Effect of multi-byte locales on GNU grep speed in OpenSolaris
I have a lab machine running OpenSolaris 2009.06 (updated to snv_117) and had created an LDIF file with about 100k small entries in it (file size was ~ 63 megs). I wanted to get a count of the exact number of entries so I ran:
grep -c ^dn:
I expected it to take a second or two. I was wrong. It was painfully slow.
I used the time command to re-run the grep and saw it clocked in at just over a minute.
This was weird, so I though it was time to investigate further. I used the DTrace Toolkit’s hotuser command to see what the hot functions were:
pfexec /opt/DTT/hotuser -c "grep -c ^dn: /var/tmp/search.out" Sampling... Hit Ctrl-C to end. FUNCTION COUNT PCNT ...<snipped out smaller functions>... ggrep`check_multibyte_string 5480 8.9% methods_unicode.so.3`__mbrtowc_dense_utf8 12328 20.1% libc.so.1`mbrlen 13566 22.1% libc.so.1`memset 23014 37.5%
Hmm, interesting to see the calls to mbrlen and methods_unicode among the hot functions. Lets check my $LANG setting:
echo $LANG
en_US.UTF-8
Bingo! Lets try it again with a non multi-byte LANG setting.
LANG=C time grep -c ^dn: /var/tmp/search.out
99987 real 0.1 user 0.0 sys 0.0
That looks normal. Now lets try one more time with a multi-byte LANG to be sure:
LANG=en_US.UTF-8 time grep -c ^dn: /var/tmp/search.out 99987 real 1:01.4 user 1:01.3 sys 0.0
Yep, the problem is confirmed.
Notes
For those unfamiliar with OpenSolaris, the default path has /usr/gnu/bin first. The grep I was using was:
grep -V grep (GNU grep) 2.5 Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
If you use the non-GNU grep available at /usr/xpg4/bin/grep it doesn’t have the big slowdown regardless of the LANG.
I also tried the same test on the GNU wc command and saw about a 25x difference when using a multi-byte LANG.
For both grep and wc, I re-ran the tests multiple times to make sure that file system caching played no role in the results.
I think these performance differences are way higher than they should be, I’m going to dig further when I have a chance.
September 25th, 2009 at 8:31 pm
Have you filed a bug with GNU for the grep problem? If not, please do that :).
September 25th, 2009 at 10:48 pm
Hi Mike – I found out there is a matching bug submitted to GNU grep:
#14472 grep is slow in multibyte locales
http://savannah.gnu.org/bugs/?14472
The bug was created 4 years ago and is marked confirmed, but it isn’t integrated yet. I pulled down and compiled GNU grep 2.5.4 to double check that it hadn’t actually made it into the latest version.