Effect of multi-byte locales on GNU grep speed in OpenSolaris

I have a lab machine running OpenSolaris 2009.06 (updated to snv_117) and had created an LDIF file with about 100k small entries in it (file size was ~ 63 megs).  I wanted to get a count of the exact number of entries so I ran:

grep -c ^dn:

I expected it to take a second or two.  I was wrong.  It was painfully slow.

I used the time command to re-run the grep and saw it clocked in at just over a minute.

This was weird, so I though it was time to investigate further.  I used the DTrace Toolkit’s hotuser command to see what the hot functions were:

pfexec /opt/DTT/hotuser -c "grep -c ^dn: /var/tmp/search.out"
Sampling... Hit Ctrl-C to end.

FUNCTION                                                     COUNT   PCNT
...<snipped out smaller functions>...
ggrep`check_multibyte_string                                    5480   8.9%
methods_unicode.so.3`__mbrtowc_dense_utf8                      12328  20.1%
libc.so.1`mbrlen                                               13566  22.1%
libc.so.1`memset                                               23014  37.5%

Hmm, interesting to see the calls to mbrlen and methods_unicode among the hot functions.  Lets check my $LANG setting:

echo $LANG
en_US.UTF-8

Bingo!  Lets try it again with a non multi-byte LANG setting.

LANG=C time grep -c ^dn: /var/tmp/search.out
99987

real        0.1
user        0.0
sys         0.0

That looks normal.  Now lets try one more time with a multi-byte LANG to be sure:

LANG=en_US.UTF-8 time grep -c ^dn: /var/tmp/search.out
99987

real     1:01.4
user     1:01.3
sys         0.0

Yep, the problem is confirmed.

Notes

For those unfamiliar with OpenSolaris,  the default path has /usr/gnu/bin first.  The grep I was using was:

grep -V
grep (GNU grep) 2.5

Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

If you use the non-GNU grep available at /usr/xpg4/bin/grep it doesn’t have the big slowdown regardless of the LANG.

I also tried the same test on the GNU wc command and saw about a 25x difference when using a multi-byte LANG.

For both grep and wc, I re-ran the tests multiple times to make sure that file system caching played no role in the results.

I think these performance differences are way higher than they should be, I’m going to dig further when I have a chance.

2 Responses to “Effect of multi-byte locales on GNU grep speed in OpenSolaris”

  1. Mike Says:

    Have you filed a bug with GNU for the grep problem? If not, please do that :).

  2. William Hathaway Says:

    Hi Mike – I found out there is a matching bug submitted to GNU grep:

    #14472 grep is slow in multibyte locales
    http://savannah.gnu.org/bugs/?14472

    The bug was created 4 years ago and is marked confirmed, but it isn’t integrated yet. I pulled down and compiled GNU grep 2.5.4 to double check that it hadn’t actually made it into the latest version.

Leave a Reply


Copyright © 2010 williamhathaway.com. All Rights Reserved.
No computers were harmed in the 0.337 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.