Archive for April, 2009

Make sure your graphs visually represent your results

Monday, April 6th, 2009

I was just reading a presentation on SSD performance for PostgreSQL and came across a graph that made my head spin

If you eyeball it, it appears that the hard disk drive (HDD) is roughly 40% as fast as the solid state disk (SSD) for a specific test.  The presentation author did include the yellow star to the right highlighting the fact that the difference is actually 0.5%, but why not make the graph start at 0 so the graph visusally represented the results or  if the difference is that small, don’t have a graph at all and just say there is essentially no difference for this type of benchmark.

I think the way the graph was made was very likely the result of the default settings that the charting tool used, and I don’t think the author had any intention of misleading people (especially since he highlighted the real difference outside the graph),  but if I generated a graph that I felt didn’t reflect what I was trying to convey I would fix it or leave it out.

Besides the egregious graph, I think the presentation overall was very good.  I especially liked the recommendation section at the end, which was very actionable.

Appalachian Trail Run

Monday, April 6th, 2009

This afternoon I ran 7.3 miles on the Appalachian Trail with Mark Rebuck.  The section that we did started at Rt 641 (Trindle Road) and went out to the bridge that crosses over the Carlisle Pike.  The weather was fantastic, we saw a lot of people out on the trail hiking or running.  I really liked this section of the trail as it had a variety of terrain including both open fields and woods, some very flat sections and a few hills, and no part that had crazy rocks.  I’ve been on the AT a few times, but often there were sections that had a crazy amount of rocks and I would be nervous while running that I’d turn an ankle.  This afternoon’s run was a total delight, I’m definitely going to run this section again.

Motionbased

Checking out Blogbench with ZFS – atime matters

Saturday, April 4th, 2009

I recently ran across a storage benchmark called Blogbench and decided to give it a quick whirl on my lab machine.

My environment consists of a Sun x4150 (32G RAM and dual quad-core Xeons @ 2.93 Ghz) running Solaris 10 05/2008 with (8) 73GB 10k RPM SAS drives and an LSI SAS RAID controller with 256M of memory.

For this test I created the following 6 disk zpool:

Two options I decided to test for ZFS using BlogBench were atime and compression.  For those unfamiliar with it, the atime of a file is changed when a file is accesed.  All UNIX file systems I’m aware of have atime updating enabled by default. As a sysadmin, it can be very handy to have the atime available to see when a file was last accessed, but  atime updating can add a significant amount of overhead in some access patterns, so a lot of sites disable it on mounts that need performance. For ZFS to disable atime you use:

zfs set atime=off $datasetname

for most UNIX file system types there is noatime or similar mount option that can be used.

I ran 5 iterations of BlogBench (using ./blogbench –directory=/data/blogbench ) for each permutation of atime and compression settings. I had the script sleep for 60 seconds between runs to make sure any background activity for memory or ZFS housekeeping had finished before the next run started.  Averaging the 5 runs together for each permutation gave me the following results:

When atime was on (which is the default), there was very little difference in the non-compressed versus compressed results.  With atime disabled there was a 30-50% increase in read transactions performed and about a 250% increase in write operations.  Note that the data size of the benchmark ( ~ 3.6G) was significantly smaller than the memory on the machine (32G) , so all reads were satisfied out of file system cache.

These results are only applicable to this specific test and software/hardware, so your environment may vary significantly but I would like people to be aware of the atime setting so they can be aware of another potential knob to turn in their environment.

Using IBM Quickr with Sun Directory Server

Thursday, April 2nd, 2009

A customer was testing out Lotus/IBM’s Quickr collaboration software and using Sun’s Directory Server as the user store.  One of the system admins mentioned that queries searching for people were glacially slow.  We investigated by checking out the access log to look for slow queries and saw that Quickr was running un-indexed queries that searched against cn,  givenName, and displayName.  These queries were taking about 30 seconds to run since the directory server had to do the DB equivalent of full-table scans.  We checked the indexes and saw that displayName wasn’t indexed.  After adding an index for the displayName attribute the queries were snappy, taking less than a second.

Troubleshooting file descriptor problems in Sun Directory Server

Wednesday, April 1st, 2009

I have a customer that was encountering a problem where their test directory server (running Sun DS 5.2p4) was constantly running out of file descriptors.  They had bumped the allowed number of file descriptors up to 4096, and that slowed the occurrence of the error, but the  root cause had not been diagnosed yet.  We first took a look using netstat and saw:


netstat -an | grep ^$THEIR_IP.389 | grep -c ESTAB

4012

So we have confirmed the problem is as stated.  Often this problem is caused by applications that don’t use connection pools properly and open way too many connections.

Next we checked under cn=monitor to see which accounts were connected to the directory server:

/bin/ldapsearch -T -D cn=directory\ manager  -h ldap -b cn=monitor -s base objectclass=* connection | awk -F: '{ print $7 }' | sort | uniq  -c

2500  uid=application_xyz,ou=apps,dc=example,dc=com

1200  uid=application_foo,ou=apps,dc=example,dc=com

220  uid=application_shizzle,ou=apps,dc=example,dc=com

So it looks like applications xyz and foo are the primary culprits.

We’ll also count the established connections by IP address to tell which machines are creating the most connections:

netstat -an | nawk  '$1 == "$LDAP_IP.389" && /ESTAB/ { print $2}' | cut -d. -f1-4 | sort | uniq -c
2700   10.10.1.168
400    10.10.1.169
300    192.168.1.1
...

We  know that the server 10.10.1.168 is the machine with the most connections coming from it.  We then hoped over to 10.10.1.168 (running an application server) and took a look from its point of view:

netstat -an | grep -c $LDAP_IP.389

2

Woah!  Houston we have a problem.  From the LDAP server’s point of view, it has 2700 connections from the app server.  From  the app server’s point of view, it has 2 connection to the LDAP server.  If we had seen symmetry between the app server’s network connections and the directory server’s network connections, it would have been an application level problem of allocating too many connections.  In this case, since the connection count is extremely unsymmetrical, it looks like there is a firewall/load-balancer or other network device in the path between these two machines which is killing connections from the application server but not symetrically telling the LDAP server the connection is dead.  We ask the network team to investigate and in the meantime put in a work-around of setting an idle timeout on the LDAP server.  This lets the directory server kill any connections that it hasn’t received an operation from in some time period (we set it to a generous 12 hours) and we immediately see the number of established connections drop down to a few hundred.  Problem solved.


Copyright © 2012 williamhathaway.com. All Rights Reserved.
No computers were harmed in the 0.415 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.