Troubleshooting file descriptor problems in Sun Directory Server

I have a customer that was encountering a problem where their test directory server (running Sun DS 5.2p4) was constantly running out of file descriptors.  They had bumped the allowed number of file descriptors up to 4096, and that slowed the occurrence of the error, but the  root cause had not been diagnosed yet.  We first took a look using netstat and saw:


netstat -an | grep ^$THEIR_IP.389 | grep -c ESTAB

4012

So we have confirmed the problem is as stated.  Often this problem is caused by applications that don’t use connection pools properly and open way too many connections.

Next we checked under cn=monitor to see which accounts were connected to the directory server:

/bin/ldapsearch -T -D cn=directory\ manager  -h ldap -b cn=monitor -s base objectclass=* connection | awk -F: '{ print $7 }' | sort | uniq  -c

2500  uid=application_xyz,ou=apps,dc=example,dc=com

1200  uid=application_foo,ou=apps,dc=example,dc=com

220  uid=application_shizzle,ou=apps,dc=example,dc=com

So it looks like applications xyz and foo are the primary culprits.

We’ll also count the established connections by IP address to tell which machines are creating the most connections:

netstat -an | nawk  '$1 == "$LDAP_IP.389" && /ESTAB/ { print $2}' | cut -d. -f1-4 | sort | uniq -c
2700   10.10.1.168
400    10.10.1.169
300    192.168.1.1
...

We  know that the server 10.10.1.168 is the machine with the most connections coming from it.  We then hoped over to 10.10.1.168 (running an application server) and took a look from its point of view:

netstat -an | grep -c $LDAP_IP.389

2

Woah!  Houston we have a problem.  From the LDAP server’s point of view, it has 2700 connections from the app server.  From  the app server’s point of view, it has 2 connection to the LDAP server.  If we had seen symmetry between the app server’s network connections and the directory server’s network connections, it would have been an application level problem of allocating too many connections.  In this case, since the connection count is extremely unsymmetrical, it looks like there is a firewall/load-balancer or other network device in the path between these two machines which is killing connections from the application server but not symetrically telling the LDAP server the connection is dead.  We ask the network team to investigate and in the meantime put in a work-around of setting an idle timeout on the LDAP server.  This lets the directory server kill any connections that it hasn’t received an operation from in some time period (we set it to a generous 12 hours) and we immediately see the number of established connections drop down to a few hundred.  Problem solved.

Leave a Reply


Copyright © 2010 williamhathaway.com. All Rights Reserved.
No computers were harmed in the 0.343 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.