Troubleshooting file descriptor problems in Sun Directory Server
I have a customer that was encountering a problem where their test directory server (running Sun DS 5.2p4) was constantly running out of file descriptors. They had bumped the allowed number of file descriptors up to 4096, and that slowed the occurrence of the error, but the root cause had not been diagnosed yet. We first took a look using netstat and saw:
netstat -an | grep ^$THEIR_IP.389 | grep -c ESTAB
4012
So we have confirmed the problem is as stated. Often this problem is caused by applications that don’t use connection pools properly and open way too many connections.
Next we checked under cn=monitor to see which accounts were connected to the directory server:
/bin/ldapsearch -T -D cn=directory\ manager -h ldap -b cn=monitor -s base objectclass=* connection | awk -F: '{ print $7 }' | sort | uniq -c
2500 uid=application_xyz,ou=apps,dc=example,dc=com
1200 uid=application_foo,ou=apps,dc=example,dc=com
220 uid=application_shizzle,ou=apps,dc=example,dc=com
…
So it looks like applications xyz and foo are the primary culprits.
We’ll also count the established connections by IP address to tell which machines are creating the most connections:
netstat -an | nawk '$1 == "$LDAP_IP.389" && /ESTAB/ { print $2}' | cut -d. -f1-4 | sort | uniq -c
2700 10.10.1.168
400 10.10.1.169
300 192.168.1.1
...
We know that the server 10.10.1.168 is the machine with the most connections coming from it. We then hoped over to 10.10.1.168 (running an application server) and took a look from its point of view:
netstat -an | grep -c $LDAP_IP.389
2
Woah! Houston we have a problem. From the LDAP server’s point of view, it has 2700 connections from the app server. From the app server’s point of view, it has 2 connection to the LDAP server. If we had seen symmetry between the app server’s network connections and the directory server’s network connections, it would have been an application level problem of allocating too many connections. In this case, since the connection count is extremely unsymmetrical, it looks like there is a firewall/load-balancer or other network device in the path between these two machines which is killing connections from the application server but not symetrically telling the LDAP server the connection is dead. We ask the network team to investigate and in the meantime put in a work-around of setting an idle timeout on the LDAP server. This lets the directory server kill any connections that it hasn’t received an operation from in some time period (we set it to a generous 12 hours) and we immediately see the number of established connections drop down to a few hundred. Problem solved.