Sun Messaging Server login hang
30 second summary for those that don’t want to read the troubleshooting details
When using replicated LDAP servers in a Sun Communications Suite deployment, it is important that every connection from a given Convergence (webmail component) instance go to the same LDAP server ,otherwise address book creation can partially fail causing some user logins to webmail to hang. To fix this, use one of the following techniques:
1) Configure Convergence’s application level failover to point to individual LDAP servers (be sure to switch the host order on alternating Convergence instances to spread the load)
/opt/sun/comms/iwc/sbin/iwcadmin -u admin -W pwdfile -o ugldap.host -v ldap1:$port ,ldap2:$port
(you will also need to restart the web container for this to take effect)
2) Use Directory Proxy Server to route writes to a preferred master
3) If pointing at a HW load balancer virtual IP, use a distribution algorithm that has backend server persistence based on originating IP. Note that with a few machines this might not actually balance out well, so verify you aren’t overloading one LDAP instance.
Background
A customer of mine is deploying Sun’s Communications Suite (aka Messaging, Calendar, and IM servers) and was testing their custom provisioning tool. A few accounts had been created that worked fine but one of the accounts would just hang when trying to login to webmail. The screen would show the application initialization progress bar stuck at 84% and indicate it was dealing with the address book.
I verified that the account would hang and then took a look at the account’s main LDAP entry, which looked fine. I then checked the account’s LDAP address book data. The bad account had 3 LDAP entries in the address book branch, but a good account should have 4. Taking a look at the iwc.log from Convergence, I could see an error:
ADDRESS_BOOK: ERROR from com.sun.comms.client.ab.wabp.WABPEngineServlet Thread httpSSLWorkerThread-80-4 at 2009-06-27 14:33:05,586 – pstore object couldn’t be created for user :baduser
At this site we have a pair of replicated LDAP servers behind a pair of load balancers that are used by the messaging components. Sun’s Directory Server has a loose replication model that usually works fine, but you can run into a rare race condition when applications are adding inter-related entries to different masters in a rapid fire succession. Convergence was initially pointing at the load balancer virtual IP to reach the LDAP servers.
When I checked the logs on the LDAP servers, I could see that Convergence had tried to create address book entries when the user first logged in, but had done so over several different LDAP connections which via the load balancer went to different LDAP servers. It created a parent entry on ldap1, then in a connection to ldap2 tried to create a dependent entry, which failed. Convergence then created another version of the parent entry on ldap2 (which worked, but caused a replication conflict). Later attempts to login ended up adding some dependent entries, but it was still in an usuable state.
When things are working correctly, you will see a LDAP operation pattern that looks like:
[27/Jun/2009:16:21:39 -0400] conn=566625 op=5 msgId=948 – ADD dn=”piPStoreOwner=$user,o=$domain,o=PiServerDb”
[27/Jun/2009:16:21:39 -0400] conn=566636 op=1 msgId=950 – ADD dn=”piEntryID=random1,piPStoreOwner=$user,o=$domain,o=PiServerDb”
[27/Jun/2009:16:21:39 -0400] conn=566636 op=2 msgId=951 – ADD dn=”piEntryID=random2,piPStoreOwner=$user,o=$domain,o=PiServerDb”
[27/Jun/2009:16:21:39 -0400] conn=566636 op=3 msgId=952 – ADD dn=”piEntryID=random3,piPStoreOwner=$user,o=$domain,o=PiServerDb”
The fix
In order to fix the account and the problem in general, we ended up deleting the skeleton address book entries for the user in question and used iwcadmin to change Convergence to point to individual LDAP servers in a failover mode. Since we had two Convergence instances and two LDAP instances, it was easy to flip the perferred order so that LDAP load will be well-balanced.
Things that could be improved
1) Convergence could give a better error experience to the user instead of a just hanging. Perhaps timing out after 30 seconds with a message “There is a problem with initializing your address book, please ask your administrator to investigate”.
2) Convergence could use a single LDAP connection when performing address book creation for any given user
3) Sun’s Directory Server could have an assured replication model (this is available in the OpenDS 2.0 release candidates)