Archive for the ‘ldap’ Category

Creating a slow operation log for OpenDS

Monday, November 2nd, 2009

For anyone that has spent much time looking at MySQL performance, you will be familiar with the ‘slow query log’.  This basically is a log where queries that took over some amount of time would get recorded.   For kicks, I tried implementing a similar hook for OpenDS.  My current version is in pretty rough shape (not very efficient or configurable), but seems to work.  I started from a copy of the TextAccessLogPublisher.java file and created a new one called TextSlowAccessLogPublisher.java.  My logic is basically:

  • create a hash table
  • emptied out all the log XYZIntermediateMessage and connect/disconnect methods
  • when a request comes in, store the text to log in the hash table (keyed off connectionID and opNumber) instead of outputting it (changed the logSearchRequest, logModifyRequest, … methods)
  • when a request is finished processing, we check the elapsed time (etime)
    • if the elapsed time greater  than our or equal to our threshold
      • print the request info we stashed in the hash table and delete it
      • print the response info
    • if the elapsed time is less than our threshold
      • delete the request info from the hash table, don’t print anything

There are a few more things I want to do:

  • Make the ‘slow operation threshold time’ dynamically changeable (looks like I will need to mess with configuration objects since I want to add an additional parameter not in the standard access log type)
  • Add extra information to the output format such as authorization DN (and potentially client connection info if not too hard to retrieve)
  • Instead of all the text formatting for every request, just put the Operation object into the hash table, since the majority of operations won’t ever get printed we shouldn’t burn CPU formatting them.  The operations would only be formatted to text if the operations end up being slow and printed.

Files

Sun Messaging Server login hang

Sunday, June 28th, 2009

30 second summary for those that don’t want to read the troubleshooting details

When using replicated LDAP servers in a Sun Communications Suite deployment, it is important that every connection from a given Convergence (webmail component) instance go to the same LDAP server ,otherwise address book creation can partially fail causing some user logins to webmail to hang.  To fix this, use one of the following techniques:

1) Configure Convergence’s application level failover to point to individual LDAP servers (be sure to switch the host order on alternating Convergence instances to spread the load)

/opt/sun/comms/iwc/sbin/iwcadmin -u admin -W pwdfile  -o ugldap.host -v ldap1:$port ,ldap2:$port

(you will also need to restart the web container for this to take effect)

2) Use Directory Proxy Server to route writes to a preferred master

3) If pointing at a HW load balancer virtual IP, use a distribution algorithm that has backend server persistence based on originating IP.  Note that with a few machines this might not actually balance out well, so verify you aren’t overloading one LDAP instance.

Background

A customer of mine is deploying Sun’s Communications Suite (aka Messaging, Calendar, and IM servers) and was testing their custom provisioning tool.  A few accounts had been created that worked fine but one of the accounts would just hang when trying to login to webmail.  The screen would show the application initialization progress bar stuck at 84% and indicate it was dealing with the address book.

I verified that the account would hang and then took a look at the account’s main LDAP entry, which looked fine.  I then checked the account’s LDAP address book data.  The bad account had 3 LDAP entries in the address book branch, but a good account should have 4. Taking a look at the iwc.log from Convergence, I could see an error:

ADDRESS_BOOK: ERROR from com.sun.comms.client.ab.wabp.WABPEngineServlet  Thread httpSSLWorkerThread-80-4 at 2009-06-27 14:33:05,586 – pstore object couldn’t be created for user :baduser

At this site we have a pair of replicated LDAP servers behind a pair of load balancers that are used by the messaging components. Sun’s Directory Server has a loose replication model that usually works fine, but you can run into a rare race condition when applications are adding inter-related entries to different masters in a rapid fire succession.  Convergence was initially pointing at the load balancer virtual IP to reach the LDAP servers.

When I checked the logs on the LDAP servers, I could see that  Convergence had tried to create address book entries when the user first logged in, but had done so over several different LDAP connections which via the load balancer went to different LDAP servers. It created a parent entry on ldap1, then in a connection to ldap2 tried to create a dependent entry, which failed.  Convergence then created  another version of the parent entry on ldap2 (which worked, but caused a replication conflict).  Later attempts to login ended up adding some dependent entries, but it was still in an usuable state.

When things are working correctly, you will see a LDAP operation pattern that looks like:

[27/Jun/2009:16:21:39 -0400] conn=566625 op=5 msgId=948 – ADD dn=”piPStoreOwner=$user,o=$domain,o=PiServerDb”
[27/Jun/2009:16:21:39 -0400] conn=566636 op=1 msgId=950 – ADD dn=”piEntryID=random1,piPStoreOwner=$user,o=$domain,o=PiServerDb”
[27/Jun/2009:16:21:39 -0400] conn=566636 op=2 msgId=951 – ADD dn=”piEntryID=random2,piPStoreOwner=$user,o=$domain,o=PiServerDb”
[27/Jun/2009:16:21:39 -0400] conn=566636 op=3 msgId=952 – ADD dn=”piEntryID=random3,piPStoreOwner=$user,o=$domain,o=PiServerDb”

The fix

In order to fix the account and the problem in general, we ended up deleting the skeleton address book entries for the user in question and used iwcadmin to change Convergence to point to individual LDAP servers in a failover mode. Since we had two Convergence instances and two LDAP instances, it was easy to flip the perferred order so that LDAP load will be well-balanced.

Things that could be improved

1) Convergence could give a better error experience to the user instead of a just hanging.  Perhaps timing out after 30 seconds with a message “There is a problem with initializing your address book, please ask your administrator to investigate”.

2) Convergence could use a single LDAP connection when performing address book creation for any given user

3) Sun’s Directory Server could have an assured replication model (this is available in the OpenDS 2.0 release candidates)

Sun Directory Server support tool – Dirtracer

Wednesday, April 29th, 2009

I just watched Lee Trujillo give a presentation and demo of his Dirtrace, his cool tool for gathering support data on Sun’s DS.  The data captured is very helpful for troubleshooting Sun DS problems in a variety of situations ranging from hangs to replication problems to performance problems.  I’ve used it in the past, but the latest version looks even easier to use and captures more data.  If you manage Sun’s Directory Server on Solaris, Linux, or HP/UX, pull down a copy and check it out.

Using IBM Quickr with Sun Directory Server

Thursday, April 2nd, 2009

A customer was testing out Lotus/IBM’s Quickr collaboration software and using Sun’s Directory Server as the user store.  One of the system admins mentioned that queries searching for people were glacially slow.  We investigated by checking out the access log to look for slow queries and saw that Quickr was running un-indexed queries that searched against cn,  givenName, and displayName.  These queries were taking about 30 seconds to run since the directory server had to do the DB equivalent of full-table scans.  We checked the indexes and saw that displayName wasn’t indexed.  After adding an index for the displayName attribute the queries were snappy, taking less than a second.

Troubleshooting file descriptor problems in Sun Directory Server

Wednesday, April 1st, 2009

I have a customer that was encountering a problem where their test directory server (running Sun DS 5.2p4) was constantly running out of file descriptors.  They had bumped the allowed number of file descriptors up to 4096, and that slowed the occurrence of the error, but the  root cause had not been diagnosed yet.  We first took a look using netstat and saw:


netstat -an | grep ^$THEIR_IP.389 | grep -c ESTAB

4012

So we have confirmed the problem is as stated.  Often this problem is caused by applications that don’t use connection pools properly and open way too many connections.

Next we checked under cn=monitor to see which accounts were connected to the directory server:

/bin/ldapsearch -T -D cn=directory\ manager  -h ldap -b cn=monitor -s base objectclass=* connection | awk -F: '{ print $7 }' | sort | uniq  -c

2500  uid=application_xyz,ou=apps,dc=example,dc=com

1200  uid=application_foo,ou=apps,dc=example,dc=com

220  uid=application_shizzle,ou=apps,dc=example,dc=com

So it looks like applications xyz and foo are the primary culprits.

We’ll also count the established connections by IP address to tell which machines are creating the most connections:

netstat -an | nawk  '$1 == "$LDAP_IP.389" && /ESTAB/ { print $2}' | cut -d. -f1-4 | sort | uniq -c
2700   10.10.1.168
400    10.10.1.169
300    192.168.1.1
...

We  know that the server 10.10.1.168 is the machine with the most connections coming from it.  We then hoped over to 10.10.1.168 (running an application server) and took a look from its point of view:

netstat -an | grep -c $LDAP_IP.389

2

Woah!  Houston we have a problem.  From the LDAP server’s point of view, it has 2700 connections from the app server.  From  the app server’s point of view, it has 2 connection to the LDAP server.  If we had seen symmetry between the app server’s network connections and the directory server’s network connections, it would have been an application level problem of allocating too many connections.  In this case, since the connection count is extremely unsymmetrical, it looks like there is a firewall/load-balancer or other network device in the path between these two machines which is killing connections from the application server but not symetrically telling the LDAP server the connection is dead.  We ask the network team to investigate and in the meantime put in a work-around of setting an idle timeout on the LDAP server.  This lets the directory server kill any connections that it hasn’t received an operation from in some time period (we set it to a generous 12 hours) and we immediately see the number of established connections drop down to a few hundred.  Problem solved.

Viewing the current status of LDAP servers in Directory Proxy Server 6.3

Friday, March 20th, 2009

The dpconf command for managing DSEE Directory Proxy Servers (DPS) shows you a lot of information about the ldap-data-sources (the back-end directory servers), including whether or not they are administratively enabled or disabled.  One status that I couldn’t find was whether a given back-end server was actually considered on-line by the DPS itself.  It turns out the current status information is available, but only by digging through the cn=monitor entry on the DPS instance.  Bear in mind you will need to authentication as the proxy’s root DN (default is “cn=proxy manager”) to dig it up.   Also, it appears that logic that implements cn=monitor doesn’t hande all search criteria perfectly, so we will use a little bit of grep magic to reduce the result set to what we want.  Here is an example ldapsearch to get the current status of servers:

ldapsearch -D “cn=proxy manager” -j ~/.dmpass -b cn=monitor serveravailable=*  \
| egrep “^backendServer|^serverAvailable”

backendServer: testdscc01:3998/
serverAvailable: true
backendServer: testds05:389/
serverAvailable: true
backendServer: testds06:389/
serverAvailable: false
backendServer: testds07:389/
serverAvailable: true

In this case it would be good idea to check testds06 and see if the server is down, or perhaps it is failing a DPS health check for some other reason.

If you want to dig a little deeper into cn=monitor, you can find a lot of detailed information about the thread that is monitoring a particular data source. Here is an example of one pointing to an LDAP server that is unavailable:

dn: cn=Proactive Monitor for testds06:389/,cn=Monitor Thread,cn=Resource,
 cn=testdps01:/opt/dsee/instances/dps,cn=Instance,cn=DPS6.0,cn=Product,cn=monitor
objectClass: top
objectClass: extensibleObject
cn: Proactive Monitor for testds06:389/
started: true
running: true
startTime: [03/19/2009:12:20:36 -0700]
operationalStatus: OK
statusDescription: The monitor thread is fully operational
threadId: 19
threadStack: java.lang.Thread.sleep(Native Method) /  com.sun.directory.proxy.server.ProactiveMonitorThread.runThread(ProactiveMonitorThread.java:122) /  com.sun.directory.proxy.util.DistributionThread.run(DistributionThread.java:225) /
backendServer: testds06:389/
serverAvailable: false
checkInterval: 30000
additionalCheckType: op connection
totalChecks: 594
availabilityChecksFailed: 2
additionalChecksFailed: 0

Command line completion in bash for DSEE and ZFS

Tuesday, March 17th, 2009

I’m working on an environment for a customer where we are using Directory Server Enterprise Edition (DSEE) and ZFS.   On the DSEE side, my co-worker Mitch and I were inspired by Ludovic’s post a while back about setting up command line completion for  dsconf and dpconf.   One small item Mitch noticed was that in the original examples, if you had a command name that didn’t contain a hypen (like dsconf import), it wouldn’t be completed (but command like dsconf get-server-prop would be).

Here is what Mitch came up with:

for cmd in dsconf dsadm dpconf dpadm; do
  complete -W "`$cmd --help | \
    perl -lane 'print $F[0] if \
      (/^The accepted values for SUBCMD/ .. \
       /^The accepted values for GLOBAL_OPTS/ \
       and not /^The /)'`" $cmd
done

For ZFS, check out this script on Big Admin by Mark Musante.
Mitch did a small update to the script which made the list of sub-commands on the fly to account for additions. Mitch’s updated version is available here.

Creating an LDAP environment to test a tool

Thursday, March 5th, 2009

Yesterday I spent some time helping a developer who is creating a tool for synchronizing accounts between a RDBMS and an LDAP server and thought I would document the process.  The tool basically makes a request to the RDBMS for all the accounts sorted by a specific attribute, then makes a similar request to the LDAP server.  The customer expected the number of records to max out at about 200,000 entries.

The first thing we did was spin up local copies of Mysql and the LDAP server.  I’m not going to document the mysql part since there are a million pages available on that.

Note that the DSEE 6.3 binaries were already installed on my test machine under /opt/dsee6.  I personally prefer the zip based distribution.

Here are the steps for the LDAP server:

Step 1 – create a new instance and add a suffix for the data

# export PATH=$PATH:/opt/dsee63/ds6/bin

# dsadm create -w /tmp/dspassword /data/ds3

# dsadm start /data/ds3

# dsconf create-suffix dc=example,dc=com

Step 2 – create an sample LDIF with 200k entries

# cd /opt/dsee63/dsrk6/bin/example_files

# cp example.template 200k.template

# vi 200k.template (change numusers value to be 200000 and added employeeNumber as a sequentially valued attribute)

 # ../makeldif -t 200k.template -o 200k.ldif

Step 3 import the sample data

# dsadm stop /data/ds3

# dsadm import -i /data/ds3 /opt/dsee63/dsrk6/bin/example_files/200k.ldif

 # dsadm start /data/ds3

Step 4 create an account with proper settings

We created an account uid=dbsync,ou=admins,dc=example,dc=com that will be used by the application to perform the search and updates.

Note that we had to adjust 2 attributes on the dbsync account. We added the following operational attributes/values:

nsSizeLimit: -1

nsLookThroughLimit: -1

We also added an ACI to the ou=people,dc=example,dc=com branch giving the dbsync user  full permissions.

aci: (targetattr !=”aci”)(version
3.0;acl “db sync – full permissions”;allow (all)(userdn = “ldap:///uid=dbsync,ou=admins,dc=example,dc=com”);)

The tool was now able to pull back all 200,000 entries, but was not able to make server-side sort request.

To enable server-side sorting we had to create a VLV index.

Step 5 – VLV index creation

We used the following LDIF to create a VLV index sorting on employeenumber

dn: cn=people_browsing_index,cn=example,cn=ldbm database,cn=plugins,cn=config
objectClass: top
objectClass: vlvSearch
cn: Browsing ou=People
vlvBase: ou=People,dc=example,dc=com
vlvScope: 1
vlvFilter: (objectclass=inetOrgPerson)
aci: (targetattr=”*”)(version 3.0; acl “VLV for Anonymous”;
allow (read,search,compare) userdn=”ldap:///all”;)

dn: cn=Sort employeenumber,cn=people_browsing_index,
cn=example,cn=ldbm database,cn=plugins,cn=config
objectClass: top
objectClass: vlvIndex
cn: Sort employeenumber
vlvSort: employeenumber

We then had to use dsadm to create the index

# dsadm stop /data/ds3

# dsadm reindex -l  -t “Sort employeeNumber”  /data/ds3 dc=example,dc=com

# dsadm start  /data/ds3

After these changes the tool was now able to query all 200,000 entries and have the server return it as a sorted list.

We also ended up doing 2 small performance tweaks to the server, but these weren’t strictly required:

dsconf set-server-prop db-env-path:/tmp/ds_cache

dsconf set-server-prop db-batched-transaction-count:5

dsadm restart /data/ds3

New LDAP vendor – Unbound ID

Wednesday, January 21st, 2009

I saw that Unbound ID’s website is now live.  There isn’t much data available except for the management team, which is a collection of ex-Sun big brains.  I’ve interacted with a few of those folks in the past on mailing lists and a couple of phone calls.  I’m looking forward to hearing more details of their solution set in the future. I think they have a good pulse on customer needs , a strong sense of practicality, and some amazing engineering talent.

Sun Directory Server – Replication over WAN

Wednesday, November 19th, 2008

Yesterday we had to modify a huge number of entries in our directory server environment.  The updates were all done in one data center, and they went extremely fast.  When I later went to check on the replication, I noticed  the data was replicated much slower to the remote data center than I expected.  Given that the other data center is a pretty decent WAN hop awa,  I decided to try changing some of the replication agreement parameters.  To do this you use:

dsconf set-repl-agmt-properties $suffix  $property:$value

You can see more information on the properties and suggested values at the Replication Over a WAN page of the DSEE Admin Guide.

In our case, I did some quick experimenting and found the values suggested for WANs seemed to work pretty well and gave us about a 3x-4x boost in performance versus the defaults.  The changes take place immediately, there was no need to restart the servers or replication agreements.

To measure how fast replication was going I would go to the remote server and run something like

grep 2008:10:23 logs/access | grep -c MOD

where 10:23 was the previous minute, to count how many MOD operations had come through in one minute.


Copyright © 2010 williamhathaway.com. All Rights Reserved.
No computers were harmed in the 0.400 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.