Archive for the ‘technology’ Category

Thumbs up for “Release It!”

Thursday, January 21st, 2010


I borrowed “Release It!” from Bill Kratzer a few weeks ago and have really enjoyed reading it.  My favorite quote from the book is “Feature complete does not mean production ready”.  I think this sums up a lot of large software projects, especially when there is a disconnect between the development team and the group responsible for deployment and operations.

The book covers 4 main topics:

  • Stability
  • Capacity
  • General Design Issues
  • Operations

In most of the sections the author breaks his advice down into an introduction (with a real example showing a problem), a set of anti-patterns that encourage the problem and a set of patterns to help  software cope with the various stresses placed on it and make it manageable.

The book stays at a relatively high level of discussion and is easy to follow.  If you are looking for lots of low-level coding examples you will be disappointed, but I think the book offers good advice that can be consumed by a wide range of people ranging from developers, to system administrators, and to project managers.

Last year I was involved in a project that struggled with a lot of the issues mentioned in this book and I think that hundreds of thousands of dollars and countless hours of stress and frustration could have been saved if this book had been required reading at the start of the project.

I recommend this book to anyone involved in developing or operating software services, or managing the people that do.

ZFS presentation

Thursday, January 14th, 2010

Photo by John @ ThinkHole.com

On Tuesday night I gave a presentation on ZFS to the Central PA Linux User Group. Since the audience was a Linux user group, I wasn’t expecting too many in the crowd to be familiar with ZFS, but I was pleasantly surprised that about 40% of the ~ 20 people in attendance had used ZFS in some capacity. If you are already a seasoned ZFS user, I would highly recommend Richard Elling’s ZFS presentation which he uses in his day-long tutorials.

Prepping new hires

Monday, January 4th, 2010

A friend of mine’s son, Alex, has recently accepted a software development job at a financial trading company.  He is starting his final semester of school, so he won’t begin working at the company for another 4-5 months.  When I was over visiting at their house last week, Alex showed me a small stack of  books that the company had sent him.  The books covered  a mix of technical and business topics that would help him build up an understanding of the software tools, development philosophies, and business concepts specific to the organization so that when he arrived at work he will be productive much quicker.

I think this is a fantastic investment by the company, and should be considered by organizations hiring for any but senior positions.  You obviously don’t want to overwhelm new hires with an onslaught of 10,000 pages of recommended reading, but having a small package from Amazon show up at their door containing a few books most appropriate to their position and your culture is a great way to help new hires get up to speed, even before they hit the door.

Installing Puppet on OpenSolaris

Saturday, December 26th, 2009

While looking at the Reductive Labs’  Puppet on Solaris page I saw there was a repository which hosts Puppet in a pkg format.  This makes installing a Puppet server on OpenSolaris pretty easy.

pkg set-publisher -O http://pkg.codenursery.com/ codenursery.com
pkg install puppet

groupadd puppet
useradd -g puppet puppet
mkdir /etc/puppet /var/puppet

/usr/ruby/1.8/sbin/puppetd  --genconfig > /etc/puppet/puppet.conf

svcadm enable puppet/master

2009 LISA Conference

Sunday, November 8th, 2009

I spent last week at the LISA Conference in Baltimore MD.  if you aren’t familiar with LISA, it is a conference focused on system administration.  This is the 4th  LISA I’ve attended in the last 12 years.

On Monday I attended a tutorial by Richard Elling on ZFS: A Filesystem for Modern Hardware.

On Tuesday I attended two tutorials.  The first was Jacob Farmer’s Disk-to-Disk Backup and Eliminating Backup System Bottlenecks.  The second was Tom Limoncelli’s Design Patterns for System Administrators.

Unfortunately on both Monday and Tuesday I had to spend a significant amount of time on conference calls helping to troubleshoot some work related issues, but the time I spent in all 3 sessions and viewing their materials was helpful.  I would definitely recommend attending tutorials by any of the 3 people above if they are teaching a topic of interest to you.

On Tuesday night I attended some (Open)Solaris birds-of-a-feather sessions.  There were a few times that people in the crowd were being belligerent towards a speaker (mostly complaining about the difficulty of finding information of various types), even though the speaker certainly had no sway over what the person in the crowd was upset about.  I don’t care how much money your company spends with a vendor, there is never a reason to be rude.   David Miner gave a talk about whats coming in Solaris.next and Ben Rockwood gave an entertaining and informative presentation on ZFS in the Trenches.

I was lucky enough to get a chance to talk with David Miner over a quick lunch later in the week and talk about the new opportunities and challenges with the OpenSolaris installation technologies.

On Wednesday through Friday I attended a mix of presentations, met with a bunch of vendors, and also sat in some of the ‘Guru is in’ sessions and talked with a number of conference attendees.  The highlights for me were:

  • Werner Vogel (CTO of Amazon) gave a fascinating talk on the history of Amazon’s IT philosophy and infrastructure and how they evolved from a humble internal IT shop to adding a business which is the dominant  cloud computing provider.
  • Elizabeth Zwicky’s talk on “Searching for Truth, or at Least Data: How to Be an Empiricist Skeptic”
  • Bryan Cantrill’s talk on “Visualizing DTrace: Sun Storage 7000 Analytics”
  • Talking with the folks from Splunk (awesome log searching analysis tool)

Creating a slow operation log for OpenDS

Monday, November 2nd, 2009

For anyone that has spent much time looking at MySQL performance, you will be familiar with the ’slow query log’.  This basically is a log where queries that took over some amount of time would get recorded.   For kicks, I tried implementing a similar hook for OpenDS.  My current version is in pretty rough shape (not very efficient or configurable), but seems to work.  I started from a copy of the TextAccessLogPublisher.java file and created a new one called TextSlowAccessLogPublisher.java.  My logic is basically:

  • create a hash table
  • emptied out all the log XYZIntermediateMessage and connect/disconnect methods
  • when a request comes in, store the text to log in the hash table (keyed off connectionID and opNumber) instead of outputting it (changed the logSearchRequest, logModifyRequest, … methods)
  • when a request is finished processing, we check the elapsed time (etime)
    • if the elapsed time greater  than our or equal to our threshold
      • print the request info we stashed in the hash table and delete it
      • print the response info
    • if the elapsed time is less than our threshold
      • delete the request info from the hash table, don’t print anything

There are a few more things I want to do:

  • Make the ’slow operation threshold time’ dynamically changeable (looks like I will need to mess with configuration objects since I want to add an additional parameter not in the standard access log type)
  • Add extra information to the output format such as authorization DN (and potentially client connection info if not too hard to retrieve)
  • Instead of all the text formatting for every request, just put the Operation object into the hash table, since the majority of operations won’t ever get printed we shouldn’t burn CPU formatting them.  The operations would only be formatted to text if the operations end up being slow and printed.

Files

Central PA Open Source Conference

Saturday, October 17th, 2009

I had a good time attending the CPOSC event  today, which was held at Harrisburg University.  Got to see lots of old friends and acquaintances and enjoyed the speakers.  At only $35 (which included a t-shirt and food), it was an awesome bargain.  Thanks to John and Eric for doing such a great job organizing the event and all the presenters for sharing their knowledge.

Testing – please ignore

Sunday, September 6th, 2009

Just wanting google to start indexing my wiki.

Sun Messaging Server login hang

Sunday, June 28th, 2009

30 second summary for those that don’t want to read the troubleshooting details

When using replicated LDAP servers in a Sun Communications Suite deployment, it is important that every connection from a given Convergence (webmail component) instance go to the same LDAP server ,otherwise address book creation can partially fail causing some user logins to webmail to hang.  To fix this, use one of the following techniques:

1) Configure Convergence’s application level failover to point to individual LDAP servers (be sure to switch the host order on alternating Convergence instances to spread the load)

/opt/sun/comms/iwc/sbin/iwcadmin -u admin -W pwdfile  -o ugldap.host -v ldap1:$port ,ldap2:$port

(you will also need to restart the web container for this to take effect)

2) Use Directory Proxy Server to route writes to a preferred master

3) If pointing at a HW load balancer virtual IP, use a distribution algorithm that has backend server persistence based on originating IP.  Note that with a few machines this might not actually balance out well, so verify you aren’t overloading one LDAP instance.

Background

A customer of mine is deploying Sun’s Communications Suite (aka Messaging, Calendar, and IM servers) and was testing their custom provisioning tool.  A few accounts had been created that worked fine but one of the accounts would just hang when trying to login to webmail.  The screen would show the application initialization progress bar stuck at 84% and indicate it was dealing with the address book.

I verified that the account would hang and then took a look at the account’s main LDAP entry, which looked fine.  I then checked the account’s LDAP address book data.  The bad account had 3 LDAP entries in the address book branch, but a good account should have 4. Taking a look at the iwc.log from Convergence, I could see an error:

ADDRESS_BOOK: ERROR from com.sun.comms.client.ab.wabp.WABPEngineServlet  Thread httpSSLWorkerThread-80-4 at 2009-06-27 14:33:05,586 – pstore object couldn’t be created for user :baduser

At this site we have a pair of replicated LDAP servers behind a pair of load balancers that are used by the messaging components. Sun’s Directory Server has a loose replication model that usually works fine, but you can run into a rare race condition when applications are adding inter-related entries to different masters in a rapid fire succession.  Convergence was initially pointing at the load balancer virtual IP to reach the LDAP servers.

When I checked the logs on the LDAP servers, I could see that  Convergence had tried to create address book entries when the user first logged in, but had done so over several different LDAP connections which via the load balancer went to different LDAP servers. It created a parent entry on ldap1, then in a connection to ldap2 tried to create a dependent entry, which failed.  Convergence then created  another version of the parent entry on ldap2 (which worked, but caused a replication conflict).  Later attempts to login ended up adding some dependent entries, but it was still in an usuable state.

When things are working correctly, you will see a LDAP operation pattern that looks like:

[27/Jun/2009:16:21:39 -0400] conn=566625 op=5 msgId=948 – ADD dn=”piPStoreOwner=$user,o=$domain,o=PiServerDb”
[27/Jun/2009:16:21:39 -0400] conn=566636 op=1 msgId=950 – ADD dn=”piEntryID=random1,piPStoreOwner=$user,o=$domain,o=PiServerDb”
[27/Jun/2009:16:21:39 -0400] conn=566636 op=2 msgId=951 – ADD dn=”piEntryID=random2,piPStoreOwner=$user,o=$domain,o=PiServerDb”
[27/Jun/2009:16:21:39 -0400] conn=566636 op=3 msgId=952 – ADD dn=”piEntryID=random3,piPStoreOwner=$user,o=$domain,o=PiServerDb”

The fix

In order to fix the account and the problem in general, we ended up deleting the skeleton address book entries for the user in question and used iwcadmin to change Convergence to point to individual LDAP servers in a failover mode. Since we had two Convergence instances and two LDAP instances, it was easy to flip the perferred order so that LDAP load will be well-balanced.

Things that could be improved

1) Convergence could give a better error experience to the user instead of a just hanging.  Perhaps timing out after 30 seconds with a message “There is a problem with initializing your address book, please ask your administrator to investigate”.

2) Convergence could use a single LDAP connection when performing address book creation for any given user

3) Sun’s Directory Server could have an assured replication model (this is available in the OpenDS 2.0 release candidates)

OpenSolaris automated installs

Sunday, June 21st, 2009

I took a test drive of the OpenSolaris automated installer (AI) utility today.  This is the replacement for the venerable Solaris jumpstart technology and is the only way to install OpenSolaris in a hands-off approach.  Based off my 2 hours of so of perusing the documentation and working with it, I think it is still a work in progress (e.g. I didn’t see any way of having jumpstart-like custom finish scripts).

The first thing I did was read through the automated installer docs.  There isn’t a lot there yet, so it is a quick read, but it will help you get the basics.  Another good place to look for information is the OpenSolaris forum for installer technology (aka project Caiman).

There appear to be several components involved, at least for x86 based clients.  I haven’t yet tried SPARC so am not sure how it differs.

1) DHCP server – to hand out an address and the PXE boot parameters to a client

2) TFTP server – to serve the PXE boot image

3) Install server – an Apache instance that hands back the XML configuration files and the mini root.  In my case it was running on port 5555.

4) Package repository – for fetching the actual packages to install. By default it is pkg.opensolaris.org/release, but you could change it to a different repository (including a mirror hosted locally if you had one).

Note that there is no NFS service needed, this should make firewall admins very happy.

The lab

I built a lab environment consisting of two virtual machines inside VMWare on my desktop.

To keep things simple, the first VM was called “server”, and the second “client”.  The purpose of my lab environment was to configure the AI environment on the server machine and complete a hands-off install on the client machine.

The VMs were configured as follows:

Server

  • RAM – 800M
  • Disk – 16GB
  • NIC1 (e1000g0) – bridged to public network
  • NIC2 (e1000g1) – host-internal network

I also went into the VMWare networking tool and disabled VMWare’s built-in DHCP server on the host-internal network to ensure that my server would be handing out any DHCP responses.

Client

  • RAM 800M
  • NIC1 (e1000g0) – host-internal network
  • Disk – 8GB   Note: when I first tried an 8 GB disk  AI complained that it couldn’t find any suitable disks because  it wanted at least a 12.5GB disk.  You can work around it by explicitly specifying which disk you want to install on, in which case the default minimum size limit won’t be  triggered.

OpenSolaris Auto Installer Lab

Setting up the server

1) Installed OpenSolaris 2009.06

2) Installed the automated install software

I saw from the docs that I needed the installadm utility, which wasn’t on my system.  I wasn’t sure which package this was from, so I ran:

pkg  search installadm

this told me I wanted the SUNWinstalladm-tools package.  I installed that using:

pfexec pkg install SUNWinstalladm-tools

3) Download the automated install ISO image

The installer needs an architecture (x86 or SPARC) specific ISO image for each type of client that will be supported.  Since I was going to install on x86, I downloaded the appropriate image from genunix: http://genunix.org/distributions/indiana/osol-0906-ai-x86.iso

4) Create an install environment under /auto-install named ai-x64 on the 192.168.72.0 network starting at .10 and using 5 addresses

pfexec installadm create-service -n ai-x64 -i 192.168.72.10 -c 5 -s /export/home/wdh/Downloads/osol-0906-ai-x86.iso /auto-install

5) Configure dhcpd to run on the appropriate interface

The installadm command configured dhcpd, but it was running on the e1000g0 interface by default.  For my environment, I needed to switch that to e1000g1 so it would see the requests from the client VM.

pfexec dhcpconfig -P INTERFACES=e1000g1
svcadm disable dhcp-server
svcadm enable dhcp-server

6) Install squid

We will need squid (or some other proxy) since we aren’t running a local repository server and the client machine will need to be able to fetch packages from pkg.opensolaris.org.  We will tell the client machine to use the proxy on the server.

pkg search squid

figure out the package name I am looking for is SUNWsquid

pfexec pkg install SUNWsquid

svcadm enable http:squid

I was pleasantly surprised how easy that was.  If you are on a non-NATed network, you will likely need to edit the squid configuration file to allow access to your clients.

7) Customize the default AI manifest (I’ll call mine ai_proxy.xml)

cd /auto-install/auto_install

make a copy of the default manifest and name it something more specific

pfexec cp default.xml ai_proxy.xml

added <ai_target_device><target_device_name>c8t0d0</target_device_name> </ai_target_device>

so I could use a disk that was smaller then the auto-installer default

added  <ai_http_proxy url=”http://192.168.72.2:3128″/> so it would use the proxy and be able to reach the internet

changed the ai_auto_reboot setting to true, and changed the default user and password from jack to my normal values.

ran installadm to let the AI service know it should use the custom version of the file

pfexec /usr/sbin/installadm add -m ai_proxy.xml -n ai-x64

8) Register the target system as a client

Started the client virtual machine and retrieved the MAC address (  00:0c:29:b6:43:bf )

On the server use installadm to register the client

pfexec installadm create-client -e 00:0c:29:b6:43:bf -t /auto-install -n ai-x64

9) Started the client system in network boot mode

The install succeeded, but it took about 1.5 hours.  I suspect if I had a local repository and was installing on a non-emulated hard disk it would have gone substantially faster.

Overall thoughts

I was happy that it was relatively straightforward to get working, but I think it will be a while before the system has as much flexibility for customizing installs as Jumpstart.  Based on all the traffic I see on the forum, it seems like the AI project has a lot of momentum behind it, so I am looking forward to giving another spin in a few months.  I’d also like to try this with a local mirror of the pkg repository and see how quick the installer will run.

Update on June 24th

I saw this morning that a functional spec for the AI client has been submitted and the project team is asking for comments.  Please read the thread/document and give any feedback you might have.


Copyright © 2010 williamhathaway.com. All Rights Reserved.
No computers were harmed in the 0.418 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.