By martin on Jun 23, 2009
We put a lot of thought into planning for failure when we setup our sites (like www.sun.com, blogs.sun.com and so on). Every component is redundant, from border firewalls to load-balancers to front end web servers to root disks. We even put the gear in separate racks on separate power, just in case someone accidentally knocks both power cables out. This is arranged in odd and even sides, and servers are placed in the corresponding side, i.e. blogs1.sun.com is placed on the odd side and blogs2.sun.com is placed on the even side. If we use more than two servers they are added to the respective side.
But the chain is only as strong as its weakest link: if I screw up when I update the puppet profile for our base server class, things will quickly go south.
No matter how carefully I test things before I commit my changes to the master mercurial repository and on to the puppetmaster (we only ran one per site before), there still is a chance things go boink! There are always some servers which were setup a few years ago, long before we started using puppet, that aren't installed and configured the way I expect, and when they are modified by puppet - they break!
So it doesn't matter that we are running multiple systems, they all get changed by puppet within 30 minutes.
To work around this problem I've set up two puppetmasters, and they serve the corresponding side (odd or even). This lets me push changes to the one side first, let it stew for a while, before I push it to the other side.