As well as doing all manner of stuff relating to Security, I occasionally get to do a bunch of Networking work; as with security, I like to do the whole piece, from requirements capture through design to implementation. I first encountered Layer 4-7 load balancing on Alteon ACEDirector 3s back in the days of WebOS 5.2 - which I figure must have been about 2001 - and found it really rather cool, especially when it came to deciding which load balancing algorithm to use based on the connection and state model of the protocol being balanced.
Later on, when Global Server Load Balancing (GSLB) was introduced, I initially wondered what the reasoning was behind it, then figured that there were all sorts of shortcomings with it which added up to a verdict of "avoid at all costs", then eventually realised that there's one or two cases where it's actually The Right Thing to do. I occasionally hear from various customers who are part-way down the path I've taken, so I figured I'd get some thoughts down on what GSLB is, when you should use it, and when there are better ways to solve the problem.
Why Customers Want "GSLB"
Back in the dotcom days, we'd get folk coming to us with words to the effect of "here's my $10 million in venture capital, build me a resilient service infrastructure to present my creation to the world at large".
We delivered on their requests, many times, including disaster recovery (DR) environments for those whose industry requirements stipulated them, or those who simply wanted them.
Then, when folks' money started getting tight, their requirements shifted a bit. They came back to us, saying "I've got this DR site, sitting there soaking up power and rack rental space, and while my main site is running normally, this other site is not actually doing me any good. How can I put the kit there to work, delivering my creation to the world such that it can still pick up as the DR site if the main one goes down? Oh, and if this site is in another country from the main one, how do I do this without having to put a huge expensive link between the two?"
Thus was GSLB born; the first time I came across it was in Alteon Web OS 8, probably around 2002-3. The following year, pretty much all the load balancer manufacturers had an implementation.
Taxonomy of GSLB
Most GSLB implementations work by exploiting the fact that queries tend to arrive at infrastructures based upon client resolutions of service addresses by DNS, rather than explicit references to IP addresses. Switches performing GSLB across the various datacentres are set up as the domain's primary external DNS servers, where DNS services themselves are typically backed-off to a pair of load-balanced DNS "slaves" hidden from external view. The switches handle DNS requests from clients, and use round-trip time (RTT) timing to determine whether a given client is closer to one GSLB-equipped datacentre than another.
In the event of a load-balanced service failing at a datacentre, where the virtual IP address (vip) associated with the service is presented via the external DNS, the GSLB system in the switches resolves the DNS-advertised address to the other site's vip.
Also - and this is the main reason why folk wanted GSLB - if multiple sites are up, the GSLB switches can compare their RTTs to the client's DNS server, such that the site with the lowest RTT is the one to which the service request is pointed. Thus, we get GSLB as "active-active bandwidth-weighted load balancing by DNS", which fulfils the customer requirements of "making the DR site do useful work".
Implicit Assumptions of GSLB
Perhaps the most significant implicit assumption in GSLB - which Alteon later claimed to have figured out a workaround for, although I never quite sorted out the nature of this workaround in my head - is that the client's DNS server (which contacts the GSLB-presented DNS address) is located logically close to the actual client trying to access the service being presented. By "logically close to", I mean "sufficiently far away from the GSLB infrastructure itself, in terms of hop count and link bandwidth bottlenecks, that the speed of the effective aggregate link between client and infrastructure is the same as that of the effective link between the DNS server chain the client is using and the infrastructure".
This assumption isn't always correct, especially when you consider that DNS is hierarchical and thus the server which makes the request of the GSLB-presented DNS service may be some levels removed from the client. Also, the "logical closeness" assumption doesn't wash too well when you consider entities such as the AOL mega-proxy.
When, and When Not, to Use GSLB
GSLB works best when you have multiple independent datacentres which you want to have doing useful work, such that the bandwidth of any existing link between them is small (assuming that GSLB negotiation traffic is light when compared to live service transaction traffic), and the datacentres do not need to perform back-end synchronisation or multi-phase commits between them
This qualifier is what has relegated GSLB to the small niche it now occupies. While http and many other protocols are stateless, the transaction data they carry very often isn't. In the event that a non-read-only transaction was performed with one GSLB site and that GSLB site subsequently went down, you'd want the other site(s) to have a record of the data written in the transaction. This usually results in back-end databases at each site needing to either do regular synshronisations or multi-phase write commits, at which point it's frequently the case that the bandwidth of the links between the sites needs to be raised to the point where, rather than use GSLB, you might as well do regular active-active load balancing. While active-active is rather trickier to set up and maintain than active-standby, it's still simpler than GSLB and has the advantage that it's easier to weight the distribution of load across your sites, to cater for any differences in hardware performance between them.
So, in the (relatively unlikely) event that your services are read-only and the bandwidth between your datacentres is small, go right ahead and use GSLB. For all other circumstances, there's usually a better way to approach the problem, based on the fact that there's more inter-site bandwidth available.
There's actually a couple more points worth considering. First, if you're going to avoid GSLB, it's often useful to source the uppstream links from your datacentres from the same supplier, in a manner such that any vips you need to fail over between datacentres are in the same upstream subnet. Second, if you are going to go down the route of GSLB, beware the latency involved with DNS map updates; while a GSLB device will readily do a map "push" to its upstream server as part of a failover event, you are at the mercy of the "ripple carry" latency involved in reaching the client-end DNS server, before a client will be redirected to the live site.