High Availability Part 5
By Todd Little-Oracle on Jan 23, 2014
In the first parts of this series on high availability, I’ve explained what high availability means, how availability is defined and measured, what impacts availability, and what needs to be done in general to improve availability. In this post I’ll look at some of the features in Tuxedo that aid in creating highly available applications.
The first thing I want to cover is using redundancy to improve availability. If you recall from the second post, just adding a single level of redundancy can make a huge improvement in availability. This is often noted in many highly available systems where they claim there is no single point of failure, meaning that for every required component, there is at least one duplicate or redundant copy of it. Highly available hardware platforms like Exalogic are designed this way with redundant power distribution and power supplies, redundant networks, redundant or extra fans, redundant SSDs, etc. So any given failure of a single hardware component will not bring the system down. Tuxedo is designed with this strategy in mind, i.e., no single point of failure. Let’s look at each type of component in a Tuxedo application and see how it can be made redundant.
Each of the critical Tuxedo system servers can have additional copies of them run. When I say critical, I mean those that are required for the application to perform its task. The reason I qualify this is that often I’m asked about the Bulletin Board Liason (BBL), Distinguished Bulletin Board Liason (DBBL), and the BRIDGE. These three servers cannot have multiple copies of them running. So how does Tuxedo provide high availability if these servers cannot run redundantly? For the BBL and DBBL, the answer is that these servers are not needed by Tuxedo to process requests. So if a BBL or the DBBL crash for some reason, the application can still function as normal request processing doesn’t need those servers to be up to function. These servers are primarily responsible for management, configuration, and housekeeping tasks. So while one of them is temporarily down, the application continues to run, but certain configuration and administration tasks may not be able to be performed until the failed server is restarted.
The BRIDGE on the other hand may be in the path of application request processing so why does it being down not necessarily affect application availability? This largely depends upon the application configuration. Ideally each machine in a clustered configuration would be running all the servers necessary to process application requests. In this case, Tuxedo would note that the BRIDGE is down and route requests only to local servers. Certain unavoidable configurations such as where a server cannot be run on more than one machine at a time, or singleton servers, may cause temporary unavailability of the services offered by that server, but this should be extremely short. The reason it will be short is that Tuxedo operates on a buddy system, meaning that every process has another process that is monitoring it presence and usually its health as well. All configured processes are monitored by the BBL. So if the BRIDGE fails, the BBL will automatically restart it and in general the worst that has happened is some requests will be temporarily delayed as no requests will be lost. Since every process needs a buddy to monitor it, the BRIDGE monitors the BBL, so if the BBL should crash, the BRIDGE will automatically restart the BBL. One final comment on these specific system servers, they contain no application or customer provided code and are highly tested, so their likelihood of failing to begin with is extremely low.
Let’s take a quick look at the other common Tuxedo system servers. TMQUEUE servers can be run with multiple copies accessing the same queuespace on a single machine. Thus if one of the TMQUEUE servers fails, the others will just continue to process /Q requests for the queuespace. Should the machine the queuespace is on fail, then it would be necessary to migrate the server group containing the TMQUEUE servers to another machine. With Tuxedo 12c, this process can be automated with automatic server group migration, such that the server group will automatically be migrated to its backup machine. So the failure of a machine here could introduce some minimal unavailability for the queuespace while the servers are being migrated. Planned enhancements to Tuxedo Message Queue (TMQ) should address even this small bit of unavailability of a queuespace in a future release. In the meanwhile, there is nothing preventing multiple queuespaces with same name existing in a single domain, so TMQUEUE servers could be run on multiple machines, but these queuespaces will be independent of each other; so a dequeue operation could block even though there are messages in the “queue”, because the dequeue operation ended up on a machine where there are no messages in that queue.
The next set of system servers are those that operate in a master/slave arrangement. These include the Tuxedo event brokers and the CORBA naming servers. These servers operate in such a fashion that one server is configured as the master server, and when changes are made, it has the responsibility of propagating the changes to the slave servers. The master or slave servers can handle read-only requests and master and slaves communicate via standard tpcalls, meaning they can be run across multiple machines. If the master server dies, it is normally configured as restartable, it will be restarted by the BBL and then restore its state from the slave servers. So worst case there may be an extremely brief period of time while a new event subscription or new CORBA naming entry can’t be made, but event posting and name lookups can still proceed even during this brief period when the master is down.
In my next post I’ll cover the remaining system servers and application servers as they are both handled in the same way for redundancy. In subsequent posts I’ll cover other ways to improve the availability of Tuxedo applications such as the built in failover/failback capabilities of Tuxedo as well as means to automate application or configuration specific steps to remediate failures.