Using Grid Engine for Service Management

Someone just asked about using Grid Engine for service management on a Grid Engine open source mailing list. I gave the short answer on the mailing list, but I thought I'd give the long answer here. (At least I thought I responded on the mailing list. I haven't gotten a copy of my reply yet...)

If you take a naïve look at Grid Engine, you might notice some specific characteristics. You might see that it has these things called load sensors, which are scripts that monitor the state of the grid and provide data (called complex variables or complexes) which gets sent as load reports to the qmaster. You might also notice that a queue can be set with a suspend threshold. A suspend threshold is a line that when crossed causes the qmaster to take action. A suspend threshold is defined as a set of complexes and their corresponding values. Each complex defines for itself whether being above the value, below the value, at the value, or not at the value is "bad." Finally, the queue suspend method and resume method might catch your eye. Normally the qmaster will suspend a job in a queue which has crossed the suspend threshold by sending the job a SIGSTOP and resume a job by sending a SIGCONT. The suspend and resume methods allow the administrator to override that behavior and execute a script instead.

If you put all those pieces together, you could say from a very high level that Grid Engine can watch for conditions in the grid and respond to those conditions by running a script. And you'd be right. Mostly. Using the above elements, you can configure a Grid Engine cluster to maintain a cluster of services, either keeping a given number of instances running, or responding to drops in responsiveness, or whatever. The only catch is that the configuration is a little odd. To help you get started, I'm going to try to provide here a sort of cookbook for building a Grid Engine cluster which manages services. Throughout, I will often refer you to the man pages. You can find them online here.

  1. Define the complex variables that you're going to use for service monitoring. See the man page for qconf(1). Look for the -mc switch. If you don't need any special load values, skip this and the next step.

  2. Write a load sensor script to monitor the complex variables you just defined. See this howto on the open source web site and Chapter 3 of the N1 Grid Engine 6 Administration Guide. The important thing to keep in mind is that load sensors run on the execution hosts. That means every host you want to monitor locally or from which you want to do remote monitoring should be an execution host. You should also keep in mind that the load sensors are only asked for data every few seconds. (You can configure the period with "qconf -mconf global".) That doesn't say anything about what the load sensor does in the meantime. It could be continuously collecting data, but only reporting when asked.

    A natural language for load sensors is Borne or C shell. Load sensors can, however, be written in any language. The only requirement is that they adhere to the load sensor input/output requirements, as explained in the howto, admin guide, and the sge_conf(5) man page. I have written load sensors in Perl, C, and the JavaTM language. The Java language is particularly interesting for service monitoring because of the JMX management interface that many enterprise servers have.

  3. Create a special queue for service management. How you configure the queue depends on what you're doing. If you're trying to manage the number of service instances running, you probably want to have the queue on a single host and have one slot per service instance that you plan to run. This will make more sense in a couple of minutes.

  4. Write a script that does whatever you want done, such as running your service. If needed, write a script that undoes whatever you wanted done, such as stopping your service.

  5. Set the start script as the suspend_method for your queue. Set the stop script, if you have one, as the unsuspend_method. Set the suspend_threshold appropriately.

  6. Write a job script that does nothing indefinitely. I recommend sleep in an infinite loop. Submit one instance of that job to your queue for every slot the queue has.

Here's how the system works. The qmaster will monitor the suspend threshold of your queue. When the suspend threshold is crossed, the qmaster will attempt to suspend one of the jobs in the queue. The jobs in the queue are all the sleeper jobs, so suspending them doesn't really affect them. However, because you configured a suspend method for the queue, every time it suspends a job, it runs your start script. The reason why you want a slot for every service instance is that through this mechanism, the qmaster can start one service instance for every job that it can suspend. You have as many jobs as you have queue slots, so the number of queue slots directly determines the number of service instances the qmaster can start. (If you're not starting service instances, extrapolate as needed.)

When the load drops back below the suspend threshold, the qmaster will begin resuming suspended jobs. Again, the jobs themselves don't really care, but since you set a resume method, your stop script gets run every time a job gets resumed.

Really, that's the medium answer. The long answer takes into account things like load jitter, service resource usage accounting, and service provisioning. That should be enough, however, if you're already familiar with Grid Engine, to being building something that resembles a service management framework.


Perhaps it's worth mentioning in this "medium" answer that the suspend interval on the queue plays an important role. Every "dummy" job running in this special queue represents a potential service, but when the suspend threshold is crossed, one of these gets started every N minutes, where N is the suspend interval. At the very least, you'd want the suspend interval to be big enough to allow for a \*single\* new instance to have a chance at making a difference in your service level, but small enough so that you don't need to wait an excessive amount of time for the next instance to start, if the first instance didn't do enough to affect your service level.

Posted by Charu Chaubal on August 29, 2005 at 03:16 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed



« July 2016