By templedf on Feb 11, 2010
Let's take a break from the Sun Grid Engine 6.2u5 feature posts and talk about something that's been in the product since 6.2. (It's actually the foundation of two of the remaining three features, so consider this ground work for finishing my u5 features series.)
Service Domain Manager (or the open source Project Hedeby (formerly Project Haithabu)) is an add-on component for Sun Grid Engine that enables multiple clusters to share resources. It was designed to allow for services of all types to share resources with each other. The basic idea is this: each cluster has a set of performance metrics specified via service level objectives (SLOs). If at any point a cluster is in violation of its SLOs, it appeals to the SDM resource provider service for additional resources. The resource provider will look for resources wherever they're available: in spare resource pools, from cloud service providers, or from other less-loaded clusters. If resources are available, the resource provider will (re)assign the resources to the cluster in need. From the users' perspective, nothing really changes, except that the overloaded cluster is now feeling better. Let's get into a little more detail.
A Little More Detail
The resource provider is the heart and brain of SDM. It's job is to keep track of services and resources and adjust resource assignments as needed. At the level of the resource provider, everything is very abstract. It doesn't know (or care) what any of its managed services do, as long as they implement the required interface. It also doesn't care about the details of the resources its managing, beyond the fact that there are details, and that the services it's managing may care about those details.
One other abstract concept that the resource provider understands is a need. When a service managed by the resource provider needs more resources, it tells the resource provider about its need. That need is expressed as a description of the desired resources to satisfy the need (including quantity), and how important the need is. For example, a managed service might say to the resource provider, "Hey! I want two OpenSolaris x86 resources with at least 4GB memory each. This need is critical to me continuing to service my users!" To satisfy this request, the resource provider will look around at the other services it's managing to see who could potentially give up the requested resources. Among the other services there might be spare pools (basically just holding tanks for idle resources), cloud service providers (e.g. Amazon EC2), or other services. If the requested resources are free, they will be reassigned to the requesting service. With a spare pool, the decision is easy: any resources in the spare pool are fair game. Same for the cloud. With other services, though, it's not so simple. In general, if a service is still holding a resource, that's because it's still using it to some degree. How do we know when it's OK to take a resource away from a service? Well, the resource provider has a set of policies that govern the relative importance of the services. Using those policies, the resource provider will decide if the importance of the requesting service plus the criticality of its need outweighs the importance of the potential donor service and how much it's using the resources in question. If, in the end, there are no resources that can reasonably be reassigned to the needy cluster, then the request stays pending and will be reevaluated again later.
On the service side of things there is a service adapter. The job of the service adapter is to be the shim between the service itself and the resource provider. It implements that abstracted service interface that the resource provider expects and translates those abstract concepts we just talked about into concrete artifacts understood by the service. In particular, it's up to the service adapter to define and implement the SLOs for the service. Why? Well, consider this use case. Imagine you have a cluster of application servers and a Sun Grid Engine cluster, and you want to share resources between them. The service level criteria will be very different between them, and it wouldn't make any sense to expect the service provider to understand them all. Instead, it's more flexible and more scalable to allow the service adapters to manage the SLOs and only report the results (e.g. needs) to the resource provider.
Let's use the Sun Grid Engine adapter to illustrate how a service adapter works. Starting with 6.2, the Sun Grid Engine qmaster includes a JMX interface known as JGDI. (While JGDI is openly accessible, we don't really advertise it because it's not really abstract enough for public consumption.) The Sun Grid Engine service adapter uses the JGDI interface to monitor the state of the qmaster. The service adapter implements one unique policy: maximum number of pending jobs. (It actually inherits a couple other policies from the service adapter SDK that are universally applicable, such as the minimum number of resources that should be assigned.) When the state of the cluster changes, the qmaster sends an event to the service adapter. The service adapter then checks the new cluster state against the SLOs that have been configured to see if any SLO has been violated. If an SLO has been violated, the SLO configuration specifies what kind of resource is needed to address the issue. For example, suppose there's an SLO that states that there should never be more than 100 pending Solaris x86 jobs. If the service adapter finds out that the 101st Solaris job is pending, it will appeal to resource provider and request an additional Solaris x86 resource.
When the resource provider assigns a resource to the service, the service adapter is responsible for prepping the resource and adding it into the service. Now, here's the interesting part. After the new resource takes on its share of the workload and the service is happy again, we don't take the resource away. The resource stays with the service until someone else needs it more. Resources are shared, not leased. It is possible to configure SDM to behave in a fashion that is in effect leasing, but it's something you have to explicitly set up.
On the other side of the coin, when the resource provider is asked for a resource, it talks to the service adapters for the managed services to find out who has something that can be borrowed. The resource provider keeps a map of where all the resources are assigned, so it can immediately tell which services are currently holding resources that are candidates for reassignment. It then contacts those services' service adapters to find out whether the resources are in use. The service adapter's job is to look at the service and place a numerical value of how well the resources are being used by the service. Once the resource provider has collected the usage values for all the candidate resources, it applies policies (such as relative importance of the services) and picks the resources that seem most available. This process applies equally to services, spare pools\*, and cloud service providers. (\* There is a built-in spare pool in the resource provider that doesn't actually have its own service adapter, but it works as though it did.)
With the 6.2u5 release, we have two service adapter implementations. One is for the Sun Grid Engine software itself. The other is a generic cloud adapter that comes with integration scripts for use with Amazon EC2 and for use with IPMI power management. Out of the box, you can use SDM to manage Sun Grid Engine clusters and to resource those clusters on demand from EC2. You can also configure a spare pool\* that powers down idle or underutilized machines. (\* It's not technically a spare pool, but it behaves like one.) The intention is to add additional service adapter implementations as we uncover the concrete demand for them. In addition, the original plan was to make the service adapter API clean, public, and well-documented. So far, it's fairly clean, fairly well documented, but only public in so far as the Hedeby Project is open source. If you have interest in seeing or (even better) developing a service adapter for a particular service, please do let us know, and we'll see what we can do to help.
Hopefully this overview gives you a pretty good idea of what SDM does and at least an inkling of how it does it. If not, let me know!