Diving in at the deep end - testing scalability of the xVM Ops Center at TACC
By nickstephen on Nov 12, 2007
xVM Ops Center is the software that we are busy building for managing data centers (yes, in the plural).
It needs to scale up to thousands and thousands of managed systems, spread across multiple geos.
One question we faced was how to test this kind of scale? Each engineer doesn't have this kind of setup to do testing on.
The approach used in initial development testing going beyond testing in our own small labs (which contain a real heterogeneous mix of machines, already a good test of heterogeneity) was to create a pseudo-data-center, a very simple software simulation of managed resources in a subnet, with the basic management characteristics and configuration of a real set of machines. In our implementation, this came down to having a JMX MBeanServer with an MBean for each managed resource (chassis, server, OS, ...) which implements the same management interface that we require to be implemented by drivers for real managed systems.
Developers can then configure a pseudo-data-center of their own (it's just a JVM), and install the pseudo-data-center driver into the product, which allows for the discovery and management of pseudo-resources instead of real resources - our drivers just query and manipulate the MBeans using JMX Remoting, instead of manipulating real resources on the network over IPMI, SNMP, ssh, or another management protocols.
This lets us test scale up to thousands of pseudo systems... but who says that this is going to work on a set of real data-centers? And who says that things like OS provisioning that are typically huge network resource hogs will work as expected in the real world?
Well, one answer is to increase the level of reality of the pseudo-data-center, and another is to go test against real systems.
Both have been done.
What better way to dive in at the deep end than to deploy a preview version of Ops Center at one of the world's largest super-computing data-centers and give it a test-drive...
Initial results are extremely promising... of course, not everything has worked right first time, and we've found and fixed various issues going from anything from locking and synchronization problems to queue length issues and job weights ... this is a great "dive-in-at-the-deep-end" opportunity to validate our product ahead of getting it to market.
We're also doing internal data-center level testing using the Sun Grid Compute Facility, which whilst we don't have access to such a big system, lets us test other topologies, hardware, and system configurations.
...by combining the above approaches, together with careful algorithmic design with respect to scalability requirements, we've a high confidence level that we can hit this nail on the head.