Big Just Got Huge
By templedf on Sep 03, 2008
I'm sure you're all familiar with the TACC Ranger system by now. 62,000 cores, 125TB RAM, a couple of petabytes of storage, #4 on the Top500 list, all running one gigantic Sun Grid Engine cluster under a single qmaster. As if that weren't exciting enough, I have some new amazingness to share.
I heard a couple of weeks ago from one of the Sun engineers onsite at TACC that they have successfully run a 60,000-core parallel job on Ranger. For those of you who are familiar with MPI, I'll give you a moment to recover. For those of you who aren't, a parallel job is a distributed application with multiple cooperating tasks running across multiple machines. In this this case, it's a single application instance composed of cooperating tasks spread across several thousand servers. Yes, really.
Even more unbelievable, this feat was accomplished with a special branch of the Sun Grid Engine 6.1 release using the old SSH-based parallel job support. (The 6.2 Grid Engine release includes a more scalable "built-in" method for starting parallel jobs that blows the doors off the old RSH- or SSH-based model.) When TACC has completed the upgrade to 6.2, the scalability numbers will be outrageous!
This 60k-core job is part of a facial recognition application being developed by PNNL. The application is able to recognize faces in images in faster-than-real-time using the Ranger system at TACC. The reason the job didn't use all 62k+ cores in the system is administrative: there isn't a single queue that spans every host yet. That will be remedied soon, I'm told.