Big Just Got Huge

I'm sure you're all familiar with the TACC Ranger system by now. 62,000 cores, 125TB RAM, a couple of petabytes of storage, #4 on the Top500 list, all running one gigantic Sun Grid Engine cluster under a single qmaster. As if that weren't exciting enough, I have some new amazingness to share.

I heard a couple of weeks ago from one of the Sun engineers onsite at TACC that they have successfully run a 60,000-core parallel job on Ranger. For those of you who are familiar with MPI, I'll give you a moment to recover. For those of you who aren't, a parallel job is a distributed application with multiple cooperating tasks running across multiple machines. In this this case, it's a single application instance composed of cooperating tasks spread across several thousand servers. Yes, really.

Even more unbelievable, this feat was accomplished with a special branch of the Sun Grid Engine 6.1 release using the old SSH-based parallel job support. (The 6.2 Grid Engine release includes a more scalable "built-in" method for starting parallel jobs that blows the doors off the old RSH- or SSH-based model.) When TACC has completed the upgrade to 6.2, the scalability numbers will be outrageous!

This 60k-core job is part of a facial recognition application being developed by PNNL. The application is able to recognize faces in images in faster-than-real-time using the Ranger system at TACC. The reason the job didn't use all 62k+ cores in the system is administrative: there isn't a single queue that spans every host yet. That will be remedied soon, I'm told.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

templedf

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today