Data-intensive research in life sciences has always been important to the development of life-saving medicine. Research has only accelerated during this seven-month-long battle with COVID-19 to discover potential vaccines and therapies to protect us from the dangerous effects of the novel coronavirus. Ordinary citizens are now schooled in previously unfamiliar terms, such as phase III clinical trial studies and herd immunity. We’re all eager witnesses and sometimes participants in this effort.
Sometimes the software needed to process and analyze large amounts of data go significantly beyond the capabilities of desktop computers and standalone servers. Here, we must consider high-performance computing (HPC). HPC typically consists of clusters of servers arranged in a coordinated fashion to accomplish a task with common software instructions, often accessed in the cloud.
In life sciences, drug discovery often requires running complex mathematical models of the behavior of multiple chemical and biological processes using HPC or even supercomputers (HPC on steroids, but more expensive). Applying modern data analysis tools, such as artificial intelligence and machine learning, to existing or discovered disease and molecular data can drastically accelerate the process of identifying promising molecular or biological candidates for further evaluation, including clinical trials in animals or humans. This data analysis includes the search for antibodies that can prevent or fight COVID-19.
Oncology and neurological research also require analysis and modeling of such large data sets and use HPC in a similar way. In recent years, the advance of improved processing and storage in the cloud has enabled researchers and data scientists to collaborate with other teams, potentially situated around the globe, and access common data sets and software in a global effort to fight disease. Scientists are also using cloud-accessible HPC for applications, such as bioinformatics, simulation of organ function, and genomics research. Whole-genome sequencing (WGS) relies heavily on HPC.
Cloud-based HPC offers the potential to save and improve lives through accelerated drug development while enabling scaling of effort without direct investment in expensive and maintenance-intensive server infrastructure. The ability of teams of researchers to collaborate remotely on cloud HPC accelerates these critical efforts.
Life sciences research computing is increasingly based on tightly coupled, low-latency processing. International Data Corporation (IDC) characterizes this processing as massively parallel compute (MPC). Today, several cloud service providers (CSPs) are offering sophisticated environments for life sciences research that include the following features:
Flexible scaling with remote direct memory access (RDMA) that allows parallel applications to scale to thousands of cores
High memory bandwidth, the latest generation Intel processors with AVX-512, high single-core turbo clock speeds, GPUs, FPGAs, or ASICs complete with CUDA, frameworks, libraries, containers, and development kits for AI development
Dedicated private network fiber connections for high-speed connection to the data center or lab
Open source tools, such as Kubernetes and job scheduling tools, that scientists are familiar with and cloud implementations of life sciences toolkits
Vast public data sets for research in the cloud and the ability to collaborate with competitors in a shared cloud environment while keeping them outside the firewall
Cloud-based compliance that alleviates some of the work involved with ensuring compliance
IDC believes that HPC in the cloud has matured to the point where businesses experience mostly advantages and few disadvantages from running their life sciences research in the cloud. When considering a move to the cloud, IDC recommends that organizations understand the cloud provider’s scalability capabilities. A cluster with thousands of CPU cores doesn’t perform the same on all CSPs. They can also make the deployment easier for themselves by building machine images in the cloud that are identical to their on-premises images, and by automating the cluster setup with open source tools like Terraform or CfnCluster.
Finally, it’s important to monitor cost once the research runs in the cloud. If it’s easy to scale a cluster in the cloud, it’s also easy to blow the budget, especially on storage.