Powering protein large language models in antibody discovery on OCI

February 15, 2024 | 4 minute read
Deepak Soni
AI Infra and GPU Specialist
Text Size 100%:

With the rapid growth of ChatGPT—over 100 million active users in just two months—people hear “large language model” (LLM) over and over. However, while the “language” in the name LLM refers to their original development for language processing tasks, the “large” is far more informative. All LLMs have many nodes—the neurons in a neural network—and an even larger number of values that describe the weights of the connections among those nodes. While they were first developed to process language, they can potentially be used for various tasks, including extensive use in protein research.

The recent breakthroughs of LLMs in the context of natural language processing (NLP), and the relationships between human natural language and the language of proteins, invite the application and adaptation of LLMs to protein modelling and design. Protein language models have been already trained to accurately predict protein properties and generate novel functionally characterized proteins, achieving state-of-the-art results. In this blog post, we show how an Oracle Cloud Infrastructure (OCI) customer takes advantage of the benefits of OCI NVIDIA GPU infrastructure’s compute scalability and performance for antibody discovery.

Protein LLM training

To test their system, the researchers tried a range of LLM sizes, varying the number of parameters that describe the strength of the connections among their nodes from 8 million up to 15 billion. A clear pattern emerged with the varying sizes. For proteins that were rare or unusual in some way, performance started out low in the base LLM and improved as the size went through XL to XXL. Nothing indicated that this improvement had saturated, even with the 15 billion-parameter, XXL system.

So, we’re still at the point that throwing more computational resources at the problem improves overall performance, even though it’s already as good as it’s likely to get for many individual proteins. We provisioned a single OCI NVIDIA GPU node to train the protein LLMs with eight A100 NVIDIA GPUs.

Solution Architecture 

A solution architecture is an architectural description of a specific solution based on best practices. The following diagram shows the OCI architecture that the customer used in detail. It uses a bastion node in the public network for log in purposes. The server node is in a private network and uses the BM.GPU4.8 shape with eight A100 NVIDIA GPUs with 500 GB of boot volume. A 27.2-TB NVMe SSD is locally available with the server node for AI modal training.

OCI solution architecture deployment

Challenges

The customer faced the challenge of the default Oracle Linux NVIDIA GPU image Logical Volume Manager (LVM) root partition not being sufficient. We created a boot volume of 500 GB but needed to resize to get all available storage in the root partition. We followed the solution available in Resize Boot Volume in Linux 7 and 8 Instances in OCI to resize the LVM root partition to 500 GB on the NVIDIA GPU instance, which worked perfectly.

Testing and Results

We trained models in OCI GPU Infrastructure environment for a customer’s internal wet lab validation project. A wet lab is one where drugs, chemicals, and other types of biological matter can be analyzed and tested by using various liquids. They needed to redesign important regions of an approved breast cancer drug, trastuzumab.

We trained six novel protein LLMs (about 1 billion parameters each) for generation of antibody binding regions and 15 novel protein LLMs (about 0.5 billion parameters each) for selection of better candidates among the generated ones. The customer utilized all eight A100 NVIDIA GPUs for this LLM training, and the computational resources of that scale were crucial for this project.

Using those models, the customer generated 40 variants of trastuzumab with redesigned regions and sent them to their wet lab partner for the wet lab validation. These results show that OCI reduces the time needed to train large protein language models with a simple ethernet network architecture. OCI Compute bare metal instances powered by NVIDIA GPUs offer ultra-low latency and bare line-rate network performance.

Conclusion

With OCI as the foundation for their solutions, customers can use our GPU Compute service, high-performance network, and storage to help run LLM workloads at scale. While this customer used OCI to train protein LLMs within the necessary time frame, the modelling productivity continues to increase and can be applied to a wide variety of uses. 

OCI offers support from cloud engineers for training LLMs and deploying NVIDIA GPUs at scale. To learn more about Oracle Cloud Infrastructure’s capabilities and supercluster architecture, see the following resources:

 

 

Deepak Soni

AI Infra and GPU Specialist

I bring to the table more than 19+ years’ experience, with 2 years in end-to-end deployment of generative AI, LLM’s AI/ML in OCI, and the rest of the year with high-performance computing infrastructure and solution development across a variety of cloud platforms. Experienced in data centre and automotive, financial, and healthcare domains, my area of expertise includes HPC infrastructure architecture, design and solution development with application profiling and benchmarking, scientific application deployments, and business systems.

Interest:

emerging technologies and innovations.

GenAI, LLM AI/ML workload

HPC/AI Architecture design and great collaborations

 


Previous Post

SMB protocol on Oracle OCI: An analysis of the available options

Marco Santucci | 5 min read

Next Post


How network cores deployed on OCI with F5 deliver on the promise of 5G

Himanshu Shukla | 4 min read