Released in July 2024, Meta’s Llama 3.1 405B is a state-of-the-art open source model with a custom commercial license, the Llama 3.1 Community License. With a context window of up to 128K and support across eight languages (English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai), it rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation. For more information, see the model’s technical specifications.

Customers can already deploy Meta’s Llama 2 and 3 models in Oracle Cloud Infrastructure (OCI) Data Science, and you can now deploy Llama 3.1 too. To harness the power of Llama 3.1 405B, you can deploy the model with OCI Data Science Model Deployment as an http endpoint. OCI Data Science supports leading industry GPUs, which enables customers to deploy the large size Llama 3.1 405B.

In this blog post, we go through the steps of how to deploy Llama 3.1 405B in OCI Data Science through a Bring Your Own Container (BYOC) approach. We deploy the Meta-Llama-3.1-405B-Instruct-FP8 version of the model, which is a quantized version of the model.

Setup

First, install the necessary packages with the following command:

pip install oracle-ads oci huggingface_hub -U

Prepare the model artifacts

Download the model from HuggingFace 

First, download the model files from HuggingFace to a local directory using a valid HuggingFace token. Llama 3.1 requires the acceptance of user agreement. If you don’t have a HuggingFace token, see the HuggingFace documentation on how to generate one.For details on downloading a model from HuggingFace, refer to the documentation

huggingface-cli login --token "<your-huggingface-token>"
# Download the LLama3.1 405B model from Hugging Face to a local folder  
huggingface-cli download meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --local-dir Meta-Llama-3.1-405B-Instruct-FP8

Upload the model to OCI Object Storage

Then, upload the model folder to a versioned bucket in OCI Object Storage. The Object Storage bucket is set up to be versioned to prevent unintentional deletion of model artifacts. If you don’t have an Object Storage bucket, create one using the OCI software developer kit (SDK) or the Console. Take note of the namespace, compartment, and bucketname. Configure the policies to allow the Data Science service to read and write the model artifacts to the Object Storage bucket in your tenancy.

allow service datascience to manage object-family in compartment <compartment> where ALL {target.bucket.name='<bucket_name>'}

An administrator must configure the policies in Identity and Access Management (IAM) in the Console.

oci os object bulk-upload --src-dir $local_dir --prefix  "Meta-Llama-3.1-405B-Instruct-FP8/" -bn "<bucket_name>" -ns "<bucket_namespace>"  --auth "resource_principal"

Save the model to the model catalog

Finally, create a Model Catalog entry for the model using the Object Storage path. You can use the Accelerated Data Science (ADS) SDK. We’re using the model by reference feature in ADS.

from ads.model.datascience_model import DataScienceModel
(DataScienceModel()
     .with_compartment_id(compartment_id) 
     .with_project_id(project_id) 
     .with_display_name("Meta-Llama-3.1-405B-Instruct-FP8") 
     .with_artifact(f"oci://{bucket}@{namespace}/{model_prefix}")).create(model_by_reference=True)

Inference container

vLLM is an easy-to-use library for large language model (LLM) inference and server.  You can get the container image from DockerHub

docker pull --platform linux/amd64 vllm/vllm-openai:v0.5.3.post1

Currently, OCI Data Science Model Deployment only supports container images residing in the OCI Registry (OCIR). Before we can push the pulled vLLM container, ensure that you have created a repository in your tenancy with the following steps:

  1. Go to your tenancy in Container Registry.
  2. Select the Create repository button.
  3. Under Access types, select Private.
  4. Set a name for your repository. In our example, we’re using “vllm-odsc.”
  5. Select Create.

You might need to log in to OCIR with Docker to push the image. To log in, use your API Auth Token, which you can create under your Oracle Cloud Account and Auth Token. You need to log in only once. Replace <region> with the OCI region that you’re using.

docker login -u '<tenant-namespace>/<username>' <region>.ocir.io

If your tenancy is federated with Oracle Identity Cloud Service, use the format <tenancy-namespace>/oracleidentitycloudservice/<username>. You can then push the container image to the OCI Registry.

docker tag vllm/vllm-openai:v0.5.3.post1 -t <region>.ocir.io/<tenancy>/vllm-odsc/vllm-openai:v0.5.3.post1
docker push <region>.ocir.io/<tenancy>/vllm-odsc/vllm-openai:v0.5.3.post1

Creating the model deployment and testing

We can now create a Data Science Model Deployment. For details, refer to this Github article on how to do this.

Inference

When the Model Deployment has reached the Active state, you can invoke it with an HTTP request to interact with the LLM. 

Prompt the Llama 3.1 model

The Instruct version of the model uses the following prompt template:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

<prompt> <|eot_id|><|start_header_id|>assistant<|end_header_id|>

 

The following example uses Python code:

data = requests.post(
    endpoint,
    json={
        "model": "odsc-llm",
        "prompt": Template(
            """"<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 23 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

$prompt<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
        ).substitute(
            prompt="What amateur radio band can a general class license holder use?"
        ),
        "max_tokens": 250,
        "temperature": 0.7,
        "top_p": 0.8,
    },
    auth=ads.common.auth.default_signer()["signer"],
    headers={},
).json()

The model produces the following output:

A General Class amateur radio license holder in the United States can use
frequencies on 80 meters (3.5 MHz), 40 meters (7 MHz), 20 meters (14 MHz),
and 15 meters (21 MHz) bands, as well as the 10 meter band (28-29.995 MHz).

Conclusion

Deploying Llama-3.1 405 B on the OCI Data Science service enables you to harness the latest open source AI technology and use it in your application, with infrastructure flexibility for optimization and enterprise grade security and privacy.

Try Oracle Cloud Free Trial! A 30-day trial with US$300 in free credits gives you access to the Oracle Cloud Infrastructure Data Science service. For more information, see the following resources: