Deploying ELYZA with vLLM and OCI Data Science

Over the past few years, we have witnessed constant changes in the availability of large language models (LLMs) for public use. New models and versions of models are continuously being released for different use cases and applications. If you find an LLM in a public repository or on Hugging Face, you can try using the model in an application by deploying the model as HTTP endpoint and providing your own data to generate new output.

Oracle Cloud Infrastructure (OCI) Data Science offers model deployments as a managed resource that allows you to deploy machine learning models as http endpoints. To create a Data Science model deployment for an LLM, you can download the model artifacts and set up a custom container with the necessary runtime dependency for the model through Data Science’s Bring Your Own Container model deployment approach. Recently, OCI Data Science has made available a container set up and managed by the Data Science service for deploying models compatible with vLLM.

vLLM is an open source, high-throughput, and memory-efficient inference and serving engine for LLMs from UC Berkeley. This container is available for customers to use for creating a model deployment inside the Data Science platform. In this post, we walk through an example of deploying ELYZA-japanese-Llama-2-13b-instruct from the company ELYZA, which is known for its LLM research and is based out of the University of Tokyo.

Many popular LLMs are trained primarily on English text. The drawback is its performance on non-English languages. ELYZA uses pretraining from the English-dominant model with an additional 18 billion tokens of Japanese data. This combination makes the model well suited for tasks using Japanese.

We cover the following steps necessary to deploy this model in OCI Data Science:

Prerequisites: Set up policies necessary for Data Science model deployment and import packages
Download the model and prepare the model artifacts
Create a Data Science model deployment

You can use the following steps to deploy other non-English language models or other foundation models if the model is compatible with the version of vLLM in the service managed container described.

Prerequisites

Setting up required policies

To deploy your model in OCI Data Science, you need to set up the necessary policies. We deploy the model through the Data Science Bring Your Own Container approach. You can find the list of necessary policies on GitHub.

Importing packages

Next, we install the necessary packages. We download the model artifacts for ELYZA from Hugging Face and use the OCI Accelerated Data Science (ADS) software developer kit (SDK) to help set up the model deployment. You can run the following Python code to install the necessary packages inside the Data Science platform:

# Install required python packages

!pip install oracle-ads
!pip install oci
!pip install huggingface_hub

# Uncomment this code and set the correct proxy links if have to setup proxy for internet

# import os
# os.environ['http_proxy']="http://myproxy"
# os.environ['https_proxy']="http://myproxy"
# Use os.environ['no_proxy'] to route traffic directly

import ads
ads.set_auth("resource_principal")

# Extract region information from the Notebook environment variables and signer

ads.common.utils.extract_region()

The following code block sets up common variables, including compartment ID, project ID, log group ID, and container image location:

# change as required for your environment

compartment_id = os.environ["PROJECT_COMPARTMENT_OCID"]
project_id = os.environ["PROJECT_OCID"]
log_group_id = "ocid1.loggroup.oc1.xxx.xxxxx"
log_id = "cid1.log.oc1.xxx.xxxxx"
instance_shape = "VM.GPU.A10.2"
container_image = "dsmc://odsc-vllm-serving:0.3.0.7"
region = "us-ashburn-1"

The container image, dsmc://odsc-vllm-serving:0.3.0.7, is an Oracle Service Managed container that was built with the following packages:

Oracle Linux 8 – Slim
CUDA 12.4
cuDNN 9
Torch 2.1.2
Python 3.11.5
vLLM v0.3.0

Download the model and prepare the model artifacts

Now, we download and prepare the model artifacts. This step has the following main tasks:

Download the model from Hugging Face, an online AI platform. Certain models on Hugging Face require users to accept agreements and terms of service before they gain access. You must generate a Hugging Face token to verify you have access to those models if you want to use them in Data Science. For information on how to generate one, see the Hugging Face documentation.
Upload the model folder to a versioned bucket in OCI Object Storage. If you don’t have an Object Storage bucket, you can create one by using the OCI SDK or through the Console. After you create an Object Storage bucket, note the namespace, compartment, and bucket name. An administrator in your tenancy needs to configure the policies to allow the Data Science service to read and write the model artifact to the Object Storage bucket in your tenancy.
Create an entry in the Data Science model catalog for the model using the Object Storage path.

The following codes accomplish these three tasks.

Download the model from the Hugging Face Model Hub:

# Login to huggingface using env variable

HUGGINGFACE_TOKEN =  "<HUGGINGFACE_TOKEN>" # Your huggingface token
!huggingface-cli login --token $HUGGINGFACE_TOKEN

We use the command, snapshot_download(), to download an entire repository from the Hugging Face Model Hub. For more information, see the Hugging Face website.

# Download the ELYZA model from Hugging Face to a local folder.

from huggingface_hub import snapshot_download
from tqdm.auto import tqdm
model_name = "elyza/ELYZA-japanese-Llama-2-13b-instruct"
local_dir = "models/ELYZA-japanese-Llama-2-13b-instruct"
snapshot_download(repo_id=model_name, local_dir=local_dir, force_download=True, tqdm_class=tqdm) 
print(f"Downloaded model {model_name} to {local_dir}")

Upload the model to OCI Object Storage:

model_prefix = "ELYZA-japanese-Llama-2-13b-instruct/" #"<bucket_prefix>"
bucket= "<bucket_name>" # this should be a versioned bucket
namespace = "<bucket_namespace>"
!oci os object bulk-upload --src-dir $local_dir --prefix $model_prefix -bn $bucket -ns $namespace --auth "resource_principal"

Create a model using ADS:

from ads.model.datascience_model import DataScienceModel
artifact_path = f"oci://{bucket}@{namespace}/{model_prefix}"
model = (DataScienceModel()
  .with_compartment_id(compartment_id)
  .with_project_id(project_id)
  .with_display_name("ELYZA-japanese-Llama-2-13b-instruct")
  .with_artifact(artifact_path)
)
model.create(model_by_reference=True)

Create Data Science Model Deployment

We set up and create an OCI Data Science model deployment with ADS. Import the ADS model deployment modules:

from ads.model.deployment import (
    ModelDeployment,
    ModelDeploymentContainerRuntime,
    ModelDeploymentInfrastructure,
    ModelDeploymentMode,
)

Set up the model deployment infrastructure:

infrastructure = (
    ModelDeploymentInfrastructure()
    .with_project_id(project_id)
    .with_compartment_id(compartment_id)
    .with_shape_name(instance_shape)
    .with_bandwidth_mbps(10)
    .with_replica(1)
    .with_web_concurrency(10)
    .with_access_log(
        log_group_id=log_group_id,
        log_id=log_id,
    )
    .with_predict_log(
        log_group_id=log_group_id,
        log_id=log_id,
    )
)

Configure the model deployment runtime:

env_var = {
    'BASE_MODEL': model_prefix,
    'PARAMS': '--served-model-name odsc-llm --seed 42',
    'MODEL_DEPLOY_PREDICT_ENDPOINT': '/v1/completions',
    'MODEL_DEPLOY_ENABLE_STREAMING': 'true'
}

container_runtime = (
    ModelDeploymentContainerRuntime()
    .with_image(container_image)
    .with_server_port(8080)
    .with_health_check_port(8080)
    .with_env(env_var)
    .with_deployment_mode(ModelDeploymentMode.HTTPS)
    .with_model_uri(model.id)
    .with_region(region)
    .with_overwrite_existing_artifact(True)
    .with_remove_existing_artifact(True)
)

Deploy the model using container runtime:

deployment = (
    ModelDeployment()
    .with_display_name(f"ELYZA-japanese-Llama-2-13b-Instruct MD with vLLM SMC")
    .with_description("Deployment of ELYZA-japanese-Llama-2-13b-Instruct MD with vLLM(0.3.0) container")
    .with_infrastructure(infrastructure)
    .with_runtime(container_runtime)
).deploy(wait_for_completion=False)

When the model deployment has reached the Active state, we can invoke the model deployment endpoint to interact with the LLM.

Conclusion

We have walked through how to deploy ELYZA by downloading the model from Hugging Face and using Data Science’s service managed container and the Accelerated Data Science SDK to create a Data Science model deployment. Soon, we plan to offer a service managed container compatible with another popular inferencing option Text Generation Inference.

The code samples mentioned in the blog are available in our Github repository. You can find more examples and tutorials on using OCI Data Science there. You can also visit our technical documentation.

Deploying ELYZA with vLLM and OCI Data Science

Prerequisites

Setting up required policies

Importing packages

Download the model and prepare the model artifacts

Download the model from the Hugging Face Model Hub:

Upload the model to OCI Object Storage:

Create a model using ADS:

Create Data Science Model Deployment

Conclusion

Wendy Yip

Senior Product Manager, AI/ML Platform

Finetuning in large language models

Announcing the latest features in OCI Generative AI

Deploying ELYZA with vLLM and OCI Data Science

Prerequisites

Setting up required policies

Importing packages

Download the model and prepare the model artifacts

Download the model from the Hugging Face Model Hub:

Upload the model to OCI Object Storage:

Create a model using ADS:

Create Data Science Model Deployment

Conclusion

Authors

Wendy Yip

Senior Product Manager, AI/ML Platform

Finetuning in large language models

Announcing the latest features in OCI Generative AI