Deploying machine learning models into production has many challenges and complexities. One example is ensuring that the inference endpoint is always available to receive and process requests. This process isn’t different from ensuring that a website or any online service has 24/7 availability. Customers don’t respond well to slow, unreliable services.
One of the main challenges with this requirement is that the load on the server can be unpredictable because it’s driven by external factors: Your customers. If your service is successful, people want to use it, sending more requests to the service and increasing the load. In turn, the service eventually needs more compute resources to handle the increased demand. Sometimes, to handle the demand, you need to use a more powerful computation resources.
Sometimes, the load decreases for various reasons, such as seasonality (e.g. predicting sales of patio furniture), popularity of the service, reduced inference frequency need, and others. To optimize the costs and prevent underutilization of the resources, you want to lower the compute power that you use.
In this blog post, we create a solution for autoscaling Oracle Cloud Infrastructure (OCI) Data Science model deployments to handle increased or decreased loads.
The idea behind the autoscaling solution is to create rules based on the metrics that the model deployment emits. Increase the compute power when the metrics go over a certain utilization value, which is called scaling up. Decrease the compute power when the metrics go under a certain utilization value, which is called scaling down.
Scaling the compute power has two dimensions, CPU utilization and memory utilization. When CPU utilization needs to be scaled, you can add or remove virtual machine (VM) instances from the deployment to adjust the processing power. Adding an instance decreases the processing load on existing instances. This scaling operation is called horizontal scaling because it adds more instances from the same VM shape to the deployment.
However, memory doesn’t behave in the same way. Memory load on an instance doesn’t decrease because another instance is added. Each instance has complete duplication of the deployment and has the same memory footprint as all other instances. To adjust memory utilization, we need to use a different VM shape with a different memory size. This scaling operation is called vertical scaling because it upgrades or downgrades the VM shape for the deployment.
In both horizontal and vertical scaling, the model deployment has zero downtime for serving inference requests. The old deployment remains active until the updated deployment is active.
The data science autoscaling solution is built on the following OCI services:
OCI Monitoring: For catching model deployment events and firing alarms when thresholds are crossed
OCI Notifications: For subscribing to the Alarms and activating the relevant OCI Function
OCI Functions: For running the logic code. The function also creates and manage alarms for scaling down, so you don’t need to create them in advance.
OCI Logging: For collecting all outputs from all the other services
The following graphic shows a high-level view of the solution architecture:
For the full, step-by-step, detailed solution, including all files, see the OCI Data Science sample repository on Github.
You need to have an active machine learning model deployment and all permissions set. For more details, see the prerequisites.
This function is called when the OCI function is invoked and the request contains the alarm message.
def handler(ctx, data: io.BytesIO=None):
The alarm processing only proceeds if the alarm has started firing (OK_TO_FIRING) or if it was intentionally defined to refire (REPEAT). The alarm OCID and the model deployment OCID are extracted to be used later. You can learn more about alarms in the Monitoring documentation.
if alarm_msg["type"] == "OK_TO_FIRING" or alarm_msg["type"] == "REPEAT":
alarm_metric_dimension = alarm_msg["alarmMetaData"][0]["dimensions"][0]
alarm_id = alarm_msg["alarmMetaData"][0]["id"]
model_deployment_id = alarm_metric_dimension["resourceId"]
Create a signer to send signed requests to OCI and create a Data Science client to communicate with the service and retrieve the model deployment to scale. Learn more about OCI authentication in OCI SDK Authentication Methods and about the Python SDK data science client in the Data Science documentation.
signer = oci.auth.signers.get_resource_principals_signer()
data_science_client = oci.data_science.DataScienceClient(config={}, signer=signer)
model_deployment = data_science_client.get_model_deployment(model_deployment_id).data
The CPU or memory load can change because the model deployment is updating, such as adding or removing an instance or changing VM shapes. You can’t change it while it’s updating, so we need to ensure that it’s in an active state.
if (model_deployment.lifecycle_state == model_deployment.LIFECYCLE_STATE_ACTIVE):
Based on the alarm title, the appropriate function is called to prepare the update_model_deployment_details object, which becomes part of the model deployment update request.
alarm_type = alarm_msg['title']
if (alarm_type == 'model_deployment_high_cpu'):
update_model_deployment_details = scale_instance_up(alarm_id=alarm_id, model_deployment=model_deployment)
elif (alarm_type == 'model_deployment_low_cpu'):
update_model_deployment_details = scale_instance_down(alarm_id=alarm_id, model_deployment=model_deployment)
elif (alarm_type == 'model_deployment_high_mem'):
update_model_deployment_details = scale_vm_up(alarm_id=alarm_id, model_deployment=model_deployment)
elif (alarm_type == 'model_deployment_low_mem'):
update_model_deployment_details = scale_vm_down(alarm_id=alarm_id, model_deployment=model_deployment)
Now, call the model deployment update API to initiate the scaling:
if (update_model_deployment_details is not None):
resp = data_science_client.update_model_deployment(
model_deployment_id=model_deployment_id,
update_model_deployment_details=update_model_deployment_details
)
We have four functions for preparing the update_model_deployment_details scenarios: High-CPU, high-memory, low-CPU, and low-memory.
Let’s look at the highlights of each function:
Get the current number of instances in the deployment with the following command:
current_instance_count = model_deployment.model_deployment_configuration_details.model_configuration_details.scaling_policy.instance_count
If the number of instances is at maximum allowed by the MAX_INSTANCES const, don’t update the deployment.
if new_instance_count > MAX_INSTANCES:
Create the update_model_deployment_details object with the new scaling policy.
update_model_deployment_details=oci.data_science.models.UpdateModelDeploymentDetails(
model_deployment_configuration_details=oci.data_science.models.UpdateSingleModelDeploymentConfigurationDetails( model_configuration_details=oci.data_science.models.UpdateModelConfigurationDetails(
model_id=model_id,
scaling_policy=oci.data_science.models.FixedSizeScalingPolicy(
policy_type="FIXED_SIZE",
instance_count=new_instance_count)
)
)
)
Create an alarm to report when CPU utilization drops below a threshold to scale the deployment down.
create_cpu_scale_down_alarm(alarm_id=alarm_id, model_deployment=model_deployment)
Now let’s look at that function.
Create a monitoring client with the resource principal signer to communicate with the Monitoring service, then retrieve the alarm based on the ID in the message.
signer = oci.auth.signers.get_resource_principals_signer()
monitoring_client = oci.monitoring.MonitoringClient(config={}, signer=signer)
current_alarm = monitoring_client.get_alarm(alarm_id=alarm_id).data
If the same alarm was created for the same model deployment, don’t create another one. We don’t want duplicates!
alarm_summary_response = monitoring_client.list_alarms(compartment_id=current_alarm.compartment_id, display_name="model_deployment_low_cpu")
if alarm_summary_response.data and (alarm_summary_response.query.find(model_deployment.id) != -1):
Now, create the alarm in the Monitoring service. Use the same parameters as the firing alarm. You can change the threshold by changing the LOW_CPU_THRESHOLD value.
alarm_query = 'CpuUtilization[1m]{resourceId = \"' + model_deployment.id + '\"}.max() < ' + str(LOW_CPU_THRESHOLD)
create_alarm_response = monitoring_client.create_alarm(
create_alarm_details=oci.monitoring.models.CreateAlarmDetails(
display_name="model_deployment_low_cpu",
compartment_id=current_alarm.compartment_id,
metric_compartment_id=current_alarm.metric_compartment_id,
namespace=current_alarm.namespace,
query=alarm_query,
severity=current_alarm.severity,
destinations=current_alarm.destinations,
is_enabled=True,
repeat_notification_duration="PT5M",
body="Low cpu for model deployment. Instances can be scaled down to reduce costs",
message_format="ONS_OPTIMIZED"
)
)
This function is mostly the same as scale_instance_up. The only differences are that with scaling down, the verification is for a minimum number of instances. We need more than one. We also delete the model_deployment_low_cpu alarm so that it doesn’t fire again. A new alarm is created when the deployment is scaled up again.
if current_instance_count == 1:
...
delete_alarm_response = monitoring_client.delete_alarm(alarm_id=alarm_id)
Adding instances when the memory is overloaded doesn’t help lower the memory load because the added instance has the same amount of memory and includes the same model exactly. So, we need to add more memory to the server, which means using a different VM shape with more memory. This process is called vertical scaling and is a bit more challenging than scaling horizontally because we need to know the next or previous VM shape of the same type. To solve this issue, a list of shapes called VM_SHAPES is available, ordered by the size of the shape.
Get the current VM shape and then find its index in the list with the following command:
current_vm_shape = model_deployment.model_deployment_configuration_details.model_configuration_details.instance_configuration.instance_shape_name
...
vm_shape_index = VM_SHAPES[VM_STANDARD].index(current_vm_shape)
If we’re using the largest possible shape, we have nowhere to upgrade to.
if vm_shape_index == VM_STANDARD_LEN:
Otherwise, find the next shape in the list.
new_vm_shape = VM_SHAPES[VM_STANDARD][vm_shape_index+1]
Now, prepare the update_model_deployment_details object for the model deployment update.
update_model_deployment_details=oci.data_science.models.UpdateModelDeploymentDetails(
model_deployment_configuration_details=oci.data_science.models.UpdateSingleModelDeploymentConfigurationDetails(
model_configuration_details=oci.data_science.models.UpdateModelConfigurationDetails(
model_id=model_id,
instance_configuration=oci.data_science.models.InstanceConfiguration(
instance_shape_name=new_vm_shape
)
)
)
)
Create the create_mem_scale_down_alarm alarm to report when memory drops below the threshold.
create_mem_scale_down_alarm(alarm_id=alarm_id, model_deployment=model_deployment)
Let’s look at the memory scale-down alarm.
This function is mostly the same as the CPU scale-down alarm, with differences in the title and the query for triggering the alarm:
alarm_query = 'MemoryUtilization[1m]{resourceId = \"' + model_deployment.id + '\"}.max() < ' + str(LOW_MEM_THRESHOLD)
This function is mostly the same as scale_vm_up, only it checks if the current shape is already the smallest possible. If so, it deletes the alarm and doesn’t scale.
if vm_shape_index == 0:
...
delete_alarm_response = monitoring_client.delete_alarm(alarm_id=alarm_id)
Otherwise, it finds the previous shape in the shapes list and prepares the update_model_deployment_details object.
new_vm_shape = VM_SHAPES[VM_STANDARD][vm_shape_index-1]
Now that we’ve covered the function code, let’s create and deploy it to the OCI Functions service.
We follow the directions given by OCI Functions for deploying functions with OCI Cloud Shell, which is the recommended method. However, feel free to deploy on a local host or OCI Compute instance if that works better for you. Follow the instructions for setting up the tenancy. We don’t list them here and instead jump directly to the application and function creation.
In the Console, open the navigation menu and click Developer Services. Under Functions, click Applications.
Click Create Application.
In the create application window, fill in the details:
Application name: For this example, let's use model-deployment-scaling.
VCN: Select a VCN to use. You can use the same VCN you used for the model deployment. Read more about VCNs in the documentation. We also have a short guide on VCNs in the Data Science documentation.
Click the application that you created to display the application details page, then select Getting Started and click Cloud Shell Setup. Follow the guidelines in that page (set the context, generate auth token, and log in with the token) up to the section of Create, deploy, and invoke your function.
In your model-deployment-scaling application, click Logs in the side panel.
Click the switch on the table to open the Enable Log panel.
Select the compartment.
Select the log group to use. Create one if you don't already have one.
Select the log name to use for the function log.
Select the log retention. You can leave the default.
Click Enable Log.
All outputs from the function (stdout, stderr) are captured in the log.
Generate a model-deployment-autoscale function with the following command:
fn init --runtime python model-deployment-autoscale
Switch into the generated directory.
cd model-deployment-autoscale
Replace the files with the files in the function folder: func.py, requirements.txt. Leave func.yaml as is.
Deploy the function.
fn -v deploy --app model-deployment-scaling
Because alarms only fire events to the Notifications service, we need to create a topic in the Notification service and subscribe the function to it.
Open the navigation menu and click Developer Services. Under Application Integration, click Notifications.
Choose the compartment you want to work in. Preferably, use the same compartment as your function and your model deployment.
In the menu, click Topics and then click Create Topic at the top of the topic list.
Configure your topic with the following parameters:
Name: This example uses model-deployment-autoscaling
Description (Optional): Enter a description for the topic.
Click Create.
Now let’s create the subscription to invoke the function when the topic is invoked. In the details page of the topic you created, click Create Subscription and configure the subscription:
Protocol: Choose Function.
Function compartment: Select the compartment where the function was created.
Function application: Select model-deployment-scaling.
Function: Select model-deployment-autoscale
Click Create.
You can read more about subscriptions, topics, and the Notifications service in Managing Topics and Subscriptions.
Now that the notification is configured, we can create the alarm.
We create one alarm to fire when the CPU utilization rises above a threshold for the horizontal scaling and one alarm to fire when the memory utilization rises above a threshold for the vertical scaling. You can change the thresholds, frequency of sampling, or any other part of the alarm to your liking.
Open the navigation menu and click Observability & Management. Under Monitoring, click Alarm Definitions.
Click Create Alarm.
Configure the alarm:
Alarm name: model_deployment_high_cpu. Use that exact name because it’s the identifier of the alarm in the function.
Alarm severity: Info
Metric description:
Compartment: Select the compartment to use, preferably the same as the function and the model deployment.
Metric namespace: oci_datascience_modeldeploy (the source service that the alarm monitors)
Resource group: Optional. This example leaves it blank.
Metric name: CpuUtilization. This name is the specific metric to monitor (in this case, the CPU utilization).
Interval: 5m. The interval in which the metric aggregations are calculated. In our case, we want to filter out burst outliers, so five minutes suffices for averaging.
Statistic: Mean. We use this aggregation strategy for calculating the metric during the interval. Mean filters out outliers.
Metric dimensions:
Dimension name: resourceId. We’re monitoring the entire model deployment resource.
Dimension value: Select the OCID of your model deployment. The OCID of the model deployment appears in the details page of the model deployment in the Oracle Cloud Console UI.
Trigger rule:
Operator: Greater than
Value: 80. We want to get an alarm when the CPU utilization of the deployment exceeds 80% so that the deployment scales up.
Trigger delay minutes: 1. The number of minutes that the condition must be maintained before the alarm is in firing state.
Notifications:
Destination service: Notifications service
Compartment: Select the compartment where your topic was created.
Topic: Select the topic you created for invoking the function.
Message Format: Send formatted messages.
Repeat notification?: Check this option. We want the alarm to fire again even if the utilization remains high after the deployment scaling.
Suppress notifications: Leave unchecked.
Enable this alarm?: Leave checked.
Click Save alarm.
Repeat the steps for creating the CPU alarm with the following differences:
Alarm name: model_deployment_high_mem
Metric name: MemoryUtilization
Now that the function is deployed and the alarms are in place, we can test them by simulating Alarm messages. We use a JSON file for easier reuse of the calls.
Create a file named test-alarm-scale-cpu-up.json and paste the following content into it:
{
"dedupeKey": "12345678-1234-4321-1234-123456789012",
"title": "model_deployment_high_cpu",
"type": "OK_TO_FIRING",
"severity": "INFO",
"timestampEpochMillis": 1657320300000,
"timestamp": "2022-07-08T22: 45: 00Z",
"alarmMetaData": [
{
"id": "<YOUR_HIGH_CPU_ALARM_OCID>",
"status": "FIRING",
"severity": "INFO",
"namespace": "oci_datascience_modeldeploy",
"query": "CpuUtilization[5m]{resourceId = \"<YOUR_MODEL_DEPLOYMENT_OCID>\"}.max() > 80",
"totalMetricsFiring": 1,
"dimensions": [
{
"resourceId": "<YOUR_MODEL_DEPLOYMENT_OCID>",
"instanceId": "instance:1234567890ABCDEF"
}
],
"alarmUrl": ""
}
],
"version": 1.3
}
From OCI Cloud Shell, invoke the function with the following payload:
cat test-alarm-scale-cpu-up.json | fn invoke model-deployment-scaling model-deployment-autoscaling
This command invokes the function with the JASON payload, which in turn causes the function to initiate a model deployment update to increase the number of instances by one. It also creates the low-CPU alarm to scale the deployment down. This alarm fires shortly after it’s created and eventually scales the deployment back down.
You can do the same for testing the memory scale-up function.
Congrats! You now have an autoscaling solution for your model deployment. Repeat for your other model deployments.
Try Oracle Cloud Free Trial! A 30-day trial with US$300 in free credits gives you access to Oracle Cloud Infrastructure Data Science service.
Want to learn more?
Full sample including all files in OCI Data Science sample repository on Github.
Visit our service documentation and learn about model deployments.
Configure your OCI tenancy with these setup instructions and start using OCI Data Science.
Star and clone our new GitHub repo! We included notebook tutorials and code samples.
Subscribe to our Twitter feed.
Try one of our LiveLabs. Search for “data science.”
Product Manager in OCI Data Science