• March 16, 2020

Container Engine for Kubernetes Service (OKE) Provides Built-In Monitoring

Mickey Boxell
Product Management

We are excited to announce the general availability of built-in Kubernetes monitoring for all Container Engine for Kubernetes Service (OKE) supported regions. The new integrated monitoring capability provides OKE users with a number of essential health and performance metrics. Users can drill down on metrics critical to monitoring resource operations from their OKE clusters, node pools, and nodes. These metrics can be viewed using the OKE and OCI Monitoring consoles. Additionally, alarms can be created using industry standard statistics, trigger operators, and time intervals.

Some of the new metrics include:

  • Unschedulable pods, which can be used to trigger node pool scale operations when there are insufficient resources on which to schedule pods

  • API Server requests per second, which is helpful to understand any underlying performance issues seen in the Kubernetes API server.

Refer to the OKE Monitoring documentation for more detailed information.


Node Pool Scaling Use Case

Let's dive deeper into the first example use case above: using the Unschedulable Pods metric to trigger Kubernetes node pool scaling. This can be done by creating an alarm that will send a notification from the Notification service that will in turn trigger a function to scale up a node pool.

This scenario involves writing a function to resize VMs and creating an alarm that sends a message to that function. When the alarm fires, the Notifications service sends the alarm message to the destination topic, which then fans out to the topic's subscriptions. In this scenario, the topic's subscriptions include the function as well as your email. The function is invoked on receipt of the alarm message.

To use Oracle Cloud Infrastructure, you must be given the required type of access in a policy. You need access to Container Engine for Kubernetes, Monitoring, Notifications, and Functions. To resize VMs, the function must be authorized to update Compute instances. To authorize your function for access to other Oracle Cloud Infrastructure resources, such as Compute instances, include the function in a dynamic group and create a policy to grant the dynamic group access to those resources. For more information, see Accessing Other Oracle Cloud Infrastructure Resources from Running Functions.


Create A Function and Policy

Create a function that gets the number of nodes in a node pool and then uses the UpdateNodePool to increment that number by one. For more information on the available API calls, refer to the Container Engine for Kubernetes API reference. Here is another example of using a Function triggered by the Monitoring Service to resize a VM. After creating your function, include your function in a dynamic group and grant the dynamic group access to Container Engine for Kubernetes.


Create A Notification and Alarm Topic

Create an alarm with the name "Unschedulable Pods" and choose the compartment in which your cluster is deployed. Under metrics namespace choose "oci_oke" and for metrics name choose "Unschedulable Pods"

Next, create a trigger rule that uses "greater than or equal to" as the operator, "1" as the value, and "1" for trigger delay minutes. Select your function under Notifications, Destinations and choose the appropriate compartment for your cluster. Create a new topic called Alarm Topic with Function as the Subscription Protocol and then select your Function Compartment, Application, and Function. Save the alarm.

The next time your Unscheduable Pods count increases hits 1 or greater, a scale up operation will be triggered for your node pool.


Accessing Metrics

Telemetry data from Container Engine for Kubernetes is available in a number of ways: the Container Clusters (OKE) console, the Monitoring Service Metrics page, API, and CLI. Below are examples the same data accessed in two different ways.

Here is the Container Clusters (OKE) console view of the API Server Requests metric with a 5 minute interval:

Here is the same API Server Requests metric with a 5 minute interval instead accessed via the CLI (some of the telemetry data was manually removed in order for it to fit better on the page):


$ oci monitoring metric-data summarize-metrics-data --namespace oci_oke --compartment-id ocid1.compartment.oc1... --query-text='(APIServerRequestCount[5m]{ clusterId="ocid1.cluster.oc1.eu-zurich-1"}.rate() )'
  "data": [
      "aggregated-datapoints": [
          "timestamp": "2020-03-12T15:47:00+00:00",
          "value": 9.24907063197026
          "timestamp": "2020-03-12T15:52:00+00:00",
          "value": 9.20446096654275
          "timestamp": "2020-03-12T15:57:00+00:00",
          "value": 9.22962962962963
      "compartment-id": "ocid1.compartment.oc1..",
      "dimensions": {
        "clusterId": "ocid1.cluster.oc1.eu-zurich-1",
        "resourceDisplayName": "monitoring",
        "resourceId": "ocid1.cluster.oc1.eu-zurich-1"
      "metadata": {
        "displayName": "APIServer Requests",
        "unit": "count"
      "name": "APIServerRequestCount",
      "namespace": "oci_oke",
      "resolution": null,
      "resource-group": null


Refer to the OKE Monitoring documentation for more information regarding accessing metrics.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.