Simplify Spark monitoring with OCI Data Flow application metrics

February 18, 2021 | 4 minute read
Carter Shanklin
Sr. Director of Product Management
Text Size 100%:

Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Big Data service that runs Apache Spark applications at any scale with almost no administration. Spark has become the leading Big Data processing framework, and OCI Data Flow is the easiest way to run Spark in Oracle Cloud because there’s nothing for Spark developers to install or manage.

The problem

Big Data applications crunch large amounts of data, such as logs or sensor data, and summarize this data for downstream consumers, such as dashboards or user-facing applications. Because downstream consumers need this data fresh, Big Data applications need to complete within a certain time window, referred to as a service-level agreement (SLA).

Consider a manufacturing sensor application that crunches hundreds of gigabytes of data per minute and summarizes it in a dashboard. The business needs data in the dashboard to be no more than five minutes old to avoid manufacturing problems. That five-minute threshold becomes the SLA, and the operations team needs to be alerted immediately if the SLA is not met.

Because these applications run on a schedule, no one is watching them, and Operations needs to be alerted if things are running too slow or failing. Until now, Data Flow users had to solve this problem themselves, building custom management scripts or using OCI’s Apache Airflow plugin for Data Flow. Data Flow application metrics give the power to solve these problems directly within the OCI Console.

Announcing Data Flow application metrics

Available now, Data Flow applications include metrics to track SLAs and stay on top of Spark application health. Look deeper into a Data Flow application, and the following SLA statistics are available:

  • Startup times for each run of the Data Flow application

  • Execution times for each run of the Data Flow application

  • Total run time = startup time + process time

These metrics show how consistently the application hits its SLAs. These metrics also integrate with the OCI Monitoring service, allowing alerts through PagerDuty, Slack, or email if things go wrong. Let’s look at an example.

Example: Alert when a spark application runs too long

Let’s say we have an application that needs to complete in less than 10 minutes. We start by navigating to the specific application in the OCI Console.

A screenshot of the OCI Console, showing the applications in Data Flow outlined in red.

When we select the application in the Console, we see a panel called Total Run Time. Selecting Options inside this panel lets us create an alarm on this query.

A screenshot of the Metric page in the Console, with the Options menu in the Total Run Time section expanded and Create an Alarm on this Query outlined in red.

Clicking this option takes us to an alarm definition interface with many fields prepopulated.

This blog doesn’t cover basic setup details for alarms like required IAM policies. For an overview, refer to the Alarms section of the OCI Monitoring service. To create our alarm, we need to do the following tasks:

  • Configure our metric name and dimensions

  • Set the SLA threshold in the Trigger rule section

  • Link the alarm to a notification topic

First, the metric name and dimension. Since we’re tracking the total run time, we use the RunTotalTime metric and the Max statistic. For the dimension, we select the specific Data Flow application that we’re monitoring.

A screenshot of the Metric description page with RunTotalTime, Statistic, Dimension name, and Dimension value sections outlined in red.

Set the trigger rule value to the SLA in seconds. Because we’re targeting 10 minutes, we set this to 600 seconds.

A screenshot of the Trigger rule page with the Value section outlined in red.

Finally, link the alarm to a notification topic. The topic controls how alerts are delivered and to whom. For more detail on how, see the Notifications service documentation.

A screenshot of the Notifications section in the Console with the Topic menu outlined in red.

With this alarm, any run of the Data Flow application lasting longer than 600 seconds causes the alarm to go off.

Other use cases

Application metrics address the following common monitoring tasks:

  • Alerting if jobs fail, or they fail repeatedly.

  • Alerting if too many jobs are running concurrently.

  • Alerting if a job takes a long time to start.

See for yourself

Data Flow application metrics are available now in all commercial regions. With Data Flow, you only pay for the resources that your Spark jobs use while Spark applications are running, so there’s no extra charge for these metrics. To learn more about Data Flow application metrics, see the Data Flow documentation.

To get started today, sign up for the Oracle Cloud Free Trial or sign in to your account to try OCI Data Flow. Try Data Flow’s 15-minute no-installation-required tutorial to see how easy Spark processing can be with Oracle Cloud Infrastructure.

Carter Shanklin

Sr. Director of Product Management

<p>Carter Shanklin is Senior Director of Product Management for Big Data in Oracle Cloud Platform. Carter has over 10 years of experience developing data management and big data products including Hadoop, Spark and SQL-based solutions.</p>

Previous Post

A retail perspective: Driving digital transformation with Oracle Cloud

Guru Sulibhavi | 6 min read

Next Post

GoDaddy DNS management through Oracle Cloud Infrastructure

Cody Brinkman | 9 min read