X

Serverless Big Data Pipelines Architecture

Alexander Koenig
Principal Product Manager

Cloud technology has enabled data scientists and data analysts to deliver value without investing in extensive infrastructure. A serverless architecture can help to reduce the associated costs to a per-use billing.

Oracle Cloud Infrastructure (sometimes referred to as OCI) offers per-second billing for many of its services. Together with Oracle Functions, a serverless platform based on the open source Fn project, this infrastructure lets you build a Big Data pipeline with charges according to the actual runtime.

Big Data Pipeline Phases

Big Data is often viewed as a pipeline or combination of services that provides efficient, robust, and scalable insight into data to deliver value to customers. Those pipelines are often divided into the following phases:

  1. Ingestion

    In this phase, data is loaded from various sources, such as streams, APIs, logging services, or direct uploads. This data can originate from various devices or applications (mobile apps, websites, IoT-devices, and so on) and commonly has a nonbinary but semistructured or unstructured format, such as CSV or JSON.

  2. Data Lake

    This phase is a processing step in which raw data is held in a repository. In cloud environments, object storage is often used. It provides a highly scalable, highly available, and inexpensive way to store data.

  3. Preparation and Computation

    In this phase, data is extracted, transformed, and loaded (ETL). Data preparation assumes a cleansed and conformant data format as an output for further processing. ETL can be done as a batch or as a stream. Computation of data lets data scientists create models from data; they might use incoming data to train machine learning models. Apache Spark and the Hadoop ecosystem are considered leading products. Spark is provided on Oracle Cloud Infrastructure with the Data Flow service, and Hadoop is provided as a managed Cloudera service (Oracle Big Data Cloud Service). For machine learning, Oracle provides the Data Science service.

  4. Data Warehouse

    In this phase, data is stored in a structured format in a database. For Big Data, storage is only possible after ETL processing. Management of data warehouses can be a complicated process because you need to consider high availability, fault tolerance, scalability, security, patching, and so on. To meet this challenge, Oracle provides Autonomous Data Warehouse, a managed service that autonomously preforms all those tasks.

  5. Presentation

    In this phase, data is presented in analytics or business intelligence tools, commonly using graphics and providing filtering and dashboarding capability. Usually a data warehouse is used to ingest the data, because that’s where all the relevant data exists in a structured format for efficient processing. Oracle provides the Analytics Cloud.

A screenshot that shows the five phases of a Big Data pipeline described in the text, with arrows showing how services in one phase flow to services in another phase.

Phases of a Big Data Pipeline in Oracle Cloud Infrastructure

Building a data pipeline assumes automation. Services are launched on demand, and the concerned data needs to be loaded. To this end, Oracle Cloud Infrastructure provides the following capabilities and services:

Big Data Pipeline Example

The following example shows how an upload of a CSV file triggers the creation of a data flow through events and functions. The data flow infers the schema and converts the file into a Parquet file for further processing. This process could be one ETL step in a data processing pipeline.

The required Python code is provided in this GitHub repository.

  1. A CSV file is loaded into Oracle Cloud Infrastructure Object Storage (phases 1 and 2). The upload can also be a stream, as Todd Sharp describes in his blog post.

    A screenshot that shows the Upload Objects dialog box for a bucket in the Object Storage page of the Console.

  2. Object Storage triggers an event to start a function (phase 2). This tutorial shows how to write functions with Python.

    A screenshot that shows the Rules details page for an event.

     

    A screenshot that shows the application information for a function starting a data flow.

     

    A screenshot that shows the configuration of a function starting a data flow.

  3. The function creates a Data Flow application, which converts the CSV into a Parquet file (phase 3).

    A screenshot that shows the resource configuration for the csv file that has been converted to Parquet.

     

    A screenshot from Data Flow that’s shows the new Parquet file.

In this example, the reformatted files can be used to create data science models with the Data Science service. The models can be deployed to functions to run in a serverless manner (phase 3).

The same logic can be used to trigger further processing, such as uploading the file to Autonomous Data Warehouse (phase 4). This process works similarly to this example.

Autonomous Data Warehouse is ideal to use as a source for data analysis with Oracle Analytics Cloud (phase 5).

The Serverless Perspective

The services in this architecture fit the expectation of usage-based pricing.

  • Object Storage: Billed per stored amount

  • Events: Free

  • Functions: Billed for execution time (in seconds)

  • Data Flow: Billed for execution time (in seconds)

  • Data Science: Billed when running (in seconds)

  • Autonomous Data Warehouse: Billed when running (in seconds)

Conclusion

The article demonstrates how different services can be used in combination to build a complete Big Data pipeline. Oracle provides data analysts and data scientists with familiar technology stacks, such as Spark and Python, but enriches them with the latest innovations in cloud technology, such as serverless functionality. Trigger events and automation provide an easy integration capability while significantly lowering cost through on-demand billing.

If you don’t already have an Oracle Cloud account, sign up for a free trial today.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.