The latest cloud infrastructure announcements, technical solutions, and enterprise cloud insights.

StreamSets Data Collector on Oracle Cloud Infrastructure

Pinkesh Valdria
Principal Solutions Architect

We are proud to announce a validated reference architecture for StreamSets Data Collector™ on Oracle Cloud Infrastructure. Starting today, you can deploy StreamSets Data Collector, an open source, award-winning solution that efficiently builds, tests, runs, and maintains dataflow pipelines that connect a variety of batch and streaming data sources on Oracle's high-performance cloud by using Terraform templates.

With this announcement, Oracle Cloud Infrastructure enhances the Big Data ISV ecosystem of partners. The partnership between StreamSets and Oracle enables customers to use Data Collector like a pipe for a data stream to move, collect, and process data on the way to its destination. Data Collector connects the hops in the stream, on a unified enterprise cloud platform with unmatched performance, security, and availability.

StreamSets Data Collector

The Data Collector is a design and execution engine that streams data in real time. You use Data Collector to route and process data in your data streams by defining the flow of data (the pipeline). A pipeline consists of stages that represent the origin and destination of the pipeline, and any additional processing. The graphical UI lets you efficiently build batch and streaming data flows with minimal schema specification, connecting many sources to multiple big data solutions with built-in transformations for data normalization and cleansing.

Figure 1: StreamSets Data Collector Web UI

Learn more about StreamSets Data Collector

Oracle Cloud Infrastructure Provides Big Data Flexibility and Performance

Blazing Fast Performance

Oracle offers the most powerful bare metal compute instances with local NVMe flash storage in the industry. Only Oracle offers this local storage, based on advanced NVMe SSD technology, and backed by a storage performance SLA.

Oracle also offers DenseIO virtual machines (VMs), a new high-performance instance with large local storage, backed by NVMe SSD. DenseIO VMs are available in multiple shapes, including 4, 8, and 16 OCPUs, allowing you to customize compute resources for your I/O and storage bound applications.  Oracle also offers standard VM instances with block storage. See our compute page for more details.

Data Collector can take advantage of the bare metal compute instances, which are connected in clusters to a nonoversubscribed 25-gigabit network infrastructure, guaranteeing low latency and high throughput, which is a key requirement for high-performance, distributed, streaming workloads. Oracle Cloud Infrastructure is the only cloud provider that offers guaranteed a 25-Gbps connection between any two nodes (network throughput performance SLA).

Unmatched Data Ecosystem

Data Collector instances that are spun up in the cloud can sit right next to your favorite Hadoop/Spark clusters using Cloudera, Hortonworks, or MapR, and also connect to many other data sources to route and process data on the way to its destination. Data Collector comes with a large number of data origin and destination connectors ready to use without any coding to build data pipelines in hours (not weeks) to reduce development costs.   

Right-Size Your Infrastructure in the Cloud

Cloud infrastructure enables you to deploy the optimal amount of infrastructure to meet your demands. No more underutilization of too much infrastructure or higher latency caused by underforecasting. In addition, Oracle offers:

  • The lowest compute pricing from a pay-as-you-go (PAYG) perspective
  • The lowest network egress costs in the industry

Deploying StreamSets Data Collector

You can deploy Data Collector on Oracle Cloud Infrastructure by using Terraform automation, which is fast becoming the leading cross-cloud framework for infrastructure as code (IaC). The Terraform template deploys a standalone StreamSets Data Collector, and performs all of the steps necessary to deploy and configure a Data Collector instance.  Optionally, Data Collector instances can connect to StreamSets Control Hub to manage all Data Collector instances.

You can customize the Terraform deployment template by choosing the shape for the Data Collector instance, changing the CIDR block sizes for the virtual cloud network and subnets, and changing other configuration settings. For details about the Terraform templates, see the readme.md file.

Figure 2: StreamSets Data Collector Standalone on Oracle Cloud Infrastructure Architecture

In the future, we will add information and templates for deploying Data Collector standalone with a Cloudera Enterprise Data Hub cluster and Data Collector via the Cloudera CDH Parcel Manager.

What’s Next?

  1. If you don’t have an Oracle Cloud Infrastructure account yet, you can sign up for a 30-day free trial account.
  2. Follow the instructions on the GitHub Oracle Cloud Infrastructure StreamSets page to install Data Collector on Oracle Cloud Infrastructure.
  3. Come and meet us at the Oracle OpenWorld booth #OCI-A01 to learn more about our Big Data ecosystem offerings.
  4. We also encourage you to read how StreamSets view the new partnership and why OCI and StreamSets are a great fit to move, collect and process data in the cloud.

We hope you are as excited as we are about the StreamSets Data Collector on Oracle Cloud Infrastructure solution. Let us know what you think!

Pinkesh Valdria

Principal Solutions Architect, Big Data


Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha