We are proud to announce a validated reference architecture for StreamSets Data Collector™ on Oracle Cloud Infrastructure. Starting today, you can deploy StreamSets Data Collector, an open source, award-winning solution that efficiently builds, tests, runs, and maintains dataflow pipelines that connect a variety of batch and streaming data sources on Oracle's high-performance cloud by using Terraform templates.
With this announcement, Oracle Cloud Infrastructure enhances the Big Data ISV ecosystem of partners. The partnership between StreamSets and Oracle enables customers to use Data Collector like a pipe for a data stream to move, collect, and process data on the way to its destination. Data Collector connects the hops in the stream, on a unified enterprise cloud platform with unmatched performance, security, and availability.
The Data Collector is a design and execution engine that streams data in real time. You use Data Collector to route and process data in your data streams by defining the flow of data (the pipeline). A pipeline consists of stages that represent the origin and destination of the pipeline, and any additional processing. The graphical UI lets you efficiently build batch and streaming data flows with minimal schema specification, connecting many sources to multiple big data solutions with built-in transformations for data normalization and cleansing.
Figure 1: StreamSets Data Collector Web UI
Oracle offers the most powerful bare metal compute instances with local NVMe flash storage in the industry. Only Oracle offers this local storage, based on advanced NVMe SSD technology, and backed by a storage performance SLA.
Oracle also offers DenseIO virtual machines (VMs), a new high-performance instance with large local storage, backed by NVMe SSD. DenseIO VMs are available in multiple shapes, including 4, 8, and 16 OCPUs, allowing you to customize compute resources for your I/O and storage bound applications. Oracle also offers standard VM instances with block storage. See our compute page for more details.
Data Collector can take advantage of the bare metal compute instances, which are connected in clusters to a nonoversubscribed 25-gigabit network infrastructure, guaranteeing low latency and high throughput, which is a key requirement for high-performance, distributed, streaming workloads. Oracle Cloud Infrastructure is the only cloud provider that offers guaranteed a 25-Gbps connection between any two nodes (network throughput performance SLA).
Data Collector instances that are spun up in the cloud can sit right next to your favorite Hadoop/Spark clusters using Cloudera, Hortonworks, or MapR, and also connect to many other data sources to route and process data on the way to its destination. Data Collector comes with a large number of data origin and destination connectors ready to use without any coding to build data pipelines in hours (not weeks) to reduce development costs.
Cloud infrastructure enables you to deploy the optimal amount of infrastructure to meet your demands. No more underutilization of too much infrastructure or higher latency caused by underforecasting. In addition, Oracle offers:
You can deploy Data Collector on Oracle Cloud Infrastructure by using Terraform automation, which is fast becoming the leading cross-cloud framework for infrastructure as code (IaC). The Terraform template deploys a standalone StreamSets Data Collector, and performs all of the steps necessary to deploy and configure a Data Collector instance. Optionally, Data Collector instances can connect to StreamSets Control Hub to manage all Data Collector instances.
You can customize the Terraform deployment template by choosing the shape for the Data Collector instance, changing the CIDR block sizes for the virtual cloud network and subnets, and changing other configuration settings. For details about the Terraform templates, see the readme.md file.
Figure 2: StreamSets Data Collector Standalone on Oracle Cloud Infrastructure Architecture
In the future, we will add information and templates for deploying Data Collector standalone with a Cloudera Enterprise Data Hub cluster and Data Collector via the Cloudera CDH Parcel Manager.
We hope you are as excited as we are about the StreamSets Data Collector on Oracle Cloud Infrastructure solution. Let us know what you think!
Principal Solutions Architect, Big Data