Effortlessly set up customized clusters with OCI Big Data cluster profiles

May 3, 2023 | 6 minute read
Anand Chandak
Principal Product Manager
Text Size 100%:

The Oracle Cloud Infrastructure (OCI) Big Data service with Apache Hadoop is a fully managed cloud service from OCI that allows you to process big data workloads using popular open source frameworks, such as Hadoop, Spark, Hive, and HBase. One of the features of OCI Big Data is the ability to create and manage Hadoop clusters of different profiles or types. In this blog post, we discuss the cluster profile feature in OCI Big Data, its benefits, and how to use it.

What is OCI Big Data?

OCI Big Data is a cloud-based big data processing service offered by OCI. The fully managed platform enables customers to process large amounts of data using popular open source big data frameworks, such as Hadoop, Spark, Hive, and Kafka. The service is designed to simplify the process of deploying and managing big data solutions, and you can integrate it with other OCI services and on-premises systems.

Use cases for Oracle Big Data

The following examples show the top use cases for OCI Big Data:

  • ETL processing: Extract, transform, load (ETL) is a common use case for OCI Big Data. You can use OCI Big Data to process large amounts of data, transform it into a format suitable for analysis, and load it into a data warehouse or other storage system.

  • Data analysis: You can use OCI Big Data for data analysis and data exploration. You can use OCI Big Data to run Apache Spark jobs to analyze data and generate insights—useful for business intelligence, data visualization, and machine learning (ML) applications.

  • Machine learning: You can use OCI Big Data for ML applications to train models on large datasets and then use the models for prediction and other tasks.

  • Log processing: With OCI Big Data, you can process analyze log data from web servers, application servers, and other sources to identify patterns and trends.

  • Batch processing: You can use OCI Big Data to process large amounts of data in batches, for example, to generate reports or perform calculations.

  • Real-time processing: You can use OCI Big Data to process streams of data in real-time, for example, to perform fraud detection or anomaly detection.

Why customers love OCI Big Data service

Customers love OCI Big Data, a cloud-based big data analytics service, for the following common reasons:

  • Scalability: OCI Big Data can easily scale to handle large amounts of data and processing power, enabling customers to gain insights quickly and efficiently.

  • Compatibility: OCI Big Data supports various open source big data frameworks, such as Hadoop, Spark, Hive, and Kafka, which allows customers to use the tools they’re familiar with and use existing code.

  • Security: OCI Big Data offers robust security features, including encryption at rest and in transit, and integrated authentication with OCI Identity and Access Management (IAM) service, providing customers with peace of mind that their data is protected.

  • Flexibility: Customers can choose to deploy OCI Big Data in various ways, including using preconfigured clusters or creating custom clusters with specific configurations, enabling them to tailor the service to their specific needs.

  • Integration with other OCI services: OCI Big Data integrates with other OCI services, such as OCI Data Catalog, OCI Data Flow, and OCI Lake House, and OCI makes it easy for customers to build end-to-end solutions for their big data needs.

What is a cluster profile?

A cluster profile in OCI Big Data represents a preconfigured set of resources optimized for a particular workload or use case. Each cluster profile has a predefined set of configuration parameters that are optimized for a specific data processing job. For example, OCI offers cluster profiles for Hadoop, Spark, HBase, Hive, and Trino (Interactive query), each designed for specific workloads and use cases.

A screenshot of the expanded Cluster profile menu showing the options with Hadoop Extended highlighted and selected.

Hadoop

The Hadoop cluster type in OCI Big Data is designed to work with Hadoop Distributed File System (HDFS) and MapReduce. It’s ideal for batch processing, such as data warehousing, log analysis, and ETL. The Hadoop cluster type comes with Hadoop, Hive, Pig, and Oozie preinstalled, making it easy to get started with big data processing.

Spark

The Spark cluster type in OCI Big Data is designed to work with Apache Spark, an open source data processing framework that supports both batch and streaming data processing. It’s ideal for real-time data processing, machine learning, and graph processing. The Spark cluster type comes with Spark, Hive, and Jupyter preinstalled, making it easy to start using Spark for big data processing.

HBase

The HBase cluster profile in OCI Big Data is designed to work with Apache HBase, an open source NoSQL database that runs on top of Hadoop. It’s ideal for storing and retrieving large amounts of structured data, such as sensor data, social media data, and financial data.

Trino

This cluster profile is for interactive querying of large datasets. Trino returns results to you as soon as they’re available. This availability offers data analysts and data scientists the ability to query large amounts of data, test hypotheses, run A/B testing, and build visualizations or dashboards.

Kafka

This cluster profile is designed for streaming data processing and supports Apache Kafka.

Hadoop Etxended

We used this cluster profile before the profile feature was introduced.

Benefits of cluster profiles

The use of cluster profile provides the following benefits:

  • Faster cluster deployment: The use of preconfigured cluster profiles speeds up the deployment process by reducing the amount of manual configuration required.

  • Better performance: Cluster profiles are optimized for specific workloads, providing better performance compared to a generic cluster setup.

  • Simplified management: Each cluster profiles comes with preconfigured services, reducing the need for manual configuration and simplifying cluster management.

How to use cluster profiles in OCI Big Data

To create an OCI Big Data cluster using a specific cluster profile, use the following steps:

  1. In the Oracle Cloud Console, navigate to the OCI Big Data service.

  2. Click the Create cluster button.

  3. Enter the cluster name and admin password.

  4. Select the checkbox for Secure and Highly Available (HA) to make the cluster secure and highly available.

  5. Select the distribution and version of Hadoop from either Oracle’s Distribution of Hadoop (ODH) or Cloudera’s Distribution of Hadoop (CDH).

  6. Select the cluster profile that best suits your use case from the menu. You can also select the version of the cluster type that you want to use.

  7. Select from the Compute shape, block storage for master and utility, and the number of Compute shape options for the worker nodes.

  8. Provide the network related details, such as CIDR Block, virtual cloud network (VCN), and subnet details.

  9. Select your encryption type: Oracle-managed or customer-managed.

  10. Click Create to provision the Big Data cluster.

When the cluster is created, off you go! You can use it to process Big Data workloads using the services and tools enabled by default.

Conclusion

Cluster profiles in OCI Big Data provide a preconfigured set of configurations and services optimized for specific workloads, enabling faster cluster deployment, better performance, and simplified management. When creating an OCI Big Data cluster, we recommend selecting the cluster profile that best suits your workload to optimize performance and reduce management overheads.

To learn more and find detailed instructions, seeour documentation or get hands-on with our LiveLab, Setup and Manage an Oracle Big Data Service(BDS) HA cluster using Oracle Distribution of Hadoop(ODH).

Anand Chandak

Principal Product Manager


Previous Post

Improving cancer care and delivering on the AI bill of rights

Christine Swisher, PhD | 12 min read

Next Post


Elastically scale machine learning inference on OCI with Kubernetes

Ganesh Radhakrishnan | 15 min read