X

Deploy HA Availability Domain Spanning Cloudera Enterprise Data Hub Clusters on Oracle Cloud Infrastructure

Zachary Smith
Principal Solution Architect

Hello, my name is Zachary Smith, and I'm a Solutions Architect working on Big Data for Oracle Cloud Infrastructure.

We're proud to announce that availability domain spanning Terraform automation is now available for use with Cloudera Enterprise Data Hub deployments on Oracle Cloud Infrastructure. This deployment architecture includes enhanced security and fault tolerance, while maintaining performance. 

Cloudera Enterprise Data Hub: Availability Domain Spanning

Availability domain spanning is ideal for customers who want to maintain the performance of Cloudera Enterprise Data Hub on Oracle Cloud Infrastructure while leveraging the cloud constructs to enhance fault tolerance and high availability. Cloudera Enterprise Data Hub cluster hosts are deployed across all three availability domains in a region, and Zookeeper, NameNode, and HDFS services are distributed across the nodes in each availability domain.

Cloudera Cluster Hosts on a Private Subnet

With our continued focus on enabling enterprise customers to deploy secure environments in the cloud, we have included in this architecture the deployment of master and worker cluster hosts on a private subnet not accessible directly from the internet. To achieve this, the bastion host in the deployment is set up as a NAT gateway, which is leveraged by hosts on the private subnet to route internet-destined traffic to the internet gateway. This architecture provides enhanced security without sacrificing cluster performance.

Performance Testing

To test the performance of Cloudera Enterprise Data Hub on Oracle Cloud Infrastructure, Terasort was chosen as a benchmark. This benchmark is a standard for Hadoop because it tests the I/O of all elements involved in a Hadoop deployment: compute, memory, storage, and network.

The following graph shows a comparison running a 10-TB Terasort across two cluster types on each deployment architecture. The first cluster type is a virtual machine using six 1.5-TB block volumes for HDFS. The second cluster type is bare metal using local NVMe for HDFS. The cluster topology is the same for both architectures: five worker nodes, one Cloudera Manager node, two master nodes for cluster services, and one bastion host.

Not only are the performance results extremely fast for sorting 10 TB with five workers, but the sort times are extremely close when comparing single availability domain versus availability domain spanning architecture. These tests were run multiple times in a row, and the results returned almost identical results regardless of the time of day that the job ran. This is a great example of Oracle’s industry-leading SLA for cloud.

We have more improvements in this space, and a white paper that details a Reference Architecture for Cloudera Enterprise Data Hub on Oracle Cloud Infrastructure, and the use of these Terraform templates.

Have questions or want to learn more? Join us at the Cloudera Now Virtual Event Booth on August 2 from 9 a.m. to 1 p.m. PDT. Register Now.

We hope you will be as excited as we are about the improvements we’re making to the Cloudera plus Oracle solution. Let us know what you think!

Zachary Smith

Senior Member of Technical Staff

https://www.linkedin.com/in/zachary-c-smith/

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.