The two most challenging aspects of managing a Hadoop cluster in the past have always been security and High Availability ( HA ). There are many built-in features within HDFS itself to mitigate hardware failures and data loss, but configuring the cluster in a way that the processes themselves have a way to fail over has never been easy. To add on to that, Hadoop itself has no native solutions around securing this distributed computing environment forcing them to lean on technologies like Kerberos to do so. The OCI Big Data Service ( BDS ) provides an automated solution to both of these complex problems right out of the box just by clicking a simple check box at cluster creation time.
Leveraging the built-in functionality of BDS, the minimum size of a HA cluster of 7 nodes. This configuration (as seen in the image above) consists of 2 name nodes that are configured to fail over to one another if anything should happen, 2 utility nodes that are running other HA and Security functions like the Kerberos KDC, and 3 worker nodes (the minimum for any BDS cluster). VM Shapes and sizes are chosen at cluster creation time, but can be changed at any point in the lifetime of the cluster. Worker nodes can be scaled up and down both horizontally and vertically with autoscaling policies as well to provide either more CPU/Ram or storage if the cluster is using local HDFS.
Kerberos is a computer network and security protocol that authenticates requests between trusted hosts across an untrusted network, like the internet or in our case, a Hadoop cluster. It uses secret-key cryptography and a trusted third-party service to authenticate client-server applications as well as users.
Kerberos users and services depend directly on the Key Distribution Center (KDC) , which provides two main functions: authentication and ticket-granting. KDC "tickets" authenticate all parties, allowing nodes to verify their identity securely.
While the services provided by Kerberos are very valuable, administering a kerberized hadoop cluster can be one of the most challenging tasks in the Big Data admin space. The BDS HA configuration automatically lays down Kerberos on all of the nodes, allows for easy user creation via Identity Management, and provides patching and upgrades for both the kerberized nodes and the KDC running on the utility nodes with no extra effort from the customer. This is a huge benefit to running the BDS HA configuration.
Now that we are familiar with what the BDS HA cluster configuration brings to the table, how do we use it? First of all, in this age of cyber security, audit etc, having a kerberized cluster is a minimum requirement for any production Hadoop cluster. The same really goes for having multiple name nodes when it comes to production workloads as nobody wants to see a cluster have to be rebooted in the middle of a long-running job (some of these can last for days or weeks) because the name node had a log folder fill up. So a best practice that we see very often is to use a configuration like this in a production environment.
However, when it comes to QA, dev and other environments the conversation can be a little more tricky. Customers I have talked to that are trying to cut down on costs have often brought up the fact that this configuration requires an extra 4 nodes over the minimum cluster configuration to run. This extra cost can add up pretty quickly.
In my experience, it is a no brainer to match QA to production so that is not as big of an issue. However many customers will run multiple small dev environments for different projects, branching code, etc and these environments can become costly.
To setup development environments we have 3 choices:
Assuming we go with option 3, there are still ways to mitigate cost in these development environments. If workloads are small, the minimum side of the name nodes and utility nodes can be as low as the E4 Flex Shape with 4 OCPU's and 32 gigs of RAM. This is a very cost efficient shape that can always be increased during times of heavier load. Another very important aspect of the BDS service that plays a role here is the ability to stop a cluster. In times of little use, weekends and holidays clusters can be stopped completely. Stopping a cluster is like pausing it. You won’t be billed for compute resources, but you will be billed for storage. On the storage side, if the configuration is not using HDFS, but using Object Storage for data, the storage on the cluster is extremely small. These clusters can be stopped and started in as little as 5 to 15 minutes depending on the size of the cluster.
Hadoop clusters bring so many valuable technologies for processing data at large scale, real-time data processing and data movement, fast queries of large data sets and so much more. The ability to secure and run these clusters in a reliable way without the overhead of trying to solve these problems is a huge benefit that OCI's Big Data Service brings to our customers.