Security is a very important aspect of many projects and you must not underestimate it, Hadoop security is very complex and consist of many components, it's better to enable one by one security features. Before starting the explanation of different security options, I'll share some materials that will help you to get familiar with the foundation of algorithms and technologies that underpin many security features in Hadoop.
First of all, I recommend that you watch this excellent video series, which explains how asymmetric key works and how RSA algorithm works (this is the basis for SSL/TLS). Then, read this this blog about TLS/SSL principals.
Also, if you mix up terms such as Active Directory, LDAP, OpenLDAP and so, it will be useful to check this page.
After you get familiar with the concepts, you can concentrate on the implementation scenarios with Hadoop. You can think of Hadoop security as being divided into few different levels - from the simplest (i.e. no security) to the most robust.
Level 0 is "relaxed" - or no security on the cluster. It assumes a level of trust; there may be authorization rules assigned to objects - but these rules can be easily subverted. In Hadoop, files and folders have permissions - similar to Linux - and users access files based on access control lists - or ACLs. See the diagram below that highlights different access paths into the cluster: shell/CLI, JDBC, and tools like Hue.
As you can see below, each file/folder has access privileges assigned. The oracle user is the owner of the items - and other users can access the items based on the access control definition (e.g. -rw-r--r-- means that Oracle can read/write to the file, users in the oinstall group can read the file, and the rest of the world can also read the file).
In this "relaxed" security level - it is very easy to subvert these ACLs. Because there is no authentication, a user can impersonate someone else; the identity is determined by the current login identity on the client machine. So, as shown below, a malicious user can define the "hdfs" user (a power user in Hadoop) on their local machine - access the cluster - and then delete import financial and healthcare data.
Additionally, accessing data thru tools like Hive are also completely open. The user that is passed as part of the JDBC connection will be used for data authorization. Note, you can specify any user that you want - there is no authentication! So, all data in Hive is open for query.
If you care about the data, then this is a real problem. Let's now review Level 1 security: creating a Bastion.
A Bastion limits access to the cluster. Instead of enabling connectivity from any client, a bastion is created that users log into - and this bastion has access to the cluster.
An Edge node in Hadoop is an example of a bastion host. It is:
- Used to run jobs and interact with the Hadoop Cluster
- A node in the Hadoop cluster that runs only gateway services
Because a user logged into this edge node and did not have the ability to alter his or her identity, the identity can be trusted (at least to a degree). This means that:
1) HDFS ACLs now have some meaning
- User identity established on edge node
- Connect only thru known access paths and hosts
Note: in HDFS there is feature extended ACL, which allow to have extended security lists, so you could grant permissions outside of the group.
$ hadoop fs -mkdir /user/oracle/test_dir
$ hdfs dfs -getfacl /user/oracle/test_dir
# file: /user/oracle/test_dir
# owner: oracle
# group: hadoop
$ hdfs dfs -setfacl -m user:ben:rw- /user/oracle/test_dir
$ hdfs dfs -getfacl /user/oracle/test_dir
# file: /user/oracle/test_dir
# owner: oracle
# group: hadoop
2) But JDBC is still insecure
- User identified in JDBC connect string not authenticated
Here is the example how I can use beeline tool from cli for work on behalf of "superuser" who may do whatever he wants on the cluster:
To ensure that identities are trusted, we need to introduce a capability that you probably use all the time and didn't even know it: Kerberos.
The two most common ways to use Kerberos with Oracle Big Data Appliance is with Active Directory Kerberos or MIT Kerberos. On the Oracle support site you will find the step by step instruction for enabling both of this configuration. Check for “BDA V4.2 and Higher Active Directory Kerberos Install and Upgrade Frequently Asked Questions (FAQ) (Doc ID 2013585.1)” for the Active Directory implementation and “Instructions to Enable Kerberos on Oracle Big Data Appliance with Mammoth V3.1/V4.* Release (Doc ID 1919445.1)” for MIT Kerberos. Oracle recommends using MIT local Kerberos for system services like hdfs, yarn and AD Kerberos for the human users (like user John Smith). Also, you have to set up trusted relationships between them by following support note “How to Set up a Cross-Realm Trust to Configure a BDA MIT Kerberos Enabled Cluster with Active Directory on BDA V4.5 and Higher (Doc ID 2198152.1)”.
Note: Big Data Appliance greatly simplifies the implementation of highly available Kerberos deployment on a Hadoop cluster. You do not need to do the manual setup (and should not) of Kerberos settings. Use tools, which is provided by BDA.
Using MOS 1919445.1 I've enabled MIT Kerberos on my BDA cluster.
So, what has been changed in my daily life with Hadoop cluster?
First of all, I'm trying to list files in HDFS:
Oops... seems something is missing. I can not access the data in HDFS - "No valid credentials provided". In order to gain access to HDFS, I must first obtain a Kerberos ticket:
Still not able to access HDFS! That's because the user principal must be added to the Key Distribution Center - or KDC. As the Kerberos admin, add the principal:
Now, I can successfully obtain the Kerberos ticket:
Here we go! I'm ready to work with my Hadoop cluster. But I don't want to enter the password every single time when I obtaining the ticket (this is important for services as well). For this, I need to create keytab file.
and I can obtain a new Kerberos ticket without the password:
Now I can work with Hadoop cluster on behalf of Oracle user:
WARNING: please keep in mind that you have to keep it in a safe directory and set permissions carefully on it (because anyone who can read it can impersonate them).
It's interesting to note, that if you work on Hadoop servers, you already have many keytab files and for example, if you want to get an HDFS tickets you may easily do this. For getting the list of the principals for certain keytab file just run:
and for obtaining the ticket, run:
Note: to debug kinit, you should export KRB5_TRACE=/dev/stdout. Here is an example:
Obtaining Kerberos Ticket without acsess to KDC
In my expirience it could be a cases when client machine could not acsess KDC directy, but need to work with Kerberos protected resources. Here is workaround for it. First go to machine which has acsess to KDC and generate cache ticket:
[opc@hadoopnode ~]$ cp /etc/krb5.conf /tmp/TMP_TICKET_CACHE/krb5.conf
[opc@hadoopnode ~]$ export KRB5_CONFIG=/tmp/TMP_TICKET_CACHE/krb5.conf
[opc@hadoopnode ~]$ export KRB5CCNAME=DIR:/tmp/TMP_TICKET_CACHE/
[opc@hadoopnode ~]$ kinit oracle
Password for oracle@BDACLOUDSERVICE.ORACLE.COM:
afilanov-mac:ssh afilanov$ scp -i id_rsa_new.dat email@example.com:/tmp/TMP_TICKET_CACHE/* /tmp/
Enter passphrase for key 'id_rsa_new.dat':
krb5.conf 100% 795 12.3KB/s 00:00
primary 100% 10 0.2KB/s 00:00
tkt0kVvY6 100% 874 13.6KB/s 00:00
rename ticket cache file and check that current user has it:
afilanov-mac:ssh afilanov$ export KRB5_CONFIG=/tmp/krb5.conf
afilanov-mac:ssh afilanov$ export KRB5CCNAME=/tmp/tkt0kVvY6
afilanov-mac:ssh afilanov$ klist
Credentials cache: FILE:/tmp/tkt0kVvY6
Issued Expires Principal
Jun 18 09:33:10 2018 Jun 19 09:33:10 2018 krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM
It is quite common that companies use the Active Directory server to manage users and groups and want to provide them access to the secure Hadoop cluster accordingly their roles and permissions. For example, I have my corporate login afilnov and I want to work with Hadoop cluster as afilanov.
For doing this you have to build trusted relationships between Active Directory and MIT KDC on BDA. All details you could find in the MOS: "How to Set up a Cross-Realm Trust to Configure a BDA MIT Kerberos Enabled Cluster with Active Directory on BDA V4.5 and Higher (Doc ID 2198152.1)", but here I'll show a quick example how it works.
First, I will log in to the Active Directory server and configure the trusted relationships with my BDA KDC:
After this I'm going to create user in AD:
I skipped all explanations here because you may find all details in MOS and here I just wanted to show that I create a user on AD (not on Hadoop) side.
On the KDC side you have to create one more principal as well:
After this we are ready to use our corporate login/password to work with Hadoop on behalf of this user:
Well, now we can obtain a Kerberos ticket and work with Hadoop as a certain user. It's important to note that on the OS we can be logged in as any user (e.g. we could be login as root and work with Hadoop cluster as a user from AD), here an example:
For Hadoop, it's important to have users (and their groups) available thru the OS. Services like HDFS perform lookups at the OS level to determine what groups a user belongs to - and then uses that information to authorize access to files and folders. But what if I want to have OS users from Active Directory? This is where SSSD steps in. It is a PAM module that will forward user lookups to Active Directory. This means that you don't need to replicate user/group information at the OS level; it simply leverages the information found in Active Directory. See the Oracle Support site you for MOS notes about how to set it up (well written detailed instruction):
Guidelines for Active Directory Organizational Unit Setup Required for BDA 4.9 and Higher SSSD Setup (Doc ID 2289768.1)
How to Set up an SSSD on BDA V4.9 and Higher (Doc ID 2298831.1)
After you pass all steps listed there you may use AD user/password for login to the Linux servers of your Hadoop cluster. Here is the example:
Another important aspect of security is network encryption. For example, even if you protect access to the servers, somebody may listen to the network between the cluster and client and intercept network packets for future analysis.
Here is an example how it could be hacked (note: my cluster already Kerberized).
Now we are hacked. Fortunately, Hadoop has the capability to protect the network between clients and cluster. It will cost you some performance, but the performance impact should not prevent you from enabling the network encryption between clients and cluster.
Before enabling encryption I've run a simple performance test:
Both jobs took 3.7 minutes. Remember it for now.
Fortunately, Oracle Big Data Appliance provides an easy way for enabling network encryption with bdacli. You can set it up it with one command:
You will need to answer some questions about your specific cluster configs, such as Cloudera Manager admin password and OS passwords.
After finishing the command, I ran performance test again:
and now I took 4.5 and 4.2 minutes respectively. The jobs perform a bit more slowly, but it is worth it.
For improved performance of transferring encrypted data, we may use an advantage of Intel embedded instructions and change encryption algorithm (go to Cloudera Manager -> HDFS -> Configuration -> dfs.encrypt.data.transfer.algorithm -> AES/CTR/NoPadding):
Another vulnerability is network interception during the shuffle (step between Map and Reduce operation) and communication between clients. To prevent this vulnerability, you have to encrypt shuffle traffic. BDA again has an easy solution: run bdacli enable hadoop_network_encryption command.
Now we will encrypt intermediate files, generated after shuffle step.
Like in the previous example, simply answer a few questions the encryption will be enabled.
Let's check performance numbers again:
we have 5.1 minute and 4.4 minutes respectively. A bit slower, but it is important in order to keep data safe.
All right, now we protected the cluster from external unauthorized access (by enabling Kerberos), encrypted network communication between the cluster and clients, encrypted intermediate files, but we still have vulnerabilities. If a user gets access to the cluster's server, he or she could read the data (despite on ACL). Let me give you an example.
Some ordinary user put sensitive information in the file and put it on HDFS:
Intruder knows the file name and wants to get the content of it (sensitive information).
The hacker found the blocks that store the data and then reviewed the physical files at the OS level. It was so easy to do. What can you do to prevent this? The answer is to use HDFS encryption.
Again, BDA has a single command to do enable HDFS transparent encryption: bdacli enable hdfs_transparent_encryption. You may find more details in MOS "How to Enable/Disable HDFS Transparent Encryption on Oracle Big Data Appliance V4.4 with bdacli (Doc ID 2111343.1)".
I'd like to note, that Cloudera has great blogpost about HDFS transparent encryption and I would recommend you to read it. So, after encryption had been enabled, I'll repeat my previous test case. Prior to running the test, we will create an encryption zone and copy files into that zone.
Bingo! The file is encrypted and the person who attempted to access the data can only see a series of nonsensical bytes.
Now let me tell couple words how encryption works. There are a few types of keys (screenshots I took from Cloudera's blog):
1) Encryption Zone key. You may encrypt files in a certain directory using some unique key. This directory called Encryption Zone (EZ) and a key called EZ key. This approach maybe quite useful, when you share Hadoop cluster among different divisions within the same company. This key stored in KMS (Key Managment Server). KMS handles generating encryption keys (EZ and DEK), also it communicates with key server and decrypts EDEK.
2) Encrypted data encryption keys (EDEK) is an attribute of the files, which stored in Name Node.
3) DEK is not persistent, you compute it on the fly from EDEK and EZ.
Here is the flow of how to client write data to the encrypted HDFS.
I took the explanation of this from Hadoop Security book:
1) The HDFS client calls create() to write to the new file.
2) The NameNode requests the KMS to create a new EDEK using the EZK-id/version.
3) The KMS generates a new DEK.
4) The KMS retrieves the EZK from the key server.
5) The KMS encrypts the DEK, resulting in the EDEK.
6) The KMS provides the EDEK to the NameNode.
7) The NameNode persists the EDEK as an extended attribute for the file metadata.
8) The NameNode provides the EDEK to the HDFS client.
9) The HDFS client provides the EDEK to the KMS, requesting the DEK.
10) The KMS requests the EZK from the key server.
11) The KMS decrypts the EDEK using the EZK.
12) The KMS provides the DEK to the HDFS client.
13) The HDFS client encrypts data using the DEK.
14) The HDFS client writes the encrypted data blocks to HDFS
for data reading you will follow this steps:
1) The HDFS client calls open() to read a file.
2) The NameNode provides the EDEK to the client.
3) The HDFS client passes the EDEK and EZK-id/version to the KMS.
4) The KMS requests the EZK from the key server.
5) The KMS decrypts the EDEK using the EZK.
6) The KMS provides the DEK to the HDFS client.
7) The HDFS client reads the encrypted data blocks, decrypting them with the DEK.
I'd like to highlight again, that all these steps are completely transparent and the end user doesn't feel any difference while working with HDFS.
Key Trustee Server and Key Trustee KMS
For those of you who are just starting to work with HDFS data encryption - the terms Key Trustee Server and KMS maybe a bit confusing. Which component do you need to use and for what purpose? From Cloudera's documentation:
Key Trustee Server is an enterprise-grade virtual safe-deposit box that stores and manages cryptographic keys. With Key Trustee Server, encryption keys are separated from the encrypted data, ensuring that sensitive data is protected in the event that unauthorized users gain access to the storage media.
Key Trustee KMS - for HDFS Transparent Encryption, Cloudera provides Key Trustee KMS, a customized Key Management Server. The KMS service is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients. Both the backing key store and the KMS implement the Hadoop KeyProvider client API. Encryption and decryption of EDEKs happen entirely on the KMS. More importantly, the client requesting creation or decryption of an EDEK never handles the EDEK's encryption key (that is, the encryption zone key).
This picture (again from cloudera's documentation) shows that KMS is intermediate service in between Name Node and Key Trustee Server:
HDFS transparent encryption operations.
HDFS is filesystem and as we discussed earlier it has ACL for managing file permissions.
As you know those three magic numbers define access rules for owner-group-others. But how to understand which group belong my user? There are two types of user group lookup - LDAP based and UnixShell based. In Cloudera Manager it defined through hadoop.security.group.mapping parameter:
for check a list of the groups for certain user from Linux console, just run:
Another one powerful capability of Hadoop in the security field is role-based access for hive queries. In the Cloudera distribution, it is managed by Sentry. Kerberos is required for Sentry installation. As many things on Big Data Appliance installation of Sentry automated and could be done within one command:
and follow the tools guide. More information you could find in MOS "How to Add or Remove Sentry on Oracle Big Data Appliance v4.2 or Higher with bdacli (Doc ID 2052733.1)".
After you enable Sentry, you can now create and enable Sentry policies. I want to mention, that in Sentry there is strict hierarchy; users always belongs to groups, groups have some roles, and roles have some privileges. And, you have to follow this hierarchy. You can't assign some privileges directly to the user or group.
I will show how to setup this policies with HUE. For this test case I'm going to create test data and load it in HDFS:
After creating the file, I log in into Hive and create an external table (for power users) and view with a restricted set of the columns(for limited users):
After I have data in my cluster, I'm going to create test users across all nodes in the cluster (or, if I'm using AD - create the users/groups there):
Now, let's use user-friendly HUE graphical interface. First, go to the Security bookmark and find that we have two objects and don't have any security rules there:
After this I clicked on the "Add policy" link and created policy for power user, which allows it to read the "emp" table:
After this I did the same for limited_user, but allowed it to read only emp_limited view. Here is my roles with policies:
Now I log in as "limited_user" and ask to show the list of the tables:
Only emp_limited is available. Let's query it:
Perfect. I don't have table "emp" in the list of my tables, but let's imagine that I do know the name and want to query it.
My attempt failed, because of lack of priviliges.
This highlighted table level granularity, but Sentry allows you to restrict access to certain columns. I'm going to reconfig power_role and allow it to select all columns except "salary":
After this I running test queries with and without salary column in where predicate:
Here we go! If I list "salary" in the select statement my query fails because of lack of privileges.
Useful Sentry commands.
Alternatively, you may use beeline cli to view and manage sentry roles. Below, I'll show how to manage privileges with beeline cli. First, I obtain a Kerberos ticket for the hive user and login into hive cli ("hive" is the admin) after this drop role and create it again. Assign to it some privileges and link this role with some group. After this login as limited_user, who belongs to limited_grp and check the permissions:
Connect to the hive from the bash console.
The two most common ways of connecting to the hive from the shell are the 1) hive cli and 2) beeline. The first is deprecated as it bypasses the security in HiveServer2; it communicates directly with the metastore. Therefore, beeline is the recommended; it communicates with HiveServer2 - enabling authorization rules to engage.
Hive cli tool is big back door for the security and it's highly recommended to disable it. To accomplish this, you need to configure hive properly. You may use hadoop.proxyuser.hive.groups parameter to allow only the users belong to the group specified in the proxy list to connect to the metastore (the application components) and as consequence, a user that does not belong to these group and run the hive cli will not connect to the metastore. Go to the Cloudera Manager -> Hive -> Configuration -> in search bar type "hadoop.proxyuser.hive.groups" and add hive, Impala and hue users:
Restart hive server. You will be able to connect to the hive cli only as a privileged user (belongs to hive, hue, Impala groups).
Great, now we locked old hive cli and now it's a good time to use new modern beeline console.
So, for running beeline, you need to invoke beeline from cli and put following connection string:
Note: before the connect to beeline you must obtain the Kerberos ticket that is used to confirm your identity.
If you enable hive TLS/SSL encryption (for ensuring integrity and confidence between your client and server connection) you need to use quite a tricky authentification with beeline. You have to use SSL Trust Store file and trust Store Password. When a client connects to a server and that server sends a public certificate across to the client to begin the encrypted connection the client must determine if it 'trusts' the server's certificate. In order to do this, it checks the server's certificate against a list of things it's been configured to trust called a trust store
You may find it with Cloudera Manager REST API:
Alternatively, you may use bdacli tool on BDA:
Now you know trustee password and trrustee path.
If you doubt that it matches, you could try to take a look on the trustore file content:
afilanov-mac:~ afilanov$ keytool -list -keystore testbdcs.truststore
Enter keystore password:
Keystore type: JKS
Keystore provider: SUN
Your keystore contains 5 entries
cfclbv3874.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,
Certificate fingerprint (SHA1): F0:5D:28:36:99:67:FB:C0:B1:D5:B3:75:DF:D6:51:9B:DF:EB:3E:3A
cfclbv3871.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,
Certificate fingerprint (SHA1): AF:3A:20:90:04:0A:27:B5:BD:DF:83:32:C7:4A:AF:AF:C4:97:E1:30
cfclbv3873.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,
Certificate fingerprint (SHA1): 30:09:B9:A8:79:D7:F4:02:3F:72:8C:05:F1:A4:BF:04:9B:8B:78:CA
cfclbv3870.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,
Certificate fingerprint (SHA1): EA:F0:38:1E:BB:89:E2:05:38:CA:F2:FB:4D:41:82:75:BE:5D:F7:88
cfclbv3872.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,
Certificate fingerprint (SHA1): C5:7D:F2:FA:96:8C:AB:4A:D2:03:02:DA:D3:F5:0C:7C:45:8E:26:E7
For example, in my case I used:
Please note, that trustore is the public key and it's not a big secret; it generally is not a problem to use it in scripts and share.
Also, alternatively, if you don't want to use so long connection string all the time, you could just put truststore credentials in Linux environment:
One more interesting thing which you could do with your Active Directory (or any other LDAP implementation) is integrate HUE and LDAP and use LDAP passwords for authenticate your users in HUE.
Before doing this you have to enable TLSv1 in your Java settings on the HUE server. Here is detailed MOS note how to do this.
Search: Disables TLSv1 by Default For Cloudera Manager/Hue/And in System-Wide Java Configurations (Doc ID 2250841.1)
After this, you may want to watch these youtube videos to understand how easy is to do this integration.
Authenticate Hue with LDAP and Search Bind or Authenticate Hue with LDAP and Direct Bind. It's really not too hard. Potentially, you may need to define your base_dn and here is the good article about how to do this. Next you may need the bind user for make the first connection and import all other users (here is the explanation from Cloudera Manager: Distinguished name of the user to bind as. This is used to connect to LDAP/AD for searching user and group information. This may be left blank if the LDAP server supports anonymous binds.). For this purposes I used my AD account afilanov.
After this I have to login to the HUE using afilanov login/password:
Then click on the user name and choose "Manage Users".
Add/Sync LDAP users:
Optionally, you may put the Username pattern and click Sync:
Here we go! Now we have list of the LDAP users imported into HUE:
and now we could use any of this accounts to log in into HUE:
Auditing tracks who does what on the cluster - making it easy to identify improper attempts to access. Fortunately, Cloudera provides easy and efficient way to do so, it's called Cloudera Navigator.
Cloudera Navigator is included with BDA and Big Data Cloud Service. It is accessible thru Cloudera Manager:
After this you may use "admin" password from Cloudera Manager. After logon you may choose "Audit" section.
Where you may create different Audit reports. Like "which files user afilanov created on HDFS for a last hour":
or which Hive queries had been run within last 24 hours against "emp_limited" table: