One of the major Big Data SQL benefits is security. You deal with the data, which you store in HDFS or other sources, through Oracle Database, which means, that you could apply many Database features, such as Data Redaction, VPD or Database Vault. These features in conjunction with database schema/grant privileged model, allows you to protect data from Database side (when intruder tries to reach data from database side).
But it's also important to keep in mind, that Data stored on HDFS may be required for other purposes (Spark, Solr, Impala...) and they need to have some other mechanism for protection. In Hadoop world, Kerboros is most popular way for protect data (authentification method). Kerberos in conjunction with HDFS ACL gives you opportunity to protect data on the file system level. HDFS as a file system has concept of user and group. And files, which you store on HDFS have different privileges for owner, group and all others.
Conclusion: For working with Kerberized clusters, Big Data SQL needs to have valid Kerberos ticket for work with HDFS files. Fortunitely, all this setup been automated and available within standard Oracle Big Data SQL installer. For get more details please check here.
Big Data SQL and Kerberos.
Well, usually, customers have a Kerbirized cluster and for working with it, we need to have valid Kerberos ticket. But here raised up the question - which principal do you need to have with Big Data SQL?
Answer is easy - oracle. In prior Big Data SQL releases, all Big Data SQL run on the Hadoop cluster as the same user: oracle. This has the following consequences:
- Unable to authorize access to data based on the user that is running a query
- Hadoop cluster audits show that all data queried thru Big Data SQL is made by oracle
What if I already have some data, used by other application and have different privileges (belonging to different users and groups)? Here in Big Data SQL 3.2 we introduced the new feature - Multi-User Authorization.
In foundation of Multi-User Authorization lays Hadoop feature, called impersonalization. I took description from here:
"A superuser with username ‘super’ wants to submit job and access hdfs on behalf of a user joe. The superuser has Kerberos credentials but user joe doesn’t have any. The tasks are required to run as user joe and any file accesses on namenode are required to be done as user joe. It is required that user joe can connect to the namenode or job tracker on a connection authenticated with super’s Kerberos credentials. In other words super is impersonating the user joe."
at the same manner, "oracle" is the superuser and other users are impersonalized.
Multi-User Authorization key concepts.
1) Big Data SQL will identify the trusted user that is accessing data on the cluster. By executing the query as the trusted user:
- Authorization rules specified in Hadoop will be respected
- Authorization rules specified in Hadoop do not need to be replicated in the database
- Hadoop cluster audits identify the actual Big Data SQL query user
2) Consider the Oracle Database as the entity that is providing the trusted user to Hadoop
3) Must map the database user that is running a query in Oracle Database to a Hadoop user
4) Must identify the actual user that is querying the Oracle table and pass that identity to Hadoop
- This may be an Oracle Database user (i.e. schema)
- Lightweight user comes from session-based contexts (see SYS_CONTEXT)
- User/Group map must be available thru OS lookup in Hadoop
Full documentation for this feature, you may find here and now I'm going to show few most popular cases with code examples.
For working with certain objects, you need to grant follow permissions for user, who will manage a mapping table:
In my cases, this is user "bikes".
Just in case clean up permissions for user BIKES:
check that the mapping table is empty:
and after this run a query:
this is the default mode, without any mapping, so I assume that I'll contact HDFS as oracle user.
For double check this, I review audit files:
here is clear, that user Oracle reads the file (ugi=oracle).
Let's check permissions for given file (which represents this external table):
so, everybody may read it. Remember this and let's try to create the first mapping.
this mapping tells me that user BIKES, will be always mapped to user1 for OS. Let's find this in file permission table:
Run query again and check the user, who reads this file:
It's interesting that user1 doesn't exist on the Hadoop OS:
if user not exists (user1 case), it could only read 777 files. Let me revoke read permission from everyone and run the query again:
Now it failed. For make it works I may create "user1" account on each Hadoop node and add it to oinstall group.
Run the query again and check the user, who reads this file:
here we are! We could read the file because of group permissions.
What if I want to map this schema to HDFS or some other powerful user? Let's try:
the reason why we got this exception is that hdfs user is on the blacklist for impersonation.
the second scenario is authorization with the thin client or with CLIENT_IDENTIFIER. In case of multi-tier architecture (when we have application tier and database tier), it may be a challenge to differentiate multiple users within the same application, which use the same schema.
Bellow is the example, which illustrates this:
we have an application, which connected to a database as HR_APP user, but many people may use this application and this database login. To differentiate these human users we may use dbms_session.set_IDENTIFIER procedure (more details you could find here).
So, Big Data SQL multi-user authorization feature allows using SYS_CONTEXT user for authorization on the Hadoop.
Bellow is a test case, which illustrates this.
the third way to make authentification is user authentification identity. Users connecting to a database (via Kerberos, DB User, etc...) have their authenticated identity passed to Hadoop. To make it work, simply run:
and after this your user on HDFS will be that returned by:
for example, if I logged on to the database as BIKES (as database user), on HDFS I'll be authenticated as bikes user
for checking all rules, which you have for multi-user authorization you may run follow query:
Hope that this feature may allow you to create robust security bastion around your data in HDFS.
My role involves overseeing the product management of Data Lake features for the Oracle Autonomous Database.