X

Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Big Data SQL Quick Start. Multi-user Authorization - Part 25

By: Alexey Filanovskiy | Product Manager

One of the major Big Data SQL benefits is security. You deal with the data, which you store in HDFS or other sources, through Oracle Database, which means, that you could apply many Database features, such as Data Redaction, VPD or Database Vault. These features in conjunction with database schema/grant privileged model, allows you to protect data from Database side (when intruder tries to reach data from database side).

But it's also important to keep in mind, that Data stored on HDFS may be required for other purposes (Spark, Solr, Impala...) and they need to have some other mechanism for protection. In Hadoop world, Kerboros is most popular way for protect data (authentification method). Kerberos in conjunction with HDFS ACL gives you opportunity to protect data on the file system level. HDFS as a file system has concept of user and group. And files, which you store on HDFS have different privileges for owner, group and all others. 

Conclusion: For working with Kerberized clusters, Big Data SQL needs to have valid Kerberos ticket for work with HDFS files. Fortunitely, all this setup been automated and available within standard Oracle Big Data SQL installer. For get more details please check here.

Big Data SQL and Kerberos.

Well, usually, customers have a Kerbirized cluster and for working with it, we need to have valid Kerberos ticket. But here raised up the question - which principal do you need to have with Big Data SQL? 

Answer is easy - oracle. In prior Big Data SQL releases, all Big Data SQL run on the Hadoop cluster as the same user: oracle. This has the following consequences:

- Unable to authorize access to data based on the user that is running a query

- Hadoop cluster audits show that all data queried thru Big Data SQL is made by oracle

What if I already have some data, used by other application and have different privileges (belonging to different users and groups)? Here in Big Data SQL 3.2 we introduced the new feature - Multi-User Authorization.

Hadoop impersonalization.

In foundation of Multi-User Authorization lays Hadoop feature, called impersonalization. I took description from here:

"A superuser with username ‘super’ wants to submit job and access hdfs on behalf of a user joe. The superuser has Kerberos credentials but user joe doesn’t have any. The tasks are required to run as user joe and any file accesses on namenode are required to be done as user joe. It is required that user joe can connect to the namenode or job tracker on a connection authenticated with super’s Kerberos credentials. In other words super is impersonating the user joe."

at the same manner, "oracle" is the superuser and other users are impersonalized.

Multi-User Authorization key concepts.

1) Big Data SQL will identify the trusted user that is accessing data on the cluster.  By executing the query as the trusted user:

- Authorization rules specified in Hadoop will be respected
- Authorization rules specified in Hadoop do not need to be replicated in the database
- Hadoop cluster audits identify the actual Big Data SQL query user

2) Consider the Oracle Database as the entity that is providing the trusted user to Hadoop

3) Must map the database user that is running a query in Oracle Database to a Hadoop user

4) Must identify the actual user that is querying the Oracle table and pass that identity to Hadoop 
- This may be an Oracle Database user (i.e. schema)
- Lightweight user comes from session-based contexts (see SYS_CONTEXT)
- User/Group map must be available thru OS lookup in Hadoop

Demonstration.

Full documentation for this feature, you may find here and now I'm going to show few most popular cases with code examples.

For working with certain objects, you need to grant follow permissions for user, who will manage a mapping table:

SQL> grant select on BDSQL_USER_MAP to bikes;
SQL> grant execute on DBMS_BDSQL to bikes;
SQL> grant BDSQL_ADMIN to bikes;

In my cases, this is user "bikes".

Just in case clean up permissions for user BIKES:

SQL> begin
DBMS_BDSQL.REMOVE_USER_MAP
(current_database_user =>'BIKES');
end;
/ 

check that the mapping table is empty:

SQL> select * from SYS.BDSQL_USER_MAP;

and after this run a query:

SQL> select /*+ MONITOR */ * from bikes.weather_ext;

this is the default mode, without any mapping, so I assume that I'll contact HDFS as oracle user.

For double check this, I review audit files:

$ cd /var/log/hadoop-hdfs
$ tail -f hdfs-audit.log |grep central_park
2018-03-01 17:42:10,938 INFO ... ugi=oracle ... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv..

here is clear, that user Oracle reads the file (ugi=oracle).

Let's check permissions for given file (which represents this external table):

$ hadoop fs -ls /data/weather/central_park_weather.csv
-rw-r--r--   3 oracle oinstall      26103 2017-10-24 13:03 /data/weather/central_park_weather.csv

so, everybody may read it. Remember this and let's try to create the first mapping.

SQL> begin
DBMS_BDSQL.ADD_USER_MAP(
    current_database_user =>'BIKES',
    syscontext_namespace  => null,
    syscontext_parm_hadoop_user => 'user1'
);
end;
/ 

this mapping tells me that user BIKES, will be always mapped to user1 for OS. Let's find this in file permission table:

Run query again and check the user, who reads this file:

SQL> select /*+ MONITOR */ * from bikes.weather_ext;
$ cd /var/log/hadoop-hdfs
$ tail -f hdfs-audit.log |grep central_park
2018-03-01 17:42:10,938 INFO ... ugi=user1... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv..

It's interesting that user1 doesn't exist on the Hadoop OS:

# id user1
id: user1: No such user

if user not exists (user1 case), it could only read 777 files. Let me revoke read permission from everyone and run the query again:

$ sudo -u hdfs hadoop fs -chmod 640 /data/weather/central_park_weather.csv
$ hadoop fs -ls /data/weather/central_park_weather.csv
-rw-r-----   3 oracle oinstall      26103 2017-10-24 13:03 /data/weather/central_park_weather.csv

Now it failed. For make it works I may create "user1" account on each Hadoop node and add it to oinstall group.

$ useradd user1
$ usermod -a -G oinstall user1

Run the query again and check the user, who reads this file:

SQL> select /*+ MONITOR */ * from bikes.weather_ext;
$ cd /var/log/hadoop-hdfs
$ tail -f hdfs-audit.log |grep central_park
2018-03-01 17:42:10,938 INFO ... ugi=user1... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv..

here we are! We could read the file because of group permissions.

What if I want to map this schema to HDFS or some other powerful user? Let's try:

SQL> begin
DBMS_BDSQL.REMOVE_USER_MAP
(current_database_user =>'BIKES');
DBMS_BDSQL.ADD_USER_MAP(
    current_database_user =>'BIKES',
    syscontext_namespace  => null,
    syscontext_parm_hadoop_user => 'hdfs'
);
end;
/ 

the reason why we got this exception is that hdfs user is on the blacklist for impersonation.

$ cat $ORACLE_HOME/bigdatasql/databases/orcl/bigdata_config/bigdata.properties| grep impersonation
....
# Impersonation properties
impersonation.enabled=true
impersonation.blacklist='hue','yarn','oozie','smon','mapred','hdfs','hive','httpfs','flume','HTTP','bigdatamgr','oracle'
...

the second scenario is authorization with the thin client or with CLIENT_IDENTIFIER. In case of multi-tier architecture (when we have application tier and database tier), it may be a challenge to differentiate multiple users within the same application, which use the same schema.

Bellow is the example, which illustrates this:

we have an application, which connected to a database as HR_APP user, but many people may use this application and this database login. To differentiate these human users we may use dbms_session.set_IDENTIFIER procedure (more details you could find here).

So, Big Data SQL multi-user authorization feature allows using SYS_CONTEXT user for authorization on the Hadoop.

Bellow is a test case, which illustrates this.

-- Remove previous rule, related with BIKES user --
SQL> begin
DBMS_BDSQL.REMOVE_USER_MAP
(current_database_user =>'BIKES');
end;
/ 

-- Add a new rule, which tells that if database user is BIKES, Hadoop user have to be taken from USERENV as CLIENT_IDENTIFIER --
SQL> begin
DBMS_BDSQL.ADD_USER_MAP(
    current_database_user =>'BIKES',
    syscontext_namespace  => 'USERENV',
    syscontext_parm_hadoop_user => 'CLIENT_IDENTIFIER'
);
end;

--Check current database user (schema) --
SQL> select user from dual;

BIKES

-- Check CLIENT_IDENTIFIER from USERENV --
SQL> select SYS_CONTEXT('USERENV', 'CLIENT_IDENTIFIER') from dual;

NULL

-- Run any query aginst Hadoop --
SQL> select /*+ MONITOR */ * from bikes.weather_ext;

-- check in the Hadoop audit logs --
-bash-4.1$ tail -f hdfs-audit.log |grep central_park
2018-03-01 18:14:40 ... ugi=oracle ... src=/data/weather/central_park_weather.csv

-- Set CLIENT_IDENTIFIER --
SQL> begin
dbms_session.set_IDENTIFIER('Alexey');
end;
/

-- Check CLIENT_IDENTIFIER for current session --
SQL> select SYS_CONTEXT('USERENV', 'CLIENT_IDENTIFIER') from dual;
Alexey

-- Run query agin over HDFS data --
SQL> select /*+ MONITOR */ * from bikes.weather_ext;

-- check in the Hadoop audit logs: --
-bash-4.1$  tail -f hdfs-audit.log |grep central_park
2018-03-01 18:17:43 ... ugi=Alexey ... src=/data/weather/central_park_weather.csv

the third way to make authentification is user authentification identity. Users connecting to a database (via Kerberos, DB User, etc...) have their authenticated identity passed to Hadoop. To make it work, simply run:

SQL> begin
DBMS_BDSQL.ADD_USER_MAP(
    current_database_user => '*' ,
    syscontext_namespace  => 'USERENV',
    syscontext_parm_hadoop_user => 'AUTHENTICATED_IDENTITY');
end;
/ 

and after this your user on HDFS will be that returned by:

SQL> select SYS_CONTEXT('USERENV', 'AUTHENTICATED_IDENTITY') from dual;

BIKES

for example, if I logged on to the database as BIKES (as database user), on HDFS I'll be authenticated as bikes user

-bash-4.1 $  tail -f hdfs-audit.log |grep central_park
2018-03-01 18:23:23 ... ugi=bikes... src=/data/weather/central_park_weather.csv

for checking all rules, which you have for multi-user authorization you may run follow query:

SQL> select * from SYS.BDSQL_USER_MAP;

Hope that this feature may allow you to create robust security bastion around your data in HDFS.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services