By dan.mcclary on Jul 18, 2013
Documentation and most discussions are quick to point out that HDFS provides OS-level permissions on files and directories. However, there is less readily-available information about what the effects of OS-level permissions are on accessing data in HDFS via higher-level abstractions such as Hive or Pig. To provide a bit of clarity, I decided to run through the effects of permissions on different interactions with HDFS.
In this scenario, we have three users: oracle, dan, and not_dan. The oracle user has captured some data in an HDFS directory. The directory has 750 permissions: read/write/execute for oracle, read/execute for dan, and no access for not_dan. One of the files in the directory has 700 permissions, meaning that only the oracle user can read it. Each user will tries to do the following tasks:
- List the contents of the directory
- Count the lines in a subset of files including the file with 700 permissions
- Run a simple Hive query over the directory
Each user issues the command
hadoop fs -ls /user/shared/moving_average|more
And what do they see:[oracle@localhost ~]$ hadoop fs -ls /user/shared/moving_average|more
Found 564 items
Obviously, the oracle user can see all the files in its own directory.
[dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|moreFound 564 items
Similarly, since dan has group read access, that user can also list all the files. The user without group read permissions, however, receives an error.
[not_dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|more
ls: Permission denied: user=not_dan, access=READ_EXECUTE,
Counting Rows in the Shell
In this test, each user pipes a set of HDFS files into a unix command and counts rows. Recall, one of the files has 700 permissions.
The oracle user, again, can see all the available data:
[oracle@localhost ~]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l40
The user with partial permissions receives an error on the console, but can access the data they have permissions on. Naturally, the user without permissions only receives the error.
[dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -lcat: Permission denied: user=dan, access=READ, inode="/user/shared/moving_average/FlumeData.1374082184056":oracle:shared_hdfs:-rw-------30[not_dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -lcat: Permission denied: user=not_dan, access=READ_EXECUTE, inode="/user/shared/moving_average":oracle:shared_hdfs:drwxr-x---0
Permissions on Hive
In this final test, the oracle user defines an external Hive table over the shared directory. Each user issues a simple COUNT(*) query against the directory. Interestingly, the results are not the same as piping the datastream to the shell.
The oracle user's query runs correctly, while both dan and not_dan's queries fail:
Job Submission failed with exception 'java.io.FileNotFoundException(File /user/shared/moving_average/FlumeData.1374082184056 does not exist)'
Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException (Permission denied: user=not_dan, access=READ_EXECUTE,
So, what's going on here? In each case, the query fails, but for different reasons. In the case of not_dan, the query fails because the user has no permissions on the directory. However, the query issued by dan fails because of a FileNotFound exception. Because dan does not have read permissions on the file, Hive cannot find all the files necessary to build the underlying MapReduce job. Thus, the query fails before being submitted to the JobTracker. The rule then, becomes simple: to issue a Hive query, a user must have read permissions on all files read by the query. If a user has permissions on one set of partition directories, but not another, they can issue queries against the readable partitions, but not against the entire table.
In a nutshell, the OS-level permissions of HDFS behave just as we would expect in the shell. However, problems can arise when tools like Hive or Pig try to construct MapReduce jobs. As a best practice, permissions structures should be tested against the tools which will access the data. This ensures that users can read
what they are allowed to, in the manner that they need to.