Tuesday Mar 15, 2016

Big Data SQL Quick Start. Parallel Query - Part3.

[Read More]

Sunday Mar 06, 2016

Hadoop Compression. Choosing compression codec. Part2.

Many customers are keep asking me about "default" (single) compression codec for Hadoop. Actually answer on this question is not so easy and let me explain why.

Bzip2 or not Bzip2?

In my previous blogpost I published results of the compression rate for some particular compression codecs into Hadoop. Based on those results you may think that it’s a good idea to compress everything with bzip2. But be careful with this. Within the same research, I noted that bzip2 actually has on average 3 times worse performance than Gzip for querying (decompress) and archive (compress) data (it’s not surprising based on the complexity of algorithm).  Are you ready to sacrifice performance? I think it will depend on the compression benefits derived from bzip2 and the frequency of querying this data (compression speed is not so import after data is stored in Hadoop systems since you usually compress data once and read it many times).  On average, bzip2 is 1.6 times better than gzip.  But, again my research showed that sometimes you can achieve 2.3 times better compression, while other times you may gain only 9% of the disk space usage (and performance is still much worse compared to gzip and other codecs). Second factor to keep in mind is the frequency of data querying and your performance SLAs. If you don’t care about query performance (don’t have any SLAs) and you select this data very rarely – bzip2 could be good a candidate.  Otherwise consider other options. I encourage you to benchmark your own data and decide for yourself “Bzip2 or not Bzip2”.

[Read More]

Thursday Feb 04, 2016

Hadoop Compression. Compression rate. – Part1.

Compression codecs.

Text files (csv with “,” delimiter):

Codec Type  Average rate  Minimum rate  Maximum rate
bzip2 17.36 3.88 61.81
gzip 9.73 2.9 26.55
lz4 4.75 1.66 8.71
snappy 4.19 1.61 7.86
lzo 3.39 2 5.39

RC File: 

Codec Type Average rate Minimum rate Maximum rate
 bzip2 17.51 4.31 54.66
 gzip 13.59 3.71 44.07
 lz4 7.12 2 21.23
 snappy 6.02 2.04  15.38
 lzo 4.37 2.33 7.02

Parquet file:

Codec Type Average rate Minimum rate Maximum rate
 gzip 17.8 3.9 60.35
 snappy 12.92 2.63 45.99

[Read More]

Tuesday Jan 19, 2016

Big Data SQL Quick Start. Introduction - Part1.

Today I am going to explain steps that required to start working with Big Data SQL. It’s really easy!  I hope after this article you all will agree with me. First, if you want to get caught up on what Big Data SQL is, I recommend that you read these blogs: Oracle Big Data SQL: One Fast Query, Big Data SQL 2.0 - Now Available.

The above blogs cover design goals of Big Data SQL. One of the goals of Big Data SQL is transparency. You just define table that links to some directory in HDFS or some table in HCatalog and continue working with it like with general Oracle Database table.It’s also useful to read the product documentation.

Your first query with Big Data SQL

Let’s start with simplest one example and query data that is actually stored in HDFS via Oracle Database using Big Data SQL. I’m going to begin this example by checking of the data that actually lies into HDFS. To accomplish this, I run the hive console and check hive table DDL:

[Read More]

Thursday Jan 07, 2016

Data loading into HDFS - Part1

Today I’m going to start first article that will be devoted by very important topic in Hadoop world – data loading into HDFS. Before all, let me explain different approaches of loading and processing data in different IT systems.

Schema on Read vs Schema on Write

So, when we talking about data loading, usually we do this into system that could belong on one of two types.  One of this is schema on write. With this approach we have to define columns, data formats and so on. During the reading  every user will observe the same data set. As soon as we performed ETL (transform data in format that mostly convenient to some particular system), reading will be pretty fast and overall system performance will be pretty good. But you should keep in mind, that we already paid penalty for this when were loading data. Like example of schema on write system you could consider Relational data base, for example, like Oracle or MySQL.


Schema on Write

Another approach is schema on read. In this case we load data as-is without any changing and transformations.  With this approach we skip ETL (don’t transform data) step and we don’t have any headaches with data format and data structure. Just load file on file system, like coping photos from FlashCard or external storage to your laptop’s disk. How to interpret data you will decide during the data reading. Interesting stuff that the same data (same files) could be read in different manner. For instance, if you have some binary data and you have to define Serialization/Deserialization framework and using it within your select, you will have some structure data, otherwise you will get set of the bytes. Another example, even if you have simplest CSV files you could read the same column like a Numeric or like a String. It will affect on different results for sorting or comparison operations.

Schema on Read

Hadoop Distributed File System is classical example of schema on read system.More details about Schema on Read and Schema on Write approach you could findhere. Now we are going to talk about data loading data into HDFS. I hope after explanation above, you understand that data loading into Hadoop is not equal of ETL (data doesn’t transform).

[Read More]

Tuesday Jul 15, 2014

Oracle Big Data SQL: One Fast Query, All Your Data

Introduction

Today we're pleased to announce Big Data SQL, Oracle's unique approach to providing unified query over data in Oracle Database, Hadoop, and select NoSQL datastores.  Big Data SQL has been in development for quite a while now, and will be generally available in a few months.  With today's announcement of the product, I wanted to take a chance to explain what we think is important and valuable about Big Data SQL.

SQL on Hadoop

As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a proliferation of solutions in the last 18 months, and just as large a proliferation of press.  From good, ol' Apache Hive to Cloudera Impala and SparkSQL, these days you can have SQL-on-Hadoop any way you like it.  It does, however, prompt the question: Why SQL?

There's an argument to be made for SQL simply being a form of skill reuse.  If people and tools already speak SQL, then give the people what they know.  In truth, that argument falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves.  If there were a better language for querying Big Data, the community would have turned it up by now.

I think the reality is that the SQL language endures because it is uniquely suited to querying datasets.  Consider, SQL is a declarative language for operating on relations in data.  It's a domain-specific language where the domain is datasets.  In and of itself, that's powerful: having language elements like FROM, WHERE and GROUP BY make reasoning about datasets simpler.  It's set theory set into a programming language.

It goes beyond just the language itself.  SQL is declarative, which means I only have to reason about the shape of the result I want, not the data access mechanisms to get there, the join algorithms to apply, how to serialize partial aggregations, and so on.  SQL lets us think about answers, which lets us get more done.

SQL on Hadoop, then, is somewhat obvious.  As data gets bigger, we would prefer to only have to reason about answers.

SQL On More Than Hadoop

For all the obvious goodness of SQL on Hadoop, there's a somewhat obvious drawback.  Specifically, data rarely lives in a single place.  Indeed, if Big Data is causing a proliferation of new ways to store and process data, then there are likely more places to store data then every before.  If SQL on Hadoop is separate from SQL on a DBMS, I run the risk of constructing every IT architect's least favorite solution: the stovepipe.

If we want to avoid stovepipes, what we really need is the ability to run SQL queries that work seamlessly across multiple datastores.  Ideally, in a Big Data world, SQL should "play data where it lies," using the declarative power of the language to provide answers from all data.

This is why we think Oracle Big Data SQL is obvious too.

It's just a little more complicated than SQL on any one thing.  To pull it off, we have to do a few things:

  • Maintain the valuable characteristics of the system storing the data
  • Unify metadata to understand how to execute queries
  • Optimize execution to take advantage of the systems storing the data

For the case of a relational database, we might say that the valuable storage characteristics include things like: straight-through processing, change-data logging, fine-grained access controls, and a host of other things.

For Hadoop, I believe that the two most valuable storage characteristics are scalability and schema-on-read.  Cost-effective scalability is one of the first things that people look to HDFS for, so any solution that does SQL over a relational database and Hadoop has to understand how HDFS scales and distributes data.  Schema-on-read is at least equally important if not more.  As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop tremendous power: dump data into HDFS, and access it without having to convert it to a specific format.  So, then, any solution that does SQL over a relational database and Hadoop is going to have to respect the schemas of the database, but be able to really apply schema-on-read principals to data stored in Hadoop.

Oracle Big Data SQL maintains all of these valuable characteristics, and it does it specifically through the approaches taken for unifying metadata and optimizing performance.

Big Data SQL queries data in a DBMS and Hadoop by unifying metadata and optimizing performance.

Unifying Metadata

To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

 Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

  The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

Optimizing Performance

Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

Smart Scan for Hadoop

Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.

Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.

Summary

So, there you have it: fast queries which join data in Oracle Database with data in Hadoop while preserving the makes each system a valuable part of overall information architectures.  Big Data SQL unifies metadata, such that data sources can be queried with the best possible parallelism and the correct read semantics.  Big Data SQL optimizes performance using approaches inspired by Exadata: filtering out irrelevant data before it can become a bottleneck.

It's SQL that plays data where it lies, letting you place data where you think it belongs.

[Read More]

Thursday Jul 18, 2013

Practical HDFS Permissions

Introduction

Documentation and most discussions are quick to point out that HDFS provides OS-level permissions on files and directories.  However, there is less readily-available information about what the effects of OS-level permissions are on accessing data in HDFS via higher-level abstractions such as Hive or Pig.  To provide a bit of clarity, I decided to run through the effects of permissions on different interactions with HDFS.

The Setup

In this scenario, we have three users: oracle, dan, and not_dan.  The oracle user has captured some data in an HDFS directory.  The directory has 750 permissions: read/write/execute for oracle, read/execute for dan, and no access for not_dan.  One of the files in the directory has 700 permissions, meaning that only the oracle user can read it.  Each user will tries to do the following tasks:

  • List the contents of the directory
  • Count the lines in a subset of files including the file with 700 permissions
  • Run a simple Hive query over the directory

Listing Files

Each user issues the command

hadoop fs -ls /user/shared/moving_average|more

And what do they see:

[oracle@localhost ~]$ hadoop fs -ls /user/shared/moving_average|more

Found 564 items

Obviously, the oracle user can see all the files in its own directory.

[dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|more
Found 564 items

Similarly, since dan has group read access, that user can also list all the files. The user without group read permissions, however, receives an error.

[not_dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|more

ls: Permission denied: user=not_dan, access=READ_EXECUTE,

inode="/user/shared/moving_average":oracle:shared_hdfs:drwxr-x---

Counting Rows in the Shell

In this test, each user pipes a set of HDFS files into a unix command and counts rows.  Recall, one of the files has 700 permissions.

The oracle user, again, can see all the available data:

[oracle@localhost ~]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l
40

The user with partial permissions receives an error on the console, but can access the data they have permissions on.  Naturally, the user without permissions only receives the error.

[dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l
cat: Permission denied: user=dan, access=READ, inode="/user/shared/moving_average/FlumeData.1374082184056":oracle:shared_hdfs:-rw-------
30
[not_dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l
cat: Permission denied: user=not_dan, access=READ_EXECUTE, inode="/user/shared/moving_average":oracle:shared_hdfs:drwxr-x---
0

Permissions on Hive

In this final test, the oracle user defines an external Hive table over the shared directory.  Each user issues a simple COUNT(*) query against the directory.  Interestingly, the results are not the same as piping the datastream to the shell.

The oracle user's query runs correctly, while both dan and not_dan's queries fail:

As dan

Job Submission failed with exception 'java.io.FileNotFoundException(File /user/shared/moving_average/FlumeData.1374082184056 does not exist)'


As not_dan

Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException
(Permission denied: user=not_dan, access=READ_EXECUTE,
inode="/user/shared/moving_average":oracle:shared_hdfs:drwxr-x-


So, what's going on here? In each case, the query fails, but for different reasons. In the case of not_dan, the query fails because the user has no permissions on the directory. However, the query issued by dan fails because of a FileNotFound exception. Because dan does not have read permissions on the file, Hive cannot find all the files necessary to build the underlying MapReduce job. Thus, the query fails before being submitted to the JobTracker.  The rule then, becomes simple: to issue a Hive query, a user must have read permissions on all files read by the query. If a user has permissions on one set of partition directories,  but not another, they can issue queries against the readable partitions, but not against the entire table.

 Summary


In a nutshell, the OS-level permissions of HDFS behave just as we would expect in the shell. However, problems can arise when tools like Hive or Pig try to construct MapReduce jobs. As a best practice, permissions structures should be tested against the tools which will access the data. This ensures that users can read
what they are allowed to, in the manner that they need to. 

[Read More]
About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« May 2016
SunMonTueWedThuFriSat
1
3
4
5
6
7
8
9
10
11
12
14
15
16
18
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today