Thursday Feb 23, 2017
Tuesday Mar 22, 2016
By Alexey Filanovskiy-Oracle on Mar 22, 2016
Thursday Jan 07, 2016
By Alexey Filanovskiy-Oracle on Jan 07, 2016
Today I’m going to start first article that will be devoted by very important topic in Hadoop world – data loading into HDFS. Before all, let me explain different approaches of loading and processing data in different IT systems.
Schema on Read vs Schema on Write
So, when we talking about data loading, usually we do this into system that could belong on one of two types. One of this is schema on write. With this approach we have to define columns, data formats and so on. During the reading every user will observe the same data set. As soon as we performed ETL (transform data in format that mostly convenient to some particular system), reading will be pretty fast and overall system performance will be pretty good. But you should keep in mind, that we already paid penalty for this when were loading data. Like example of schema on write system you could consider Relational data base, for example, like Oracle or MySQL.
Schema on Write
Another approach is schema on read. In this case we load data as-is without any changing and transformations. With this approach we skip ETL (don’t transform data) step and we don’t have any headaches with data format and data structure. Just load file on file system, like coping photos from FlashCard or external storage to your laptop’s disk. How to interpret data you will decide during the data reading. Interesting stuff that the same data (same files) could be read in different manner. For instance, if you have some binary data and you have to define Serialization/Deserialization framework and using it within your select, you will have some structure data, otherwise you will get set of the bytes. Another example, even if you have simplest CSV files you could read the same column like a Numeric or like a String. It will affect on different results for sorting or comparison operations.
Schema on Read
Hadoop Distributed File System is classical example of schema on read system.More details about Schema on Read and Schema on Write approach you could find here. Now we are going to talk about data loading data into HDFS. I hope after explanation above, you understand that data loading into Hadoop is not equal of ETL (data doesn’t transform).
The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.
- The latest in Oracle Partitioning - Part 2: Multi Column List Partitioning
- The latest in Oracle Partitioning - Part 1: Read Only Partitions
- Database 12c Release 2 available for download
- The first really hidden gem in Oracle Database 12c Release 2: runtime modification of external table parameters
- Data loading into HDFS - Part3. Streaming data loading
- Big Data Lite 4.7.0 is now available on OTN!
- Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19
- Big Data SQL Quick Start. Oracle Text Integration – Part 18
- How to intelligently aggregate approximations
- Dealing with very very long string lists using Database 12.2