Tuesday Mar 15, 2016

Big Data SQL Quick Start. Parallel Query - Part3.

[Read More]

Sunday Mar 06, 2016

Hadoop Compression. Choosing compression codec. Part2.

Many customers are keep asking me about "default" (single) compression codec for Hadoop. Actually answer on this question is not so easy and let me explain why.

Bzip2 or not Bzip2?

In my previous blogpost I published results of the compression rate for some particular compression codecs into Hadoop. Based on those results you may think that it’s a good idea to compress everything with bzip2. But be careful with this. Within the same research, I noted that bzip2 actually has on average 3 times worse performance than Gzip for querying (decompress) and archive (compress) data (it’s not surprising based on the complexity of algorithm).  Are you ready to sacrifice performance? I think it will depend on the compression benefits derived from bzip2 and the frequency of querying this data (compression speed is not so import after data is stored in Hadoop systems since you usually compress data once and read it many times).  On average, bzip2 is 1.6 times better than gzip.  But, again my research showed that sometimes you can achieve 2.3 times better compression, while other times you may gain only 9% of the disk space usage (and performance is still much worse compared to gzip and other codecs). Second factor to keep in mind is the frequency of data querying and your performance SLAs. If you don’t care about query performance (don’t have any SLAs) and you select this data very rarely – bzip2 could be good a candidate.  Otherwise consider other options. I encourage you to benchmark your own data and decide for yourself “Bzip2 or not Bzip2”.

[Read More]

Friday Feb 19, 2016

How Can Graph Databases and Analytics Help Your Business?

Several videos describing the value of graph analyses are now available from our Oracle Big Data Spatial and Graph + Oracle Labs teams.

video

Check out this blog post for details :). 

Big Data Lite 4.4.0 is now available on OTN

big data lite


It's now available for download on OTN.  Check out this VM to help you learn about Oracle's big data platform.[Read More]

Thursday Feb 04, 2016

Hadoop Compression. Compression rate. – Part1.

Compression codecs.

Text files (csv with “,” delimiter):

Codec Type  Average rate  Minimum rate  Maximum rate
bzip2 17.36 3.88 61.81
gzip 9.73 2.9 26.55
lz4 4.75 1.66 8.71
snappy 4.19 1.61 7.86
lzo 3.39 2 5.39

RC File: 

Codec Type Average rate Minimum rate Maximum rate
 bzip2 17.51 4.31 54.66
 gzip 13.59 3.71 44.07
 lz4 7.12 2 21.23
 snappy 6.02 2.04  15.38
 lzo 4.37 2.33 7.02

Parquet file:

Codec Type Average rate Minimum rate Maximum rate
 gzip 17.8 3.9 60.35
 snappy 12.92 2.63 45.99

[Read More]

Tuesday Jan 26, 2016

Big Data SQL Quick Start. Offloading - Part2.

After reading these articles: Big Data SQL Quick Start. Introduction, One fast Query All your Data, you know what Big Data SQL is, and you understandthat it allows you query data from Hadoop through the Oracle Database.  But you also should to know that it’s not just reading data. Big Data SQL allows you to process data stored in HDFS locally and return back to the database only data relevant to the query. Let’s imagine a simple diagram of a data management system that includes Oracle Database and Hadoop:

[Read More]

Tuesday Jan 19, 2016

Big Data SQL Quick Start. Introduction - Part1.

Today I am going to explain steps that required to start working with Big Data SQL. It’s really easy!  I hope after this article you all will agree with me. First, if you want to get caught up on what Big Data SQL is, I recommend that you read these blogs: Oracle Big Data SQL: One Fast Query, Big Data SQL 2.0 - Now Available.

The above blogs cover design goals of Big Data SQL. One of the goals of Big Data SQL is transparency. You just define table that links to some directory in HDFS or some table in HCatalog and continue working with it like with general Oracle Database table.It’s also useful to read the product documentation.

Your first query with Big Data SQL

Let’s start with simplest one example and query data that is actually stored in HDFS via Oracle Database using Big Data SQL. I’m going to begin this example by checking of the data that actually lies into HDFS. To accomplish this, I run the hive console and check hive table DDL:

[Read More]

Wednesday Jan 13, 2016

BIWA 2016 - here's my list of must-attend sessions and labs

It’s almost here - the 2016 BIWA conference at the Oracle Conference Center. The conference starts on January 26 with a welcome by the conference leaders at 8:30am. The BIWA summit is dedicated to providing all the very latest information and best practices for data warehousing, big data, spatial analytics and BI. This year the conference has expanded to include the most important query language on the planet: SQL. There will be a whole track dedicated to YesSQL! The full agenda is available here

Unfortunately I won’t be able to attend this year’s conference but if I was going to be there, then this would be my list of must-attend sessions and hands-on labs.

[Read More]

Thursday Jan 07, 2016

Data loading into HDFS - Part1

Today I’m going to start first article that will be devoted by very important topic in Hadoop world – data loading into HDFS. Before all, let me explain different approaches of loading and processing data in different IT systems.

Schema on Read vs Schema on Write

So, when we talking about data loading, usually we do this into system that could belong on one of two types.  One of this is schema on write. With this approach we have to define columns, data formats and so on. During the reading  every user will observe the same data set. As soon as we performed ETL (transform data in format that mostly convenient to some particular system), reading will be pretty fast and overall system performance will be pretty good. But you should keep in mind, that we already paid penalty for this when were loading data. Like example of schema on write system you could consider Relational data base, for example, like Oracle or MySQL.


Schema on Write

Another approach is schema on read. In this case we load data as-is without any changing and transformations.  With this approach we skip ETL (don’t transform data) step and we don’t have any headaches with data format and data structure. Just load file on file system, like coping photos from FlashCard or external storage to your laptop’s disk. How to interpret data you will decide during the data reading. Interesting stuff that the same data (same files) could be read in different manner. For instance, if you have some binary data and you have to define Serialization/Deserialization framework and using it within your select, you will have some structure data, otherwise you will get set of the bytes. Another example, even if you have simplest CSV files you could read the same column like a Numeric or like a String. It will affect on different results for sorting or comparison operations.

Schema on Read

Hadoop Distributed File System is classical example of schema on read system.More details about Schema on Read and Schema on Write approach you could find here. Now we are going to talk about data loading data into HDFS. I hope after explanation above, you understand that data loading into Hadoop is not equal of ETL (data doesn’t transform).


[Read More]

Tuesday Nov 24, 2015

Little things to know about ... Oracle Partitioning (part one of hopefully many)

Oracle Partitioning is one of the most commonly used option on top of Enterprise Edition - if not the most often used one, which is as you can guess always a discussion in our buildings ;-)

Over the years Oracle Partitioning matured significantly and became more powerful and flexible. But, as in real life, with power and flexibility comes always little things that are good to know (no, that's not an equivalent for complexity). So I am happy to see Connor McDonald just blogging about such a little detail around Interval Partitioning.  

Check it out, it's worth it.

Thursday Sep 03, 2015

Oracle Big Data Lite 4.2.1 - Includes Big Data Discovery

We just released Oracle Big Data Lite 4.2.1 VM.  This VM provides many of the key big data technologies that are part of Oracle's big data platform.  Along with all the great features of the previous version, Big Data Lite now adds Oracle Big Data Discovery 1.1:

The list of big data capabilities provided by the virtual machine continues to grow.  Here's a list of all the products that are pre-configured:

  • Oracle Enterprise Linux 6.6
  • Oracle Database 12c Release 1 Enterprise Edition (12.1.0.2) - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.0)
  • Cloudera Manager (5.4.0)
  • Oracle Big Data Discovery 1.1
  • Oracle Big Data Connectors 4.2
    • Oracle SQL Connector for HDFS 3.3.0
    • Oracle Loader for Hadoop 3.4.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.0
    • Oracle XQuery for Hadoop 4.2.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.3.4)
  • Oracle Big Data Spatial and Graph 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1
  • Oracle Data Integrator 12cR1 (12.1.3.0.1)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.4.0
  • Oracle CopyToBDA 2.0 
Take it for a spin - and check out the tutorials and demos that are available from the Big Data Lite download page.

Thursday Jul 02, 2015

Using Oracle Big Data Spatial and Graph

Wondering how to get started with graph analyses?  The latest Oracle Big Data Lite VM includes Oracle's new spatial and graph toolkit for big data.  Check out these two blog posts that describe how to find interesting relationships in data:

 Pretty cool :)

 

Wednesday Mar 11, 2015

Oracle Big Data Lite 4.1 VM is available on OTN

Oracle Big Data Lite 4.1 VM is now available for download on OTN.  Big Data Lite includes many of the key capabilities of Oracle's big data platform.  Each of the components have been configure to work together - and there are many hands-on labs and demonstrations to help you get started using the system.  Below is a listing of what's included:

  • Oracle Enterprise Linux 6.5
  • Oracle Database 12c Release 1 Enterprise Edition (12.1.0.2) - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.3.0)
  • Cloudera Manager (5.3.0)
  • Oracle Big Data Connectors 4.1
    • Oracle SQL Connector for HDFS 3.2.0
    • Oracle Loader for Hadoop 3.3.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.4.1
    • Oracle XQuery for Hadoop 4.1.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.2.5)
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.0.3
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.3.0
  • Oracle CopyToBDA 1.1 

 

 Enjoy!

Tuesday Mar 03, 2015

Are you leveraging Oracle's database innovations for Cloud and Big data?

If you are interested in big data, Hadoop, SQL and data warehousing then mark your calendars because on March 18th at 10:00AM PST/1:00PM EST, you will be able to hear Tom Kyte (Oracle Database Architect) talk about how you can use Oracle Big Data SQL to seamlessly integrate all your Hadoop big data datasets with your relational schemas stored in Oracle Database 12c. As part of this discussion Tom will outline how you can build the perfect foundation for your enterprise big data management system using Oracle's innovative technology.

If you are working on a data warehousing project and/or a big data project then this is one webcast you will not want to miss so register today (click here) to hear the latest about Oracle Database innovations and best practices. The full list of speakers is:

Tom Kyte
Oracle Database Architect
Keith Wilcox
VP, Database Administration
Epsilon
Bill Callahan
Director, Principal Engineer,
CCC Information Services, Inc.

Tuesday Dec 09, 2014

X-Charging for Sandboxes

This is the next part in my on-going series of posts on the topic of how to successfully manage sandboxes within an Oracle data warehouse environment. In Part 1 I provided an overview of sandboxing (key characteristics, deployment models) and introduced the concept of a lifecycle called BOX’D (Build, Observe, X-Charge and Drop). In Part 2 I briefly explored the key differences between data marts and sandboxes. Part 3 explored the Build-phase of our lifecycle. Part 4 explored the Observer-phase of our lifecycle so we have now arrived at the X-Charge part of our model.

To manage the chargeback process for our sandbox environment we are going to use the new Enterprise Manager 12c Cloud Management pack, for more information visit the EM home page on OTN

Why charge for your providing sandbox services? The simple answer is that placing a price or cost on a service ensures that the resources are used wisely. If a project team incurred zero costs for their database environment then there is no incentive to evaluate the effectiveness of the data set and the cost-benefit calculation for the project is skewed by the lack of real-world cost data. This type of approach is the main reason why sandbox projects evolve over time into “production” data marts. Even if the project is not really delivering on its expected goals there is absolutely no incentive to kill the project and free up resources. Therefore, by not knowing the cost, it is impossible to establish the value...

[Read More]
About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« June 2016
SunMonTueWedThuFriSat
   
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
28
29
30
  
       
Today