Wednesday Feb 05, 2014

OTN Virtual Developer Day Database 12c content now available on-demand

Thank you to everyone who attended the SQL pattern matching session during yesterday's OTN Virtual Developer Day event. We had a great crowd of people join our live workshop session. I hope everyone enjoyed using the amazing platform which the OTN team put together to host the event.  

The great news is that all the content from the event is now available for download and you can watch the all on-demand videos from the four tracks (Big Data DBA, Big Data Developer, Database DBA and Database Developer). 

The link to fantastic OTN VDD platform is here: https://oracle.6connex.com/portal/database2014/login?langR=en_US&mcc=aceinvite and this is what the landing pad page looks like:

OTNVDD Me

This page will give you access to the keynote session by Tom Kyte and Jonathan Lewis which covered the landscape of Oracle DB technology evolution and adoption.  The content looks at what's next for Oracle Database 12c looking at the high value technologies and techniques that are driving greater database efficiencies and innovation.

You will be able to access the videos, slides from each presentation and a huge range of technical hands-on labs covering big data and database technologies, including my SQL Pattern Matching workshop. If you want to download the the Virtualbox image for the Database tracks it is available here: http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html (this contains everything you need to run my SQL Pattern Matching workshop).

While you doing the workshop, if you have any questions then please feel free to email me - keith.laker@oracle.com.

Enjoy.

Friday Jan 17, 2014

StubHub's Data Scientists reap benefits of integrated approach….

We have released yet another great video customer video, this time with Stubhub.

Many customers are still pulling data out of their data warehouse and shipping it to specialised processing engines so they can mine their data, run spatial analytics and/or built multi-dimensional cubes. The problem with this approach, as the team at Stubhub points out, is that typically when you move the data to these specialised engines you have to work with a subset of the data that is sitting in your data warehouse. When you work with a subset of data you immediately start to impose compromises on your analytical workflows. If you can't work with all your data then you can't be sure that your analytical model is as good as it could be and that could mean losing customers or missing out on additional revenue.

The other problem comes from everyone using their own favourite tool to do their analysis: how do you share your discoveries, how do you develop a high level of corporate-wide analytical skills?

Stubhub asked Oracle to help them resolve these two key problems...

[Read More]

Thursday Dec 19, 2013

SQL Analytics Part 2- Key Concepts

This post continues on from my first post on analytical SQL "introduction to SQL for reporting and analysis" which looked at the reasons why it makes sense to use analytical SQL in your data warehouse and operational projects.  In this post we are going to examine the key processing concepts behind analytical SQL.  

One of the main advantages of Oracle's SQL analytics is that the key concepts are shared across all functions - in effect we have created a unified SQL framework for delivering analytics. These concepts build on existing SQL features to provide developers and business users with a framework that is both flexible and powerful in terms of its ability to support sophisticated calculations. There are four key concepts that you need to understand when implementing features and functions relating to SQL analytics:

  1. Process order
  2. Result-set Partitions
  3. Windows
  4. Current Row

Let's look at each of these topics in more detail.

1) Processing order.

The execution workflow for SQl statements containing analytical SQL is relatively simple:  first all the HAVING, GROUP BY and JOIN predicates are processed. The output from this step is then passed to the analytical functions so all the calculations can be applied. This typically involves the use of window functions which are applied based on the partitions that have been defined with analytic functions applied to each row in each partition. Finally the ORDER BY clause is processed to provide control over the final output. It is useful to keep this workflow in your mind when you are building your analytical SQL because it will help you understand the inputs flowing into your analytical functions and the resulting output.  

2) Result-set partitions

Oracle's analytic functions allow the input data set to be divided into groups of rows which are referred to as "partitions". It is important to note that in this context the term "partition" is completely unrelated to the table partition feature.

These analytical partitions are created after the groups defined with GROUP BY clauses and are can be used by any analytical aggregate functions such as sums and averages. The partitions can be based on any column that is part of the the input data  set and individual partitions can be any size. It is quite possible to create a single partition contain all the rows from the initial query result set or create a small number of very large partitions or a large number of very small partitions where each partition just contains a few rows.

3) Windows

For each row in a partition it is possible to define a window over the data which determines the range of rows used to perform the calculations for the current row (the next section will explain the concept of the "current row")/ The size of a window can be based on either a physical number of rows or a logical interval, which is typically time-based. The window has a starting row and an ending row and depending on how the window is defined it may move at only one end or, in some cases, both ends.

Physical windows

For example a cumulative sum function would have its starting row fixed at the first row in the partition and the ending row would then slide from the starting row all the way to the last row of the partition to create a running total over the rows in the partition. 

SELECT Qtrs
, Months
, Channels
, Revenue
, SUM(Revenue) OVER (PARTITION BY Qtrs) AS Qtr_Sales
, SUM(Revenue) OVER () AS Total_Sales
FROM sales_table


Window Fixed 1

Logical windows

f the data set contains a date column then it is possible to use logical windows by taking advantage of Oracle’s built-in time awareness.  A good example of window where the start row changes is the calculation of a moving average. In this case both the starting and end points slide so that a constant physical or logical range is maintained during the processing. The example below creates a four-period moving average and the images show the current-row, which is identified by the arrow, and the moving window, which is marked as the pink area :

Window 1 Window 2
Window 3 Window 4
Window 5 Window 6

The concept of a "window" is very powerful and provides a lot of flexibility in terms of being able to interact with the data. A window can be set as large as all the rows in a partition. At the other extreme it could be just a single row. Users may specify a window containing a constant number of rows, or a window containing all rows where a column value is in a specified numeric range. Windows may also be defined to hold all rows where a date value falls within a certain time period, such as the prior month.

When using window functions, the current row is included during calculations, so you should only specify (n-1) when you are dealing with n items - see the next section for more information….

4) Current Row

Each calculation performed with an analytic function is based on a current row within a partition. The current row serves as the reference point and during processing it begins at the starting row, moves throw the following rows until the end row of the window is reached. For instance, a centered moving average calculation could be defined with a window that holds the current row, the six preceding rows, and the following six rows. In the example below the calculation of a running total would be the result of the current row plus the values from the preceding two rows. At the end of the window the running total will be reset. The example shown below creates running totals within a result set showing the total sales for each channel within a product category within year:

SELECT calendar_year
, prod_category_desc
, channel_desc
, country_name
, sales
, units
, SUM(sales) OVER (PARTITION BY calendar_year, prod_category_desc, channel_desc order by country_name) sales_tot_cat_by_channel
FROM . . .

SQL A Current Row

Summary

This post has outlined the four main processing concepts behind analytical SQL. The next series of posts will provide an overview of the key analytical features and functions that use these concepts. In the next blog post we will review the analytical SQL features and techniques that are linked to enhanced reporting which includes: windowing, lag-lead, reporting aggregate functions, pivoting operations and data densification for reporting and time series calculations. Although these topics will be presented in terms of data warehousing, they are actually applicable to any activity needing analysis and reporting. 

If you have any questions or comments about analytical SQL then feel free to contact me via this blog.

Technorati Tags: , , , ,

Wednesday Nov 27, 2013

Read-All-About-It: new weekly Oracle Data Warehousing newspaper

Thanks to Brendan Tierney for bringing this excellent online automated news service to my attention….

For a long time I have been wondering how to pull together all the articles from my favourite Twitter feeds, Facebook pages and blogs. Well thanks to Brendan I have discovered a service called Paper.li. This weekend I spent some time setting up feeds from all my favourite sources related to data warehousing, big data. Exadata and other related Oracle technologies. The result is the "#Oracle DW-Big Data Weekly Roundup" which is designed to "keep you up to date on all the weekly sql analytics, data warehousing and big data news from # Oracle". The newspaper is refreshed every Sunday night so that it is ready for Monday morning to read over breakfast. It is the perfect way to start the working week….



Newspaper


if you want to subscribe to this weekly newspaper then go here: http://paper.li/OracleBigData/1384259272 and click on the red SUBSCRIBE link in the top right region of the screen. To give you some guidance on the where all this content is coming from, I am pulling articles from the following sources:

  • Oracle Twitter accounts
    • OracleBigData
    • Oracle Database
    • SQLMaria (Optimizer)
    • CharlieDataMine (Advanced Analytics)
    • NoSQL Database
    • SQL Developer
    • Hardware team
    • BI technology
    • Profit Online Magazine
    • Mark Hornick (R Enterprise)
    • Oracle University
  • Oracle Blogs
    • Data Warehousing
    • Data Mining
    • R
  • Oracle Facebook pages
    • Data Warehousing and Big Data page

Looking for feedback on how useful this is to people as we have so many ways to communicate with you it is good to know what works and what does not work.
If you want to subscribe to Brendan's data mining/analytics newsletter it is here: http://paper.li/brendantierney/1364568794.

Now I am off to investigate creating the same thing on Flipboard for all you iPad/iPhone and Android users…..hope to have an update for you on this very soon so stay tuned!

Thursday Nov 14, 2013

Data Scientist Boot camp (Skills and Training)

As almost everyone is interested in data science, take the boot camp to get ahead of the curve. Leverage this free Data Science Boot camp from Oracle Academy to learn some of the following things:

  • Introduction: Providing Data-Driven Answers to Business Questions
  • Lesson 1: Acquiring and Transforming Big Data
  • Lesson 2: Finding Value in Shopping Baskets
  • Lesson 3: Unsupervised Learning for Clustering
  • Lesson 4: Supervised Learning for Classification and Prediction
  • Lesson 5: Classical Statistics in a Big Data World
  • Lesson 6: Building and Exploring Graphs

You will also find the code samples that go with the training and you can get of to a running start.


Friday Nov 01, 2013

SQL analytical mash-ups deliver real-time WOW! for big data

One of the overlooked capabilities of SQL as an analysis engine, because we all just take it for granted, is that you can mix and match analytical features to create some amazing mash-ups. As we move into the exciting world of big data these mash-ups can really deliver those "wow, I never knew that" moments.

While Java is an incredibly flexible and powerful framework for managing big data there are some significant challenges in using Java and MapReduce to drive your analysis to create these "wow" discoveries. One of these "wow" moments was demonstrated at this year's OpenWorld during Andy Mendelsohn's general keynote session. 

Here is the scenario - we are looking for fraudulent activities in our big data stream and in this case we identifying potentially fraudulent activities by looking for specific patterns. We using geospatial tagging of each transaction so we can create a real-time fraud-map for our business users.

OOW PM  2

Where we start to move towards a "wow" moment is to extend this basic use of spatial and pattern matching, as shown in the above dashboard screen, to incorporate spatial analytics within the SQL pattern matching clause. This will allow us to compute the distance between transactions. Apologies for the quality of this screenshot….hopefully below you see where we have extended our SQL pattern matching clause to use location of each transaction and to calculate the distance between each transaction:

OOW PM  4

This allows us to compare the time of the last transaction with the time of the current transaction and see if the distance between the two points is possible given the time frame. Obviously if I buy something in Florida from my favourite bike store (may be a new carbon saddle for my Trek) and then 5 minutes later the system sees my credit card details being used in Arizona there is high probability that this transaction in Arizona is actually fraudulent (I am fast on my Trek but not that fast!) and we can flag this up in real-time on our dashboard:

OOW PM  3

In this post I have used the term "real-time" a couple of times and this is an important point and one of the key reasons why SQL really is the only language to use if you want to analyse  big data. One of the most important questions that comes up in every big data project is: how do we do analysis? Many enlightened customers are now realising that using Java-MapReduce to deliver analysis does not result in "wow" moments. These "wow" moments only come with SQL because it is offers a much richer environment, it is simpler to use and it is faster - which makes it possible to deliver real-time "Wow!". Below is a slide from Andy's session showing the results of a comparison of Java-MapReduce vs. SQL pattern matching to deliver our "wow" moment during our live demo.

OOW PM  1

 You can watch our analytical mash-up "Wow" demo that compares the power of 12c SQL pattern matching + spatial analytics vs. Java-MapReduce  here:

OOW PM  5

You can get more information about SQL Pattern Matching on our SQL Analytics home page on OTN, see here http://www.oracle.com/technetwork/database/bi-datawarehousing/sql-analytics-index-1984365.html

You can get more information about our spatial analytics here: http://www.oracle.com/technetwork/database-options/spatialandgraph/overview/index.html

If you would like to watch the full Database 12c OOW presentation see here: http://medianetwork.oracle.com/video/player/2686974264001


Wednesday Oct 30, 2013

Oracle Magazine: Getting started with SQL Analytics

I am currently working on a series of podcasts covering the broad categories of our SQL analytical functions and features and while I was doing some research I came across of series of four articles in the Oracle Magazine.

This series of article is written by Melanie Caffrey who is a senior development manager at Oracle. She is a coauthor of Expert PL/SQL Practices for Oracle Developers and DBAs (Apress, 2011) and Expert Oracle Practices: Oracle Database Administration from the Oak Table (Apress, 2010).

The four articles are under the banner "Technology: SQL 101" and parts 9, 10, 11 and 12 cover SQL analytics. Here are the links to the four articles:

The articles cover topics such as GROUP BY, SUM, AVG, HAVING, window functions, RANK, FIRST, LAST, LAG, LEAD etc.  

The great news is that  you can try out the examples in this series. All you need is access to an Oracle Database instance. All the schemas, data sets and SQL statements that you will need can be downloaded from a link included in the January article.  

 I hope you find this series of articles useful.

Wednesday Jul 17, 2013

Oracle: Big Data at Work

There is a lot of hype around big data, but here at Oracle we try to help customers implement big data solutions to solve real business problems. For those of you interested in understanding more about how you can put big data to work at your organization, consider joining these events:

San Jose | August 5 - 6
Marriott San Jose
301 S Market St, San Jose, California 95113
Event Registration Page
Chicago | August 7 - 8
The Westin Michigan Avenue
909 N Michigan Ave, Chicago, IL 60611

New York | August 12 - 13
Marriott Marquis Times Square
1535 Broadway, New York, NY 10036
Event Registration Page

Enjoy!

Wednesday Feb 20, 2013

Looking for tools to solve your big data problems?

Look no further, Infosys today announced its Infosys BigDataEdge developer platform to drive value from your big data stack. 

By empowering business users to rapidly develop insights from vast amounts of structured and unstructured data, better business decisions can be made in near real-time. With Infosys BigDataEdge, enterprises can reduce the time taken to extract information by up to 40 percent and generate insights up to eight times faster.

Read More.

Wednesday Jan 30, 2013

Parallel R: Quick Ways Model More

Introduction

I am less and less often mistaken for a pirate when I mention the R language.  While I miss the excuse to wear an eyepatch, I'm glad more people are beginning to explore a statistical language I've been touting for years.  When it comes to plotting or running complex statistics in a single line of code, R is a great tool to have.  That said, there are plenty of pitfalls for the casual or new user: syntax, learning to write vectorized code, or even just knowing which "apply" function you really should choose.

  I want to explore a slightly less-often considered aspect of R development: parallelism.  Out of the box, R can seem very limited to someone used to working on compute clusters or even a multicore server.  However, there are a few tricks we can leverage to get the most out of R on everything from a personal workstation to a Hadoop cluster.

 R is Single-Threaded

The R interpreter is -- and likely always will be -- single-threaded.  This means loading data frames is done in a single thread.  So is building your linear model, or generating that pretty surface plot.  Even on my laptop, that's a lot of threads to not use for modeling.  No matter how much my web browser might covet those cycles, I'd like to use them for work.

Rather than a complex multithreaded re-implementation, the R interpreter offers a number of ways to allow users to selectively apply parallelism.  Some of these approaches leverage MPI libraries and mirror that message passing approach.  Others allow a more implicit parallelism via "foreach" or "apply" constructs. We'll just focus on a pair of strategies using the parallelism that's been included in R since it's 2.14.1 version: the parallel library.

 Setting The Stage for Parallel Execution

We're going to need to load a few libraries into our R session before we can execute anything outside of our single-thread.  We'll use the doParallel and foreach because they allow us to focus on what to parallelize rather than how to coordinate our threads.

> data(iris)
library(parallel)
library(iterators)
library(doParallel)
library(foreach)

Knowing that calculations in R will be single-threaded, we want to use the parallel package to operate on logical subsets of the data simultaneously.  For example, I loaded a set of data about Iris which contains a number of different species.  One way I might want to parallelize is to fit the same each species simultaneously.  For that, I'm going to have to split the data by species:

> species.split <- split(iris, iris$Species)

 This gives us a list we can iterate over -- or parallelize.  From here on out, it's simply a question of deciding what resources we want to leverage: local CPUs or remote hosts.

FORKs and SOCKs

We're going to use the makeCluster function to bind together a set of computational resources.  But first we need to decide: do we want to use only local CPUs, or is it necessary to open up socket connections to other machines distribute our workload?  In the former case we'll use makeCluster to create what's called a FORK cluster (in that it uses UNIX's fork call to create slaves).  In the latter, we'll create a SOCK cluster by opening up sockets to a list of remote hosts and starting slave processes on them.

Here's a FORK cluster which uses all my cores:

> cl <- makeCluster(detectCores())
registerDoParallel(cl)

And here's a SOCK cluster across three nodes (password-less SSH is required)

> hostlist <- c("10.0.0.1", "10.0.0.2", "10.0.0.3")
cl <- makeCluster(hostlist)
registerDoParallel(cl)

In each case, I call registerDoParallel to bind this cluster to the %dopar% operator.  This is the operator which will let us easily iterate in parallel.

Running in Parallel

Once we've got something to iterate over and a cluster with which to do it, modeling in parallel becomes straightforward.  Suppose I want to fit a model of sepal length as a linear combination of petal characteristics.  In that case, the code is simply:

> species.models <- foreach(i=species.split) %dopar% {
m<-lm(i$Sepal.Length ~ i$Petal.Width*i$Petal.Length);
return(m)
}

But I'm not just restricted to fitting linear models on my little cluster.  I can run k-means clustering for several different k simultaneously using basically the same block:

> species.clusters<- foreach(i=2:5) %dopar% {
km <- kmeans(iris, i);
return(km)
}

When I'm done with my block, I can just call stopCluster(cl) to ensure my processes terminate and I'm not hogging resources.

Using Hadoop

Finally, there will be situations in which I need to deploy in parallel against much larger datasets -- specifically, datasets stored in HDFS.  Both Hive and Pig will let me run an R script as part of a streaming process.  In Hive, the TRANSFORM operator will send data to an R Script.  In Pig, you can use theSTREAM operator to send a whole bag to an R script.  However, you can't stream from within Pig'sFOREACH blocks, so I occasionally use a UDF which invokes R scripts for me.

Regardless of the method you choose to send HDFS data to an R process, it's important to make sure your R script can consume data streaming from standard input.  I find the most expedient way of doing this via the file function.  A typical script might start:

#! /usr/bin/env Rscript
#Connection to STDIN for reading a data frame
con <- file(description="stdin")
my.data.frame <- read.table(con, header=FALSE, sep=",")

Summary

We've covered several ways to push R beyond the the bounds of its single-threaded core.  There are forking and socket mechanisms for spreading our work around, not to mention tricks for leveraging the power of Hadoop Streaming.  In each case, however, one thing stands out: we must be smart as modelers and understand what can and should be done in parallel.

[Read More]

Monday Jan 28, 2013

First Oracle BIWA Data Scientists Certified

For those who attended the BIWA Summit a few weeks ago, you would have seen the data scientist certification. BIWA just listed the first batch of data scientists it certified:

Instructor Level Certificate  - Brendan Tierney

Oracle Data Scientist Certificate 
Don Ferguson, CherryRoad Technologies
Jorge Anicama, IBM (GBS)
Tim Vlamis, Vlamis Software Solutions
Vijayalakshmi Muthukrishnan, Motorola
Sicheng Liu, Deloitte Consulting
Avik Bhattacharya, Printpack Inc.
Ari Kaplan, Ariball
Paul Mitchell, Oracle

Associate Level 
Suresh Anand, Sashatech LLC

Participation Certificate 
Ahmed Kopap
Ekine Akuiyibo
Khader Mohiuddin

More on the program, see here: http://oraclebiwasig.blogspot.com/2013/01/oracle-data-scientist-at-biwa-summit.html


Friday Jan 18, 2013

Big Data Appliance X3-2 Updates

Untitled Document

Hello world. Waaw, time went by too fast. Happy new year, and here is the long past due update on the new Big Data Appliance and the software updates.

Big Data Appliance X3-2

Both the software as well as the hardware of the Big Data Appliance got a refresher.

Hardware Update

A good place to start is to quickly review the hardware differences (no price changes!). On a per node basis the following is a comparison between old and new (X3-2) hardware:

Big Data Appliance v1

Big Data Appliance X3-2

CPU

2 x 6-Core Intel® Xeon® 5675 (3.06 GHz)
2 x 8-Core Intel® Xeon® E5-2660 (2.2 GHz)
Memory
48GB
64GB expandable to 512GB
Disk

12 x 3TB High Capacity SAS

12 x 3TB High Capacity SAS
InfiniBand
40Gb/sec
40Gb/sec
Ethernet
10Gb/sec
10Gb/sec
KVM
1 KVM Switch
N/A (removed)

For all the details on the environmentals and other useful information, review the data sheet for Big Data Appliance X3-2. For those wondering what we did with the 2RU we now have left from the KVM, that is open space, at the top of the rack.

The higher core count gives a BDA X3-2 more parallel compute power while saving some 30% in energy and heat.

Software Update

As we did with Hardware, a good place to start is a quick overview of the software changes in below table:

Big Data Appliance v1.1.x Software Stack Big Data Appliance V2.0.1 Software Stack
Linux
Oracle Linux 5.6
Oracle Linux 5.8 with UEK
JDK
1.6
1.6u35
Cloudera CDH
CDH 3u4
CDH 4.1.x
Cloudera Manager
CM 3
CM 4.1
Oracle Enterprise Manager
N/A
Big Data Appliance Plug-In for Enterprise Manager
R
Open Source R
Oracle R Distribution 2.x
Big Data Connectors *
Big Data Connectors 1.1.x
Big Data Connectors 2.0.x
Oracle NoSQL Database CE **
NoSQL DB 1.x
NoSQL DB 2.x

* Oracle Big Data Connectors is a separately licensed product which can be pre-installed and pre-configured on BDA
** Oracle NoSQL DB 2.x will be pre-installed in a future update to Mammoth but can be applied manually today

Apart from the versions updates, bug fixes and a great number of performance improvements across the entire system, the biggest updates are the inclusion of CDH 4.1.2 and the default set up of highly available name nodes for Hadoop, the Enterprise Manager management of the BDA, the uptake of the Oracle R Distribution and the updates to Oracle NoSQL Database. In a nutshell these updates deliver the following improvements:

Cloudera CDH 4.1.x

The latest version of CDH and CM deliver:

  • Higher overall performance
  • Highly available name nodes with the BDA using failover quorum processes instead of an external HA filer solution
  • Vastly expanded management capabilities via CM 4

On top of this, BDA now has both Zookeeper and Oozie configured out of the box.

Oracle Enterprise Manager

The new Big Data Appliance Plug-In for Enterprise Manager delivers the first end-to-end management of the Hadoop cluster from hardware metrics to software and Hadoop metrics. To achieve the end-to-end management of the system Enterprise Manager delivers all the system metrics users are used to from the Exadata Plug-In for Enterprise Manager. Enterprise Manager enables a seamless transition between the Hardware and high level software monitoring and the expanded Hadoop monitoring and diagnostics from Cloudera Manager. This combination of functionality makes operations for a BDA simpler and allows operations staff to seamlessly switch between their Exadata, Big Data Appliance and other Oracle Engineered systems.

Oracle R Distribution

The big difference between Oracle R Distribution and the Opensource R distribution is that Oracle R Distribution is enabled to dynamically load the math kernel libraries on the CPUs from both Intel and AMD. This increases performance of basic calculations, which in turn increases the performance of the overall R calculations because more math is off-loaded into the CPUs.

Oracle NoSQL Database 2.x

A great number of great new features are added into NoSQL DB 2.x. Most of these are in both the Community Edition as well in the Enterprise Edition. Charles Lamb has a nice concise post describing what is new here.

Big Data Connectors

To close out, Big Data Connectors got a refresher focused on performance, so download the new products here and give them a go via this download page. More information on news, read the data sheet here.

Wednesday Aug 01, 2012

Flume and Hive for Log Analytics

There's a lot to learn from log data, but to get the most value from it, that data needs to be easy to collect and analyze. Otherwise, time that could be used to learn from data is spent writing parsers and transport components. In this entry we'll simplify log collection and transport using JSON serialization and parts of the Hadoop ecosystem.

Logging everything in JSON is a great idea. As serialization formats go it's engineer-friendly: you and your favorite programming language can both read it. Moreover, having all of your log data structured as universal data structures makes getting started with analytics much simpler. To illustrate how much simpler, we'll take JSON logs written to a flat file, stream them into HDFS, and expose them via Hive for exploration and aggregation.

The Preliminaries

We're going to use three components to put our system together:

  • A flat file that's collecting JSON data. Assume entries look a bit like this:
    {"fieldA":"string data","fieldB":400,"fieldC":0.99}
  • Flume: the distributed log-collection service that's part of the Hadoop ecosystem
  • Hive and a SerDe for handling JSON data

The "Tail Table"

We'll begin by setting up the final destination for our log data. This requires we create a directory in HDFS to hold the log data and define a Hive table over it. Making the directory's easy:

hadoop fs -mkdir /user/oracle/tail_table

Similarly, defining the external table is straightforward in the Hive command line:

CREATE EXTERNAL TABLE IF NOT EXISTS tail_table(fieldA int, fieldB string, fieldC float)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/user/oracle/tail_table';

This gives us a table which will read and respect the types of the values in our JSON records. If a field isn't present for a given record, a NULL value is returned for that column. Fields not included in the CREATE statement are ignored but still exist in the JSON. This allows the schema of the JSON to remain flexible while minimally impacting the Hive table.

Streaming Data with Flume OG

CDH 3u4 ships with two very different versions of Flume. The default is Flume 0.9.4, or Flume OG . It's great at streaming data into HDFS, but Flume OG has some requirements.

  • You must run Zookeeper to coordinate Flume nodes
  • You must run a Flume master to control Flume nodes
Those caveats aside, setting up Flume to stream data into our Hive table is remarkably simple. We only need to define a source which tails our JSON logs and a sink which writes these into the appropriate HDFS directory. We can set this up via the Flume master's web interface. Just navigate to the Flume master web interface at http://flumemaster.your.domain:35871and click the config link. From here, select the Flume node you want to configure from the dropdown menu (i.e. the node which has the JSON log file). The rest is easy:
  • Set the source as: tail("/path/to/json.log")
  • Set the sink as: collectorSink("hdfs://namenode/user/oracle/tail_table", "logdata", 30)

This configuration will tail the log file and write a new message into HDFS with each new line. The collectorSink will commit data to our Hive table every 30 seconds. The resulting configuration looks like this: 

Streaming Data with Flume NG

The other version of Flume which ships with CDH3 is Flume NG . Flume NG is significantly different from its predecessor. Our tail source from the previous section is gone, but so too are many of restrictions.
  • Zookeeper is no longer a requirement
  • The master-slave architecture has been replaced by independent Flume agents
  • We can now use Avro RPCs to transfer data in multi-hop flows
That last point is a big advance for Flume. In Flume OG, transfer from application servers to our Hadoop cluster was a gray area. Either our application servers run Flume nodes connected to the Zookeeper instances and Flume masters for the Hadoop cluster, or logs must be transferred into the Hadoop cluster via another method. In Flume NG, we can run independent Flume agents on application server and the Hadoop cluster, relying on Avro RPC to handle forwarding.

For this type of multi-hop log transfer, we need a flume-ng-agent running on each application server and one on the Hadoop cluster. The application servers will have a flume.conf file which includes something like this:

app-agent.sources = tail
app-agent.channels = memoryChannel
app-agent.sinks = avro-forward-sink
app-agent.sources.tail.type = exec
app-agent.sources.tail.command = tail -f /path/to/json.log
app-agent.sources.avro-forward-sink.type = avro
app-agent.sources.avro-forward-sink.hostname = 10.1.1.100
app-agent.sources.avro-forward-sink.port = 10000

This sets up a source that runs "tail" and sinks that data via Avro RPC to 10.1.1.100 on port 10000.


The collecting Flume agent on the Hadoop cluster will need a flume.conf with an avro source and an HDFS sink.
hdfs-agent.sources= avro-collect
hdfs-agent.sinks = hdfs-write
hdfs-agent.channels = memoryChannel
hdfs-agent.sources.avro-collect.type = avro
hdfs-agent.sources.avro-collect.bind = 10.1.1.100
hdfs-agent.sources.avro-collect.port = 10000
hdfs-agent.sinks.hdfs-write.type = hdfs
hdfs-agent.sinks.hdfs-write.path = hdfs://namenode/user/oracle/tail_table
hdfs-agent.sinks.hdfs-write.rollInterval = 30

On this side we've defined a source that reads Avro messages from port 10000 on 10.1.1.100 and writes the results into HDFS, rolling the file every 30 seconds. It's just like our setup in Flume OG, but now multi-hop forwarding is a snap.


The resulting configuration looks like this: 

Takeway

No matter which version you deploy, the combination of Flume, Hive and JSON make it straightforward to up an end-to-end pipeline for consuming and analyzing serialized log data. With deployments this simple, you can spend more time focusing on your applications and analytics

[Read More]

Wednesday Feb 08, 2012

Announcing Oracle Advanced Analytics

The Oracle Advanced Analytics Option extends the database into a comprehensive advanced analytics platform for big data business analytics. Oracle Advanced Analytics, a combination of Oracle Data Mining and Oracle R Enterprise, delivers predictive analytics, data mining, text mining, statistical analysis, advanced numerical computations and interactive graphics inside the database. It brings powerful computations to the database resulting in dramatic improvements in information discovery, scalability, security, and savings. Oracle Advanced Analytics eliminates data movement to external analytical servers, accelerates information cycle times and reduces total cost of ownership. 

Resources:

  • Press release (here)
  • Advanced Analytics Option on OTN (here)
  • Oracle R Enterprise on OTN (here)
The release of these deep analysis tools and languages complements the earlier release of Oracle Big Data Appliance and Oracle Big Data Connectors enabling the management and analysis of all data.

Monday Jun 27, 2011

Big Data Accelerator

For everyone who does not regularly listen to earnings calls, Oracle's Q4 call was interesting (as it mostly is). One of the announcements in the call was the Big Data Accelerator from Oracle (Seeking Alpha link here - slightly tweaked for correctness shown below):

 "The big data accelerator includes some of the standard open source software, HDFS, the file system and a number of other pieces, but also some Oracle components that we think can dramatically speed up the entire map-reduce process. And will be particularly attractive to Java programmers [...]. There are some interesting applications they do, ETL is one. Log processing is another. We're going to have a lot of those features, functions and pre-built applications in our big data accelerator."

 Not much else we can say right now, more on this (and Big Data in general) at Openworld!

About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
8
9
10
11
12
13
14
16
17
18
19
21
22
23
24
25
27
28
29
30
31
      
Today