By Jean-Pierre Dijcks-Oracle on Mar 13, 2014
The certification of Oracle Big Data Connectors on Intel Distribution for Hadoop now complete (see our previous post). This video from Strata gives you a nice overview of IDH and BDC.
Intel partnered with Oracle to certify compatibility between Intel® Distribution for Apache Hadoop* (IDH) and Oracle Big Data Connectors*. Users can now connect IDH to Oracle Database with Oracle Big Data Connectors, taking advantage of the high performance feature-rich components of that product suite. Applications on IDH can leverage the connectors for fast load into Oracle Database, in-place query of data in HDFS with Oracle SQL, analytics in Hadoop with R, XQuery processing on Hadoop, and native Hadoop integration within Oracle Data Integrator.
Over the past six months I have posted a number of articles about SQL pattern matching, as shown here:
Most of these have been related to explaining the basic concepts and along with some specific use cases.
In this post I want to review some of the internal performance tests that we have run during the development of this feature. In part 3 of the series of podcasts I covered a number of use cases for SQL pattern matching such as: stock market analysis, tracking governance-compliance, call service quality and sessionization. The most popular scenarios is likely to be the analysis of sessionization data as this is currently a hot topic when we start considering machine-data and in more general data terms, big data.
To help us create a meaningful test data set we used decided to use the TPC-H schema because it contained approximately seven years of data which equated to approximately 1TB of data. One of the objectives of our performance tests was to compare and contrast the performance and scalability of code using the 12c MATCH_RECOGNIZE clause with code using 11g window functions.
To make things easy to understand I have divided our sessionization workflow into a number of steps.
Part 1 - For the purposes of this specific use case we defined a session as a sequence of one or more events with the same partition key where the gap between the timestamps is less than 10 seconds - obviously the figure for the gap is completely arbitrary and could be set to any number as required. The 1TB of source data looked like this:
The following sections compare the use of 11g window functions vs. 12c MATCH_RECOGNIZE clause.
Part 2- To create the first part of the sessionization workflow we took the original source data and used the USER_ID as the PARTITION BY key and the timestamp for the ORDER BY clause. The objective for this first step is to detect the various sessions and assign a surrogate session id to each session within each partition (USER_ID).
This creates an output result set that delivers a simplified sessionization data set as shown here:
The 12c SQL to create the initial result set is as follows:
SELECT user_id, session_id start_time, no_of_events, duration FROM Events MATCH_RECOGNIZE (PARTITION BY User_ID ORDER BY Time_Stamp MEASURES match_number() session_id, count(*) as no_of_events, first(time_stamp) start_time, last(time_stamp) - first(time_stamp) duration ONE ROW PER MATCH PATTERN (b s*) DEFINE s as (s.Time_Stamp - prev(s.Time_Stamp) <= 10) ) ;
as a comparison here is how to achieve the above using 11g analytical window functions
CREATE VIEW Sessionized_Events as SELECT Time_Stamp, User_ID, Sum(Session_Increment) over (partition by User_ID order by Time_Stampasc) Session_ID FROM ( SELECT Time_Stamp, User_ID, CASE WHEN (Time_Stamp - Lag(Time_Stamp) over (partition by User_ID order by Time_Stampasc)) < 10 THEN 0 ELSE 1 END Session_Increment FROM Events);SELECT User_ID, Min(Time_Stamp) Start_Time, Count(*) No_Of_Events, Max(Time_Stamp) -Min(Time_Stamp) Duration FROM Sessionized_Events GROUP BY User_ID, Session_ID;
As you can see the 11g approach using window functions ( SUM() OVER(PARTITION BY…) ) is a little more complex to understand but it produces the same output - i.e. our initial sessionized data set.
Part 3 - However, to get business value from this derived data set we need to do some additional processing. Typically, with this kind of analysis the business value within the data emerges only after aggregation, which in this case needs to by session. We need to reduce the data set to a single tuple, or row, per session along with some derived attributes, such as:
To do this with Database 12c we can use the MATCH_RECOGNIZE clause to determine how many events are captured within each session. There are actually two ways to do this: 1) we can compare the current record to the previous record, i.e. peek backwards or 2) we can compare the current record to the next record, i.e. peek forwards.
Here is code based on using the PREV() function to compare the current record against the previous record:
select count(*) from ( select /* mr_sessionize_prev */ * from ( select o_pbykey, session_id, start_time, no_of_events, duration from orders_v MATCH_RECOGNIZE ( PARTITION BY o_pbykey ORDER BY O_custkey, O_Orderdate MEASURES match_number() session_id, count(*) as no_of_events, first(o_orderdate) start_time, last(o_orderdate) - first(o_orderdate) duration PATTERN (b s*) DEFINE s as (s.O_Orderdate - prev(O_Orderdate) <= 100) ) ) where No_Of_Events >= 20 );
Here is code based on using the NEXT() function to compare the current record against the next record:
select count(*) from ( select /* mr_sessionize_prev */ * from ( select o_pbykey, session_id, start_time, no_of_events, duration from orders_v MATCH_RECOGNIZE ( PARTITION BY o_pbykey ORDER BY O_custkey, O_Orderdate MEASURES match_number() session_id, count(*) as no_of_events, first(o_orderdate) start_time, last(o_orderdate) - first(o_orderdate) duration PATTERN (s* e) DEFINE s as (next(s.O_Orderdate) - s.O_Orderdate <= 100) ) ) where No_Of_Events >= 20
Finally we can compare the 12c MATCH_RECOGNIZE code to the 11g code which uses window functions (which in my opinion is a lot more complex):
select count(*) from ( select /* wf */ * from (select O_pbykey, Session_ID, min(O_Orderdate) Start_Time, count(*) No_Of_Events, (max(O_Orderdate) - Min(O_Orderdate)) Duration from (select O_Orderdate, O_Custkey, o_pbykey, sum(Session_Increment) over(partition by o_pbykey order by O_custkey, O_Orderdate) Session_ID from ( select O_Custkey, O_Orderdate, O_pbykey, case when (O_Orderdate – Lag(O_Orderdate) over(partition by o_pbykey order by O_custkey, O_Orderdate)) <= 100 -- Threshold then 0 else 1 end Session_Increment from orders_v ) ) group by o_pbykey, Session_ID ) where No_Of_Events >= 20
The final output generated by both sets of code (11g using window functions and 12c using MATCH_RECOGNIZE clause ) would look something like this:
border-style: initial; border-width: 0px; display: block; margin-left: auto; margin-right: auto;" title="NewImage.png" src="http://lh6.ggpht.com/-HDaEnake_3o/UrL_AViG4iI/AAAAAAAAA20/WyCX1sPjRVk/NewImage.png?imgmax=800" alt="NewImage" width="598" height="275" border="0">
Part 4 - The performance results for these three approaches (11g window functions vs. MATCH_RECOGNIZE using PREV() vs. MATCH_RECOGNIZE using NEXT() )are shown below. Please note that on the graph the X-axis shows the number of partitions within each test-run and the Y-axis shows the time taken to run each test. There are three key points to note from this graph:
The first is that, in general the 12c MATCH_RECOGNIZE code is between 1.5x and 1.9x faster compared to using window functions, which is good news if you are looking for a reason to upgrade to Database 12c.
Secondly, it is clear from the X-axis that as the number of partitions increases the MATCH_RECOGNIZE clause offers excellent scalability and continues to deliver excellent performance. So it performs well and scales well as your data volumes increase.
However, it is important to remember that the 11g window function code shows similar attributes of excellent scalability and excellent performance. If you are using 11g at the moment and you have not considered using Oracle Database to run your sessionization analysis then it is definitely worth pulling that transformation code back inside the database and using window functions to run those sessionization transformations. If you need a reason to upgrade to Database 12c then MATCH_RECOGNIZE does offer significant performance benefits if you are doing pattern matching operations either inside the Oracle Database 11g or using an external processing engine.
Lastly, when you are designing your own MATCH_RECONGIZE implementations and you are using the NEXT() or PREV() functions it is worth investigating if using the alternate function offers any significant performance benefits. Obviously, much will depend on the nature of the comparison you are trying to formulate but it is an interesting area and we would welcome feedback on this specific point based on your own data sets.
In general, if you are using the MATCH_RECOGNIZE clause then I would love to hear about your use case and experiences in developing your code. You can contact me directly either via this blog or via email (email@example.com).
Have fun with MATCH_RECOGNIZE….
Thank you to everyone who attended the SQL pattern matching session during yesterday's OTN Virtual Developer Day event. We had a great crowd of people join our live workshop session. I hope everyone enjoyed using the amazing platform which the OTN team put together to host the event.
The great news is that all the content from the event is now available for download and you can watch the all on-demand videos from the four tracks (Big Data DBA, Big Data Developer, Database DBA and Database Developer).
The link to fantastic OTN VDD platform is here: https://oracle.6connex.com/portal/database2014/login?langR=en_US&mcc=aceinvite and this is what the landing pad page looks like:
This page will give you access to the keynote session by Tom Kyte and Jonathan Lewis which covered the landscape of Oracle DB technology evolution and adoption. The content looks at what's next for Oracle Database 12c looking at the high value technologies and techniques that are driving greater database efficiencies and innovation.
You will be able to access the videos, slides from each presentation and a huge range of technical hands-on labs covering big data and database technologies, including my SQL Pattern Matching workshop. If you want to download the the Virtualbox image for the Database tracks it is available here: http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html (this contains everything you need to run my SQL Pattern Matching workshop).
While you doing the workshop, if you have any questions then please feel free to email me - firstname.lastname@example.org.
You've been hearing alot about Oracle's big data platform. Today, we're pleased to announce Oracle Big Data Lite Virtual Machine - an environment to help you get started with the platform. And, we have a great OTN Virtual Developer Day event scheduled where you can start using our big data products as part of a series of workshops.
Oracle Big Data Lite Virtual Machine is an Oracle VM VirtualBox that contains many key components of Oracle's big data platform, including: Oracle Database 12c Enterprise Edition, Oracle Advanced Analytics, Oracle NoSQL Database, Cloudera Distribution including Apache Hadoop, Oracle Data Integrator 12c, Oracle Big Data Connectors, and more. It's been configured to run on a "developer class" computer; all Big Data Lite needs is a couple of cores and about 5GB memory to run (this means that your computer should have at least 8GB total memory). With Big Data Lite, you can develop your big data applications and then deploy them to the Oracle Big Data Appliance. Or, you can use Big Data Lite as a client to the BDA during application development.
How do you get started? Why not start by registering for the Virtual Developer Day scheduled for Tuesday, February 4, 2014 - 9am to 1pm PT / 12pm to 4pm ET / 3pm to 7pm BRT:
There will be 45 minute sessions delivered by product experts (from both Oracle and Oracle Aces) - highlighted by Tom Kyte and Jonathan Lewis' keynote "Landscape of Oracle Database Technology Evolution". Some of the big data technical sessions include:
Keep an eye on this space - we'll be publishing how-to's that leverage the new Oracle Big Data Lite VM. And, of course, we'd love to hear about the clever applications you build as well!
Make sure you are free on Tuesday February 4 because the OTN team are hosting another of their virtual developer day events. Most importantly it is FREE. Even more importantly is the fact that I will be running a 12c pattern matching workshop at 11:45am Pacific Time. Of course there are lots other sessions that you can attend relating to big data and Oracle Database 12c and the OTN team has created two streams to help you learn about this two important areas:
You can sign-up for the event and pick your tracks and sessions via this link: https://oracle.6connex.com/portal/database2014/login?langR=en_US&mcc=aceinvite.
My pattern matching session is included in the Oracle 12c DBA section of the application development track and the workshop will cover the following topics:
As my session is only 45 minutes long I am only going to cover the first three topics and leave you to work through the last three topics in your own time. During the 45 minute workshop I will be available to answer any questions via the live Q&A chat feature.
There is a link to the full agenda on the invitation page. The OTN team will be providing a Database 12c Virtualbox VM that you will be able to download later this week. For the pattern matching session I will be providing the scripts to install our sample schema, the slides from the webcast and the workshop files which include a whole series of exercises that will help you learn about pattern matching and test your SQL skills.
The big data team has kindly included my pattern matching content inside their Virtualbox image so if you want to focus on the sessions offered on the big data tracks but still want to work on the pattern matching exercises after the event then you will have everything you need already installed and ready to go!
Don't forget to register as soon as possible and I hope you have a great day…Let me know if you have any questions or comments.
We have released yet another great video customer video, this time with Stubhub.
Many customers are still pulling data out of their data warehouse and shipping it to specialised processing engines so they can mine their data, run spatial analytics and/or built multi-dimensional cubes. The problem with this approach, as the team at Stubhub points out, is that typically when you move the data to these specialised engines you have to work with a subset of the data that is sitting in your data warehouse. When you work with a subset of data you immediately start to impose compromises on your analytical workflows. If you can't work with all your data then you can't be sure that your analytical model is as good as it could be and that could mean losing customers or missing out on additional revenue.
The other problem comes from everyone using their own favourite tool to do their analysis: how do you share your discoveries, how do you develop a high level of corporate-wide analytical skills?
CaixaBank is Spain’s largest domestic bank by market share with a customer base of 13.7 It is also Spain’s leading bank in terms of innovation and technology, and one of the most prominent innovators worldwide. CaixaBank has been recently awarded the title of the World’s Most Innovative Bank at the 2013 Global Banking Innovation Awards (November 2013).
Like most financial services companies CaixaBank wants to get closer to its customers by collecting data about their activities across all the different channels (offices, internet, phone banking, ATMs, etc.). In the old days we used to call this CRM and then this morphed into "360-degree view" etc etc. While many companies have delivered these types of projects and customers feel much more connected and in control of their relationship with their bank the capture of streams of big data has the potential to create another revolution in the way we interact with our bank. What banks like CaixaBank want to do is to capture data in one part of the business and make it available to all the other lines of business as quickly as possible.
Big data is allowing businesses like CaixaBank to significantly enhance the business value of their existing customer data by integrating it with all sorts of other internal and external data sets. This is probably the most exciting part of big data because the potential business benefits are really only constrained by imagination of the team working on these type of projects. However, that in itself does create problems in terms of securing funding and ownership of projects because the benefits can be difficult to estimate which is where all the industry use cases, conference papers and blog posts can help in terms of providing insight into what is going on in across the market in broad general terms.
To help them implement a strategic Big Data project, CaixaBank has selected Oracle for the deployment of its new Big Data infrastructure. This project, which includes an array of Oracle solutions, positions CaixaBank at the forefront of innovation in the banking industry. The new infrastructure will allow CaixaBank to maximize the business value from any kind of data and embark on new business innovation projects based on valuable information gathered from large data sets. Projects currently under review include:
The Oracle solution (including Oracle Engineered Systems, Oracle Software and Oracle Consulting Services) consists in the implementation of a new Information Management Architecture that provides a unified corporate data model and new advanced analytic capabilities (for more information about how Oracle's Reference Architecture can help you integrate structured, semi-structured and unstructured information into a single logical information resource that can be exploited for commercial gain click here to view our whitepaper)
The importance of the project is best explained by Juan Maria Nin, CEO of CaixaBank:
Here's another great story about how to use data warehousing and big data technologies to solve real world problems using diverse sets of data using Oracle technology. BAE Systems is taking unstructured, semi-structured, operational and social media data and using it to solve complex problems such as financial crime, cyber security and digital transformation. The volumes of data that BAE deals with are very large and this creates its own set of challenges and problems in terms of optimising hardware and software to work efficiently and effectively together. Although BAE had their own in-house Hadoop experts they chose Oracle Big Data Appliance for their Hadoop cluster because it’s easier, cheaper, and faster to operate.
BAE is working with many telco customers to explore the new areas that are being opened up by the use of big data to manage browsing data and call record data. These data sources are being transformed to provide additional insight for the network operations teams, analysis of customer quality and to drive marketing campaigns.
Click on the image to watch the video, or click here: http://medianetwork.oracle.com/video/player/2940549413001
This post continues on from my first post on analytical SQL "introduction to SQL for reporting and analysis" which looked at the reasons why it makes sense to use analytical SQL in your data warehouse and operational projects. In this post we are going to examine the key processing concepts behind analytical SQL.
One of the main advantages of Oracle's SQL analytics is that the key concepts are shared across all functions - in effect we have created a unified SQL framework for delivering analytics. These concepts build on existing SQL features to provide developers and business users with a framework that is both flexible and powerful in terms of its ability to support sophisticated calculations. There are four key concepts that you need to understand when implementing features and functions relating to SQL analytics:
Let's look at each of these topics in more detail.
The execution workflow for SQl statements containing analytical SQL is relatively simple: first all the HAVING, GROUP BY and JOIN predicates are processed. The output from this step is then passed to the analytical functions so all the calculations can be applied. This typically involves the use of window functions which are applied based on the partitions that have been defined with analytic functions applied to each row in each partition. Finally the ORDER BY clause is processed to provide control over the final output. It is useful to keep this workflow in your mind when you are building your analytical SQL because it will help you understand the inputs flowing into your analytical functions and the resulting output.
Oracle's analytic functions allow the input data set to be divided into groups of rows which are referred to as "partitions". It is important to note that in this context the term "partition" is completely unrelated to the table partition feature.
These analytical partitions are created after the groups defined with GROUP BY clauses and are can be used by any analytical aggregate functions such as sums and averages. The partitions can be based on any column that is part of the the input data set and individual partitions can be any size. It is quite possible to create a single partition contain all the rows from the initial query result set or create a small number of very large partitions or a large number of very small partitions where each partition just contains a few rows.
For each row in a partition it is possible to define a window over the data which determines the range of rows used to perform the calculations for the current row (the next section will explain the concept of the "current row")/ The size of a window can be based on either a physical number of rows or a logical interval, which is typically time-based. The window has a starting row and an ending row and depending on how the window is defined it may move at only one end or, in some cases, both ends.
For example a cumulative sum function would have its starting row fixed at the first row in the partition and the ending row would then slide from the starting row all the way to the last row of the partition to create a running total over the rows in the partition.
, SUM(Revenue) OVER (PARTITION BY Qtrs) AS Qtr_Sales
, SUM(Revenue) OVER () AS Total_Sales
f the data set contains a date column then it is possible to use logical windows by taking advantage of Oracle’s built-in time awareness. A good example of window where the start row changes is the calculation of a moving average. In this case both the starting and end points slide so that a constant physical or logical range is maintained during the processing. The example below creates a four-period moving average and the images show the current-row, which is identified by the arrow, and the moving window, which is marked as the pink area :
The concept of a "window" is very powerful and provides a lot of flexibility in terms of being able to interact with the data. A window can be set as large as all the rows in a partition. At the other extreme it could be just a single row. Users may specify a window containing a constant number of rows, or a window containing all rows where a column value is in a specified numeric range. Windows may also be defined to hold all rows where a date value falls within a certain time period, such as the prior month.
When using window functions, the current row is included during calculations, so you should only specify (n-1) when you are dealing with n items - see the next section for more information….
Each calculation performed with an analytic function is based on a current row within a partition. The current row serves as the reference point and during processing it begins at the starting row, moves throw the following rows until the end row of the window is reached. For instance, a centered moving average calculation could be defined with a window that holds the current row, the six preceding rows, and the following six rows. In the example below the calculation of a running total would be the result of the current row plus the values from the preceding two rows. At the end of the window the running total will be reset. The example shown below creates running totals within a result set showing the total sales for each channel within a product category within year:
, SUM(sales) OVER (PARTITION BY calendar_year, prod_category_desc, channel_desc order by country_name) sales_tot_cat_by_channel
FROM . . .
This post has outlined the four main processing concepts behind analytical SQL. The next series of posts will provide an overview of the key analytical features and functions that use these concepts. In the next blog post we will review the analytical SQL features and techniques that are linked to enhanced reporting which includes: windowing, lag-lead, reporting aggregate functions, pivoting operations and data densification for reporting and time series calculations. Although these topics will be presented in terms of data warehousing, they are actually applicable to any activity needing analysis and reporting.
If you have any questions or comments about analytical SQL then feel free to contact me via this blog.
Are you interested in learning how Oracle customers are taking advantage of Oracle's Data Warehousing and Big Data Platform? Want to keep up on the latest product releases and how they might impact your organization? Looking for best practices that describe how to most effectively apply Oracle DW-Big Data technology?
Check out Oracle's new monthly magazine - Oracle DW-Big Data Monthly Roundup. Powered by Flipboard - you can now view all this content on your favorite device.
Let us know what you think!
Support Quote”Oracle Exadata Database Machine is the best platform on which to run the Oracle Database and the X4 release extends that value proposition,” said Oracle President Mark Hurd. “As private database clouds grow in popularity, the strengths of Oracle Exadata around performance, availability and quality of service set it apart from all alternatives.”
We have just announced the release of the fifth-generation of our flagship database machine: Oracle Exadata Database Machine X4. This latest release introduces new hardware and software to accelerate performance, increase capacity, and improve efficiency and quality-of-service for enterprise data warehouse deployments.
Other key highlights are:
1) Improved workload management
We are using a new generation of InfiniBand network protocols to ensure that network-intensive workloads such as reporting, batch and backups do not delay response-time sensitive interactive workloads. Which is great news for IT teams that have to define and manage service level agreements.
2) Bigger flash cache for even faster performance
3) Hardware driven compression/decompression
A feature that is unique to Exadata is the Flash Cache Compression. This transparently compresses database data into flash using hardware acceleration to compress and decompress data with zero performance overhead.
4) In-memory processing
5) Increased support for big data
The full press release is here: http://www.oracle.com/us/corporate/press/2079925
dunnhumby presented at this year's OpenWorld where they outlined the how and why of data warehousing on Exadata. Our engineered system delivered a performance improvement of more than 24x. dunnhumby pushes its data warehouse platform really hard with more than 280 billion fact rows and 250 million dimension rows for one large retailer client alone, dunnhumby’s massive data requires the best performance the industry has to offer.
In Oracle Exadata, dunnhumby has found that solution. Using Oracle Exadata’s advanced Smart Scan technology and robust Oracle Database features. This new environment has empowered its analysts to perform complex ad hoc queries across billions of fact rows and hundreds of millions of dimension rows in minutes or seconds, compared to hours or even days on other platforms.
You can download the presentation by Philip Moore - Exadata Datawarehouse Architect, Dunnhumby USA LLC - from the OpenWorld site, see here: https://oracleus.activeevents.com/2013/connect/sessionDetail.ww?SESSION_ID=3412.
If you missed Philip's session at OpenWorld then we have just released a new video interview with Chris Wones, Director of Data Solutions at dunnhumby. During the interview Chris outlines some of the challenges his team faced when trying to do joined up analytics across disparate and disconnected data sets and how Exadata allowed them to bring everything together so that they could run advanced analytical queries that were just not possible before and that meant being able to bid on completely new types of contracts. The combination of Exadata and Oracle Advanced Analytics are delivering real business benefit to dunnhumby and its customers.
For more information about Oracle's Advanced Analytics option checkout Charlie Berger's advanced analytics blog: http://blogs.oracle.com/datamining and Charlie's twitter feed: https://twitter.com/CharlieDataMine
To watch the video click on the image:
If the video does not start follow this link: http://medianetwork.oracle.com/video/player/2889835899001
As a follow-on to the previous post (here) on use cases, follow the link below to a recording that explains how to go about expanding the data warehouse into a big data platform.
The idea behind it all is to cover a best practice on adding to the existing data warehouse and expanding the system (shown in the figure below) to deal with:
To access the webcast:
The core business reason to build a Data Factory as it is presented here
is to implement a cost savings strategy by placing long-running batch
jobs on a cheaper system. The project is often funded by not spending
money on the more expensive system – for example by switching Mainframe
MIPS off - and instead leveraging that cost savings to fund the Data
Factory. The first figure shows a simplified implementation of the Data Factory.
As the image below shows, the data factory must be scalable, flexible and (more) cost effective for processing the data. The typical system used to build a data factory is Apache Hadoop or in the case of Oracle’s Big Data Appliance – Cloudera’s Distribution including Apache Hadoop (CDH).
Hadoop (and therefore Big Data Appliance and CDH) offers an extremely
scalable environment to process large data volumes (or a large number of
small data sets) and jobs. Most typical is the offload of large batch
updates, matching and de-duplication jobs etc. Hadoop also offers a very
flexible model, where data is interpreted on read, rather than on
write. This idea enables a data factory to quickly accommodate all types
of data, which can then be processed in programs written in Hive, Pig
As shown in above the data factory is an integration platform, much like an ETL tool. Data sets land in the data factory, batch jobs process data and this processed data moves into the upstream systems. These upstream systems include RDBMS’s which are then used for various information needs. In the case of a Data Warehouse, this is very close to pattern 2 described below, with the difference that in the data factory data is often transient and removed after the processing is done.
This transient nature of data is not a required feature, but it is often implemented to keep the Hadoop cluster relatively small. The aim is generally to just transform data in a more cost effective manner.
In the case of an upstream system in NoSQL systems, data is often prepared in a specific key-value format to be served up to end applications like a website. NoSQL databases work really well for that purpose, but the batch processing is better left to Hadoop cluster.
It is very common for data to flow in the reverse order or for data from RDBMS or NoSQL databases to flow into the data factory. In most cases this is reference data, like customer master data. In order to process new customer data, this master data is required in the Data Factory.
Because of its low risk profile – the logic of these batch processes is well known and understood – and funding from savings in other systems, the Data Factory is typically an IT department’s first attempt at a big data project. The down side of a Data Factory project is that business users see very little benefits in that they do not get new insights out of big data.
The common way to drive new insights out of big data is pattern two. Expanding the data warehouse with a data reservoir enables an organization to expand the raw data captured in a system that is able to add agility to the organization. The graphical pattern is shown in below.
A Data Reservoir – like the Data Factory from Pattern 1 – is based on
Hadoop and Oracle Big Data Appliance, but rather then have transient
data and just process data and then hand the data off, a Data Reservoir
aims to store data at a lower than previously stored grain for a period
much longer than previous periods.
The Data Reservoir is initially used to capture data, aggregate new metrics and augment (not replace) the data warehouse with new and expansive KPIs or context information. A very typical addition is the sentiment of a customer towards a product or brand which is added to a customer table in the data warehouse.
The addition of new KPIs or new context information is a continuous process. That is, new analytics on raw and correlated data should find their way into the upstream Data Warehouse on a very, very regular basis.
As the Data Reservoir grows and starts to become known to exist because of the new KPIs or context, users should start to look at the Data Reservoir as an environment to “experiment” and “play” with data. With some rudimentary programming skills power users can start to combine various data elements in the Data Reservoir, using for example Hive. This enables the users to verify a hypotheses without the need to build a new data mart. Hadoop and the Data Reservoir now becomes an economically viable sandbox for power users driving innovation, agility and possibly revenue from hitherto unused data.
Agility for power users and expert programmers is one thing, but
eventually the goal is to enable business users to discover new and
exciting things in the data. Pattern 3 combines the data reservoir with a
special information discovery system to provide a Graphical User
Interface specifically for data discovery. This GUI emulates in many
ways how an end user today searches for information on the internet.
To empower a set of business users to truly discover information, they first and foremost require a Discovery tool. A project should therefore always start with that asset.
Once the Discovery tool (like Oracle Endeca) is in place, it pays to
start to leverage the Data Reservoir to feed the Discovery tool. As is
shown above, the Data Reservoir is continuously fed with new data. The
Discovery tool is a business user’s tool to create ad-hoc data marts in
the discovery tool. Having the Data Reservoir simplifies the acquisition
by end users because they only need to look in one place for data.
In essence, the Data Reservoir now is used to drive two different systems; the Data Warehouse and the Information Discovery environment and in practice users will very quickly gravitate to the appropriate system. But no matter which system they use, they now have the ability to drive value from data into the organization.
So far, most of what was discussed was analytics and batch based. But a lot of organizations want to come to some real time interaction model with their end customers (or in the world of the Internet of Things – with other machines and sensors).
Hadoop is very good at providing the Data Factory and the Data
Reservoir, at providing a sandbox, at providing massive storage and
processing capabilities, but it is less good at doing things in real
time. Therefore, to build a closed loop recommendation system – which
should react in real time – Hadoop is only one of the components .
Typically the bottom half of the last figure is akin to pattern 2 and is used to catch all data, analyze the correlations between recorded events (detected fraud for example) and generate a set of predictive models describing something like “if a, b and c during a transaction – mark as suspect and hand off to an agent”. This model would for example block a credit card transaction.
To make such a system work it is important to use the right technology at both levels. Real time technologies like Oracle NoSQL Database, Oracle Real Time Decisions and Oracle Event Processing work on the data stream in flight. Oracle Big Data Appliance, Oracle Exadata/Database and Oracle Advanced Analytics provide the infrastructure to create, refine and expose the models.
Today’s big data technologies offer a wide variety of capabilities. Leveraging these capabilities with the existing environment and skills already in place according to the four patterns described does enable an organization to benefit from big data today. It is a matter of identifying the applicable pattern for your organization and then to start on the implementation.
The technology is ready. Are you?
The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.