Tuesday Apr 14, 2015

Statement of Direction -- Big Data Management System

Click here to start reading the Full Statement of Direction. 

Introduction: Oracle Big Data Management System Today 

As today's enterprises embrace big data, their information architectures must evolve. Every enterprise has data warehouses today, but the best-practices information architecture embraces emerging technologies such as Hadoop and NoSQL. Today’s information architecture recognizes that data not only is stored in increasingly disparate data platforms, but also in increasingly disparate locations: on-premises and potentially multiple cloud platforms. The ideal of a single monolithic ‘enterprise data warehouse’ has faded as a new more flexible architecture has emerged. Oracle calls this new architecture the Oracle Big Data Management System, and today it consists of three key components

  • The data warehouse, running on Oracle Database and Oracle Exadata Database Machine, is the primary analytic database for storing much of a company’s core transactional data: financial records, customer data, point- of-sale data and so forth. Despite now being part of a broader architecture, the requirements on the RDBMS for performance, scalability, concurrency and workload management are in more demand than ever; Oracle Database 12c introduced Oracle Database In-Memory (with columnar tables, SIMD processing, and advanced compression schemes) as latest in a long succession of warehouse-focused innovations. The market-leading Oracle Database is the ideal starting point for customers to extend their architecture to the Big Data Management System.
  • The ‘data reservoir’, hosted on Oracle Big Data Appliance, will augment the data warehouse as a repository for the new sources of large volumes of data: machine-generated log files, social-media data, and videos and images -- as well as a repository for more granular transactional data or older transactional data which is not stored in the data warehouse. Oracle’s Big Data Management System embraces complementary technologies and platforms, including open-source technologies: Oracle Big Data Appliance includes Cloudera’s Distribution of Hadoop and Oracle NoSQL Database for data management.
  • A ‘franchised query engine,’ Oracle Big Data SQL, enables scalable, integrated access in situ to the entire Big Data Management System. SQL is the accepted language for day-to-day data access and analytic queries, and thus SQL is the primary language of the Big Data Management System.  Big Data SQL enables users to combine data from Oracle Database, Hadoop and NoSQL sources within a single SQL statement.  Leveraging the architecture of Exadata Storage Software and the SQL engine of the Oracle Database, Big Data SQL delivers high-performance access to all data in the Big Data Management System.

Using this architecture, the Oracle Big Data Management System combines the performance of Oracle’s market-leading relational database, the power of Oracle’s SQL engine, and the cost-effective, flexible storage of Hadoop and NoSQL. The result is an integrated architecture for managing Big Data, providing all of the benefits of Oracle Database, Exadata, and Hadoop, without the drawbacks of independently-accessed data repositories.  

Note that the scope of this statement of direction is the data platform for Big Data. An enterprise Big Data solution would also be comprised of big data tools and big data applications built upon this data platform. 

Read the full Statement of Direction -- Big Data Management System here.


Wednesday Mar 18, 2015

Production workloads blend Cloud and On-Premise Capabilities

Prediction #7 - blending production workloadsacross cloud and on-premise in Oracle's Enterprise Big Data Predictions 2015 is a tough nut to crack. Yet, we at Oracle think this is really the direction we all will go. Sure we can debate the timing, and whether or not this happens in 2015, but it is something that will come to all of us who are looking towards that big data future. So let’s discuss what we think is really going to happen over the coming years in the big data and cloud world.

Reality #1 – Data will live both in the cloud and on-premise

We see this today. Organizations run Human Capital Management systems in the cloud, integrate these with data from outside cloud based systems (think for example LinkedIn, staffing agencies etc.) while their general data warehouses and new big data systems are all deployed as on-premise systems. We also see the example in the prediction where various auto dealer systems uplink into the cloud to enable the manufacturer to consolidate all of their disparate systems. This data may be translated into data warehouse entries and possibly live in two worlds – both in the cloud and on-premise for deep diving analytics or in aggregated form.

Reality #2 – Hybrid deployments are difficult to query and optimize

We also see this today and it is one of the major issues of living in the hybrid world of cloud and on-premise. A lot of the issues are driven by low level technical limitations, specifically in network bandwidth and upload / download capacity into and out of the cloud environment. The other challenges are really (big) data management challenges in that they go into the art of running queries across two ecosystems with very different characteristics. We see a trend to use engineered systems on-premise, which delivers optimized performance for the applications, but in the cloud we often see virtualization pushing the trade-off towards ease of deployment and ease of management. These completely different ecosystems make optimization of queries across them very difficult.

Solution – Equality brings optimizations to mixed environments

As larger systems like big data and data warehouse systems move to the cloud, better performance becomes a key success criterion. Oracle is uniquely positioned to drive both standardization and performance optimizations into the cloud by deploying on engineered systems like Oracle Exadata and Oracle Big Data Appliance. Deploying engineered systems enables customers to run large systems in the cloud delivering performance as they see today in on-premise deployments. This then means that we do not live in a world divided in slow and fast, but in a world of fast and fast.
This equivalence also means that we have the same functionality in both worlds, and here we can sprinkle in some – future – Oracle magic, where we start optimizing queries to take into account where the data lives, how fast we can move it around (the dreaded networking bandwidth issue) and where we need to execute code. Now, how are we going to do this? That is a piece magic, and you will just need to wait a bit… suffice it to say we are hard at work at solving this challenging topic.

Monday Mar 09, 2015

Why SQL Part 2 - It has a powerful framework

In the first part (Why SQL is the natural language for data analysis) of this series of blog posts I explained why analytics is important to the business, how the volume of data along with the types of data is continuing to expand and why that makes it vital that you select the right language for data analysis. Many of us work with SQL every day and take for granted many of its unique features, its power, flexibility and the richness of the analytics.

This familiarity with SQL means that sometimes we are a bit slow at preventing some of our projects investing in other big data languages such as MapReduce, Impala, Spark, Java and many of the new generation of SQL-like open source projects. While many of these new languages are considered “cool”, it is very easy to end up building new proprietary data silos or investing in a language that eventually is replaced by another open source project or investing time and effort in code that eventually fails to deliver the required analytics.

One of the aims of this series of blog posts is to refresh your memory about why SQL has been so successful in the area of analytics. SQL has four unique features that make it perfect for data analysis...

    [Read More]

    Tuesday Mar 03, 2015

    Why SQL is the natural language for data analysis

    Analytics is a must-have component of every corporate data warehousing and big data project. It is the core driver for the business: the development of new products, better targeting of customers with promotions, hiring of new talent and retention of existing key talent. Yet the analysis of especially “big data environments”, data stored and processed outside of classical relational systems, continues to be a significant challenge for the majority companies. According to Gartner, 72% of companies are planning to increase their expenditure on big data yet 55% state they don’t have the necessary skills to make use of it. 

    The objective of this series of articles, which will appear over the coming weeks, is to explain why SQL is the natural language for amy kind of data analysis including big data and the benefits that this brings for application developers, DBAs and business users. 

    [Read More]

    Friday Jan 16, 2015

    Deploying SAS High Performance Analytics on Big Data Appliance

    Oracle and SAS have an ongoing commitment to our joint customers to deliver value-added technology integrations through engineered systems such as Exadata, Big Data Appliance, SuperCluster,  Exalogic and ZFS Storage Appliance.  Dedicated resources manage and execute on joint SAS/Oracle Database, Fusion Middleware, and Oracle Solaris integration projects; providing customer support, including sizing and IT infrastructure optimization and consolidation.  Oracle support teams are onsite at SAS Headquarters in Cary, NC (USA); and in the field on a global basis.

    The latest in this effort is to enable our joint customers to deploy SAS High Performance Analytics on Big Data Appliance. This effort enables SAS users to leverage the lower cost infrastructure Hadoop offers in a production ready deployment on Oracle Big Data Appliance. Here from Paul Kent (VP Big Data, SAS) on some of the details.

    Read more on deploying SAS High Performance Analytics on www.oracle.com/SAS. Don't miss the deployment guide and best practices here.

    Friday Oct 10, 2014

    Review of Data Warehousing and Big Data at #OOW14

    Data Warehousing and Big Data were at the heart of this year’s OpenWorld conference being across in a number of keynotes and a huge number of general sessions. Our hands-on labs were all completely full as people got valuable hands-on time with our most important new features. The key areas at this year’s conference were:

    • Big Data SQL - One Fast SQL Query for All Your Data
    • Database In-Memory - Powering the Real-Time Enterprise
    • Mutitenant - Plug your data warehousing Into the Cloud
    DW 4 DW 3 DW 3

    All these topics appeared in the main keynote sessions including live on-stage demonstrations of how each feature can be used to increased the performance and analytical capability of your data warehouse.

    If you want to revisit the most important sessions, or if simply missed this year’s conference and want to catch up on all the most important topics, then I have put together a book of the highlights from this year’s conference. The booklet is divided into the following sections:

    • Key Messages
    • Overview of Oracle Database 12c
    • Session Catalogue
    • Your Oracle Presenters
    • Links
    • OpenWorld 2015

    PDF-iBook

    You can download my review in PDF format by clicking here. Hope this proves useful and if I missed anything then let me know. 

    Thursday Oct 09, 2014

    One of the ways Oracle is using Big Data

    Today, Oracle is using big data technology and concepts to significantly improve the effectiveness of its support operations, starting with its hardware support group. While the company is just beginning this journey, the initiative is already delivering valuable benefits.

    In 2013, Oracle’s hardware support group began to look at how it could use automation to improve support quality and accelerate service request (SR) resolution. Its goal is to use predictive analytics to automate SR resolution within 80% to 95% accuracy.

    Oracle’s support group gathers a tremendous amount of data. Each month, for example, it logs 35,000 new SRs and receives nearly 6 TB of telemetry data via automated service requests (ASRs)—which represent approximately 18% of all SRs. Like many organizations, Oracle had a siloed view of this data, which hindered analysis. For example, it could look at SRs but could not analyze the associated text, and it could review SRs and ASRs separately, but not together.

    Oracle was conducting manual root-cause analysis to identify which types of SRs were the best candidates for automation. This was a time-consuming, difficult, and costly process, and the company looked to introduce big data and predictive analytics to automate insight.

    The team knew that it had to walk before it could run. It started by taking information from approximately 10 silos, such as feeds from SRs and ASRs, parts of databases, and customer experience systems, and migrating the information to an Oracle Endeca Information Discovery environment. Using the powerful Oracle Endeca solution, Oracle could look at SRs, ASRs, and associated notes in a single environment, which immediately yielded several additional opportunities for automation. On the first day of going live with the solution, Oracle identified 4% more automation opportunities.

    Next, Oracle focused its efforts on gaining insight in near real time, leveraging the parallel processing of Hadoop to automatically feed Oracle Endeca Information Discovery—dramatically improving data velocity. Oracle’s first initiative with this new environment looked at Oracle Solaris SRs. In the first few weeks of that project, Oracle identified automation opportunities that will increase automated SR resolution from less than 1% to approximately 5%—simply by aggregating all of the data in near real-time. 

    Once Oracle proved via these early proofs of concept that it could process data more efficiently and effectively to feed analytical projects, it began to deploy Oracle Big Data Appliance and Oracle Exalytics In-Memory Machine.

    Read the entire profile here.

    Monday Sep 29, 2014

    The End of the Data Scientist Bubble...

    Looking around northern California and inside many technology kitchens makes me believe that we are about to see the Data Scientist bubble burst. And then I read the Fortune Magazine article on Peter Thiel - and the excerpt on Zero to One (his new book) in that article and it dawned on me that is one of the intersting ways to look at the Data Scientist bubble.

    Thiel's Classification of Innovation

    Without trying to simplify and/or bastardize mr. Thiel's theory, the example in the Fortune Mag article will make this visible to most people (I hope). In the article the analogy is; going from one type writer to 100 type writers is 1 to N, inventing a word processor is moving us from 0 to 1. In other words, true innovation dramatically changes things by giving previously unknown power to the masses. It is that innovation that moves us from 0 to 1. Expansion of existing ideas - not true innovation - moves us from 1 to N. Of course, don't take my word on this but read the article or the book...

    The Demise of the Human Data Scientist

    The above paradigm explains the Data Scientist bubble quite nicely. Once upon a time companies hired a few PhD students who by chance had a degree in statistics and had learned how to program and figured out how to deal with (large) data sets. These newly minted data scientists proved that there is potential value in mashing data together, running analytics on these newly created data sets and thus caused a storm of publicity. Companies large and small are now frantically trying to hire these elusive data scientists, or something a little more down to earth, are creating data scientists (luckily not in the lab) by forming teams that bring a part of the skillset to the table.

    This approach all starts to smell pretty much like a whole busload of typewriters being thrown at a well-known data analysis and data wrangling problem. Neither the problem nor the solution are new, nor innovative. Data Scientists are therefore not moving us from 0 to 1...

    One could argue that while the data scientist quest is not innovative, at least is solves the problem of doing analytics. Fair and by some measure correct, but there is one bigger issue with the paradigm of "data scientists will solve our analytics problem" and that is scale. Giving the keys to all that big data to only a few data scientists is not going to work because these smart and amazing people are now becoming, often unbeknownst to them, an organizational bottleneck to gaining knowledge from big data.

    The only real solution, our 0 to 1, is to expose a large number of consumers to all that big data, while enabling these consumers to apply a lot of the cool data science to all that data. In other words, we need to provide tools which include data science smarts. Those tools will enable us to apply the 80% common data science rules to the 80% of common business problems. This approach drives real business value at scale. With large chunks of issues resolved, we can then focus our few star data scientists on the 20% of problems or innovations that drive competitive advantage and change markets.

    My Conclusion

    The bubble is bursting because what I am seeing is more and more tools coming to market (soon) that will drive data science into the day-to-day job of all business people. Innovation is not the building of a better tool for data scientists or hiring more of them, instead the real 0 to 1 innovation is tools that make make all of us data scientists and lets us solve our own data science problems. The future of Data Science is smarter tools, not smarter humans.

    Wednesday Sep 03, 2014

    OpenWorld on your iPad and iPhone - UPDATED!

    In my last blog post I provided links to our OpenWorld data warehouse web app for smartphones and tablets. Now that the OOW team has released the session schedule (it is now live on the OpenWorld site) I have updated my iPhone and iPad apps to include the list of sessions on a day-by-day basis (Sunday, Monday Tuesday, Wednesday, Thursday). The list of sessions can still be viewed in subject area order (data warehousing, performance and scalability, analytics, unstructured data, industry models and big data) within the app via the “Switch to subject view” link in the top left part of the screen.

    Updated-iPad-OOW2014 Updated-iPhone-OOW2014

    I have also added a location map which can be viewed by clicking on the linked-text, “View location map", which is in the top right part of the screen on each application. The location map that is available within both the iPad and iPhone apps is shown below:

    Oow locations



    If you want to run these updated web apps on your smartphone and/or tablet then you can reuse the existing links that I published on my last blog post. If you missed that post then  follow these links:

    Android users: I have tested the app on Android and there appears to be a bug in the way the Chrome browser displays frames since scrolling within frames does not work . The app does work correctly if you use either the Android version of the Opera browser or the standard Samsung browser on Samsung devices.

    If you have any comments about the app (content you would like to see) then please let me know. Enjoy OpenWorld and, if you have time, it would be great to see you if you want to stop by at the Parallel Execution and Analytical SQL demo booth.

    Tuesday Jul 22, 2014

    StubHub Taps into Big Data for Insight into Millions of Customers’ Ticket-Buying Patterns, Fraud Detection, and Optimized Ticket Prices

    The benefits of Big Data at Stubhub:

    • Stubhub enabled data scientists to work directly with customer-related data—such as ticket-purchasing history—inside the database, and to use database options to explore the data graphically, build and evaluate multiple data-mining models, and deploy predictions and insights throughout the enterprise—drastically improving StubHub’s agility and responsiveness
    • Developed highly targeted ticket promotional campaigns and offers by having the ability to calculate 180 million customers’ lifetime value (or propensity) instead of just 20,000 values at a time
    • Used Oracle R Enterprise component of Oracle Advanced Analytics—an Oracle Database option—to reduce a fraud issue by up to 90%

    Read more or watch the video:

    Monday Jun 30, 2014

    My Data Warehousing and Big Data Must-See Guide for OpenWorld

    Data Warehousing Must-See Session Guide for Open World 2014
    IBook2014

    There’s so much to learn at this year’s Oracle OpenWorld - it provides more educational and networking opportunities than any other conference dedicated to Oracle business and technology users. To get the most from this year’s event I have prepared an initial guide which lists all the must important data warehousing and big data sessions. The first part of the guide provides links to videos and content from last year’s event so you can either re-live the highlights from last year or see what you missed by not being there! If you have an iPad or use OSX 10.9 then you will want to download the iBook version because this contains video highlights from last year’s database keynote session.

    The session guide is divided into the following chapters:

    • Data Warehousing
    • Performance and Scalability
    • Database Analytics
    • Industry Data Models and Data Integration
    • Unstructured Data
    • Big Data

    The last part of the booklet lists all the key data warehouse and big data product managers who will be presenting at this year’s conference. Included alongside each speaker’s biography are links to their social media sites and blogs.

    Please note that, as usual, there will be hands-on labs at this year’s OpenWorld but these have not been included at in the session catalog. I am expecting them to be added shortly. Details of all our labs will appear in the more detailed guide that I will publish before the conference begins.

    The Must-See guide is available in two formats: 

    For iPad and OSX 10.9 (Mavericks) users please download the iBook, which is available here: https://dl.dropboxusercontent.com/u/69436560/OOW/Must-See%20OpenWorld%202014_Sessions.ibooks

    For all other platforms please download the PDF, which is available here: https://dl.dropboxusercontent.com/u/69436560/OOW/Must-See%20OpenWorld%202014_Sessions.pdf

     Hope to see you in September at OpenWorld. 

    Thursday Apr 17, 2014

    Analyzing our Big Data Lite movie app clickstream data

    Since last year's OpenWorld I have posted quite a few articles on our new 12c pattern matching feature: 

     To date most of my online demos and tutorials have used a very simple data set consisting of a single table with 60 rows of fictitious trade data. Over the last few months I have been searching around for a slightly bigger and more challenging data set to use. Fortunately, our BIg Data PMs (led by Marty Gubar) have put together a very rich big data schema as part of their recently launched Big Data Virtual Machine that you can download from our OTN web page, see here: http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.

    The data set is based around an online customer movie application. Here is a little background…

    Movieplex on otn

    Oracle MoviePlex Demo Application

    Oracle MoviePlex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalized experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.

    As the users watch movies, search for their favorite actors, rate their movies the system records all the clicks in a log file. The contents of that log file looks like this:

    Avro file

    In its raw state this information is not very helpful. It needs a lot of processing to convert the various data points into usable information. In many cases companies have been turning to Hadoop and its related Java-based programming language MapReduce to process and convert these types of files into usable information.  Most business users will want to summarise this information by customer and/or movie and then merge this information with other data sets. So how can we make access to and analysis of this type of data much easier? As part of this post I am going to compare the Hadoop-Java-MapReduce approach with an alternative approach using 12c SQL. My objective is not to determine which is solution is the best because each approach has its own camp of supporters. Once we have reviewed the two solutions I will put everything into context and make some recommendations…..so let's bring on the code!

    Sessionization using Java

    Accessing the Avro file

    At the moment In the context of Big Data, everything seem to revolve around Hadoop, MapReduce and Java. It is quite natural for a big data developer to extend their basic map reduce processing to include more complicated requirements. In the case of our movie demo there is a lot of processing that needs to be done using the native features of Hadoop to collect and process the weblog file being generated by our application. There is an excellent video prepared by Marty Gubar (Director of Product Management for Big Data) which explains this process. This is part four of a six-part series that explains the movieplex demo: Part 4. Turn Clicks into Value - Flume & Hive. The movie demo lasts about 5 mins and you can watch here: https://www.youtube.com/watch?v=IwrjJUoUwXY.

    The steps shown in the movie explain how to access the avro file and then how to clean the data to provide some interesting additional metrics.

    Calculating the sessionization data

     Creating the sessionization analysis is a little more complicated. In fact, it requires 370 lines of Java code.  Here is the Java code we created for doing the sessionization analysis (the code window is scrollable so you can review the very long code sample):

    The actual logic for the sessionization analysis is about 100 lines of code as shown here (at s before, the code window is scrollable so you can review the very long code sample):

    As you can see from the code listing this requires a strong knowledge of Java and with 370 lines of code, if we decide to change the details of the pattern that we are searching for it is going to be a lengthy process to make the required changes. Making anything changes  to the code to reflect changing business requirements is definitely going to be beyond the skills of your average business analyst and/or data scientists and this might negatively impact the level of project agility.

    Making life easier and richer with SQL

    My personal view is that sessionization analysis is quite a sophisticated requirement and is best achieved using SQL because the requirements and parameters for the analysis will naturally evolve over time as new questions are triggered by existing answers. Therefore, you need an agile approach to analytics that can deal with constant changing requirements during the data discovery phase.

    Accessing the Avro file

    First step is to create a DIRECTORY object to access the raw data file:

    CREATE DIRECTORY session_file_dir AS '/home/oracle/applog';
    GRANT READ, WRTIE ON DIRECTORY session_file_dir to pmuser;

    Next I created an external table over the data file. Notice that in the avro file, each key column  has an identifier and a value.

    {"custId":1077225"movieId":100"genreId":8"time":"2011-07-30:22:55:32""recommended":"N""activity":6}

    The objective for this external table was to keep the processing as simple as possible:

    CREATE TABLE RAW_SESSION_DATA 
    (
      CUST_ID VARCHAR2(20) 
    , MOVIE_ID VARCHAR2(20) 
    , GENRE_ID VARCHAR2(20) 
    , SESSION_DATE VARCHAR2(50) 
    , RECOMMENDED_ID VARCHAR2(20) 
    , ACTIVITY_ID VARCHAR2(20)
    , RATING_ID VARCHAR2(20)
    ) 
    ORGANIZATION EXTERNAL 
    ( 
     TYPE ORACLE_LOADER 
     DEFAULT DIRECTORY SESSION_FILE_DIR 
     ACCESS PARAMETERS
     (RECORDS DELIMITED BY NEWLINE
     NOBADFILE
     NODISCARDFILE
     NOLOGFILE
     FIELDS TERMINATED BY ','
     OPTIONALLY ENCLOSED BY "'"
     MISSING FIELD VALUES ARE NULL)
     LOCATION (SESSION_FILE_DIR: 'movieapp_30months.log') 
    ) 
    REJECT LIMIT 1;
    ALTER TABLE RAW_SESSION_DATA 
    PROJECT COLUMN REFERENCED;

    Cleaning the data

    Now the next stage is to clean the data and remove the column identifiers such as custId, movieId, genreId etc. To do this processing I simply used the SQL SUBSTR() function. At the same time I decided to breakout the activity ID column to create unique columns for each type of activity. This approach is useful if you are going to be using the data as part of a data mining project because in many cases it useful to uniquely identify specific attributes. In many cases you will want to create a fact table from this process but in this case I have created a view while I do some basic prototyping:

    CREATE OR REPLACE VIEW VW_SESSION_DATA AS SELECT 
      SUBSTR(CUST_ID,11) AS CUST_ID
    , CASE SUBSTR(MOVIE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(MOVIE_ID,11) END AS MOVIE_ID
    , CASE SUBSTR(GENRE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(GENRE_ID,11) END AS GENRE_ID
    , TO_DATE(SUBSTR(SESSION_DATE, 9,19), 'YYYY-MM-DD HH24:MI:SS') AS SESS_DATE
    , CASE SUBSTR(RECOMMENDED_ID,15) WHEN 'null' THEN null ELSE SUBSTR(RECOMMENDED_ID,16,1) END AS REC_ID
    , (CASE SUBSTR(ACTIVITY_ID,12,2)
        WHEN '1' THEN '1'
        WHEN '2}' THEN '2'
        WHEN '3}' THEN '3'
        WHEN '4}' THEN '4'
        WHEN '5}' THEN '5'
        WHEN '6}' THEN '6'
        WHEN '7}' THEN '5'
        WHEN '8}' THEN '8'
        WHEN '9}' THEN '9'
        WHEN '10' THEN '10'
        WHEN '11' THEN '11'
        ELSE null
      END) AS ACTIVITY_ID
    , CASE SUBSTR(RATING_ID,9) WHEN 'null' THEN null ELSE SUBSTR(RATING_ID,10,1) END as RATING_ID
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '1' THEN 'Y' END as act_rate
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '2' THEN 'Y' END as act_complete 
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '3' THEN 'Y' END as act_pause
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '4' THEN 'Y' END as act_start 
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '5' THEN 'Y' END as act_browse 
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '6' THEN 'Y' END as act_list
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '7' THEN 'Y' END as act_search
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '8' THEN 'Y' END as act_login
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '9' THEN 'Y' END as act_logout
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '10' THEN 'Y' END as act_incomplete
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '11' THEN 'Y' END as act_purchase
    FROM RAW_SESSION_DATA;

    Running a query against this view generates our basic data set that we can then feed into our MATCH_RECOGNIZE clause to create the sessionization result set.

    Sessionization

     Is there a better, simpler way to deal with the avro file? Well yes there is. As part of the Database 12c release programme we will be adding in-database support for JSON data. This will allow us to take the raw avro file and access stand query it directly from within the database without the need to go through the complicated cleaning process! At the moment we have not made any collateral (PPTs etc) on this feature publicly available so I can't provide you with any supporting links that will give you more information. However, once this feature is released I will revisit this little demo to explore how the new JSON feature can be used along side SQL pattern matching.

    Calculating the sessionization data

    The code to calculate the sessionization information is as follows:

    SELECT *
    FROM vwsession_data 
    MATCH_RECOGNIZE
     (PARTITION BY cust_id ORDER BY sess_date 
      MEASURES match_number() session_id, 
        COUNT(*) no_of_events,
        FIRST(sess_date) start_date,
        TO_CHAR(FIRST(sess_date), 'hh24:mi:ss') start_time,
        LAST(sess_date) end_date,
        TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, 
        TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
        LAST(activity_id) last_act_id,
        COUNT(act_rate) act_rate,
        COUNT(act_complete) act_complete,
        COUNT(act_pause) act_pause,
        COUNT(act_start) act_start,
        COUNT(act_browse) act_browse,
        COUNT(t(act_list) act_list,
        COUNT(act_search) act_search,
        COUNT(act_login) act_login, 
        COUNT(act_logout) act_logout,
        COUNT(act_incomplete) act_incomplete,
        COUNT(act_purchase) act_purchase 
     ONE ROW PER MATCH 
     PATTERN (strt s+) 
     DEFINE 
        s as (round(to_number(sess_date - prev(sess_date)) * 1440) <= 5)
     );

    This statement uses many of the MATCH_RECOGNIZE features that I have discussed in previous posts. In this specific example I have decided to set the duration between events within a session as 5 minutes. That means if the user does nothing for about  5 minutes then I will assume that a new session has started even if the user has not logged out. 

    Beyond the basic sessionization model there is another interesting pattern that we can search for within our data set. If a user starts watching a movie then we might not get another event being logged for over two hours. Therefore, we could look for instances where the time between events is less than 150 minutes (assuming most movies last around 2 hours) and there is at least one (or more) "start" event is logged and at least  one (or more) "complete"  event is logged,

    PATTERN (strt s* f+ c+) 
     DEFINE 
     f as act_start = 'Y',
     c as act_complete = 'Y',
     s as (round(to_number(sess_date - prev(sess_date))*1440) <= 150)

    The output from this query is shown below:

    Start Complete Data Full

    Looking at the results what really jumps out is the level of interaction on the site before, during and after the user has watched a movie. Many of the records that are returned show users searching our movie database, going to their lists of favourite movies and rating the movies they have just watched. All this implies that if we can get customers to begin watching a movie there is a huge knock-on effect in terms of how long they remain on the site and the number of "events" they create while on the site. This is the sort of information that is useful to marketing and campaign management teams. 

    A different view: Calculating the sessionization data for each movie

    While developing the above analysis I began to wonder about analysing the data not from a customer perspective but from the point of view of each movie. This would allow me to look at the session activity over time for each movie and understand the usage patterns across movies and genres. For example, are there certain movies or types of movies that are more or less likely to be rated. Is the process of watching a movie the same across all movies in terms of the events that are registered during a session? There are all sorts of new avenues of analysis that could be created from looking at the data in this way.

    So how easy would it be to change the focus of the analysis to movies? Well the code took me approximately 30 seconds to write - actually cut & paste from the customer sessionization example, switch the customer id for the movie id and finally removed a few fields. Here is the new code:

    SELECT *
    FROM SESSION_DATA td
    MATCH_RECOGNIZE
     (PARTITION BY movie_id ORDER BY sess_date 
      MEASURES COUNT(*) no_of_events,
         FIRST(sess_date) start_date,
         TO_CHAR(first(sess_date), 'hh24:mi:ss') start_time,
         LAST(sess_date) end_date,
         TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, 
         TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
         MIN(activity_id) act_id,
         MAX(activity_id) last_act_id,
         COUNT(act_rate) act_rate,
         COUNT(act_complete) act_complete,
         COUNT(act_pause) act_pause,
         COUNT(act_start) act_start,
         COUNT(act_browse) act_browse,
         COUNT(act_list) act_list,
         COUNT(act_search) act_search,
         COUNT(act_login) act_login, 
         COUNT(act_logout) act_logout,
         COUNT(act_incomplete) act_incomplete,
         COUNT(act_purchase) act_purchase 
     ONE ROW PER MATCH 
     PATTERN (strt s*) 
     DEFINE 
         s as (ROUND(TO_NUMBER(sess_date - PREV(sess_date))*1440) <= 120)
     ) MR;

    The speed at which you can create these additional/alternative result sets is really great!  It is now relatively easy to continue this discovery process by analysing the information by movie genre or any of the other related attributes.

    Summary

    The main take-away from this post is that (hopefully) I shown how easy it is to use SQL for sessionization analysis.  Firstly in terms of creating the framework to support the normal "sessionization" transformation process: taking the source log file, extracting the data points and then grouping the information by user. The great news is that processing of the source file will be greatly simplified when the JSON feature is released because it will be possible to query the log file in a much simpler way to extract the required data points. Secondly, SQL provides a much simpler and easier way to support the evolutionary of the discovery process. Changing the focus from customer analysis to movie analysis is both quick and easy. I am not convinced that a Java-based approach to data discovery can offer the same level of simplicity and agility but I will let you be the judge of that last statement….

    Tuesday Apr 15, 2014

    OpenWorld call for Papers closes today!

     Just a gentle reminder - if you have not submitted a paper for this year's OpenWorld conference then there is still just enough time because the deadline is Today (Tuesday, April 15) at 11:59pm PDT. The call for papers website is here http://www.oracle.com/openworld/call-for-papers/index.html and this provides all the details of how and what to submit.

    I have been working with a number of customers on some really exciting papers so I know this year's conference is going to be really interesting for data warehousing and analytics. I would encourage everyone to submit a paper, especially if you have never done this before. Right now both data warehousing and analytics are among the hottest topics in IT and I am sure all of you have some great stories that you could share with your industry peers who will be attending the conference. It is a great opportunity to present to your peers and also learn from them by attending their data warehouse/analytics sessions during this week long conference. And of course you get a week of glorious Californian sunshine and the chance to spend time in one of the World's most beautiful waterfront cities.

    If you would like any help submitting a proposal then feel free to email during today and I will do my best to provide answers and/or guidance. My email address is keith.laker@oracle.com.

    Have a great day and get those papers entered into our OpenWorld system right now! 

    Monday Mar 24, 2014

    Built-in sorting optimizations to support analytical SQL

    One of the proof points that I often make for using analytical SQL over more sophisticated SQL-based methods is that we have included specific optimizations within the database engine to support our analytical functions. In this blog post I am going to briefly talk about how the database optimizes the number of sorts that occur when using analytical SQL.

    Sort Optimization 1: Ordering Groups

    Many of analytical functions include PARTITION BY and/or an ORDER BY clause both of which by definition implies that an ordering process is going to be required. As each function can have its own PARTITION BY-ORDER BY clause this can create situations where lot of different sorts are needed. For example, if we have a SQL statement that included the following:

    Rank() Over (Partition by (x) Order by (w))
    Sum(a) Over (Partition by (w,x) Order by (z))
    Ntile() Over (Partition by (x) Order by (y))
    Sum(b) Over (Partition by (x,y) Order by (z))

    this could involve four different sort processes to take into account the use of both PARTITION BY and ORDER BY clauses across the four functions. Performing four separate sort processes on a data set could add a tremendous overhead (depending on the size of the data set). Therefore, we have taken two specific steps to optimize the sorting process.

    The first step is create the notion of "Ordering Groups". This optimizations looks for ways to group together sets of analytic functions which can be evaluated with a single sort. The objective is to construct a minimal set of ordering groups which in turn minimizes the number of sorts. In the example above we would create two ordering groups as follows:

    Screen Shot 2014 03 13 at 13 39 37

    This allows us to reduce the original list of sorts down from 4 to just 2.

    Sort Optimization 2: Eliminating Sorts

    We can further reduce the number sorts that need to be performed by carefully scheduling the execution so that:

    • Ordering groups with sorts corresponding to that in the GROUP BY execute first (immediately after the GROUP BY) 
    • Ordering groups with sorts corresponding to that in the ORDER BY execute last (immediately before the ORDER BY)

    In addition, we can also eliminate sorts when an index or join method (sort-merge) makes sorting unnecessary. 

    Optimization 3 : RANK Predicates

    Where a SQL statement includes RANK() functions there are additional optimizations that kick-in. Instead of sorting all the data, adding the RANK and then applying the predicate, the RANK predicate is evaluated as part of the sort process. The net result is that fewer records are actually sorted, resulting in more efficient execution.

    Summary 

    Overall, these three optimizations ensure that as few sorts as possible are performed when you include SQL analytical functions as part of your SQL statements. 

    Thursday Feb 06, 2014

    Sessionization analysis with 12c SQL pattern matching is super fast

    Over the past six months I have posted a number of articles about SQL pattern matching, as shown here:

    Most of these have been related to explaining the basic concepts and along with some specific use cases.

    In this post I want to review some of the internal performance tests that we have run during the development of this feature. In part 3 of the series of podcasts I covered a number of use cases for SQL pattern matching such as: stock market analysis, tracking governance-compliance, call service quality and sessionization. The most popular scenarios is likely to be the analysis of sessionization data as this is currently a hot topic when we start considering machine-data and in more general data terms, big data.

    To help us create a meaningful test data set we used decided to use the TPC-H schema because it contained approximately seven years of data which equated to approximately 1TB of data. One of the objectives of our performance tests was to compare and contrast the performance and scalability of code using the 12c MATCH_RECOGNIZE clause with code using 11g window functions.

    Analysis of Sessionization Data

    To make things easy to understand I have divided our sessionization workflow into a number of steps.
    Part 1 - For the purposes of this specific use case we defined a session as a sequence of one or more events with the same partition key where the gap between the timestamps is less than 10 seconds - obviously the figure for the gap is completely arbitrary and could be set to any number as required. The 1TB of source data looked like this:

    Session_Id User
    1 Mary
    2 Sam
    3 Richard
    11 Mary
    12 Sam
    13 Richard
    22 Sam
    23 Mary
    23 Richard
    32 Sam
    33 Richard
    34 Mary
    43 Richard
    43 Sam
    44 Mary
    47 Sam
    48 Sam
    53 Mary
    54 Richard
    59 Sam
    60 Sam
    63 Mary
    63 Richard
    68 Sam


    The following sections compare the use of 11g window functions vs. 12c MATCH_RECOGNIZE clause.
    Part 2- To create the first part of the sessionization workflow we took the original source data and used the USER_ID as the PARTITION BY key and the timestamp for the ORDER BY clause. The objective for this first step is to detect the various sessions and assign a surrogate session id to each session within each partition (USER_ID).
    This creates an output result set that delivers a simplified sessionization data set as shown here:

    NewImage

    The 12c SQL to create the initial result set is as follows:

    SELECT user_id, session_id start_time, no_of_events, duration
    FROM Events MATCH_RECOGNIZE
     (PARTITION BY User_ID ORDER BY Time_Stamp
      MEASURES match_number() session_id,
               count(*) as no_of_events,
               first(time_stamp) start_time,
               last(time_stamp) - first(time_stamp) duration
      ONE ROW PER MATCH 
      PATTERN (b s*)
      DEFINE
           s as (s.Time_Stamp - prev(s.Time_Stamp) <= 10)
     )
    ;

    as a comparison here is how to achieve the above using 11g analytical window functions

    CREATE VIEW Sessionized_Events as
    SELECT Time_Stamp, User_ID,
     Sum(Session_Increment) over (partition by User_ID order by Time_Stampasc) Session_ID
    FROM ( SELECT Time_Stamp, User_ID,
     CASE WHEN (Time_Stamp - Lag(Time_Stamp) over (partition by User_ID order by Time_Stampasc)) < 10
     THEN 0 ELSE 1 END Session_Increment
     FROM Events);
    
    SELECT User_ID, Min(Time_Stamp) Start_Time,
     Count(*) No_Of_Events, Max(Time_Stamp) -Min(Time_Stamp) Duration
    FROM Sessionized_Events
    GROUP BY User_ID, Session_ID;


    As you can see the 11g approach using window functions ( SUM() OVER(PARTITION BY…) ) is a little more complex to understand but it produces the same output - i.e. our initial sessionized data set.


    Part 3 - However, to get business value from this derived data set we need to do some additional processing.  Typically, with this kind of analysis the business value within the data emerges only after aggregation, which in this case needs to by session. We need to reduce the data set to a single tuple, or row, per session along with some derived attributes, such as:

    • Within-partition Session_ID
    • Number of events in a session
    • Total duration 

    To do this with Database 12c we can use the MATCH_RECOGNIZE clause to determine how many events are captured within each session. There are actually two ways to do this: 1) we can compare the current record to the previous record, i.e. peek backwards or 2)  we can compare the current record to the next record, i.e. peek forwards.
    Here is code based on using the PREV() function to compare the current record against the previous record:

    select count(*)
    from ( select /* mr_sessionize_prev */ *
     from
     ( select o_pbykey,
    session_id,
     start_time,
     no_of_events,
     duration
     from orders_v MATCH_RECOGNIZE
     (
      PARTITION BY o_pbykey
      ORDER BY O_custkey, O_Orderdate
      MEASURES match_number() session_id, count(*) as no_of_events,
               first(o_orderdate) start_time,
               last(o_orderdate) - first(o_orderdate) duration
      PATTERN (b s*)
      DEFINE s as (s.O_Orderdate - prev(O_Orderdate) <= 100)
     )
     )
     where No_Of_Events >= 20
    ); 

    Here is code based on using the NEXT() function to compare the current record against the next record:

    select count(*)
    from ( select /* mr_sessionize_prev */ *
     from
     ( select o_pbykey, session_id, start_time, no_of_events, duration
     from orders_v MATCH_RECOGNIZE
     (
    PARTITION BY o_pbykey
    ORDER BY O_custkey, O_Orderdate
    MEASURES match_number() session_id, count(*) as no_of_events,
             first(o_orderdate) start_time,
             last(o_orderdate) - first(o_orderdate) duration
    PATTERN (s* e)
    DEFINE s as (next(s.O_Orderdate) - s.O_Orderdate <= 100)
     )
     )
     where No_Of_Events >= 20

    Finally we can compare the 12c MATCH_RECOGNIZE code to the 11g code which uses  window functions (which in my opinion is a lot more complex):

    select count(*)
    from (
     select /* wf */ *
     from (select O_pbykey,
                  Session_ID,
                  min(O_Orderdate) Start_Time,
                  count(*) No_Of_Events,
                 (max(O_Orderdate) - Min(O_Orderdate)) Duration
            from (select O_Orderdate,
    O_Custkey, o_pbykey,
                         sum(Session_Increment)
                         over(partition by o_pbykey order by O_custkey, O_Orderdate) Session_ID
                   from ( select O_Custkey,
                                 O_Orderdate,
                                 O_pbykey,
                                 case when (O_Orderdate –
                                 Lag(O_Orderdate)
                                     over(partition by o_pbykey
                                          order by O_custkey, O_Orderdate)) <= 100 -- Threshold
                                 then 0 else 1 end Session_Increment
                          from orders_v
                        )
                )
           group by o_pbykey, Session_ID
        )
     where No_Of_Events >= 20


    The final output generated by both sets of code (11g using window functions and 12c using MATCH_RECOGNIZE clause ) would look something like this:
    border-style: initial; border-width: 0px; display: block; margin-left: auto; margin-right: auto;" title="NewImage.png" src="http://lh6.ggpht.com/-HDaEnake_3o/UrL_AViG4iI/AAAAAAAAA20/WyCX1sPjRVk/NewImage.png?imgmax=800" alt="NewImage" width="598" height="275" border="0">

    Part 4 - The performance results for these three approaches (11g window functions vs. MATCH_RECOGNIZE using PREV() vs. MATCH_RECOGNIZE using NEXT() )are shown below. Please note that on the graph the X-axis shows the number of partitions within each test-run and the Y-axis shows the time taken to run each test. There are three key points to note from this graph:
    The first is that, in general the 12c MATCH_RECOGNIZE code is between 1.5x and 1.9x faster compared to using window functions, which is good news if you are looking for a reason to upgrade to Database 12c.

    MR Performance Results

    Secondly, it is clear from the X-axis that as the number of partitions increases the MATCH_RECOGNIZE clause offers excellent scalability and continues to deliver excellent performance. So it performs well and scales well as your data volumes increase.

    However, it is important to remember that the 11g window function code shows similar attributes of excellent scalability and excellent performance. If you are using 11g at the moment and you have not considered using Oracle Database to run your sessionization analysis then it is definitely worth pulling that transformation code back inside the database and using window functions to run those sessionization transformations. If you need a reason to upgrade to Database 12c then MATCH_RECOGNIZE does offer significant performance benefits if you are doing pattern matching operations either inside the Oracle Database 11g or using an external processing engine.

    Lastly, when you are designing your own MATCH_RECONGIZE implementations and you are using the NEXT() or PREV() functions it is worth investigating if using the alternate function offers any significant performance benefits. Obviously, much will depend on the nature of the comparison you are trying to formulate but it is an interesting area and we would welcome feedback on this specific point based on your own data sets.

    In general, if you are using the MATCH_RECOGNIZE clause then I would love to hear about your use case and experiences in developing your code. You can contact me directly either via this blog or via email (keith.laker@oracle.com).
    Have fun with MATCH_RECOGNIZE….

    About

    The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

    Search

    Archives
    « July 2015
    SunMonTueWedThuFriSat
       
    1
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
     
           
    Today