Friday Nov 20, 2015

Review of Data Warehousing and Big Data at #oow15

DW BigData Review Of OOW15

This year OpenWorld was bigger, more exciting and packed with sessions about the very latest technology and product features. Most importantly, both data warehousing and Big Data were at the heart of this year’s conference across a number of keynotes and a huge number of general sessions. Our hands-on labs were all completely full as people got valuable hands-on time with our most important new features. The key focus areas at this year’s conference were:
  • Database 12c for Data Warehousing
  • Big Data and the Internet of Things 
  • Cloud from on-premise to on the Cloud to running hybrid Cloud systems
  • Analytics and SQL continues to evolve to enable more and more sophisticated analysis. 
All these topics appeared across the main keynote sessions including live on-stage demonstrations of how many of our news features can be used to increase the performance and analytical capability of your data warehouse and big data management system - checkout the on-demand videos for the keynotes and executive interviews....
[Read More]

Tuesday Apr 14, 2015

Statement of Direction -- Big Data Management System

Click here to start reading the Full Statement of Direction. 

Introduction: Oracle Big Data Management System Today 

As today's enterprises embrace big data, their information architectures must evolve. Every enterprise has data warehouses today, but the best-practices information architecture embraces emerging technologies such as Hadoop and NoSQL. Today’s information architecture recognizes that data not only is stored in increasingly disparate data platforms, but also in increasingly disparate locations: on-premises and potentially multiple cloud platforms. The ideal of a single monolithic ‘enterprise data warehouse’ has faded as a new more flexible architecture has emerged. Oracle calls this new architecture the Oracle Big Data Management System, and today it consists of three key components

  • The data warehouse, running on Oracle Database and Oracle Exadata Database Machine, is the primary analytic database for storing much of a company’s core transactional data: financial records, customer data, point- of-sale data and so forth. Despite now being part of a broader architecture, the requirements on the RDBMS for performance, scalability, concurrency and workload management are in more demand than ever; Oracle Database 12c introduced Oracle Database In-Memory (with columnar tables, SIMD processing, and advanced compression schemes) as latest in a long succession of warehouse-focused innovations. The market-leading Oracle Database is the ideal starting point for customers to extend their architecture to the Big Data Management System.
  • The ‘data reservoir’, hosted on Oracle Big Data Appliance, will augment the data warehouse as a repository for the new sources of large volumes of data: machine-generated log files, social-media data, and videos and images -- as well as a repository for more granular transactional data or older transactional data which is not stored in the data warehouse. Oracle’s Big Data Management System embraces complementary technologies and platforms, including open-source technologies: Oracle Big Data Appliance includes Cloudera’s Distribution of Hadoop and Oracle NoSQL Database for data management.
  • A ‘franchised query engine,’ Oracle Big Data SQL, enables scalable, integrated access in situ to the entire Big Data Management System. SQL is the accepted language for day-to-day data access and analytic queries, and thus SQL is the primary language of the Big Data Management System.  Big Data SQL enables users to combine data from Oracle Database, Hadoop and NoSQL sources within a single SQL statement.  Leveraging the architecture of Exadata Storage Software and the SQL engine of the Oracle Database, Big Data SQL delivers high-performance access to all data in the Big Data Management System.

Using this architecture, the Oracle Big Data Management System combines the performance of Oracle’s market-leading relational database, the power of Oracle’s SQL engine, and the cost-effective, flexible storage of Hadoop and NoSQL. The result is an integrated architecture for managing Big Data, providing all of the benefits of Oracle Database, Exadata, and Hadoop, without the drawbacks of independently-accessed data repositories.  

Note that the scope of this statement of direction is the data platform for Big Data. An enterprise Big Data solution would also be comprised of big data tools and big data applications built upon this data platform. 

Read the full Statement of Direction -- Big Data Management System here.

Wednesday Mar 18, 2015

Production workloads blend Cloud and On-Premise Capabilities

Prediction #7 - blending production workloadsacross cloud and on-premise in Oracle's Enterprise Big Data Predictions 2015 is a tough nut to crack. Yet, we at Oracle think this is really the direction we all will go. Sure we can debate the timing, and whether or not this happens in 2015, but it is something that will come to all of us who are looking towards that big data future. So let’s discuss what we think is really going to happen over the coming years in the big data and cloud world.

Reality #1 – Data will live both in the cloud and on-premise

We see this today. Organizations run Human Capital Management systems in the cloud, integrate these with data from outside cloud based systems (think for example LinkedIn, staffing agencies etc.) while their general data warehouses and new big data systems are all deployed as on-premise systems. We also see the example in the prediction where various auto dealer systems uplink into the cloud to enable the manufacturer to consolidate all of their disparate systems. This data may be translated into data warehouse entries and possibly live in two worlds – both in the cloud and on-premise for deep diving analytics or in aggregated form.

Reality #2 – Hybrid deployments are difficult to query and optimize

We also see this today and it is one of the major issues of living in the hybrid world of cloud and on-premise. A lot of the issues are driven by low level technical limitations, specifically in network bandwidth and upload / download capacity into and out of the cloud environment. The other challenges are really (big) data management challenges in that they go into the art of running queries across two ecosystems with very different characteristics. We see a trend to use engineered systems on-premise, which delivers optimized performance for the applications, but in the cloud we often see virtualization pushing the trade-off towards ease of deployment and ease of management. These completely different ecosystems make optimization of queries across them very difficult.

Solution – Equality brings optimizations to mixed environments

As larger systems like big data and data warehouse systems move to the cloud, better performance becomes a key success criterion. Oracle is uniquely positioned to drive both standardization and performance optimizations into the cloud by deploying on engineered systems like Oracle Exadata and Oracle Big Data Appliance. Deploying engineered systems enables customers to run large systems in the cloud delivering performance as they see today in on-premise deployments. This then means that we do not live in a world divided in slow and fast, but in a world of fast and fast.
This equivalence also means that we have the same functionality in both worlds, and here we can sprinkle in some – future – Oracle magic, where we start optimizing queries to take into account where the data lives, how fast we can move it around (the dreaded networking bandwidth issue) and where we need to execute code. Now, how are we going to do this? That is a piece magic, and you will just need to wait a bit… suffice it to say we are hard at work at solving this challenging topic.

Monday Mar 09, 2015

Why SQL Part 2 - It has a powerful framework

In the first part (Why SQL is the natural language for data analysis) of this series of blog posts I explained why analytics is important to the business, how the volume of data along with the types of data is continuing to expand and why that makes it vital that you select the right language for data analysis. Many of us work with SQL every day and take for granted many of its unique features, its power, flexibility and the richness of the analytics.

This familiarity with SQL means that sometimes we are a bit slow at preventing some of our projects investing in other big data languages such as MapReduce, Impala, Spark, Java and many of the new generation of SQL-like open source projects. While many of these new languages are considered “cool”, it is very easy to end up building new proprietary data silos or investing in a language that eventually is replaced by another open source project or investing time and effort in code that eventually fails to deliver the required analytics.

One of the aims of this series of blog posts is to refresh your memory about why SQL has been so successful in the area of analytics. SQL has four unique features that make it perfect for data analysis...

    [Read More]

    Tuesday Mar 03, 2015

    Why SQL is the natural language for data analysis

    Analytics is a must-have component of every corporate data warehousing and big data project. It is the core driver for the business: the development of new products, better targeting of customers with promotions, hiring of new talent and retention of existing key talent. Yet the analysis of especially “big data environments”, data stored and processed outside of classical relational systems, continues to be a significant challenge for the majority companies. According to Gartner, 72% of companies are planning to increase their expenditure on big data yet 55% state they don’t have the necessary skills to make use of it. 

    The objective of this series of articles, which will appear over the coming weeks, is to explain why SQL is the natural language for amy kind of data analysis including big data and the benefits that this brings for application developers, DBAs and business users. 

    [Read More]

    Friday Jan 16, 2015

    Deploying SAS High Performance Analytics on Big Data Appliance

    Oracle and SAS have an ongoing commitment to our joint customers to deliver value-added technology integrations through engineered systems such as Exadata, Big Data Appliance, SuperCluster,  Exalogic and ZFS Storage Appliance.  Dedicated resources manage and execute on joint SAS/Oracle Database, Fusion Middleware, and Oracle Solaris integration projects; providing customer support, including sizing and IT infrastructure optimization and consolidation.  Oracle support teams are onsite at SAS Headquarters in Cary, NC (USA); and in the field on a global basis.

    The latest in this effort is to enable our joint customers to deploy SAS High Performance Analytics on Big Data Appliance. This effort enables SAS users to leverage the lower cost infrastructure Hadoop offers in a production ready deployment on Oracle Big Data Appliance. Here from Paul Kent (VP Big Data, SAS) on some of the details.

    Read more on deploying SAS High Performance Analytics on Don't miss the deployment guide and best practices here.

    Friday Oct 10, 2014

    Review of Data Warehousing and Big Data at #OOW14

    Data Warehousing and Big Data were at the heart of this year’s OpenWorld conference being across in a number of keynotes and a huge number of general sessions. Our hands-on labs were all completely full as people got valuable hands-on time with our most important new features. The key areas at this year’s conference were:

    • Big Data SQL - One Fast SQL Query for All Your Data
    • Database In-Memory - Powering the Real-Time Enterprise
    • Mutitenant - Plug your data warehousing Into the Cloud
    DW 4 DW 3 DW 3

    All these topics appeared in the main keynote sessions including live on-stage demonstrations of how each feature can be used to increased the performance and analytical capability of your data warehouse.

    If you want to revisit the most important sessions, or if simply missed this year’s conference and want to catch up on all the most important topics, then I have put together a book of the highlights from this year’s conference. The booklet is divided into the following sections:

    • Key Messages
    • Overview of Oracle Database 12c
    • Session Catalogue
    • Your Oracle Presenters
    • Links
    • OpenWorld 2015


    You can download my review in PDF format by clicking here. Hope this proves useful and if I missed anything then let me know. 

    Thursday Oct 09, 2014

    One of the ways Oracle is using Big Data

    Today, Oracle is using big data technology and concepts to significantly improve the effectiveness of its support operations, starting with its hardware support group. While the company is just beginning this journey, the initiative is already delivering valuable benefits.

    In 2013, Oracle’s hardware support group began to look at how it could use automation to improve support quality and accelerate service request (SR) resolution. Its goal is to use predictive analytics to automate SR resolution within 80% to 95% accuracy.

    Oracle’s support group gathers a tremendous amount of data. Each month, for example, it logs 35,000 new SRs and receives nearly 6 TB of telemetry data via automated service requests (ASRs)—which represent approximately 18% of all SRs. Like many organizations, Oracle had a siloed view of this data, which hindered analysis. For example, it could look at SRs but could not analyze the associated text, and it could review SRs and ASRs separately, but not together.

    Oracle was conducting manual root-cause analysis to identify which types of SRs were the best candidates for automation. This was a time-consuming, difficult, and costly process, and the company looked to introduce big data and predictive analytics to automate insight.

    The team knew that it had to walk before it could run. It started by taking information from approximately 10 silos, such as feeds from SRs and ASRs, parts of databases, and customer experience systems, and migrating the information to an Oracle Endeca Information Discovery environment. Using the powerful Oracle Endeca solution, Oracle could look at SRs, ASRs, and associated notes in a single environment, which immediately yielded several additional opportunities for automation. On the first day of going live with the solution, Oracle identified 4% more automation opportunities.

    Next, Oracle focused its efforts on gaining insight in near real time, leveraging the parallel processing of Hadoop to automatically feed Oracle Endeca Information Discovery—dramatically improving data velocity. Oracle’s first initiative with this new environment looked at Oracle Solaris SRs. In the first few weeks of that project, Oracle identified automation opportunities that will increase automated SR resolution from less than 1% to approximately 5%—simply by aggregating all of the data in near real-time. 

    Once Oracle proved via these early proofs of concept that it could process data more efficiently and effectively to feed analytical projects, it began to deploy Oracle Big Data Appliance and Oracle Exalytics In-Memory Machine.

    Read the entire profile here.

    Monday Sep 29, 2014

    The End of the Data Scientist Bubble...

    Looking around northern California and inside many technology kitchens makes me believe that we are about to see the Data Scientist bubble burst. And then I read the Fortune Magazine article on Peter Thiel - and the excerpt on Zero to One (his new book) in that article and it dawned on me that is one of the intersting ways to look at the Data Scientist bubble.

    Thiel's Classification of Innovation

    Without trying to simplify and/or bastardize mr. Thiel's theory, the example in the Fortune Mag article will make this visible to most people (I hope). In the article the analogy is; going from one type writer to 100 type writers is 1 to N, inventing a word processor is moving us from 0 to 1. In other words, true innovation dramatically changes things by giving previously unknown power to the masses. It is that innovation that moves us from 0 to 1. Expansion of existing ideas - not true innovation - moves us from 1 to N. Of course, don't take my word on this but read the article or the book...

    The Demise of the Human Data Scientist

    The above paradigm explains the Data Scientist bubble quite nicely. Once upon a time companies hired a few PhD students who by chance had a degree in statistics and had learned how to program and figured out how to deal with (large) data sets. These newly minted data scientists proved that there is potential value in mashing data together, running analytics on these newly created data sets and thus caused a storm of publicity. Companies large and small are now frantically trying to hire these elusive data scientists, or something a little more down to earth, are creating data scientists (luckily not in the lab) by forming teams that bring a part of the skillset to the table.

    This approach all starts to smell pretty much like a whole busload of typewriters being thrown at a well-known data analysis and data wrangling problem. Neither the problem nor the solution are new, nor innovative. Data Scientists are therefore not moving us from 0 to 1...

    One could argue that while the data scientist quest is not innovative, at least is solves the problem of doing analytics. Fair and by some measure correct, but there is one bigger issue with the paradigm of "data scientists will solve our analytics problem" and that is scale. Giving the keys to all that big data to only a few data scientists is not going to work because these smart and amazing people are now becoming, often unbeknownst to them, an organizational bottleneck to gaining knowledge from big data.

    The only real solution, our 0 to 1, is to expose a large number of consumers to all that big data, while enabling these consumers to apply a lot of the cool data science to all that data. In other words, we need to provide tools which include data science smarts. Those tools will enable us to apply the 80% common data science rules to the 80% of common business problems. This approach drives real business value at scale. With large chunks of issues resolved, we can then focus our few star data scientists on the 20% of problems or innovations that drive competitive advantage and change markets.

    My Conclusion

    The bubble is bursting because what I am seeing is more and more tools coming to market (soon) that will drive data science into the day-to-day job of all business people. Innovation is not the building of a better tool for data scientists or hiring more of them, instead the real 0 to 1 innovation is tools that make make all of us data scientists and lets us solve our own data science problems. The future of Data Science is smarter tools, not smarter humans.

    Wednesday Sep 03, 2014

    OpenWorld on your iPad and iPhone - UPDATED!

    In my last blog post I provided links to our OpenWorld data warehouse web app for smartphones and tablets. Now that the OOW team has released the session schedule (it is now live on the OpenWorld site) I have updated my iPhone and iPad apps to include the list of sessions on a day-by-day basis (Sunday, Monday Tuesday, Wednesday, Thursday). The list of sessions can still be viewed in subject area order (data warehousing, performance and scalability, analytics, unstructured data, industry models and big data) within the app via the “Switch to subject view” link in the top left part of the screen.

    Updated-iPad-OOW2014 Updated-iPhone-OOW2014

    I have also added a location map which can be viewed by clicking on the linked-text, “View location map", which is in the top right part of the screen on each application. The location map that is available within both the iPad and iPhone apps is shown below:

    Oow locations

    If you want to run these updated web apps on your smartphone and/or tablet then you can reuse the existing links that I published on my last blog post. If you missed that post then  follow these links:

    Android users: I have tested the app on Android and there appears to be a bug in the way the Chrome browser displays frames since scrolling within frames does not work . The app does work correctly if you use either the Android version of the Opera browser or the standard Samsung browser on Samsung devices.

    If you have any comments about the app (content you would like to see) then please let me know. Enjoy OpenWorld and, if you have time, it would be great to see you if you want to stop by at the Parallel Execution and Analytical SQL demo booth.

    Tuesday Jul 22, 2014

    StubHub Taps into Big Data for Insight into Millions of Customers’ Ticket-Buying Patterns, Fraud Detection, and Optimized Ticket Prices

    The benefits of Big Data at Stubhub:

    • Stubhub enabled data scientists to work directly with customer-related data—such as ticket-purchasing history—inside the database, and to use database options to explore the data graphically, build and evaluate multiple data-mining models, and deploy predictions and insights throughout the enterprise—drastically improving StubHub’s agility and responsiveness
    • Developed highly targeted ticket promotional campaigns and offers by having the ability to calculate 180 million customers’ lifetime value (or propensity) instead of just 20,000 values at a time
    • Used Oracle R Enterprise component of Oracle Advanced Analytics—an Oracle Database option—to reduce a fraud issue by up to 90%

    Read more or watch the video:

    Monday Jun 30, 2014

    My Data Warehousing and Big Data Must-See Guide for OpenWorld

    Data Warehousing Must-See Session Guide for Open World 2014

    There’s so much to learn at this year’s Oracle OpenWorld - it provides more educational and networking opportunities than any other conference dedicated to Oracle business and technology users. To get the most from this year’s event I have prepared an initial guide which lists all the must important data warehousing and big data sessions. The first part of the guide provides links to videos and content from last year’s event so you can either re-live the highlights from last year or see what you missed by not being there! If you have an iPad or use OSX 10.9 then you will want to download the iBook version because this contains video highlights from last year’s database keynote session.

    The session guide is divided into the following chapters:

    • Data Warehousing
    • Performance and Scalability
    • Database Analytics
    • Industry Data Models and Data Integration
    • Unstructured Data
    • Big Data

    The last part of the booklet lists all the key data warehouse and big data product managers who will be presenting at this year’s conference. Included alongside each speaker’s biography are links to their social media sites and blogs.

    Please note that, as usual, there will be hands-on labs at this year’s OpenWorld but these have not been included at in the session catalog. I am expecting them to be added shortly. Details of all our labs will appear in the more detailed guide that I will publish before the conference begins.

    The Must-See guide is available in two formats: 

    For iPad and OSX 10.9 (Mavericks) users please download the iBook, which is available here:

    For all other platforms please download the PDF, which is available here:

     Hope to see you in September at OpenWorld. 

    Thursday Apr 17, 2014

    Analyzing our Big Data Lite movie app clickstream data

    Since last year's OpenWorld I have posted quite a few articles on our new 12c pattern matching feature: 

     To date most of my online demos and tutorials have used a very simple data set consisting of a single table with 60 rows of fictitious trade data. Over the last few months I have been searching around for a slightly bigger and more challenging data set to use. Fortunately, our BIg Data PMs (led by Marty Gubar) have put together a very rich big data schema as part of their recently launched Big Data Virtual Machine that you can download from our OTN web page, see here:

    The data set is based around an online customer movie application. Here is a little background…

    Movieplex on otn

    Oracle MoviePlex Demo Application

    Oracle MoviePlex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalized experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.

    As the users watch movies, search for their favorite actors, rate their movies the system records all the clicks in a log file. The contents of that log file looks like this:

    Avro file

    In its raw state this information is not very helpful. It needs a lot of processing to convert the various data points into usable information. In many cases companies have been turning to Hadoop and its related Java-based programming language MapReduce to process and convert these types of files into usable information.  Most business users will want to summarise this information by customer and/or movie and then merge this information with other data sets. So how can we make access to and analysis of this type of data much easier? As part of this post I am going to compare the Hadoop-Java-MapReduce approach with an alternative approach using 12c SQL. My objective is not to determine which is solution is the best because each approach has its own camp of supporters. Once we have reviewed the two solutions I will put everything into context and make some recommendations… let's bring on the code!

    Sessionization using Java

    Accessing the Avro file

    At the moment In the context of Big Data, everything seem to revolve around Hadoop, MapReduce and Java. It is quite natural for a big data developer to extend their basic map reduce processing to include more complicated requirements. In the case of our movie demo there is a lot of processing that needs to be done using the native features of Hadoop to collect and process the weblog file being generated by our application. There is an excellent video prepared by Marty Gubar (Director of Product Management for Big Data) which explains this process. This is part four of a six-part series that explains the movieplex demo: Part 4. Turn Clicks into Value - Flume & Hive. The movie demo lasts about 5 mins and you can watch here:

    The steps shown in the movie explain how to access the avro file and then how to clean the data to provide some interesting additional metrics.

    Calculating the sessionization data

     Creating the sessionization analysis is a little more complicated. In fact, it requires 370 lines of Java code.  Here is the Java code we created for doing the sessionization analysis (the code window is scrollable so you can review the very long code sample):

    The actual logic for the sessionization analysis is about 100 lines of code as shown here (at s before, the code window is scrollable so you can review the very long code sample):

    As you can see from the code listing this requires a strong knowledge of Java and with 370 lines of code, if we decide to change the details of the pattern that we are searching for it is going to be a lengthy process to make the required changes. Making anything changes  to the code to reflect changing business requirements is definitely going to be beyond the skills of your average business analyst and/or data scientists and this might negatively impact the level of project agility.

    Making life easier and richer with SQL

    My personal view is that sessionization analysis is quite a sophisticated requirement and is best achieved using SQL because the requirements and parameters for the analysis will naturally evolve over time as new questions are triggered by existing answers. Therefore, you need an agile approach to analytics that can deal with constant changing requirements during the data discovery phase.

    Accessing the Avro file

    First step is to create a DIRECTORY object to access the raw data file:

    CREATE DIRECTORY session_file_dir AS '/home/oracle/applog';
    GRANT READ, WRTIE ON DIRECTORY session_file_dir to pmuser;

    Next I created an external table over the data file. Notice that in the avro file, each key column  has an identifier and a value.


    The objective for this external table was to keep the processing as simple as possible:

      CUST_ID VARCHAR2(20) 
    , MOVIE_ID VARCHAR2(20) 
    , GENRE_ID VARCHAR2(20) 
     LOCATION (SESSION_FILE_DIR: 'movieapp_30months.log') 

    Cleaning the data

    Now the next stage is to clean the data and remove the column identifiers such as custId, movieId, genreId etc. To do this processing I simply used the SQL SUBSTR() function. At the same time I decided to breakout the activity ID column to create unique columns for each type of activity. This approach is useful if you are going to be using the data as part of a data mining project because in many cases it useful to uniquely identify specific attributes. In many cases you will want to create a fact table from this process but in this case I have created a view while I do some basic prototyping:

        WHEN '1' THEN '1'
        WHEN '2}' THEN '2'
        WHEN '3}' THEN '3'
        WHEN '4}' THEN '4'
        WHEN '5}' THEN '5'
        WHEN '6}' THEN '6'
        WHEN '7}' THEN '5'
        WHEN '8}' THEN '8'
        WHEN '9}' THEN '9'
        WHEN '10' THEN '10'
        WHEN '11' THEN '11'
        ELSE null
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '1' THEN 'Y' END as act_rate
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '2' THEN 'Y' END as act_complete 
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '3' THEN 'Y' END as act_pause
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '4' THEN 'Y' END as act_start 
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '5' THEN 'Y' END as act_browse 
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '6' THEN 'Y' END as act_list
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '7' THEN 'Y' END as act_search
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '8' THEN 'Y' END as act_login
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '9' THEN 'Y' END as act_logout
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '10' THEN 'Y' END as act_incomplete
    , case SUBSTR(ACTIVITY_ID,12,1) WHEN '11' THEN 'Y' END as act_purchase

    Running a query against this view generates our basic data set that we can then feed into our MATCH_RECOGNIZE clause to create the sessionization result set.


     Is there a better, simpler way to deal with the avro file? Well yes there is. As part of the Database 12c release programme we will be adding in-database support for JSON data. This will allow us to take the raw avro file and access stand query it directly from within the database without the need to go through the complicated cleaning process! At the moment we have not made any collateral (PPTs etc) on this feature publicly available so I can't provide you with any supporting links that will give you more information. However, once this feature is released I will revisit this little demo to explore how the new JSON feature can be used along side SQL pattern matching.

    Calculating the sessionization data

    The code to calculate the sessionization information is as follows:

    SELECT *
    FROM vwsession_data 
     (PARTITION BY cust_id ORDER BY sess_date 
      MEASURES match_number() session_id, 
        COUNT(*) no_of_events,
        FIRST(sess_date) start_date,
        TO_CHAR(FIRST(sess_date), 'hh24:mi:ss') start_time,
        LAST(sess_date) end_date,
        TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, 
        TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
        LAST(activity_id) last_act_id,
        COUNT(act_rate) act_rate,
        COUNT(act_complete) act_complete,
        COUNT(act_pause) act_pause,
        COUNT(act_start) act_start,
        COUNT(act_browse) act_browse,
        COUNT(t(act_list) act_list,
        COUNT(act_search) act_search,
        COUNT(act_login) act_login, 
        COUNT(act_logout) act_logout,
        COUNT(act_incomplete) act_incomplete,
        COUNT(act_purchase) act_purchase 
     PATTERN (strt s+) 
        s as (round(to_number(sess_date - prev(sess_date)) * 1440) <= 5)

    This statement uses many of the MATCH_RECOGNIZE features that I have discussed in previous posts. In this specific example I have decided to set the duration between events within a session as 5 minutes. That means if the user does nothing for about  5 minutes then I will assume that a new session has started even if the user has not logged out. 

    Beyond the basic sessionization model there is another interesting pattern that we can search for within our data set. If a user starts watching a movie then we might not get another event being logged for over two hours. Therefore, we could look for instances where the time between events is less than 150 minutes (assuming most movies last around 2 hours) and there is at least one (or more) "start" event is logged and at least  one (or more) "complete"  event is logged,

    PATTERN (strt s* f+ c+) 
     f as act_start = 'Y',
     c as act_complete = 'Y',
     s as (round(to_number(sess_date - prev(sess_date))*1440) <= 150)

    The output from this query is shown below:

    Start Complete Data Full

    Looking at the results what really jumps out is the level of interaction on the site before, during and after the user has watched a movie. Many of the records that are returned show users searching our movie database, going to their lists of favourite movies and rating the movies they have just watched. All this implies that if we can get customers to begin watching a movie there is a huge knock-on effect in terms of how long they remain on the site and the number of "events" they create while on the site. This is the sort of information that is useful to marketing and campaign management teams. 

    A different view: Calculating the sessionization data for each movie

    While developing the above analysis I began to wonder about analysing the data not from a customer perspective but from the point of view of each movie. This would allow me to look at the session activity over time for each movie and understand the usage patterns across movies and genres. For example, are there certain movies or types of movies that are more or less likely to be rated. Is the process of watching a movie the same across all movies in terms of the events that are registered during a session? There are all sorts of new avenues of analysis that could be created from looking at the data in this way.

    So how easy would it be to change the focus of the analysis to movies? Well the code took me approximately 30 seconds to write - actually cut & paste from the customer sessionization example, switch the customer id for the movie id and finally removed a few fields. Here is the new code:

    SELECT *
     (PARTITION BY movie_id ORDER BY sess_date 
      MEASURES COUNT(*) no_of_events,
         FIRST(sess_date) start_date,
         TO_CHAR(first(sess_date), 'hh24:mi:ss') start_time,
         LAST(sess_date) end_date,
         TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, 
         TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
         MIN(activity_id) act_id,
         MAX(activity_id) last_act_id,
         COUNT(act_rate) act_rate,
         COUNT(act_complete) act_complete,
         COUNT(act_pause) act_pause,
         COUNT(act_start) act_start,
         COUNT(act_browse) act_browse,
         COUNT(act_list) act_list,
         COUNT(act_search) act_search,
         COUNT(act_login) act_login, 
         COUNT(act_logout) act_logout,
         COUNT(act_incomplete) act_incomplete,
         COUNT(act_purchase) act_purchase 
     PATTERN (strt s*) 
         s as (ROUND(TO_NUMBER(sess_date - PREV(sess_date))*1440) <= 120)
     ) MR;

    The speed at which you can create these additional/alternative result sets is really great!  It is now relatively easy to continue this discovery process by analysing the information by movie genre or any of the other related attributes.


    The main take-away from this post is that (hopefully) I shown how easy it is to use SQL for sessionization analysis.  Firstly in terms of creating the framework to support the normal "sessionization" transformation process: taking the source log file, extracting the data points and then grouping the information by user. The great news is that processing of the source file will be greatly simplified when the JSON feature is released because it will be possible to query the log file in a much simpler way to extract the required data points. Secondly, SQL provides a much simpler and easier way to support the evolutionary of the discovery process. Changing the focus from customer analysis to movie analysis is both quick and easy. I am not convinced that a Java-based approach to data discovery can offer the same level of simplicity and agility but I will let you be the judge of that last statement….

    Tuesday Apr 15, 2014

    OpenWorld call for Papers closes today!

     Just a gentle reminder - if you have not submitted a paper for this year's OpenWorld conference then there is still just enough time because the deadline is Today (Tuesday, April 15) at 11:59pm PDT. The call for papers website is here and this provides all the details of how and what to submit.

    I have been working with a number of customers on some really exciting papers so I know this year's conference is going to be really interesting for data warehousing and analytics. I would encourage everyone to submit a paper, especially if you have never done this before. Right now both data warehousing and analytics are among the hottest topics in IT and I am sure all of you have some great stories that you could share with your industry peers who will be attending the conference. It is a great opportunity to present to your peers and also learn from them by attending their data warehouse/analytics sessions during this week long conference. And of course you get a week of glorious Californian sunshine and the chance to spend time in one of the World's most beautiful waterfront cities.

    If you would like any help submitting a proposal then feel free to email during today and I will do my best to provide answers and/or guidance. My email address is

    Have a great day and get those papers entered into our OpenWorld system right now! 

    Monday Mar 24, 2014

    Built-in sorting optimizations to support analytical SQL

    One of the proof points that I often make for using analytical SQL over more sophisticated SQL-based methods is that we have included specific optimizations within the database engine to support our analytical functions. In this blog post I am going to briefly talk about how the database optimizes the number of sorts that occur when using analytical SQL.

    Sort Optimization 1: Ordering Groups

    Many of analytical functions include PARTITION BY and/or an ORDER BY clause both of which by definition implies that an ordering process is going to be required. As each function can have its own PARTITION BY-ORDER BY clause this can create situations where lot of different sorts are needed. For example, if we have a SQL statement that included the following:

    Rank() Over (Partition by (x) Order by (w))
    Sum(a) Over (Partition by (w,x) Order by (z))
    Ntile() Over (Partition by (x) Order by (y))
    Sum(b) Over (Partition by (x,y) Order by (z))

    this could involve four different sort processes to take into account the use of both PARTITION BY and ORDER BY clauses across the four functions. Performing four separate sort processes on a data set could add a tremendous overhead (depending on the size of the data set). Therefore, we have taken two specific steps to optimize the sorting process.

    The first step is create the notion of "Ordering Groups". This optimizations looks for ways to group together sets of analytic functions which can be evaluated with a single sort. The objective is to construct a minimal set of ordering groups which in turn minimizes the number of sorts. In the example above we would create two ordering groups as follows:

    Screen Shot 2014 03 13 at 13 39 37

    This allows us to reduce the original list of sorts down from 4 to just 2.

    Sort Optimization 2: Eliminating Sorts

    We can further reduce the number sorts that need to be performed by carefully scheduling the execution so that:

    • Ordering groups with sorts corresponding to that in the GROUP BY execute first (immediately after the GROUP BY) 
    • Ordering groups with sorts corresponding to that in the ORDER BY execute last (immediately before the ORDER BY)

    In addition, we can also eliminate sorts when an index or join method (sort-merge) makes sorting unnecessary. 

    Optimization 3 : RANK Predicates

    Where a SQL statement includes RANK() functions there are additional optimizations that kick-in. Instead of sorting all the data, adding the RANK and then applying the predicate, the RANK predicate is evaluated as part of the sort process. The net result is that fewer records are actually sorted, resulting in more efficient execution.


    Overall, these three optimizations ensure that as few sorts as possible are performed when you include SQL analytical functions as part of your SQL statements. 


    The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


    « November 2015