Tuesday May 26, 2015

Space Management and Oracle Direct Path Load

If you're loading you data warehouse using Oracle direct path load, you might want to check out a post on the Optimizer blog. It covers how the Oracle database manages space during a direct path load and what we've changed in Oracle Database 12c.

Wednesday Feb 25, 2015

New Way to Enable Parallel DML

This post was triggered by Jonathan Lewis' tweet here.

The conventional way of enabling parallel DML is to enable it with an ALTER SESSION command as explained in the documentation. This command enables parallel DML for the session and all subsequent DML statements are candidates for parallel execution provided that the rules and restrictions for parallel DML are met.

12c introduces a new way of enabling and disabling parallel DML. Rather than enabling or disabling it session-wise you can enable or disable it statement-wise using new hints. The hint ENABLE_PARALLEL_DML enables parallel DML for the statement, and the hint DISABLE_PARALLEL_DML disables it for the statement. All rules and restrictions for parallel DML still apply, these hints are only alternatives to the related ALTER SESSION commands.

This example shows the plans for the same statement without and with the ENABLE_PARALLEL_DML hint.

We can see that parallel DML is not enabled for the above statement indicated by the LOAD operation being above the PX COORDINATOR in the plan. In 12c the notes section nicely and clearly shows that it is not enabled.

Here we see that the notes section does not mention parallel DML as disabled and the LOAD operation is under the PX COORDINATOR, both of these indicate that parallel DML is enabled.

These two new hints are available starting with and can be used regardless of the value of the OPTIMIZER_FEATURES_ENABLE parameter. They can be used in INSERT, UPDATE, DELETE, and MERGE statements. We are updating the related documentation to include these, I will update this post to include links when the documentation is refreshed.

Sep 2, 2015 UPDATE: The documentation is updated to include these hints, here and here.

Friday Dec 12, 2014

Analytical SQL scripts now on Github

After this year’s OpenWorld I ran a 1-day workshop on analytical SQL for our Data Warehouse and Big Data Global Leaders customers. This was part of the Global Leaders 'Days with Development' programme. We had a pack room at the Oracle Conference Center and I was very lucky to have Stew Ashton, Technical Architect, BNP Paribas, Dr. Holger Friedrich, CTO at sumIT AG and Joerg Otto, Head of DB Engineering, IDS GmbH co-present with me and have them explain how they were using analytical SQL in their projects. 

 The workshop covered the following topics:

  • Analytic SQL concepts and foundations
  • Analytic functions for reporting
  • Analytic functions for aggregation
  • More advanced and new 12c features: Pattern Matching
  • SQL Model clause

For the workshop I created a comprehensive slide deck (I will post the presentation shortly on our OTN home page) which included code samples and explain plans to highlight the key benefits of using our analytical SQL features and functions. The great news is that I now have a repository for analytical SQL code samples on the Github repository. To kick things off on this new repository I have posted all the SQL scripts that I created for this workshop so you can now download and work through a series of use cases that explain how to use window functions, intelligently aggregate data, manipulate rows and columns of data, find patterns and create what-if scenarios. Below is my repository home page where you can download the code:

Github Repository for Analytical SQL

So what is Github?

At a simple level, it is an online version control system (and a lot more!) that stores and manages modifications to code script files within in a central repository. Its key benefit is that it makes it very easy for developers to work together on a common project. This environment makes it easy to download a new version of code scripts, make changes, and upload the revisions to those files. Everyone can then see these new changes, download them, and contribute. This system is very popular with developers so we have decided to join this community and make our SQL scripts available via this site. It is the ability to “collaborate” which is most important for me.

To help you get started there is a great section on the website called “Set Up Git”. If like me you are using a Mac then GitHub has a Mac client! You can use it without ever touching the command line interface (which can be a little frustrating at times!).

You can contribute too!

It would be great if you could contribute your own scripts to this repository so I can build up a library of scripts for analytical SQL. All you need to do is create an account on Github, search for the analytical SQL repository and then either download the repository as a zip file or use the “Clone in Desktop” option. What I want to do is build up a series of well documented use cases and when we have enough content then I will create industry specific folders to help organize the content.

So my new repository is now live, please have a look and feel free to upload your own scripts that show how you have used analytical SQL to solve specific business problems within your projects. Looking forward to lots of files arriving into this great new repository. 

Friday Dec 05, 2014

Parallel Execution Fundamentals White Paper

As my first blog post as the product manager for Parallel Execution (still known as Parallel Query by many people) I want to point out a new white paper about parallel execution fundamentals in the Oracle Database. You can find the paper here.

Also, stay tuned for another one in which we will talk about using Database Resource Manager to manage concurrent workloads with parallel execution, this will be published soon.

Please send any comments, questions and feedback about the paper, or generally about parallel execution in Oracle, to yasin.baskan@oracle.com, or even better use the comments section below. 

Tuesday Nov 11, 2014

Optimizing Table Scans with Zone Maps

Most of you will be familiar with partition pruning, where the Oracle Database will avoid the need to scan table and index partitions based on query predicates. This optimization is transparent to your application, but for it to work, the database has to find a way of mapping a query filter predicate to the partitioning key column (or columns). Partition pruning can only occur if the query has predicates that match the predetermined shape of a partitioned object. For example, a query on a SALES table partitioned by ORDER_DATE will need to include ORDER_DATE in a join or WHERE clause for it to be optimized by partition pruning.

What if you could do better than this? What if you could prune partitions using a variety of column predicates and dimension hierarchies, irrespective of their appearance in the partitioning key? How about pruning at a much finer level of granularity than a partition? Perhaps we want to optimize queries that filter SALES by SHIP_DATE, STATE and COUNTY, as well as ORDER_DATE. The new Oracle zone map feature is designed to achieve this, and just like partitioning, zone maps are transparent to your queries; you don’t have to change your applications to make use of them.

Zone maps are available in Oracle Database 12c for Oracle Engineered Systems. Conceptually, they divide a table up into contiguous regions of blocks called zones (the default zone size being 1024 blocks). For each zone, the Oracle database records the minimum and maximum values for specified columns using a new database object called a zone map. Queries that filter on zone map columns have the potential to be optimized; it’s possible to prune zones that contain ranges of column values outside the match specified in the query predicate.

Consider a query that filters a sales table by (North American) state; in this case “CA”. A zone map on the STATE column will record the minimum and maximum values for this column for each zone in the table. This makes it possible to skip the zones that we can be certain won’t contain rows for “CA”.

Zone Map Representation

You are probably aware that Exadata storage cells and the Oracle database In-Memory Column Store uses similar storage index techniques, so what benefits do zone maps add? Besides the fact that you can control zone maps explicitly, the most significant difference between zone maps and storage indexes is that zone maps can be used to prune zones using column predicates from multiple (joined) tables. Consider a more realistic scenario, in which the SALES table doesn’t have a STATE column, but instead has a LOCATION_ID referencing a dimension table called LOCATIONS. This is our query for summing the sales figures in California:

SELECT SUM(amount)
FROM   sales     s,
       locations l
WHERE  s.location_id = l.location_id
AND    l.state = 'CA';

It would be great if we could avoid scanning zones in SALES that don’t contain rows associated with “CA”. Before we look at how we can do this, we’ll make the scenario even more realistic by assuming that LOCATIONS is a dimensional hierarchy of State and County, like this:

Zone Map Locations Table Example

 Each State is made up of multiple Counties, so “CA” will be associated with multiple LOCATION_ID values. If we want the “CA” rows in SALES, we’ll need to match the ones marked below in bold/red:

Zone Map Sales Table Example

If we want to optimize a scan for “CA” rows, we will have to address a few issues:

  • The SALES table does not have a STATE column, so no storage index structure on SALESs data will allow us to directly prune disk regions based on “CA”.
  • Table rows associated with “CA” are likely to be physically scattered throughout the SALES table, so it’s unlikely that these rows will be confined to a relatively small number of zones or disk regions. We might not be able to make efficient use of an Exadata storage index on SALES.LOCATION_ID, if any (note that I am consciously ignoring the push down of BLOOM FILTERs to Exadata here, which still suffers from the physical scattering).
  • A SALES storage index based on min/max Location ID is likely to be less efficient than using zones based on min/max State values, simply because each State is made up of multiple Location IDs. This inefficiency is more pronounced if Location IDs for “CA” are not numerically close to one another - it will reduce the chances that the Location IDs we’re searching for will be found within the same min/max Location ID regions.

Of course, zone maps are designed to address these issues - with a little bit of help from another Oracle Database 12c feature called attribute clustering.  I introduced attribute clustering in an earlier post, but don’t worry if you haven’t read that yet; I’ll cover the basics here anyway. You’ve probably deduced that we can reduce the number of zones that contain “CA” rows if we cluster or sort the rows in SALES, keeping these rows close to one another, like this:

Table with Zone Map

Attribute clustering is the feature that’s used to cluster the rows together. Zone maps are used to record the min/max values for specified columns for each zone (and this can include column values derived from joins; LOCATIONS.STATE and LOCATIONS.COUNTY in our case).

The following DDL will create a zone map on our SALES fact table using the dimension table columns LOCATIONS.STATE and LOCATIONS.COUNTY. It will also enable attribute clustering, using the same columns to cluster the table’s rows:

JOIN locations ON (sales_ac.location_id = locations.location_id) 
BY LINEAR ORDER (locations.state, locations.county)

The LINEAR ORDER clause specifies a linear clustering algorithm, which is ideal for this example. Another algorithm is available; it is specified with "INTERLEAVED" and is optimized for more complex combinations of query predicates and dimension tables. Note that the definition of attribute clustering by itself does not change any data stored on disk; instead, it provides a directive for direct path operations; INSERT APPEND and MOVE that will physically perform the clustering operation for us. If there are pre-existing rows in SALES, we can MOVE the table (or its partitions) to re-order them.

Joins between SALES and the dimension table are now candidates for optimization when the query includes predicates on the dimension hierarchy “state” and “state, county”. For example:

SELECT SUM(amount)
FROM   sales
JOIN   locations  ON (sales.location_id = locations.location_id) 
WHERE  locations.state  = 'NM';
SELECT SUM(amount)
FROM   sales
JOIN   locations  ON (sales.location_id = locations.location_id) 
WHERE  locations.state  = 'CA'
AND    locations.county = 'Kern';

By clustering the rows and recording appropriate min/max column values for our zones, we have addressed all of the issues I identified above. What’s more, we can still get benefit from Exadata storage indexes because zone maps and storage indexes complement one another, and they work together transparently.

Zone maps are explicitly created and controlled by the database administrator on a table-by-table basis. They are an inherent part of the physical database design and can be thought of as a coarse anti-index structure (unlike an index, a zone map tells you what zones not to access). Zone maps are very compact, and in some cases it is possible to use them where you would otherwise use an index. This is most relevant in data warehousing environments where scanning is often more appropriate than indexed row retrieval, and where indexes can use a considerable amount of storage space. Zone maps must be refreshed to be synchronized with the underlying table data, so you will need to give some consideration to how you want them to be kept up-to-date if you decide to use them as an alternative to indexes.

In summary, take a look at zone maps if you want to:

  • Optimize scanning queries, particular when joining with one or more tables.
  • Reduce your dependency on indexes, particularly in data warehousing environments.
  • Improve performance in your data warehouse; particularly for star or snowflake schemas.

Here’s an example of using zone maps to optimize a table scan. To compare before and after, start by creating a table that has no zone map or attribute clustering:

CREATE TABLE sales_zm (sale_id NUMBER(10), customer_id NUMBER(10));

Insert 8 million rows with the following PL/SQL code. Why that many? With our example, we'll read one or two zones rather than the entire table, so I'm aiming to make the difference pretty obvious when you look at the block read statistics:

  i NUMBER(10);
  FOR i IN 1..80
    INSERT INTO sales_zm
    FROM   dual
    CONNECT BY LEVEL <= 100000;
EXEC dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'SALES_ZM');

Run the following query a few times to see what value “consistent gets” settles at:

FROM   sales_zm
WHERE  customer_id = 50;

On my machine, I read 7,545 blocks from the buffer cache, but since the value depends on some storage defaults don’t be surprised if your value is different:

Before Zone Map

The following DDL will create a zone map, but since attribute clustering is a property of the table (like compression), any existing rows will not be re-ordered:


The zone map will not be efficient until we cluster the rows together, so we’ll MOVE the table to achieve this. This will refresh the zone map too:


Run the same query a few times to see what value “consistent gets” settles at:

FROM   sales_zm
WHERE  customer_id = 50;

On my database, I read around 1,051 database blocks instead of 7,545: a considerable improvement:

After Zone Map

You'll find more examples covering zone maps and attribute clustering in the Oracle Learning Library and inside the Oracle Github repository. Full details on zone maps and attribute clustering can be found in the Oracle documentation library; particularly the Oracle 12c Database Data Warehousing Guide.

There's an earlier post on attribute clustering if you haven't read it already.

If there's anything to need to ask, or if you can't find what you need regarding zone maps or attribute clustering, please let me know by leaving a comment below. Thanks!

Friday Oct 24, 2014

Optimizing Queries with Attribute Clustering

Attribute clustering is a feature that’s new to Oracle Database (Enterprise Edition). It was designed to work with other features, such as compression, storage indexes, and especially with a new feature called zone maps, but since attribute clustering has some useful benefits of its own I’ll cover them here and make zone maps the subject of a later post.

So what is attribute clustering? It is simply a table property - just like compression - but it defines how rows should be ordered and clustered together in close physical proximity, based on one or more column values. For example, in a sales transaction table you could choose to cluster together rows that share common customer ID values. Why would you consider doing this? Perhaps your system frequently queries sales data relating to particular customers, and perhaps there is a requirement for extremely fast and consistent response times (a call center CRM application would be a good example). Your physical database design will probably incorporate an index on the customer ID column already, but you can gain further benefit if sales table rows are physically near to one another when they share common customer IDs. The diagram below represents an index being scanned to find a particular customer, followed by some reads that fetch the corresponding sales table rows:

Attribute clustering improving index clustering.

With attribute clustering, the matching sales table rows are near to one another, so it is likely that fewer database blocks will need to be read from storage (or database cache) than if the rows are scattered throughout the sales table. The reason for this is that database blocks will usually contain multiple rows, so it is beneficial if each block we read happens to contains multiple matching rows.  Technically, attribute clustering improves index clustering factors, an important metric with regards to the efficiency of scanning indexes and fetching the corresponding table rows.

Many DBAs have used a similar trick in the past by ordering rows as they are loaded into the database (using an explicit SQL “ORDER BY” clause). Attribute clustering has the advantage of being transparent and a property of the table itself; clustering behavior is inherited from the table definition and is implemented automatically. Just like compression, attribute clustering is a directive that transparently kicks in for certain operations, namely direct path insertion or data movement. This is especially useful because row clustering will occur during table and partition movement as well as during data load.

It's pretty common for database systems to order rows on data load or data movement to improve table compression ratios. Attribute clustering can be used instead to achieve a similar result.

The name, “attribute clustering” might remind you of another database feature called Oracle Table Clusters, but be careful not to confuse the two. Oracle Table Clusters store rows from one or multiple tables in close proximity inside a specialized storage structure. Attribute clustering simply orders rows within a table (or its partitions and subpartitions); related rows will be physically close to one another, but they will not be held inside any new type of storage structure that's specific to attribute clustering.

Although attribute clustering is especially useful in conjunction with zone maps, it can be used as a transparent, declarative way to cluster or order table rows in order to improve:

  • Index range scan performance.
  • Table compression ratios (including for Oracle Database In-Memory).
  • Smart scan filtering on Exadata and the In-Memory column store through more efficient storage indexes.

Here’s an example of using attribute clustering to speed up a query. We'll compare before and after; so start by creating a table that is not attribute clustered:

CREATE TABLE sales_ac (sale_id NUMBER(10), customer_id NUMBER(10));

INSERT INTO sales_ac
FROM   dual

EXEC dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'SALES_AC');

CREATE INDEX sales_ac_cix ON sales_ac (customer_id);


Our table is not very large, so I’ve included a hint in the test query to encourage the optimizer to use the index: 

SELECT /*+ INDEX(sales_ac sales_ac_cix) */ COUNT(DISTINCT sale_id)
FROM  sales_ac
WHERE customer_id = 50;

Run the query a few times, and see what the reported value for “consistent gets” settles at. I get 96, but since the value depends on some storage defaults, don’t be surprised if your value is different: 

Before attribute clustering.

Attribute clustering is a property of the table, so when it is added, existing rows are not re-ordered. The following command is very fast because it just makes a change to the data dictionary: 

ALTER TABLE sales_ac 

Now we can physically cluster the existing table data by moving the table:


Moving tables and partitions is much cleaner and simpler than the manual “ORDER BY” method, where we would have to create a new table, add indexes, drop the old table and then rename. The simpler MOVE approach is particularly relevant in real-world scenarios, where it would be more usual to move tables a partition or sub-partition at a time, potentially using on-line operations.

Rebuild the index:


Use the same test query:

SELECT /*+ INDEX(sales_ac sales_ac_cix) */ COUNT(DISTINCT sale_id)
FROM   sales_ac
WHERE  customer_id = 50;

Again, run it a number of times to settle the reported value for “consistent gets”. In my case, I now read 3 database blocks instead of 96: a considerable improvement!  

With attribute clustering. 

Full details on zone maps and attribute clustering can be found in the Oracle documentation, particularly the Oracle 12c Data Warehousing Guide

The zone maps blog post includes some more on attribute clustering, plus links to a bunch of scripts for you to try out. 

Monday Jun 23, 2014

Part 3: DBAs guide to building and deploying sandboxes

Recently I started a series of blog posts covering the need for a private, secure and safe area for data discovery within the data warehouse ecosystem. The most common name for this type area is a “sandbox”. The demand for sandboxing is growing rapidly as many companies start exploring the new types of data streams linked to “big data”. In the first of my series of blog posts I started with an overview of the basic elements of sandboxes in the cloud. In the second post I looked at the differences between sandboxes vs. data marts.  As part of the first posts I put forward the idea of a sandbox lifecycle model, as shown below, and in this post I am going to explore the first stage of that lifecycle model: the BUILD stage:

Many companies are actively working on lots of different big data led projects: customer sentiment analysis, clickstream analysis, product/service recommendations, real time event monitoring etc. Consequently there are lots of different project teams wanting to evaluate a wide variety of new data sources inside their own private space. They are looking for new customers or segments to target, new product opportunities, new ways to retain customers and ways to predict system faults before they happen. Consequently, DBAs are being flooded with requests to create new sandbox environments. The process of building, or deploying, a sandbox can be based on one of two approaches:

  • Deliver an empty environment: Project team then finds the required data sources and loads them into the sandbox before starting their analysis
  • Delivered a pre-populated environment: Project team loads any additional new data sources into the sandbox before starting their analysis

At the moment we can easily support both of these scenarios. Below is a review of the features available in earlier, pre-12c, versions of the database followed by new features that are part of Database 12c. Hopefully, if you are already using sandboxes and looking for a reason to move to Database 12c then this post will help you make the case for moving your data warehouse to 12c.

Building Sandboxes pre-Database 12c

There are a number of ways to create sandboxes using pre-12c versions of the Oracle Database. The two most popular choices that I have seen used are Database Configuration Assistant and Workspace Manager. 

Database Configuration Assistant

Creating an empty sandbox is a relatively easy for DBAs and business users. The Database Configuration Assistant (DBCA) provides a wizard for creating and configuring an empty, but ready to go, database. This tool is very flexible and allows DBAs to control all aspects of the database creation process. For most business users the wizard provides far too many parameters to configure so Oracle provides three default configuration templates which provide default values for all the main configuration parameters. The pre-built templates include: general purpose, data warehouse and custom. The DBA can add additional templates for specific types of sandboxes by simply creating new template files and this is covered in the database documentation (see here: http://docs.oracle.com/cd/E16655_01/server.121/e17643/install.htm#BABEGFCG). The information in each template includes details of database options, initialisation parameters, and storage attributes (for data files, tablespaces, control files, and online redo logs). A common approach might be to create templates to configure large, medium and small sandboxes with appropriate settings for the specific hardware platform being used.

Once the new sandbox is ready the business users/data scientists can then start loading data. I am not going to over the data loading process, however, please note that this can be done using a data integration tool such as Oracle Data Integrator or flat files. In fact ODI now has a scripting language called Groovy which allows ETL developers to create easy-to-use dialog for selecting a table/view from a source system and copy its data to a target schema. For more information about ODI’s Groovy scripting language checkout the posts on the data integration blog: https://blogs.oracle.com/warehousebuilder/tags/groovy.

Overall, creating empty sandboxes using DBCA is a simple process. However, this approach does mean that data is replicated from the original database and each instance requires its own dedicated hardware resources (I/O, CPU and memory) which cannot be easily transferred to other sandboxes. Every time a new sandbox is required the DBA and systems administrators have to ensure that sufficient memory for managing the data along with all the typical database background processes, storage and CPU are available. As more and more sandboxes are requested and deployed these replicated overheads quickly consume the available server resources and this limits the number of sandboxes that can be run on a specific platform. Having dedicated, non-sharable, resources is extremely limiting and very costly.

Many teams find that trying to get approval to create a new database is a long and  painful process. Sometimes it can months to work through the bureaucracy and ensure all the pieces are in place ready to create the new database: storage, processing nodes, network connectivity etc.

Key Benefits: relatively simple process for creating new, empty, databases;

Key issues: requires dedicated system resources; cannot copy an existing database with its schemas and data; no simple mechanism for moving discoveries (objects, data etc) to production environment; may take a long time to provision due to internal bureaucracy

Workspace Manager

If projects require pre-populated sandbox environments then creating a sandbox with access to existing data is best done using Oracle Workspace Manager. This provides the ability to manage current, proposed and historical versions of data within the same database instance so there is no need to move data. Workspace Manager creates a self-contained virtual area within a database where users can update existing data without affecting the original data. This virtual area isolates all the changes until they are explicitly merged with production data or discarded. This means there is no need to duplicate existing data and any new data can be loaded in a controlled way without impacting the production data.

Users in a workspace always see a transactionally consistent view of the entire database; that is, they see changes made in their current workspace plus the original data in the database as it existed either when the workspace was created or when the workspace was last refreshed with changes from the parent workspace. Workspace Manager even supports a hierarchy of workspaces for very complex sandbox requirements involving multiple teams and/or project goals.

There is a great example of how to use Workspace Manager for what-if analysis on the Oracle Learning Library. The set-by-step tutorial shows you how to use workspaces to do location-based analysis on current and proposed data. The fictitious company MyCompany is planning to build a new warehouse to provide better service to its customers. Two potential sites are under consideration. The company wants to use SQL to analyse the prospective sites. To do this, data for both prospective sites must be entered into the production warehouse schema. However, this new data must be kept separate from the existing production data so that it does not impact the work of employees who are not part of the site selection team. Using Workspace Manager it is possible to isolate data for each of prospective sites and allow the two site selection teams to work concurrently. This use case creates the following hierarchy of workspaces, as shown below:

 version-enabled warehouse table and associated team workspaces: SITE1, SITE2, and LIVE. 

You can take this sandbox tutorial by following the steps on the Oracle Learning Library page, see here: https://apex.oracle.com/pls/apex/f?p=44785:24:15780601154836::NO:24:P24_CONTENT_ID%2CP24_PREV_PAGE:4553%2C29

One of the challenges of using workspaces relates to resource management and the ability to increase or decrease resources which is important for overall resource utilisation.

Obviously at some point users will want to add new data into the workspace and as with the DBCA overview, ETL developers can use ODI’s scripting language called Groovy to create easy-to-use dialog for selecting a table/view from a source system and copy its data to a target schema. For more information about ODI’s Groovy scripting language checkout the posts on the data integration blog: https://blogs.oracle.com/warehousebuilder/tags/groovy

Key Benefits: relatively simple process; no need to copy/clone data; easy to move discoveries to production

Key issues: requires careful workload management; monitoring and chargeback is difficult to achieve

For more information about Oracle Workspace Manager visit the product home page on OTN: http://www.oracle.com/technetwork/database/enterprise-edition/index-087067.html.


In general, it is likely that most customers will use both approaches for delivering sandboxes to their project teams. Where there are hundreds of projects taking place at the same time then both approaches will probably create additional work for the DBA and systems administrators in terms of having to monitor and track lots of workspaces and/or plan-monitor-track resource usage across lots of database instances. Ideally what we need is an approach that can easily support the deployment of thousands of sandboxes, offers simple and fast provisioning, leaves data in place to prevent unnecessary movement of large volumes of data, provides a single centrally managed platform and, most importantly, allows for expansion/reduction of resources as needed across all sandboxes. Is this possible? 

Building Sandboxing with Database 12c

The good news is that the new features we introduced as part of Database 12c combine the all best parts of the sandboxing features from the pre-12c approach with none of the drawbacks. Specifically, Database 12c introduced the concept of a multitenant database. This was originally positioned as a way to consolidate databases and applications under the control of a single environment but the approach works amazingly well for sandboxes. Therefore, with 12c you can now deploy sandboxes into the cloud!

This new feature (Oracle Multitenant) delivers a new architecture that provides a multitenant container database to hold a series of pluggable databases, or in this case sandboxes. There is a great quick overview of the new multitenant feature available on YouTube, follow the link below to watch the video:

Video: Oracle Multitenant Architecture

With the new multitenant architecture featuring pluggable databases (sandboxes) it is now possible to create a single container database that allows multiple sandboxes to be simply plugged in. Below is a simple overview of the main multitenant terminology. When we start to create a new multitenant sandbox environment we plug-in our multiple pluggable sandboxes to the root database. The combination of root database and pluggable sandboxes (PDBs) is referred to as a “Container Database” or CDB as shown below:


The great thing about the multitenant feature us that it offers a very fast and efficient way of creating new sandboxes. It is a bit like the template feature in Database Configuration Assistant with one important difference: deploying a new pluggable sandbox takes seconds as shown here:


the above is taken from the Mutitenant info graphic which is here: http://www.oracle.com/us/products/database/oracle-multitenant-infographic-1961308.pdf. The actual speed at which we can deploy pluggable sandboxes is outlined in the graph below:


The first bar on the graph shows the time taken to create a “standard” database, which involves coping template data files and then configuring the instance which can take a considerable amount of time. The second bar shows the time taken to create a new pluggable sandbox, which is significantly faster. The third bar shows that we can deliver pluggable sandboxes in real time by using the built-in “clone” feature - this allows us to deploy a new pluggable sandbox which is able to “see” the source data in the PDB which acted as the source for the clone. The cloned-pluggable sandbox also gets access to new/refreshed data as it is changed in the source system. This means that our cloned pluggable sandbox is always up to date with the very latest data. 

What is clone?

Provisioning a sandbox that is a copy of an existing database so the project teams have access to existing data sets used to be a lengthy multistep process.  Multitenant provides a capability called cloning. It is now possible to create local clones (from a pluggable database in the same container database) or remote clones (across a database link to a remote container database). What are the benefits of using this feature? Firstly, it is very simple - a single SQL statement. Secondly, it keeps the overall security requirements nice and  simple  because all that is required is database access and not OS access. Lastly, business users and data scientists like it because the whole process is so fast. The process of creating full clones is fast. Even faster—ludicrously fast—is the process for creating snapshot clones. Snapshot clones are built on a capability of the underlying file system—where available—called copy-on-write. This makes it possible to request and deploy a pre-populated pluggable sandbox in a matter of seconds!

Now we understand the overall architecture and speed at which we can deploy new sandboxes, how do you actually request a new sandbox/PDB?

Self-Service provisioning of sandboxes

In prior releases of the Oracle Database a lot of the tools and features that were used for sandboxing were typically aimed at DBAs. Today, many business users and data scientists want to control and own the process of requesting and building a sandbox. As part of Database we have provided a new self-service application that DBAs can make available to their business teams for self-service provisioning.  This new application provides is a simple 4-step process for creating a new pluggable sandbox :

Step 1: Start the APEX self-service provisioning app and click on the “New Database” link:

Build 1

Step 2: Select to create a new sandbox/pluggable database or plug-in an existing database that has been moved from another container. In the majority of cases business users will typically provision a new sandbox:

Build 2

Step 3: Determine if you are going to create a new empty sandbox or clone an existing database (mart or warehouse or existing sandbox). In the example below we have selected and existing sandbox MIKETEST and the option to clone this database is available at the bottom of the screen. 

Build 3

Step 4: Check the details and confirm the deployment. 

Build 4

As mentioned earlier the speed at which pluggable sandboxes can be created means that business users can request a new pluggable sandbox and have the environment ready for use in a matter of seconds (of your course your mileage might vary according to your hardware!).

Workspace Manager and Multitenant

In many cases the will still be a need to have an environment where the same data sets can be shared across multiple project teams with each team work independently on their own view of the data. This is exactly the sort of use case for Workspace Manager, as described earlier, and the good news is that Workspace Manager functions transparently in a multitenant architecture. Workspaces benefit from the efficient administration of one multitenant container database, and the separation and resource prioritisation allowed by multiple pluggable databases. 

SQL Developer and Multitenant

For data scientists and more sophisticated business users the latest version of SQL Developer is now multitenant aware. Therefore, rather than using the self-service application, all the required operations for creating and managing a sandbox are now available from within SQL Developer as shown here:

SQLDeveloper 1

 From within SQLDeveloper you can: deploy a sandbox (CREATE PLUGGABLE or CLONE PLUGGABLE), move a sandbox (UNPLUG and PLUG) and delete a sandbox (DROP PLUGGABLE)

SQLDeveloper 2

SQL Developer offers fine grained control and most business users and data scientists will probably need some guidance from their DBA on the various parameter settings for deploying a new sandbox. For more information about using SQL Developer to manage ( creating, modifying, plugging/unplugging) pluggable sandboxes see the online tutorial on the Oracle Learning Library: https://apex.oracle.com/pls/apex/f?p=44785:24:113456037406764:::24:P24_CONTENT_ID%2CP24_PROD_SECTION_GRP_ID%2CP24_PREV_PAGE:7649%2C%2C24


Database 12c offers significant advantages for delivering sandboxes - deploying pluggable sandboxes into the cloud is now simple and really fast. The fast and efficient deployment of pluggable sandboxes can be delivered as a self-service approach using a variety of tools and applications. These new pluggable sandboxes also support the existing data isolation features such as Workspaces. Overall, Database 12c makes sandboxing faster, simpler and more efficient.


In this latest post on sandboxing I have examined the “Build” phase of our BOX’D sandbox lifecycle. For customers who have not yet moved to Database 12c there are a number of tools and features that can be used to deploy sandboxes. Customers who have moved to Database 12c can use the new multitenant feature alongside some of the earlier features to deliver real-time deployment of pre-populated and empty sandboxes.

Thursday May 15, 2014

DBAs Guide to Sandboxes vs. Data Marts

I had an interesting response to my first post on the topic of sandboxing (DBA's Guide to Deploying Sandboxes in the Cloud). The following question was asked: what is the difference between a data mart and sandbox?

This is actually a great question so I thought it would be useful to convert my answer into a short blog post. I am sure there will be lots of different opinions on this topic just as there are alternative names for "sandbox environment" (from analytical sandbox, to analytical appliance to discovery zone etc etc) but here is my attempt at an answer:

OLAP1 IndustryDataModels

In my experience data marts tend to be a single subject area data repository and/or linked to a specific corporate application (such as finance, HR, CRM, ERP, logistics, sales tracking etc). The source data is pushed to a specific line of business for analysis. The push and loading processes implements all the necessary data cleansing and transformation routines so the data arrives into its destination schema ready for use. Most importantly, the data push happens on a regular basis and is driven by the needs of the business.  Many customers implement a data lifecycle management workflow to ensure that sufficient historical data is available to support the required analysis. In many cases the life of the data mart is largely open-ended and the IT team will ensure that regular backups and all the usual patching and maintenance operations are performed on a regular basis. Where the mart is seen as a mission critical system then high-availability features can be implemented and in extreme cases a disaster recovery site is setup and data synchronised between the production and DR systems.

The schema itself is typically organized to support reporting requirements and will be based around the standard relational models such as star and/or snowflake schemas although this is not a mandatory requirements. Sometimes a 3NF approach is required to support the particular needs of the business. The majority of queries within the mart are "well-defined" and "well-known" and subject to tuning and monitoring by the DBA team. 

A "sandbox" is generally meant as a non-operational environment where business analysts and data scientists can test ideas, manipulate data and model "what if" scenarios without placing an excessive computational load on the core operational processes. It has a finite life expectancy so that when timer runs out the sandbox is deleted and the associated discoveries are either incorporated into the enterprise warehouse, or data mart, or simply abandoned. The primary driver from an organisational perspective is to use a 'fail-fast" approach. At any one point in time an organization might be running any number of analytical experiments spread across hundreds of sandboxes. However, at some point in time those experiments will be halted and evaluated and the "hardware" resources being consumed will be returned to a general pool for reuse by existing projects or used to create environments for new projects. 

In general terms a sandbox environment is never patched or upgraded, except in exceptional circumstances. There is never a great urgency to apply software or operating system patches so the ITM team will  just incorporated these tasks into the normal cycles.

A sandbox should never be considered mission critical so there is no need to implement high availability features or build and manage a DR environment. If a sandbox becomes unavailable due to a fault (hardware or software) there is no pressing urgency to resolve the issue - in Oracle parlance a "sandbox going down" is not a P1 issue for support.

Below is a summary of how I view the differences between marts and sandboxes:

Business Centric Attributes

Data Mart


Business scope

Single subject area

Potentially a mixing pot of data sourced from multiple systems

Core objective

Managing the business

Discovery of new products, markets and/or customer segments

Query scope

Batch reporting for dashboards and pre-configured reports along with limited ad-hoc analysis

Ad –hoc data discovery

Frequency of update

Regular, scheduled data loads

One-off and ad-hoc loads as required

Data Volumes

Driven by external factors such as GRC requirements

Driven by needs of project

Data Quality

Very important – data is fully cleansed and transformed during load process

Raw data is loaded, transformations, cleansing and enhancements are incorporated into discovery process

Typical Output

Historical reports, KPIs, scorecards, multi-dimensional analysis

Data mining models: forecasts, predictions, scoring

Sophisticated analytics (aggregations, spatial, graph etc)

Typical query patterns

Pre-defined patterns returning small data sets, easily tuned

Complex ad hoc queries over massive data volumes

IT Centric Attributes

Data Mart


Mission critical



Performance SLAs



Individual/LOB chargeback for resource usage



HA features/DR Site




Full + incremental



As required to resolve specific issues along with scheduled maintenance programs

Only when necessary

Use of beta software


Possibly if project needs specific features

Life expectancy


Limited – typically 90 days or less

Number of instances

Most companies choose to
implent a very small number of subject-specific marts

Large companies likely to have many hundreds of sandoxes running at any given point in time

Table: Key differences between data mart and sandbox

As you can see from this table, in some ways sandboxes are similar to data marts and in other ways they are not. For me, the key difference is in the life expectancy - a sandbox should never outstay its welcome. The best sandbox environments that I have come across are those where strict time limits are enforced on their duration. If you let a sandbox live on too long then you run the danger of it morphing into a shadow data mart and that is a very dangerous situation if you look at the attributes and descriptions listed in the first table (Business Centric Attributes). 

So why is there so much confusion about the differences between marts and sandboxes? Much of this is down to niche vendors trying to jump on specific marketing bandwagons. At the moment the latest marketing bandwagon is the concept of "analytical databases" which in reality is nothing more than a data mart (and in many cases these vendors are simple peddling highly specialised data silos). These niche vendor platforms are simply not designed to run hundreds of environments with resources being continually returned to a centralised pool for redistribution to existing or new projects - which is a core requirement for effective sandboxing.

Over the next 2-3 months I am will share why and how the unique features of Oracle Database 12c provide the perfect platform for supporting environments running hundreds of thousands of sandbox-driven projects.

Thursday Apr 17, 2014

Analyzing our Big Data Lite movie app clickstream data

Since last year's OpenWorld I have posted quite a few articles on our new 12c pattern matching feature: 

 To date most of my online demos and tutorials have used a very simple data set consisting of a single table with 60 rows of fictitious trade data. Over the last few months I have been searching around for a slightly bigger and more challenging data set to use. Fortunately, our BIg Data PMs (led by Marty Gubar) have put together a very rich big data schema as part of their recently launched Big Data Virtual Machine that you can download from our OTN web page, see here: http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.

The data set is based around an online customer movie application. Here is a little background…

Movieplex on otn

Oracle MoviePlex Demo Application

Oracle MoviePlex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalized experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.

As the users watch movies, search for their favorite actors, rate their movies the system records all the clicks in a log file. The contents of that log file looks like this:

Avro file

In its raw state this information is not very helpful. It needs a lot of processing to convert the various data points into usable information. In many cases companies have been turning to Hadoop and its related Java-based programming language MapReduce to process and convert these types of files into usable information.  Most business users will want to summarise this information by customer and/or movie and then merge this information with other data sets. So how can we make access to and analysis of this type of data much easier? As part of this post I am going to compare the Hadoop-Java-MapReduce approach with an alternative approach using 12c SQL. My objective is not to determine which is solution is the best because each approach has its own camp of supporters. Once we have reviewed the two solutions I will put everything into context and make some recommendations…..so let's bring on the code!

Sessionization using Java

Accessing the Avro file

At the moment In the context of Big Data, everything seem to revolve around Hadoop, MapReduce and Java. It is quite natural for a big data developer to extend their basic map reduce processing to include more complicated requirements. In the case of our movie demo there is a lot of processing that needs to be done using the native features of Hadoop to collect and process the weblog file being generated by our application. There is an excellent video prepared by Marty Gubar (Director of Product Management for Big Data) which explains this process. This is part four of a six-part series that explains the movieplex demo: Part 4. Turn Clicks into Value - Flume & Hive. The movie demo lasts about 5 mins and you can watch here: https://www.youtube.com/watch?v=IwrjJUoUwXY.

The steps shown in the movie explain how to access the avro file and then how to clean the data to provide some interesting additional metrics.

Calculating the sessionization data

 Creating the sessionization analysis is a little more complicated. In fact, it requires 370 lines of Java code.  Here is the Java code we created for doing the sessionization analysis (the code window is scrollable so you can review the very long code sample):

The actual logic for the sessionization analysis is about 100 lines of code as shown here (at s before, the code window is scrollable so you can review the very long code sample):

As you can see from the code listing this requires a strong knowledge of Java and with 370 lines of code, if we decide to change the details of the pattern that we are searching for it is going to be a lengthy process to make the required changes. Making anything changes  to the code to reflect changing business requirements is definitely going to be beyond the skills of your average business analyst and/or data scientists and this might negatively impact the level of project agility.

Making life easier and richer with SQL

My personal view is that sessionization analysis is quite a sophisticated requirement and is best achieved using SQL because the requirements and parameters for the analysis will naturally evolve over time as new questions are triggered by existing answers. Therefore, you need an agile approach to analytics that can deal with constant changing requirements during the data discovery phase.

Accessing the Avro file

First step is to create a DIRECTORY object to access the raw data file:

CREATE DIRECTORY session_file_dir AS '/home/oracle/applog';
GRANT READ, WRTIE ON DIRECTORY session_file_dir to pmuser;

Next I created an external table over the data file. Notice that in the avro file, each key column  has an identifier and a value.


The objective for this external table was to keep the processing as simple as possible:

 LOCATION (SESSION_FILE_DIR: 'movieapp_30months.log') 

Cleaning the data

Now the next stage is to clean the data and remove the column identifiers such as custId, movieId, genreId etc. To do this processing I simply used the SQL SUBSTR() function. At the same time I decided to breakout the activity ID column to create unique columns for each type of activity. This approach is useful if you are going to be using the data as part of a data mining project because in many cases it useful to uniquely identify specific attributes. In many cases you will want to create a fact table from this process but in this case I have created a view while I do some basic prototyping:

    WHEN '1' THEN '1'
    WHEN '2}' THEN '2'
    WHEN '3}' THEN '3'
    WHEN '4}' THEN '4'
    WHEN '5}' THEN '5'
    WHEN '6}' THEN '6'
    WHEN '7}' THEN '5'
    WHEN '8}' THEN '8'
    WHEN '9}' THEN '9'
    WHEN '10' THEN '10'
    WHEN '11' THEN '11'
    ELSE null
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '1' THEN 'Y' END as act_rate
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '2' THEN 'Y' END as act_complete 
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '3' THEN 'Y' END as act_pause
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '4' THEN 'Y' END as act_start 
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '5' THEN 'Y' END as act_browse 
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '6' THEN 'Y' END as act_list
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '7' THEN 'Y' END as act_search
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '8' THEN 'Y' END as act_login
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '9' THEN 'Y' END as act_logout
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '10' THEN 'Y' END as act_incomplete
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '11' THEN 'Y' END as act_purchase

Running a query against this view generates our basic data set that we can then feed into our MATCH_RECOGNIZE clause to create the sessionization result set.


 Is there a better, simpler way to deal with the avro file? Well yes there is. As part of the Database 12c release programme we will be adding in-database support for JSON data. This will allow us to take the raw avro file and access stand query it directly from within the database without the need to go through the complicated cleaning process! At the moment we have not made any collateral (PPTs etc) on this feature publicly available so I can't provide you with any supporting links that will give you more information. However, once this feature is released I will revisit this little demo to explore how the new JSON feature can be used along side SQL pattern matching.

Calculating the sessionization data

The code to calculate the sessionization information is as follows:

FROM vwsession_data 
 (PARTITION BY cust_id ORDER BY sess_date 
  MEASURES match_number() session_id, 
    COUNT(*) no_of_events,
    FIRST(sess_date) start_date,
    TO_CHAR(FIRST(sess_date), 'hh24:mi:ss') start_time,
    LAST(sess_date) end_date,
    TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, 
    TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
    LAST(activity_id) last_act_id,
    COUNT(act_rate) act_rate,
    COUNT(act_complete) act_complete,
    COUNT(act_pause) act_pause,
    COUNT(act_start) act_start,
    COUNT(act_browse) act_browse,
    COUNT(t(act_list) act_list,
    COUNT(act_search) act_search,
    COUNT(act_login) act_login, 
    COUNT(act_logout) act_logout,
    COUNT(act_incomplete) act_incomplete,
    COUNT(act_purchase) act_purchase 
 PATTERN (strt s+) 
    s as (round(to_number(sess_date - prev(sess_date)) * 1440) <= 5)

This statement uses many of the MATCH_RECOGNIZE features that I have discussed in previous posts. In this specific example I have decided to set the duration between events within a session as 5 minutes. That means if the user does nothing for about  5 minutes then I will assume that a new session has started even if the user has not logged out. 

Beyond the basic sessionization model there is another interesting pattern that we can search for within our data set. If a user starts watching a movie then we might not get another event being logged for over two hours. Therefore, we could look for instances where the time between events is less than 150 minutes (assuming most movies last around 2 hours) and there is at least one (or more) "start" event is logged and at least  one (or more) "complete"  event is logged,

PATTERN (strt s* f+ c+) 
 f as act_start = 'Y',
 c as act_complete = 'Y',
 s as (round(to_number(sess_date - prev(sess_date))*1440) <= 150)

The output from this query is shown below:

Start Complete Data Full

Looking at the results what really jumps out is the level of interaction on the site before, during and after the user has watched a movie. Many of the records that are returned show users searching our movie database, going to their lists of favourite movies and rating the movies they have just watched. All this implies that if we can get customers to begin watching a movie there is a huge knock-on effect in terms of how long they remain on the site and the number of "events" they create while on the site. This is the sort of information that is useful to marketing and campaign management teams. 

A different view: Calculating the sessionization data for each movie

While developing the above analysis I began to wonder about analysing the data not from a customer perspective but from the point of view of each movie. This would allow me to look at the session activity over time for each movie and understand the usage patterns across movies and genres. For example, are there certain movies or types of movies that are more or less likely to be rated. Is the process of watching a movie the same across all movies in terms of the events that are registered during a session? There are all sorts of new avenues of analysis that could be created from looking at the data in this way.

So how easy would it be to change the focus of the analysis to movies? Well the code took me approximately 30 seconds to write - actually cut & paste from the customer sessionization example, switch the customer id for the movie id and finally removed a few fields. Here is the new code:

 (PARTITION BY movie_id ORDER BY sess_date 
  MEASURES COUNT(*) no_of_events,
     FIRST(sess_date) start_date,
     TO_CHAR(first(sess_date), 'hh24:mi:ss') start_time,
     LAST(sess_date) end_date,
     TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, 
     TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
     MIN(activity_id) act_id,
     MAX(activity_id) last_act_id,
     COUNT(act_rate) act_rate,
     COUNT(act_complete) act_complete,
     COUNT(act_pause) act_pause,
     COUNT(act_start) act_start,
     COUNT(act_browse) act_browse,
     COUNT(act_list) act_list,
     COUNT(act_search) act_search,
     COUNT(act_login) act_login, 
     COUNT(act_logout) act_logout,
     COUNT(act_incomplete) act_incomplete,
     COUNT(act_purchase) act_purchase 
 PATTERN (strt s*) 
     s as (ROUND(TO_NUMBER(sess_date - PREV(sess_date))*1440) <= 120)
 ) MR;

The speed at which you can create these additional/alternative result sets is really great!  It is now relatively easy to continue this discovery process by analysing the information by movie genre or any of the other related attributes.


The main take-away from this post is that (hopefully) I shown how easy it is to use SQL for sessionization analysis.  Firstly in terms of creating the framework to support the normal "sessionization" transformation process: taking the source log file, extracting the data points and then grouping the information by user. The great news is that processing of the source file will be greatly simplified when the JSON feature is released because it will be possible to query the log file in a much simpler way to extract the required data points. Secondly, SQL provides a much simpler and easier way to support the evolutionary of the discovery process. Changing the focus from customer analysis to movie analysis is both quick and easy. I am not convinced that a Java-based approach to data discovery can offer the same level of simplicity and agility but I will let you be the judge of that last statement….

Monday Oct 28, 2013

Partitioning tutorial - new features in Oracle Database 12c

For data warehousing projects Oracle Partitioning really is a must-have feature because it delivers so many important benefits such as:

  • Dramatically improves query performance and speeds up database maintenance operations
  • Lowers costs by enabling a tiered storage approach that allows data to be stored on the most cost-effective storage for better resource utilisation
  • Combined with Oracle Advanced Compression, it provides an automated approach to information lifecycle management using a simple, efficient, yet powerful way to manage data growth and reduce complexity and costs

To help you get the most from partitioning we have released a new tutorial that covers the 12c new features. Topics include how to:

  1. Use Interval Reference Partitioning
  2. Perform Cascading TRUNCATE and EXCHANGE Operations
  3. Move Partitions Online
  4. Maintain Multiple Partitions
  5. Maintain Global Indexes Asynchronously
  6. Use Partial Indexes

For more information about this tutorial follow this link to the Oracle Learning Library: http://apex.oracle.com/pls/apex/f?p=44785:24:0::NO:24:P24_CONTENT_ID,P24_PREV_PAGE:8408,2 where you can begin your tutorial right now!

For more information about Oracle Partitioning visit our home page on OTN: http://www.oracle.com/technetwork/database/bi-datawarehousing/dbbi-tech-info-part-100980.html

Friday Oct 18, 2013

OLL Live webcast - Using SQL for Pattern Matching in Oracle Database

If you are interested in learning about our exciting new 12c SQL pattern matching feature then mark your diaries. On Wednesday, October 30th at 8:00 am (US/Pacific time zone) Supriya Ananth, who is one of our top curriculum developers at Oracle, will be hosting an OLL webcast on our new SQL pattern matching feature.

The ability to recognize patterns in a sequence of rows has been a capability that was widely desired, but not possible with SQL until now. Row pattern matching in native SQL improves application and development productivity and query efficiency for row-sequence analysis.

With Oracle Database 12c you can use the new MATCH_RECOGNIZE clause to perform pattern matching in SQL to do the following:
  • Logically partition and order the data using the PARTITION BY and ORDER BY clauses
  • Use regular expressions syntax to define patterns of rows to seek using the PATTERN clause. These patterns a powerful and expressive feature, applied to the pattern variables you define.
  • Specify the logical conditions required to map a row to a row pattern variable in the DEFINE clause.
  • Define measures, which are expressions usable in the MEASURES clause of the SQL query.
For more information and to register for this exciting webcast please visit the OLL Live website, see here: https://apex.oracle.com/pls/apex/f?p=44785:145:116820049307135::::P145_EVENT_ID,P145_PREV_PAGE:461,143

Please note - if the above link does not work then go to OLL (https://apex.oracle.com/pls/apex/f?p=44785:1:) and click the OLL Live icon (upper right, beneath the Login link or logout link if you are already logged in). The pattern matching webcast is listed on the calendar of events on 30 October.

Sunday Nov 04, 2012

Blueprints for Oracle NoSQL Database

I think that some of the most interesting analytic problems are graph problems.  I'm always interested in new ways to store and access graphs.  As such, I really like the work being done by Tinkerpop to create Open Source Software to make property graphs more accessible over a wide variety of datastores.  Since key-value stores like Oracle NoSQL Database are well-suited to storing property graphs, I decided to extend the Blueprints API to work with it.  Below I'll discuss some of the implementation details, but you can check out the finished product here: http://github.com/dwmclary/blueprints-oracle-nosqldb.

 What's in a Property Graph? 

In the most general sense, a graph is just a collection of vertices and edges.  Vertices and edges can have properties: weights, names, or any number of other traits.  In an undirected graph, edges connect vertices without direction.  A directed graph specifies that all edges have a head and a tail --- a direction.  A multi-graph allows multiple edges to connect two vertices.  A "property graph" encompasses all of these traits.

Key-Value Stores for Property Graphs

Key-Value stores like Oracle NoSQL Database tend to be ideal for implementing property graphs.  First, if any vertex or edge can have any number of traits, we can treat it as a hash map.  For example:

Vertex["name"] = "Mary"

Vertex["age"] = 28

Vertex["ID"] = 12345

 and so on.  This is a natural key-value relationship: the key "name" maps to the value "Mary."  Moreover if we maintain two hash maps, one for vertex objects and one for edge objects, we've essentially captured the graph.  As such, any scalable key-value store is fertile ground for planting graphs.

Oracle NoSQL Database as a Scalable Graph Database

While Oracle NoSQL Database offers useful features like tunable consistency, what lends it to storing property graphs is the storage guarantees around its key structure.  Keys in Oracle NoSQL Database are divided into two parts: a major key and a minor key.  The storage guarantee is simple.  Major keys will be distributed across storage nodes, which could encompass a large number of servers.  However, all minor keys which are children of a given major key are guaranteed to be stored on the same storage node.  For example, the vertices:




May be stored on different servers, but




will always be on the same server.  This means that we can structure our graph database such that retrieving all the properties for a vertex or edge requires I/O from only a single storage node.  Moreover, Oracle NoSQL Database provides a storeIterator which allows us to store a huge number of vertices and edges in a scalable fashion.  By storing the vertices and edges as major keys, we guarantee that they are distributed evenly across all storage nodes.  At the same time we can use a partial major key to iterate over all the vertices or edges (e.g. we search over /Personnel/Vertex to iterate over all vertices).

Fork It!

The Blueprints API and Oracle NoSQL Database present a great way to get started using a scalable key-value database to store and access graph data.  However, a graph store isn't useful without a good graph to work on.  I encourage you to fork or pull the repository, store some data, and try using Gremlin or any other language to explore.

[Read More]

Friday Sep 02, 2011

So why do I need "so much extra space" for my partition SPLIT operation?

[Read More]

Monday Jun 13, 2011

Parallel Execution – Precedence of Hints and other Factors

The following table is a reflection of the precedence of hints, things like alter session enable parallel DML when using Auto DOP. It is also important to understand how the DML and query parts work together and how they influence each other.

All of the below is based on a simple statement:

insert into t3 as select * from t1, t2 where t1.c1 = t2.c1;


Some explanatory words based on the lines in the picture above:

Line 1 => The cleanest way to run PX statements, where Auto DOP gets to do its work and we will use the computed DOP of the statement

Line 4 => Because a FORCE parallel is used, we must ensure that the DOP > 1

Line 9 => The statement level hint over rides all other means and we run the statement with the DOP in the hint

A word on internal degree limit. This is NOT (repeat NOT) a parameter you can find, or set or find an underscore for. It is the built in limit to DOPs (set to default DOP or parallel_degree_limit = CPU). It is an internal boundary to ensure we do not blast through the upper bound of CPU power. Also note, for any non-compute degree, those boundaries and limits do not apply. That in itself is a reason to go look at and understand Auto DOP.

Friday May 20, 2011

Screencasts on Parallel Execution are now Live

New screencasts discussing automatic parallelism in Oracle 11g Release 2[Read More]

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« October 2015