Wednesday Mar 18, 2015

Production workloads blend Cloud and On-Premise Capabilities

Prediction #7 - blending production workloadsacross cloud and on-premise in Oracle's Enterprise Big Data Predictions 2015 is a tough nut to crack. Yet, we at Oracle think this is really the direction we all will go. Sure we can debate the timing, and whether or not this happens in 2015, but it is something that will come to all of us who are looking towards that big data future. So let’s discuss what we think is really going to happen over the coming years in the big data and cloud world.

Reality #1 – Data will live both in the cloud and on-premise

We see this today. Organizations run Human Capital Management systems in the cloud, integrate these with data from outside cloud based systems (think for example LinkedIn, staffing agencies etc.) while their general data warehouses and new big data systems are all deployed as on-premise systems. We also see the example in the prediction where various auto dealer systems uplink into the cloud to enable the manufacturer to consolidate all of their disparate systems. This data may be translated into data warehouse entries and possibly live in two worlds – both in the cloud and on-premise for deep diving analytics or in aggregated form.

Reality #2 – Hybrid deployments are difficult to query and optimize

We also see this today and it is one of the major issues of living in the hybrid world of cloud and on-premise. A lot of the issues are driven by low level technical limitations, specifically in network bandwidth and upload / download capacity into and out of the cloud environment. The other challenges are really (big) data management challenges in that they go into the art of running queries across two ecosystems with very different characteristics. We see a trend to use engineered systems on-premise, which delivers optimized performance for the applications, but in the cloud we often see virtualization pushing the trade-off towards ease of deployment and ease of management. These completely different ecosystems make optimization of queries across them very difficult.

Solution – Equality brings optimizations to mixed environments

As larger systems like big data and data warehouse systems move to the cloud, better performance becomes a key success criterion. Oracle is uniquely positioned to drive both standardization and performance optimizations into the cloud by deploying on engineered systems like Oracle Exadata and Oracle Big Data Appliance. Deploying engineered systems enables customers to run large systems in the cloud delivering performance as they see today in on-premise deployments. This then means that we do not live in a world divided in slow and fast, but in a world of fast and fast.
This equivalence also means that we have the same functionality in both worlds, and here we can sprinkle in some – future – Oracle magic, where we start optimizing queries to take into account where the data lives, how fast we can move it around (the dreaded networking bandwidth issue) and where we need to execute code. Now, how are we going to do this? That is a piece magic, and you will just need to wait a bit… suffice it to say we are hard at work at solving this challenging topic.

Wednesday Mar 11, 2015

Oracle Big Data Lite 4.1 VM is available on OTN

Oracle Big Data Lite 4.1 VM is now available for download on OTN.  Big Data Lite includes many of the key capabilities of Oracle's big data platform.  Each of the components have been configure to work together - and there are many hands-on labs and demonstrations to help you get started using the system.  Below is a listing of what's included:

  • Oracle Enterprise Linux 6.5
  • Oracle Database 12c Release 1 Enterprise Edition (12.1.0.2) - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.3.0)
  • Cloudera Manager (5.3.0)
  • Oracle Big Data Connectors 4.1
    • Oracle SQL Connector for HDFS 3.2.0
    • Oracle Loader for Hadoop 3.3.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.4.1
    • Oracle XQuery for Hadoop 4.1.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.2.5)
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.0.3
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.3.0
  • Oracle CopyToBDA 1.1 

 

 Enjoy!

Tuesday Mar 10, 2015

Parallel DML on Tables with LOB Columns

Until 12c, running parallel DML on tables with LOB columns required that table to be partitioned. Starting with 12c, we allow parallel INSERTs into non-partitioned tables with LOB columns provided that those columns are declared as SecureFiles LOBs. Parallel UPDATE, DELETE, and MERGE are still not supported. The documentation on this has some errors leading some customers to expect all parallel DML to be supported, so I wanted to point out the current behavior.

INSERTs into non-partitioned table with SecureFiles LOB column

Here is an example showing that the INSERT will be executed in parallel.

SQL> create table t (a number,b varchar2(200),c clob) lob(c) store as securefile;
Table created.

SQL> explain plan for insert /*+ parallel enable_parallel_dml */
  2  into t select /*+ parallel */ * from t;

Explained.

SQL> select * from table(dbms_xplan.display(format=>'basic +note'));


Other DML on non-partitioned table with SecureFiles LOB column

A DELETE statement on the other hand is executed serially as indicated in the note section of the plan. The note also shows the reason which is that the table is non-partitioned. This behavior applies for UPDATE and MERGE statements too.

SQL> explain plan for delete /*+ parallel enable_parallel_dml */  from t;
Explained.

SQL> select * from table(dbms_xplan.display(format=>'basic +note'));


DML on partitioned table with SecureFiles LOB column

For partitioned tables with LOB columns, all parallel DML operations are supported as before. Remember that the effective DOP will be the number of partitions, this behavior is also the same as before. This is because we use the partitions as PX granules in this case.

This example shows that parallel DML is allowed on a partitioned table with LOB columns. V$PQ_SESSTAT shows we get a DOP of 4 even though the plan shows a DOP of 8, that is because the table has 4 partitions. 

SQL> create table t (a number,b varchar2(200),c clob)
  2  lob(c) store as securefile partition by hash(a) partitions 4;

Table created.

SQL> explain plan for delete /*+ parallel(8) enable_parallel_dml */  from t;

Explained.

SQL> select * from table(dbms_xplan.display(format=>'basic +note'));



SQL> delete /*+ parallel(8) enable_parallel_dml */  from t;

0 rows deleted.

SQL> select * from V$PQ_SESSTAT;

STATISTIC                      LAST_QUERY SESSION_TOTAL     CON_ID
------------------------------ ---------- ------------- ----------
Queries Parallelized                    0             0          0
DML Parallelized                        1             1          0
DDL Parallelized                        0             0          0
DFO Trees                               1             1          0
Server Threads                          4             0          0
Allocation Height                       4             0          0
Allocation Width                        1             0          0
Local Msgs Sent                         8             0          0
Distr Msgs Sent                         0             0          0
Local Msgs Recv'd                       8             0          0
Distr Msgs Recv'd                       0             0          0
DOP                                     4             0          0
Slave Sets                              1             0          0

13 rows selected.

As always, we are fixing the documentation about this.

Monday Mar 09, 2015

Why SQL Part 2 - It has a powerful framework

In the first part (Why SQL is the natural language for data analysis) of this series of blog posts I explained why analytics is important to the business, how the volume of data along with the types of data is continuing to expand and why that makes it vital that you select the right language for data analysis. Many of us work with SQL every day and take for granted many of its unique features, its power, flexibility and the richness of the analytics.

This familiarity with SQL means that sometimes we are a bit slow at preventing some of our projects investing in other big data languages such as MapReduce, Impala, Spark, Java and many of the new generation of SQL-like open source projects. While many of these new languages are considered “cool”, it is very easy to end up building new proprietary data silos or investing in a language that eventually is replaced by another open source project or investing time and effort in code that eventually fails to deliver the required analytics.

One of the aims of this series of blog posts is to refresh your memory about why SQL has been so successful in the area of analytics. SQL has four unique features that make it perfect for data analysis...

    [Read More]

    Tuesday Mar 03, 2015

    Are you leveraging Oracle's database innovations for Cloud and Big data?

    If you are interested in big data, Hadoop, SQL and data warehousing then mark your calendars because on March 18th at 10:00AM PST/1:00PM EST, you will be able to hear Tom Kyte (Oracle Database Architect) talk about how you can use Oracle Big Data SQL to seamlessly integrate all your Hadoop big data datasets with your relational schemas stored in Oracle Database 12c. As part of this discussion Tom will outline how you can build the perfect foundation for your enterprise big data management system using Oracle's innovative technology.

    If you are working on a data warehousing project and/or a big data project then this is one webcast you will not want to miss so register today (click here) to hear the latest about Oracle Database innovations and best practices. The full list of speakers is:

    Tom Kyte
    Oracle Database Architect
    Keith Wilcox
    VP, Database Administration
    Epsilon
    Bill Callahan
    Director, Principal Engineer,
    CCC Information Services, Inc.

    Why SQL is the natural language for data analysis

    Analytics is a must-have component of every corporate data warehousing and big data project. It is the core driver for the business: the development of new products, better targeting of customers with promotions, hiring of new talent and retention of existing key talent. Yet the analysis of especially “big data environments”, data stored and processed outside of classical relational systems, continues to be a significant challenge for the majority companies. According to Gartner, 72% of companies are planning to increase their expenditure on big data yet 55% state they don’t have the necessary skills to make use of it. 

    The objective of this series of articles, which will appear over the coming weeks, is to explain why SQL is the natural language for amy kind of data analysis including big data and the benefits that this brings for application developers, DBAs and business users. 

    [Read More]

    Wednesday Feb 25, 2015

    New Way to Enable Parallel DML

    This post was triggered by Jonathan Lewis' tweet here.

    The conventional way of enabling parallel DML is to enable it with an ALTER SESSION command as explained in the documentation. This command enables parallel DML for the session and all subsequent DML statements are candidates for parallel execution provided that the rules and restrictions for parallel DML are met.

    12c introduces a new way of enabling and disabling parallel DML. Rather than enabling or disabling it session-wise you can enable or disable it statement-wise using new hints. The hint ENABLE_PARALLEL_DML enables parallel DML for the statement, and the hint DISABLE_PARALLEL_DML disables it for the statement. All rules and restrictions for parallel DML still apply, these hints are only alternatives to the related ALTER SESSION commands.

    This example shows the plans for the same statement without and with the ENABLE_PARALLEL_DML hint.

    We can see that parallel DML is not enabled for the above statement indicated by the LOAD operation being above the PX COORDINATOR in the plan. In 12c the notes section nicely and clearly shows that it is not enabled.

    Here we see that the notes section does not mention parallel DML as disabled and the LOAD operation is under the PX COORDINATOR, both of these indicate that parallel DML is enabled.

    These two new hints are available starting with 12.1.0.1 and can be used regardless of the value of the OPTIMIZER_FEATURES_ENABLE parameter. They can be used in INSERT, UPDATE, DELETE, and MERGE statements. We are updating the related documentation to include these, I will update this post to include links when the documentation is refreshed.

    Friday Feb 06, 2015

    Unified Query: SQL for All Seasons

    In a recent interview, the topic of "a SQL for All Seasons" came up.  Initially, the phrasing made me think we were going to about a database that staunchly refused to answer queries about divorce.  Instead, the conversation centered around the pain enterprises feel when dealing with polyglot persistence.  As much as we, as developers, may choose to avoid (or embrace) polyglot persistence, in large enterprises it's becoming unavoidable.

    What we focus on with Oracle Big Data SQL is unified query, and it's designed to be the complement to polyglot persistence.  Store data in the places the business deems correct, resulting in the "polyglot problem," but query it all simultaneously using a single SQL statement.  We think it's a pretty valuable concept, and it makes storing data in Hadoop or NoSQL stores for business or performance requirements easier to manage.  To explain the concept more fully, we've released a new whitepaper which considers why unified query is important, and what pitfalls can exist in some implementations.

    [Read More]

    Tuesday Jan 27, 2015

    MATCH_RECOGNIZE and the Optimizer

    If you have already been working with the new 12c pattern matching feature you will have probably spotted some new keywords appearing in your explain plans. Essentially there are four new keywords that you need to be aware of:
    • MATCH RECOGNIZE
    • SORT
    • BUFFER
    • DETERMINISTIC FINITE AUTO
    The fist three is bullet points are reasonably obvious (at least I hope they are!) but just incase…. the keywords MATCH RECOGNIZE refers to the row source for evaluating the match_recognize clause . The “SORT keyword means the row source sorts the data data before running it through the state machine to find the matches.  The last keyword is the most interesting and is linked to the use of “state machine”, as mentioned in the previous sentence. Its appearance or lack of appearance affects the performance of your pattern matching query. The importance of this keyword is based on the way that pattern matching is performed. 
    [Read More]

    Friday Jan 16, 2015

    Deploying SAS High Performance Analytics on Big Data Appliance

    Oracle and SAS have an ongoing commitment to our joint customers to deliver value-added technology integrations through engineered systems such as Exadata, Big Data Appliance, SuperCluster,  Exalogic and ZFS Storage Appliance.  Dedicated resources manage and execute on joint SAS/Oracle Database, Fusion Middleware, and Oracle Solaris integration projects; providing customer support, including sizing and IT infrastructure optimization and consolidation.  Oracle support teams are onsite at SAS Headquarters in Cary, NC (USA); and in the field on a global basis.

    The latest in this effort is to enable our joint customers to deploy SAS High Performance Analytics on Big Data Appliance. This effort enables SAS users to leverage the lower cost infrastructure Hadoop offers in a production ready deployment on Oracle Big Data Appliance. Here from Paul Kent (VP Big Data, SAS) on some of the details.

    Read more on deploying SAS High Performance Analytics on www.oracle.com/SAS. Don't miss the deployment guide and best practices here.

    Friday Dec 12, 2014

    Analytical SQL scripts now on Github

    After this year’s OpenWorld I ran a 1-day workshop on analytical SQL for our Data Warehouse and Big Data Global Leaders customers. This was part of the Global Leaders 'Days with Development' programme. We had a pack room at the Oracle Conference Center and I was very lucky to have Stew Ashton, Technical Architect, BNP Paribas, Dr. Holger Friedrich, CTO at sumIT AG and Joerg Otto, Head of DB Engineering, IDS GmbH co-present with me and have them explain how they were using analytical SQL in their projects. 

     The workshop covered the following topics:

    • Analytic SQL concepts and foundations
    • Analytic functions for reporting
    • Analytic functions for aggregation
    • More advanced and new 12c features: Pattern Matching
    • SQL Model clause

    For the workshop I created a comprehensive slide deck (I will post the presentation shortly on our OTN home page) which included code samples and explain plans to highlight the key benefits of using our analytical SQL features and functions. The great news is that I now have a repository for analytical SQL code samples on the Github repository. To kick things off on this new repository I have posted all the SQL scripts that I created for this workshop so you can now download and work through a series of use cases that explain how to use window functions, intelligently aggregate data, manipulate rows and columns of data, find patterns and create what-if scenarios. Below is my repository home page where you can download the code:

    Github Repository for Analytical SQL

    So what is Github?

    At a simple level, it is an online version control system (and a lot more!) that stores and manages modifications to code script files within in a central repository. Its key benefit is that it makes it very easy for developers to work together on a common project. This environment makes it easy to download a new version of code scripts, make changes, and upload the revisions to those files. Everyone can then see these new changes, download them, and contribute. This system is very popular with developers so we have decided to join this community and make our SQL scripts available via this site. It is the ability to “collaborate” which is most important for me.

    To help you get started there is a great section on the website called “Set Up Git”. If like me you are using a Mac then GitHub has a Mac client! You can use it without ever touching the command line interface (which can be a little frustrating at times!).

    You can contribute too!

    It would be great if you could contribute your own scripts to this repository so I can build up a library of scripts for analytical SQL. All you need to do is create an account on Github, search for the analytical SQL repository and then either download the repository as a zip file or use the “Clone in Desktop” option. What I want to do is build up a series of well documented use cases and when we have enough content then I will create industry specific folders to help organize the content.

    So my new repository is now live, please have a look and feel free to upload your own scripts that show how you have used analytical SQL to solve specific business problems within your projects. Looking forward to lots of files arriving into this great new repository. 

    Tuesday Dec 09, 2014

    X-Charging for Sandboxes

    This is the next part in my on-going series of posts on the topic of how to successfully manage sandboxes within an Oracle data warehouse environment. In Part 1 I provided an overview of sandboxing (key characteristics, deployment models) and introduced the concept of a lifecycle called BOX’D (Build, Observe, X-Charge and Drop). In Part 2 I briefly explored the key differences between data marts and sandboxes. Part 3 explored the Build-phase of our lifecycle. Part 4 explored the Observer-phase of our lifecycle so we have now arrived at the X-Charge part of our model.

    To manage the chargeback process for our sandbox environment we are going to use the new Enterprise Manager 12c Cloud Management pack, for more information visit the EM home page on OTN

    Why charge for your providing sandbox services? The simple answer is that placing a price or cost on a service ensures that the resources are used wisely. If a project team incurred zero costs for their database environment then there is no incentive to evaluate the effectiveness of the data set and the cost-benefit calculation for the project is skewed by the lack of real-world cost data. This type of approach is the main reason why sandbox projects evolve over time into “production” data marts. Even if the project is not really delivering on its expected goals there is absolutely no incentive to kill the project and free up resources. Therefore, by not knowing the cost, it is impossible to establish the value...

    [Read More]

    Friday Dec 05, 2014

    Parallel Execution Fundamentals White Paper

    As my first blog post as the product manager for Parallel Execution (still known as Parallel Query by many people) I want to point out a new white paper about parallel execution fundamentals in the Oracle Database. You can find the paper here.

    Also, stay tuned for another one in which we will talk about using Database Resource Manager to manage concurrent workloads with parallel execution, this will be published soon.

    Please send any comments, questions and feedback about the paper, or generally about parallel execution in Oracle, to yasin.baskan@oracle.com, or even better use the comments section below. 

    Tuesday Nov 11, 2014

    Optimizing Table Scans with Zone Maps

    Most of you will be familiar with partition pruning, where the Oracle Database will avoid the need to scan table and index partitions based on query predicates. This optimization is transparent to your application, but for it to work, the database has to find a way of mapping a query filter predicate to the partitioning key column (or columns). Partition pruning can only occur if the query has predicates that match the predetermined shape of a partitioned object. For example, a query on a SALES table partitioned by ORDER_DATE will need to include ORDER_DATE in a join or WHERE clause for it to be optimized by partition pruning.

    What if you could do better than this? What if you could prune partitions using a variety of column predicates and dimension hierarchies, irrespective of their appearance in the partitioning key? How about pruning at a much finer level of granularity than a partition? Perhaps we want to optimize queries that filter SALES by SHIP_DATE, STATE and COUNTY, as well as ORDER_DATE. The new Oracle 12.1.0.2 zone map feature is designed to achieve this, and just like partitioning, zone maps are transparent to your queries; you don’t have to change your applications to make use of them.

    Zone maps are available in Oracle Database 12c for Oracle Engineered Systems. Conceptually, they divide a table up into contiguous regions of blocks called zones (the default zone size being 1024 blocks). For each zone, the Oracle database records the minimum and maximum values for specified columns using a new database object called a zone map. Queries that filter on zone map columns have the potential to be optimized; it’s possible to prune zones that contain ranges of column values outside the match specified in the query predicate.

    Consider a query that filters a sales table by (North American) state; in this case “CA”. A zone map on the STATE column will record the minimum and maximum values for this column for each zone in the table. This makes it possible to skip the zones that we can be certain won’t contain rows for “CA”.

    Zone Map Representation

    You are probably aware that Exadata storage cells and the Oracle database In-Memory Column Store uses similar storage index techniques, so what benefits do zone maps add? Besides the fact that you can control zone maps explicitly, the most significant difference between zone maps and storage indexes is that zone maps can be used to prune zones using column predicates from multiple (joined) tables. Consider a more realistic scenario, in which the SALES table doesn’t have a STATE column, but instead has a LOCATION_ID referencing a dimension table called LOCATIONS. This is our query for summing the sales figures in California:

    SELECT SUM(amount)
    FROM   sales     s,
           locations l
    WHERE  s.location_id = l.location_id
    AND    l.state = 'CA';

    It would be great if we could avoid scanning zones in SALES that don’t contain rows associated with “CA”. Before we look at how we can do this, we’ll make the scenario even more realistic by assuming that LOCATIONS is a dimensional hierarchy of State and County, like this:

    Zone Map Locations Table Example

     Each State is made up of multiple Counties, so “CA” will be associated with multiple LOCATION_ID values. If we want the “CA” rows in SALES, we’ll need to match the ones marked below in bold/red:

    Zone Map Sales Table Example

    If we want to optimize a scan for “CA” rows, we will have to address a few issues:

    • The SALES table does not have a STATE column, so no storage index structure on SALESs data will allow us to directly prune disk regions based on “CA”.
    • Table rows associated with “CA” are likely to be physically scattered throughout the SALES table, so it’s unlikely that these rows will be confined to a relatively small number of zones or disk regions. We might not be able to make efficient use of an Exadata storage index on SALES.LOCATION_ID, if any (note that I am consciously ignoring the push down of BLOOM FILTERs to Exadata here, which still suffers from the physical scattering).
    • A SALES storage index based on min/max Location ID is likely to be less efficient than using zones based on min/max State values, simply because each State is made up of multiple Location IDs. This inefficiency is more pronounced if Location IDs for “CA” are not numerically close to one another - it will reduce the chances that the Location IDs we’re searching for will be found within the same min/max Location ID regions.

    Of course, zone maps are designed to address these issues - with a little bit of help from another Oracle Database 12c feature called attribute clustering.  I introduced attribute clustering in an earlier post, but don’t worry if you haven’t read that yet; I’ll cover the basics here anyway. You’ve probably deduced that we can reduce the number of zones that contain “CA” rows if we cluster or sort the rows in SALES, keeping these rows close to one another, like this:

    Table with Zone Map

    Attribute clustering is the feature that’s used to cluster the rows together. Zone maps are used to record the min/max values for specified columns for each zone (and this can include column values derived from joins; LOCATIONS.STATE and LOCATIONS.COUNTY in our case).

    The following DDL will create a zone map on our SALES fact table using the dimension table columns LOCATIONS.STATE and LOCATIONS.COUNTY. It will also enable attribute clustering, using the same columns to cluster the table’s rows:

    ALTER TABLE sales
    ADD CLUSTERING sales 
    JOIN locations ON (sales_ac.location_id = locations.location_id) 
    BY LINEAR ORDER (locations.state, locations.county)
    WITH MATERIALIZED ZONEMAP;
    

    The LINEAR ORDER clause specifies a linear clustering algorithm, which is ideal for this example. Another algorithm is available; it is specified with "INTERLEAVED" and is optimized for more complex combinations of query predicates and dimension tables. Note that the definition of attribute clustering by itself does not change any data stored on disk; instead, it provides a directive for direct path operations; INSERT APPEND and MOVE that will physically perform the clustering operation for us. If there are pre-existing rows in SALES, we can MOVE the table (or its partitions) to re-order them.

    Joins between SALES and the dimension table are now candidates for optimization when the query includes predicates on the dimension hierarchy “state” and “state, county”. For example:

    SELECT SUM(amount)
    FROM   sales
    JOIN   locations  ON (sales.location_id = locations.location_id) 
    WHERE  locations.state  = 'NM';
    
    SELECT SUM(amount)
    FROM   sales
    JOIN   locations  ON (sales.location_id = locations.location_id) 
    WHERE  locations.state  = 'CA'
    AND    locations.county = 'Kern';

    By clustering the rows and recording appropriate min/max column values for our zones, we have addressed all of the issues I identified above. What’s more, we can still get benefit from Exadata storage indexes because zone maps and storage indexes complement one another, and they work together transparently.

    Zone maps are explicitly created and controlled by the database administrator on a table-by-table basis. They are an inherent part of the physical database design and can be thought of as a coarse anti-index structure (unlike an index, a zone map tells you what zones not to access). Zone maps are very compact, and in some cases it is possible to use them where you would otherwise use an index. This is most relevant in data warehousing environments where scanning is often more appropriate than indexed row retrieval, and where indexes can use a considerable amount of storage space. Zone maps must be refreshed to be synchronized with the underlying table data, so you will need to give some consideration to how you want them to be kept up-to-date if you decide to use them as an alternative to indexes.

    In summary, take a look at zone maps if you want to:

    • Optimize scanning queries, particular when joining with one or more tables.
    • Reduce your dependency on indexes, particularly in data warehousing environments.
    • Improve performance in your data warehouse; particularly for star or snowflake schemas.

    Here’s an example of using zone maps to optimize a table scan. To compare before and after, start by creating a table that has no zone map or attribute clustering:

    CREATE TABLE sales_zm (sale_id NUMBER(10), customer_id NUMBER(10));

    Insert 8 million rows with the following PL/SQL code. Why that many? With our example, we'll read one or two zones rather than the entire table, so I'm aiming to make the difference pretty obvious when you look at the block read statistics:

    DECLARE
      i NUMBER(10);
    BEGIN
      FOR i IN 1..80
      LOOP
        INSERT INTO sales_zm
        SELECT ROWNUM, MOD(ROWNUM,1000)
        FROM   dual
        CONNECT BY LEVEL <= 100000;
        COMMIT;
      END LOOP;
    END;
    /
    EXEC dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'SALES_ZM');
    
    SET AUTOTRACE ON STATISTIC

    Run the following query a few times to see what value “consistent gets” settles at:

    SELECT COUNT(DISTINCT sale_id)
    FROM   sales_zm
    WHERE  customer_id = 50;

    On my machine, I read 7,545 blocks from the buffer cache, but since the value depends on some storage defaults don’t be surprised if your value is different:

    Before Zone Map

    The following DDL will create a zone map, but since attribute clustering is a property of the table (like compression), any existing rows will not be re-ordered:

    ALTER TABLE sales_zm ADD CLUSTERING BY LINEAR ORDER (customer_id) 
    WITH MATERIALIZED ZONEMAP;

    The zone map will not be efficient until we cluster the rows together, so we’ll MOVE the table to achieve this. This will refresh the zone map too:

    ALTER TABLE sales_zm MOVE;

    Run the same query a few times to see what value “consistent gets” settles at:

    SELECT COUNT(DISTINCT sale_id)
    FROM   sales_zm
    WHERE  customer_id = 50;

    On my database, I read around 1,051 database blocks instead of 7,545: a considerable improvement:

    After Zone Map

    You'll find more examples covering zone maps and attribute clustering in the Oracle Learning Library and inside the Oracle Github repository. Full details on zone maps and attribute clustering can be found in the Oracle documentation library; particularly the Oracle 12c Database Data Warehousing Guide.

    There's an earlier post on attribute clustering if you haven't read it already.

    If there's anything to need to ask, or if you can't find what you need regarding zone maps or attribute clustering, please let me know by leaving a comment below. Thanks!


    Thursday Oct 30, 2014

    Part 4 of DBAs guide to managing sandboxes - Observe

    This is the next part in my on-going series of posts on the topic of how to successfully manage sandboxes within an Oracle data warehouse environment. In Part 1 I provided an overview of sandboxing (key characteristics, deployment models) and introduced the concept of a lifecycle called BOX’D (Build, Observe, X-Charge and Drop). In Part 2 I briefly explored the key differences between data marts and sandboxes. Part 3 explored the Build-phase of our lifecycle.

    Now, in this post I am going to focus on the Observe-phase. At this stage in the lifecycle we are concerned with managing our sandboxes. Most modern data warehouse environments will be running hundreds of data discovery projects so it is vital that the DBA can monitor and control the resources that each sandbox consumes by establishing rules to control the resources available to each project both in general terms and specifically for each project.  

    In most cases, DBAs will setup a sandbox with dedicated resources. However, this approach does not create an efficient use of resources since sharing of unused resources across other projects is just not possible. The key advantage of Oracle Multitenant is its unique approach to resource management. The only realistic way to support thousands of sandboxes, which in today’s analytical driven environments is entirely possible if not inevitable, is to allocate one chunk of memory and one set of background processes for each container database. This provides much greater utilisation of existing IT resources and greater scalability as multiple pluggable sandboxes are consolidated into the multitenant container database.

    Resources

    Using multitenant we can now expand and reduce our resources as required to match our workloads. In the example below we are running an Oracle RAC environment, with two nodes in the cluster. You can see that only certain PDBs are open on certain nodes of the cluster and this is achieved by opening the corresponding services on these nodes as appropriate. In this way we are partitioning the SGA across the various nodes of the RAC cluster. This allows us to achieve the scalability we need for managing lots of sandboxes. At this stage we have a lot of project teams running large, sophisticated workloads which is causing the system to run close to capacity as represented by the little resource meters.

    Expand 1

    It would be great if our DBA could add some additional processing power to this environment to handle this increased workload. With 12c what we can do is simply drop another node into the cluster which allows us to spread the processing of the various sandbox workloads loads out across the expanded cluster. 

    Expand 2

    Now our little resource meters are showing that the load on the system is a lot more comfortable. This shows that the new multitenant feature integrates really well with RAC. It’s a symbiotic relationship whereby Multitenant makes RAC better and RAC makes Multitenant better.

    So now we can add resources to the cluster how do we actually manage resources across each of our sandboxes? As a DBA I am sure that you are familiar with the features in Resource Manager that allow you to control system resources: CPU, sessions, parallel execution servers, Exadata I/O. If you need a quick refresher on Resource Manager then check out this presentation by Dan Norris “Overview of Oracle Resource Manager on Exadata” and the chapter on resource management in the 12c DBA guide.

    With 12c Resource Manager is now multitenant-aware. Using Resource Manager we can configure policies to control how system resources are shared across the sandboxes/projects. Policies control how resources are utilised across PDBs creating hard limits that can enforce a “get what you pay for” model which is an important point when we move forward to the next phase of the lifecycle: X-Charge. Within Resource Manager we have adopted an “industry standard” approach to controlling resources based on two notions:

    1. a number of shares is allocated to each PDB
    2. a maximum utilization limit may be applied to each PDB

    To help DBAs quickly deploy PDBs with a pre-defined set of shares and utilisation limits there is a “Default” configuration that works, even as PDBs are added or removed. How would this work in practice? Using a simple example this is how we could specify resource plans for the allocation of CPU between three PDBs:

    RM 1

    As you can see, there are four total shares, 2 for the data warehouse and one each for our two sandboxes. This means that our data warehouse is guaranteed 50% of the CPU whatever else is going on in the other sandboxes (PDBs). Similarly each of our sandbox projects is guaranteed at least 25%. However, in this case we did not specify settings for maximum utilisation. Therefore, our marketing sandbox could use 100% of the CPU if both the data warehouse and the sales sandbox were idle.

    By using the “Default” profile we can simplify the whole process of adding and removing sandboxes/PDBS. As we add and remove sandboxes, the system resources are correctly rebalanced, by using the settings specific default profile, across all the plugged-in sandboxes/PDBs as shown below.

    RM 2

    Summary

    In this latest post on sandboxing I have examined the “Observe” phase of our BOX’D sandbox lifecycle. With the new  multitenant-aware Resource Manager we can configure policies to control how system resources are shared across sandboxes. Using Resource Manager it is possible to configure a policy so that the first tenant in a large, powerful server experiences a realistic share of the resources that will eventually be shared as other tenants are plugged in.

    In the next post I will explore the next phase of our sandbox lifecycle, X-charge, which will cover the metering and chargeback services for pluggable sandboxes. 

    About

    The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

    Search

    Archives
    « June 2015
    SunMonTueWedThuFriSat
     
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
        
           
    Today