Tuesday Mar 03, 2015

Are you leveraging Oracle's database innovations for Cloud and Big data?

If you are interested in big data, Hadoop, SQL and data warehousing then mark your calendars because on March 18th at 10:00AM PST/1:00PM EST, you will be able to hear Tom Kyte (Oracle Database Architect) talk about how you can use Oracle Big Data SQL to seamlessly integrate all your Hadoop big data datasets with your relational schemas stored in Oracle Database 12c. As part of this discussion Tom will outline how you can build the perfect foundation for your enterprise big data management system using Oracle's innovative technology.

If you are working on a data warehousing project and/or a big data project then this is one webcast you will not want to miss so register today (click here) to hear the latest about Oracle Database innovations and best practices. The full list of speakers is:

Tom Kyte
Oracle Database Architect
Keith Wilcox
VP, Database Administration
Epsilon
Bill Callahan
Director, Principal Engineer,
CCC Information Services, Inc.

Tuesday Jan 27, 2015

MATCH_RECOGNIZE and the Optimizer

If you have already been working with the new 12c pattern matching feature you will have probably spotted some new keywords appearing in your explain plans. Essentially there are four new keywords that you need to be aware of:
  • MATCH RECOGNIZE
  • SORT
  • BUFFER
  • DETERMINISTIC FINITE AUTO
The fist three is bullet points are reasonably obvious (at least I hope they are!) but just incase…. the keywords MATCH RECOGNIZE refers to the row source for evaluating the match_recognize clause . The “SORT keyword means the row source sorts the data data before running it through the state machine to find the matches.  
The last keyword is the most interesting and is linked to the use of “state machine”, as mentioned in the previous sentence. Its appearance or lack of appearance affects the performance of your pattern matching query. The importance of this keyword is based on the way that pattern matching is performed. To search for a pattern containing a specific set of events we build something called a “state-machine”. At this point I will turn to Wikipedia to provide a definition of a state machine:
…a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition…
The classic example of a state machine is a traffic light which moves through a given sequence of events in a set order and always in that order: Red  ->  Red & Yellow -> Green -> Yellow. The traffic light model can also be viewed as at “deterministic” state machine. This implies that  for every state there is exactly one transition for a given input, i.e. it is not possible to have two different transitions leading out of a particular state. With our traffic light state model it is clear that given a red light state there is only one next transition which is to red & amber.
Let’s use our normal stock ticker sample schema that tracks stock market prices for three ticker symbols. Let’s look at two very similar pattern matching queries:
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
   AFTER MATCH SKIP TO LAST UP
 PATTERN (STRT DOWN* UP*)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
   AFTER MATCH SKIP TO LAST UP
 PATTERN (STRT DOWN UP)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;


Match Recognize keywords in Otimizer explain plan


Match Recognize keywords in Otimizer explain plan

Note that the key difference between the two sql statements is the PATTERN clause. The statement on the left checks for zero or more instances of two different events: 1) where the price in the current row is less then the price in the previous row and 2) where the price in the current row is more then the price in the previous row. The statement on the right checks for only once instance of each down-up pattern. This difference in the definition of the pattern results in different explain plans where the plan on the right includes the key phrase “DETERMINISTIC FINITE AUTO” .

The phrase “DETERMINISTIC FINITE AUTO” means that the state machine that we constructed is deterministic and thus when running the sorted rows through the state machine, we don’t do backtracking (I will write a separate blog post on this topic very soon as it is a key concept in pattern matching. For the moment I will simply point you to Wikipedia page on backtracking, personally I found the section headed “Description of the method” the most useful). The key benefit of building a “DETERMINISTIC FINITE AUTO” plan is that the execution is more efficient when there is no backtracking.

When we analyze the PATTERN clause and build the corresponding state machine we are able to detect deterministic finite automaton by checking the state machine. If any state has two or more outgoing transitions then we regard the state machine as non-deterministic, if any final state is followed by a non-final state, then the state machine is regarded as non-deterministic. At the moment we can only detect a few trivial cases such as PATTERN (A B C), PATTERN (A B+), PATTERN (A B*), etc.
The first example of these patterns that we can detect is shown above (see the statement on the right where we have STRT DOWN UP pattern) and the other two examples of these types of deterministic patterns are shown below:
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
 PATTERN (STRT DOWN UP+)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
 PATTERN (STRT DOWN UP*)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;

Match Recognize keywords in Otimizer explain plan



Match Recognize keywords in Otimizer explain plan

For PATTERN (A | B) , or PATTERN (A B+ C) we just regard the state machine as non-deterministic, therefore, the explain plans only contain the keywords MATCH RECOGNIZE (SORT) as shown below:
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
 PATTERN (STRT | DOWN | UP)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
 PATTERN (STRT DOWN* UP)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;

Screen Shot 2015 01 27 at 14 56 07

Screen Shot 2015 01 27 at 14 55 46
Within the current version of 12c (12.1.2) we are not checking the mutual exclusiveness of the DEFINE predicates in detecting a deterministic state machine, therefore, the execution plan defaults to a MATCH RECOGNIZE (SORT) style plan, where we may or may have to use backtracking. Obviously, as we continue to develop the MATCH_RECOGNIZE feature will expand our ability to detect a deterministic state machine which means we will process your patter more efficiently.
In summary, if you want the most efficient execution plan then try to define your pattern in such way that we are able to create a deterministic state machine. This assumes, of course, that backtracking is not needed within each partition/data set in order to identify the required pattern (more on this in my next blog post).
Hope this information is useful. If you have any questions then feel free to contact me directly (keith.laker@oracle.com).

Tuesday Dec 09, 2014

X-Charging for Sandboxes

This is the next part in my on-going series of posts on the topic of how to successfully manage sandboxes within an Oracle data warehouse environment. In Part 1 I provided an overview of sandboxing (key characteristics, deployment models) and introduced the concept of a lifecycle called BOX’D (Build, Observe, X-Charge and Drop). In Part 2 I briefly explored the key differences between data marts and sandboxes. Part 3 explored the Build-phase of our lifecycle. Part 4 explored the Observer-phase of our lifecycle so we have now arrived at the X-Charge part of our model.

To manage the chargeback process for our sandbox environment we are going to use the new Enterprise Manager 12c Cloud Management pack, for more information visit the EM home page on OTN

Why charge for your providing sandbox services? The simple answer is that placing a price or cost on a service ensures that the resources are used wisely. If a project team incurred zero costs for their database environment then there is no incentive to evaluate the effectiveness of the data set and the cost-benefit calculation for the project is skewed by the lack of real-world cost data. This type of approach is the main reason why sandbox projects evolve over time into “production” data marts. Even if the project is not really delivering on its expected goals there is absolutely no incentive to kill the project and free up resources. Therefore, by not knowing the cost, it is impossible to establish the value.

The benefits of metering and x-charging are that it enables project teams to focus on the real value of their analysis. If all analysis is free then it is almost impossible to quantify the benefits or costs of a particular analysis. Project teams can also use x-charging as a way to adjust their consumption of resources and control their IT costs. It benefits the IT team as it enables them to achieve higher utilisation rates across their servers. Most importantly the cost-element attached to running a sandbox acts as a string incentive to finalize and shutdown sandboxes ensuring that they do not morph into uncontrolled marts.

There is a fantastic whitepaper on this topic, which explores the much wider topic of metering and chargeback within a cloud environment which is available on the Enterprise Manager webpage, click here to view the whitepaper.

Overview

Enterprise Manager 12c uses the rich monitoring and configuration data that is collected for Enterprise Manager targets as the basis for a metering and chargeback solution. Enterprise Manager Chargeback provides the administrator with:

  • Assignment of rates to metered resources
  • Management of a cost center hierarchy
  • Assignment of resources to cost centers
  • Usage and charge-back reports

This set of features can be used to implement a chargeback regime for analytical sandboxes. There is a rich set of API’s that allow you to extract metering and charge data so that it can be incorporated into enterprise billing solutions such as Oracle Billing and Revenue Management application.

Setting up a x-charging framework for our analytical sandboxes involves three key stages:

  • Creating chargeback plans for resources and database options
  • Defining users and cost centers to “take” charges
  • Reporting on usage and charges
Let’s look at each of this stages in more details:

Step 1: Creating charge plans

A Charge Plan is created by the DBA and it defines the metered resources along with the associated rates. Enterprise Manager Chargeback offers two types of Charge Plan – Universal Charge Plan and Extended Charge Plans.

The Universal Charge Plan is the simplest way to enable chargeback for sandboxes and is probably adequate for the vast majority of projects. It contains just 3 metrics:

  • CPU Usage
  • Memory Allocation
  • Storage Allocation

and the DBA can set the rates for each metric as shown here:

Charge Plans

Even with this basic profile you can implement quite sophisticated charging models. It is possible to vary the rates used in charge calculations by month/period. Each “period" is known as a “Reporting Cycle”. If rates are modified, the updated rates will be used to re-calculate the charges for all days from the first of the current period onwards.

Some projects may need access to analytical features that are costed database options. For example, if a project needs to build data mining models then they will require the Oracle Advanced Analytics option. Alternatively, to support semantic analysis or social network analysis requires the use of the spatial and graph option. Extended Charge Plans allow the DBA to factor in charging for database options alongside the standard charging metrics of the Universal Charge Plan. For database options it makes sense to make use of the ability to create fixed cost charges to effectively “rent-out" each option for each sandbox environment. Of course if a project suddenly decides it needs access to a specific type of analytical option, such as in-memory, it simply a case of adding the relevant cross-charge item to the profile for the specific sandbox and the project team can start using that feature right away (assuming the database instance has the correct options pre-installed).

Charge Plans Extended


Step 2 Setting up users and costs centres

When administering a self-service analytic sandbox, it is necessary to meter resource consumption for each self-service user. These costs then need to rolled up into an aggregate level such as cost centers to generate a total charge for each department/project-team accessing the sandbox. For ease of administration and chargeback the self-service users can be represented within a Cost Center structure. Each cost center contains list of “consumers” who have access to the sandbox and of course its associated resources. The cost centers can be organized in a hierarchical fashion to support aggregation and drill down with the cost analysis or billing reports. A typical hierarchical cost centers within a project might look something like this:

Cost center hierarchy

Step 3: Chargeback Reports

Any chargeback solution will involve reporting so that users can understand how their use of sandbox (storing data, running reports etc) translates to charges. Enterprise Manager provides reports that show both resource usage and charging information. This is broken down into two categories of reports: summary and trending reports.

Summary Reports show information related to charge or resource utilisation broken down by cost center, target type and resource. These reports allow both sandbox owners and business users to drill down and quickly assess analyse charges in terms of type of target (database instance, host operating environment, virtual machine etc) or cost centers as shown below.

EM summary report

Trending Reports These reports show metric or charge trends over time and are useful for project teams who want to see how their charges change over time. At an aggregate level the I.T. team can use this information to help them with capacity planning. A report of CPU usage is shown below. 

EM trend report

What’s missing?

While this latest version of enterprise manager has some great features for managing analytical sandboxes it would be really useful if the project team could enter a total budget for their sandbox. This budget could then shown on graphs such as the trending report. It would be useful to know how much of the budget has been spent, how many days-periods of budget remain based on current spending patterns etc. Of course once the budget has been used up it would be useful if the sandbox could be locked - this would focus the minds of the project team and ensure that a sandbox does not evolve into a “live” data mart. Which brings us nicely to the next blog post which will be on the final part of our lifecycle model: ensuring that sandboxes have a “Drop” phase.

If you want more information about how to setup the chargeback plans then there is a great video on the Oracle Learning Library: Oracle Enterprise Manager 12c: Setup and Use Chargeback.

Thursday Oct 30, 2014

Part 4 of DBAs guide to managing sandboxes - Observe

This is the next part in my on-going series of posts on the topic of how to successfully manage sandboxes within an Oracle data warehouse environment. In Part 1 I provided an overview of sandboxing (key characteristics, deployment models) and introduced the concept of a lifecycle called BOX’D (Build, Observe, X-Charge and Drop). In Part 2 I briefly explored the key differences between data marts and sandboxes. Part 3 explored the Build-phase of our lifecycle.

Now, in this post I am going to focus on the Observe-phase. At this stage in the lifecycle we are concerned with managing our sandboxes. Most modern data warehouse environments will be running hundreds of data discovery projects so it is vital that the DBA can monitor and control the resources that each sandbox consumes by establishing rules to control the resources available to each project both in general terms and specifically for each project.  

In most cases, DBAs will setup a sandbox with dedicated resources. However, this approach does not create an efficient use of resources since sharing of unused resources across other projects is just not possible. The key advantage of Oracle Multitenant is its unique approach to resource management. The only realistic way to support thousands of sandboxes, which in today’s analytical driven environments is entirely possible if not inevitable, is to allocate one chunk of memory and one set of background processes for each container database. This provides much greater utilisation of existing IT resources and greater scalability as multiple pluggable sandboxes are consolidated into the multitenant container database.

Resources

Using multitenant we can now expand and reduce our resources as required to match our workloads. In the example below we are running an Oracle RAC environment, with two nodes in the cluster. You can see that only certain PDBs are open on certain nodes of the cluster and this is achieved by opening the corresponding services on these nodes as appropriate. In this way we are partitioning the SGA across the various nodes of the RAC cluster. This allows us to achieve the scalability we need for managing lots of sandboxes. At this stage we have a lot of project teams running large, sophisticated workloads which is causing the system to run close to capacity as represented by the little resource meters.

Expand 1

It would be great if our DBA could add some additional processing power to this environment to handle this increased workload. With 12c what we can do is simply drop another node into the cluster which allows us to spread the processing of the various sandbox workloads loads out across the expanded cluster. 

Expand 2

Now our little resource meters are showing that the load on the system is a lot more comfortable. This shows that the new multitenant feature integrates really well with RAC. It’s a symbiotic relationship whereby Multitenant makes RAC better and RAC makes Multitenant better.

So now we can add resources to the cluster how do we actually manage resources across each of our sandboxes? As a DBA I am sure that you are familiar with the features in Resource Manager that allow you to control system resources: CPU, sessions, parallel execution servers, Exadata I/O. If you need a quick refresher on Resource Manager then check out this presentation by Dan Norris “Overview of Oracle Resource Manager on Exadata” and the chapter on resource management in the 12c DBA guide.

With 12c Resource Manager is now multitenant-aware. Using Resource Manager we can configure policies to control how system resources are shared across the sandboxes/projects. Policies control how resources are utilised across PDBs creating hard limits that can enforce a “get what you pay for” model which is an important point when we move forward to the next phase of the lifecycle: X-Charge. Within Resource Manager we have adopted an “industry standard” approach to controlling resources based on two notions:

  1. a number of shares is allocated to each PDB
  2. a maximum utilization limit may be applied to each PDB

To help DBAs quickly deploy PDBs with a pre-defined set of shares and utilisation limits there is a “Default” configuration that works, even as PDBs are added or removed. How would this work in practice? Using a simple example this is how we could specify resource plans for the allocation of CPU between three PDBs:

RM 1

As you can see, there are four total shares, 2 for the data warehouse and one each for our two sandboxes. This means that our data warehouse is guaranteed 50% of the CPU whatever else is going on in the other sandboxes (PDBs). Similarly each of our sandbox projects is guaranteed at least 25%. However, in this case we did not specify settings for maximum utilisation. Therefore, our marketing sandbox could use 100% of the CPU if both the data warehouse and the sales sandbox were idle.

By using the “Default” profile we can simplify the whole process of adding and removing sandboxes/PDBS. As we add and remove sandboxes, the system resources are correctly rebalanced, by using the settings specific default profile, across all the plugged-in sandboxes/PDBs as shown below.

RM 2

Summary

In this latest post on sandboxing I have examined the “Observe” phase of our BOX’D sandbox lifecycle. With the new  multitenant-aware Resource Manager we can configure policies to control how system resources are shared across sandboxes. Using Resource Manager it is possible to configure a policy so that the first tenant in a large, powerful server experiences a realistic share of the resources that will eventually be shared as other tenants are plugged in.

In the next post I will explore the next phase of our sandbox lifecycle, X-charge, which will cover the metering and chargeback services for pluggable sandboxes. 

Friday Sep 26, 2014

Why SQL is becoming the goto language for Big Data analysis

Since the term big data first appeared in our lexicon of IT and business technology it has been intrinsically linked to the no-SQL, or anything-but-SQL, movement. However, we are now seeing that SQL is experiencing a renaissance. The term “noSQL” has softened to a much more realistic approach "not-only-SQL" approach. And now there is an explosion of SQL-based implementations designed to support big data. Leveraging the Hadoop ecosystem, there is: Hive, Stinger, Impala, Shark, Presto and many more. Other NoSQL vendors such as Cassandra are also adopting flavors of SQL. Why is there a growing level of interest in the reemergence of SQL? Probably, a more pertinent question is: did SQL ever really go away? Proponents of SQL often cite the following explanations for the re-emergence of SQL for analysis:

  1. There are legions of developers who know SQL. Leveraging the SQL language allows those developers to be immediately productive.
  2. There are legions of tools and applications using SQL today.
  3. Any platform that provides SQL will be able to leverage the existing SQL ecosystem.

However, despite the virtues of these explanations, they alone do not explain the recent proliferation of SQL implementations. Consider this: how often does the open-source community embrace a technology just because it is the corporate orthodoxy? The answer is: probably not ever. If the open-source community believed that there was a better language for basic data analysis, they would be implementing it. Instead, a huge range of emerging projects, as mentioned earlier, have SQL at their heart The simple conclusion is that SQL has emerged as the de facto language for big data because, frankly, it is technically superior. Let’s examine the four key reasons for this:

  1. SQL is a natural language for data analysis.
  2. SQL is a productive language for writing queries.
  3. SQL queries can be optimised.
  4. SQL is extensible.

1. SQL is a natural language for data analysis.

The concept of SQL is underpinned by the relational algebra - a consistent framework for organizing and manipulating sets of data - and the SQL syntax concisely and intuitively expresses this mathematical system.

Most business users, data analysts and even data scientists think about data within the context of a spreadsheet. If you think about a spreadsheet containing a set of customer orders then what do most people do with that spreadsheet? Typically, they might filter the records to look only at the customer orders for a given region. Alternatively, they might hide some columns: maybe the customer address is not needed for a particular piece of analysis, but the customer name and their orders are important data points. Finally, they might add calculations to compute totals and/or perhaps create a cross tabular report.

Within the language of SQL these are common steps: 1) projections (SELECT), 2) filters and joins (WHERE), and 3) aggregations (GROUP BY). These are core operators in SQL. The vast majority of people have found the fundamental SQL query constructs to be straightforward and readable representation of everyday data analysis operations.

2. SQL is a productive language for writing queries.

When a developer writes a SQL query, he or she simply describes the results that they want. The developer does not have to get into any of the nitty-gritty of describing how to get the results 

This type of approach is often referred to as  'declarative programming,’ and it makes the developer's job easier. Even the simplest SQL query illustrates the benefits of declarative programming:

SELECT day, prcp, temp FROM weather
WHERE city = 'San Francisco' AND prcp > 0.0;

SQL engines may have multiple ways to execute this query (for example, by using an index). Fortunately the developer doesn't need to understand any of the underlying database processing techniques. The developer simply specifies the desired set of data using projections (SELECT) and filters (WHERE).

This is perhaps why SQL has emerged as such an attractive alternative to the MapReduce framework for analyzing HDFS data. MapReduce requires the developer to specify, at each step, how the underlying data is to be processed. For the same “query", the code is longer and more complex in MapReduce. For the vast majority of data analysis requirements, SQL is more than sufficient, and the additional expressiveness of MapReduce introduces complexity without providing significant benefits.


3. SQL queries can be optimized

The fact that SQL is a declarative language not only shields the developer from the complexities of the underlying query techniques, but also gives the underlying SQL engine has a lot of flexibility in how to optimize any given query. 

In a lot of programming languages, if the code runs slow, then it's the programmer's fault. For the SQL language, however, if a SQL query runs slow, then it's the SQL engine's fault.

This is where analytic databases really earn their keep – databases can easily innovate ‘under the covers’ to deliver faster performance; parallelization techniques, query transformations, indexing and join algorithms are just a few key areas of database innovation that drive query performance.

4. SQL is extensible

SQL provides a robust framework that adapts to new requirements

SQL has stayed relevant over the decades because, even though its core is grounded in universal data processing techniques, the language itself can be extended with new processing techniques and new calculations. Simple time-series calculations, statistical functions, and pattern-matching capabilities have all been added to SQL over the years. 

Consider, as a recent example, what many organizations realized as they started to ask queries such as 'how many distinct visitors came to my website last month?' These organizations realized that it is not vital to have a precise answer to this type of query ... an approximate answer (say, within 1%) would be more than sufficient. This has requirement has now been quickly delivered by implementing the existing hyperloglog algorithms within SQL engines for 'approximate count distinct' operations. 

More importantly, SQL is a language that is not explicitly tied to a storage model. While some might think of SQL as synonymous with relational databases, many of the new adopters of SQL are built on non-relational data. SQL is well on its way to being a standard language for accessing data stored in JSON and other serialized data structures.  

Summary

SQL is an immensely popular language today … and if anything its popularity is growing as the language is adopted for new data types and new use cases. The primacy of SQL for big data is not simply a default choice, but a conscious realization that SQL is the best suited language for basic analysis

PS. Next week, many sessions at this year’s OpenWorld will focus on the power, richness and performance of SQL for sophisticated data analysis including the following:

Monday September 28

Using Analytical SQL to Intelligently Explore Big Data @ 4:00PM Moscone North 131

Joerg Otto - Head of Database Engineering, IDS GmbH
Marty Gubar - Director, Oracle
Keith Laker - Senior Principal Product Manager, Data Warehousing and Big Data, Oracle


YesSQL! A Celebration of SQL and PL/SQL @ 6:00PM Moscone South 103

Steven Feuerstein - Architect, Oracle
Thomas Kyte - Architect, Oracle


Tuesday September 29

SQL Is the Best Development Language for Big Data @ 10:45AM Moscone South 104

Thomas Kyte - Architect, Oracle

Enjoy OpenWorld 2014 and if you have time please come and meet the Analytical SQL team in the Moscone South Exhbition Hall. We will be on the Parallel Execution and Advanced SQL Processing demo booth (id 3720).

Oracle Big Data Lite 4.0 Virtual Machine Now Available

Big Data Lite 4.0 is now available for download from OTN.  There are lots of new capabilities in this latest version:
  • Oracle Database 12c (12.1.0.2), including new JSON support and Oracle Big Data SQL-enabled external tables.  Check out this hands-on lab to learn how to securely analyze all your data - across both Hadoop and Oracle Database 12c - using Big Data SQL.
  • New versions of SQL Developer and Data Modeler that support Hive access and automatic generation of Big Data SQL external tables
  • GoldenGate and the latest ODI versions are now included - with some great new hands-on labs.
  • Cloudera Manager is back - you can now optionally use CM to manage your Hadoop environment (requires 10GB memory devoted to the VM).  If you don't want to use CM, you can use the manual CDH configuration with the Big Data Lite services application
  • New versions of the entire stack... Big Data Connectors, NoSQL Database, CDH, JDeveloper and more.

Here's the inventory of all the features and version:

  • Oracle Enterprise Linux 6.4
  • Oracle Database 12c Release 1 Enterprise Edition (12.1.0.2) - including Oracle Big Data SQL-enabled external tables, Oracle Advanced Analytics, OLAP, Spatial and more
  • Cloudera Distribution including Apache Hadoop (CDH5.1.2)
  • Cloudera Manager (5.1.2)
  • Oracle Big Data Connectors 4.0
    • Oracle SQL Connector for HDFS 3.1.0
    • Oracle Loader for Hadoop 3.2.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.4.1
    • Oracle XQuery for Hadoop 4.0.1
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.0.14)
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.0.3
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1

Tuesday Apr 29, 2014

Oracle Data Warehouse and Big Data Magazine April Edition for Customers + Partners

Follow us on Facebook Twitter Blogger
Oracle Data Warehouse and Big Data Magazine APRIL Edition for Customers + Partners

The latest edition of our monthly data warehouse and big data magazine for Oracle customers and partners is now available. The content for this magazine is taken from the various data warehouse and big data Oracle product management blogs, Oracle press releases, videos posted on Oracle Media Network and Oracle Facebook pages. Click here to view the April Edition


Please share this link http://flip.it/fKOUS to our magazine with your customers and partners


This magazine is optimized for display on tablets and smartphones using the Flipboard App which is available from the Apple App store and Google Play store





Monday Apr 28, 2014

DBAs Guide to Deploying Sandboxes in the Cloud

Overview

The need for a private, secure and safe area for data discovery within the data warehouse ecosystem is growing rapidly as many companies start investing in and investigating "big data". Business users need space and resources to evaluate new data sources to determine their value to the business and/or explore news way of analyzing existing datasets to extract even more value.  These safe areas are most commonly referred to as "Sandboxes" or "Discovery Sandboxes" or "Discovery Zones".  If you are not familiar with the term then Forrester Research defines a "sandbox" as:

“data exploration environment where a power user can analyse production […] with near complete freedom to modify data models, enrich data sets and run the analysis whenever necessary, without much dependency on IT and production environment restrictions.” *1

These sandboxes are tremendously useful for business users because they allow them to quickly and informally explore new data sets or new ways of analyzing data without having to go through the formal rigour normally associated with data flowing into the EDW or deploying analytical scripts within the EDW. They provide business users with a high degree of freedom. The real business value is highlighted in a recent article by Ralph Kimball:

In several of the e-commerce enterprises interviewed for this white paper, analytic sandboxes were extremely important, and in some cases hundreds of the sandbox experiments were ongoing simultaneously.

As one interviewee commented “newly discovered patterns have the most disruptive potential, and insights from them lead to the highest returns on investment" *2

Key Characteristics

So what are they key characteristics of a sandbox? Essentially there are three:

  1. Used by skilled business analysts and data scientists
  2. Environment has fewer rules of engagement
  3. Time boxed

Sandboxes are not really designed to be used by CIOs or CEOs or general BI users. They are designed for business analysts and data scientists who have a strong knowledge of SQL, detailed understanding of the business and the source data that is being evaluated/analyzed. As with many data exploration projects you have to be able to understand the results that come back from a query and be able to determine very quickly if they make sense.

As I stated before, the normal EDW rules of engagement are significantly relaxed within the sandbox and new data flowing into the sandbox is typically disorganised and dirty. Hence the need for strong SQL skills to create simplified but functional data cleaning and transformation scripts with the emphasis being to make new data usable as quickly as possible. Part of the "transformation" process might be to generate new data points derived from existing attributes. A typical example of this is where a data set contains date-of-birth information, which in itself is quite a useful piece of information, that can be transformed to create a new data point of "age". Obviously the business analysts and data scientist need to be reasonably proficient in SQL to create the required transformation steps - it is not a complicated process but it highlights the point that the business community needs to have the necessary skills so that they are self-sufficient.

Most importantly the sandbox environment needs to have a time limit. In the past this is where most companies have gone wrong! Many companies fail to kill off their sandboxes. Instead these environments evolve and flourish into shadow marts and/or data warehouses which end up causing havoc as users can never be sure which system contains the correct data. Today, most enlightened companies enforce a 90-day timer on their sandboxes. Once the 90 day cycle is complete then ownership of the processes and data are either moved over to the EDW team, who can then start to apply the corporate standards to the various objects and scripts, or the environment and all its data is simply dropped.

The only way a business can support the hundreds of live sandbox experiments described in Kimball's recent report (*2) is by enforcing these three key characteristics.

Choosing your deployment model:

Over the years that I have spent working on various data warehouse projects I have seen a wide variety of  weird and wonderful deployment models designed to support sandboxing. In very general terms these various deployment models reduce down to one of the following types:

  1. Desktop sandbox
  2. Detached sandbox
  3. Attached sandbox

each one of these deployment models has benefits and advantages as described here:

1. Desktop Sandboxes

Many business users prefer to use their desktop tools, such as spreadsheet packages, because the simple row-column data model gives them a simplified and easily managed view of their data set. However, this approach places a significant processing load on the desktop computer (laptop or PC) and while some vendors offer a way to off-load some of that processing to bespoke middleware servers this obviously means implementing an additional specialised middleware server on dedicated hardware.  Otherwise, companies have to invest large amounts of money upgrading their desktop systems with additional memory and solid-state disks.

Creating a new sandbox is just a question of opening a new, fresh worksheet and loading the required data set. Obviously, the size and breadth of the dataset is limited by the resources on the desktop system and complicated calculations can take a considerable time to run with little or no scope for additional optimisation or tuning. Desktop sandbox are, by default, data-silos and completely disconnected from the enterprise data warehouse which makes it very difficult to do any sort of joined-up analysis. 

The main advantage of this approach is that power users can easily run what-if models where they redefine their data model to test new "hierarchies", add new dimensions or new attributes. They can even change the data by simply over-typing existing values. Collaboration is a simple process of emailing the spreadsheet model to other users for comments. The overriding assumption here is that users who receive the spreadsheet are actually authorised to view the data! Of course there is nothing to prevent recipients forwarding the data to other users. Therefore, it is fair to say that data security is non-existent.

For DBAs, the biggest problem with this approach is that it offers no integration points into the existing cloud management infrastructure. Therefore, it is difficult for the IT team to monitor the resources being used and make appropriate x-charges.  Of course the DBA has no control over the deletion of desktop based sandboxes so there is a tendency for these environments to take on a life of their own with business users using them to create "shadow" production systems that are never decommissioned.

Overall, the deployment of desktop sandboxes is not recommended.

2. Detached Sandboxes

Using a detached, dedicated sandbox platform resolves many of the critical issues related to desktop sandbox platforms most notably the issues relating to: data security and processing scalability. Assuming a relatively robust platform is used to manage the sandboxes then the security profiles implemented in the EDW can be replicated across to the stand-alone platform. This approach still allows users to redefine their data model to test new "hierarchies", add new dimensions or new attributes within what-if models and even change data points but this ability is "granted" by the DBA rather than being automatically taken and enforced by the business user. In terms of sharing results there is no need to distribute data via email and this ensures everyone gets the same consistent view of the results (and by default the original source, should there be a need to work backwards from the results to the source).

Key concerns for business users is the level of latency that occurs from the need to unload and reload not only the required data but also all the supporting technical and business metadata. Unloading, moving and importing large historical data sets can be very time consuming and can require large amounts of resources on the production system - which may or may not be available depending on the timing of the request. 

For the DBA issues arise around the need to monitor additional hardware and software services in the data center. For IT this means more costs because additional floor space, network bandwidth, power and cooling may be required. Of course, assuming that the sandbox platform fits into the existing monitoring and control infrastructure then x-charging can be implemented. In this environment the DBA has full control over the deletion of a sandbox so they can prevent the spread of "shadow" production data sets. For important business discoveries, the use of detached sandboxes does provide the IT team with the opportunity to grab the loading and analysis scripts and move them to the production EDW environment. This helps to reduce the amount of time and effort needed to "productionize" discoveries.

While detached sandboxes remove some of the disadvantages of desktop platforms it is still not an ideal way to deliver sandboxes to the business community.

3. Attached Sandboxes

Attached sandboxes resolve all the problems associated with the other two scenarios. Oracle provides a rich set of in-database features that allow business users to work with in-place data, which in effect, removes the issue of data latency. Oracle Database is able to guarantee complete isolation for any changes to dimensions, hierarchies, attributes and/or even individual data points so there is no need to unload, move and then reload data. All the existing data security policies remain in place which means there is no need to replicate security profiles to other systems where there is the inherent risk that something might be missed in the process.

For the DBA, x-charging can be implemented using existing infrastructure management tools. The DBA has full control over the sandbox in terms of resources (storage space, CPU, I/O) and duration. The only concern that is normally raised regarding the use of attached sandboxes is the impact on the existing operational workloads. Fortunately, Oracle Database, in conjunction with our engineered systems, has a very robust workload management framework (see earlier posts on this topic: https://blogs.oracle.com/datawarehousing/tags/Workload_Management). This means that the DBA can allocate sufficient resource to each sandbox while ensuring that the key operational workloads continue to meet their SLAs. Overall, attached sandboxes, within an Oracle Database environment, is a win-win solution: both the DBA and the business community get what they need.

Summary

Deployment Model

Benefit

Disadvantages

Desktop Sandbox

High degree of local control over data
“Fast” performance
Quick and easy sharing of results

Reduced data scalability
Not easy to integrate new data
Very costly to implement
Undermines data consistency-governance
Data security is compromised

Detached Sandbox

Reduces workload on EDW
Upload personal/external data to sandbox
Explore large volumes of data without limits

Requires additional hardware and software
Requires replication of corporate data
High latency
Replication + increased management of operational metadata

Attached Sandbox

Upload additional data to virtual partitions Easy to mix new data with corporate data
No replication of corporate data
Efficient use of DW platform resources
Data access controlled by enterprise security features

Requires robust workload management tools

From this list of pros and cons it is easy to see that the "Attached Sandbox"  is the best deployment model to use. Fortunately, Oracle Database 12c has a number of new features and improvements to existing features that mean it is the perfect platform for deploying and managing attached sandboxes.

B-O-X-D: the lifecycle of a sandbox

Now we know what type of sandbox we need to deploy (just in case you were not paying attention - attached sandboxes!) to keep our business users happy the next step is to consider the lifecycle of the sandbox along with the tools and features that support each of the key phases. To make things easier I have broken this down into four key DBA-centric phases as shown below:

Sandbox lifecycle

Over the next four weeks I will cover these four key phases of the sandbox lifecycle and explain which Oracle tools and Oracle Database features are relevant and how they can be used. 

Footnotes

*1 Solve the Data Management Conflict Between Business and IT, by Brad Peters - Information Management Newsletters, July 20, 2010

*2 The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics by Ralph Kimball

Tuesday Apr 15, 2014

OpenWorld call for Papers closes today!

 Just a gentle reminder - if you have not submitted a paper for this year's OpenWorld conference then there is still just enough time because the deadline is Today (Tuesday, April 15) at 11:59pm PDT. The call for papers website is here http://www.oracle.com/openworld/call-for-papers/index.html and this provides all the details of how and what to submit.

I have been working with a number of customers on some really exciting papers so I know this year's conference is going to be really interesting for data warehousing and analytics. I would encourage everyone to submit a paper, especially if you have never done this before. Right now both data warehousing and analytics are among the hottest topics in IT and I am sure all of you have some great stories that you could share with your industry peers who will be attending the conference. It is a great opportunity to present to your peers and also learn from them by attending their data warehouse/analytics sessions during this week long conference. And of course you get a week of glorious Californian sunshine and the chance to spend time in one of the World's most beautiful waterfront cities.

If you would like any help submitting a proposal then feel free to email during today and I will do my best to provide answers and/or guidance. My email address is keith.laker@oracle.com.

Have a great day and get those papers entered into our OpenWorld system right now! 

Monday Mar 24, 2014

Built-in sorting optimizations to support analytical SQL

One of the proof points that I often make for using analytical SQL over more sophisticated SQL-based methods is that we have included specific optimizations within the database engine to support our analytical functions. In this blog post I am going to briefly talk about how the database optimizes the number of sorts that occur when using analytical SQL.

Sort Optimization 1: Ordering Groups

Many of analytical functions include PARTITION BY and/or an ORDER BY clause both of which by definition implies that an ordering process is going to be required. As each function can have its own PARTITION BY-ORDER BY clause this can create situations where lot of different sorts are needed. For example, if we have a SQL statement that included the following:

Rank() Over (Partition by (x) Order by (w))
Sum(a) Over (Partition by (w,x) Order by (z))
Ntile() Over (Partition by (x) Order by (y))
Sum(b) Over (Partition by (x,y) Order by (z))

this could involve four different sort processes to take into account the use of both PARTITION BY and ORDER BY clauses across the four functions. Performing four separate sort processes on a data set could add a tremendous overhead (depending on the size of the data set). Therefore, we have taken two specific steps to optimize the sorting process.

The first step is create the notion of "Ordering Groups". This optimizations looks for ways to group together sets of analytic functions which can be evaluated with a single sort. The objective is to construct a minimal set of ordering groups which in turn minimizes the number of sorts. In the example above we would create two ordering groups as follows:

Screen Shot 2014 03 13 at 13 39 37

This allows us to reduce the original list of sorts down from 4 to just 2.

Sort Optimization 2: Eliminating Sorts

We can further reduce the number sorts that need to be performed by carefully scheduling the execution so that:

  • Ordering groups with sorts corresponding to that in the GROUP BY execute first (immediately after the GROUP BY) 
  • Ordering groups with sorts corresponding to that in the ORDER BY execute last (immediately before the ORDER BY)

In addition, we can also eliminate sorts when an index or join method (sort-merge) makes sorting unnecessary. 

Optimization 3 : RANK Predicates

Where a SQL statement includes RANK() functions there are additional optimizations that kick-in. Instead of sorting all the data, adding the RANK and then applying the predicate, the RANK predicate is evaluated as part of the sort process. The net result is that fewer records are actually sorted, resulting in more efficient execution.

Summary 

Overall, these three optimizations ensure that as few sorts as possible are performed when you include SQL analytical functions as part of your SQL statements. 

Wednesday Feb 05, 2014

OTN Virtual Developer Day Database 12c content now available on-demand

Thank you to everyone who attended the SQL pattern matching session during yesterday's OTN Virtual Developer Day event. We had a great crowd of people join our live workshop session. I hope everyone enjoyed using the amazing platform which the OTN team put together to host the event.  

The great news is that all the content from the event is now available for download and you can watch the all on-demand videos from the four tracks (Big Data DBA, Big Data Developer, Database DBA and Database Developer). 

The link to fantastic OTN VDD platform is here: https://oracle.6connex.com/portal/database2014/login?langR=en_US&mcc=aceinvite and this is what the landing pad page looks like:

OTNVDD Me

This page will give you access to the keynote session by Tom Kyte and Jonathan Lewis which covered the landscape of Oracle DB technology evolution and adoption.  The content looks at what's next for Oracle Database 12c looking at the high value technologies and techniques that are driving greater database efficiencies and innovation.

You will be able to access the videos, slides from each presentation and a huge range of technical hands-on labs covering big data and database technologies, including my SQL Pattern Matching workshop. If you want to download the the Virtualbox image for the Database tracks it is available here: http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html (this contains everything you need to run my SQL Pattern Matching workshop).

While you doing the workshop, if you have any questions then please feel free to email me - keith.laker@oracle.com.

Enjoy.

Monday Jan 27, 2014

FREE OTN virtual workshop - Learn about SQL pattern matching with Oracle Database 12c.

otn virtual dvlper day

Make sure you are free on Tuesday February 4 because the OTN team are hosting another of their virtual developer day events. Most importantly it is FREE. Even more importantly is the fact that I will be running a 12c pattern matching workshop at 11:45am Pacific Time. Of course there are lots other sessions that you can attend relating to big data and Oracle Database 12c and the OTN team has created two streams to help you learn about this two important areas:

  • Oracle Database application development — Learn expert tips and tricks on how to develop applications for Oracle Database 12c and Big Data environments more effectively.
  • Oracle Database platform deployment processes — From integration, to data migration, experts showcase new capabilities in Oracle 12c and Big Data environments that will allow you to deliver greater database performance and integration.

You can sign-up for the event and pick your tracks and sessions via this link: https://oracle.6connex.com/portal/database2014/login?langR=en_US&mcc=aceinvite

My pattern matching session is included in the Oracle 12c DBA section of the application development track and the workshop will cover the following topics:

  • Part 1 - Introduction to SQL Pattern Matching
  • Part 2 - Pattern Match: simple example
  • Part 3 - How to use built-in measures
  • Part 4 - Searching for more complex patterns
  • Part 5 - Deep dive into how SQL Pattern Matching works
  • Part 6 - More Advanced Topics

As my session is only 45 minutes long I am only going to cover the first three topics and leave you to work through the last three topics in your own time. During the 45 minute workshop I will be available to answer any questions via the live Q&A chat feature.

There is a link to the full agenda on the invitation page. The OTN team will be providing a Database 12c Virtualbox VM that you will be able to download later this week. For the pattern matching session I will be providing the scripts to install our sample schema, the slides from the webcast and the workshop files which include a whole series of exercises that will help you learn about pattern matching and test your SQL skills. 

The big data team has kindly included my pattern matching content inside their Virtualbox image so if you want to focus on the sessions offered on the big data tracks but still want to work on the pattern matching exercises after the event then you will have everything you need already installed and ready to go!

Don't forget to register as soon as possible and I hope you have a great day…Let me know if you have any questions or comments.

Wednesday Oct 30, 2013

Oracle Magazine: Getting started with SQL Analytics

I am currently working on a series of podcasts covering the broad categories of our SQL analytical functions and features and while I was doing some research I came across of series of four articles in the Oracle Magazine.

This series of article is written by Melanie Caffrey who is a senior development manager at Oracle. She is a coauthor of Expert PL/SQL Practices for Oracle Developers and DBAs (Apress, 2011) and Expert Oracle Practices: Oracle Database Administration from the Oak Table (Apress, 2010).

The four articles are under the banner "Technology: SQL 101" and parts 9, 10, 11 and 12 cover SQL analytics. Here are the links to the four articles:

The articles cover topics such as GROUP BY, SUM, AVG, HAVING, window functions, RANK, FIRST, LAST, LAG, LEAD etc.  

The great news is that  you can try out the examples in this series. All you need is access to an Oracle Database instance. All the schemas, data sets and SQL statements that you will need can be downloaded from a link included in the January article.  

 I hope you find this series of articles useful.

Tuesday Oct 22, 2013

OOW content for Pattern Matching....

If you missed my sessions at OpenWorld then don't worry - all the content we used for pattern matching (presentation and hands-on lab) is now available for download.

My presentation "SQL: The Best Development Language for Big Data?" is available for download from the OOW Content Catalog, see here: https://oracleus.activeevents.com/2013/connect/sessionDetail.ww?SESSION_ID=9101

For the hands-on lab ("Pattern Matching at the Speed of Thought with Oracle Database 12c") we used the Oracle-By-Example content. The OOW hands-on lab uses Oracle Database 12c Release 1 (12.1) and uses the MATCH_RECOGNIZE clause to perform some basic pattern matching examples in SQL. This lab is broken down into four main steps:
  • Logically partition and order the data that is used in the MATCH_RECOGNIZE clause with its PARTITION BY and ORDER BY clauses.
  • Define patterns of rows to seek using the PATTERN clause of the MATCH_RECOGNIZE clause. These patterns use regular expressions syntax, a powerful and expressive feature, applied to the pattern variables you define.
  • Specify the logical conditions required to map a row to a row pattern variable in the DEFINE clause.
  • Define measures, which are expressions usable in the MEASURES clause of the SQL query.
You can download the setup files to build the ticker schema and the student notes from the Oracle Learning Library. The direct link to the example on using pattern matching is here: http://apex.oracle.com/pls/apex/f?p=44785:24:0::NO:24:P24_CONTENT_ID,P24_PREV_PAGE:6781,2.

Wednesday Jun 27, 2012

FairScheduling Conventions in Hadoop

While scheduling and resource allocation control has been present in Hadoop since 0.20, a lot of people haven't discovered or utilized it in their initial investigations of the Hadoop ecosystem. We could chalk this up to many things:

  • Organizations are still determining what their dataflow and analysis workloads will comprise
  • Small deployments under tests aren't likely to show the signs of strains that would send someone looking for resource allocation options
  • The default scheduling options -- the FairScheduler and the CapacityScheduler -- are not placed in the most prominent position within the Hadoop documentation.

However, for production deployments, it's wise to start with at least the foundations of scheduling in place so that you can tune the cluster as workloads emerge. To do that, we have to ask ourselves something about what the off-the-rack scheduling options are. We have some choices:

  • The FairScheduler, which will work to ensure resource allocations are enforced on a per-job basis.
  • The CapacityScheduler, which will ensure resource allocations are enforced on a per-queue basis.
  • Writing your own implementation of the abstract class org.apache.hadoop.mapred.job.TaskScheduler is an option, but usually overkill.

If you're going to have several concurrent users and leverage the more interactive aspects of the Hadoop environment (e.g. Pig and Hive scripting), the FairScheduler is definitely the way to go. In particular, we can do user-specific pools so that default users get their fair share, and specific users are given the resources their workloads require.

To enable fair scheduling, we're going to need to do a couple of things. First, we need to tell the JobTracker that we want to use scheduling and where we're going to be defining our allocations. We do this by adding the following to the mapred-site.xml file in HADOOP_HOME/conf:

<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>

<property>
<name>mapred.fairscheduler.allocation.file</name>
<value>/path/to/allocations.xml</value>
</property>

<property>
<name>mapred.fairscheduler.poolnameproperty</name>
<value>pool.name</value>
</property>

<property>
<name>pool.name</name>
<value>${user.name}</name>
</property>

What we've done here is simply tell the JobTracker that we'd like to task scheduling to use the FairScheduler class rather than a single FIFO queue. Moreover, we're going to be defining our resource pools and allocations in a file called allocations.xml For reference, the allocation file is read every 15s or so, which allows for tuning allocations without having to take down the JobTracker.

Our allocation file is now going to look a little like this

<?xml version="1.0"?>
<allocations>
<pool name="dan">
<minMaps>5</minMaps>
<minReduces>5</minReduces>
<maxMaps>25</maxMaps>
<maxReduces>25</maxReduces>
<minSharePreemptionTimeout>300</minSharePreemptionTimeout>
</pool>
<mapreduce.job.user.name="dan">
<maxRunningJobs>6</maxRunningJobs>
</user>
<userMaxJobsDefault>3</userMaxJobsDefault>
<fairSharePreemptionTimeout>600</fairSharePreemptionTimeout>
</allocations>

In this case, I've explicitly set my username to have upper and lower bounds on the maps and reduces, and allotted myself double the number of running jobs. Now, if I run hive or pig jobs from either the console or via the Hue web interface, I'll be treated "fairly" by the JobTracker. There's a lot more tweaking that can be done to the allocations file, so it's best to dig down into the description and start trying out allocations that might fit your workload.

[Read More]
About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« March 2015
SunMonTueWedThuFriSat
1
2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today