November 9, 2009

Next Generation Data Warehouse Platforms

Rather than entertain you all with my own theories I decided to be a literature critic and discuss a paper / survey done by TDWI's Philip Russom. The report goes into "what a next generation DW platform should try to address and what it could look like". You can read an Oracle sponsored copy here. Do note that the research was sponsored by a large number of vendors - not just Oracle - and that this is just our way of getting a complementary copy to you...

I think there are a couple of interesting outcomes from the report and I'd like to summarize some of it in the following points:

  • You should start thinking about either a new infrastructure or start thinking about how to leverage the existing infrastructure for more advanced data warehousing in the next year or so.
  • Poor query performance and the lack of advanced analytics (OLAP, Data Mining, Statistics etc.) and inadequate load performance are the top three issues identified in the report. So look for a platform that will go fast (really fast!) and allows you to use advanced analytics for a large use community.
  • More and more companies are looking towards a more real-time (dare I say Right-Time) environment - for at least parts of the DW - in light of integrating DW data with operational systems or in a fancy way striving for operational excellence. I would think this is where the mixed workloads come from as well, which are an interesting trend in the paper. Again, you will want a platform that will allow for these mixed workloads and the trickle feeds while reading from the DW.
  • Hardware evolutions do drive platforms in that the capacity increases, the price of that capacity decreases and that storage media change to faster media over the next couple of years. Think flash, think memory, think (maybe even) solid state disks.
  • Private Clouds - inline with the previous statements - are expected to be seen as an infrastructure. Inline with the big boss, I would prefer to call this a compute grid rather than referring to water vapor too much... But it means that low cost small commodity compute resources racked together and used across workloads (consolidation anyone?) are something in your near future...

As a true critic, I of course have to say something sour about the report... my sour grape about this report is the question as to whether you would like a DBMS built for transactions as your DW database... Not quite sure what a survey on DW workloads expects to get as an answer. It did surprise me that the DBMS built for DW has flat growth and moderate to good commitment. My conclusion is that this is not necessarily a great discussion to have, as neither seem to be interesting to too many people. I think people focus on finding a platform that qualifies by solving their pain points and/or the above points.

And then of course, to keep on the sour face,  the fun discussion on MPP. I think that everyone agrees that large operational data stores (3NF like schemas in many cases) benefit from parallel processing capacities. What we are seeing is that more and more people need more than just full table scans on these schemas. In light of the above trends (and there is more on those in the report) people require multiple workloads on these systems, read consistency without readers blocking writers etc. In other words, a next generation platform needs to deal with both massively parallel queries and with single transaction level reads and writes.

A discussion around MPP versus SMP versus OLTP DB versus DW DB is not very useful I would think. It should be on the issues and merits of a DBMS and platform not on a theoretical argument. And I think most of the report does bring up the right points you should be thinking about while designing what the platform should be doing. I just think that people should keep in mind that theory is not always the best lead argument. Keith has written a long post on a related topic. One other thing, many new DBMS systems have sprung into the DW market, many just featuring a single core feature (compression, columnar storage, in-memory etc.). For a critical system that sees mixed workloads and lots of data, these point solutions may not be the right thing to base an architecture on...

I was also surprised by the Data Quality entry in Good Growth - Good Commitment. While I think that more and more people are paying attention, I still see this as the orphan in the DW environment. Lots of talk, not much action. I just wish it was more action and less talk, because, come on, this is really a problem! So I hope the report is spot on there, but I'm a skeptic (nothing to blame the report for, just me being in this business not seeing too much happening in DQ).

Ok, be done now with the sulking :-)

Couple of things around some new trends and things in Oracle that would help with building an infrastructure for the next generation data warehouse:

  • Performance => Sun Oracle Database Machine. Yes, it really is fast!
  • In-memory processing => Oracle now has (11gR2) In-Memory Parallel Execution. More about this can be read in Maria's excellent post here.
  • In-Database Analytics => As the report says in Exadata V2 and Oracle 11gR2 we are now offloading data mining model scoring to the storage side of the house, which allows us to embed mining models into more and more operational systems and get online (direct) feedback on transactions. We also have for years moved more and more OLAP and Stats functionality into the engine
  • Real-time data warehousing => First and foremost the read consistency model introduced in Oracle 4 (this is not a typo...) allowing readers to see consistent data during writes, secondly, the just completed acquisition of Golden Gate and the ETL capabilities (like streams) in Oracle allow for very nice real time data feeds. Oracle's MAA architecture allows us to be up and running 24*7 on commodity hardware and deliver an online experience to all customers...
  • Cloud computing => see the in-database mapreduce post here.
  • Appliances => Sun Oracle Database Machine

So, rather than finding point solutions for all of these challenges as one of the conclusions shows, maybe a solution would be to choose a platform that actually allows you to do what you need from a single vendor... If you are really thinking operational BI, real time DW it is hard to imagine how you are going to do this with multiple DBMS solutions. If the analytics run on a secondary system, you have just created a delay as you now need to update the EDW and then this analytics system. This simply means you cannot actually do real time data warehousing... Loading just to the analytics system means that you are back into data silos.

Concluding, I think most conclusions of the report are spot on. Think Big Data, understand the business drivers, plan for your systems etc. Not so sure a set of separate point solutions is what you need if you are going to do real time data warehousing...

Anyways, do read the report! I think it gives you good food for thought and a great insight in the trends you will be dealing with in your near future.

November 4, 2009

Chunking of Updates

While I was looking at the Edition-based Redefinition I was puzzled by, or thinking about the performance of large data volumes moving across the edition. So I started thinking about how do I get that trigger to do stuff in parallel. Turns out I was thinking upside down... or inside out is maybe better.

In 11g Release 2 a new DBMS package is introduced called DBMS_PARALLEL_EXECUTE, and in the RUN_TASK procedure it allows me to actually fire a cross-edition trigger, and there is my answer to how to get the redefinition of the data to happen in parallel. Eureka... we already figured this out, it just took me a while to figure out the connections...

If you are just interested in avoiding large rollback segments or trying to find some way of doing updates on data in a small controlled batch manner I would say, have a look at the documentation for DBMS_PARALLEL_EXECUTE. The nice thing about the chunking is that you get restartability and control over the update in smaller chunks without having to build the infrastructure for all of this yourself.

You can read some more here and here.

October 21, 2009

Edition Based Redefinition

Or, how can I use 11gR2 to update my operational data warehouse with real time data loads and still patch the schema to deal with errors or with upgrades due to changing business demands.

The Theory

In a time when the data warehouse was refreshed weekly (wow, that often?) fixing the schema or the data was a piece of cake. The data was static, the users were querying (you hoped) away and you could tinker with the schema without really making a mess of it all. Take the DW down over the weekend, hey no problem. Those where the days, when you had nice batch windows to load data, weekends to upgrade the schema and the query layers and in general not too much pressure.

Both Operational Data Warehousing and Real Time Data Warehousing make all of the above a blast from the past and the fantasy dream world of all data warehouse architects.

Today chances are that your data warehouse is as critical to your business as the OLTP application that captures sales, orders or something else that makes you money. Taking the system down is not an option in many cases.

In OLTP land this reality has been around for a while... patching the app while the app is running is something that really saves time and money over there.

Oh, but we just said that the DW is like the OLTP app. Therefore, should I use the same patching process and functionality? Yes, and that is where we bring in Edition Based Redefinition.

The Scenario

We have a very small star schema, with a single fact table and two dimensions. The idea is to simply use this example and highlight some of the things that should be done to enable online patching. The goal is not to create a very complex scenario and cover every possible case to be looked at...

Schema, user and objects

We are also looking at a real time or short load time interval. Again, the scenario will just be inserting a small number of rows, but it should actually show the principle quite well.

To simulate the query angle we have two users, one is the owner of the schema and the objects to be worked on, the other queries these objects. The owner is REDEF_OWN and the user is REDEF_QRY (and yes, very creative naming...).

Redefinition in Action

Let's get going. We have created the tables (all three) and loaded the initial data set. Users are all happy and business is booming.

Our change means that the product dimension gets a second and third level, but most importantly name needs to be changed to being a code to uniformly identify the product across the globe. This means there will be some re-linking of data as some facts will go to a different product code (the new "name").

First, develop and test all of the steps on a copy in the development environment, then go to work on the editions in production. For a complete list of the steps to do take a look in the documentation (here) or download the script with the code for this example.

Now that you know your changes you will create Editioning Views that will regulate access to editioned data structures. This access is both ETL as well as the query access. This setup is shown in the following illustration.

blog_redef_procedure

What we did here is rename the actual tables which now have a _GEN suffix and created Editioning Views with the original names. The code looks like this (note the new keyword in the create view syntax!):

ALTER TABLE PRODUCTS RENAME TO PRODUCTS_GEN;

CREATE OR REPLACE EDITIONING VIEW PRODUCTS
AS
SELECT  prd_id 
,       prd_name
,       prd_desc
FROM PRODUCTS_GEN;

With this redirection in place (and we do this for all three tables to ensure we can modify all of these) we are ready to create the new edition and start modifying the actual objects. As a quick side note on grants and security, make sure the query users (in this case REDEF_QRY) have the correct grants on both the tables and the views to allow for selects.

redef_change_edition_screen

Once we are in the new edition all changes can be made, tested and then published. The screen above shows how to change the Edition and see what edition this session is working with. In this case, we are working in the PHASETWO edition.

As we add the columns to the table, make sure to not change the primary key (yet). We will do that later and by using the view, re-point the foreign key to that new primary key. Also note that we are not changing the definition of the description, so we will leave it as is...

The change looks like this:

redef_change_objects_screen

The next step is to get the changes propagated from the live system into the new system. Remember data loads are going on as we set all of this up. For this we need a basic procedure that captures the logic and then use a cross edition trigger on the first edition to invoke the logic for each inserted row (or a batch).

We use a procedure to allow extensions to the example without directly modifying the trigger. For this simple example, the trigger probably would have been flexible enough to deal with the changes.

After creating the procedure (see the code for some simple PL/SQL procedure doing some work), create the forward edition trigger (this is in Edition PhaseTwo!) and enable it. For simplicity we will not go into waiting for pending DML. In a real scenario you may want to do this of course. See the links to the documentation for more information on that step.

redef_create_trigger_screen

At this point, both editions are still in synch. E.g. no data has been changed. First we transform the data to be upgraded by executing the transforms. Obviously, at this point, every insert going into the base schema will have the trigger fire and will show up transformed in the new edition.

Just to show what the actual status is, here is the view a query user will see when looking at products (he is looking at Ora$Base):

redef_qry_view_current

At the same time, if the owner looks at PRODUCTS (he is in PhaseTwo) he will see the following:

redef_owner_view_before_xform

As you can see only the non-modified columns are showing (why? - See the actual code of the view - in short, because we are selecting the new columns...).

Using the code in the documentation we will apply the transformations in the new edition and see what the result will be:

redef_owner_view_after_xform

We do the same steps to the fact table to cater for the redefinition of the dimension key (we went from 20 to 40). The details of this are in the code package.

At this point in time we have all data available in the PhaseTwo edition in it current form. Now a new ETL record with the following values comes in for the FACT table (SALES): 20, 10, 200, 3, 20. In other words we are still referring to dimension key 20.

This gives us the following results in the Ora$Base edition, note that the last record refers to dimension key 20 as a product_key:

redef_qry_after_etl_current

In the PhaseTwo edition we see the record transformed and we see that the product key field is always 40, even for the new record:

redef_qry_after_etl_phasetwo

In effect we have updated our data, redefined some codes, added some columns and all of this with no noticeable impact for the end users.

The situation is now as follows:

blog_redef_procedure_step2

Now we should change the primary and foreign keys as well so it all refers to the new situation, we do this right before the switch of the users and the load processes. Because the cross edition trigger is in place you will get a key violation on the next generation primary key.

After we change the keys, the PK on PRODUCTS_GEN is named PRODUCT_PK_1. Inserting into the dimension in Ora$Base (there are no keys on PRD_ID_1 in that revision) will give the error coming from the trigger:

redef_etl_current_dupkeyerror

Now we are all set and we can switch the edition for all users:

ALTER DATABASE DEFAULT EDITION = PhaseTwo;

After a disconnect and reconnect of any users that were connected, they will see the new status and are by default running on the upgraded edition. Next steps will include the removal of the triggers and the procedures.

Obviously your ETL is now changing as well as there are more fields to fill and the fields have different names. Some of the logic built into the triggers will move into ETL. And this is where the whole testing in development and test comes into place. Just make sure the ETL cut over is a simple as the above for the query users.

A Word on Performance

The interesting discussion is now whether or not you want to do this exact construct with large batch ETL routines. I would say that when doing this in batch, it may make sense to do a dual data load and use the new logic to populate the new system with the old data. But I'm open to any comments on this...

A Word on Naming Conventions

As you can see we are adding columns to the original table. If you start with a clean sheet of paper, you may consider using Redefinition Views as the query object, and keeping the underlying tables more generic. In other words, rather than using _1 suffixes maybe you can use more common or global names that will allow you to keep the tables within some set naming standards (and not have slightly ad hoc _1 version indicators).

The idea therefore is to work with a level of indirection to the query users... if that makes sense.

So What?

Why is this so interesting? Didn't you test this all anyways in your test environment? Yes, all of that is true, as we all test our OLTP app upgrade in the lab. The point is that with this edition based approach I can run the ETL process continuously, everyone keeps on querying without any downtime, and then everything is all of a sudden changed with no downtime for the users. That means no interruption on the business at all... that is what you need when you are running an active data warehouse (or should we call this OLDW for On-Line Data Warehousing?).

More Resources

Most of the above scripts and a different example can actually be found in the user guides here and here. To download the script, click here.

October 15, 2009

Impressions from Openworld

Well, Openworld closed today with a wrap up party under (kind of) sunny skies. If nothing else it was a lot better than the weather on Tuesday. Wow, what a storm! Luckily the tents were all up and I think everyone just enjoyed the show.

On the data warehousing front, I think this was a great show. For anyone - like me - mostly interested in DW related technology and customers there was a lot to enjoy. We had quite a few sessions where Database machine customers shared their stories. Four of them shared their (top 10) lessons learned while implementing a Database machine. Six of them talked about their experiences on a customer panel. Various other sessions discussed how Database machine changes the data warehousing game. All in all, I think no one should doubt that this is a product very much in demand.

Once my feet are all happy again I will update the blog and give all of you the links to the customer presentations.

That was also shown at the keynote hall where we had some pretty cool demonstrations on an actual V2 Database machine. The demogrounds had a nice steady flow of folks looking for more technical details and we got quite busy discussing what the machine can do. Lots of interest in Data Warehousing, but also a lot of interest around Consolidation and of course OLTP deployments.

A lot more happened at the show, and I'm sure the news outlets covered most of the things Larry Ellison talked about ($10 Million anybody?).

For those looking for some light entertainment, but with a serious message, have a look at the speech by Governor Arnold Schwarzenegger (here).

For now, it is time to take the shoes off and enjoy a quite evening in front of a TV... See you all next year!

October 8, 2009

If you are coming to OpenWorld...

A quick update on how to get organized at OpenWorld. The various product teams have created focus area booklets that list all the presentations related to a specific topic and these are now available for download from the OpenWorld site. For the data warehousing area we have created the following PDF booklets to help you:


In addition there is a whole series of subject specific PDF booklets that you can download from here:

  • Cloud Computing

  • Data Integration and Oracle GoldenGate

  • Database Application Development

  • Database Security

  • Database Upgrade

  • Database Utilities

  • Enterprise Manager

  • Grid Computing

  • Green/Environmental Sustainability

  • High Availability

  • Information Management

  • Management and Infrastructure

  • Modernization

  • Oracle Consulting

  • Oracle Multimedia

  • Oracle Spatial and Oracle MapViewer

  • Oracle@Oracle Best Practices

  • Secure Enterprise Search and Text

  • Semantic Technologies

The data warehouse PM team will be in demo grounds all week on the data warehouse demo pod so please come and visit and say hi and let us know if you find the blog useful. We will be based in Moscone West in the database area. A map of the layout of the Moscone West demo grounds is here. Look forward to seeing you all next week.

October 1, 2009

In-Database MapReduce (Map-Reduce)

The Map-Reduce model has become a popular way for programmers to describe and implement parallel programs. These custom map-reduce programs are often used to process a large data set in parallel. This post shows how to implement Map-Reduce Programs within the Oracle database using Parallel Pipelined Table Functions and parallel operations.

The Theory

Pipelined Table Functions were introduced in Oracle 9i as a way of embedding procedural logic within a data flow. At a logical level, a Table Function is a function that can appear in the FROM clause and thus functions as a table returning a stream of rows. Table Functions can also take a stream of rows as an input. Since Pipelined Table Functions are embedded in the data flow they allow data to be 'streamed' to a SQL statement avoiding intermediate materialization in most cases. Additionally, Pipelined Table Functions can be parallelized.

To parallelize a Table Function the programmer specifies a key to repartition the input data. Table Functions can be implemented natively in PL/SQL, Java, and C. You can find more information and examples about Table Functions and the functionality mentioned above at the following URL:

http://download.oracle.com/docs/cd/B10501_01/appdev.920/a96624/08_subs.htm#19677

Pipelined Table Functions have been used by customers for several releases and are a core part of Oracle's extensibility infrastructure. Both external users and Oracle Development have used Table Functions as an efficient and easy way of extending the database kernel.

Examples of table functions being used within Oracle are the implementation of a number of features in Oracle Spatial and Oracle Warehouse Builder. Oracle Spatial usages include spatial joins and several spatial data mining operations. Oracle Warehouse Builder allows end users to leverage Table Functions to parallelize procedural logic in data flows such as the Match-Merge algorithm and other row-by-row processing algorithms.

Step-by-Step Example

All examples are available in plain text in this file: omr.sql.

To illustrate the usage of parallelism, and Pipelined Table Functions to write a Map-Reduce algorithm inside the Oracle database, we describe how to implement the canonical map-reduce example: a word count. For those unfamiliar with the example, the goal of word count is to return all distinct words within a set of documents as well as a count of how often this word occurs within this set of documents.

The procedural code in this word count example is implemented in PL/SQL but, as said before, Oracle allows you to pick your language of choice to implement said procedural logic.

Step 1 - Setting up the Environment

We will be looking at a set of documents, these documents can be either files outside of the database, or they can be stored as Secure Files/CLOB columns within the database. Within this table our documents are stored, effectively reflecting a file system.

In this case we are going to create an table within the database using the following definition:

CREATE TABLE documents (a CLOB)
  LOB(a) STORE AS SECUREFILE(TABLESPACE sysaux);

Each row in this table corresponds to a single document. We populate this table with a very simple corpus resulting in 3 documents with the text shown here:

INSERT INTO documents VALUES ('abc def');
INSERT INTO documents VALUES ('def ghi');
INSERT INTO documents VALUES ('ghi jkl');
commit;

The end result of both the map function and the reduce table function are going to live in a package, keeping the code nice and tidy. To show the steps to be taken we will take snippets from the overall package and show those in the section to follow. The actual package will contain a set of types, which are required for the code to work. All code was tested on Oracle Database 11g (11.1.0.6).

Download the full code here.

The following figures show the package being deployed.

CreatePackageHeader

CreatePackageBody

Step 2 - Creating the Mapper and the Reducer

First we need to create a generic function to "map" (as in map-reduce) or tokenize a document. Note that the goal is not to show the best map function, but how this will work in principle in the database. This specific map function is very basic and better implementations may be found elsewhere. 

You can use the aggregation engine to get the results and only use the mapper. A query and a result would look like this:

FirstSelect

The aggregation is done in SQL, no reducer required.

Of course, you could write your own aggregation Table Function to count the occurrences of words in a document. That is what you would do if you were writing the map-reduce program without leveraging the Oracle aggregation engine as we did before. This aggregation Table Function is the reducer of the map-reduce program unit.

The Table Function specifies that it's input be partitioned by word and could (to use the Oracle execution engine's sort) ask for the data to ordered or clustered by word. We show a sample count program in this post to complete the example.

Step 3 - In-Database Map-Reduce

After you have completed both the mapper and the reducer you are ready to do the full map-reduce in the database. Running a query using this Table Function will give us a parallel workload on external documents, doing what the typical map-reduce programs do.

SecondSelect

Summary

Oracle Table Functions are a proven technology, used by many internal and external parties to extend Oracle Database 11g.

Oracle Table Functions are a robust scalable way to implement Map-Reduce within the Oracle database and leverage the scalability of the Oracle Parallel Execution framework. Using this in combination with SQL provides an efficient and simple mechanism for database developers to develop Map-Reduce functionality within the environment they understand and with the languages they know.

Download the code here: omr.sql. For the example, I ran this in OE (as you can see on the SQL screens). No special privileges required.

September 28, 2009

OLTP And Data Warehouse? How Does That Work?

I recently came across a posting on another website that was questioning some of the messaging associated with our recent launch of the Sun Oracle Database Machine. The implication being made was that an appliance designed to support data warehousing could not support OLTP operations and an appliance designed to support OLTP operations could not be expected to support data warehousing.

Are these two types of operations mutually exclusive? Is it actually possible to deliver an appliance that truly does provide the best of both worlds? More importantly why would you need such an appliance? I would argue that were are rapidly reaching a point where there is very little difference between the types of features and of operations needed to support OLTP operations compared to data warehouse.

Many of today's data warehouses are looking more and more like an OLTP application. This is what is often referred to as "active data warehousing". Active data warehousing is completely different to traditional data warehousing (as supported by the current data warehouse appliance vendors) and is based around three types of operations that overlap and intersect:

  • Data loading in near-real time using trickle feed ETL jobs to update small amounts of data
  • Operational style queries requesting very small amounts of data to support transactional operations that run around the clock
  • Analytical style queries, executed continuously to plan next year's budgets, uncover market trends or predict future growth patterns

To support these types of "active" operations a data warehouse appliance must provide the following:

1) Strong Transactional Heritage
The traditional data warehouse is becoming a thing of the past. Very few customers are looking to build a data warehouse based on single batch updates where the database is taken down over night, or over the weekend, to load new data. An active data warehouse is as the name suggests, active. It is not supposed to be offline. By linking into operational systems the data warehouse becomes a 24*7 extension. In reality an active data warehouse cannot be unavailable. It is also wrong to assume that trickle feeds just deliver small amounts of data at regular intervals. For many customers, such as those in retail, financial organizations or telecommunications this is not the case. The trickle feeds deliver terabytes of data that need to be loaded as quickly a possible so business users can extract competitive advantages as quickly as possible.

Active data warehousing is often seen as "the revenge of OLTP" systems because of the need to combine a strong robust transactional model with data warehouse features within a single database engine. Therefore, this latest announcement of a Sun Oracle Database Machine that supports both OLTP and data warehousing means it provides the perfect platform for delivering active data warehousing.

The traditional data warehouse appliance vendors have no background in transactional processing. Their databases lack a sophisticated locking model so queries struggle to deliver accurate and consistent results during data loads. Typically these appliance databases fail to support active data warehouse style operations because they can only support a single view of the data. Consequently when data is being loaded all readers are locked out of the system until the write operation(s) completes.

If read operations are running when the data load process starts the data load is locked out until the queries all complete. This simplistic model causes severe performance problems for customers looking for the benefits of moving to a truly active data warehouse.

In contrast Oracle has a long and distinguished history of both transactional processing and data warehousing. The world-class robust locking model within the Oracle Database allows both readers and writers to co-exist without blocking each other. Oracle automatically provides read consistency to a query so that all the data that the query sees comes from a single point in time. Oracle can also provide read consistency to all of the queries in a transaction. This means the Oracle Database is ideally suited to deliver a real active data warehouse.
The leading analyst companies such as Gartner, Forrester and IDC recognize the Oracle Database as a market leader in both data warehousing and OLTP.

2) Robust Security Model
Pushing data out to many different OLTP applications and business users means that the issue of data security becomes a critical focus area. There are many areas that have to be considered when attempting to secure a database. Security has become a multidimensional problem. Customers need to consider how best to:

  • Protect their network and media
  • Develop a strong authentication framework
  • Manage internal threats
  • Deliver audit and configuration monitoring
  • Create a central place to manage all users
  • Encrypt and mask sensitive data

In addition, many countries have passed laws that specify how data must be managed and who can access that data. Failure to meet the requirements of this legislation can result in damaging publicity as well as fines or even lengthy jail sentences.

Security within most data warehouse appliance systems is limited. The typical appliance database provides basic user authentication and authorization procedures so users who access the data can only see information to which they have been granted access. In some cases appliance vendors will offer some form of data encryption, which can be applied to a specific column of data. However, the encryption process is not transparent and usually requires changes to both ETL/data loading processes and the query access processes.

The auditing and logging of user activity is invariably limited to basic information such as details of logons and logoffs, number of failed logins, non-typical access periods and non-typical logon sources.

In today's security conscious world these security features are completely inadequate. In fact, most of the appliance vendors have now woken up to this fact and are slowly linking up with a third party vendors in an attempt to bring 21st Century rigor to their security features. The problem for customers is that they have to deal with another software vendor, additional licenses and attempt to broker solutions to support issues where the boundaries of responsibility are unclear.

In contrast Oracle is able to fully secure your data from end to end and across the whole data lifecycle. For over 30 years Oracle has been developing market-leading solutions to help customers secure all information within their data warehouses and OLTP systems. Oracle data security solutions help address numerous government and industry regulations including SOX, PCI, HIPAA, FISMA, JSOX and the European Union Data Privacy Directive.

To help customers secure and defend their data warehouse Oracle has developed its 'Defence-in-Depth Guide' along with its 'Maximum Security Architecture'. This is a series of solutions to help customers address today's top security concerns. The Oracle Maximum Security Architecture provides an easy to follow overview of Oracle's defence-in-depth approach to data security.

Oracle's market leading security and data encryption features are designed to work transparently, minimizing any impact on existing applications while addressing mandatory requirements found in many regulations.

Overall, Oracle provides everything you need to secure the data within both your data warehouse and your OLTP applications. There is no need to contact third party vendors, no need to change your database schemas, no need to change your existing business intelligence tools. Oracle Database security features are designed to be transparent to make customers lives easier and all Oracle security features work transparently with Sun Oracle Database Machine.

3) Strong Resource Management
The other implication made on the website mentioned earlier was that you would have to be idiot to put your operational systems on the same appliance as your data warehouse. But why would the appliance vendors want to stop customers from running OLTP applications along side their data warehouse? Apart from the lack of strong transactional features the appliance vendors typical uses a shared nothing architecture, which is optimized to support pre-defined batch queries that scan large volumes of data. These optimizations do not suit the types of queries that dominate active data warehouse and OLTP operations. Operational queries request small amounts of data but the shared nothing architecture has to run these queries across all nodes within the cluster tying up resources that really are not needed. It does not take many of these operational queries to flood an appliance system and the DBA has little, if any, control over the amount of resource acquired by each query.

The Oracle DBA has complete control over how system resources are used because Oracle's 'shared everything' architecture provides the ability to dynamically optimize each query. Before the query is executed the current workload is examined and the amount of processing power, i.e. degree of parallelism, allocated to the query is adjusted rather than simply allocating the same amount of processing resource every time (as happens with the other appliance vendors). The degree of parallelism is, therefore, optimized for each query and there is no requirement for a minimal degree of parallelism across all nodes. Operations can run in parallel using one, some or all nodes depending on the current workload, the characteristics and importance of the query.

Over the years the Oracle Database has been extended to provide fully automated performance optimization features. These query optimization features have the ability to learn over time, which are the most optimal query plans, and then lock subsequent queries into using those plans to ensure consistent performance.

It is this ability to effectively manage the resources within the Sun Oracle Database Machine that allows Oracle to run both OLTP and data warehouse operations on the same platform. Oracle has a world-class database with strong transactional and data warehouse features, it has world-class resource management features and now with Sun Oracle Database Machine it has a world-class high performance platform.

4) Deliver High Availability
Customers cannot afford to have their active data warehouse or OLTP applications inactive. Without access to the operational systems and/or the data warehouse, revenue and customers can be lost, penalties incurred and bad press generated, all of which can have a lasting effect on both customers and the company's reputation. Building a robust, high availability IT infrastructure to support both a data warehouse and OLTP operations is critical to the success and well being of all enterprises in today's fast moving economy. The data warehouse, like all OLTP systems, is now a mission-critical system.

Building a robust and resilient platform based on a typical data warehouse appliance is very difficult because of its shared nothing architecture. This means data is dedicated to a specific node and if a node fails then the data associated with that node also becomes unavailable. Which by default, implies that the whole system is unavailable.

To resolve this architectural challenge appliance vendors offer a number of different methods for delivering a resilient enterprise data warehouse, each governed by the size of the customer's budget.

In some cases the use of features to deliver a robust and resilient platform can affect performance. For example when using a 'fallback' configuration to protect data, write performance is reduced as data has to be written to both the primary and fallback locations to ensure consistency. This has a knock-on effect in terms of query performance because user requests can be blocked while a write operation is being performed. If that write operation takes longer because of the need to support a fallback copy of the data, business users will be locked out of the system for longer.

Oracle databases operate in highly resilient and recoverable configurations and run thousands of mission-critical systems around the world. Oracle's database is designed to support mission critical environments from the ground up. One of the key parts in providing a highly available solution is Oracle Real Application Clusters (RAC). This is the premier database clustering technology that allows two or more computers (also referred to as "nodes") in a cluster to concurrently access a single shared database. This effectively creates a single database system that spans multiple hardware systems yet appears to the application as a single unified database.

Oracle Automatic Storage Management (ASM) provides protection at the storage layer. With performance and high availability as a primary objective, ASM builds on the principle of stripe and mirror everything. Intelligent mirroring capabilities allow administrators to define two or three way mirrors for the ultimate protection of critical business data.

Furthermore, Oracle has an extremely mature set of fully integrated and resilient solutions that enable customers to deliver highly available data warehousing. Oracle has developed "Maximum Availability Architecture" (MAA) to provide superior data protection and availability by minimizing or eliminating planned and unplanned downtime at all technology stack layers including hardware or software components.

OLTP And Data Warehouse? How Does That Work?
For many appliance vendors it just not an option to run an OLTP application on their servers because of the inherent limitations built in to their database and architecture. Therefore, to go back to the original question: "OLTP And Data Warehouse? How Does That Work?" For Oracle customers the answer is: very nicely thank you!

September 27, 2009

In-Memory Parallel Execution in Oracle Database 11gR2

As promised, the next entry in our 11gR2 explorations is In-Memory Parallel Execution. If you are going to Oracle OpenWorld next month make sure you check out the following session:

Tuesday, October 13 2009 5:30PM, Moscone South Room 308
Session S311420
Extreme Performance with Oracle Database 11g and In-Memory Parallel Execution.

In this session you will get more details and insight from the folks who actually built this functionality! A must see if this is of any interest, so book that ticket now and register!

Down to business, what is "In-Memory Parallel Execution"?

Let's begin by having a quick trip down memory-lane back to Oracle Database 7 when Parallel Execution (PX) was first introduced. The goal of PX then and now is to reduce the time it takes to complete a complex SQL statement by using multiple processes to go after the necessary data instead of just one process. Up until now these parallel server processes, typically by-passed the buffer cache and read the necessary data directly from disk. The main reasoning for this was that the objects accessed by PX were large and would not fit into the buffer cache. Any attempt made to read these large objects into the cache would have resulted in trashing the cache content.

However, as hardware systems have evolved; the memory capacity on a typical database server have become extremely large. Take for example the 2 CPU socket Sun server being used in new the Sun Oracle Database Machine. It has an impressive 72GB of memory, giving a full Database Machine (8 database nodes) over ½ a TB of memory. Suddenly using the buffer cache to hold large object doesn't seem so impossible any more.

In-Memory Parallel Execution (In-Memory PX) takes advantage of these larger buffer caches but it also ensures we don't trash the cache.

In-Memory PX begins by determining if the working set (group of database blocks) necessary for a query fits into the aggregated buffer cache of the system. If the working set does not fit then the objects will be accessed via direct path IO just as they were before. If the working set fits into the aggregated buffer cache then the blocks will be distributed among the nodes and the blocks will be affinitzed or associated with that node.

In previous releases, if the Parallel Execution of one statement read part of an object into the buffer cache, then subsequent SQL statement on other nodes in the cluster would access that data via Cache Fusion. This behavior could eventually result in a full copy of that table in every buffer cache in the cluster. In-Memory PX is notably different because Cache Fusion will not be used to copy the data from its original node to another node, even if a parallel SQL statement that requires this data is issued from another node. Instead Oracle uses the parallel server process on the same node (that the data resides on) to access the data and will return only the result to the node where the statement was issued.

The decision to use the aggregated buffer cache is based on an advanced set of heuristics that include; the size of the object, the frequency at which the object changes and is accessed, and the size of the aggregated buffer cache. If the object meets these criteria it will be fragmented or broken up into pieces and each fragment will be mapped to a specific node. If the object is hash partitioned then each partition becomes a fragment, otherwise the mapping is based on the FileNumber and ExtentNumber.

InMemoryPX_final.jpg

To leverage In-Memory PX you must set the initialization parameter PARALLEL_DEGREE_POLICY to AUTO (default MANUAL). Once this is set, the database controls which objects are eligible to be read into the buffer cache and which object will reside there at any point in time. It is not possible to manual control the behavior for specific statements.

Obviously this post is more of a teaser, for in-depth discussions on this, go to Openworld and/or keep an eye out for a new white paper called Parallel Execution Fundemental in Oracle Database 11gR2 that will be coming soon to oracle.com. This paper not only covers In-Memory PX but Auto-DOP and parallel statement queuing.

Stay tuned for more on 11gR2 coming soon...

September 21, 2009

500GB/sec and Database Machine Generation 2

Last Tuesday we announced the second generation of Database Machine. This second generation of Oracle Exadata is now running on Sun hardware. The premise for Database Machine is still the same: deliver extreme performance systems on commodity hardware with ease of deployment.

The database machine is a prime example (as was the first generation) of software-enabled hardware. The software offers the real value, the hardware is of the shelve stuff allowing a great price point and an easy way to quickly release a next generation system and get the benefits of faster chips and other components. The software allows the easy migration and the extreme benefits.

The Sun Oracle Database Machine comes with some new and very cool Exadata software features, it once again has InfiniBand - generation 2 delivers even higher throughput numbers - and it is now available in smaller configurations.

So what is new here?

For one, the addition of flash into the system is something very compelling and a leap forward in terms of performance and throughput. And yes, that is where the 500GB/sec comes from...

cache_hierarchy.JPG

Effectively what we did in generation 2 is adding a very fast cache into the storage tier of the system, and by doing this created a hierarchy as shown above. The fastest tier is the actual memory in the database nodes, which we increased on the machine. The bottom part of hierarchy is the disk, here we increased the throughput for a whole rack to 21GB/sec. By adding flash cards (not flash drives!) to the storage tier we can leverage this as cache and get the benefits from a scale out strategy. As we scale out the storage, we scale out the flash and the throughput.

The Exadata cache is a smart cache that we carefully manage. If you deem it necessary you can pin objects into the cache as well. Since the Exadata Storage Server actually understands the structure of the data stored, the cache does so too. It is after all managed by the Exadata software. This means that we do not use a regular LRU (Least Recently Used) algorithm, but determine which data is hot and cache these sets when we deem it better to do so.

One distinct difference with the flash you see in traditional storage arrays is that we are not using flash disks in Exadata. We are using PCIe cards. This means we are not constraint by slow disk controllers and can get these massive throughput numbers of 50GB/sec for a full rack database machine.

On top of this, we are introducing Hybrid Columnar Compression with Exadata generation 2. We talked about this already in a previous post around the 11g Release 2 database new features.

In the data warehousing workload (assuming bulk loads for example and lots of querying) we can achieve a 10x compression of the data with almost no impact on query performance. That compression rate allows us to achieve up to 500GB/sec of scan rates from the flash cards.

To put that into perspective, in generation 1 of the Database Machine we achieved up to 14GB/sec of throughput from the disks (in a full rack). In generation 2 we are up to 21 GB/sec, both numbers are uncompressed. Flash gets us to around 50GB/sec. The truly staggering numbers come with that 10x Hybrid Columnar Compression rate... For anyone who has ever run queries on a system, 500 GB/sec is really, really fast!

Storage Indexes

That is not all though. Generation 2 of Exadata also introduces Storage Indexes. A storage index is something more akin to a range partition, but we evaluate this at the storage layer. Sometime this is referred to as a negative index.

What happens is that for each column commonly queried we transparently store the min and max values of that column. We do this for a certain data size e.g. as soon as we finish writing the data and filling up that predefined size we calculate the min and max for the relevant columns. The result is something like this:

storage_index_schematic.JPG

If the user now issues a query asking something like SELECT * FROM TABLE WHERE B<2 the scans will only look for the first set of rows in above picture. Since the minimum value in the second block is 3, no rows matching the query will be in that set of rows. This allows a Storage Index to gives us transparent data elimination without overhead, making the scans more distinctive and therefore faster.

So as you can see, the whole system is faster on all accounts than the already fast generation 1 system. It is also much faster than anything else out there in the market.

Seeing that there is much more news, like the actual family details (half racks, quarter racks and smaller) the offloading of data mining scoring and all the 11g Release 2 details we haven't yet cover, expect quite a few follow-up posts on both 11gR2 and Generation 2 Exadata.

Next is as promised earlier, the 11gR2 in-memory parallel execution

September 16, 2009

Introducing the next generation Database Machine

Yesterday, in a live webcast Larry Ellison introduced the next generation Database Machine. Before we dive into technical details, have a look at the announcement and the information out on the web:

http://www.oracle.com/us/corporate/press/033684 (click here)
http://www.oracle.com/database/database-machine.html (click here)

Most importantly, have a look at the following page which has some excellent overview and introductory papers:

http://www.oracle.com/technology/products/bi/db/exadata/index.html (click here)

Now, there are a number of very important implications of this next generation database machine. The most important one being Flash Cards! Imagine 500GB/sec of throughput from your system...

About

Data Warehousing is a broad topic area. This blog is written by the product management team and tries to shed light on the ins and outs of Oracle Data Warehousing and Oracle Database Machine.

That said, the views expressed on this blog are our own and do not necessarily reflect the views of Oracle.

Powered by
Movable Type and Oracle