February 7, 2010

New LinkedIn group for Oracle Data Warehouse Professionals


Just a quick message to say that I have created a new group on LinkedIn for "Oracle Data Warehousing Professionals". The URL for the group is here:

http://www.linkedin.com/groups?gid=2129659

Please feel free to share this link with friends, colleagues etc. I have linked our Data Warehouse and Database Insider blogs to this community so as we post articles on these blogs the articles will be automatically pushed out to the LinkedIn community.

Any comments and feedback let me know.


February 5, 2010

Oracle Positioned in Leaders Quadrant of Gartner's latest EDW MQ

Gartner has positioned Oracle in the Leaders Quadrant of its 2010 "Magic Quadrant (MQ) for Data Warehouse Database Management Systems (DBMS)" (written by Donald Feinberg and Mark A. Beyer, January 28, 2010). This report is widely used by many customers who consider it an authoritative guide which is valuable in making database investment decisions. Of course Oracle has been the leader in both the database and data warehouse markets for a considerable time, as you can see from our advertising:

Gartner's positioning of Oracle as a data warehouse leader within its EDW MQ is only natural. This latest MQ report highlights the main strengths of Oracle's data warehouse strategy and includes recognition for delivery to the market of our Database Machine platform. However, it is important to remember that Oracle has a broad range of data warehouse solutions ranging from fully pre-configured solutions delivered and supported by Oracle - Oracle Database Machine)-, optimized warehouse reference configurations with all the major hardware vendors, and stand-alone Oracle Database, 11g Release 2. No other data warehouse vendor offers this level of portability (although some have seen the error of their ways and are now trying to copy Oracle's approach).

The latest release of Oracle Database, 11g Release 2, contains many optimizations to enhance query performance with minimal intervention from the DBA. This is simply the continuation of a process which started a long time ago and is aimed at reducing both the setup, management and overall TCO. Optimizations have been added to improve refresh times for MVs and cube MVs. It is now much easier to define and maintain these transparent query optimization features and most importantly, they benefit every BI tool on the market.

To help DBAs and database designers create the best partitioning strategy, the partition advisor has been improved to take the guesswork out of partitioning and to help build a database structure that can deliver the very best levels of performance.

To help customers reduce their cost of storage, 11gR2 added columnar compression as part of Exadata. Oracle's database compression features also provide significant improvements in performance. 11g Release 2 now comes with a compression advisor which can be used to estimate the likely levels of compression to help define an optimal data/storage strategy.

Gartner recognized all the work Oracle has done to improve the automation and optimization of the Oracle data warehouse and as Gartner points out, these optimizations are unique to Oracle. No other database provides this level of support and guidance for the DBA to help simplify the manage of workloads within the data warehouse.

Both RAC and ASM are now seen as vital features, and key strengths, for extremely large data warehouses and the highlighting of these features within the MQ report validates our use of these two features within Oracle Database Machine. The use of both RAC and ASM provide the ability to flexibly scale any platform by adding additional servers and/or storage without having to take the data warehouse offline. In addition, RAC and ASM deliver an environment that provides the level of highly availability needed to support the demands of today's mission critical data warehouses.

Overall, this latest EDW MQ is an important and impressive acknowledgment of Oracle's data warehouse strategy. Taking a wider view of data warehousing it is important to understand that there are other factors which need to be considered, although the EDW MQ is definitely the foundation, or cornerstone MQ. There are a lot of other factors to consider when implementing a data warehouse such as ETL tools, data quality tools, data integration tools, BI platforms, enterprise content management etc etc. Therefore, a true leader in the data warehouse space needs to be in the leaders quadrant for all these areas - like Oracle:

This is the reason why Oracle is No1 in data warehousing!


January 25, 2010

Are you a Data Warehouse IT Superhero?

Are you transforming your data warehouse through innovation by using Oracle solutions? Does your company consider you a data warehouse super hero? Have you ever dreamed of landing a role in a major Hollywood blockbuster?

If so, then you could be well on your way to being the Honorary CIO of Stark Industries in the next Iron Man movie, which is due out later this year. Marvel and Oracle are challenging you to enter the "Stark Industries Honorary CIO" contest. Our winner will:

  • Become an Honorary CIO of Stark Industries
  • Attend a premier of Iron Man 2
  • Have their success story appear on Oracle.com

How do I become the Honorary CIO of Stark Industries?

All you have to do is tell us about your Oracle Data Warehouse success story. The deadline for entries is March 15, 2010 and you can apply online by going to the specially designed Marvel and Oracle website, which is here. Obviously there are certain requirements that you will need to meet and all the rules have been posted here. Good luck!

If you do decide to enter please let me know (you can email me at keith.laker@oracle.com) as it would be nice to keep tabs on all the submissions from our data warehouse customers.


January 22, 2010

Database Machine Customer Success Stories

Recently I saw some interesting stories from a few customers, in this case in EMEA and APAC. I won't comment too much on the details as the customers do so in the links here other than saying that Banca Transilvania is a Sun Oracle Database Machine customer (the new V2 system).

SK Telecom case study

Banca Transilvania case study

Enjoy the reading as customers being successful is always fun to read about.

January 18, 2010

Delivering hardware to support a data warehouse cloud strategy

From a hardware perspective, the move to the cloud is mostly being driven by the need to find a better way to utilize hardware resources more effectively and efficiently. Most enterprise data warehouse environments consist of an enterprise data warehouse with usually, but not always, additional data marts spun-off to support key departmental objectives. In addition there are the separate processing engines needed to support the EDW such as data transformation, data quality, metadata repositories, de-duplication etc etc. Each of these applications (EDW, marts, etc etc) tends to run on its own dedicated platform and this results in an inefficient use of resources. In some cases, attempts have been made to share storage services but this creates a whole series of other performance and management problems.

As I stated before, the rush to move everything to the cloud seems to be based on the idea that somehow this new concept is magically going to deliver better hardware utilization. In reality, this can only really be true if the hardware platform used to support a cloud based EDW strategy has the following characteristics:

  • Based on high volume hardware
  • Based on grid processing model
  • Uses intelligent shared storage
  • Based on open standards based operating systems
  • Provide a highly available platform
  • Deliver increased utilization

1) Use of High Volume Hardware
One of the main factors holding back many data warehouses today is their inability to quickly and easily integrate new hardware developments - new CPUs, bigger disks, faster interconnects etc. Every new advance in hardware delivers faster performance for less money. This makes it vital that such developments can incorporated into the data center as soon as possible.

This is especially true in data warehousing where many customers need to store ever-increasing volumes of data as well as supporting an ever-growing community of users running every more complex queries. New innovations such as Intel’s latest Nehalem chipset, interconnect technology from the super-computing industry (InfiniBand), ever expanding SATA/SAS disk storage capacities and the introduction of SSD/flash technology are all vital in terms of delivering increased performance, improving overall efficiency and reducing total costs.

Many customers are being told that the simplest way to access new technology is to off-load some, or all, of their processing to the cloud (at this point I am not differentiating between public and private clouds). The problem is that simply moving to the cloud (public or private) does not mean guarantee access to the latest hardware innovations. In many cases, it is simply a way of masking the use of proprietary hardware (and related software) that is probably well passed its sell-by date.

2) Use of Grid Processing
Most customers are working with large numbers of dedicated servers with storage assigned to each application. This creates a computing infrastructure where resources are tied to specific applications, resulting in an inflexible architecture. This increases cost, power requirements, reduces overall performance, scalability and availability.

The way to resolve these issues is to move to an approach based on the grid. Grid Computing is a virtualizes and pools IT resources, such as compute power, storage and network capacity into a set of shared services that can be distributed and re-distributed as needed. Grid computing provides the required flexibility to meet the changing needs of the business. It is much easier to support short-term special projects for a department if additional resources can quickly and easily provisioned. Placing applications on a grid computing based architecture enables multiple applications to share computing infrastructure, resulting in much greater flexibility, cost, power efficiency, performance, scalability and availability, all at the same time.

Oracle Real Application Clusters (RAC) allows the Oracle Database to run in a grid platform. Nodes, CPUs, storage and memory can all be dynamically provisioned while the system remains online. This makes it easy to maintain service levels while at the same time lowering overall costs through improved utilization. In fact you could consider the "C" in RAC as referring to "Cloud" rather than "Cluster".

Adding additional resources “on-demand” is a key requirement for delivering cloud computing. I would argue that this can be a complicated process within a shared-nothing infrastructure. This makes this type of approach unsuitable for use with a cloud computing strategy. In reality adding something relatively simple such as more storage has a profound impact on the whole environment. The database has to be completely rebuilt to re-distribute data evenly across all the disks. For many vendors adding more storage space also means adding more processing nodes to ensure the hardware remains in balance with the data. This all creates additional downtime for the business, as the whole platform has to go offline while the new resources are added and configured which impacts SLAs.

3) Use of Intelligent Shared Storage
Today’s data warehouse is completely different from yesterday’s data warehouse. Data volumes, query complexity and numbers of users have all increased dramatically and will continue to increase. The pressure to analyze increasing amounts of data will put more strain on the storage layer and many systems will struggle with I/O bottlenecks. With traditional storage, creating a shared storage grid is difficult to achieve because of the inability to prioritize the work of the various jobs and users consuming I/O bandwidth from the storage subsystem. The same occurs when multiple databases share the storage subsystem

Exadata delivers a new kind of storage – intelligent storage - specifically built for the Oracle database. Exadata has powerful smart scan features which reduce the time taken to find the data relevant to a specific query and begin the process of transforming the data into information. At the disk level there is a huge amount of intelligent processing to support a query. Consequently, the result returned from the disk is reduced to the necessary information to satisfy a query, being significantly smaller than compared with a traditional block storage approach (as used by many other vendors such as data warehouse).

The resource management capabilities of Exadata storage can prevent one class of work, or one database, from monopolizing disk resources and bandwidth and ensures user defined SLAs are met when using Exadata storage. With an Exadata system it is possible to identify various types of workloads, assign priority to these workloads, and ensure the most critical workloads get priority.

The tight integration between the storage layer, Exadata, and the Oracle Oracle Database ensures customers to get all the benefits of extreme performance with all the scalability and high availability required to support a “cloud” based enterprise data warehouse.

4) Use of Open Standards Based Operating Systems
The same concept that applies to hardware also applies to operating systems. Customers need to move from proprietary operating systems to one based on open standard, such as Linux. The use of open standards based operating systems also allows new technologies to rapidly incorporated.

Oracle provides its own branded version of Linux – Oracle Enterprise Linux. Oracle is committed to making Linux stronger and better. Oracle works closely with, and contributes to, the Linux community to ensure the Oracle Database runs optimally across all major flavors of Linux. This cooperation extends to the very latest technology supporting both Exadata and Sun Oracle Database Machine: such as the support of InfiniBand as a networking infrastructure. Oracle is working with the Linux community to help standardize the use of InfiniBand interconnects. Oracle has already released the InfiniBand drivers it developed for use with Oracle Database Machine to the open-source community.

With its support for Linux, use of commodity hardware components, intelligent shared storage and grid architecture, Oracle is able to deliver the most open approach to enterprise data warehousing in the market today and support the key elements needed to allow customers to develop a successful cloud based data warehouse strategy.

5) Use of a Highly Available Framework
In a hardware-cloud it is vital that there is no single point of failure. As the number of applications sharing the hardware increases so does the impact of a loss of service. A data warehouse, either inside or outside the cloud, can be subjected to both planned and unplanned outages. There are many types of unplanned system outage such as computer failure, storage failure, human error, data corruption, lost writes, hangs or slow downs and even complete site failure. Planned system outages are the result of needing to perform routine and periodic maintenance operations and new deployments. The key is to minimize the amount of downtime to reduce the impact on productivity, lost revenue, damaged customer relationships, bad publicity, and lawsuits.

A data warehouse built around a shared nothing architecture is vulnerable to the loss of a node or a disk since losing one or both of these items means that a specific portion of the data set is unavailable. As a result queries and/or processes have to be halted until the node/disk is repaired.

A shared everything architecture, such as Oracle’s, is the ideal solution for cloud computing since there is no single point of failure. If a disk or node fails, queries and/or processes are simply serviced from another disk containing a copy of the data from the failed disk or transparently moved to another node. This is achieved without interruptions in service, saving cost and ensuring business continuity.

Exadata has been designed to incorporate the same standards of high availability (HA) customers have come to expect for Oracle products. With Exadata, all database features and tools work just as they do with traditional non-Exadata storage. With the Exadata architecture, all single points of failure are eliminated. Familiar features such as mirroring, fault isolation, and protection against drive and cell failure have been incorporated into Exadata to ensure continual availability and protection of data. Other features to ensure high availability within the Exadata Storage Server are described below

Oracle's Hardware Assisted Resilient Data (HARD) Initiative is a comprehensive program designed to prevent data corruptions before they happen. Data corruptions are very rare, but when they happen, they can have a catastrophic effect on a database, and therefore a business. Exadata has enhanced HARD functionality embedded in it to provide even higher levels of protection and end-to-end data validation for your data.


6) Delivers Increased Utilization
One of the key aims of cloud computing is to increase the utilization of existing hardware. What is actually required is a new approach to hardware that allows applications to be consolidated onto a single, scalable platform. This allows resources (disk, processing, memory) to be shared across all applications and allocated as required. If one particular application needs additional short-term resources for a special project, the infrastructure should be flexible enough to allow those resources to be made available without significantly impacting other applications.

The use of high volume hardware, grid architecture, highly available framework and open standards makes it possible to create a suitable platform for consolidation to support enterprise wide applications.

Now we have the second part of our cloud strategy in place: a hardware platform to support the data warehouse cloud:

The last stage is to review the key software requirements needed to support and develop a successful cloud based data warehouse strategy.


New Data Mining Blog

I just noticed that blogs.oracle.com has a new member discussing Data Mining. It has the obvious address: http://blogs.oracle.com/datamining/

I'm sure Charlie will give you lots of fun, interesting and cool details on data mining. So enjoy.

And just to vent my opinion, according the the TDWI report critiqued earlier, this is a topic to pay attention to. In-database analytics is something that will become more and more important as we see more data being stored. As with many of these things Oracle has been pushing this into the server for years. Now with Exadata V2 we are also pushing down (offloading) data mining functionality into the storage layers. All in all this will be an interesting technology area to keep an eye on!

January 11, 2010

Integrating Hadoop Data with Oracle Parallel Processing

Various industry verticals are seeing vast amounts of data that is stored on file systems. These vast amounts of data are typically data that contains a lot of irrelevant detail and some gems useful for further analysis or enriching other data sources. Despite storing this data outside of the database some customers do want to integrate this data with data stored in the database. The goal of such integration is to extract information that is of value to the business users.

This post describes in detail how to access data stored in a Hadoop cluster from within an Oracle database. Note that we picked Hadoop and HDFS as an example. These strategies apply to other distributed storage mechanisms as well. In this post we describe various access methods and show a concrete example of an implementation of such an access method to access data from both HDFS and from with the Oracle database.

Access Methods for External Hadoop Data

The simplest way to access external files or external data on a file system from within an Oracle database is through an external table. See here for an introduction to External tables.

External tables present data stored in a file system in a table format and can be used in SQL queries transparently. External tables could thus potentially be used to access data stored in HDFS (the Hadoop File System) from inside the Oracle database. Unfortunately HDFS files are not directly accessible through the normal operating system calls that the external table driver relies on. The FUSE (File system in Userspace) project provides a workaround in this case. There are a number of FUSE drivers that allow users to mount a HDFS store and treat it like a normal file system. By using one of these drivers and mounting HDFS on the database instance (on every instance if this was a RAC database), HDFS files can be easily accessed using the External Table infrastructure.

clip_image002

In the picture shown above we are utilizing Oracle Database 11g to implement in-database mapreduce as described in this previous blog post. In general, the parallel execution framework in Oracle Database 11g is sufficient to run most of the desired operations in parallel directly from the external table.

The external table approach may not be suitable in some cases (say if FUSE is unavailable). Oracle Table Functions provide an alternate way to fetch data from Hadoop. Our attached example outlines one way of doing this. At a high level we implement a table function that uses the DBMS_SCHEDULER framework to asynchronously launch an external shell script that submits a Hadoop Map-Reduce job. The table function and the mapper communicate using Oracle's Advanced Queuing feature. The Hadoop mapper en-queue's data into a common queue while the table function de-queues data from it. Since this table function can be run in parallel additional logic is used to ensure that only one of the slaves submits the External Job.

clip_image002[8]

The queue gives us load balancing since the table function could run in parallel while the Hadoop streaming job will also run in parallel with a different degree of parallelism and outside the control of Oracle's Query Coordinator.

An Example leveraging Table Functions

As an example we translated the architecture shown in Figure 2 in a real example. Note that our example only shows a template implementation of using a Table Function to access data stored in Hadoop. Other, possibly better, implementations are clearly possible.

The following diagrams are a technically more accurate and more detailed representation of the original schematic in Figure 2 explaining where and how we use the pieces of actual code that follow:

clip_image002[10]

In step 1 we figure out who gets to be the query coordinator. For this we use a simple mechanism that writes records with the same key value into a table. The first insert wins and will function as the query coordinator (QC) for the process. Note that the QC table function invocation does play a processing role as well.

In step 2 this table function invocation (QC) starts an asynchronous job using dbms_scheduler - the Job Monitor as shown above - that than runs the synchronous bash script on the Hadoop cluster. This bash script, called the launcher in Figure 3 starts the mapper processes (step 3) on the Hadoop cluster.

The mapper processes process data and write to a queue in step 5. In the current example we chose a queue as it is available cluster wide. For now we simply chose to write any output directly into the queue. You can achieve better performance by either batching up the outputs and then moving them into the queue. Obviously you can choose various other delivery mechanisms, including pipes and relational tables.

Step 6 is then the de-queuing process which is done by parallel invocations of the table function running in the database. As these parallel invocations process data it gets served up to the query that is requesting the data. The table function leverages both the Oracle data and the data from the queue and thereby integrates data from both sources in a single result set for the end user(s).

clip_image004

After the Hadoop side processes (the mappers) are kicked off, the job monitor process keeps an eye on the launcher script. Once the mappers have finished processing data on the Hadoop cluster, the bash script finishes as is shown in the illustration above.

The job monitor monitors a database scheduler queue and will notice that the shell script has finished (step 7). The job monitor checks the data queue for remaining data elements, step 8. As long as data is present in the queue the table function invocations keep on processing that data (step 6).

clip_image006

When the queue is fully de-queued by the parallel invocations of the table function, the job monitor terminates the queue (step 9 above) ensuring the table function invocations in Oracle stop. At that point all the data has been delivered to the requesting query.

Sample Code

The query to run a request, the table function, the Launcher script and the mapper are all available for download via this link. We do not claim that this is production ready code of course, it is an illustration on how such integration could work.

January 5, 2010

New whitepaper on Exadata Hybrid Columnar Compression

The Exadata team has just released a new technical white paper on Exadata Hybrid Columnar Compression (EHCC). It is available on OTN and you can download it by clicking here.

To follow on from Jean-Pierre's recent posting on "Increase Performance While Reducing Cost",
the new Exadata compression is probably one of the most important and exciting features of this release of Exadata for data warehouse customers, because it increases performance while significantly reducing the overall cost of storage.

In a data warehouse it would be very useful to apply different rates of compression based on data usage/query patterns. Data that is being actively updated should really be compressed in a different way from historical or archive data. Ideally, as much historical data as possible should be kept online to support adhoc analysis and the development of data mining models. Many database vendors have tried to resolve this growing need - to keep more data online and maintain query performance while reducing overall storage costs. The result to date has been either 1) high rates of compression with poor query performance that impacts on general BI and adhoc queries as well as the development of data mining models or 2) low rate of compression which does not allow large amounts of historical data to be kept online.

EHCC has two compression methods and this makes it very useful for data warehousing. The first compression mode, which is optimized for query performance, provides up to 10x compression ratio, with corresponding improvements in query performance. The second mode provides up to 50x compression ratios with only limited impact on performance. This makes EHCC ideal for managing data that is considered suitable for archiving but also needs to be kept online to support various business queries. This means it is possible to keep more data online for much longer. Having access to a lot more data helps all sorts of data warehouse operations, especially data mining where access to large datasets helps develop more robust models.

Combining EHCC with partitioning provides the ideal solution. Oracle partitioning provides the ability to divide a single table into smaller partitions. These partitions are typically based on date ranges (although other options are available). Using the combination of partitioning and EHCC it is possible to develop something like this:

  • Active Partitions (0-6 months) - up to 10x compression using Exadata Hybrid Columnar Compression - Query Mode.
  • Historical Partitions (7- 24 months) - up to 10x compression using Exadata Hybrid Columnar Compression - Query mode
  • Archive Partitions (25 - 60 months) - up to 50x compressing using Exadata Hybrid Columnar Compression - Archive mode

Defining a compression strategy similar to this provides a lot of benefits. In almost all cases query performance will improve due to improved disk scan rates and a reduction in the number of I/Os. Oracle Database does not require data to be decompressed - it can keep data compressed in memory, thereby further reducing I/Os and providing fast data reads.

So now you can store more data and access that data much faster. For more information visit the Exadata home page on OTN: http://www.oracle.com/technology/products/bi/db/exadata/index.html



December 29, 2009

EDW and the Cloud

Cloud computing is typically presented as a single concept within the media by journalists, analysts and management consultants. The nature of that single concept depends on who is presenting the message. Most of the information being pushed out at the moment concentrates on either the hardware needed to support the cloud or the software needed to build applications. However, these two areas are inter-linked and they also drive a third subject area, which is vital to the success of both the other two areas. I would propose that cloud computing, especially in the case of data warehousing, is three overlapping processes or concepts:

  • DW Process Reengineering
  • Hardware to enable the Cloud
  • Software to enable the Cloud

The term "cloud" should really be used as an umbrella term covering all three elements. While it is probably possible to implement a cloud just in terms of software or hardware, I believe the full benefits of cloud computing can only be realized by considering all three aspects and using the reengineering phase to drive the other two.

DW Process Reengineering (DWPR) Most businesses have been here before. Who can forget the boom in management consultants explaining that businesses needed to reengineer their whole business to focus on core activities? The key point of BPR was that it forced a business to take a fresh look at all their processes and determine how they could best re-construct those processes to improve how they delivered their core products and services. This caused every company to ask some very basic questions about their operations:

  • Who are our customers?
  • Does our mission need to be redefined?
  • Are our strategic goals aligned with our mission?

This started a huge chain of events that spread across every department - except the data warehouse! I would argue that the data warehouse was somehow bypassed or the full impact was never actually considered. This is not surprising given that most IT departments were busy implementing new CRM/ERP applications, re-writing existing OLTP applications and trying to decommission old systems. There was just not enough time to think through all the changes that would be required in the long run to support the data warehouse. In effect, the data warehouse got left behind.

Today, every application within an organization is somehow connected to the data warehouse - not only pushing data in but also pulling data out. As both the volume of data and the richness of that data within the data warehouse have increased so has the need to analyze it in all sorts of different ways. Now, that analysis is being pushed out to a much wider audience which in turn is driving all sorts of new requests for reports and data extracts.

Beyond the every day events within the EDW there are also the "special" projects that demand additional resources for short periods of time. The trouble is: every department is creating more and more "special" projects every year. These projects aim to slice, dice and mash-up data in completely new ways. The IT team is expected to immediately make resources available to these projects and manage them for the duration of the project. Invariably, there are just not enough spare resources available, or resources have to be borrowed from other projects, which in turn have significant impact on those donor projects.

It is time to take a step back and to apply the same basic principles of BPR to data warehousing. What is needed is: data warehouse process reengineering - DWPR.

The need for DWPR Cloud computing is supposed to be the answer to this problem of how to deliver more data with less resources to more users. I would argue that the expectation and the reality of cloud based solutions are quite different. First let us consider the reality of actually delivering a data warehouse. Making data available within the data warehouse so business users can analyze it is driven by a large number of processes. The diagram below (I think I have actually over-simplified it) outlines a typical data warehouse:

cc4.jpgThe traditional data warehouse environment has evolved over time probably starting as a simple data mart or collection of data marts. Now it is built around numerous engines that collect and push data out to multiple disconnected data marts (information silos) running on dedicated hardware. In fact each "engine" probably runs on its own dedicated hardware, so you have OLAP hardware, ETL hardware, data cleansing hardware, data mining hardware etc. This process of evolution has created a plumbing and data movement nightmare that is holding back many customers from evolving to a real-time EDW that can truly drive the business forward.

At the moment just about every vendor and management consultant is pushing the "cloud" as the way forward for every new application, including the data warehouse. The dream that is being peddled is that the cloud will somehow save customers from this plumbing/engine/hardware nightmare that haunts many systems. What many people envisage when they are presented with the concept of a running a data warehouse in the "cloud" is the following:

CC2.jpgAll that complication (database engines, data cleansing, loading data, unloading data, hardware platforms, storage layers) magically disappears and life is wonderful. Unfortunately, the cloud is simply being used as a masking device, a smoke screen, to hide the horrors of all the plumbing and interaction needed to make a real enterprise data warehouse work.

This is where the process reengineering needs to start. The "cloud" should be used as framework for completely rethinking when, where, how and why data is moved around the organization. The aim of any cloud strategy should not be to throw away all existing hardware and then to "rent" the same software in cyber space but I suspect that is what most IT departments will end up delivering to their business.

cc3.jpgThis "new" environment is even worse than the environment running in most customers data centers today because at least all these engines are within the same data center. Moving data into each engine, processing it and then pushing it to the next engine involves relatively minimal distances of network travel. Once you move to a cloud-computing model your Name and Address scrubbing engine could be in Australia and your data-mining engine in Hungary. If performance against large data sets is a major consideration then this might not be the best solution.

This is a golden opportunity for every customer to look at how data is moved, transformed, cleansed, secured, analyzed and presented not just in terms of what is required today but also in terms of how it will be structured, presented and co-exist with other data in the future. What is needed is a basic re-evaluation of the core principles relating to the EDW:

  • What are the business objectives and goals that the EDW will help support?
  • What are the high-level requirements for business use of enterprise data?
  • What business problems will not be addressed by the EDW?
  • What are the risks and how will they be managed?
  • How does the EDW interface with supporting down-stream and up-stream applications?
  • What is delivery strategy for the data?
  • Are the users internal, external or both?
  • How will data be delivered to these users?

Most importantly, is the business ready to build an EDW environment from both the business and technology perspectives? Once you sign-up to the need to re-evaluate all the processes driving your data warehouse the next step is to ensure you build your cloud enabled EDW on the correct platform (hardware and software). So what are the key characteristics of the perfect EDW cloud platform?


December 3, 2009

Increase Performance while Reducing Cost

Exciting times! Well yes the performance piece here is cool and reducing cost is obviously something exciting, but the most exciting thing to happen is that Sinterklaas is about to come and bring us cool presents... December 6th is his birthday and all Dutch folks get nice gifts from him on the evening before his birthday... boy am I glad I'm Dutch :-)

Anywho, you don't want to read about Dutch folklore but about serious matters around performance and reducing cost.

As part of the German Oracle User Group (DOAG) conference in Nuremberg I did spend some time debating a presentation about this exact topic. This post is something that will summarize some of those thoughts. The presentation is actually titled "Top 5 tips on reducing storage cost while improving performance".

The tips are reasonably obvious, so I will just list them and then kind of pick out the interesting pieces:

  1. Think, plan and design
  2. Get appropriate hardware
  3. Design tiered storage
  4. Partition data into smaller chunks
  5. Compress the data

Like I said, these are fairly obvious points, but there is some thought behind this. Step one is something always on the list of course but with the newest hardware - and particularly the price points of components - it will pay of to really think about leveraging new components such as Solid State Disks and most certainly Flash technology.

The below picture shows a little bit of what I mean:

Upward and Downward ILM

In essence what this is saying is that typically the cost is higher for faster storage media (not surprising). However those higher speed media costs are increasingly creeping down making them more feasible. The picture also introduces two variations of Information Life-cycle Management, I call them upward ILM and downward ILM.

Downward ILM is traditional data management where older data is placed on lower cost, slower media as the access to this data is infrequent and a little longer response times don't hurt anyone.

Upward ILM is the opposite. Take high value data, accessed a lot and place it on either faster disks but more interestingly on faster media. Like Flash, or in memory. This will get everyone extreme fast query access to this highly relevant data.

That combination of managing data not just from a cost perspective is really what is encapsulated within those first three points in my top 5 tips. Get hardware that allows you to plan for and execute both Upward and Downward ILM, make sure you have that platform. Then tier the storage so you have appropriate performance characteristics across these storage tiers. Plan for all of this from the start (and create budgets and value propositions etc).

The lesson learned, or message taken from this is really: understand the storage technology and hardware available and leverage this accordingly.

The trick to get this done today is to balance all of this in a manner that delivers performance and still keeps the system to buy reasonably priced. Storing all data in SSD's or Flash is going to be really fast and really, really expense if you choose all SSD. First of all, you will introduce sticker shock and instantly create antagonism against your proposal, secondly, you really do not need that performance for all of your data. So balance this out a little and create a data distribution that leverages the hardware, gets great performance and keeps prices in line with the business value delivered.

To understand this a little better consider the following example (disclaimer - purely an illustration of a concept, no attempt to tell you that this is the case in your environment):

data distribution over media

This is stating that most of your queries only access a small amount of your total data volume. This may be 60% or 40% or some other number, but a typical system will look a lot at current (for example 1 - 7 day old) data.

This would mean that you have all of these media in a setup somehow and manage this.

Software, and this gets us to tip 4 and 5 allows us to change these numbers and create systems that are less error prone (what happens if your analysis is not correct and people need to wait for a small data set stored on slow storage) and faster for a good price.

Partitioning data is traditionally a method to speed up the scanning of data in queries (and there are more usages). In its simplest form this means that if I partition data and therefore scan a lot less data to satisfy my query, I can use slower disk and still get adequate performance. BTW, partitioning means Oracle Partitioning but also any other techniques that improve scan rates. Oracle provides things like Storage Indexes and Smart Scans in Exadata which I think are part of the partitioning scheme.

The long and short of it is that partitioning is a downward mobility technology, meaning it allows you to use cheaper and slower media but still achieve good performance.

Compression is often seen in the same light. I compress data, this means I can scan it faster, so there you go.

However I would argue that compression is actually upward mobility. If 5% of my capacity is in flash and in memory due to cost considerations, I may not be able to load all data for my 60% of queries into the flash and memory spaces. Compression however allows me to - at least with Oracle - put a lot (10x) more data onto those media. That gives me both a huge cost advantage in that I do not need to expand the flash cost and it gives me an incredible speed up by using flash for a lot more queries than I could before applying compression.

Better leverage high performance tiers

The picture above is trying to illustrate this concept. Without spending more on hardware you now have a lot more data available for direct access and high speed response times. This can mean that you can put all data online, but more importantly it can mean that all hot data is in flash and memory and you get ultra fast retrieval rates.

The latest developments in hardware, cost and performance combined with the software advances allow us to really reconsider data warehouse storage and performance characteristics. My conclusion would be that it is important to keep track of these new media and start planning on using them. Apply compression and partitioning schemes to move more data into these high performance tiers and get large performance gains at equal hardware costs.

In this case it pays to play around with what is available and define a strategy around the hardware, the software and the business needs. This way you will be able to deliver better performance at lower overall cost to the business.

About

Data Warehousing is a broad topic area. This blog is written by the product management team and tries to shed light on the ins and outs of Oracle Data Warehousing and Oracle Database Machine.

That said, the views expressed on this blog are our own and do not necessarily reflect the views of Oracle.

Powered by
Movable Type and Oracle