EDW and the Cloud
By KLaker on Dec 28, 2009
Cloud computing is typically presented as a single concept within the media by journalists, analysts and management consultants. The nature of that single concept depends on who is presenting the message. Most of the information being pushed out at the moment concentrates on either the hardware needed to support the cloud or the software needed to build applications. However, these two areas are inter-linked and they also drive a third subject area, which is vital to the success of both the other two areas. I would propose that cloud computing, especially in the case of data warehousing, is three overlapping processes or concepts:
- DW Process Reengineering
- Hardware to enable the Cloud
- Software to enable the Cloud
The term "cloud" should really be used as an umbrella term covering all three elements. While it is probably possible to implement a cloud just in terms of software or hardware, I believe the full benefits of cloud computing can only be realized by considering all three aspects and using the reengineering phase to drive the other two.
DW Process Reengineering (DWPR) Most businesses have been here before. Who can forget the boom in management consultants explaining that businesses needed to reengineer their whole business to focus on core activities? The key point of BPR was that it forced a business to take a fresh look at all their processes and determine how they could best re-construct those processes to improve how they delivered their core products and services. This caused every company to ask some very basic questions about their operations:
- Who are our customers?
- Does our mission need to be redefined?
- Are our strategic goals aligned with our mission?
This started a huge chain of events that spread across every department - except the data warehouse! I would argue that the data warehouse was somehow bypassed or the full impact was never actually considered. This is not surprising given that most IT departments were busy implementing new CRM/ERP applications, re-writing existing OLTP applications and trying to decommission old systems. There was just not enough time to think through all the changes that would be required in the long run to support the data warehouse. In effect, the data warehouse got left behind.
Today, every application within an organization is somehow connected to the data warehouse - not only pushing data in but also pulling data out. As both the volume of data and the richness of that data within the data warehouse have increased so has the need to analyze it in all sorts of different ways. Now, that analysis is being pushed out to a much wider audience which in turn is driving all sorts of new requests for reports and data extracts.
Beyond the every day events within the EDW there are also the "special" projects that demand additional resources for short periods of time. The trouble is: every department is creating more and more "special" projects every year. These projects aim to slice, dice and mash-up data in completely new ways. The IT team is expected to immediately make resources available to these projects and manage them for the duration of the project. Invariably, there are just not enough spare resources available, or resources have to be borrowed from other projects, which in turn have significant impact on those donor projects.
It is time to take a step back and to apply the same basic principles of BPR to data warehousing. What is needed is: data warehouse process reengineering - DWPR.
The need for DWPR Cloud computing is supposed to be the answer to this problem of how to deliver more data with less resources to more users. I would argue that the expectation and the reality of cloud based solutions are quite different. First let us consider the reality of actually delivering a data warehouse. Making data available within the data warehouse so business users can analyze it is driven by a large number of processes. The diagram below (I think I have actually over-simplified it) outlines a typical data warehouse:
The traditional data warehouse environment has evolved over time probably starting as a simple data mart or collection of data marts. Now it is built around numerous engines that collect and push data out to multiple disconnected data marts (information silos) running on dedicated hardware. In fact each "engine" probably runs on its own dedicated hardware, so you have OLAP hardware, ETL hardware, data cleansing hardware, data mining hardware etc. This process of evolution has created a plumbing and data movement nightmare that is holding back many customers from evolving to a real-time EDW that can truly drive the business forward.
At the moment just about every vendor and management consultant is pushing the "cloud" as the way forward for every new application, including the data warehouse. The dream that is being peddled is that the cloud will somehow save customers from this plumbing/engine/hardware nightmare that haunts many systems. What many people envisage when they are presented with the concept of a running a data warehouse in the "cloud" is the following:
All that complication (database engines, data cleansing, loading data, unloading data, hardware platforms, storage layers) magically disappears and life is wonderful. Unfortunately, the cloud is simply being used as a masking device, a smoke screen, to hide the horrors of all the plumbing and interaction needed to make a real enterprise data warehouse work.
This is where the process reengineering needs to start. The "cloud" should be used as framework for completely rethinking when, where, how and why data is moved around the organization. The aim of any cloud strategy should not be to throw away all existing hardware and then to "rent" the same software in cyber space but I suspect that is what most IT departments will end up delivering to their business.
This "new" environment is even worse than the environment running in most customers data centers today because at least all these engines are within the same data center. Moving data into each engine, processing it and then pushing it to the next engine involves relatively minimal distances of network travel. Once you move to a cloud-computing model your Name and Address scrubbing engine could be in Australia and your data-mining engine in Hungary. If performance against large data sets is a major consideration then this might not be the best solution.
This is a golden opportunity for every customer to look at how data is moved, transformed, cleansed, secured, analyzed and presented not just in terms of what is required today but also in terms of how it will be structured, presented and co-exist with other data in the future. What is needed is a basic re-evaluation of the core principles relating to the EDW:
- What are the business objectives and goals that the EDW will help support?
- What are the high-level requirements for business use of enterprise data?
- What business problems will not be addressed by the EDW?
- What are the risks and how will they be managed?
- How does the EDW interface with supporting down-stream and up-stream applications?
- What is delivery strategy for the data?
- Are the users internal, external or both?
- How will data be delivered to these users?
Most importantly, is the business ready to build an EDW environment from both the business and technology perspectives? Once you sign-up to the need to re-evaluate all the processes driving your data warehouse the next step is to ensure you build your cloud enabled EDW on the correct platform (hardware and software). So what are the key characteristics of the perfect EDW cloud platform?