Next Generation Data Warehouse Platforms
By Jean-Pierre Dijcks on Nov 09, 2009
Rather than entertain you all with my own theories I decided to be a literature critic and discuss a paper / survey done by TDWI's Philip Russom. The report goes into "what a next generation DW platform should try to address and what it could look like". You can read an Oracle sponsored copy here. Do note that the research was sponsored by a large number of vendors - not just Oracle - and that this is just our way of getting a complementary copy to you...
I think there are a couple of interesting outcomes from the report and I'd like to summarize some of it in the following points:
- You should start thinking about either a new infrastructure or start thinking about how to leverage the existing infrastructure for more advanced data warehousing in the next year or so.
- Poor query performance and the lack of advanced analytics (OLAP, Data Mining, Statistics etc.) and inadequate load performance are the top three issues identified in the report. So look for a platform that will go fast (really fast!) and allows you to use advanced analytics for a large use community.
- More and more companies are looking towards a more real-time (dare I say Right-Time) environment - for at least parts of the DW - in light of integrating DW data with operational systems or in a fancy way striving for operational excellence. I would think this is where the mixed workloads come from as well, which are an interesting trend in the paper. Again, you will want a platform that will allow for these mixed workloads and the trickle feeds while reading from the DW.
- Hardware evolutions do drive platforms in that the capacity increases, the price of that capacity decreases and that storage media change to faster media over the next couple of years. Think flash, think memory, think (maybe even) solid state disks.
- Private Clouds - inline with the previous statements - are expected to be seen as an infrastructure. Inline with the big boss, I would prefer to call this a compute grid rather than referring to water vapor too much... But it means that low cost small commodity compute resources racked together and used across workloads (consolidation anyone?) are something in your near future...
As a true critic, I of course have to say something sour about the report... my sour grape about this report is the question as to whether you would like a DBMS built for transactions as your DW database... Not quite sure what a survey on DW workloads expects to get as an answer. It did surprise me that the DBMS built for DW has flat growth and moderate to good commitment. My conclusion is that this is not necessarily a great discussion to have, as neither seem to be interesting to too many people. I think people focus on finding a platform that qualifies by solving their pain points and/or the above points.
And then of course, to keep on the sour face, the fun discussion on MPP. I think that everyone agrees that large operational data stores (3NF like schemas in many cases) benefit from parallel processing capacities. What we are seeing is that more and more people need more than just full table scans on these schemas. In light of the above trends (and there is more on those in the report) people require multiple workloads on these systems, read consistency without readers blocking writers etc. In other words, a next generation platform needs to deal with both massively parallel queries and with single transaction level reads and writes.
A discussion around MPP versus SMP versus OLTP DB versus DW DB is not very useful I would think. It should be on the issues and merits of a DBMS and platform not on a theoretical argument. And I think most of the report does bring up the right points you should be thinking about while designing what the platform should be doing. I just think that people should keep in mind that theory is not always the best lead argument. Keith has written a long post on a related topic. One other thing, many new DBMS systems have sprung into the DW market, many just featuring a single core feature (compression, columnar storage, in-memory etc.). For a critical system that sees mixed workloads and lots of data, these point solutions may not be the right thing to base an architecture on...
I was also surprised by the Data Quality entry in Good Growth - Good Commitment. While I think that more and more people are paying attention, I still see this as the orphan in the DW environment. Lots of talk, not much action. I just wish it was more action and less talk, because, come on, this is really a problem! So I hope the report is spot on there, but I'm a skeptic (nothing to blame the report for, just me being in this business not seeing too much happening in DQ).
Ok, be done now with the sulking :-)
Couple of things around some new trends and things in Oracle that would help with building an infrastructure for the next generation data warehouse:
- Performance => Sun Oracle Database Machine. Yes, it really is fast!
- In-memory processing => Oracle now has (11gR2) In-Memory Parallel Execution. More about this can be read in Maria's excellent post here.
- In-Database Analytics => As the report says in Exadata V2 and Oracle 11gR2 we are now offloading data mining model scoring to the storage side of the house, which allows us to embed mining models into more and more operational systems and get online (direct) feedback on transactions. We also have for years moved more and more OLAP and Stats functionality into the engine
- Real-time data warehousing => First and foremost the read consistency model introduced in Oracle 4 (this is not a typo...) allowing readers to see consistent data during writes, secondly, the just completed acquisition of Golden Gate and the ETL capabilities (like streams) in Oracle allow for very nice real time data feeds. Oracle's MAA architecture allows us to be up and running 24*7 on commodity hardware and deliver an online experience to all customers...
- Cloud computing => see the in-database mapreduce post here.
- Appliances => Sun Oracle Database Machine
So, rather than finding point solutions for all of these challenges as one of the conclusions shows, maybe a solution would be to choose a platform that actually allows you to do what you need from a single vendor... If you are really thinking operational BI, real time DW it is hard to imagine how you are going to do this with multiple DBMS solutions. If the analytics run on a secondary system, you have just created a delay as you now need to update the EDW and then this analytics system. This simply means you cannot actually do real time data warehousing... Loading just to the analytics system means that you are back into data silos.
Concluding, I think most conclusions of the report are spot on. Think Big Data, understand the business drivers, plan for your systems etc. Not so sure a set of separate point solutions is what you need if you are going to do real time data warehousing...
Anyways, do read the report! I think it gives you good food for thought and a great insight in the trends you will be dealing with in your near future.