The ODTUG aftermath… (hadoop, partitions and best practices)
By Jean-Pierre Dijcks on Jul 19, 2010
Pfff, it took me a while to finally post my thoughts on ODTUG this year, and now I'm reading Mike telling us all to sign up for the super-early-bird special for Kscope11... man, I'm getting old and slow as I'm just about to write a bunch of stuff on the 2010 show...
You have to be there next year!
Ok, I am biased, this 2010 show was my 8th time in succession and I think every year is better than the year before. It was quite fun to be back in DC and to watch the worldcup as we did in 2006. Overall I did think that the BI/DW track did well, and I'm always impressed with the speakers that are out there. I did end up doing a bit more core database focused sessions this time around and the audience is just great for that as well. It is quite fun to talk to developers as they bring a different angle to the table as the DBAs do. Not better, not worse, just different!
Hadoop with Oracle
Yup, I did do a session on Hadoop, basically discussing how it could fit within the Oracle space. I got the feeling that the audience was just as curious as I was on the whole idea, and it was confirmed that everyone is kind of looking at Hadoop as an add-on to their database stored information.
I got the feeling that there are still a lot of unknowns in the field about Hadoop and what it really is and does for "mainstream" workloads. The few people that were actively looking at doing something with Hadoop were really focusing on leveraging it as a processing engine for some raw data that is not going to live in a database. It did seem like the end results would very much live in a database.
So I think the focus will really be on "integrating" the data streams. Since Hadoop is completely file based, Oracle and Hadoop do really integrate well out-of-the-box today. For one, the external table route is easy to use to grab results from Hadoop jobs and integrate them (without dual storage cost) into an Oracle query. Secondly, the Oracle Grid Engine allows you to actually schedule jobs across systems and leverage nodes as a Hadoop cluster without dedicating the nodes to only that task. Using Grid Engine would (in theory) look like this:
The fun thing is not just that you can re-allocate nodes in your cluster for different tasks, you can also farm out work to your (private) cloud solution if needed. All is done in the Grid Engine.
Always an interesting session to do... and this time we focused on the "new stuff" that is done in Oracle Database 11g. I got to cover three new partitioning features in some detail:
- Interval partitioning
- REF partitioning
- Virtual Column partitioning
The one that we spent most time on is really Interval partitioning, since it is so widely applicable in almost any workload.
The concept is simple and elegant. As user produce data, the database allocates partitions as they are needed. There is no maintenance required other than setting up the table once for interval partitioning. This makes interval specifically useful in any case in which you may not have all partitions created. Lots of information on this cool scheme can be found in the documentation.
Best Practices for Data Warehousing
This was the last session(s) on the last day... and still people showed up... It was almost a lesson in speed talking as both Maria and myself went through a lot of material in under 3 hours. I guess it is a tribute to the ODTUG crowd that they all were still able to absorb new knowledge after that many days of conference!
While we cannot capture the depth and breath of the materials we did cover at ODTUG (that shows again, make sure you go next year) you can get a hint of it all over here.
I guess we better listen to Mike and go sign up for 2011... and I better get onto creating some cool abstracts for both the DW track and the general DB track. See all, you all in Long Beach!