Recently Keith spent some time talking about the cloud on this blog and I will spare you my thoughts on the whole thing. What I do want to write down is something about the Big Data movement and what I think is the killer app for Big Data...
Where is this coming from, ok, I confess... I spent 3 days in cloud land at the Cloud Connect conference in Santa Clara and it was quite a lot of fun. One of the nice things at Cloud Connect was that there was a track dedicated to Big Data, which prompted me to some extend to write this post.
The most valuable point made in the Big Data track was that Big Data in itself is not very cool. Doing something with Big Data is what makes all of this cool and interesting to a business user!
The other good insight I got was that a lot of people think Big Data means a single gigantic monolithic system holding gazillions of bytes or documents or log files. Well turns out that most people in the Big Data track are talking about a lot of collections of smaller data sets. So rather than thinking "big = monolithic" you should be thinking "big = many data sets".
This is more than just theoretical, it is actually relevant when thinking about big data and how to process it. It is important because it means that the platform that stores data will most likely consist out of multiple solutions. You may be storing logs on something like HDFS, you may store your customer information in Oracle and you may store distilled clickstream information in some distilled form in MySQL. The big question you will need to solve is not what lives where, but how to get it all together and get some value out of all that data.
Nope, sorry, this is not the killer app... and no I'm not saying this because my business card says Oracle and I'm therefore biased. I think language is important, but as with storage I think pragmatic is better. In other words, some questions can be answered with SQL very efficiently, others can be answered with PERL or TCL others with MR.
History should teach us that anyone trying to solve a problem will use any and all tools around. For example, most data warehouses (Big Data 1.0?) get a lot of data in flat files. Everyone then runs a bunch of shell scripts to massage or verify those files and then shoves those files into the database. We've even built shell script support into external tables to allow for this.
I think the Big Data projects will do the same. Some people will use MapReduce, although I would argue that things like Cascading are more interesting, some people will use Java. Some data is stored on HDFS making Cascading the way to go, some data is stored in Oracle and SQL does do a good job there.
As with storage and with history, be pragmatic and use what fits and neither NoSQL nor MR will be the one and only. Also, a language, while important, does in itself not deliver business value. So while cool it is not a killer app...
This is the killer app!
And you are now thinking: "what does that mean?"
Let's decompose that heading.
First of all, analytics. I would think you had guessed by now that this is really what I'm after, and of course you are right. But not just analytics, which has a very large scope and means many things to many people. I'm not just after Business Intelligence (analytics 1.0?) or data mining (analytics 2.0?) but I'm after something more interesting that you can only do after collecting large volumes of specific data.
That all important data is about behavior. What do my customers do? More importantly why do they behave like that? If you can figure that out, you can tailor web sites, stores, products etc. to that behavior and figure out how to be successful.
Today's behavior that is somewhat easily tracked is web site clicks, search patterns and all of those things that a web site or web server tracks. that is where the Big Data lives and where these patters are now emerging. Other examples however are emerging, and one of the examples used at the conference was about prediction churn for a telco based on the social network its members are a part of. That social network is not about LinkedIn or Facebook, but about who calls whom. I call you a lot, you switch provider, and I might/will switch too.
And that just naturally brings me to the next word, vertical. Vertical in this context means per industry, e.g. communications or retail or government or any other vertical.
The reason for being more specific than just behavioral analytics is that each industry has its own data sources, has its own quirky logic and has its own demands and priorities. Of course, the methods and some of the software will be common and some will have both retail and service industry analytics in place (your corner coffee store for example).
But the gist of it all is that analytics that can predict customer behavior for a specific focused group of people in a specific industry is what makes Big Data interesting.
Well, that is going to be interesting. I have not seen much going on in that space and if I had to have some criticism on the cloud connect conference it would be the lack of concrete user cases on big data. The telco example, while a step into the vertical behavioral part is not really on big data. It used a sample of data from the customers' data warehouse.
One thing I do think, and this is where I think parts of the NoSQL stuff come from, is that we will be doing this analysis where the data is. Over the past 10 years we at Oracle have called this in-database analytics. I guess we were (too) early? Now the entire market is going there including companies like SAS.
In-place btw does not mean "no data movement at all", what it means that you will do this on data's permanent home. For SAS that is kind of the current problem. Most of the inputs live in a data warehouse. So why move it into SAS and back? That all worked with 1 TB data warehouses, but when we are looking at 100TB to 500 TB of distilled data...
As it is still early days with these systems, I'm very interested in seeing reactions and thoughts to some of these thoughts...