RTD, Big Data and Map Reduce algorithms

I was recently asked to compare and contrast the way that RTD processes its learning with Map Reduce, Big Data systems like Hadoop. The question is very relevant, and in reality you could implement RTD Learning using Map Reduce for many or most cases without significant difference in functionality. Nevertheless, there are a few interesting differences.

RTD learning is incremental. As data to be learned is generated by the decision nodes, the learning server learns on this data incrementally updating its models. A Map Reduce implementation would accumulate all the data and periodically use several nodes to learn in parallel (Map) on portions of the data and accumulating that knowledge into a single view of the truth (Reduce). This implementation would be non-incremental, with the full data processed each batch to produce a brand new model.

If enough nodes are used together with fast storage, and when the volume of data is not extremely high, then it is possible that the latency in the production of model can be managed and new models be produced in a matter of minutes. The computing requirements to maintain these models fresh would be much higher than the current requirements for a typical RTD implementation. For example, a single 2 core node dedicated to learning can cope with a volume of more than 20 million records per day. If the time window used is a quarter, then the total number of records for a model is about 3 billion, considering that each model learns for 2 time windows. As fast as Map Reduce may be, processing three billion records with a few hundred attributes each, for two model overlapping time windows) would probably take quite a long time.

 Is there an alternative? Is it possible to use the Map Reduce principles to process data incrementally similarly to what RTD does?

I believe it is. The batch oriented Map Reduce principles can be modified and extended to work on streams of data, instead of stored shards of data. For example, there could be several Map nodes that continuously accumulate the learning from a number of records each in parallel and pass the accumulated learning to a consolidating Reduce node that incrementally maintains the models. The number of records the Map nodes accumulate to summarize will affect the latency of the data to affect the models, providing interesting configuration options to optimize for throughput or latency. Ultimately the tradeoff could be done dynamically with the Map nodes processing larger windows when they see more load.

An additional consideration is the cost of consolidation of the Map produced summaries, as it needs to be relatively smaller than the resources needed to do the accumulation in the first place, as we do not want the Reduce phase to be a dominant bottleneck. Additionally, it would be possible to have layers of accumulation where several summaries are consolidated.

The fact that RTD models are naturally additive facilitates the distribution of work among the Map nodes.

In spite of this long discussion of theoretical uses for Map Reduce in the RTD context, it is important to note that with parallel learning and suitable configurations RTD learning, as it stands today, can cope with very high volumes of data.

Comments:

What about using hadoop as a real time source for calacuation data? Hbase streaming, etc.

Posted by Joe Kleinhenz on March 16, 2012 at 07:27 AM PDT #

The problem with using Hadoop in real time is a question of the personality of the different systems. Generally speaking a Hadoop cluster is not managed as a 24/7 operational system, but more as an analytical back office system. Also traditional hadoop works in batch mode, thus having an impedance mismatch with the real time systems.

One thing you could do is to expose the output from hadoop, perhaps scores for each customer or other summarization, in a database, whether relational or NoSql. Then you use that DB as a buffer between the real time environment and the map/reduce cluster.

The alternative of using streaming hadoop is interesting. Streaming hadoop is not as developed as batch oriented one and it is not yet clear how real time it can be, as it is not just a matter of enabling technology but you also have to have the incremental algorithms that can work in a streaming environment. Furthermore, if you are streaming then presummably you would need a much smaller cluster, as you are processing in real time there is no more the need to have 100x difference between the rate of production of data and the rate of processing, a 1:1 ratio is sufficient in this case. RTD by itself is such a streaming system that needs just the 1:1 processing capability. RTD is not built with Hadoop technology, but it handles big data the same.

Posted by Michel Adar on March 29, 2012 at 04:36 AM PDT #

Thank you for this very interesting post ! Then have you already seen hadoop + RTD working together in a prod environment ? would be really interested by the feedback :)

Posted by Eric Som on October 30, 2012 at 04:23 AM PDT #

Thank you for this very interesting post ! Then have you already seen hadoop + RTD working together in a prod environment ? would be really interested by the feedback :)

Posted by badou38 on October 30, 2012 at 08:57 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Issues related to Oracle Real-Time Decisions (RTD). Entries include implementation tips, technology descriptions and items of general interest to the RTD community.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today