RTD, Big Data and Map Reduce algorithms
By Michel Adar on Jan 24, 2012
I was recently asked to compare and contrast the way that RTD processes its learning with Map Reduce, Big Data systems like Hadoop. The question is very relevant, and in reality you could implement RTD Learning using Map Reduce for many or most cases without significant difference in functionality. Nevertheless, there are a few interesting differences.
RTD learning is incremental. As data to be learned is generated by the decision nodes, the learning server learns on this data incrementally updating its models. A Map Reduce implementation would accumulate all the data and periodically use several nodes to learn in parallel (Map) on portions of the data and accumulating that knowledge into a single view of the truth (Reduce). This implementation would be non-incremental, with the full data processed each batch to produce a brand new model.
If enough nodes are used together with fast storage, and when the volume of data is not extremely high, then it is possible that the latency in the production of model can be managed and new models be produced in a matter of minutes. The computing requirements to maintain these models fresh would be much higher than the current requirements for a typical RTD implementation. For example, a single 2 core node dedicated to learning can cope with a volume of more than 20 million records per day. If the time window used is a quarter, then the total number of records for a model is about 3 billion, considering that each model learns for 2 time windows. As fast as Map Reduce may be, processing three billion records with a few hundred attributes each, for two model overlapping time windows) would probably take quite a long time.
Is there an alternative? Is it possible to use the Map Reduce principles to process data incrementally similarly to what RTD does?
I believe it is. The batch oriented Map Reduce principles can be modified and extended to work on streams of data, instead of stored shards of data. For example, there could be several Map nodes that continuously accumulate the learning from a number of records each in parallel and pass the accumulated learning to a consolidating Reduce node that incrementally maintains the models. The number of records the Map nodes accumulate to summarize will affect the latency of the data to affect the models, providing interesting configuration options to optimize for throughput or latency. Ultimately the tradeoff could be done dynamically with the Map nodes processing larger windows when they see more load.
An additional consideration is the cost of consolidation of the Map produced summaries, as it needs to be relatively smaller than the resources needed to do the accumulation in the first place, as we do not want the Reduce phase to be a dominant bottleneck. Additionally, it would be possible to have layers of accumulation where several summaries are consolidated.
The fact that RTD models are naturally additive facilitates the distribution of work among the Map nodes.
In spite of this long discussion of theoretical uses for Map Reduce in the RTD context, it is important to note that with parallel learning and suitable configurations RTD learning, as it stands today, can cope with very high volumes of data.