Following on from the previous post we started to discuss the fundamental capabilities that an end-to-end big data discovery product needs to contain in order to allow anyone (not just highly skilled 'techies') to turn raw data in Hadoop into actionable insight. We discussed 1) the ability to 'find' relevant data in Hadoop and 2) the ability to 'explore' a new data set to understand potential. We now turn our focus to:
3. Transform (to make big data better). We already discussed that data in Hadoop typically isn't ready for analytics because it needs changing in some way first. Perhaps we need to tease out product names or id's buried in some text or replace some missing values, concatenate fields together, turn integers into strings or strings into dates. Maybe we want to infer a geographic hierarchy from a customer address or IP address or a date hierarchy from a single time stamp. The BDD Transform page allows any user to directly change the data in Hadoop without moving it or picking up the phone and calling IT and then waiting for ETL tools to get the data ready for them. Via an Excel-like view of the data, Transform allows users to quickly change data with a simple right-click and preview the results of the transformation before applying the change. For the more sophisticated data wrangler, they can leverage a library of hundreds of typical transforms to implement on the data to get it ready. They can even make the data better and richer by adding new data elements from large text fields based on the results of clever term and named entity extraction algorithms. Any transform or enrichment used can be previewed before applying, but when it is applied BDD leverages the power of a massively scalable open source data processing framework called Apache Spark behind the scenes so the transforms can be applied at scale upon data sets in Hadoop that contain billions and billions of records. All of this complexity is masked away from the user so they can just sit back and wait for the magic to happen.
In the next and final post we will discuss the final 2 critical capabilities for an effective big data discovery tool. Until then... please let us know your thoughts!