## Sunday Apr 07, 2013

### Introduction

In the final installment in our series on Hive UDFs, we're going to tackle the least intuitive of the three types: the User Defined Aggregating Function.  While they're challenging to implement, UDAFs are necessary if we want functions for which the distinction of map-side v. reduce-side operations are opaque to the user.  If a user is writing a query, most would prefer to focus on the data they're trying to compute, not which part of the plan is running a given function.

The UDAF also provides a valuable opportunity to consider some of the nuances of distributed programming and parallel database operations.  Since each task in a MapReduce job operates in a bit of a vacuum (e.g. Map task A does not know what data Map task B has), a UDAF has to explicitly account for more operational states than a simple UDF.  We'll return to the notion of a simple Moving Average function, but ask yourself: how do we compute a moving average if we don't have state or order around the data?

As before, the code is available on github, but we'll excerpt the important parts here.

### Prefix Sum: Moving Average without State

In order to compute a moving average without state, we're going to need a specialized parallel algorithm.  For moving average, the "trick" is to use a prefix sum, effectively keeping a table of running totals for quick computation (and recomputation) of our moving average.  A full discussion of prefix sums for moving averages is beyond length of a blog post, but John Jenq provides an excellent discussion of the technique as applied to CUDA implementations.

What we'll cover here is the necessary implementation of a pair of classes to store and operate on our prefix sum entry within the UDAF.

`public class PrefixSumMovingAverage {    static class PrefixSumEntry implements Comparable     {        int period;        double value;        double prefixSum;        double subsequenceTotal;        double movingAverage;        public int compareTo(Object other)        {            PrefixSumEntry o = (PrefixSumEntry)other;            if (period < o.period)                return -1;            if (period > o.period)                return 1;            return 0;        }`

}

Here we have the definition of our moving average class and the static inner class which serves as an entry in our table.  What's important here are some of the variables we define for each entry in the table: the time-index or period of the value (its order), the value itself, the prefix sum,  the subsequence total, and the moving average itself.  Every entry in our table requires not just the current value to compute the moving average, but also sum of entries in our moving average window.  It's the pair of these two values which allows prefix sum methods to work their magic.

`//class variables    private int windowSize;    private ArrayList<PrefixSumEntry> entries;        public PrefixSumMovingAverage()    {        windowSize = 0;    }        public void reset()    {        windowSize = 0;        entries = null;    }        public boolean isReady()    {        return (windowSize > 0);    }`

The above are simple initialization routines: a constructor, a method to reset the table, and a boolean method on whether or not the object has a prefix sum table on which to operate.  From here, there are 3 important methods to examine: add, merge, and serialize.  The first is intuitive, as we scan rows in Hive we want to add them to our prefix sum table.  The second are important because of partial aggregation.

We cannot say ahead of time where this UDAF will run, and partial aggregation may be required.  That is, it's entirely possible that some values may run through the UDAF during a map task, but then be passed to a reduce task to be combined with other values.  The serialize method will allow Hive to pass the partial results from the map side to the reduce side.  The merge method allows reducers to combine the results of partial aggregations from the map tasks.

` @SuppressWarnings("unchecked")  public void add(int period, double v)  {    //Add a new entry to the list and update table    PrefixSumEntry e = new PrefixSumEntry();    e.period = period;    e.value = v;    entries.add(e);    // do we need to ensure this is sorted?    //if (needsSorting(entries))    	Collections.sort(entries);    // update the table    // prefixSums first    double prefixSum = 0;    for(int i = 0; i < entries.size(); i++)    {        PrefixSumEntry thisEntry = entries.get(i);        prefixSum += thisEntry.value;        thisEntry.prefixSum = prefixSum;        entries.set(i, thisEntry);    }`

The first part of the add task is simple: we add the element to the list and update our table's prefix sums.

` // now do the subsequence totals and moving averages    for(int i = 0; i < entries.size(); i++)    {        double subsequenceTotal;        double movingAverage;        PrefixSumEntry thisEntry = entries.get(i);        PrefixSumEntry backEntry = null;        if (i >= windowSize)            backEntry = entries.get(i-windowSize);        if (backEntry != null)        {            subsequenceTotal = thisEntry.prefixSum - backEntry.prefixSum;                    }        else        {            subsequenceTotal = thisEntry.prefixSum;        }        movingAverage = subsequenceTotal/(double)windowSize;	thisEntry.subsequenceTotal = subsequenceTotal;	thisEntry.movingAverage = movingAverage;	entries.set(i, thisEntry);`

}

In the second half of the add function, we compute our moving averages based on the prefix sums.  It's here you can see the hinge on which the algorithm swings: thisEntry.prefixSum - backEntry.prefixSum -- that offset between the current table entry and it's nth predecessor makes the whole thing work.

`public ArrayList<DoubleWritable> serialize()  {    ArrayList<DoubleWritable> result = new ArrayList<DoubleWritable>();        result.add(new DoubleWritable(windowSize));    if (entries != null)    {        for (PrefixSumEntry i : entries)        {            result.add(new DoubleWritable(i.period));            result.add(new DoubleWritable(i.value));        }    }    return result;`

}

The serialize method needs to package the results of our algorithm to pass to another instance of the same algorithm, and it needs to do so in a type that Hadoop can serialize.  In the case of a method like sum, this would be relatively simple: we would only need to pass the sum up to this point.  However, because we cannot be certain whether this instance of our algorithm has seen all the values, or seen them in the correct order, we actually need to serialize the whole table.  To do this, we create a list ofDoubleWritables, pack the window size at its head, and then each period and value.  This gives us a structure that's easy to unpack and merge with other lists with the same construction.

` @SuppressWarnings("unchecked")  public void merge(List<DoubleWritable> other)  {    if (other == null)        return;        // if this is an empty buffer, just copy in other    // but deserialize the list    if (windowSize == 0)    {        windowSize = (int)other.get(0).get();        entries = new ArrayList<PrefixSumEntry>();        // we're serialized as period, value, period, value        for (int i = 1; i < other.size(); i+=2)        {            PrefixSumEntry e = new PrefixSumEntry();            e.period = (int)other.get(i).get();            e.value = other.get(i+1).get();            entries.add(e);        }`

}

Merging results is perhaps the most complicated thing we need to handle.  First, we check the case in which there was no partial result passed -- just return and continue.  Second, we check to see if this instance of PrefixSumMovingAverage already has a table.  If it doesn't, we can simply unpack the serialized result and treat it as our window.

`   // if we already have a buffer, we need to add these entries    else    {        // we're serialized as period, value, period, value        for (int i = 1; i < other.size(); i+=2)        {            PrefixSumEntry e = new PrefixSumEntry();            e.period = (int)other.get(i).get();            e.value = other.get(i+1).get();            entries.add(e);        }`

}

The third case is the non-trivial one: if this instance has a table and receives a serialized table, we must merge them together.  Consider a Reduce task: as it receives outputs from multiple Map tasks, it needs to merge all of them together to form a larger table.  Thus, merge will be called many times to add these results and reassemble a larger time series.

`// sort and recompute    Collections.sort(entries);    // update the table    // prefixSums first    double prefixSum = 0;    for(int i = 0; i < entries.size(); i++)    {        PrefixSumEntry thisEntry = entries.get(i);        prefixSum += thisEntry.value;        thisEntry.prefixSum = prefixSum;        entries.set(i, thisEntry);`

}

This part should look familiar, it's just like the add method.  Now that we have new entries in our table, we need to sort by period and recompute the moving averages.  In fact, the rest of the merge method is exactly like the add method, so we might consider putting sorting and recomputing in a separate method.

### Orchestrating Partial Aggregation

We've got a clever little algorithm for computing moving average in parallel, but Hive can't do anything with it unless we create a UDAF that understands how to use our algorithm.  At this point, we need to start writing some real UDAF code.  As before, we extend a generic class, in this case GenericUDAFEvaluator.

`  public static class GenericUDAFMovingAverageEvaluator extends GenericUDAFEvaluator {                    // input inspectors for PARTIAL1 and COMPLETE        private PrimitiveObjectInspector periodOI;        private PrimitiveObjectInspector inputOI;        private PrimitiveObjectInspector windowSizeOI;                // input inspectors for PARTIAL2 and FINAL        // list for MAs and one for residuals        private StandardListObjectInspector loi;`

As in the case of a UDTF, we create ObjectInspectors to handle type checking.  However, notice that we have inspectors for different states: PARTIAL1, PARTIAL2, COMPLETE, and FINAL.  These correspond to the different states in which our UDAF may be executing.  Since our serialized prefix sum table isn't the same input type as the values our add method takes, we need different type checking for each.

` @Override        public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException {                    super.init(m, parameters);                        // initialize input inspectors            if (m == Mode.PARTIAL1 || m == Mode.COMPLETE)            {                assert(parameters.length == 3);                periodOI = (PrimitiveObjectInspector) parameters[0];                inputOI = (PrimitiveObjectInspector) parameters[1];                windowSizeOI = (PrimitiveObjectInspector) parameters[2];            }`

Here's the beginning of our overrided initialization function.  We check the parameters for two modes, PARTIAL1 and COMPLETE.  Here we assume that the arguments to our UDAF are the same as the user passes in a query: the period, the input, and the size of the window.  If the UDAF instance is consuming the results of our partial aggregation, we need a different ObjectInspector.  Specifically, this one:

`else            {                loi = (StandardListObjectInspector) parameters[0];`

}

Similar to the UDTF, we also need type checking on the output types -- but for both partial and full aggregation. In the case of partial aggregation, we're returning lists of DoubleWritables:

`              // init output object inspectors            if (m == Mode.PARTIAL1 || m == Mode.PARTIAL2) {                // The output of a partial aggregation is a list of doubles representing the                // moving average being constructed.                // the first element in the list will be the window size                //                 return ObjectInspectorFactory.getStandardListObjectInspector(                    PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);`

}

But in the case of FINAL or COMPLETE, we're dealing with the types that will be returned to the Hive user, so we need to return a different output.  We're going to return a list of structs that contain the period, moving average, and residuals (since they're cheap to compute).

```else {                // The output of FINAL and COMPLETE is a full aggregation, which is a                // list of DoubleWritable structs that represent the final histogram as                // (x,y) pairs of bin centers and heights.                                ArrayList<ObjectInspector> foi = new ArrayList<ObjectInspector>();                foi.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);                foi.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);		foi.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);                ArrayList<String> fname = new ArrayList<String>();                fname.add("period");                fname.add("moving_average");		fname.add("residual");
return ObjectInspectorFactory.getStandardListObjectInspector(                 ObjectInspectorFactory.getStandardStructObjectInspector(fname, foi) );```

}

Next come methods to control what happens when a Map or Reduce task is finished with its data.  In the case of partial aggregation, we need to serialize the data.  In the case of full aggregation, we need to package the result for Hive users.

```  @Override    public Object terminatePartial(AggregationBuffer agg) throws HiveException {      // return an ArrayList where the first parameter is the window size      MaAgg myagg = (MaAgg) agg;      return myagg.prefixSum.serialize();    }
@Override    public Object terminate(AggregationBuffer agg) throws HiveException {      // final return value goes here      MaAgg myagg = (MaAgg) agg;            if (myagg.prefixSum.tableSize() < 1)      {        return null;      }            else      {        ArrayList<DoubleWritable[]> result = new ArrayList<DoubleWritable[]>();        for (int i = 0; i < myagg.prefixSum.tableSize(); i++)        {	    double residual = myagg.prefixSum.getEntry(i).value - myagg.prefixSum.getEntry(i).movingAverage;
DoubleWritable[] entry = new DoubleWritable[3];            entry[0] = new DoubleWritable(myagg.prefixSum.getEntry(i).period);            entry[1] = new DoubleWritable(myagg.prefixSum.getEntry(i).movingAverage);	    entry[2] = new DoubleWritable(residual);            result.add(entry);        }                return result;      }      ```

}

We also need to provide instruction on how Hive should merge the results of partial aggregation.  Fortunately, we already handled this in our PrefixSumMovingAverage class, so we can just call that.

`@SuppressWarnings("unchecked")    @Override    public void merge(AggregationBuffer agg, Object partial) throws HiveException {        // if we're merging two separate sets we're creating one table that's doubly long                if (partial != null)        {            MaAgg myagg = (MaAgg) agg;            List<DoubleWritable> partialMovingAverage = (List<DoubleWritable>) loi.getList(partial);            myagg.prefixSum.merge(partialMovingAverage);        }`

}

Of course, merging and serializing isn't very useful unless the UDAF has logic for iterating over values.  The iterate method handles this and -- as one would expect -- relies entirely on thePrefixSumMovingAverage class we created.

```@Override    public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException {          assert (parameters.length == 3);            if (parameters[0] == null || parameters[1] == null || parameters[2] == null)      {        return;      }            MaAgg myagg = (MaAgg) agg;            // Parse out the window size just once if we haven't done so before.  We need a window of at least 1,      // otherwise there's no window.      if (!myagg.prefixSum.isReady())      {        int windowSize = PrimitiveObjectInspectorUtils.getInt(parameters[2], windowSizeOI);        if (windowSize < 1)        {            throw new HiveException(getClass().getSimpleName() + " needs a window size >= 1");        }        myagg.prefixSum.allocate(windowSize);      }            //Add the current data point and compute the average
int p = PrimitiveObjectInspectorUtils.getInt(parameters[0], inputOI);      double v = PrimitiveObjectInspectorUtils.getDouble(parameters[1], inputOI);      myagg.prefixSum.add(p,v);      ```

}

### Aggregation Buffers: Connecting Algorithms with Execution

One might notice that the code for our UDAF references an object of type AggregationBuffer quite a lot.  This is because the AggregationBuffer is the interface which allows us to connect our custom PrefixSumMovingAverage class to Hive's execution framework.  While it doesn't constitute a great deal of code, it's glue that binds our logic to Hive's execution framework.  We implement it as such:

``` // Aggregation buffer definition and manipulation methods     static class MaAgg implements AggregationBuffer {        PrefixSumMovingAverage prefixSum;    };
@Override    public AggregationBuffer getNewAggregationBuffer() throws HiveException {      MaAgg result = new MaAgg();      reset(result);      return result;```

}

### Using the UDAF

The goal of a good UDAF is that, no matter how complicated it was for us to implement, it's that it be simple for our users.  For all that code and parallel thinking, usage of the UDAF is very straightforward:

`ADD JAR /mnt/shared/hive_udfs/dist/lib/moving_average_udf.jar;CREATE TEMPORARY FUNCTION moving_avg AS 'com.oracle.hadoop.hive.ql.udf.generic.GenericUDAFMovingAverage'; #get the moving average for a single tail numberSELECT TailNum,moving_avg(timestring, delay, 4) FROM ts_example WHERE TailNum='N967CA' GROUP BY TailNum LIMIT 100;`

Here we're applying the UDAF to get the moving average of arrival delay from a particular flight.  It's a really simple query for all that work we did underneath.  We can do a bit more and leverage Hive's abilities to handle complex types as columns, here's a query which creates a table of timeseries as arrays.

`#create a set of moving averages for every plane starting with N#Note: this UDAF blows up unpleasantly in heap; there will be data volumes for which you need to throw#excessive amounts of memory at the problemCREATE TABLE moving_averages AS SELECT TailNum, moving_avg(timestring, delay, 4) as timeseries FROM ts_example `

WHERE TailNum LIKE 'N%' GROUP BY TailNum;

### Summary

We've covered all manner of UDFs: from simple class extensions which can be written very easily, to very complicated UDAFs which require us to think about distributed execution and plan orchestration done by query engines.  With any luck, the discussion has provided you with the confidence to go out and implement your own UDFs -- or at least pay some attention to the complexities of the ones in use every day.

## Thursday Apr 04, 2013

### Introduction

In our ongoing exploration of Hive UDFs, we've covered the basic row-wise UDF.  Today we'll move to the UDTF, which generates multiple rows for every row processed.  This UDF built its house from sticks: it's slightly more complicated than the basic UDF and allows us an opportunity to explore how Hive functions manage type checking.

We'll step through some of the more interesting pieces, but as before the full source is available on github here.

### Extending GenericUDTF

Our UDTF is going to produce pairwise combinations of elements in a comma-separated string.  So, for a string column "Apples, Bananas, Carrots" we'll produce three rows:

• Apples, Bananas
• Apples, Carrots
• Bananas, Carrots

As with the UDF, the first few lines are a simple class extension with a decorator so that Hive can describe what the function does.

```@Description(name = "pairwise", value = "_FUNC_(doc) - emits pairwise combinations of an input array")public class PairwiseUDTF extends GenericUDTF {
```

private PrimitiveObjectInspector stringOI = null;

We also create an object of PrimitiveObjectInspector, which we'll use to ensure that the input is a string.  Once this is done, we need to override methods for initialization, row processing, and cleanup.

`@Override  public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {    if (args.length != 1) {      throw new UDFArgumentException("pairwise() takes exactly one argument");    }     if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE        && ((PrimitiveObjectInspector) args[0]).getPrimitiveCategory() !=         PrimitiveObjectInspector.PrimitiveCategory.STRING) {      throw new UDFArgumentException("pairwise() takes a string as a parameter");    } `

stringOI = (PrimitiveObjectInspector) args[0];

This UDTF is going to return an array of structs, so the initialize method needs to return aStructObjectInspector object.  Note that the arguments to the constructor come in as an array of ObjectInspector objects.  This allows us to handle arguments in a "normal" fashion but with the benefit of methods to broadly inspect type.  We only allow a single argument -- the string column to be processed -- so we check the length of the array and validate that the sole element is both a primitive and a string.

The second half of the initialize method is more interesting:

`List<String> fieldNames = new ArrayList<String>(2);    List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(2);    fieldNames.add("memberA");    fieldNames.add("memberB");    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);    return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);`

}

Here we set up information about what the UDTF returns.  We need this in place before we start processing rows, otherwise Hive can't correctly build execution plans before submitting jobs to MapReduce.  The structures we're returning will be two strings per struct, which means we'll needObjectInspector objects for both the values and the names of the fields.  We create two lists, one of strings for the name, the other of ObjectInspector objects.  We pack them manually and then use a factor to get the StructObjectInspector which defines the actual return value.

Now we're ready to actually do some processing, so we override the process method.

`@Override  public void process(Object[] record) throws HiveException {    final String document = (String) stringOI.getPrimitiveJavaObject(record[0]);     if (document == null) {      return;    }    String[] members = document.split(",");	java.util.Arrays.sort(members);	for (int i = 0; i < members.length - 1; i++)		for (int j = 1; j < members.length; j++)			if (!members[i].equals(members[j]))				forward(new Object[] {members[i],members[j]});`

}

This is simple pairwise expansion, so the logic isn't anything more than a nested for-loop.  There are, though, some interesting things to note.  First, to actually get a string object to operate on, we have to use an ObjectInspector and some typecasting.  This allows us to bail out early if the column value is null.  Once we have the string, splitting, sorting, and looping is textbook stuff.

The last notable piece is that the process method does not return anything.  Instead, we callforward to emit our newly created structs.  From the context of those used to database internals, this follows the producer-consumer models of most RDBMs.  From the context of those used to MapReduce semantics, this is equivalent to calling write on the Context object.

`@Override  public void close() throws HiveException {    // do nothing`

}

If there were any cleanup to do, we'd take care of it here.  But this is simple emission, so our override doesn't need to do anything.

### Using the UDTF

Once we've built our UDTF, we can access it via Hive by adding the jar and assigning it to a temporary function.  However, mixing the results of a UDTF with other columns from the base table requires that we use a LATERAL VIEW.

`#Add the Jaradd jar /mnt/shared/market_basket_example/pairwise.jar;  #Create a function CREATE temporary function pairwise AS 'com.oracle.hive.udtf.PairwiseUDTF';  # view the pairwise expansion outputSELECT m1, m2, COUNT(*) FROM market_basket`

LATERAL VIEW pairwise(basket) pwise AS m1,m2 GROUP BY m1,m2;

### Introduction

In our ongoing series of posts explaining the in's and out's of Hive User Defined Functions, we're starting with the simplest case.  Of the three little UDFs, today's entry built a straw house: simple, easy to put together, but limited in applicability.  We'll walk through important parts of the code, but you can grab the whole source from github here.

### Extending UDF

The first few lines of interest are very straightforward:

`@Description(name = "moving_avg", value = "_FUNC_(x, n) - Returns the moving mean of a set of numbers over a window of n observations")@UDFType(deterministic = false, stateful = true)`

public class UDFSimpleMovingAverage extends UDF

We're extending the UDF class with some decoration.  The decoration is important for usability and functionality.  The description decorator allows us to give the Hive some information to show users about how to use our UDF and what it's method signature will be.  The UDFType decoration tells Hive what sort of behavior to expect from our function.

A deterministic UDF will always return the same output given a particular input.  A square-root computing UDF will always return the same square root for 4, so we can say it is deterministic; a call to get the system time would not be.  The stateful annotation of the UDFType decoration is relatively new to Hive (e.g., CDH4 and above).  The stateful directive allows Hive to keep some static variables available across rows.  The simplest example of this is a "row-sequence," which maintains a static counter which increments with each row processed.

Since square-root and row-counting aren't terribly interesting, we'll use the stateful annotation to build a simple moving average function.  We'll return to the notion of a moving average later when we build a UDAF, so as to compare the two approaches.

`private DoubleWritable result = new DoubleWritable();  private static ArrayDeque<Double> window;  int windowSize;    public UDFSimpleMovingAverage() {    result.set(0);`

}

The above code is basic initialization.  We make a double in which to hold the result, but it needs to be of class DoubleWritable so that MapReduce can properly serialize the data.  We use a deque to hold our sliding window, and we need to keep track of the window's size.  Finally, we implement a constructor for the UDF class.

` public DoubleWritable evaluate(DoubleWritable v, IntWritable n) {    double sum = 0.0;    double moving_average;    double residual;    if (window == null)    {        window = new ArrayDeque<Double>();`

}

Here's the meat of the class: the evaluate method.  This method will be called on each row by the map tasks.  For any given row, we can't say whether or not our sliding window exists, so we initialize it if it's null.

`//slide the window    if (window.size() == n.get())        window.pop();                window.addLast(new Double(v.get()));                // compute the average    for (Iterator<Double> i = window.iterator(); i.hasNext();)`

sum += i.next().doubleValue();

Here we deal with the deque and compute the sum of the window's elements.  Deques are essentially double-ended queues, so they make excellent sliding windows.  If the window is full, we pop the oldest element and add the current value.

`moving_average = sum/window.size();    result.set(moving_average);`

return result;

Computing the moving average without weighting is simply dividing the sum of our window by its size.  We then set that value in our Writable variable and return it.  The value is then emitted as part of the map task executing the UDF function.

### Going Further

The stateful annotation made it simple for us to compute a moving average since we could keep the deque static.  However, how would we compute a moving average if there was no notion of state between Hadoop tasks? At the end of the series we'll examine a UDAF that does this, but the algorithm ends up being much different.  In the meantime, I challenge the reader to think about what sort of approach is needed to compute the window.

## Tuesday Apr 02, 2013

### Introduction

User-defined Functions (UDFs) have a long history of usefulness in SQL-derived languages.  While query languages can be rich in their expressiveness, there's just no way they can anticipate all the things a developer wants to do.  Thus, the custom UDF has become commonplace in our data manipulation toolbox.

Apache Hive is no different in this respect from other SQL-like languages.  Hive allows extensibility via both Hadoop Streaming and compiled Java.  However, largely because of the underlying MapReduce paradigm, all Hive UDFs are not created equally.  Some UDFs are intended for "map-side" execution, while others are portable and can be run on the "reduce-side."  Moreover, UDF behavior via streaming requires that queries be formatted so as to direct script execution where we desire it.

The intricacies of where and how a UDF executes may seem like minutiae, but we would be disappointed time spent coding a cumulative sum UDF only executed on single rows.  To that end, I'm going to spend the rest of the week diving into the three primary types of Java-based UDFs in Hive.  You can find all of the sample code discussed here.

### The Three Little UDFs

Hive provides three classes of UDFs that most users are interested in: UDFs, UDTFs, and UDAFs.  Broken down simply, the three classes can be explained as such:

• UDFs -- User Defined Functions; these operate row-wise, generally during map execution.  They're the simplest UDFs to write, but constrained in their functionality.
• UDTFs -- User Defined Table-Generating Functions; these also execute row-wise, but they produce multiple rows of output (i.e., they generate a table).  The most common example of this is Hive's explode function.
• UDAFs -- User Defined Aggregating Functions; these can execute on either the map-side or the reduce-side and far more flexible than UDFs.  The challenge, however, is that in writing UDAFs we have to think not just about what to do with a single row, or even a group of rows.  Here, one has to consider partial aggregation and serialization between map and reduce proceses.
Over the next few days, we'll walk through code for each of these function types, from simple to complex.  Along the way, we'll end up with a couple of useful functions you can use in your own Hive code (or improve upon).

## Tuesday Mar 05, 2013

About a year ago we did a comparison of Build your Own Hadoop cluster and a Big Data Appliance, where we focused purely on the hardware and software cost. We thought it could use an update, but luckily an analyst firm did one for us and this time it covers both the Hardware/Software costs, but also ventures a lot more into dealing with other costs.

Some highlights from the report:

• Oracle Big Data Appliance(\$450k) is 39% less costly than "build your own"(\$733k)
• OBDA reduces time-to-market by 33% vs "build"

But, the report is not just about those numbers, it covers a number of very interesting things like 3 Hadoop Myths, the importance of big data in the near future and the priority customers give to improving their analytics footprint.

## Wednesday Feb 20, 2013

### Looking for tools to solve your big data problems?

Look no further, Infosys today announced its Infosys BigDataEdge developer platform to drive value from your big data stack.

By empowering business users to rapidly develop insights from vast amounts of structured and unstructured data, better business decisions can be made in near real-time. With Infosys BigDataEdge, enterprises can reduce the time taken to extract information by up to 40 percent and generate insights up to eight times faster.

## Thursday Feb 07, 2013

### New Series of Video Workshops

Contrary to my habit of posting How-To articles, today I'm not going to tell you how to do anything.  Instead, I'm going to ask you to go do something.  What I want you to do is check out the series of video tutorials I'm doing around pieces of the Hadoop ecosystem and Big Data analytics.

The first entry is on YouTube and in the Big Data Playlist on Oracle Media Network.  It gives a brief overview of setting up Flume flows into HDFS.  Of course, if you're looking for a more detailed How-To, just check the archives -- I covered it a while back.

Some of the topics I'll cover in upcoming months include:

• Hive User Defined Functions
• Market Basket Analysis with Hive, R and Data Miner
• Customer Segmentation and Clustering in Hadoop
• Classification and Targeting using Random Forests

## Wednesday Jan 30, 2013

### Introduction

I am less and less often mistaken for a pirate when I mention the R language.  While I miss the excuse to wear an eyepatch, I'm glad more people are beginning to explore a statistical language I've been touting for years.  When it comes to plotting or running complex statistics in a single line of code, R is a great tool to have.  That said, there are plenty of pitfalls for the casual or new user: syntax, learning to write vectorized code, or even just knowing which "apply" function you really should choose.

I want to explore a slightly less-often considered aspect of R development: parallelism.  Out of the box, R can seem very limited to someone used to working on compute clusters or even a multicore server.  However, there are a few tricks we can leverage to get the most out of R on everything from a personal workstation to a Hadoop cluster.

The R interpreter is -- and likely always will be -- single-threaded.  This means loading data frames is done in a single thread.  So is building your linear model, or generating that pretty surface plot.  Even on my laptop, that's a lot of threads to not use for modeling.  No matter how much my web browser might covet those cycles, I'd like to use them for work.

Rather than a complex multithreaded re-implementation, the R interpreter offers a number of ways to allow users to selectively apply parallelism.  Some of these approaches leverage MPI libraries and mirror that message passing approach.  Others allow a more implicit parallelism via "foreach" or "apply" constructs. We'll just focus on a pair of strategies using the parallelism that's been included in R since it's 2.14.1 version: the parallel library.

### Setting The Stage for Parallel Execution

We're going to need to load a few libraries into our R session before we can execute anything outside of our single-thread.  We'll use the doParallel and foreach because they allow us to focus on what to parallelize rather than how to coordinate our threads.

> `data(iris)library(parallel)library(iterators)library(doParallel)library(foreach)`

Knowing that calculations in R will be single-threaded, we want to use the parallel package to operate on logical subsets of the data simultaneously.  For example, I loaded a set of data about Iris which contains a number of different species.  One way I might want to parallelize is to fit the same each species simultaneously.  For that, I'm going to have to split the data by species:

> `species.split <- split(iris, iris\$Species)`

This gives us a list we can iterate over -- or parallelize.  From here on out, it's simply a question of deciding what resources we want to leverage: local CPUs or remote hosts.

### FORKs and SOCKs

We're going to use the makeCluster function to bind together a set of computational resources.  But first we need to decide: do we want to use only local CPUs, or is it necessary to open up socket connections to other machines distribute our workload?  In the former case we'll use makeCluster to create what's called a FORK cluster (in that it uses UNIX's fork call to create slaves).  In the latter, we'll create a SOCK cluster by opening up sockets to a list of remote hosts and starting slave processes on them.

Here's a FORK cluster which uses all my cores:

> `cl <- makeCluster(detectCores())registerDoParallel(cl)`

And here's a SOCK cluster across three nodes (password-less SSH is required)

> `hostlist <- c("10.0.0.1", "10.0.0.2", "10.0.0.3")cl <- makeCluster(hostlist)registerDoParallel(cl)`

In each case, I call registerDoParallel to bind this cluster to the %dopar% operator.  This is the operator which will let us easily iterate in parallel.

### Running in Parallel

Once we've got something to iterate over and a cluster with which to do it, modeling in parallel becomes straightforward.  Suppose I want to fit a model of sepal length as a linear combination of petal characteristics.  In that case, the code is simply:

> `species.models <- foreach(i=species.split) %dopar% {m<-lm(i\$Sepal.Length ~ i\$Petal.Width*i\$Petal.Length);return(m)}`

`But I'm not just restricted to fitting linear models on my little cluster.  I can run k-means clustering for several different k simultaneously using basically the same block:`

> `species.clusters<- foreach(i=2:5) %dopar% {km <- kmeans(iris, i);return(km)}`

When I'm done with my block, I can just call stopCluster(cl) to ensure my processes terminate and I'm not hogging resources.

Finally, there will be situations in which I need to deploy in parallel against much larger datasets -- specifically, datasets stored in HDFS.  Both Hive and Pig will let me run an R script as part of a streaming process.  In Hive, the TRANSFORM operator will send data to an R Script.  In Pig, you can use theSTREAM operator to send a whole bag to an R script.  However, you can't stream from within Pig'sFOREACH blocks, so I occasionally use a UDF which invokes R scripts for me.

Regardless of the method you choose to send HDFS data to an R process, it's important to make sure your R script can consume data streaming from standard input.  I find the most expedient way of doing this via the file function.  A typical script might start:

`#! /usr/bin/env Rscript#Connection to STDIN for reading a data framecon <- file(description="stdin")my.data.frame <- read.table(con, header=FALSE, sep=",")`

### `Summary`

We've covered several ways to push R beyond the the bounds of its single-threaded core.  There are forking and socket mechanisms for spreading our work around, not to mention tricks for leveraging the power of Hadoop Streaming.  In each case, however, one thing stands out: we must be smart as modelers and understand what can and should be done in parallel.

## Monday Jan 28, 2013

### First Oracle BIWA Data Scientists Certified

For those who attended the BIWA Summit a few weeks ago, you would have seen the data scientist certification. BIWA just listed the first batch of data scientists it certified:

Instructor Level Certificate  - Brendan Tierney

Oracle Data Scientist Certificate
Jorge Anicama, IBM (GBS)
Tim Vlamis, Vlamis Software Solutions
Vijayalakshmi Muthukrishnan, Motorola
Sicheng Liu, Deloitte Consulting
Avik Bhattacharya, Printpack Inc.
Ari Kaplan, Ariball
Paul Mitchell, Oracle

Associate Level
Suresh Anand, Sashatech LLC

Participation Certificate
Ahmed Kopap
Ekine Akuiyibo

More on the program, see here: http://oraclebiwasig.blogspot.com/2013/01/oracle-data-scientist-at-biwa-summit.html

## Friday Jan 25, 2013

### Announcing: OTN Big Data Developer Day

Announcing the first Big Data Developer Day. A full day with two tracks and hands-on on all things Big Data at Oracle!!

An influx of new data types combined with new approaches for analyzing data are creating untapped growth opportunities that have the potential to transform your business. Oracle is the first vendor to provide a complete and integrated set of enterprise-ready products to address the full spectrum of big data business requirements. Jumpstart your understanding of big data in the enterprise by attending this complementary one-day hands-on workshop. You will learn from technical experts how to:

• Write MapReduce on Oracle’s Big Data Platform
• Manage a Big Data environment
• Access Oracle NoSQL Database
• Manage Oracle NoSQL DB Cluster
• Use data from a Hadoop Cluster with Oracle
• Develop analytics on big data

Register today to learn these skills which you can immediately put to use within your organization.

So if you are in the bay area, do come and learn the coolest new technologies.

## Friday Jan 18, 2013

### Big Data Appliance X3-2 Updates

Untitled Document

Hello world. Waaw, time went by too fast. Happy new year, and here is the long past due update on the new Big Data Appliance and the software updates.

## Big Data Appliance X3-2

Both the software as well as the hardware of the Big Data Appliance got a refresher.

### Hardware Update

A good place to start is to quickly review the hardware differences (no price changes!). On a per node basis the following is a comparison between old and new (X3-2) hardware:

Big Data Appliance v1

Big Data Appliance X3-2

CPU

2 x 6-Core Intel® Xeon® 5675 (3.06 GHz)
2 x 8-Core Intel® Xeon® E5-2660 (2.2 GHz)
Memory
48GB
64GB expandable to 512GB
Disk

12 x 3TB High Capacity SAS

12 x 3TB High Capacity SAS
InfiniBand
40Gb/sec
40Gb/sec
Ethernet
10Gb/sec
10Gb/sec
KVM
1 KVM Switch
N/A (removed)

For all the details on the environmentals and other useful information, review the data sheet for Big Data Appliance X3-2. For those wondering what we did with the 2RU we now have left from the KVM, that is open space, at the top of the rack.

The higher core count gives a BDA X3-2 more parallel compute power while saving some 30% in energy and heat.

### Software Update

As we did with Hardware, a good place to start is a quick overview of the software changes in below table:

Big Data Appliance v1.1.x Software Stack Big Data Appliance V2.0.1 Software Stack
Linux
Oracle Linux 5.6
Oracle Linux 5.8 with UEK
JDK
1.6
1.6u35
Cloudera CDH
CDH 3u4
CDH 4.1.x
Cloudera Manager
CM 3
CM 4.1
Oracle Enterprise Manager
N/A
Big Data Appliance Plug-In for Enterprise Manager
R
Open Source R
Oracle R Distribution 2.x
Big Data Connectors *
Big Data Connectors 1.1.x
Big Data Connectors 2.0.x
Oracle NoSQL Database CE **
NoSQL DB 1.x
NoSQL DB 2.x

* Oracle Big Data Connectors is a separately licensed product which can be pre-installed and pre-configured on BDA
** Oracle NoSQL DB 2.x will be pre-installed in a future update to Mammoth but can be applied manually today

Apart from the versions updates, bug fixes and a great number of performance improvements across the entire system, the biggest updates are the inclusion of CDH 4.1.2 and the default set up of highly available name nodes for Hadoop, the Enterprise Manager management of the BDA, the uptake of the Oracle R Distribution and the updates to Oracle NoSQL Database. In a nutshell these updates deliver the following improvements:

#### Cloudera CDH 4.1.x

• Higher overall performance
• Highly available name nodes with the BDA using failover quorum processes instead of an external HA filer solution
• Vastly expanded management capabilities via CM 4

On top of this, BDA now has both Zookeeper and Oozie configured out of the box.

#### Oracle Enterprise Manager

The new Big Data Appliance Plug-In for Enterprise Manager delivers the first end-to-end management of the Hadoop cluster from hardware metrics to software and Hadoop metrics. To achieve the end-to-end management of the system Enterprise Manager delivers all the system metrics users are used to from the Exadata Plug-In for Enterprise Manager. Enterprise Manager enables a seamless transition between the Hardware and high level software monitoring and the expanded Hadoop monitoring and diagnostics from Cloudera Manager. This combination of functionality makes operations for a BDA simpler and allows operations staff to seamlessly switch between their Exadata, Big Data Appliance and other Oracle Engineered systems.

#### Oracle R Distribution

The big difference between Oracle R Distribution and the Opensource R distribution is that Oracle R Distribution is enabled to dynamically load the math kernel libraries on the CPUs from both Intel and AMD. This increases performance of basic calculations, which in turn increases the performance of the overall R calculations because more math is off-loaded into the CPUs.

#### Oracle NoSQL Database 2.x

A great number of great new features are added into NoSQL DB 2.x. Most of these are in both the Community Edition as well in the Enterprise Edition. Charles Lamb has a nice concise post describing what is new here.

## Tuesday Nov 20, 2012

### Winner of the 2012 Government Big Data Solutions Award

Hot off the press:

The winner of the 2012 Government Big Data Solutions Aware is the National Cancer Institute!! Read all the details on CTOLabs.com.

A short excerpt to wet your appetite: "... This solution, based on the Oracle Big Data Appliance with the Cloudera Distribution of Apache Hadoop (CDH), leverages capabilities available from the Big Data community today in pioneering ways that can serve a broad range of researchers. The promising approach of this solution is repeatable across many other Big Data challenges for bioinfomatics, making this approach worthy of its selection as the 2012 Government Big Data Solution Award."

Congrats to the entire team!!

## Friday Oct 05, 2012

### Introduction

More often than not, data doesn't come packaged exactly as we'd like it for analysis. Transformation, match-merge operations, and a host of data munging tasks are usually needed before we can extract insights from our Big Data sources. Few people find data munging exciting, but it has to be done. Once we've suffered that boredom, we should take steps to automate the process. We want codify our work into repeatable units and create workflows which we can leverage over and over again without having to write new code. In this article, we'll look at how to use Oozie to create a workflow for the parallel machine learning task I described on Cloudera's site.

### Hive Actions: Prepping for Pig

In my parallel machine learning article, I use data from the National Climatic Data Center to build weather models on a state-by-state basis. NCDC makes the data freely available as gzipped files of day-over-day observations stretching from the 1930s to today. In reading that post, one might get the impression that the data came in a handy, ready-to-model files with convenient delimiters. The truth of it is that I need to perform some parsing and projection on the dataset before it can be modeled. If I get more observations, I'll want to retrain and test those models, which will require more parsing and projection. This is a good opportunity to start building up a workflow with Oozie.

I store the data from the NCDC in HDFS and create an external Hive table partitioned by year. This gives me flexibility of Hive's query language when I want it, but let's me put the dataset in a directory of my choosing in case I want to treat the same data with Pig or MapReduce code.

`CREATE EXTERNAL TABLE IF NOT EXISTS historic_weather(column 1, column2)PARTITIONED BY (yr string) STORED AS ... LOCATION '/user/oracle/weather/historic'; `

As new weather data comes in from NCDC, I'll need to add partitions to my table. That's an action I should put in the workflow. Similarly, the weather data requires parsing in order to be useful as a set of columns. Because of their long history, the weather data is broken up into fields of specific byte lengths: x bytes for the station ID, y bytes for the dew point, and so on. The delimiting is consistent from year to year, so writing SerDe or a parser for transformation is simple. Once that's done, I want to select columns on which to train, classify certain features, and place the training data in an HDFS directory for my Pig script to access.

`ALTER TABLE historic_weather ADD IF NOT EXISTS PARTITION (yr='2010')LOCATION '/user/oracle/weather/historic/yr=2011';INSERT OVERWRITE DIRECTORY '/user/oracle/weather/cleaned_history' SELECT w.stn, w.wban, w.weather_year, w.weather_month,w.weather_day, w.temp, w.dewp, w.weather FROM (FROM historic_weather SELECT TRANSFORM(...)USING '/path/to/hive/filters/ncdc_parser.py' as stn, wban, weather_year, weather_month, weather_day, temp, dewp, weather) w;`

Since I'm going to prepare training directories with at least the same frequency that I add partitions, I should also add that to my workflow. Oozie is going to invoke these Hive actions using what's somewhat obviously referred to as a Hive action. Hive actions amount to Oozie running a script file containing our query language statements, so we can place them in a file called `weather_train.hql`.

### Starting Our Workflow

Oozie offers two types of jobs: workflows and coordinator jobs. Workflows are straightforward: they define a set of actions to perform as a sequence or directed acyclic graph. Coordinator jobs can take all the same actions of Workflow jobs, but they can be automatically started either periodically or when new data arrives in a specified location. To keep things simple we'll make a workflow job; coordinator jobs simply require another XML file for scheduling. The bare minimum for workflow XML defines a name, a starting point, and an end point:

``` <workflow-app name="WeatherMan" xmlns="uri:oozie:workflow:0.1"> <start to="ParseNCDCData"/> <end name="end"/> </workflow-app>```

To this we need to add an action, and within that we'll specify the hive parameters Also, keep in mind that actions require <ok> and <error> tags to direct the next action on success or failure.

``` <action name="ParseNCDCData"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <configuration> <property> <name>oozie.hive.defaults</name> <value>/user/oracle/weather_ooze/hive-default.xml</value> </property> </configuration> <script>ncdc_parse.hql</script> </hive> <ok to="WeatherMan"/> <error to="end"/> </action>```

There are a couple of things to note here:
1. I have to give the FQDN (or IP) and port of my JobTracker and NameNode.
2. I have to include a `hive-default.xml` file.
3. I have to include a script file.
4. The `hive-default.xml` and script file must be stored in HDFS
That last point is particularly important. Oozie doesn't make assumptions about where a given workflow is being run. You might submit workflows against different clusters, or have different `hive-defaults.xml` on different clusters (e.g. MySQL or Postgres-backed metastores).

A quick way to ensure that all the assets end up in the right place in HDFS is just to make a working directory locally, build your workflow.xml in it, and copy the assets you'll need to it as you add actions to`workflow.xml`. At this point, our local directory should contain:

1. `workflow.xml`
2. `hive-defaults.xml` (make sure this file contains your metastore connection data)
3. `ncdc_parse.hql`

### Adding Pig to the Ooze

Adding our Pig script as an action is slightly simpler from an XML standpoint. All we do is add an action to`workflow.xml` as follows:

``` <action name="WeatherMan"> <pig> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <script>weather_train.pig</script> </pig> <ok to="end"/> <error to="end"/> </action>```

Once we've done this, we'll copy `weather_train.pig` to our working directory. However, there's a bit of a "gotcha" here. My pig script registers the Weka Jar and a chunk of jython. If those aren't also in HDFS, our action will fail from the outset -- but where do we put them? The Jython script goes into the working directory at the same level as the pig script, because pig attempts to load Jython files in the directory from which the script executes. However, that's not where our Weka jar goes.

While Oozie doesn't assume much, it does make an assumption about the Pig classpath. Anything under`working_directory/lib` gets automatically added to the Pig classpath and no longer requires a REGISTER statement in the script. Anything that uses a REGISTER statement cannot be in the`working_directory/lib` directory. Instead, it needs to be in a different HDFS directory and attached to the pig action with an `<archive>` tag.

Yes, that's as confusing as you think it is.

You can get the exact rules for adding Jars to the distributed cache from Oozie's Pig Cookbook.

### Making the Workflow Work

We've got a workflow defined and have collected all the components we'll need to run. But we can't run anything yet, because we still have to define some properties about the job and submit it to Oozie. We need to start with the job properties, as this is essentially the "request" we'll submit to the Oozie server. In the same working directory, we'll make a file called `job.properties` as follows:

`nameNode=hdfs://localhost:8020jobTracker=localhost:8021queueName=defaultweatherRoot=weather_oozemapreduce.jobtracker.kerberos.principal=foodfs.namenode.kerberos.principal=foooozie.libpath=\${nameNode}/user/oozie/share/liboozie.wf.application.path=\${nameNode}/user/\${user.name}/\${weatherRoot}outputDir=weather-ooze`

While some of the pieces of the properties file are familiar (e.g., JobTracker address), others take a bit of explaining. The first is `weatherRoot`: this is essentially an environment variable for the script (as are jobTracker and queueName). We're simply using them to simplify the directives for the Oozie job. The`oozie.libpath` pieces is extremely important. This is a directory in HDFS which holds Oozie's shared libraries: a collection of Jars necessary for invoking Hive, Pig, and other actions. It's a good idea to make sure this has been installed and copied up to HDFS. The last two lines are straightforward: run the application defined by workflow.xml at the application path listed and write the output to the output directory.

We're finally ready to submit our job! After all that work we only need to do a few more things:

1. Validate our `workflow.xml`
2. Copy our working directory to HDFS
3. Submit our job to the Oozie server
4. Run our workflow
Let's do them in order. First validate the workflow:
`oozie validate workflow.xml`

Next, copy the working directory up to HDFS:
`hadoop fs -put working_dir /user/oracle/working_dir`

Now we submit the job to the Oozie server. We need to ensure that we've got the correct URL for the Oozie server, and we need to specify our `job.properties` file as an argument.
`oozie job -oozie http://url.to.oozie.server:port_number/ -config /path/to/working_dir/job.properties -submit`

We've submitted the job, but we don't see any activity on the JobTracker? All I got was this funny bit of output:
`14-20120525161321-oozie-oracle`

This is because submitting a job to Oozie creates an entry for the job and places it in `PREP` status. What we got back, in essence, is a ticket for our workflow to ride the Oozie train. We're responsible for redeeming our ticket and running the job.
`oozie -oozie http://url.to.oozie.server:port_number/ -start 14-20120525161321-oozie-oracle`

Of course, if we really want to run the job from the outset, we can change the "-submit" argument above to "-run." This will prep and run the workflow immediately.

### Takeaway

So, there you have it: the somewhat laborious process of building an Oozie workflow. It's a bit tedious the first time out, but it does present a pair of real benefits to those of us who spend a great deal of time data munging. First, when new data arrives that requires the same processing, we already have the workflow defined and ready to run. Second, as we build up a set of useful action definitions over time, creating new workflows becomes quicker and quicker.

## Wednesday Sep 05, 2012

### E-Book on big data (featuring Analysts, Customers and more)

As we are gearing up for Openworld, here is a nice E-book on big data to start paging through. It contains Gartner's take on big data, customer and partner interviews and a lot more good info. Enjoy the read so you come prepared for Openworld!!

For those coming to Oracle Openworld (or the Americas Cup races around the same time), you can find big data sessions via this URL.

Enjoy!!

## Wednesday Aug 01, 2012

### Flume and Hive for Log Analytics

There's a lot to learn from log data, but to get the most value from it, that data needs to be easy to collect and analyze. Otherwise, time that could be used to learn from data is spent writing parsers and transport components. In this entry we'll simplify log collection and transport using JSON serialization and parts of the Hadoop ecosystem.

Logging everything in JSON is a great idea. As serialization formats go it's engineer-friendly: you and your favorite programming language can both read it. Moreover, having all of your log data structured as universal data structures makes getting started with analytics much simpler. To illustrate how much simpler, we'll take JSON logs written to a flat file, stream them into HDFS, and expose them via Hive for exploration and aggregation.

### The Preliminaries

We're going to use three components to put our system together:

• A flat file that's collecting JSON data. Assume entries look a bit like this:
`{"fieldA":"string data","fieldB":400,"fieldC":0.99}`
• Flume: the distributed log-collection service that's part of the Hadoop ecosystem
• Hive and a SerDe for handling JSON data

### The "Tail Table"

We'll begin by setting up the final destination for our log data. This requires we create a directory in HDFS to hold the log data and define a Hive table over it. Making the directory's easy:

`hadoop fs -mkdir /user/oracle/tail_table`

Similarly, defining the external table is straightforward in the Hive command line:

`CREATE EXTERNAL TABLE IF NOT EXISTS tail_table(fieldA int, fieldB string, fieldC float)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'LOCATION '/user/oracle/tail_table';`

This gives us a table which will read and respect the types of the values in our JSON records. If a field isn't present for a given record, a NULL value is returned for that column. Fields not included in the CREATE statement are ignored but still exist in the JSON. This allows the schema of the JSON to remain flexible while minimally impacting the Hive table.

### Streaming Data with Flume OG

CDH 3u4 ships with two very different versions of Flume. The default is Flume 0.9.4, or Flume OG . It's great at streaming data into HDFS, but Flume OG has some requirements.

• You must run Zookeeper to coordinate Flume nodes
• You must run a Flume master to control Flume nodes
Those caveats aside, setting up Flume to stream data into our Hive table is remarkably simple. We only need to define a source which tails our JSON logs and a sink which writes these into the appropriate HDFS directory. We can set this up via the Flume master's web interface. Just navigate to the Flume master web interface at http://flumemaster.your.domain:35871and click the config link. From here, select the Flume node you want to configure from the dropdown menu (i.e. the node which has the JSON log file). The rest is easy:
• Set the source as: `tail("/path/to/json.log")`
• Set the sink as: `collectorSink("hdfs://namenode/user/oracle/tail_table", "logdata", 30)`

This configuration will tail the log file and write a new message into HDFS with each new line. The collectorSink will commit data to our Hive table every 30 seconds. The resulting configuration looks like this:

### Streaming Data with Flume NG

The other version of Flume which ships with CDH3 is Flume NG . Flume NG is significantly different from its predecessor. Our tail source from the previous section is gone, but so too are many of restrictions.
• Zookeeper is no longer a requirement
• The master-slave architecture has been replaced by independent Flume agents
• We can now use Avro RPCs to transfer data in multi-hop flows
That last point is a big advance for Flume. In Flume OG, transfer from application servers to our Hadoop cluster was a gray area. Either our application servers run Flume nodes connected to the Zookeeper instances and Flume masters for the Hadoop cluster, or logs must be transferred into the Hadoop cluster via another method. In Flume NG, we can run independent Flume agents on application server and the Hadoop cluster, relying on Avro RPC to handle forwarding.

For this type of multi-hop log transfer, we need a flume-ng-agent running on each application server and one on the Hadoop cluster. The application servers will have a flume.conf file which includes something like this:

`app-agent.sources = tail app-agent.channels = memoryChannel app-agent.sinks = avro-forward-sink app-agent.sources.tail.type = exec app-agent.sources.tail.command = tail -f /path/to/json.log app-agent.sources.avro-forward-sink.type = avro app-agent.sources.avro-forward-sink.hostname = 10.1.1.100 app-agent.sources.avro-forward-sink.port = 10000`

This sets up a source that runs "tail" and sinks that data via Avro RPC to 10.1.1.100 on port 10000.

The collecting Flume agent on the Hadoop cluster will need a flume.conf with an avro source and an HDFS sink.
`hdfs-agent.sources= avro-collect hdfs-agent.sinks = hdfs-write hdfs-agent.channels = memoryChannel hdfs-agent.sources.avro-collect.type = avro hdfs-agent.sources.avro-collect.bind = 10.1.1.100 hdfs-agent.sources.avro-collect.port = 10000 hdfs-agent.sinks.hdfs-write.type = hdfs hdfs-agent.sinks.hdfs-write.path = hdfs://namenode/user/oracle/tail_table hdfs-agent.sinks.hdfs-write.rollInterval = 30 `

On this side we've defined a source that reads Avro messages from port 10000 on 10.1.1.100 and writes the results into HDFS, rolling the file every 30 seconds. It's just like our setup in Flume OG, but now multi-hop forwarding is a snap.

The resulting configuration looks like this:

### Takeway

No matter which version you deploy, the combination of Flume, Hive and JSON make it straightforward to up an end-to-end pipeline for consuming and analyzing serialized log data. With deployments this simple, you can spend more time focusing on your applications and analytics