Three Little Hive UDFs: Part 2
By dan.mcclary on Apr 04, 2013
In our ongoing exploration of Hive UDFs, we've covered the basic row-wise UDF. Today we'll move to the UDTF, which generates multiple rows for every row processed. This UDF built its house from sticks: it's slightly more complicated than the basic UDF and allows us an opportunity to explore how Hive functions manage type checking.
We'll step through some of the more interesting pieces, but as before the full source is available on github here.
Our UDTF is going to produce pairwise combinations of elements in a comma-separated string. So, for a string column "Apples, Bananas, Carrots" we'll produce three rows:
- Apples, Bananas
- Apples, Carrots
- Bananas, Carrots
As with the UDF, the first few lines are a simple class extension with a decorator so that Hive can describe what the function does.
We also create an object of PrimitiveObjectInspector, which we'll use to ensure that the input is a string. Once this is done, we need to override methods for initialization, row processing, and cleanup.
This UDTF is going to return an array of structs, so the initialize method needs to return a StructObjectInspector object. Note that the arguments to the constructor come in as an array of ObjectInspector objects. This allows us to handle arguments in a "normal" fashion but with the benefit of methods to broadly inspect type. We only allow a single argument -- the string column to be processed -- so we check the length of the array and validate that the sole element is both a primitive and a string.
The second half of the initialize method is more interesting:
Here we set up information about what the UDTF returns. We need this in place before we start processing rows, otherwise Hive can't correctly build execution plans before submitting jobs to MapReduce. The structures we're returning will be two strings per struct, which means we'll need ObjectInspector objects for both the values and the names of the fields. We create two lists, one of strings for the name, the other of ObjectInspector objects. We pack them manually and then use a factor to get the StructObjectInspector which defines the actual return value.
Now we're ready to actually do some processing, so we override the process method.
This is simple pairwise expansion, so the logic isn't anything more than a nested for-loop. There are, though, some interesting things to note. First, to actually get a string object to operate on, we have to use an ObjectInspector and some typecasting. This allows us to bail out early if the column value is null. Once we have the string, splitting, sorting, and looping is textbook stuff.
The last notable piece is that the process method does not return anything. Instead, we call forward to emit our newly created structs. From the context of those used to database internals, this follows the producer-consumer models of most RDBMs. From the context of those used to MapReduce semantics, this is equivalent to calling write on the Context object.
If there were any cleanup to do, we'd take care of it here. But this is simple emission, so our override doesn't need to do anything.
Using the UDTF
Once we've built our UDTF, we can access it via Hive by adding the jar and assigning it to a temporary function. However, mixing the results of a UDTF with other columns from the base table requires that we use a LATERAL VIEW.
#Add the Jaradd jar /mnt/shared/market_basket_example/pairwise.jar;#Create a functionCREATE temporary function pairwise AS 'com.oracle.hive.udtf.PairwiseUDTF';# view the pairwise expansion outputSELECT m1, m2, COUNT(*) FROM market_basket
LATERAL VIEW pairwise(basket) pwise AS m1,m2 GROUP BY m1,m2;