Tuesday Aug 13, 2013

Hive 0.11 (May, 15 2013) and Rank() within a category

This is a follow up to a Stack Overflow question HiveQL and rank():

libjack recommended that I upgrade to Hive 0.11 (May, 15 2013) to take advantage of Windowing and Analytics functions. His recommendation worked immediately, but it took a while for me to find the right syntax to sort within categories. This blog entry records the correct syntax.


1. Sales Rep data

Here is a CSV file with Sales Rep data:

$ more reps.csv
1,William,2
2,Nadia,1
3,Daniel,2
4,Jana,1


Create a Hive table for the Sales Rep data:

create table SalesRep (
  RepID INT,
  RepName STRING,
  Territory INT
  )
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n';

... and load the CSV into the Hive Sales Rep table:

LOAD DATA
 LOCAL INPATH '/home/hadoop/MyDemo/reps.csv'
 INTO TABLE SalesRep;



2. Purchase Order data

Here is a CSV file with PO data:

$ more purchases.csv
4,1,100
2,2,200
2,3,600
3,4,80
4,5,120
1,6,170
3,7,140


Create a Hive table for the PO's:

create table purchases (
  SalesRepId INT,
  PurchaseOrderId INT,
  Amount INT
  )
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n';


... and load CSV into the Hive PO table:

LOAD DATA
 LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
 INTO TABLE purchases;



3. Hive JOIN

So this is the underlining data that is being worked with:

SELECT p.PurchaseOrderId, s.RepName, p.amount, s.Territory
FROM purchases p JOIN SalesRep s
WHERE p.SalesRepId = s.RepID;


PO ID Rep
Amount
Territory
1
Jana 100 1
2
Nadia 200 1
3
Nadia 600 1
4
Daniel 80 2
5
Jana 120 1
6
William 170 2
7
Daniel 140 2


4. Hive Rank by Volume only

SELECT
  s.RepName, s.Territory, V.volume,
rank() over (ORDER BY V.volume DESC) as rank
FROM
  SalesRep s
  JOIN
    ( SELECT
      SalesRepId, SUM(amount) as Volume
      FROM purchases
      GROUP BY SalesRepId) V
  WHERE V.SalesRepId = s.RepID
  ORDER BY V.volume DESC;



Rep
Territory
Amount
Rank
Nadia 1
800 1
Daniel 2
220 2
Jana 1
220
2
William 2
170 4

The ranking over the entire data set - Daniel is tied for second among all Reps.


5. Hive Rank within Territory, by Volume

SELECT
  s.RepName, s.Territory, V.volume,
  rank() over (PARTITION BY s.Territory ORDER BY V.volume DESC) as rank
FROM
  SalesRep s
  JOIN
    ( SELECT
      SalesRepId, SUM(amount) as Volume
      FROM purchases
      GROUP BY SalesRepId) V
  WHERE V.SalesRepId = s.RepID
  ORDER BY V.volume DESC;



Rep
Territory
Amount
Rank
Nadia 1
800 1
Jana 1
220 2
Daniel 2
220
1
William 2
170 2

The ranking is within the territory - Daniel is the best is his territory.


6. FYI: this example was developed on a SPARC T4 server with Oracle Solaris 11 and Apache Hadoop 1.0.4
About

user12620111

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today