Flume and Hive for Log Analytics

There's a lot to learn from log data, but to get the most value from it, that data needs to be easy to collect and analyze. Otherwise, time that could be used to learn from data is spent writing parsers and transport components. In this entry we'll simplify log collection and transport using JSON serialization and parts of the Hadoop ecosystem.

Logging everything in JSON is a great idea. As serialization formats go it's engineer-friendly: you and your favorite programming language can both read it. Moreover, having all of your log data structured as universal data structures makes getting started with analytics much simpler. To illustrate how much simpler, we'll take JSON logs written to a flat file, stream them into HDFS, and expose them via Hive for exploration and aggregation.

The Preliminaries

We're going to use three components to put our system together:

  • A flat file that's collecting JSON data. Assume entries look a bit like this:
    {"fieldA":"string data","fieldB":400,"fieldC":0.99}
  • Flume: the distributed log-collection service that's part of the Hadoop ecosystem
  • Hive and a SerDe for handling JSON data

The "Tail Table"

We'll begin by setting up the final destination for our log data. This requires we create a directory in HDFS to hold the log data and define a Hive table over it. Making the directory's easy:

hadoop fs -mkdir /user/oracle/tail_table

Similarly, defining the external table is straightforward in the Hive command line:

CREATE EXTERNAL TABLE IF NOT EXISTS tail_table(fieldA int, fieldB string, fieldC float)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/user/oracle/tail_table';

This gives us a table which will read and respect the types of the values in our JSON records. If a field isn't present for a given record, a NULL value is returned for that column. Fields not included in the CREATE statement are ignored but still exist in the JSON. This allows the schema of the JSON to remain flexible while minimally impacting the Hive table.

Streaming Data with Flume OG

CDH 3u4 ships with two very different versions of Flume. The default is Flume 0.9.4, or Flume OG . It's great at streaming data into HDFS, but Flume OG has some requirements.

  • You must run Zookeeper to coordinate Flume nodes
  • You must run a Flume master to control Flume nodes
Those caveats aside, setting up Flume to stream data into our Hive table is remarkably simple. We only need to define a source which tails our JSON logs and a sink which writes these into the appropriate HDFS directory. We can set this up via the Flume master's web interface. Just navigate to the Flume master web interface at http://flumemaster.your.domain:35871 and click the config link. From here, select the Flume node you want to configure from the dropdown menu (i.e. the node which has the JSON log file). The rest is easy:
  • Set the source as: tail("/path/to/json.log")
  • Set the sink as: collectorSink("hdfs://namenode/user/oracle/tail_table", "logdata", 30)

This configuration will tail the log file and write a new message into HDFS with each new line. The collectorSink will commit data to our Hive table every 30 seconds. The resulting configuration looks like this: 

Streaming Data with Flume NG

The other version of Flume which ships with CDH3 is Flume NG . Flume NG is significantly different from its predecessor. Our tail source from the previous section is gone, but so too are many of restrictions.
  • Zookeeper is no longer a requirement
  • The master-slave architecture has been replaced by independent Flume agents
  • We can now use Avro RPCs to transfer data in multi-hop flows
That last point is a big advance for Flume. In Flume OG, transfer from application servers to our Hadoop cluster was a gray area. Either our application servers run Flume nodes connected to the Zookeeper instances and Flume masters for the Hadoop cluster, or logs must be transferred into the Hadoop cluster via another method. In Flume NG, we can run independent Flume agents on application server and the Hadoop cluster, relying on Avro RPC to handle forwarding.

For this type of multi-hop log transfer, we need a flume-ng-agent running on each application server and one on the Hadoop cluster. The application servers will have a flume.conf file which includes something like this:

app-agent.sources = tail
app-agent.channels = memoryChannel
app-agent.sinks = avro-forward-sink
app-agent.sources.tail.type = exec
app-agent.sources.tail.command = tail -f /path/to/json.log
app-agent.sources.avro-forward-sink.type = avro
app-agent.sources.avro-forward-sink.hostname = 10.1.1.100
app-agent.sources.avro-forward-sink.port = 10000

This sets up a source that runs "tail" and sinks that data via Avro RPC to 10.1.1.100 on port 10000.


The collecting Flume agent on the Hadoop cluster will need a flume.conf with an avro source and an HDFS sink.
hdfs-agent.sources= avro-collect
hdfs-agent.sinks = hdfs-write
hdfs-agent.channels= memoryChannel
hdfs-agent.sources.avro-collect.type = avro
hdfs-agent.sources.avro-collect.bind = 10.1.1.100
hdfs-agent.sources.avro-collect.port = 10000
hdfs-agent.sinks.hdfs-write.type = hdfs
hdfs-agent.sinks.hdfs-write.path = hdfs://namenode/user/oracle/tail_table
hdfs-agent.sinks.hdfs-write.rollInterval = 30

On this side we've defined a source that reads Avro messages from port 10000 on 10.1.1.100 and writes the results into HDFS, rolling the file every 30 seconds. It's just like our setup in Flume OG, but now multi-hop forwarding is a snap.


The resulting configuration looks like this: 

Takeway

No matter which version you deploy, the combination of Flume, Hive and JSON make it straightforward to up an end-to-end pipeline for consuming and analyzing serialized log data. With deployments this simple, you can spend more time focusing on your applications and analytics

Comments:

Very useful; thanks Dan.

What is "hostname" in app-agent.sources.avro-forward-sink.hostname = 10.1.1.100 correpsonds to? Is it the ip address of the server running the flume agent? Can we use any available port #?

Thanks.

Posted by guest on September 17, 2012 at 01:42 PM PDT #

In this case 10.1.1.100 is the address of the server which is receiving Avro messages and writing them to HDFS. That is, it's the address of the flume agent on your Hadoop cluster. You can use any available port as long as both the sink and the source are configure to use it. We can take the lines from the two configuration files and match them up:

On the app server:
app-agent.sources.avro-forward-sink.hostname = 10.1.1.100
matches
hdfs-agent.sources.avro-collect.bind = 10.1.1.100

and:
app-agent.sources.avro-forward-sink.port = 10000
matches
hdfs-agent.sources.avro-collect.port = 10000

Posted by Dan Mcclary on October 12, 2012 at 09:50 AM PDT #

The flume.conf for agent and collector has error.

Replace
app-agent.channel = memoryChannel

with:
app-agent.channels = memoryChannel

Replace
hdfs-agent.channel= memoryChannel

with
hdfs-agent.channels= memoryChannel

Posted by guest on January 05, 2013 at 03:54 PM PST #

The hostname is field is either the hostname or IP address of the server running the flume agent which is going to be receiving the Avro messages. You can use any port to which you can bind.

Posted by Dan McClary on January 06, 2013 at 11:51 AM PST #

I think flume-ng at app server has an error:

Replace

app-agent.sources.avro-forward-sink.type = avro
app-agent.sources.avro-forward-sink.hostname = 10.1.1.100
app-agent.sources.avro-forward-sink.port = 10000

With

agent.sinks.avro-forward-sink.type = avro
agent.sinks.avro-forward-sink.hostname = 10.1.1.100
app-agent.sinks.avro-forward-sink.port = 10000

Posted by guest on October 27, 2013 at 10:30 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
4
5
6
7
8
9
10
11
12
13
14
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today