There's a lot to learn from log data, but to get the most value from it, that data needs to be easy to collect and analyze. Otherwise, time that could be used to learn from data is spent writing parsers and transport components.
In this entry we'll simplify log collection and transport using JSON serialization and parts of the Hadoop ecosystem.
Logging everything in JSON is a great idea. As serialization formats go it's engineer-friendly: you and your favorite programming language can both read it. Moreover, having all of your log data structured as universal data structures makes getting started with analytics much simpler. To illustrate how much simpler, we'll take JSON logs written to a flat file, stream them into HDFS, and expose them via Hive for exploration and aggregation.
We're going to use three components to put our system together:
We'll begin by setting up the final destination for our log data. This requires we create a directory in HDFS to hold the log data and define a Hive table over it. Making the directory's easy:
hadoop fs -mkdir /user/oracle/tail_table
Similarly, defining the external table is straightforward in the Hive command line:
CREATE EXTERNAL TABLE IF NOT EXISTS tail_table(fieldA int, fieldB string, fieldC float)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
CDH 3u4 ships with two very different versions of Flume. The default is Flume 0.9.4, or Flume OG . It's great at streaming data into HDFS, but Flume OG has some requirements.
collectorSink("hdfs://namenode/user/oracle/tail_table", "logdata", 30)
This configuration will tail the log file and write a new message into HDFS with each new line. The collectorSink will commit data to our Hive table
every 30 seconds. The resulting configuration looks like this:
For this type of multi-hop log transfer, we need a flume-ng-agent running on each application server and one on the Hadoop cluster.
The application servers will have a flume.conf file which includes something like this:
app-agent.sources = tail
app-agent.channels = memoryChannel
app-agent.sinks = avro-forward-sink
app-agent.sources.tail.type = exec
app-agent.sources.tail.command = tail -f /path/to/json.log
app-agent.sources.avro-forward-sink.type = avro
app-agent.sources.avro-forward-sink.hostname = 10.1.1.100
app-agent.sources.avro-forward-sink.port = 10000
This sets up a source that runs "tail" and sinks that data via Avro RPC to 10.1.1.100 on port 10000.
hdfs-agent.sinks = hdfs-write
hdfs-agent.sources.avro-collect.type = avro
hdfs-agent.sources.avro-collect.bind = 10.1.1.100
hdfs-agent.sources.avro-collect.port = 10000
hdfs-agent.sinks.hdfs-write.type = hdfs
hdfs-agent.sinks.hdfs-write.path = hdfs://namenode/user/oracle/tail_table
hdfs-agent.sinks.hdfs-write.rollInterval = 30
On this side we've defined a source that reads Avro messages from port 10000 on 10.1.1.100 and writes the results into HDFS, rolling the file every 30 seconds. It's just like our setup in Flume OG, but now multi-hop forwarding is a snap.
The resulting configuration looks like this:
No matter which version you deploy, the combination of Flume, Hive and JSON make it straightforward to up an end-to-end pipeline for consuming and analyzing serialized log data. With deployments this simple, you can spend more time focusing on your applications and analytics