In my previous blogs, I already told about data loading into HDFS. In the first blog, I covered data loading from generic servers to HDFS. The second blog was devoted by offloading data from Oracle RDBMS. Here I want to explain how to load into Hadoop streaming data. Before all, I want to note that I will now explain Oracle Golden Gate for Big Data just because it already has many blogposts. Today I'm going to talk about Flume and Kafka.
What is Kafka?
Kafka is distributed service bus. Ok, but what is service bus? Let's imagine that you do have few data systems, and each one needs data from others. You could link it directly, like this:
but it became very hard to manage. Instead this you could have one centralized system, that will accumulate data from all sources and be a single point of contact for all systems. Like this:
What is Flume?
"Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store." - this definition from documentation pretty good explains what is Flume. Flume historically was developed for loading data in HDFS. But why I couldn't just use Hadoop client?
Challenge 1. Small files.
Hadoop have been designed for storing large files and despite on that on the last few year were done a lot of optimizations around NameNode, it's still recommended to store only big files. If your source has a lot of small files, Flume could collect them and flush this collection in batch mode, like a single big file. I always use the analogy with glass and drops. You could collect one million drops in one glass and after this, you will have one glass of water instead one million drops.
Challenge 2. Lots of data sources
Let's imagine that I do have an application (even two on two different servers) that produce files which I want to load into HDFS.
life is good, if files are large enough it's not gonna be a problem.
But now let's imagine, that I have 1000 application servers and each one wants to write data into HDFS. Even if files are large this workload will collapse your Hadoop cluster. If not believe - just try it (but not on production cluster!). So, we have to have something in between HDFS and our data sources.
Now is time for Flume. You could do two tiers architecture, fist ties will collect data from different sources, the second one will aggregate them and load into HDFS.
In my example I depict 1000 sources, which is handled by 100 Flume servers on the first tier, which is load data on the second tier, that connect directly to HDFS and in my example, it's only two connections - it's affordable. Here you could find more details, just want to add that general practice is use one aggregation agent for 4-16 client agents.
I also want to note, that it's a good practice to use AVRO sink when you move data from one tier to next. Here is an example of the flume config file:
Deep technical presentation about Kafka you could find here and here actually, I got few screens from there. The Very interesting technical video you could find here. In my article, I just will remind key terms and concepts.
Producer - a process that writes data into Kafka cluster. It could be part of an application or edge nodes could play this role.
Consumer - a process that reads the data from Kafka cluster.
Brocker - a member of Kafka cluster. Set of members is Kafka cluster.
You could find a lot of useful information about Flume in this book, here I just highlight key concepts.
Flume has 3 major components:
1) Source - where I get the data
2) Chanel - where I buffer it. It could be memory or disk, for example.
3) Sink - where I load my data. For example, it could be another tier of Flume agents, HDFS or HBase.
Between source and channel, there are two minor components: Interceptor and Selector.
With Interceptor you could do simple processing, with Selector you could choose channel depends on the message header.
Flume and Kafka similarities and differences.
It's a frequent question: "what is the difference between Flume and Kafka", the answer could be very expanded, but let me briefly explain key points.
1) Pull and Push.
Flume accumulates data up to some condition (number of the events, size of the buffer or timeout) and then push it to the disk
Kafka accumulates data until client initiate reads. So client pulls data whenever he wants.
2) Data processing
Flume could do simple transformations by interceptors
Kafka doesn't do any data processing, just store that data.
Flume usually is a batch of single instances.
Kafka is the cluster, which means that it has such benefits as High Availability and scalability out of the box without extra efforts.
4) Message size
Flume doesn't have any obvious restrictions for size of the message
Kafka was designed for few KB messages
5) Coding vs Configuring
Flume usually configurable tool (users usually don't write the code, instead of this they use configure capabilities).
With Kafka, you have to write code for load/unload the data.
Many customers are thinking about choosing right technology either Flume or Kafka for handing their data streaming. Stop choosing, use both. It's quite common use case and it named as Flafka. Good explanation and nice pictures you could find here (actually I borrowed few screens from there).
First of all, Flafka is not a dedicated project. It's just bunch of Java classes for integration Flume and Kafka.
Now Kafka could be either source for Flume by flume config:
or channel by the following directive:
Use Case1. Kafka as a source or Chanel
if you do have Kafka as enterprise service bus (see my example above) you may want to load data from your service bus into HDFS. You could do this by writing Java program, but if don't like it, you may use Kafka as a Flume source.
in this case, Kafka could be also useful for smoothing peak load. Flume provides flexible routing in this case.
Also, you could use Kafka as a Flume Chanel for high availability purposes (it's distributed by application design).
Use case 2. Kafka as a sink.
If you use Kafka as enterprise service bus, I may want to load data into it. The native way for Kafka is Java program, but if you feel, that it will be way more convenient with Flume (just using few config files) - you have this option. The only one that you need is config Kafka as a sink.
Use case 3. Flume as the tool to enrich data.
As I Already told before - Kafka could do any data processing. It just stores data without any transformation. You could use Flume as the way to add some extra information to your Kafka messages. For doing this you need to define Kafka as a source, implement interceptor which will add some information to your message and write back to the Kafka in a different topic.
There are two major tools for loading stream data - Flume and Kafka. There is no right answer, what to use because each tool has own advantages/disadvantages. Generally, it's why Flafka have been created - it's just a combination of those two tools.