Enabling Real-Time MySQL to HDFS Integration
Batch processing delivered by Map/Reduce remains central to Apache Hadoop, but as the pressure to gain competitive advantage from “speed of thought” analytics grows, so Hadoop itself is undergoing significant evolution. The development of technologies allowing real time queries, such as Apache Drill, Cloudera Impala and the Stinger Initiative are emerging, supported by new generations of resource management with Apache YARN
To support this growing emphasis on real-time operations, we are releasing a new MySQL Applier for Hadoop to enable the replication of events from MySQL to Hadoop / Hive / HDFS (Hadoop Distributed File System) as they happen. The Applier complements existing batch-based Apache Sqoop connectivity.
Replication via the MySQL Applier for Hadoop is implemented by connecting to the MySQL master and reading binary log events as soon as they are committed, and writing them into a file in HDFS.
The Applier for Hadoop uses an API provided by libhdfs, a C library to manipulate files in HDFS. The library comes precompiled with Hadoop distributions.
It connects to the MySQL master to read the binary log and then:
Databases are mapped as separate directories, with their tables mapped as sub-directories with a Hive data warehouse directory. Data inserted into each table is written into text files (named as datafile1.txt) in Hive / HDFS. Data can be in comma separated format; or any other, that is configurable by command line arguments.
You can learn more about the design of the MySQL Applier for Hadoop from this blog
The installation, configuration and implementation are discussed in detail in this Applier for Hadoop blog. Integration with Hive is also documented.
You can also see it in action from this MySQL Hadoop Applier Video Tutorial
With the growth in big data projects and Hadoop adoption, it would be great to get your feedback on how we can further develop the Applier to meet your real-time integration needs. Please use the comments section to let the MySQL team know your priorities.