Monday Apr 22, 2013

Announcing the MySQL Applier for Apache Hadoop

Enabling Real-Time MySQL to HDFS Integration

Batch processing delivered by Map/Reduce remains central to Apache Hadoop, but as the pressure to gain competitive advantage from “speed of thought” analytics grows, so Hadoop itself is undergoing significant evolution. The development of technologies allowing real time queries, such as Apache Drill, Cloudera Impala and the Stinger Initiative are emerging, supported by new generations of resource management with Apache YARN

To support this growing emphasis on real-time operations, we are releasing a new MySQL Applier for Hadoop to enable the replication of events from MySQL to Hadoop / Hive / HDFS (Hadoop Distributed File System) as they happen.  The Applier complements existing batch-based Apache Sqoop connectivity.

Replication via the MySQL Applier for Hadoop is implemented by connecting to the MySQL master and reading binary log events as soon as they are committed, and writing them into a file in HDFS.


The Applier for Hadoop uses an API provided by libhdfs, a C library to manipulate files in HDFS. The library comes precompiled with Hadoop distributions.

It connects to the MySQL master to read the binary log and then:

  • Fetches the row insert events occurring on the master
  • Decodes these events, extracts data inserted into each field of the row, and uses content handlers to get it in the format required
  • Appends it to a text file in HDFS.

Databases are mapped as separate directories, with their tables mapped as sub-directories with a Hive data warehouse directory. Data inserted into each table is written into text files (named as datafile1.txt) in Hive / HDFS. Data can be in comma separated format; or any other, that is configurable by command line arguments. 


You can learn more about the design of the MySQL Applier for Hadoop from this blog

The installation, configuration and implementation are discussed in detail in this Applier for Hadoop blog. Integration with Hive is also documented.

You can also see it in action from this MySQL Hadoop Applier Video Tutorial 

With the growth in big data projects and Hadoop adoption, it would be great to get your feedback on how we can further develop the Applier to meet your real-time integration needs. Please use the comments section to let the MySQL team know your priorities.

Thursday Jul 26, 2012

MySQL and Hadoop

Introduction

"Improving MySQL performance using Hadoop" was the talk which I and Manish Kumar gave at Java One & Oracle Develop 2012, India. Based on the response and interest of the audience, we decided to summarize the talk in a blog post. The slides of this talk can be found here. They also include a screen-cast of a live Hadoop system pulling data from MySQL and working on the popular 'word count' problem.



MySQL and Hadoop have been popularly considered as 'Friends with benefits' and our talk was aimed at showing how!

The benefits of MySQL to developers are the speed, reliability, data integrity and scalability. It can successfully process large amounts of data (upto terabytes). But for applications that require massive parallel processing we may need the benefits of a parallel processing system such as Hadoop.

Hadoop is a highly scalable distributed framework and extremely powerful in terms of computation. Hadoop is fault tolerant and parallelizes data processing across many nodes. Popular users of Hadoop are Yahoo, Facebook, Twitter and Netflix.

Combine the speed and reliability of MySQL with the high scalability of Hadoop and what you get is a powerful system capable of using relational data and crunching it at tremendous speeds!

Hadoop

Hadoop leverages its distributed file system - the Hadoop Filesystem (HDFS) - to cheaply and reliably replicate chunks of data. HDFS is a distributed, scalable and portable file system written in Java. Each node in a Hadoop instance typically has a single name-node and a cluster of data-nodes form the HDFS cluster. HDFS uses TCP/IP for communication and stores large files across multiple machines. One drawback of HDFS is that a single name-node stores all the meta-data in memory.

Submitting a task to a Hadoop system is done by writing a map-reduce job. A map-reduce job splits input data into independent chunks where each chunk is processed by the map task in a parallel manner. During the 'Reduce' phase, data from data nodes is merge sorted so that the key-value pairs for a given key are contiguous. The merged data is read sequentially and the valus are passed to the reduce method with an iterator reading the input file until the next key in encountered.

Sqoop

One of the popular ways to use MySQL with Hadoop is to 'Sqoop' it! Sqoop was originally developed by Cloudera and serves as an import tool for Hadoop. Recently it graduated from Apache Incubator to a Top Level Project. It allows users to easily import the data from MySQL database into Hadoop's HDFS so it can be used in Hive. Sqoop uses mysqldump to fetch the data stored in MySQL. One useful feature of Sqoop is that it can also be installed on Amazon Elastic Map Reduce (EMR).

You can find more about Sqoop here.

Demo

In the demo session, we showed a live hadoop system with 1 name node and 2 data nodes. The idea  was to show how easily a live Hadoop cluster can be brought up. We demonstrated the various phases of setting up the Hadoop system: installation, formatting the HDFS, editing the configuration files, starting the Hadoop cluster and finally running the map reduce job. We also addressed questions related to trouble shooting Hadoop deployments and common mistakes done while setting up a Hadoop cluster. After this, we wrote a basic mapreduce job for the word count problem in Java and ran the job on our Hadoop cluster. Although a 2-node cluster did not give much improvement in the time required to complete the task, the increased speed at which the map-reduce job returned the results was clearly noticeable in comparison to the same problem solved using simple SQL queries.

The following image depicts the word count problem solved using a map-reduce job:



Response

Overall the response to our session was overwhelming. Several interesting questions came up during the session and many attendees were interested to know about the other possible ways of integrating MySQL with Hadoop. Some of them came up with unique ideas and use cases in which the benefits of MySQL proved to be the perfect missing pieces in Hadoop deployments using typical non-relational databases.

Apart from MySQL, Hadoop, HDFS and Sqoop some of the related technologies that were also discussed in the session were Hive, Scribe and Thrift. Following are some helpful resources:
  • Leveraging Hadoop to Augment MySQL Deployments - Sarah Sproehnle, Cloudera
  • http://engineering.twitter.com/2010/04/hadoop-at-twitter.html
  • http://semanticvoid.com/blog/category/mysql
  • http://michael-noll.com
  • http://www.facebook.com/MySQLatFacebook
  • http://shout.setfive.com/tag/hadoop/

Monday Dec 19, 2011

Using MySQL Cluster to Protect & Scale the HDFS Namenode

The MySQL Cluster product team is always interested to see new and innovative uses of the database. Last week, a team of students at the KTH Royal Institute of Technology in Sweden blogged about their use of MySQL Cluster in creating a scalable and highly available HDFS Namenode. The blog has received some pretty wide coverage, but was first picked up by Alex Popescu at the myNoSQL site

There are many established use cases of MySQL Cluster in the web, cloud/SaaS, telecoms and even flight control systems – you can see those we are allowed to talk about publicly here

The KTH team has been working on a project to move all of the metadata from the HDFS / Hadoop nameenode to MySQL Cluster. Why did they want to do this, you may ask? Well…:

- The namenode is a single point of failure. If it goes down, so too does the file system

- As a single server, the namenode becomes a bottleneck within heavily loaded HDFS / Hadoop deployments. As server resources are consumed and write volumes increase, so the system can grind to a halt. (And with data volumes growing around 40% per year, this will only become more common!)

So KTH decided to move metadata storage to MySQL Cluster. Why, you may ask? Well….

- MySQL Cluster already offered them a replicated, shared-nothing database, distributed across commodity hardware.

- MySQL Cluster is widely deployed with proven stability

- The metadata can be distributed across nodes to scale out capacity, while retaining complete consistency to the clients and eliminating any Single Point of Failure

- Linear scaling of operations per second across the cluster, as new namenodes are added.

Access to the cluster is via the MySQL Cluster Connector for Java, providing a NoSQL, Java based ORM with very low latency. You can learn more about this ClusterJ API here

Of course, the work at KTH is on-going with future optimizations planned – which we will follow with interest.

So how can you determine if MySQL Cluster is the right choice for your new project? We have just updated our MySQL Cluster Evaluation Guide

This update is based around the latest MySQL Cluster 7.2 Development Release which includes a series of enhancements to further broaden the use case of MySQL Cluster, including:

- 70x higher JOIN performance with Adaptive Query Localization pushing JOIN operations down to MySQL Cluster’s data

- Native Key-Value Memcached interface to the cluster allowing schema and schemaless storage

- New cross-data center scalability enhancements

MySQL Cluster is not a fit for every use-case, but by downloading the Evaluation Guide, you’ll get a clear picture of where MySQL Cluster can be useful to you, and best practices in planning and executing your evaluation.

Let us know of other interesting use-cases in the comments below

About

Get the latest updates on products, technology, news, events, webcasts, customers and more.

Twitter


Facebook

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
5
6
9
10
11
12
13
15
16
17
18
19
20
23
24
25
26
27
28
29
30
   
       
Today