Monday Apr 22, 2013

Announcing the MySQL Applier for Apache Hadoop

Enabling Real-Time MySQL to HDFS Integration

Batch processing delivered by Map/Reduce remains central to Apache Hadoop, but as the pressure to gain competitive advantage from “speed of thought” analytics grows, so Hadoop itself is undergoing significant evolution. The development of technologies allowing real time queries, such as Apache Drill, Cloudera Impala and the Stinger Initiative are emerging, supported by new generations of resource management with Apache YARN

To support this growing emphasis on real-time operations, we are releasing a new MySQL Applier for Hadoop to enable the replication of events from MySQL to Hadoop / Hive / HDFS (Hadoop Distributed File System) as they happen.  The Applier complements existing batch-based Apache Sqoop connectivity.

Replication via the MySQL Applier for Hadoop is implemented by connecting to the MySQL master and reading binary log events as soon as they are committed, and writing them into a file in HDFS.


The Applier for Hadoop uses an API provided by libhdfs, a C library to manipulate files in HDFS. The library comes precompiled with Hadoop distributions.

It connects to the MySQL master to read the binary log and then:

  • Fetches the row insert events occurring on the master
  • Decodes these events, extracts data inserted into each field of the row, and uses content handlers to get it in the format required
  • Appends it to a text file in HDFS.

Databases are mapped as separate directories, with their tables mapped as sub-directories with a Hive data warehouse directory. Data inserted into each table is written into text files (named as datafile1.txt) in Hive / HDFS. Data can be in comma separated format; or any other, that is configurable by command line arguments. 


You can learn more about the design of the MySQL Applier for Hadoop from this blog

The installation, configuration and implementation are discussed in detail in this Applier for Hadoop blog. Integration with Hive is also documented.

You can also see it in action from this MySQL Hadoop Applier Video Tutorial 

With the growth in big data projects and Hadoop adoption, it would be great to get your feedback on how we can further develop the Applier to meet your real-time integration needs. Please use the comments section to let the MySQL team know your priorities.

Friday Mar 08, 2013

MySQL Web Reference Architectures - Your Guide to Innovating on the Web

MySQL is deployed in 9 of the top 10 most trafficked sites on the web including Facebook, Twitter, eBay and YouTube, as well as in some of the fastest growing services such as Tumblr, Pinterest and box.com

Working with these companies has given MySQL developers, consultants and support engineers unique insight into how to design database-driven web architectures – whether deployed on-premise or in the cloud.

The MySQL Web Reference Architectures are a set of documented and repeatable best practices for building infrastructure that deliver the highest levels of scalability, agility and availability with the lowest levels of cost, risk and complexity. 

Four components common to most web and mobile properties are sized, with optimum deployment architectures for each:

User authentication and session management

Content management

Ecommerce

Analytics and big data integration

The sizing is defined by database size and load, as shown below

For each reference architecture, strategies for scaling the service and ensuring high availability are discussed, along with approaches to secure, audit and backup user data, and tools to monitor and manage the environment.

The Reference Architectures cover the core underlying technologies supporting today’s most successful web services including:

- MySQL Database

- MySQL Cluster

- MySQL Replication

- Caching with Memcached and Redis

- Big Data with Hadoop

- NoSQL APIs

- Geographic Redundancy

- Hardware Recommendations

- Operational Best Practices

An example of the "Large" reference architecture is shown below

To learn more:

- Download the MySQL Web Reference Architectures Guide

- View the MySQL Web Reference Architectures slides

The Reference Architecture are designed as a starting point which we hope will enable you build the next web and mobile phenomenon!

Tuesday Jan 15, 2013

Upcoming MySQL Events

Oracle's MySQL team is running/participating to a number of events during the upcoming weeks and months. Don't miss this chance to learn about the latest developments straight from the source and to get all your questions answered!

Additional events will likely be scheduled down the road and posted on our events page, but you can already register for the following ones:

MySQL Tech Tour: Big Data and High Availability with MySQL– Pleasanton, California

January 22

MySQL Tech Tour: Big Data and High Availability with MySQL– Belmont, California

January 23

FOSDEM – Brussels, Belgium

February 2-3

MySQL Tech Tour: From the Web to the Cloud – Istanbul, Turkey

February 5

Oracle & Zend LAMP Seminar - Milan, Italy

February 15

MySQL Tech Tour for Software & Hardware Vendors - Arie Petach Tikva, Israel

February 19

MySQL Tech Tour: From the Web to the Cloud – Oslo, Norway

February 21

MySQL Tech Tour: From the Web to the Cloud – Brussels, Belgium

February 21

MySQL Tech Tour: From the Web to the Cloud – Stockholm, Sweden

March 20

MySQL Tech Tour: From the Web to the Cloud – Munich, Germany

April 17

MySQL Tech Tour: From the Web to the Cloud – Baden, Switzerland

April 24

We hope to see many of you at those events!

Thursday Nov 29, 2012

MySQL and Hadoop Integration - Unlocking New Insight

“Big Data” offers the potential for organizations to revolutionize their operations. With the volume of business data doubling every 1.2 years, analysts and business users are discovering very real benefits when integrating and analyzing data from multiple sources, enabling deeper insight into their customers, partners, and business processes.

As the world’s most popular open source database, and the most deployed database in the web and cloud, MySQL is a key component of many big data platforms, with Hadoop vendors estimating 80% of deployments are integrated with MySQL.

The new Guide to MySQL and Hadoop presents the tools enabling integration between the two data platforms, supporting the data lifecycle from acquisition and organisation to analysis and visualisation / decision, as shown in the figure below


The Guide details each of these stages and the technologies supporting them:

Acquire: Through new NoSQL APIs, MySQL is able to ingest high volume, high velocity data, without sacrificing ACID guarantees, thereby ensuring data quality. Real-time analytics can also be run against newly acquired data, enabling immediate business insight, before data is loaded into Hadoop. In addition, sensitive data can be pre-processed, for example healthcare or financial services records can be anonymized, before transfer to Hadoop.

Organize: Data is transferred from MySQL tables to Hadoop using Apache Sqoop. With the MySQL Binlog (Binary Log) API, users can also invoke real-time change data capture processes to stream updates to HDFS.

Analyze: Multi-structured data ingested from multiple sources is consolidated and processed within the Hadoop platform.

Decide: The results of the analysis are loaded back to MySQL via Apache Sqoop where they inform real-time operational processes or provide source data for BI analytics tools.

So how are companies taking advantage of this today? As an example, on-line retailers can use big data from their web properties to better understand site visitors’ activities, such as paths through the site, pages viewed, and comments posted. This knowledge can be combined with user profiles and purchasing history to gain a better understanding of customers, and the delivery of highly targeted offers.

Of course, it is not just in the web that big data can make a difference. Every business activity can benefit, with other common use cases including:

- Sentiment analysis;

- Marketing campaign analysis;

- Customer churn modeling;

- Fraud detection;

- Research and Development;

- Risk Modeling;

- And more.

As the guide discusses, Big Data is promising a significant transformation of the way organizations leverage data to run their businesses. MySQL can be seamlessly integrated within a Big Data lifecycle, enabling the unification of multi-structured data into common data platforms, taking advantage of all new data sources and yielding more insight than was ever previously imaginable.

Download the guide to MySQL and Hadoop integration to learn more. I'd also be interested in hearing about how you are integrating MySQL with Hadoop today, and your requirements for the future, so please use the comments on this blog to share your insights.




Friday Jun 01, 2012

Configuring MySQL Cluster Data Nodes

In my previous blog post, I discussed the enhanced performance and scalability delivered by extensions to the multi-threaded data nodes in MySQL Cluster 7.2. In this post, I’ll share best practices on the configuration of data nodes to achieve optimum performance on the latest generations of multi-core, multi-thread CPU designs.

Configuring the Data Nodes

The configuration of data node threads can be managed in two ways via the config.ini file:

- Simply set MaxNoOfExecutionThreads to the appropriate number of threads to be run in the data node, based on the number of threads presented by the processors used in the host or VM.

- Use the new ThreadConfig variable that enables users to configure both the number of each thread type to use and also which CPUs to bind them too.

The flexible configuration afforded by the multi-threaded data node enhancements means that it is possible to optimise data nodes to use anything from a single CPU/thread up to a 48 CPU/thread server. Co-locating the MySQL Server with a single data node can fully utilize servers with 64 – 80 CPU/threads. It is also possible to co-locate multiple data nodes per server, but this is now only required for very large servers with 4+ CPU sockets dense multi-core processors.

24 Threads and Beyond!

An example of how to make best use of a 24 CPU/thread server box is to configure the following:

- 8 ldm threads

- 4 tc threads

- 3 recv threads

- 3 send threads

- 1 rep thread for asynchronous replication.

Each of those threads should be bound to a CPU. It is possible to bind the main thread (schema management domain) and the IO threads to the same CPU in most installations.

In the configuration above, we have bound threads to 20 different CPUs. We should also protect these 20 CPUs from interrupts by using the IRQBALANCE_BANNED_CPUS configuration variable in /etc/sysconfig/irqbalance and setting it to 0x0FFFFF.

The reason for doing this is that MySQL Cluster generates a lot of interrupt and OS kernel processing, and so it is recommended to separate activity across CPUs to ensure conflicts with the MySQL Cluster threads are eliminated.

When booting a Linux kernel it is also possible to provide an option isolcpus=0-19 in grub.conf. The result is that the Linux scheduler won't use these CPUs for any task. Only by using CPU affinity syscalls can a process be made to run on those CPUs.

By using this approach, together with binding MySQL Cluster threads to specific CPUs and banning CPUs IRQ processing on these tasks, a very stable performance environment is created for a MySQL Cluster data node.

On a 32 CPU/Thread server:

- Increase the number of ldm threads to 12

- Increase tc threads to 6

- Provide 2 more CPUs for the OS and interrupts.

- The number of send and receive threads should, in most cases, still be sufficient.

On a 40 CPU/Thread server, increase ldm threads to 16, tc threads to 8 and increment send and receive threads to 4.

On a 48 CPU/Thread server it is possible to optimize further by using:

- 12 tc threads

- 2 more CPUs for the OS and interrupts

- Avoid using IO threads and main thread on same CPU

- Add 1 more receive thread.

Summary

As both this and the previous post seek to demonstrate, the multi-threaded data node extensions not only serve to increase performance of MySQL Cluster, they also enable users to achieve significantly improved levels of utilization from current and future generations of massively multi-core, multi-thread processor designs.

A big thanks to Mikael Ronstrom, Senior MySQL Architect at Oracle, for his work in developing these enhancements and best practices.

You can download MySQL Cluster 7.2 today and try out all of these enhancements. The Getting Started guides are an invaluable aid to quickly building a Proof of Concept

Don’t forget to check out the MySQL Cluster 7.2 New Features whitepaper to discover everything that is new in the latest GA release

Wednesday May 30, 2012

MySQL Cluster 7.2: Over 8x Higher Performance than Cluster 7.1

Summary

The scalability enhancements delivered by extensions to multi-threaded data nodes enables MySQL Cluster 7.2 to deliver over 8x higher performance than the previous MySQL Cluster 7.1 release on a recent benchmark

What’s New in MySQL Cluster 7.2

MySQL Cluster 7.2 was released as GA (Generally Available) in February 2012, delivering many enhancements to performance on complex queries, new NoSQL Key / Value API, cross-data center replication and ease-of-use. These enhancements are summarized in the Figure below, and detailed in the MySQL Cluster New Features whitepaper

Figure 1: Next Generation Web Services, Cross Data Center Replication and Ease-of-Use

Once of the key enhancements delivered in MySQL Cluster 7.2 is extensions made to the multi-threading processes of the data nodes.

Multi-Threaded Data Node Extensions
The MySQL Cluster 7.2 data node is now functionally divided into seven thread types:
1) Local Data Manager threads (ldm). Note – these are sometimes also called LQH threads.
2) Transaction Coordinator threads (tc)
3) Asynchronous Replication threads (rep)
4) Schema Management threads (main)
5) Network receiver threads (recv)
6) Network send threads (send)
7) IO threads

Each of these thread types are discussed in more detail below.

MySQL Cluster 7.2 increases the maximum number of LDM threads from 4 to 16. The LDM contains the actual data, which means that when using 16 threads the data is more heavily partitioned (this is automatic in MySQL Cluster). Each LDM thread maintains its own set of data partitions, index partitions and REDO log. The number of LDM partitions per data node is not dynamically configurable, but it is possible, however, to map more than one partition onto each LDM thread, providing flexibility in modifying the number of LDM threads.

The TC domain stores the state of in-flight transactions. This means that every new transaction can easily be assigned to a new TC thread. Testing has shown that in most cases 1 TC thread per 2 LDM threads is sufficient, and in many cases even 1 TC thread per 4 LDM threads is also acceptable. Testing also demonstrated that in some instances where the workload needed to sustain very high update loads it is necessary to configure 3 to 4 TC threads per 4 LDM threads. In the previous MySQL Cluster 7.1 release, only one TC thread was available. This limit has been increased to 16 TC threads in MySQL Cluster 7.2. The TC domain also manages the Adaptive Query Localization functionality introduced in MySQL Cluster 7.2 that significantly enhanced complex query performance by pushing JOIN operations down to the data nodes.

Asynchronous Replication was separated into its own thread with the release of MySQL Cluster 7.1, and has not been modified in the latest 7.2 release.

To scale the number of TC threads, it was necessary to separate the Schema Management domain from the TC domain. The schema management thread has little load, so is implemented with a single thread.

The Network receiver domain was bound to 1 thread in MySQL Cluster 7.1. With the increase of threads in MySQL Cluster 7.2 it is also necessary to increase the number of recv threads to 8. This enables each receive thread to service one or more sockets used to communicate with other nodes the Cluster.

The Network send thread is a new thread type introduced in MySQL Cluster 7.2. Previously other threads handled the sending operations themselves, which can provide for lower latency. To achieve highest throughput however, it has been necessary to create dedicated send threads, of which 8 can be configured. It is still possible to configure MySQL Cluster 7.2 to a legacy mode that does not use any of the send threads – useful for those workloads that are most sensitive to latency.

The IO Thread is the final thread type and there have been no changes to this domain in MySQL Cluster 7.2. Multiple IO threads were already available, which could be configured to either one thread per open file, or to a fixed number of IO threads that handle the IO traffic. Except when using compression on disk, the IO threads typically have a very light load.

Benchmarking the Scalability Enhancements

The scalability enhancements discussed above have made it possible to scale CPU usage of each data node to more than 5x of that possible in MySQL Cluster 7.1. In addition, a number of bottlenecks have been removed, making it possible to scale data node performance by even more than 5x.

Figure 2: MySQL Cluster 7.2 Delivers 8.4x Higher Performance than 7.1

The flexAsynch benchmark was used to compare MySQL Cluster 7.2 performance to 7.1 across an 8-node Intel Xeon x5670-based cluster of dual socket commodity servers (6 cores each).

As the results demonstrate, MySQL Cluster 7.2 delivers over 8x higher performance per data nodes than MySQL Cluster 7.1.

More details of this and other benchmarks will be published in a new whitepaper – coming soon, so stay tuned!

In a following blog post, I’ll provide recommendations on optimum thread configurations for different types of server processor. You can also learn more from the Best Practices Guide to Optimizing Performance of MySQL Cluster

Conclusion

MySQL Cluster has achieved a range of impressive benchmark results, and set in context with the previous 7.1 release, is able to deliver over 8x higher performance per node.

As a result, the multi-threaded data node extensions not only serve to increase performance of MySQL Cluster, they also enable users to achieve significantly improved levels of utilization from current and future generations of massively multi-core, multi-thread processor designs.

About

Get the latest updates on products, technology, news, events, webcasts, customers and more.

Twitter


Facebook

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
5
6
9
10
11
12
13
15
16
17
18
19
20
23
24
25
26
27
28
29
30
   
       
Today