Memcached Performance on Sun's Nehalem System

Memcached is the de-facto distributed caching server used to scale many web2.0 sites today. With the requirement to support a very large number of users as sites grow, memcached aids scalability by effectively cutting down on MySQL traffic and improving response times.

Memcached is a very light-weight server but is known not to scale beyond 4-6 threads. Some scalability improvements have gone into the 1.3 release (still in beta). With the new Intel Nehalem based systems improved hyper-threading providing twice as much performance as current systems, we were curious to see how memcached would perform on these systems. So we ran some tests, the results of which are shown below :




memcached 1.3.2 does scale slightly better than 1.2.5 after 4 threads. However, both versions reach their peak at 8 threads with 1.3.2 giving about 14% better throughput at 352,190 operations/sec.

The improvements made to per-thread stats certainly have helped as we no longer see stats_lock at the top of the profile. That honor now goes to cache_lock. With the increased performance of new systems making 350K ops/sec possible, breaking up of this (and other) lock(s) in memcached is necessary to improve scalability.

Test Details

A single instance of memcached was run on a SunFire X2270 (2 socket Nehalem) with 48GB of memory and an Oplin 10G card. Several external client systems were used to drive load against the server using an internally developed Memcached benchmark. More on the benchmark later.
The clients connected to the server using a single 10 Gigabit Ethernet link. At the maximum throughput of 350K, the network was about 52% utilized and the server was 62% utilized. So there is plenty of head-room on this system to handle a much higher load if memcached could scale better. Of course, it is possible to run multiple instances of memcached to get better performance and better utilize the system resources and we plan to do that next. It is important to note that utilizing these high performance systems effectively for memcached will require the use of 10 GBE interfaces.

Benchmark Details

The Memcached benchmark we ran is based on Apache Olio - a web2.0 workload. I recently showcased results from Olio on Nehalem systems as well. Since Olio is a complex multi-tier workload, we extracted the memcached part to more easily test it in a stand-alone environment. This gave rise to our Memcached benchmark.

The benchmark initially populates the server cache with objects of different sizes to simulate the types of data that real sites typically store in memcached :

  • small objects (4-100 bytes) to represent locks and query results
  • medium objects (1-2 KBytes) to represent thumbnails, database rows, resultsets
  • large objects (5-20 KBytes) to represent whole or partially generated pages

The benchmark then runs a mixture of operations (90% gets, 10% sets) and measures the throughput and response times when the system reaches steady-state. The workload is implemented using Faban, an open-source benchmark development framework. It not only speeds benchmark development, but the Faban harness is a great way to queue, monitor and archive runs for analysis.

Stay tuned for further results.
Comments:

Do you know about Ehcache?

http://ehcache.sourceforge.net/
http://gregluck.com/blog/archives/2009/02/i_have_been_wai.html

It would be interesting to compare the scalability with Memcached.

Posted by Paul Sandoz on April 17, 2009 at 05:32 PM PDT #

Yes. But I believe that EHCache and memcached are for different types of applications.

Posted by Shanti on April 18, 2009 at 03:01 AM PDT #

How many core do you have ? If it is 4 or 8 , any idea of what it could give with more cores ? I wonder if the step in the curve only comes from memcached locking.

Posted by LiFo2 on April 19, 2009 at 11:00 PM PDT #

The server had 2 quad-core Nehalem processors with hyper-threading enabled, so kind of like having a total of 16 threads/cpus.

Posted by Shanti on April 20, 2009 at 04:07 AM PDT #

Do you know about Brutis - http://code.google.com/p/brutis/

Be interesting to see what the server performance looks with Brutis.

Posted by Mostak on April 20, 2009 at 05:05 AM PDT #

Are there performance benefits of using memcached over regular ZFS ARC ?

Posted by J on April 26, 2009 at 02:15 PM PDT #

If what you're asking is whether Memcached can be extended to use ZFS as part of it's cache, there can't be much benefits of using a straight ram-based ARC (in which case you might as well give that memory to memcached directly), but there could be benefits in using the ZFS L2ARC which uses SSD's to cache additional data behind the regular ARC, reducing the need to go to disk. This can dramatically increase the size of the memcache at the same time allowing ZFS to provide a backing store for the cache.

Posted by Shanti on April 27, 2009 at 03:31 AM PDT #

I am very interested in your initial objects. As you mentioned, you have 3 kinds of objects.
\* small objects (4-100 bytes) to represent locks and query results
\* medium objects (1-2 KBytes) to represent thumbnails, database rows, resultsets
\* large objects (5-20 KBytes) to represent whole or partially generated pages

But could you tell me the details about the count, size distribution , expiration distribution and the loading sequence?
Thanks

Posted by mingfan.lu on April 27, 2009 at 12:39 PM PDT #

Is there any chance to obtain your benchmark dataset/trace? It would really help with my benchmarking work.

Thanks,
Alex

Posted by Alex on April 29, 2009 at 08:45 AM PDT #

[Trackback] Behind the scenes of many of your favourite websites you will find an application, that speeds up their page creation times. It's called memcached. At the end it's not much more than a server that allows you to access remote memory to store and load ob...

Posted by c0t0d0s0.org on May 20, 2009 at 05:57 PM PDT #

[Trackback] Behind the scenes of many of your favourite websites you will find an application, that speeds up their page creation times. It's called memcached. At the end it's not much more than a server that allows you to access remote memory to store and load ob...

Posted by c0t0d0s0.org on May 20, 2009 at 05:59 PM PDT #

How about 1.4.4.Can you compare it with other version?

Posted by Tao8 on December 23, 2009 at 03:24 PM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

I'm a Senior Staff Engineer in the Performance & Applications Engineering Group (PAE). This blog focuses on tips to build, configure, tune and measure performance of popular open source web applications on Solaris.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today