Scaling WikiPedia with LAMP: 7 billion page views per month

I recently attended an interesting talk by Brion Vibber, CTO of WikiMedia Foundation, a non-profit organisation that runs the infrastructure for Wikipedia. He described how his team of 7 engineers manages the Wikipedia site that gets on an average of 7 billion page views per month. The highlights from the talk are listed below that included the architecture of the site infrastructure to scale up to the traffic that is received. They are ranked amongst the Top 10 sites in terms of traffic.

The site runs on the LAMP stack and you know what that is:

  • Linux
  • Apache
  • MySQL from Sun
  • Perl/PHP/Python/Pwhatever :-)

WikiMedia runs the site on about 400 x86 servers. Of those, about 250 run the webservers and the remaining run MySQL database. Recently they acquired the OpenSolaris Thumper machines from Sun which they are exploring. Sun Fire X4500 aka Thumper is the World's first Open Source Storage Server running Open Solaris and ZFS. Currently they are using the thumpers for storing the media files using the ZFS file system and they are simply loving it. They have also begun to use the DTrace feature of Open Solaris and cant stop raving about it!!

11/21/08 Update: Link to the recent Press Release WikiMedia selects Sun Microsystems to Enhance Multimedia Experience.

At the core, Wikipedia runs on a very simple system architecture as shown below and given its a non-profit organisation, almost all software is open source and FREE.


Simple is nice but it can be SLOW :-) In order to speedup, the first thing is to add cache in the front end as well as at the backend of the system. On the webfront side, Wikipedia uses the Squid reverse proxy cache for caching and at the backend, they use memcached as shown below:


Squid is a proxy server and  a web cache daemon.. It has a wide variety of uses, from speeding up a web server by caching repeated requests, to caching web, DNS and othercomputer network lookups for a group of people sharing network resources. Squid is good for static dynamic sites like Wiki where the content does not change as often. The public face of a given Wiki page does not change that often, so one can cache at the HTTP level. Wikipedia also uses Squid for geographical load balancing too so that they can use cheaper, faster local bandwidth.

Along with Apache/PHP servers, Wikipedia also uses APC, the alternate PHP caching tool. Since PHP compiles the scripts to bytecode, then throws it away after execution, compilation everytime adds a lot of un-necessary overhead. Hence it is recommended to always use an opcode cache with PHP. This drastically reduces the startup time for large apps.

Another speedup technique used by Wikipedia is memcached. memcached is a general-purpose distributed memory caching system often used to speed up dynamic database-driven websites by caching data and objects in memory to reduce the number of times the database must be read. memcached allows you to share temporary data in the network memory. Even though one needs to go over the network to get the data, the latency is still smaller than disk-based database access. Wikipedia usually stores the rendered pages in the memcached.

After adding all possible cache, next thing is to add CASH! :-) ie add more servers to gain scalability.

Those 250 or so webservers come with plenty of memory. The underutilized memory can be used for memcached and adds up to a big memcached store space.

Further, for getting the speedup at the database level, Wikipedia uses simple sharding techniques. They split the data along logical data partitions, such as subsites that dont interact closely.


They also do functional sharding and split the machines along functional boundaries for speedup.


Next popular technique used by them is Replication to gain speed. They have a master server for all writes and slave servers for most Reads. The secret truth they claim behind configuring the master and slave machines is to make sure the slave machines are faster than the masters as slaves need to keep up with the masters, hence handle writes faster than the master.


As you can see, the beauty of the architecture is that it is SIMPLE and all Open Source and it rocks!


Comments:

Thx for this post!

It was very interesting.

Posted by Gergo Jonas on October 20, 2008 at 12:41 AM PDT #

Very cool. Thanks for the information.

Keith

Posted by Keith Murphy on October 20, 2008 at 04:40 AM PDT #

A really lovely presentation ;)

Posted by Zilvinas on October 20, 2008 at 05:59 PM PDT #

U can do better making images in e.g. PNG (not BMP) :) Luck!

Posted by Ameli on November 18, 2008 at 04:09 PM PST #

How about a link to the actual presentation? I don't see one here.

Posted by John on November 21, 2008 at 05:43 AM PST #

Nice Presentation. Can you Provide Links for More Details to every Tuning Phase?

Posted by Muhammad Nuruddin on November 25, 2008 at 03:17 PM PST #

Thank you.Hot Shoes Cheap Jordan shoes michael jordan shoes and are on sale-Free shipping.

Posted by michael jordan shoes on January 10, 2010 at 11:25 AM PST #

nice presentation and it is very useful.

Posted by guest on January 10, 2010 at 12:53 PM PST #

Cheap Jordan Shoes and Cheap Nike Shoes,<a href="http://www.Jordanshoesonsale.com">air jordan shoes</a>,Air Jordan Shoes,<a href="http://www.MySolegame.com">Nike Jordan Shoes</a>,Jordan Retro Shoes,Jordan Shoes, Air Jordan, Air Jordan Shoes, Wholesale Jordan Shoes

Cheap Jordan Shoes and Cheap Nike Shoes,[url=http://www.Jordanshoesonsale.com]air jordan shoes[/url],Air Jordan Shoes,[url=http://www.MySolegame.com]Nike Jordan Shoes[/url],Jordan Retro Shoes,Jordan Shoes, Air Jordan, Air Jordan Shoes, Wholesale Jordan Shoes

Posted by mdke on January 17, 2010 at 04:30 PM PST #

But remember, any encyclopedia is never a good source regardless of accuracy. It's a fantastic starting point for real research, especially if you know nothing about the topic and need to know what books to get, etc.

Posted by Whole Life Insurance on August 26, 2010 at 08:35 PM PDT #

AdSense pays you every time someone clicks on one of the adverts displayed on your site, so if you have a site that only gets 1,000 visits per month but 100 of those visitors click on an ad, you'll be doing better than if your site gets 5,000 visits per month but only 50 ad clicks.

Posted by unlocked cell phone on September 02, 2010 at 08:11 PM PDT #

AdSense pays you every time someone clicks on one of the adverts displayed on your site, so if you have a site that only gets 1,000 visits per month but 100 of those visitors click on an ad, you'll be doing better than if your site gets 5,000 visits per month but only 50 ad clicks.

Posted by Business Gifts on September 07, 2010 at 10:39 PM PDT #

Wikipedia is run by 7 engineers? Wow.....that is staggering, considering the amount of traffic that they receive on a daily basis......

Posted by Term Life Quotes on September 15, 2010 at 01:38 PM PDT #

When Wiki came, encyclo has lost its place somehow. And yes 7 million has grown day after day.

Posted by Gold Eagle Coins on September 30, 2010 at 04:25 AM PDT #

I would really wait for the Replication technique just to gain some speed on it. I am getting excited now. Please stop me so I can breathe.

Posted by Fullerton Personal Injury Attorney on September 30, 2010 at 04:32 AM PDT #

I very much like the idea of the system of memory caching distribution as this would ensure best cashing. Speed is expected to boost...

Posted by the best hdtv on September 30, 2010 at 04:49 AM PDT #

The problem with this is you are mixing two types of porn together...the leagal porn and the illegal porn. Legal porn is fine, consentual and why shouldn't it be? Humans are very sexual creatures by nature.

Posted by Toronto Windshield Replacement on October 25, 2010 at 06:21 PM PDT #

Oh man, if you're going to use a datagrid for Adding Editing and Deleting, definitely go for .net 2.0. The Datagrid sux at that on 1.0. of course you could always consider a 3rd party datagrid.

Posted by hermes birkin on November 25, 2010 at 09:40 PM PST #

Once your question is open for business, you can always add details. Just hover your mouse over the “edit” section under your question to more fully explain the scope of your question and what type of help you are asking for.

Posted by cheap all inclusive holidays on December 07, 2010 at 06:49 PM PST #

How is Sun doing its earnings releases today? Still using the Web as the first source of information? Or back to newswires?

Posted by solar china on December 26, 2010 at 06:48 PM PST #

They said that the merchandise, the way they talked sounded like they were cussing. To me (if u remember) the whole controversy on furbies and teletubbies are similar. Because if people remember, they were saying those cute little owl looking things that made sounds to talk sounded like they were cussing and they said the same thing with teletubbies.....

Posted by replica louis vuitton on January 07, 2011 at 05:17 PM PST #

I thought they would have a lot more than 7 engineers at Wikipedia, surely they need more than this???

Posted by Peter on February 15, 2011 at 06:30 PM PST #

Great things you’ve always shared with us. Just keep writing this kind of posts.The time which was wasted in traveling for tuition now it can be used for studies.Thanks
<a href="http://www.Hasifa-Tv.com">צלם חתונות</a>
<a href="http://www.Hasifa-Tv.com">צילום אירועים</a>

Posted by Ali on April 24, 2011 at 09:30 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

alkagupta

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today