Monday Oct 20, 2008

Scaling WikiPedia with LAMP: 7 billion page views per month

I recently attended an interesting talk by Brion Vibber, CTO of WikiMedia Foundation, a non-profit organisation that runs the infrastructure for Wikipedia. He described how his team of 7 engineers manages the Wikipedia site that gets on an average of 7 billion page views per month. The highlights from the talk are listed below that included the architecture of the site infrastructure to scale up to the traffic that is received. They are ranked amongst the Top 10 sites in terms of traffic.

The site runs on the LAMP stack and you know what that is:

  • Linux
  • Apache
  • MySQL from Sun
  • Perl/PHP/Python/Pwhatever :-)

WikiMedia runs the site on about 400 x86 servers. Of those, about 250 run the webservers and the remaining run MySQL database. Recently they acquired the OpenSolaris Thumper machines from Sun which they are exploring. Sun Fire X4500 aka Thumper is the World's first Open Source Storage Server running Open Solaris and ZFS. Currently they are using the thumpers for storing the media files using the ZFS file system and they are simply loving it. They have also begun to use the DTrace feature of Open Solaris and cant stop raving about it!!

11/21/08 Update: Link to the recent Press Release WikiMedia selects Sun Microsystems to Enhance Multimedia Experience.

At the core, Wikipedia runs on a very simple system architecture as shown below and given its a non-profit organisation, almost all software is open source and FREE.

Simple is nice but it can be SLOW :-) In order to speedup, the first thing is to add cache in the front end as well as at the backend of the system. On the webfront side, Wikipedia uses the Squid reverse proxy cache for caching and at the backend, they use memcached as shown below:

Squid is a proxy server and  a web cache daemon.. It has a wide variety of uses, from speeding up a web server by caching repeated requests, to caching web, DNS and othercomputer network lookups for a group of people sharing network resources. Squid is good for static dynamic sites like Wiki where the content does not change as often. The public face of a given Wiki page does not change that often, so one can cache at the HTTP level. Wikipedia also uses Squid for geographical load balancing too so that they can use cheaper, faster local bandwidth.

Along with Apache/PHP servers, Wikipedia also uses APC, the alternate PHP caching tool. Since PHP compiles the scripts to bytecode, then throws it away after execution, compilation everytime adds a lot of un-necessary overhead. Hence it is recommended to always use an opcode cache with PHP. This drastically reduces the startup time for large apps.

Another speedup technique used by Wikipedia is memcached. memcached is a general-purpose distributed memory caching system often used to speed up dynamic database-driven websites by caching data and objects in memory to reduce the number of times the database must be read. memcached allows you to share temporary data in the network memory. Even though one needs to go over the network to get the data, the latency is still smaller than disk-based database access. Wikipedia usually stores the rendered pages in the memcached.

After adding all possible cache, next thing is to add CASH! :-) ie add more servers to gain scalability.

Those 250 or so webservers come with plenty of memory. The underutilized memory can be used for memcached and adds up to a big memcached store space.

Further, for getting the speedup at the database level, Wikipedia uses simple sharding techniques. They split the data along logical data partitions, such as subsites that dont interact closely.

They also do functional sharding and split the machines along functional boundaries for speedup.

Next popular technique used by them is Replication to gain speed. They have a master server for all writes and slave servers for most Reads. The secret truth they claim behind configuring the master and slave machines is to make sure the slave machines are faster than the masters as slaves need to keep up with the masters, hence handle writes faster than the master.

As you can see, the beauty of the architecture is that it is SIMPLE and all Open Source and it rocks!




« April 2014