Boston PHP meetup notes

Wednesday I went to the monthly meeting of the PHP developers in the Boston area. I hadn't attended in a long time. This was a very interesting meeting because the presenters were from the
BlueStateDigital company (BSD), based in Boston. The speakers were Josh King and Chuck Hagenbuch.

For those of you that are not in the USA or didn't follow the USA presidential elections in November, BSD is the company that powered the website. Independently of your political orientation, it was the first time in history that politics and the internet were so intertwined. It proved to be a winning move for the Obama campaign.
So this talk was interesting both on the technical level and on the communication/sociological one.

These notes have been transcribed pretty much verbatim, so the sentences are a bit disconnected sometimes.

There were two parts to the presentation, one about the Neighbor to Neighbor application (N2N) which was used to keep track of the phone banks and the canvassing, and the other about the mass mailings that were sent to supporters, for fundraising, coordination of events, etc.

They use an entirely open source stack, Centos4 [ok I am biased here, so better not to comment :-)] and mySQL. They said they did not have a DBadmin on staff.

The N2N application used geocoding to identify the person, usually a volunteer using the site, and then lead him through some on-line training sessions, in order to ready him to go out and contact in person other people in the same area. Some other algorithm was used to determine the radius of the area, based on population density. This is how the application was started, but later it was modified to include phone banking, where the geographical proximity did not play a role. The point was that these databases needed to be updated in milliseconds to reflect the changes and information recorded by the volunteers. There was also a high level of synchronization of data between the campaign headquarters and the DB at the BSD locations, cross checked with voters databases. The number of people in the database (with addresses, phone numbers, etc) was about 150 Million.

Lessons learned:
- 4GB files can still pose issues, (they claimed that they needed to switch from tar to zip). [I am actually wondering about this, it seems odd]

- cURL needed to be used to transfer the data back and forth

- "load data local infile" is a slow operation in mySQL, so they loaded the data in a temporary table and then did an atomic rename into the database. [not sure I got this right]

The Mailer application:

to give an idea of the size of the mail campaign:
Blue State Digital started with the Obama campaign in February 2007.
In 2006 they handled other campaigns and they has a volume of 76 Million emails. For the Obama campaign, they were estimating about 590Million emails, with 5M to 6M lists size. Their target was to be able to send 1Million personalized emails per hour.

The actual numbers instead were:
13M list size
7000 different mailings (different email content)
1.3 Billions emails sent

They also raised $500 Million on line (also handled by BSD).

When a mail was sent to supporters this would happen:
1. cron job prepares DB tables
2. Daemon process does personalization and sends
3. postfix delivers

Daemons were written in PHP, did pcntl_fork. The daemons exited when any of the files (necessary to build the emails) changed. A watchdog process monitored everything (killing processes that took too long and restarting them).

Postfix, kept the active queue on RAM disk, Defer to backup MTA when needed. They found that many providers (Comcast, AOL, Hotmail) were blacklisting them, had to negotiate directly with them, switch servers from which email was sent (by changing postfix rules), and also have the Obama folks "make a few phone calls".

They had many different ways to segment the lists depending on geos, how much money was donated, etc.

Used cricket and ganglia to monitor the deferred messages rats, etc.
[I am not familiar with these packages, I assume they were talking about these: and]

Interesting tidbit: the campaign people knew how much money a particular email message was estimated to bring in, and if it didn't it was taken out of circulation.

Use of PHP:
personalizing the emails, send in parallel threads
work in batches
multiple processes -> multiple servers

Need to store recipients of each email. email for the B.O. campaign was always targeted, never sent to the whole list.
inserting a record into a 1 Billion row table is SLOW.
Use merge tables to avoid inserts entirely

Replication problem, the operation is single threaded, so on large operations is very slow.
They also had locking problems, used InnoDB
some tables were optimized for inserts and selects

When they had 1 million recipients, it would take 1 hour to do one insert.

Somebody asked if they would switch to a different database such as Postgres: answer was NO.
because 1. they are used to MySQL features, 2. it would bring up just a different set of problems/slowdowns, 3. performance profile would be unknown ahead of time, unless you can re-apply mySQL traffic against a different DB.

Somebody asked if they considered using the cloud: EC2 was looked at, but they cannot send emails from there, so no point.

During the Democratic National Convention, when Obama gave the acceptance speech, they raised $2Million per hour.

They had to buy new hardware to handle all this. The campaign donated their own DB server machines.

they looked at 3rd party engines for mySQL: innoDB plugin, mySQL cluster, eXtremeDB [??].

Replication + failover: not used. only used SAN + snapshotting
It was all behind firewalls with only outgoing SMTP open.
They didn't get serious hacking attempts.

Peak rate of messages: 6Million/Hour average
peak rate for messages sent to > 1M recipients: 4.2 M/hr
peak rate for messages sent to > 5M recipients: 2.2 M/hr

Bottom line, they said that the most important part of all this is the strategy.
Note, the campaign provided all the content, they just provided the services.


Post a Comment:
Comments are closed for this entry.

Linux Tools News and Tidbits


« June 2016