November 17, 2009

Some Comments on the LF End User Summit

On November 9 and 10 I attended the Linux Foundation End User Summit in Jersey City. Ted Tso and I organized a mini session about tracing in Linux. We had Steve Rostedt talking about ftrace, and Subendhu Ghosh from RedHat talk about the perf events subystem. The session was on Tuesday afternoon, and I'd say that aside from Christoph Hellwig talk about KVM it was the most crowded session. It was nice that both Jon Corbet in his Kernel Report talk that morning, and James Bottomley in his panel the day before mentioned tracing and our session as something "hot" to follow. The Ftrace talk consited of an introduction to Ftrace, and a few demos. The audience was mostly of Wall Street type people, who are still using old versions of enterprise products, so the don't really know what has happened in the meantime to the upstream. Ftrace has been around for a while, a couple of years, but perf events is newer, and many people didn't know what it was. There was a little demo of that as well. While talking about perf events, somebody (hch, IIRC) mentioned the new tool that Arjan built on top of perf events, called "timechart". It's a tool that spits out a graphical view of the time your system is spending doing various things. http://blog.fenrus.org/?p=5.
Perf events is in Fedora 12 and so is ftrace. The timechart thing is only in mainline now.
Slides from Steve's talk are on Steve's page: http://people.redhat.com/srostedt

Besides this tracing session there was a talk done by Sergio Leunissen from my same Oracle Linux team, on OLT (Oracle Linux Testkit) and Validated Configurations and how that is helping us stabilize OEL for enterprise loads. The talk went well, the audience asked lots of questions about OLT and VC. Also were happy to see the oracle-validated RPM that we ship.

The other talk I attended on the second day was the filesystem recap by Ric Wheeler. Nothing new here, just a summary of all the things that have happened to the various FS in linux during the last year or so. It was nice that Ric mentioned Oracle and the FS folks by name (thanks Ric! ). So Chris, Martin, Jens, Chuck got honorable mentions. In general the amount of work that Oracle is doing in this space is still a bit unknown, since several people in the audience expressed surprise to see Oracle so involved in Linux.

On the first day there were no deep technical talk or separate sessions. Of the plenary sessions, an interesting one was the one by IDC about the Linux market share. While Windows keeps dominating the server side, Linux is increasing, and apparently Unix is not going away any time soon. The investment in Unix seems to still be fairly solid.

I also attended a session discussing how kernel interfaces should keep up with the times. In particular as examples there was mention of SSDs and SSSs which are going to be mainstream soon. Some complaint about epoll. Also mentioned were the interfaces for sending messages to multiple recipients, with suggestions that the calls there should be vectorized to speed up processing of messages. It's interesting that a very similar debate was going on almost simultaneously on the linux-raid mailing list, here.
Jens Axboe, also in the Oracle Linux Team, has done some work on per-bdi writeback threads, see his slides from the Linux Plumber's conference here.

Another good overview from the NYSE (Brian Clark) talked about what they focus on in their IT infrastructure.

I think this is about it.

September 9, 2009

OEL 5.4 Is Available

OEL 5.4 is available on ULN as of today, September 9 2009. Both i386 and x86_64 platforms are ready for download. There are 4 new channels on ULN, el5_u4_i386_base, el5_u4_i386_patch, el5_u4_x86_64_base, el5_u4_x86_64_patch.
The IA64 version will be available shortly, on two new channels: el5_u4_ia64_base, and el5_u4_ia64_patch.
ISOs will soon be available as usual on eDelivery
RPMs will be soon available on the public Yum repository as well.

July 28, 2009

Linux Symposium in Montreal

Two weeks ago I attended the Linux Symposium in Montreal.
I was there mostly to participate in the tracing track, and to discuss the current state of kernel tracing infrastructure and the other projects, determining what has been accomplished since the meetings we had at the Collaboration Summit in April. Some of the key players were not at the conference however, and the discussion was a bit limited. Some of the goals that were set at the Collab Summit are still lagging a bit. On the utrace side, Frank is working on the gdb stub to talk to utrace. This is built on top of kernels with utrace integrated (RHEL and Fedora) and uses the gdb remote protocol (in usual gdb fashion) to communicate with the target process being debugged under utrace control. His latest prototype is here:
http://sourceware.org/ml/systemtap/2009-q3/msg00045.html

Ptrace cleanups: Roland and Oleg are now official ptrace kernel maintainers (since April 2009). Oleg Nesterov is making progress on this, but I am not seeing any recent posts.
I see that there was a wave of ptrace cleanup patches at the end of May from Oleg (http://lkml.org/lkml/2009/5/30/229), but I cannot see that they went in. As of last week, there were no comments on them at all. It seems we are stuck in the architecture specific cleanups for ptrace (see this thread: http://lkml.org/lkml/2009/5/3/109).

Utrace: I've seen no major changes, but just this week, Oleg and Roland started looking at the utrace implementation a bit more closely, as you can see from the utrace-devel mailing list. Of course it's all depending on the ptrace cleanups getting into the kernel.

ftrace and markers: Steve Rostedt wasn't there, and Mathieu has been busy with his dissertation. Frederic Weisbecker gave a presentation, it was an introduction to ftrace.

Systemtap: Frank gave a talk on systemtap, an overall introduction and overview.

There was a panel where people talked about their projects, not much talk unfortunately around how to make progress in the overall area. It was mostly a showcase of each project.

Many people from the Montreal area talked about Eclipse. These folks work for Ericsson which uses Eclipse a lot.

The slides from the tracing mini-summits are here:
http://ltt.polymtl.ca/tracingwiki/index.php/TracingMiniSummit2009

The first day I attended a tutorial given by IBM, about Performance Inspector.
It's a decent performance tool, sharing many things with Oprofile, plus a gui. It turns out the project is an old internal IBM project, that got open sourced much later. One of the people working on it was the author of Oprofile.

I attended Martin and Dan talks as well. I think they were very good ones. Honestly, they had some real meat to them, and some new ideas. They also were engaging speakers, and didn't put the audience to sleep. I also attended Ric Wheeler's talk and that was interesting, and a nice high level summary.
Dan's Transcendent Memory slides are here:
http://oss.oracle.com/projects/tmem/documentation/presentations/
and his work is here:
http://oss.oracle.com/projects/tmem
There is a good article on Tmem here: http://lwn.net/Articles/340080/


Jon Corbet did his usual kernel report, and mentioned Oracle a lot. That was nice to see.

A talk about autotest from Google was OK. They started from Martin Bligh's system and expanded on that, the test driver mechanism is getting more sophisticated and operates on many different machines, and layers of servers.


June 19, 2009

New GCC and Libstdc++ blog

Paolo Carlini, in our team, has started a new blog on GCC, and libstdc++ in particular. Check it his postings:
http://blogs.oracle.com/pcarlini

June 16, 2009

OVM 2.1.5 has been released

OVM 2.1.5 was released on ULN on June 10 2009. The updated documentation and the release notes are here:
http://download.oracle.com/docs/cd/E11081_01/welcome.html
The ISOs are available on eDelivery.
There is a new datasheet available here:
http://www.oracle.com/technologies/virtualization/docs/ovm-ds.pdf

May 27, 2009

Linux Foundation End User Summit

The next End User Summit event has been scheduled for November 9 and 10 2009 in Jersey City. Take a look at the website. More details to appear there soon.

http://events.linuxfoundation.org/events/end-user-summit

LinuxCon in Portland, OR

The first LinuxCon conference will take place September 21 to 23 in Portland Oregon, co-located with the Linux Plumbers Conference. The speaker list and the schedule were announced last week. It will have a variety of topics, and an excellent lineup of panels and keynotes.
Check the schedule here:
http://events.linuxfoundation.org/events/linuxcon/speakers


May 26, 2009

OEL 4.8 released

OEL 4.8 was released on ULN and on the Oracle public yum repository on May 26, 2009.
There are 6 new channels on ULN, el4_u8_i386_base, el4_u8_i386_patch, el4_u8_x86_64_base, el4_u8_x86_64_patch, el4_u8_ia64_base, el4_u8_ia64_patch.
ISOs will soon be available as usual on eDelivery.
Updated documentation:
ULN Whitepaper
Unbreakable Linux Whitepaper
Unbreakable Linux FAQs
Unbreakable Linux Data Sheet
OEL4 Certification Guidelines
Release Notes
Source RPMs


May 6, 2009

Notes about tracing from the Collaboration Summit

A bit late, but nevertheless...

These are some notes I took at the Collab Summit in San Francisco. I posted these notes already on the systemtap mailing list, but I thought they are worth of a blog entry.

Some of the slides, not all of them, are here: http://events.linuxfoundation.org/slides

Our tracing session went well, I think it was most useful for the developers themselves, it was a
bit technical.

The first day of the summit was all dedicated to panels and presentations to the whole audience. It was not broken down into sessions. One of the presentations was by Edward Screven about Oracle and Linux. I thought it went well. It really showed how we use Linux internally and it explained why we got into the Linux business. Also talked about upstream contributions.

Another highlight of the first day was a panel with Jim Zemlin (the LF director), Ian Murdock of Sun, and Sam Ramji from Microsoft. It was entertaining to watch how the Microsoft person kept trying to not say anything that could be quoted by journalists (he said that himself). Quite an achievement, actually.

Detailed blogs have been posted by Gerrit Huizenga of IBM: http://gh-linux.blogspot.com/

The important stuff that for me happened at the summit was the meetings we had with the tracing people and some kernel folks. It started on Thursday, after the presentation were over, and we reached some consensus on the tracing infrastructure that is going into the kernel. (Markers, utrace, uprobes). The discussions continued on Friday and we talked more about higher level things, like the debuginfo size problem and the systemtap integration with the kernel. Detailed notes below.

Overall it was a good conference, or at least, let's say that I was very happy with the conclusions and the agreement we reached about tracing. Of course not everything is decided, but it was a good start. It seems that the tracing problem is really getting attention from the kernel community. There is going to be a tracing mini-summit at the Linux Symposium in Montreal in July where we all will reconvene and discuss the next steps. It looks like we can get agreement on the short term, and more low level items, but there is still quite a bit of uncertainty on the higher level issues. Hopefully by having periodic meetings, the dialog will be ongoing, and progress can be made.

Already Steve has posted a clean up of his markers (one of the items below). This is great, so that even systemtap can use these new trace events in the kernel: http://lkml.org/lkml/2009/4/14/414 Also keep an eye on utrace-devel mailing list: where the ptrace cleanups are being discussed (they are now also being discussed on lkml)


These are mine and Christoph's notes from the discussions we had at the Collaboration Summit in SFO, in April. I've added some clarifications, comments and updates from Roland and Frank as well. Keep in mind that the situation is not static. Every day there are new discussions on these topics on the various mailing lists, so this is a "snapshot" of about of the end of April.


Attendes: Renzo Davoli, Mathieu Desnoyers, Jake Edge, Frank Ch. Eigler, Christoph Hellwig, Masami Hiramatsu, Jim Keniston, Roland McGrath, Steven Rostedt, Josh Stone, Elena Zannoni. Plus James Bottomley (day 2), Keiichiro Tokunaga (from Fujitsu)

Thursday afternoon:

Kernel Tracing items:

- make DEFINE_TRACE work in modules (Steve)
- investigate markers removal (Christoph, Matthew)
- the 25 magic google tracpoints (Matthew)
- make the two major tracepoint implementations interchangeable (Matthew, Steve) (working on a common ring buffer API)

- get djprobes and the instruction decoder upstream (Masami)

Utrace and userspace probing:

- get arm and mips converted to regsets and uprobes, set a cut off date for others (Christoph, Roland)

*** Update 4/24/09: arm work finished by Roland. (posted to lkml?) Christoph pinged the mips maintainer, no reply seen.


- more ptrace cleanups to prepare for utrace (Oleg) [In progress]
- in-kernel gdb server for debugging userspace (Frank)
- get uprobes upstream piecemail, including backing the gdbserver(Jim)


From Friday morning:

- Include Roland in email about kbuild patch for separate debuginfo sent by Wenji Huang (Elena)

- elfutils debuginfo reduction (duplicates eliminations): ROland working on it, goal is Fedora 12 timeframe.

- Look up old patch from crashdump people about pulling type info into a common .h file for debuginfo purposes. (SLES9 or SLES10) (Christoph?)

- Talk to Arnaldo about his CTF work. Redhat folks to have a meeting with him.

*** Update 4/24/09(Roland): Acme working on a kbuild scheme using his tools herein the kernel would compile with -g but have DWARF replaced with CTF in each .o during compilation. This intends to avoid the "slow compile" problem cited by some kernel hackes, and believed to be slow link phase and disk i/o. (We'll see how quick it is.) That would yield CTF for the kernel, which has at least type layout information and some minimal symbolic info about functions. That is intended to suffice for function entry probes with parameters, but it does not handle the parameter ABI problem. (As far as we can tell, Sun only deals with one presumed function call ABI.)

- James had a comment about how difficult it is for stap/kretprobes to probe at the end of a function. Seems that epilogue or last instruction of a function (upon return) is hard to probe. He is interested not just in the return value, but in the value of the locals before exit. Gcc-4.4 (current fedora and future rhel6 compiler) should produce correct info.

Systemtap PR: 10056.

What James wants is *new* compiler information to identify a site at the beginning of the epilogue as a good probe site. He believes that at such a place, the locals (now technically dead) will in fact still be accessible in stack slots not clobbered before the epilogue. If that's in fact true for a particular site of interest, then the idea presupposes that GCC's DWARF location information for that PC is accurate about whatever happens to be accessible still. AFAIK we
don't especially suspect any problems with that location information.

The DWARF format (line info) includes an epilogue_begin marker feature. This is what would be used for finding the probe site James wants. There was past talk (and maybe even some unfinished work) to make GCC emit those, but GCC 4.4 does NOT emit those markers.


- how good is stap at working on code that has no associated debuginfo. For markers, Josh has added a patch that can generate debuginfo on the fly. To be clear, it generates sufficient information for type layouts. That is enough to do interesting things with tracepoint parameters.
(This is the more useful half of what we could get from CTF.) systemtap 0.9.6 will have two new features for this.


- Revisit which (if any) parts of systemtap code tree could be pushed into kernel tree.

- Future directions: there was agreement that doing userspace debugging via .ko's is not the best solution in the long term, but currently is the only way there is to do userspace probing.

- Complaints about stap shutting itself down (from Mathieu) when doing irq monitoring.

The germ of good idea to take away from this (beyond documenting -DINTERRUPTIBLE and some other tweaks already done) is that we may want to teach systemtap to be a little more tolerant of reentrancy. For probe handlers that simply trace (without local vars, without many
temporaries), with a reentrant-tolerant printf(), we can probably do OK -- without losing the events Mathieu was worried about, and without -DINTERRUPTIBLE=0. Future work, no PR yet.

- Idea was floated around to define a stable api into the kernel for which stap is guaranteed not to break from one kernel version to the next. Mathieu proposed to keep a part of the stap self testing module into the kernel so that kernel developers can test themselves if they break something. Idea not practical, since it puts the burden on kernel community to assure that out of tree code keeps working. Problem is that stap doesn't use a predefined set of probe points. This is the whole idea behind the current stap design, that is, to get away from the static approach of dtrace, and to let you probe everything because of the use of the debuginfo. I.e. "API" is potentially the whole kernel code. However, the translator-generated and runtime-boilerplate code use a relatively small number of exported functions as the API. This is one reason why the runtime etc. is in fact remarkably stable from kernel version to kernel version. Maybe 1-3 changes per kernel release are needed. The tapsets are a different matter. Reorienting them with a preference for tracepoints will improve them too.

WRT keeping some stap tests in the kernel tree, one problem with this idea is that while the kernel developer can then instantly fix this hypothetical self-test module, and potentially make the self-test compile ("closing the feedback loop"), there is a hole in the feedback to the systemtap developers. Unless this "self-test module" is in fact complete, we still need to carry the fuller runtime/etc. code in systemtap, so we'd need to receive & redistribute those patches

- Mathieu mentioned gdb tracepoints: continuous debugging. Similar to systemtap idea. Collect info at breakpoints and continue the program. Code Sourcery working on that.

April 3, 2009

Boston PHP meetup notes

Wednesday I went to the monthly meeting of the PHP developers in the Boston area. I hadn't attended in a long time. This was a very interesting meeting because the presenters were from the
BlueStateDigital company (BSD), based in Boston. The speakers were Josh King and Chuck Hagenbuch.

For those of you that are not in the USA or didn't follow the USA presidential elections in November, BSD is the company that powered the mybarackobama.com website. Independently of your political orientation, it was the first time in history that politics and the internet were so intertwined. It proved to be a winning move for the Obama campaign.
So this talk was interesting both on the technical level and on the communication/sociological one.

These notes have been transcribed pretty much verbatim, so the sentences are a bit disconnected sometimes.

There were two parts to the presentation, one about the Neighbor to Neighbor application (N2N) which was used to keep track of the phone banks and the canvassing, and the other about the mass mailings that were sent to supporters, for fundraising, coordination of events, etc.

They use an entirely open source stack, Centos4 [ok I am biased here, so better not to comment :-)] and mySQL. They said they did not have a DBadmin on staff.

The N2N application used geocoding to identify the person, usually a volunteer using the site, and then lead him through some on-line training sessions, in order to ready him to go out and contact in person other people in the same area. Some other algorithm was used to determine the radius of the area, based on population density. This is how the application was started, but later it was modified to include phone banking, where the geographical proximity did not play a role. The point was that these databases needed to be updated in milliseconds to reflect the changes and information recorded by the volunteers. There was also a high level of synchronization of data between the campaign headquarters and the DB at the BSD locations, cross checked with voters databases. The number of people in the database (with addresses, phone numbers, etc) was about 150 Million.

Lessons learned:
- 4GB files can still pose issues, (they claimed that they needed to switch from tar to zip). [I am actually wondering about this, it seems odd]

- cURL needed to be used to transfer the data back and forth

- "load data local infile" is a slow operation in mySQL, so they loaded the data in a temporary table and then did an atomic rename into the database. [not sure I got this right]


The Mailer application:

to give an idea of the size of the mail campaign:
Blue State Digital started with the Obama campaign in February 2007.
In 2006 they handled other campaigns and they has a volume of 76 Million emails. For the Obama campaign, they were estimating about 590Million emails, with 5M to 6M lists size. Their target was to be able to send 1Million personalized emails per hour.

The actual numbers instead were:
13M list size
7000 different mailings (different email content)
1.3 Billions emails sent

They also raised $500 Million on line (also handled by BSD).

When a mail was sent to supporters this would happen:
1. cron job prepares DB tables
2. Daemon process does personalization and sends
3. postfix delivers

Daemons were written in PHP, did pcntl_fork. The daemons exited when any of the files (necessary to build the emails) changed. A watchdog process monitored everything (killing processes that took too long and restarting them).

Postfix, kept the active queue on RAM disk, Defer to backup MTA when needed. They found that many providers (Comcast, AOL, Hotmail) were blacklisting them, had to negotiate directly with them, switch servers from which email was sent (by changing postfix rules), and also have the Obama folks "make a few phone calls".

They had many different ways to segment the lists depending on geos, how much money was donated, etc.

Used cricket and ganglia to monitor the deferred messages rats, etc.
[I am not familiar with these packages, I assume they were talking about these: http://cricket.sourceforge.net/ and http://ganglia.info/]

Interesting tidbit: the campaign people knew how much money a particular email message was estimated to bring in, and if it didn't it was taken out of circulation.

Use of PHP:
personalizing the emails, send in parallel threads
work in batches
multiple processes -> multiple servers

mySQL:
Need to store recipients of each email. email for the B.O. campaign was always targeted, never sent to the whole list.
inserting a record into a 1 Billion row table is SLOW.
Use merge tables to avoid inserts entirely

Replication problem, the operation is single threaded, so on large operations is very slow.
They also had locking problems, used InnoDB
some tables were optimized for inserts and selects

When they had 1 million recipients, it would take 1 hour to do one insert.

Somebody asked if they would switch to a different database such as Postgres: answer was NO.
because 1. they are used to MySQL features, 2. it would bring up just a different set of problems/slowdowns, 3. performance profile would be unknown ahead of time, unless you can re-apply mySQL traffic against a different DB.

Somebody asked if they considered using the cloud: EC2 was looked at, but they cannot send emails from there, so no point.

During the Democratic National Convention, when Obama gave the acceptance speech, they raised $2Million per hour.

They had to buy new hardware to handle all this. The campaign donated their own DB server machines.

they looked at 3rd party engines for mySQL: innoDB plugin, mySQL cluster, eXtremeDB [??].

Replication + failover: not used. only used SAN + snapshotting
It was all behind firewalls with only outgoing SMTP open.
They didn't get serious hacking attempts.

Peak rate of messages: 6Million/Hour average
peak rate for messages sent to > 1M recipients: 4.2 M/hr
peak rate for messages sent to > 5M recipients: 2.2 M/hr

Bottom line, they said that the most important part of all this is the strategy.
Note, the campaign provided all the content, they just provided the services.