Monday Dec 08, 2008

What's going on in MySQL land?

Enough of one-sided stories. Let's see a different angle of MySQL 5.1.

First, let me thank my colleague Chris Powers for taking a stand in defense of the management. But saying "everyone does so" is not a good explanation. The truth is much more complex and requires some narrative.

MySQL 5.1 didn't start on the right foot. The effort to produce its features was underestimated, mostly because, at the time when it was designed, the company was still unearthing the architectural bugs that were haunting MySQL 5.0.

MySQL 5.0 was GA in October 2005. One month later, MySQL 5.1 started its alpha stage, while a rain of bugs fell on the freshly released server. When the version was hastily declared beta 6 months later, the implication of the architectural problems weren't even found yet. That's why the beta stage of this version had a long and troubled course.

In September 2007, when 5.1 was declared RC, there were really too many open bugs and instability problems. The decision of going RC had little engineering justification. A fierce internal debate followed, with a prompt decision to review the release criteria, which kept many people busy for several weeks. The desire of shipping 5.1 GA before the Users Conference 2008 was rightfully dwarfed by the discovery of new, more disturbing bugs. Two more RCs were released, while the developers fought to fix a staggering number of bugs.

More than 3500 bugs affected MySQL 5.1, and by June we had fixed 2300 of them. There were still some outstanding critical bugs, and Marketing and Sales were pressing for a release. It was understandable. The economic situation of Sun was not good, the company had just cut 2500 jobs, and we needed the new release to boost sales. However, the outstanding bugs were so bad that the people who were in direct contact with users (Support and Community Team) strongly objected to a GA declaration at that point in time. In a joint effort, we identified 40 critical bugs that needed fixing before going GA. The position of the Support team was clear. They were not going to support row-based replication and partitioning, unless these bugs were fixed. The offending bugs were examined by a committee chaired by Sun Software CTO, and 35 bugs were recommended for immediate fixing.

You can see the result in the changelog. MySQL 5.1.27, which was the intended GA, was not released. Version 5.1.28 fixed 79 bugs, and version 5.1.29 fixed 58 more, and 8 fixes went into 5.1.30. The critical bugs identified by Support as the basis for supporting the new release were all fixed. Of the usability bugs identified by the Community team, only two were deferred to MySQL 5.1.31, and will be available in the next binary release.

The above explanation makes little sense if we don't explain the criteria for triaging bugs. Until MySQL 4.1, the release criteria for GA was "zero known bugs". That rule was artificially ignored in 5.0, with the results that we all know. But in 5.1, there was no way of shipping with such rule. MySQL has become a much more complex product than it was in 4.1, and a flaw doesn't necessarily affect all the users. It is likely the contrary, in fact, that a defect identified in our QA laboratory only affects a small portion of our user base, or none at all, like it happened to a bug in prepared statements, found by a Support engineer, 4 years the feature was released, and so far claimed by no one (The bug was fixed, BTW).

Back to the triage criteria. There were 1000 open bugs when we went RC. The only way of getting rid of them was to prioritize the bugs. Classifying the bugs based on one single parameter, as we did before was not an option. We needed to be more specific. So all the open bugs were reviewed, and classified according to defect class, impact, workaround, risk to fix. Not a perfect system, I grant you, but nonetheless it allowed us to start fixing the most serious bugs with the greatest impact first. If a bug is a crash, but it only happens after a convoluted series of actions that our QA engineers have devised, and has little chance of affecting real users, that bug gets fixed after a similar one that occurs in a common sequence of events. If a bug affects two customers, then its impact is automatically raised, and thus fixed with priority.

The other important point is the risk factor. If a bug fix involves a substantial change of existing feature, it is deferred to the next version. When we went GA, we still had some open bugs, but none that prevented the normal usage of the server. The Community and Support teams are in contact with customers and other users who have been using MySQL 5.1 in production for years, and we made sure that the bugs reported by these brave souls were addressed before the GA release.

You see, there was a lot of work involved in our GA, and the engineering department, together with Support and Community teams, did an outstanding job of bringing this release to the public. Are the processes perfect? No. But they have improved significantly over the ones that led to MySQL 5.0 GA.

And speaking of that, we know that our engineering process leaves a lot of room for improvement.

Despite our claim to be the most popular open source database in the world, our development practices are very much closed source, and our release cycle is definitely in need of a revision. The obstacles towards opening the development model are quite a few, and we have been working hard to meet this goal. The first change was to get rid of the proprietary revision control system that was a real obstacle towards participation. Next, we published the Worklogs. Then we removed the MySQL Contributor License Agreement, which was another serious impediment. Next comes the release cycle and development model. There are many people at work on this issue, but this is work in progress and we don't want to raise your expectations on a quick resolution for these items. The important point is that everyone agrees on the need for change, and we are working toward this goal. It's a team effort, which will eventually bear fruits.

In the meantime, please honor the effort of the developers who created new features like Partitioning, Row-based replication and the Event Scheduler and have squashed thousands of bugs, and give it a try.


Giuseppe Maxia


« April 2014