What's going on in MySQL land?

Enough of one-sided stories. Let's see a different angle of MySQL 5.1.

First, let me thank my colleague Chris Powers for taking a stand in defense of the management. But saying "everyone does so" is not a good explanation. The truth is much more complex and requires some narrative.

MySQL 5.1 didn't start on the right foot. The effort to produce its features was underestimated, mostly because, at the time when it was designed, the company was still unearthing the architectural bugs that were haunting MySQL 5.0.

MySQL 5.0 was GA in October 2005. One month later, MySQL 5.1 started its alpha stage, while a rain of bugs fell on the freshly released server. When the version was hastily declared beta 6 months later, the implication of the architectural problems weren't even found yet. That's why the beta stage of this version had a long and troubled course.

In September 2007, when 5.1 was declared RC, there were really too many open bugs and instability problems. The decision of going RC had little engineering justification. A fierce internal debate followed, with a prompt decision to review the release criteria, which kept many people busy for several weeks. The desire of shipping 5.1 GA before the Users Conference 2008 was rightfully dwarfed by the discovery of new, more disturbing bugs. Two more RCs were released, while the developers fought to fix a staggering number of bugs.

More than 3500 bugs affected MySQL 5.1, and by June we had fixed 2300 of them. There were still some outstanding critical bugs, and Marketing and Sales were pressing for a release. It was understandable. The economic situation of Sun was not good, the company had just cut 2500 jobs, and we needed the new release to boost sales. However, the outstanding bugs were so bad that the people who were in direct contact with users (Support and Community Team) strongly objected to a GA declaration at that point in time. In a joint effort, we identified 40 critical bugs that needed fixing before going GA. The position of the Support team was clear. They were not going to support row-based replication and partitioning, unless these bugs were fixed. The offending bugs were examined by a committee chaired by Sun Software CTO, and 35 bugs were recommended for immediate fixing.

You can see the result in the changelog. MySQL 5.1.27, which was the intended GA, was not released. Version 5.1.28 fixed 79 bugs, and version 5.1.29 fixed 58 more, and 8 fixes went into 5.1.30. The critical bugs identified by Support as the basis for supporting the new release were all fixed. Of the usability bugs identified by the Community team, only two were deferred to MySQL 5.1.31, and will be available in the next binary release.

The above explanation makes little sense if we don't explain the criteria for triaging bugs. Until MySQL 4.1, the release criteria for GA was "zero known bugs". That rule was artificially ignored in 5.0, with the results that we all know. But in 5.1, there was no way of shipping with such rule. MySQL has become a much more complex product than it was in 4.1, and a flaw doesn't necessarily affect all the users. It is likely the contrary, in fact, that a defect identified in our QA laboratory only affects a small portion of our user base, or none at all, like it happened to a bug in prepared statements, found by a Support engineer, 4 years the feature was released, and so far claimed by no one (The bug was fixed, BTW).

Back to the triage criteria. There were 1000 open bugs when we went RC. The only way of getting rid of them was to prioritize the bugs. Classifying the bugs based on one single parameter, as we did before was not an option. We needed to be more specific. So all the open bugs were reviewed, and classified according to defect class, impact, workaround, risk to fix. Not a perfect system, I grant you, but nonetheless it allowed us to start fixing the most serious bugs with the greatest impact first. If a bug is a crash, but it only happens after a convoluted series of actions that our QA engineers have devised, and has little chance of affecting real users, that bug gets fixed after a similar one that occurs in a common sequence of events. If a bug affects two customers, then its impact is automatically raised, and thus fixed with priority.

The other important point is the risk factor. If a bug fix involves a substantial change of existing feature, it is deferred to the next version. When we went GA, we still had some open bugs, but none that prevented the normal usage of the server. The Community and Support teams are in contact with customers and other users who have been using MySQL 5.1 in production for years, and we made sure that the bugs reported by these brave souls were addressed before the GA release.

You see, there was a lot of work involved in our GA, and the engineering department, together with Support and Community teams, did an outstanding job of bringing this release to the public. Are the processes perfect? No. But they have improved significantly over the ones that led to MySQL 5.0 GA.

And speaking of that, we know that our engineering process leaves a lot of room for improvement.

Despite our claim to be the most popular open source database in the world, our development practices are very much closed source, and our release cycle is definitely in need of a revision. The obstacles towards opening the development model are quite a few, and we have been working hard to meet this goal. The first change was to get rid of the proprietary revision control system that was a real obstacle towards participation. Next, we published the Worklogs. Then we removed the MySQL Contributor License Agreement, which was another serious impediment. Next comes the release cycle and development model. There are many people at work on this issue, but this is work in progress and we don't want to raise your expectations on a quick resolution for these items. The important point is that everyone agrees on the need for change, and we are working toward this goal. It's a team effort, which will eventually bear fruits.

In the meantime, please honor the effort of the developers who created new features like Partitioning, Row-based replication and the Event Scheduler and have squashed thousands of bugs, and give it a try.

Comments:

Thanks for this, Giuseppe. I've been on a bug triage committee for an RDBMS product myself, so I understand the pressures. One thing I'd comment, however, is to be careful of applying too many quantitative processes to a decision that is ultimately qualitative. It's good to characterize priority, impact, risk, etc. so you can compare bugs to one another, but don't be tempted to create a mathematical formula to reduce these values to a single "do we fix it or not" number. That decision must still be made on a case by case basis.

Posted by Bill Karwin on December 08, 2008 at 10:55 AM CET #

the ga criteria for 4.1, or any earlier release, was not "no known bugs". documenting the bug, because it was too hard or risky to fix, was always an escape hatch. you see that with bug #989, the first bug that monty called out in his posting, which was reported against 4.0, long before 4.1 went ga.

Posted by jim winstead on December 08, 2008 at 12:15 PM CET #

Giuseppe this is a great synopsis of the 5.1 release. I'm sure most people don't know the reasons for it being in RC for so long - but it's a great thing to see the company improving their processes and moving toward an open cycle.

Posted by themattreid on December 09, 2008 at 09:52 AM CET #

Thanks for a sane post on the topic.

It'll be good to see the other working branches online at Launchpad. I.e. the ones used by the bugs team, build team, and the various development teams. This shows activity and thus gets away from the "silence then publication" thing, and we also know that there are people who will track the additional trees and catch funny stuff. So we regain the benefit of additioanal QA.

One comment:
> If a bug affects two customers, then its impact
> is automatically raised, and thus fixed with priority.

I appreciate this makes sense in terms of resourcing on the MySQL end, but it does nothing for quality which is what users care about. Or is the intent that lots of people just sign up for say a Basic subscription (i.e. become a customer) in order to get a bug that affects them prioritised? And does that actually work? I'd just like to know so I can advise my clients appropriately. I need to be able to rely on it though.

Posted by Arjen Lentz on December 09, 2008 at 04:20 PM CET #

Post a Comment:
Comments are closed for this entry.
About

Giuseppe Maxia

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today