By cwebster on Apr 28, 2009
I recently saw a blog about agile development where the author was doing a production deployment with every commit that successfully passed the unit tests, so I thought I would share the development process we used to creating Zembly.
Each planning cycle covers a single three week development cycle (sprints for those familiar with the lingo). Planning is done both top down and bottom up, that is features and tasks come from both strategic features as well as infrastructure related requirements. The input from everyone on the extended team is a categorized set of tasks and features in our bug tracking system (Jira). The planning meeting serves to further categorize this list of tasks to committed and target features (committed features are expected to be before the end of the sprint, where target features will be delivered if there is time).
While the above sound straightforward, there are several things to be aware of:
- The issues must be clear and should have an accurate time estimate of the work. Having one line descriptions with no time estimate makes is difficult to determine the priority and importance of the task. The time estimate makes it possible to determine the load on each person (tools can help a lot at this point, specifically with dependencies and total time).
- Make sure to include things like blogging, demos, presentations. Not including these will result in leftover tasks at the end of the sprint.
- Bugs or time for bug fixing also needs to be included. What we have done is to work with higher level tasks (something like bug fixing) where each engineer can commit time for bug fixing and determine the bugs to fix. Trying to prioritize the various bugs proved to time consuming, but considering the time impact is important.
- Try to avoid adding additional work during the planning meeting, this meeting will become extremely long with feature debates anyway so trying to specify a feature enough to get a time estimate will be difficult.
- Input to the planning is open to everyone, but the prioritization done during the planning meeting should be open to a few. This will make reaching consensus faster and also reduce the time everyone spends in meetings. We introduced a sprint lead rotating role (responsible for representing the engineering team during planning and update meetings) so that everyone would get to attend and participate in a planning meeting, but not all at once.
The development process works as follows:
- A developer completes a unit of work which is production ready (the assumption is that tests are being developed in conjunction with code) and commits the change to the trunk. Code reviews are incorporated into the development cycle, mostly informal (peer review of change sets), but we have formal code reviews for large and complicated changes.
- A continuous build system (Hudson), detects when changes occur and does a checkout, build, test cycle. If the build or tests fail a mail is sent to the development team. If the build is successful the binary is published, for deployment to our continuous integration server.
- The continuous integration server picks up the last successful build (up to once an hour), deploys this, then runs a series of functional tests. The functional tests exercise the zembly platform APIs and ensure the integration is working. The functional tests serve as black box tests, where the unit tests are white box tests. If the deployment or functional tests fail a mail is sent to the development team for evaluation. Another set of UI tests are exected if the functional tests are successful which verify the basic functionality of the user interface (you get the picture if the tests fail) across multiple browsers.
- A set of performance tests run nightly and detect performance variations across builds. The key here is to look for trends and specifically verifying performance optimizations are really working. One of the biggest challenges is to make the tests repeatable (if you can avoid network effects that will save you lots of headaches).
There are a few interesting things that we found work well:
- breaking the build should be a big deal. Everyone will do it, but promptly fixing the problem is essential. The larger and more distributed the team the more expensive it is to break the build as updating and having to figure out why the build is broken is a pain. Establish a rule of waiting until the continuous build is successful before leaving. Also, peer pressure is effective.
- Establish a culture of writing tests, imposing this or trying to write tests after the fact is difficult.
- Make sure commit messages are understandable. A commit message describing the change lives with the code and may be referenced long after the change commits, so describing the change as well as the potential impacts on other code help both future developers and people testing the code (we used a QE impact section describing what areas would be impacted).
Deployment and Codeline management
We split a sprint into a number of deployment units (we have done several variations between daily and weekly) but this really depends on how fast the codeline can be vetted and stabilized. This time has increased as the SLA our users have grown to expect has increased. Each deployment unit has a release manager, who is responsibile for ensuring the code line is branched (more below), the bits are tested, stabilized, and pushed to production before the next release manager takes over. The release manager is announced but also gets to wear a special badge, the NASA style vest was not available.
Here are more details on what happens:
- There are three codelines all the time:
- trunk - this is never closed, but for larger changes it is better to commit them right after the staging codeline is created. This allows more testing on developer machines and reduces the risk of stopping testing immediately after the staging deployment.
- staging - this is created from the specific trunk revision that is currently running in the staging environment. Bugs which are detected in the staging environment that are serious enough to prevent a production deployment are fixed in the this branch.
- production - represents the code which is currently running in production. This branch is used if there are blocker issues in production. Changes in this branch are emergency changes and keeping the production environment running is the highest priority.
- There is an automated deployment which deploys the latest good bits (these are the bits which pass the test described in the development section) to the staging environment (a small replica of what we run in production) as well as runs the automated tests. The staging environment automated tests are the same as those that run continuously; however, these run in a horizontally scaled environment which has different characteristics.
- The release manager's job begins after this deployment to ensure the automated test suites pass as well as the codelines described above are uptodate, which is essentially moving the staging codeline to production, and copying the specific trunk revision to the staging codeline (copy and move are used in the subversion context). The release manager also does a manual sanity check on the bits to look for things that are difficult to detect in UI tests (alignment issues for example).
- At this point the build is ready for testing. This can be done by a formal QE team and also perhaps through developer brown bag sessions (we use both). The developer brown bag session allows the team to try the software as a group and build or create things, so in addition to finding bugs, things like usability issue will surface.
- If there are any defects the release manager will determine if the defects require immediate fixes and if so will ensure the fix gets committed to the right codeline and the changes get pushed to staging.
- Once everything is working as expected, the build is given a go and deployed to production and the cycle starts again.