Last week one of our customers faced a complex escalation where a large (2+ TB) database did not successfully completed the migration. I will not go into the details but the problem was that the data-header blocks where not consistent with the data in the blocks - in layman's terms "Houston we have a problem".
As a result - no access and.. the key business application down... in Support terms a REAL SEVERITY 1... in real life an emergency call Monday morning at 6.45 AM.
Our usual reflex is to assign a crisis manager who starts to engage with all parties involved (customer/partner-implementer/support/development). Starting point is to triage the situation, assign ownership and manage the process which in fact means communicate-communicate-communicate and let the techies do their work.
During the day we had a lot if discussions with the customer about all aspects of this project. What is your plan?, how is the contingency setup ?. Is there a fallback ? and did you see this problem during testing ?. All interesting areas to discuss in this blog but lets focus on the testing
A review of the steps revealed that they spend a lot of time in this area. A team of specialists spend many hours in setting-up and preparing the test, covering all areas of the migration. The team performed the planned data migration using an identical data set on an identical operating system using the cookbook designed for the migration weekend.
So... the 1M dollar question was - "Why did they not found out about the data-corruption in the test". I can tell you there where a lot of speculations about going on during the day.
During the day we got the answer, simple and a great learning point; "We tested the migration but did not run the application accessing the data. This was considered as a separate project and was planned in a later stage....by a different team"
So, project leaders, please keep this learning in mind when planning migration. Don't let a limited scope of the project limit your thinking about testing.
Hans
..and how about the problem...
-Repeating the migration would take 7 days
-Fallback would delay the project 3 months, resulting in huge costs and loss in confidence.
-Using DUL would be an option
-The Oracle team decided to manually detect the wrong blocks and patch the manually .. 12.00PM the same day the 2+ TB Database was up and running and the customer went into production as planned (after running the dataccess tests)
