Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

  • July 27, 2009

A follow up on "why does ETL take so long"

Jean-Pierre Dijcks
Master Product Manager

I got this comment from Oliver on the ETL posting: "in my opinion coding or modelling the etl-process is not the most time consuming part of building a data warehouse. The most time is spent on consolidating the dimensions and business definitions across the different source systems and company divisions. Also agreeing on a common definition of the KPIs takes time."

I think this is a very true statement, and it kind of puts the actual building of the technical infrastructure into perspective. As an example, I have been on a project where it took the company over a year to define the actual product hierarchy, and even then there was still massive debate on the outcome.

That said, let's put my comments into perspective. What I was referring to was a three phase/step process to build the DW. First step is the schema design, then we do the ETL and data loading and lastly we do the BI design and build the reports. From the DW side of the house, the actual ETL typically takes the longest to complete. That is what the posting was about...

Now, back to Oliver's thoughts.

There are a couple of things of extreme importance when building a DW:

- Solve a real business problem

- Get executive and business sponsorship

- Come to well defined business definitions (what does revenue mean, what does the product hierarchy look like, what KPIs are you measuring and when are we doing ok, etc...)

If you find that real business problem and you secure the right sponsorship the hard work is just beginning. Most people will realize there is a problem, but not many will understand the solution. Or they get sidetracked with minor details and thus focus on the wrong things.

That is where it makes sense to use prototypes or agile development methods to start to frame the answer for the business users. Use their own data in a simple reporting environment. Keep them focused on the information, not the color scheme or the report layout. Then work your way through the set of definitions and refine the data extraction rules as you go through that data set.

I think that will help define certain parts of the dimensions and business definitions. Seeing is in many cases believing (or discovering issues). It will at least foster understanding and give the users ideas as to what is possible.

With this start it is hopefully easier and faster to come to consensus on business definitions. But keep in mind it is easier and faster, not easy and fast!

Thanks for the comment Oliver!

Join the discussion

Comments ( 2 )
  • mintakastar Thursday, March 4, 2010
    yes, I agree that the ETL process takes a lot, but obviously the design time takes a lot too, because the design it the most important thing, it is what define the solution of the problem.
    i've got a question, not referring a lot with this topic, but.., what kind of algorithms do you use or could be used to create a DW?
  • jean-pierre.dijcks Thursday, March 4, 2010
    Hi Mintaka,
    What do you mean with algorithms? On the transformation side I think a lot of it is "generic SQL stuff"... A lot of lookups (recoding), some conversions (case statements) and then there is the whole complex transformations (custom code).
    Data cleansing is a bit more rule and algorithm driven. For example to determine whether or not two strings match you use edit-distance or jaro-winkler. Or you use all sorts of other rules to strip of noise words (the, in, at etc.).
    Not sure if this is an answer to what you are after, if not feel free to comment and clarify.
    Also, yes you are correct, designing something takes a bit. But a schema design is typically a lot simpler than creating the ETL process. In many cases it takes less time and I think it should take less time...
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.