Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Why does it take forever to build ETL processes?

Jean-Pierre Dijcks
Master Product Manager

One of the discussion topics at ODTUG was whether or not ETL (or better data integration) still makes up 70% of the DW development time. At that session my answer was YES. For some more context take a quick look at this post.

Let's talk some more about that.

I think there are many reasons why data integration is still a very large piece of the development of a data warehouse. Some of these are:

- It is a complex problem to solve and solving that takes time

- Data quality issues make data integration much harder than it should be

- Are we still hand coding ETL?

- New requirements make data integration more complex

I think all of the above are true. I do think however that we can mitigate some of this and streamline and simplify (and therefore shorten) data integration projects. Looking at the 4 topics above, how can we do this?


It is a complex problem...

True, but in reality some problems are more complex than others. A typical ETL routine consists of the actual extract - transform - load steps or code, a more or less generic error handling routine with logging infrastructure and the runtime environment for execution (users, grants, passwords, dblinks etc.)

When handcoding, all of the above are the ETL developer's problem. One way to take complexity out of this task is therefore to use an ETL tool. The tool will standardize the runtime environment to a set of metadata definitions, use a standard logging framework and give you a default error handling function. Leveraging the tool cuts down on general development time and testing, and reduces maintenance in future weeks and months.

The other complex issue that can be tackled in some way with tooling is the problem of understanding the source and the source data. Packaged applications are well documented (except customizations) but are still a pain to get data out of. One way to go is to buy packaged extraction routines like Oracle's BI Applications. The other solution is to leverage data profiling capabilities as the investigative tooling to better understand new sources. All of this is focused at reducing the time to understanding a source. Reducing that time reduces overall time to delivery, but maybe more importantly reducing the unknown can help reduce the complexity of a problem.

Data Quality issues...

Ahhhh, the never ending saga continuous. Yes, you do suffer from quality issues and yes you will be the first to find out when you start moving data around. So here are 2 suggestions to think about:

Start using some form of profiling or auditing tool. At least you can look at all kinds of analysis of the data beforehand. You can talk to users and then show them that there really is a problem in this data or that.

The second suggestion is a golden-oldie, use an ETL tool! Why? Because data quality issues impose changes on your transformation logic and to deal with that efficiently you will need to handle these changes quickly. A metadata driven tool will allow you to do that, hand coded scripts will only make your data quality problem into a "I recoded this 12 times and it still does not work" problem.

If you are not using any of these tools, make the case for them. I know that is hard in this economy, but the case is about saving time and money, doing things faster etc.

Are we still hand coding...

Don't! See above ...

New requirements make data integration more complex...

Yes, we are living in a world where we see a number of trends:

1) More and more data needs to be handled

2) More and more data is used for operational purposes

3) More people are asking questions on the system

4) The business questions are getting harder

These trends make for an exciting environment, but also for some tough choices. These trends directly impact the ETL environment for obvious reasons. Moving more data in the same time window can be a challenge, moving it in a smaller time window is infinitely harder.

The big problem is however a growing focus on near-real-time solutions. Not only are these solutions architecturally more complicated, they are also newer and fewer people are at an expert level with regards to implementing these solutions.

To deal with this set of challenges it pays to go back to basics. Look at the architecture and look at the real needs of the customer. Does all ETL need to be running in "real-time" mode, or can we model the schema in such a way that only certain parts are refreshed in real real-time?

Imagine a case where stock counts and therefore the routing of shipments is driven off of warehouse information. Yes definitely a real case for real-time movement. But does the end customer dimension require the same refresh rates? Does the marketing budget star schema need real-time feeds...?

In other words, yes new trends require new technology, but be practical about it. The fastest way of moving lots of data may still just be flat files.

So does it really need to take 70% of the time?

I doubt it, or maybe better, I think there are ways to load the data warehouse and not spend 70% of your time on ETL. That does however require an investment in tools and a lot of consideration as to the loading architecture. If you are going on the manual labor track you may end up spending 70% of your labor and time on ETL. And so we are back to the age old cliche, work harder or work smarter.

So how do I work smarter not harder?

Back to my original problem list, I think:

- It is a complex problem to solve and solving that takes time

=> Use pre-built solution or tools to ensure productivity is as high as possible. Remember, you will need to change lots of things, so create an environment where you can apply changes quickly and easily.

- Data quality issues make data integration much harder than it should be

=> Face the music! Everyone has data quality issues, so plan for them and ensure you are flexible and able to deal with these issues.

- Are we still hand coding ETL?

=> Just don't... please...

- New requirements make data integration more complex

This is the hardest one to deal with. You will need to change, so investigate the right architecture, don't go overboard on new things but make sure you are not made obsolete by time passing by. In other words, plan ahead, think ahead and invest in knowledge.

Join the discussion

Comments ( 5 )
  • Oliver Monday, July 27, 2009
    in my opinion coding or modelling the etl-process is not the most time consuming part of building a data warehouse. The most time is spent on consolidating the dimensions and business definitions across the different source systems and company divisions. Also aggreeing on a common definition of the kpi takes time.
    My 2 cents
  • MCSE Training Tuesday, July 28, 2009
    Good insights! Data handling is surely a major issue faced by operational units.
  • Jordan Tuesday, July 28, 2009
    In response to Oliver, I would also add that identifying your semantic types - more simply, finding how your cryptic and unhelpful data fields map to meaningful fields - can be a time consuming, yet extremely beneficial part of the data profiling process.
    ETL Engineer
  • Jean-Pierre Tuesday, July 28, 2009
    Hi Jordan,
    Yup, it is one of those nice things about data profiling. I've also found it useful to discover data and what is where. In ERP apps (SAP is of course the hardest one since the table defs are cryptic at best) it helps to profile the data and get a feel for and understanding of the data. While the reference manuals help, seeing the data and the data type with some form of column name always helps me understand things...
  • Elwyn Lloyd Jones Thursday, July 30, 2009
    Nonsense, it takes 5-mins to build an ETL Process with your favourite tool; or it takes 5-Minutes to build an ETL Process as designed; or it takes 5-Minutes to build an ETL Process as specified; or it takes 5-Minutes to "cleanse" the data. But, it is impossible to build an ETL Process as designed with your favourite tool; and it is impossible to design an ETL Process as specified; and it takes forever to decide who'se fault it was that the data was wrong in the first place; and who has the right to fix it. ETL Processes are simple which is why everybody involved thinks they are right. The cartoon of the round bolt and the square hole says it all!
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha