Does Database Matter?
By frank.buytendijk on Aug 18, 2009
"Infrastructure is everything below the level I care about". For developers that would consist of the network and the hardware, for the CEO of the company, probably all of IT would fit under that definition. And given the fact that data warehousing is such an established discipline, it is an attractive thought here as well. All databases can handle queries, it doesn't matter which one you use to build a data warehouse. Right? Well, not really...
Data warehouse architectures have greatly evolved the last years. Where they used to aim at providing the management information for the managers in an organization, now they serve a multitude of stakeholders, also external to the organization. It is very common to give customers access to parts of the data warehouse, so they can manage their customer relationship with the organization better. The same goes with suppliers, who have access to information to improve their administrative and logistical integration. In short, the data warehouse has become a platform to service the complete value chain. This means that all of a sudden, database security becomes a major requirement. Not every database has the same capabilities here.
The second trend in data warehouse architectures has been to support closed-loop integration. Instead of weekly or daily batch updates, data is flowing in continuously, and is being integrated continuously. Moreover, the transactional systems that feed the data warehouse also become a target for the data warehouse, feeding back integrated information. Think for instance of call center and e-business systems that need integrated customer data coming from multiple product systems. The data warehouse increasingly serves as that central data hub. Updating and querying at the same time asks for database capabilities such as multi version read consistency and fine grained locking (an advanced form of row locking, originally meant to support transactional systems), that ensures that users get consistent results from queries even during updates to the data.
Further, with data warehouses becoming more operational of nature, performance tuning becomes a different game. It is no longer acceptable to have a star-schema-based structure without at least a normalized layer underneath. Closed-loop architectures require normalized data to match transactional data models, when data is denormalized the link with the operational data is lost. Further, performance tuning shouldn't come from denormalization, but from more advanced database and hardware techniques.
At the same time, traditional uses of the data warehouse, analytical queries requiring massive table scans, haven't gone away. The type of queries has become much more mixed of nature. Sometimes short random records are needed, sometimes large joins. Query management has become much more versatile.
Both trends combined (greatly expanded user constituency spanning the complete value chain, as well as operational use) also demand high availability techniques. Database does matter.
Interestingly enough, these exact points are made by vendors of specialty databases optimized for data warehousing only. They use specific techniques such as column-based storage, and proprietary hardware to deliver 'appliances', allowing what is often called 'active data warehousing'. Active data warehousing comprises the forementioned continuously inflow and integration of data. Still, the reasoning that this requires a proprietary database is flawed.
First, and I mentioned this already, because of closed-loop architectures. Transactional systems need to access the data warehouse as well. The data warehouse is not a one-directional street, or an information silo only accessed by BI tools. It has become an integral part of both the IT infrastructure, as well as IT operations processes and skills. Having non-standard hardware and database makes it harder to integrate the data warehouse in such an environment.
The second problem with specialized databases is the need for specialist skills. DBA skills are rare. And with growing demands for scalability, availability, and performance tuning, the required skills have become more advanced. Building specialist skills for the data warehouse only isolates the data warehouse DBAs, and makes it even harder to integrate the data warehouse in the IT infrastructure and IT operations. Having mutually exclusive DBA skills for operational databases and the data warehouse places a burden on IT budgets, headcounts and skills that are under pressure already. Advanced self-management techniques in the database helps bringing the ratio DBA per terabyte of data down. Again, database does matter.
The trends in data warehousing (more types of users, and a mix of operational and analytical use) favor Oracle. With extensive experience in both operational and analytical use, mixed query loads, security, availability, self management and advanced row locking applied to data warehousing are not a problem. That leaves one thing: the advantage of an optimized hardware appliance, for advanced performance tuning. And that's the Oracle Exadata Database Machine, however running the same Oracle database, leveraging the skills you already have.