Delivering software to support the cloud
By KLaker on Feb 17, 2010
From a software perspective, developing a cloud strategy is all about the data and not moving it. For a long time know Oracle has advocated the basic principle of doing everything inside the database. When you move to the cloud this makes even more sense because you do not want to be continually unloading, moving and reloading data into different engines.
Many of the data warehouse vendors actively promote the use of multiple processing engines to support their data warehouse solution. As a result you get something like this:
In this scenario data is being loaded, unloaded, moved and reloaded multiple times which increases latency, allows errors to be introduced and makes it difficult to determine exactly where a piece of data actually came from. There is also the topic of data security to consider - and that is a big topic! All this data movement, unloading and reloading provides numerous opportunities for security breaches.
If you are going to develop a viable data warehouse cloud strategy then what is needed are some simple rules that can be used to check the suitability of your preferred database platform:
- Flexible data model not fixed data model
- Data loading based on ELT not ETL
- Analytics inside the database not outside
- End-to-end security not disconnected security
- Effective resource management not ineffective resource management
Data Model Strategy - > Flexible not fixed data model
Oracle Database is not restricted to a single type of data model. This provides the required flexibility to provide a data model that can support real-time data loading as well as the complex analytics needed to support today's BI queries. Most importantly, as the business changes (new companies acquired, new products added and old products decommissioned) it is important to have a data model that can easily move with the business and not hold it back. This is especially true when considering a cloud-based strategy for the data warehouse since one of the main drivers of moving to a cloud based environment is "increased flexibility".
Oracle has developed and proven its reference architecture through numerous customer engagements over the last 14 years During this time the model has evolved to build on the changes in the capability of the underlying database technology and tools. Each new release of the Oracle Database adds new data warehousing, security and availability features that make it significantly quicker and easier to implement this reference architecture.
The goal of Oracle’s Data Warehouse Reference Architecture is to deliver a high quality integrated system and information at a significantly reduce cost over the longer term. It does this through recognizing the differing needs for Data Management and Information Access that must both be delivered by the Warehouse, applying different types of data modeling to each in a layered and abstracted approach.
The Reference Architecture is intended as a guide and not an instruction manual. Each layer in the architecture has a role in delivering the analytical platform required to support next generation business execution. The Reference Architecture gives us something to measure back against so we can understand what we compromise by making specific architectural, technical and tools choices. It works equally well for new Data Warehouse developments as it does for developing a roadmap to migrate an existing ones.
Data loading strategy-> ELT not ETLMany data integration (DI) tools rely on their engines to perform data transformations this is because many databases have very weak data transformation engines. Therefore, most DI tools extract data from a source system, move that data into their own processing engine and perform transformations in a row-based manner. Finally, the data is then pushed into the target - the data warehouse. The situation gets more complex when you start including processes to manage data quality, data lineage, data discovery etc etc.
This approach means customers have to manage multiple servers and their network takes a beating every time the ETL jobs are run because large data sets are moved around the network being passed from engine to engine. Yet the need to use multiple engines with associated dedicated hardware is often cited as an excellent reason for moving to a cloud based strategy. This removes the need to manage all those servers and software licenses. Data can freely move around the cloud taking advantage of the latest versions of each piece of software.
Yet it is the volume of data and the complexity of the transformations (ETL and data quality) that makes it vital that processing is down within the data warehouse database engine under the control of the database workload management features. The Oracle Database has specialized and optimized data transformation features such as set-based operations, error logging, pipeline table functions, regular expressions.
The Oracle Database license includes the Warehouse Builder (OWB) which follows the approach of extracting the data from the source system, loading into the target (DW) and then applying the required data transformations using the power of the Oracle Database. Customers using OWB do not need to buy additional hardware to run their ETL or additional tools beyond their normal enterprise database license. Therefore, using OWB it is possible to "cloud-enable" the ETL process directly within the database.
As all ETL jobs are under the control of the database workload manager the priority and access to resources can be managed from one central console. Using OWB's macro language ("experts") it is possible to write wrappers around normal processes that users might want to do such as load the contents of an Excel worksheet into a table. This way, users can "build" and execute their own ETL jobs using the same ETL tools and repository as the IT team. Then when something needs to be changed the impact on the whole environment can easily be determined.
Processing data inside the database makes sense. Take the analysis to the data not the other way round!
Analytics Strategy -> inside the database not outside
As with ETL it makes sense to do as much processing inside the database as possible since this is where all the data and real processing power is located. Personally, I think the challenge for most Oracle customers is knowing what is inside the database. The latest version of Enterprise Edition offers data mining, OLAP/multi-dimensional models, spatial, text mining, and support for unstructured data. By keeping all these types analysis within the database engine it is possible to run cross-functional analysis that is simply not possible in other data warehouse databases/engines.
Imagine being able to analyze the result from a data mining model using spatial analytics and then applying a top 10 and bottom 10 query to highlight winners and losers? Could you do this using the cloud? Of course, but it would probably mean unloading data from the enterprise data warehouse into a data mining engine and then pushing the results to a spatial engine and creating a federated query across the spatial and data warehouse datasets to run the winner and losers query. That all takes time and time is what most business users do not have - even without considering who is going to write the ETL to move all that data around!
End-to-end security not disconnected security
One of the biggest challenges around cloud computing is data security. Why? Because data is continually on the move from one engine to the next and all that movement is not encrypted, some engines have an encryption process (usually unique to them) and others have nothing. How do you know who is accessing your most sensitive data and more importantly how do you know where it is being moved to?
There is an easy answer to this: don't put sensitive data in the cloud! The only problem is that the sensitive data is usually the gateway to a lot of very important analysis. Therefore, you either stop moving data around, or you apply strong encryption and authorization policies or you do both. Fortunately, Oracle offers both! Using Oracle Database as the foundation of a DW cloud strategy means you can use Oracle's transparent security features to lock down sensitive data and stop unauthorized access. Data remains locked inside the Oracle Database where you can use the built-in analytic power to run queries across effectively secured data sets.
Effective resource management not ineffective resource management
If you are going to manage resources with the cloud in an effective way then you need to be able to control all aspects of the data warehouse workload. Most database systems, including those with cloud platforms, will provide some degree of control over the processing directly within the database.
The Oracle Database Resource Manager (DBRM) allows the DBA to prioritize workloads and restrict access to resources for certain groups of users. This allows the to protect high priority users or jobs from being impacted by lower priority work. The DBRM does this by allocating CPU time to different jobs based on their priority. The amount of resources allocated to a specific workload or user can depend on the percentage of CPU time, number of active sessions, and amount of space available etc etc.
The addition of Exadata to the data warehouse platform provides the Oracle DBA with one significant advantage for managing workloads: it extends DBRM's capabilities to include the coordination and prioritization of I/O bandwidth consumed between databases, and between different users and classes of work. This is only possible with Oracle and is the direct result of the tight integration between the database with the storage layer. Exadata is aware of what types of work and how much I/O bandwidth is consumed. Users can therefore have the Exadata system identify various types of workloads, assign priority to these workloads, and ensure the most critical workloads get priority.
To support a data warehouse cloud strategy that supports both data warehousing and/or mixed workload environments, you may want to ensure different users and tasks within a database are allocated the correct relative amount of I/O resources. For example you may want to allocate 70% of I/O resources to interactive users on the system and 30% of I/O resources to batch reporting jobs. This is simply not possible, or at best extremely complex to achieve, with other vendors databases. With Oracle, this is simple setup and enforce using the DBRM and I/O resource management capabilities of Exadata storage.
With Oracle Database 11g you get an integrated and complete software platform to support a cloud strategy
In this model Oracle Database provides an intelligent cloud, or iCloud, compared to the more traditional "dumb" cloud which is being heavily promoted by many of the current data warehouse vendors as they rush to prove their cloud credentials. Oracle offers "iCloud" as the way forward for your data warehouse strategy, which really is the only way forward.