Sunday Jul 13, 2014

New Big Data Features in ODI 12.1.3

Oracle Data Integrator (ODI) 12.1.3 extends its Hadoop capabilities through a number of exciting new cababilities. The new features include:

  • Loading of RDBMS data from and to Hadoop using Sqoop
  • Support for Apache HBase databases
  • Support for Hive append functionality
With these new additions ODI provides full connectivity to load, transform, and unload data in a Big Data environment.

The diagram below shows all ODI Hadoop knowledge modules with KMs added in ODI 12.1.3 in red. 

Sqoop support

Apache Sqoop is designed for efficiently transferring bulk amounts of data between Hadoop and relational databases such as Oracle, MySQL, Teradata, DB2, and others. Sqoop operates by creating multiple parallel map-reduce processes across a Hadoop cluster and connecting to an external database and transfering data from or to Hadoop storage in a partitioned fashion. Data can be stored in Hadoop using HDFS, Hive, or HBase. ODI adds two knowledge modules IKM SQL to Hive- HBase-File (SQOOP) and IKM File-Hive to SQL (SQOOP).

Loading from and to Sqoop in ODI is straightforward. Create a mapping with the database source and hadoop target (or vice versa) and apply any necessary transformation expressions.

In the physical design of the map, make sure to set the LKM of the target to LKM SQL Multi-Connect.GLOBAL and choose a Sqoop IKM, such as  IKM SQL to Hive- HBase-File (SQOOP). Change the MapReduce Output Directory IKM property MAPRED_OUTPUT_BASE_DIR to an appropriate HDFS dir. Review all other properties and tune as necessary. Using these simple steps you should be able to perform a quick Sqoop load. 

For more information please review the great ODI Sqoop article from Benjamin Perez-Goytia, or read the ODI 12.1.3 documentation about Sqoop.

HBase support

ODI adds support for HBase as a source and target. HBase metadata can be reverse-engineered using the RKM HBase knowledge module, and HBase can be used as source and target of a Hive transformation using LKM HBase to Hive and IKM Hive to HBase. Sqoop KMs also support HBase as a target for loads from a database. 

For more information please read the ODI 12.1.3 documentation about HBase.

Hive Append support

Prior to Hive 0.8 there had been no direct way to append data to an existing table. Prior Hive KMs emulated such logic by renaming the existing table and concatenating old and new data into a new table with the prior name. This emulated append operation caused major data movement, particularly when the target table has been large.

Starting with version 0.8 Hive has been enhanced to support appending. All ODI 12.1.3 Hive KMs have been updated to support the append capability by default but provide backward compatibility to the old behavior through the KM property HIVE_COMPATIBLE=0.7. 

Conclusion

ODI 12.1.3 provides an optimal and easy-to use way to perform data integration in a Big Data environment. ODI utilizes the processing power of the data storage and processing environment rather than relying on a proprietary transformation engine. This core "ELT" philosophy has its perfect match in a Hadoop environment, where ODI can provide unique value by providing a native and easy-to-use data integration envionment.

Monday Jul 30, 2012

Four Ways to Wrestle a Big Data Elephant

He’s large. He’s fast. He’s strong. And very very hungry! Meet the big data elephant. Perhaps you have seen him stalking the corners of your data warehouse looking for some untapped data to devour? Or some unstructured weblogs to weigh in on. To wrestle the elephant to work for you rather than against you, we need data integration. But not just any kind, we need newer styles of data integration that are poised for these evolving types of data management challenges. I've put together four key requirements below with some pointers to industry experts in each category. Hopefully this is useful. And, good luck with that 8 and ¼ tons of data!

Four Ways to Wrestle a Big Data Elephant

  • Leverage existing tools and skill-sets
  • Quality first
  • Remember real-time
  • Integrate the platform

Leverage existing tools and skill-sets

While Hadoop technologies are cool to say, and can seem to add an impressive ‘buzz’ to your LinkedIn/Twitter profiles, a word of caution that not every big data technology may actually be necessary. The trend now is that tools are becoming ‘integrated’ in such a way that designing ETL and developing mapReduce can be implemented in a single design environment. Data Integration tools are evolving to support new forms of connectivity to source in NoSQL, HDFS. This is as opposed to keeping these two worlds separate. Something that I referred to recently in my blog on Bridging two Worlds: Big Data and Enterprise Data.

The advantages of a single solution allow you to address not only the complexities of mapping, accessing, and loading big data but also correlating your enterprise data – and this correlation may require integrating across mixed application environments. The correlation is key to taking full advantage of big data and requires a single unified tool that can straddle both environments.

Quality First

Secondly, big data sources consist of many different types and in many different forms. How can anyone be sure of the quality of that data? And yes, data stewardship best practices still do apply. In the big data scenario, data quality is important because of the multitude of data sources Multiple data sources make it difficult to trust the underlying data. Being able to quickly and easily identify and resolve any data discrepancies, missing values, etc in an automated fashion is beneficial to the applications and systems that use this information.

Remember real-time

I covered this very subject in last week’s blog on Is Big Data Just Super Sexy Batch. No it’s not. But at the same time, it would be an overstatement to say that big data addresses all of our real-time needs. [The cheetah still runs faster than the elephant… although I still wouldn’t want to try to outrun an elephant!]. Tools such as Oracle GoldenGate and techniques in real-time replication, change data capture don’t simply disappear with big data. In fact, the opposite will happen. They become even more crucial as our implementations cross over between unstructured and structured worlds where both performance, low-latency become increasingly paramount as volumes rise and velocity requirements

Integrate the platform

Taking all the miscellaneous technologies around big data – which are new to many organizations - and making them each work with one another is challenging. Making them work together in a production-grade environment is even more daunting. Integrated systems can help an organization radically simplify their big data architectures by integrating the necessary hardware and software components to provide fast and cost-efficient access, and mapping, to NoSQL and HDFS.

Combined hardware and software systems can be optimized for redundancy with mirrored disks, optimized for high availability with hot-swappable power, and optimized for scale by adding new racks with more memory and processing power. Take it one step further and you can use these same systems to build out more elastic capacity to meet the flexibility requirements big data demands.

To learn more about Oracle Data Integration products, see our website or to follow more conversations like this one join me on twitter @dainsworld.

About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today