Wednesday Apr 15, 2015

Data Governance for Migration and Consolidation

By Martin Boyd, Senior Director of Product Management

How would you integrate millions of parts, customer and supplier information from multiple acquisitions into a single JD Edwards instance?  This was the question facing National Oilwell Varco (NOV), a leading worldwide provider of worldwide components used in the oil and gas industry.  If they could not find an answer then many operating synergies would be lost, but they knew from experience that simply “moving and mapping” the data from the legacy systems into JDE was not sufficient, as the data was anything but standardized.

This was the problem described yesterday in a session at the Collaborate Conference in Las Vegas.  The presenters were Melissa Haught of NOV and Deepak Gupta of KPIT, their systems integrator. Together they walked through an excellent discussion of the problem and the solution they have developed:

The Problem:  It is first important to recognize that the data to be integrated from many and various legacy systems had been created over time with different standards by different people according to their different needs. Thus, saying it lacked standardization would be an understatement.  So how do you “govern” data that is so diverse?  How do you apply standards to it months or years after it has been created? 

The Solution:  The answer is that there is no single answer, and certainly no “magic button” that will solve the problem for you.  Instead, in the case of NOV, a small team of dedicated data stewards, or specialists, work to reverse-engineer a set of standards from the data at hand.  In the case of product data, which is usually the most complex, NOV found they could actually infer rules to recognize, parse, and extract information from ‘smart’ part numbers, even from part numbering schemes from acquired companies.  Once these rules are created for an entity or a category and built in to their Oracle Enterprise Data Quality (EDQ) platform. Then the data is run through the DQ process and the results are examined.  Most often you will find out problems, which then suggest some rule refinements are required. Rule refinement and data quality processing steps run repeatedly until the result is as good as it can be.  The result is never 100% standardized and clean data though. Some data is always flagged into a “data dump” for future manual remediation. 

Lessons Learned:

  • Although technology is a key enabler, it is not the whole solution. Dedicated specialists are required to build the rules and improve them through successive iterations
  • A ‘user friendly’ data quality platform is essential so that it is approachable and intuitive for the data specialists who are not (nor should they be) programmers
  • A rapid iteration through testing and rules development is important to keep up project momentum.  In the case of NOV, specialists request rule changes, which are implemented by KPIT resources in India. So in effect, changes are made and re-run overnight which has worked very well

Technical Architecture:  Data is extracted from the legacy systems by Oracle Data Integrator (ODI), which also transforms the data in to the right ‘shape’ for review in EDQ.  An Audit Team reviews these results for completeness and correctness based on the supplied data compared to the required data standards.  A secondary check is also performed using EDQ, which verifies that the data is in a valid format to be loaded into JDE.

The Benefit:  The benefit of having data that is “fit for purpose” in JDE is that NOV can mothball the legacy systems and use JDE as a complete and correct record for all kinds of purposes from operational management to strategic sourcing.  The benefit of having a defined governance process is that it is repeatable.  This means that every time the process is run, the individuals and the governance team as a whole learn something from it and they get better at executing it next time around.  Because of this NOV has already seen orders of magnitude improvements in productivity as well as data quality, and is already looking for ways to expand the program into other areas.

All-in-all, Melissa and Deepak gave the audience great insight into how they are solving a complex integration program and reminded us of what we should already know: "integrating" data is not simply moving it. To be of business value, the data must be 'fit for purpose', which often means that both the integration process and the data must be governed. 

Thursday Feb 19, 2015

Hive, Pig, Spark - Choose your Big Data Language with Oracle Data Integrator

The strength of Oracle Data Integrator (ODI) has always been the separation of logical design and physical implementation. Users can define a logical transformation flow that maps any sources to targets without being concerned what exact mechanisms would be used to realize such a job. In fact, ODI doesn’t have its own transformation engine but instead outsources all work to the native mechanisms of the underlying platforms, may it be relational databases, data warehouse appliances, or Hadoop clusters.

In the case of Big Data this philosophy of ODI gains even more importance. New Hadoop projects are incubated and released on a constant basis and introduce exciting new capabilities; the combined brain trust of the big data community conceives new technology that outdoes any proprietary ETL engine. ODI’s ability to separate your design from the implementation enables you to pick the ideal environment for your use case; and if the Hadoop landscape evolves, it is easy to retool an existing mapping with a new physical implementation. This way you don’t have to tie yourself to one language that is hyped this year, but might be legacy in the next.

ODI enables the generation from logical design into executed code through physical designs and Knowledge Modules. You can even define multiple physical designs for different languages based on the same logical design. For example, you could choose Hive as your transformation platform, and ODI would generate Hive SQL as the execution language. You could also pick Pig, and the generated code would be Pig Latin. If you choose Spark, ODI will generate PySpark code, which is Python with Spark APIs. Knowledge Modules will orchestrate the generation of code for the different languages and can be further configured to optimize the execution of the different implementation, for example parallelism in Pig or in-memory caching for Spark.

The example below shows an ODI mapping that reads from a log file in HDFS, registered in HCatalog. It gets filtered, aggregated, and then joined with another table, before being written into another HCatalog-based table. ODI can generate code for Hive, Pig, or Spark based on the Knowledge Modules chosen. 

 ODI provides developer productivity and can future-proof your investment by overcoming the need to manually code Hadoop transformations to a particular language.  You can logically design your mapping and then choose the implementation that best suits your use case.

Monday Feb 16, 2015

The Data Governance Commandments

This is the second of our Data Governance Series. Read the first part here.

The Four Pillars of Data Governance

Our Data Governance Commandments are simple principles that can help your organization get its data story straight, and get more value from customer, performance or employee data.

Data governance is a wide-reaching disciple, but like all walks of life, there are a handful of essential elements you need in place before you can start really enjoying the benefits of a good data governance strategy. These are the four key pillars of data governance:


Data is like any other asset your business has: It needs to be properly managed and maintained to ensure it continues delivering the best results.

Enter the data steward; a role dedicated to managing, curating and monitoring the flow of data through your organization. This can be a dedicated individual managing data full-time, or just a role appended to an existing employee’s tasks.

But do you really need one? If you take your data seriously, then someone should certainly be taking on this role; even if they only do it part-time.


So what are these data stewards doing with your data exactly? That’s for you to decide, and it’s the quantity and quality of these processes that will determine just how successful your data governance program is.

Whatever cleansing, cleaning and data management processes you undertake, you need to make sure they’re linked to your organization’s key metrics. Data accuracy, accessibility, consistency and completeness all make fine starting metrics, but you should add to these based on your strategic goals.


No matter how ordered your data is, it still needs somewhere to go, so you need to make sure your data warehouse is up to task, and is able to hold all your data in an organized fashion that complies with all your regulatory obligations.

But as data begins filling up your data warehouse, you’ll need to improve your level of data control and consider investing in a tool to better manage metadata: the data about other data. By managing metadata, you master the data itself, and can better anticipate data bottlenecks and discrepancies that could impact your data’s performance.

More importantly, metadata management allows you to better manage the flow of data—wherever it is going. You can manage and better control your data not just within the data warehouse or a business analytics tool, but across all systems, increasing transparency and minimizing security and compliance risks.

But even if you can control data across all your systems, you also need to ensure you have the analytics to put the data to use. Unless actionable insights are gleaned from your data, it’s just taking up space and gathering dust.

Best Practices

For your data governance to really deliver—and keep delivering—you need to follow best practices.

Stakeholders must be identified and held accountable, strategies must be in place to evolve your data workflows, and data KPIs must be measured and monitored. But that’s just the start. Data governance best practices are evolving rapidly, and only by keeping your finger on the pulse of the data industry can you prepare your governance strategy to succeed.

How Many Have You Got?

These four pillars are essential to holding up a great data governance strategy, and if you’re missing even one of them, you’re severely limiting the value and reliability of your data.

If you’re struggling to get all the pillars in place, you might want to read our short guide to data governance success.

Tuesday Feb 10, 2015

The Data Governance Commandments: Ignoring Your Data Challenges is Not an Option

This is the first of our Data Governance blog series. Read the next of the series here.

Our Data Governance Commandments are simple principles that can help your organization get its data story straight, and get more value from customer, performance or employee data.

All businesses are data businesses in the modern world, and if you’re collecting any information on employees, performance, operations, or your customers, your organization is swimming in data by now. Whether you’re using it, or just sitting on it, that data is there and it is most definitely your responsibility.

Even if you lock it in a vault and bury your head in the sand, that data will still be there, and it will still be:

  • Subject to changeable regulations and legislation
  • An appealing target for cybercriminals
  • An opportunity that you’re missing out on

Those are already three very good reasons to start working on your data strategy. But let’s break it down a bit more.


Few things stand still in the world of business, but regulations in particular can move lightning-fast.

If your data is sat in a data warehouse you built a few years ago, that data could now be stored in an insecure format, listed incorrectly, and violating new regulations you haven’t taken into account.

You may be ignoring the data, but regulatory bodies aren’t—and you don’t want to find yourself knee-deep in fines.


Your network is like a big wall around your business. Cybercriminals only need to find one crack in the brickwork, and they’ll come flooding in.

Sure, you’ve kept firewalls, anti-virus software and your critical servers up to date, but what about that old data warehouse? How’s that looking?

If you’ve taken your eye off your DW for even a second, you’re putting all that data at risk. And if the cybercriminals establish a backdoor through the DW into the rest of the organization, who knows how far the damage could spread?

If you lose just consumer reputation and business following such a data breach, consider yourself lucky. The impact could be far worse for the organization that ignores its data security issues.


Even without the dangers of data neglect, ignoring your data means you’re ignoring fantastic business opportunities. The data you’re ignoring could be helping your business:

  • Better target marketing and sales activities
  • Make more informed business decisions
  • Get more from key business applications
  • Improve process efficiency

Can you afford to ignore all of these benefits, and risk the security and compliance of your data?

Thankfully, there are plenty of ways you can start tightening up your data strategy right away.

Check out our short guide to data governance, and discover the three principles you need to follow to take control of your data.

Thursday Nov 06, 2014

Oracle Data Integrator and Hortonworks

Check out Oracle's Alex Kotopoulis being features on Hortonworks blog discussing how Oracle Data Integrator is the best tool for data ingest to Hadoop!

Remember to register for the November 11th joint webinar presented by Jeff Pollock, VP Oracle, and Tim Hall, VP Hortonworks.  Click here to register.  

Monday Oct 20, 2014

Announcing Availability of Oracle Enterprise Metadata Management

Oracle today announced the general availability of Oracle Enterprise Metadata Management (OEMM), Oracle's comprehensive Metadata Management technology for Data Governance. With this release Oracle stresses the importance that it lays on it's product strategy that not just offers best in class Data Integration solutions like Oracle Data Integrator (ODI), Oracle GoldenGate (OGG) and Oracle Enterprise Data Quality (OEDQ), but also on technology that ties together business initiatives like governance.

Data Governance Considerations

Organizations have been struggling to impose credible governance onto their data for long with ad-hoc processes and technologies that are unwieldy  and unscalable. There were a number of reasons why this was the case.

  • a. Data Governance cannot be done without managing metadata.
  • b. Data Governance cannot be done without extending across all platforms irrespective of technologies.
  • c. Data Governance cannot be done without a business and IT friendly interface.

Complete Stewardship -  Data Transparency from Source to Report

The biggest advantages of having an airtight Data Governance program is to reduce data risk, increase security and to manage your organization's Data Life-cycle. Any governance tool should be able to surface lineage, impact analysis and data flow not just within a Business Analytics, or within a Data Ware house but across all these systems, no matter what technology one is using. This increased transparency assesses accurately risks and impacts during changes to data. 

Data Flow Diagram across platforms.

With a focus on stewardship OEMM is designed to be intuitive and search based. It's search catalog allows easy browsing of all objects with collaboration and social features for the Data Steward.

 Search based catalog and Business Glossary for easy browsing of objects.

Big Data Governance

OEMM along with Oracle Data Integrator provides a powerful combination to govern Big Data standards including HBase, SQOOP and JSON. With ODI providing complete support to these data standards for Data loading and transformation, OEMM harvests the ODI metadata to stitch together a complete data map that even traverses through any Big Data Reservoir that organizations have in place. 

Oracle and 3rd Party Metadata

OEMM is truly heterogeneous. It is designed to pull in and manage metadata from Oracle and 3rd party Data Bases, Data Warehouses, ETL, Business Intelligence, and other Reporting Tools. 

Visit the OEMM homepage for more information about Oracle Enterprise Metadata Management.

Friday Oct 17, 2014

Upcoming Webinar: Data Transformation and Acquisition Techniques, to Handle Petabytes of Data

Many organizations have become aware of the importance of big data technologies, such as Apache Hadoop but are struggling to determine the right architecture to integrate it with their existing analytics and data processing infrastructure. As companies are implementing Hadoop, they need to learn new skills and languages, which can impact developer productivity. Often times they resort to hand-coded solutions which can be brittle, impact the productivity of the developer and the efficiency of the Hadoop cluster.

To truly tap into the business benefits of the big data solutions, it’s necessary to ensure that the business and IT have simple tools-based methods to get data in, change and transform it, and keep it continuously updated with their data warehouse.

In this webinar you’ll learn how the Oracle and Hortonworks solution can:

  • Accelerate developer productivity
  • Optimize data transformation workloads for on Hadoop
  • Lower cost of data storage and processing
  • Minimize risks in deployment of big data projects
  • Provide proven industrial scale tooling for data integration projects

We will also discuss how technologies from both Oracle and Hortonworks can deploy the big data reservoir or data lake, an efficient cost-effective way to handle petabyte-scale data staging, transformations, and aged data requirements while reclaiming compute power and storage from your existing data warehouse.

Jeff Pollock, Vice President, Oracle
Tim Hall, Vice President, Hortonworks

Hosted by:
Tim Matteson
, Co-Founder, Data Science Central

Click Here to Register.

Wednesday Oct 15, 2014

Oracle Data Integrator Certified with Hortonworks HDP 2.1

To often companies fall into what they perceive is the path of least resistance by using custom, hand-coded methods to create big data solutions but with the rush to production these hand coded solutions more often perform slower and are more costly to maintain.   To truly tap into the business benefits of the big data solutions, a simple tools based solutions is required to move large volumes of data into Hadoop and efficiently transform it without the need for costly mid-tier servers.    The Oracle Data Integration Solutions team is pleased to announce the certification of Oracle Data Integrator with Hortonworks HDP 2.1. 

This collaboration between both the Oracle Data Integrator and Hortonworks teams will provide customers a familiar and comprehensive data integration platform for Hadoop covering high-volume, high-performance batch-loads,  agile transformations using the power of Hadoop and a superior developer experience with the flow-based declarative user interface of Oracle Data Integrator. 

To learn more, click here.    

Also, on November 11th, 2014 Jeff Pollock, VP Oracle Data Integration Solutions and Tim Hall, VP of Product Management Hortonworks will be hosting a joint webinar to discuss the certification and how technologies from both Oracle and Hortonworks can be used to deploy big data reservoirs.    To register, click here

Wednesday Jul 02, 2014

Learn more about ODI and Apache Sqoop

The ODI A-Team just published a new article about moving data from relational databases into Hadoop using ODI and Apache Sqoop. Check out the blog post here: Importing Data from SQL databases into Hadoop with Sqoop and Oracle Data Integrator (ODI)

Monday Jun 30, 2014

Oracle Enterprise Data Quality Product Family

Oracle Enterprise Data Quality (OEDQ) is critical to the Oracle Data Integration portfolio. OEDQ helps make sure that the data that customers use is fit for purpose. While Oracle Data Integrator (ODI) helps with Data Movement and Extract transform and Load (ELT) and Oracle GoldenGate (OGG) is a leader in data replication, OEDQ is the tool that helps maintain data consistency and quality.

Critical Data Quality Challenges

Data used for decision making and analytics has to be fully trustworthy. However in real life data rarely comes clean. It contains missing values, duplicate entries, misspelt words, non standardized names and various other forms of questionable data. Making critical decisions with such data results in operational inefficiencies, loss of goodwill among customers, faulty market readings and audit and compliance lapses.

Essential Data Quality Capabilities

Ever since there have been databases and applications, there have been data quality problems. Unfortunately all those problems are not created equal and neither are the solutions that address them. Some of the largest differences are driven by the data type, or domain, of the data in question. The most common data domains in data quality are customer (or more generally, party data including suppliers, employees, etc.) and product data. Oracle Enterprise Data Quality products recognize these differences and provide purpose-built capabilities to address each. Quick to deploy and easy to use, Oracle Enterprise Data Quality products bring the ability to enhance the quality of data to all stakeholders in any data management initiative.

 Oracle Enterprise Data Quality covers:

 • Profiling, Audit and Dashboards

• Parsing and Standardization

• Match and Merge

• Case Management

• Address Verification

• Product Data Capabilities

Zebra Technologies Uses OEDQ to Reduce Costs