Wednesday Nov 12, 2014

ODI 12c - Spark SQL and Hive?

In this post I'll cover some new capabilities in the Apache Spark 1.1 release and show what they mean to ODI today. There's a nice slide shown below from the Databricks training for Spark SQL that pitches some of the Spark SQL capabilities now available. As well as programmatic access via Python, Scala, Java, the Hive QL compatibility within Spark SQL is particularly interesting for ODI...... today. The Spark 1.1 release supports a subset of the Hive QL features which in turn is a subset of ANSI SQL, there is already a lot there and it is only going to grow. The Hive engine today uses map-reduce which is not fast today, the Spark engine is fast, in-memory - you can read much more on that elsewhere.

Figure taken from from the Databricks training for Spark SQL, July 2014.

In the examples below I used the Oracle Big Data Lite VM, I downloaded the Spark 1.1 release and built using Maven (I was on CDH 5.2). To use Spark SQL in ODI, we need to create a Hive data server - the Hive data server masquerades as many things, it can can be used for Hive, for HCatalog or for Spark SQL. Below you can see my data server, note the Hive port is 10001, by default 10000 is the Hive server port - we aren't using Hive server to execute the query, here we are using the Spark SQL server. I will show later how I started the Spark SQL server on this port (Apache Spark doc for this is here).

I started the server using the Spark standalone cluster that I configured using the following command from my Spark 1.1 installation;

./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host bigdatalite --hiveconf hive.server2.thrift.port 10001 --master spark://192.168.122.1:7077

You can also specify local (for test), Yarn or other cluster information for the master. I could have just as easily started the server using Yarn by specify the master URI as something like --master yarn://192.168.122.1:8032 where 8032 is my Yarn resource manager port. I ran using the 10001 port so that I can run both Spark SQL and Hive engines in parallel whilst I do various tests. To reverse engineer I actually used the Hive engine to reverse engineer the table definitions in ODI (I hit some problems using the Spark SQL reversing, so worked around it) and then changed the model to use my newly created Spark SQL data server above.

Then I built my mappings just like normal - and used the KMs in ODI for Hive just like normal. For example the mapping below aggregates movie ratings and then joins with movie reference data to load movie rating data - the mapping uses the datastores from a model obtained from the Hive metastore;

If you look at the physical design the Hive KMs are assigned but we will execute this through the Spark SQL engine rather than through Hive. The switch from engine to engine was handled in the URL within our our Hive dataserver.

When the mapping is executed you can use the Spark monitoring API to check the status of the running application and Spark master/workers.

You can also monitor from the regular ODI operator and ODI console. Spark SQL support uses the Hive metastore for all the table definitions be they internally or externally managed data. 

There are other blogs from tools showing how to access and use Spark SQL, such as the one here from Antoine Amend using SQL Developer. Antoine has also another very cool blog worth checking out Processing GDELT Data Using Hadoop. In this post he shows a custom InputFormat class that produces records/columns. This is a very useful post for anyone wanting to see the Spark newAPIHadoopFile api in action. It has a pretty funky name, but is a key piece (along with its related methods) of the framework.

  1. // Read file from HDFS - Use GdeltInputFormat
  2. val input = sc.newAPIHadoopFile(
  3.    "hdfs://path/to/gdelt",
  4.    classOf[GdeltInputFormat],
  5.    classOf[Text],
  6.    classOf[Text]

Antoine also provides the source code to GdeltInputFormat so you can see the mechanics of his record reader, although the input data is delimited data (so could have been achieved in different ways) it's a useful resource to be aware of.

If you are looking at Spark SQL, this post was all about using Spark SQL via the JDBC route - there is another whole topic on transformations using the Spark framework alongside Spark SQL that is for future discussion. You should be aware of and check out the Hive QL compatibility documentation here, check what you can do can't do within Spark SQL today. Download the BDA Lite VM and give it a try.

Thursday Sep 25, 2014

ODI 12c - Migrating from OWB to ODI - PLSQL Procedures

The OWB to ODI migration utility does a lot, but there are a few things it doesn't handle today. Here's one for our OWB customers moving to ODI who are scratching their heads on the apparent lack of support for PLSQL procedures in ODI... With a little creative work you can not only get those mappings into ODI but there is potential to dramatically improve performance (my test below improves performance by 400% and can be easily further tuned based on hardware).

This specific illustration will cover the support for PLSQL procedures (not functions) in OWB and what to do in ODI, OWB takes care of invoking PLSQL procedures mid-flow - it did this by supporting PLSQL row-based code for certain (not all) map designs out of the box - PLSQL procedure invocation was one case this was done. The PLSQL that OWB generated was pretty much as efficient as it could have been for PLSQL (used bulk collect and many other best practices you didn't have to worry about as a map designer) but it was limited in the logical map support (couldn't have a set-based operator such as join after a row based only operator such as a PLSQL transformation) - it was also PLSQL not SQL.

Here we see how with a simple pipelined, parallel enabled table function wrapper around your PLSQL procedure call, how you capture the same design in ODI 12c and/or get the mapping migrated from OWB. I think the primary hurdle customers have is what is the option going forward. To solve this, we will just leverage more of the Oracle database; table functions and parallelize the <insert your favorite word> out of it!

The mapping below calls a PLSQL procedure and OWB generated PLSQL row based code for this case, the target is getting loaded with data from the source table and the 2 output parameters of the PLSQL procedure;

When you try and migrate such a mapping using the OWB to ODI migration utility, you'll get a message indicating that the map cannot be migrated using the utility. Let's see what we can do! The vast majority of mappings are set-based, generally a very small subset are row based PLSQL mappings. Let's see how this is achieved in ODI 12c.

I did a test using the generated code from OWB - no tuning just the raw code for the above mapping - it took ">">">12 minutes 32 seconds to process about 32 million rows and invoke the PLSQL procedure and perform a direct path insert into the target. With my ODI 12c design using a very simple table function wrapper around the PLSQL procedure I can cut the time to 3 minutes 14 seconds!! Not only can I do this, but I can easily further optimize it to better leverage the Oracle database server by quickly changing the hints - I had a 4 processor machine, so that's about as much as I could squeeze out of it.

Here is my map design in ODI 12c;

The table function wrapper to call the PLSQL procedure is very simple, line 7 is where I call the PLSQL procedure, I use the object instance in the call and pipe the data when the call is made;

  1. create or replace function TF_CALL_PROC(input_values sys_refcursor) return TB_I_OBJ pipelined parallel_enable(PARTITION input_values BY ANY) AS
  2.   out_pipe i_obj := i_obj(null,null,null,null,null,null,null,null,null);
  3. BEGIN
  4.   LOOP
  5.     FETCH input_values INTO out_pipe.prod_id, out_pipe.cust_id,out_pipe.time_id,out_pipe.channel_id,out_pipe.promo_id,out_pipe.quantity_sold,out_pipe.amount_sold;
  6.     EXIT WHEN input_values%NOTFOUND;
  7.     MYPROC(out_pipe.prod_id,out_pipe.status,out_pipe.info);
  8.     PIPE ROW(out_pipe);
  9.   END LOOP;
  10.   CLOSE input_values;
  11.   RETURN;
  12. END;

This is a very simple table function (with enough metadata you could generate it), it uses table function pipelining and parallel capabilities - I will be able to parallelize all aspects the generated statement and really leverage the Oracle database. The above table function uses the types below, it has to project all of the data used downstream - whereas OWB computed this, you will have to do that.

  1. create or replace type I_OBJ as object (
  2.  prod_id number,
  3.  cust_id number,
  4.  time_id date,
  5.  channel_id number, 
  6.  promo_id number,
  7.  quantity_sold number(10,2),
  8.  amount_sold number(10,2),
  9.  status varchar2(10),
  10.  info number
  11.   );
  12. create or replace type TB_I_OBJ as table of I_OBJ; 

The physical design in ODI has the PARALLEL(4) hints on my source and target and I enable parallel DML using the begin mapping command within the physical design.

You can see in above image when using Oracle KMs there are options for hints on sources and targets, you can easily set these to take advantage of the hardware resources, tweak these to pump the performance throughput!

To summarize, you can see how we can leverage the database to really speed the process up (remember the 400%!), also we can still capture the design in ODI and on top of that unlike in OWB, this approach let's us carry on doing arbitrary data flow transformations after the table function component which is invoking our PLSQL procedure - so we could join, lookup etc. Let me know what you think of this, I'm a huge fan of table functions I think they afford a great extensibility capability.  

Wednesday Jul 23, 2014

ODI 12c - Sessionizing Data with MATCH_RECOGNIZE and SQL Override

The Oracle 12c database introduced a very useful new pattern matching capability, useful for many different use cases. The example I will use in this blog post within ODI is sessionizing data - some log data with sample URLs, IP addresses initiating the clicks and their timestamps. To incorporate this in ODI, I will use one of the lesser known capabilities that many users often want to use - it is the ability to override the generated SQL. I blogged about how this could be done in ODI 11g here and here - in this post I will show how this can be done in 12c.

There are some great posts and presentations from Keith Laker and many other partners and users on the MATCH_RECOGNIZE capability, the post on Sessionization with 12c pattern matching is super fast is a great read, check it out! I will base the SQL used on that post. 

My source data has the following information;

  1. create table clicks (IP varchar2(16), Click_Timestamp timestamp, URL varchar2(200));
  2. insert into clicks values ('192.65.20.33','03-JUL-2014 11:33:15','stubhub.com');
  3. insert into clicks values ('192.65.20.33','03-JUL-2014 11:33:21','mlssoccer.com');
  4. insert into clicks values ('192.65.20.33','03-JUL-2014 11:59:08','netflix.com');
  5. insert into clicks values ('192.65.9.89','04-JUL-2014 09:02:19','amazon.com');
  6. insert into clicks values ('192.65.9.89','04-JUL-2014 09:02:26','amazon.com/travel');
  7. insert into clicks values ('192.65.9.89','04-JUL-2014 09:48:27','citibank.com');
  8. insert into clicks values ('192.65.20.33','04-JUL-2014 10:01:05','ebay.com');
  9. commit;

The type of query we need to generate a session ID using MATCH_RECOGNIZE is shown below. As in Keith's blog, we defined a session as a sequence of one or more events with the same partition key (IP address) where the gap between the timestamps is less than 10 seconds - in this demo, the gap is completely arbitrary and could be set to any number as required. Here we detect the various sessions and assign a surrogate session id to each session within each partition (IP). 

  1. SELECT ip, session_id start_time, no_of_events, duration
  2. FROM clicks MATCH_RECOGNIZE
  3.  (PARTITION BY ip ORDER BY click_TimeStamp 
  4.   MEASURES match_number() session_id, 
  5.            count(*) as no_of_events,
  6.            first(click_timestamp) start_time, 
  7.            last(click_timestamp) - first(click_timestamp) duration 
  8.   ONE ROW PER MATCH 
  9.   PATTERN (b s*) 
  10.   DEFINE 
  11.        s as ((s.click_TimeStamp - prev(s.click_TimeStamp)) <= TO_DSINTERVAL('0 0:00:10.0')) 
  12.  ); 

The Oracle syntax (...FROM clicks MATCH_RECOGNIZE....) for this operation is a little different from other SQL constructs, it isn't a join, it isn't a table function, it might be a custom component - but until that's public need another way. You could create a view and use that in ODI. Another option is to use the SQL override, custom template capability. This let's you define any arbitrary code for a source component in a mapping. Source datastores in mappings have a physical option named CUSTOM_TEMPLATE. You can define code here such as the SQL to produce the sessionized data above. The code can use ODI's substitution reference methods to make the code more generic (parameterized schemas etc.). For example the physical map design below defines a mapping with a source datastore defining the attributes projected from my query. I defined the datastore in an ODI model - it isn't actually in a database, it's just used in the mapping.

As it's description states 'Custom SQL statement as an inline view to override default code generation.' is what this is for. 

 

Note this option is under the extract options on the source datastore's physical node, the value for CUSTOM_TEMPLATE can be edited and arbitrary code, in this case SQL, defined. Double clicking the value field in the KM options table brings up an editor where you can see the entire code block I have defined. Note below, I have hard-wired the clicks table in the SQL;

To make this more generic and flexible, I can use the odiRef methods to get the name of the clicks table generated based on the topology defined in ODI. My clicks table is in a model in ODI and the use of odiRef.getObjectName below let's me get the name correctly generated depending on the generation context - perhaps in one context it is in schema DEVX and in another it is in DEVY, using the SDK shields us from such changes.

You will note also that there is another KM option named SELECT_HINT, this is used by the default 12c code generation capabilities but if you override the code using this custom code option described above you can still incorporate the hints supplied in the mapping. So if you define a value for the KM option SELECT_HINT you can have it included in your custom code by using this getOption method - pretty flexible! 

Hope this is a useful insight into some of the capabilities provided in ODI 12c. I used the latest and greatest 12.1.3 release in this post, that's what you will need to use this functionality.

Tuesday Jul 15, 2014

Oracle GoldenGate 12c Capture for SQL Server

In Oracle GoldenGate 12c we have expanded the heterogeneity of the product by adding support for the latest versions of the major databases. 

In Oracle GoldenGate 12.1.2.0.1 for SQL Server, which is available for download on My Oracle Support, we offer support for capturing from and delivering to SQL Server 2012. Oracle GoldenGate 12c also supports SQL Server 2008 and SQL Server 2008 R2 Capture and Delivery.

Two other highly anticipated new features for Oracle GoldenGate 12c for SQL Server, are the ability for Capture to read from SQL Server compressed transaction log backups and the ability to restrict the need for Capture to read from transaction log backups, required when customers use 3rd party transaction log backup utilities.

Some of the key capabilities of Oracle GoldenGate for SQL Server are as follows:

  • Oracle GoldenGate now supports Capture from SQL Server 2012 beginning with GoldenGate 12.1.2.0.1.
  • Oracle GoldenGate for SQL Server can be configured to Capture from tables without Primary Keys.
  • Oracle GoldenGate 12c for SQL Server supports SQL Server native backup compression and includes the optional configuration to never have to read from transaction log backups, which enables any log backup utility to be used.
  • Capture process can work in conjunction with SQL Server Transactional Replication, and Change Data Capture.
  • Oracle GoldenGate for SQL Server can provide real-time data to heterogeneous target databases, and receive real-time data from heterogeneous databases.

A complete list of supported features and details on how to implement those features are available in the Oracle GoldenGate 12c Documentation Library, available at the following address: http://docs.oracle.com/goldengate/1212/gg-winux/index.html

I also recommend reading our white paper Using Oracle GoldenGate 12c with SQL Server”  that provides more details on how GoldenGate supports SQL Server databases.

Tuesday Nov 05, 2013

ODI 12c - Aggregating Data

This posting will look at the aggregation component that was introduced in ODI 12c. For many ETL tool users this shouldn't be a big surprise, its a little different than ODI 11g but for good reason. You can use this component for composing data with relational like operations such as sum, average and so forth. Also, Oracle SQL supports special functions called Analytic SQL functions, you can use a specially configured aggregation component or the expression component for these now in ODI 12c. In database systems an aggregate transformation is a transformation where the values of multiple rows are grouped together as input on certain criteria to form a single value of more significant meaning - that's exactly the purpose of the aggregate component.

In the image below you can see the aggregate component in action within a mapping, for how this and a few other examples are built look at the ODI 12c Aggregation Viewlet here - the viewlet illustrates a simple aggregation being built and then some Oracle analytic SQL such as AVG(EMP.SAL) OVER (PARTITION BY EMP.DEPTNO) built using both the aggregate component and the expression component.

In 11g you used to just write the aggregate expression directly on the target, this made life easy for some cases, but it wan't a very obvious gesture plus had other drawbacks with ordering of transformations (agg before join/lookup. after set and so forth) and supporting analytic SQL for example - there are a lot of postings from creative folks working around this in 11g - anything from customizing KMs, to bypassing aggregation analysis in the ODI code generator.

The aggregate component has a few interesting aspects.

1. Firstly and foremost it defines the attributes projected from it - ODI automatically will perform the grouping all you do is define the aggregation expressions for those columns aggregated. In 12c you can control this automatic grouping behavior so that you get the code you desire, so you can indicate that an attribute should not be included in the group by, that's what I did in the analytic SQL example using the aggregate component.

2. The component has a few other properties of interest; it has a HAVING clause and a manual group by clause. The HAVING clause includes a predicate used to filter rows resulting from the GROUP BY clause. Because it acts on the results of the GROUP BY clause, aggregation functions can be used in the HAVING clause predicate, in 11g the filter was overloaded and used for both having clause and filter clause, this is no longer the case. If a filter is after an aggregate, it is after the aggregate (not sometimes after, sometimes having). 

3. The manual group by clause let's you use special database grouping grammar if you need to. For example Oracle has a wealth of highly specialized grouping capabilities for data warehousing such as the CUBE function. If you want to use specialized functions like that you can manually define the code here. The example below shows the use of a manual group from an example in the Oracle database data warehousing guide where the SUM aggregate function is used along with the CUBE function in the group by clause.

The SQL I am trying to generate looks like the following from the data warehousing guide;

  1. SELECT channel_desc, calendar_month_desc, countries.country_iso_code,
  2.       TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
  3. FROM sales, customers, times, channels, countries
  4. WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND
  5.   sales.channel_id= channels.channel_id
  6.  AND customers.country_id = countries.country_id
  7.  AND channels.channel_desc IN
  8.   ('Direct Sales', 'Internet') AND times.calendar_month_desc IN
  9.   ('2000-09', '2000-10') AND countries.country_iso_code IN ('GB', 'US')
  10. GROUP BY CUBE(channel_desc, calendar_month_desc, countries.country_iso_code);

I can capture the source datastores, the filters and joins using ODI's dataset (or as a traditional flow) which enables us to incrementally design the mapping and the aggregate component for the sum and group by as follows;

In the above mapping you can see the joins and filters declared in ODI's dataset, allowing you to capture the relationships of the datastores required in an entity-relationship style just like ODI 11g. The mix of ODI's declarative design and the common flow design provides for a familiar design experience. The example below illustrates flow design (basic arbitrary ordering) - a table load where only the employees who have maximum commission are loaded into a target. The maximum commission is retrieved from the bonus datastore and there is a look using employees as the driving table and only those with maximum commission projected.

Hopefully this has given you a taster for some of the new capabilities provided by the aggregate component in ODI 12c. In summary, the actions should be much more consistent in behavior and more easily discoverable for users, the use of the components in a flow graph also supports arbitrary designs and the tool (rather than the interface designer) takes care of the realization using ODI's knowledge modules.

Interested to know if a deep dive into each component is interesting for folks. Any thoughts? 

Thursday Dec 06, 2012

ODI 11g - Cleaning control characters and User Functions

In ODI user functions have a poor name really, they should be user expressions - a way of wrapping common expressions that you may wish to reuse many times - across many different technologies is an added bonus. To illustrate look at the problem of how to remove control characters from text. Users ask these types of questions over all technologies - Microsoft SQL Server, Oracle, DB2 and for many years - how do I clean a string, how do I tokenize a string and so on. After some searching around you will find a few ways of doing this, in Oracle there is a convenient way of using the TRANSLATE and REPLACE functions. So you can convert some text using the following SQL;

  • replace( translate('This is my string'||chr(9)||' which has a control character', chr(3)||chr(4)||chr(5)||chr(9), chr(3) ), chr(3), '' )

If you had many columns to perform this kind of transformation on, in the Oracle database the natural solution you'd go to would be to code this as a PLSQL function since you don't want the code splattered everywhere. Someone tells you that there is another control character that needs added equals a maintenance headache. Coding it as a PLSQL function will incur a context switch between SQL and PLSQL which could prove costly.

In ODI user functions let you capture this expression text and reference it many times across your mappings. This will protect the expression from being copy-pasted by developers and make maintenance much simpler - change the expression definition in one place.

Firstly define a name and a syntax for the user function, I am calling it UF_STRIP_BAD_CHARACTERS and it has one parameter an input string; 

We then can define an implementation for each technology we will use it, I will define Oracle's using the inputString parameter and the TRANSLATE and REPLACE functions with whatever control characters I want to replace;

I can then use this inside mapping expressions in ODI, below I am cleaning the ENAME column - a fabricated example but you get the gist.

 Note when I use the user function the function name remains in the text of the mapping, the actual expression is not substituted until I generate the scenario. If you generate the scenario and export the scenario you can have a peak at the code that is processed in the runtime - below you can see a snippet of my export scenario;

 That's all for now, hopefully a useful snippet of info.

Wednesday Oct 31, 2012

Ameristar Wins with Oracle GoldenGate’s Heterogeneous Real-Time Data Integration

Today we announced a press release about another successful project with Oracle GoldenGate. This time at Ameristar. Ameristar is a casino gaming company and needed a single data integration solution to connect multiple heterogeneous systems to its Teradata data warehouse.

The project involves integration of Ameristar’s promotional and gaming data from 14 data sources across its 7 casino hotel properties in real time into a central Teradata data warehouse. The source systems include the Aristocrat gaming and MGT promotional management platforms running on Microsoft SQL Server 2000 databases.

As you can notice, there was no Oracle Database involved in this project, but Ameristar’s IT leadership knew that  GoldenGate’s strong heterogeneous and real-time data integration capabilities is the right technology for their data warehousing project. With GoldenGate Ameristar was able to reduce data latency to the enterprise data warehouse, and use this real-time customer information for marketing teams in improving overall customer experience. Ameristar customers receive more targeted and timely campaign offers, and the company has more up-to-date visibility into financial metrics of the company.

One other key benefit the company experienced with GoldenGate is in operational costs. The previous data capture solution Ameristar used was trigger based and required a lot of effort to manage. They needed dedicated IT staff to maintain it. With GoldenGate, the solution runs seamlessly without needing a fully-dedicated staff, giving the IT team at Ameristar more resources for their other IT projects.

If you want to learn more about GoldenGate and the latest features for Oracle Database and non-Oracle databases, please watch our on demand webcast about Oracle GoldenGate 11g Release 2.

Friday Oct 05, 2012

Hidden Gems: Accelerating Oracle Data Integrator with SOA, Groovy, SDK, and XML

On the last day of Oracle OpenWorld, we had a final advanced session on getting the most out of Oracle Data Integrator through the use of various advanced techniques.

The primary way to improve your ODI processes is to choose the optimal knowledge modules for your load and take advantage of the optimized tools of your database, such as OracleDataPump and similar mechanisms in other databases. Knowledge modules also allow you to customize tasks, allowing you to codify best practices that are consistently applied by all integration developers.

ODI SDK is another very powerful means to automate and speed up your integration development process. This allows you to automate Life Cycle Management, code comparison, repetitive code generation and change of your integration projects. The SDK is easily accessible through Java or scripting languages such as Groovy and Jython.

Finally, all Oracle Data Integration products provide services that can be integrated into a larger Service Oriented Architecture. This moved data integration from an isolated environment into an agile part of a larger business process environment. All Oracle data integration products can play a part in thisracle GoldenGate can integrate into business event streams by processing JMS queues or publishing new events based on database transactions.

  • Oracle GoldenGate can integrate into business event streams by processing JMS queues or publishing new events based on database transactions.
  • Oracle Data Integrator allows full control of its runtime sessions through web services, so that integration jobs can become part of business processes.
  • Oracle Data Service Integrator provides a data virtualization layer over your distributed sources, allowing unified reading and updating for heterogeneous data without replicating and moving data.
  • Oracle Enterprise Data Quality provides data quality services to cleanse and deduplicate your records through web services.

About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« March 2015
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
12
13
14
15
16
17
18
19
20
21
22
23
24
25
27
28
29
30
31
    
       
Today