Wednesday Jan 16, 2013

ODI - Java Table Function for PDF capture

You can leverage the java table function approach to easily integrate PDF using an API like iText. I blogged earlier about the approach of writing a specific LKM for this, although this is perfectly reasonable, the writing and subsequent maintenance of the KM especially if you have many such APIs can be simplified, the code is also much cleaner in my opinion. What do you think? Its simplified by using the LKM for a java table function I posted here, then writing specific table functions for adapters.

All I did was write the table function wrapper and the makeRow method - this can even support the convention for naming columns from the earlier PDF blog post by overloading the findColumn method (so the query will use COLUMN_n, such as COLUMN_2 - under the hood, we will get the number n and return the appropriate column).

The makeRow method produces the columns from a java Object, the object is a File object, and represents the PDF file, here is a snippet of the code;

  1.     public String[] makeRow(Object obj) throws SQLException
  2.     {
  3.         String[] row = null;
  4.         int i = 0;
  5.         try {
  6.           PdfReader reader = new PdfReader(((File)obj).getPath());
  7.           Iterator it = reader.getAcroFields().getFields().entrySet().iterator();
  8.           row = new String[ reader.getAcroFields().getFields().size() ];
  9.           while (it.hasNext())
  10.             row[i++] = reader.getAcroFields().getField(((Map.Entry)it.next()).getKey().toString()).toString();
  11.           reader.close();
  12.         } catch (Exception e) { e.printStackTrace(); }
  13.         return row;
  14.     }

The code is very simple, for the table function itself, I simply created a Java Iterator (line 19 below) over the array of files in the directory, the class I extended from conveniently takes care of the iteration.

  1. public static ResultSet readCollection(String dirName, String endstr)
  2.         throws SQLException, UnknownHostException, IOException
  3.     {
  4.       final String suffix = endstr;
  5.       File folder = new File( dirName );
  6.       File[] listOfFiles = folder.listFiles(new FilenameFilter() {
  7.         public boolean accept(File dir, String name) {
  8.           return name.toLowerCase().endsWith(suffix);
  9.         }
  10.       } ); 
  11.       ArrayList<File> al = new ArrayList<File>();
  12.       for (int i = 0; i < listOfFiles.length; i++)
  13.         al.add( listOfFiles[i] );
  14.       PdfReader reader = new PdfReader(((File)listOfFiles[0]).getPath());
  15.       int sz = reader.getAcroFields().getFields().entrySet().size();
  16.       String[] cols = new String[sz];
  17.       for (int j = 0; j < sz; j++)
  18.         cols[j] = new String(new Integer(j+1).toString());
  19.       return new pdf_table( cols, al.iterator() );
  20.     }

The entire Java source for the PDF table function can be found here.

We can assign the LKM to the source set and set the table function name to pdf_table.readCollection, define the directory to use and the file extension to filter.

For this case using the table function I set the model type for the PDF datastore model to be my Derby/JavaDB technology (and not file). This generated the SQL SELECT........from table(PDF_W4('d:\temp\pdfs', 'pdf' )) PDF where (1=1) statement to load into the work table. This was using the exact LKM that I used to extract from MongoDB, and I can write any java table function to extract data and load.

Tuesday Jan 15, 2013

ODI - Hive and MongoDB

I've been experimenting with another Hive storage handler, this time for MongoDB, there are a few out there including this one from MongoDB. The one I have been using supports basic primitive types and also supports read and write - using the standard approach of storage handler class and custom properties to describe the data mask. This then lets you access MongoDB via hive external table very easily and abstract away a lot of integration complexity - also makes it ideal for using in ODI. I have been using on my Linux VM where I have Hive running to access my MongoDB running on an another machine. The storage handler is found here, I used it to access the same example I blogged about here, below is the external table definition;

  1. ADD JAR /home/oracle/mongo/hive-mongo.jar;

  2. create external table mongo_emps(EMPNO string, ENAME string, SAL int)  
  3. stored by "org.yong3.hive.mongo.MongoStorageHandler"  
  4. with serdeproperties ( "mongo.column.mapping" = "EMPNO,ENAME,SAL" )  
  5. tblproperties ( "mongo.host" = "<my_mongo_ipaddress>", "mongo.port" = "27017",  
  6. "mongo.db" = "test", "mongo.collection" = "myColl" );

Very simple. The nice aspect of the Hive external table are the SerDeProperties that can be specified, very simple but provides a nice flexible approach. I can then reverse engineer this into ODI (see reverse engineering posting here) and use it in my Hive integration mappings to read and potentially write to MongoDB.

The primitive types supported can also project nested document types, so for example in the document below (taken from here), name, contribs and awards are strings but have JSON structures;

  1. {
  2. "_id" : 1,
  3. "name" : {
  4. "first" : "John",
  5. "last" :"Backus"
  6. },
  7. "birth" : ISODate("1924-12-03T05:00:00Z"),
  8. "death" : ISODate("2007-03-17T04:00:00Z"),
  9. "contribs" : [ "Fortran", "ALGOL", "Backus-Naur Form", "FP" ],
  10. "awards" : [
  11. {
  12. "award" : "W.W. McDowellAward",
  13. "year" : 1967,
  14. "by" : "IEEE Computer Society"
  15. },
  16. {
  17. "award" : "National Medal of Science",
  18. "year" : 1975,
  19. "by" : "National Science Foundation"
  20. },
  21. {
  22. "award" : "Turing Award",
  23. "year" : 1977,
  24. "by" : "ACM"
  25. },
  26. {
  27. "award" : "Draper Prize",
  28. "year" : 1993,
  29. "by" : "National Academy of Engineering"
  30. }
  31. }

can be processed with the following external table definition, which then can be used in ODI;

  1. create external table mongo_bios(name string, birth string, death string, contribs string, awards string)  
  2. stored by "org.yong3.hive.mongo.MongoStorageHandler"  
  3. with serdeproperties ( "mongo.column.mapping" = "name,birth,death,contribs,awards" )  
  4. tblproperties ( "mongo.host" = "<my_ip_address>", "mongo.port" = "27017",  
  5. "mongo.db" = "test", "mongo.collection" = "bios" );

All very simple and that's what makes it so appealing. Anyway, that's a quick following on using external tables with MongoDB and Hive to the SQL oriented approach I described here that used java table functions.

Monday Jan 14, 2013

ODI - Integrating PDF using iText

Integrating PDF form data with ODI is as easy as any other data source,  see the code template from Suraj Bang a few years back with OWB for what I followed and converted to ODI - a few changes for the newer version of iText and to work with ODI. The LKM PDF to SQL uses ODI's File technology as a source and loads to any ANSI SQL store. The LKM reads all of the form fields from a PDF form and supports the File technology's String and Numeric datatype. The fields have to be defined by position, so your field name in the file can be MYCOL_1, MYCOL_2, MYCOL_3 etc. or whatever prefix you desire. The LKM as mentioned uses the iText java api, so the JAR should be downloaded and copied into the normal ODI userlib directory, I tested with version 5.3.5 of iText.

The LKM PDF to SQL is here. I manually defined a File datastore with column names (remembering to use the position as the suffix) and types, the delimiter and such information are not used since its PDF, but I set them anyway. The example I used was to process W4 PDF data, the same as in Suraj's post and was able to process all of the PDFs in the same manner. Kudos to Suraj for the jython, with a few little changes from me to use the latest iText.

Friday Jan 11, 2013

ODI - Java Table Function for MongoDB

Behind the scenes of the MongoDB posting was a very simple JavaDB/Derby table function. The function implemented a couple of methods - the table function readCollection and the function makeRow which creates an array from a Java Object. It can't get much simpler. The iteration through the collection is handled by the class I extended from EnumeratorTableFunction  which came from the posting by Rick Hillegas, and it fits nicely into ODIs source/target generic integration task in the KM framework. Here is a viewlet I have created showing you everything very briefly but end to end.

The makeRow function uses the MongoDB java SDK, and produces a row for each BasicDBObject, each value in the document is output as a column in the table. Nested/complex values are serialized as Java Strings - so you will get a JSON string for anything complex.

  1. public String[] makeRow(Object obj) throws SQLException
  2. {
  3.   int idx = 0;
  4.   BasicDBObject dbo = (BasicDBObject) obj;
  5.   Iterator it = dbo.entrySet().iterator();
  6.   String[]    row = new String[ getColumnCount() ];
  7.   it.next(); // skip the 'id' column
  8.   while (it.hasNext()) {
  9.     Map.Entry pairs = (Map.Entry)it.next();
  10.     row[ idx++ ] = pairs.getValue().toString();
  11.   }
  12.   return row;
  13. }

The readCollection table function is a static method and has a couple of parameters (for demonstration) - one is the MongoDB database name and the other is the collection name. The function initializes the object instance with the column names which are defined to be the key names for the objects in the collection (the first object is taken and its keys used as the column names);

  1. public static ResultSet readCollection(String dbName, String collectionName)
  2.   throws SQLException, UnknownHostException
  3. {
  4.   int idx = 0;
  5.   MongoClient mongoClient = new MongoClient();
  6.   DB db = mongoClient.getDB(dbName);
  7.   DBCollection coll = db.getCollection(collectionName);
  8.   DBCursor cursor = coll.find();
  9.   BasicDBObject dbo = (BasicDBObject)  coll.findOne();
  10.   Set<String> keys = dbo.keySet();
  11.   String[] skeys = new String[keys.size()];
  12.   Iterator it = keys.iterator();
  13.   it.next(); // skip the id
  14.   while (it.hasNext()) {
  15.     skeys[idx++] = it.next().toString();
  16.   }
  17.   return new mongo_table( skeys, cursor );
  18. }

The mongo_table constructor just initializes itself and sets the enumeration to iterate over - the class I extend from is very useful, it can iterate over Java Enumeration, Iterator, Iterable, or array objects - the super class initializes the column names, and the setEnumeration defines the collection/iterator - which in this case is a MongoDB DBCursor which happens to be a Java Iterator<DBObject>.

  1. public mongo_table(String[] column_names, DBCursor cursor)
  2.   throws SQLException
  3. {
  4.   super( column_names );
  5.   setEnumeration( cursor );
  6. }

This approach can be used for sourcing pretty much anything, which is great for integration needs. The ODI Knowledge Module is an LKM and stages the result of the table function into a work table, then everything else is as normal. The KM creates the work table and also registers the table function with JavaDB/Derby. My code for the function registration is as follows;

  1. create function <%=odiRef.getSrcTablesList("","[TABLE_NAME]", "","")%>( dbName varchar( 330), collName varchar( 30))
  2. returns table
  3. (
  4. <%=odiRef.getSrcColList("","[COL_NAME] [SOURCE_CRE_DT]","[COL_NAME] [SOURCE_CRE_DT]",",\n","")%> )
  5. language java
  6. parameter style DERBY_JDBC_RESULT_SET
  7. no sql
  8. external name '<%=odiRef.getOption("TABLE_FUNCTION_NAME")%>'

This creates the table function with the same name as the datastore in the interface, plus the resultant table of the function has the columns (and types) from that datastore. The external JavaDB function name is taken from the KM option TABLE_FUNCTION_NAME. As I mentioned I have hard-wired 2 parameters just now. The Java code implementing this should be created and put in a JAR in the normal userlib directory for adding custom code including JDBC drivers. The other JARs needed are the MongoDB Java SDK jar, derby.jar and vtis-example.jar (from the zip here). You can get the Java source for mongo_table.java here, it is compiled using the MongoDB Java SDK on the classpath as follows (on Windows).

  1. javac -classpath mongo-2.10.1.jar;vtis-example.jar mongo_table.java
  2. jar cvf mongo_table.jar mongo_table.class

The LKM is here it needs imported into your project.

Anyway...this wasn't all about MongoDB per se, it was also about the JavaDB table function capability, any other examples spring to mind about integration capabilities using this route? Going to post about loading into MongoDB and how an IKM is built for this. Interested to hear any ideas/feedback from you on this....so don't be shy!

Customer Spotlight:Raymond James

Back in October we announced that Raymond James is a winner of 2012 Fusion Middleware Innovation Awards in the Data Integration category. As I mentioned on this site before, Raymond James used Oracle Data Integrator and Oracle GoldenGate in an integrated solution for its operational data store that supports the application services across the enterprise. By using Oracle's real-time data integration solution from heterogeneous systems into a centralized ODS, the team achieved higher application performance, reduced complexity and cost, and up to 50 percent faster time to market. 

I had a pleasure meeting the team, Ryan Fornett, Tim Garrod, and Michael Wells, from Raymond James at Oracle OpenWorld 2012. In addition to hearing from them during their presentations, we had a chance to record interviews with Tim and Ryan while at OpenWorld. You can find the links to their videos below:

 Yesterday, with their permission we posted a detailed blog on the FMW blog site about their implementation:

Data Integration Customer Spotlight: Raymond James Financial

I highly recommend reading it to see Oracle Data Integration products in action.

Many thanks to Ryan, Tim, and Mike for sharing the details of their innovative solution with us.


Thursday Jan 10, 2013

ODI - MongoDB and Java Table Functions Anyone?

Let's see how we could integrate MongoDB using ODI, first take a step back. Derby/JavaDB introduced table functions a few years ago. Table functions are really useful, they are in the Oracle database and as much fun in JavaDB! ODI is a great platform for integration and JavaDB and table functions provide a really nice way to integrate arbitrary Java APIs into your designs. What I have done here is;

  • built a very simple java table function to project a table to represent the documents in a MongoDB collection. The collection is passed as a parameter to the KM and the column names are the keys for the MongoDB document. The data comes from the values.
  • built a very simple LKM from a Java table function to SQL

All of this will use the JavaDB in-memory, so no admin, simple to use. Rick Hillegas wrote a nice article with some handy classes that I have used. The mongo_table class I have written uses the EnumeratorTableFunction class included in Rick's examples. The MongoDB DBCursor class is a Java Iterator, which makes it really nice to pass to the  EnumeratorTableFunction class, and let it do all of the work.

The LKM I constructed declares the table function to JavaDB/Derby, for example below, the function is declared based on the source datastore name (MONGO_EMPS) and columns (my source datastore has EMPNO,ENAME,SAL, note the table function will actually project types defined in ODI's datastore), the function has the MongoDB database name and collection name as parameters.

  1. create function MONGO_EMPS( dbName varchar( 330), collectionName varchar( 30))
  2. returns table
  3. (
  4.   EMPNO VARCHAR(20),
  5.   ENAME VARCHAR(30),
  6.   SAL NUMERIC(10),
  7. )
  8. language java
  9. parameter style DERBY_JDBC_RESULT_SET
  10. no sql
  11. external name 'mongo_table.readCollection'

Then the actual code to use the function as a source is executed from a source task (the target is SQL as I mentioned earlier for the LKM). Below you can see my execution using the test MongoDB and the myStuff collection;

  1. select
  2. MON.EMPNO   C1_EMPNO,
  3. MON.ENAME   C2_ENAME,
  4. MON.SAL   C6_SAL
  5. from table(MONGO_EMPS('test', 'myStuff' )) MON
  6. where (1=1)
  7. And (MON.SAL > 4000)

Note I can also perform some filtering as an example, here it is being done in JavaDB and in my case its in-memory. No setup, no persistence just on the fly Java. Ideally I would push the filter down to MongoDB rather than reading and filtering in the driver - more on that later.

I had defined my documents in MongoDB using the following basic commands in the mongo shell;

  1. use test
  2. a1 = { EMPNO: "1", ENAME : "Fred", SAL : 10000 }
  3. a2 = { EMPNO: "2", ENAME : "John", SAL : 2000 }
  4. db.myStuff.insert( a1 )
  5. db.myStuff.insert( a2 )

In ODI, I can simply then use the datastore representing the MongoDB collection of documents in an interface and map it to my target;

The physical design uses the LKM JavaTableFunction to SQL KM and sets the MongoDB databases, collection and the Java table function name.

That's it. Pretty straightforward and we are reading and consuming MongoDB documents. So what about complex document types like this? These are transported, more to come. The other point here is that this is a generic LKM that you can plug in other arbitrary table functions - so you can consume from any API, its very simple. For the LKM I created, I just defined 2 parameters (because my table function only had two), but really we need a better way to handle this and ensure they are ignored if not defined etc. That's all polishing tho, anyway fun stuff you can see all with a small piece of code leveraging JavaDB!

Maximum Availability with Oracle GoldenGate Webcast: On Demand Link & FAQ

Last month we had a live webcast on a hot topic: Maximum Availability with Oracle GoldenGate. The webcast presentation focused on how Oracle GoldenGate can help achieve maximum availability for Oracle Database and other major databases. This webcast was very useful in answering a frequently asked question around GoldenGate's role for Oracle Database Maximum Availability Architecture (MAA) and when to use Oracle GoldenGate vs Active Data Guard. Product experts, Joe Meeks from Oracle Database High Availability product management team and Nick Wagner from Oracle GoldenGate product management team, addressed this question for us. Joe and Nick presented GoldenGate’s key solutions within MAA, and when to use which product along with Integrated Capture and Automated Conflict Management features of GoldenGate 11gR2. If you missed this event I highly recommend watching it on demand via the following link.

Watch On Demand:Maximum Availability with Oracle GoldenGate.

We also recorded a short podcast on this topic. Oracle GoldenGate 11gR2 for Maximum Availability

In the live webcast we had a very interactive audience who submitted over 100 questions during the event. Joe and Nick answered the most commonly asked questions about GoldenGate’s use cases and technology. Here are a few of the questions that we did not have time to answer during the event, along with the answers from the PM team.

  • Can I use the target database as warehouse for select only. If I were to allow users to this target, are there steps to ensure both source and target databases remain in sync?

 Yes you can use the target for read only or select only operations. Yes, we can do bi-directional active-active replication if you want to allow users to make changes to the target system. We can replicate those back to the source.

  • Can Integrated Capture be run on Oracle Database Standard Edition?

Yes, but all of the features that are only supported by Integrated Capture are only available with Oracle Database Enterprise Edition.

  •  Is it possible to capture database transactions from a source database using GoldenGate and publish them as messages into   a JMS queue?

Yes, GoldenGate can capture from and deliver to JMS queue. Please see our page for Java Adapter for more info. 

  • Can GoldenGate be used in environments where a number of tables do not have primary keys?

Yes. In tables without primary keys, we will use all non-LOB columns in the where clause for updates and deletes and then use ROWNUM = 1.

  • Is there a webinar that shows how to set up Oracle GoldenGate?

 Oracle GoldenGate Youtube channels has many useful videos and labs including “Oracle GoldenGate Fundamentals and How Oracle GoldenGate Works” and “Oracle GoldenGate Deep Dive Hands-on Labs” in 3 part series. You can browse and watch them here. Please check out GoldenGate training sessions with Oracle University as well.

I highly recommend subscribing to GoldenGate’s Youtube channel along with joining the GoldenGate Facebook and Twitter communities to get regular updates on the product.

Wednesday Jan 02, 2013

ODI - Hive and NoSQL, the code

This post includes the Java client demonstration code used in the Hive and NoSQL post illustrated here. The BasicBigData.java code is a NoSQL client which populates a key value store that is queryable using the Hive external table from that post. It didn't take long to code and a few peeks at the NoSQL javadoc to get it going. You can take this java code and compile and run it (instructions for compiling are similar to the verification demo here - it is very easy).

The java code uses the NoSQL major/minor path constructor to describe the Key, below is a snippet to define the birthdate for Bob Smith;

  1. ArrayList<String> mjc1 = new ArrayList<String>();
  2. mjc1.add("Smith");
  3. mjc1.add("Bob");
  4. ...
  5. ArrayList<String> mnrb = new ArrayList<String>();
  6. mnrb.add("birthdate");
  7. ...
  8. store.put(Key.createKey(mjc1,mnrb),Value.createValue("05/02/1975".getBytes()));
  9. ...

In the referenced post, to actually aggregate the key values, we used the Hive collect_set aggregation function (see here for Hive aggregation functions). The collect_set aggregation function returns a set of objects with duplicates eliminated. To get the aggregation function behavior in ODI with the correct group by we must tell ODI about the Hive aggregation function. We can define a new language element for collect set in the Topology tree, define the element as a group function, and also define the expression for Hive under the Implementation tab;

We are then able to define expressions which reference this aggregation function and get the exact syntax defined in the earlier post. Below we see the Hive expressions using collect_set below;

From this design and the definition of the aggregation function in ODI, when its executed you can see the generated Hive QL with the correct columns in the grouping function;

The target Hive datastore in the interface I defined as been loaded with the key values from the NoSQL keystore, cool!

Those are a few of the missing pieces which would let you query NoSQL through Hive external tables, hopefully some useful pointers. 

About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« January 2013 »
SunMonTueWedThuFriSat
  
1
3
4
5
6
7
8
9
12
13
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today