ODI - Java Table Function for PDF capture

You can leverage the java table function approach to easily integrate PDF using an API like iText. I blogged earlier about the approach of writing a specific LKM for this, although this is perfectly reasonable, the writing and subsequent maintenance of the KM especially if you have many such APIs can be simplified, the code is also much cleaner in my opinion. What do you think? Its simplified by using the LKM for a java table function I posted here, then writing specific table functions for adapters.

All I did was write the table function wrapper and the makeRow method - this can even support the convention for naming columns from the earlier PDF blog post by overloading the findColumn method (so the query will use COLUMN_n, such as COLUMN_2 - under the hood, we will get the number n and return the appropriate column).

The makeRow method produces the columns from a java Object, the object is a File object, and represents the PDF file, here is a snippet of the code;

  1.     public String[] makeRow(Object obj) throws SQLException
  2.     {
  3.         String[] row = null;
  4.         int i = 0;
  5.         try {
  6.           PdfReader reader = new PdfReader(((File)obj).getPath());
  7.           Iterator it = reader.getAcroFields().getFields().entrySet().iterator();
  8.           row = new String[ reader.getAcroFields().getFields().size() ];
  9.           while (it.hasNext())
  10.             row[i++] = reader.getAcroFields().getField(((Map.Entry)it.next()).getKey().toString()).toString();
  11.           reader.close();
  12.         } catch (Exception e) { e.printStackTrace(); }
  13.         return row;
  14.     }

The code is very simple, for the table function itself, I simply created a Java Iterator (line 19 below) over the array of files in the directory, the class I extended from conveniently takes care of the iteration.

  1. public static ResultSet readCollection(String dirName, String endstr)
  2.         throws SQLException, UnknownHostException, IOException
  3.     {
  4.       final String suffix = endstr;
  5.       File folder = new File( dirName );
  6.       File[] listOfFiles = folder.listFiles(new FilenameFilter() {
  7.         public boolean accept(File dir, String name) {
  8.           return name.toLowerCase().endsWith(suffix);
  9.         }
  10.       } ); 
  11.       ArrayList<File> al = new ArrayList<File>();
  12.       for (int i = 0; i < listOfFiles.length; i++)
  13.         al.add( listOfFiles[i] );
  14.       PdfReader reader = new PdfReader(((File)listOfFiles[0]).getPath());
  15.       int sz = reader.getAcroFields().getFields().entrySet().size();
  16.       String[] cols = new String[sz];
  17.       for (int j = 0; j < sz; j++)
  18.         cols[j] = new String(new Integer(j+1).toString());
  19.       return new pdf_table( cols, al.iterator() );
  20.     }

The entire Java source for the PDF table function can be found here.

We can assign the LKM to the source set and set the table function name to pdf_table.readCollection, define the directory to use and the file extension to filter.

For this case using the table function I set the model type for the PDF datastore model to be my Derby/JavaDB technology (and not file). This generated the SQL SELECT........from table(PDF_W4('d:\temp\pdfs', 'pdf' )) PDF where (1=1) statement to load into the work table. This was using the exact LKM that I used to extract from MongoDB, and I can write any java table function to extract data and load.

Comments:

Hi David,

I tried downloading the latest version of itextpdf & vti-example, ran javac to compile per your instruction on the java table function post, but while trying to run it, I get "main" java.lang.NoClassDefFoundError: pdf_table.

I then updated my CLASSPATH env variable to point to itextpdf.jar, vti-example.jar, and the new pdf_table.jar; now I have a different errow showing.. "java:32: cannot find symbol \n symbol : class PdfReader \n PdfReader reader = new PdfReader(((File)listOfFiles[01]).getPath());"

Could you please advise what I'm missing here? Thanks in advance!!

Posted by guest on August 11, 2014 at 10:42 PM PDT #

Hi

The entire Java source is linked in the blog post ... look for 'The entire Java source for the PDF table function can be found here.' This needs to be compiled/jar'd.

Cheers
David

Posted by guest on August 12, 2014 at 08:20 AM PDT #

Thanks for the prompt reply, David. Actually, I did grabbed the java source from the link in the post. The problem I am having is executing the jar file after compiling and creation of the new pdf_table.jar file. See my below steps (Also, I am using jdk 1.6):

Script: javac -classpath itextpdf-5.5.2.jar;vtis-example.jar pdf_table.java
Result: pdf_table$1.class and pdf_table.class was generated.

Script: jar -cvf pdf_table.jar pdf_table.java
Result: "added manifest adding: pdf_table.java (in = 2362) (out=881)(deflated 62%)"
pdf_table.jar was generated

Script: java -jar pdf_table.jar
Result: "Failed to load Main-Class manifest attribute from pdf_table.jar"

Posted by guest on August 12, 2014 at 08:26 AM PDT #

So it looks like the main-class is not being recognized. I then created a manifest.txt with the following:

Manifest-Version: 1.0
Main-Class: pdf_table
Class-Path: D:/temp/itextpdf-5.5.2.jar D:/temp/vtis-example.jar

Ran Script: jar -cfm pdf_table.jar manifest.txt pdf_table.class
Result: Exception in thread "main" java.lang.NoClassDefFoundError: sun/javadb/vti/core/E
numeratorTableFunction
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:14
1)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Caused by: java.lang.ClassNotFoundException: sun.javadb.vti.core.EnumeratorTable
Function
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 12 more
Could not find the main class: pdf_table. Program will exit.

Posted by guest on August 12, 2014 at 08:33 AM PDT #

Hi

You are jar'ing the java source file. You should put the *.class files in the jar.

Cheers
David

Posted by David on August 12, 2014 at 08:45 AM PDT #

Do you mean the pdf_table$1.class and pdf_table.class into the pdf_table.jar?

Or you mean manually extract all the required classes from the itext.jar and vti-example.jar and put them in them in the pdf_table.jar as well?

Posted by guest on August 12, 2014 at 08:50 AM PDT #

Yes I mean the pdf*class files generated from the javac should be put into the pdf_table.jar file. Something like;

jar -cvf pdf_table.jar pdf*.class

Cheers
David

Posted by David on August 12, 2014 at 08:53 AM PDT #

Hi David,

I've done this before and no dice... I just did it again to confirm:

Script: jar -cvf pdf_table.jar pdf*.class
Result: I now see both classes added into the jar.

Script: java -jar pdf_table.jar
Result: "Failed to load Main-Class manifest attribute from pdf_table.jar"

====
So I tried...

Script jar cvfm pdf_table.jar manifest.txt pdf*.class
Result same as before: Exception in thread "main" java.lang.NoClassDefFoundError: sun/javadb/vti/core/E
numeratorTableFunction...

Posted by guest on August 12, 2014 at 09:16 AM PDT #

When the java is compiled then put in a JAR it needs to be copied to the ODU userlib directory. You don't run 'java' on the resultant jar, ODI will use that compiled class when the interface/mapping using it is executed.

Cheers
David

Posted by David on August 12, 2014 at 09:19 AM PDT #

Thank you so much for your help so far, David.

I should've known to try and run the interface first. Looks like the classes are being called correctly from ODI. However I am getting an array error during the Loading step as follows:

java.sql.SQLException: The exception 'java.lang.ArrayIndexOutOfBoundsException: 0' was thrown while evaluating an expression

Any ideas? The directory I am pointing to has 2 pdf files and some other subdirectories and files.

Posted by guest on August 12, 2014 at 09:55 AM PDT #

Can you make PDF available?

Cheers
David

Posted by David on August 12, 2014 at 10:07 AM PDT #

Hi David,

Sorry, the PDF was a personal file I was using as testing, so I switched to the 2014 W4 as per your example found here(http://www.irs.gov/pub/irs-pdf/fw4.pdf).

I re-ran the interface and this is what I get:
java.sql.SQLException: The exception 'java.sql.SQLException: Unimplemented method: notImplemented' was thrown while evaluating an expression.20000 : null : java.sql.SQLException: Unimplemented method: notImplemented

Posted by guest on August 12, 2014 at 10:16 AM PDT #

Here's a thought, I only downloaded the derby.jar library file and not the bin. Would that be the reason? Is it expecting a fully installed derbydb?

Posted by guest on August 12, 2014 at 10:24 AM PDT #

Just the jar is all you need. Did you type the table function name exatly as is in blog - the javadb code will call on down through to the javaclass method? Have you a full stack trace?

Cheers
David

Posted by David on August 12, 2014 at 11:44 AM PDT #

Hi David, this is what I was able to gather from the SNP_SESS_TASK_LOG:

20000 : 38000 : java.sql.SQLException: The exception 'java.sql.SQLException: Unimplemented method: notImplemented' was thrown while evaluating an expression.20000 : null : java.sql.SQLException: Unimplemented method: notImplemented java.sql.SQLException: The exception 'java.sql.SQLException: Unimplemented method: notImplemented' was thrown while evaluating an expression. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedResultSet.closeOnTransactionError(Unknown Source) at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(Unknown Source) at org.apache.derby.impl.jdbc.EmbedResultSet.next(Unknown Source) at oracle.odi.runtime.agent.execution.sql.concurrent.FastJDBCRecordSet.<init>(FastJDBCRecordSet.java:86) at oracle.odi.runtime.agent.execution.sql.SQLDataProvider.getJDBCRecordSet(SQLDataProvider.java:130) at oracle.odi.runtime.agent.execution.sql.SQLDataProvider.readData(SQLDataProvider.java:102) at oracle.odi.runtime.agent.execution.sql.SQLDataProvider.readData(SQLDataProvider.java:1) at oracle.odi.runtime.agent.execution.DataMovementTaskExecutionHandler.handleTask(DataMovementTaskExecutionHandler.java:70) at com.sunopsis.dwg.dbobj.SnpSessTaskSql.processTask(SnpSessTaskSql.java:2913) at com.sunopsis.dwg.dbobj.SnpSessTaskSql.treatTask(SnpSessTaskSql.java:2625) at com.sunopsis.dwg.dbobj.SnpSessStep.treatAttachedTasks(SnpSessStep.java:577) at com.sunopsis.dwg.dbobj.SnpSessStep.treatSessStep(SnpSessStep.java:468) at com.sunopsis.dwg.dbobj.SnpSession.treatSession(SnpSession.java:2128) at oracle.odi.runtime.agent.processor.impl.StartSessRequestProcessor$2.doAction(StartSessRequestProcessor.java:366) at oracle.odi.core.persistence.dwgobject.DwgObjectTemplate.execute(DwgObjectTemplate.java:216) at oracle.odi.runtime.agent.processor.impl.StartSessRequestProcessor.doProcessStartSessTask(StartSessRequestProcessor.java:300) at oracle.odi.runtime.agent.processor.impl.StartSessRequestProcessor.access$0(StartSessRequestProcessor.java:292) at oracle.odi.runtime.agent.processor.impl.StartSessRequestProcessor$StartSessTask.doExecute(StartSessRequestProcessor.java:855) at oracle.odi.runtime.agent.processor.task.AgentTask.execute(AgentTask.java:126) at oracle.odi.runtime.agent.support.DefaultAgentTaskExecutor$2.run(DefaultAgentTaskExecutor.java:82) at java.lang.Thread.run(Thread.java:662) Caused by: java.sql.SQLException: The exception 'java.sql.SQLException: Unimplemented method: notImplemented' was thrown while evaluating an expression. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 28 more Caused by: java.sql.SQLException: Unimplemented method: notImplemented at sun.javadb.vti.core.TemplateTableFunction.notImplemented(TemplateTableFunction.java:327) at sun.javadb.vti.core.TemplateTableFunction.getWarnings(TemplateTableFunction.java:185) at org.apache.derby.impl.sql.execute.VTIResultSet.getNextRowCore(Unknown Source) at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(Unknown Source) ... 20 more

Posted by guest on August 12, 2014 at 12:17 PM PDT #

Are you using 12c? I had used 11g in those examples but can try on 12c if the problem is there.

Cheers
David

Posted by David on August 12, 2014 at 01:15 PM PDT #

Hi David, no I am not using 12c. I am using 11g (11.1.1).

Posted by guest on August 12, 2014 at 01:19 PM PDT #

The latest Derby code has changes that make the older Sun sample table function support classes fail, here is a zip of the vtis-example.jar, unzip this zip into the odi/userlib and it should work with the old and new Derby releases.

https://blogs.oracle.com/dataintegration/resource/odi_12c/vtis-example.zip

Cheers
David

Posted by guest on August 13, 2014 at 10:03 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« March 2015
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today