Wednesday Feb 06, 2013

ODI - Extracting data from PDF Forms in 0 to 60

Here's an end to end viewlet illustrating how to reverse engineer form metadata from PDFs and also how to extract and integrate data. This carries on from the earlier posting here.

Wednesday Jan 16, 2013

ODI - Java Table Function for PDF capture

You can leverage the java table function approach to easily integrate PDF using an API like iText. I blogged earlier about the approach of writing a specific LKM for this, although this is perfectly reasonable, the writing and subsequent maintenance of the KM especially if you have many such APIs can be simplified, the code is also much cleaner in my opinion. What do you think? Its simplified by using the LKM for a java table function I posted here, then writing specific table functions for adapters.

All I did was write the table function wrapper and the makeRow method - this can even support the convention for naming columns from the earlier PDF blog post by overloading the findColumn method (so the query will use COLUMN_n, such as COLUMN_2 - under the hood, we will get the number n and return the appropriate column).

The makeRow method produces the columns from a java Object, the object is a File object, and represents the PDF file, here is a snippet of the code;

  1.     public String[] makeRow(Object obj) throws SQLException
  2.     {
  3.         String[] row = null;
  4.         int i = 0;
  5.         try {
  6.           PdfReader reader = new PdfReader(((File)obj).getPath());
  7.           Iterator it = reader.getAcroFields().getFields().entrySet().iterator();
  8.           row = new String[ reader.getAcroFields().getFields().size() ];
  9.           while (it.hasNext())
  10.             row[i++] = reader.getAcroFields().getField(((Map.Entry);
  11.           reader.close();
  12.         } catch (Exception e) { e.printStackTrace(); }
  13.         return row;
  14.     }

The code is very simple, for the table function itself, I simply created a Java Iterator (line 19 below) over the array of files in the directory, the class I extended from conveniently takes care of the iteration.

  1. public static ResultSet readCollection(String dirName, String endstr)
  2.         throws SQLException, UnknownHostException, IOException
  3.     {
  4.       final String suffix = endstr;
  5.       File folder = new File( dirName );
  6.       File[] listOfFiles = folder.listFiles(new FilenameFilter() {
  7.         public boolean accept(File dir, String name) {
  8.           return name.toLowerCase().endsWith(suffix);
  9.         }
  10.       } ); 
  11.       ArrayList<File> al = new ArrayList<File>();
  12.       for (int i = 0; i < listOfFiles.length; i++)
  13.         al.add( listOfFiles[i] );
  14.       PdfReader reader = new PdfReader(((File)listOfFiles[0]).getPath());
  15.       int sz = reader.getAcroFields().getFields().entrySet().size();
  16.       String[] cols = new String[sz];
  17.       for (int j = 0; j < sz; j++)
  18.         cols[j] = new String(new Integer(j+1).toString());
  19.       return new pdf_table( cols, al.iterator() );
  20.     }

The entire Java source for the PDF table function can be found here.

We can assign the LKM to the source set and set the table function name to pdf_table.readCollection, define the directory to use and the file extension to filter.

For this case using the table function I set the model type for the PDF datastore model to be my Derby/JavaDB technology (and not file). This generated the SQL SELECT........from table(PDF_W4('d:\temp\pdfs', 'pdf' )) PDF where (1=1) statement to load into the work table. This was using the exact LKM that I used to extract from MongoDB, and I can write any java table function to extract data and load.

Monday Jan 14, 2013

ODI - Integrating PDF using iText

Integrating PDF form data with ODI is as easy as any other data source,  see the code template from Suraj Bang a few years back with OWB for what I followed and converted to ODI - a few changes for the newer version of iText and to work with ODI. The LKM PDF to SQL uses ODI's File technology as a source and loads to any ANSI SQL store. The LKM reads all of the form fields from a PDF form and supports the File technology's String and Numeric datatype. The fields have to be defined by position, so your field name in the file can be MYCOL_1, MYCOL_2, MYCOL_3 etc. or whatever prefix you desire. The LKM as mentioned uses the iText java api, so the JAR should be downloaded and copied into the normal ODI userlib directory, I tested with version 5.3.5 of iText.

The LKM PDF to SQL is here. I manually defined a File datastore with column names (remembering to use the position as the suffix) and types, the delimiter and such information are not used since its PDF, but I set them anyway. The example I used was to process W4 PDF data, the same as in Suraj's post and was able to process all of the PDFs in the same manner. Kudos to Suraj for the jython, with a few little changes from me to use the latest iText.


Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality


« April 2014