ODI - Java Table Function for PDF capture

You can leverage the java table function approach to easily integrate PDF using an API like iText. I blogged earlier about the approach of writing a specific LKM for this, although this is perfectly reasonable, the writing and subsequent maintenance of the KM especially if you have many such APIs can be simplified, the code is also much cleaner in my opinion. What do you think? Its simplified by using the LKM for a java table function I posted here, then writing specific table functions for adapters.

All I did was write the table function wrapper and the makeRow method - this can even support the convention for naming columns from the earlier PDF blog post by overloading the findColumn method (so the query will use COLUMN_n, such as COLUMN_2 - under the hood, we will get the number n and return the appropriate column).

The makeRow method produces the columns from a java Object, the object is a File object, and represents the PDF file, here is a snippet of the code;

  1.     public String[] makeRow(Object obj) throws SQLException
  2.     {
  3.         String[] row = null;
  4.         int i = 0;
  5.         try {
  6.           PdfReader reader = new PdfReader(((File)obj).getPath());
  7.           Iterator it = reader.getAcroFields().getFields().entrySet().iterator();
  8.           row = new String[ reader.getAcroFields().getFields().size() ];
  9.           while (it.hasNext())
  10.             row[i++] = reader.getAcroFields().getField(((Map.Entry)it.next()).getKey().toString()).toString();
  11.           reader.close();
  12.         } catch (Exception e) { e.printStackTrace(); }
  13.         return row;
  14.     }

The code is very simple, for the table function itself, I simply created a Java Iterator (line 19 below) over the array of files in the directory, the class I extended from conveniently takes care of the iteration.

  1. public static ResultSet readCollection(String dirName, String endstr)
  2.         throws SQLException, UnknownHostException, IOException
  3.     {
  4.       final String suffix = endstr;
  5.       File folder = new File( dirName );
  6.       File[] listOfFiles = folder.listFiles(new FilenameFilter() {
  7.         public boolean accept(File dir, String name) {
  8.           return name.toLowerCase().endsWith(suffix);
  9.         }
  10.       } ); 
  11.       ArrayList<File> al = new ArrayList<File>();
  12.       for (int i = 0; i < listOfFiles.length; i++)
  13.         al.add( listOfFiles[i] );
  14.       PdfReader reader = new PdfReader(((File)listOfFiles[0]).getPath());
  15.       int sz = reader.getAcroFields().getFields().entrySet().size();
  16.       String[] cols = new String[sz];
  17.       for (int j = 0; j < sz; j++)
  18.         cols[j] = new String(new Integer(j+1).toString());
  19.       return new pdf_table( cols, al.iterator() );
  20.     }

The entire Java source for the PDF table function can be found here.

We can assign the LKM to the source set and set the table function name to pdf_table.readCollection, define the directory to use and the file extension to filter.

For this case using the table function I set the model type for the PDF datastore model to be my Derby/JavaDB technology (and not file). This generated the SQL SELECT........from table(PDF_W4('d:\temp\pdfs', 'pdf' )) PDF where (1=1) statement to load into the work table. This was using the exact LKM that I used to extract from MongoDB, and I can write any java table function to extract data and load.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
3
5
6
7
8
9
10
12
13
14
17
18
19
20
21
23
24
25
26
27
28
29
30
   
       
Today