OWB 11gR2 – Bulk File Loading - more, faster, easier

A common problem in warehouses is to load a number of similar files into the database, often the filenames are dynamic so the names cannot be hard-wired, if they are known its easy enough to configure the external table or SQL*Loader mapping with those names. What to do in the dynamic case? Well, here we will see one approach using code templates, we can create a design for what we want and apply it in a code template. We can load more files, faster and easier. This is a great example of customizations of the power of customizations where new templates can be created to solve real world problems.

There are 2 requirements we will specify for the file loading mechanics - the first is to allow a regular expression to be supplied in order to load all files that match a pattern and the second is to support compressed file loading.

The Starting Point

Where to start? The File to External Table code template is a fair starting point, this creates an external table for staging a single file from the file-system via an external table in the Oracle database. Currently the template stages a single file with some small changes we can enhance this to stage many files and also incorporate the preprocessor to handle compressed data.

owb_bulk_files

Essentially all we have to do is define some options on a code template so that a regular expression could be supplied for the desired file names, we will change the template so that it retrieves all filenames that matched the regular expression. An additional change will be to support the preprocessor directive to load directly from compressed files (for example). This will provide better performance of moving the file over the network - often very large files are compressed, this significantly reduces the time to transfer. Rather than having to uncompress this compressed file it can be streamed directly into the database. The Oracle database's external table has a PREPROCESSOR directive to fulfill this exact requirement. Including this as an option in the code template will add significant capabilities.

The Template

Rather than editing the File to Oracle external table seeded template we will use this as a basis. Using the OWB OMB scripting language, we will copy-paste the file to external table code template and apply the changes. Using this approach we can take updates to this base and apply the changes using the script (download here). Below we will look at excerpts of it. Firstly using OMB we can copy paste existing templates to customize.

OMBCOPY CT '/PUBLIC_PROJECT/BUILT_IN_CT/LCT_FILE_TO_ORACLE_EXTER_TABLE' TO 'LCT_FILES_TO_ORACLE_EXT_TABLE'

Then add new options to the code template to allow the user to define which preprocessor to use and an regular expression for the file (below we add FILE_REGEXP, EXT_EXE_DIR and EXT_EXE options to the template).

FILE_REGEXP - Process all files in the directory matching the regular expression.
EXT_EXE_DIR - SQL directory where the preprocessor program resides. This must exist prior to executing the mapping.
EXT_EXE - Name of the preprocessor script. If this value is set the PREPROCESSOR directive will be used in the external table.

When the template is complete there are changes to 2 parts of the create external table statement, the first change is to use the PREPROCESSOR directive if the EXT_EXE options is defined, this variable defines the name of the preprocessor;

owb_files_mapping0

The second change above checks for the existence of the regular expression, if no regular expression is defined it uses the file defined in the mapping, if a regular expression is defined then some code is executed to get all the file names that match the regular expression and include those in the external table definition.

The Mapping

The sample mapping below stages the ORDERS and ORDERITEMS data files using the Files to External Table code template, the stage tables are joined with some additional tables and then loaded into the cube/fact, this example illustrates using code templates for staging data in different ways and data warehouse operators (such as cube) integrated using the hybrid mapping design.

The mapping uses a regular expression orderit.*gz to get all COMPRESSED (gzipped) files in the directory, these will be included in the external table. The file names in our example are:

  • orderitems_uk.csv.gz
  • orderitems_us.csv.gz

Should we add new data files into this directory that match this pattern we could just rerun the mapping to pick them up! We would not have to change anything, simply copy the compressed file into the directory and we can dynamically pick up the data.

The preprocessor program gzunzipdb is a script defined in the SQL directory referenced by EXEC_DIR, and the script calls

  • /bin/gunzip -c -d $*

The Oracle external table preprocessor feature takes the standard output of the preprocessor and pumps the data right through the external table. For details of the preprocessor directive see the article 'Using the Preprocessor Feature with External Tables in Oracle Database 11g' on OTN.

owb_files_mapping2

When the mapping is executed we can see the how these properties are applied in the template;

owb_files_mapping3

 

If we then look at the actual table definition in SQLDeveloper for example we can see the files are defined in the LOCATION clause and the preprocessor is also included.

owb_files_mapping4

What we have seen here is a way in which a new code template can be designed, and where the logical mapping design has remained the same. When the template is applied to the mapping we can define various characteristics that let the same map design move more data, faster and easier than before;

  • there is no custom coding,
  • file data can be transferred compressed over the network to the data warehouse,
  • it is then uncompressed without landing the data on the file system and piped directly through the external table

and best of all, the mapping design remains the same. In subsequent postings I'll show how you add yet another level of dynamic capabilities by providing the regular expression at runtime.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

ETL, CDC, Real-Time DI and Data Quality for the Oracle Database from the inside.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today