Wednesday Mar 13, 2013

OWB - Compressing Files in Parallel using Java Activity

Yesterday I posted a user function for compressing/decompressing files using parallel processes in ODI. The same code you can pick up and use from an OWB process flow. Invoking the java function from within a Java activity from within the flow.

The JAR used in the example below can be downloaded here, from the process flow OWB invokes the main method within the ZipFile class for example - passing the parameters to the function for the input, output directories and also the number of threads. The parameters are passed as a string in OWB, each parameter is wrapped in ?, so we have a string like ?param1?param2?param3? and so on. In the example I pass the input directory d:\inputlogs as the first parameter and d:\outputzips as the second, the number of processes used is 4 - I have escaped my backslash in order to get this to work on Windows.

 The classpath has the JAR file with the class compiled in it and the classpath value can be specified specified on the activity, carefully escaping the path if on windows.

Then you can define the actual class to use;

That's it, pretty easy. The return value from the method will use the exit code from your java method - normally 0 is failure and other values are error (so if you exit the java using a specific error code value you can return this code into a variable in OWB or perform a complex transition condition). Any standard output/error is also capture from within the OWB activity log in the UI, for example below you can see an exception that was thrown and also messages output to the standard output/error;

 That's a quick insight to the java activity in OWB.

Wednesday Jul 18, 2012

OWB 11gR2 – Parallelization using DBMS_PARALLEL_EXECUTE

As well as all of the parallel query and DML capabilities exposed in OWB, the 11gR2 release of OWB includes out of the box support for the DBMS_PARALLEL_EXECUTE package in 11gR2 of the Oracle database, there are various articles on how to do the grunt work manually coding it on the web (see here for example). The use cases for this range from Oracle techie stuff like managing roll back segments for large table updates to parallelizing your PLSQL mappings…whatever they may be. Here we will see how with a couple of configuration settings in OWB you can self parallelize mappings – divide and conquer the olde fashioned way.

The use case that is generally mentioned for using this is for doing large table updates – basically its dividing a BIG problem into smaller pieces, the package also exposes a resume capability for reprocessing failed chunks which is also exposed from OWB (when you execute a chunked mapping you can resume or start a mapping). The approach can also be utilized for parallelizing arbitrary PLSQL (support for set based, row based also the table being chunked can be the source or the target or an arbitrary SQL statement). The 11.2.0.3 patch is best for this.

Chunk by SQL Statement

The example below updates employees salaries, driving the parallel worker threads by the departments table. As you see below we the start and end department id are the same, we are doing a distinct, so all employees in a department will be updated in each chunk, so we can increase the salary within the EMPLOYEES table for each department in a chunk. We are using the department id to drive the chunking – each department if processed in its own thread. We will see how we can control the thread pool of workers.

Note above there are other options such as chunk by ROWID and an arbitrary SQL statement).

The mapping now using olde fashioned divide and conquer has new runtime properties which relate back to the DBMS_PARALLEL_EXECUTE package and include the resume functionality for processing failed chunks and the parallel level, so you can change at runtime whether 1,2 or n threads are used to process the chunks.

We are saying that we want 2 child processes processing the chunks, this is a runtime parameter. Its wise to think about the number of chunks and the number of child processes to ensure optimal execution plan. The image below depicts the code that gets generated with chunk parallel level 2, and parallel level 4 – essentially the package ‘main’ procedure uses DBMS_PARALLEL_EXECUTE to process the heart of the mapping in the child processes.

There is much more to the package than meets the eye the resume capability for example provides a way of reprocessing failed chunks, rather than reprocessing everything again. This is also exposed as a runtime parameter for the chunked OWB mapping so you can resume and reprocess only the failed chunks.

Chunk by ROWID

This is the classic Large Table Update example that typical divide and conquer is used for. This example updates employees salaries, driving the parallel worker threads by rows in the target employees table itself. With this example when we configure the mapping we pick chunk method as ROWID, the chunk table EMPLOYEES, chunk type BY_ROWS and size is 1000 for example. The EMPLOYEES table is set to perform and update, I define the target filter for update for the ON clause in the merge/update – so its the ‘how do I identify the rows to update within this chunk’ clause. The other thing I had to do was define a chunking expression – now in this case its kind of redundant since the chunk is being done in the ON clause of the update – so we can trick OWB by just saying ‘start=start and end=end’ using the variables. If you don’t specify this, OWB will complain that the chunking expression is invalid.

So the MERGE statement within the chunking code will increase the SALARY for the chunk, you will see the expression to increase salary by, the dummy chunking expression used in selecting from the source (DUAL) and which rows to match – the rows in this chunk.

This let’s us perform large table updates in chunks and drive the parallelized mapping using mapping input parameters.

The parallelization for PLSQL (or row based mappings) is an interesting case - for example for the likes of match merge which has an inherent divide and conquer strategy (in binning), with match merge out of the box the processing of the bins is serialized by default. Combining the chunk parallelization with the match merge binning lets you boost the performance of such mappings. So if you pair the chunking criteria with the binning you can substantially increase the performance of such mappings.

Wednesday Nov 02, 2011

Parallel Processing with DBMS_PARALLEL_EXECUTE

Here is another illustration of some of the powerful capabilities of the DBMS_PARALLEL_EXECUTE package in the Oracle database, carrying on from the earlier post here. One of the comments from that post was on how to insert into a different named table within each chunk and that insert can perform parallel DML also. This kind of scenario could be interesting for very high end processing, it could be end point target tables or tables that are prepared and then you perform partition exchanges with them or something.

The image below shows a variation on the original post where rather than inserting into a specific partition, you write into a specific table.

Driving the whole process can be your own chunking criteria, the question was how to drive this process from a table using SQL such as ‘select distinct level_key, level_key from chunk_table’ where chunk_table has the level_key and the target table name. For example it could contain the following data;

level_key table_name
1 sales_level1
2 sales_level2
3 sales_level3
4 sales_level4
 

So the first chunk with level_key 1 will write the results to table sales_level1 etc.

You can use the DBMS_PARALLEL_PACKAGE as follows to create this functionality. The start/end values have to be of data type NUMBER, so you will have to lookup the (target) table name inside your PLSQL block within the statement provided in the run_task call.

This block has the query to determine the chunks .....

begin
   begin
     DBMS_PARALLEL_EXECUTE.DROP_TASK(task_name => 'TASK_NAME');
   exception when others then null;
   end;
   DBMS_PARALLEL_EXECUTE.CREATE_TASK(task_name => 'TASK_NAME');
   DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_SQL(task_name => 'TASK_NAME',
   sql_stmt =>'select distinct level_key, level_key from chunk_table', by_rowid => false);
end;

Then the next block will construct the and process the tasks......

begin
   DBMS_PARALLEL_EXECUTE.RUN_TASK (task_name => 'TASK_NAME',
     sql_stmt =>'declare
       s varchar2(16000); vstart_id number := :start_id; vend_id number:= :end_id;
       table_name varchar2(30);
       begin
         select table_name into table_name from chunk_table where level_key=vstart_id;
         s:=''insert into ''||table_name||'' select /*+ PARALLEL(STG, 8) */ colx from STAGING_TABLE STG
           where level_key =:vstart_id'';
         execute immediate s using vstart_id;
         commit;
     end;',
     language_flag => DBMS_SQL.NATIVE, parallel_level => 2 );
end;

The anonymous PLSQL block can be any arbitrary code, you can see the table name of the target is retrieved, the example does a parallel INSERT using the hint PARALLEL(STG,8). Anyway, good to share.

Tuesday Jul 05, 2011

ODI 11g - Pipelines

There’s a write up of some pushing the envelope on ODIEE 11g capabilities here that I presented parts of last week at ODTUG KScope 11 in Long Beach, California.

The ODTUG Kaleidescope/KScope is certainly an awesome conference with gurus on all fronts willing to talk about their cherished areas. Next year its headed for San Antonio, Texas – Long Beach was fun and it was great to meet all the folks that I did!

About

ETL, CDC, Real-Time DI and Data Quality for the Oracle Database from the inside.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today