Incremental Loading of Monthly Date-Partitioned Parquet Files in OCI-Data Integration

OCI Data Integration (OCIDI) is a powerful Oracle Cloud Infrastructure service for seamless data movement, transformation, and integration across diverse sources. One of its key capabilities is incremental extraction, which processes only new or updated data—reducing resource usage, time, and costs while ensuring fresh insights.

In this blog, I’ll show how to implement incremental loading of monthly timestamp-based Parquet files stored in date-partitioned folders, and ingest them into Oracle Autonomous Database for optimized and efficient data synchronization.

Solution Architecture

Solution Prerequisites

Oracle Cloud Infrastructure data integration workspace with policies. Refer here.
Autonomous Database
Object Storage Bucket

What we need to configure?

Audit Table with necessary columns to track last extracted date and folder and target table
PL/SQL stored procedures
Date wise generated folders in object storage bucket

Configuration Setup

Step 1: Create the target table and audit table in ADW.

CREATE TABLE "TGT"."TEST_PAR_LOAD" 
   ( "MONTH_VALUE" VARCHAR2(4000 CHAR) COLLATE "USING_NLS_COMP", 
 "YEAR_VALUE" VARCHAR2(4000 CHAR) COLLATE "USING_NLS_COMP", 
 "DAY_OF_YEAR_VALUE" VARCHAR2(4000 CHAR) COLLATE "USING_NLS_COMP", 
 "DAY_OF_WEEK_VALUE" VARCHAR2(4000 CHAR) COLLATE "USING_NLS_COMP", 
 "DATE_VALUE" VARCHAR2(4000 CHAR) COLLATE "USING_NLS_COMP"
   )  DEFAULT COLLATION "USING_NLS_COMP" ;

CREATE TABLE "TGT"."AUDIT_FOLDER_NAME" 
   ( "TARGET_TABLE" VARCHAR2(100 BYTE) COLLATE "USING_NLS_COMP", 
 "FOLDER_NAME" VARCHAR2(200 BYTE) COLLATE "USING_NLS_COMP", 
 "CURR_DATE_VALUE" DATE
   )  DEFAULT COLLATION "USING_NLS_COMP" ;

Step 2: Create the PL/SQL stored procedures to update the audit table(Post we insert initial month’s data in audit table to configure initial load)

PL/SQL procedures to get the file pointer from latest folder and update the audit table values for the incremental load to be executed for next month

create or replace PROCEDURE TGT.GET_FOLDER_NAME(p_TARGET_TABLE VARCHAR2, p_folder_name OUT VARCHAR2)
AS
BEGIN
   SELECT folder_name
   INTO p_folder_name
   FROM TGT.AUDIT_FOLDER_NAME
   WHERE TARGET_TABLE = p_TARGET_TABLE
     AND CURR_DATE_VALUE = (SELECT MAX(CURR_DATE_VALUE) FROM TGT.AUDIT_FOLDER_NAME WHERE TARGET_TABLE = p_TARGET_TABLE);
END;

create or replace PROCEDURE TGT.UPDATE_folder_name_monthly(p_TARGET_TABLE VARCHAR2)
AS
    v_new_folder_name VARCHAR2(2000); -- Increased size to accommodate the new folder format
    v_curr_date_value DATE;
    v_next_month_last_day DATE;
BEGIN
    -- Retrieve the current date value from the audit table
    SELECT MAX(curr_date_value) INTO v_curr_date_value
    FROM TGT.AUDIT_FOLDER_NAME
    WHERE TARGET_TABLE = p_TARGET_TABLE;

    -- Get the last day of the next month
    v_next_month_last_day := LAST_DAY(ADD_MONTHS(v_curr_date_value, 1));

    -- Generate the new folder name using the last day of the next month and append the wildcard for .parquet files
    v_new_folder_name := 'LOAN_DATA_LOAD/process_date=' || TO_CHAR(v_next_month_last_day, 'DD-MM-YYYY') || '/*.parquet';

    -- Insert the new folder name into the audit table with the updated month
    INSERT INTO TGT.AUDIT_FOLDER_NAME (TARGET_TABLE, folder_name, curr_date_value)
    VALUES (p_TARGET_TABLE, v_new_folder_name, v_next_month_last_day);

END;

Now you can see Object Storage bucket folders in above format as shown in the image and new folders will be created in this format.(process_date=date_value) with the files uploaded in it.

Solution configuration in OCI Data Integration

In OCI Data Integration (OCIDI), we can create a data flow with a parameter that has a default value set to the first month’s folder or object entity. By using a wildcard pattern such as *.parquet, the data flow is designed to process only parquet files from the specified folder initially. The parameter is then passed into the data flow entity for dynamic execution from object storage folders within OCIDI.

Invoke the PL/SQL stored procedures using OCIDI SQL tasks

Create a pipeline task to invoke the stored procedures SQL tasks and data flow task and pass the output parameter from GET_FOLDER_NAME_SQL_TASK into the data flow parameter.
Execute the pipeline task to test the results.

Once executed the AUDIT_FOLDER_NAME table will be updated with the latest date value the folder name as well for subsequent extraction.This architecture is widely adopted for loading data from monthly date-partitioned files stored in dynamically created structured folders.

One of the key considerations when executing the solution is understanding the frequency at which new files are generated within the folders.

I hope that this blog helps as you learn more about Oracle Cloud Infrastructure Data Integration’s unique capabilities on incremental data load. For more information, check out the documentation.

Incremental Loading of Monthly Date-Partitioned Parquet Files in OCI-Data Integration

Solution Prerequisites

What we need to configure?

Configuration Setup

Solution configuration in OCI Data Integration

Sathya Prakash Velmurugan

Staff Solution Engineer, Solution Engineering Hub JAPAC

Oracle GoldenGate: MySQL to Oracle Database Instantiation Using Initial Load

Why a Migration from MongoDB to Oracle Database Makes Sense

Incremental Loading of Monthly Date-Partitioned Parquet Files in OCI-Data Integration

Solution Prerequisites

What we need to configure?

Configuration Setup

Solution configuration in OCI Data Integration

Authors

Sathya Prakash Velmurugan

Staff Solution Engineer, Solution Engineering Hub JAPAC

Oracle GoldenGate: MySQL to Oracle Database Instantiation Using Initial Load

Why a Migration from MongoDB to Oracle Database Makes Sense