Metadata-driven data preparation and transformation with OCI Data Integration

OCI Data Integration offers powerful capabilities such as schema drift handling and bulk transformation with pattern matching, which are critical for modern data integration in today’s dynamic and ever-evolving data ecosystems.

Schema Drift Handling
In real-world environments, data source schemas often change due to updates in applications, databases, or APIs. Manually adjusting transformation logic for each change can be labour-intensive, error-prone, and lead to delays. OCI Data Integration’s schema drift feature automates this adaptation, enabling seamless and continuous data processing without manual intervention.
Bulk Transformation with Pattern Matching
Large-scale data integration frequently involves repetitive transformations across multiple fields or datasets particularly in file-based systems. Manually implementing these changes is inefficient and inconsistent. Bulk transformation, augmented with pattern matching, allows users to apply standardized operations across multiple attributes simultaneously, based on naming patterns or structural rules. This significantly reduces development time, improves consistency, and ensures scalability when processing high volumes of data.

Together, these capabilities empower organizations to manage schema variability with ease, streamline transformation workflows, and accelerate time-to-value all while maintaining accuracy and consistency across complex data pipelines.

In this scenario, we will create a single data flow to ingest and transform two heterogeneous files from object storage into corresponding ADW tables. The source file and target table are parameterized to be selected at runtime.

Prerequisites

Object Storage bucket with .csv files.
Autonomous Database schema with read/write access.
ADB user with ‘DWrole’.
OCI Data Integration workspace, project, and application.

Getting Started

We begin by working with two CSV files stored in Object Storage: AvailSeats.csv and SeatsSold.csv.

The first step involves defining Data Assets for both Object Storage and Autonomous Database (ADB), followed by the creation of a Data Flow named SEAT_DETAILS within the OCI Data Integration (OCI DI) project.

By default, OCI DI interprets all attributes from a CSV source as VARCHAR. To enable accurate date-based filtering or numeric calculations, it’s essential to convert date fields to the appropriate DATE data type and numeric fields to NUMBER.

To streamline this process, bulk transformations can be applied to perform mass data type conversions and mass functions and transformations. This capability allows you to execute functions—including user-defined ones—across multiple attributes that match a predefined pattern, thereby reducing manual effort and improving consistency across the data flow.

Attribute naming standards are an essential element in order to ensure bulk transformations perform accurately.

Create a Data Flow

In the project, navigate to ‘Data Flows’ and select ‘Create Data Flow’
Assign a name for the data flow, and ensure the correct project or folder is selected

Create parameter for the source file

To make the data flow reusable, we’ll parameterize the data entity name by creating a source file parameter.

Follow these steps to configure the parameter:

Parameter P_SOURCE_FILE

Hint: Ensure that at least one file matching the defined file name patternn exists in the source Object Storage bucket. If no match is found, the data flow designer will treat the parameter as unresolved, potentially preventing successful configuration or execution.

On the designer canvas, select View and enable the Parameters option.
Click Add to create a new parameter.
Define the parameter as follows.
- Identifier: P_SOURCE_FILE
- Data Type: String
- Length: 255 or long enough for your filename
- Default Value: SeatsSold*.csv

Create Source Operator

Drag the Source operator onto the data flow canvas and configure it to reference one of the input files. In this example, SeatsSold*.csv.

If the data is contained in multiple files, you can use wildcard characters to process all of them in one run.

Seat details Source Operator - Ticket Source File

Configure the Object Storage connections as required.
Assign the Data Entity using the parameter: ${P_SOURCE_FILE}

Enable Schema Drift

Navigate to Advanced Options tab

Select Schema Drift from Advanced Options tab

Check the Allow Schema Drift setting. This allows the data flow to dynamically adapt to changes in the source schema without manual reconfiguration.

Rename Source Attribute group

The attribute group name is derived from the source file name, by default, but since we will reuse the dataflow for other source files, this can become confusing as you work with other data files, and in some cases can be quite long. It is helpful to rename to a more generic name, such as ‘SOURCE’

Navigate to Attributes tab and review the attributes from the csv file

optional Rename source attribute group to Source

Click the three dots to the right of the attribute group name, and select Rename
Enter the new name.

Attribute group is renamed to 'SOURCE'

Rename as ‘SOURCE1’

Convert Datatypes

Depending on your source and usage, you may need to convert datatypes. Since the source in this scenario is a CSV file, all attributes are initially assigned a default data type of VARCHAR.

If you intend to use an attribute as date filter or in calculations, it is necessary to convert these attributes to their appropriate date, numeric, Boolean, etc data types.

You can use an expression to perform these conversions in bulk.

Add Expression 'CONVERT'

Drag an Expression Operator onto the canvas and name it as ‘CONVERT’.
Connect ‘SOURCE’ to Expression Operator ‘CONVERT’ the and perform the necessary data type conversions.

Expression CONV_PRICE with 'Allow bulk selection' selected

To convert the ‘PRICE’ attributes from VARCHAR to NUMERIC, follow these steps:

Add an expression and name it as ‘CONV_PRICE’.
Enable ‘Allow bulk selection’, then select Pattern as the method for selecting source attributes.
Alternatively, you may select attributes by datatype or choose individual attributes manually.

Note: only use ‘Attribute’ selection if you are certain the attribute names will be present in every source file being processed.

The Bulk Select option will apply the expression to all matched attributes.

Pattern builder with '_price' and Attribute tree expanded

Within the Expression Builder, the matched attributes are represented by the macro variable ‘%MACRO_INPUTS%’ in the Expression builder.
In the Pattern Builder, enter the pattern ‘*_price*’ and click Edit Pattern.
Expand the tree under the Pattern Editor. The Pattern Editor will display all attributes matching the pattern. You can refine the selection until the desired columns are included, then confirm the pattern selection.
Once you have confirmed that the correct attributes are displayed, click Save Changes.

Define the Name and Data Type for the resulting converted attribute

Choose the name and datatype for the new attribute

(Optional)Define the target attribute name using a regular expression

‘$0’ will retain the same name, or you can add prefix or suffix, for example, $0 _conv
In the Edit Expression window, uncheck ‘Use source attribute data types’ and set the datatype to Numeric, specify a Scale of 2 and a Length of 12.
Set the resulting attribute a datatype to NUMERIC, specifyi a Scale of 2 and a Length of 12..
Define the function to be applied to the matched attributes

expression builder with to_number function defined, with '%MACRO_INPUT% as parameter

Within the Expression Builder, the incoming attribute %MACRO_INPUTS% represents all attributes that match the defined pattern.

Enter a function to_number(%MACRO_INPUT%)

By utilizing %MACRO_INPUTS% in the expression, the function is applied uniformly to every attribute that matches the specified pattern or selection criteria.

Navigate to the Data tab and view the newly added attributes with data preview

Expression Data Tab with filters to select NUMERIC and attributes of pattern '_price'

Use the filter to search for the attribute matching pattern and datatype NUMERIC
- Filter data ‘*_price’
- Datatype ‘NUMERIC’
Examine the newly added attributes

Create Parameter for Target table

Add a new Parameter for the default Target table name

Define Target Table parameter 'P_TARGET_TABLE'

Create a parameter for the target table name and set its default value to TICKET_SEATS_SOLD.

Define target table

Add a table to receive the transformed data.

Target table definition with $[P_TARGET_TABLE} as data entity

Drag the Target Operator onto the Data Flow canvas and connect the ‘CONVERT’ Expression to the new target.
Assign identifier ‘TARGET_TABLE’
Select ‘Create new data entity’
Configure the ADB schema connections as required.
Assign the Data Entity using the parameter: ${P_TARGET_TABLE}.
Finally, validate and save the data flow to ensure all configurations are correctly applied.

Exclude unnecessary attributes from being included in the Target table

The Expression builder will create new attributes with the prescribed datatype, You may want to remove the original, or any other attributes you don’t need.

The original VARCHAR attributes still exist in the SOURCE attribute group

Expression 'CONVERT' attributes tab with filter for '_price' and matching attributes displayed

Add Name filter ‘*_price’
View the attributes that match the pattern include both VARCHAR and NUMERIC datatypes from ‘SOURCE’

(Optional) remove original varchar datatype attributes

Create an Expression to exclude unwanted attributes, using patterns

Expression 'Exclude' attributes tab with Name filter '*_price' and. datatype filter 'VARCHAR' selected

Right click on the line from ‘CONVERT’ expression to the target operator, select ‘Insert’, then ‘Expression’
Name the expression ‘EXCLUDE’
Navigate to ATTRIBUTES tab

Expression exclude Attributes tab with Name filter *_price*

Click on the filter next to the ‘Name’ header
Enter the pattern ‘*_price’ as in the bulk select in the filter

Expression 'Exclude' Attributes with name and datatype filter of VARCHAR, Action button dropdown, select 'Exclude by applied filter'

Click on the filter icon next to ‘Type’, by VARCHAR
Select ‘Exclude by applied filter’ from the ‘Actions’ button

Expression Exclude attributes with 'View Rules' dropdown displaying the excluded patterns

Validate and save the data flow

Add the Data Flow to an Integration Task and publish to application

Create an Integration Task

Return to the Project and navigate to the Tasks menu within your project and click ‘Create Task’.
From the dropdown, select ‘Integration Task’.

Create Integration task, select project and Data Flow

Select the project and select the ‘Seat Details’ data flow.
Click ‘Configure’ next to ‘Configure Parameters’ and verify the parameter values
Configure the task parameters as follows
- Set P_SOURCE_FILE to SeatsSold*.csv
- Set P_TARGET_TABLE to TICKET_SEATS_SOLD
Click ‘Configure’ to apply the settings.
Save the task

Publish and run the task

Publish the ‘Seat Details’ integration task to the application.
Navigate to the application and verify that the patch was successfully applied.
Navigate to Tasks on the application console, click the three-dot menu next to ‘Seat_Details’, and select ‘Run’.
Leave the parameter values at their defaults and click ‘Run’.

Run default parameters are shown. Click 'Edit' to change'

Validate Results

Upon successful completion, validate the data in the Target table within the database schema to ensure accuracy.

sql query showing TICKET_SEATS_SOLD output

Run it again, (from Tasks canvas, do not rerun the job, it will not prompt for parameters) this time using the input file for Available Seats to process the corresponding dataset.

Task run parameters 'AvailSeat*.cxv source file and AVAIL_SEATS target table

Verify the table exists and examine the data.

Sql query showing AVAIL_SEATS output

You can reuse the same Task for multiple tables with varying definitions, as long as the naming conventions are consistent, and the patterns used in each attribute are the same, or not included, in the other tables. Use bulk select for any functions, do not define function on attributes without bulk select unless they exist in every file format being processed.

You may also define multiple Tasks for the same Data Flow, each task with different default parameters, which can be changed at runtime.

To preview the calculations and data in the Data Flow for additional files:

alter the P_SOURCE_FILE parameter default value with the csv file naming pattern (make sure a matching file exists in the bucket)
click on ‘Refresh all data entities’.

This will refresh the Data Flow designer with the new file’s format and data.

Summary

By incorporating parameters and standard naming conventions along with pattern matching, the same integration logic can be applied across heterogeneous datasets. This approach enhances reusability, reduces development time, and minimizes the amount of code required.

We hope this blog has been helpful in enhancing your understanding of Oracle Cloud Infrastructure (OCI) Data Integration. For further learning, explore the comprehensive Blogs and official Documentation.

Be sure to explore our full collection of blogs on OCI Data Integration to deepen your knowledge and stay up to date with best practices and new features.

Metadata-driven data preparation and transformation with OCI Data Integration

Prerequisites

Getting Started