OCI GoldenGate Real-Time Ingestion for Azure Data Lake Storage

December 13, 2022 | 8 minute read
Deniz Sendil
Senior Principal Product Manager
Text Size 100%:

OCI GoldenGate is a fully managed, native cloud service that moves data in real-time, at scale. OCI GoldenGate processes data as it moves from one or more data management systems to target systems. You can also design, run, orchestrate, and monitor data replication tasks without having to allocate or manage any compute environments.

Azure Data Lake Storage (ADLS) is a secure cloud platform that provides scalable, cost-effective storage for big data analytics.

You can use OCI GoldenGate for real-time data ingestion into ADLS in the format of your choice. OCI GoldenGate can publish JSON, Parquet, DelimitedText, Avro_OCF and XML file formats into ADLS. 

In this blog, I will focus on how to ingest real-time data in different formats into ADLS with OCI GoldenGate.

Apart from Parquet format, OCI GoldenGate uses a 2-step process for ADLS replication. For Parquet, it uses a 3 step process as you'll see later in this blog. 

  1. Generating local files from trail files (these local files are not accessible in OCI GoldenGate)
  2. Loading files into Azure Storage Account

There are different pieces working together in this process; but, these different pieces are automated and defaulted to make the process simpler. As you can see below, different properties can be configured for customizing the replication. 

As users, we need to create an OCI GoldenGate Connection for ADLS and we need to create the Replicat

OCI GoldenGate ADLS Connection: 

OCI GoldenGate can connect to Azure Storage Accounts with public or private access. If you need to connect to an Azure Storage Acount running with a private end point, you can refer to this blog for creating the connection.

ADLS connection can be created using one of the 3 Azure Storage authentication methods:

  • Shared Access Key
  • Shared Access Signature
  • Azure Active Directory

Depending on the authentication type, you’ll need to provide different details. 

For Shared Access Key, you’ll need to provide Storage Account Access Key. You can refer to Azure Documentation to get the details about how to create and obtain storage account access keys.

For Shared Access Signature, you’ll need to provide SAS Token. You can refer to Azure Documentation to get the details  about how to create a SAS Token.

For Azure Active Directory, you’ll need to provide Azure Tenant ID, Azure Client ID and Azure Client Secret. For Azure Active Directory authentication, you need to register an application in Azure AD App Registrations and assign needed roles to the application. For example, Storage BLOB Data Owner. You can refer to Azure Documentation to get details about Azure AD App Registrations and roles.

To create an OCI GoldenGate Connection for ADLS, click Create connection button. Provide a Name for the connection and select the Type as Azure Data Lake Storage. Click Next.

OCI GoldenGate

Provide your Azure Storage Account Name, select the authentication type, provide the required details and click Create.

OCI GoldenGate

Once connection is created, assign connection to OCI GoldenGate Big Data deployment. When created, connections and deployments are detached from each other. Assigning the connection to a deployment enables replicate to access connection details to connect to the target. You can assign a connection in connection details, resources, assigned deployments, assign deployments.

OCI GoldenGate

Creating the replicat:

To start creating a Replicat in OCI GoldenGate console, click Add Replicat.
 

OCI GoldenGate

Select the Replicat Type. There are two different Replicat types available: Classic and Coordinated. Classic Replicat is a single threaded process whereas Coordinated Replicat is a multithreaded one that applies transactions in parallel. Coordinated Replicat will result in multiple files being created as there is a multithreaded process running.

Next, provide Replicat Options. Trail Name is the name of the trail file that is used for replication. Also, you need to select Target as Azure Data Lake Storage. Once selected, Available Aliases will be listed, and you will need to choose an alias. Once an alias is selected, the Replicat will be using the credential details from the connection that is selected.

OCI GoldenGate

In the Parameter File, you can either specify source to target mapping or leave it as-is with a wildcard selection. If Coordinated Replicat is selected as the Replicat Type, an additional parameter needs to be provided: TARGETDB LIBFILE libggjava.so SET property=/u02/Deployment/etc/conf/ogg/your_replicat_name.properties

OCI GoldenGate

Last step is the Properties File configuration. In this screen, you will notice that some of the properties are pre-configured and populated. These are the basic and bare minimum properties needed for starting a replication into ADLS.

Please provide your target Azure Storage Container Name in gg.eventhandler.abs.bucketMappingTemplate property. If the target container already exists, OCI GoldenGate will load the files into that container. If the container does not exist and if your access has the appropriate permissions, OCI GoldenGate will automatically create the container. For Azure Storage container names, there are two possible options: using a static name or using a dynamic formatting which can be configured with Template Keywords. You can use any of the template keywords in bucket mapping template. For example, if you use ${tableName}, OCI GoldenGate will create the buckets with the table names coming from the source database. 

If you hit Create & Run, OCI GoldenGate will load the files in the default file format, which is JSON, into ADLS.

OCI GoldenGate

By default, OCI GoldenGate is configured to use the JSON format. However, OCI GoldenGate supports DelimitedText, Avro_OP, Avro_OCF and XML file formats as well. To change the file format, you need to apply gg.eventhandler.abs.format property. For example: gg.eventhandler.abs.format=delimitedtext

OCI GoldenGate

Depending on the selected format, you can also apply Pluggable Formatters. The pluggable formatters are used to convert operations from the Oracle GoldenGate trail file into formatted messages that you can send to ADLS. 

There are different settings available for different format types. For more details on specifics of pluggable formatters, you can check these blogs: JSON, DelimitedText, Avro OCF. Please note that these blogs are written for different connectors; but you can apply the properties in ADLS replication as well.

For example, in the DelimitedText blog you see an example of adding column delimiters using gg.handler.filewriter.format.fieldDelimiter=CDATA[;]. You can apply a similar property for ADLS replication with DelimitedText format: gg.eventhandler.abs.format.fieldDelimiter=CDATA[;].

You can also apply metacolumn functionality while replicating into ADLS. Metacolumns allow you to select the metadata fields that you want
to see in the generated output messages.

Replicating Parquet files into ADLS:

OCI GoldenGate uses a 3-step process for ADLS Parquet format replication:

  1. Generating local files from trail files (these local files are not accessible in OCI GoldenGate)
  2. Converting local files to Parquet format
  3. Loading files into Azure Storage Account

We need to update the Replicat Properties for Parquet replication into ADLS and we’ll see how in this part. 

Adding the replicat, selecting Replicat Type, updating Replicat Options and Parameter File steps are the same as above. If you follow the above steps, you’ll see the Properties File as below.

OCI GoldenGate

You can follow the below steps for customizing the properties:

  1. As the first step, we need to remove the below property:

gg.target=abs

  1. Leave gg.eventhandler.abs.connectionId as is. Any changes in this property may lead into replication fail.
  2. Please provide your target Azure Storage Container Name in gg.eventhandler.abs.bucketMappingTemplate property. If the container already exists, OCI GoldenGate will load the files into that container. If the container does not exist and if your access has the appropriate permissions, OCI GoldenGate will automatically create the container. For Azure Storage container names, there are two possible options: using a static name or using a dynamic formatting which can be configured with Template Keywords.
  3. Update gg.classpath property as below:

gg.classpath=$THIRD_PARTY_DIR/abs/*:$THIRD_PARTY_DIR/hadoop/*:$THIRD_PARTY_DIR/parquet/*

  1. Add the below set of properties:

#The File Writer Handler
gg.handlerlist=filewriter
gg.handler.filewriter.type=filewriter
gg.handler.filewriter.mode=op
gg.handler.filewriter.pathMappingTemplate=./dirout
gg.handler.filewriter.stateFileDirectory=./dirsta
gg.handler.filewriter.fileRollInterval=7m
gg.handler.filewriter.inactivityRollInterval=30s
gg.handler.filewriter.fileWriteActiveSuffix=.tmp
gg.handler.filewriter.finalizeAction=delete

### Avro OCF
gg.handler.filewriter.format=avro_row_ocf
gg.handler.filewriter.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.avro
gg.handler.filewriter.format.pkUpdateHandling=delete-insert
gg.handler.filewriter.format.metaColumnsTemplate=${optype},${position}
gg.handler.filewriter.format.iso8601Format=false
gg.handler.filewriter.partitionByTable=true
gg.handler.filewriter.rollOnShutdown=true

#The Parquet Event Handler
gg.handler.filewriter.eventHandler=parquet
gg.eventhandler.parquet.type=parquet
gg.eventhandler.parquet.pathMappingTemplate=./dirparquet
gg.eventhandler.parquet.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.parquet
gg.eventhandler.parquet.writeToHDFS=false
gg.eventhandler.parquet.finalizeAction=delete
#Selecting the OCI  Event Handler
gg.eventhandler.parquet.eventHandler=abs
gg.eventhandler.abs.type=abs
gg.eventhandler.abs.pathMappingTemplate=${fullyQualifiedTableName}

Your properties file should look like as below:

OCI GoldenGate

Once ready, click Create & Run. When replicat starts successfully, you’ll see it checked with a green mark in a running state.

OCI GoldenGate

You can customise parquet file replication into ADLS by changing the above property settings. You can refer to following documents/ blogs for more details: File Writer Handler, Avro OCF, Parquet Event Handler

Conclusion:

OCI GoldenGate is an easy to use, flexible and cost-effective solution for real-time data lakehouse ingestions. You can experience it yourself with OCI GoldenGate Quickstarts.

Deniz Sendil

Senior Principal Product Manager

Deniz Sendil is a Senior Principle Product Manager in Oracle GoldenGate product group with a focus on GoldenGate for Big Data. Deniz has over 17 years of experience in data management, analytics and data integration domains. 


Previous Post

Using Oracle Cloud Infrastructure GoldenGate with MySQL Databases

Julien TESTUT | 6 min read

Next Post


Reusing Tasks in OCI Data Integration - Working Effectively

David Allan | 7 min read