OCI GoldenGate is a fully managed, native cloud service that moves data in real-time, at scale. OCI GoldenGate processes data as it moves from one or more data management systems to target systems. You can also design, run, orchestrate, and monitor data replication tasks without having to allocate or manage any compute environments.
Azure Data Lake Storage (ADLS) is a secure cloud platform that provides scalable, cost-effective storage for big data analytics.
You can use OCI GoldenGate for real-time data ingestion into ADLS in the format of your choice. OCI GoldenGate can publish JSON, Parquet, DelimitedText, Avro_OCF and XML file formats into ADLS.
In this blog, I will focus on how to ingest real-time data in different formats into ADLS with OCI GoldenGate.
Apart from Parquet format, OCI GoldenGate uses a 2-step process for ADLS replication. For Parquet, it uses a 3 step process as you'll see later in this blog.
There are different pieces working together in this process; but, these different pieces are automated and defaulted to make the process simpler. As you can see below, different properties can be configured for customizing the replication.
As users, we need to create an OCI GoldenGate Connection for ADLS and we need to create the Replicat.
OCI GoldenGate can connect to Azure Storage Accounts with public or private access. If you need to connect to an Azure Storage Acount running with a private end point, you can refer to this blog for creating the connection.
ADLS connection can be created using one of the 3 Azure Storage authentication methods:
Depending on the authentication type, you’ll need to provide different details.
For Shared Access Key, you’ll need to provide Storage Account Access Key. You can refer to Azure Documentation to get the details about how to create and obtain storage account access keys.
For Shared Access Signature, you’ll need to provide SAS Token. You can refer to Azure Documentation to get the details about how to create a SAS Token.
For Azure Active Directory, you’ll need to provide Azure Tenant ID, Azure Client ID and Azure Client Secret. For Azure Active Directory authentication, you need to register an application in Azure AD App Registrations and assign needed roles to the application. For example, Storage BLOB Data Owner. You can refer to Azure Documentation to get details about Azure AD App Registrations and roles.
To create an OCI GoldenGate Connection for ADLS, click Create connection button. Provide a Name for the connection and select the Type as Azure Data Lake Storage. Click Next.
Provide your Azure Storage Account Name, select the authentication type, provide the required details and click Create.
Once connection is created, assign connection to OCI GoldenGate Big Data deployment. When created, connections and deployments are detached from each other. Assigning the connection to a deployment enables replicate to access connection details to connect to the target. You can assign a connection in connection details, resources, assigned deployments, assign deployments.
To start creating a Replicat in OCI GoldenGate console, click Add Replicat.
Select the Replicat Type. There are two different Replicat types available: Classic and Coordinated. Classic Replicat is a single threaded process whereas Coordinated Replicat is a multithreaded one that applies transactions in parallel. Coordinated Replicat will result in multiple files being created as there is a multithreaded process running.
Next, provide Replicat Options. Trail Name is the name of the trail file that is used for replication. Also, you need to select Target as Azure Data Lake Storage. Once selected, Available Aliases will be listed, and you will need to choose an alias. Once an alias is selected, the Replicat will be using the credential details from the connection that is selected.
In the Parameter File, you can either specify source to target mapping or leave it as-is with a wildcard selection. If Coordinated Replicat is selected as the Replicat Type, an additional parameter needs to be provided: TARGETDB LIBFILE libggjava.so SET property=/u02/Deployment/etc/conf/ogg/your_replicat_name.properties
Last step is the Properties File configuration. In this screen, you will notice that some of the properties are pre-configured and populated. These are the basic and bare minimum properties needed for starting a replication into ADLS.
Please provide your target Azure Storage Container Name in gg.eventhandler.abs.bucketMappingTemplate property. If the target container already exists, OCI GoldenGate will load the files into that container. If the container does not exist and if your access has the appropriate permissions, OCI GoldenGate will automatically create the container. For Azure Storage container names, there are two possible options: using a static name or using a dynamic formatting which can be configured with Template Keywords. You can use any of the template keywords in bucket mapping template. For example, if you use ${tableName}, OCI GoldenGate will create the buckets with the table names coming from the source database.
If you hit Create & Run, OCI GoldenGate will load the files in the default file format, which is JSON, into ADLS.
By default, OCI GoldenGate is configured to use the JSON format. However, OCI GoldenGate supports DelimitedText, Avro_OP, Avro_OCF and XML file formats as well. To change the file format, you need to apply gg.eventhandler.abs.format property. For example: gg.eventhandler.abs.format=delimitedtext
Depending on the selected format, you can also apply Pluggable Formatters. The pluggable formatters are used to convert operations from the Oracle GoldenGate trail file into formatted messages that you can send to ADLS.
There are different settings available for different format types. For more details on specifics of pluggable formatters, you can check these blogs: JSON, DelimitedText, Avro OCF. Please note that these blogs are written for different connectors; but you can apply the properties in ADLS replication as well.
For example, in the DelimitedText blog you see an example of adding column delimiters using gg.handler.filewriter.format.fieldDelimiter=CDATA[;]. You can apply a similar property for ADLS replication with DelimitedText format: gg.eventhandler.abs.format.fieldDelimiter=CDATA[;].
You can also apply metacolumn functionality while replicating into ADLS. Metacolumns allow you to select the metadata fields that you want
to see in the generated output messages.
OCI GoldenGate uses a 3-step process for ADLS Parquet format replication:
We need to update the Replicat Properties for Parquet replication into ADLS and we’ll see how in this part.
Adding the replicat, selecting Replicat Type, updating Replicat Options and Parameter File steps are the same as above. If you follow the above steps, you’ll see the Properties File as below.
You can follow the below steps for customizing the properties:
gg.target=abs
gg.classpath=$THIRD_PARTY_DIR/abs/*:$THIRD_PARTY_DIR/hadoop/*:$THIRD_PARTY_DIR/parquet/*
#The File Writer Handler
gg.handlerlist=filewriter
gg.handler.filewriter.type=filewriter
gg.handler.filewriter.mode=op
gg.handler.filewriter.pathMappingTemplate=./dirout
gg.handler.filewriter.stateFileDirectory=./dirsta
gg.handler.filewriter.fileRollInterval=7m
gg.handler.filewriter.inactivityRollInterval=30s
gg.handler.filewriter.fileWriteActiveSuffix=.tmp
gg.handler.filewriter.finalizeAction=delete
### Avro OCF
gg.handler.filewriter.format=avro_row_ocf
gg.handler.filewriter.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.avro
gg.handler.filewriter.format.pkUpdateHandling=delete-insert
gg.handler.filewriter.format.metaColumnsTemplate=${optype},${position}
gg.handler.filewriter.format.iso8601Format=false
gg.handler.filewriter.partitionByTable=true
gg.handler.filewriter.rollOnShutdown=true
#The Parquet Event Handler
gg.handler.filewriter.eventHandler=parquet
gg.eventhandler.parquet.type=parquet
gg.eventhandler.parquet.pathMappingTemplate=./dirparquet
gg.eventhandler.parquet.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.parquet
gg.eventhandler.parquet.writeToHDFS=false
gg.eventhandler.parquet.finalizeAction=delete
#Selecting the OCI Event Handler
gg.eventhandler.parquet.eventHandler=abs
gg.eventhandler.abs.type=abs
gg.eventhandler.abs.pathMappingTemplate=${fullyQualifiedTableName}
Your properties file should look like as below:
Once ready, click Create & Run. When replicat starts successfully, you’ll see it checked with a green mark in a running state.
You can customise parquet file replication into ADLS by changing the above property settings. You can refer to following documents/ blogs for more details: File Writer Handler, Avro OCF, Parquet Event Handler.
OCI GoldenGate is an easy to use, flexible and cost-effective solution for real-time data lakehouse ingestions. You can experience it yourself with OCI GoldenGate Quickstarts.
Deniz Sendil is a Senior Principle Product Manager in Oracle GoldenGate product group with a focus on GoldenGate for Big Data. Deniz has over 17 years of experience in data management, analytics and data integration domains.