Introduction:

In today’s data-driven world, organizations are constantly seeking efficient ways to move, transform, and analyze vast amounts of data. Traditional data warehousing solutions often struggle with the scale and agility required for modern analytical workloads. This is where technologies like Apache Iceberg and Oracle GoldenGate for Distributed Applications and Analytics (DAA) come into play, offering a powerful combination for real-time data replication and advanced analytics.

The Challenge: Bridging Operational and Analytical Silos

Operational databases, are optimized for transactional integrity and high concurrency. Analytical systems, on the other hand, demand flexibility, scalability, and the ability to handle massive datasets for complex queries. Historically, bridging these two worlds has involved batch processes, ETL jobs, and significant latency, leading to stale insights and missed opportunities.

Enter Apache Iceberg: A Table Format for the Modern Data Lake

Apache Iceberg has emerged as a game-changer in the data lake ecosystem. It’s an open table format that brings SQL-like table capabilities and reliability to data stored in object storage (like S3, ADLS, GCS). Key benefits of Iceberg include:

• Schema Evolution: Easily evolve your table schemas without rewriting data.

• Time Travel: Query historical versions of your data.

• ACID Transactions: Ensures data consistency and reliability for updates, deletes, and inserts.

• Performance: Optimized for query performance on large datasets.

• Open Standard: Avoids vendor lock-in and integrates with various data engines (Spark, Flink, Trino, etc.).

Oracle GoldenGate DAA: Real-time Replication for the Enterprise

Oracle GoldenGate has long been the gold standard for heterogeneous real-time data replication. GoldenGate for Distributed Applications and Analytics (DAA) extends these capabilities, providing a robust and scalable solution for moving data from Oracle databases to various big data targets, including Apache Iceberg. GoldenGate for DAA offers:

• Low Latency: Captures changes from the Oracle redo logs in real-time, ensuring minimal data lag.

• Heterogeneous Support: Replicates data to a wide range of targets beyond just other databases.

• Data Transformation: Allows for in-flight data transformations and filtering.

• Resilience: Built-in error handling and recovery mechanisms for continuous operation.

• Scalability: Designed to handle high-volume data streams.

The Power of Integration: GoldenGate for DAA replication to Apache Iceberg

Oracle GoldenGate team announced the general availability of Oracle GoldenGate for Distributed Applications and Analytics (DAA) 23.7, which introduces new connections, enhancements, and fixes, as well as a focus on Apache Iceberg integration for real-time data analytics and AI use cases.

The integration of Oracle GoldenGate DAA with Apache Iceberg provides a powerful solution for building real-time analytical platforms. Here’s how it works:

1. Change Data Capture (CDC) from Oracle: GoldenGate captures transactional changes (inserts, updates, deletes) from your Oracle source database’s redo logs. This is done by setting up the Extract process on the Oracle Database using GoldenGate for Oracle.

2. Real-time Data Delivery: GoldenGate for Oracle sends these changes to GoldenGate for DAA over a Distribution Path. The Distribution service forwards the transactions to a remote system (in this case from GoldenGate for Oracle to Goldengate for Distributed Applications and Analytics).

3. Receiver process on GoldenGate for DAA: Verify the Receive service is automatically set up on the remote (oggdaa) system.

4. GoldenGate for Big Data (Iceberg Handler): GoldenGate for DAA includes handlers specifically designed for big data targets. For Iceberg, this typically involves a GoldenGate Big Data Replicat configured to write to Apache Iceberg. Replicat service using Goldengate for Distributed Applications and Analytics, injects the transactions into Apache Iceberg tables.

Architecture

Architecture

Source: Oracle Database 23ai running on Oracle Autonomous Database

Target: Apache Iceberg on AWS (Catalog – AWS Glue, Storage – Amazon S3)

Steps

1. Install GoldenGate for Oracle and GoldenGate for DAA on the same host or different hosts. Create one deployment for each GoldenGate instance.

2. Follow the steps mentioned in the documentation and configure Extract to Capture from an Autonomous Database.

Create an Integrated Extract –

Add Extract

Select the Source Database and enter a trail file name –

Add Extract Options

Specify the parameters and create the Extract –

Add Extract Parameters

3. Create a Distribution Path on the Source GoldenGate for Oracle by selecting the Extract process created in previous step.

Distribution Service

Under Target Options, select ogg as Target Protocol. Select Receiver Service for Target Type and specify the Target Host (host running GoldenGate for DAA) and Port Number of the port running the Receiver Service on GoldenGate for DAA. We select localhost here as both GoldenGate for Oracle and GoldenGate for DAA are running on the same host.

Distribution Service Target Options

4. Verify the Receiver Service is created on GoldenGate for DAA

Receiver Service

5. Create a Replicat process on GoldenGate for DAA to replicate the data to Apache Iceberg. While creating the Replicat process select Apache Iceberg as Target. We ran the test with AWS Glue as Catalog and Amazon S3 as Storage Location. Apache Iceberg support multiple Catalog and Storage options. Check the official documentation for details.

Add Replicat

Enter the details in the Properties File, including AWS Access Key, Secret Key and S3 Bucket details.

Add Replicat Properties

Create and Run the Replicat.

6. Login to AWS and verify the table and its contents.

Table details from AWS Glue –

AWS Glue

Verify the table contents from Amazon Athena –

Amazon Athena

Conclusion:

The combination of Oracle GoldenGate DAA and Apache Iceberg represents a significant leap forward in enterprise data replication and analytics. By enabling real-time, reliable, and scalable data movement from Oracle databases to modern data lake environments, organizations can unlock the full potential of their data, driving faster insights and informed decision-making. If you’re looking to modernize your data architecture and accelerate your analytical capabilities, exploring this powerful integration is a strategic imperative.