Introduction
In the shift to OCI GoldenGate Service, there is a dangerous temptation to assume that “fully managed” means “set it and forget it”. While Oracle manages the compute, patching, and scaling of the deployment, the logical configuration of your data pipeline remains your responsibility. A replication stream can function “normally” for months, only for a specific volume spike or a long-running transaction to suddenly shatter that stability, exposing hidden cracks in your foundation.
A recent Targeted Architecture and Configuration Review of a mission-critical OCI GoldenGate deployment identified several best-practice opportunities across configuration hygiene, performance alignment, and operational resilience. This was not a general health check; it was a forensic-style review of an OCI GoldenGate Microservices Architecture (GGMA) deployment, focusing on service-level configuration, Extract and Replicat parameter files, report files (.rpt), and error logs (ggserr.log).
The objective was clear: move the system from a merely functional state to an optimized, best-practice architecture. While deployment environments can be highly variable, this analysis identified several key best practices for configuration management, resource tuning, and maintaining data integrity in GoldenGate environments.
This post will explore these key lessons, outlining how to optimize your OCI GoldenGate implementation for stability, performance, and long-term maintainability.
1. Configuration Hygiene: The Art of Minimalism
One of the most common issues in long-standing environments is “parameter bloat”—configurations carrying legacy settings, redundant parameters, or tuning attempts that have long since been forgotten. In the review, we observed parameter files where well-intentioned overrides were actually just hardcoding default behaviors, adding noise without value.
The Golden Rule: Default parameter values in GoldenGate are regularly enhanced with each release, Release Update (RU), or patch, often improving performance and reliability. Whenever possible, leverage the current default settings, as they typically reflect the latest best practices and product optimizations. Only add a parameter if you have a specific, documented reason for it.
- Embrace Default Parameter Values: It is crucial to remember that GoldenGate parameters generally have default settings that are already optimized for general use cases. In the review, we identified parameters like LOGALLSUPCOLS and UPDATERECORDFORMAT COMPACT explicitly defined in the configuration. Since Integrated Extract defaults to capturing all supplemental columns and using compact formats automatically, hardcoding them creates unnecessary clutter.
- Best Practice: Keep your configuration files lean. A cleaner file is easier to read, debug, and upgrade. Trust that the default behaviors are designed to address the workload requirements of most environments out of the box.
- Order Matters for Key Parameters: Although GoldenGate syntax offers flexibility, specific key parameters require a particular order in the configuration file for correct functioning.
- Key Example (Encryption):
ENCRYPTTRAILmust be declared immediately before theEXTTRAILparameter it is intended to encrypt.
- Technical Requirement: GoldenGate processes parameters sequentially. Placing
ENCRYPTTRAILbefore the trail definition ensures the encryption rule is active at the exact moment the trail file is initialized.
- Best Practice: Define dependent parameters in contiguous blocks. Avoid placing unrelated settings between
ENCRYPTTRAILandEXTTRAILto prevent logical fragmentation during future updates.
- Key Example (Encryption):
- Comment Your Parameters: If you configure non-default parameter settings, include a brief explanation or reference—such as a date and support ticket (SR) or bug number. If your organization’s security policies allow, you may include the author’s name or initials; otherwise, strictly adhere to security best practices and avoid recording any Personally Identifiable Information (PII) in the parameter files.
- Real-World Example: A
CHUNK_SIZEsetting of 10GB was observed. While valid for specific use cases, such non-default values requires documentation to ensure the design intent is clear to future administrators.
- Best Practice: Treat your configuration files with the same rigor as application code. Add comments referencing the specific Ticket # or technical justification so future administrators understand the intent.
- Real-World Example: A
2. Performance Tuning: Right-Sizing for the Target
When utilizing Oracle Cloud Infrastructure (OCI), capacity is measured in OCPUs. An OCPU in Oracle Cloud is defined as the CPU capacity equivalent to one physical core of a processor with hyper-threading enabled. It is critical to consider the specific capacity limits of your target database service. Replicat configuration should be aligned with the actual compute and resource capacity available on the target database service to ensure optimal and stable performance.
- Optimizing Parallelism: A Data-Driven Approach Avoid arbitrarily setting high parallelism values based solely on CPU counts during initial deployment.
- Best Practice Workflow:
- Start with Defaults: Begin with the default parallelism settings to establish a performance baseline.
- Measure and Tune: If the Replicat cannot keep up with transaction volume (lag increases), first investigate if the target Database, network, or storage are performance limiting factors. Only after determining these resources are not the bottleneck should you incrementally increase the parallelism to optimize throughput.
- Choose Your Strategy:
- Fixed Parallelism: Use this only if the workload is strictly predictable and well-known. While this provides stable resource usage, it lacks flexibility; any change in transaction volume requires stopping the process to manually adjust the
APPLY_PARALLELISMvalue.
- Dynamic Parallelism (Recommended for most workloads): Set a specific range (e.g.,
MIN_APPLY_PARALLELISM 2,MAX_APPLY_PARALLELISM 8) to allow the Replicat to self-tune. This eliminates the need for iterative manual tuning (stopping/starting) as workload patterns fluctuate (for parameter details, see the Oracle GoldenGate documentation: https://docs.oracle.com/en/middleware/goldengate/core/23/reference/apply_parallelism.html ).
- Fixed Parallelism: Use this only if the workload is strictly predictable and well-known. While this provides stable resource usage, it lacks flexibility; any change in transaction volume requires stopping the process to manually adjust the
- Best Practice Workflow:
- Standardize BATCHSQL Configuration Enable
BATCHSQLfor high-throughput replication. This is the recommended standard for most workloads.- Exception:
BATCHSQLshould be disabled if the workload generates frequent unique constraint violations or data conflicts. In these scenarios, the Replicat must roll back the batch and retry transactions individually, causing BATCHSQL to perform slower than standard mode.
- Impact on Secondary Processes Even secondary or maintenance-focused processes (such as those handling bulk delete operations) require this optimization. Running these processes in single-threaded mode without batching often results in significant latency due to increased database context switches and network round-trips.
- Best Practice Verify BATCHSQL is enabled on all Replicat processes unless a specific technical exclusion applies.
- Exception:
3. Data Integrity: The “Strict Consistency” Standard
For mission-critical financial systems, data integrity takes precedence over availability. The configuration must ensure that the target data is an exact transactional replica of the source, even if it requires halting the process to investigate anomalies.
- Source Consistency: Snapshot-Based Extraction To guarantee transactional consistency, the Extract process must reconstruct the data state as of the specific SCN (System Change Number) time, rather than fetching the current row version.
- The Risk: Fetching current data for a past transaction can introduce “future” data (updates committed after the original transaction), breaking transactional integrity.
- Configuration: Use
FETCHOPTIONS USESNAPSHOTto instruct Extract to build a read-consistent image using the source database’s Undo segments.USESNAPSHOT: Instructs Extract to build a read-consistent image using the source database’s Undo segments.
NOUSELATESTVERSION: Explicitly prevents the Extract from fetching the current row version if the snapshot reconstruction fails (specifically, when the required Undo information has been overwritten due to retention limits).
MISSINGROW ABEND: Forces the Extract to terminate if the consistent row image (at the specific SCN) cannot be reconstructed. Without this parameter, the process may default to skipping the operation (logging only a warning), resulting in silent data divergence.
- Prerequisite: Ensure the source database
UNDO_RETENTIONis sized to accommodate the replication lag, and verifyRETENTION GUARANTEEis enabled.
- Target Consistency: Handling “No Data Found” When applying data to the target, handling
ORA-1403(No Data Found) is a critical decision point.- Strict Integrity: Do NOT use
REPERROR (1403, IGNORE)orDISCARD. These settings allow data divergence by silently skipping missing records.
- Best Practice: Configure
REPERROR (DEFAULT, ABEND). This forces the Replicat to stop immediately upon encountering a data mismatch, preventing the issue from compounding.
- Resolution: Administrators must manually investigate the missing record to restore consistency before restarting the process.
- Strict Integrity: Do NOT use
4. Operational Hygiene: Observability and Lifecycle Management
Ensuring long-term stability requires a proactive approach to monitoring, maintenance, and architectural standards.
- Lifecycle Management: Patching and Upgrades While OCI GoldenGate is a managed service, the timing of upgrades remains a customer responsibility to ensure application compatibility. Operational security and stability depend on maintaining a supported software baseline.
- Regular Patching: Schedule quarterly minor version updates to uptake critical bug fixes and security patches.
- Major Upgrades: Plan for major version upgrades (e.g., 19c to 23ai) well before the “Sustaining Support” deadline to avoid running on deprecated code paths.
- Heartbeat Monitoring: Lag Visualization Standard “Lag at Checkpoint” metrics can be misleading during periods of low transaction volume. To ensure accurate observability:
- Enable Heartbeats: Leverage the native ADD HEARTBEATTABLE functionality. This generates periodic dummy transactions, forcing the checkpoint to advance even during idle periods.
- Benefits: This provides a true “end-to-end” latency metric and prevents false positives where lag appears to spike simply because no new data has arrived to update the checkpoint. It integrates directly with the LAG REPLICAT command and handles automated history purging.
- Process Visibility: Throughput Metrics While Heartbeat Tables measure latency, understanding throughput volume is critical for capacity planning.
- Best Practice: Configure REPORTCOUNT in your Extract and Replicat parameter files to print processing statistics directly to the report file (.rpt).
- Configuration: Use the RATE option to calculate processing speed.
- Example: REPORTCOUNT EVERY 15 MINUTES, RATE.
- Why It Matters: This creates an immutable record of processing spikes, allowing administrators to correlate volume surges with system resource consumption.
- Architectural Standards: The MAA Model for Hybrid Topologies In hybrid and multi-cloud topologies (e.g., AWS source to OCI target), the physical placement of the GoldenGate Hub (GGHub) is the single most critical factor for performance. Since OCI GoldenGate cannot capture locally inside a third-party cloud, a GGHub must be deployed in the same region or datacenter where the source database resides. This ensures the capture process or delivery process operates within the required low-latency threshold.
- The “Remote Delivery” Anti-Pattern: Never configure a Replicat to apply transactions across a Wide Area Network (WAN).
- The Risk: Replicat processes are highly sensitive to network latency. According to Oracle MAA standards, the Primary or active GGHub must reside in the same data center as the target database to ensure a round-trip latency of 4ms or less. Exceeding this limit results in performance degradation, as the process waits for SQLNET acknowledgments on every batch.
- The MAA Standard: Split-Hub Architecture For cross-region or multi-cloud replication, the Split-Hub design (Source GGHub + Target OCI GGS Deployment ) is the robust standard when latency exceeds thresholds.
- Extract Thresholds: While a remote Extract (Single Hub) is technically supported if latency is <90ms, it leaves the pipeline vulnerable to WAN fluctuations.
- The Decoupling Advantage: By placing a specific GoldenGate Hub in the source cloud/region, you ensure the Extract operates locally (<4ms latency), completely decoupling capture performance from the cross-cloud network.
- Distribution Path: Data is then transported asynchronously via the Distribution Server, which is optimized for resilience against high network jitter.
For detailed planning guidelines, see the Oracle MAA documentation: Planning GGHub Placement in the Platinum MAA Architecture.
- The “Remote Delivery” Anti-Pattern: Never configure a Replicat to apply transactions across a Wide Area Network (WAN).
Conclusion
Transforming an OCI GoldenGate implementation from a functional state to an optimized standard requires a disciplined focus on configuration best practices.
- Configuration Hygiene: Remove parameter bloat and rely on optimized defaults.
- Performance: Tune parallelism based on target database capacity and use BATCHSQL.
- Data Integrity: Enforce strict consistency using Snapshot-based extraction and explicit ABEND handling.
- Architecture: Adopt the MAA Split-Hub standard to decouple Capture from Delivery; ensure the Source Hub resides within 90ms of the source , while the Target Hub remains within 4ms of the target to prevent performance degradation.
By standardizing these pillars, the deployment evolves from a baseline configuration into a resilient, enterprise-grade architecture.
