Unexpected restarts of an OCI PostgreSQL Database System can be concerning. However, in most cases, these restarts are part of Oracle Cloud Infrastructure (OCI) managed operations designed to protect availability, performance, and data integrity. This document explains the primary reasons behind unexpected restarts, how OCI handles such situations, and what you should monitor to maintain optimal database operations.

1. Overview

In the OCI Managed PostgreSQL Service, database system unavailability typically results from the following:

  1. Failover between nodes (Node A to Node B)
    In a multi-node high availability (HA) database system, failover is triggered when the primary node becomes unresponsive or experiences resource issues such as high CPU or memory utilization. The failover process promotes the standby node to primary, ensuring minimal disruption to database availability. This mechanism is designed to maintain continuous service and protect against unplanned downtime.
  2. Host replacement
    Host replacement typically occurs in single-node database systems or in situations where failover to a standby node is not possible or unsuccessful. During host replacement, OCI provisions a new compute host for the database system and migrates the workload to this new host. This process ensures that the database system can recover from hardware or host-level failures, although it may take slightly longer than a failover. 
  3. DBSystem applying configuration Changes: The third and final reason for an OCI PostgreSQL DBsystem restart is during the applying of a parameter configuration changes. If the customer applies a configuration that includes parameters requiring a DBSystem server restart, the DBsystem will automatically restart as part of the apply configuration process.

The exact behavior of restarts depends on the topology of the database system (single-node vs. multi-node) and the health condition of the underlying host and database processes. Factors such as system resource utilization, unresponsiveness of the database, Apply parameter configuration changes or hardware-level failures determine whether a failover or host replacement is initiated.

2. Scenario 1: Failover in a Multi-Node (HA) PostgreSQL Database System

2.1 OCI PostgreSQL Failover Overview

OCI PostgreSQL failover is an automatic high-availability (HA) mechanism that ensures continuity of database service by switching operations from the primary node to a healthy standby replica whenever the primary node becomes interrupted or unavailable. This mechanism is designed to minimize downtime and maintain service availability without requiring manual intervention from the customer.

2.2 When Does Failover Occur?

A failover from Node A (Primary) to Node B (Standby) may be triggered under the following conditions:

  • Health check failures: The health check agent does not receive a response from the PostgreSQL DBsystem node within the expected time, regardless of the reason.
  • Resource exhaustion: The database node becomes unresponsive due to excessive workload, such as sustained 100% CPU or memory utilization, which prevents the database from responding to queries.
  • Scheduled maintenance: Failover may occur as part of planned maintenance activities performed within the customer-defined maintenance window, ensuring minimal impact during updates or upgrades.
  • Hardware-level failures: Issues such as disk failures, block storage problems, or network connectivity disruptions can also trigger failover to protect the database from extended downtime.

2.3 Failover Behavior

  • Any disruption affecting the DB system agent will first trigger an automatic failover to the standby node. This ensures the database service remains available even when the primary node becomes unavailable.
  • If the failover completes successfully, the service is restored with minimal disruption, allowing database operations to continue with little impact.
  • From the customer’s perspective, this may appear as a brief outage or a temporary connection reset, but the underlying database continues to function and serve requests through the standby node. 

This automatic failover mechanism is a core component of OCI PostgreSQL’s high-availability design, providing resiliency, reliability, and uninterrupted database access for critical workloads.

3. Scenario 2: Host Replacement

In OCI, host replacement is triggered when a database instance node becomes faulty and does not recover successfully after multiple automated retry attempts. The OCI health check agent continuously monitors the database instance and attempts to restore it to a healthy state. If the database instance fails to come back online or remains unresponsive after repeated recovery attempts, the agent determines that the node is unhealthy and initiates a host replacement, which involves provisioning a new compute host for the database system.

The host replacement process typically takes approximately 7 to 10 minutes. During this time, the affected database instance is rebuilt and the PostgreSQL service is restarted on the newly provisioned host under the managed PostgreSQL service.

3.1 When Does Host Replacement Occur?

Host replacement may be triggered under the following conditions:

  • Single-node database system limitation: A single-node DB system does not have high availability (HA) or a replica node. If any interruption occurs on the primary node, a failover is not possible. As a result, the system initiates a host replacement, which is a more time-consuming recovery process.
  • Resource exhaustion on a single node: When CPU and memory utilization reach 100%, the database system may become unresponsive. In a single-node DB system, this unresponsiveness cannot be mitigated through failover, so the health check agent triggers a host replacement to restore the service.
  • Planned maintenance activities: Host replacement is initiated during scheduled maintenance when rebuilding or replacing the underlying compute host is required.
  • Failover failure in HA setup: In a two-node database system, host replacement may be triggered if a failover attempt fails or cannot be completed successfully.
  • Infrastructure-level issues: The OCI PostgreSQL health agent detects underlying infrastructure problems, including hardware faults, persistent block storage errors, or network isolation or connectivity issues. 

In these scenarios, OCI automatically replaces the underlying compute host and restarts the PostgreSQL service on a new host, ensuring the database system is restored to a healthy operational state.
 

Important Note about IP Addresses:
When a host replacement occurs, the newly provisioned host is assigned a different private IP address. However, this private IP address is not recommended for direct use.
You should always connect to the database using:

  • The Reader Endpoint (FQDN) for replica or read-only connections
  • The Primary Endpoint (FQDN) for primary database connections

From a connectivity perspective, host replacement has no impact, as the primary and reader FQDNs remain unchanged and consistent throughout the process.

4. Scenario 3: DBSystem Restart Due to Configuration Changes: 

The third and final reason for an OCI PostgreSQL DBsystem restart is during the applying of a parameter configuration changes. If the customer applies a configuration that includes parameters requiring a database server restart, the DB system will automatically restart as part of the apply configuration process.

If the customer applies changes using the copy configuration method, all parameters in the new configuration—both existing and newly added—are considered during the apply operation. If any of these parameters require a restart, the DB system will restart, regardless of whether the parameter was recently added or already existed.

To avoid unnecessary restarts, if the customer only needs to apply parameters that support reload (without restart), they should use the OCI API  to apply only those specific parameters. This approach ensures the changes take effect without triggering a DB system restart.

Therefore, we recommend that customers apply such configuration changes during non-peak or off-business hours to minimize any potential impact on application availability. Customers should plan and schedule these changes accordingly.

For more details, please refer to the documentation on creating and applying configurations.


4. Common Indicators in PostgreSQL Logs

When a failover or host replacement occurs, PostgreSQL logs may show entries such as:

  • fast shutdown request
  • terminating connection due to administrator command
  • database system is shutting down
  • database system is starting up

These log messages are expected during controlled shutdown and startup sequences initiated by the OCI platform. They typically indicate that OCI is stopping the PostgreSQL service as part of a failover, host replacement, or maintenance operation, rather than signaling an internal database error.

Such entries do not imply a database-level bug or corruption. Instead, they reflect OCI-initiated lifecycle actions designed to protect service availability and ensure recovery from infrastructure or system-level events.

To gain better visibility into database events and understand actions being performed on the DB system, we recommend  Enabling OCI PostgreSQL service logs.

5. Best Practices and Recommendations

To better understand, troubleshoot, and proactively manage database restarts, failovers, or host replacement events, we strongly recommend continuous monitoring and alerting on key system metrics. This helps identify early warning signs and prevents unplanned disruptions.

5.1 Monitor Resource Utilization

Log in to the OCI Console, navigate to Monitoring, and closely track the following OCI metrics, especially around the time of any database shutdown or restart events:

  • CPU utilization
  • Memory usage
  • Storage IOPS and throughput

Recommendations:

  • Scale resources or change shape:
    Monitor resource utilization trends over time. If sustained high usage is observed, analyze workload patterns and proactively scale up database resources or change the DB system shape to ensure sufficient headroom.
  • Adjust IOPS and throughput:
    If IO performance becomes a bottleneck, IOPS and throughput can be adjusted directly from the DB System Details page, based on workload requirements.

For more details, refer to the Monitoring Database Systems  section in the OCI PostgreSQL documentation.

5.2 Review Maintenance Window Timing

Check whether the unexpected restart or interruption occurred within the configured maintenance window.
 

  • If the event falls within the maintenance window, it is likely part of planned maintenance and should be considered expected behavior.
  • Review and validate your maintenance window configuration in the OCI Console to ensure it aligns with your business requirements and low-traffic periods. 

For more information, see Managing Maintenance Windows in the OCI PostgreSQL documentation:

5.3 Configure Proactive Alerts

Configure OCI Monitoring alarms to receive early notifications when system resources approach critical thresholds:

  • CPU usage exceeding defined limits
  • Memory usage nearing capacity
  • Abnormal IOPS or storage throughput behavior 

Early alerts enable proactive corrective actions, such as scaling resources or optimizing workloads, before the database becomes unresponsive or triggers failover or host replacement events.

For more information, refer to Creating Alarms in the OCI Monitoring documentation.

6. Conclusion

Unexpected PostgreSQL database restarts in OCI are most commonly the result of automated failover or host replacement mechanisms. These operations are designed to maintain service availability and protect your workloads from prolonged outages or infrastructure failures.

By understanding these scenarios and proactively monitoring system metrics, you can reduce surprises, improve stability, and better interpret PostgreSQL logs during such events.

OCI Managed PostgreSQL is a fully managed service, and such automated actions are a core part of ensuring reliability, resilience, and performance at scale.
 


Additional Resources