At Oracle, we build world-class database systems and services such as OCI Exadata Database Service, OCI MySQL Database Service and MySQL Heatwave Database service, a cloud scale service that we recently covered on the First Principles series. PostgreSQL is a popular open-source database and our goal is to give OCI customers the choice of database to run their cloud workloads. With the recently announced OCI Database for PostgreSQL, we brought to bear Oracle’s expertise in database technology to build a cloud scale world class OCI Database with PostgreSQL service. This blog post demystifies how Oracle adapted PostgreSQL into a cloud database service.
PostgreSQL is a popular database software that provides the atomicity, consistency, isolation, and durability, also known as ACID properties. At Oracle Cloud Infrastructure (OCI), we are opinionated about what makes a great cloud database experience. Customers should be able to add or remove database instances on demand, scale compute and storage independent of each other and the service should be architected to deliver best-in-class price performance without risk of data loss. These are the basic premises for any cloud based database service. Here are the key properties we need from an OCI PostgreSQL service:
Figure 1: Customer Managed PostgreSQL Architecture
From here on out, we will refer to open-source PostgreSQL as "vanilla PostgreSQL". Although vanilla PostgreSQL is powerful and popular database, here are some of the challenges that customers face with it:
OCI Database with PostgreSQL has solved these challenges.
Now we'll explore how the upleveled architecture of OCI Database with PostgreSQL solves these challenges above and makes it very simple to run and manage PostgreSQL in OCI. In OCI Database with PostgreSQL, we push the problem of replication and durability to the new Database Optimized Storage (DbOS) layer, which was purpose-built to enable a high-scale, high availability and high performance database service. DbOS offers highly durable network-attached storage in which data blocks are replicated across multiple availability domains in a three-availability-domain region. In a single-AD region, data is replicated across multiple fault-domains. All PostgreSQL nodes in a cluster access the same network-attached storage. Each standby replica doesn’t have to maintain its own copy of the database anymore. The primary instance writes to the shared storage while standby replica instances read from the same shared storage and serve user queries.
Figure 2: Vanilla PostgreSQL v OCI Database with PostgreSQL
This new shared-storage architecture used in OCI Database with PostgreSQL offers a number of advantages in the form of safety, flexibility, efficiency and performance. Although OCI Database with PostgreSQL has a completely different storage architecture underneath, it’s still fully compatible with vanilla PostgreSQL. So, you can lift-and-shift your existing PostgreSQL workloads to OCI Database with PostgreSQL or move them back again with ease. DbOS is a shared file system that leverages high performance OCI Block Storage capabilities like built-in replication.
Let's detail why the newly embedded DbOS layer is such a powerful advantage for PostgreSQL users:
In our experiments, OCI Database with PostgreSQL’s built-in replication across ADs was more than twice as fast as synchronous replication in vanilla PostgreSQL.
Vanilla PostgreSQL on Linux uses a standard filesystem supported by the kernel, typically ext4 or xfs. Such filesystems are not designed for a distributed setup such as OCI Database with PostgreSQL in which all the nodes in the cluster share the same underlying storage or filesystem. Let's review some critical aspects of filesystem implementation used in OCI Database with PostgreSQL, as illustrated in Figure 3.
Figure 3: OCI Database with PostgreSQL Architecture
Figure 4: Vanilla PostgreSQL Replication
For the replica to satisfy a read-query, it must be able to read the version of the page consistent as of the current ReplayLSN of that replica. This is critical when traversing index structures such as B-tree that might have concurrent structural modifications such as splits and merges. In vanilla PostgreSQL, LSN is a log sequence number and ReplayLSN is the LSN during WAL Replay activity executed as part of the replication process. In vanilla PostgreSQL, each replica has its own storage, which means that each replica can maintain read consistency independently. Figure 4 shows that vanilla PostgreSQL has separately stored data for the primary and the replicate databases. The primary has data persisted at LSN 100 and is writing the latest data at LSN 110, and the replica is trailing behind at LSN 90. When a read request for Block 0 goes to the replica, it will read its latest data at LSN 90.
Now, let's see how this same situation is handled by DbFS, which uses a database-aware shared storage layer.
Figure 5: OCI Database with PostgreSQL - Primary Ahead of Replica scenario
To allow a replica to read a page at specific LSN, DbFS keeps multiple versions of the page at various LSNs. When the primary flushes a page either as part of a checkpoint or cache eviction, DbFS doesn't overwrite the previous copy if a replica still needs it. In a scheme similar to a Log-structured file system, blocks are written to new offsets and multiple versions of the block are available to the read replica. Older versions are eventually discarded when DbFS determines that all replicas have moved past the LSN and the block is no longer needed.
In Figure 5, the primary has first persisted Block 0 at LSN 90 and subsequently at LSN 100. Because the replica is still at LSN 90, the primary writes LSN 100 into a separate location and doesn’t overwrite the Block 0 at LSN 90. Storing separate versions of Block 0 for primary and replica helps ensure read consistency for each replica.
Figure 6: Figure 6: OCI Database with PostgreSQL - Primary not persisted latest data scenario
When the primary has not persisted the page as of the replica's ReplayLSN, it reads an old version of the page from the shared disk and on-the-fly replays any WAL records applicable to this page to bring this page up-to-date as of the replica's current ReplayLSN. To make this WAL replay efficient, each replica maintains an in-memory index of all the WAL records, which is indexed by the page number and maintained since the last checkpoint.
In Figure 6, the primary has persisted Block 0 at LSN 50, while the in-memory data for Block 0 is at LSN 110. The replica has successfully replayed until LSN 90 is in its in-memory state. As a result, the replica makes a read request for Block 0 to the DbOS at LSN 90. DbOS currently has data persisted at LSN 50. DbOS applies WAL records until LSN 90 and returns the data at LSN 90 to the replica. With the WAL Index, DbOS ensures that even if the primary has not persisted the data to storage, it can successfully return consistent data to the replica.
For high availability, OCI Database with PostgreSQL automatically detects unhealthy primary instances and fails over to an existing read replica or launches a new database instance if there are no read replicas. The process of fail-over must ensure not only that the unhealthy primary stops making any new changes to shared storage, but also that any in-flight I/O requests to shared-storage are canceled. In some failure scenarios that involve network partitioning, the unhealthy primary might not be reachable externally but can still modify shared storage. To handle these complex cases, DbFS uses block-layer reservation (NVMe reservation or SCSI-3 persistent reservation) to fence out the old primary and ensure that there is a single primary at any given time. With persistent or NVMe reservations, whoever holds the reservation is allowed to write to shared storage. During failover, the new primary obtains the reservation and pre-empts the old primary. When this is done, the storage subsystem rejects all requests from the old primary.
Besides the shared storage optimizations, OCI Database with PostgreSQL implemented the following optimizations to further improve performance.
Atomic writes: DbOS implements optimizations for known database performance risks such as elimination of "torn writes". Typically, most databases need some sort of protection against "torn writes", which happen when the database uses a page size (PostgreSQL uses 8 KB) that doesn’t match the "atomic write unit" size of the underlying storage (typically 512B or 4KB). For example, PostgreSQL first writes an entire 8KB page to the WAL if it’s the first modification to the page since the last checkpoint, and then flushes the page to disk. If the page write is torn, then PostgreSQL falls back to using the full-page it wrote previously in the WAL and no harm is done. But this protection comes at a price - it causes balooning of the WAL and the problem is exacerbated by frequent checkpoints which are needed to minimize recovery time during unplanned failovers. We implemented atomic write support for PostgreSQL pages in DbOS. The storage layer never overwrites an existing page. Instead, it uses log-structuring technique to always write pages to a new location on disk and maintains a mapping layer from logical file offset to disk location. Older versions of the pages are periodically garbage collected. This avoids double writes.
Optimized page cache: OCI Database with PostgreSQL uses a purpose-built caching layer, unlike vanilla PostgreSQL, which relies on the generic Linux kernel page-cache. OCI's page cache implementation has many optimizations such as the following ones:
Storage-level-backups: In vanilla Postgres, to maintain database backups, WAL is copied to object storage, and a periodic snapshot of the filesystem is taken. This process uses both network and CPU on the primary node. OCI Database with PostgreSQL delegates backups to the storage layer, eliminating network and CPU overhead for backups.
The OCI Database with PostgreSQL service provides significant advantages in the form of cost, performance, scale, availability, and durability as discussed in detail before. The key to achieving most of these benefits is based on the DbOS and the DbFS that are purpose-built for optimizing PostgreSQL to work more effectively at cloud scale.
Oracle Cloud Infrastructure (OCI) Engineering handles the most demanding workloads for enterprise customers, which has pushed us to think differently about designing our cloud platform. We have more of these engineering deep dives as part of this First Principles series, hosted by Pradeep Vincent and other experienced engineers at Oracle.
For more information, see the following resources:
Deepak Agarwal is a Vice President and Distinguished Engineer at Oracle Cloud Infrastructure (OCI). He has spent 23 years in the tech industry as a software engineer and architect working for Yahoo!, Microsoft, and AWS. For the past 8 years, Deepak has been enriching Oracle’s Distributed systems and Storage to meet cloud customer needs. In his spare time, you can find Deepak scaling mountains or riffing on his piano.
Pradeep Vincent is the Chief Technical Architect and Senior Vice President at Oracle Cloud Infrastructure (OCI). He is a technology and software architect with more than 20 years of experience in tech companies such as Oracle, AWS, and IBM. He has a deep understanding of Cloud Infrastructure, Compute, Storage and Networking. Pradeep has been with Oracle for more than eight years leading a team of architects and software engineers building Oracle’s Public Cloud. He also leads OCI’s Architecture and Engineering Community initiatives.