There has been a lot of hype since Intel introduced Optane persistent memory (PMEM). To cut through the confusing vendor assertions requires a brief overview of what PMEM is and what it is not.
Optane PMEM comes in two flavors. The first is a non-volatile memory dual inline memory module (NVDIMM). The second is a NVMe SSD form factor more commonly called storage class memory (SCM). The Optane PMEM value comes from lower latency, higher performance than NAND Flash, with much greater write wear-life, a.k.a. endurance, and similar non-volatility of flash. Data remains unchanged when power is lost. Optane PMEMs are noticeably slower than standard volatile DRAM. Optane PMEM NVDIMMs have two modes, Memory Mode and Application Direct Mode. Application Direct Mode or AppDirect requires the application and/or the file system to be modified to place data directly in and out of Optane PMEM. There are not many applications or file systems that do this yet. The new Oracle Exadata X8M is one of those few. Memory Mode sits behind DRAM. The DRAM acts as a first-in-first-out cache to Optane PMEM. The applications have no control over data placement. Memory Mode is the more common implementation because it requires no application changes to be used. It’s how it’s implemented in servers.
Many storage vendors have jumped on the Optane PMEM bandwagon, implementing the SCM SSDs as caching storage drives into their storage systems. SCM allows them to claim lower latencies and more IOPS. However, that performance is significantly less than the PMEM NVDIMMs and likely to be inconsistent under load. And more importantly, it is not going to be as fast or consistently fast as Exadata X8M. Here’s why.
The Exadata Database Server running Oracle Database 19c accesses the Optane PMEM directly in the Exadata Storage Servers. It leverages RDMA over converged Ethernet (RoCE) at 100Gbps in the internal Exadata interconnect, bypassing the network, storage controller, IO software, interrupts, and context switches. Exadata X8M is able to derive a consistent ≤ 19µs of latency or less from this architecture and as many as 16 million 8K SQL IOPS per rack. Many database functions and all storage functions are handled by the Exadata Storage Servers freeing up the Exadata Database Servers for more performance. All Exadata Database Servers can access ALL Exadata Storage Servers’ PMEM. Each Storage Server can have up to 1.536 TBs of Optane PMEM NVDIMMs with up to 21.5-27 TBs per Exadata rack. All PMEM is auto-mirrored for resiliency. Contrast the Exadata PMEM architecture with the Optane PMEM storage class memory (SCM) approach common to stand alone storage systems. Or the standalone database server utilizing Optane PMEM NVDIMMs in memory mode, not Application Direct Mode. These implementations realize longer latencies for IO operations or are restricted by limited PMEM scalability.
The standalone storage system has a much different path and lower performance characteristics. The database server IO connects to the storage system over an external switched network. It will likely utilize NVMe-oF on Fibre Channel or Ethernet to get the lowest possible latency from that network, but may not. NVMe-oF utilizes RDMA that enables two computers on the same network to exchange memory contents without involving the processors. RDMA is designed to minimize network latencies; however, that depends on several factors. One of which is the network chosen. RDMA on Infiniband, Fibre Channel (FC), and Ethernet utilizing RoCE have it built into the NIC silicon thus providing the lowest latencies. The NVMe/TCP version on Ethernet is software-driven (slower than silicon-based) and since it’s running on a layer 3 network, has the potential for network congestion, delivering inconsistent performance. All of which causes higher latency. This is why NVMe-oF utilizing TCP is not recommended for applications requiring consistently high performance. Most storage systems are focused on NVMe-oF on 32Gbps FC and in some cases ROCE on 40 Gbps Ethernet. Only two are utilizing Infiniband. Because the layer 2 network fabric is shared it can develop hot spots that slows performance. That’s just one variable performance issue, there are other issues as well. When the database hits the storage system it has multiple layers it must go through. The RDMA only bypasses the primary CPU to go directly to DRAM. From there is has additional latency areas it must pass including the PCIe controller, down the PCIe bus, to the SCM SSD cache . The SCM storage architecture cannot avoid storage network or fabric issues, system interrupts, DRAM congestion, storage system processing slowdowns for storage-intensive functions such as snapshots, thin provisioning, data deduplication, compression, replication, RAID rebuilds, context switches, and more. As a result, the latencies are very inconsistent with generally ~ 100µs or minimally more than 5X slower than Exadata X8M.
Some storage vendors utilize SCMs as a faster storage tier than NAND Flash SSDs but slower than DRAM. They utilize AI-ML to determine which data blocks go where based on past history.
However, this methodology is flawed. Data access patterns change from day to day. The blocks of data that were hot today likely will not be hot tomorrow. In addition, the blocks do not correlate with the data hotness of the Oracle Database, which is related to table, index, or partition level. Whereas the Oracle Database knows which tables, indexes, partitioning, and other database structures are hot, the standalone storage systems do not. Exadata systems know which data is hot or not because it is co-engineered with the Oracle Database.
The server system has a different path and limitations. Server systems can take advantage of Optane PMEM NVDIMMs. They have to run the Optane PMEM NVDIMMs in Memory Mode as a cache in front of Flash SSDs and/or HDDs. The database writes and reads through the server to DRAM where it is cached and moved over time to PMEM as the DRAM cache1 fills and ages. As the PMEM cache fills and ages, it then flushes the data to Flash SSDs and/or potentially HDDs.
However, the server architecture has significant scalability and availability limitations. Optane PMEM capacity is limited by the number of DIMM slots which have to be shared with DRAM DIMMs. There are only so many DIMM slots in a server.
Availability is another issue because the Optane PMEM in other servers in a cluster is not a shareable pool. This limits data redundancy and high availability options. Data protection functions that are CPU and memory-intensive such as snapshots, replication, thin provisioning, deduplication, compression, RAID rebuilds, etc., all reduce database performance while running.This is because they share the same CPUs and memory, providing inconsistent latency and response times or requiring those functions to be performed in low usage timeframes. Server vendors do not currently publish their database performance numbers or latencies because it depends on too many factors, including if there are other applications running on that server. It’s difficult for them to predict application performance consistency.
Therefore, whenever a storage or server vendor says that their Optane PMEM implementation is equal to or faster than Exadata X8M for running the Oracle Database, it’s factually incorrect. Period. In many situations, it is not even close. Exadata smokes every server, storage system, and HCI appliance currently on sale in the industry today. It out-performs all of them.
Cloud adjacent is another popular yet misunderstood technology. It came about because many public clouds cannot or do not provide enough storage performance for mission-critical applications. Or the customer cannot have their data stored in a public cloud because of performance, regulatory, legal, or data sovereignty requirements, and data must remain under their control at all times. Storage vendors have been selling cloud adjacent types of set ups for a few years. The storage can be owned by the customer or it can be a managed service from the storage vendor. The storage system is placed in an Equinix (or similar) data center located geographically near or in the same facility or building as the targeted public cloud. The Equinix data center is connected to the public clouds with a very high-speed 10Gbps connection. The storage systems from NetApp, Dell EMC, Infinidat, HPE, and others have much better performance than the block or file storage available within the public clouds of AWS, Microsoft Azure, and Google.
The issue is distance and the associated latency. There is no getting around speed-of-light latency so it’s paramount that the cloud adjacent data center be as close as possible to the target public cloud. This becomes increasingly evident with databases. A database transaction will kick off dozens of IOs to the storage system. The AWS website specifically states that for each RDS database transaction, expect approximately 30 storage IOs. That’s a lot of round trips between the public cloud and the cloud adjacent storage system. Latency is additive and can cause application response times to become unacceptable.
It is this problem that Oracle solves with Exadata in a Cloud Adjacent architecture. Customers can again purchase or utilize Exadata and have their system installed in the Equinix data center closest to the targeted public cloud provider (AWS, Azure, or Oracle). With the Exadata Cloud Adjacent architecture, each database transaction has a single roundtrip instead of dozens or as much as 30 roundtrips. Putting the Exadata in the Equinix data center is going to provide much better application response time than just putting the storage in the Equinix data center. This is because the database talking to the cloud adjacent storage has much higher latency than the app server talking to the Exadata in a cloud adjacent configuration.
Once again, if a storage vendor pushing cloud adjacency claims that they are as fast or faster than an Exadata Cloud Adjacent architecture, they are simply blowing smoke.
For More Information on Oracle Exadata
Paper sponsored by Oracle. About Dragon Slayer Consulting: Marc Staimer, as President and CDS of the 21-year-old Dragon Slayer Consulting in Beaverton, OR, is well known for his in-depth and keen understanding of user problems, especially with storage, networking, applications, cloud services, data protection, and virtualization. Marc has published thousands of technology articles and tips from the user perspective for internationally renowned online trades including many of TechTarget’s Searchxxx.com websites and Network Computing and GigaOM. Marc has additionally delivered hundreds of white papers, webinars, and seminars to many well-known industry giants such as: Brocade, Cisco, DELL, EMC, Emulex (Avago), HDS, HPE, LSI (Avago), Mellanox, NEC, NetApp, Oracle, QLogic, SanDisk, and Western Digital. He has additionally provided similar services to smaller, less well-known vendors/startups including: Asigra, Cloudtenna, Clustrix, Condusiv, DH2i, Diablo, FalconStor, Gridstore, ioFABRIC, Nexenta, Neuxpower, NetEx, NoviFlow, Pavilion Data, Permabit, Qumulo, SBDS, StorONE, Tegile, and many more. His speaking engagements are always well attended, often standing room only, because of the pragmatic, immediately useful information provided. Marc can be reached at firstname.lastname@example.org, (503)-312-2167, in Beaverton OR, 97007.