More than Just Redundant Hardware: Exadata MAA and HA Explained part II, the Exadata Storage cell

In the previous episode of this blog series, I discussed what makes Oracle Exadata exceptional regarding the compute node. In this blog, I will show you what Exadata brings to the Exadata storage servers.

The Exadata storage servers, or storage cells, store the actual data of the Oracle Databases; it is also there where the magic happens in terms of Smart Scan and Offloading and other Exadata performance and storage optimizations such as Hybrid Columnar Compression (HCC).

There are three kinds of storage servers: High Capacity (HC), Extreme Flash (EF), and Extended Storage Server (XT).

All these storage servers come with various components: CPU, RAM Memory, RDMA memory, and HCA.
Depending on the storage server model, XT, HC, or EF, the storage servers come with either high-capacity hard disks (XT) or high-capacity hard disks and performance-optimized flash drives (HC) or a combination of performance-optimized and capacity-optimized flash drives in the case of EF storage servers. Following animation will show you the differences between HC and EF storage servers.

Exadata Storage server differences

You can find detailed information here regarding these three types of Exadata Storage servers and their intended use cases at this link.

What happens if something goes wrong with the system drives of the storage servers will it affect the user data?

Beginning with the Exadata X7 series, the system software resides on separate M.2 drives. The drives are mirrored with software RAID and can be replaced online if something goes wrong.

User and system data are separate, but how does Exadata handle failures, errors, poor disk health, corruptions, and replacements? What protection do you have for my hot data residing – in flash and memory? What type of protection do you have for them?

Exadata Monitoring Service (MS) constantly monitors disk performance and health; poor performance is often a precursor to disk failure.
Disks with poor performance are confined, and I/O is directed to an alternative mirror.

The predictive failure functionality on the storage cells will automatically power cycle the drives or flash memory to avoid false positives.
The Exadata storage cell automatically runs disk health checks and diagnostics; if the disk or flash card is deemed healthy, it is returned to service and resynchronized. If, on the contrary, it is considered unhealthy, the disk or flash drive is dropped, data is rebalanced to maintain redundancy, and a blue service LED is lit. The disk or flash drive can then be replaced without turning off the storage servers online.

ASM partnering also comes into play when a network packet in the I/O path between database server and the storage node is corrupted; the storage cell prevents the write and ASM retries by re-sending the packet.

Database corruptions are repaired automatically by reading from one of the ASM mirrors and fixing the corruption using a good copy.
Periodic Exadata Disk scrubbing is performed on hard disks, as explained in detail here.

Exadata disk scrubbing makes sure that bad sectors are detected and repaired by reading from the healthy ASM mirror copy, this is one of the reasons why it is so important to use HIGH REDUNDANCY for the ASM diskgroups; this will give you a primary extent and 2 mirrors.

The Smart OLTP Caching feature will ensure that storage failures don’t impact the application and that brownouts are kept to a minimum if not zero.

Have a look at the following animation to see how this works in practice.

Exadata Smart OLTP Caching

Following animation will show you what happens if a failure happens

Exadata Smart OLTP Caching Failure

The Exadata Software will also take into account the “health factor” to ensure that the flash cache is warmed up on the storage cell where storage has been replaced. Please have a look at the animation below for a better understanding.

Health Factor

There is a lot of information in the Exadata section of the Automatic Workload Repository report (AWR) for postmortem issues and in real-time using Real Time Insight to ensure that you can quickly diagnose potential performance and health issues.

All good; the data is safe on the disk or flash drives, but what about the disk controller itself?
In Exadata X9M and previous versions, we can auto repair from controller cache failure to prevent corruption. In X10M, further optimizations were made to eliminate the need for controller cache, by leveraging Flash cache.

Exadata offloads so much CPU processing to the cells, what happens if its too much? Will it slow the whole system down?
MS alerts will send out alerts when this would occur.
The reverse offload feature enables a storage cell to push some offloaded work back to the database node when the storage cell’s CPU is saturated.
If you want to know more about this feature make sure to check the following link

What happens if a whole cell fails or isn’t functioning properly? Also, how do you maintain service levels during cell software updates?
The Exadata RS (Restart Server) process can decide to restart hung cells. Instant Failure Detection (IFD) will prevent split-brain situations by using the 4 RDMA paths and not relying on OS timeouts.

The Exadata Flash Cache state is preserved during cell software rolling updates and rebalance operations. When a storage server is shutdown the diskmon process in the Grid Infrastructure on the database server is notified, ensuring there is no blackout when the storage tier is shutdown for maintenance.

Conclusion :
As you can see again in this blog post, these features are a good example of how unique Exadata is. Engineering hard and software together, is not only a slogan, but a reality with Oracle Exadata

More interesting links :
https://www.oracle.com/database/technologies/exadata/hardware/storageservers/
https://blogs.oracle.com/exadata/post/real-time-insight-quick-start
https://blogs.oracle.com/exadata/post/exadata-disk-scrubbing
Exadata New Features

Other parts in this series :
More than Just Redundant Hardware: Exadata MAA and HA Explained – Introduction

More than Just Redundant Hardware: Exadata MAA and HA Explained part I, the compute node

More than Just Redundant Hardware: Exadata MAA and HA Explained – Part II, the Exadata Storage Cell

Philippe Fierens

Product Manager for Exadata Fleet Update - Fleet Patching & Provisioning - Oracle Update Advisor

Your Mission Critical Database Program at Oracle CloudWorld 2024

Announcing Oracle Zero Downtime Migration 21.5

More than Just Redundant Hardware: Exadata MAA and HA Explained – Part II, the Exadata Storage Cell

Authors

Philippe Fierens

Product Manager for Exadata Fleet Update - Fleet Patching & Provisioning - Oracle Update Advisor

Your Mission Critical Database Program at Oracle CloudWorld 2024

Announcing Oracle Zero Downtime Migration 21.5