This post offers an insider's view to how the Exadata team thinks about high availability with three examples of availability challenges and how Exadata addresses them. Each section below describes an availability challenge and includes a video where Michael Nowak (Architect, MAA) explains each solution and discusses how Oracle technical staff continually identify and address availability challenges.
Slow or hung I/Os and sick disks are fact of life, and Exadata implements a range of Machine Learning and other techniques to identify and remedy problematic I/Os. This enables Exadata to maintain service levels in the face of these real-life problems.
In this 4 min video, Michael Nowak explains how storage servers detect and cancel or repair slow I/Os and hung I/Os, and confine sick disks. And how Database servers cooperate with Storage servers to deal with undetected issues via I/O latency capping.
When thinking about protecting and keeping data highly available, a first consideration is to introduce redundancy in how and where the data are stored. This addresses failure risks for individual (nonvolatile) storage devices (e.g., magnetic disks and flash memory). It is just as important to ensure the availability of the system managing the storage devices. As of Exadata X7, each storage server includes two redundant M.2 solid state drives to house the operating system and the Exadata storage server software.
When needed, an M.2 drive can be replaced online while the storage server continues to service the application, with redundancy via Intel RSTe RAID technology. In this 3 min video, Michael Nowak explains how this solution evolved and why it is an important improvement.
Operator errors create challenges to availability beyond hardware and software remedies, for example, if a data center operator mistakenly removes a disk at a time when its removal would compromise storage redundancy. Starting with Exadata X7, storage servers include a “do-not-service” LED to alert datacenter personnel that shutting down a storage cell when the redundancy of a storage cell would be compromised.
In this 4 min video, Michael Nowak explains ASM disk partnering and how it drives this LED warning light.
For the Exadata product team at Oracle, high availability is a fundamental design principle and an ongoing commitment. Exadata embodies the leading edge of Oracle's Maximum Availability Architecture, a best practices blueprint based on proven Oracle high availability technologies, end-to-end validation, expert recommendations and customer experiences (see also the technical overview and MAA blog).
We are always interested in your feedback. You are welcome to engage with us via Twitter @ExadataPM and by comments here.