Organizations are moving VMware estates to the Oracle Cloud VMware Solution for better control and security, on-demand scale, and a better, cost-saving, subscription-based model. But after the workloads have moved to a VMware software-defined data center (SDDC) on Oracle Cloud Infrastructure (OCI), VMware administrators might wonder about the options to monitor the ESX hosts that are part of the VMware Solution SDDC.

OCI covers this area as a part of the Oracle Cloud VMware Solution provisioning process using Infrastructure Health Monitoring for all the hosts at no extra cost. The oci_compute_infrastructure_health metric metric is automatically created when the ESXi hosts are booted up and OCI Monitoring service manages all the alarms.

A graphic depicting the architecture for monitoring ESX nodes.

How does this monitoring work?

When the ESXi host is active, infrastructure health metrics for CPUs, memory, and disk usage are sent at regular intervals to the OCI Monitoring service. If any alarm or alert is observed on these infrastructure components, an automated email notification is created and sent to the tenancy administrator (by default) to act based on the alarm. We recommend creating user-defined notifications for all infrastructure metrics and ensuring that the notifications are sent to multiple Oracle Cloud VMware Solution VMware admins.

A screenshot of the OCI Hardware Issue Detected report screen.

You can see the same error in the Oracle Cloud Console by looking under the Compute option in the menu, Instance, and selecting Instance name.

A screenshot of the previous error in the Oracle Cloud Console.

Generating custom metrics, alarms, and notifications

VMware Solution administrators can also create custom queries for oci_compute_infrastructure_health metrics to generate alarms and notifications as required using the OCI Observability services.

At the time of this post, OCI Notifications service integrates and provides notifications using email, SMS, functions, PagerDuty, Slack, and custom URLs. OCI Monitoring also plugs into existing Grafana dashboards using the OCI Grafana plugin.

Responding to an alert

At OCI, we understand that hardware can go bad. To protect our customers’ investment in Oracle Cloud VMware Solution, we provide our customers with a zero-cost host replacement capability.

A screenshot of a menu with the option for Replace Host highlighted.

As a part of this feature, an ESXi host is created under the SDDC. When the host is active, customer VMware administrators can ingest the host into their vCenter, NSX, and vSAN configurations using the solution playbook, Add an ESXi host to Oracle Cloud VMware Solution.

When the new host is added to the cluster in vCenter and moved into production, the vCenter admin must delete the faulty host from the vCenter, NSX, and vSAN and mark the replace host operation as complete within the stipulated time period, as displayed in the Console. Failure to complete this process results in the tenancy being billed for both the replacement and faulty ESXi host. Only hosts that the monitoring system detects as faulty are eligible for host replacement. For assistance, raise a service request for OCI support.

Next steps

Enhance your Oracle Cloud VMware Solution infrastructure observability further using the Oracle Cloud Infrastructure Monitoring and Notification services. If you have any questions or suggestions, contact the Oracle Cloud VMware Solution team.