X

Oracle Management Cloud Blog covers the latest releases, customer stories, how-to guides and more.

  • July 10, 2018

Oracle Management Cloud: Autonomous By Design

Vijay Tatkar
Director, Product Management

OMC? ... wait I thought the 'O' in it was for Autonomous :-)

Oracle Management Cloud (OMC) is recognized by Ovum as a leader for Systems Management and Security and it represents a culmination of Oracle's Autonomous Cloud vision. I posit that OMC is Autonomous by Design and in this blog I'm going to explain why this is the case.

The term Autonomous generally conjures image of self-driving cars, robots or unmanned vehicles that do difficult tasks without human intervention or those that significantly augment human capabilities. In other words, the power to make own decisions, being self-governing and being better by being independent.

OMC grew out of a similar desire. Network Operating Centers (NOCs) and IT Ops organizations have been inundated with ever increasing data from application metrics, server (whether host or VM or container driven) metrics, real and synthetic user data, diagnostic logs, transaction traces, configuration and compliance reports, tickets, alerts, network events and other data streams. Over time data-centers have expanded their footprint by hosting dozens of VMs in place of a single host and the application of DevOps principles has driven development into enormous diversity.

Meanwhile, heterogeneous application stacks multiplied due to a much greater application deployment cadence and Containers and Cloud have driven heterogeneity to a new level of complexity. This has caused data streams to multiply enormously. There are almost as many monitoring and management tools to make sense of this data, but each is typically "specialized" to its own tier. Network outages have become common because managing such a wealth of data through established practices and narrow tools became increasingly unwieldy and impractical.

Now, add cyber-security issues to this mix. These are delivered not by lonesome hackers wanting to prove their hacking prowess, but rather by well-funded organizations keen and willing to sustain long engagements, to try and crack open cyber defenses through a wide range of cyber-hacking tools available in the underground black market (the so-called "dark web"). You now have a situation that is far more complex. Add to this above list of data streams, security related streams like global threat feeds or Identity driven traces or security events and the data muddle gets even harder to tackle. OMC was Oracle's vision for tackling this deluge, in a way in which smarter, machine-operated methods would be used to pick out the crucial needle of anomalies from a haystack of repetitive and boring, easy to miss events.

OMC had to build-in some incredible smarts into this solution. We set upon this task by doing two things that provided the solution with the autonomy it needed: building a unified data platform so that all the input streams could be meaningfully mapped and then applying popular machine learning techniques and auto-remediation to the data. Machine learning with careful supervision vitally allows powerful, non-interventional decision making. Popular techniques employed are:

  • Anomaly detection that flags unusual resource usage, identifies configuration drift, etc

  • Clustering which filters out signals from noise and aggregates topology-based data

  • Correlation which resolves dependencies and groups alerts on related symptoms

  • Forecasting which prevents outages before they can happen and helps in capacity and resource planning

While Machine Learning is at the basis of autonomy, there are two broad categories for minimizing management labor:
  • Eliminating effort associated with maintaining legacy management tools: this includes managing the unified data platform, upgrading and patching it, elastically growing according to expanding needs, updating with security patches, encrypting data in transit, etc

  • Eliminating effort associated with using legacy management tools: this includes zero-effort techniques to set baselines and thresholds, automatic topology discovery and correlations, Machine learning techniques that have been extensively taught to follow patterns, behaviors and heuristics, various telemetry contexts so that anomaly detection, clustering, correlation and forecasting can be done without extensive data science knowledge on the part of the user. Additionally, automatic orchestration and remediation can be used for preventive and corrective actions.

Oracle Management Cloud was built for today's complex environments: heterogeneous stacks (not just Oracle software or hardware), hybrid cloud environments, mix of on-premises and cloud solutions, going beyond firewalls to track users through their Identity and associated Entities. By building a unified data platform, OMC delivers services that are fully integrated between performance monitoring, infrastructure and IT analytics, Log analytics, compliance and configuration, orchestration/remediation and security monitoring and analytics.

Whenever I present OMC to customers, they are often astonished to see that we have chosen to integrate traditional Systems Management with Security, but in its original architecture, this is the most logical conclusion. Security and Systems Management are really two different faces of the same coin. Log Analytics is the basis for Security Monitoring and Analytics as well as Systems Management because logs contain both application diagnostic as well as security information; the logs are parsed similarly, but the various fields are enriched differently. Configuration drift is both a Systems issue and a Security issue. User Behavior is at the basis of Performance optimizations and DevOps on the Systems side, and UEBA (User and Entity behavior analysis) is the basis for Security analysis. Topology is central to Systems trouble-shooting, and also key to understanding the Security Kill Chain. Anomaly detection means outliers and trends on the Systems side and they are fundamental to Security. Historical samples inform forecasting, clustering and anomaly detection on the Systems side and Kill Chains on the Security side. SQL deep-dives are important for Performance trouble-shooting and also to detect Security anomalies. Remediation is common to both. There are so many similarities between them that can unite them; indeed even at the top level, Data Center outages and Cyber-security threats are not seen as distinct events either. 

Thus, the OMC architecture of having a unified data platform and then running well-trained Machine Learning algorithms on it and applying auto-remediation is a great mechanism for tackling both types of problems equally well. In retrospect, it feels like OMC's decision to build such a platform is such a brilliant move. But in many ways, it is the logical culmination of thinking that tomorrow's problems need to be solved by machines that work autonomously, that learn as they go, and combat the twin scourges of outages and cyber-security threats by being aggressively based on a smart learning platform combined with automated remediation. This is what makes OMC Autonomous by design.

Related Reading:
Give Yourself an Edge: Use Machine Learning for Managing IT Operations
Read more about Oracle Cloud's Autonomous capabilities