IT Innovation, Oracle OpenWorld | October 3, 2017

Catch the Drift with Machine Learning

By: Guest Author


By Alan Zeichick

“One of these things is not like the others,” the television show Sesame Street taught generations of children. Easy. Let’s move to the next level: “One or more of these things may or may not be like the others, and those variances may or may not represent systems vulnerabilities, failed patches, configuration errors, compliance nightmares, or imminent hardware crashes.” That’s a lot harder than distinguishing cookies from brownies.

Looking through gigabytes of log files and transactions records to spot patterns or anomalies is hard for humans: it’s slow, tedious, error-prone, and doesn’t scale. Fortunately, it’s easy for artificial intelligence (AI) software, such as the machine learning algorithms build into Oracle Management Cloud. What’s more, the machine learning algorithms can be used to direct manual or automated remediation efforts to improve security, compliance, and performance.

Consider how large-scale systems gradually drift away from their required (or desired) configuration, a key area of concern in the large enterprise. In his Monday, October 2 Oracle OpenWorld session on managing and securing systems at scale using AI, Prakash Ramamurthy, senior vice president of systems management at Oracle, talked about how drift happens. Imagine that you’ve applied a patch, but then later you spool up a virtual server that is running an old version of a critical service or contains an obsolete library with a known vulnerability. That server is out of compliance, Ramamurthy said. Drift.

Drift is bad, said Ramamurthy, and detecting and stopping drift is a core competency of Oracle Management Cloud. It starts with monitoring cloud and on-premises servers, services, applications, and logs, using machine learning to automatically understand normal behavior and identify anomalies. No training necessary here: A variety of machine learning algorithms teach themselves how to play the “one of these things is not like the others” game with your data, your systems, and your configuration, and also to classify the systems in ways that are operationally relevant. Even if those logs contain gigabytes of information on hundreds of thousands of transactions each second.

The Problem Is Never a Dearth of Data

We are drowning in operational data—too much of it. Most of that data represents perfectly normal operations, but is necessary for determining historical context and operational norms, as well as serving as a baseline for analytics. What is needed, Ramamurthy said, is insight: Is there a problem? If so, what is wrong? Where is it? What should be done? Can that be handled automatically?

Often, insights come from detecting and analyzing outliers, he explained. Perhaps one server has a Node.js library that’s a different version from the one installed on other servers. This may represent a deviation from patch/update polices, or might represent surreptitiously installed malware. Or perhaps a transaction that occasionally takes four minutes to write to a database, when that task usually takes less than two seconds, represents a resource allocation problem with a particular database server, or a transient issue with a network connection.

Can these problems be solved without machine learning? Of course, said Ramamurthy. A log analysis system could be configured to look for database “write” operations that take more than 10 seconds, or to scan for unapproved versions of the Node.js library. However, there are thousands (or millions) or such possible anomalous cases, and it’s impractical to manually code them all, as well as keep up with changes. With machine learning, it’s not necessary to write manual filters or do any sort of manual log or transaction review, he explained. The machine learning algorithms train themselves to find problems—and continually retrain to always stay current on the active system configuration.

Algorithms Designed For On-The-Job Training

That’s the key benefit of machine learning, compared to other popular AI techniques. Expert systems, for example, are provided with an explicit set of rules, often written by human experts, and then automatically solve problems by applying those rules. Neural networks are trained to recognize and classify specific patterns in sample datasets, and then use that training to identify similar patterns in real-world data.

While both expert systems and neural networks are incredibly powerful and efficient in production environments, they must be explicitly trained on domain-specific test datasets—and those algorithms are not flexible or able to go beyond that training. By contrast, machine learning does not need to be trained, and can autonomously find patterns in real-world data—even as that data is shifting or changing over time. That makes machine learning the perfect tool for autonomous computing, and for allowing tools like Oracle Management Cloud to start work right away, potentially producing useful insights on Day 1. What’s more, machine learning algorithms can even make predictions based on how data changes over time, allowing administrators to gain insight for capacity planning or preventive maintenance.

In other words, as system norms shift, the algorithms in Oracle Management Cloud keep up with the changes, and will still call out instances of drift. “By building machine learning into everything,” said Ramamurthy, “Oracle Management Cloud has fundamentally changed how users do management, operations, and security. Instead of studying thousands of bits of data to troubleshoot, machine learning is analyzing the data to bring insights to our customers.” Because one of those things is not like the others—and we need to fix that.

Alan Zeichick is principal analyst at Camden Associates, a tech consultancy in Phoenix, Arizona, specializing in software development, enterprise networking, and cybersecurity. Follow him @zeichick.