Ever heard the adage that the operating cost of a given application is often 2x the app’s acquisition cost? Or how about that bugs cost 100x more to fix in the production phase than during the requirements phase? Or that developers in DevOps environments are often spending over half their time tweaking the “Ops” portion, like CI/CD, instead of writing code?
Removing effort from the operating portion of the equation has long been a goal of IT, though actually doing so is difficult in traditional environments where visibility to the edge (say, end-user monitoring and server-side instrumentation) is low and where remediation (say, optimizing configuration parameters) is manual. But change is on the horizon, thanks to three integrated capabilities provided by cloud platforms that can lead to autonomous, self-healing systems. Those three capabilities are automatic instrumentation, machine learning-powered analytics, and integrated remediation.
Automatic Instrumentation: Closing the Visibility Gap
Cloud software platform providers like Oracle are working hard to make visibility and instrumentation simply a feature of the underlying platform, rather than requiring a separate effort. What this means for developers is that as you write and deploy code, the platform automatically generates and delivers relevant activity and environment telemetry.
For example, PaaS services such as Java Cloud Service, SOA Cloud Service, and Database Cloud Service automatically expose detailed telemetry both about their environments (instance-level telemetry) as well as the artifacts deployed in those environments (code-level telemetry) to management services such as Oracle Management Cloud, without any extra work by developers or operations personnel.
By generating and exposing instrumentation automatically, we can close the visibility gap that often exists today between developers (who know what they coded, but not necessarily about environment dependencies) and operations (who know about environment dependencies, but not about what was coded).
Image 1: 2 views of automated telemetry, generated by Java Cloud Service and Integration Cloud Service and exposed in Oracle Management Cloud.
Machine Learning-Based Analytics
Having the relevant telemetry is a required first step, but understanding it is no easy task. We’re talking about terabytes of logs, tens of thousands of activity and configuration metrics, in an environment where neither developers nor operators understand the dependencies among components. After all, we’ve happily given up a level of control in cloud in exchange for the ability to iterate faster.
Fortunately, we no longer have to rely on our human faculties to deal with this data overload – we can instead rely on purpose-built machine learning (ML). ML loves data. The more the better. And ML that is designed specifically for the operations problem is able to intuit pretty interesting things out of this data, such as how applications are built (topology, dependencies) and how they should behave (baselining, anomaly detection, forecasting) – without any effort from developers.
So, instead of a human having to program a monitoring regime to tell how something ought to work, the monitoring regime tells the humans how the application actually works, how it should work in the future, and why it may not be working as it should. In this scenario, root-cause analysis becomes automated, capacity-planning becomes continuous, dependency-mapping just happens, and alerts/events only bubble up when they actually require attention.
Oracle Management Cloud’s ML portfolio provides topology-aware diagnostics that can forecast impending problems or identify root-cause of current problems without any operator knowledge of the systems being managed.
Image 2: Machine learning-based topology views generated automatically by Oracle Management Cloud.
Automated Remediation: The Final Step
So now that we have all the data we need to understand what’s going on, and have the ability to analyze it in real-time using machine learning to understand why and what we should do about it, we can move toward the final step: taking action.
Automated remediation is the most visible aspect of self-healing systems, but in a sense it’s also the oldest. API-based and script-based automation options have existed for most technical platforms for a long time and are wildly under-utilized. The problem in most IT organizations is not can they automate something, it’s should they run that particular automation at a given time. Sure, I can spin up a new VM, or clone the microservice – but should I? Will it solve the problem or prevent another problem?
Put simply, for automation to be more heavily-utilized, we need to be better at answering the “should I?” question. Fortunately, since we’ve now taken care of having better telemetry data and the ability to analyze it, we can link our analytic results directly to automation, at the platform level. For example, Oracle Management Cloud can automatically invoke automation regimes such as Chef and Puppet, or Cloud Service APIs, in response to analytic conclusions.
Image 3: Automated remediation in Oracle Management Cloud
Autonomous Software Isn’t Magic
Variability and complexity in software environments is inevitable. We have urgent business pressures to innovate and an increasingly sophisticated portfolio of loosely coupled cloud platforms on which to innovate. However, unless we take steps to remove the downstream operational effort associated with the increase in variability and complexity, we will be dragged into spending ever-more time and energy on operations rather than development, and that 2x ratio may quickly become 5x or 10x.
Self-healing, self-tuning, and self-managing aren’t “magic.” Rather, they are the by-design outputs of a platform that first auto-generates sufficient instrumentation, then provides that instrumentation to an ML-based analytic engine, and finally uses the analytic results to invoke the proper automation. Given the pace of business change, these aren’t just cool features of a platform, they are absolute necessities for sustainable modern application development. And they are here, now.
We invite you to experience just what autonomous PaaS is like at cloud.oracle.com/tryit
In this post I’m going to demonstrate how quick and easy one can create an Autonomous Transaction Processing, short ATP, instance of...
The JavaOne conference is expanding to create a new, bigger event that’s inclusive to more languages, technologies and developer...
This is the 2nd (and final) part of this blog series about Spring Cloud Data Flow on Oracle Cloud In Part 1, we covered some of the basics, in...