Analyze billions of log records in seconds using log clustering

July 13, 2021 | 5 minute read
Sreeji Das
Senior Director of Development
Jae Young Yoon
Consulting Member of Technical Staff
Text Size 100%:

As a developer, you consider it a good practice to log messages from your application. Logging makes life easier for whoever troubleshoots an application issue. Good logging not only helps reduce triaging time, but also helps one to identify repeating patterns and regressions.

While logging is good, the number of logs can quickly add up, especially at cloud scale. While several logging solutions exist to collect and send logs to a central repository, you still end up with many logs in the repository. We routinely see that our Logging Analytics customers generate billions of log entries a day.

High log volume: Finding a needle in a haystack

You can run into the following problems as you look through many logs:

What do I search for?

To get a manageable number of search results, you need to know what to search. You can search for a specific error string or look at a chart for a metric, like the response time. You can more easily analyze structured logs, like the Oracle Cloud Infrastructure (OCI) API Gateway logs, but logs that contain less structure and more English text are more difficult.

How do I search?

You need to understand the query syntax for your logging tool. If you know what you’re searching for, the process is easier because most logging tools allow you to type the search string and restrict the search to specific fields.

What do I do after searching?

If you don’t know what needle you’re searching for in the haystack, it’s also harder to know if the needle you found is really the one you wanted.

Log clustering: Billions of logs to a few hundred patterns

The cluster feature in Logging Analytics is a good solution when you have a large volume of logs, and you don’t know what to search for. Cluster has the following benefits:

  • Uses machine learning to group logs that have similar structure

  • Doesn’t require you to learn a query language

  • Doesn’t require you to supply parameters, like the number of clusters

  • Runs quickly on large datasets

Running the cluster analysis

To run cluster, log in to Oracle Cloud, select Logging Analytics, Log Explorer, and click the Cluster visualization.

Invoking Cluster Visualization
Figure 1. Invoking the cluster visualization

This selection runs cluster analysis on the default query, searches for all the log records. The system clusters the log records, analyzes, and auto-categorizes those clusters.

The tabbed UI shows the categories of clusters.

Cluster Results UI
Figure 2. Cluster tabs

The cluster table shows sample message representing each cluster and details for each sample. You can click the count to view the individual log records.

Cluster Table
Figure 3. Cluster table

As seen in the first row, the clustering algorithm grouped 30 log records into a single cluster sample. The algorithm then identifies alarm control process and 1331 as words that vary and marked them as variables. You can click a variable to go deeper and see other values in this position.

Cluster Variable Drill down
Figure 4. Cluster variable details

Identifying potential issues

The system analyzes each cluster sample for a list of known keywords to identify it as an issue or not. Cluster then assigns a score for each issue, based on the severity of the issue. You can see the complete list of issues by clicking the Potential Issues tab.

Customizing cluster categories

You can use a dictionary to supply your own rules to categorize the clusters. This step is useful if you want to filter out the known messages. See the documentation for Dictionary Lookups for more details.

Identifying outliers

A typical application runs a loop, producing similar messages repeatedly over a period. When an unusual event like an application shutdown happens, you see a set of messages that aren’t usually seen. The Outliers tab shows cluster signatures that occurred only once. These occurrences usually indicate unusual events in your system.

Cluster Outliers
Figure 5. Cluster outliers

Autocorrelated trends

A request into your application typically traverses multiple tiers. For example, the request originates at the UI, goes to multiple mid tiers and the database or storage layers. The trend-clustering algorithm identifies messages that occur together at the same time and groups them together.

You can see these correlations under the Trends tab.

Trend Listing Page
Figure 6. Trend list

You can sort the Trend column to get the list of similar patterns.

The trend list page shows 31 similar trends for the message “Sense Key: Hardware Error [current]” coming from Linux Syslog logs. Clicking the link expands to show all those signatures across all the tiers.

Trend Drilldown
Figure 7. Trend details

Clicking the expand button next to the spark-line shows you the expanded chart. You can click Hide Similar Trends to go back to the cluster UI.

Summary

You can use the log clustering feature to reduce large number of log records to few signatures. The clustering algorithm autoanalyzes and categorizes the signatures. The algorithm also correlates the cluster signatures by their trends automatically. This process helps when you don’t have a specific string to search for and you want the system to automatically analyze and categorize large volume of logs.

Sreeji Das

Senior Director of Development

Jae Young Yoon

Consulting Member of Technical Staff


Previous Post

Improve Performance and Mitigate Risk with Oracle Enterprise Manager 13.5 Manage the Manager

Shefali Bhargava | 10 min read

Next Post


Announcing the general availability of the Oracle Enterprise Manager 13.5 App on Oracle Cloud Marketplace

K B Sumesh | 3 min read