Introduction

We are happy to announce Solaris ACT Service to make lives of administrators and engineers slightly easier.

Some believe that on-premises IT infrastructures and services are arcane. I disagree. They are stable, proven, and feature-rich products that spin our day-to-day services from finance, manufacturing, retail, medical, and the list goes on. Much like anything we use every single day, we need to maintain the hardware and software that our society stands on. We don’t brag about our maintenance. We just keep working.

Why

Internet Is Slow

Let us start by laying out some facts.

Oracle X8-8 Servers can have up to 6 TB of RAM.[1] For SPARC architecture, Oracle M8-8 Servers can have up to 8 TB of RAM[2]. Now the world’s fastest broadband internet connection, as of this writing, is in Monaco, marking 261.82 Mbps, or roughly 32 MB/s.[3] The world’s slowest connection is in Cuba, marking 4.01 Mbps, or approximately 500 KB/s.

When a panic happens to your server, Oracle Solaris takes a snapshot of memory, compress it, and then save it on the disk.  Admins can choose to send the saved file either manuallly or automatically over to our receiving servers at Oracle. [5] Then our receiving server makes sure that nothing malicious is included in the snapshot of your memory, aka crash dump, by scanning it. All of this happens behind the scene before our diagnosis starts.

Imagine if you are an administrator in Monaco, and have 8 TB of RAM in your M8-8. Suppose that only the half of the memory used at the time of event, and thanks to the compression, your crash dump file is down to 2 TB. Yet, even if you would enjoy the fastest internet connection possible, you still need over 18 hours to send the files to Oracle. If you are in countries like Cuba, it is probably faster to send someone on-site for analysis.

Admins Are Busy

When an administrator finds a snapshot of the memory image, aka crash dump, he/she needs to extract files, run the debugger, and execute debugger command for analysis. In the event of production outage, this is one of many things that administrators need to work on. Moreover, administrators may not be familiar with the internals of the software, and therefore, this is something they would like to ask for help.

What if we can automate the initial analysis after a panic? In a lot of cases, panics don’t repeat. Therefore, chances are we are able to run a diagnosis. Automating the process would save time for administrators. An initial diagnosis report would be much smaller in size than crash dump itself therefore it would be much faster and easier to send to Oracle support. It would save time for the customers, and it would save time for us.

This is exactly the idea behind introducing ACT service. Your Oracle Solaris now runs an initial analysis after a panic for you.

How Does It Work

Migration

When you update Oracle Solaris to SRU48 [4] or later, it will install ACT service.  ACT service will do the following as a “one-time” migration process after the first reboot.

  1. Move your old crash dump files to /var/share/historical-crash.
  2. There will be a dedicated dataset with lz4 compression enabled for storing crash dump files. (/var/share/crash)
  3. It will install the service to run ACT diagnosis tool after a panic.

If a Panic Happens

When a panic happens, and if the service is enabled, the service will do the following after a panic.

  1. Check if we have sufficient free space on your storage to extract the dump files, and extract them if sufficient. If not, we don’t touch the crash dump and proceed with the boot process.
  2. If the dump files are extracted, we generate an initial analysis report by ACT.
  3. We return the system to the pre-analysis state by cleaning up the temporarily extracted files.

The generated report will reside under the path /var/crash/n/act.n, where n is a positive number assigned to your crash dump starting from 0.

Please see Table 1 for the estimated transfer time you would save in US, UK, and Japan.

Table 1: Transfer Time You Save By Sending ACT Report First

Server

RAM (GB) Compressed
Dump Size (GB)
Act Size (MB) Saved Time (US) Saved Time (UK) Saved Time (Japan)
X7-2 Rack 1500 101.3 6.4 1:06:16 2:12:06 1:15:33
X4-4 1500 49.2 306 (bytes) 0:32:11 1:04:10 0:36:42
X4170 M2 148 35.94 9.53 0:23:30 0:46:51 0:26:48

US – 203.81 Mbps, UK – 102.24 Mbps , Japan 178.76 Mbps [3]

Conclusion and Next Steps

As you can see, ACT service automates the initial analysis of your panic and reduces “panic-to-initial diagnosis” time. It saves administrators time and to fight with the debugger, that they could use for something else during the crisis. You can start using the service by upgrading Oracle Solaris to SRU48 or later. Hopefully it will reduce the down time in case of panics.

References

Oracle Server X8-8 System Architecture.  At the time of the initial draft, it was common to see 4-socket configuration that allows user to have up to 3 TB RAM.
Oracle’s SPARC T8 and SPARC M8 Server Architecture
Internet Speeds by Country 2022
4 ENH 32531200 – An Act file should be generated during savecore to aid automation