Linux kernel developer Prakash Sangappa works closely with the Oracle Database team to ensure that the database runs best on Oracle Linux. As the Oracle Database team brings new capabilities to a release, Prakash ensures that any necessary support is in Oracle Linux. This is always exciting when Prakash and the team are delivering new features to the Linux operating system. In this blog post, Prakash talks about the challenges of trying to run a process with real time priority in a user namespace.
User namespaces provide user id and group id isolation. With use of user namespaces, an unprivileged user can be mapped to the root user(uid 0) inside a user namespace. That unprivileged user gains full privileges and capabilities to perform operations within the user namespace. This includes the ability to create other namespaces which is useful. Oracle Multitenant, the architecture for the next-generation database cloud, will be using namespaces to create and isolate database instances on a system.
Though uid 0 in user namespace gets all capabilities, some of the capabilities are ineffective (e.g. CAP_SYS_NICE, CAP_IPC_LOCK, CAP_SYS_TIME) as they would allow modifying global resources, like setting RT priority, locking memory and setting system time respectively. This restriction is problematic for Oracle Multitenant, especially the capability CAP_SYS_NICE, which is required to set RT priority on some of its critical processes.
Below is a brief description of the architecture and the use case.
Oracle Multitenant helps simplify consolidation, provisioning, management and more. This new architecture allows a container database (CDB) to hold zero or more customer databases called pluggable databases (PDBs). It helps to manage many databases as one. An existing database can be adaptoped without change as a pluggable database. You can find more information about Oracle Multitenant here.
For security and isolation, Oracle Multitenant will use Linux namespaces including user namespaces to sandbox PDBs which are nested inside the CDB. Namespaces will also be used to isolate many CDBs on the system.
Within a CDB, there are critical processes like the log writer that has to run at a higher priority. It needs to be scheduled on a CPU immediately as soon as it is ready to run. For this reason, these critical processes are assigned RT priority. However, with use of user namespaces, setting RT priority from within the user namespace is not possible.
One way to handle this limitation would be to use a helper process running as root in the init namespace. This process could set RT priority for the critical processes within the user namespace on request. However, this is not convenient.
As RT priority is not a resource that can be namespace'd by introducing a new namespace type, the following approaches could address the requirement. The main concern would be runaway processes running with RT priority that render the system unresponsive.
Allow root user(uid 0) from init namespace, when mapped inside a user namespace to set RT priority.
If a user namespace were to be tagged or indicated in some way, permit CAP_SYS_NICE capability to take affect and be able to set RT priority.
With use of cgroups bandwidth control, allow root user(uid 0) inside user namespace to be able to set RT priority.
Add a scheduler option to run processes at a fixed high priority above all user priority, like a new scheduling class.
This topic was presented at Linux Plumbers Conference 2018. From the discussion that ensued after the presentation, opinion seems to be leaning towards some solution based on cgroups bandwidth control to allow setting RT priority inside user namespace. We plan to further explore this approach.