Oracle Linux kernel developer Nagarathnam Muthusamy contributed this blog post on the challenges of translating pids (process IDs) between different namespaces. This is a feature currently lacking from namespace support in the Linux kernel and is an important feature to enable multitenant use of the Oracle database via CDBs.
Process ID(PID) namespace facility in Linux kernel has been an effective way of providing isolation between groups of processes which in turn has been employed by various implementations of containers. Though strong isolation between processes is desired, there are always some processes which would like to monitor the activities of other processes and their resource utilizations in the system. Each PID namespace has its own sequence of PIDs which require any processes monitoring them from top of the hierarchy to translate the process ID to and from its own PID namespace. Linux kernel has various set of APIs which provide PID in its result. Any such API can be used for PID translations and following are few of the approaches.
The sender can translate its PID from its own namespace to a PID in the target namespace by sending and receiving the SCM_CREDENTIALS message. The drawback of this method is the requirement of a socket communication channel to PID translation which adds to the management overhead. This method does not enable the sender to translate the PID of other process unless it is root or it has CAP_SYS_ADMIN.
/proc/<pid>/status file provides a way to find the PIDs associated with a process in different namespaces. PID translation from child namespace to parent namespace from parent namespace would require searching all the status file in the parent namespace to find the desired PID at desired level.
struct shmid_ds provided by IPC_STAT on a shared memory contains following two elements.
pid_t shm_cpid; /* PID of creator */ pid_t shm_lpid; /* PID of last shmat(2)/shmdt(2) */
struct msqid_ds provided by IPC_STAT on a message queue contains following two elements.
pid_t msg_lspid; /* PID of last msgsnd(2) */ pid_t msg_lrpid; /* PID of last msgrcv(2) */
PIDs in these elements are translated to the PID namespace of the caller. Though these can be used by monitors to keep track of the usage of shared resources by processes regardless of their namespace, these APIs cannot be used for generic PID translation without creating extra shared memory or message queues.
GETPID command of semctl provides the PID of the process that performed the last operation on a semaphore. Similar to shmctl and msgctl, this is an excellent way to monitor the users of a semaphore but cannot be used for generic PID translation without creating extra semaphores. shmctl and semctl were fixed in upstream linux kernel 4.17. This facility might not be available in older releases but will be part of the Oracle UEK.
F_GETLK command of fcntl provides information on process which is holding the file lock. This information is translated to the caller's namespace. Any process which require translation across different PID namespaces can create a dummy file in a common location which it can lock. Any query on the owner of the file lock through fcntl will return the translated PID of the observed process under caller's namespace. Though file is lighter weight than any IPC mechanisms, creation and cleanup of files for every process in a system just for PID transaltion is an added overhead.
Is there any cleaner way?
Usually when your monitor process or any other process in the system requires PID translation, you might be able to work with any of the above mentioned methods and get around this problem. If none of the above options satisfy your use case, well, you are not alone!
I have been working with Konstantin to resurrect his old patch which provides PID translation capabilities through a new system call called translate_pid. The discussions can be followed in https://lkml.org/lkml/2018/4/4/677 The link also has pointers to previous versions of the API.
The API started off with following function signature,
pid_t getvpid(pid_t pid, pid_t source, pid_t target)
The major issue highlighted here was the use of PID to identify namespace. Any API which uses PID is susceptible to race condition involving PID recycling. Linux kernel has many existing PID based interfaces only because there were no better method to identify the resources when those interfaces were designed. This suggestion lead to the following API
pid_t translate_pid(pid_t pid, int source, int target);
where source and target are the file descriptors pointing to /proc/<pid>/ns/pid files of the source and target namespace. The major issue with this API is the additional step involved in opening and closing of a file for every PID translation. This API also prevents use cases which requires PID translation but does not have privileges to open /proc/<pid>/ns/pid file.
The API under discussion at the time of writing this blog tries to get the best of both worlds as follows.
pid_t translate_pid(pid_t pid, int source_type, int source, int target_type, int target);
Here *type argument is used to change the way source and target are interpreted as follows.
TRANSLATE_PID_CURRENT_PIDNS - current pid namespace, argument is unused TRANSLATE_PID_TASK_PIDNS - task pid-ns, argument is task pid TRANSLATE_PID_FD_PIDNS - pidns fd, argument is file descriptor
As the API is finalized, we will have cleaner method to translate the PID without working around the problem with other existing methods.