The container ecosystem is growing up. Proprietary implementations are gradually being replaced by open standards. One of the most important standards from the open container initiative is the oci-runtime spec, which allows alternative container runtimes to be developed. Choices in runtimes leads to innovation and better implementations. As part of our container efforts at Oracle, we decided to implement a runtime in rust called railcar. Below you'll find more information about the development of the runtime, challenges we faced, and lessons learned.
These days, almost all container utilities are in c or go. c is great for interacting with the linux kernel, but has security drawbacks. Go is great for speed of development and memory safety, but it has some limitations that create problems interacting with namespaces. Rust sits at a perfect intersection of these two languages: it has memory safety and higher-level primitives, but doesn't sacrifice low level control over threading and therefore can handle namespaces properly. It is a great choice for container utilities and we hope to see the rust community and the container community collaborate more in the future.
Linux namespaces are quite challenging to interact with correctly. The amount of forking and synchronization needed to properly enter user and pid namespaces are especially frustrating. While there are some advantages to having limited isolation primitives that can be combined, it definitely makes you long for a simpler model like jails or zones.
Complexity and Debugability
Implementing the spec is fairly straightforward, but trying to get your backend working behind docker is much more painful. Specifically, the number of different processes involved makes even seeing what is going on a challenge. The call graph from a command like `docker run` looks like docker client -> dockerd -> containerd -> containerd-shim -> runtime. When things go wrong, it isn't clear what went wrong or why. Shell calls made by containerd to your runtime will often crash containerd, which makes dockerd restart it, leading to endless failing loops with no logging of the actual error.
One of the most frustrating bits was the discovery that simply implementing the spec as written is not enough to have an alternative backend for docker. There are a few other things needed. For example, there are times when containerd tries to call <runtime> ps or <runtime> stats. These are supported by runc but not mentioned in the spec.
In some cases containerd/runc doesn't exactly follow the spec. For example, the spec says that a call to delete after the container has been deleted should return an error, but testing clearly showed that delete is called twice by containerd and runc accepts the double delete. Runc does return an error after the instance has been fully removed, so one must assume that the first delete doesn't actually remove everything, so it is called a second time. This spells trouble for alternative runtimes that do properly delete everything on the first call.
Some parts of the spec seem a bit superfluous, probably because it is still so young. One of the more challenging things to implement were the prestart and poststart hooks. These seemed quite useful when the "run" command was in use, but now that the spec has switched over to "create", "start", "stop" commands, this type of work could be done outside of the runtime. The runtime should focus on just containerizing well; having the runtime execing hooks violates separation of concerns.
Railcar does do one thing a bit differently from the default runtime. It always creates an init process. The lack of an init process in pid namespaces leads to some weird issues. We decided to take the opportunity afforded by a new runtime to experiment with a different option. We do like the consistency that this feature provides, but it may lead to some slightly unexpected behavior around stdout and stderr when a pseudoterminal isn't in use.
The Rust Language
Rust turned out to be an excellent choice for a project like this. The nasty C code injection before the go runtime starts that was necessary for runc has always been annoying and rust gave us sufficient flexibility to implement the whole of railcar in one language. And it can also build a static binary if required!
The language is still young and it could use some more library support, but the majority of what we needed to accomplish was fairly straightforward. We managed to keep the usage of unsafe to a minimum which gives as a lot of confidence in the memory safety of our code.
Fast Container Startup
One of our goals for an alternative runtime was to see how low we could get container start time. While the rust implementation is slightly faster than the one in go, we were surprised to find that the majority of the time spent in container start (about 150ms on an unloaded system) is actually waiting on locks in the kernel.
We used some off-cpu profiling techniques and discovered that the two biggest offenders are cgroup creation and network namespace creation. That means for really fast startup, the best choice would be to pre-create the namespaces and cgroups and only specialize them during start.
Some changes to the create/start flow in the runtime spec would be needed to support something like this, but it would be worthwhile to investigate. We have seen startup times under 10ms when the network namespaces and cgroups have been pre-created.
A method for allowing railcar start to access stdout and stderr of the init process would improve compatibility with the runc. Right now if you run railcar in docker without a terminal, stdout and stderr from the user process get blackholed. Some of the other features of runc that railcar has not implemented like the stats command, would be useful. Automated testing against newer versions of the spec would be another valuable addition.
In general, railcar hasn't yet been used in a wide variety of scenarios. We would appreciate the community trying it out and helping with any issues that arise. It would be great to see railcar used as a backend for kubernetes via cri-o as well.
As a whole, implementing an alternative runtime was a very valuable exercise. We hope that the runtime will gain some attention from the community so that useful containerization alternatives continue to exist and promote experimentation and cross-pollination. Feel free to clone the github repository and let us know if you find any issues.