Containers have grown dramatically in popularity over the last few years. They have been used to replace both package managers and config management. Unfortunately, while the standard build process for containers is ideal for developers, but the resulting container images make operators' jobs more difficult.
While building cloud services here at Oracle, we identified a number of services that could benefit from containers, but we ensure their stability and security. In order to be comfortable with using containers in production we had to make some changes to our container build process. After analyzing some of the problems with our process, we developed a method for building and running containers that dramatically improved their stability and security for our environment. Note that we started this process over two years ago, and the stability and security of containers have dramatically improved since then.
Many of the problems arise from the practice of putting an entire linux into the container image. Specifically this leads to:
1. Large Images
The standard method for building images for the longest time started with `FROM debian/jesse`, which resulted in large images that necessitated layers just to be manageable. This has improved somewhat by switching over to alpine, but not all software is trivial to build there. Even today, many of the most popular images are 1G or larger in size.
2. Privilege Escalation
The practice of putting an entire linux user space into the application deployment artifact means that a compromise of the application gives other potential avenues of attack for privilege escalation. It is more secure to have nothing but the application inside the container.
3. Vulnerability Management
One of the biggest security issues in containers is out-of-date dependencies. Containers are immutable, so there is no solution for updating vulnerable software via a package manager; it must be done at build time. The standard build process makes it difficult to determine if a container needs to be updated because it is unclear which files in the container are actually being used. Is the application using the vulnerable version of openssl in the container or was it just included as part of the standard install?
Many of these build issues are already addressed by leading container systems. The addition of multi-stage builds and the --squash option provides other ways of operationalizing containers. We have taken it a step further and are curious to see how the community responds.
In addition to the build challenges, we have made a few operational changes to minimize headaches. These are not indictments of the container systems nor the features themselves, but we have had success with the following:
1. No Layers
After a few years with containers, I've decided that layers are an operational misfeature. They are fantastic during development. It is great to cache the last few steps of your build process, and to be able to run new versions of your code without having to pull down the whole image every time. The issue is multiple layers in your deployment image means unneeded complexity. The recent addition of the --squash option to Docker build signals that others have discovered the same thing.
2. User Namespaces
User namespaces are a fantastic security feature. If someone breaks out of namespace isolation, they are mapped to user with no permissions on the host. In general it is difficult to enable user namespaces because user ownership in the filesystem cannot be mapped properly. Docker has been working on this for a while. They recently added the ability to share a single user namespace for all containers, but it requires a config change that is rarely used. Ideally, each container would have its own unique user namespace. Unfortunately, this requires either read-only images or kernel support for mapping filesystem ids.
3. No Overlayfs
Overlayfs (and its predecessor unionfs) have been a source of countless issues in production. If you are launching many containers from the same image, copying the source image into each containers directory can be a huge waste of space, so the goals for overlays make sense. While we would love to use overlays in the future, we have hit sufficient operational issues that we avoid them for now. In addition, an inability to map filesystem ids means they don't help with unique user namespaces.
4. No Image Repository
Pulling down images has been a source of many issues. Network problems, incorrect images, and layer corruption have plagued us. We have enjoyed much more operational success by simply deploying the necessary image to a node out of band. This also allows us to ensure that all nodes have the most recent version of an image in the case of a security vulnerability.
We enable these features by shrinking the deployment artifact into a read-only container that is small enough to deploy manually. Because the file-system is read-only, we can enable user namespaces, and avoid layers and overlays.
In order to solve some of the issues we saw with containers in production, we created a new concept that we are calling a "microcontainer". Note that this is not a new container format, but simply a specific method for constructing a container that allows for better security and stability. Specifically, a microcontainer:
In addition to these principles, we adopt some file location conventions that make microcontainers easier to run in practice. Software packages install config files in various locations and write files in other locations. We put all temporary files (like pid files) into /run and all persistent writes (log files and data files) into /write. We put config files that need to be modified per container in /read where they can be modified via a volume mount or a kubernetes configmap.
Implementing these principles and conventions allows us to solve some of the major security and stability issues that contribute to operational headaches.
If you are building a statically linked go binary and using scratch as your base, you are probably already building a microcontainer. For other software, building microcontainers can be tricky. For that reason, we have open sourced smith, the tool that we use to build microcontainers. The tool can be used to build a microcontainer from yum repositories and (optionally) an rpm file.
In addition it can be used to "microize" an existing docker container, so you can use the developer friendly docker tools to build your container while you are in development, then transform it into a microcontainer for your production deployment. Smith builds images in the standard oci format, but it can also upload and download images from docker repositories.
There are various ways to get config files into /read and output files into /write. Sometimes these can be set via command line parameters, but one easy solution is to symlink the expected location into the proper directory. For example you could symlink /var/log/appname to /write/log.
While smith can discover any linked dependencies of your executable, it will not automatically pick up libraries that are opened via dlopen, nor will it automatically pull in configuration and data files. In practice it is generally pretty easy to figure out what other files must be included. It generally takes me no more than an hour to package an existing application into a microcontainer. The payoff at the end is that your image is more secure and can be as much as 10x smaller than one produced using other techniques.
Many software packages depend on basic utilities to function properly. It is best to keep all other utilities (bash, grep, awk, etc.) out of a microcontainer, but at times the executable you are trying to run is actually a shell script that does some environment setup before running the actual application. In some cases, the effort required to remove these dependencies is too great, and you are forced to include a few of these utilities in your container. In general, every rule can be broken. It is always a good idea to be practical. Just make sure you have a good reason for everything you include in your container, since it will become something you have to maintain.
Running a container in a unique user-namespace requires manual setup via something like runc, but you can gain some of the other benefits using Docker directly. Docker can't yet import oci images so you'll have to push a microcontainer image built with smith to a Docker repository like Docker Hub before you can run it.
To get the full benefits of the microcontainer, run the container in read-only mode:
docker run -ti --read-only --tmpfs /run -v /my/write/directory:/write my-microcontainer
Microcontainers can be an excellent way to improve the security of your system. If you're building them via smith, the id of the container will only change if the application bits themselves change. This means you can set up an automated build to alert you if you need to redeploy your containers. They do have one drawback though, which is that they are harder to debug. You don't have any tools in the container you can access via docker exec. For this reason we also have released another tool called crashcart to help with debugging.
(Edit 6/30/2017: An earlier version of this article failed to include recent additions to docker and incorrectly presented operational changes as issues)