Oracle AI & Data Science Blog
Learn AI, ML, and data science best practices

Using Docker Containers For Data Science Environments

Containers are lightweight versions of traditional virtual machines. They don’t take up large amounts of space on your server, they are easy to create and destroy, and they are fast to boot up. They also make creating repeatable data science environments easy.

For a data scientist, running a container that is already equipped with the libraries and tools needed for a particular analysis eliminates the need to spend hours debugging packages across different environments or configuring custom environments. It’s why at DataScience.com, we use Docker containers for a variety of applications in our platform, such as allowing our users to launch isolated Jupyter and RStudio sessions outfitted with their choice of libraries and tools.  

What is a Container?

If you find yourself on Docker’s website, you’ll see that a container is “a standardized unit of software.” But what does that mean, exactly? (Docker isn’t the only container technology provider out there, by the way. You might also come across Kubernetes, Cloud Foundry, and others in your search.)

A container is pretty much what it sounds like: It contains things, and in this case, a software container contains the code, frameworks, and libraries needed to run a software application. And because it only contains these things, it’s very small; that means you can put many containers on one operating system. It also means you have peace of mind when you run that software — everything you need is already in that container.

What’s really important, however, is the standardization and efficiency that containers provide. Rather than building a new environment for every analysis, your IT team can put the tools and packages required for certain types of analyses (e.g., scikit-learn, TensorFlow, Jupyter, etc.) into a container, create an image of that container, and have every user boot up an isolated, standardized environment from that image.

Wait, What’s an Image?

An image is essentially a snapshot of a running container at a particular point in time that can act as a template for other containers. All running containers came from an image, and you can snapshot any running container to create a new image. You can also launch as many containers as you want from that image. Got it?

Registries like Docker Hub contain hundreds of thousands of images that can be downloaded for free. That means there is sure to be an image out there containing the tools you need for your particular analysis.

If you’re working in the DataScience.com Platform, finding an image that has the tools you require is as easy as selecting the appropriate one from a dropdown menu when you launch your environment. We’ve created a number of pre-baked images for deep learning, natural language processing, and other data science techniques for this purpose that can be used in RStudio and Jupyter sessions on our platform.

Why Set Up a Data Science Environment in a Container?

One reason is speed. We want data scientists using our platform to launch a Jupyter or RStudio session in minutes, not hours. We also want them to have that fast user experience while still working in a governed, central architecture (rather than on their local machines). The process of getting an environment up and running varies from company to company, but in some cases, a data scientist must submit a formal request to IT and wait for days or weeks, depending on the backlog. That puts a burden on both groups.

Containerization benefits both data science and IT/technical operations teams. In the DataScience.com Platform, for instance, we allow IT to configure environments with different languages, libraries, and settings in an admin dashboard and make those images available in the dropdown menu when a data scientist launches a session. These environments can be selected for any run, session, scheduled job, or API. (Or you don’t have to configure anything at all. We provide plenty of standard environment templates to choose from.)

Ultimately, containers solve a lot of common problems associated with doing data science work at the enterprise level. They take the pressure off of IT to produce custom environments for every analysis, standardize how data scientists work, and ensure that old code doesn’t stop running because of environment changes. To start using containers and our library of curated images to do collaborative data science work, request a demo of our platform today.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.