An introduction and application of Docker for Data Scientists
But it surely works on my machine?
This can be a classic meme within the tech community, especially for Data Scientists who need to ship their amazing machine-learning model, only to learn that the production machine has a special operating system. Removed from ideal.
Nevertheless…
There’s an answer because of these wonderful things called containers and tools to manage them reminiscent of Docker.
On this post, we’ll dive into what containers are and the way you may construct and run them using Docker. The usage of containers and Docker has develop into an industry standard and customary practice for data products. As a Data Scientist, learning these tools is then a useful tool in your arsenal.
Docker is a service that help construct, run and execute code and applications in containers.
Now you might be wondering, what’s a container?
Ostensibly, a container could be very just like a virtual machine (VM). It’s a small isolated environment where every thing is self ‘contained’ and will be run on any machine. The first selling point of containers and VMs is their portability, allowing your application or model to run seamlessly on any on-premise server, local machine, or on cloud platforms reminiscent of AWS.
The foremost difference between containers and VMs is how they use their hosts computer resources. Containers are rather a lot more lightweight as they don’t actively partition the hardware resources of the host machine. I won’t delve into the complete technical details here, nonetheless if you should understand a bit more, I actually have linked a fantastic article explaining their differences here.
Docker is then simply a tool we use to create, manage and run these containers with ease. It’s one among the foremost the reason why containers have develop into extremely popular, because it enables developers to simply deploy applications and models that run anywhere.
There are three foremost elements we want to run a container using Docker:
- Dockerfile: A text file that incorporates the instructions of learn how to construct a docker. image
- Docker Image: A blueprint or template to create a Docker container.
- Docker Container: An isolated environment that gives every thing an application or machine learning model must run. Includes things reminiscent of dependencies and OS versions.
There are also just a few other key points to notice:
- Docker Daemon: A background process (daemon) that deals with the incoming requests to docker.
- Docker Client: A shell interface that allows the user to talk to Docker through its daemon.
- DockerHub: Just like GitHun, a spot where developers can share their Docker images.
Hombrew
The very first thing it’s best to install is Homebrew (link here). That is dubbed because the ‘missing package manager for MacOS’ and could be very useful for anyone coding on their Mac.
To put in Homebrew, simply run the command given on their website:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Confirm Homebrew is installed by running brew help
.
Docker
Now with Homebrew installed, you may install docker by running brew install docker
. Confirm docker is installed by running which docker
, the output shouldn’t rise any errors and appear to be this:
/opt/homebrew/bin/docker
Colima
The ultimate part, is it install Colima. Simply, run install colima
and confirm it’s installed with which colima
. Again, the output should appear to be this:
/opt/homebrew/bin/colima
Now you is perhaps wondering, what on earth is Colima?
Colima is a software package that allows container runtimes on MacOS. In additional laymen terms, Colima creates the environment for containers to work on our system. To realize this, it runs a Linux virtual machine with a daemon that Docker can communicate with using the client-server model.
Alternativetly, you too can install Docker desktop as a substitute of Colima. Nevertheless, I prefer Colima for just a few reasons: its free, more lightweight and I like working within the terminal!
See this blog post here for more arguments for Colima
Workflow
Below is an example of how Data Scientists and Machine Learning Engineers can deploy their model using Docker:
Step one is clearly to construct their amazing model. Then, you could wrap up all of the stuff you might be using to run the model, stuff just like the python version and package dependencies. The ultimate step is to make use of that requirements file contained in the Dockerfile.
If this seems completely arbitrary to you in the mean time don’t fret, we’ll go over this process step-by-step!
Basic Model
Let’s start by constructing a basic model. The provided code snippet displays a straightforward implementation of the Random Forest classification model on the famous Iris dataset:
Dataset from Kaggle with a CC0 licence.
This file known as basic_rf_model.py
for reference.
Create Requirements File
Now that we have now our model ready, we want to create a requirement.txt
file to accommodate all of the dependencies that underpin the running of our model. In this easy example, we luckily only depend on the scikit-learn
package. Due to this fact, our requirement.txt
will simply appear to be this:
scikit-learn==1.2.2
You’ll be able to check the version you might be running in your computer by the scikit-learn --version
command.
Create Dockerfile
Now we are able to finally create our Dockerfile!
So, in the identical directiory because the requirement.txt
and basic_rf_model.py
, create a file named Dockerfile
. Inside Dockerfile
we could have the next:
Let’s go over line by line to see what all of it means:
FROM python:3.9
: That is the bottom image for our imageMAINTAINER egor@some.email.com
: This means who maintains this imageWORKDIR /src
: Sets the working directory of the image to be srcCOPY . .
: Copy the present directory files to the Docker directoryRUN pip install -r requirements.txt
: Install the necessities fromrequirement.txt
file into the Docker environmentCMD ["python", "basic_rf_model.py"]
: Tells the container to execute the commandpython basic_rf_model.py
and run the model
Initiate Colima & Docker
The following step is setup the Docker environment: First we want in addition up Colima:
colima start
After Colima has began up, check that the Docker commands are working by running:
docker ps
It should return something like this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
This is sweet and means each Colima and Docker are working as expected!
Note: the
docker ps
command lists all the present running containers.
Construct Image
Now it’s time to construct our first Docker Image from the Dockerfile
that we created above:
docker construct . -t docker_medium_example
The -t
flag indicates the name of the image and the .
tells us to construct from this current directory.
If we now run docker images
, we must always see something like this:
Congrats, the image has been built!
Run Container
After the image has been created, we are able to run it as a container using the IMAGE ID
listed above:
docker run bb59f770eb07
Output:
Accuracy: 0.9736842105263158
Because all it has done is run the basic_rf_model.py
script!
Extra Information
This tutorial is just scratching the surface of what Docker can do and be used for. There are a lot of more features and commands to learn to know Docker. I great detailed tutorial is given on the Docker website you could find here.
One cool feature is you could run the container in interactive mode and go into its shell. For instance, if we run:
docker run -it bb59f770eb07 /bin/bash
You’ll enter the Docker container and it should look something like this:
We also used the ls
command to point out all of the files within the Docker working directory.
Docker and containers are incredible tools to make sure Data Scientists’ models can run anywhere and anytime with no issues. They do that by creating small isolated compute environments that contain every thing for the model to run effectively. This known as a container. It is simple to make use of and light-weight, rendering it a typical industrial practice nowadays. In this text, we went over a basic example of how you may package your model right into a container using Docker. The method was easy and seamless, so is something Data Scientists can learn and pick up quickly.
Full code utilized in this text will be found at my GitHub here:
(All emojis designed by OpenMoji — the open-source emoji and icon project. License: CC BY-SA 4.0)