How Docker Can Help You Become a Data Scientist

1. Description

        Over the past 5 years, I've heard a lot of buzz about docker containers. It seems like all my software engineering friends are using them to develop applications. I'm trying to figure out how this technology can make me more productive, but I've found that the tutorials online are either too detailed: clarifying features that I'll never use as a data scientist, or too superficial: don't give me enough information to help me understand How to use Docker quickly and efficiently.

        I wrote this quickstart so you don't have to parse all the information, but can learn what you need to know to get started quickly.

2. What is Docker?

        You can think of Docker as a lightweight virtual machine that contains everything needed to run an application. A docker container can capture a snapshot of the state of your system so that others can quickly recreate your computing environment. That's all you need to know for this tutorial, but for more details you can head here .

3. Why use docker?

  1. Reproducibility: As a professional data scientist, it is very important that your work is reproducible . Reproducibility not only facilitates peer review, but ensures that the models, applications or analyzes you build can run without friction, making your deliverables more robust and time-tested. For example, if you've built a model in python, it's usually not enough to run pip-freeze and send the resulting requirements.txt file to your colleagues, as this will only encapsulate python-specific dependencies - whereas usually there are Dependencies outside of Python, such as operating systems, compilers, drivers, configuration files, or other data required for your code to run successfully. Even if you can just share the python dependencies, wrapping everything in a Docker container relieves others of the burden of recreating the environment and makes your work more accessible.
  2. Portability of computing environments : As a data scientist, especially in the field of machine learning, being able to change computing environments quickly can greatly affect your productivity. Data science work often begins with prototyping, exploration, and research—work that doesn't necessarily require dedicated computing resources. This work usually takes place on a laptop or personal computer. However, it often happens that different computing resources can greatly speed up your workflow, such as machines with more CPUs or more powerful GPUs for things like deep learning. I see many data scientists restricting themselves to their local computing environment because of the friction of recreating the local environment on a remote computer. Docker makes the process of porting environments (all libraries, files, etc.) very easy. Rapid porting of computing environments is also a huge competitive advantage in Kaggle competitions, as you can cost-effectively leverage valuable computing resources on AWS. Finally, creating docker files allows you to port many of your favorite local environments, such as bash aliases or vim plugins.
  3. Enhance your engineering skills : Familiarity with Docker will allow you to deploy your model or analysis as an application (for example, as a REST API endpoint that can provide predictive services), making your work accessible to others. Also, other applications that you may need to interact with as part of your data science workflow may live in Docker containers, such as databases or other applications.

4. Docker Terminology

Before we dive in, it's helpful to be familiar with Docker terminology:

  • Image: is the blueprint of what you want to build. Example: Ubuntu + TensorFlow with Nvidia drivers and a running Jupyter server.
  • Container: is an instance of an image that you animate. You can run multiple copies of the same image. It is very important to grasp the difference between images and containers, as this is a source of confusion for newbies. If the difference between an image and a container isn't clear, stop and read again.
  • Dockerfile : A recipe for creating an image. Dockerfile contains special Docker syntax. From the official docs: A is a text document that contains all the commands a user can invoke on the command line to assemble an image.Dockerfile
  • Commit : Like git, Docker containers provide version control. You can save the state of your docker container as a new image at any time by committing your changes .
  • DockerHub/Image Registry : A place where people can publish public (or private) docker images to facilitate collaboration and sharing.
  • Layer : A modification to an existing image, represented by instructions in a Dockerfile. Layers are applied sequentially to the base image to create the final image.

        I'll use this term for the rest of the post, so refer to this list if you get lost! It's easy to get confused between these terms, especially between images and containers - so be wary when reading!

5. Create your first Docker image

        Before creating a docker container, it is useful to create a Dockerfile that will define the image. Let's walk through the Dockerfile below slowly. This file can be found in the Github repository that accompanies this tutorial .

# reference: https://hub.docker.com/_/ubuntu/
FROM ubuntu:16.04

# Adds metadata to the image as a key value pair example LABEL version="1.0"
LABEL maintainer="Hamel Husain <[email protected]>"

##Set environment variables
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8

RUN apt-get update --fix-missing && apt-get install -y wget bzip2 ca-certificates \
    build-essential \
    byobu \
    curl \
    git-core \
    htop \
    pkg-config \
    python3-dev \
    python-pip \
    python-setuptools \
    python-virtualenv \
    unzip \
    && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

RUN echo 'export PATH=/opt/conda/bin:$PATH' > /etc/profile.d/conda.sh && \
    wget --quiet https://repo.continuum.io/archive/Anaconda3-5.0.0.1-Linux-x86_64.sh -O ~/anaconda.sh && \
    /bin/bash ~/anaconda.sh -b -p /opt/conda && \
    rm ~/anaconda.sh

ENV PATH /opt/conda/bin:$PATH

RUN pip --no-cache-dir install --upgrade \
        altair \
        sklearn-pandas

# Open Ports for Jupyter
EXPOSE 7745

#Setup File System
RUN mkdir ds
ENV HOME=/ds
ENV SHELL=/bin/bash
VOLUME /ds
WORKDIR /ds
ADD run_jupyter.sh /ds/run_jupyter.sh
RUN chmod +x /ds/run_jupyter.sh

# Run a shell script
CMD  ["./run_jupyter.sh"]

5.1 FROM  statement

FROM ubuntu:16.04

        The FROM  statement encapsulates the most magical part of Docker. This statement specifies the base image to build on. After specifying a base image with FROM , Docker will look for an image named ubuntu:16.04 in your local environment   —if it cannot find it locally, it will search the Docker registry you specified , which by default is  DockerHub . This layering mechanism is convenient because you often want to install programs on top of an operating system like Ubuntu. Instead of worrying about how to install Ubuntu from scratch, simply build on top of the official Ubuntu images! There are a variety of Docker images hosted on Dockerhub, including those that offer more than just an operating system, for example, if you want a container with Anaconda already installed, you can build one on top of the official anaconda docker image container. Best of all, you can also publish an image you build at any time, even if it was created by layering on top of another image! The possibilities are endless.

        In this example, we specify that our base image is ubuntu :16.04 , which will look for a  DockerHub repository called ubuntu . Colon — The part of the image name after 16.04 is a marker that allows you to specify the version of the base image to install . If you navigate to the Ubuntu  DockerHub repository , you'll notice that there are different tags for different versions of Ubuntu:

        For example, at the time of writing, ubuntu: 16.04 , ubuntu: xenial-20171201 , ubuntu:xenial  , and  ubuntu:latest  all refer to Ubuntu version 16.04 and are all aliases for the same image. Additionally, the links provided in this repository link you to the appropriate Dockerfile used to build the image for each release. You won't always find Dockerfiles in the DockerHub repositories, as maintainers can choose to include Dockerfiles for how they make their images. I personally find it useful to look at a few of these Dockerfiles to learn more about Dockerfiles (but wait until you finish this tutorial!

        One tag deserves special mention —  the :latest  tag. This flag specifies   what will be fetched by default if no flag is specified in the FROM statement. For example, if the FROM statement looks like this:

FROM ubuntu

        Then you end up pulling the ubuntu:16.04 image. Why? — If you look closely at the screenshot above, you'll see: the latest tag is associated with 16.04

        One last note on Docker images: use good judgment when pulling random Docker images from DockerHub. Docker images created by malicious actors may contain malware.

5.2 Label declaration

        This statement adds metadata to the image and is completely optional. I added it so others know who to contact about the image, and also so I can search my docker containers, especially if there are many containers running on the server at the same time.

LABEL maintainer="Hamel Husain <youremail>"

5.3 Environmental statement

ENV LANG=C.UTF-8 LC_ALL=C.UTF-8

        This allows you to change environment variables and is very easy. You can read more about this here .

5.4 Run statement

        This is usually the workhorse that does the work needed to build a Docker image. You can run arbitrary shell commands such as  apt-get  and  pip install to install required packages and dependencies.

RUN apt-get update --fix-missing && apt-get install -y wget bzip2    
    build-essential \
    ca-certificates \
    git-core \
    
...

        In this case, I'm installing some of my favorite utilities like curl, htop, byobu, then anaconda, then other libraries that aren't in the base anaconda install (scroll up to the full Dockerfile to see all the RUN statements) .

        The commands after the RUN  statement have nothing to do with Docker, but normal linux commands that you run when you install these packages yourself, so don't worry if you are not familiar with some of these packages or linux commands. Also, as a further suggestion - when I first started learning docker, I looked at other Dockerfiles on Github or DockerHub and copied and pasted the relevant parts I wanted into my Dockerfile.

        One thing you might notice about the RUN statement is formatting. Each library or package is neatly indented and alphabetized for readability. This is a common convention of Dockerfiles, so I recommend you adopt it, as it will simplify collaboration.

5.5 Expose Statements

        This statement is useful if you are trying to expose a port - for example if you are serving jupyter notebooks from inside a container or some kind of web service. Docker's documentation is pretty good at explaining the EXPOSE  statement:

This directive does not actually publish the port. It acts as a type of documentation between whoever builds the image and whoever runs the container, about which ports to publish. To actually publish ports when running the container, use the flag on to publish and map one or more ports, or use the flag to publish all exposed ports and map them to high-order ports.EXPOSE-pdocker run-P

5.6 Volume Statements

VOLUME /ds 

        This statement allows you to share data between docker containers and the host. The VOLUME statement allows you to mount externally mounted volumes. The host directory is only declared when running the container (as you may run this container on a different machine), not when defining the image*. Currently, you only need to specify the name of the folder in the docker container that you want to share with the host container.

        From the docker user guide:

* The host directory is declared at container runtime : The host directory (mount point) is inherently host dependent. This is to keep images portable. Because there is no guarantee that a given host directory will be available on all hosts. Therefore, you cannot mount a host directory from a Dockerfile. The command does not support specifying parameters. A mount point must be specified when creating or running a container.VOLUMEhost-dir

        Also, these volumes are designed to keep data outside of the container's filesystem, which is often useful if you're dealing with large amounts of data that you don't want to bloat your docker image. When saving a docker image,   any data in this VOLUME directory will not be saved as part of the image, but data outside of this directory in the container will be saved.

5.7 Work path declaration

WORKDIR /ds

        This statement sets the working directory in case a specific file without an absolute path is to be referenced in another command. For example, the last statement in the Dockerfile is

CMD [“./run_jupyter.sh”]

Assuming the working directory is /ds

5.8 ADD statement

        EDIT 8/24/2020: You should now use the COPY statement instead of the ADD statement. 

ADD run_jupyter.sh /ds/run_jupyter.sh

        This command allows you to copy files from the host machine into the docker container while running the docker container. I use it to execute bash scripts and import useful things into the container, such as .bashrc files.

        Note that the path to the host container is not fully specified here, as the host path is relative to the context directory specified when running the container (discussed later).

        It just so happens that when I run this container, I put the file run_jupyter.sh in the root of the context directory, so that's why there is no path in front of the source file.

        From the user guide:

ADD <src>... <dest>

This directive copies new files, directories or remote file URLs from it and adds them to the image filesystem in path.ADD<src><dest>

5.9 CMD statement

        Docker containers are designed with the philosophy that they are ephemeral and only stay around long enough to finish the application they are meant to run. However, for data science, we often want to keep these containers running even when there is no activity running in them. One way many people do this is by simply running a bash shell (which doesn't terminate unless you kill it).

CMD [“./run_jupyter.sh”]

        In the command above, I'm running a shell script that instantiates a Jupyter notebook server. However, if you don't have any specific application to run, but want the container to run without exiting - you can simply run the bash shell with:

CMD ["/bin/bash"]

        This works because the bash shell doesn't terminate until you exit, so the container keeps running normally.

        From the user guide:

There can be only one directive in a. If you list more than one, only the last one will take effect.CMDDockerfileCMDCMD

The main purpose of a is  to provide default values ​​for execution containers. These defaults can include the executable, or omit the executable, in which case you must also specify the directive.CMDENTRYPOINT

6. Build your Docker image

        Don't worry, from here, everything else is fairly simple. Now that we have created our recipe as a DockerFile, it's time to build the image. You can do this with the following command:

Also   available on Github

This will build a docker image (not a container, read the terminology at the beginning of this article if you don't remember the difference!), which you can then run later.

7. Create and run a container from a Docker image

Now, you're ready to put all that magic into practice. We can start this environment by executing the following command:

        Also   available on Github

        After running this command, your container will be up and running! The Jupyter server will run because

CMD [“./run_jupyter.sh”] 

        command at the end of the Dockerfile. You should now be able to access jupyter notebook on the port it is being served from - in this example it should be accessible from   http ://localhost:7745/ via the password tutorial If you run this docker container remotely, you have to setup local port forwarding so you can access the jupyter server from your browser.

Eight, interact with the container

Once the container is up and running, the following commands will come in handy:

  • Attach a new terminal session to the container . This is useful if you need to install some software or use a shell.

  • Save the state of the container as a new image . Even if you start out with a Dockerfile that includes all the libraries you want to install, over time you may significantly change the state of the container by interactively adding more libraries and packages. It is useful to save the state of a container as an image that can be shared or layered later. You can   do this with the docker commit CLI command:
docker commit <container_name> new_image_name:tag_name(optional)

        For example, if I wanted to save the state of a container named  container1  to an image named hamelsmu/tutorial:v2, I would simply run the following command:

docker commit container_1 hamelsmu/tutorial:v2 

You might wonder why hamelsmu/ is in front of the image name - it just makes it easier to push this container to DockerHub         later , since hamelsmu is my DockerHub username (more on that later). If you use Docker at work, you most likely have an internal private Docker repository to which you can push Docker images.

  • List running containers . I use it a lot when I forget the name of the currently running container.
docker ps -a -f status=running 

If you run the above command without the status=running flag, then you will see a list of all containers on the system (even if they are no longer running). This is useful for tracking down old containers.

  • List all images that have been saved locally .
docker images 
  • Push your image to DockerHub (or other registry). This is useful if you want to share your work with others, or conveniently save images in the cloud. Be careful not to share any private information when doing this (there are also private repositories on DockerHub).

        Start by creating a DockerHub repository and naming your image appropriately, as described here . This will involve running the command docker login to first connect to your account on DockerHub or another registry. For example, to push an image to this container , I first have to name my local image hamelsmu/tutorial (I can choose any tag name) For example, the CLI command:

docker push hamelsmu/tutorial:v2 

        Push the above docker image to  this repository  with tag v2 . It should be noted that if you make your image publicly available, others can simply layer on top of your image , like we did in this tutorial   adding layers to the ubuntu image. This is very useful for others looking to replicate or extend your research.

9. Now you have superpowers

Now that you know how to operate Docker, you can perform the following tasks:

  • Share reproducible research with colleagues and friends.
  • Win Kaggle competitions without breaking the bank by temporarily moving code to larger computing environments as needed.
  • Prototype locally inside a docker container on your laptop, then seamlessly move the same computation to a server without breaking a sweat, while taking many of your favorite local environments with you (your aliases, vim plugins, bash scripts, custom prompts, etc.).
  • Use  Nvidia-Docker  to quickly instantiate all dependencies needed to run Tensorflow, Pytorch, or other deep learning libraries on a GPU machine (which can be a pain if you're doing this from scratch). See the bonus section below for more information.
  • Publish the model as an application, for example as a rest api serving predictions from a docker container. When your application is Dockerized, it can be replicated as many times as needed.

10. Extended reading

        We've only scratched the surface of Docker, there's a lot more you can do. I'm focusing on the area of ​​Docker, the one I think you'll encounter most often as a data scientist, and hopefully give you enough confidence to start using it. Here are some resources that helped me in my Docker journey:

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131987709