Docker enables efficient, consistent machine learning environment setup across local development and cloud deployment, avoiding many pitfalls of virtual machines and manual dependency management. It streamlines system reproduction, resource allocation, and GPU access, supporting portability and simplified collaboration for ML projects. Machine learning engineers benefit from using pre-built Docker images tailored for ML, allowing seamless project switching, host OS flexibility, and straightforward deployment to cloud platforms like AWS ECS and Batch, resulting in reproducible and maintainable workflows.
You're listening to machine learning applied. In this episode, we're gonna talk about Docker. Docker. Now, Docker is a technology that lets you run software on your computer, an operating system on your computer, okay? Like an entire operating system packaged up into a little package that is running.
Inside of your, what we call the host operating system, is running a guest operating system. So if you are developing on Windows or Mac, you can use Docker to run Ubuntu, Linux, or, or Windows or Mac inside of your host machine. Now, uh, you may be familiar with this concept. Um, if you're not already familiar with Docker, you might already be familiar with the concept of virtual machines or virtualization.
It's, it's the same thing basically, but, uh, technologically different. A virtual machine does the same thing. It allows you to run an operating system inside of your operating system, Linux, inside of your windows. But it is technologically different in that in a few ways that are important to us as developers.
The first way that it's important to us is that it has limited access to the host resources. Namely, it can't access the GPU, which we use all the time as machine learning engineers for machine learning projects. And another big pain about virtual machines is that you specify what resources they allocate and use upfront.
So you tell a virtual machine, like Virtual Box or VMware, these are two popular virtual machine technologies. You're gonna be using five gigs of Ram and uh, two CPUs in advance. And now it has those allocated, it pulls those aside for use. Itself and you, your machine, your host machine, can no longer access those resources, which sucks.
So if your guest machine, the virtual machine, uh, is not even using any of these resources, then it's still allocated away from your host machine and can't be used. Big, big downfall. Well, Docker and this whole containerization movement, um, uh, mitigates a lot of these technical difficulties through a lot of incredible technology that they've really spent a lot of man hours over the years improving.
And so those, those two things are mitigated. The, the, the first being that Docker can access your GPU. It can, which is fantastic. So now you can write and run your machine learning projects inside of a dog or container that accesses your GPU to run these machine learning algorithms. And it doesn't matter what your host operating system is, be it Windows or Mac or Linux.
Another thing being that Docker only uses the resources it needs from the host operating system. So you can run your desktop and play VR games, uh, you know, and use half the GPU for that, and you won't have resources pre allocated to your Docker container that you can't then use on your host operating system.
Um, these are the obvious benefits of using containerization, but, uh. I'm gonna step away from Docker for a bit to motivate some of the other bigger, more essential benefits that Docker provides. We're gonna stop talking about Docker for a bit, and we're gonna talk about your traditional setup, uh, when you're setting up your local host, your local development environment, your machine learning engineer, and you're writing everything in Python.
Okay? What do you do? First you select your operating system. Which operating system are you going to be running on? The most obvious choice is Ubuntu Linux, Ubuntu Linux. And because most machine learning packages are Ubuntu first class, they support Ubuntu always. And any support for Windows or Mac are sort of.
Afterthoughts second class citizenship, even non Ubuntu Linux distributions. You know, debian's fine. 'cause Debian and Ubuntu have so much in common that they'll usually, usually what will work on Ubuntu will also work on Debian. But other distributions, uh, may, may suffer from second class citizenship when it comes to machine learning python packages.
So your obvious choice is Ubuntu. So you install Ubuntu. It's kind of a bummer. A because you wanna play video games, right? Or, uh, you know, the rest of the, the Ubuntu desktop experience really is subpar two Windows or Mac as a desktop experience. Personally, I think so. Some, some other Ubuntu primary users might not think that that's the case, but I personally don't like the Abuntu desktop experience quite as much.
And games, man, I wanna play video games. So what do you do? You dual boot, uh, windows and Abuntu and you just. Deal with it. Oh, well, you have your Mbu two environment. What do you do? You, uh, you first thing you set up Cuda and co, DNN, CUDA and Co DNN. Cuda is the Nvidia library for uh, GPU, matrix Algebra and all that, all that math stuff that happens in TensorFlow.
Psych Kit, learn and PyTorch. It happens through Cuda. Cuda is the API that lets these libraries run fast math that is related to machine learning. On the GPU, your Nvidia graphics card runs its math through the Cuda API, to the machine learning library. And then you also set up CU D-N-N-C-U-D-N-N, and that's the the Neural network, API that sits on top of Cuda.
Cuda is just general mathy architectural stuff, access to the GPU coup. DNN is that plus neural network overlays. Things that make. Writing that type of math, four neural network's, a little bit easier to work with that TensorFlow and PyTorch tap into. Well, TensorFlow uses, uh, the latest version of TensorFlow as of today.
Uh, November 7th, 2020. The latest version of TensorFlow uses Kudo 10.1 and co DNN seven. And the latest version of PyTorch prefers Cuda 10.2. That's kind of a bummer. Well, how do you deal with that? How do you deal with it in compatibility? Well, the nice thing is that, uh, PyTorch allows you to use 10.1. You can use an older version in PyTorch.
You just have to install, uh, a specific version of PyTorch, which runs optimally on that Cuda version. So you install Cuda 10.1 and co DNN seven. On your Ubuntu operating system, and then you PIP install TensorFlow and, and that specific version of pie torch, which supports Cuda 10.1, and then you PIP install all the other packages that you require for your project.
Psychic learn pandas, num, pi, all those things. Cool. You write your project, everything's fine. Now it comes time to deploy to a server. You need to put your machine learning model, you know, and, and some web API for accessing that model through the client. Like in my case, no, these machine learning models run on a separate AWS instance, which we'll talk about in a bit.
And then the server, the Python, API backend runs on another AWS instance. If you were to set things up manually, the way we're going through this process, you would have to create two EC2 instances. You'd have to manually spin up an EC2 instance with an UB two, 8:00 AM. I go through the whole steps again of installing Cuda 10.1 and co DNN seven, et cetera.
It's just a nightmare. It's, it's, it's just so much work. Also, you have a new project. You have a new client. They say, actually we're using a different version of Cuda, CDNN and PyTorch and TensorFlow. Now you're screwed. You can only support one Cuda version and one CDNN version on your operating system.
Wow, okay. You're really in high water. Let's, let's step all the way back. That was not the right way to go. That was not the correct approach. Let's try a new approach. Anaconda, you're still on Ubuntu. And, but you didn't install Cuda and Cudi NN yet. Anaconda is a Python environment manager, which allows you to install multiple versions of Python, multiple versions of Python, Python 2.7, Python 3.5, side by side.
They can coexist on the same operating system, and within those Python environments you can install specific packages. Okay, so. Python 3.5 Environment wants this version of TensorFlow and that version of PyTorch and this version of Cuda and this version of CDNN and any other specific versions of Python packages you might need then.
You've got your, your, your other client, you're working on this other project, okay? You clock out of project A and you're gonna clock now into project B. You switch to your other Anaconda environment, and now you have a different version of Python, a different version of TensorFlow and PyTorch and all these things.
Now there is a thing out there called, uh, pip m or Python, m pi MI can't remember which it is, uh, which allows you to have multiple versions of Python and multiple packages within those environments. But what it doesn't provide to you that Anaconda does. Is also management of system installs, not just the Python packages themselves, but system installs.
And in particular in this, this example that we're creating, CUDA and co DNN, those are system installations. You manually download those from the Nvidia website and you install 'em into your system. You go through and you set up some environment variables and you put some files into these like user lib directories and stuff.
That's nothing to do with Python. That's all everything to do with, uh, operating system. So one nice thing that Anaconda provides that PIP M doesn't provide is the management of system application installations, including Cuda and CU DNN, and it manages these for you automatically as dependencies to specific Python applications that you're installing.
If you were to install the PyTorch and TensorFlow packages through Anaconda, they would install for you automatically. Or there may be some other flag or additional thing that they specify in the how to that would automatically set up the the right versions of CUDA and CDNN that they need. All right, so that's a huge improvement over our previous setup, a huge improvement.
Now we have siloed environments. Including the correct versions of Cuda and Co DNN, and TensorFlow and PyTorch, and any other Python packages needed on a different project by project basis. Very nice. But we still face the problem. We gotta deploy this to the internet. To an Amazon EC2 instance, we still have to spin up an AWS instance, install Anaconda and, you know, uh, download the code for our project, set up the configuration, do j files and, and environment variables, and open up ports and all these things.
We basically have to replicate the entire environment setup. We performed on our local host, on our local Ubuntu dev environment. We have to replicate that whole process in the cloud and every time we make changes to our local environment setup, if we, if we change some aspect of the, the environment set up, I'm not talking about pip.
Packages, but the environment setup will have to go to the EC2 instance and make sure we replicate that change there too. So this is a problem. This is a problem. Anaconda takes us halfway there, but it does not take us all the way there. So that's where Docker comes in. Docker is incredible technology.
Docker lets you specify in a file how a guest operating system will be set up. Now, these aren't amis or Amazon machine instances. These aren't like image ISO files that you download from the internet, which would've been the case previously with virtual box. These aren't image identifiers that then go and download some binary file that you start from, but then you have to continue the environment set up manually yourself, and then save that away as an ISO file.
You know, these are the old, this is the old way of virtual machine management. No, these are code files. These are text files that you say install Ubuntu 1804. Install Cuda 10.1. Install co DNN seven. Install TensorFlow This install PyTorch that Download this GitHub repository. Make that your primary directory that you're gonna be working from.
Run this code as the main line of code that we're gonna be running for our project. That's the entry point directive. So Docker running locally will look at this docker file that you create in text format. It's a file with code in it. There is this docker language that you have to familiarize yourself with.
I wouldn't, I won't say it's not a steep learning curve. It is a pretty steep learning curve, but it's well worth your time. You specify this code file called a Docker file that tells the Docker system exactly how to create a, a full fledged guest operating system to run on your host with all of the required system dependencies, including Cuda, I-Q-D-N-N.
Including the operating system, including other operating system dependencies you may need like Postgres database or at least the Postgres client, which might be connecting to a remote database server and any PIP packages you need to install, and then you tell it to expose some port. Port 80 port 8, 8, 8, 8 port 3000.
These are common ports. These will be the ports on which you access whatever's running on that guest operating system, be it a Python, API server on fast, API or a machine learning model that's exposing its results over a port. Or you don't need to expose the port you, you dive into your guest environment's.
Shell like bash using your your own terminal. The command for that is like Docker exec, your environment bash. Now you're inside of your operating system, your guest, Ubuntu operating system. You can actually do all your code from there, do all your code from there, or run your uh commands, run your machine learning scripts and stuff like this.
So benefit one that Docker gives us. No. Operating system environment setup. Okay. You don't have to set up Cuda and CDNN. I know I keep mentioning those two installations, but they're such a doozy. If you've been doing machine learning engineering for a while, you, you get where I'm coming from. Setting up Cuda and cu, DNN is such a freaking pain in the butt.
It's such a pain in the butt. It's not that the setup is hard, it's that you're just, you're gonna hit issues always. You always hit issues. I don't know why it's sometimes it's like TensorFlow and PyTorch version compatibility and. What they expect of the kuda or cu DNN libraries or that you're starting a new project and you need, you know, new versions of those, of those libraries or whatever the case may be, you're gonna hit issues manually, setting up kuda and cu n yourself.
So that's benefit one is that you get a buntu with kuda and ku n all set up and it's all running inside of a guest docker container, plus all the pit packages and, uh, environment set up, et cetera. Benefit two. It's running as a guest inside of your host operating system, so you get to pick your desktop environment.
You're not constrained to using Ubuntu as your host operating system, as your desktop environment. Me personally, I'm running. I. Docker for my machine learning projects on my Windows Host operating system, which now allows me to play video games, virtual reality, and I don't have to dual boot and I have access to my GPU through the docker containers.
Now a little aside, I. If you want to run Docker on Windows, if you want to run machine learning projects, which have access to the GPU on your host Windows environment, you actually have to switch to what's called the Dev channel. The Dev channel of Windows. You actually, you go to where you'd normally check for updates in Windows.
Somewhere in there, there's a place where you say, I want to use. The, the dev channel of Windows, it's basically like beta. It's asking to use Windows beta, which is a little bit scary, but, uh, I won't say it doesn't come with problems, but the problems are well worth the, the ability to access GPU through Docker containers.
In my opinion, I've been using this Windows Insider Channel slash dev channel for, uh, a good year now. And I've only hit like a minor snag in that process. And then what you do is, uh, there's, there's some setup instructions. Maybe I'll link them in the show notes. You install an NVIDIA driver, which is specific to this dev channel and Nvidia Docker.
And then now your docker containers have, uh, pass through access to, to your host systems. GPU. And in my case, this is fantastic because I can run some background machine learning jobs in my Docker container using the GPU all while I'm doing some other programming task, uh, on my host operating system for some other client.
Now if you're using Mac, uh, you still are limited, unfortunately, because newer Mac models, MacBook Pros, for example, don't come with Nvidia graphics cards. They come with a MD graphics cards, which don't have Cuda CU DNN, and therefore you'll have to go through some roundabout methods to write your machine learning code for accessing the.
Which use the GPU, like there's this library called Plaid ml, which lets you access the GPU, um, I think through TensorFlow and PyTorch. I think it might be a layer that sits between, I'm not sure, and it's like, it's supposed to be GPU brand agnostic and whatnot, but you, you're gonna have to jump through hoops if you're using a Mac.
And so for this reason, I recommend machine learning engineers stick to PCs, windows, or Ubuntu. And recently, because you can now use. Docker containers on the Windows Dev slash Insider channel, which can access your G-P-U-I-I would recommend using Windows. It's a, in my opinion, a little bit better of a desktop environment than the Ubuntu desktop environment and allows you to do some of the other stuff, software compatibility wise, like gaming, which Ubuntu doesn't do quite as well.
Okay, benefit number one, end-to-end system setup on your behalf. Benefit number two, you can use whatever host you want, be it Windows, Linux, or Mac. And finally, benefit number three is when it comes time to deploy your machine learning model to the cloud. We were talking about using Amazon EC2 instances and setting these all up with Anaconda and all that stuff.
You don't do that. You deploy your Docker file. You literally just deploy your Docker file and it sets up your entire end-to-end server for you replicating your environment, including code open ports and all those things. It's just snap. You have your entire system now replicated in the cloud, all with the push of a button because you maintained a Docker file which manages your local Docker guest container.
Now, the way that works is thus, if you are making a machine learning server, if you're making a machine learning model that's intended to be ran on a server, if it's a long lived server, then you have two options. One option is to use Amazon Container Services, ECS, and, and I'm talking specifically within the AWS stack here.
I'm most familiar with AWS and less so with Google Cloud Platform, GCP or Microsoft Azure, or any of the other platforms most familiar with AWS. So in AWS. You could deploy a long lived machine learning model, something that's intended to just always stay on. You could do that in ECS, Amazon Container Services, and the reason you do it there, rather than EC2, the traditional deployment model for AWS.
Server instances is that ECS is a Docker product. It's a product that is intended to take your Docker file and, uh, run a server from it. And you have control over these instances, what are called services in, in, in ECS services, run tasks, and then tasks have task definitions. These task definitions are where you specify where is the Docker file located?
Uh, is it on docker hub or is it. Over here on in the Elastic Container Registry, ECR, you can host your Docker file wherever you want. Where do I find this Docker file? Uh, what are some of the ports I want open? And what are some of the environment variables like secret keys and stuff like this? Um, so you'll specify those in the.
ECS ui and then it will just run the server for you automatically 'cause it has the docker file, which tells it everything it needs to know to run the server from end to end. So that's ECS. Another option available to you is, uh, SageMaker, Amazon SageMaker allows you to deploy trained machine learning models.
To a rest endpoint. Now, it's not really Docker centric. It maybe lies out outside of the purview of this episode. But basically what you would do is you'd train a machine learning model like TensorFlow or MXNet. You train a model and then you deploy it to SageMaker, Amazon SageMaker, and then now what you have is a, a running.
Instance of that model that can take data rows, that you pass it over a rest endpoint and return the results, the, in the inferences, the predictions. Now what does that have to do with Docker? As far as I understand, I think it is compatible with Docker. I could be wrong on this. My, my thinking was that if you have other requirements than what is provided by.
Sage make's default dependencies that you might be able to use Docker to kind of orchestrate the provisioning of, of that setup, but I may be wrong. Overall, I, you know, ECS is kind of your go-to for deploying a Docker container. And, uh, there's one more option, and that is if your jobs are not long lived, if they're short-lived, or at least they're intended to die, eventually with, with an exit zero in Python, or they just run a Python file and then eventually it just stops.
Like, like, for example, training a machine learning algorithm, typically what you'll do is you Python train pi. It does its thing and eventually it reaches the end of the Python file and that's, that's it. It's over. The script ends and you're back in terminal. Well, what you would do is you can deploy your docker file, your docker container to AWS Batch batch, that the, these batch jobs allow you to run one off.
Container runs. You specify which file intends to run in the docker file. It's called the entry point directive. So Python train pi, for example. And then, uh, you, you know, you set this up on AWS batch and you kick off a job one way or another. You can do it through the. Through the AWS website, or you can do it through an API call Bodo three, you know, in Python or, or some service can actually kick off a job and it will run that job until that job completes and then it will, and then it will come down, it'll come back offline and therefore you're no longer being charged for the compute resources of the very expensive GPU instance that it's running this code on.
The minimum G-P-U-A-W-S instance, available to batch is gonna be the P two X large instance. Uh, that's AKA, uh, Nvidia, K 80 GPU. I mean, that's pretty much the minimum GPU you'd want to be using for machine learning applications these days anyway, so it's a good minimum to work with. But what batch lets you do is specify that you.
Would like to run these batch jobs on a spot instance if available, and then if not available, you can either just wait until one is available or you can run it on a normal instance. And a spot instance is a fraction of the cost of a normal instance. So if you're running your P two X large, your machine learning job inside of a Docker container on a P two X large.
And it just so happens to be able to nab a spot instance because you told it to through batch, then it could cost you one 10th of the price that it would normally cost you to run an ECS instance, long-lived ECS instance on a normal P two X large instance. So these batch jobs are fantastic. I use them for noie and in fact I use them for actually long lived.
Container runs not, not just short-lived. What I do is I have a loop. I have a listener loop, a while loop that is looking for new jobs to run on behalf of users. If they're gonna run a summarization or a question answering or create a new entry and it kicks off these, recommending the books job. It listens for any new jobs that may have come from a user.
And if there's been nothing in 15 minutes, then it does a Python exit Zero, and then batch sees that exit zero and then shuts the whole thing down. But if as long as people are keeping it active, as long as there's activity every 15 minutes, it never hits exit zero. And so that batch job is effectively a long lived instance as if it were running on ECS, but with an added benefit that I can use the uh, spot instances.
So. Docker. Those are the three big benefits of using Docker instead of Anaconda, instead of a standard Python in installation on your, on your local host. The three big benefits are, one, you can run it on your host environment. You can run any number of Docker guest containers on your host environment at a time.
They'll only request the resources they need and therefore you don't have to worry about the state. Setup of your desktop development environment. You do your own thing with your computer that you want. Docker containers come and go. You might be working on a project here. You could forget about it next month when you're not working on it anymore.
It didn't touch your operating system. You could play your VR games on your Windows machine and run these, uh, Docker containers in the background. Number two, they do all the setup for you. All the setup from A to Z. The installation of the operating system, the installation of Cuda and cu DNN, the installation of TensorFlow, PyTorch, and all these P packages, and I haven't mentioned this yet, but, um, there are Nvidia Docker files that you can inherit from Nvidia docket fi docker files on Docker hub that you can't, you say in the Docker file at the top of your docker file, you'll say from Nvidia slash uh, you know, CUDA 10.1 dash cod nnn seven.
You can inherit from Docker files that have all these machine learning libraries set up for you. Tried and true. Everything's known to work with each other, all the versions match and all that stuff, so you don't have to go through that process yourself. In fact, you could pretty much do nothing in your Docker file if you're worried about there being a learning curve, learning Docker.
You can get started using Docker using an empty docker file that just inherits from one of these Docker files that have all the packages set up for you. Popular ones are gonna be NVIDIA's own Docker files. Like I said, that sets up Cuda, CDNN and then the hugging face Transformers. Docker files actually set up TensorFlow and PyTorch versions compatible with Cuda and CDNN and compatible with, of course, the hugging face Transformers.
Libraries and I myself have made a couple of Docker files. These are under GitHub slash left near my handle slash ml tools ML dash. Tools. There's a docker files folder, and these are kind of like the kitchen sink set up. It sets up Kuda, KDNN, PyTorch TensorFlow, hugging face, transformers, UKP labs, sentence transformers, and then a whole bunch of helper libraries, site learn pannu pipe, blah, blah, blah, blah.
You know, if you just really want to get started on machine learning engineering on a local host, you don't wanna have to worry about anything. Use one of these Docker files because it'll just set everything up for you, and then you don't need to worry about what you may or may not need. So benefit one, is it siloed from your host environment?
Benefit two is, it sets everything up for you, even more so than Anaconda, which is already a step above Python raw. And then benefit number three is that as you go along, maintaining your Docker file and your environment and your code, it's all gonna be kept up to date. You know, if you need new software in your guest environment in order to do something.
Uh, new you have decided to do. Let's say you need, okay, I'm actually gonna use a database. I'm gonna install Postgres 12 and pip install op G two. Suddenly you needed a database. You modify your Docker file. You have to modify your Docker file so that you can rerun the Docker file to install these dependencies.
Well now your Docker file's up to date. So if you run this now in the cloud, it's, it's always up to date. So benefit number three is your docker file is the source of truth, and now you can take that Docker file and you can run it all over the internet. It's portable and it just automatically sets up servers, wherever those servers may be.
They could be on Amazon SageMaker, they could be on Amazon ECS, Amazon, uh, Fargate, Amazon Batch. Okay. The, uh, the Kubernetes offering, Amazon Kubernetes, so that's all AWS These are all different products in AWS that support Docker and then Google, Google Cloud platform, the inventor of Kubernetes, which is a, a docker based project.
I'm not gonna talk about Kubernetes here, but, um. GCP obviously has mega first class support for everything, Docker, because like I said, they are extremely involved with Docker. The, the future is Docker. The the future is Docker. All these cloud providers, I'm telling you now, they're gonna move off of running your own servers eventually, so you're gonna have to learn Docker eventually.
Um, might as well be now because it will save you so much heartache on environment and software setup. So learn docker. Use docker. Stop using Anaconda. You know, unless you're using it inside of your Docker instance. I mean, there are reasons to use Anaconda, but not as a primary way to silo environments on your host environment.
And yeah, deploy your Docker containers to the cloud and save yourself some trouble. That's it for this episode. I'll see you guys next time.