MLA 014 Machine Learning Hosting and Serverless Deployment

Jan 17, 2021
Click to Play Episode

Machine learning model deployment on the cloud is typically handled with solutions like AWS SageMaker for end-to-end training and inference as a REST endpoint, AWS Batch for cost-effective on-demand batch jobs using Docker containers, and AWS Lambda for low-usage, serverless inference without GPU support. Storage and infrastructure options such as AWS EFS are essential for managing large model artifacts, while new tools like Cortex offer open source alternatives with features like cost savings and scale-to-zero for resource management.

Resources
Resources best viewed here
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd Edition)
Designing Machine Learning Systems
Machine Learning Engineering for Production Specialization
Data Science on AWS: Implementing End-to-End, Continuous AI and Machine Learning Pipelines
Amazon SageMaker Technical Deep Dive Series
Show Notes
Try a walking desk

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

Cloud Providers for Machine Learning Hosting

  • The major cloud service providers for machine learning hosting are Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure.
  • AWS is widely adopted due to rapid innovation, a large ecosystem, extensive documentation, and ease of integration with other AWS services, despite some features of GCP, such as TPUs, being attractive for specific use cases.

Core Machine Learning Hosting Services

1. AWS SageMaker

  • SageMaker is an end-to-end service for training, monitoring, and deploying machine learning models, including REST endpoint deployment for inference.
  • It features auto-scaling, built-in monitoring, and support for Jupyter notebooks, but it incurs at least a 40% cost premium over direct EC2 usage and is always-on, which can be costly for low-traffic applications.
  • AWS SageMaker provides REST endpoint deployment and training analytics.
  • Google Cloud offers GCP Cloud ML with similar functionality.

2. AWS Batch

  • AWS Batch allows one-off batch jobs, typically for resource-intensive ML training or infrequent inference, using Docker containers.
  • Batch supports spot instances for significant cost savings and automatically shuts down resources when jobs complete, reducing always-on costs.
  • Batch jobs can be triggered via CLI, console, or programmatically, and the service does not provide automatic deployment or monitoring functionality like SageMaker.
  • AWS Batch enables Docker-based batch jobs and leverages ECR for container hosting.

3. AWS Lambda

  • AWS Lambda provides serverless deployment for machine learning inference, auto-scaling to meet demand, and incurs costs only during actual usage, but it does not support GPU or Elastic Inference.
  • Lambda functions can utilize attached AWS EFS for storing and loading large model artifacts, which helps manage deployment size and cold start performance.
  • Only models that can perform inference efficiently on CPU within Lambda’s memory and compute limits are suitable for this approach.

4. Elastic Inference and Persistent Storage

  • AWS Elastic Inference enables the attachment of fractional GPU resources to EC2 or SageMaker for inference workloads, driving down costs by avoiding full GPU allocation.
  • AWS EFS (Elastic File System) is used to provide persistent, shared storage for model artifacts, allowing services like Batch and Lambda to efficiently access large files without repeated downloads.
  • AWS EFS allows mounting persistent file systems across services.

Model Optimization and Compatibility

  • Model optimizers such as ONNX (Open Neural Network Exchange) and Intel’s OpenVINO can compress and optimize machine learning models for efficient inference, enabling CPU-only deployment with minimal loss of accuracy.
  • ONNX helps convert models to a format that is interoperable across different frameworks and architectures, which supports serverless environments like Lambda.

Emerging and Alternative Providers

1. Cortex

  • Cortex is an open source system that orchestrates model training, deployment, and scaling on AWS, including support for spot instances and potential for scale-to-zero, reducing costs during idle periods.
  • Cortex aims to provide SageMaker-like capabilities without the additional premium and with greater flexibility over infrastructure management.

2. Other Providers

  • PaperSpace Gradient and FloydHub are additional providers offering ML model training and deployment services with cost-competitive offerings versus AWS.
  • PaperSpace is highlighted as significantly less expensive than SageMaker and Batch, though AWS integration and ecosystem breadth may still steer users toward AWS-native solutions.

Batch and Endpoint Model Deployment Scenarios

  • If model usage is rare (e.g., 1–50 times per day), batch approaches such as AWS Batch are cost-effective, running containerized jobs as needed and then shutting down.
  • For customer-facing applications requiring consistently available models, endpoint-based services like SageMaker, GCP Cloud ML, or Cortex are more appropriate.

Orchestration and Advanced Architectures

  • Kubernetes and related tools can be used to orchestrate ML models and complex pipelines at scale, enabling integration of components such as API gateways, serverless functions, and scalable training and inference systems.
  • Tools like KubeFlow leverage Kubernetes for deploying machine learning workloads, but require higher expertise and greater management effort.

Summary Table of Linked Services

Try a walking desk

Go from concept to action plan. Get expert, confidential guidance on your specific AI implementation challenges in a private, one-hour strategy session with Tyler.Get personalized guidance from Tyler to solve your company's AI implementation challenges.Book Your Session with TylerBook Your Call with Tyler

Transcript
You're listening to machine learning applied, and in this episode we're gonna talk about machine learning hosting solutions out there with a special emphasis on serverless technology. In the last machine learning applied episode, we talked about various tech stack solutions, especially for serverless solutions, for hosting your server. The Amplify Stack in particular uses behind the hood. AWS Lambda and AWS Lambda is a very popular solution for serverless architectures. AWS Lambda lets you write a single code function, whether it's node js or Python, and then that function in conjunction with A-W-S-A-P-I gateway. We'll be exposed as a rest endpoint that your client can call. Very powerful and saves you a lot of time and heartache. Now, we want to achieve this level of functionality when deploying our machine learning models. So in this episode, when I'm talking about machine learning deployment solutions, I want to push towards as much of a serverless solution for deploying your machine learning models as possible. Now, I wanna preface by saying my own experience in machine learning hosting is slightly unlimited. And in my experience trying to find the best serverless machine learning, hosting technology available, I've found that the providers are somewhat limited, and this makes sense. We've been deploying regular servers, app servers using FAST API, no Gs for a very long time. And so hosting providers like AWS and GCP have had a lot of time to sit with this. Problem in order to come up with solutions like AWS Lambda, where machine learning is a more recently popular technology and so they haven't had as much time to really perfect this space. So you'll find that even within a single cloud hosting provider like AWS, there are competing machine learning hosting solutions. And it's not clear when you might want to use one solution over another. And that any one given solution doesn't really solve all the problems you need. So it's an evolving space, it's a nascent field, and things are only gonna get better. All that's to say that this episode is gonna be due for a refresh before too long, so expect to follow up on this episode in the future. Now, let's start with types of hosted machine learning scenarios. The first type is if you want to train your model in the cloud. Now, training is the most expensive. Component of the machine learning story, it's the most time and computation expensive part of the end-to-end machine learning story. What you'll usually do is you'll train a model, and that includes hyper parameter optimization of the model, so not just the training phase, but cross validating against validation error, and checking the test error and tweaking hyper parameters, et cetera. This can span multiple days for a single model, depending on the training set. And then later you'll end up using that model for inference. And inference is the cheap part. Usually inference can happen in in any number of seconds or milliseconds, so it's time cheap and it's also computationally cheap. So these two phases of the machine learning pipeline are separate and handled differently. The training phases extremely expensive. The inference phase is cheap. So you'll want to use different technology options for these different phases of the machine learning lifecycle in the training phase. You don't have to do it in the cloud, you could do it on your desktop. You could train your model on a custom built GPU heavy desktop, which is what I typically do, save the result of the trained machine learning model. This is what we call. Artifacts, artifacts. So the artifacts are the result of a saved machine learning model. If you've got a TensorFlow machine learning model, you'll save it in the TensorFlow format. If you've got another model, let's say XG Boost or something else, you might save it via the Python job lib. Library to a pickle file or something like that. So the exported saved dump of your machine learning model is what we call an artifact, and then you will take that artifact and deploy it to wherever you're going to be hosting the inference part of your machine learning lifecycle. So that's one option is that you could do the training on your desktop, but another option is that you can cloud host the training phase of your machine learning model. And there are a number of cloud hosting solutions for the training phase. And this is especially valuable if the training phase may take multiple machines. Let's say your desktop is just doesn't cut it, it's not enough. Even if you have a very beefy, powerful desktop, it's still not enough. You may need to parallelize the training phase of your machine learning. Then you might want to consider a cloud hosting solution. And of course, I always mention there's three popular cloud hosting solution providers out there. GCP, Google Cloud platform. That is Google's web hosting offerings, AWS or Amazon Web Services. That's Amazon's web hosting solutions, and Microsoft Azure, A-Z-U-R-E, and of course that's by Microsoft, so Microsoft, Amazon, and Google. Now, in the past I recommended that you consider Google. Primarily, and the reason I said this, this is a long time ago in one of my early Machine learning guide episodes. The reason I said this is that Google really is the king of machine learning. They really are. I mean, their primary business model being ads, ad words and ad sense is very, very machine learning heavy and their secondary business model. Which is driven by their primary business model is search. And search of course is machine learning heavy using, actually these days, primarily Burt Technologies. Google puts out just white papers, hand over fist. Google is one of the leading absolute cutting edge, uh, participants and contributors to the research space of machine learning. And so necessarily they would have. Fantastic machine learning, backend services and solutions. And they do GCP is absolutely fantastic tech and they offer things that other cloud providers don't offer. Something like the TPUs t uh, tensor processing units as opposed to GPUs and CPUs, which streamline your tensor flow code immensely, makes your TensorFlow both training and inference much faster than, uh, what you can get on GPUs and CPUs. So GCP is a fantastic solution. I highly recommend it. But since I've made that recommendation in the past, I never did make the switch myself to Google. I've always been a bit of an AWS guy, and it's just that AWS, they just innovate so fast on their cloud hosting solutions, and they're just so tried and true that, uh. I found working with AWS to be easier than working with GCP and most of the projects that I get involved with end up using AWS and most of the resources I find out there, tutorials and videos on cloud hosting end up being on AWS. So it's just that over time I found that AWS provided the least inertia for me as a developer. And so you might find that to be the case for you as well. So I would recommend AWS consider AWS first before it. GCP or Microsoft Azure. But if you take this seriously, then of course consider all three options. Weigh the price of the services that you're going to be using, especially with the machine learning hosting services, because GPU hosting is extremely expensive and the price difference between the three services is very drastic, but there's no clear winner. To say that one is cheaper than the other. It just depends on the services you use, whether it's this type of GPU or that type of Kubernetes orchestration, et cetera. So for the rest of this episode, I'm just gonna be focusing on AWS for my examples, but bear in mind that all three cloud providers offer equivalent services for almost everything I'm going to be discussing in this episode. So, for example, we're gonna be talking about SageMaker right now. And SageMaker is an AWS service, but Google offers an equivalent service called Cloud ml, and then I'm sure Microsoft Azure offers a, an equivalent service as well. AWS SageMaker is a service for training, monitoring, and deploying your machine learning models. And AWS SageMaker is probably the most well known of the cloud-hosted machine learning solutions out there. And the reason being that it is the most end to end solution out there. It provides the training component, the everything in between component, including monitoring your models, trainings and outputting the artifacts, and then exporting those artifacts into a deployed API via REST endpoints. Now we're talking about the training phase here. So what you might do is, uh, use SageMaker to train your model on, and you can do that either through the SageMaker, API or console or SageMaker, allows you to spin up a notebook, a Jupiter notebook where you can write your code and interact with the code in, in, in a web browser. On the deployed SageMaker instance that you're using, and then train your model directly from within the Jupiter notebook. And then you can, uh, dump the artifacts of your trained model at the end of your session. And then you can close that Jupiter notebook and you can deploy those artifacts to a rest endpoint, which we'll talk about that in a bit. One of the big benefits that SageMaker offers you in training your model is sort of this auto-scaling and parallelization if necessary, if you need lots of compute to train your model. And another benefit that SageMaker offers is monitoring of your model's training process. So something SageMaker adds that The other AWS machine learning solutions doesn't provide is a lot of insights. Into the, the process of your model. This might be accuracy, metrics, error, you know, actual error metrics, like if you're, if you're throwing errors and things like this. Um, compute resource utilization and all these things. So it gives you a lot of tooling. In the console, in the AWS web console to monitor the training process of your machine learning model. On top of whatever tooling, whatever framework you're using provides. So for example, if you're training a TensorFlow model, well, TensorFlow has this package called tensor. Board and tensor board gives you visual insights into the architecture of your net, of your neural network. And that might include things like the distribution of any one neuron, the distribution of the weights, so you can tell whether things are going awry, whether they're overfitting or under fitting, whether you're getting vanishing or exploding gradients in the training process, et cetera. So that's TensorBoard. Well, that's the type of tooling you might expect that SageMaker provides for the training process and monitoring of your machine learning model. So SageMaker is very robust and powerful, and it's an end-to-end solution. And like I said, what you do then is you, at the very end of your notebook or however you want to interface with SageMaker, you're able to export your model, your saved models, artifacts, and then you can use those artifacts to deploy rest endpoints for your machine learning model. For inference. So you've done the training process. Maybe let's say it takes a half hour or an hour. You've got a trained model and now you want to use that model on the web. You want your app server, which is, which exists on either AWS Lambda, or it may exist as a docker container on ECS Fargate and that app server's going interact with your machine learning model in order to provide results to your end user on the client. Well, you don't want to deploy your machine learning model on the Fargate instance because you, because that instance is gonna always be on and you don't want to attach an expensive GPU to it. You want your app server to handle what it handles best, and that's running a, a web server. So you leave it alone and you deploy your sage. Make's saved artifacts model to arrest endpoint. Through SageMaker and your app server now can interact with your SageMaker models rest endpoints, um, in order to retrieve results via inference of the machine learning model. So SageMaker creates a, a web, API for your model that's special. Uh, it does a lot of the heavy lifting for you. Uh, in order that you can host a model without actually spinning up an EC2 instance, setting up TensorFlow and Pie Torch, uh, creating a web, API on that instance to expose the models inference, API and all these things. So SageMaker is. End to end, it does everything for you. If you want a one stop shop hosted machine learning solution, then SageMaker is your guy. Now, why might you not want to use SageMaker? You know, that seems to be the, just the conclusion to this podcast episode. Let's stop here and just say, SageMaker is the king and, and we're done. Well, there's a couple things that are a bummer about SageMaker. The first is it's very expensive, so it's gonna be using a GPU instance. Uh, EC2 instance with A GPU attached. A very common one that people use is called a P two X large. That's one of the lower. EC2 instances that support a lot of the modern machine learning frameworks, but it's still quite expensive. If you were to host your own EC2 instance on a P two X large and write up your own code to deliver your machine learning models. Inference capabilities. You, you write your own rest endpoints on that EC2 instance, let's say using fast API, for example, you would have a couple benefits. One is you'd have a lot more control. Two is that you can use spot instances if you want, although you'd have to take care that those instances might get shut down if you, the price of these spot instances goes above your max price bid. So that's one solution is host your own on EC2 instance, and it would have this flat cost. If you're not using spot instances, well that cost on SageMaker is 40% more. So for the P two X large instance, manually done via EC2, you're paying 40% more to replicate that on SageMaker. Now granted, SageMaker provides all these bells and whistles, these benefits and saves you a lot of time, so it's very likely worth it. It's almost never a good idea to custom host your own model on EC2. You're gonna want to use one of these solutions available on AWS. Instead, you wanna get as serverless as you can. We're trying to move away from EC2 as much as possible. Over time, it'll just save you time, heartache, and unnecessary boilerplate coating. So we don't wanna use the EC2 model, but going with the SageMaker route, we end up paying a 40% premium. That's a, that's a big bummer. Another bummer is that whatever SageMaker deploys is always on. Always on. Now, if your machine learning project, your app, which uses machine learning on the backend by way of SageMaker. For example, is constantly being hit and has a huge user base, and users are constantly running the machine learning inference capabilities of your machine learning model. That's fine. You always want it on anyway. And what SageMaker provides for you out of the box is auto scaling of your machine learning models, inference, autoscaling. That's very, very important. That's another benefit of using SageMaker instead of VC two, auto scaling, auto deployment, auto provisioning of server architecture and instances, blah, blah, blah. So if you have an app that has machine learning and that machine learning is in constant use, then it is very unlikely that you need to take that server down ever anyway. So an always on server via SageMaker is akay. That's exactly what you want. But if you are a startup and you don't have a lot of users, like I don't, with no noie, doesn't have enough users currently using the machine learning capabilities of Noie that would warrant the expense of keeping a SageMaker deployed model. Always available, then I would be paying just excess cost out the nose. A, I'd be paying a 40% increase on the cost of whatever EC2 instances is deploying, and B, I'd be paying for it to always beyond, even if it's only being used some of the time. Okay. So that covers SageMaker. Again, SageMaker is an end-to-end machine learning, training, and deployment solution with all sorts of bells and whistles, including monitoring the training phase of your machine learning model and deployment of arrest, API for inference provided by your machine learning model at scale, with auto provisioning, auto this, auto that everything is just handled for you, but with a premium. That premium being 40% additional cost of the EC2 instance you're gonna be using and the model always being on. As long as you're okay with that premium, then the solution is great for you. And again, GCP has an equivalent solution to SageMaker. It's called Google Cloud ml, and Microsoft Azure, I'm sure has a, an equivalent solution to SageMaker. I don't know what it is. I'm very unfamiliar with Azure's offerings, unfortunately. Now there's another concept called elastic inference provided by AWS and Elastic Inference allows you to attach as much GPU compute power as you want to, whatever EC2 instance that you need without using a full GPU instance. Now that's a little bit confusing. The idea goes like this. The P two X large EC2 instance uses a, I believe it's the K 80 GPU. So it attaches a K 80 GPU with this much ram and this much compute power. Well, especially at inference in the inference phase when you've deployed your model to the cloud at a rest endpoint and you're making predictions that GPU is gonna go wildly underutilized. Even when you have a lot of users and the machine learning model is constantly in use, even then because inference is so cheap, you're still very likely to be under utilizing your GPU and therefore you're gonna be paying for more than you should be paying for. So what Cloud Inference does is it allows you to attach a number to uh, whatever EC2 instance, you might be using a number of GPU compute power. And Ram. Without actually specifying which GPU you're gonna be using and that allows you a little bit more flexibility to drive down the costs of your deployed machine learning model, you're less likely to use this inference API at training time. Actually, that's kinda the point. Uh, that's why it's called the elastic inference. API you, you don't use it at training time. 'cause, 'cause training time is gonna be probably using your GPU to the max, but at inference time it's very unlikely to be using the full GPU and so. It's better and more cost effective to use this AWS Elastic Inference system rather than a full fledged GPU EC2 instance. So rather than using a P two x large EC2 instance, in, in, in this example where we're deploying our own model on EC2, which is never suggested, rather than using a P two X large instance, you might use whatever. Kind of instance you want, um, including the, just the standard instances with normal CPU and ram. And then attach some amount of GPU Compute and GPU ram, some small amount that you think you might be utilizing. And then that way you'll save costs. And this Elastic Inference solution is available to SageMaker actually. So even though the SageMaker deployment of your model is a 40% cost on whatever instance you're using, and. Is always up, you can still shave some cost off, quite sub, quite a substantial amount of cost by not using one of the GPU instances and instead using the elastic inference API. So that's one way to get a cost savings on SageMaker. I. It's compatible with the elastic inference, API. Okay, so that's SageMaker. SageMaker is the end-to-end training and deployment. Now let's talk about training only and deployment only offerings. The traditional training only offering is something called AWS Batch. AWS Batch lets you run a docker container to completion, and then it. Dies and then it goes down. And then what kicks off this batch process? Well, you would either call it through a CLI call or trigger it manually in the AWS console. Or you can use the Python Bodo three library on your app server to trigger a batch job. So. What batch is intended to do is run heavy jobs. It's not just for machine learning. AWS batch is primarily intended for any batch job, any heavy job that is Docker containerized, which is intended to run once and then die. So you'd kick off an AWS batch job, and that batch job might be, for example, training a machine learning model. That's a, a prime use case of AWS batch. Now batch doesn't offer all the tooling that SageMaker offers on monitoring your training process. And all those things, and it certainly doesn't offer deployment of a machine learning model to a rest endpoint. But what Batch does offer is a lot of cost savings. So a, the job dies eventually, so as soon as it's done, it dies. And that means you don't have an EC2 instance always on racking up costs. And B, you can specify that that batch job should use a spot instance if available. And so what you end up doing is you, you set up a batch environment. You say, I want it to run on a P two X large. It's gonna deploy this docker container, which is, for example, a machine learning training job. And if available, spin up a spot instance of that P two X large, but if not available. Then use a standard P two X large instance, and then that way if there is a P two X large that's available in spot instance, you'll save a lot of costs, sometimes up to 90%, and if not, it will run in standard mode if you need the batch job to run immediately, no matter what. Alternatively, you could say. Always use a spot instance. Don't fall back on a standard instance, and you would do something like that if it's not time sensitive, if it doesn't really matter when that machine learning job runs just as long as it runs eventually, then you might use a spot instance only batch job. Now try to think outside the box on how you might use batch, because with the cost savings that batch provides a. It being a spot instance, B, it not being 40% extra cost, and C, it dies when it's done. Well check this. I'm using batch for no's, machine learning server. I. In other words, my batch server is basically my deployment server, which is not batches intended use case, batches intended to, to fire and forget some heavy job. Well, what I do is the app server. The server server. When a user comes online, if there's not already a batch server running, then I use the Bodo three library, which is. Python's AWS library Bo, O three BOTO three. I use the Bodo library to spin up a batch instance, and that runs my inference system. It puts the inference system online. Now that inference system is running in a batch job inside of a docker container, and machine learning requests, requests for inference prediction requests are sent to that batch job by way of. Uh, job queue. You know, you might use Rabbit MQ or SQS, Amazon SQS. I actually don't use any of those. I have my own handcrafted job queue system by way of Postgres, which I mentioned in the last episode. Postgres is so powerful. It offers a job queuing like system just due to the nature of some of its capabilities as well as some pub sub capabilities. So eventually I'll probably move to a system like RabbitMQ or SQS, but for now, this. Postgres only job queuing system is working fantastically. So I send job requests to the batch server from the app server. The batch server runs these machine learning inference jobs. Like a summarization report or question answering in No. The, or theme generation in No. The, all these things returns the results and then it waits for a while, and if it hasn't received a new job in, let's say five to 15 minutes, it turns itself off with an exit zero. Exit parentheses, zero parentheses. And what that exit zero thing does is it kills the Python process. And a killed process is what batch is looking for in order to take the instance offline. Normally, like I said, the way people use batch as intended is to run a script. Maybe that's a machine learning training script to completion. And at the end of a Python script, at the completion of a Python script, implied is Exit zero. So that's the intended use case, but what you could do with batch is run the whole thing in a loop so it never exits until you want it to. And now what do you have? You have a machine learning server. Okay. So it's a little bit of a hack job and it doesn't offer some of the benefits of SageMaker, but it does offer a substantial cost savings. While not requiring the server to be online all the time. So let me tell you what it does not provide that SageMaker does provide is auto scaling. So when I get enough users that would warrant more than one machine learning server, then this solution's gonna go out the door and I'm gonna switch over to SageMaker. This is a temporary solution during the regimen in which I have very few users using the machine learning services. Okay, SageMaker end-to-end batch for big jobs. And that's a, a shoo-in for machine learning model training. And batch uses, uh, docker containers. So you're gonna dockerize your machine learning stuff into a container, and then you will run that container's script to completion. And at that point, AWS batch takes itself offline. Now, as far as I understand, you could probably use. Elastic inference, API with batch. I'm not sure actually, the way I understand elastic inference, API is, it's sort of this one size fits all replacement for GPU instances. If you are primarily using that instances, GPU capabilities for inference, not for training, um, something you might need to look into if you want to add more cost savings when using AWS batch is, see if it's compatible with the elastic inference. API, I don't know. Now let's pause here real quick. I'm gonna mention another solution offered by AWS called EFS Elastic File System and EFS lets you attach a file system, a mounted network file system to any service, whether it's EC2 or AWS Batch, or SageMaker, or Lambda. And, and that file system is permanent. So if your AWS batch container gets taken offline when its script has completed as it's intended to do whatever files you save, including model saves, saving model artifacts to disk gets deleted. When the AWS batch container gets taken down, its file system is ephemeral. It's temporary, it's intended to be deleted on server takedown. And that's normal. Um, if you don't want that, then you'll use EFS. EFS is a permanent file system that you can attach to all sorts of different services and they'll share the file system, uh, amongst themselves. They'll all be able to contribute to the same file system, and when those services get taken offline, like AWS batch coming offline, then that file system will still remain available to you. AWS offers all sorts of different file system solutions. The most obvious of which is called AWS S3. S3 is where you'll store files usually, which are intended to be available to access on the web, for example, but, but sometimes you want a private file system that's only available to your servers. But you don't really mount an S3 file system. An S3 file system is more like, um, Dropbox or Google Drive. It's something where you put stuff over there and maybe sometimes you'll pull stuff out of, but it's not intended to be a read write file system that you're consistently accessing from an operating system. For that, instead you'll use EFS. So with these solutions that I've discussed so far, SageMaker and batch, you're almost certainly gonna want to combine them with EFS because you're gonna be saving your model artifacts. You're gonna be saving various, you know, training information, training data. You want a large spreadsheet or a SQL dump to be onetime. Save consistently available on operating system startup, available as a mounted file system and so on. Another reason you would use EFS, something I use in is that a lot of these models, pre-trained models that you're gonna be using in your project, like the hugging face transformers models, well hugging face transformers, you would install the package via PIP in the docker container. In the docker file, pip install, hugging face transformers, but the models that are loaded by that package, the models themselves, which are pre-trained models with. PyTorch weights and TensorFlow weights are downloaded from the internet and they're very large. They can be in the order of four gigabytes or more per model. So four gigabytes per model in No, the, I have a summarization model, a question answer model, a sentiment analysis model, and more. So let's say that I'm dealing with four gigabytes per, well, I don't want to re-download each model every time I provision a batch instance, so, so batch, every time it comes online, it has to re-provision the instance based on the Docker file. You wanna keep your docker file as minimal as possible, so it has to download as little as possible. And the only way you can achieve that is that you have the big stuff. Like hugging face transformers models, for example, pre downloaded onto an EFS drive, not downloaded by way of the docker file and that that EFS drive is mounted to the batch container. Okay? SageMaker for end-to-end batch for big jobs, which run to completion then die. But it doesn't have to be that way. You can run some tricks where it's an infinite loop and it exits when you want to. An EFS to ensure that persistent files are stored between ups and downs of your servers. Now, here's where things get a little bit tricky, because SageMaker I think is an obvious solution for people with deep pockets and a big system that just needs to work and always be available. But what about people like us and my project Noie, which doesn't have deep pockets nor a lot of users consistently using the machine learning capabilities. That batch system that I just described to you doesn't really work fantastically. It works for now because I don't need to scale this thing, but eventually I want a solution that maybe saves cost by comparison to SageMaker, but ideally provides some of the capabilities of SageMaker. So for example, a auto scaling inference, rest endpoint. But also allows that endpoint to scale to zero. In other words, when someone's not using the machine learning capabilities, whatever solution I use ideally can turn itself off so it's not incurring costs while no one's using it. Well, I did mention in the last episode the solution AWS Lambda. Lambda lets you write Python functions or node functions in order to deploy a single function as a rest endpoint in conjunction with API Gateway, another AWS service. The problem with Lambda for machine learning is that Lambda's do not have GPU access. Even they don't have the elastic inference, API compatibility. You can't use elastic Inference, API on AWS Lambda. In other words, you cannot use A GPU with Lambda. It is impossible, as far as I know. If I'm wrong, please send me a message and correct me. 'cause this to me would be the ideal scenario is if we could get GPU access on Lambda to deploy our own machine learning inference endpoint. The cool thing about Lambda is as long as you're not computing, as long as a Lambda function is not being called, you are not being charged. That's, that's an immense benefit. That's, to me, that's the real benefit of Lambda. Unlike EC2, if you have your own EC2 service, or even if you're using ECS in Fargate and a deployed Docker container, that service is always on, and so you're constantly being charged. Lambda is not always on. It spins itself up. In real time when it's being requested from a client, so it comes online as needed in order to do its job, and then goes offline. That's wonderful. That is absolutely wonderful. Now, traditionally, the compute cost of any one Lambda. Endpoint is pretty low anyway, so that's not a huge deal. The, the real benefit of Lambda is that it auto scales and it auto provisions all the server architecture and all this stuff. All the, the real cost and and time savings benefit of Lambda is all the stuff it does for you, the orchestration, the scaling, the endpoint creation and all that stuff. But if you could potentially have Lambda attached to a GPU such that it would only run a machine learning model inference when called and then taken offline, you could save immense amount of money and get most of the uh, functionality you desire that you get from sage makers deployments. But you can't, unfortunately, you can't use a GPU on Lambda. However, some people are still using Lambda nonetheless for their machine learning deployments. And the way they do this is, first off, remember that I said that the cost of inference. Of a machine learning model is substantially less than the cost of training. Well, with this in mind, some people have found that maybe they don't actually need a GPU for at inference time. Maybe their model is quick enough, not quite so compute intensive as to necessitate A GPU. At inference time, not at training time. You always need a GU at training time, but at inference time, maybe you don't need a GPU. And so they deploy their machine learning models to Lambda, not batch, not SageMaker. They deploy it to Lambda. And again, they have to use E-F-S-E-F-S attached to Lambda so that your Lambda function has access to the file system. Then that file system has the saved model artifacts, let's say a TensorFlow model or a PyTorch model, or a XG Boost job lid, pickled file, et cetera. So the Lambda function loads up a saved model from EFS disc and uses it simply for inference. And some companies have found. That even though yes, there is a reduced performance of the, the model's compute power on Lambda, they have found that nonetheless it, it is sufficient for their needs. It's enough for them to use the minimal lambda resources to run inference on a machine learning model. What they find and, and, and some of the things I've read is they say, look, a small amount of CPU and a small amount of RAM is not that much, especially when we're running inference on deep learning models. But the whole point of lambda is that it auto scales. That's the real big benefit of Lambda. So if you have a single function, you have a hundred users hitting that machine learning model at once, well, those hundred requests are a hundred different Lambda calls, and therefore each function call. Is loading up its own model into memory and running inference and sending back the results. So the parallelization capabilities of AWS Lambda compensates for, uh, the lack of GPU availability when running inference on these models. So if you're listening to this episode and you want to cut costs, you know, SageMaker iss too expensive, but you don't want to have something online all the time, then. What I would recommend you do is consider whether or not you could use AWS Lambda to host your model for inference. In other words, think about whether or not your model can be loaded into ram. The minimal amount of RAM that Lambda provides, I think the Max Ram. A Lambda function provides us three gigabytes and the max CPU capabilities. A lambda function provides us two virtual CPUs. So think about whether that's enough, whether that's enough for one inference call, and if it is, use Lambda. Deploy your machine learning models, inference capabilities to Lambda functions. If it's not, then you'll consider some alternatives like batch or some more stuff that we're gonna discuss here in a bit. One more thing about Lambda. If the compute resources provided by Lambda is insufficient to run the inference of the model that you want to deploy. There's one more trick you can try, and this is something I, I plan to try with. No, this is, I, I want to try. Moving away from batch, batch really is not the right solu, you know, I'm not using batch the way it's intended to be used, and that's gonna bite me in the butt eventually. Batch is really intended for, like, training jobs. For example, I'm gonna try moving from batch to Lambda, and the way I'm gonna do it is, is the following. There are these tools out there, which I will discuss in a future episode. They're considered these optimization tools, machine learning model, optimizers, optimizers. And I'm not talking about an optimizer like Adam or at grad, uh, we're not talking about back prop, we're talking about the whole machine learning model. A exported TensorFlow model artifact or an exported PyTorch model is piped through an optimizer and out comes a faster model. It's basically like a zipper, like a zip file. It's like a compression algorithm of your machine learning model. It compresses it, it literally compresses it. It, it does some tricks on the model itself. On the, on the weights and the neurons. It can, it can remove neurons as necessary. It can quantize the types of values that can be represented at any one given neuron. Another thing it can do is reduce what, you know, usually these models have like floating point 32 variable types. As the weights, while it can downscale from floating point 32 to floating point 16, or even more floating 0.8, it will learn all the tricks of your exported model without reducing the accuracy too much, and then export a substantially, substantially minified version of your machine learning model. And now that model can run on CPU. So your previous model, let's say a hugging transformers model, for example, a lot of those probably won't work very well on CPU. They probably won't lend well to an AWS Lambda function. But you pipe the model through one of these optimizers and outcomes, a, a minified version of the model, which can be used on AWS Lambda CPU, only small amount of ram. A OK, you know, we lost, let's say 1%, 2% accuracy. Big deal. It's, it's worth the cost savings and the scalability and all that stuff. These optimizers, the, the most popular out there is called Onyx, O-N-N-Y-X. Like I said, I'm gonna do a dedicated episode on optimizers. They're very valuable, very powerful tools. And another popular one is called Open Vno. O-P-E-N-V-I-N-O and that one's by Intel. If I recall correctly, I don't know that Open VIO is necessarily intended for general purpose model optimization. I think it's intended for optimizing your model to be run on Intel CPU chips. Now, if you are AWS Lambda functions are running on Intel CPUs, which I think they are, then, you know, win-win, you're, that's, you're getting what you wanted out of it. But, uh, just keep that in mind. I don't think Open VIO is a general purpose optimizer. I think it's an Intel CPU optimizer. A lot of times Open VIO is used for things like camera models, uh, computer vision models, which are deployed on small cameras intended to be run at the edge, where that camera only has an Intel chip. It doesn't have GPU or, or anything else. And so. Open. Vno tends to have a very specific use case, but Onyx is a general use case optimizer, O-N-N-Y-X. Onyx is a general case optimizer and it has a lot of other benefits as well. You'll, you'll hear about Onyx. If this is the first time you've heard about Onyx, this, I promise this won't be the last time. Onyx is not just about model. Optimizing. It's also about model file format, generalization. So when you pipe your TensorFlow model through Onyx and out comes an optimized model, a smaller, faster, leaner model with not too much reduction in accuracy. Another benefit of what just came out of the pipeline. Something that can be deployed on all sorts of different architectures. So Onyx is not just about optimization, it's also about compatibility. Compatibility. But we'll talk about that in a future episode. Let's, let's move on from Onyx. So SageMaker end-to-end solution. All sorts of bells and whistles, all sorts of tooling, deep pockets, lots of users, ghost SageMaker, batch runs a one-off fire and forget big job. Great idea for model training. Lambda is a bunch of tiny little servers where every client request spins up that function into CPU and Ram and then spins it down. Normally not used for machine learning, but a lot of people have been recently using it for machine learning. The, the main way they do that is A, either their model would run just fine for inference purposes on that Lambda instance, because inference is cheap. But if inference is still not cheap enough to run on a Lambda function, you can force it to be so by running your machine learning model through Onyx to output a minified version of that model, which now can run cheaply on a Lambda function. And of course, uh, I mentioned EFS, your file system, where you store your model artifacts, your saved model dumps. You would wanna pipe your model through Onyx first and save it to your EFS file system in advance. And that way your Lambda function calls would load up from EFS, the minified models, the optimized models, not the original models. Now as far as I understand that, that about brings us to a conclusion on the AWS machine learning offerings. Like I said, I mean there's, you could do all this stuff on EC2 yourself manually. You could also do all this on ECS Elastic Container Services by using a darker container, and that will relieve some of the deployment overhead of spinning up. Manual EC2 instances and such, but I just wouldn't recommend it. It's, it's too expensive and it's too much legwork to write the code, which is gonna serve up your inference engine and all these things. I just, for machine learning model deployment, I definitely would not recommend running a custom EC2 server. But even at the ECS level, I don't know that I would recommend running a containerized machine learning model on ECS. You could do it. You could probably get some cost savings, but there might be a little bit more to the programming and the management and making sure that sort of everything works together and scales effectively and stuff. So, so that covers the machine learning solutions offerings by AWS. And like I said, GCP and Azure have their own equivalents to all of this stuff, but there are third party. Providers out there that are competing in this space now, uh, one is called Floyd Hub. I don't know much about them. Another is called Paperspace. They do stuff similar to SageMaker. It's, it's a similar offering to SageMaker where you can train and monitor your machine learning model and then deploy it as a rest endpoint. And they also have an offering similar to AWS batch, where you run a one-off. Instance that dies at the end. And these services are very similar to their counterparts. They offer the same types of solutions. For example, the their batch equivalent has a cost savings and can use spot instances in a similar way as AWS. Now, why would you use Paperspace if AWS has all this stuff built in, especially since a lot of us are using AWS already, wouldn't we want to be using the VPC? That we might be hosting our other services on and so on. Well, Paperspace is a lot cheaper than AWS actually, uh, for some reason, I don't know how they've pulled this off, but they've found a way to make things a little bit cheaper than AWS SageMaker and AWS batch. I don't know exactly how they pulled it off, but I did, I did experiment with Paperspace some, and I found that they were cheaper than AWS on average, but I still prefer AWS. Like I said, because I am using all the other AWS services and I do want that EFS file system, for example, to be available to other AWS services outside of the machine learning servers I'm running, so I still prefer to stay in AWS rather than Paperspace, although do look into Paperspace, I believe you can set it up to run inside your AWS architecture as well as. There's a way that you can run paper space services, OnPrem on your own servers in, in like an onsite server, farm, physical servers and so on. I will say to you, check out paper space, but maybe don't get your hopes up too high. I think the AWS offerings are pretty much all you need, but check out paper space and see if, see if you like what they've got. Finally, there's one last cloud hosting solution I want to discuss that I'm really excited about, but I don't know enough about yet, and it's called Cortex, C-O-R-T-E x.dev. DEV cortex.dev. cortex.dev is an open source cloud hosting setup. It uses your AWS stack. It actually orchestrates the provisioning of the various services. In your AWS stack needed to run various machine learning components of, of a full fledged architecture. Okay, so let me step back. What you do is you pip install cortex and you write your traditional machine learning code, TensorFlow, PyTorch, hugging face transformers, et cetera. And. At some point in your code, you'll run inference through your machine learning model, or you will run training on your machine learning model and you will have it tied into Cortex in one way or another. There will be some flag. You know, I, I'm kind of like blowing smoke here as I'm trying to wrap my head 'cause I haven't used Cortex yet. Um, so bear with me as I try to describe what I understand of Cortex. But at some point there's like sort of a flag that tells you whether you're in dev environment or in production environment, and at the point at which you would call inference on your machine learning model and the standard way that you would call inference, like model predict on your TensorFlow model. If this flag is in development mode, then it'll do the normal stuff. It'll run it all on your GPU, but if the flag is in production mode, it will. Kickoff to AWS through the Cortex system. And what Cortex will do is it will auto provision EC2 instances, spot instances available, and scale them up, and scale them down as necessary. And one thing that they're working on, which is the reason I'm so excited about Cortex, is scaled down to zero. In other words, what Cortex does for you is it's basically an open source sage maker where it does the end-to-end stuff, training, monitoring, and inference and deployment to rest endpoints. It does all that for you. Same as SageMaker. Without the added cost, it doesn't incur that 40% extra cost, and in fact, with a reduced cost because you're able to use the spot instances if available. So it's cheaper than SageMaker. And last but not least. One thing they're working on, which isn't yet released is scaled down to zero. So if you don't have users using the service, the service, you can specify some cool down period. After 15 minutes, just take everything offline. So Cortex is a powerful, open source, cheaper alternative to SageMaker, which may soon. Provide scale to zero, which is a big missing opportunity in almost all of the serverless machine learning solutions out there. So keep an eye on cortex.dev, check it out. Tell me what you think. Send a comment, send an email. I want to know, I wanna know if anyone has any experience with Cortex. I plan to give it a whirl and I'll keep you guys posted on what I find. That's pretty much a wrap for this episode. I do want to mention that you can orchestrate your own machine learning architecture. There's solutions out there. There's Kube Flow. So Kubernetes is a declarative orchestration system. So in the same way that a Docker file is a declarative code file for provisioning a server, it sets up the server from end to end, Kubernetes takes it to the next level, and it's the same kind of thing. It's a declarative file that. Provisions your entire tech stack. So you'd use Kubernetes on AWS or on GCP in the same way you'd use Docker and it, it sets up your entire, some instances, some API gateways, some serverless functions, et cetera. End to end. So you know, the real big guys, the real DevOps champions of this world, when they're setting up their machine learning architecture, they're probably gonna be using something more akin to Kubernetes. And I don't know anything about Kube flow, but I know that it is a. TensorFlow companion to Kubernetes. So it makes deploying machine learning in your architecture more streamlined if you're using Kubernetes. But I don't know much about these orchestration software packages like Kubernetes or Kube Flow. So, but when I do learn more about these projects, I will do a follow up episode and fill you guys in. Until then, thanks for listening. See you next time.
Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.