SageMaker 1 | Machine Learning Podcast

MLA 015 SageMaker 1
Nov 03, 2021
Click to Play Episode
Part 1 of deploying your ML models to the cloud with SageMaker (MLOps)
Show Notes
MLOps is deploying your ML models to the cloud. See MadeWithML for an overview of tooling (also generally a great ML educational run-down.)
And I forgot to mention JumpStart, I'll mention next time.
Transcript
[00:00:00] Welcome back to machine learning applied. And this is going to be a very important episode where we discuss Amazon SageMaker AWS, Amazon web services. SageMaker. In the last episode, I talked about deploying your machine learning model to the server who was an episode called machine learning server.

[00:00:17] Well, I was in over my head. I'm older and wiser now. And I know now that this concept is called machine learning operations or M L ops machine learning operations, you may be familiar if you're a web developer or a server developer with something called dev ops developer operations, which has effectively replaced systems administration is the concept of deploying your server to the cloud, your front end, to the cloud, et cetera, making these servers, scalable, microservices, architectures, all these things well, deploying your machine learning models to the server. A new concept in the world of data science and machine learning is called ML ops or machine learning operations.

[00:00:57] In the last episode, the machine learning server episode, I talked about a few services. I talked, I did talk about SageMaker. I talked about AWS lambda then I talked about a handful of auxiliary services, like cortex.dev Paperspace and Floyd hub. Well, I hate to say this to those mom and pops. I'm sorry, but toss those out the window because SageMaker is going to blow your mind.

[00:01:22] SageMaker is way more powerful than I thought it was. It has more bells and whistles than I realized. There are ways to reduce costs. One of the biggest gripes that I had with it in the prior episode, ways to handle, rest endpoint up in the cloud, AKA scale to zero.

[00:01:39] And that's not all the sky's the limit with the amount of features available by way of SageMaker. Okay. Now I may talk about GCP, Google cloud platform, and I may also talk about Microsoft Azure in the future, but I also may not because I feel like I'm completely sold on Sage maker.

[00:01:59] I think I'm going all in on AWS. Why? Well, one is AWS is just clearly the most popular cloud hosting provider out there. It's just the most popular. It seems clear as day to me. And what does popularity give you? It gives you more resources, whether that's educational material or modules and plugins or prospective employees and hires available on the marketplace.

[00:02:25] And so popularity is not something I shake a stick at when it comes to deciding on some technology or technology stack. I know it may be a controversial point, but in this particular case, the popularity of SageMaker seems to have amplified its capacity by way of the features that they ended up building into the framework.

[00:02:44] And as you'll see in this episode, SageMaker is absolutely astounding. The amount of features, bells and whistles and capabilities of this framework are just out of this world. And it may indeed replace my entire local host development set up, even when I talked about in prior episodes using Docker and Docker compose, I may be switching to completely going all in on AWS, including developing against AWS on local hosts by way of something called local stack. I'll talk about that in the next episode, but SageMaker has absolutely wowed me.

[00:03:15] It has dazzled me, so let's get into it. What is SageMaker? Well, in the last episode, I talked about deploying our machine learning model to the cloud by way of some of these services. And I did mention SageMaker. I also mentioned AWS Lambda.

[00:03:26] Now AWS Lambda is still a viable option under certain circumstances for deploying our machine learning models to the cloud, but for larger models, especially GPU centric models, and especially big data centric models, and data pipelines that have scalability in mind then you're going to be using AWS Sagemaker. SageMaker is not just an ML ops platform. It's not just about deploying your model to the cloud. It is the entire end to end stack of an entire machine learning anything. Period. You're going to collect your data using SageMaker prepare your data using SageMaker

[00:04:00] train, your model, deploy that trained model to the web as a rest endpoint, or as we'll discuss as running one-off inference jobs, and then monitor that model to make sure it's not drifting or keeping an eye on bias and all these things. You're gonna be able to use SageMaker to label your data And there's even tooling. And SageMaker that lets you deploy your model to an edge device. Remember edge. I talked about this before. Edge means it's on a device. So if you have a camera that has some tiny machine learning model, that's running on the camera and not in the cloud somewhere.

[00:04:35] It's not making rest calls with those video frames. We call that model running at the edge. So we'll talk about deploying your SageMaker model to the edge using AWS SageMaker Neo, but we'll get to that at the end. I'm going to take you through the SageMaker features. Step-by-step let's just list those features.

[00:04:53] Now, starting from the top in the data preparation phase, we have data Wrangler feature store GroundTruth, and clarify in the build phase we have SageMaker studio, AutoPilot, and jumpstart

[00:05:11] in the train and tune phase. We have debugger distributed training and in the deploy and manage phase, we have deploy pipelines, model monitor, Kubernetes integration, edge manager, and Neo. and this is all just listed on Sage maker's website, AWS to Amazon at force last stage maker, you just hover over the features menu item, and these will all be listed in the dropdown.

[00:05:35] So let's just take this one feature. Data preparation. So we're in the data preparation phase of a pipeline, a pipeline in machine learning or data science is that you take your data from some source, you're going to be ingesting your data. And then you're going to send it through a series of steps and a pipeline before it even hits the machine learning training phase.

[00:05:59] And then the very end of that pipeline is a trained model is deployed to a REST end point. in that pipeline, in the entire pipeline of your data stuff, the first step of the pipeline is going to be getting your data, collecting your data. We call it data ingestion, ingesting your data. Now that data may come from a CSV on Amazon S3 Or it may come from a TSV, uh, tab, separated value file or something called a parquet file. I haven't discussed that before, but I'll discuss it later. Parquet, P a R Q U E T. It's actually very common, especially in, Amazon SageMaker stored on S3 or maybe your database RDS.

[00:06:41] Postgres, MySQL Microsoft SQL server, whatever the case may be, or it may be streaming from the web coming from the internet somewhere as something we call a fire hose. A fire hose is basically if you're collecting data, that's maybe updating every second or millisecond even. So for example, tweets on Twitter. Twitter is a fire hose. It is so much data so fast that don't bother storing this in a database, especially not an RDS database. Maybe you might put this in a no SQL database like dynamo, DB, but more likely the case you're actually going to be piping this through one of AWS services like Amazon Kinesis firehose.

[00:07:23] So you have any number of data sources. Okay. A data source is a single source of data. One data source might be a CSV. another data source might be a TSV. Another data source might be your database. A data lake, a data lake is a collection of these data sources that are related to each other.

[00:07:44] So if you're doing a whole bunch of analysis on tweets on Twitter, well, you might be storing some data in a dynamo DB table. You might be. Piping some other data straight through AWS, Kinesis, firehose, and you might be saving some other data AWS S three as CSV files or parquet files. If all of these data sources are related to each other, you might store them all together, maybe in a single VPC.

[00:08:12] And this is called a data lake, a data lake. And when we say big data, big data means we're dealing with so much data that you couldn't possibly run this all on a single Python script. You couldn't just load all of this data. This firehose from Kinesis, the data from the RDS database and the CSVs all into some pandas data frame in a Python script on a single server.

[00:08:37] Anything beyond that

[00:08:39] capacity

[00:08:40] Running this all in a data frame, on a server you're dealing with what's called big data. So big data just means you need to scale this data. You need to scale this data. And so right off the bat right away, we find value in using SageMaker for dealing with data lakes. If the data you're working with is big, is big data, which is going to be nine times out of 10.

[00:09:01] Anyway. So if you have a lot of data coming in to your machine learning system, through your pipeline, you're going to want to use one of these ML ops data pipelines like SageMaker and the tool here, the entry point of the SageMaker data preparation phase of the pipeline is called data Wrangler data Wrangler.

[00:09:20] What it'll do for you is it'll allow you to specify where your data is going to be ingested from where you're ingesting your data from this data lake. And then what are we going to do with that data? Well, you're going to be doing some feature transformation, so, so we're going to transform the data into features, and then we're going to also be imputing missing data.

[00:09:39] Remember imputing, I am P Ute means filling in the missing values of data where wherever there's missing values, maybe you want to fill it in with NOLs or Nan's or the mean average of that column or the median, or maybe guess the missing value. There's actually tooling around that. And SageMaker trying to fill in the missing values with some estimate based on the other features of the rows, which is already super powerful.

[00:10:05] And so SageMaker data Wrangler allows you to do some feature engineering, some imputation, as well as some visualization, visualizing the distribution of your data per feature, deciding which features are the most important feature importances, and then piping your data into a, almost like a dry run quickie machine learning model, like XG boost, one of these off the shelf, real obvious

[00:10:27] machine learning model and implementations to see how your data would perform before we even get to the training phase. In case you may want to feature engineer and transform and impute your data some more before we get to that phase. But let me step back just a little bit. And to say that in the past, normally in prior episodes, we would have done this by way of pandas and Numpy.

[00:10:47] We would have done this feature engineering and imputation in pandas. Well, that's all fine and good, but if your data needs to scale, if you are ingesting data at scale, big data, which is going to be most of the time when you're actually deploying your entire company to the internet, and you're finally getting lots of customers and suddenly you're getting lots of data from somewhere, you want to be prepared for big data and you want to have your pipeline set up in advance so that the data preparation phase doesn't just run on a single Python script.

[00:11:16] Instead it runs in a distributed, paralyzed fashion, By way of, usually in the case of Amazon SageMaker, what's running behind the hood is called, Apache spark, which is a distributed parallelization framework, especially for data science and machine learning, which traditionally either runs Scala, which is a JVM language or Python by way of PySpark So SageMaker does a lot of its parallelization for distributed data pipelining by way of Apache spark. But you want to be prepared for big data In your data pipeline. and that's where a data Wrangler comes into play. So let's talk about some of these features.

[00:11:51] So the first feature is feature engineering, being able to transform your data into features. So for example, if you have a date, if you have a tweet coming from Twitter and it has a date, 2021, October 28th, you may want to pull some stuff out of that date. One thing may be the day of week or week of year or time of day. And so on. You may want to turn that date into a number of integer columns, because you remember in most machine learning algorithms, we really want to work with numbers, not with strings, not with dates, numbers. So one feature transformation step of this phase, the pipeline might be transforming dates.

[00:12:29] Now in the past, we might use pandas to pull these features out of a date column. And pandas is very slick at handling these things. well, data Wrangler has as part of its

[00:12:40] tooling, a whole suite of pre-built feature engineering steps for common features. So it picks up automatically a date column in your CSVs or your RDS column. It will suggest this is a date. I think maybe we should pull out this, that and the other feature automatically for you. so not only are you preparing yourself.

[00:13:03] For the internet by deploying your model at scale, using SageMaker, but out of the box, you're already saving time and code and Python code. If you use data Wranglers auto feature engineering suggestions, steps that it picks up by analyzing your data. So that's one part of the feature engineering phase.

[00:13:22] another part of the feature engineering is imputation strategies.

[00:13:26] So if you have a whole bunch of missing data in your spreadsheet, bunch of nulls, missing data, well, traditionally we'd use pandas and a Python script, and we would either fill those in with the mean of that column or the median of that column. That's a very common strategy for filling in holes.

[00:13:42] Another common strategy is removing rows for which that column is empty. Okay. If it's important that don't train on data for which that column is empty, or if you don't know how to handle filling in that value, you don't know if mean or median or max or men would be wise in this case.

[00:14:02] Maybe we'll just remove the row. When would you not want to remove a row for an empty column? If the net result leaves you with not very much training data, but you don't have to make this decision SageMaker data, Wrangler will help you make that decision for you. It'll give you some suggestions pro tip by removing rows for this empty column, you're going to be left with not much training data.

[00:14:25] We suggest filling it in with the mean or the median further. It has a feature for actually filling in the value of that column with

[00:14:34] a likely candidate. For that missing value based on the rest of the columns of that row, that is slick. That is super powerful. In other words, it uses a machine learning model to predict what would go into that missing slot for that row based on the values for other rows and what they had for that column.

[00:14:56] So that's something you don't get with pandas out of the box. That's super impressive. Finally, another quick, cool feature engineering strategy that gives you out of the box is for string columns.

[00:15:06] If it thinks these things are categories, we can turn this into a one hot encoded vector. Very cool. Right out of the box. No pandas. You get one hot encodings of your string based category column. And the whole thing can be piped through principal component analysis to dimensionality, reduce your data.

[00:15:25] Now, what do you get at the end of this phase? Well, each feature that you feature engineer, you apply a feature engineering step as if it was a layer in Photoshop. It's like your data comes in and you say, okay, for the date columns, I want to apply date, feature engineering. I want to pull out the time of day, day of week, week of year, blah, blah, blah.

[00:15:47] And that's a layer. That's a layer on top of that step of the data pipeline. Okay, now you have a new set of data and it goes on to the next step of the data Wrangler pipeline. You apply a new feature engineering step, whether that's imputation or one hot and coding and so

[00:16:05] on

[00:16:05] Now, unlike writing your feature engineering process in pandas and a Python file, where what you get at the end is the data.

[00:16:14] Maybe you're going to stuff that back into an RDS database, or that's going to be piped into your machine learning model. what came into your Panda's data frame was raw data. What came out of your pandas data frame was clean data, but that process was destructive to the data.

[00:16:31] It was destructive. There may be some other part of your machine learning pipeline that wants the date column or that wants the string column, or that is okay with NULLs, we don't want the imputation. So unlike doing this stuff through pandas and a Python file, data Wrangler on on SageMaker applies these feature engineering steps through a pipeline as if they're Photoshop layers that you can access any of these steps through the pipeline. And that's why we call it a pipeline. is because if you imagine pipes you can access the water going through the pipe can split in the pipe and go left and right to various parts of your, data science application.

[00:17:10] But that doesn't stop their data. Wrangler comes with a whole bunch of , visualization tools as if it were competing with Tableau I've mentioned in a previous episode on data analysis, data exploration, E D a

[00:17:25] exploratory data analysis. Well, Tableau is a desktop application that you know, you download, you pay for us a premium subscription, and you punch in your CSV and you explore your data. It gives you some charts and graphs. Very cool. We also talked about mat plot, lib and seaborne all these different libraries that you might use in a Jupiter notebook to explore your data.

[00:17:44] Well,

[00:17:44] SageMaker data. Wrangler has a graphical user interface and that graphical user interface, it allows you to first off set up the data pipelining that I just described with the feature engineering. And second off lets you explore your data graphically, visually using different charts and graphs. It competes with Tableau.

[00:18:03] It competes with that program all on your AWS stack. That's are you excited yet? This is just one tool that I just listed. Like what 10 of the SageMaker pipeline that is now potentially replacing some program. You might pay a large sum for a pro subscription to download on your desktop for exploring your data.

[00:18:21] Well, this is all part of the pipeline that you're going to be setting up your whole machine learning project in might as well have the whole thing strung together for free, You know, run this thing on a T1 micro, kick off a SageMaker studio project and whatnot. You get the data exploration phase for free and built into your data pipeline phase.

[00:18:41] So not just for looking at your data. It's also assessing, applying some feature engineering strategy, reassessing and so on. And now click submit. Save. Yes, we're going to go onto the next phase.

[00:18:53] And finally, and this is so-called data Wrangler. We'll let you take now the output of your data pipeline. You've done all your feature transformations, your imputations you've done some analysis, visual analysis and whatnot, and just pipe, the whole thing into a quick dry run machine learning model, just run the whole thing through XG boost or linear regression or logistic regression, it will decide on which model to use based on the label column that you select.

[00:19:20] It will do a whole bunch of hyper parameter optimization and meta learning and decide which model to use probably going to be XGboost. Okay. We did some hyper parameter optimization on XG boost. We selected XG boost. We've got these features, outcomes, a regression prediction on this numerical label column.

[00:19:36] And then tells you the feature importances of your data, of your data lake of your data source. The feature importances this is something in the past I mentioned we might use XG boost for XG boost dot train, parentheses your data. Now you have a trained model. You say trained model dot feature importances underscore, and it will tell you what, what feature of your data source contributed the most to making the predicted output?

[00:20:04] Well, unlike XG boost, SageMaker data Wranglers feature importances output, a it's visual. You get that right out of the box. You just pipe in your CSV. It kicks off a dry run machine learning quickie determines the feature importance has. Now you have a graph that you can just. Eyeball and see which features seem to be the most important, super handy.

[00:20:25] You could do that in XG boost by way of Matplotlib, seaborne, whatever the case may be, but you'll have to wire up some code data. Wrangler will do this automatically for you. and it's a really handy eyeball What seems to be the most important in my spreadsheet, because if you tend to know what's most important contributor to the predictive output, and that aligns with what data Wrangler is saying, then, you know, you're sort of like barking up the right tree.

[00:20:51] And not only that, but another cool thing about feature importance is through Sage makers, it's using something called Shapp S H a P. Now where XG boost allows you to pull out the feature importance as a, for a single tree, a single tree, remember XG boost, gradient boosting kicks off a whole bunch of trees, and then it sort of averages the vote amongst those trees.

[00:21:14] That's what we call forest a random So in order to get the feature importance is what you really are doing actually is pulling out the feature importance is for a single tree of the forest, which may or may not be accurate. Shap will actually, determine the feature importances like at scale. So in the case of XG boost, it won't just be dealing with a single tree of the forest.

[00:21:37] It will actually be dealing with the real feature importance is of the full XG boost trained model. But B it's able to run this feature importances against not just XG boost, but other models like a neural network, a linear regression, naive bays, and so on previously feature importances was only available to XG boost.

[00:21:57] Now you can actually get feature importances from any train model available on AWS SageMaker and that's using this Shap library, S H a P it's actually an open source library. You can use that for your other machine learning models that you're writing in Python, including neural networks.

[00:22:12] you're not stuck with SageMaker to get these feature importances outside of XG boost, but it's nice that it has an in Sage maker out of the box on data Wrangler. Oh my gosh. So much more territory to cover. That was feature one of the SageMaker pipeline feature. One called data Wrangler lets you transform your data, impute your data, analyze it, spit the whole thing through a pipeline.

[00:22:34] determined feature importances and kick off a real quick dry run machine learning model all out the gate with a user interface or you could do the whole thing in code using infrastructure as code by way of something called Terraform or AWS CDK or whatever. We're talking about that in a future All right, let's move along. The next feature is called feature store and I'll actually kind of skip past this actually, because feature store I did wrap up and when I was talking about data, Wrangler feature store when I was talking about that, you do the feature transformations and then it sort of applies layers on top of what was the previous data comes in through a feature transformation.

[00:23:09] Now you have the output data, you have the input and the output, the feature store is this the output, these layers, there's now a central repository of feature. After we have applied these transformation steps, these layers, and anybody in the data science team can access this feature store against the data lake, the ingested data that comes in through the pipeline.

[00:23:33] They can either access the data beforehand, or they can access the data after these transformation steps. And it's all in a centralized repository on AWS. SageMaker that repository in your pipeline is called feature store. So it makes gathering your features for various steps in your pipeline. just to breeze really handy ground truth.

[00:23:55] This is gonna blow your mind GroundTruth really, really powerful tool. You have training data, you have a spreadsheet, you have RDS database. Now sometimes you have labels, the cost of houses and downtown Portland, Oregon, or Boston, and sometimes you don't have labels. Now, where are you going to get these labels from?

[00:24:16] If you don't have the labels, if you need to label your data, where are you going to get your labeled data from now, sometimes we have, data sets available on the internet by way of Microsoft or Google That's in some dataset repository, or if we're using hugging face transformers, we can download a data set through their library.

[00:24:34] Scikit-learn you can download it through the library. That's all well and good, but your data is specific to you and your company, and it's unique. And you may be able to bootstrap from a previously built data sets on the web. But eventually you're going to have custom data to your customers or your business use case, but you still need to label that data.

[00:24:55] Where are you going to get those labels from that's where Amazon ground truth comes from part of the Sage maker pipeline. This is super cool. GroundTruth has basically three options for labeling your data. The first is that it can kick off your data to, what's called Amazon mechanical Turk. You may have heard of this before.

[00:25:16] It's a. Marketplace of contractors, freelancers, people all over the world who are getting paid some, cents per label or a dollars per label or per hour or something. They're actually people who are tasked almost like task rabbit or something to perform some small task that's available through the Amazon marketplace.

[00:25:38] One of which is labeling data. So some amount of people all over the world. Get access to your data. if you click the yes button, we're going to use Amazon mechanical Turk If you opt into that strategy, these workers, these contractors will get access to your data and they will label the data.

[00:25:55] So let's say you have an image and in the image is a cat and we're doing a classification problem. So you hope that your mechanical Turk worker labels the image as cat. If you have one mechanical Turk worker and you specify the number of mechanical Turk workers, then that worker will decide whether it's a cat or a dog or a tree or a car.

[00:26:14] Okay. You can up it to two, to three to four, you can specify any number of workers to work on, labeling your data as you want. And it will average the predictions of those workers. So four said cat, and one said, dog, well, we're going to say that this label is a cat. Now of course, labels are typically more complex than classification of images.

[00:26:38] Sometimes we have bounding boxes around objects and an image or pixel segmentation of objects within an image. So a mechanical Turk worker may click every single pixel. That is the person in that image. So maybe we have five workers, all five are clicking the pixels that are a person in that image, and it does some AWS special sauce that determines how to average that out and determine what the real pixels are then for accurately representing the pixel segmentation of the images that you're after.

[00:27:13] It also does some really cool stuff where if one mechanical Turk worker was. Very accurate in the past, has a great track record and another has a poor track record. It takes that into account. So it up weights, the score of the highly accurate mechanical Turk worker in the averaging of the score of the label creation.

[00:27:34] Okay. That's possibility. Number one for labeling your data is if it's not sensitive and you just want the world to crack at it, you can outsource this to mechanical Turk and SageMaker has this incredible tooling around making sure that that process is streamlined and that you get accurate labels, but if your data is sensitive, then it provides a graphical user interface on the AWS console that you can then provide this login URL to on-premise workers, your employees to perform this labeling job.

[00:28:08] So let's say that you are drawing bounding boxes around cancerous sections of an x-ray, and that may be in 2d or 3d of images. And due to the sensitive nature of these images, being in a hospital setting and HIPAA compliance, you can only kick off this labeling job to your doctors or nurses or whatever.

[00:28:29] GroundTruth has tooling around providing a log in URL, A graphical user interface for click and drag drawing, bounding boxes, clicking pixels, entering text categories, numerical values, whatever the case may be for your employees to label your data super, super powerful, tool.

[00:28:49] And then finally, the last feature it provides is it can, predict automatically the label of that row if there's no label, but there's a bunch of labels for other data. It can use machine learning like the imputation strategy I discussed before for filling in empty columns, by guessing what that column value might be in this case as well.

[00:29:11] It can predict the outcome label as if it was an inference engine and evidently it's pretty powerful. It's pretty accurate. So really, really powerful stuff.

[00:29:21] Okay, next feature. And we're still in the data preparation phase. Next feature, Amazon clarify Amazon clarify. Now actually, when I was talking about data Wrangler, and I said, how sort of feature store kind of bleeds into that a little bit, a lot of these tools, there's a lot of overlap between the tools. The overarching framework is called Sage maker, and there is a graphical interface for the entire studio called SageMaker studio.

[00:29:50] it says it's like an IDE, an integrated development environment. Like if you have PyCharm or Atom or visual studio code, well, this is a web-based IDE for managing all your data in the data pipeline, including the machine learning model training and deployment phases and the SageMaker studio houses.

[00:30:07] All of these features, you don't have to use SageMaker studio. You could deploy this all through code, using Terraform or CDK. But given that let's say we're dealing with data ingestion and feature engineering and all that stuff in data Wrangler. Well, if you're doing feature engineering, that's going to bleed into feature store naturally.

[00:30:24] So a lot of these features bleed into each other. We'll clarify, bleeds into data Wrangler as well. Clarify is for exploring the bias in your data and your machine learning model. Now, this is really important. You may have heard a whole bunch of, press surrounding biased models.

[00:30:41] You know, like determining who's eligible for a loan or admission to college campuses or whatever the case may be. Maybe there's racism or sexism, or maybe it's not so sinister and it's just simply leaning heavily towards some category over another.

[00:30:55] SageMaker clarify, we'll point out those biases in both the data and the machine learning trained model, and it will allow you to adjust the data accordingly. It will provide some insights and suggestions, and tooling around improving the bias for your data and your machine learning model. And we'll talk about clarify again later, because after you train and deploy your model, you might be using clarify to monitor the bias of your deployed model in order to recalibrate it either retrain it or adjust the data, whatever the case may be.

[00:31:31] All right. We're moving on to the build phase, the build phase of the data pipeline of your machine learning stack using SageMaker. Well, the first one listed to SageMaker studio, and I already mentioned that SageMaker studio is simply an IDE for managing all of these services in the cloud in a graphical user interface, fantastic tool.

[00:31:52] but we don't need to discuss this. It's just go on Sage maker's website and look at a video of studio how it looks visually. It's a visual thing. So I'm not gonna be able to speak to it much here, autopilot. maybe I should make this two episodes. This is okay. Autopilot is so cool. Autopilot allows you to, create a train model and deploy it without writing any code at all, period, you take your data, source your data lake, your feature store, whatever the case from the prior steps. We talked about data preparation phase of your pipeline. Now, your data's prepared, ready to go.

[00:32:24] You pipe it into autopilot and autopilot. We'll train a model and deploy it, period. You don't need to know what model you don't need to know if it's going to be linear regression or logistic regression or XGBoost or what It will look at your data. Okay. You specify the label column, you tell it what the output column is going to be.

[00:32:44] and it will determine automatically are we dealing with a regression problem or a classification problem or a binary classification problem. Okay. Ones and zeros or any number of categories or some number. It will determine that automatically. I mean, that's easy. It's easy to look at a column and determine what type of column we're dealing with.

[00:33:02] No big deal. it, determines if we're going to deal with regression or classification and so on, and then it determines based on your data, the amount of rows in the amount of columns, the distribution of the columns, the number of missing values and the types of feature transformations that you applied in previous steps, that are coming out of your feature store.

[00:33:22] It determines the right. Determines, whether we're going to use XG boost, linear regression, logistic regression, naive Bayes, and so on automatically for you. And after determining the right model for the job, it auto applies hyper parameter optimization. It does hyper parameter optimization for you out of the box.

[00:33:40] determining the right model for the job based on the data you give it. And then running all that through hyper parameter optimization to output a well-tuned good trained model. How freaking awesome is that? How awesome is that? now note, this is only valuable for table data.

[00:34:02] I've talked about in the past data can come in any number of types. We have like time series based data like stock markets and language. You might pipe that into a recurrent neural network or a transformers language. You have space based data, that would be a photo picture or even stock market predictions as if you were considering the data as if it were a photo.

[00:34:24] We talked about that in the Bitcoin trading episode, we would use a convolutional neural network for that. Okay. So if you're dealing with time or space, you won't be using autopilot, but if you're dealing with table, which is the majority of the use cases of machine learning, then you can use autopilot and it will determine the right model for the job.

[00:34:43] The right hyper parameters for that model will train a model for you all part of the data pipeline and deploy it to a rest endpoint for inference. Back when I said all part of the data pipeline, that's important because like I said, in the past, you write a single Python script using a pandas data frame coming from a CSV or an RDS database.

[00:35:06] That's not. But you plug your whole data lake into a pipeline by way of Sage makers, data, Wrangler outcomes, a feature store, and you know, scalable data architecture on a spark backend. And the training of all that data goes through SageMaker autopilot. And you can train this thing and then he can kick off a rest endpoint.

[00:35:30] Now you may be thinking, wow, that's really cool. Also I'm a little concerned that it takes the reins too much that I won't have much control. Well, you can take the reins back. You can sort of eject this model. If you're used to react, you know, you maybe use, create react app will create this a really simplistic react code base environment on your local host.

[00:35:51] And then if you ever really want to do major customizations to it, you can run NPM eject If none of that makes sense, ignore it. But this is kind of a web development background. You can eject an autopilot trained model and it will come out with a Jupiter notebook of Python code with all the models that it tried, all the hyper parameters that it tried for each of those models, the final model that's trained with the optimal hyper-parameters and so on.

[00:36:20] And then you can modify that model. You can either modify it and then redeploy it to a rest end point. Or you can modify it maybe because we don't even want to deploy this to a rest end point in the first place. We just want to run these machine learning inference jobs as what's called batch transform jobs.

[00:36:37] I'll talk about that later. That's the equivalent of scale to zero. One-off machine learning jobs that I was trying to get after in the last episode. So SageMaker autopilot is this whole automated end to end solution for creating and deploying a machine learning model that you don't have to touch. But if you want, you can touch, you can eject it and then you can fine tune it to your heart's content.

[00:37:01] Super, super powerful. And you don't have to do all this as part of the data pipeline. Okay. If everything I'm talking about in this pipeline stuff with data Wrangler and feature stores and whatnot, let's say you just want to upload a CSV. And he had just won a machine learning model, and you want to turn that into a rest end point, or you don't even want to turn it into a restaurant, but you just want to upload a CSV, generate a machine learning model.

[00:37:24] Look at some of the data distribution.

[00:37:27] Look at what the machine learning model was selected. What are the feature importances? And so on again, remember how a lot of these features bleed into each other. autopilot will allow you per data, Wrangler and feature store, and clarify to explore some of the metrics around your data and your model, which includes data, distributions, feature, importances, model bias, data bias, and so on.

[00:37:51] So if you just want to kick off a quick CSV, exploration job and get a quick trained model, you can do that by uploading a CSV. You don't have to use all the SageMaker pipelining, really powerful stuff. All right. In the build phase, still I'm going to breeze over this real quick debugger.

[00:38:07] if you're familiar with tensor board, TensorFlow's GUI tooling for exploring a neural networks the distributions of the numerical values of the weights at any given neuron at any given layer, you can see if maybe you're having vanishing or exploding gradients, and you can determine if you need some dropout or if you need some regularization.

[00:38:34] you can look to see if something's wrong with your machine learning models, special, your deep learning models, and that's where SageMaker debugger comes into play. it's basically tensor board in the cloud with a GUI you know, auto deployed. So you don't have to run this thing on local hosts. Not only that it can actually send you email notifications by way of CloudWatch.

[00:38:57] If you know, something's up, maybe during the training phase or maybe by way of SageMaker clarify, your model is drifting or there's bias detected or whatever the case may be. Now, this is the first time you're hearing me mentioned CloudWatch. CloudWatch is an AWS service that's monitors certain things typically by way of logs, maybe a regular expression that's monitoring logs, and it's looking for some pattern in the logs, but it can also monitor other things that can monitor, let's say usage, resource utilization of a server Ram CPU, GPU utilization, disk space usage.

[00:39:36] And if let's say the resources go too high or something is wrong, or there's an anomaly detected in your server stack, then it can use CloudWatch to send you an email notification or a text message or something that something is awry. Something is amiss and SageMaker debugger can during the training phase of a model, use CloudWatch to send you notifications that maybe your neurons are vanishing or exploding or something like this.

[00:40:10] And CloudWatch is integrated into almost all of the SageMaker tooling. Actually, this is the first time I'm mentioning it, but it's all over Sage maker. You can integrate CloudWatch to send you notifications if there's bias or drift of your data or your model or whatever the case may be.

[00:40:27] and I did mention the notification of over or under resource utilization of a CPU or Ram or GP or disc space that will come in handy, big time in machine learning as we dovetail into the next train and tune feature, being distributed training.

[00:40:46] in the distributed training feature of SageMaker, you can run your training jobs over multiple instances, multiple ECE, two instances with GPS and so on. Now you can tap into CloudWatch to determine if you're over or under utilizing GPU or CPU or Ram. If you're over utilizing these things, then you may want to spin up more instances as part of the distributed training feature.

[00:41:13] If you're under utilizing, you'll want to use fewer instances so that you can save money. And so you can say if we're training this thing in the cloud, distributed 10 instances, and we're not using very much resources, send me an email notification so that I can alter the resource utilization I can alter which GPU is used on these instances, which CPU is used on these instances and so on.

[00:41:37] and the same for over utilization. So the distributed training step of the pipeline of SageMaker is just like a sounds. you can train your models and SageMaker, you can write them in a P I Python notebook, or you can write them in a Python file and you can kick them off in a Python script, a training job, and, the distributed training feature set of SageMaker lets you run that over multiple instances so you can save time.

[00:42:05] Previously, you would write your Keras model, your convolutional neural network in a Python script on your local host. You have a 10 80 ti GPU and you run your script overnight over two days over three days while it's training takes a long time to train. Well, there are ways to distribute this to parallel lies it If you were to do this over multiple cores or multiple slices of your GPU on local hosts, there's ways to do this, but SageMaker provides tooling that eases the burden of the distribution of training across multiple instances. Again, using Apache spark under the hood, or it may be using TensorFlow tooling for distributed training.

[00:42:51] SageMaker allows you to kick off a training job of your machine learning. And with very little amount of extra code in your Python file or in your Jupiter notebook, very little extra code. You can specify how your data from the data lake or the data source is going to be split across the different instances.

[00:43:14] maybe we're keying them by ID or shard key, or if we're using and AWS S3 bucket with a bunch of CSVs, we might be sharding them. Okay. Sh S H a R D means how one would split data across multiple instances by save date or some substring of the file name or something like this full root folder name or whatever the case may be.

[00:43:40] SageMaker allows you to specify how data will be split across the instances and what are the instance types, the CPU, GPU and Ram. And again, CloudWatch will notify you if you're over or under utilizing, how many of these instances to kick off, what are the instance types? How many to kick off how the data gets distributed?

[00:43:59] And that's about it as far as the actual model training stuff and the orchestration of recalibrating, the training of the models that are running on different instances, how do they communicate with each other, what they've learned thus far, and like unify that into a master algorithm, a master model upstream SageMaker will handle all that stuff for you.

[00:44:22] Huge win, Normally you would defer to a machine learning framework like TensorFlow or PI torch and. Use their tooling for orchestrating the merging of training across the distributed instances. And it's a little bit more heavy on the code side, on the Python scripting side. SageMaker eases that burden a lot with their distributed training tooling.

[00:44:47] Okay. So we have our data, we ingested it from a data store or a data lake, a data set. Normally we say data lake data lake means a whole bunch of different data sets that have something in common with each other. So we ingest our data from a data lake. We feature engineer it through data Wrangler, and we store those features into feature store.

[00:45:08] We do a little bit of assessment on our data by way of SageMaker clarify. And SageMaker data Wrangler, a little bit of data analysis, determined, feature importances, some quickie models determine what models we might want to use downstream.

[00:45:22] Just bird's eye view. Okay. At this point, the episode is running very long. I'm going to split this into two episodes cause we still have quite a bit of SageMaker to cover, but per usual, I will be listing some resources that you can learn.

[00:45:36] some of this AWS SageMaker tooling offline without my help. And in the next episode, we'll return to the deploy and manage phase of the SageMaker pipeline

[00:45:49] See you then.
Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.