MLA 015 AWS SageMaker MLOps 1

Nov 03, 2021
Click to Play Episode

SageMaker is an end-to-end machine learning platform on AWS that covers every stage of the ML lifecycle, including data ingestion, preparation, training, deployment, monitoring, and bias detection. The platform offers integrated tools such as Data Wrangler, Feature Store, Ground Truth, Clarify, Autopilot, and distributed training to enable scalable, automated, and accessible machine learning operations for both tabular and large data sets.

Resources
Resources best viewed here
Designing Machine Learning Systems
Machine Learning Engineering for Production Specialization
Data Science on AWS: Implementing End-to-End, Continuous AI and Machine Learning Pipelines
Amazon SageMaker Technical Deep Dive Series
Show Notes
CTA

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

Amazon SageMaker: The Machine Learning Operations Platform

MLOps is deploying your ML models to the cloud. See MadeWithML for an overview of tooling (also generally a great ML educational run-down.)

Introduction to SageMaker and MLOps

  • SageMaker is a comprehensive platform offered by AWS for machine learning operations (MLOps), allowing full lifecycle management of machine learning models.
  • Its popularity provides access to extensive resources, educational materials, community support, and job market presence, amplifying adoption and feature availability.
  • SageMaker can replace traditional local development environments, such as setups using Docker, by moving data processing and model training to the cloud.

Data Preparation in SageMaker

  • SageMaker manages diverse data ingestion sources such as CSV, TSV, Parquet files, databases like RDS, and large-scale streaming data via AWS Kinesis Firehose.
  • The platform introduces the concept of data lakes, which aggregate multiple related data sources for big data workloads.
  • Data Wrangler is the entry point for data preparation, enabling ingestion, feature engineering, imputation of missing values, categorical encoding, and principal component analysis, all within an interactive graphical user interface.
  • Data wrangler leverages distributed computing frameworks like Apache Spark to process large volumes of data efficiently.
  • Visualization tools are integrated for exploratory data analysis, offering table-based and graphical insights typically found in specialized tools such as Tableau.

Feature Store

  • Feature Store acts as a centralized repository to save and manage transformed features created during data preprocessing, ensuring different steps in the pipeline access consistent, reusable feature sets.
  • It facilitates collaboration by making preprocessed features available to various members of a data science team and across different models.

Ground Truth: Data Labeling

  • Ground Truth provides automated and manual data labeling options, including outsourcing to Amazon Mechanical Turk or assigning tasks to internal employees via a secure AWS GUI.
  • The system ensures quality by averaging multiple annotators’ labels and upweighting reliable workers, and can also perform automated label inference when partial labels exist.
  • This flexibility addresses both sensitive and high-volume labeling requirements.

Clarify: Bias Detection

  • Clarify identifies and analyzes bias in both datasets and trained models, offering measurement and reporting tools to improve fairness and compliance.
  • It integrates seamlessly with other SageMaker components for continuous monitoring and re-calibration in production deployments.

Build Phase: Model Training and AutoML

  • SageMaker Studio offers a web-based integrated development environment to manage all aspects of the pipeline visually.
  • Autopilot automates the selection, training, and hyperparameter optimization of machine learning models for tabular data, producing an optimal model and optionally creating reproducible code notebooks.
  • Users can take over the automated pipeline at any stage to customize or extend the process if needed.

Debugger and Distributed Training

  • Debugger provides real-time training monitoring, similar to TensorBoard, and offers notifications for anomalies such as vanishing or exploding gradients by integrating with AWS CloudWatch.
  • SageMaker’s distributed training feature enables users to train models across multiple compute instances, optimizing for hardware utilization, cost, and training speed.
  • The system allows for sharding of data and auto-scaling based on resource utilization monitored via CloudWatch notifications.

Summary Workflow and Scalability

  • The SageMaker pipeline covers every aspect of machine learning workflows, from ingestion, cleaning, and feature engineering, to training, deployment, bias monitoring, and distributed computation.
  • Each tool is integrated to provide either no-code, low-code, or fully customizable code interfaces.
  • The platform supports scaling from small experiments to enterprise-level big data solutions.

Useful AWS and SageMaker Resources

CTA

Go from concept to action plan. Get expert, confidential guidance on your specific AI implementation challenges in a private, one-hour strategy session with Tyler.Get personalized guidance from Tyler to solve your company's AI implementation challenges.Book Your Session with TylerBook Your Call with Tyler

Transcript
Welcome back to Machine Learning Applied, and this is gonna be a very important episode where we discuss Amazon SageMaker, AWS, Amazon Web Services, SageMaker. In the last episode, I talked about deploying your machine learning model to the server. It was an episode called. Machine learning server. Well, I was in over my head. I'm older and wiser now, and I know now that this concept is called machine learning operations, or ML ops, machine learning operations. You may be familiar if you're a web developer or a server developer with something called DevOps. Developer operations, which has effectively replaced systems administration. It's the concept of deploying your server to the cloud, your front end to the cloud, et cetera, making these servers scalable, microservices, architectures, all these things. Well, deploying your machine learning models to the server. A new concept in the world of data science and machine learning is called ML ops or machine learning operations. In the last episode, the machine learning server episode, I talked about a few services. I talked, I did talk about SageMaker. I talked about AWS Lambda. And then I talked about a handful of auxiliary services like Cortex Dev, Paperspace and Floyd Hub. Well, I hate to say this to those mom and pops, I'm sorry, but toss those out the window because SageMaker is gonna blow your mind. SageMaker is way more powerful than I thought it was. It has more bells and whistles than I realized. There are ways to reduce cost. One of the biggest gripes that I had with it in the prior episode, ways to handle. Rest endpoint up in the cloud, AKA scale to zero. And that's not all. The sky's the limit with the amount of features available by way of SageMaker. Okay, now I may talk about GCP, Google Cloud platform and I may also talk about Microsoft Azure in the future, but I also may not because. I feel like I'm completely sold on SageMaker. I think I'm going all in on AWS. Why? Well, one is AWS is just clearly the most popular cloud hosting provider out there. It's just the most popular. It seems clear as day to me. And what does popularity give you? It gives you more resources, whether that's educational material or modules and plugins or prospective employees and hires available on the marketplace. And so popularity is not something I shake a stick at when it comes to, I. Deciding on some technology or technology stack. I know it may be a controversial point, but in this particular case, the popularity of SageMaker seems to have amplified its capacity by way of the features that they ended up building into the framework. And as you'll see in this episode, SageMaker is. Absolutely astounding. The amount of features, bells and whistles and capabilities of this framework are just out of this world, and it may indeed replace my entire local host development setup. Even when I talked about in prior episodes using Docker and Docker compose, I may be switching to completely going all in on AWS. Including developing against AWS on local hosts by way of something called Local Stack. I'll talk about that in the next episode, but SageMaker has absolutely wowed me. It has dazzled me. So let's get into it. What is SageMaker? Well, in the last episode, I talked about deploying your machine learning model to the cloud by way of some of these services. And I did mention SageMaker. I also mentioned AWS Lambda. Now AWS Lambda is still a viable option under certain circumstances for deploying your machine learning models to the cloud. But for larger models, especially GPU centric models and especially big data centric models and data pipelines that have scalability in mind than you're gonna be using AWS SageMaker. SageMaker is not just an ML Ops. Platform. It's not just about deploying your model to the cloud. It is the entire end-to-end stack of an entire machine learning anything, period. You're gonna collect your data using SageMaker, prepare your data using SageMaker, train your model, deploy that trained model to the web. As a rest endpoint or as we'll discuss as running one-off inference jobs, and then monitor that model to make sure it's not drifting or keeping an eye on bias and all these things. You're gonna be able to use SageMaker to label your data. There's even tooling and SageMaker that lets you deploy your model to an edge device. Remember Edge, I talked about this before. Edge means it's on a device. So if you have a camera that has some tiny machine learning model that's running on the camera and not in the cloud somewhere, it's not making rest calls with those video frames. We call that model running at the. Edge. So we'll talk about deploying your SageMaker model to the Edge using AWS SageMaker neo, but we'll get to that at the end. I'm gonna take you through the SageMaker features step by step. Let's just list those features. Now, starting from the top, in the data preparation phase, we have data Wrangler, feature, store, ground truth, and clarify. In the build phase, we have SageMaker Studio Autopilot. Jumpstart in the train and tune phase, we have debugger distributed training, and in the deploy and manage phase, we have Deploy pipelines, model monitor, Kubernetes integration, edge Manager, and Neo. And this is all just listed on SageMaker website, AWS amazon.com/sagemaker. You just hover over the features menu item and, and, and these will all be listed in the dropdown. So let's just take this one feature at a time, data preparation. So we're in the data preparation phase of a pipeline. A pipeline in machine learning or data science is that you take your data from some source, you're gonna be ingesting your data, and then you're gonna send it through a series of steps. A pipeline before it even hits the machine learning training phase. And then the very end of that pipeline is a trained model that's deployed to a rest endpoint in that pipeline, in the entire pipeline of your data stuff. The first step of the pipeline is going to be getting your data, collecting your data. We call it data ingestion, ingesting your data. Now, that data may come from A CSV on Amazon S3, or it may come from A-T-S-V-A tab, separated value file or something called a parquet file. I haven't discussed that before, but I'll discuss it later. Parquet, P-A-R-Q-U-E-T. It's actually very common, especially in Amazon. SageMaker stored on S3 or maybe your database, RDS, Postgres, MySQL, Microsoft SQL Server, whatever the case may be, or. It may be streaming from the web, coming from the internet somewhere as something we call a fire hose. A fire hose is basically, if you're collecting data that's maybe updating every second or millisecond even. So for example, tweets on Twitter. Twitter is a fire hose. It is so much data so fast that don't bother storing this in a database, especially not an RDS database. Maybe you might put this in a NoSQL database like Dynamo db, but more likely the case, you're actually gonna be piping this through one of AWS services like Amazon Kinesis Firehose, so you have any number of data sources. Okay. A, a data source is a single source of data. One data source might be a CSV, another data source might be a TSV, another data source might be your database, a data lake. A data lake is a collection of these data sources that are related to each other. So if you're doing a whole bunch of analysis on tweets on Twitter. You might be storing some data in a Dynamo DB table. You might be piping some other data straight through AWS Kinesis fire hose, and you might be saving some other data, AWS S3 as CSV files or parquet files. If all of these data sources are related to each other, you might store them all together, maybe in a single VPC. And this is called a data lake. A data lake. And when we say big data. Big data means we're dealing with so much data that you couldn't possibly run this all on a single Python script. You couldn't just load all of this data, this fire hose from Kinesis, the data from the RDS database and the CSVs all into some pandas data frame in a Python script on a single server. Anything beyond that capacity, running this all in a data frame on a server. You're dealing with what's called big data. So big data just means you need to scale this data. You need to scale this data. And so right off the bat, right away, we find value in using SageMaker for dealing with data lakes. If the data you're working with is big, is big data, which is gonna be nine times outta 10 anyway. So if you have a lot of data coming into your machine learning system through your pipeline, you're gonna want to use one of these. MO Ops data pipelines like SageMaker, and the tool here, the entry point of the SageMaker data preparation phase of the pipeline is called Data Wrangler. Data Wrangler. What it'll do for you is it'll allow you to specify where your data is gonna be ingested from, where you're ingesting your data from this data lake, and then what are we gonna do with that data? Well, you're gonna be doing some feature transformation, so, so we're gonna transform the data into features. And then we're gonna also be imputing missing data. Remember, imputing, I-M-P-U-T-E means filling in the missing values of data, where wherever there's missing values, maybe you wanna fill it in with noles or NANS or the, the, the mean average of that column or the median, or maybe guess the missing value. There's actually tooling around that and SageMaker trying to fill in the missing values with some estimate based on the. Other features of the Rose, which is already super powerful. And so SageMaker Data Wrangler allows you to do some feature engineering, some imputation, as well as some visualization, visualizing the distribution of your data per feature, deciding which features are the most important feature importances, and then piping your data into a. Almost like a dry run quickie machine learning model like xg Boost one of these off the shelf, real obvious machine learning model implementations to see how your data would perform before we even get to the training phase. In case you may want to feature engineer and transform and impute your data some more before we get to that phase. But let me step back just a little bit and to say that in the past, normally in prior episodes, we would've done this by way of pandas and num pa. We would've done this feature engineering and imputation in pandas. Well, that's all fine and good, but if your data needs to scale, if you are ingesting data at scale, big data, which is gonna be most of the time when you're actually deploying your entire company to the internet and you're finally getting lots of customers, and suddenly you're getting lots of data from somewhere you wanna be prepared for big data. And you want to have your pipeline set up in advance so that the data preparation phase doesn't just run on a single Python script. Instead, it runs in a distributed paralleled fashion by way of usually in the case of Amazon SageMaker, what it's running behind the hood is called Apache Spark, which is a distributed parallelization framework, especially for data science and machine learning, which traditionally either runs Scala, which is a JVM language or Python by way of. Pi Spark. So SageMaker does a lot of its parallelization for distributed data, pipelining by way of Apache Spark, but you wanna be prepared for big data in your data pipeline, and that's where Data Wrangler comes into play. So let's talk about some of these features. So the first feature is feature engineering. Being able to transform your data into features. So for example, if you have a date, if you have a tweet coming from Twitter and it has a date, 2021, October 28th, you may want to pull some stuff out of that date. One thing may be the day of week or week of year or time of day and so on. You may want to turn that date into a. Number of integer columns because you remember in most machine learning algorithms, we really wanna work with numbers, not with strings, not with dates, numbers. So one feature transformation step of this phase of the pipeline might be transforming dates. Now, in the past, we might use Pandas to pull these features out of a date column, and Pandas is very slick at handling these things well. Data Wrangler has, as part of its tooling, a whole suite of prebuilt feature engineering. Steps for common features. So if it picks up automatically a date column in your CSVs or your RDS column, it will suggest this is a date. I think maybe we should pull out this, that, and the other feature automatically for you. So not only are you preparing yourself for the internet by deploying your model at scale, using SageMaker, but outta the box, you're already saving time and code and python code if you use data wranglers. Auto feature engineering suggestion steps that it picks up by analyzing your data. So that's one part of the feature engineering phase. Another part of the feature engineering is imputation strategies. So if you have a whole bunch of missing data in your spreadsheet, bunch of noes missing data. Well, traditionally we'd use pandas in a Python script and we would either fill those in with the mean of that column or the median of that column. That's a very common strategy for filling in holes. Another common strategy is removing rows for which that column is empty. Okay? If it's important that you don't train on data for which that column is empty, or if you don't know how to handle filling in that value, you don't know if mean or median or max or min would be wise. In this case, maybe we'll just remove the row. When would you not want to remove a row for an empty column if the net result leaves you with not very much training data? But you don't have to make this decision. SageMaker Data Wrangler will help you make that decision for you. It'll give you some suggestions. Pro tip, by removing rows for this empty column, you're gonna be left with not much training data. We suggest filling it in with the mean or the median further. It has a feature for actually filling in the value of that column with a likely candidate for that missing value based on the rest of the columns of that row. That is slick. That is super powerful. In other words, it uses a machine learning model to predict what would go into that missing slot for that row based on the values for other rows and what they had for that column. So that's something you don't get with pandas outta the box. That's super impressive. Finally, another quick cool feature engineering strategy that gives you outta the box is for string columns. If it thinks these things are categories, we can turn this into a one hot and coated vector. Very cool. Right outta the box, no pandas. You get one hot ENC coatings of your string based category columns. And the whole thing can be piped through principle component analysis to dimensionality reduce your data. Now, what do you get at the end of this phase? Well, each feature that you feature engineer, you apply a feature engineering step as if it was a layer in Photoshop. It's like your data comes in and you say, okay, for the date columns, I want to apply date feature engineering. I wanna pull out the time of day, day of week, week of year, blah, blah, blah. And that's a layer. That's a layer on top of that step of the data pipeline. Okay? Now you have a new set of data and it goes on to the next step of the data Wrangler pipeline. You apply a new feature engineering step, whether that's imputation or one hot and coating and so on. Now, unlike writing your feature engineering process in Pandas, in a Python file, where what you get at the end is the data. Maybe you're gonna stuff that back into an RDS database or that's gonna be piped into your machine learning model. What came into your PANDA'S data frame was raw data. What came out of your panda's data frame was clean data, but that process was destructive to the data. It was destructive. There may be some other part of your machine learning pipeline that wants the date column or that wants the string column, or that is okay with Knolls. We don't want the imputation. So unlike doing this stuff through pandas in a Python file, data Wrangler on SageMaker applies these feature engineering steps through a pipeline as if they're Photoshop layers, that you can access any of these steps through the pipeline. And that's why we call it a pipeline. It's because if you imagine pipes, you can access the, the water going through the pipe can split in the pipe and go left and right to various parts of your data science application. But that doesn't stop there. Data Wrangler comes with a whole bunch of visualization tools as if they were competing with Tableau. I've mentioned in a previous episode on data analysis, data exploration, EDA, exploratory data analysis. Tableau is a desktop application that, you know, you download, you pay for a, a premium subscription, and you punch in your CSV and you explore your data. It gives you some charts and graphs. Very cool. We also talked about map, plot, lib and seaborn, all these different libraries that you might use in a Jupyter Notebook to explore your data. Well, SageMaker Data Wrangler has a graphical user interface and that graphical user interface, it allows you to first off, set up the data pipelining that I just described with the feature engineering. And second off lets you explore your data graphically, visually using different charts and graphs. It competes with Tableau. It competes with that program all on your AWS stack. That's, are you excited yet? This is just one tool that I just listed, like what 10 of the SageMaker pipeline that is now potentially replacing some program. You might pay a large sum for a pro subscription to download on your desktop for exploring your data. Well, this is all part of the pipeline that you're gonna be. Setting up your whole machine learning project in, might as well have the whole thing strung together for free. You know, run this thing on a T one micro kick off a SageMaker studio project and whatnot. You get the data exploration phase for free and built into your data pipeline phase. So it's not just for looking at your data, it's also assessing, applying some feature engineering strategy, reassessing and so on. And now click submit save. Yes, we're good to go onto the next phase. Finally, and this is so cool. Data Wrangler will let you take now the output of your data pipeline. You've done all your feature transformations, your imputations, you've done some analysis, visual analysis and whatnot, and just pipe the whole thing into a quick dry run machine learning model. I. Just run the whole thing through XG Boost or linear regression or logistic regression. It will decide on which model to use, based on the label column that you select. It will do a whole bunch of hyper parameter optimization and metal learning and decide which model to use. Probably gonna be XG Boost. Okay. We did some hyper parameter optimization on XG Boost. We selected XG Boost. We've got these features, outcomes, a regression prediction on this numerical label column. And it then tells you the feature importances of your data, of your data lake, of your data source. The feature importances. This is something in the past I mentioned we might use XG Boost for XG Boost train, parentheses your data. Now you have a trained model. You say trained model. Do feature importances. And it will tell you what it, what feature of your data source contributed the most to making the predicted output? Well, unlike xg Boost SageMaker data, Wrangler's feature importance is output A. It's visual. You get that right outta the box. You just pipe in your CSV, it kicks off a. Dry run machine learning quickie determines the feature. Importance is now you have a graph that you can just eyeball and see which features seem to be the most important. Super handy. You could do that in xg boost by way of map, plot, lib, seaborn, whatever the case may be, but you'll have to wire up some code data. Wrangler will do this automatically for you, and it's a really handy eyeball. What seems to be the most important in my spreadsheet, because if you tend to know what's most important, contributor to. The predictive output, and that aligns with what data Wrangler is saying, then you know, you're sort of like barking up the right tree. And not only that, but another cool thing about feature importance is through sage makers. It's using something called Shap, SHAP. Now, where xg Boost allows you to pull out the feature importance is. A for a single tree. A single tree. Remember, XG Boost gradient boosting kicks off a whole bunch of trees and then it sort of averages the vote amongst those trees. That's what we call forest, a random forest. So in order to get the feature, importance is what you really are doing actually is pulling out the feature importances for a single tree of the forest, which may or may not be accurate, Shap. Will actually determine the feature importances like at scale. So in the case of XG Boost, it won't just be dealing with a single tree of the forest. It will actually be dealing with the real feature importances of the full XG boost trained model, but B, it's able to run this feature importance. Against not just XG Boost, but other models like a neural network of linear regression, naive bays, and so on. Previously, feature importance was only available to XG Boost. Now you can actually get feature importance from any trained model available on AWS Sage maker. And that's using this Shap library, SHAP. It's actually a open source library. You can use that for your other machine learning models that you're writing in Python, including neural networks. You're not stuck with SageMaker to get these feature importances outside of XGBoost, but it's nice that it has it in SageMaker outta the box on Data Wrangler. Oh my gosh. So much more territory to cover. That was feature one. Of the SageMaker pipeline feature one called Data Wrangler lets you transform your data, impute your data, analyze it, spit the whole thing through a pipeline, determine feature importances, and kick off a real quick dry run machine learning model all out the gate with a user interface. Or you could do the whole thing in code using infrastructures code by way of something called Terraform or A-W-S-C-D-K or whatever. We're talking about that in a future episode. Alright, let's move along. The next feature is called feature store, and I'll actually kind of skip past this actually, because feature store, I did wrap up when I was talking about Data Wrangler feature store. When I was talking about that. You do the feature transformations and then it sort of applies layers on top of what was the previous data comes in through a feature transformation. Now you have the output data, you have the input and the output. The feature store is this, the output, these layers. There's now a central repository of features. After we have applied these transformation steps, these layers, and anybody in the data science team can access this feature store against the data lake, the ingested data that comes in through the pipeline. They can either access the data beforehand. Where they can access the data after these transformation steps. And it's all in a centralized repository. On AWS SageMaker, that repository in your pipeline is called feature store. And so it makes gathering your features for various steps in your pipeline. Just a breeze, really handy ground truth. This is gonna blow your mind. Ground truth. Really, really powerful tool. You have training data, you have a spreadsheet, you have a RDS database. Now, sometimes you have labels the cost of houses in downtown Portland, Oregon, or Boston, and sometimes you don't have labels. Now, where are you going to get these labels from if you don't have the labels? If you need to label your data, where are you gonna get your label data from? Now sometimes we have data sets available on the internet by way of Microsoft or Google. That's in some dataset repository. Or if we're using hugging face transformers, we can download a dataset through their library. Psyche learn. You can download it through the library. That's all well and good, but your data is specific to you and your company and it's unique. And you may be able to bootstrap from a previously built sets on the web, but eventually you're gonna have custom data to your customers or your business use case, but you still need to label that data. Where are you gonna get those labels from? That's where Amazon Ground truth comes from. Part of the SageMaker pipeline. This is super cool. Ground Truth has basically three options for labeling your data. The first is. That it can kick off your data to what's called Amazon Mechanical Turk. You may have heard of this before. It's a marketplace of contractors, freelancers, people all over the world who are getting paid some sense. Per label or dollars per label or per hour or something. They're actually people who are tasked almost like TaskRabbit or something to perform some small task that's available through the Amazon marketplace, one of which is labeling data. So some amount of people all over the world get access to your data. If you click the yes button, we're gonna use Amazon Mechanical Turk. If you opt into that strategy, these workers, these contractors will get access to your data and they will label the data. So let's say you have an image and in the image of a cat, and we're doing a classification problem. So you hope that your mechanical Turk worker labels the image as. Cat. If you have one mechanical Turk worker and you specify the number of mechanical Turk workers, then that worker will decide whether it's a cat or a dog, or a tree or a car. Okay, you can up it to two, to three, to four. You can specify any number of workers to work on labeling your data as you want, and it will average the predictions of those workers. So, a four said cat and one said dog. Well, we're gonna say that this label is a cat. Now, of course, labels are typically more complex than classification of images. Sometimes we have bounding boxes around objects in an image. Or pixel segmentation of objects within an image. So a mechanical Turk worker may click every single pixel that is the person in that image. So maybe we have five workers, all five are clicking the pixels that are a person in that image, and it does some AWS special sauce that determines how to average that out and determine what the real pixels are. Then for accurately representing the pixel segmentation of the images that you're after. It also does some really cool stuff where if one mechanical Turk worker was very accurate in the past, it has a great track record and another has a poor track record, it takes that into account so it up weights the score of the highly accurate mechanical Turk worker in the averaging of the score of the the label creation. That's possibility number one for labeling your data is if it's not sensitive and you just want the world to crack at it, you can outsource this to mechanical Turk. And SageMaker has this incredible tooling around making sure that that process is streamlined and that you get accurate labels. But if your data is sensitive, then it provides a graphical user interface. On the AWS console that you can then provide this, uh, login URL to on-premise workers, your employees, to perform this labeling job. So let's say that you are drawing bounding boxes around cancerous sections of an x-ray, and that may be in 2D or 3D of images. Due to the sensitive nature of these images, being in a hospital setting and HIPAA compliance, you can only kick off this labeling job to your doctors or your nurses or whatever ground truth has tooling around providing a login, URLA, graphical user interface for click and drag drawing, bounding boxes, clicking pixels, entering text categories, numerical values, whatever the case may be for your employees to label your data. Super, super powerful tool. Then finally the last feature it provides is it can predict automatically the label of that row if, if there's no label, but there's a bunch of labels for other data, it can use machine learning like the imputation strategy I discussed before for filling in empty columns by. Guessing what that column value might be in this case as well, it can predict the outcome label as if it was an inference engine, and evidently it's pretty powerful. It's pretty accurate, so really, really powerful stuff. Okay, next feature, and we're still in the data preparation phase. Next feature, Amazon clarify, Amazon clarify. Now, actually, when I was talking about data wrangler and I said how sort of feature store kind of bleeds into that a little bit. A lot of these tools, there's a lot of overlap between the tools. The overarching framework is called SageMaker, and there is a graphical interface for the entire studio. Called SageMaker Studio. It's a, it's like an IDE, an integrated development environment, like if you have Pie charm or Adam or Visual Studio Code. Well, this is a web-based IDE for managing all your data in the data pipeline, including the machine learning model training and deployment phases, and the SageMaker Studio houses. All of these features, you don't have to use SageMaker Studio. You could deploy this all through code using Terraform or CDK. But given that, let's say we're dealing with data ingestion and feature engineering and all that stuff in Data Wrangler, well, if you're doing feature engineering, that's gonna bleed into feature store naturally. So a lot of these features bleed into each other. Well clarify bleeds into data Wrangler as well. Clarify is for exploring the bias in your data and your machine learning model. This is really important. You may have heard a whole bunch of press surrounding biased models, you know, like determining who's eligible for a loan or admission to college campuses, or whatever the case may be. Maybe there's racism or sexism, or maybe it's not so sinister and it's just simply leaning heavily towards some category over another SageMaker clarify. We'll point out those biases in both of the data and the machine learning trained model, and it will allow you to adjust the data accordingly. It will provide some insights and suggestions and tooling around improving the bias for your data and your machine learning model. And we'll talk about clarify again later, because after you train and deploy your model, you might be using clarify to monitor the bias of your deploy. Deployed model in order to recalibrate it, either retrain it or adjust the data, whatever the case may be. Alright, we're moving on to the build phase. The build phase of the data pipeline of your machine learning stack using SageMaker. Well, the first one listed as SageMaker Studio and I already mentioned that SageMaker Studio is simply an IDE for managing all of these services in the cloud in a graphical user interface. Fantastic tool, but we don't need to discuss this. It's just go on Sage Make's website and look at a video of studio, how it looks visually. It's a visual thing, so I'm not gonna be able to speak to it much here. Autopilot. Maybe I should make this. Two episodes. This is okay. Autopilot is so cool. Autopilot allows you to create a trained model and deploy it without writing any code at all, period. You take your data source, your data lake, your feature store, whatever the case from the prior steps we talked about that data preparation phase of your pipeline. Now your data's prepared, ready to go. You pipe it into autopilot, and autopilot will train a model and deploy it. Period. You don't need to know what model. You don't need to know if it's gonna be linear regression or logistic regression, or XG boost or what. It will look at your data, okay? You specify the label column. You tell it what the output column is gonna be, and it will determine automatically are we dealing with a regression problem or a classification problem, or a binary classification problem. Okay? Ones and zeros or any number of categories or some number. It will determine that automatically. I mean, that's easy. It's easy to look at a column and determine what type of column we're dealing with, no big deal. It determines if we're gonna deal with regression or classification and so on. And then it determines based on your data, the amount of rows and the amount of columns, the distribution of the columns, the number of missing values, and the types of feature transformations that you applied in previous steps that are coming out of your feature store. It determines the right model. It determines whether we're gonna use xg boost, linear regression, logistic regression, naive bays, and so on automatically for you. And after determining the right model for the job, it auto applies hyper parameter optimization. It does hyper parameter optimization for you outta the box, determining the right model for the job based on the data you give it, and then running all that through hyper parameter optimization to output a well tuned, good trained model. How freaking awesome is that? How awesome is that? Now, note, this is only valuable for table data I've talked about in the past. Data can come in any number of types. We have like time series based data, like stock markets and language. You might pipe that into a recurrent neural network or a Transformers language model. You have space-based data. That would be a, a photo, a picture, or even stock market predictions as if you were considering the data, as if it were a photo. We talked about that in, uh, the Bitcoin trading episode. We would use a convolutional neural network for that. Okay. So if you're dealing with time or space, you won't be using. Autopilot. But if you're dealing with table, which is the majority of the use cases of machine learning, then you can use autopilot and it will determine the right model for the job. The right hyper parameters for that model will train a model for you all part of the data pipeline and deploy it to a rest endpoint for inference. Back when I said all part of the data pipeline. That's important because like I said in the past, you write a single Python script using a panda's data frame coming from a CSV or a RDS database, that's not scalable, but you plug your whole data lake into a pipeline by way of sage make's data Wrangler. Outcomes, a feature store and you know, scalable data architecture on a spark backend, and the training of all that data goes through SageMaker autopilot, and you can train this thing and then you can kick off a rest endpoint. Now you may be thinking, wow, that's really cool. Also, I'm a little concerned that it. Takes the reins too much that I won't have much control. Well, you can take the reins back. You can sort of eject this model if you're used to react, you know, you maybe use create React app. We'll create this, uh, really simplistic react code base environment on your local host. And then if you ever really want to do major customizations to it, you can run. NPM eject. If none of that makes sense, ignore it. But this is kind of web development background. You can eject an autopilot trained model and it will come out with a Jupyter notebook of Python code with all the models that it tried, all the hyper parameters that it tried for each of those models. The final model that's trained with the optimal hyper parameters and so on, and then you can modify that model. You can either modify it and then redeploy it to rest endpoint, or you can modify it maybe because we don't even want to deploy this to rest endpoint in the first place. We just want to run these machine learning inference jobs as what's called batch transform jobs. I'll talk about that later. This. The equivalent of scale to zero one-off machine learning jobs that I was trying to get after in the last episode. So SageMaker autopilot is this whole automated end-to-end solution for creating and deploying a machine learning model that you don't have to touch, but if you want, you can touch, you can eject it, and then you can fine tune it to your heart's content. Super, super powerful. And you don't have to do all this as part of the data pipeline. Okay? If everything I'm talking about. In this pipeline stuff with data wrangler and feature stores and and whatnot. Let's say you just wanna upload a CSV and you just want a machine learning model, and you want to turn that into a REST endpoint, or you don't even want to turn it into a rest endpoint. You just wanna upload a CSV, generate a machine learning model, look at some of the data distributions. Look at what the machine learning model was selected, what are the feature, importances and so on. Again, remember how all a lot of these features bleed into each other. Autopilot will allow you per data wrangler and feature store and clarify to explore some of the metrics around your data and your model. Which includes data distributions, feature importances, model bias, data bias, and so on. So if you just want to kick off a quick CSV exploration job and get a quick trained model, you can do that by uploading a CSV. You don't have to use all this SageMaker pipelining. Really powerful stuff. All right. In the build phase still, I'm gonna breeze over this real quick. Debugger, if you're familiar with tensor board, tensor flows gooey tooling for exploring a neural networks, the distributions of the numerical values of the weights at any given neuron. At any given layer, you can see if maybe you're having vanishing or exploding gradients. You can determine if you need some dropout or if you need some regularization. You can look to see if something's wrong with your machine learning model, especially your deep learning models, and that's where SageMaker debugger comes into play. It's basically tensor board in the cloud with a gooey. Auto deployed so you don't have to run this thing on local hosts. Not only that, it can actually send you email notifications by way of CloudWatch if you know something's up, maybe during the training phase, or maybe by way of SageMaker clarify, your model is drifting or there's bias detected, or whatever the case may be. Now, this is the first time you're hearing me mention CloudWatch. CloudWatch is an AWS service that's monitors certain things. Typically by way of logs, maybe a, a regular expression. It's monitoring logs and it's looking for some pattern in the logs, but it can also monitor other things. It can monitor, let's say U usage resource utilization of a server, ram, C-P-U-G-P-U, utilization, disc space usage. And if, let's say the resources go too high or you. Something is wrong or there's an anomaly detected in your server stack, then it can use CloudWatch to send you an email notification or a text message or something that something is awry. Something is amis and SageMaker. Debugger can, during the training phase of a model, use CloudWatch to send you notifications that maybe your neurons are vanishing or exploding. Something like this, and CloudWatch is integrated into almost all of the SageMaker tooling. Actually, this is the first time I'm mentioning it, but it's all over SageMaker. You can integrate CloudWatch to send you notifications if there's bias or drift of your data or your model, or whatever the case may be. And I did mention the notification of over or under resource utilization of A CPU or RAM or GPU or disc space. That will come in in handy big time in machine learning as we dovetail into the next train and tune feature being distributed training. In the distributed training feature of SageMaker, you can run your training jobs over multiple instances, multiple EC2 instances with GPUs and so on. Now you can tap into CloudWatch to determine if you're over or under utilizing GPU or CPU or ram. If you're overutilizing these things, then you may want to spin up more instances as part of the distributed training feature. If you're. Under utilizing, you'll want to use fewer instances so that you can save money. And so you can say, if we're training this thing in the cloud, distributed 10 instances and we're not using very much resources, send me an email notification so that I can alter the resource utilization I can. Alter which GPU is used on these instances, which CPU is used on these instances and so on. And the same for over utilization. So the distributed training step of the pipeline of SageMaker is just like it sounds. You can train your models and SageMaker, you can write them in a. I Python notebook where you can write them in a Python file and you can kick them off in a Python script, a training job. And the distributed training feature set of SageMaker lets you run that over multiple instances so you can save time. Previously you would write your K Os model, your convolutional neural network. In a Python script on your local host, you have a 10 80 TI IGPU, and you run your script overnight over two days, over three days, while it's training takes a long time to train. Well, there are ways to distribute this to. Parallelize it. If you were to do this over multiple cores or multiple slices of your GPU on local host, there's ways to do this, but SageMaker provides tooling that eases the burden of the distribution of training across multiple instances. Again, using Apache Spark under the hood, or it may be using. TensorFlow Tooling for distributed training SageMaker allows you to kick off a training job of your machine learning model and with very little amount of extra code in your Python file or in your Jupyter Notebook, very little extra code, you can specify how your data. From the data lake or the data source is going to be split across the different instances. Maybe we're keying them by ID or shard key, or if we're using an AWS S3 bucket with a bunch of CSVs, we might be sharding them. Okay? S-H-A-R-D means how one would split data across multiple instances. By save date or some substring of the file name or something like this full, you know, root folder name or whatever the case may be. SageMaker allows you to specify how data will be split across the instances and what are the instance types, the C-P-U-G-P-U, and ram. And again, CloudWatch will notify you if you're over or Underutilizing. How many of these instances to kick off, what are the instance types, how many to kick off, how the data gets distributed, and that's about it. As, as far as the actual model training stuff and the orchestration of recalibrating, the training of the, the models that are running on different instances, how do they communicate with each other, what they've learned thus far, and like unify that into a master algorithm? A master model, upstream stage maker will handle all that stuff for you. Huge win. Normally, you would defer to a machine learning framework like TensorFlow or PyTorch, and use their tooling for orchestrating the merging of training across the distributed instances. And it's a little bit more heavy on the code side. On the Pythons scripting side, SageMaker eases that burden a lot with their distributed training tooling. Okay, so we have our data. We ingested it from a data store or a data lake, a data set. Normally we say data lake. Data lake means a whole bunch of different data sets that have something in common with each other. So we ingest our data from a data lake. We feature engineer it through data wrangler. Uh, we store those features to feature store. We do a little bit of assessment on our data by way of SageMaker, clarify and SageMaker data wrangler, a little bit of data analysis. Determine feature importances, some quickie models determine what models we might wanna use downstream. Just bird's eye view. Okay, at this point the episode is running very long. I'm gonna split this into two episodes 'cause we still have quite a bit of SageMaker to cover. But per usual, I will be listing some resources that you can learn some of this AWS SageMaker tooling offline without my help. And in the next episode, we'll return to the deploy and manage. Phase of the SageMaker pipeline. See you then.
Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.