OCDevel
Walk
OCDevel AI Podcast
OCDevel AI Podcast
Learn AI and machine learning from the ground up - a complete, self-driving course that goes from "what is AI?" all the way to building and operating production AI systems. Every episode pairs a five-minute brief on the latest in AI with a tutorial that climbs a single ladder across ~100 episodes - interleaving the concepts, the math that actually explains them, hands-on code you build yourself, and the MLOps to ship it. It leaves no stone unturned: the probability, statistics, and Bayesian foundations most courses skip get the deep treatment they deserve, right alongside the practical work. The path runs from your first model on real data, through the mathematical bedrock, classical ML, neural networks built from scratch in PyTorch, transformers part by part, building with LLMs (RAG, fine-tuning, agents), and MLOps on AWS and GCP - to the capstone: operating a self-managing fleet of AI agents in production. The goal isn't a diploma, it's a job. Every phase leaves you a portfolio project, and the whole course is built to make you the rare "operator" who can ship real systems - the one-person AI department. For programmers who want to break into AI through self-directed learning - no grad school required. AI-generated podcast by OCDevel.
CTA
Generated with OCDevel PodcasterMade with OCDevel Podcaster
This show was made with OCDevel Podcaster: turn any topic or text into an AI-narrated podcast episode that drops right into your feed.Turn any topic into an AI-narrated episode in your feed.Create your own →Create your own →

The Machine Learning Workflow and the Train/Test Split: Data, Features, Training, and Honest Evaluation

13h ago

The whole point of machine learning is generalization to unseen data, and the single habit that protects it is holding out a test set you never touch until the end. We walk the full workflow loop from problem framing to monitoring, and we make the train/test split and cross-validation concrete with scikit-learn.

Show Notes
OCDevel AI Podcast — The ML Workflow and Train/Test Split

The anchor episode of Phase 1: the end-to-end machine learning workflow as a loop, and the train/test split as the discipline that keeps your evaluation honest. Builds directly on the NumPy/pandas episode; unlocks every model that follows, because they all share the same fit/predict/score harness.

Education segment (anchor)
  • Workflow as a loop, not a line: problem framing, data collection, EDA, cleaning, feature engineering, model choice, training, evaluation, tuning, deploy, monitor, iterate. Roughly 80% of project time is data + features, 20% modeling. Often drawn as CRISP-DM with back-arrows.
  • Generalization is the goal: performance on unseen data from the same distribution. Training error is an optimistically biased estimate of true error, which is why you need held-out data.
  • Overfitting vs underfitting, capacity, and the bias-variance tradeoff (intuition only).
  • The split: 80/20, 70/30, 90/10 for big data; stratify for imbalance, never shuffle time series. Three-way split (train/validation/test) and k-fold cross-validation (Kohavi's 1995 case for 10-fold).
  • Data leakage taxonomy: preprocessing before split, target leakage (the Caruana et al. pneumonia case), temporal and group leakage, distribution shift.
  • scikit-learn facts verified against version 1.9.0: train_test_split, cross_val_score, the cross-validation guide, common pitfalls, DummyClassifier, and the Kaggle Data Leakage lesson.
  • Worked examples: Iris with a dummy baseline (~0.33) vs logistic regression (~0.97); California housing with a mean baseline (R² ~0.0) vs linear regression (R² ~0.58); correct vs leaky scaling.
News segment (week of June 18–22, 2026)
  • Noam Shazeer leaves Google for OpenAI: Gemini co-lead and "Attention Is All You Need" co-author departs (also see 9to5Google). Google reportedly paid ~$2.7B in 2024 to bring him back.
  • AI funding roundup: Odyssey ($310M Series B, world models), Hydra Host ($100M), Bland AI ($50M), Radical Numerics ($50M).
  • Open-weights leads (reportedly): GLM-5.2, MiniMax M3, VibeThinker-3B, Nemotron 3 Ultra.
Transcript

Let's run the news for the week of June eighteenth through June twenty-second, twenty twenty-six.

The big one is a talent move. Noam Shazeer is leaving Google for OpenAI. It was announced around Wednesday the seventeenth into the eighteenth. Shazeer was a vice president of engineering at Google and a co-lead of the Gemini models. He announced it himself in a post on X, saying he's excited to join OpenAI and looks forward to working with the team there.

Here's why that lands hard. Shazeer is one of the co-authors of the twenty seventeen paper "Attention Is All You Need." That's the paper that introduced the Transformer, the architecture sitting under GPT, Gemini, Claude, Grok, and Llama. He's also credited with multi-query attention. So this is one of the highest-profile individual moves of the year, and it's a clean snapshot of the AI talent war. For context, Google reportedly paid around two point seven billion dollars in an August twenty twenty-four deal to bring Shazeer back from Character dot A I. That two point seven billion figure is reported, not confirmed. And he's leaving less than two years later.

Next, the funding roundup for the week of June thirteenth through eighteenth, from Crunchbase News. They called it a slower week for the really big deals, but a few stand out.

Odyssey led the week. Three hundred ten million dollars, a Series B, at a one point four five billion dollar valuation, led by Natural Capital. Total raised is around three hundred thirty-seven million. Odyssey builds A I world models, which are learnable simulators that create multimodal simulations of real-world environments. Their backers include Amazon, A M D Ventures, E Q T, Google Ventures, I Q T, and SignalRank. World models are worth keeping an eye on as a trend.

Then Hydra Host raised one hundred million in a Series A, led by Kindred Ventures. They run a bare-metal G P U platform that connects customers to distributed A I compute. Total around one hundred nineteen million. Bland A I raised fifty million, a Series C led by Dell Technologies Capital, for A I voice agents that automate enterprise phone calls. And Radical Numerics raised a fifty million dollar seed, led by Emergence Capital, building models that simulate and predict biological systems for drug discovery.

A few open-weights leads to close out, all reported rather than tested by me. GLM five point two from Z dot A I shipped mid-June with coding and agentic gains over five point one, and it was reportedly folded into Nous Research's Hermes agent within days. MiniMax M three is a new open-source coding model, ranked among the top open-weights coders. VibeThinker three B is an M I T-licensed three-billion-parameter fine-tune that reportedly claims parity with much larger reasoners on math and code. And Nvidia shipped Nemotron three Ultra with a strong capability-to-efficiency ratio. Worth pulling the small ones and running them locally.

Today we're building the thing that holds the entire course together. The machine learning workflow, and inside it, the single most important habit you'll ever pick up, the train and test split.

Here's the promise. Once you understand this loop, every model we cover later, linear regression, logistic regression, k nearest neighbors, decision trees, random forests, gradient boosting, and eventually neural networks, all of them plug into the exact same harness. You learn the loop once. After that, every new algorithm is just swap the estimator, keep the harness. That's the whole game.

So let's start with a correction of a very common mental model. Beginners think machine learning is the line where you call fit on the model. You import something, you call fit, and boom, you're a machine learning engineer. But the fit call is the smallest, easiest part of the whole thing. It's the cherry on top. The real work is everywhere around it.

The workflow is a loop, not a line

Let me walk the stages, and I want you to hear them as a cycle you re-enter, not a checklist you finish.

Stage one is problem framing. What are we actually predicting? Is this supervised learning, where we have labeled examples, or unsupervised, where we don't? Is it classification, where the answer is a discrete category, or regression, where the answer is a continuous number? And here's a subtle one. What's the business metric versus the machine learning metric? What does "good enough" even mean for this project? Most projects that fail don't fail because the model was bad. They fail because someone solved the wrong problem, precisely and beautifully, and shipped it. Framing is where you avoid that.

Stage two is data collection. Garbage in, garbage out. That phrase is old and tired and completely true. No model rescues bad data.

Stage three is exploratory data analysis, or E D A. This is where your pandas fluency from the last episode pays off. You call describe, you call info, you call value counts. You plot histograms, you look at correlations, you build a missing-value map. You're trying to understand distributions, spot outliers, and check class balance before you model anything. You look before you leap.

Stage four is cleaning. Missing values get dropped or imputed. You fix data types, you remove duplicate rows, you handle outliers, you parse dates into real datetime objects. Tedious, unglamorous, necessary.

Stage five is feature engineering. This is scaling and normalization, one-hot encoding of categories, binning continuous values, building interaction terms, decomposing a datetime into day of week and month, turning text into vectors. There's a line, widely attributed to Andrew Ng, that applied machine learning is basically feature engineering. This is often the highest-leverage stage in the whole project. The model can only see what you feed it.

Stage six is choosing a model. Start simple. A linear or logistic regression, or a single decision tree. And here's a reframing that helps. The choice of model is itself a hyperparameter. It's one more knob you'll tune, not a sacred commitment.

Stage seven is training, or fitting. You call fit on X train and y train. Under the hood, an optimizer searches for the parameters that minimize a loss function on the training data. That's it. That's the famous fit call.

Stage eight is evaluation. You measure performance on held-out data, using a metric that actually fits the problem. Hold that thought, because the back half of this episode is mostly about doing this honestly.

Stage nine is tuning. You adjust hyperparameters, you change the feature set, you try a different model family. This is where validation and cross-validation come in.

Stage ten is predict and deploy. You serialize the trained model, and you serve it, behind an A P I or as a batch job.

Stage eleven is monitoring. Live performance, data drift, latency. When something degrades, you trigger a retrain.

And stage twelve is iterate. Whatever you learn in production sends you back upstream, to reframe, recollect, re-engineer. The arrows point backward as much as forward.

You'll sometimes see this drawn formally as CRISP-DM, the Cross-Industry Standard Process for Data Mining, from nineteen ninety-nine. Its stages are business understanding, data understanding, data preparation, modeling, evaluation, and deployment, and the canonical diagram has back-arrows all over it. Same idea, older name.

One rule of thumb to anchor all of this. Roughly eighty percent of a project's time goes to data cleaning and feature engineering. Roughly twenty percent goes to the modeling. The fit call you were so excited about is a sliver of the work.

Features and targets

Let me lock down the vocabulary, because we'll use it constantly.

X is the design matrix, also called the feature matrix. It's two-dimensional, with shape number of samples by number of features. The rows are samples, also called observations. The columns are features, also called variables. We write X with a capital letter because it's a two-dimensional thing, a table.

y is the target, also called the label or the response. It's one-dimensional, with shape number of samples. We write it lowercase because it's a single column, one value per row. If you have multiple things to predict at once, it becomes two-dimensional, number of samples by number of targets, but the default case is one column.

Supervised learning means learning a function f that maps X to y, from labeled pairs. When y is categorical, that's classification. When y is continuous, that's regression. And a practical note. Scikit-learn assumes X is numeric, so you encode any categorical columns first, and it assumes the rows of X and y line up in the same order. Scikit-learn stores the number of features it saw at fit time, in an attribute called n features in, and if you later hand it the wrong number of columns, it errors out rather than guessing.

Generalization, the core idea

Here's the heart of everything. The point of machine learning is generalization. Performance on unseen data, drawn from the same distribution as your training data. Not performance on data the model already saw.

A model that's brilliant on its training data and useless on new data has learned nothing. It memorized. That's the failure mode the whole rest of this episode is built to prevent.

Let me give you the light formal version, because it makes the intuition sharper. We assume our data is independent and identically distributed samples, i i d for short, drawn from some unknown distribution over X and y. When we train, we minimize what's called empirical risk, which is just the average loss over the training sample. But what we actually care about is the expected risk, the average loss over the entire distribution, including all the data we'll never see. That's the generalization error. And the gap between the two has a name, the generalization gap.

The key consequence. Training error is an optimistically biased estimate of true error. It's biased low, because you chose your parameters using exactly those points, so of course the model looks good on them. That bias is the entire reason you need held-out data. You cannot trust the model to grade its own homework.

Here's the analogy I want you to keep. Studying for an exam. Imagine the teacher hands out the exact exam ahead of time, and you memorize the answer key. You score a hundred percent. Does that hundred mean you understand the material? Not at all. It means you have a good memory. The test set is the same exam topic, but with new questions you haven't seen. That's the only score that tells you whether you learned anything. And touching the test set early is cheating, except the only person you cheat is yourself, because production, the real world, is the actual final exam, and it doesn't care about your practice score.

Overfitting and underfitting

So if memorizing is the danger, let's name both ways it goes wrong.

Underfitting is when your model is too simple for the pattern. You get high error on both the training data and the test data. Picture fitting a straight line to a relationship that's actually curved. The line just can't bend enough. Both scores are bad.

Overfitting is the opposite. The model is too flexible, and it fits the noise in the training data, not just the signal. You get low training error but high test error. The tell-tale sign of overfitting is a big gap between the training score and the test score. When train looks great and test looks bad, that gap is overfitting, made visible.

The cleanest way to feel the difference. Imagine a lookup table that simply stores every single training example and its answer. On the training data it gets zero error, perfect. On anything new, it's useless, because it never saw that exact row. That's overfitting taken to the extreme. Memorizing is not learning. Learning means extracting the generalizable signal and throwing away the noise.

The dial that controls this is called capacity, or model complexity. It's the model's ability to fit many different shapes of function. Think polynomial degree, or tree depth, or sheer parameter count. Low capacity tends to underfit. Too much capacity tends to overfit. And there's a classic picture here, a U-shaped curve. As you increase complexity, training error just keeps falling, monotonically. But test error falls at first, then bottoms out, then rises again. That U is the whole story of the tradeoff in one curve.

There's a deeper framing called the bias-variance tradeoff, and I'll only give you the intuition now, we derive it properly in a later phase. Total expected error breaks into roughly three pieces. Bias squared, plus variance, plus irreducible noise. Bias is error from a model that's too simple, that's underfitting. Variance is sensitivity to the particular training sample you happened to draw, that's overfitting. Tuning complexity is just navigating between those two.

One honest footnote, so I'm not lying to you by simplification. Very large neural networks show something called double descent, where test error falls, rises, and then falls again past a certain point. So the clean U-curve is a useful first model of the world, not the final truth. Keep the U for now, and know there's more later.

Let me make this concrete with numbers. Take a decision tree on the Iris dataset with no limit on its depth. It can hit a hundred percent training accuracy, and around ninety-two percent test accuracy. That gap, a hundred versus ninety-two, is overfitting you can measure. Now cap the tree's max depth. Training accuracy drops, because you handcuffed the model, but test accuracy rises or at least steadies. You traded a little training performance for real generalization. That's the tradeoff in action, on a real dataset.

The train and test split

Now the main event. Before you touch anything model-related, you hold out a chunk of your data, and you do not look at it until the very end. That held-out chunk is your one honest estimate of generalization.

Why does this work? Because evaluating on data the model trained on only measures memorization. The test set stands in for the future, for the unseen data you'll face in production. It's a simulation of reality, kept clean.

What ratios do people use? The two most common are eighty twenty and seventy thirty. For a very large dataset, you'll see ninety ten, because ten percent of a huge dataset is still an enormous number of test examples, and you'd rather give the model more training data. For small datasets, you push the other way, toward cross-validation, which we'll get to. There's no universal law here. The reason bigger datasets can afford a smaller test fraction is that what really controls the noise in your estimate is the absolute number of test examples, not the percentage.

Now the cardinal sin. Data leakage. Leakage is any flow of information from the test side back into the training side. Let me give you the classic examples.

The first is preprocessing before the split. Say you create a standard scaler, which subtracts the mean and divides by the standard deviation, and you fit it on the whole dataset, and then you split. You just leaked. Because the scaler's mean and standard deviation were computed using the test rows, so the test statistics are now baked into your training pipeline, and your scores come out optimistic. The fix is simple. Split first. Then fit the scaler on X train only, and use it to transform both train and test. The scaler learns from train, and merely applies to test.

The second example is evaluating on the training data itself, which we've already beaten up. That's just measuring memorization.

The third is subtler and catches experienced people. Reusing the test set over and over to make decisions. Every time you peek at the test set and then change something, you leak a little test information into your choices. The test set quietly degrades into a validation set. People call this test set overfitting, or leaderboard overfitting. The test set is sacred precisely because you spend it once.

A few mechanical knobs make splitting reliable. First, set a random seed, called random state in scikit-learn, so that you and a reviewer get the exact same split and can reproduce each other's numbers. Second, for classification, stratify on y. With imbalanced classes, a plain random split can over-represent or under-represent the minority class by chance. The breast cancer dataset, for instance, is roughly thirty-seven percent one class, and a small random test set could drift far from that. Stratifying forces both the train and test splits to match the overall class proportions. Third, shuffling is on by default, which matters a lot when your data arrives sorted, say all of class zero first, then all of class one. And the one big exception. For time series, do not shuffle. You train on the past and test on the future. If you shuffle, you let the model peek at the future to predict the past, which leaks, and there's a dedicated tool for this called TimeSeriesSplit.

Validation versus test, the three-way split

Here's a trap that even careful people fall into. Suppose you tune your hyperparameters by checking the test set, over and over, picking whatever scores best. You've now fit your decisions to the test set. It's no longer unseen, and the number you report at the end is inflated.

The fix is a three-way split. Train, validation, and test. The training set fits the model's parameters. The validation set is where you tune hyperparameters, compare models, and decide when to stop. And the test set, you touch exactly once, at the very end, after every decision is locked. Common ratios are sixty twenty twenty, or seventy fifteen fifteen.

You implement this with two splits. First, split off the test set. Then split the remainder into train and validation, and remember to recompute that second fraction relative to what's left, not the original total.

The mental model. The training set is your textbook. The validation set is your practice exams, and you can take as many of those as you want. The test set is the one real final, taken once. Many practice exams, one real final.

And one more honest warning. The validation set wears out too. If you run a huge hyperparameter search against it, you'll overfit the validation set the same way you could overfit the test set. That wearing-out is exactly what motivates cross-validation, and it's why we keep the test set locked away.

Cross-validation, the better use of data

A single train-validation split has two problems. It wastes data, because the validation rows never get to train the model. And it gives you a high-variance estimate, because you might have drawn a lucky or an unlucky split.

K-fold cross-validation fixes both. You split the data into k equal folds. You train on k minus one of them, and validate on the one you held out. Then you rotate, so that each fold serves as the validation fold exactly once. You end up with k scores, and you average them for a robust estimate, and you also look at their standard deviation, which tells you how stable the model is across different splits.

Common choices are five folds or ten folds. The ten comes from a nineteen ninety-five study by Kohavi, which showed that ten-fold stratified cross-validation gives a good bias and variance balance for model selection. So when in doubt, five or ten.

The cost is real. K-fold means training k separate models, so it's k times the compute of a single fit. That's the price of a better estimate. For a huge dataset, a single split is often plenty, because you already have tons of data. Cross-validation earns its keep on small and medium datasets, where every row is precious.

A few variants worth knowing by name. For classification, you want stratified k-fold, which preserves the class ratios in every fold, and here's a default many people don't know. When you pass an integer for the number of folds to a classifier in scikit-learn, it automatically uses stratified k-fold for you. If it's a regressor, or otherwise, it uses plain k-fold. That auto-stratification is a quiet, helpful default.

Leave-one-out, sometimes called L O O, is the extreme case where k equals the number of samples. You train on everything but one row, test on that one row, and repeat for every row. It's nearly unbiased but very high variance, and very expensive, since you train n models. Mostly it's a thing to know exists.

Group k-fold matters when your rows aren't independent. Say you have multiple rows per patient, or per user, or per session. If the same patient shows up in both train and test, the model can cheat by recognizing that patient. Group k-fold keeps every group entirely within one fold, so no entity straddles the line.

And there's a cross-validation version of the leakage trap. Your preprocessing must happen inside each fold. If you scale, or impute, or select features using the full dataset and then run cross-validation, every fold's validation portion already influenced your preprocessing, and your scores come out optimistic. The clean solution is a Pipeline, which we'll cover properly in a later episode, so I'll just flag it now. When you call cross-validation on a pipeline, scikit-learn re-fits every preprocessing step on each fold's training portion only. That, right there, is the single biggest reason pipelines exist. Flag now, deliver later.

A little math, just in time

Let me give you the formal backbone, lightly, because it justifies everything we just did.

Empirical risk minimization means we minimize one over n, times the sum over the training points, of the loss between the true y and the model's prediction. But what we want is to minimize the expected loss over the true distribution. Held-out data is how we estimate that second thing using the first.

Why is training error biased low? Because the parameters were chosen using those exact points, so the model is tuned to them. Held-out points were not seen, so in expectation, under the i i d assumption, the held-out estimate is unbiased. That's the formal reason the test set is trustworthy and the training score is not.

And the variance of your estimate matters too. A test set of size m gives an accuracy estimate with a standard error of roughly the square root of p times one minus p, divided by m, where p is the accuracy. The takeaway is quantitative. A twenty-sample test set is nearly worthless, its error bars are huge. A two thousand sample test set is tight. That's the real reason bigger datasets can use a smaller test fraction. It's the absolute count m that tames the noise, not the percentage.

And the bias-variance decomposition, which I'll state and not derive. The expected squared error of your prediction equals bias squared, plus variance, plus sigma squared. That last term, sigma squared, is irreducible noise. It's the floor that no model, however perfect, can ever beat. Some of the world is just random.

Always start dumb, the baseline

Before any fancy model, fit a dummy. Scikit-learn gives you DummyClassifier and DummyRegressor, and they establish the floor.

A dummy classifier, by default, with strategy set to prior, just predicts the most frequent class. A dummy regressor, by default, predicts the mean of the target. If your real model can't beat the dummy, your real model is worthless. Full stop.

Dummies also expose misleading metrics, and this is where they really earn their place. Picture a fraud dataset that's ninety-nine percent legitimate transactions. A model that always predicts legitimate gets ninety-nine percent accuracy, and catches exactly zero fraud. A dummy classifier set to most frequent gets that same ninety-nine percent. The dummy reveals that accuracy is the wrong metric for this problem, which is what motivates precision, recall, F one, and R O C area-under-curve in later episodes.

And one more job for the baseline. It sanity-checks your whole pipeline end to end before you invest in modeling. Does the data load? Does the split run? Does the scoring function work? Run the dummy, and you've tested the plumbing.

The unifying estimator API

Now let me show you why scikit-learn took over the world. It's the estimator API, one consistent contract that everything obeys.

Every estimator has a fit method that takes X and y. Fit learns from the data, stores the learned parameters in attributes with a trailing underscore, like coef underscore, intercept underscore, classes underscore, and returns itself. Predictors add a predict method. Classifiers also offer predict proba, for class probabilities, or decision function. Transformers, the preprocessing objects, have transform and fit transform. And a transformer must not change the number of samples or reorder them, it only changes the columns. Almost everything also has a score method, with a sensible default metric, R squared for regressors and accuracy for classifiers, where higher is always better.

A couple of contract details that save you pain. The constructor does no validation, it just stores your arguments. All the real validation happens in fit. So if you passed something silly, you find out when you fit, not when you build the object. Re-fitting overwrites the prior learned state, it doesn't accumulate. And get params and set params are how grid search reaches in and tweaks your settings, which is how automated tuning works at all.

Here's the punchline. DummyClassifier, LogisticRegression, LinearRegression, StandardScaler, and Pipeline all expose the same fit and predict and transform and score. So swapping one model for another is a one-line change. That uniformity is scikit-learn's killer feature, and it's why it became the de facto standard. For history, scikit-learn started as a Google Summer of Code project by David Cournapeau in two thousand seven, had its first public release in twenty ten, and has been led largely out of INRIA, a French research institute.

Two worked examples, in words

Let me put it all together with two end-to-end examples. I'll describe the code in plain words.

First, classification on Iris. You load Iris with return X y set to true, which gives you X with shape one hundred fifty by four, and y with shape one hundred fifty. You call train test split with test size equals zero point two, random state equals forty-two, and stratify equals y, so the three classes stay balanced across both splits. Now the dumb baseline first. You build a dummy classifier with strategy most frequent, call fit on the training data, then call score on the test data, and you get about zero point three three, which makes sense, because Iris has three equal classes. Then the real model. You build a logistic regression with max iter set to a thousand, call fit, call score, and you get about zero point nine seven. It crushes the baseline, which is exactly what you want to see. Then cross-validation. You build a stratified k-fold with five splits, shuffle on, random state forty-two, and call cross-validation score with that fold scheme. You get back an array of five scores, the mean is about zero point nine seven, and the standard deviation is small, which tells you the model is stable, not lucky.

Second, regression on California housing. You fetch the California housing data with return X y true. You split with test size zero point two and random state zero. The dummy regressor with strategy mean gets an R squared of about zero point zero, by construction, because predicting the mean explains none of the variance. Then linear regression. Fit, score, and you get an R squared of about zero point five eight. It beats the trivial predictor, so you're learning something real.

And now the leakage demonstration, three versions side by side. The wrong way. You call fit transform with a standard scaler on all the data, then split. The leak already happened, the scaler saw the test rows. The right way. Split first, fit the scaler on X train, then transform train and test separately. And the best way, which I'm foreshadowing. You build a pipeline that chains the standard scaler and the logistic regression, and you call cross-validation score on that pipeline. Now the scaler is re-fit inside each fold, on that fold's training portion only. No leakage, ever, automatically. Remember, fit transform is just fit followed by transform in one call, and you use it on train. On test or validation, you use plain transform.

The leakage taxonomy and production

Let me widen out to all the ways leakage bites you in the real world, because this is where careers are made or embarrassed.

First, train-test contamination. That's preprocessing fit on the full data, which we covered. It also includes duplicate rows that span both sets, the same record landing in train and test, and over-peeking at the test set.

Second, and nastier, target leakage. This is a feature that secretly encodes the answer, or that's only available after the moment you'd actually make the prediction. The canonical teaching case is from a study by Caruana and colleagues on pneumonia mortality. The model learned that asthma patients had lower mortality risk, which is backwards and dangerous. Why? Because asthma patients with pneumonia got rushed to the I C U and treated aggressively, so features like admitted to I C U or received antibiotics were really markers that doctors already knew the patient was severe. The feature encoded the outcome. Another example, a housing model with a neighborhood average price feature that included the very house being predicted. And a famous Kaggle competition, Predict Future Sales, where activity from the test time period bled into the features. And from Kaggle's own data leakage lesson, the took antibiotic medicine example, a feature recorded after someone got sick, which then predicts that they were sick. It predicts the past.

Third, temporal leakage. That's using future information to predict the past, and any random shuffle of a time series does exactly this.

Fourth, group leakage. The same entity appearing in both train and test inflates your scores, and the fix, again, is group k-fold.

Then there's distribution shift, sometimes called dataset shift, and it's the reason your offline scores and your production scores disagree even when you did everything right. There are three flavors. Covariate shift, where the distribution of the inputs changes. Label shift, also called prior shift, where the distribution of the outcomes changes, like fraud rates climbing. And concept drift, where the actual relationship between inputs and outputs changes, like spammers evolving their tactics to dodge your filter. Your held-out test set only certifies generalization to the same distribution it came from. Production can drift away from that. Which is exactly why monitoring and retraining close the loop, back to stage eleven and twelve from the beginning.

A word on reproducibility. Pin your random state everywhere, in the split, in the cross-validation, in the model's initialization. Pin your library versions. And version your data. Future you, and your reviewers, will thank you.

And on deployment, briefly. You persist the fitted estimator, or the whole pipeline, using something like joblib or pickle. And here's the elegant part. The same pipeline that prevented leakage during cross-validation is the exact thing you ship to production. So your training-time preprocessing and your serving-time preprocessing are guaranteed to match, which avoids a whole class of bugs called train-serve skew.

A note on the datasets

Quick practical note, since you'll reach for these. The old Boston housing dataset was deprecated and then removed from scikit-learn, over ethical concerns about a racial feature in it. So for regression, use fetch California housing, or the diabetes dataset, which has four hundred forty-two samples and ten features. For classification, Iris gives you a hundred fifty samples in three balanced classes of fifty each, with four features. Breast cancer gives you five hundred sixty-nine samples, binary, roughly sixty-three thirty-seven, which makes it great for practicing stratification and imbalance. There's also wine, and digits. They all come back as a Bunch object with data, target, and feature names. Pass return X y true if you just want the arrays as a tuple, or as frame true if you want pandas.

Where this sits on the map

Let me close by placing this episode. It builds directly on the NumPy and pandas episode from Phase one. Your X is a two-dimensional NumPy array or a pandas DataFrame. Your E D A and your cleaning are pandas work. The design matrix is exactly the array fluency you just built.

And it unlocks everything downstream. Linear regression, logistic regression, k nearest neighbors, decision trees, random forests, gradient boosting, and later, neural networks, every single one plugs into the identical fit, predict, score workflow, and the same evaluation discipline we built today. Learn the loop once. Learn the split once. After that, every new algorithm is the same move, swap the estimator, keep the harness. The train and test discipline is the through-line of this entire course. Protect your test set, and you'll always know the truth about your model.