
Fit a line, read its coefficients, and understand what "fitting" actually does before we ever touch the calculus. This is the dot product that quietly becomes logistic regression, a neuron, and every neural network downstream.
This episode opens with a quick news brief, then settles into the anchor tutorial: linear regression, the first real model in the course. Intuition first, then a hands-on scikit-learn workflow.
y = w·x + b and why that dot product is the atom of everything downstream.LinearRegression (stable 1.9.0). Worked on fetch_california_housing (not the removed Boston set), honest test R² ≈ 0.6.Let's start with the week's news, June 22nd through the 28th, and then we'll spend the rest of our time on your first real model. The biggest headline: on June 26th, OpenAI began a limited release of a new model family called GPT-5.6, codenamed Sol, Terra, and Luna. Think of it as a three-rung ladder. Sol is the flagship with advanced reasoning, Terra is the balanced everyday tier, and Luna is the fast, cheap option. Sol reportedly adds new reasoning-effort modes called "max" and "ultra," and the "ultra" mode reportedly spins up subagents to tackle complex projects. Reported pricing runs from about five dollars in and thirty dollars out per million tokens for Sol, down to roughly one dollar in and six dollars out for Luna. It's gated to a select group of trust partners for now, with wider availability promised in the coming weeks. The pattern worth noticing: labs now ship a price-performance ladder, not one single model.
On June 23rd, Anthropic launched Claude Tag. This one's a virtual teammate you at-mention in Slack to hand off a task. It builds context from your channel history and connected data, makes a plan, and then works on its own over extended stretches. It runs on Opus 4.8 and it's in beta for Enterprise and Team customers with some launch credits to start. For us, it's a clean, concrete example of the agent-as-coworker pattern: memory, tool use, and autonomy stitched together in production.
Also on June 22nd, SpaceX signed a compute deal with Reflection AI worth up to about 6.3 billion dollars. Reflection is an open-weights lab founded in 2024 by former DeepMind researchers, and they'll pay roughly 150 million dollars a month, from July 1st through 2029, to use Nvidia GB300 chips at SpaceX's Colossus 2 data center near Memphis, Tennessee. There's a 90-day exit clause after the first three months. It's smaller than SpaceX's reported deals with Anthropic and Google, but the interesting part is an open-weights lab leasing frontier-scale compute from a non-cloud provider. That's a real shift in where training capacity comes from.
Quick lightning round. Also on June 22nd, OpenAI expanded its "Daybreak" cyber program and shipped the full GPT-5.5-Cyber, a defensive-security model that reportedly scores 85.6 percent on a benchmark called CyberGym, versus 81.8 for the base model. It's gated to vetted organizations and isn't a public API model. Alongside it came a Codex Security plugin for vulnerability scanning and a project called "Patch the Planet," with Trail of Bits and HackerOne, to find and fix bugs in widely used open-source projects like cURL, Go, and Python, with mandatory human review. On the people front, around June 19th and 20th, Nobel laureate John Jumper said he's leaving Google DeepMind for Anthropic, days after Noam Shazeer said he'd go to OpenAI. And on June 23rd, ByteDance unveiled Seedance 2.5, a video model that reportedly generates 30-second clips in a single pass. Now, to our main event.
Okay. This is a big one. Today you build your first real model. Up to now we've talked about what AI is, we've set up tools, we've handled data with NumPy and pandas, and we've learned the discipline of splitting data into a training set and a test set. Today all of that pays off, because we're going to fit a line to data, read what that line tells us, and genuinely understand what "fitting" means before we ever write down a single derivative. The model is linear regression. It's the simplest thing that actually deserves to be called machine learning, and, more importantly, it's the seed of almost everything that comes later in this course.
Let me set expectations. We're going intuition-first and hands-on. There are two deep mathematical detours that I'm deliberately going to flag and then walk past: the proof that linear regression is maximum likelihood estimation under Gaussian noise, and the matrix-calculus derivation of the so-called normal equations. Both of those are real, both of them are beautiful, and both of them come back in the math phase. Today you get the punchlines, not the proofs.
Let's start with the single most basic distinction in supervised learning. Regression means predicting a continuous number. How much will this house sell for. What will the temperature be tomorrow. How many minutes until the bus arrives. Contrast that with classification, where you predict a category: spam or not spam, cat or dog, fraud or legitimate. Same machinery underneath, but regression's answer is a number on a sliding scale, and classification's answer is a label from a fixed set. Today is entirely about regression, about predicting a number.
Now the vocabulary, because it'll repeat for the rest of the course. The inputs are called features. You'll also hear them called predictors or independent variables, and we usually bundle them into a capital X. The thing you're trying to predict is the target. You'll hear it called the response or the output or the dependent variable, and we usually call it lowercase y. So the whole job of a supervised model is to learn a mapping from features X to a target y. Burn that in. Features go in, target comes out.
The simplest possible version is called simple linear regression, and "simple" here is a technical word meaning exactly one feature. With one feature you're fitting a straight line. The equation is the one you saw in school: y equals w times x plus b. The w is the slope, how steeply the line rises, and the b is the intercept, where the line crosses the vertical axis when x is zero. In machine learning we tend to call w a weight and b a bias, but it's the same line you've always known.
Real problems have more than one feature, and that's called multiple linear regression. Instead of a single slope, you now have one weight per feature, and your prediction is a weighted sum. Predicted y equals weight one times feature one, plus weight two times feature two, and so on, all the way down your list of features, plus a single bias at the end. Geometrically, with one feature you fit a line, with two features you fit a flat plane floating in three dimensions, and with many features you fit what mathematicians call a hyperplane, which is just the same flat idea in more dimensions than you can picture. Don't strain to visualize it. Trust the algebra.
Here's the piece I want you to actually memorize, because it is the spine of this entire course. That weighted sum has a name. It's a dot product. You take your vector of weights, you take your vector of features, you multiply them element by element, you add it all up, and then you add the bias. We write the hypothesis as y-hat equals w dot x plus b, where y-hat, "y with a little hat on it," just means "the model's prediction" as opposed to the true value. So the hypothesis function, the thing the model computes, is a dot product of weights and features, plus a bias. I'm naming this loudly right now because this exact structure, weights dotted with features plus a bias, is going to come back as logistic regression, and it's going to come back as a single neuron, and a neural network is just stacks of these. Learn it cold here, where it's at its simplest, and you get the rest of the course half-priced.
So what does "fitting" actually mean? Fitting means choosing the weights and the bias so that the line, or the plane, or the hyperplane, passes as close as possible to your actual data points. That's it. The data is fixed. Fitting is the act of dialing in the numbers w and b until the model's predictions hug the real values as tightly as they can. We're going to make "as close as possible" precise in a few minutes, but hold onto the picture: you're sliding and tilting a line until it threads through a cloud of dots as well as a straight line can.
One small piece of foreshadowing while we're here. There's a slick trick where you absorb the bias b into the weight vector by inventing a fake feature that's always equal to one. If one of your features is permanently the number one, then its weight just acts like the bias, and now the whole equation collapses to a single clean dot product, y equals w dot x, with no separate plus-b hanging off the end. You don't need to do this by hand, scikit-learn handles the bias for you, but it's worth knowing because that "always-one" feature shows up everywhere once you get into the math.
Now let's actually run one, because the best part of linear regression is how little code it takes. The current stable scikit-learn is version 1.9.0, released on June 2nd, 2026, and it runs on Python 3.11 through 3.14. The entire interface for this model is three lines, and once you learn these three lines you basically know the shape of every model in scikit-learn. Step one, you create the model: you make a LinearRegression object. Step two, you fit it: you call its fit method and hand it your training features and your training target. That's the line where all the learning happens. Step three, you predict: you call the predict method with some new features and it hands you back predicted numbers. Create, fit, predict. That rhythm, that same three-beat pattern, is how you'll use nearly every algorithm we ever touch.
Let me give you the cleanest possible demonstration, the one straight out of the documentation, because it shows the model doing exactly what it's supposed to. Imagine I secretly build some data with a known rule. I decide the true relationship is: y equals one times the first feature, plus two times the second feature, plus three. I make four little data points that obey that rule exactly. Then I hand the features and the targets to LinearRegression and tell it to fit. When it's done, I look at what it learned. The model reports its weights, and they come back as one and two, exactly. It reports its bias, and it comes back as three, exactly. And if I ask it to score itself, it reports a perfect score of one-point-zero, because the data was perfectly linear with no noise. If I then ask it to predict for a brand new point, say feature values of three and five, it returns sixteen, which is one times three, plus two times five, plus three. The model recovered the hidden rule from nothing but examples. That's the whole magic in miniature.
After you fit a model, scikit-learn stashes what it learned in a set of attributes, and the naming convention is that learned things end with an underscore. There's a coefficients array, that's your weights, one number per feature. There's an intercept, that's your single bias number. There's also some bookkeeping: the rank and the singular values from the underlying solver, the number of features it saw, and, nicely, if you fed it a pandas DataFrame, it remembers the feature names, so it can tell you which weight belongs to which column. That last one matters a lot for interpretation, which is where we're headed next.
A couple of knobs on the model are worth knowing. When you create LinearRegression, the arguments are keyword-only, which just means you have to name them. There's a setting that controls whether the model computes a bias at all, on by default, and you'd only turn it off if you know your line must pass through the origin. There's a setting that can force every coefficient to be zero or positive, which is occasionally useful when negative weights would be physically nonsensical. And here's a detail that surprises people: under the hood, scikit-learn does not literally invert a matrix to solve this. The dense solver calls SciPy's least-squares routine, which is built on something called the singular value decomposition. We'll come back to why that matters, but the short version is: it's a more numerically stable way to get the same answer.
Let me get you off toy data and onto something real. We're going to use a dataset called California housing, which ships with scikit-learn and which you fetch with one function call. Quick but important aside: the famous old Boston housing dataset that you'll see in countless old tutorials has been removed from scikit-learn. It was deprecated back in version 0.23 and fully removed in 1.2, because it contained a feature built on a racially loaded assumption. So do not reach for load_boston. It's gone, and for good reason. We use California housing instead.
Here's what that dataset is. It comes from the 1990 U.S. census, and it has 20,640 rows, where each row is one census block group, a small neighborhood. There are eight numeric features: median income in the block group, the median house age, the average number of rooms, the average number of bedrooms, the population, the average occupancy, and the latitude and longitude. The target is the median house value, expressed in hundreds of thousands of dollars, so a target value of 2.5 means two hundred fifty thousand dollars. And here's a fact to anchor your intuition before you even run anything: median income is by far the strongest predictor. Wealthier neighborhoods have pricier houses. You'd expect income to come out with a large positive coefficient, and it does.
Now I want to fold in the discipline we already learned: the train/test split. You never, ever judge a model by how well it does on the data it learned from. That's like grading students on the exact questions they studied. So the workflow is this. You fetch the data. You split it, holding out twenty percent as a test set you won't touch during training, and you set a random seed so the split is reproducible. You create the model and fit it on the training portion only. Then, to get an honest measure of performance, you score it on the held-out test set, the data it has never seen. That honest score, by the way, lands around R-squared of zero-point-six on California housing. Tutorials typically report somewhere in the high fifties to low sixties. That's a decent model and nowhere near perfect, and I love that number because it's honest. Plain linear regression is not magic. It captures a lot of the signal and misses plenty. Repeat the spiral rule with me: fit on the training set, evaluate on the test set, and never, ever report your training score as your performance.
So you've fit the model and you've got a coefficients array. How do you read it? This is genuinely one of the most misunderstood things in all of applied statistics, so go slow with me. A coefficient on a feature tells you this: how much the predicted target changes when that one feature goes up by one unit, while every other feature is held fixed. That last clause, "while every other feature is held fixed," is the entire crux of multiple regression, and it's where almost everyone slips. The coefficient on median income isn't "the effect of income in general." It's "the effect of income, after we've already accounted for house age, and rooms, and location, and everything else in the model." The sign of the coefficient tells you direction, positive means the target rises with the feature, negative means it falls. The magnitude tells you strength, how big a swing per unit.
But there's a trap in that magnitude, and you have to internalize it. A coefficient's size is meaningless without knowing the feature's units. A coefficient of 0.5 on something measured in thousands is a completely different beast from a coefficient of 0.5 on something measured in single digits. Which means you cannot just look at your coefficients, find the biggest one, and declare it the most important feature. You can only compare coefficient magnitudes if the features are on comparable scales, for instance if you've standardized them first. So please, do not read "big coefficient" as "important feature." That is a classic, confident, wrong conclusion.
Alright, let's finally make "fitting" precise. We said fitting means getting the line as close as possible to the points. Close in what sense? Picture a scatter of dots and a straight line drawn through them. For any one data point, drop a vertical line from the dot to the fitted line. That vertical gap is called the residual. Formally, a residual is the actual value minus the predicted value, the true y minus the y-hat. A positive residual means the point sits above the line, the model under-predicted. A negative residual means it sits below, the model over-predicted. Fitting is the act of choosing the weights and bias that make all those residuals collectively as small as possible.
But "collectively small" needs a rule, because some residuals are positive and some negative, and they'd cancel if you just added them. The rule we use is called ordinary least squares, OLS for short, and here's what it does: it adds up the square of every residual and chooses the line that makes that total as small as possible. Squaring kills the sign problem, since a square is always positive, and it gives us one single number, the sum of squared residuals, that measures how bad the fit is. If you divide that sum by the number of data points, you get the mean squared error, MSE, and minimizing the mean instead of the sum gives you the exact same line, because dividing by a constant doesn't move where the minimum is.
This number we're minimizing has a name you'll hear constantly: it's a loss function. You'll also hear it called a cost function or an objective function. They're the same idea: a single number that says how wrong the model currently is, where lower is better. And here is the definition of training that you should tattoo on your brain: training a model means adjusting its parameters to minimize a loss function. That sentence is true for linear regression, it's true for logistic regression, it's true for a hundred-layer neural network. The model changes, the optimizer changes, but training is always "turn the knobs to make the loss go down."
Now, why squared error? Why not just use the absolute size of each residual, which seems more natural? There are four good reasons, and they're worth understanding rather than memorizing. First, squaring penalizes big mistakes much more harshly than small ones. A residual of four becomes sixteen, while a residual of one stays one, so the model is strongly motivated to avoid large errors. Second, the squared function is smooth and differentiable everywhere, whereas the absolute value has a sharp kink at zero, and that smoothness makes the calculus of optimization clean. Third, squared error gives you a convex loss, a single smooth bowl shape with exactly one lowest point, so there's a unique best answer and no risk of getting stuck in some false minimum. And fourth, the deep one: minimizing squared error is mathematically equivalent to maximum likelihood estimation when you assume the noise in your data is Gaussian, that is, normally distributed. That's the maximum likelihood estimation link, MLE for short, and it's the reason squared error isn't an arbitrary choice, it's the statistically principled one under a very common assumption. I'm flagging that derivation as a math-phase return. For today, just the punchline: least squares is what you get when your errors are bell-curved.
For contrast, what happens if you minimize the absolute error instead of the squared error? That's a different, legitimate method called least absolute deviations, sometimes L1. It's more robust to outliers, because it doesn't blow a single far-off point up into a giant squared penalty, and interestingly it predicts the median of the data rather than the mean. So the choice of loss isn't just a technicality, it quietly decides what your model is even trying to estimate.
So how does the computer actually find the bottom of that bowl? There are two routes, and you should know both by name. The first is the closed-form solution, the famous normal equation. There's an exact algebraic formula, w equals the inverse of X-transpose-X, times X-transpose-y, that hands you the optimal weights in one shot, no iteration. I'm saying it out loud once for completeness, but you do not need to memorize it, and as I mentioned, scikit-learn doesn't even compute it by literally inverting that matrix, it uses the singular value decomposition for numerical stability. The deep reason that formula exists is that you can set the slope of the loss bowl to zero and solve, which is the matrix-calculus derivation I'm flagging as a Phase 2 topic. Today, just hold the fact: for linear regression, an exact closed-form answer exists.
The second route is gradient descent, and this is the one that matters for your future. Instead of solving in one shot, you start somewhere on the loss bowl and take repeated small steps downhill, in the direction that reduces the loss fastest, until you reach the bottom. For plain linear regression this is unnecessary, the closed-form solution is right there. But gradient descent is the workhorse of modern machine learning, because for huge datasets the closed-form math becomes too expensive, and for models like neural networks there is no closed-form solution at all. There is only stepping downhill. So we'll devote an entire future episode to gradient descent and optimization. For now, just file it: two ways to the minimum, an exact formula and an iterative descent, and the iterative one is where the field actually lives.
Let's talk about measuring how good a fit is, because you need honest numbers, and you compute all of them on the held-out test data, never the training data. The headline metric is R-squared, the coefficient of determination. R-squared answers a specific question: how much better is my model than the dumbest reasonable baseline, which is always just guessing the average value of y? An R-squared of one means a perfect fit. An R-squared of zero means your model is no better than always predicting the mean. And here's the part that shocks beginners: R-squared can go negative. On test data, if your model is actually worse than guessing the mean, you get a negative R-squared. For instance, if you predict three, two, one for true values one, two, three, the R-squared comes out to negative three. So R-squared is not a percentage trapped between zero and one. It can fall through the floor.
And one more warning about R-squared, because this trips people constantly: when you call the model's score method, it returns R-squared. It does not return accuracy. Accuracy is a classification idea. This is a regression model, and its score is R-squared. Plenty of beginners see a score of zero-point-six and think "sixty percent accurate," and that is just the wrong mental model. It's "sixty percent of the variance explained relative to the mean baseline."
R-squared is great for "how does this compare to baseline," but it's unitless, so you also want a metric in real units. Mean squared error, the thing we minimized, is in the target's units squared, which for housing would be "hundreds of thousands of dollars, squared," which is meaningless to a human. So take its square root and you get RMSE, root mean squared error, which is back in plain target units and you can actually say "I'm typically off by about this many hundred-thousands of dollars." In recent scikit-learn there's a dedicated function for it, added in version 1.4, and the old way of passing a squared-equals-false flag is deprecated, so use the new root-mean-squared-error function. RMSE heavily penalizes large errors, same as the squared loss does. There's also MAE, mean absolute error, which is the average absolute size of the residuals, also in target units, but more forgiving of outliers. A handy diagnostic: if your RMSE is much bigger than your MAE, that gap is telling you a few large outlier errors are dominating. Good practice is to report at least one absolute metric, RMSE or MAE, alongside your R-squared. R-squared tells you "compared to dumb," and RMSE tells you "in dollars."
Now, linear regression comes with assumptions, the classic five, and you don't need to memorize them like a checklist, you need to know what each one looks like when it breaks. First, linearity: the relationship really is roughly a straight line. If it's actually curved, you'll see a curved pattern in your residuals, and the fix is transforming features, which we'll get to. Second, independence: your observations shouldn't be tied to each other. This breaks all the time with time-series data or clustered data, where one point leans on its neighbors. Third, homoscedasticity, a mouthful that just means constant variance: the spread of your residuals should be roughly the same everywhere. When it's not, that's heteroscedasticity, and you'll see the residual spread fan out like a funnel, which is common when your target spans orders of magnitude. Fourth, normality of the residuals: the errors, not your features and not your target, but the errors, should be roughly bell-shaped. This one matters mostly for statistical inference and small samples, and far less for pure prediction, and you check it with something called a Q-Q plot. Fifth, low multicollinearity: your features shouldn't be strong stand-ins for each other. When two features are nearly redundant, the model can't tell which one deserves credit, and the coefficients get unstable, even flipping sign with a tiny change in the data. You check it with a number called the variance inflation factor, VIF, and you start worrying past about five, definitely past ten.
And here's the one tool that ties all five of those together, your master diagnostic: the residual plot. You plot the residuals against the model's fitted values and you look at the shape. What you want is a boring, structureless cloud of points scattered randomly around zero. Boring is good. Boring means the assumptions are holding. The moment you see structure, a curve, a funnel, a trend, that pattern is the model telling you which assumption you just violated. A curve means non-linearity. A funnel means heteroscedasticity. Learn to read that one plot and you've got a stethoscope for your regressions.
Let's talk pitfalls, because this is where good practitioners separate from naive ones. The number one misuse, by a mile: coefficients are not causal. Correlation is not causation. A positive coefficient on a feature does not mean that feature causes the target to rise. There could be a hidden confounder driving both, the causation could run backwards, or your data could be selected in a way that manufactures the pattern. The model only knows that two things move together. It knows nothing about why. So never narrate a regression coefficient as "if we increase this, the target will go up." That's a claim the model is not entitled to make.
Second pitfall: extrapolation is fantasy. The line is only trustworthy inside the range of feature values it actually saw during training. If your housing data tops out at incomes around some level, the model has no idea what happens above that, and the straight line will happily, confidently extend into nonsense. Predictions outside the training range are guesses dressed up as math.
Third: outliers plus squared loss are a dangerous combination. Remember that squaring punishes big residuals hard. So a single far-off point, especially one at an extreme feature value, what's called a high-leverage point, can grab the line and drag the whole fit toward itself. The remedies are things like mean absolute error, or Huber loss, or robust regression methods, which I'll just name for now so you know they exist.
Fourth, let me clear up feature scaling, because the truth here is nuanced and people overstate it. For ordinary least squares specifically, scaling your features does not change the solution and does not change the predictions. The math just rescales the coefficients to compensate. So if someone tells you "you must always scale before linear regression," that's not quite right for plain OLS. What scaling does change is the interpretation of the coefficients, and it matters enormously for two other things: for gradient descent, where unscaled features make the descent path zig-zag inefficiently, and for regularization, which we'll mention in a moment. So scale when you're doing gradient descent or regularization or comparing coefficient sizes, but know that vanilla OLS predictions don't care.
Fifth, remember multicollinearity from the assumptions: it can wreck your coefficients while leaving your predictions perfectly fine. The model might predict well and still hand you weights that are unstable and uninterpretable. So a good prediction score does not guarantee trustworthy coefficients.
And sixth, the spiral callback to overfitting. Linear regression is what we call a high-bias, low-variance model. In plain terms, it's a simple, rigid model, so its usual failure mode is underfitting, being too simple to capture the pattern, rather than overfitting. But it can absolutely overfit if you throw enough features at it, or if you start adding polynomial terms, which I'll explain in a second. This is exactly why we hold out a test set and use cross-validation, ideas you already have in your toolkit.
I want to tell you about one classic demonstration, because it makes a permanent impression: Anscombe's quartet. In 1973, a statistician named Francis Anscombe published a short paper called "Graphs in Statistical Analysis." He constructed four little datasets, eleven x-y pairs each, that are nearly identical on every summary statistic you'd normally compute. Same mean of x, which is nine. Same mean of y, seven and a half. Same variance, same correlation of about zero-point-eight-two, and they all produce the exact same fitted line, y equals three plus one-half x. By the numbers, they look like the same data. But when you actually plot them, they're wildly different. One is a clean linear relationship, just as you'd hope. One is a smooth curve, where a straight line is the wrong model entirely. One is a perfect straight line ruined by a single outlier. And one is a vertical stack of points at a single x value with one far-off high-leverage point setting the whole slope. The lesson is unforgettable and it's the moral of this entire episode: always plot your data. The summary statistics can be identical and the reality completely different. There's a modern, charming echo of this called the Datasaurus dozen, where the scatter plot literally spells out a dinosaur while the stats stay fixed.
Now let me reach forward to one technique that explains a phrase that confuses everyone. People say "linear regression can only fit straight lines," and that's not quite true. The word "linear" means linear in the parameters, the weights, not necessarily linear in your inputs. So if your data curves, you can fit that curve by transforming your features first. Add a squared version of a feature, add a cubed version, add the logarithm, add an interaction term that multiplies two features together, and then run ordinary linear regression on that expanded set of columns. In scikit-learn there's a tool that generates these polynomial features for you, and you'd chain it together with the regression in something called a pipeline. The key insight: it's still a linear least-squares problem, you've just given it more columns to work with. The catch is that high-degree polynomials overfit dramatically, producing wild wiggles that chase noise instead of signal. So it's a powerful tool with a sharp edge.
And one more forward-pointer, just a single sentence so you know the terms exist: when overfitting or multicollinearity bites, there are two regularization techniques, Ridge, which adds an L2 penalty on the size of the coefficients, and Lasso, which adds an L1 penalty and has the bonus of zeroing out weak features entirely, giving you automatic feature selection. The full story waits for Phase 3.
Let me close by naming the lineage out loud, because this is why today matters so much more than "I can fit a line." That dot product, weights dotted with features plus a bias, is the conceptual seed of nearly every model in this course. Next episode is logistic regression, and logistic regression is literally the same linear combination, the same w dot x plus b, squashed through an S-shaped function called a sigmoid to turn it into a probability for classification. After that, the single neuron, also called a perceptron, which is just an activation function wrapped around w dot x plus b. In fact, logistic regression is a single sigmoid neuron, the same thing under two names. And a neural network is nothing more than stacks of these linear-combination-plus-nonlinearity units wired together. So the dot product you learned today is the atom of everything downstream. Today builds directly on the train/test split and the NumPy and pandas work you've already done, and it unlocks logistic regression, which unlocks neurons, which unlocks neural networks.
A little history to send you off, because the roots here are wonderful. The method of least squares has a famous priority dispute. Legendre published it first, in 1805, in an appendix about computing comet orbits, where he coined the French phrase for "method of least squares." Then Gauss published his version in 1809, in a work on celestial motion, and claimed he'd actually been using it since 1795, which understandably annoyed Legendre. Gauss went further, though, connecting least squares to the normal distribution, the Gaussian, which is exactly that maximum likelihood link we flagged earlier, and he famously used the method to predict where astronomers should look to re-find the dwarf planet Ceres after it disappeared behind the sun in 1801. As for the word "regression" itself, that comes from Francis Galton in the 1880s, studying heredity. He noticed that very tall parents tend to have children who are tall but closer to the average, a phenomenon he called "regression toward mediocrity," in an 1886 paper. That gives us the modern phrase "regression to the mean," which, fittingly for this episode, is itself one of the most misunderstood ideas in statistics: extreme measurements tend to be followed by less extreme ones purely by chance, and people constantly mistake that statistical fact for a real effect, in medicine, in sports, in business.
So here's where you stand. You can fit a line with three lines of code. You know what the coefficients mean and the ceteris paribus trap that lurks in them. You understand that fitting is minimizing a squared-error loss, that there's a closed-form answer and an iterative one, and you can evaluate honestly with R-squared and RMSE on held-out data. You know the five assumptions, the residual plot that checks them, and the pitfalls that catch the overconfident. And above all, you know to plot your data. Next time, we keep the dot product and bend it toward classification with logistic regression. See you there.