
Logistic regression takes the same linear score from last episode, w dot x plus b, squashes it through a sigmoid into a probability, and trains it with cross-entropy, which turns out to be maximum likelihood for a Bernoulli label. It's a linear classifier and, quietly, a single neuron, the exact template every classification network stacks at its top.
Phase 1 continues. Last episode fit a line for regression; this one turns that same linear score into a classifier. Classification predicts a discrete category (usually with a probability), not a continuous quantity.
The machine: keep the linear score z = w·x + b, then squash it with the sigmoid σ(z) = 1/(1+e^−z) into a probability p = P(y=1|x). Invert it and you get the logit: log(p/(1−p)) = w·x + b, so logistic regression is linear in log-odds. Each weight is an odds ratio via e^(w_j), why banks and epidemiology still use it for interpretable reason codes.
Why not squared error: MSE on a sigmoid is non-convex with vanishing gradients on the worst mistakes. The right loss is log loss / binary cross-entropy, which is the negative log-likelihood of a Bernoulli model, classification's version of MSE-as-Gaussian-MLE. Its gradient collapses to the clean (p − y)·x, the same form as linear regression.
Also covered: the linear decision boundary (a hyperplane) and why logistic regression can't solve XOR; the from-scratch NumPy rebuild (swap identity→sigmoid, MSE→cross-entropy, keep the same update); L2/L1/elastic-net regularization and the perfect-separation blow-up; multiclass via one-vs-rest vs softmax; a light look at accuracy's trap on imbalanced data (full metrics next episode); history from Verhulst's 1838 growth curve to Berkson's "logit" (1944) and Cox (1958); and the punchline, logistic regression is one neuron.
Worked example: scikit-learn 1.9.0 LogisticRegression on Breast Cancer Wisconsin, with StandardScaler, predict_proba, decision_function, and coefficient reading.
First, the news for the week of June 28th through July 5th, 2026.
Lead story. Anthropic shipped Claude Sonnet 5 on June 30th, and on July 1st it became the default model for every Free and Pro user. Anthropic calls it their most agentic Sonnet ever, landing near the flagship Opus 4.8 at much lower cost. It carries a one-million-token context window and up to 128,000 tokens of output. Intro pricing runs two dollars per million input tokens and ten dollars per million output, through August 31st. The reported benchmarks, and treat these as reported: about 63.2 percent on SWE-bench Pro for agentic coding, 81.2 on OSWorld, 80.4 on Terminal-Bench, and 84.7 on BrowseComp. Two catches. A new tokenizer emits roughly one to one-point-three-five times more tokens, so do the cost math on your own workload. And the temperature and sampling parameters are gone. Hold onto those coding benchmarks, because a pass-or-fail score on a labeled test set is exactly the classification we get into today.
Next, Meituan open-sourced LongCat-2.0 on June 30th under an MIT license. It's a 1.6-trillion-parameter Mixture-of-Experts coding model, and Mixture-of-Experts means only a sparse slice, about 33 to 56 billion parameters, actually activates per token, rather than the whole dense network. Native one-million context, pretrained on over 30 trillion tokens. The headline, and mark this reportedly, it's the first trillion-parameter model fully pretrained and served on domestic Chinese chips, a cluster of over 50,000 Chinese-made ASICs, no Nvidia in the loop. Reported SWE-bench Pro of 59.5, priced around 75 cents in and 2.95 out per million tokens. Open weights, genuinely competitive coding.
Google moved its Gemini image models to general availability on June 30th: Gemini 3.1 Flash Image and Gemini 3 Pro Image, nicknamed Nano Banana Pro, now live in Google AI Studio and the Gemini API. Pro Image runs about 13 cents an image, generates in two to five seconds, and leads on legible in-image text rendering, so infographics come out readable. Every output carries an invisible SynthID watermark for provenance.
A policy note that ties right into today's topic. Anthropic, alongside Amazon, Microsoft, and Google, proposed a jailbreak severity-scoring framework, and a safety classifier reportedly blocked over 99 percent of a targeted jailbreak. But by July 3rd came reports of false positives, routine security-adjacent coding getting blocked. A safety classifier is a binary classifier, and false positives versus false negatives, that precision-recall tradeoff, is exactly what today's tutorial introduces.
And one for the toolbox. Ai2 released MolmoMotion on July 1st, a fully open vision-language model that predicts 3D point trajectories from a video, a point, and a text instruction. The weights, the training code, the 1.16-million-video MolmoMotion-1M dataset, and the PointMotionBench eval are all on Hugging Face under allenai. Grab them if you want to play.
Last episode we fit a straight line. We wrote y-hat equals w dot x plus b, we measured how far off we were with residuals, we squared them into mean squared error, and we found the weights two ways: the closed-form normal equation, and gradient descent walking downhill on the loss. Today we take that exact same machinery, the linear score, the dot product, the gradient descent loop, and we point it at a completely different kind of problem. We're going to classify.
So let's start with the fundamental split. What is classification, and how is it different from regression?
Regression predicts a continuous quantity. The output y is a real number, somewhere on the number line. A house price, a temperature, the R-squared value we computed last time. It can be 3.7, it can be negative 200, it can be anything. Classification predicts a discrete category out of a finite set. Is this email spam or not spam. Is this tumor malignant or benign. Will this customer churn or stay. The output isn't a quantity, it's a label, and usually we want it with a probability of membership attached.
That change of task is deeper than it sounds. In regression, "how far off am I" is a meaningful, signed distance. If the true price is 300 and I predict 310, I'm off by ten, and being off by ten is genuinely worse than being off by two. The residual has direction and magnitude. In classification, "how far off" mostly collapses to right or wrong. The label is a category, not a distance. So the natural thing to predict isn't the category directly, it's the probability that the instance belongs to the positive class. We predict P of y equals one given x.
Let me lay out the flavors, because people mix them up constantly. Binary classification has exactly two classes, y in the set zero or one. Spam or ham, malignant or benign, churn or stay, default or repay. And here's a convention worth burning in: the positive class, class one, is the thing you want to detect. The rare, the dangerous, the interesting thing. Multiclass classification has one label out of K mutually exclusive classes, where K is bigger than two. A handwritten digit is exactly one of zero through nine. An iris flower is exactly one of three species. Exactly one label per instance, that's the point. Then there's multilabel, which is different, and people trip on this. In multilabel, each instance can carry several labels at once. A photo tagged beach, and sunset, and dog, all three. That's not one-out-of-K. That's K independent yes-or-no questions. Is there a beach, yes or no. Is there a sunset, yes or no. So multilabel is K separate binary problems, K sigmoids, not one softmax over K classes. Hold that distinction; it comes back when we talk output layers.
Why does any of this matter in the real world? Because the stakes are asymmetric and that changes everything. Spam filtering: a false positive, flagging a real email as spam, costs you differently than a false negative, letting spam through. Disease diagnosis: a false negative, telling a sick person they're healthy, can be fatal, so you tune the model to catch more positives even at the cost of some false alarms. Churn prediction: it's imbalanced, most customers stay, so a model that just predicts "stay" for everyone looks accurate and is useless. Credit default: this is a historical killer app, and regulators demand interpretable models, so the coefficients literally become reason codes you can show a customer, which is a big reason logistic regression still runs in banking today. And then click-through rate, conversion, fraud, sentiment. Classification is everywhere money and risk live.
Now the tempting mistake, and I want to walk through exactly why it fails, because the failure teaches you the fix. Last episode we built a beautiful tool that outputs a number: y-hat equals w dot x plus b. Why can't we just run linear regression, and say if the output is above some threshold, call it class one, otherwise class zero?
Here's why. That linear output is unbounded. It ranges over all real numbers, negative infinity to positive infinity. If you try to read it as a probability, you get nonsense: a probability of 3.7, a probability of negative 2.1. Those don't exist. The output is not a probability and it's not calibrated. And it gets worse during training. Imagine a point that's confidently, obviously in the positive class, y equals one. Suppose the linear model predicts five for it. Under mean squared error, that costs you five minus one, squared, which is sixteen. So the training loss actively punishes the model for being too correct, for being too confident on a point it got right. That's backwards. Mean squared error fights confident correct predictions. And on top of that, a single outlier, one extreme point, drags the least-squares fit toward itself and physically moves the decision boundary, because least squares is trying to minimize distance to every point including the far one.
But here's the insight that saves us. The fix isn't to throw away the linear score. The linear score is fine. The fix is to reshape that unbounded score into a proper probability. And that reshaping is the sigmoid.
So let's build the bridge. We keep the linear combination exactly as it was. We define the score, which we'll call z, sometimes called the logit: z equals w dot x plus b. That's the same dot product from the linear regression episode, the weighted sum of features plus a bias. And z lives on the whole real line, negative infinity to positive infinity.
Now we squash it. We pass z through the sigmoid function, also called the logistic function. Sigma of z equals one over one plus e to the minus z. And our prediction, the probability p, equals sigma of w dot x plus b. That's the entire model. Linear score in, sigmoid squash, probability out.
Let me give you the sigmoid's personality, because these properties do real work later. Its range is the open interval from zero to one. Open, meaning it approaches zero and approaches one but never exactly reaches either. It's S-shaped and monotonically increasing: bigger z always means bigger probability. Sigma of zero equals one half, and that's a hinge point: z equals zero corresponds to probability one half, which corresponds to sitting right on the decision boundary. It has a symmetry: sigma of negative z equals one minus sigma of z. It saturates: as z goes to positive infinity the output flattens toward one, as z goes to negative infinity it flattens toward zero. That flattening far from the boundary is a feature, it means the model gets confident, but it's also the source of the vanishing-gradient problem we'll meet in deep learning. And its derivative has a lovely form: sigma-prime of z equals sigma of z times one minus sigma of z. That derivative peaks at zero point two-five right at z equals zero, and it decays toward zero as the magnitude of z grows. One practical note: for large negative z, e to the minus z overflows numerically, so stable implementations use something like scipy's expit function rather than computing it naively.
Now the deep why, the part that makes logistic regression click. Invert the sigmoid. If p equals sigma of z, then solving for z gives z equals the log of p over one minus p. Look at that quantity, p over one minus p. That's the odds. If p is zero point eight, the odds are four, as in four to one. Take the log of the odds and you get the log-odds, also called the logit. So logit of p equals log of p over one minus p. And since z equals w dot x plus b, the model is literally saying: the log-odds are linear in x. Log of p over one minus p equals w dot x plus b.
Sit with the geometry of that for a second. Probability is boxed into zero to one. Odds open that up to zero to infinity. Log-odds stretch it all the way out to negative infinity to positive infinity. The logit transform takes a bounded probability and maps it onto the unbounded real line, which is exactly the place a linear model wants to live. So here's the honest one-sentence definition of the whole method: logistic regression is a linear model in log-odds space. That's why it has "regression" in the name even though it does classification. It fits a linear regression, but on the logit.
That log-odds framing also hands you interpretability, which is a genuine advantage. A one-unit increase in feature x-j multiplies the odds by e to the w-j, holding the other features fixed. That number, e to the w-j, is called the odds ratio. So if a weight w-j is zero point six-nine-three, then e to the zero point six-nine-three is about two, which means each one-unit bump in that feature doubles the odds. A weight of zero gives an odds ratio of one, meaning no effect. A negative weight gives a ratio below one, a protective factor that lowers the odds. This is why logistic regression dominates epidemiology and credit scoring: you can point at a coefficient and say, in plain language, what it does to the odds.
Now, decision boundary, and why we call this a linear classifier even though the sigmoid is a curve. The default decision rule is: predict class one if p is at least one half, otherwise class zero. But watch what that condition unfolds into. p at least one half is the same as sigma of z at least one half, which happens exactly when z is at least zero, which is exactly when w dot x plus b is at least zero. So the boundary between the two classes is the set of points where w dot x plus b equals zero. And that set is a hyperplane: a line in two dimensions, a plane in three, a flat, d-minus-one-dimensional surface in d dimensions.
That's the whole reason we call logistic regression a linear classifier. The boundary is flat, even though the sigmoid squashing function is nonlinear. Let that land: the nonlinearity only reshapes the score into a probability. It does not bend the boundary. The vector w is the normal vector to that hyperplane. Its direction sets the boundary's orientation, and its magnitude sets how sharply the sigmoid transitions across it. A large-magnitude w makes the sigmoid act almost like a step function, snapping from zero to one over a tiny region, which is the model being very confident.
And because the boundary is flat, there are things logistic regression simply cannot do. The canonical example is XOR, exclusive-or. Four points: zero-zero maps to class zero, one-one maps to class zero, zero-one maps to class one, one-zero maps to class one. Try to draw a single straight line that puts the two class-one points on one side and the two class-zero points on the other. You can't. There is no such line. So logistic regression on XOR gets about fifty percent accuracy, coin-flip, no matter how long you train it. This is the famous limitation Minsky and Papert wrote up in their 1969 book Perceptrons, and it more or less froze neural network research for years.
There are two escape hatches, and I'm flagging them now because they're the road ahead. One, engineer nonlinear features by hand. Add a feature that's x-one times x-two, or x-one squared. Suddenly a boundary that's flat in the expanded feature space looks curved back in the original space, and you can carve up XOR. That's the basis-expansion, and eventually the kernel, idea. Two, stack neurons. Let hidden layers learn their own features instead of you hand-crafting them. That's neural networks, and we'll get there.
Okay. We have the model and the boundary. How do we train it? What's the loss?
Your first instinct, after last episode, is mean squared error again. Just take sigma of z minus y, square it, average it. Don't. Two things go wrong. First, mean squared error composed with the sigmoid is non-convex in the weights. The loss surface has hills and valleys and plateaus, so gradient descent can get stuck in a local minimum and never find the best answer. Second, the gradient goes weak in exactly the wrong place. The mean-squared-error gradient carries a factor of sigma-prime of z, that sigma times one minus sigma term. Now picture a point where the model is confidently wrong: z is very negative, so p is near zero, but the true label y is one. There, sigma-prime is nearly zero, so the gradient is nearly zero, so the model barely learns from its most catastrophic mistake. It shrugs at its worst errors. Exactly backwards.
The right loss is log loss, also called binary cross-entropy, also called negative log-likelihood. Same object, three names, and I'll show you why they're the same. For a single example, the loss is negative the quantity: y times log p, plus one minus y times log of one minus p, where p is sigma of z. Average that over all n examples and you have the training loss. Notice that only one term is ever active per example. If the true label y is one, the second term dies and you're left with negative log p. If y is zero, the first term dies and you're left with negative log of one minus p.
Feel what that does. Suppose the truth is one and you predict p heading toward zero. Then negative log p heads toward positive infinity. A single confident wrong answer produces an enormous loss that dominates everything. Now suppose you predict correctly and confidently, p near one when the truth is one. Then negative log p heads toward zero. So this loss savagely punishes confident mistakes and rewards confident correct calls. There's an information-theoretic reading too: cross-entropy is the expected surprise, measured in bits or nats. Log loss is negative the log of the probability the model assigned to what actually happened. It's literally how surprised the model was by reality.
And here's the payoff mean squared error couldn't give you: cross-entropy composed with the sigmoid and the linear score is convex in the weights and the bias. One global minimum, at least on non-separable data. Gradient descent reliably rolls down to it. No local-minimum trap. That convexity is the whole reason we switch losses.
Now I want to justify this loss instead of just asserting it, because the justification is one of the most beautiful things in the subject. It's maximum likelihood estimation.
Model each label as a Bernoulli random variable. A Bernoulli is just a weighted coin: it comes up one with probability p and zero with probability one minus p. And here p equals sigma of w dot x plus b. There's a compact way to write the Bernoulli probability mass function: P of y given x equals p to the power y, times one minus p to the power one minus y. Check it. If y is one, that's p to the one times anything to the zero, which is p. If y is zero, that's p to the zero times one minus p to the one, which is one minus p. Neat.
Now assume the data points are independent. The likelihood of the whole dataset is the product over all examples of p-i to the y-i times one minus p-i to the one minus y-i. Products are miserable to optimize, so take the log, which is safe because log is monotonic and doesn't move the location of the maximum. The log-likelihood becomes a sum: sum over i of y-i log p-i plus one minus y-i times log of one minus p-i.
Now look at that sum and look back at our loss. Maximizing that log-likelihood is exactly the same as minimizing its negative, which is exactly the cross-entropy, up to the one-over-n averaging factor. So log loss is not arbitrary and it's not a trick someone invented. It is the negative log-likelihood of a Bernoulli model. Cross-entropy is maximum likelihood for classification.
And this connects straight back to last episode in a way I love. Mean squared error is also a maximum-likelihood estimate. If you model y given x as Gaussian, a bell curve centered at w dot x plus b, then minimizing squared error is exactly maximum likelihood under Gaussian noise. So there's one recipe underneath both episodes. Pick a distribution for y given x. Write its negative log-likelihood. That's your loss. Gaussian noise gives you mean squared error, that's regression. Bernoulli gives you cross-entropy, that's classification. Same recipe, different noise model. Keep that; it's the seed of the probability chapter later in the course.
Now the gradient, and here's a small miracle. You'd expect that differentiating cross-entropy through a sigmoid through a linear function would produce a mess of sigmoid-derivative terms. It doesn't. When you apply the chain rule and plug in that identity, sigma-prime equals sigma times one minus sigma, all the messy pieces cancel. What's left is stunningly clean. The gradient of the loss with respect to w is one over n, times the sum over i of the quantity p-i minus y-i, times x-i. And the gradient with respect to the bias b is one over n times the sum of p-i minus y-i. In words: the gradient is the average of prediction minus target, times the input.
Now stare at that and compare it to last episode. Linear regression's gradient was the average of y-hat minus y, times x. Logistic regression's gradient is the average of p minus y, times x. It's the same form. The only difference is that where linear regression had y-hat equals z, logistic regression has p equals sigma of z. That's it. That deep structural unity is not a coincidence. The sigmoid is what's called the canonical link function for the Bernoulli distribution in the generalized-linear-model, or exponential-family, framework. When you match the right link to the right distribution, the algebra rewards you with this clean residual gradient. And this same p-minus-y form comes back later as the delta rule, the base case of backpropagation for a sigmoid output neuron. You're seeing the seed of backprop right now.
One more contrast with last episode. Linear regression had a closed form, the normal equation, w equals X-transpose X inverse, times X-transpose y. Logistic regression has no closed form. Because p-i is sigma of w dot x-i, the equations you'd set to zero are transcendental, nonlinear in w, and there's no algebra that solves them in one shot. So you optimize iteratively. You can use plain gradient descent, the exact update we derived last time, w gets w minus learning rate times gradient. Or you can use Newton's method, or iteratively reweighted least squares, which use second-derivative, Hessian information for faster, quadratic convergence. Scikit-learn's default lbfgs solver, and newton-cg, live in that quasi-Newton, second-order family. And because the loss is convex, all of these land on the same global optimum, barring the perfect-separation problem I'll get to.
Let me make this concrete with scikit-learn, current version 1.9.0, released June 2nd, 2026, running on Python 3.11 through 3.14. We'll use the Breast Cancer Wisconsin dataset that ships with it: 569 samples, 30 features, binary target, where the encoding has benign as one and malignant as zero.
The flow goes like this. You load the data with load-breast-cancer, asking for X and y directly. You split into train and test, holding out twenty percent, with a fixed random seed for reproducibility, and you stratify on y so both splits keep the same class balance. Then, and this matters, you standardize. You fit a StandardScaler on the training features only, and use it to transform both train and test. You fit scikit-learn's LogisticRegression, and you'll want to raise max-iter to something like a thousand. Then you call its methods. Predict gives you hard zero-or-one labels. Predict-proba gives you, per row, the probability of each class, two numbers that sum to one, ordered to match the classifier's classes attribute. Score gives mean accuracy, and on this dataset you'll typically see around ninety-six to ninety-eight percent, because the classes are nearly linearly separable in thirty dimensions. The coef attribute holds the weights, shape one by thirty, that's your w. The intercept attribute is the bias b. And decision-function gives you the raw score z, the logit, w dot x plus b, before the sigmoid, which is what you want for custom thresholds and for the ROC curve next episode.
Why did we standardize? Two reasons. First, regularization is scale-sensitive. L2 penalizes the sum of the weights squared, treating every feature's weight equally. If one feature runs zero to a hundred, like age, and another runs zero to a million, like income, the raw weights live on wildly different scales and the penalty hits them unfairly. Standardizing to zero mean and unit variance levels the playing field. Second, convergence. Gradient and quasi-Newton solvers converge much faster on well-conditioned, standardized data. Skip scaling and you'll often get a ConvergenceWarning and have to crank up max-iter.
A quick tour of the important parameters in the 1.9 series, so you can read the docs with confidence. Penalty defaults to L2, with options for L1, elastic-net, or none. C defaults to one point zero, and here's the gotcha: C is the inverse regularization strength. Smaller C means stronger regularization, more shrinkage. Larger C means weaker regularization. Think of C as roughly one over lambda. The solver defaults to lbfgs; liblinear is for small data and L1 but does one-vs-rest only; saga handles large and sparse data with any penalty; and there's newton-cg and newton-cholesky. Not every solver supports every penalty. Max-iter defaults to a hundred, raise it to five hundred or a thousand if you see that convergence warning. Class-weight set to balanced reweights the loss inversely to class frequency, and that's your standard lever for imbalanced data. And note the multi-class parameter is being deprecated: modern scikit-learn auto-selects the multinomial approach for more than two classes with lbfgs, and you'd wrap things in OneVsRestClassifier only if you explicitly want one-vs-rest.
Now, the part that makes you own it. We rebuild logistic regression from scratch in NumPy, and the point is how little changes from last episode's linear-regression-from-scratch. You define a sigmoid function, one over one plus np-dot-exp of negative z, clipping z or using expit for stability. You define a binary cross-entropy loss, clipping p into a tiny interval near zero and one so you never take the log of zero. Then the training loop, per epoch, does this. Compute the linear score z equals X times w plus b, our dot product. Pass it through the sigmoid to get p. Compute the error, p minus y, the same residual form as before. Compute grad-w as X-transpose times error, over n, which is that one-over-n sum of p-i minus y-i times x-i. Compute grad-b as the mean of the error. Then update: w gets w minus learning rate times grad-w, and b gets b minus learning rate times grad-b. On a toy dataset of two linearly-separable Gaussian blobs, this hits about a hundred percent training accuracy.
Here's the thing I want you to actually notice. That one-line update, w minus learning rate times grad-w, is character-for-character identical to the linear-regression loop from last episode. The only differences in the whole program are that we compute p equals sigmoid of z instead of just using z, and we use cross-entropy instead of squared error. That's the spiral. And if you print the loss each epoch, you'll watch it decrease monotonically, smoothly downhill, because the loss is convex, and you're seeing exactly what scikit-learn does under the hood. Want regularization? Add lambda over n times w onto grad-w, and don't regularize the bias.
Speaking of which, regularization. L2, also called ridge, is scikit-learn's default. It adds lambda times the sum of the weights squared to the loss, with strength roughly one over C, and it shrinks coefficients smoothly toward zero without ever making them exactly zero. It controls overfitting and keeps the weights finite. L1, also called lasso, adds lambda times the sum of the absolute values of the weights, and it produces sparsity: it drives some coefficients exactly to zero, which is automatic feature selection. L1 needs the liblinear or saga solver. Elastic net is a convex blend of L1 and L2, via saga with an l1-ratio. And to repeat the C convention because it trips everyone: C going to infinity means basically no regularization, a pure maximum-likelihood fit; C going toward zero crushes the coefficients toward zero and underfits. Tune it with cross-validation, and there's a LogisticRegressionCV that does exactly that.
But there's a special reason regularization matters more for classification than you might expect, and it's called perfect separation, or complete separation. Suppose your data actually is linearly separable, a clean line splits the classes. Then the unregularized maximum-likelihood fit has no finite optimum. Think about why. To make separable points more and more confident, probabilities pushed toward one and zero, the optimizer makes the sigmoid steeper and steeper, which means driving the magnitude of w toward infinity. A steeper sigmoid always lowers the loss a little further, forever. So the weights blow up, the model becomes pathologically overconfident, and the optimizer never converges. Albert and Anderson wrote this up in 1984. Regularization, or equivalently a prior on the weights, bounds the magnitude of w and restores a finite, sensible solution, which is a strong reason scikit-learn regularizes by default. Statisticians have another fix, Firth's penalized likelihood from 1993. The symptom to recognize: enormous coefficients, a ConvergenceWarning, and probabilities pinned dead at zero or one.
Let me go lighter on multiclass, because it mostly foreshadows neural-net output layers. Say you have more than two classes. One approach is one-vs-rest, also called one-vs-all: train K binary logistic classifiers, each one class versus everything else, and at prediction time run all K and pick the highest score. Simple, parallelizable, works with any binary classifier. The downside is the K probabilities aren't jointly normalized, they don't sum to one, and each was trained on a lopsided one-versus-the-rest split. The other approach is multinomial, or softmax, logistic regression, also called the maximum-entropy classifier. One joint model. Each class k gets its own weights w-k, its own score z-k equals w-k dot x plus b-k, and you normalize all the scores together with the softmax: softmax of z, for class k, is e to the z-k over the sum across all classes j of e to the z-j. That gives a proper probability distribution: all positive, summing to one.
And softmax generalizes the sigmoid exactly. For two classes, softmax reduces precisely to the sigmoid, because only the difference z-one minus z-zero matters. Sigmoid is just the two-class softmax. Cross-entropy generalizes the same way: categorical cross-entropy is negative the sum over classes of y-k times log p-k, with y a one-hot vector, and for two classes it collapses right back to binary cross-entropy. The maximum-likelihood derivation is identical, just with a Categorical distribution instead of a Bernoulli. And the gradient keeps that same p minus y times x form, per class. Here's the forward pointer: softmax over linear scores is exactly the output layer of a neural-network classifier. This whole machine, linear score, squash, cross-entropy, same gradient, is the template sitting at the top of every classification network.
Now, evaluation, but only a light touch, because the full treatment is next episode. Start with accuracy: correct over total. Intuitive, and it lies on imbalanced data. If ninety-nine percent of transactions are legit, a model that always predicts "not fraud" scores ninety-nine percent accuracy while catching exactly zero fraud. Accuracy alone is a trap when the classes are skewed. So here's the vocabulary we'll develop next time, just one sentence each. The confusion matrix, true positives, false positives, false negatives, true negatives, is the full picture. Precision is true positives over true positives plus false positives: when I say positive, how often am I right. Recall, also called sensitivity, is true positives over true positives plus false negatives: of the actual positives, how many did I catch. F1 is the harmonic mean of the two. And ROC-AUC measures ranking quality across all possible thresholds.
Which brings up thresholds. That zero point five cutoff is not sacred. It's just sigma of zero equals one half, an accident of where the sigmoid crosses. For imbalanced or asymmetric-cost problems, move it. Lower the threshold to catch more positives, higher recall, which is what you want for cancer screening. Raise it to be more precise when false alarms are expensive. Predict-proba and decision-function give you that knob; the plain predict method just bakes in zero point five. And to be explicit: next episode is the full metrics treatment, confusion matrix, the precision-recall tradeoff, ROC and PR curves, AUC, threshold selection, and calibration. Today we're only planting the vocabulary.
Let me hit the common pitfalls fast, because naming them inoculates you. One: "logistic regression is regression." No. It's classification. The name is historical, it fits a linear regression on the logit. That's the number-one beginner confusion. Two: perfect separation makes weights diverge and overconfidence explode; recognize it by huge coefficients plus a convergence warning, and fix it by regularizing. Three: not scaling features gives slow or failed convergence and unfair regularization; use StandardScaler first. Four: assuming zero point five is always the threshold wrecks recall on rare-positive problems; choose the threshold from costs or validation. Five: treating predict-proba as perfectly calibrated. The probabilities are reasonable but not guaranteed, especially under strong regularization; check a calibration curve and, if needed, fix it with Platt scaling or isotonic regression, via CalibratedClassifierCV. Six: forgetting the linear-boundary limitation and being shocked when it fails on XOR or concentric-ring data; fix with feature engineering or a nonlinear model. Seven: multicollinearity, correlated features, makes coefficient magnitudes and even signs unstable, so the odds-ratio interpretation breaks even when the predictions are fine; L2 helps stabilize it. Eight: reading a coefficient as a direct effect on probability. It's an effect on log-odds. Exponentiate it for the odds ratio, and remember the effect on probability itself is nonlinear, it depends on where you sit on the S-curve.
A little history, because the names carry it. The logistic function came from Pierre-Francois Verhulst, a Belgian, in papers from 1838, 1844, and 1845, modeling population growth under limited resources, a self-limiting S-curve that saturates at a carrying capacity. He coined the word "logistique." Same sigma of z equals one over one plus e to the minus z curve, born in demography. It got rediscovered in the 1920s by Raymond Pearl and Lowell Reed in population biology. The term "logit" comes from Joseph Berkson in 1944, from "logistic unit," a deliberate pun on "probit," probability unit, which Chester Bliss had coined in 1934 for the probit model. Berkson championed logistic over probit for bioassay because it was computationally simpler, no Gaussian integral, and analytically cleaner. That probit-versus-logit contrast is worth knowing: probit uses the Gaussian cumulative distribution as its squashing function, the two are numerically very close, logistic just has slightly heavier tails, and logit won on the closed form and the interpretable odds-ratio story, while probit persists in econometrics. Then David Cox in 1958 wrote "The regression analysis of binary sequences," the foundational statistical treatment, and it later folded into the generalized-linear-models framework of Nelder and Wedderburn, 1972, as the GLM for Bernoulli and binomial data with the logit link. And its workhorse resume is long: decades as the default in biostatistics and epidemiology with odds ratios in case-control studies, in credit scoring where regulators love the interpretable reason codes and FICO-style scorecards, in social science, clinical risk models, marketing churn and response, and internet-scale click-through prediction, where early ad systems were literally massive L1-regularized logistic regressions. It's still a top baseline: fast, interpretable, calibratable, and genuinely hard to beat when the signal is roughly linear in good features.
Which leaves the punchline, the reason this episode is where it is in the course. Look one more time at the model: z equals w dot x plus b, followed by sigma of z. That is exactly a single artificial neuron with a sigmoid activation. Inputs x, weights w, a bias b, a weighted sum z that people call the pre-activation, a nonlinear activation sigma, and an output p. That's the perceptron, the McCulloch-Pitts lineage. So logistic regression is the single neuron, precisely as we promised last episode when we said the path runs logistic regression, then the single neuron, then neural networks. Learning its weights by gradient descent on cross-entropy is the base case of backpropagation: for one neuron, backprop just is that p minus y times x gradient we derived today.
And stacking those neurons gives you neural networks. Many units in a layer, feeding another layer. The hidden layers learn nonlinear features on their own, so that the final, still-linear-in-features boundary becomes curved back in the original input space, which is how a network solves XOR and everything past it. The single neuron is the LEGO brick. Sigmoid and softmax reappear everywhere in deep learning: sigmoid as an activation and as the binary output unit, softmax as the standard multiclass output layer, cross-entropy as the standard classification loss. Everything we did today, the linear score, the squashing, the cross-entropy, that clean p minus y times x gradient, is the seed the whole course keeps spiraling back to. Remember the callbacks: the dot product came from linear regression, the gradient descent came from last episode, mean squared error was Gaussian maximum likelihood and cross-entropy is Bernoulli maximum likelihood, the very same recipe. And this single neuron is the thing we're going to stack into a network.