Logistic Regression | Machine Learning Guide Podcast

MLG 007 Logistic Regression
Feb 19, 2017
Click to Play Episode
Logistic regression, despite its name, is a classification technique used in supervised learning. It transforms a linear regression output into a classification through a sigmoid function, providing a confidence score for the presence of a specific class.
Try a walking desk to stay healthy while you study or work!
Resources
Resources best viewed here
Andrew Ng Coursera Course
Show Notes
See Andrew Ng Week 3 Lecture Notes
Overview

Logistic Function: A sigmoid function transforming linear regression output to logits, providing a probability between 0 and 1.
Binary Classification: Logistic regression deals with binary outcomes, determining either 0 or 1 based on a threshold (e.g., 0.5).
Error Function: Uses log likelihood to measure the accuracy of predictions in logistic regression.
Gradient Descent: Optimizes the model by adjusting weights to minimize the error function.
Classification vs Regression

Classification: Predicts a discrete label (e.g., a cat or dog).
Regression: Predicts a continuous outcome (e.g., house price).
Practical Example

Train on a dataset of house features to predict if a house is 'expensive' based on labeled data.
Automatically categorize into 0 (not expensive) or 1 (expensive) through training and gradient descent.
Logistic Regression in Machine Learning

Neurons in Neural Networks: Act as building blocks, as logistic regression is used to create neurons for more complex models like neural networks.
Composable Functions: Demonstrates the compositional nature of machine learning algorithms where functions are built on other functions (e.g., logistic built on linear).
Resources

Andrew Ng's Coursera Course: Continuing recommendation for in-depth learning on logistics and other machine learning models.
Transcript
This is episode seven, logistic regression. In this episode, we're gonna talk about classifiers, namely the logistic regression classifier algorithm. So remember where we are in the artificial intelligence tree? We've gone down to machine learning, down to supervised learning, and then supervised learning is broken down into two subfields, classification and regression. We've studied regression last episode with linear regression, which will give you a continuous variable output, a number, so the cost of a house in Portland, Oregon. And now we're gonna talk about classification, which will give you the class of a thing. Am I looking at a cat, a dog, a tree, a house? Or in the case of binary classification, is this a dog, yes or no, zero or one? Now you may have noticed right off the bat the word logistic regression. That is really confusing. Wait, I said that supervised learning is broken down into regression and classification, so those are two separate categories, and now we're talking about logistic and linear regression. Those both sound like regression to me. How is logistic regression classification? Actually, the term logistic regression is historical. It was a mistake, I believe, from what I've heard. Try to ignore it if you can, the fact that it has regression in the word. Logistic regression. I think what's going on there is that, as you'll see in a bit, we pipe our linear regression algorithm into our logistic function, so the classification step is a function of our linear regression algorithm. So I think that's why regression is in the title. It's like we're doing the logistic thingy to linear regression, logistic regression. And what does logistic mean? Well, what comes out of our classification function is what's called a logit, L-O-G-I-T. So I like to imagine it this way. Linear regression is for guessing numbers, like the cost of a house in Oregon, and logistic regression is for guessing classes, this, that, or the other thing. And what you do, imagine logistic regression is like this machine, like this cartoon machine with conveyor belt going into it and a conveyor belt coming out of it. And so in comes our linear regression algorithm, and it goes down the conveyor belt, so it's inside of our logistic regression machine, and it kind of does that cartoon, like, bing, bang, boom. It looks like there's a fight going inside of the logistic regression machine. And then out comes a number which tells us how confident the algorithm is that the thing we're looking at is a house, or how confident it is that the thing we're looking at is a tree. So out come these numbers associated with the class, and these numbers are called logits. So 0.7% probability of this being a house, 0.5% it being a tree, 0.3% it being a dog, and then we pick the thing with the highest logit, with the highest probability, and we take that class. And that function of picking the class associated with the highest logit is called argmax. You'll see this in various machine learning libraries, argmax, A-R-G-M-A-X. It says find the thing with the highest number, take that class. Now, a random aside. You'll notice that I said 0.7% likelihood of house, 0.5% tree, 0.3% dog, whatever. Those don't add up to one, right? They don't add up to 100%. So that's not a real probability distribution. That's not a proper probability. If your architecture needs a proper probability distribution, then you pipe those logits. So you go into another machine, you pipe those logits into something called softmax. Softmax, S-O-F-T-M-A-X. It takes your logits, and it transforms them into a proper probability distribution where all of your logits add up to one. We won't cover softmax in this episode. We'll cover that in a later episode. So logistic regression takes into it linear regression, and then it does bing, bang, boom in the machinery, and then out comes logits. One, two, three, four logits. Maybe we have a four-class system. Either this picture can be of a house, a tree, a dog, or a cat. And each logit associated with each class comes out of the machinery, and we pick the one with the highest value, in this case, house, with 0.7, by way of a function called argmax. Now, like I did in the last episode where I made the episode simpler to visualize by working with univariate linear regression rather than multivariate linear regression, and I'm just assuming that you're going to take that Andrew Ng Coursera course where you will learn the details of multivariate linear regression. I'm gonna make this episode simpler by working with binary classification. So is this picture a picture of a house or not? Yes or no? Zero or one? So in comes a picture, bing, bang, boom, out comes one logit, and it's gonna be a value between zero and one where zero represents no and one represents yes. So it might be 0.7, which is the logistic regression algorithm telling us that it is 70% confident that this is a picture of a house. It's not 100% confident, it's 70% confident. And what we're gonna do with logistic regression is say anything over 0.5 is yes and anything under 0.5 is no. So this is a yes. This is a picture of a house. We're just gonna guess that it's a picture of a house. Okay, the example that we're gonna be using actually for this episode is the same example for the last episode. We're piping in a spreadsheet of houses in Portland, Oregon. The rows are the houses themselves. Each column is a feature. So square footage, number of bedrooms, number of bathrooms, distance to downtown, et cetera. The last label from the previous episode was the cost of the house, $200,000, $300,000. That's Y or labels. So that last column, remember, is called the labels or the Y values, the actual cost of the house. And we're gonna be using this spreadsheet to train our model. We're gonna use this spreadsheet to train the pattern that we recognize so that in the future we can make predictions. Well, again, logistic regression is not linear regression. We're not guessing a number, we're guessing a class. And so in this example, what we're gonna do is instead of saying the cost of a house, which is a continuous variable, that's not what we wanna work with, let's say, do we consider this house expensive or not? Yes or no? Expensive or not expensive? So zero will be not expensive and one will be expensive. And so we'll go through this spreadsheet ourselves manually. Anything, let's just say, over $300,000, we'll consider that expensive. And anything under $300,000, we'll consider it not expensive. So we're gonna modify our spreadsheet. We're gonna open it up in Microsoft Excel. One row at a time, we're gonna say zero, one. One, one, zero, zero, zero. One, one, one, zero, zero. One, one, one, zero. Just replacing all these actual numbers with whether or not we consider it expensive. So we're working with classes here. In this case, binary classification. It could be one of two things. Now remember how the machine learning system works. We have a three-step process. Predict or infer, that's step one. Step two is our error or loss function. And step three is train or learn. So we're gonna pipe in our spreadsheet into our logistic regression function. And it's gonna go through all the rows, row by row by row. And it's gonna make a whole bunch of predictions, a bunch of random shots in the dark. That's step one, the predict phase. And then step two, remember, we're gonna use an error or loss function, an error function, in order to determine how bad we did, how off were we. And then we're gonna do step three, which is to train our hypothesis function. We're going to train these theta parameters, the coefficients in our function. We're gonna update their values until we have a function that fits our data accurately. A line on a graph that fits our data accurately. Now it's not gonna be a line in the case of logistic regression. So let's dive in. Let's open up that cartoon machine and zoom in and let's look at these three steps in detail. So the hypothesis function. In linear regression, remember we had kind of a scatter plot cloud of dots looking like a football pointing northeast. And we wanted to shoot a line straight through the center of that football. That's called your regression line in linear regression. Well, we're not gonna have numeric values in our case, in logistic regression. We're not gonna have numeric values. We're gonna have ones and zeros. So on one side are things that are expensive based on some combination of the features of the houses. And on the other side are things that are not expensive. So we need a function that somehow gives us zeros or ones or somewhere in between. And our linear function, that's a line going down in the football cloud northeast, that does not give us one or zero. It gives us a number, 200,000, 300,000. So the function we're going to use is a mathematical function in statistics. It's called a logistic function, logistic regression. A logistic function or a sigmoid function. And the reason it's called a sigmoid function as an alternative to logistic function is that it looks like an S. Imagine if you take an S, you draw an S and with your fingertips, you grab the top right end of the S and the lower left end, and you stretch it out. You stretch it out so that on the X equals zero, on the X axis coming from negative infinity, coming from the left, you come from the left, from the left, and then once you get towards the Y axis, you start curving up really fast. You cross over the Y axis at X equals 0.5 at one half. And then when X is positive, you start leveling out and then you get to X equals one and you go to the right towards infinity. So it's an S on a graph. The bottom of the S is on X equals zero. The top is on X equals one. Shoots off to the right towards infinity, towards the left towards infinity, crosses over the Y axis at 0.5. So we want to fit this S curve, this sigmoid function or logistic function. We want to fit our data in the graph somehow to that function. What we want to do is create what's called a decision boundary that puts all the data on one side, if it's yes, and all the data on the other side, if it's no. We want to learn what that decision boundary is, where we cross over the yes, no axis. And we're going to train our theta parameters. Remember, that's from the linear regression episode. We have these theta parameters. They're numbers inside the function that we're going to learn. We want to train these theta parameters so that we get this good decision boundary. So that's our hypothesis function or our objective function. It is the sigmoid or logistic function. So remember, hypothesis or objective function is the name for the function that we're using in the predict step, step one. And depending on the machine learning algorithm you're using for the task at hand, that function will be a specific function in math. So in this case, in logistic regression, it is the sigmoid or logistic function. In linear regression, I guess it's just a linear function. I guess that's all you call it. What you call it is just a linear function. Now let me just give you the formula for this function. The formula for the sigmoid function is one over one plus e to the negative linear regression. That's kind of weird, right? So one over one plus e to the negative, and then we say z, where z is your linear regression function, or specifically theta transpose x, where theta is the vector of parameters that we're gonna learn, or weights, and x is the matrix of examples, your spreadsheet. And if that transpose word threw you off, that's a technical detail of the multivariate linear regression step that I skipped in the last episode. But you're gonna learn that in the Andrew Ng Coursera course. So you'll learn this whole stuff with vectorization and matrix algebra and all that stuff in the Andrew Ng course. So don't worry about that right now. But one more time, logistic regression function that gives you that s-curve on a graph is one over one plus e to the negative theta transpose x. So linear regression is inside of that logistic regression function. Okay, so step one is we have our hypothesis function, and we're gonna pipe in our spreadsheet, and we're gonna map it all on our graph, and we're going to make a bunch of random guesses. Remember that step one is to predict, predict randomly. So we're gonna be like yes, no, no, yes, no, no, no, no, yes, yes, yes, yes, yes. When the actual values are no, no, yes, yes, yes, no, no, no, yes, yes, yes. And what we're gonna do is now we're gonna go to step two, which is figure out how off we were, how bad we were. That's our error or loss function. And just like in step one, where our hypothesis or objective function is gonna be a specific function, depending on the machine learning algorithm you're using, in our case, it's logistic or sigmoid function. In this step, our error function will be a specific function as well. Ours is called the log likelihood function because it uses a logarithm in the function. And here's how it works. We can't use our linear regression error function because we're not working with numbers. We're working with binary classifications. Zero or one. When the actual value is one, but my guess was zero, how bad did I do? Or vice versa. My guess was zero, but the actual value is one, how bad did I do? Or if the actual value is one and the guess is one, how bad did I do? If that's the case, the error should be zero. If I guessed correctly, the error should be zero. Now remember that we're using logits, a scale from zero to one where anything below 0.5 is no and anything above 0.5 is yes. And we may have guessed in our predict step. Point two, as in I am 20% confident that the answer to this particular case is no, where the actual answer was yes. In that case, we're less wrong than if I would have guessed zero and the actual answer is one. So our error function is, what it is, it's this log function. It starts at zero and it goes towards infinity. Y equals infinity at X equals one. So it goes up and up and up and up and then up into infinity. So what we're looking at in our error function is a graph that starts at zero and it goes right towards one and often to infinity, often to Y equals infinity before it ever hits X equals one. The closer my guess is to Y equals zero, which is the correct value, the closer my guess is to zero, the closer to zero is the error. And the closer my guess is to one, even though the actual answer is zero, the closer to infinity is my error. Okay, this is very confusing and don't dwell on the details. You're gonna learn this all in the Coursera course. I'm just describing it to you now for thoroughness. Now let's take the other example, flip the graph. In the cases where the house is considered expensive, then here's how the error function works. The closer my guess goes towards one. In other words, I guessed that the house is expensive. I guessed correctly. The closer I go towards one, the graph becomes zero. In other words, the error is zero. And the closer my guess goes towards zero, where I'm guessing that the house is not expensive, even though it is, the closer my error on the Y-axis goes to infinity. So this one's like a sloping graph in the other direction. So it's like you're coming down from a ramp on the Y-axis and you hit zero where X equals one. Again, I know this doesn't come out well in audio format. So just dive into the details in the Andrew Ng course, but I just wanna step you through the process. Okay, so that's the visual representation of the cost function that we're constructing for our objective function, our sigmoid function. The cost function looks like two separate cases of a logarithm. We're gonna combine those two separate cases into one function. And what happens is that one of these gets canceled out, depending on whether we're dealing with a yes or a no. It's hard to describe. What the function looks like is Y times the logarithm of your guess plus one minus Y times the logarithm of one minus your guess, okay? That's the error for one row of your spreadsheet, one guess. How bad did you do guessing for one particular row? Sum all those up and divide them by the number of examples. So it's the average of errors. And in this case, it's a little bit complicated. We're working with logarithms, but just go to the Andrew Ng course notes for week three. Okay, whew, that was crazy. Step two was to figure out how bad we did with all of our yes, no, yes, yes, no, no, no guesses. How bad were we off? Now remember, the point of our cost function is to tell us how bad we did so that we can train our hypothesis function, we can train the theta parameters to get a better graph that more accurately depicts the way things are with all of our data. And that's step three. Step three is to train our hypothesis function using our error function, okay? So our hypothesis function goes into our error function. Our error is a function of our hypothesis. And then our error function goes into our train function, namely gradient descent. Remember, gradient descent. The function of gradient descent is to take the derivative of your loss function. The derivative tells you which direction you need to step with all of your theta parameters, which direction each theta parameter needs to change, maybe some negative value or some positive value, up, down, left, or right, and by how much. So your derivative says how much each of your theta parameters in your hypothesis function needs to change in order to reduce your error function. And we're going to keep doing that, one gradient step at a time, keep taking the derivative and changing your theta parameters until our error function is at a minimum, at the smallest point that it can be. So the hypothesis function goes into your error function and your error function goes into the derivative function. Remember that the derivative itself is a function. And you repeat the derivative step one step at a time until your error function gives you a small value, the smallest value that it can give you, which means your hypothesis function, going back one step now, is ideal. In our case, it means that our sigmoid function has a good decision boundary that can separate all the yeses on one side and all the nos on another side. And then in the future, when you make a guess with a new house you've never seen and you don't have the label, is this house considered expensive by our relative definition of expensive? It will throw it on that graph. And if our function gives us anything greater than 0.5, then the answer is yes. And if it gives us anything less than 0.5, the answer is no. Okay, so gradient descent trains your theta parameters by taking the derivative of your loss function, which tells you how big of a step to take in which direction over and over and over until your error is small. And the gradient descent formula is, for each of your theta parameters, you have your theta parameter, what it was before, minus alpha over m times the sum of all of your guess minus the actual value times that feature in that position. So theta j equals theta j minus alpha over m times the sum from i equals one to m of your hypothesis for that row minus the actual value for that row times feature j for that row. Again, you'll learn this in the Andrew Ng Coursera course. Oh man, that was wild, huh? So let's run through this one more time. Remember, we have supervised learning broken down into classification, where we're trying to guess the class of a thing. Is it a cat, dog, or tree? And regression, which is where we're trying to guess the value of a thing, the continuous variable or numeric value of a thing. And then inside of classification, you have any number of algorithms, such as a decision tree or a Bayesian classifier. And we're focusing in this episode on the 101 classifier, which is called logistic regression. Logistic regression takes a spreadsheet of data whose values or labels, the Y column, is yeses and nos. Zero one, zero, zero, zero, one, one, one, one, zero, zero, zero, one, in the case of binary classification. In the case of multi-class classification, it will be any number of classes, but we're not gonna talk about that in this episode. We pipe that spreadsheet into our logistic regression algorithm. Our logistic regression algorithm goes over that spreadsheet and makes a whole bunch of guesses. That step one is predict. Step two is determine how bad you did with those guesses. And step three is to take your error function from step two and apply repeated applications of the derivative of that function to tell you how much to change your hypothesis function theta parameters so that you can get more accurate and more accurate over time until your error function finally reaches a minimum value. Now you have a hypothesis function that is trained on your data, on your spreadsheet or your matrix. And now when you get new samples in the future, you can pipe it into your hypothesis function and it will give you a guess and that guess will be more accurate. The details of each step is that the hypothesis function in our logistic regression algorithm is called a sigmoid function or a logistic function. Inside of that sigmoid function is actually linear regression. So logistic regression is a function of linear regression. On a graph, our logistic function looks like an S, a stretched out S. And we're trying to.
 find a decision boundary that puts all the yeses on one side and all the noes on one side. Our error function in step two is called a log likelihood function, and it tells us how off we were with our guess from the actual value. And the function is actually quite complex, so I will just refer you to the Andrew Ng course to look at the equation and to watch his videos to understand the equation, but in summary it just tells you how bad you did. And then of course the training step just applies repeated applications of the derivative of the loss function, and that loop doing repeated applications is called a gradient descent. We're descending the error graph to the bottom of the graph where the error is the lowest. That's why it's called descent, you're descending. Now let's sort of take a very big step back and remember what we're trying to accomplish. I mean artificial intelligence in the very general sense of the term. Remember that artificial intelligence is being able to simulate any mental task. Now we dove down the details rabbit hole of linear and logistic regression talking about mathematical equations and graphs and charts, and the training process or the learning process was like taking these lines or these s curves and altering them in some way that just you probably feel like you're very far removed from artificial intelligence by now. So let's take a big step back and let's remember the goal, simulating any mental task. Remember that artificial intelligence is broken down into multiple subfields, one of which is machine learning. And then I said machine learning is sort of the the most interesting and essential, in my opinion, subfield of artificial intelligence in that it affects all the other fields. It's almost like any mental task could be boiled down to learning, boiled down to storing a pattern about how the world works so that you can make a prediction in the future, an inference. Now in our examples we're storing a pattern or a model of the costs of houses in Portland, Oregon. That doesn't feel a lot like artificial intelligence yet. Or whether a house is expensive or not, yes or no. Logistic or linear regression. That's a pattern and then we can make a prediction with that pattern in the future. But if you step back a bit and think about other more high level sorts of machine learning tasks, such as let's say you're on the African savannah and you're looking, you're looking in front of you sort of like taking a picture, visual picture of what's in front of you. Oh there so happens to be a lion. Now you use classification in order to determine what class of objects is in front of you. Is this a lion, a tree, a house, or food? If it's food I want to eat it. If it's a lion I want to run. Okay so my classification algorithm has determined by way of my stored model that this is indeed a lion. Now we go to another learning algorithm. What action should I take given the circumstances? You may have learned, you know in machine learning, you may have learned that lions will eat you. Either verbally from your parents or maybe one took a bite out of your shoulder one day when you were on the hunt. So you have learned that lions will eat you and that the predicted course of action now given that there is a lion in front of you is to run away. So here we have vision turning into action and if we want to translate this into a machine learning situation we might use a convolutional neural network in the case of classifying what you're looking at okay with vision and we might use a deep q network in order to determine what course of action or policy or plan to take given our determination. So everything in machine learning sort of boils down to this learn and predict cycle but we have to start at the very bottom with linear and logistic regression the building blocks the legos in order to work our way up to the more advanced high level topics of things like how to take actions in a in an environment given your state or advanced algorithms in vision and classification. Now I want to go on a little detour. I said that linear and logistic regression are like legos or building blocks in the grand scheme and that you're learning the legos or building blocks right now and that's why it's important. Machine learning you will find is a very composable branch of engineering composable. If you come from a software engineering background or maybe web development or even mathematics you might be familiar with this thing called functional programming. Functional programming it's a style of programming and it's used in languages like Haskell or Lisp where you have a function function A and it takes as its arguments other functions functions B and C and let's say that function B takes as its arguments D and E. Functional programming is like Russian dolls where you nest all these functions inside of each other and then eventually at the very bottom you have to sort of give it a number or a string or some constant okay and then you can like start the process and it's like opening these Russian dolls one at a time you open the Russian doll and what's inside another Russian doll you open that and what's inside another and you open that and what's inside this is called composability. Composability your functions or your equations are composed of other functions or equations which are composed of other functions and so on so everything's nested inside of each other. You already saw this in machine learning by way of logistic regression being composed of linear regression. It is a function of linear regression so we took our linear regression algorithm and we put it inside of logistic regression. We also saw this in the steps one two and three process of machine learning. We have our hypothesis function and we put that into our error function so our error function is a function of our hypothesis function our error function is composed of our hypothesis function and then we put our error function we put that into a derivative function that's the gradient descent step that's step three training so in the case of logistic regression here's how it all unwraps we have our Russian dolls the very outer Russian doll is our derivative pop that open inside is our loss pop that open inside is our logistic function pop that open and inside is linear regression and you will find that everything in machine learning is this way now that's kind of a thing in mathematics it's kind of the mathematical nature of machine learning remember that machine learning is kind of like applied statistics really and calculus machine learning is highly mathematical and mathematics is highly composable so it's like this by nature but this is also a very useful and necessary attribute in order to scale machine learning once you're actually deploying these architectures in code putting it on amazon web services aws and scaling them horizontally if you know anything about functional programming as a software engineer or architect you know that a proper horizontally scalable system needs to be functional by nature and you will find that machine learning needs to scale indeed a lot of these algorithms especially once we get into deep learning are very very computationally expensive very heavy algorithms and in order to deploy a service that will be used by any number of people you're going to need to be able to scale horizontally and in order to do that the nature of the architecture must be functional okay that was a long-winded digression one of the reasons i wanted to point out this composability aspect of machine learning is the following you're probably chomping at the bit to learn about deep learning that's all the rage in machine learning and if you came to this podcast because you're excited you've seen all these articles and discussions on hacker news about artificial neural networks and deep learning and all the stuff that's happening in that space well patience my friend because we will get there and we'll get there sooner than you think we'll get to deep learning but in order to understand deep learning you have to understand logistic regression and linear regression because logistic regression is a neuron a neuron in a neural network so that composability paradigm is at play here a neural network in deep learning is a function of logistic regression which itself is a function of linear regression so everything's composed and nested inside of each other so before we can get to deep learning and neural networks we're going to need to learn all these little basics these linear units and logistic units because they're going to become neurons inside of our neural network so that's kind of cool deep learning is a function of shallow learning we call it shallow learning these simple algorithms like linear and logistic regression okay so that was a very technical long-winded episode i believe don't quote me on this but i believe that the next few episodes won't be nearly as technical the next episode specifically we're going to be talking about mathematics we're not going to go into math we're going to talk about the branches of math that you need to know in order to succeed in machine learning and how much of these types of math that you need to know what are the resources that you can learn these things etc because that's a common question that comes up what type of math do i need to know how much of it do i need to know can i go into machine learning without knowing any math etc so we're going to do an episode on that sometime soon i'm going to do an episode on languages and frameworks so python versus r versus matlab tensorflow versus theano versus torch and then we'll do a high level overview of deep learning and all these things before we finally get back into the technical details so do not fear my entire podcast series will not be like this and that linear regression episode which are super super technical okay what are the resources for this episode no new resources i'm going to point you once again to the andrew in coursera course so like i said in the linear regression episode that course is not optional it is required you need to start on it i'm going to keep recommending it until we start getting into new territory but i want you to start working on that course that's it for this episode and i'll see you guys next time
Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.