I’m working on a statistics multipart question and need an explanation and answer to help me learn.
Do problems 4, 5, 6, 8, and 9 in Chapter 4 (pages 151152).
CHAPTER 4
Training Models
With Early Release ebooks, you get books in their earliest formâ€”
the authorâ€™s raw and unedited content as he or she writesâ€”so you
can take advantage of these technologies long before the official
release of these titles. The following will be Chapter 4 in the final
release of the book.
So far we have treated Machine Learning models and their training algorithms mostly
like black boxes. If you went through some of the exercises in the previous chapters,
you may have been surprised by how much you can get done without knowing anyâ€
thing about whatâ€™s under the hood: you optimized a regression system, you improved
a digit image classifier, and you even built a spam classifier from scratchâ€”all this
without knowing how they actually work. Indeed, in many situations you donâ€™t really
need to know the implementation details.
However, having a good understanding of how things work can help you quickly
home in on the appropriate model, the right training algorithm to use, and a good set
of hyperparameters for your task. Understanding whatâ€™s under the hood will also help
you debug issues and perform error analysis more efficiently. Lastly, most of the topâ€
ics discussed in this chapter will be essential in understanding, building, and training
neural networks (discussed in Part II of this book).
In this chapter, we will start by looking at the Linear Regression model, one of the
simplest models there is. We will discuss two very different ways to train it:
â€¢ Using a direct â€œclosedformâ€ equation that directly computes the model parameâ€
ters that best fit the model to the training set (i.e., the model parameters that
minimize the cost function over the training set).
113
â€¢ Using an iterative optimization approach, called Gradient Descent (GD), that
gradually tweaks the model parameters to minimize the cost function over the
training set, eventually converging to the same set of parameters as the first
method. We will look at a few variants of Gradient Descent that we will use again
and again when we study neural networks in Part II: Batch GD, Minibatch GD,
and Stochastic GD.
Next we will look at Polynomial Regression, a more complex model that can fit nonâ€
linear datasets. Since this model has more parameters than Linear Regression, it is
more prone to overfitting the training data, so we will look at how to detect whether
or not this is the case, using learning curves, and then we will look at several regulariâ€
zation techniques that can reduce the risk of overfitting the training set.
Finally, we will look at two more models that are commonly used for classification
tasks: Logistic Regression and Softmax Regression.
There will be quite a few math equations in this chapter, using basic
notions of linear algebra and calculus. To understand these equaâ€
tions, you will need to know what vectors and matrices are, how to
transpose them, multiply them, and inverse them, and what partial
derivatives are. If you are unfamiliar with these concepts, please go
through the linear algebra and calculus introductory tutorials availâ€
able as Jupyter notebooks in the online supplemental material. For
those who are truly allergic to mathematics, you should still go
through this chapter and simply skip the equations; hopefully, the
text will be sufficient to help you understand most of the concepts.
Linear Regression
In Chapter 1, we looked at a simple regression model of life satisfaction: life_satisfacâ€
tion = Î¸0 + Î¸1 Ã— GDP_per_capita.
This model is just a linear function of the input feature GDP_per_capita. Î¸0 and Î¸1 are
the modelâ€™s parameters.
More generally, a linear model makes a prediction by simply computing a weighted
sum of the input features, plus a constant called the bias term (also called the intercept
term), as shown in Equation 41.
Equation 41. Linear Regression model prediction
y = Î¸0 + Î¸1×1 + Î¸2×2 + â€« Ùµâ€¬+ Î¸nxn
â€¢ Å· is the predicted value.
114

Chapter 4: Training Models
â€¢ n is the number of features.
â€¢ xi is the ith feature value.
â€¢ Î¸j is the jth model parameter (including the bias term Î¸0 and the feature weights
Î¸1, Î¸2, â€«Ùµâ€¬, Î¸n).
This can be written much more concisely using a vectorized form, as shown in Equaâ€
tion 42.
Equation 42. Linear Regression model prediction (vectorized form)
y = hÎ¸ x = Î¸ Â· x
â€¢ Î¸ is the modelâ€™s parameter vector, containing the bias term Î¸0 and the feature
weights Î¸1 to Î¸n.
â€¢ x is the instanceâ€™s feature vector, containing x0 to xn, with x0 always equal to 1.
â€¢ Î¸ Â· x is the dot product of the vectors Î¸ and x, which is of course equal to
Î¸0x0 + Î¸1×1 + Î¸2×2 + â€« Ùµâ€¬+ Î¸nxn.
â€¢ hÎ¸ is the hypothesis function, using the model parameters Î¸.
In Machine Learning, vectors are often represented as column vecâ€
tors, which are 2D arrays with a single column. If Î¸ and x are colâ€
umn vectors, then the prediction is: y = Î¸Tx, where Î¸T is the
transpose of Î¸ (a row vector instead of a column vector) and Î¸Tx is
the matrix multiplication of Î¸T and x. It is of course the same preâ€
diction, except it is now represented as a single cell matrix rather
than a scalar value. In this book we will use this notation to avoid
switching between dot products and matrix multiplications.
Okay, thatâ€™s the Linear Regression model, so now how do we train it? Well, recall that
training a model means setting its parameters so that the model best fits the training
set. For this purpose, we first need a measure of how well (or poorly) the model fits
the training data. In Chapter 2 we saw that the most common performance measure
of a regression model is the Root Mean Square Error (RMSE) (Equation 21). Thereâ€
fore, to train a Linear Regression model, you need to find the value of Î¸ that minimiâ€
zes the RMSE. In practice, it is simpler to minimize the Mean Square Error (MSE)
Linear Regression

115
than the RMSE, and it leads to the same result (because the value that minimizes a
function also minimizes its square root).1
The MSE of a Linear Regression hypothesis hÎ¸ on a training set X is calculated using
Equation 43.
Equation 43. MSE cost function for a Linear Regression model
MSE X, hÎ¸ =
2
1 m T i
Î¸ x âˆ’yi
miâˆ‘
=1
Most of these notations were presented in Chapter 2 (see â€œNotationsâ€ on page 43).
The only difference is that we write hÎ¸ instead of just h in order to make it clear that
the model is parametrized by the vector Î¸. To simplify notations, we will just write
MSE(Î¸) instead of MSE(X, hÎ¸).
The Normal Equation
To find the value of Î¸ that minimizes the cost function, there is a closedform solution
â€”in other words, a mathematical equation that gives the result directly. This is called
the Normal Equation (Equation 44).2
Equation 44. Normal Equation
Î¸ = XT X
âˆ’1
XT y
â€¢ Î¸ is the value of Î¸ that minimizes the cost function.
â€¢ y is the vector of target values containing y(1) to y(m).
Letâ€™s generate some linearlooking data to test this equation on (Figure 41):
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
1 It is often the case that a learning algorithm will try to optimize a different function than the performance
measure used to evaluate the final model. This is generally because that function is easier to compute, because
it has useful differentiation properties that the performance measure lacks, or because we want to constrain
the model during training, as we will see when we discuss regularization.
2 The demonstration that this returns the value of Î¸ that minimizes the cost function is outside the scope of this
book.
116

Chapter 4: Training Models
Figure 41. Randomly generated linear dataset
Now letâ€™s compute Î¸ using the Normal Equation. We will use the inv() function from
NumPyâ€™s Linear Algebra module (np.linalg) to compute the inverse of a matrix, and
the dot() method for matrix multiplication:
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
The actual function that we used to generate the data is y = 4 + 3×1 + Gaussian noise.
Letâ€™s see what the equation found:
>>> theta_best
array([[4.21509616],
[2.77011339]])
We would have hoped for Î¸0 = 4 and Î¸1 = 3 instead of Î¸0 = 4.215 and Î¸1 = 2.770. Close
enough, but the noise made it impossible to recover the exact parameters of the origiâ€
nal function.
Now you can make predictions using Î¸:
>>> X_new = np.array([[0], [2]])
>>> X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
>>> y_predict = X_new_b.dot(theta_best)
>>> y_predict
array([[4.21509616],
[9.75532293]])
Letâ€™s plot this modelâ€™s predictions (Figure 42):
plt.plot(X_new, y_predict, “r“)
plt.plot(X, y, “b.”)
Linear Regression

117
plt.axis([0, 2, 0, 15])
plt.show()
Figure 42. Linear Regression model predictions
Performing linear regression using ScikitLearn is quite simple:3
>>> from sklearn.linear_model import LinearRegression
>>> lin_reg = LinearRegression()
>>> lin_reg.fit(X, y)
>>> lin_reg.intercept_, lin_reg.coef_
(array([4.21509616]), array([[2.77011339]]))
>>> lin_reg.predict(X_new)
array([[4.21509616],
[9.75532293]])
The LinearRegression class is based on the scipy.linalg.lstsq() function (the
name stands for â€œleast squaresâ€), which you could call directly:
>>> theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e6)
>>> theta_best_svd
array([[4.21509616],
[2.77011339]])
This function computes Î¸ = X+y, where á…š+ is the pseudoinverse of X (specifically the
MoorePenrose inverse). You can use np.linalg.pinv() to compute the pseudoinâ€
verse directly:
>>> np.linalg.pinv(X_b).dot(y)
array([[4.21509616],
[2.77011339]])
3 Note that ScikitLearn separates the bias term (intercept_) from the feature weights (coef_).
118

Chapter 4: Training Models
The pseudoinverse itself is computed using a standard matrix factorization technique
called Singular Value Decomposition (SVD) that can decompose the training set
matrix X into the matrix multiplication of three matrices U Î£ VT (see
numpy.linalg.svd()). The pseudoinverse is computed as X+ = VÎ£+UT. To compute
the matrix Î£+, the algorithm takes Î£ and sets to zero all values smaller than a tiny
threshold value, then it replaces all the nonzero values with their inverse, and finally
it transposes the resulting matrix. This approach is more efficient than computing the
Normal Equation, plus it handles edge cases nicely: indeed, the Normal Equation may
not work if the matrix XTX is not invertible (i.e., singular), such as if m < n or if some
features are redundant, but the pseudoinverse is always defined.
Computational Complexity
The Normal Equation computes the inverse of XT X, which is an (n + 1) Ã— (n + 1)
matrix (where n is the number of features). The computational complexity of inverting
such a matrix is typically about O(n2.4) to O(n3) (depending on the implementation).
In other words, if you double the number of features, you multiply the computation
time by roughly 22.4 = 5.3 to 23 = 8.
The SVD approach used by ScikitLearnâ€™s LinearRegression class is about O(n2). If
you double the number of features, you multiply the computation time by roughly 4.
Both the Normal Equation and the SVD approach get very slow
when the number of features grows large (e.g., 100,000). On the
positive side, both are linear with regards to the number of instanâ€
ces in the training set (they are O(m)), so they handle large training
sets efficiently, provided they can fit in memory.
Also, once you have trained your Linear Regression model (using the Normal Equaâ€
tion or any other algorithm), predictions are very fast: the computational complexity
is linear with regards to both the number of instances you want to make predictions
on and the number of features. In other words, making predictions on twice as many
instances (or twice as many features) will just take roughly twice as much time.
Now we will look at very different ways to train a Linear Regression model, better
suited for cases where there are a large number of features, or too many training
instances to fit in memory.
Gradient Descent
Gradient Descent is a very generic optimization algorithm capable of finding optimal
solutions to a wide range of problems. The general idea of Gradient Descent is to
tweak parameters iteratively in order to minimize a cost function.
Gradient Descent

119
Suppose you are lost in the mountains in a dense fog; you can only feel the slope of
the ground below your feet. A good strategy to get to the bottom of the valley quickly
is to go downhill in the direction of the steepest slope. This is exactly what Gradient
Descent does: it measures the local gradient of the error function with regards to the
parameter vector Î¸, and it goes in the direction of descending gradient. Once the graâ€
dient is zero, you have reached a minimum!
Concretely, you start by filling Î¸ with random values (this is called random initializaâ€
tion), and then you improve it gradually, taking one baby step at a time, each step
attempting to decrease the cost function (e.g., the MSE), until the algorithm converges
to a minimum (see Figure 43).
Figure 43. Gradient Descent
An important parameter in Gradient Descent is the size of the steps, determined by
the learning rate hyperparameter. If the learning rate is too small, then the algorithm
will have to go through many iterations to converge, which will take a long time (see
Figure 44).
120

Chapter 4: Training Models
Figure 44. Learning rate too small
On the other hand, if the learning rate is too high, you might jump across the valley
and end up on the other side, possibly even higher up than you were before. This
might make the algorithm diverge, with larger and larger values, failing to find a good
solution (see Figure 45).
Figure 45. Learning rate too large
Finally, not all cost functions look like nice regular bowls. There may be holes, ridges,
plateaus, and all sorts of irregular terrains, making convergence to the minimum very
difficult. Figure 46 shows the two main challenges with Gradient Descent: if the ranâ€
dom initialization starts the algorithm on the left, then it will converge to a local miniâ€
mum, which is not as good as the global minimum. If it starts on the right, then it will
take a very long time to cross the plateau, and if you stop too early you will never
reach the global minimum.
Gradient Descent

121
Figure 46. Gradient Descent pitfalls
Fortunately, the MSE cost function for a Linear Regression model happens to be a
convex function, which means that if you pick any two points on the curve, the line
segment joining them never crosses the curve. This implies that there are no local
minima, just one global minimum. It is also a continuous function with a slope that
never changes abruptly.4 These two facts have a great consequence: Gradient Descent
is guaranteed to approach arbitrarily close the global minimum (if you wait long
enough and if the learning rate is not too high).
In fact, the cost function has the shape of a bowl, but it can be an elongated bowl if
the features have very different scales. Figure 47 shows Gradient Descent on a trainâ€
ing set where features 1 and 2 have the same scale (on the left), and on a training set
where feature 1 has much smaller values than feature 2 (on the right).5
Figure 47. Gradient Descent with and without feature scaling
4 Technically speaking, its derivative is Lipschitz continuous.
5 Since feature 1 is smaller, it takes a larger change in Î¸1 to affect the cost function, which is why the bowl is
elongated along the Î¸1 axis.
122

Chapter 4: Training Models
As you can see, on the left the Gradient Descent algorithm goes straight toward the
minimum, thereby reaching it quickly, whereas on the right it first goes in a direction
almost orthogonal to the direction of the global minimum, and it ends with a long
march down an almost flat valley. It will eventually reach the minimum, but it will
take a long time.
When using Gradient Descent, you should ensure that all features
have a similar scale (e.g., using ScikitLearnâ€™s StandardScaler
class), or else it will take much longer to converge.
This diagram also illustrates the fact that training a model means searching for a
combination of model parameters that minimizes a cost function (over the training
set). It is a search in the modelâ€™s parameter space: the more parameters a model has,
the more dimensions this space has, and the harder the search is: searching for a neeâ€
dle in a 300dimensional haystack is much trickier than in three dimensions. Fortuâ€
nately, since the cost function is convex in the case of Linear Regression, the needle is
simply at the bottom of the bowl.
Batch Gradient Descent
To implement Gradient Descent, you need to compute the gradient of the cost funcâ€
tion with regards to each model parameter Î¸j. In other words, you need to calculate
how much the cost function will change if you change Î¸j just a little bit. This is called
a partial derivative. It is like asking â€œwhat is the slope of the mountain under my feet
if I face east?â€ and then asking the same question facing north (and so on for all other
dimensions, if you can imagine a universe with more than three dimensions). Equaâ€
tion 45 computes the partial derivative of the cost function with regards to parameâ€
âˆ‚
ter Î¸j, noted âˆ‚Î¸ MSE(Î¸).
j
Equation 45. Partial derivatives of the cost function
2 m T i
âˆ‚
MSE Î¸ =
Î¸ x âˆ’ y i x ji
âˆ‚Î¸ j
miâˆ‘
=1
Instead of computing these partial derivatives individually, you can use Equation 46
to compute them all in one go. The gradient vector, noted ÖÎ¸MSE(Î¸), contains all the
partial derivatives of the cost function (one for each model parameter).
Gradient Descent

123
Equation 46. Gradient vector of the cost function
âˆ‚
MSE Î¸
âˆ‚Î¸0
âˆ‚
MSE Î¸
2
ÖÎ¸ MSE Î¸ = âˆ‚Î¸1
= XT XÎ¸ âˆ’ y
m
â€«Ù´â€¬
âˆ‚
MSE Î¸
âˆ‚Î¸n
Notice that this formula involves calculations over the full training
set X, at each Gradient Descent step! This is why the algorithm is
called Batch Gradient Descent: it uses the whole batch of training
data at every step (actually, Full Gradient Descent would probably
be a better name). As a result it is terribly slow on very large trainâ€
ing sets (but we will see much faster Gradient Descent algorithms
shortly). However, Gradient Descent scales well with the number of
features; training a Linear Regression model when there are hunâ€
dreds of thousands of features is much faster using Gradient
Descent than using the Normal Equation or SVD decomposition.
Once you have the gradient vector, which points uphill, just go in the opposite direcâ€
tion to go downhill. This means subtracting ÖÎ¸MSE(Î¸) from Î¸. This is where the
learning rate Î· comes into play:6 multiply the gradient vector by Î· to determine the
size of the downhill step (Equation 47).
Equation 47. Gradient Descent step
Î¸ next step = Î¸ âˆ’ Î· ÖÎ¸ MSE Î¸
Letâ€™s look at a quick implementation of this algorithm:
eta = 0.1 # learning rate
n_iterations = 1000
m = 100
theta = np.random.randn(2,1)
# random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta)  y)
theta = theta  eta * gradients
6 Eta (Î·) is the 7th letter of the Greek alphabet.
124

Chapter 4: Training Models
That wasnâ€™t too hard! Letâ€™s look at the resulting theta:
>>> theta
array([[4.21509616],
[2.77011339]])
Hey, thatâ€™s exactly what the Normal Equation found! Gradient Descent worked perâ€
fectly. But what if you had used a different learning rate eta? Figure 48 shows the
first 10 steps of Gradient Descent using three different learning rates (the dashed line
represents the starting point).
Figure 48. Gradient Descent with various learning rates
On the left, the learning rate is too low: the algorithm will eventually reach the soluâ€
tion, but it will take a long time. In the middle, the learning rate looks pretty good: in
just a few iterations, it has already converged to the solution. On the right, the learnâ€
ing rate is too high: the algorithm diverges, jumping all over the place and actually
getting further and further away from the solution at every step.
To find a good learning rate, you can use grid search (see Chapter 2). However, you
may want to limit the number of iterations so that grid search can eliminate models
that take too long to converge.
You may wonder how to set the number of iterations. If it is too low, you will still be
far away from the optimal solution when the algorithm stops, but if it is too high, you
will waste time while the model parameters do not change anymore. A simple soluâ€
tion is to set a very large number of iterations but to interrupt the algorithm when the
gradient vector becomes tinyâ€”that is, when its norm becomes smaller than a tiny
number È½ (called the tolerance)â€”because this happens when Gradient Descent has
(almost) reached the minimum.
Gradient Descent

125
Convergence Rate
When the cost function is convex and its slope does not change abruptly (as is the
case for the MSE cost function), Batch Gradient Descent with a fixed learning rate
will eventually converge to the optimal solution, but you may have to wait a while: it
can take O(1/ È½) iterations to reach the optimum within a range of È½ depending on the
shape of the cost function. If you divide the tolerance by 10 to have a more precise
solution, then the algorithm may have to run about 10 times longer.
Stochastic Gradient Descent
The main problem with Batch Gradient Descent is the fact that it uses the whole
training set to compute the gradients at every step, which makes it very slow when
the training set is large. At the opposite extreme, Stochastic Gradient Descent just
picks a random instance in the training set at every step and computes the gradients
based only on that single instance. Obviously this makes the algorithm much faster
since it has very little data to manipulate at every iteration. It also makes it possible to
train on huge training sets, since only one instance needs to be in memory at each
iteration (SGD can be implemented as an outofcore algorithm.7)
On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much
less regular than Batch Gradient Descent: instead of gently decreasing until it reaches
the minimum, the cost function will bounce up and down, decreasing only on averâ€
age. Over time it will end up very close to the minimum, but once it gets there it will
continue to bounce around, never settling down (see Figure 49). So once the algoâ€
rithm stops, the final parameter values are good, but not optimal.
Figure 49. Stochastic Gradient Descent
7 Outofcore algorithms are discussed in Chapter 1.
126

Chapter 4: Training Models
When the cost function is very irregular (as in Figure 46), this can actually help the
algorithm jump out of local minima, so Stochastic Gradient Descent has a better
chance of finding the global minimum than Batch Gradient Descent does.
Therefore randomness is good to escape from local optima, but bad because it means
that the algorithm can never settle at the minimum. One solution to this dilemma is
to gradually reduce the learning rate. The steps start out large (which helps make
quick progress and escape local minima), then get smaller and smaller, allowing the
algorithm to settle at the global minimum. This process is akin to simulated annealâ€
ing, an algorithm inspired from the process of annealing in metallurgy where molten
metal is slowly cooled down. The function that determines the learning rate at each
iteration is called the learning schedule. If the learning rate is reduced too quickly, you
may get stuck in a local minimum, or even end up frozen halfway to the minimum. If
the learning rate is reduced too slowly, you may jump around the minimum for a
long time and end up with a suboptimal solution if you halt training too early.
This code implements Stochastic Gradient Descent using a simple learning schedule:
n_epochs = 50
t0, t1 = 5, 50
# learning schedule hyperparameters
def learning_schedule(t):
return t0 / (t + t1)
theta = np.random.randn(2,1)
# random initialization
for epoch in range(n_epochs):
for i in range(m):
random_index = np.random.randint(m)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]
gradients = 2 * xi.T.dot(xi.dot(theta) – yi)
eta = learning_schedule(epoch * m + i)
theta = theta – eta * gradients
By convention we iterate by rounds of m iterations; each round is called an epoch.
While the Batch Gradient Descent code iterated 1,000 times through the whole trainâ€
ing set, this code goes through the training set only 50 times and reaches a fairly good
solution:
>>> theta
array([[4.21076011],
[2.74856079]])
Figure 410 shows the first 20 steps of training (notice how irregular the steps are).
Gradient Descent

127
Figure 410. Stochastic Gradient Descent first 20 steps
Note that since instances are picked randomly, some instances may be picked several
times per epoch while others may not be picked at all. If you want to be sure that the
algorithm goes through every instance at each epoch, another approach is to shuffle
the training set (making sure to shuffle the input features and the labels jointly), then
go through it instance by instance, then shuffle it again, and so on. However, this genâ€
erally converges more slowly.
When using Stochastic Gradient Descent, the training instances
must be independent and identically distributed (IID), to ensure
that the parameters get pulled towards the global optimum, on
average. A simple way to ensure this is to shuffle the instances durâ€
ing training (e.g., pick each instance randomly, or shuffle the trainâ€
ing set at the beginning of each epoch). If you do not do this, for
example if the instances are sorted by label, then SGD will start by
optimizing for one label, then the next, and so on, and it will not
settle close to the global minimum.
To perform Linear Regression using SGD with ScikitLearn, you can use the SGDRe
gressor class, which defaults to optimizing the squared error cost function. The folâ€
lowing code runs for maximum 1000 epochs (max_iter=1000) or until the loss drops
by less than 1e3 during one epoch (tol=1e3), starting with a learning rate of 0.1
(eta0=0.1), using the default learning schedule (different from the preceding one),
and it does not use any regularization (penalty=None; more details on this shortly):
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e3, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())
128

Chapter 4: Training Models
Once again, you find a solution quite close to the one returned by the Normal Equaâ€
tion:
>>> sgd_reg.intercept_, sgd_reg.coef_
(array([4.24365286]), array([2.8250878]))
Minibatch Gradient Descent
The last Gradient Descent algorithm we will look at is called Minibatch Gradient
Descent. It is quite simple to understand once you know Batch and Stochastic Gradiâ€
ent Descent: at each step, instead of computing the gradients based on the full trainâ€
ing set (as in Batch GD) or based on just one instance (as in Stochastic GD), Minibatch GD computes the gradients on small random sets of instances called minibatches. The main advantage of Minibatch GD over Stochastic GD is that you can
get a performance boost from hardware optimization of matrix operations, especially
when using GPUs.
The algorithmâ€™s progress in parameter space is less erratic than with SGD, especially
with fairly large minibatches. As a result, Minibatch GD will end up walking
around a bit closer to the minimum than SGD. But, on the other hand, it may be
harder for it to escape from local minima (in the case of problems that suffer from
local minima, unlike Linear Regression as we saw earlier). Figure 411 shows the
paths taken by the three Gradient Descent algorithms in parameter space during
training. They all end up near the minimum, but Batch GDâ€™s path actually stops at the
minimum, while both Stochastic GD and Minibatch GD continue to walk around.
However, donâ€™t forget that Batch GD takes a lot of time to take each step, and Stochasâ€
tic GD and Minibatch GD would also reach the minimum if you used a good learnâ€
ing schedule.
Figure 411. Gradient Descent paths in parameter space
Gradient Descent

129
Letâ€™s compare the algorithms weâ€™ve discussed so far for Linear Regression8 (recall that
m is the number of training instances and n is the number of features); see Table 41.
Table 41. Comparison of algorithms for Linear Regression
Algorithm
Large m Outofcore support Large n Hyperparams Scaling required ScikitLearn
Normal Equation Fast
No
Slow
0
No
n/a
SVD
Fast
No
Slow
0
No
LinearRegression
Batch GD
Slow
No
Fast
2
Yes
SGDRegressor
Stochastic GD
Fast
Yes
Fast
â‰¥2
Yes
SGDRegressor
Minibatch GD
Fast
Yes
Fast
â‰¥2
Yes
SGDRegressor
There is almost no difference after training: all these algorithms
end up with very similar models and make predictions in exactly
the same way.
Polynomial Regression
What if your data is actually more complex than a simple straight line? Surprisingly,
you can actually use a linear model to fit nonlinear data. A simple way to do this is to
add powers of each feature as new features, then train a linear model on this extended
set of features. This technique is called Polynomial Regression.
Letâ€™s look at an example. First, letâ€™s generate some nonlinear data, based on a simple
quadratic equation9 (plus some noise; see Figure 412):
m = 100
X = 6 * np.random.rand(m, 1) – 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
8 While the Normal Equation can only perform Linear Regression, the Gradient Descent algorithms can be
used to train many other models, as we will see.
9 A quadratic equation is of the form y = ax2 + bx + c.
130
 Chapter 4: Training Models
Figure 412. Generated nonlinear and noisy dataset
Clearly, a straight line will never fit this data properly. So letâ€™s use ScikitLearnâ€™s Poly
nomialFeatures class to transform our training data, adding the square (2nddegree
polynomial) of each feature in the training set as new features (in this case there is
just one feature):
>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly_features = PolynomialFeatures(degree=2, include_bias=False)
>>> X_poly = poly_features.fit_transform(X)
>>> X[0]
array([0.75275929])
>>> X_poly[0]
array([0.75275929, 0.56664654])
X_poly now contains the original feature of X plus the square of this feature. Now you
can fit a LinearRegression model to this extended training data (Figure 413):
>>> lin_reg = LinearRegression()
>>> lin_reg.fit(X_poly, y)
>>> lin_reg.intercept_, lin_reg.coef_
(array([1.78134581]), array([[0.93366893, 0.56456263]]))
Polynomial Regression

131
Figure 413. Polynomial Regression model predictions
Not bad: the model estimates y = 0 . 56×21 + 0 . 93×1 + 1 . 78 when in fact the original
function was y = 0 . 5×21 + 1 . 0x1 + 2 . 0 + Gaussian noise.
Note that when there are multiple features, Polynomial Regression is capable of findâ€
ing relationships between features (which is something a plain Linear Regression
model cannot do). This is made possible by the fact that PolynomialFeatures also
adds all combinations of features up to the given degree. For example, if there were
two features a and b, PolynomialFeatures with degree=3 would not only add the
features a2, a3, b2, and b3, but also the combinations ab, a2b, and ab2.
PolynomialFeatures(degree=d) transforms an array containing n
n+d !
features, where n! is the
d! n!
factorial of n, equal to 1 Ã— 2 Ã— 3 Ã— â€« Ã— Ùµâ€¬n. Beware of the combinatoâ€
rial explosion of the number of features!
features into an array containing
Learning Curves
If you perform highdegree Polynomial Regression, you will likely fit the training
data much better than with plain Linear Regression. For example, Figure 414 applies
a 300degree polynomial model to the preceding training data, and compares the
result with a pure linear model and a quadratic model (2nddegree polynomial).
Notice how the 300degree polynomial model wiggles around to get as close as possiâ€
ble to the training instances.
132

Chapter 4: Training Models
Figure 414. Highdegree Polynomial Regression
Of course, this highdegree Polynomial Regression model is severely overfitting the
training data, while the linear model is underfitting it. The model that will generalize
best in this case is the quadratic model. It makes sense since the data was generated
using a quadratic model, but in general you wonâ€™t know what function generated the
data, so how can you decide how complex your model should be? How can you tell
that your model is overfitting or underfitting the data?
In Chapter 2 you used crossvalidation to get an estimate of a modelâ€™s generalization
performance. If a model performs well on the training data but generalizes poorly
according to the crossvalidation metrics, then your model is overfitting. If it perâ€
forms poorly on both, then it is underfitting. This is one way to tell when a model is
too simple or too complex.
Another way is to look at the learning curves: these are plots of the modelâ€™s perforâ€
mance on the training set and the validation set as a function of the training set size
(or the training iteration). To generate the plots, simply train the model several times
on different sized subsets of the training set. The following code defines a function
that plots the learning curves of a model given some training data:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
train_errors, val_errors = [], []
for m in range(1, len(X_train)):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
Learning Curves

133
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.plot(np.sqrt(train_errors), “r+”, linewidth=2, label=”train”)
plt.plot(np.sqrt(val_errors), “b“, linewidth=3, label=”val”)
Letâ€™s look at the learning curves of the plain Linear Regression model (a straight line;
Figure 415):
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
Figure 415. Learning curves
This deserves a bit of explanation. First, letâ€™s look at the performance on the training
data: when there are just one or two instances in the training set, the model can fit
them perfectly, which is why the curve starts at zero. But as new instances are added
to the training set, it becomes impossible for the model to fit the training data perâ€
fectly, both because the data is noisy and because it is not linear at all. So the error on
the training data goes up until it reaches a plateau, at which point adding new instanâ€
ces to the training set doesnâ€™t make the average error much better or worse. Now letâ€™s
look at the performance of the model on the validation data. When the model is
trained on very few training instances, it is incapable of generalizing properly, which
is why the validation error is initially quite big. Then as the model is shown more
training examples, it learns and thus the validation error slowly goes down. However,
once again a straight line cannot do a good job modeling the data, so the error ends
up at a plateau, very close to the other curve.
These learning curves are typical of an underfitting model. Both curves have reached
a plateau; they are close and fairly high.
134

Chapter 4: Training Models
If your model is underfitting the training data, adding more trainâ€
ing examples will not help. You need to use a more complex model
or come up with better features.
Now letâ€™s look at the learning curves of a 10thdegree polynomial model on the same
data (Figure 416):
from sklearn.pipeline import Pipeline
polynomial_regression = Pipeline([
(“poly_features”, PolynomialFeatures(degree=10, include_bias=False)),
(“lin_reg”, LinearRegression()),
])
plot_learning_curves(polynomial_regression, X, y)
These learning curves look a bit like the previous ones, but there are two very imporâ€
tant differences:
â€¢ The error on the training data is much lower than with the Linear Regression
model.
â€¢ There is a gap between the curves. This means that the model performs signifiâ€
cantly better on the training data than on the validation data, which is the hallâ€
mark of an overfitting model. However, if you used a much larger training set,
the two curves would continue to get closer.
Figure 416. Learning curves for the polynomial model
Learning Curves

135
One way to improve an overfitting model is to feed it more training
data until the validation error reaches the training error.
The Bias/Variance Tradeoff
An important theoretical result of statistics and Machine Learning is the fact that a
modelâ€™s generalization error can be expressed as the sum of three very different
errors:
Bias
This part of the generalization error is due to wrong assumptions, such as assumâ€
ing that the data is linear when it is actually quadratic. A highbias model is most
likely to underfit the training data.10
Variance
This part is due to the modelâ€™s excessive sensitivity to small variations in the
training data. A model with many degrees of freedom (such as a highdegree polâ€
ynomial model) is likely to have high variance, and thus to overfit the training
data.
Irreducible error
This part is due to the noisiness of the data itself. The only way to reduce this
part of the error is to clean up the data (e.g., fix the data sources, such as broken
sensors, or detect and remove outliers).
Increasing a modelâ€™s complexity will typically increase its variance and reduce its bias.
Conversely, reducing a modelâ€™s complexity increases its bias and reduces its variance.
This is why it is called a tradeoff.
Regularized Linear Models
As we saw in Chapters 1 and 2, a good way to reduce overfitting is to regularize the
model (i.e., to constrain it): the fewer degrees of freedom it has, the harder it will be
for it to overfit the data. For example, a simple way to regularize a polynomial model
is to reduce the number of polynomial degrees.
For a linear model, regularization is typically achieved by constraining the weights of
the model. We will now look at Ridge Regression, Lasso Regression, and Elastic Net,
which implement three different ways to constrain the weights.
10 This notion of bias is not to be confused with the bias term of linear models.
136

Chapter 4: Training Models
Ridge Regression
Ridge Regression (also called Tikhonov regularization) is a regularized version of Linâ€
ear Regression: a regularization term equal to Î±âˆ‘ni = 1 Î¸i2 is added to the cost function.
This forces the learning algorithm to not only fit the data but also keep the model
weights as small as possible. Note that the regularization term should only be added
to the cost function during training. Once the model is trained, you want to evaluate
the modelâ€™s performance using the unregularized performance measure.
It is quite common for the cost function used during training to be
different from the performance measure used for testing. Apart
from regularization, another reason why they might be different is
that a good training cost function should have optimizationfriendly derivatives, while the performance measure used for testâ€
ing should be as close as possible to the final objective. A good
example of this is a classifier trained using a cost function such as
the log loss (discussed in a moment) but evaluated using precision/
recall.
The hyperparameter Î± controls how much you want to regularize the model. If Î± = 0
then Ridge Regression is just Linear Regression. If Î± is very large, then all weights end
up very close to zero and the result is a flat line going through the dataâ€™s mean. Equaâ€
tion 48 presents the Ridge Regression cost function.11
Equation 48. Ridge Regression cost function
1
J Î¸ = MSE Î¸ + Î± 2 âˆ‘ni = 1 Î¸i2
Note that the bias term Î¸0 is not regularized (the sum starts at i = 1, not 0). If we
define w as the vector of feature weights (Î¸1 to Î¸n), then the regularization term is
simply equal to Â½(Ö« w Ö«2)2, where Ö« w Ö«2 represents the â„“2 norm of the weight vector.12
For Gradient Descent, just add Î±w to the MSE gradient vector (Equation 46).
It is important to scale the data (e.g., using a StandardScaler)
before performing Ridge Regression, as it is sensitive to the scale of
the input features. This is true of most regularized models.
11 It is common to use the notation J(Î¸) for cost functions that donâ€™t have a short name; we will often use this
notation throughout the rest of this book. The context will make it clear which cost function is being disâ€
cussed.
12 Norms are discussed in Chapter 2.
Regularized Linear Models

137
Figure 417 shows several Ridge models trained on some linear data using different Î±
value. On the left, plain Ridge models are used, leading to linear predictions. On the
right, the data is first expanded using PolynomialFeatures(degree=10), then it is
scaled using a StandardScaler, and finally the Ridge models are applied to the resultâ€
ing features: this is Polynomial Regression with Ridge regularization. Note how
increasing Î± leads to flatter (i.e., less extreme, more reasonable) predictions; this
reduces the modelâ€™s variance but increases its bias.
As with Linear Regression, we can perform Ridge Regression either by computing a
closedform equation or by performing Gradient Descent. The pros and cons are the
same. Equation 49 shows the closedform solution (where A is the (n + 1) Ã— (n + 1)
identity matrix13 except with a 0 in the topleft cell, corresponding to the bias term).
Figure 417. Ridge Regression
Equation 49. Ridge Regression closedform solution
Î¸ = XT X + Î±A
âˆ’1
XT y
Here is how to perform Ridge Regression with ScikitLearn using a closedform soluâ€
tion (a variant of Equation 49 using a matrix factorization technique by AndrÃ©Louis
Cholesky):
>>> from sklearn.linear_model import Ridge
>>> ridge_reg = Ridge(alpha=1, solver=”cholesky”)
>>> ridge_reg.fit(X, y)
13 A square matrix full of 0s except for 1s on the main diagonal (topleft to bottomright).
138

Chapter 4: Training Models
>>> ridge_reg.predict([[1.5]])
array([[1.55071465]])
And using Stochastic Gradient Descent:14
>>> sgd_reg = SGDRegressor(penalty=”l2″)
>>> sgd_reg.fit(X, y.ravel())
>>> sgd_reg.predict([[1.5]])
array([1.47012588])
The penalty hyperparameter sets the type of regularization term to use. Specifying
“l2” indicates that you want SGD to add a regularization term to the cost function
equal to half the square of the â„“2 norm of the weight vector: this is simply Ridge
Regression.
Lasso Regression
Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso
Regression) is another regularized version of Linear Regression: just like Ridge
Regression, it adds a regularization term to the cost function, but it uses the â„“1 norm
of the weight vector instead of half the square of the â„“2 norm (see Equation 410).
Equation 410. Lasso Regression cost function
J Î¸ = MSE Î¸ + Î±âˆ‘ni = 1 Î¸i
Figure 418 shows the same thing as Figure 417 but replaces Ridge models with
Lasso models and uses smaller Î± values.
14 Alternatively you can use the Ridge class with the “sag” solver. Stochastic Average GD is a variant of SGD.
For more details, see the presentation â€œMinimizing Finite Sums with the Stochastic Average Gradient Algoâ€
rithmâ€ by Mark Schmidt et al. from the University of British Columbia.
Regularized Linear Models

139
Figure 418. Lasso Regression
An important characteristic of Lasso Regression is that it tends to completely elimiâ€
nate the weights of the least important features (i.e., set them to zero). For example,
the dashed line in the right plot on Figure 418 (with Î± = 107) looks quadratic, almost
linear: all the weights for the highdegree polynomial features are equal to zero. In
other words, Lasso Regression automatically performs feature selection and outputs a
sparse model (i.e., with few nonzero feature weights).
You can get a sense of why this is the case by looking at Figure 419: on the topleft
plot, the background contours (ellipses) represent an unregularized MSE cost funcâ€
tion (Î± = 0), and the white circles show the Batch Gradient Descent path with that
cost function. The foreground contours (diamonds) represent the â„“1 penalty, and the
triangles show the BGD path for this penalty only (Î± â†’ âˆž). Notice how the path first
reaches Î¸1 = 0, then rolls down a gutter until it reaches Î¸2 = 0. On the topright plot,
the contours represent the same cost function plus an â„“1 penalty with Î± = 0.5. The
global minimum is on the Î¸2 = 0 axis. BGD first reaches Î¸2 = 0, then rolls down the
gutter until it reaches the global minimum. The two bottom plots show the same
thing but uses an â„“2 penalty instead. The regularized minimum is closer to Î¸ = 0 than
the unregularized minimum, but the weights do not get fully eliminated.
140

Chapter 4: Training Models
Figure 419. Lasso versus Ridge regularization
On the Lasso cost function, the BGD path tends to bounce across
the gutter toward the end. This is because the slope changes
abruptly at Î¸2 = 0. You need to gradually reduce the learning rate in
order to actually converge to the global minimum.
The Lasso cost function is not differentiable at Î¸i = 0 (for i = 1, 2, â€«Ùµâ€¬, n), but Gradient
Descent still works fine if you use a subgradient vector g15 instead when any Î¸i = 0.
Equation 411 shows a subgradient vector equation you can use for Gradient Descent
with the Lasso cost function.
Equation 411. Lasso Regression subgradient vector
sign Î¸1
g Î¸, J = ÖÎ¸ MSE Î¸ + Î±
sign Î¸2
â€«Ù´â€¬
sign Î¸n
âˆ’1 if Î¸i < 0
where sign Î¸i = 0 if Î¸i = 0
+1 if Î¸i > 0
15 You can think of a subgradient vector at a nondifferentiable point as an intermediate vector between the graâ€
dient vectors around that point.
Regularized Linear Models

141
Here is a small ScikitLearn example using the Lasso class. Note that you could
instead use an SGDRegressor(penalty=”l1″).
>>> from sklearn.linear_model import Lasso
>>> lasso_reg = Lasso(alpha=0.1)
>>> lasso_reg.fit(X, y)
>>> lasso_reg.predict([[1.5]])
array([1.53788174])
Elastic Net
Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The
regularization term is a simple mix of both Ridge and Lassoâ€™s regularization terms,
and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge
Regression, and when r = 1, it is equivalent to Lasso Regression (see Equation 412).
Equation 412. Elastic Net cost function
J Î¸ = MSE Î¸ + rÎ±âˆ‘ni = 1 Î¸i +
1âˆ’r
Î±âˆ‘ni = 1 Î¸i2
2
So when should you use plain Linear Regression (i.e., without any regularization),
Ridge, Lasso, or Elastic Net? It is almost always preferable to have at least a little bit of
regularization, so generally you should avoid plain Linear Regression. Ridge is a good
default, but if you suspect that only a few features are actually useful, you should preâ€
fer Lasso or Elastic Net since they tend to reduce the useless featuresâ€™ weights down to
zero as we have discussed. In general, Elastic Net is preferred over Lasso since Lasso
may behave erratically when the number of features is greater than the number of
training instances or when several features are strongly correlated.
Here is a short example using ScikitLearnâ€™s ElasticNet (l1_ratio corresponds to
the mix ratio r):
>>> from sklearn.linear_model import ElasticNet
>>> elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
>>> elastic_net.fit(X, y)
>>> elastic_net.predict([[1.5]])
array([1.54333232])
Early Stopping
A very different way to regularize iterative learning algorithms such as Gradient
Descent is to stop training as soon as the validation error reaches a minimum. This is
called early stopping. Figure 420 shows a complex model (in this case a highdegree
Polynomial Regression model) being trained using Batch Gradient Descent. As the
epochs go by, the algorithm learns and its prediction error (RMSE) on the training set
naturally goes down, and so does its prediction error on the validation set. However,
142

Chapter 4: Training Models
after a while the validation error stops decreasing and actually starts to go back up.
This indicates that the model has started to overfit the training data. With early stopâ€
ping you just stop training as soon as the validation error reaches the minimum. It is
such a simple and efficient regularization technique that Geoffrey Hinton called it a
â€œbeautiful free lunch.â€
Figure 420. Early stopping regularization
With Stochastic and Minibatch Gradient Descent, the curves are
not so smooth, and it may be hard to know whether you have
reached the minimum or not. One solution is to stop only after the
validation error has been above the minimum for some time (when
you are confident that the model will not do any better), then roll
back the model parameters to the point where the validation error
was at a minimum.
Here is a basic implementation of early stopping:
from sklearn.base import clone
# prepare the data
poly_scaler = Pipeline([
(“poly_features”, PolynomialFeatures(degree=90, include_bias=False)),
(“std_scaler”, StandardScaler())
])
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)
sgd_reg = SGDRegressor(max_iter=1, tol=np.infty, warm_start=True,
penalty=None, learning_rate=”constant”, eta0=0.0005)
Regularized Linear Models

143
minimum_val_error = float(“inf”)
best_epoch = None
best_model = None
for epoch in range(1000):
sgd_reg.fit(X_train_poly_scaled, y_train) # continues where it left off
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
val_error = mean_squared_error(y_val, y_val_predict)
if val_error < minimum_val_error:
minimum_val_error = val_error
best_epoch = epoch
best_model = clone(sgd_reg)
Note that with warm_start=True, when the fit() method is called, it just continues
training where it left off instead of restarting from scratch.
Logistic Regression
As we discussed in Chapter 1, some regression algorithms can be used for classificaâ€
tion as well (and vice versa). Logistic Regression (also called Logit Regression) is comâ€
monly used to estimate the probability that an instance belongs to a particular class
(e.g., what is the probability that this email is spam?). If the estimated probability is
greater than 50%, then the model predicts that the instance belongs to that class
(called the positive class, labeled â€œ1â€), or else it predicts that it does not (i.e., it
belongs to the negative class, labeled â€œ0â€). This makes it a binary classifier.
Estimating Probabilities
So how does it work? Just like a Linear Regression model, a Logistic Regression
model computes a weighted sum of the input features (plus a bias term), but instead
of outputting the result directly like the Linear Regression model does, it outputs the
logistic of this result (see Equation 413).
Equation 413. Logistic Regression model estimated probability (vectorized form)
p = hÎ¸ x = Ïƒ xT Î¸
The logisticâ€”noted Ïƒ(Â·)â€”is a sigmoid function (i.e., Sshaped) that outputs a number
between 0 and 1. It is defined as shown in Equation 414 and Figure 421.
Equation 414. Logistic function
Ïƒt =
144

1
1 + exp âˆ’ t
Chapter 4: Training Models
Figure 421. Logistic function
Once the Logistic Regression model has estimated the probability p = hÎ¸(x) that an
instance x belongs to the positive class, it can make its prediction Å· easily (see Equaâ€
tion 415).
Equation 415. Logistic Regression model prediction
y=
0 if p < 0 . 5
1 if p â‰¥ 0 . 5
Notice that Ïƒ(t) < 0.5 when t < 0, and Ïƒ(t) â‰¥ 0.5 when t â‰¥ 0, so a Logistic Regression
model predicts 1 if xT Î¸ is positive, and 0 if it is negative.
The score t is often called the logit: this name comes from the fact
that the logit function, defined as logit(p) = log(p / (1  p)), is the
inverse of the logistic function. Indeed, if you compute the logit of
the estimated probability p, you will find that the result is t. The
logit is also called the logodds, since it is the log of the ratio
between the estimated probability for the positive class and the
estimated probability for the negative class.
Training and Cost Function
Good, now you know how a Logistic Regression model estimates probabilities and
makes predictions. But how is it trained? The objective of training is to set the paramâ€
eter vector Î¸ so that the model estimates high probabilities for positive instances (y =
1) and low probabilities for negative instances (y = 0). This idea is captured by the
cost function shown in Equation 416 for a single training instance x.
Equation 416. Cost function of a single training instance
cÎ¸ =
âˆ’log p
if y = 1
âˆ’log 1 âˆ’ p if y = 0
Logistic Regression

145
This cost function makes sense because â€“ log(t) grows very large when t approaches
0, so the cost will be large if the model estimates a probability close to 0 for a positive
instance, and it will also be very large if the model estimates a probability close to 1
for a negative instance. On the other hand, â€“ log(t) is close to 0 when t is close to 1, so
the cost will be close to 0 if the estimated probability is close to 0 for a negative
instance or close to 1 for a positive instance, which is precisely what we want.
The cost function over the whole training set is simply the average cost over all trainâ€
ing instances. It can be written in a single expression (as you can verify easily), called
the log loss, shown in Equation 417.
Equation 417. Logistic Regression cost function (log loss)
1
i
i
J Î¸ = âˆ’ m âˆ‘m
+ 1 âˆ’ y i log 1 âˆ’ p i
i = 1 y log p
The bad news is that there is no known closedform equation to compute the value of
Î¸ that minimizes this cost function (there is no equivalent of the Normal Equation).
But the good news is that this cost function is convex, so Gradient Descent (or any
other optimization algorithm) is guaranteed to find the global minimum (if the learnâ€
ing rate is not too large and you wait long enough). The partial derivatives of the cost
function with regards to the jth model parameter Î¸j is given by Equation 418.
Equation 418. Logistic cost function partial derivatives
1 m
âˆ‚
JÎ¸ =
Ïƒ Î¸T x i âˆ’ y i x ji
âˆ‚Î¸ j
miâˆ‘
=1
This equation looks very much like Equation 45: for each instance it computes the
prediction error and multiplies it by the jth feature value, and then it computes the
average over all training instances. Once you have the gradient vector containing all
the partial derivatives you can use it in the Batch Gradient Descent algorithm. Thatâ€™s
it: you now know how to train a Logistic Regression model. For Stochastic GD you
would of course just take one instance at a time, and for Minibatch GD you would
use a minibatch at a time.
Decision Boundaries
Letâ€™s use the iris dataset to illustrate Logistic Regression. This is a famous dataset that
contains the sepal and petal length and width of 150 iris flowers of three different
species: IrisSetosa, IrisVersicolor, and IrisVirginica (see Figure 422).
146

Chapter 4: Training Models
Figure 422. Flowers of three iris plant species16
Letâ€™s try to build a classifier to detect the IrisVirginica type based only on the petal
width feature. First letâ€™s load the data:
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> list(iris.keys())
[‘data’, ‘target’, ‘target_names’, ‘DESCR’, ‘feature_names’, ‘filename’]
>>> X = iris[“data”][:, 3:] # petal width
>>> y = (iris[“target”] == 2).astype(np.int) # 1 if IrisVirginica, else 0
Now letâ€™s train a Logistic Regression model:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X, y)
Letâ€™s look at the modelâ€™s estimated probabilities for flowers with petal widths varying
from 0 to 3 cm (Figure 423)17:
X_new = np.linspace(0, 3, 1000).reshape(1, 1)
y_proba = log_reg.predict_proba(X_new)
plt.plot(X_new, y_proba[:, 1], “g“, label=”IrisVirginica”)
16 Photos reproduced from the corresponding Wikipedia pages. IrisVirginica photo by Frank Mayfield (Creaâ€
tive Commons BYSA 2.0), IrisVersicolor photo by D. Gordon E. Robertson (Creative Commons BYSA 3.0),
and IrisSetosa photo is public domain.
17 NumPyâ€™s reshape() function allows one dimension to be â€“1, which means â€œunspecifiedâ€: the value is inferred
from the length of the array and the remaining dimensions.
Logistic Regression

147
plt.plot(X_new, y_proba[:, 0], “b–“, label=”Not IrisVirginica”)
# + more Matplotlib code to make the image look pretty
Figure 423. Estimated probabilities and decision boundary
The petal width of IrisVirginica flowers (represented by triangles) ranges from 1.4
cm to 2.5 cm, while the other iris flowers (represented by squares) generally have a
smaller petal width, ranging from 0.1 cm to 1.8 cm. Notice that there is a bit of overâ€
lap. Above about 2 cm the classifier is highly confident that the flower is an IrisVirginica (it outputs a high probability to that class), while below 1 cm it is highly
confident that it is not an IrisVirginica (high probability for the â€œNot IrisVirginicaâ€
class). In between these extremes, the classifier is unsure. However, if you ask it to
predict the class (using the predict() method rather than the predict_proba()
method), it will return whichever class is the most likely. Therefore, there is a decision
boundary at around 1.6 cm where both probabilities are equal to 50%: if the petal
width is higher than 1.6 cm, the classifier will predict that the flower is an IrisVirginica, or else it will predict that it is not (even if it is not very confident):
>>> log_reg.predict([[1.7], [1.5]])
array([1, 0])
Figure 424 shows the same dataset but this time displaying two features: petal width
and length. Once trained, the Logistic Regression classifier can estimate the probabilâ€
ity that a new flower is an IrisVirginica based on these two features. The dashed line
represents the points where the model estimates a 50% probability: this is the modelâ€™s
decision boundary. Note that it is a linear boundary.18 Each parallel line represents the
points where the model outputs a specific probability, from 15% (bottom left) to 90%
(top right). All the flowers beyond the topright line have an over 90% chance of
being IrisVirginica according to the model.
18 It is the the set of points x such that Î¸0 + Î¸1×1 + Î¸2×2 = 0, which defines a straight line.
148

Chapter 4: Training Models
Figure 424. Linear decision boundary
Just like the other linear models, Logistic Regression models can be regularized using
â„“1 or â„“2 penalties. ScitkitLearn actually adds an â„“2 penalty by default.
The hyperparameter controlling the regularization strength of a
ScikitLearn LogisticRegression model is not alpha (as in other
linear models), but its inverse: C. The higher the value of C, the less
the model is regularized.
Softmax Regression
The Logistic Regression model can be generalized to support multiple classes directly,
without having to train and combine multiple binary classifiers (as discussed in
Chapter 3). This is called Softmax Regression, or Multinomial Logistic Regression.
The idea is quite simple: when given an instance x, the Softmax Regression model
first computes a score sk(x) for each class k, then estimates the probability of each
class by applying the softmax function (also called the normalized exponential) to the
scores. The equation to compute sk(x) should look familiar, as it is just like the equaâ€
tion for Linear Regression prediction (see Equation 419).
Equation 419. Softmax score for class k
sk x = xT Î¸ k
Note that each class has its own dedicated parameter vector Î¸(k). All these vectors are
typically stored as rows in a parameter matrix Î˜.
Once you have computed the score of every class for the instance x, you can estimate
the probability pk that the instance belongs to class k by running the scores through
the softmax function (Equation 420): it computes the exponential of every score,
Logistic Regression

149
then normalizes them (dividing by the sum of all the exponentials). The scores are
generally called logits or logodds (although they are actually unnormalized logodds).
Equation 420. Softmax function
pk = Ïƒ s x k =
exp sk x
K
âˆ‘ j = 1 exp s j x
â€¢ K is the number of classes.
â€¢ s(x) is a vector containing the scores of each class for the instance x.
â€¢ Ïƒ(s(x))k is the estimated probability that the instance x belongs to class k given
the scores of each class for that instance.
Just like the Logistic Regression classifier, the Softmax Regression classifier predicts
the class with the highest estimated probability (which is simply the class with the
highest score), as shown in Equation 421.
Equation 421. Softmax Regression classifier prediction
y = argmax Ïƒ s x k = argmax sk x = argmax
k
k
k
Î¸k
T
x
â€¢ The argmax operator returns the value of a variable that maximizes a function. In
this equation, it returns the value of k that maximizes the estimated probability
Ïƒ(s(x))k.
The Softmax Regression classifier predicts only one class at a time
(i.e., it is multiclass, not multioutput) so it should be used only with
mutually exclusive classes such as different types of plants. You
cannot use it to recognize multiple people in one picture.
Now that you know how the model estimates probabilities and makes predictions,
letâ€™s take a look at training. The objective is to have a model that estimates a high
probability for the target class (and consequently a low probability for the other
classes). Minimizing the cost function shown in Equation 422, called the cross
entropy, should lead to this objective because it penalizes the model when it estimates
a low probability for a target class. Cross entropy is frequently used to measure how
150

Chapter 4: Training Models
well a set of estimated class probabilities match the target classes (we will use it again
several times in the following chapters).
Equation 422. Cross entropy cost function
1
i
i
K
J Î˜ = âˆ’ m âˆ‘m
i = 1 âˆ‘k = 1 yk log pk
â€¢ yki is the target probability that the ith instance belongs to class k. In general, it is
either equal to 1 or 0, depending on whether the instance belongs to the class or
not.
Notice that when there are just two classes (K = 2), this cost function is equivalent to
the Logistic Regressionâ€™s cost function (log loss; see Equation 417).
Cross Entropy
Cross entropy originated from information theory. Suppose you want to efficiently
transmit information about the weather every day. If there are eight options (sunny,
rainy, etc.), you could encode each option using 3 bits since 23 = 8. However, if you
think it will be sunny almost every day, it would be much more efficient to code
â€œsunnyâ€ on just one bit (0) and the other seven options on 4 bits (starting with a 1).
Cross entropy measures the average number of bits you actually send per option. If
your assumption about the weather is perfect, cross entropy will just be equal to the
entropy of the weather itself (i.e., its intrinsic unpredictability). But if your assumpâ€
tions are wrong (e.g., if it rains often), cross entropy will be greater by an amount
called the Kullbackâ€“Leibler divergence.
The cross entropy between two probability distributions p and q is defined as
H p, q = âˆ’ âˆ‘x p x log q x (at least when the distributions are discrete). For more
details, check out this video.
The gradient vector of this cost function with regards to Î¸(k) is given by Equation
423:
Equation 423. Cross entropy gradient vector for class k
Ö k JÎ˜ =
Î¸
1 m
pki âˆ’ yki x i
miâˆ‘
=1
Now you can compute the gradient vector for every class, then use Gradient Descent
(or any other optimization algorithm) to find the parameter matrix Î˜ that minimizes
the cost function.
Logistic Regression

151
Letâ€™s use Softmax Regression to classify the iris flowers into all three classes. ScikitLearnâ€™s LogisticRegression uses oneversusall by default when you train it on more
than two classes, but you can set the multi_class hyperparameter to “multinomial”
to switch it to Softmax Regression instead. You must also specify a solver that supâ€
ports Softmax Regression, such as the “lbfgs” solver (see ScikitLearnâ€™s documentaâ€
tion for more details). It also applies â„“2 regularization by default, which you can
control using the hyperparameter C.
X = iris[“data”][:, (2, 3)]
y = iris[“target”]
# petal length, petal width
softmax_reg = LogisticRegression(multi_class=”multinomial”,solver=”lbfgs”, C=10)
softmax_reg.fit(X, y)
So the next time you find an iris with 5 cm long and 2 cm wide petals, you can ask
your model to tell you what type of iris it is, and it will answer IrisVirginica (class 2)
with 94.2% probability (or IrisVersicolor with 5.8% probability):
>>> softmax_reg.predict([[5, 2]])
array([2])
>>> softmax_reg.predict_proba([[5, 2]])
array([[6.38014896e07, 5.74929995e02, 9.42506362e01]])
Figure 425 shows the resulting decision boundaries, represented by the background
colors. Notice that the decision boundaries between any two classes are linear. The
figure also shows the probabilities for the IrisVersicolor class, represented by the
curved lines (e.g., the line labeled with 0.450 represents the 45% probability boundâ€
ary). Notice that the model can predict a class that has an estimated probability below
50%. For example, at the point where all decision boundaries meet, all classes have an
equal estimated probability of 33%.
Figure 425. Softmax Regression decision boundaries
152

Chapter 4: Training Models
Exercises
1. What Linear Regression training algorithm can you use if you have a training set
with millions of features?
2. Suppose the features in your training set have very different scales. What algoâ€
rithms might suffer from this, and how? What can you do about it?
3. Can Gradient Descent get stuck in a local minimum when training a Logistic
Regression model?
4. Do all Gradient Descent algorithms lead to the same model provided you let
them run long enough?
5. Suppose you use Batch Gradient Descent and you plot the validation error at
every epoch. If you notice that the validation error consistently goes up, what is
likely going on? How can you fix this?
6. Is it a good idea to stop Minibatch Gradient Descent immediately when the valiâ€
dation error goes up?
7. Which Gradient Descent algorithm (among those we discussed) will reach the
vicinity of the optimal solution the fastest? Which will actually converge? How
can you make the others converge as well?
8. Suppose you are using Polynomial Regression. You plot the learning curves and
you notice that there is a large gap between the training error and the validation
error. What is happening? What are three ways to solve this?
9. Suppose you are using Ridge Regression and you notice that the training error
and the validation error are almost equal and fairly high. Would you say that the
model suffers from high bias or high variance? Should you increase the regulariâ€
zation hyperparameter Î± or reduce it?
10. Why would you want to use:
â€¢ Ridge Regression instead of plain Linear Regression (i.e., without any regulariâ€
zation)?
â€¢ Lasso instead of Ridge Regression?
â€¢ Elastic Net instead of Lasso?
11. Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime.
Should you implement two Logistic Regression classifiers or one Softmax Regresâ€
sion classifier?
12. Implement Batch Gradient Descent with early stopping for Softmax Regression
(without using ScikitLearn).
Solutions to these exercises are available in ???.
Exercises

153
Purchase answer to see full
attachment