+1(978)310-4246 credencewriters@gmail.com
  

Description

Q1. In the homework folder you can find the data file BinaryData.csv, which consists of two columns
x and y. The variable y is of categorical nature, taking only 0-1 values. The goal of this question is to write
your own logistic regression module and compare it with the R (or Python) logistic regression package.
(a) Load the data file BinaryData.csv and perform a simple logistic regression in the programming
language of your choice, predicting the class y based on x. Report the values of 0 and 1 .
(b) Let’s see if we can regenerate the same results using our own logistic regression module. For this
purpose consider a model in the form of
y = sigmoid(
0
+
1 x).
In the class we showed that the negative log likelihood loss corresponding to this model is in the
form of
L( 0 ,
1)
=
n
X
yi log 1 + e
0
1 xi
+ (1
yi ) log 1 + e
0 + 1 xi
.
i=1
In Question 3 of HW2, you showed that each summand is a convex function. Using a similar line of
derivation as you did for that question, derive an expression for
@L
= …,
@ 0
Hint. I write the final formula for
@L
,
@ 0
,
@L
= ….
@ 1
but still leave the derivation to you:
n
X
@L
=
sigmoid(
@ 0
i=1
0
+
1 xi )
yi .
(c) Write a gradient descent (GD) scheme to minimize L( 0 , 1 ) in R or Python. For your scheme use a
learning rate of ⌘ = 0.01, and run the GD for 500 iterations. As the initial values for 0 and 1 you
can use zero. Although you can see that since our problem is convex and has only one minimum,
starting from other initializations yields identical results. If you implement your code correctly, you
should get identical results for 0 and 1 as part (a). Attach your code and final results.
Q2. In this question the goal is applying your own QDA framework to the data BinaryData.csv.
(a) Read the data, and calculate ⇡0 , µ0 and 02 for the
⇡1 , µ1 and 12 for the class 1. To
P class 0, and
2
calculate the variance use the equation 1/(n 1) i (xi µ) . If you do the calculations correctly
you should get integer values for µ0 , 02 , µ1 , and 12 .
(b) Based on the values you found in part (a), calculate the decision point which separates class 0 from
class 1. Basically, if xd is the decision point, your class prediction for the input x is
⇢
0 if x < xd predicted label of x = 1 if x xd Hint: In the class we discussed how to find the decision point, which is the point on the crossing of the weighted Gaussian profiles. You will end up with a quadratic equation, with two roots. You need to pick the root that is between µ0 and µ1 . 2 Q3. Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean on 15 April 1912. The ship sank after striking an iceberg during her maiden voyage from Southampton to New York City. Many of you may be familiar with this tragic story from a movie of the same title, directed by James Cameron, and starring Leonardo DiCaprio and Kate Winslet. The goal of this question is to make a model that predicts which passengers survived the Titanic shipwreck. In the homework package, you can access the data file Titanic.csv, which is a part of the actual Titanic passenger data. The column Survived indicates if the passenger survived or not. If 1, the passenger survived, and if 0 they did not. Below we present a brief description of each feature: Pclass: Passenger’s ticket class Name: Name of the passenger Sex: Passenger’s gender Age: Passenger’s age in years SibSp : number of siblings / spouse aboard the Titanic Parch: number of parents / children aboard the Titanic Fare: Passenger fare Embarked: Port of embarkation (a) Let’s first preprocess the data. Read the data file Titanic.csv. Drop the column Name. While this column is certainly useful to trace families, at this point we will not use it in our model. You may also notice that there are many missing entries in the column Age. To fill in the missing entries of this column, first, calculate the mean age for all the available entries, then replace the missing entries with that value. Now you should have a dataset of no missing entries, with 891 samples (note that the first row of the CSV file is the variable names). (b) Split the data you created in part (a) into test and train sets. Your train set would contain the first 750 rows of the data, and your test would contain the rows 751 to 891 of the data. (c) Try the following classification models to predict Survived in terms of the other features in the dataset. For your classification, treat the variables Pclass, Sex, Embarked as categorical variables and Age, SibSp, Parch, Fare as numerical features (you may want to look into your textbook and see examples where the command as.factor is used). Specifically, do the following – Use logistic regression for your classification. Report the p-values associated with the intercept and all the numerical features. Which features have large p-values? Use the test data to estimate the accuracy of your model. – Apply LDA and QDA, and again report your model accuracies using the test data. – Among logistic regression, LDA and QDA, which model seemed the most accurate one? (d) Repeat the steps in part (c), this time treating Sex, Embarked as categorical variables and Pclass, Age, SibSp, ParchAge, Fare as numerical features. Q4 (Bootstrapping). In statistics, for a given random variable, the ratio of variance to the mean is a measure of dispersion for that random variable, also referred to as the Fano factor. Under nomrality assumption, consider n independent samples x1 , x2 , . . . , xn . One empirical way to estimate the Fano factor would be to calculate it through ˆ2 D= , µ̂ where n x1 + . . . + xn 1 X µ̂ = , ˆ2 = (xi µ̂)2 . n n 1 i=1 3 Consider 10 independent samples from a reference normal distribution as follows: x1 = 8.2344, x2 = 4.4854, x3 = 5.4821, x4 = 1.0953, x5 = 2.1565 x6 = 2.5096, x7 = 4.9772, x8 = 2.4998, x9 = 4.2628, x10 = 0.6933. (a) What is the value of D for this sample set? (b) Next, we want to see if we could repeat this experiment many times by drawing samples from the reference distribution and calculate D, how would the value of D vary. Since we do not know the reference distribution, we would need to do bootstrapping to see the variation in D. Run a bootstrapping experiment with B = 10000 in a programming language of your choice, and estimate the standard deviation for D. (c) I am gong to reveal that the reference distribution that we drew the 10 samples above from is a normal distribution with mean 3 and variance 3.24. Run an experiment, where for 10000 times you draw 10 samples from a normal distribution with mean 3 and variance 3.24, and then calculate D. Let’s call these values D1 , . . . , D10000 . Calculate the standard deviation for these 10000 quantities (that would simply be the command sd in R). Compare this value with the value you obtained in part (b). Some Notes. You should see that in part (b) despite no access to the actual reference distribution, we managed to calculate the variations of D, in a way that is close to part (c), where we were fully aware of the reference distribution. BTW, don’t set the bar too high, for this problem you may see around 25% di↵erence between the values in part (b) and part (c). Q5 (VC Dimension). The goal of this question is to go through one example related to the VC dimension. This question does not require any Math and using a bit of your intuition, you should be able to solve it. I walk with you through it step by step. Consider H to be the set of ellipsoid classifiers on the 2D plane. Technically these classifiers can be modeled as p p H = {h : h(x, y) = (x a)2 + (y b)2 + (x a0 )2 + (y b0 )2 c} where a, a0 , b, b0 and c 0 are parameters of your classifier. By changing these parameters you can create ellipsoids centered anywhere on the plane with any desired ratio between the redii and any orientation (Examples are shown in Figure 1). As discussed in the class, for binary classification h < 0 can characterize one class and h > 0 may characterize the other class.
Figure 1: examples of classifiers that H can create
4
(a) Show that the VC dimension of H is at least 5.
Hint. To show this you would need to come up with one set of 5 points on the 2D plane which can
be shattered by H. To do this you would need to realize 25 = 32 dichotomies, but here I suggest an
easier way. Consider the points sitting on the vertices of a pentagon, then many realizations become
rotated versions of each other. If you manage to show that in all 8 cases depicted in Figure 2, a 2D
Figure 2: Dichotomies needed for 5 points
ellipsoid can separate the two classes, you would be done! To get you started I have done this for
the first two cases. Just do it for the other 6.
(b) It turns out that the VC dimension of H is precisely 5, which basically means that no set of 6 points
can be shattered by H. Showing this is beyond our technical interest1 . However, my question would
be easy. For the set of six points below, suggest an unrealizable dichotomy for H. In simple words,
color the dots in a way that no ellipsoid can separate the black dots from the white dots.
1
I share the paper in case you are interested to pursue a research in complexity theory: https://arxiv.org/pdf/1109.
4347.pdf
5
Class 4: Classification
MSA 8150: Machine Learning for Analytics
Alireza Aghasi
Institute for Insight, Georgia State University
Classification
– In many applications, the response is not a quantitative value and
instead represents a class, e.g., y ∈{student, non-student},
y ∈{while, yellow, green}
– Yet based on the observation of some features, we would like to
predict the class (what we refer to as the classification)
– Regression vs classification
rit
y
X2
e
Incom
of
Se
rs
nio
Ye
a
Ed
uc
atio
n
oo o
o
oo
o
o
o
oo oo o
o
o
oo oo ooo
o
oo ooo o
o
oo
oo o o
o
o ooo oo
o oo
oo
o
o
o
o
o
ooo o o o o o
o
o
o
o o
o oo o
o o
o o o
o o
o oo
o o
o
o o oooo o
ooo o o o o
oo
o
oo
o
o
o
o
o
o
o oo o
o
o
o
oo o oo o
oo o
o o oo o
o o
o
o oo
o o o oo o ooo o
o
o
oo o
o ooooo oooo
oo
o
o oo o o
o
o o
oo oo o
o
o
o
o
o
o
o oo
o oo
o
o
oo o
o
o
o o
X1
1
Classification
Income
20000
500
500
1000
1500
Balance
2000
2500
0
0
0
0
40000
2000
1500
Balance
1000
60000
40000
20000
Income
60000
2500
Example. Predicting default cases on the credit card (unable to pay the
credit card), based on the income and current balance
No
Yes
Default
No
Yes
Default
(one immediate observation is probably balance is a more useful feature)
2
Binary Classification
– In simple regression for a single feature x we fitted a line
y = β0 + β1 x to the data
– In binary classification with only one feature, we don’t have values
any more, but two classes (say class 0 and class 1)
– Can we do the fit in a way that the sign of β0 + β1 x becomes an
indicator of the class for us?
– In other words, for a given feature xt , we make a decision based on
the following:
(
1 β0 + β1 x t > 0
,
yt =
0 β0 + β1 x t < 0 – A smooth function (called Sigmoid – also inverse Logit) that takes almost binary values 0, 1 based on the sign of the input z is ( ez 1 z >> 0
≈
1 + ez
0 z 2 labels (e.g., y ∈{while, yellow, green}) and p
features x1 , x2 , · · · , xp , we fit K models parametrized by
(1)
(1)
(2)
(2)
Label 1: {β0 , β1 , · · · , βp(1) }
Label 2: {β0 , β1 , · · · , βp(2) }
..
.
(K )
(K )
Label K: {β0 , β1 , · · · , βp(K ) }
– For this problem we consider the following form:
(k)
pk (x) = P(Y = k|x) =
e β0
(1)
e β0
(1)
+···+βp xp
+···+βp(k) xp
(K )
+ · · · + e β0
(K )
+···+βp xp
– What is the sum of all P(Y = k|x) for a fixed x?
12
Some R Simulations
– Let’s preform some basic classification tasks in R!
13
Linear and Quadratic
Discriminant Analysis
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
Let’s do an exercise. We are given 30 blue points and 40 red points as
above. Let’s fit a Normal pdf to each cloud
14
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
Let’s assume both variances turn out similar. Find the intersection point.
What can you say about this point?
15
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
Blue Red
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
We have found a rule to classify each point on the real line
16
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-3
-2
-1
0
1
2
3
Let’s do a different exercise with another set of points
17
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
The intersection point changes since the variances are no more the same.
18
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
Blue Red
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
Again we have a rule. If you want to assign a probability of being blue
and being red to each point, how would you calculate that probability?
P(Y = B|x) = pB (x) =
fB (x)
fR (x)
, P(Y = R|x) = pR (x) =
19
fB (x) + fR (x)
fB (x) + fR (x)
LDA/QDA Story in Simple Words
• Say you want to incorporate the probability of each class in your
calculation of P(Y |x). We can use the Bayes formula
– Probabilistically, suppose that our y can take K distinct values. By
the Bayes’ theorem we have
P(Y = `)P(X = x|Y = `)
P(Y = `|X = x) = PK
k=1 P(Y = k)P(X = x|Y = k)
Ï€` f` (x)
= PK
k=1 πk fk (x)
– Let’s see why this equality holds, knowing the Bayes’ equality
P(A|B) =
P(A ∩ B)
P(B)
and A1 ∪ A2 · · · ∪ AK covering the entire space, where Ai ∩ Aj = ∅.
20
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
The intersection point yet changes
21
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
Red
Blue
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
P(Y = B|x) = pB (x) =
P(Y = R|x) = pR (x) =
2
3
3
7 fB (x)
,
3
4
7 fB (x) + 7 fR (x)
4
7 fR (x)
3
4
7 fB (x) + 7 fR (x)
4
22
LDA/QDA Story in Simple Words
Question. What if we have more than two classes?
0.4
label 1 (30 pts)
label 2 (60 pts)
label 3 (90 pts)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-2
0
2
4
6
8
10
12
14
23
Story in Simple Words
Answer. Same idea!
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-2
0
2
4
6
8
10
12
14
24
Story in Simple Words
Goal. Higher
! dimensions,!
0
1 0
µ1 =
, Σ1 =
, µ2 =
0
0 1
!
−1
, Σ2 =
−1
1
0.9
!
0.9
1
Also, in higher dimensions the intersection between two Normal pdfs is
no more just a point, it would be a boundary
25
Little Introduction about Multivariate Normal
x1
x2
x2
– Recall the normal distribution for a random variable x:
1
(x − µ)2
f (x) = √
)
exp(−
2σ 2
2πσ 2
– Similar to the scalar case, we can define a distribution for the
random vector x = [x1 , · · · , xp ]T as

1
1
> −1
f (x) =
exp − (x − µ) Σ (x − µ)
2
(2π)p/2 |Σ|1/2
x1
26
Linear Discriminant Analysis (LDA)
– Recall
Ï€` f` (x)
P(Y = `)P(X = x|Y = `)
= PK
P(Y = `|X = x) = PK
k=1 P(Y = k)P(X = x|Y = k)
k=1 πk fk (x)
– The purpose of LDA is learning a model for P(Y = `|X = x)
– In the formulation above, f` (x) is in a sense the distribution we
consider for the data points in class `, and π` is the probability that
we pick some random sample and it belongs to class `
– In LDA, we assume that all f` (x) have a multivariate normal
distribution with similar covariances and different means, i.e.

1
1
> −1
f` (x) =
exp
−
(x
−
µ
)
Σ
(x
−
µ
)
`
`
2
(2π)p/2 |Σ|1/2
– Unlike logistic regression, which involved a rather complicated
maximization for learning, in LDA we have closed form expressions
for π` , µ` and Σ and classifying new test points becomes very easy
27
Linear Discriminant Analysis (LDA)
– Given a training set (x 1 , y1 ), · · · , (x n , yn ) where the responses y can
take K distinct class values 1, 2, · · · , K , we can easily learn the LDA
model by calculating π` , µ` and Σ via (considering c` to be the
index of samples in class `)
# of elements in c`
n
X
1
xi
µ̂` =
# of elements in c`
π̂` =
i∈c`
Σ̂ =
1
N −K
K X
X
(x i − µ̂` )(x i − µ̂` )>
k=1 i∈c`
– After this point for a new test point x t we have all that is needed to
calculate P(Y = `|X = x t ) for ` = 1, · · · , K and pick as the label
the one that is largest
28
4
2
X2
0
−2
−4
−4
−2
0
X2
2
4
Linear Discriminant Analysis (LDA)
−4
−2
0
2
X1
4
−4
−2
0
2
4
X1
29
Linear Discriminant Analysis (LDA)
– In practice to assign a label to a given test point x t we do not need
to calculate
Ï€` f` (x t )
P(Y = `|X = x t ) = PK
k=1 πk fk (x t )
and only comparing π` f` (x t ) is enough
– This reduces to evaluate
1 > −1
−1
δ` = x >
t Σ µ` − µ` Σ µ` + log π`
2
and pick as the class ` corresponding to the largest δ`
– You can find the decision boundary between class i and j by finding
the points for which δi = δj
– [see the sample Matlab code]
30
Quadratic Discriminant Analysis (QDA)
– Recall
Ï€` f` (x t )
P(Y = `|X = x t ) = PK
k=1 πk fk (x t )
– The purpose of QDA is learning a model for P(Y = `|X = x) in a
more flexible way compared to LDA
– In QDA, we assume that all f` (x) have a multivariate normal
distribution with similar covariances and different means, i.e.

1
1
> −1
f` (x) =
exp − (x − µ` ) Σ` (x − µ` )
2
(2π)p/2 |Σ` |1/2
– The main difference between LDA and QDA is in LDA we consider a
single Σ for all classes, but in QDA we allow more flexibility by
having a different covariance matrix for each class
– Similar to LDA, QDA can be learned easily and we can obtain closed
form expressions for π` , µ` and Σ`
31
Quadratic Discriminant Analysis (QDA)
– Given a training set (x 1 , y1 ), · · · , (x n , yn ) where the responses y can
take K distinct class values 1, 2, · · · , K , we can easily learn the QDA
model by calculating π` , µ` and Σ` via (considering c` to be the
index of samples in class `)
# of elements in c`
n
X
1
µ̂` =
xi
# of elements in c`
i∈c`
X
1
Σ̂` =
(x i − µ̂` )(x i − µ̂` )>
# of elements in c` − 1
π̂` =
i∈c`
– After this point for a new test point x t we have all that is needed to
calculate P(Y = `|X = x t ) for ` = 1, · · · , K and pick as the label
the one that is largest
32
2
1
0
X2
−1
−4
−3
−2
−1
−2
−3
−4
X2
0
1
2
Quadratic Discriminant Analysis (QDA)
−4
−2
0
X1
2
4
−4
−2
0
2
4
X1
33
Quadratic Discriminant Analysis (QDA)
– In practice to assign a label to a given test point x t we do not need
to calculate
Ï€` f` (x t )
P(Y = `|X = x t ) = PK
k=1 πk fk (x t )
and only comparing π` f` (x t ) is enough
– This reduces to evaluate
1
1
δ` = − log |Σ` | − (x t − µ` )> Σ−1
` (x t − µ` ) + log π`
2
2
and pick as the class the ` corresponding to the largest δ`
– [see the sample Matlab code]
34
Some R Simulations
– Let’s preform some basic classification tasks in R!
35
Summary
– Logistic regression is very popular for classification, especially for
binary classification
– LDA is especially useful when K > 2, the number of training
samples is small, or the classes are well separated, and Gaussian
assumptions are reasonable.
– QDA presents more flexibility in shaping the partitions compared to
LDA
– Logistic regression can also fit quadratic boundaries like QDA, by
explicitly including quadratic terms in the model
36
Questions?
36
References
https://www.alsharif.info/iom530, 2013.
J. Friedman, T. Hastie, and R. Tibshirani.
The elements of statistical learning.
Springer series in statistics, 2nd edition, 2009.
G. James, D. Witten, T. Hastie, and R. Tibshirani.
https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning
/asset/classification.pdf, 2013.
G. James, D. Witten, T. Hastie, and R. Tibshirani.
An introduction to statistical learning: with applications in R,
volume 112.
Springer, 2013.
Class 5: Resampling/Boosting and Model
Complexity
MSA 8150: Machine Learning for Analytics
Alireza Aghasi
Institute for Insight, Georgia State University
Bootstrap
Bootstrap
– The bootstrap is a flexible and very powerful statistical tool that can
be used to quantify the uncertainty associated with a given
estimator or statistical learning method
– It can provide an estimate of the standard error of a coefficient, or a
confidence interval for that coefficient, regardless of how
complex the derivation of that coefficient is
1
A Very Simple Example
– Think of a block, which takes 100 randomly generated samples
drawn from a distribution and generates their sample mean x̄ as the
output
– We want to see if we do this many, many times, how the outputs of
our block vary
– In principal to do this we would need to run the experiment, for say
1000 times, and then look at the variations of x̄, however, this
requires constant access to the distribution and a lot of sampling
(which in many applications can be expensive to acquire)
– Bootstrap can help us do this without accessing the reference
distribution
– Let’s see the code
2
Bootstrap via an Example
– Lets explain bootstrap via an example
– Suppose that we wish to invest a fixed sum of money in two
financial assets that yield returns of X and Y (random quantities)
– We will invest α shares in X , and will invest the remaining 1 − α in
Y
– To minimize the risk, we want to minimize var (αX + (1 − α)Y )
– We can show that (in the class we do it) that the minimizer is
α=
var (Y ) − cov (X , Y )
var (X ) + var (Y ) − 2cov (X , Y )
3
Bootstrap via an Example
– In real-world we do not know var (X ), var (Y ), cov (X , Y )
– Suppose we are given a data set containing pairs of X and Y . We
can estimate var (X ), var (Y ), cov (X , Y ) from the sample set and
get an estimate α̂ for the optimal share
– Ideally we can generate these sample sets many times, and estimate
an α̂ for each and look into the histogram
– However in a real-world we only have one sample set to use
– Bootstrap yet allows us to generate good estimates of α using only
one sample set!
4
Bootstrap via an Example
To see how nicely Bootstrap works, lets compare its outcome with the
case that α is generated from many synthetic sample generations
– We generate 1000 sample sets each containing 100 pairs of X , Y
−2
−2
−1
0
Y
0
−1
Y
1
1
2
2
– For the synthetic data generated var (X ) = 1, var (Y ) = 1.25 and
cov (X , Y ) = 0.5 which yield an optimal value of α = 0.6
−2
−1
0
1
2
−2
−1
0
2
2
Y
−1
0
1
1
0
−2
−1
−3
−2
−3
Y
1
X
2
X
−3
−2
−1
0
X
1
2
−2
−1
0
1
2
3
X
5
Bootstrap via an Example
– To get the left panel we generate 1000 synthetic sample sets, for
each obtain α̂ and plot the histogram and calculate
v
u
1000
u 1 1000
X
X
1
ᾱ =
α̂i , SE (α) = t
(α̂i − ᾱ)2
1000
999
i=1
i=1
0.4
0.5
0.6
0.7
α
0.8
0.9
0.8
0.7
α
0.6
0.3
0
0
0.4
50
50
0.5
100
100
150
150
200
200
0.9
– For the bootstrap we only use one of the sample sets and regenerate
new sample set by sampling with replacement
– Surprisingly, the results are very close
0.3
0.4
0.5
0.6
0.7
0.8
0.9
True
Bootstrap
α
6
Bootstrap General Framework
– Suppose a black-box calculates α̂ from a sample set, e.g., a
coefficient in linear or logistic regression
– We are interested in estimating the variability of α̂ without
examining many new sample sets
– Denoting the first bootstrap data set by Z 1 , we use Z 1 to produce a
new bootstrap estimate for α, which we call α̂1
– This procedure is repeated B (say 100 or 1000) times, in order to
produce B different bootstrap data sets, Z 1 , Z 2 , · · · , Z B , and the
corresponding α estimates, α̂1 , · · · , α̂B
– We estimate the standard error of these bootstrap estimates using
the formula
v
u
B
B
u 1 X
X
¯ 2 where α̂
¯= 1
SEB (α) = t
(α̂i − α̂)
α̂i
B −1
B
i=1
i=1
– This serves as an estimate of the standard error of α estimated from
the original data set!
7
Programming Exercise
Let’s go through some programming exercises
– Basic Example
– Linear model example with the boot function in R
– Linear model example without the boot function in R and Python
8
Model Complexity and VC
Dimension
Train and Test
– Recall that we fitted out models using training data and were
interested in evaluating the performance with respect to
independent test data
– To produce justifiable model reliability arguments, the test data
should not be used in the training
– If the model is evaluated against the training data the results can be
very distracting
– The training error rate is often quite different from the test error
rate, and in particular the former can dramatically underestimate
the latter (recall the accuracy vs complexity chart)
9
Training vs Test Model Evaluation
– Recall this plot from the first session
– We all had a sense of what complexity is, but have not yet seen any
official way of measuring it
10
VC dimension
– In statistical learning theory and complexity theory, one measure
that is widely used to quantify the complexity of models is the
Vapnik–Chervonenkis Dimension (VCD)
– In its original form VCD is defined to measure the complexity of a
class of binary classifiers. However, variants of the VCD are also
available for multi-class classification and regression problems
– There are other ways of measuring model complexity, however, VCD
is probably the most well-known one
11
How Does VCD Relate Test and Training Errors?
– Consider a set of classifiers H. Vapnik and Chervonenkis showed
that for a binary classifier trained with N samples, the test and error
can be related. More specifically, with probability 1 − η (you can
pick η to be very small to make this very likely):
s
Test Err ≤ Training Err +
1
N

VCD(H)

log
2N
VCD(H)

+1
− log

η
4
Or roughly speaking we get an idea of the gap between train and
test as
s
Test Err ≤ Training Err +
∼
VCD(H
N
12
How Does VCD Relate Test and Training Errors?
– When the test and training errors are close, we say the model
generalizes well
– When there is a large gap between the test and training errors, we
say the model generalizes poorly
– For example, we saw in the first lectures that polynomial regression
has a poor generalization
– We will see that deep neural networks generalize very well
– In general, calculating the VCD for a set of classifiers is not easy,
however, often times obtaining approximations or bounds on the
VCD is possible
– In order to define the VCD, we first need to introduce few notions
13
What is VCD?
– Usually, the ultimate goal is obtaining the VCD for a set of
classifiers (e.g., linear classifiers in 2D), denoted by H
– We refer to H as Hypothesis Class. For example, we say, H is the
hypothesis class of linear classifiers in 2D
H = {h : h(x, y ) = ax + by + c}
– Notice that in this case 2D points can be classified based on the sign
of h, for example class 0 corresponds to h < 0 and class 1 corresponds to h ≥ 1. – The VCD of H is defined as the cardinality of the largest set of points that H shatters – What does shattering mean? 14 What is Shattering? – Consider a given hypothesis class H for a binary classification problem (of say + and -, or white and black) – Also consider a given set of N points (fixed location), which can obviously be labelled in 2N ways with the + and - labels – We say H shatters the given N points, if H realizes every one of the 2N labelings. In other words, for any binary labelling of the N points, there exists a function in H (a classier formulated by H) that can separate the two class of points – Best explained via a few examples 15 Example 1 – Consider the two points below and a linear hypothesis class, i.e., H = {h : h(x, y ) = ax + by + c} Can H shatter them? 16 Example 1 – Consider the two points below and a linear hypothesis class, i.e., H = {h : h(x, y ) = ax + by + c} Can H shatter them? 17 Example 2 – Consider the three points below and a linear hypothesis class, i.e., H = {h : h(x, y ) = ax + by + c} Can H shatter them? 18 Example 3 – Consider the three points below and a linear hypothesis class, i.e., H = {h : h(x, y ) = ax + by + c} Can H shatter them? 19 Example 4 – Consider the four points below and a linear hypothesis class, i.e., H = {h : h(x, y ) = ax + by + c} Can H shatter them? 20 Example 5 – Consider the two points below and an origin centered disk classifier, i.e., H = {h : h(x, y ) = x 2 + y 2 − r 2 } Can H shatter them? 21 How is the VCD Calculated? – As we pointed out earlier, the VCD is defined as the cardinality of the largest set of points that H shatters – Therefore, to show that the VCD of a classifier is M, we need to provide two steps: – Step 1. Come up with a set of M points that our classifier shatters (usually not hard) – Step 2. Show that NO M + 1 POINTS can be shattered by our classifier (can be hard to show!) – From Examples 2 we conclude that the VCD of a linear classifier in 2D is at least 3, but Example 4 is not sufficient to show that it is exactly 3 (we need to show no 4 points can be shattered, not only the specific ones in Example 4) – Also Example 3 does not imply that the VCD cannot be 3 (all we need for step 1 is one set of points that are shattered) – Using the Radon’s theorem one can show the VCD is in fact 3, but that requires more analysis that we skip – In fact linear classifiers in d-Dimensional space have a VCD of d + 1 22 Example 6 – Can you show that the origin centered disk classifier H = {h : h(x, y ) = x 2 + y 2 − r 2 } has a VCD of 1? – Step 1. If we have one point on the plane, it can be shattered – Step 2. Any two distinct points that we consider on the plane cannot be shattered, because we can take the farthest point from the origin and label it -, and label the other point with +. Then a disk that contains the negative labeled one should also contain the positive one and a shattering is not possible. – Step1 and Step 2 imply that the VCD for this classifier is 1. 23 Example 7 – Prove that the VCD of axis aligned rectangle is 4 in 2D. – Step 1. Presenting a set of 4 points that can be shattered by axis aligned rectangles – We can take advantage of realizations that are rotated version of each other, and only verify 6 labelings 24 Example 7 (Continued) – Step 2. Proving that no set of 5 points can be shattered by axis aligned rectangles: – Proof. Think of any 5 distinct points on the plane. We can pick the topmost, bottommost, leftmost and rightmost points and label them -, and label the fifth point by +. Any rectangle that contains the 4 negative points would also contain the positive labelled point, therefore a shattering of 5 points is never possible. – Now based on step 1 and step 2, we can claim that the VCD of axis aligned rectangle is 4. – Performing only step 1 cannot prove that the VCD is 4. It can only prove that the VCD is at least 4. 25 Some Final Notes – In general VCD is a way of quantifying model complexities – Calculation of the exact VCD may be hard for a given classifier (step 2 is usually the most challenging part) – Shattering a set of M points by a classifier only guarantees a lower-bound M on the VCD, but does not prove that the VCD is M – While VCD is a very powerful tool to have an idea about the generalization, the equation relating the test and training errors is only an inequality. For some models (such as Deep Neural Nets) we see that while the VCD is large (in order of the number of parameters), they generalize very well! Maybe this is the magic of DNNs! 26 Questions? 26 References https://www.alsharif.info/iom530, 2013. J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer series in statistics, 2nd edition, 2009. G. James, D. Witten, T. Hastie, and R. Tibshirani. https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning /asset/cv boot.pdf, 2013. G. James, D. Witten, T. Hastie, and R. Tibshirani. An introduction to statistical learning: with applications in R, volume 112. Springer, 2013. Purchase answer to see full attachment

error: Content is protected !!