+1(978)310-4246 credencewriters@gmail.com
Select Page

Description

Q1. In the homework folder you can find the data file BinaryData.csv, which consists of two columns
x and y. The variable y is of categorical nature, taking only 0-1 values. The goal of this question is to write
your own logistic regression module and compare it with the R (or Python) logistic regression package.
(a) Load the data file BinaryData.csv and perform a simple logistic regression in the programming
language of your choice, predicting the class y based on x. Report the values of 0 and 1 .
(b) LetÃ¢â‚¬â„¢s see if we can regenerate the same results using our own logistic regression module. For this
purpose consider a model in the form of
y = sigmoid(
0
+
1 x).
In the class we showed that the negative log likelihood loss corresponding to this model is in the
form of
L( 0 ,
1)
=
n
X
yi log 1 + e
0
1 xi
+ (1
yi ) log 1 + e
0 + 1 xi
.
i=1
In Question 3 of HW2, you showed that each summand is a convex function. Using a similar line of
derivation as you did for that question, derive an expression for
@L
= …,
@ 0
Hint. I write the final formula for
@L
,
@ 0
,
@L
= ….
@ 1
but still leave the derivation to you:
n
X
@L
=
sigmoid(
@ 0
i=1
0
+
1 xi )
yi .
(c) Write a gradient descent (GD) scheme to minimize L( 0 , 1 ) in R or Python. For your scheme use a
learning rate of Ã¢Å’Ëœ = 0.01, and run the GD for 500 iterations. As the initial values for 0 and 1 you
can use zero. Although you can see that since our problem is convex and has only one minimum,
starting from other initializations yields identical results. If you implement your code correctly, you
should get identical results for 0 and 1 as part (a). Attach your code and final results.
Q2. In this question the goal is applying your own QDA framework to the data BinaryData.csv.
(a) Read the data, and calculate Ã¢â€¡Â¡0 , Ã‚Âµ0 and 02 for the
Ã¢â€¡Â¡1 , Ã‚Âµ1 and 12 for the class 1. To
P class 0, and
2
calculate the variance use the equation 1/(n 1) i (xi Ã‚Âµ) . If you do the calculations correctly
you should get integer values for Ã‚Âµ0 , 02 , Ã‚Âµ1 , and 12 .
(b) Based on the values you found in part (a), calculate the decision point which separates class 0 from
class 1. Basically, if xd is the decision point, your class prediction for the input x is
Ã¢â€¡Â¢
Figure 1: examples of classifiers that H can create
4
(a) Show that the VC dimension of H is at least 5.
Hint. To show this you would need to come up with one set of 5 points on the 2D plane which can
be shattered by H. To do this you would need to realize 25 = 32 dichotomies, but here I suggest an
easier way. Consider the points sitting on the vertices of a pentagon, then many realizations become
rotated versions of each other. If you manage to show that in all 8 cases depicted in Figure 2, a 2D
Figure 2: Dichotomies needed for 5 points
ellipsoid can separate the two classes, you would be done! To get you started I have done this for
the first two cases. Just do it for the other 6.
(b) It turns out that the VC dimension of H is precisely 5, which basically means that no set of 6 points
can be shattered by H. Showing this is beyond our technical interest1 . However, my question would
be easy. For the set of six points below, suggest an unrealizable dichotomy for H. In simple words,
color the dots in a way that no ellipsoid can separate the black dots from the white dots.
1
I share the paper in case you are interested to pursue a research in complexity theory: https://arxiv.org/pdf/1109.
4347.pdf
5
Class 4: Classification
MSA 8150: Machine Learning for Analytics
Alireza Aghasi
Institute for Insight, Georgia State University
Classification
Ã¢â‚¬â€œ In many applications, the response is not a quantitative value and
instead represents a class, e.g., y Ã¢Ë†Ë†{student, non-student},
y Ã¢Ë†Ë†{while, yellow, green}
Ã¢â‚¬â€œ Yet based on the observation of some features, we would like to
predict the class (what we refer to as the classification)
Ã¢â‚¬â€œ Regression vs classification
rit
y
X2
e
Incom
of
Se
rs
nio
Ye
a
Ed
uc
atio
n
oo o
o
oo
o
o
o
oo oo o
o
o
oo oo ooo
o
oo ooo o
o
oo
oo o o
o
o ooo oo
o oo
oo
o
o
o
o
o
ooo o o o o o
o
o
o
o o
o oo o
o o
o o o
o o
o oo
o o
o
o o oooo o
ooo o o o o
oo
o
oo
o
o
o
o
o
o
o oo o
o
o
o
oo o oo o
oo o
o o oo o
o o
o
o oo
o o o oo o ooo o
o
o
oo o
o ooooo oooo
oo
o
o oo o o
o
o o
oo oo o
o
o
o
o
o
o
o oo
o oo
o
o
oo o
o
o
o o
X1
1
Classification
Income
20000
500
500
1000
1500
Balance
2000
2500
0
0
0
0
40000
2000
1500
Balance
1000
60000
40000
20000
Income
60000
2500
Example. Predicting default cases on the credit card (unable to pay the
credit card), based on the income and current balance
No
Yes
Default
No
Yes
Default
(one immediate observation is probably balance is a more useful feature)
2
Binary Classification
Ã¢â‚¬â€œ In simple regression for a single feature x we fitted a line
y = ÃŽÂ²0 + ÃŽÂ²1 x to the data
Ã¢â‚¬â€œ In binary classification with only one feature, we donÃ¢â‚¬â„¢t have values
any more, but two classes (say class 0 and class 1)
Ã¢â‚¬â€œ Can we do the fit in a way that the sign of ÃŽÂ²0 + ÃŽÂ²1 x becomes an
indicator of the class for us?
Ã¢â‚¬â€œ In other words, for a given feature xt , we make a decision based on
the following:
(
1 ÃŽÂ²0 + ÃŽÂ²1 x t > 0
,
yt =
0 ÃŽÂ²0 + ÃŽÂ²1 x t < 0 Ã¢â‚¬â€œ A smooth function (called Sigmoid Ã¢â‚¬â€œ also inverse Logit) that takes almost binary values 0, 1 based on the sign of the input z is ( ez 1 z >> 0
Ã¢â€°Ë†
1 + ez
0 z 2 labels (e.g., y Ã¢Ë†Ë†{while, yellow, green}) and p
features x1 , x2 , Ã‚Â· Ã‚Â· Ã‚Â· , xp , we fit K models parametrized by
(1)
(1)
(2)
(2)
Label 1: {ÃŽÂ²0 , ÃŽÂ²1 , Ã‚Â· Ã‚Â· Ã‚Â· , ÃŽÂ²p(1) }
Label 2: {ÃŽÂ²0 , ÃŽÂ²1 , Ã‚Â· Ã‚Â· Ã‚Â· , ÃŽÂ²p(2) }
..
.
(K )
(K )
Label K: {ÃŽÂ²0 , ÃŽÂ²1 , Ã‚Â· Ã‚Â· Ã‚Â· , ÃŽÂ²p(K ) }
Ã¢â‚¬â€œ For this problem we consider the following form:
(k)
pk (x) = P(Y = k|x) =
e ÃŽÂ²0
(1)
e ÃŽÂ²0
(1)
+Ã‚Â·Ã‚Â·Ã‚Â·+ÃŽÂ²p xp
+Ã‚Â·Ã‚Â·Ã‚Â·+ÃŽÂ²p(k) xp
(K )
+ Ã‚Â· Ã‚Â· Ã‚Â· + e ÃŽÂ²0
(K )
+Ã‚Â·Ã‚Â·Ã‚Â·+ÃŽÂ²p xp
Ã¢â‚¬â€œ What is the sum of all P(Y = k|x) for a fixed x?
12
Some R Simulations
Ã¢â‚¬â€œ LetÃ¢â‚¬â„¢s preform some basic classification tasks in R!
13
Discriminant Analysis
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
LetÃ¢â‚¬â„¢s do an exercise. We are given 30 blue points and 40 red points as
above. LetÃ¢â‚¬â„¢s fit a Normal pdf to each cloud
14
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
LetÃ¢â‚¬â„¢s assume both variances turn out similar. Find the intersection point.
15
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
Blue Red
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
We have found a rule to classify each point on the real line
16
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-3
-2
-1
0
1
2
3
LetÃ¢â‚¬â„¢s do a different exercise with another set of points
17
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
The intersection point changes since the variances are no more the same.
18
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
Blue Red
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
Again we have a rule. If you want to assign a probability of being blue
and being red to each point, how would you calculate that probability?
P(Y = B|x) = pB (x) =
fB (x)
fR (x)
, P(Y = R|x) = pR (x) =
19
fB (x) + fR (x)
fB (x) + fR (x)
LDA/QDA Story in Simple Words
Ã¢â‚¬Â¢ Say you want to incorporate the probability of each class in your
calculation of P(Y |x). We can use the Bayes formula
Ã¢â‚¬â€œ Probabilistically, suppose that our y can take K distinct values. By
the BayesÃ¢â‚¬â„¢ theorem we have
P(Y = `)P(X = x|Y = `)
P(Y = `|X = x) = PK
k=1 P(Y = k)P(X = x|Y = k)
Ãâ‚¬` f` (x)
= PK
k=1 Ãâ‚¬k fk (x)
Ã¢â‚¬â€œ LetÃ¢â‚¬â„¢s see why this equality holds, knowing the BayesÃ¢â‚¬â„¢ equality
P(A|B) =
P(B)
and A1 Ã¢Ë†Âª A2 Ã‚Â· Ã‚Â· Ã‚Â· Ã¢Ë†Âª AK covering the entire space, where Ai Ã¢Ë†Â© Aj = Ã¢Ë†â€¦.
20
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
2
3
4
The intersection point yet changes
21
LDA/QDA Story in Simple Words
1
0.9
0.8
0.7
0.6
Red
Blue
0.5
0.4
0.3
0.2
0.1
0
-2
-1
0
1
P(Y = B|x) = pB (x) =
P(Y = R|x) = pR (x) =
2
3
3
7 fB (x)
,
3
4
7 fB (x) + 7 fR (x)
4
7 fR (x)
3
4
7 fB (x) + 7 fR (x)
4
22
LDA/QDA Story in Simple Words
Question. What if we have more than two classes?
0.4
label 1 (30 pts)
label 2 (60 pts)
label 3 (90 pts)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-2
0
2
4
6
8
10
12
14
23
Story in Simple Words
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-2
0
2
4
6
8
10
12
14
24
Story in Simple Words
Goal. Higher
! dimensions,!
0
1 0
Ã‚Âµ1 =
, ÃŽÂ£1 =
, Ã‚Âµ2 =
0
0 1
!
Ã¢Ë†â€™1
, ÃŽÂ£2 =
Ã¢Ë†â€™1
1
0.9
!
0.9
1
Also, in higher dimensions the intersection between two Normal pdfs is
no more just a point, it would be a boundary
25
Little Introduction about Multivariate Normal
x1
x2
x2
Ã¢â‚¬â€œ Recall the normal distribution for a random variable x:
1
(x Ã¢Ë†â€™ Ã‚Âµ)2
f (x) = Ã¢Ë†Å¡
)
exp(Ã¢Ë†â€™
2ÃÆ’ 2
2Ãâ‚¬ÃÆ’ 2
Ã¢â‚¬â€œ Similar to the scalar case, we can define a distribution for the
random vector x = [x1 , Ã‚Â· Ã‚Â· Ã‚Â· , xp ]T as

1
1
> Ã¢Ë†â€™1
f (x) =
exp Ã¢Ë†â€™ (x Ã¢Ë†â€™ Ã‚Âµ) ÃŽÂ£ (x Ã¢Ë†â€™ Ã‚Âµ)
2
(2Ãâ‚¬)p/2 |ÃŽÂ£|1/2
x1
26
Linear Discriminant Analysis (LDA)
Ã¢â‚¬â€œ Recall
Ãâ‚¬` f` (x)
P(Y = `)P(X = x|Y = `)
= PK
P(Y = `|X = x) = PK
k=1 P(Y = k)P(X = x|Y = k)
k=1 Ãâ‚¬k fk (x)
Ã¢â‚¬â€œ The purpose of LDA is learning a model for P(Y = `|X = x)
Ã¢â‚¬â€œ In the formulation above, f` (x) is in a sense the distribution we
consider for the data points in class `, and Ãâ‚¬` is the probability that
we pick some random sample and it belongs to class `
Ã¢â‚¬â€œ In LDA, we assume that all f` (x) have a multivariate normal
distribution with similar covariances and different means, i.e.

1
1
> Ã¢Ë†â€™1
f` (x) =
exp
Ã¢Ë†â€™
(x
Ã¢Ë†â€™
Ã‚Âµ
)
ÃŽÂ£
(x
Ã¢Ë†â€™
Ã‚Âµ
)
`
`
2
(2Ãâ‚¬)p/2 |ÃŽÂ£|1/2
Ã¢â‚¬â€œ Unlike logistic regression, which involved a rather complicated
maximization for learning, in LDA we have closed form expressions
for Ãâ‚¬` , Ã‚Âµ` and ÃŽÂ£ and classifying new test points becomes very easy
27
Linear Discriminant Analysis (LDA)
Ã¢â‚¬â€œ Given a training set (x 1 , y1 ), Ã‚Â· Ã‚Â· Ã‚Â· , (x n , yn ) where the responses y can
take K distinct class values 1, 2, Ã‚Â· Ã‚Â· Ã‚Â· , K , we can easily learn the LDA
model by calculating Ãâ‚¬` , Ã‚Âµ` and ÃŽÂ£ via (considering c` to be the
index of samples in class `)
# of elements in c`
n
X
1
xi
Ã‚ÂµÃŒâ€š` =
# of elements in c`
Ãâ‚¬ÃŒâ€š` =
iÃ¢Ë†Ë†c`
ÃŽÂ£ÃŒâ€š =
1
N Ã¢Ë†â€™K
K X
X
(x i Ã¢Ë†â€™ Ã‚ÂµÃŒâ€š` )(x i Ã¢Ë†â€™ Ã‚ÂµÃŒâ€š` )>
k=1 iÃ¢Ë†Ë†c`
Ã¢â‚¬â€œ After this point for a new test point x t we have all that is needed to
calculate P(Y = `|X = x t ) for ` = 1, Ã‚Â· Ã‚Â· Ã‚Â· , K and pick as the label
the one that is largest
28
4
2
X2
0
Ã¢Ë†â€™2
Ã¢Ë†â€™4
Ã¢Ë†â€™4
Ã¢Ë†â€™2
0
X2
2
4
Linear Discriminant Analysis (LDA)
Ã¢Ë†â€™4
Ã¢Ë†â€™2
0
2
X1
4
Ã¢Ë†â€™4
Ã¢Ë†â€™2
0
2
4
X1
29
Linear Discriminant Analysis (LDA)
Ã¢â‚¬â€œ In practice to assign a label to a given test point x t we do not need
to calculate
Ãâ‚¬` f` (x t )
P(Y = `|X = x t ) = PK
k=1 Ãâ‚¬k fk (x t )
and only comparing Ãâ‚¬` f` (x t ) is enough
Ã¢â‚¬â€œ This reduces to evaluate
1 > Ã¢Ë†â€™1
Ã¢Ë†â€™1
ÃŽÂ´` = x >
t ÃŽÂ£ Ã‚Âµ` Ã¢Ë†â€™ Ã‚Âµ` ÃŽÂ£ Ã‚Âµ` + log Ãâ‚¬`
2
and pick as the class ` corresponding to the largest ÃŽÂ´`
Ã¢â‚¬â€œ You can find the decision boundary between class i and j by finding
the points for which ÃŽÂ´i = ÃŽÂ´j
Ã¢â‚¬â€œ [see the sample Matlab code]
30
Quadratic Discriminant Analysis (QDA)
Ã¢â‚¬â€œ Recall
Ãâ‚¬` f` (x t )
P(Y = `|X = x t ) = PK
k=1 Ãâ‚¬k fk (x t )
Ã¢â‚¬â€œ The purpose of QDA is learning a model for P(Y = `|X = x) in a
more flexible way compared to LDA
Ã¢â‚¬â€œ In QDA, we assume that all f` (x) have a multivariate normal
distribution with similar covariances and different means, i.e.

1
1
> Ã¢Ë†â€™1
f` (x) =
exp Ã¢Ë†â€™ (x Ã¢Ë†â€™ Ã‚Âµ` ) ÃŽÂ£` (x Ã¢Ë†â€™ Ã‚Âµ` )
2
(2Ãâ‚¬)p/2 |ÃŽÂ£` |1/2
Ã¢â‚¬â€œ The main difference between LDA and QDA is in LDA we consider a
single ÃŽÂ£ for all classes, but in QDA we allow more flexibility by
having a different covariance matrix for each class
Ã¢â‚¬â€œ Similar to LDA, QDA can be learned easily and we can obtain closed
form expressions for Ãâ‚¬` , Ã‚Âµ` and ÃŽÂ£`
31
Quadratic Discriminant Analysis (QDA)
Ã¢â‚¬â€œ Given a training set (x 1 , y1 ), Ã‚Â· Ã‚Â· Ã‚Â· , (x n , yn ) where the responses y can
take K distinct class values 1, 2, Ã‚Â· Ã‚Â· Ã‚Â· , K , we can easily learn the QDA
model by calculating Ãâ‚¬` , Ã‚Âµ` and ÃŽÂ£` via (considering c` to be the
index of samples in class `)
# of elements in c`
n
X
1
Ã‚ÂµÃŒâ€š` =
xi
# of elements in c`
iÃ¢Ë†Ë†c`
X
1
ÃŽÂ£ÃŒâ€š` =
(x i Ã¢Ë†â€™ Ã‚ÂµÃŒâ€š` )(x i Ã¢Ë†â€™ Ã‚ÂµÃŒâ€š` )>
# of elements in c` Ã¢Ë†â€™ 1
Ãâ‚¬ÃŒâ€š` =
iÃ¢Ë†Ë†c`
Ã¢â‚¬â€œ After this point for a new test point x t we have all that is needed to
calculate P(Y = `|X = x t ) for ` = 1, Ã‚Â· Ã‚Â· Ã‚Â· , K and pick as the label
the one that is largest
32
2
1
0
X2
Ã¢Ë†â€™1
Ã¢Ë†â€™4
Ã¢Ë†â€™3
Ã¢Ë†â€™2
Ã¢Ë†â€™1
Ã¢Ë†â€™2
Ã¢Ë†â€™3
Ã¢Ë†â€™4
X2
0
1
2
Quadratic Discriminant Analysis (QDA)
Ã¢Ë†â€™4
Ã¢Ë†â€™2
0
X1
2
4
Ã¢Ë†â€™4
Ã¢Ë†â€™2
0
2
4
X1
33
Quadratic Discriminant Analysis (QDA)
Ã¢â‚¬â€œ In practice to assign a label to a given test point x t we do not need
to calculate
Ãâ‚¬` f` (x t )
P(Y = `|X = x t ) = PK
k=1 Ãâ‚¬k fk (x t )
and only comparing Ãâ‚¬` f` (x t ) is enough
Ã¢â‚¬â€œ This reduces to evaluate
1
1
ÃŽÂ´` = Ã¢Ë†â€™ log |ÃŽÂ£` | Ã¢Ë†â€™ (x t Ã¢Ë†â€™ Ã‚Âµ` )> ÃŽÂ£Ã¢Ë†â€™1
` (x t Ã¢Ë†â€™ Ã‚Âµ` ) + log Ãâ‚¬`
2
2
and pick as the class the ` corresponding to the largest ÃŽÂ´`
Ã¢â‚¬â€œ [see the sample Matlab code]
34
Some R Simulations
Ã¢â‚¬â€œ LetÃ¢â‚¬â„¢s preform some basic classification tasks in R!
35
Summary
Ã¢â‚¬â€œ Logistic regression is very popular for classification, especially for
binary classification
Ã¢â‚¬â€œ LDA is especially useful when K > 2, the number of training
samples is small, or the classes are well separated, and Gaussian
assumptions are reasonable.
Ã¢â‚¬â€œ QDA presents more flexibility in shaping the partitions compared to
LDA
Ã¢â‚¬â€œ Logistic regression can also fit quadratic boundaries like QDA, by
explicitly including quadratic terms in the model
36
Questions?
36
References
https://www.alsharif.info/iom530, 2013.
J. Friedman, T. Hastie, and R. Tibshirani.
The elements of statistical learning.
Springer series in statistics, 2nd edition, 2009.
G. James, D. Witten, T. Hastie, and R. Tibshirani.
https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning
/asset/classification.pdf, 2013.
G. James, D. Witten, T. Hastie, and R. Tibshirani.
An introduction to statistical learning: with applications in R,
volume 112.
Springer, 2013.
Class 5: Resampling/Boosting and Model
Complexity
MSA 8150: Machine Learning for Analytics
Alireza Aghasi
Institute for Insight, Georgia State University
Bootstrap
Bootstrap
Ã¢â‚¬â€œ The bootstrap is a flexible and very powerful statistical tool that can
be used to quantify the uncertainty associated with a given
estimator or statistical learning method
Ã¢â‚¬â€œ It can provide an estimate of the standard error of a coefficient, or a
confidence interval for that coefficient, regardless of how
complex the derivation of that coefficient is
1
A Very Simple Example
Ã¢â‚¬â€œ Think of a block, which takes 100 randomly generated samples
drawn from a distribution and generates their sample mean xÃŒâ€ž as the
output
Ã¢â‚¬â€œ We want to see if we do this many, many times, how the outputs of
our block vary
Ã¢â‚¬â€œ In principal to do this we would need to run the experiment, for say
1000 times, and then look at the variations of xÃŒâ€ž, however, this
requires constant access to the distribution and a lot of sampling
(which in many applications can be expensive to acquire)
Ã¢â‚¬â€œ Bootstrap can help us do this without accessing the reference
distribution
Ã¢â‚¬â€œ LetÃ¢â‚¬â„¢s see the code
2
Bootstrap via an Example
Ã¢â‚¬â€œ Lets explain bootstrap via an example
Ã¢â‚¬â€œ Suppose that we wish to invest a fixed sum of money in two
financial assets that yield returns of X and Y (random quantities)
Ã¢â‚¬â€œ We will invest ÃŽÂ± shares in X , and will invest the remaining 1 Ã¢Ë†â€™ ÃŽÂ± in
Y
Ã¢â‚¬â€œ To minimize the risk, we want to minimize var (ÃŽÂ±X + (1 Ã¢Ë†â€™ ÃŽÂ±)Y )
Ã¢â‚¬â€œ We can show that (in the class we do it) that the minimizer is
ÃŽÂ±=
var (Y ) Ã¢Ë†â€™ cov (X , Y )
var (X ) + var (Y ) Ã¢Ë†â€™ 2cov (X , Y )
3
Bootstrap via an Example
Ã¢â‚¬â€œ In real-world we do not know var (X ), var (Y ), cov (X , Y )
Ã¢â‚¬â€œ Suppose we are given a data set containing pairs of X and Y . We
can estimate var (X ), var (Y ), cov (X , Y ) from the sample set and
get an estimate ÃŽÂ±ÃŒâ€š for the optimal share
Ã¢â‚¬â€œ Ideally we can generate these sample sets many times, and estimate
an ÃŽÂ±ÃŒâ€š for each and look into the histogram
Ã¢â‚¬â€œ However in a real-world we only have one sample set to use
Ã¢â‚¬â€œ Bootstrap yet allows us to generate good estimates of ÃŽÂ± using only
one sample set!
4
Bootstrap via an Example
To see how nicely Bootstrap works, lets compare its outcome with the
case that ÃŽÂ± is generated from many synthetic sample generations
Ã¢â‚¬â€œ We generate 1000 sample sets each containing 100 pairs of X , Y
Ã¢Ë†â€™2
Ã¢Ë†â€™2
Ã¢Ë†â€™1
0
Y
0
Ã¢Ë†â€™1
Y
1
1
2
2
Ã¢â‚¬â€œ For the synthetic data generated var (X ) = 1, var (Y ) = 1.25 and
cov (X , Y ) = 0.5 which yield an optimal value of ÃŽÂ± = 0.6
Ã¢Ë†â€™2
Ã¢Ë†â€™1
0
1
2
Ã¢Ë†â€™2
Ã¢Ë†â€™1
0
2
2
Y
Ã¢Ë†â€™1
0
1
1
0
Ã¢Ë†â€™2
Ã¢Ë†â€™1
Ã¢Ë†â€™3
Ã¢Ë†â€™2
Ã¢Ë†â€™3
Y
1
X
2
X
Ã¢Ë†â€™3
Ã¢Ë†â€™2
Ã¢Ë†â€™1
0
X
1
2
Ã¢Ë†â€™2
Ã¢Ë†â€™1
0
1
2
3
X
5
Bootstrap via an Example
Ã¢â‚¬â€œ To get the left panel we generate 1000 synthetic sample sets, for
each obtain ÃŽÂ±ÃŒâ€š and plot the histogram and calculate
v
u
1000
u 1 1000
X
X
1
ÃŽÂ±ÃŒâ€ž =
ÃŽÂ±ÃŒâ€ši , SE (ÃŽÂ±) = t
(ÃŽÂ±ÃŒâ€ši Ã¢Ë†â€™ ÃŽÂ±ÃŒâ€ž)2
1000
999
i=1
i=1
0.4
0.5
0.6
0.7
ÃŽÂ±
0.8
0.9
0.8
0.7
ÃŽÂ±
0.6
0.3
0
0
0.4
50
50
0.5
100
100
150
150
200
200
0.9
Ã¢â‚¬â€œ For the bootstrap we only use one of the sample sets and regenerate
new sample set by sampling with replacement
Ã¢â‚¬â€œ Surprisingly, the results are very close
0.3
0.4
0.5
0.6
0.7
0.8
0.9
True
Bootstrap
ÃŽÂ±
6
Bootstrap General Framework
Ã¢â‚¬â€œ Suppose a black-box calculates ÃŽÂ±ÃŒâ€š from a sample set, e.g., a
coefficient in linear or logistic regression
Ã¢â‚¬â€œ We are interested in estimating the variability of ÃŽÂ±ÃŒâ€š without
examining many new sample sets
Ã¢â‚¬â€œ Denoting the first bootstrap data set by Z 1 , we use Z 1 to produce a
new bootstrap estimate for ÃŽÂ±, which we call ÃŽÂ±ÃŒâ€š1
Ã¢â‚¬â€œ This procedure is repeated B (say 100 or 1000) times, in order to
produce B different bootstrap data sets, Z 1 , Z 2 , Ã‚Â· Ã‚Â· Ã‚Â· , Z B , and the
corresponding ÃŽÂ± estimates, ÃŽÂ±ÃŒâ€š1 , Ã‚Â· Ã‚Â· Ã‚Â· , ÃŽÂ±ÃŒâ€šB
Ã¢â‚¬â€œ We estimate the standard error of these bootstrap estimates using
the formula
v
u
B
B
u 1 X
X
Ã‚Â¯ 2 where ÃŽÂ±ÃŒâ€š
Ã‚Â¯= 1
SEB (ÃŽÂ±) = t
(ÃŽÂ±ÃŒâ€ši Ã¢Ë†â€™ ÃŽÂ±ÃŒâ€š)
ÃŽÂ±ÃŒâ€ši
B Ã¢Ë†â€™1
B
i=1
i=1
Ã¢â‚¬â€œ This serves as an estimate of the standard error of ÃŽÂ± estimated from
the original data set!
7
Programming Exercise
LetÃ¢â‚¬â„¢s go through some programming exercises
Ã¢â‚¬â€œ Basic Example
Ã¢â‚¬â€œ Linear model example with the boot function in R
Ã¢â‚¬â€œ Linear model example without the boot function in R and Python
8
Model Complexity and VC
Dimension
Train and Test
Ã¢â‚¬â€œ Recall that we fitted out models using training data and were
interested in evaluating the performance with respect to
independent test data
Ã¢â‚¬â€œ To produce justifiable model reliability arguments, the test data
should not be used in the training
Ã¢â‚¬â€œ If the model is evaluated against the training data the results can be
very distracting
Ã¢â‚¬â€œ The training error rate is often quite different from the test error
rate, and in particular the former can dramatically underestimate
the latter (recall the accuracy vs complexity chart)
9
Training vs Test Model Evaluation
Ã¢â‚¬â€œ Recall this plot from the first session
Ã¢â‚¬â€œ We all had a sense of what complexity is, but have not yet seen any
official way of measuring it
10
VC dimension
Ã¢â‚¬â€œ In statistical learning theory and complexity theory, one measure
that is widely used to quantify the complexity of models is the
VapnikÃ¢â‚¬â€œChervonenkis Dimension (VCD)
Ã¢â‚¬â€œ In its original form VCD is defined to measure the complexity of a
class of binary classifiers. However, variants of the VCD are also
available for multi-class classification and regression problems
Ã¢â‚¬â€œ There are other ways of measuring model complexity, however, VCD
is probably the most well-known one
11
How Does VCD Relate Test and Training Errors?
Ã¢â‚¬â€œ Consider a set of classifiers H. Vapnik and Chervonenkis showed
that for a binary classifier trained with N samples, the test and error
can be related. More specifically, with probability 1 Ã¢Ë†â€™ ÃŽÂ· (you can
pick ÃŽÂ· to be very small to make this very likely):
s
Test Err Ã¢â€°Â¤ Training Err +
1
N

VCD(H)

log
2N
VCD(H)

+1
Ã¢Ë†â€™ log

ÃŽÂ·
4
Or roughly speaking we get an idea of the gap between train and
test as
s
Test Err Ã¢â€°Â¤ Training Err +
Ã¢Ë†Â¼
VCD(H
N
12
How Does VCD Relate Test and Training Errors?
Ã¢â‚¬â€œ When the test and training errors are close, we say the model
generalizes well
Ã¢â‚¬â€œ When there is a large gap between the test and training errors, we
say the model generalizes poorly
Ã¢â‚¬â€œ For example, we saw in the first lectures that polynomial regression
has a poor generalization
Ã¢â‚¬â€œ We will see that deep neural networks generalize very well
Ã¢â‚¬â€œ In general, calculating the VCD for a set of classifiers is not easy,
however, often times obtaining approximations or bounds on the
VCD is possible
Ã¢â‚¬â€œ In order to define the VCD, we first need to introduce few notions
13
What is VCD?
Ã¢â‚¬â€œ Usually, the ultimate goal is obtaining the VCD for a set of
classifiers (e.g., linear classifiers in 2D), denoted by H
Ã¢â‚¬â€œ We refer to H as Hypothesis Class. For example, we say, H is the
hypothesis class of linear classifiers in 2D
H = {h : h(x, y ) = ax + by + c}
Ã¢â‚¬â€œ Notice that in this case 2D points can be classified based on the sign
of h, for example class 0 corresponds to h < 0 and class 1 corresponds to h Ã¢â€°Â¥ 1. Ã¢â‚¬â€œ The VCD of H is defined as the cardinality of the largest set of points that H shatters Ã¢â‚¬â€œ What does shattering mean? 14 What is Shattering? Ã¢â‚¬â€œ Consider a given hypothesis class H for a binary classification problem (of say + and -, or white and black) Ã¢â‚¬â€œ Also consider a given set of N points (fixed location), which can obviously be labelled in 2N ways with the + and - labels Ã¢â‚¬â€œ We say H shatters the given N points, if H realizes every one of the 2N labelings. In other words, for any binary labelling of the N points, there exists a function in H (a classier formulated by H) that can separate the two class of points Ã¢â‚¬â€œ Best explained via a few examples 15 Example 1 Ã¢â‚¬â€œ Consider the two points below and a linear hypothesis class, i.e., H = {h : h(x, y ) = ax + by + c} Can H shatter them? 16 Example 1 Ã¢â‚¬â€œ Consider the two points below and a linear hypothesis class, i.e., H = {h : h(x, y ) = ax + by + c} Can H shatter them? 17 Example 2 Ã¢â‚¬â€œ Consider the three points below and a linear hypothesis class, i.e., H = {h : h(x, y ) = ax + by + c} Can H shatter them? 18 Example 3 Ã¢â‚¬â€œ Consider the three points below and a linear hypothesis class, i.e., H = {h : h(x, y ) = ax + by + c} Can H shatter them? 19 Example 4 Ã¢â‚¬â€œ Consider the four points below and a linear hypothesis class, i.e., H = {h : h(x, y ) = ax + by + c} Can H shatter them? 20 Example 5 Ã¢â‚¬â€œ Consider the two points below and an origin centered disk classifier, i.e., H = {h : h(x, y ) = x 2 + y 2 Ã¢Ë†â€™ r 2 } Can H shatter them? 21 How is the VCD Calculated? Ã¢â‚¬â€œ As we pointed out earlier, the VCD is defined as the cardinality of the largest set of points that H shatters Ã¢â‚¬â€œ Therefore, to show that the VCD of a classifier is M, we need to provide two steps: Ã¢â‚¬â€œ Step 1. Come up with a set of M points that our classifier shatters (usually not hard) Ã¢â‚¬â€œ Step 2. Show that NO M + 1 POINTS can be shattered by our classifier (can be hard to show!) Ã¢â‚¬â€œ From Examples 2 we conclude that the VCD of a linear classifier in 2D is at least 3, but Example 4 is not sufficient to show that it is exactly 3 (we need to show no 4 points can be shattered, not only the specific ones in Example 4) Ã¢â‚¬â€œ Also Example 3 does not imply that the VCD cannot be 3 (all we need for step 1 is one set of points that are shattered) Ã¢â‚¬â€œ Using the RadonÃ¢â‚¬â„¢s theorem one can show the VCD is in fact 3, but that requires more analysis that we skip Ã¢â‚¬â€œ In fact linear classifiers in d-Dimensional space have a VCD of d + 1 22 Example 6 Ã¢â‚¬â€œ Can you show that the origin centered disk classifier H = {h : h(x, y ) = x 2 + y 2 Ã¢Ë†â€™ r 2 } has a VCD of 1? Ã¢â‚¬â€œ Step 1. If we have one point on the plane, it can be shattered Ã¢â‚¬â€œ Step 2. Any two distinct points that we consider on the plane cannot be shattered, because we can take the farthest point from the origin and label it -, and label the other point with +. Then a disk that contains the negative labeled one should also contain the positive one and a shattering is not possible. Ã¢â‚¬â€œ Step1 and Step 2 imply that the VCD for this classifier is 1. 23 Example 7 Ã¢â‚¬â€œ Prove that the VCD of axis aligned rectangle is 4 in 2D. Ã¢â‚¬â€œ Step 1. Presenting a set of 4 points that can be shattered by axis aligned rectangles Ã¢â‚¬â€œ We can take advantage of realizations that are rotated version of each other, and only verify 6 labelings 24 Example 7 (Continued) Ã¢â‚¬â€œ Step 2. Proving that no set of 5 points can be shattered by axis aligned rectangles: Ã¢â‚¬â€œ Proof. Think of any 5 distinct points on the plane. We can pick the topmost, bottommost, leftmost and rightmost points and label them -, and label the fifth point by +. Any rectangle that contains the 4 negative points would also contain the positive labelled point, therefore a shattering of 5 points is never possible. Ã¢â‚¬â€œ Now based on step 1 and step 2, we can claim that the VCD of axis aligned rectangle is 4. Ã¢â‚¬â€œ Performing only step 1 cannot prove that the VCD is 4. It can only prove that the VCD is at least 4. 25 Some Final Notes Ã¢â‚¬â€œ In general VCD is a way of quantifying model complexities Ã¢â‚¬â€œ Calculation of the exact VCD may be hard for a given classifier (step 2 is usually the most challenging part) Ã¢â‚¬â€œ Shattering a set of M points by a classifier only guarantees a lower-bound M on the VCD, but does not prove that the VCD is M Ã¢â‚¬â€œ While VCD is a very powerful tool to have an idea about the generalization, the equation relating the test and training errors is only an inequality. For some models (such as Deep Neural Nets) we see that while the VCD is large (in order of the number of parameters), they generalize very well! Maybe this is the magic of DNNs! 26 Questions? 26 References https://www.alsharif.info/iom530, 2013. J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer series in statistics, 2nd edition, 2009. G. James, D. Witten, T. Hastie, and R. Tibshirani. https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning /asset/cv boot.pdf, 2013. G. James, D. Witten, T. Hastie, and R. Tibshirani. An introduction to statistical learning: with applications in R, volume 112. Springer, 2013. Purchase answer to see full attachment

error: Content is protected !!