Description

Q1. In the homework folder you can find the data file BinaryData.csv, which consists of two columns

x and y. The variable y is of categorical nature, taking only 0-1 values. The goal of this question is to write

your own logistic regression module and compare it with the R (or Python) logistic regression package.

(a) Load the data file BinaryData.csv and perform a simple logistic regression in the programming

language of your choice, predicting the class y based on x. Report the values of 0 and 1 .

(b) LetÃ¢â‚¬â„¢s see if we can regenerate the same results using our own logistic regression module. For this

purpose consider a model in the form of

y = sigmoid(

0

+

1 x).

In the class we showed that the negative log likelihood loss corresponding to this model is in the

form of

L( 0 ,

1)

=

n

X

yi log 1 + e

0

1 xi

+ (1

yi ) log 1 + e

0 + 1 xi

.

i=1

In Question 3 of HW2, you showed that each summand is a convex function. Using a similar line of

derivation as you did for that question, derive an expression for

@L

= …,

@ 0

Hint. I write the final formula for

@L

,

@ 0

,

@L

= ….

@ 1

but still leave the derivation to you:

n

X

@L

=

sigmoid(

@ 0

i=1

0

+

1 xi )

yi .

(c) Write a gradient descent (GD) scheme to minimize L( 0 , 1 ) in R or Python. For your scheme use a

learning rate of Ã¢Å’Ëœ = 0.01, and run the GD for 500 iterations. As the initial values for 0 and 1 you

can use zero. Although you can see that since our problem is convex and has only one minimum,

starting from other initializations yields identical results. If you implement your code correctly, you

should get identical results for 0 and 1 as part (a). Attach your code and final results.

Q2. In this question the goal is applying your own QDA framework to the data BinaryData.csv.

(a) Read the data, and calculate Ã¢â€¡Â¡0 , Ã‚Âµ0 and 02 for the

Ã¢â€¡Â¡1 , Ã‚Âµ1 and 12 for the class 1. To

P class 0, and

2

calculate the variance use the equation 1/(n 1) i (xi Ã‚Âµ) . If you do the calculations correctly

you should get integer values for Ã‚Âµ0 , 02 , Ã‚Âµ1 , and 12 .

(b) Based on the values you found in part (a), calculate the decision point which separates class 0 from

class 1. Basically, if xd is the decision point, your class prediction for the input x is

Ã¢â€¡Â¢

0 if x < xd
predicted label of x =
1 if x xd
Hint: In the class we discussed how to find the decision point, which is the point on the crossing
of the weighted Gaussian profiles. You will end up with a quadratic equation, with two roots. You
need to pick the root that is between Ã‚Âµ0 and Ã‚Âµ1 .
2
Q3. Titanic was a British passenger liner operated by the White Star Line that sank in the North
Atlantic Ocean on 15 April 1912. The ship sank after striking an iceberg during her maiden voyage from
Southampton to New York City. Many of you may be familiar with this tragic story from a movie of the
same title, directed by James Cameron, and starring Leonardo DiCaprio and Kate Winslet.
The goal of this question is to make a model that predicts which passengers survived the Titanic
shipwreck. In the homework package, you can access the data file Titanic.csv, which is a part of the
actual Titanic passenger data. The column Survived indicates if the passenger survived or not. If 1, the
passenger survived, and if 0 they did not. Below we present a brief description of each feature:
Pclass: PassengerÃ¢â‚¬â„¢s ticket class
Name: Name of the passenger
Sex: PassengerÃ¢â‚¬â„¢s gender
Age: PassengerÃ¢â‚¬â„¢s age in years
SibSp : number of siblings / spouse aboard the Titanic
Parch: number of parents / children aboard the Titanic
Fare: Passenger fare
Embarked: Port of embarkation
(a) LetÃ¢â‚¬â„¢s first preprocess the data. Read the data file Titanic.csv. Drop the column Name. While
this column is certainly useful to trace families, at this point we will not use it in our model. You
may also notice that there are many missing entries in the column Age. To fill in the missing entries
of this column, first, calculate the mean age for all the available entries, then replace the missing
entries with that value. Now you should have a dataset of no missing entries, with 891 samples (note
that the first row of the CSV file is the variable names).
(b) Split the data you created in part (a) into test and train sets. Your train set would contain the first
750 rows of the data, and your test would contain the rows 751 to 891 of the data.
(c) Try the following classification models to predict Survived in terms of the other features in the
dataset. For your classification, treat the variables Pclass, Sex, Embarked as categorical variables
and Age, SibSp, Parch, Fare as numerical features (you may want to look into your textbook and
see examples where the command as.factor is used). Specifically, do the following
Ã¢â‚¬â€œ Use logistic regression for your classification. Report the p-values associated with the intercept
and all the numerical features. Which features have large p-values? Use the test data to
estimate the accuracy of your model.
Ã¢â‚¬â€œ Apply LDA and QDA, and again report your model accuracies using the test data.
Ã¢â‚¬â€œ Among logistic regression, LDA and QDA, which model seemed the most accurate one?
(d) Repeat the steps in part (c), this time treating Sex, Embarked as categorical variables and Pclass,
Age, SibSp, ParchAge, Fare as numerical features.
Q4 (Bootstrapping). In statistics, for a given random variable, the ratio of variance to the mean
is a measure of dispersion for that random variable, also referred to as the Fano factor. Under nomrality
assumption, consider n independent samples x1 , x2 , . . . , xn . One empirical way to estimate the Fano factor
would be to calculate it through
Ã‹â€ 2
D= ,
Ã‚ÂµÃŒâ€š
where
n
x1 + . . . + xn
1 X
Ã‚ÂµÃŒâ€š =
,
Ã‹â€ 2 =
(xi Ã‚ÂµÃŒâ€š)2 .
n
n 1 i=1
3
Consider 10 independent samples from a reference normal distribution as follows:
x1 = 8.2344, x2 = 4.4854, x3 = 5.4821, x4 = 1.0953, x5 = 2.1565
x6 = 2.5096, x7 = 4.9772, x8 = 2.4998, x9 = 4.2628, x10 = 0.6933.
(a) What is the value of D for this sample set?
(b) Next, we want to see if we could repeat this experiment many times by drawing samples from the
reference distribution and calculate D, how would the value of D vary. Since we do not know
the reference distribution, we would need to do bootstrapping to see the variation in D. Run a
bootstrapping experiment with B = 10000 in a programming language of your choice, and estimate
the standard deviation for D.
(c) I am gong to reveal that the reference distribution that we drew the 10 samples above from is a
normal distribution with mean 3 and variance 3.24. Run an experiment, where for 10000 times you
draw 10 samples from a normal distribution with mean 3 and variance 3.24, and then calculate D.
LetÃ¢â‚¬â„¢s call these values D1 , . . . , D10000 . Calculate the standard deviation for these 10000 quantities
(that would simply be the command sd in R). Compare this value with the value you obtained in
part (b).
Some Notes. You should see that in part (b) despite no access to the actual reference distribution,
we managed to calculate the variations of D, in a way that is close to part (c), where we were fully
aware of the reference distribution. BTW, donÃ¢â‚¬â„¢t set the bar too high, for this problem you may see
around 25% diÃ¢â€ Âµerence between the values in part (b) and part (c).
Q5 (VC Dimension). The goal of this question is to go through one example related to the VC
dimension. This question does not require any Math and using a bit of your intuition, you should be able
to solve it. I walk with you through it step by step.
Consider H to be the set of ellipsoid classifiers on the 2D plane. Technically these classifiers can be
modeled as
p
p
H = {h : h(x, y) = (x a)2 + (y b)2 + (x a0 )2 + (y b0 )2 c}
where a, a0 , b, b0 and c 0 are parameters of your classifier. By changing these parameters you can create
ellipsoids centered anywhere on the plane with any desired ratio between the redii and any orientation
(Examples are shown in Figure 1). As discussed in the class, for binary classification h < 0 can characterize
one class and h > 0 may characterize the other class.

Figure 1: examples of classifiers that H can create

4

(a) Show that the VC dimension of H is at least 5.

Hint. To show this you would need to come up with one set of 5 points on the 2D plane which can

be shattered by H. To do this you would need to realize 25 = 32 dichotomies, but here I suggest an

easier way. Consider the points sitting on the vertices of a pentagon, then many realizations become

rotated versions of each other. If you manage to show that in all 8 cases depicted in Figure 2, a 2D

Figure 2: Dichotomies needed for 5 points

ellipsoid can separate the two classes, you would be done! To get you started I have done this for

the first two cases. Just do it for the other 6.

(b) It turns out that the VC dimension of H is precisely 5, which basically means that no set of 6 points

can be shattered by H. Showing this is beyond our technical interest1 . However, my question would

be easy. For the set of six points below, suggest an unrealizable dichotomy for H. In simple words,

color the dots in a way that no ellipsoid can separate the black dots from the white dots.

1

I share the paper in case you are interested to pursue a research in complexity theory: https://arxiv.org/pdf/1109.

4347.pdf

5

Class 4: Classification

MSA 8150: Machine Learning for Analytics

Alireza Aghasi

Institute for Insight, Georgia State University

Classification

Ã¢â‚¬â€œ In many applications, the response is not a quantitative value and

instead represents a class, e.g., y Ã¢Ë†Ë†{student, non-student},

y Ã¢Ë†Ë†{while, yellow, green}

Ã¢â‚¬â€œ Yet based on the observation of some features, we would like to

predict the class (what we refer to as the classification)

Ã¢â‚¬â€œ Regression vs classification

rit

y

X2

e

Incom

of

Se

rs

nio

Ye

a

Ed

uc

atio

n

oo o

o

oo

o

o

o

oo oo o

o

o

oo oo ooo

o

oo ooo o

o

oo

oo o o

o

o ooo oo

o oo

oo

o

o

o

o

o

ooo o o o o o

o

o

o

o o

o oo o

o o

o o o

o o

o oo

o o

o

o o oooo o

ooo o o o o

oo

o

oo

o

o

o

o

o

o

o oo o

o

o

o

oo o oo o

oo o

o o oo o

o o

o

o oo

o o o oo o ooo o

o

o

oo o

o ooooo oooo

oo

o

o oo o o

o

o o

oo oo o

o

o

o

o

o

o

o oo

o oo

o

o

oo o

o

o

o o

X1

1

Classification

Income

20000

500

500

1000

1500

Balance

2000

2500

0

0

0

0

40000

2000

1500

Balance

1000

60000

40000

20000

Income

60000

2500

Example. Predicting default cases on the credit card (unable to pay the

credit card), based on the income and current balance

No

Yes

Default

No

Yes

Default

(one immediate observation is probably balance is a more useful feature)

2

Binary Classification

Ã¢â‚¬â€œ In simple regression for a single feature x we fitted a line

y = ÃŽÂ²0 + ÃŽÂ²1 x to the data

Ã¢â‚¬â€œ In binary classification with only one feature, we donÃ¢â‚¬â„¢t have values

any more, but two classes (say class 0 and class 1)

Ã¢â‚¬â€œ Can we do the fit in a way that the sign of ÃŽÂ²0 + ÃŽÂ²1 x becomes an

indicator of the class for us?

Ã¢â‚¬â€œ In other words, for a given feature xt , we make a decision based on

the following:

(

1 ÃŽÂ²0 + ÃŽÂ²1 x t > 0

,

yt =

0 ÃŽÂ²0 + ÃŽÂ²1 x t < 0
Ã¢â‚¬â€œ A smooth function (called Sigmoid Ã¢â‚¬â€œ also inverse Logit) that takes
almost binary values 0, 1 based on the sign of the input z is
(
ez
1 z >> 0

Ã¢â€°Ë†

1 + ez

0 z 2 labels (e.g., y Ã¢Ë†Ë†{while, yellow, green}) and p

features x1 , x2 , Ã‚Â· Ã‚Â· Ã‚Â· , xp , we fit K models parametrized by

(1)

(1)

(2)

(2)

Label 1: {ÃŽÂ²0 , ÃŽÂ²1 , Ã‚Â· Ã‚Â· Ã‚Â· , ÃŽÂ²p(1) }

Label 2: {ÃŽÂ²0 , ÃŽÂ²1 , Ã‚Â· Ã‚Â· Ã‚Â· , ÃŽÂ²p(2) }

..

.

(K )

(K )

Label K: {ÃŽÂ²0 , ÃŽÂ²1 , Ã‚Â· Ã‚Â· Ã‚Â· , ÃŽÂ²p(K ) }

Ã¢â‚¬â€œ For this problem we consider the following form:

(k)

pk (x) = P(Y = k|x) =

e ÃŽÂ²0

(1)

e ÃŽÂ²0

(1)

+Ã‚Â·Ã‚Â·Ã‚Â·+ÃŽÂ²p xp

+Ã‚Â·Ã‚Â·Ã‚Â·+ÃŽÂ²p(k) xp

(K )

+ Ã‚Â· Ã‚Â· Ã‚Â· + e ÃŽÂ²0

(K )

+Ã‚Â·Ã‚Â·Ã‚Â·+ÃŽÂ²p xp

Ã¢â‚¬â€œ What is the sum of all P(Y = k|x) for a fixed x?

12

Some R Simulations

Ã¢â‚¬â€œ LetÃ¢â‚¬â„¢s preform some basic classification tasks in R!

13

Linear and Quadratic

Discriminant Analysis

LDA/QDA Story in Simple Words

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

-2

-1

0

1

2

3

4

LetÃ¢â‚¬â„¢s do an exercise. We are given 30 blue points and 40 red points as

above. LetÃ¢â‚¬â„¢s fit a Normal pdf to each cloud

14

LDA/QDA Story in Simple Words

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

-2

-1

0

1

2

3

4

LetÃ¢â‚¬â„¢s assume both variances turn out similar. Find the intersection point.

What can you say about this point?

15

LDA/QDA Story in Simple Words

1

0.9

0.8

0.7

Blue Red

0.6

0.5

0.4

0.3

0.2

0.1

0

-2

-1

0

1

2

3

4

We have found a rule to classify each point on the real line

16

LDA/QDA Story in Simple Words

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

-3

-2

-1

0

1

2

3

LetÃ¢â‚¬â„¢s do a different exercise with another set of points

17

LDA/QDA Story in Simple Words

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

-2

-1

0

1

2

3

4

The intersection point changes since the variances are no more the same.

18

LDA/QDA Story in Simple Words

1

0.9

0.8

0.7

Blue Red

0.6

0.5

0.4

0.3

0.2

0.1

0

-2

-1

0

1

2

3

4

Again we have a rule. If you want to assign a probability of being blue

and being red to each point, how would you calculate that probability?

P(Y = B|x) = pB (x) =

fB (x)

fR (x)

, P(Y = R|x) = pR (x) =

19

fB (x) + fR (x)

fB (x) + fR (x)

LDA/QDA Story in Simple Words

Ã¢â‚¬Â¢ Say you want to incorporate the probability of each class in your

calculation of P(Y |x). We can use the Bayes formula

Ã¢â‚¬â€œ Probabilistically, suppose that our y can take K distinct values. By

the BayesÃ¢â‚¬â„¢ theorem we have

P(Y = `)P(X = x|Y = `)

P(Y = `|X = x) = PK

k=1 P(Y = k)P(X = x|Y = k)

Ãâ‚¬` f` (x)

= PK

k=1 Ãâ‚¬k fk (x)

Ã¢â‚¬â€œ LetÃ¢â‚¬â„¢s see why this equality holds, knowing the BayesÃ¢â‚¬â„¢ equality

P(A|B) =

P(A Ã¢Ë†Â© B)

P(B)

and A1 Ã¢Ë†Âª A2 Ã‚Â· Ã‚Â· Ã‚Â· Ã¢Ë†Âª AK covering the entire space, where Ai Ã¢Ë†Â© Aj = Ã¢Ë†â€¦.

20

LDA/QDA Story in Simple Words

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

-2

-1

0

1

2

3

4

The intersection point yet changes

21

LDA/QDA Story in Simple Words

1

0.9

0.8

0.7

0.6

Red

Blue

0.5

0.4

0.3

0.2

0.1

0

-2

-1

0

1

P(Y = B|x) = pB (x) =

P(Y = R|x) = pR (x) =

2

3

3

7 fB (x)

,

3

4

7 fB (x) + 7 fR (x)

4

7 fR (x)

3

4

7 fB (x) + 7 fR (x)

4

22

LDA/QDA Story in Simple Words

Question. What if we have more than two classes?

0.4

label 1 (30 pts)

label 2 (60 pts)

label 3 (90 pts)

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

-2

0

2

4

6

8

10

12

14

23

Story in Simple Words

Answer. Same idea!

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

-2

0

2

4

6

8

10

12

14

24

Story in Simple Words

Goal. Higher

! dimensions,!

0

1 0

Ã‚Âµ1 =

, ÃŽÂ£1 =

, Ã‚Âµ2 =

0

0 1

!

Ã¢Ë†â€™1

, ÃŽÂ£2 =

Ã¢Ë†â€™1

1

0.9

!

0.9

1

Also, in higher dimensions the intersection between two Normal pdfs is

no more just a point, it would be a boundary

25

Little Introduction about Multivariate Normal

x1

x2

x2

Ã¢â‚¬â€œ Recall the normal distribution for a random variable x:

1

(x Ã¢Ë†â€™ Ã‚Âµ)2

f (x) = Ã¢Ë†Å¡

)

exp(Ã¢Ë†â€™

2ÃÆ’ 2

2Ãâ‚¬ÃÆ’ 2

Ã¢â‚¬â€œ Similar to the scalar case, we can define a distribution for the

random vector x = [x1 , Ã‚Â· Ã‚Â· Ã‚Â· , xp ]T as

1

1

> Ã¢Ë†â€™1

f (x) =

exp Ã¢Ë†â€™ (x Ã¢Ë†â€™ Ã‚Âµ) ÃŽÂ£ (x Ã¢Ë†â€™ Ã‚Âµ)

2

(2Ãâ‚¬)p/2 |ÃŽÂ£|1/2

x1

26

Linear Discriminant Analysis (LDA)

Ã¢â‚¬â€œ Recall

Ãâ‚¬` f` (x)

P(Y = `)P(X = x|Y = `)

= PK

P(Y = `|X = x) = PK

k=1 P(Y = k)P(X = x|Y = k)

k=1 Ãâ‚¬k fk (x)

Ã¢â‚¬â€œ The purpose of LDA is learning a model for P(Y = `|X = x)

Ã¢â‚¬â€œ In the formulation above, f` (x) is in a sense the distribution we

consider for the data points in class `, and Ãâ‚¬` is the probability that

we pick some random sample and it belongs to class `

Ã¢â‚¬â€œ In LDA, we assume that all f` (x) have a multivariate normal

distribution with similar covariances and different means, i.e.

1

1

> Ã¢Ë†â€™1

f` (x) =

exp

Ã¢Ë†â€™

(x

Ã¢Ë†â€™

Ã‚Âµ

)

ÃŽÂ£

(x

Ã¢Ë†â€™

Ã‚Âµ

)

`

`

2

(2Ãâ‚¬)p/2 |ÃŽÂ£|1/2

Ã¢â‚¬â€œ Unlike logistic regression, which involved a rather complicated

maximization for learning, in LDA we have closed form expressions

for Ãâ‚¬` , Ã‚Âµ` and ÃŽÂ£ and classifying new test points becomes very easy

27

Linear Discriminant Analysis (LDA)

Ã¢â‚¬â€œ Given a training set (x 1 , y1 ), Ã‚Â· Ã‚Â· Ã‚Â· , (x n , yn ) where the responses y can

take K distinct class values 1, 2, Ã‚Â· Ã‚Â· Ã‚Â· , K , we can easily learn the LDA

model by calculating Ãâ‚¬` , Ã‚Âµ` and ÃŽÂ£ via (considering c` to be the

index of samples in class `)

# of elements in c`

n

X

1

xi

Ã‚ÂµÃŒâ€š` =

# of elements in c`

Ãâ‚¬ÃŒâ€š` =

iÃ¢Ë†Ë†c`

ÃŽÂ£ÃŒâ€š =

1

N Ã¢Ë†â€™K

K X

X

(x i Ã¢Ë†â€™ Ã‚ÂµÃŒâ€š` )(x i Ã¢Ë†â€™ Ã‚ÂµÃŒâ€š` )>

k=1 iÃ¢Ë†Ë†c`

Ã¢â‚¬â€œ After this point for a new test point x t we have all that is needed to

calculate P(Y = `|X = x t ) for ` = 1, Ã‚Â· Ã‚Â· Ã‚Â· , K and pick as the label

the one that is largest

28

4

2

X2

0

Ã¢Ë†â€™2

Ã¢Ë†â€™4

Ã¢Ë†â€™4

Ã¢Ë†â€™2

0

X2

2

4

Linear Discriminant Analysis (LDA)

Ã¢Ë†â€™4

Ã¢Ë†â€™2

0

2

X1

4

Ã¢Ë†â€™4

Ã¢Ë†â€™2

0

2

4

X1

29

Linear Discriminant Analysis (LDA)

Ã¢â‚¬â€œ In practice to assign a label to a given test point x t we do not need

to calculate

Ãâ‚¬` f` (x t )

P(Y = `|X = x t ) = PK

k=1 Ãâ‚¬k fk (x t )

and only comparing Ãâ‚¬` f` (x t ) is enough

Ã¢â‚¬â€œ This reduces to evaluate

1 > Ã¢Ë†â€™1

Ã¢Ë†â€™1

ÃŽÂ´` = x >

t ÃŽÂ£ Ã‚Âµ` Ã¢Ë†â€™ Ã‚Âµ` ÃŽÂ£ Ã‚Âµ` + log Ãâ‚¬`

2

and pick as the class ` corresponding to the largest ÃŽÂ´`

Ã¢â‚¬â€œ You can find the decision boundary between class i and j by finding

the points for which ÃŽÂ´i = ÃŽÂ´j

Ã¢â‚¬â€œ [see the sample Matlab code]

30

Quadratic Discriminant Analysis (QDA)

Ã¢â‚¬â€œ Recall

Ãâ‚¬` f` (x t )

P(Y = `|X = x t ) = PK

k=1 Ãâ‚¬k fk (x t )

Ã¢â‚¬â€œ The purpose of QDA is learning a model for P(Y = `|X = x) in a

more flexible way compared to LDA

Ã¢â‚¬â€œ In QDA, we assume that all f` (x) have a multivariate normal

distribution with similar covariances and different means, i.e.

1

1

> Ã¢Ë†â€™1

f` (x) =

exp Ã¢Ë†â€™ (x Ã¢Ë†â€™ Ã‚Âµ` ) ÃŽÂ£` (x Ã¢Ë†â€™ Ã‚Âµ` )

2

(2Ãâ‚¬)p/2 |ÃŽÂ£` |1/2

Ã¢â‚¬â€œ The main difference between LDA and QDA is in LDA we consider a

single ÃŽÂ£ for all classes, but in QDA we allow more flexibility by

having a different covariance matrix for each class

Ã¢â‚¬â€œ Similar to LDA, QDA can be learned easily and we can obtain closed

form expressions for Ãâ‚¬` , Ã‚Âµ` and ÃŽÂ£`

31

Quadratic Discriminant Analysis (QDA)

Ã¢â‚¬â€œ Given a training set (x 1 , y1 ), Ã‚Â· Ã‚Â· Ã‚Â· , (x n , yn ) where the responses y can

take K distinct class values 1, 2, Ã‚Â· Ã‚Â· Ã‚Â· , K , we can easily learn the QDA

model by calculating Ãâ‚¬` , Ã‚Âµ` and ÃŽÂ£` via (considering c` to be the

index of samples in class `)

# of elements in c`

n

X

1

Ã‚ÂµÃŒâ€š` =

xi

# of elements in c`

iÃ¢Ë†Ë†c`

X

1

ÃŽÂ£ÃŒâ€š` =

(x i Ã¢Ë†â€™ Ã‚ÂµÃŒâ€š` )(x i Ã¢Ë†â€™ Ã‚ÂµÃŒâ€š` )>

# of elements in c` Ã¢Ë†â€™ 1

Ãâ‚¬ÃŒâ€š` =

iÃ¢Ë†Ë†c`

Ã¢â‚¬â€œ After this point for a new test point x t we have all that is needed to

calculate P(Y = `|X = x t ) for ` = 1, Ã‚Â· Ã‚Â· Ã‚Â· , K and pick as the label

the one that is largest

32

2

1

0

X2

Ã¢Ë†â€™1

Ã¢Ë†â€™4

Ã¢Ë†â€™3

Ã¢Ë†â€™2

Ã¢Ë†â€™1

Ã¢Ë†â€™2

Ã¢Ë†â€™3

Ã¢Ë†â€™4

X2

0

1

2

Quadratic Discriminant Analysis (QDA)

Ã¢Ë†â€™4

Ã¢Ë†â€™2

0

X1

2

4

Ã¢Ë†â€™4

Ã¢Ë†â€™2

0

2

4

X1

33

Quadratic Discriminant Analysis (QDA)

Ã¢â‚¬â€œ In practice to assign a label to a given test point x t we do not need

to calculate

Ãâ‚¬` f` (x t )

P(Y = `|X = x t ) = PK

k=1 Ãâ‚¬k fk (x t )

and only comparing Ãâ‚¬` f` (x t ) is enough

Ã¢â‚¬â€œ This reduces to evaluate

1

1

ÃŽÂ´` = Ã¢Ë†â€™ log |ÃŽÂ£` | Ã¢Ë†â€™ (x t Ã¢Ë†â€™ Ã‚Âµ` )> ÃŽÂ£Ã¢Ë†â€™1

` (x t Ã¢Ë†â€™ Ã‚Âµ` ) + log Ãâ‚¬`

2

2

and pick as the class the ` corresponding to the largest ÃŽÂ´`

Ã¢â‚¬â€œ [see the sample Matlab code]

34

Some R Simulations

Ã¢â‚¬â€œ LetÃ¢â‚¬â„¢s preform some basic classification tasks in R!

35

Summary

Ã¢â‚¬â€œ Logistic regression is very popular for classification, especially for

binary classification

Ã¢â‚¬â€œ LDA is especially useful when K > 2, the number of training

samples is small, or the classes are well separated, and Gaussian

assumptions are reasonable.

Ã¢â‚¬â€œ QDA presents more flexibility in shaping the partitions compared to

LDA

Ã¢â‚¬â€œ Logistic regression can also fit quadratic boundaries like QDA, by

explicitly including quadratic terms in the model

36

Questions?

36

References

https://www.alsharif.info/iom530, 2013.

J. Friedman, T. Hastie, and R. Tibshirani.

The elements of statistical learning.

Springer series in statistics, 2nd edition, 2009.

G. James, D. Witten, T. Hastie, and R. Tibshirani.

https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning

/asset/classification.pdf, 2013.

G. James, D. Witten, T. Hastie, and R. Tibshirani.

An introduction to statistical learning: with applications in R,

volume 112.

Springer, 2013.

Class 5: Resampling/Boosting and Model

Complexity

MSA 8150: Machine Learning for Analytics

Alireza Aghasi

Institute for Insight, Georgia State University

Bootstrap

Bootstrap

Ã¢â‚¬â€œ The bootstrap is a flexible and very powerful statistical tool that can

be used to quantify the uncertainty associated with a given

estimator or statistical learning method

Ã¢â‚¬â€œ It can provide an estimate of the standard error of a coefficient, or a

confidence interval for that coefficient, regardless of how

complex the derivation of that coefficient is

1

A Very Simple Example

Ã¢â‚¬â€œ Think of a block, which takes 100 randomly generated samples

drawn from a distribution and generates their sample mean xÃŒâ€ž as the

output

Ã¢â‚¬â€œ We want to see if we do this many, many times, how the outputs of

our block vary

Ã¢â‚¬â€œ In principal to do this we would need to run the experiment, for say

1000 times, and then look at the variations of xÃŒâ€ž, however, this

requires constant access to the distribution and a lot of sampling

(which in many applications can be expensive to acquire)

Ã¢â‚¬â€œ Bootstrap can help us do this without accessing the reference

distribution

Ã¢â‚¬â€œ LetÃ¢â‚¬â„¢s see the code

2

Bootstrap via an Example

Ã¢â‚¬â€œ Lets explain bootstrap via an example

Ã¢â‚¬â€œ Suppose that we wish to invest a fixed sum of money in two

financial assets that yield returns of X and Y (random quantities)

Ã¢â‚¬â€œ We will invest ÃŽÂ± shares in X , and will invest the remaining 1 Ã¢Ë†â€™ ÃŽÂ± in

Y

Ã¢â‚¬â€œ To minimize the risk, we want to minimize var (ÃŽÂ±X + (1 Ã¢Ë†â€™ ÃŽÂ±)Y )

Ã¢â‚¬â€œ We can show that (in the class we do it) that the minimizer is

ÃŽÂ±=

var (Y ) Ã¢Ë†â€™ cov (X , Y )

var (X ) + var (Y ) Ã¢Ë†â€™ 2cov (X , Y )

3

Bootstrap via an Example

Ã¢â‚¬â€œ In real-world we do not know var (X ), var (Y ), cov (X , Y )

Ã¢â‚¬â€œ Suppose we are given a data set containing pairs of X and Y . We

can estimate var (X ), var (Y ), cov (X , Y ) from the sample set and

get an estimate ÃŽÂ±ÃŒâ€š for the optimal share

Ã¢â‚¬â€œ Ideally we can generate these sample sets many times, and estimate

an ÃŽÂ±ÃŒâ€š for each and look into the histogram

Ã¢â‚¬â€œ However in a real-world we only have one sample set to use

Ã¢â‚¬â€œ Bootstrap yet allows us to generate good estimates of ÃŽÂ± using only

one sample set!

4

Bootstrap via an Example

To see how nicely Bootstrap works, lets compare its outcome with the

case that ÃŽÂ± is generated from many synthetic sample generations

Ã¢â‚¬â€œ We generate 1000 sample sets each containing 100 pairs of X , Y

Ã¢Ë†â€™2

Ã¢Ë†â€™2

Ã¢Ë†â€™1

0

Y

0

Ã¢Ë†â€™1

Y

1

1

2

2

Ã¢â‚¬â€œ For the synthetic data generated var (X ) = 1, var (Y ) = 1.25 and

cov (X , Y ) = 0.5 which yield an optimal value of ÃŽÂ± = 0.6

Ã¢Ë†â€™2

Ã¢Ë†â€™1

0

1

2

Ã¢Ë†â€™2

Ã¢Ë†â€™1

0

2

2

Y

Ã¢Ë†â€™1

0

1

1

0

Ã¢Ë†â€™2

Ã¢Ë†â€™1

Ã¢Ë†â€™3

Ã¢Ë†â€™2

Ã¢Ë†â€™3

Y

1

X

2

X

Ã¢Ë†â€™3

Ã¢Ë†â€™2

Ã¢Ë†â€™1

0

X

1

2

Ã¢Ë†â€™2

Ã¢Ë†â€™1

0

1

2

3

X

5

Bootstrap via an Example

Ã¢â‚¬â€œ To get the left panel we generate 1000 synthetic sample sets, for

each obtain ÃŽÂ±ÃŒâ€š and plot the histogram and calculate

v

u

1000

u 1 1000

X

X

1

ÃŽÂ±ÃŒâ€ž =

ÃŽÂ±ÃŒâ€ši , SE (ÃŽÂ±) = t

(ÃŽÂ±ÃŒâ€ši Ã¢Ë†â€™ ÃŽÂ±ÃŒâ€ž)2

1000

999

i=1

i=1

0.4

0.5

0.6

0.7

ÃŽÂ±

0.8

0.9

0.8

0.7

ÃŽÂ±

0.6

0.3

0

0

0.4

50

50

0.5

100

100

150

150

200

200

0.9

Ã¢â‚¬â€œ For the bootstrap we only use one of the sample sets and regenerate

new sample set by sampling with replacement

Ã¢â‚¬â€œ Surprisingly, the results are very close

0.3

0.4

0.5

0.6

0.7

0.8

0.9

True

Bootstrap

ÃŽÂ±

6

Bootstrap General Framework

Ã¢â‚¬â€œ Suppose a black-box calculates ÃŽÂ±ÃŒâ€š from a sample set, e.g., a

coefficient in linear or logistic regression

Ã¢â‚¬â€œ We are interested in estimating the variability of ÃŽÂ±ÃŒâ€š without

examining many new sample sets

Ã¢â‚¬â€œ Denoting the first bootstrap data set by Z 1 , we use Z 1 to produce a

new bootstrap estimate for ÃŽÂ±, which we call ÃŽÂ±ÃŒâ€š1

Ã¢â‚¬â€œ This procedure is repeated B (say 100 or 1000) times, in order to

produce B different bootstrap data sets, Z 1 , Z 2 , Ã‚Â· Ã‚Â· Ã‚Â· , Z B , and the

corresponding ÃŽÂ± estimates, ÃŽÂ±ÃŒâ€š1 , Ã‚Â· Ã‚Â· Ã‚Â· , ÃŽÂ±ÃŒâ€šB

Ã¢â‚¬â€œ We estimate the standard error of these bootstrap estimates using

the formula

v

u

B

B

u 1 X

X

Ã‚Â¯ 2 where ÃŽÂ±ÃŒâ€š

Ã‚Â¯= 1

SEB (ÃŽÂ±) = t

(ÃŽÂ±ÃŒâ€ši Ã¢Ë†â€™ ÃŽÂ±ÃŒâ€š)

ÃŽÂ±ÃŒâ€ši

B Ã¢Ë†â€™1

B

i=1

i=1

Ã¢â‚¬â€œ This serves as an estimate of the standard error of ÃŽÂ± estimated from

the original data set!

7

Programming Exercise

LetÃ¢â‚¬â„¢s go through some programming exercises

Ã¢â‚¬â€œ Basic Example

Ã¢â‚¬â€œ Linear model example with the boot function in R

Ã¢â‚¬â€œ Linear model example without the boot function in R and Python

8

Model Complexity and VC

Dimension

Train and Test

Ã¢â‚¬â€œ Recall that we fitted out models using training data and were

interested in evaluating the performance with respect to

independent test data

Ã¢â‚¬â€œ To produce justifiable model reliability arguments, the test data

should not be used in the training

Ã¢â‚¬â€œ If the model is evaluated against the training data the results can be

very distracting

Ã¢â‚¬â€œ The training error rate is often quite different from the test error

rate, and in particular the former can dramatically underestimate

the latter (recall the accuracy vs complexity chart)

9

Training vs Test Model Evaluation

Ã¢â‚¬â€œ Recall this plot from the first session

Ã¢â‚¬â€œ We all had a sense of what complexity is, but have not yet seen any

official way of measuring it

10

VC dimension

Ã¢â‚¬â€œ In statistical learning theory and complexity theory, one measure

that is widely used to quantify the complexity of models is the

VapnikÃ¢â‚¬â€œChervonenkis Dimension (VCD)

Ã¢â‚¬â€œ In its original form VCD is defined to measure the complexity of a

class of binary classifiers. However, variants of the VCD are also

available for multi-class classification and regression problems

Ã¢â‚¬â€œ There are other ways of measuring model complexity, however, VCD

is probably the most well-known one

11

How Does VCD Relate Test and Training Errors?

Ã¢â‚¬â€œ Consider a set of classifiers H. Vapnik and Chervonenkis showed

that for a binary classifier trained with N samples, the test and error

can be related. More specifically, with probability 1 Ã¢Ë†â€™ ÃŽÂ· (you can

pick ÃŽÂ· to be very small to make this very likely):

s

Test Err Ã¢â€°Â¤ Training Err +

1

N

VCD(H)

log

2N

VCD(H)

+1

Ã¢Ë†â€™ log

ÃŽÂ·

4

Or roughly speaking we get an idea of the gap between train and

test as

s

Test Err Ã¢â€°Â¤ Training Err +

Ã¢Ë†Â¼

VCD(H

N

12

How Does VCD Relate Test and Training Errors?

Ã¢â‚¬â€œ When the test and training errors are close, we say the model

generalizes well

Ã¢â‚¬â€œ When there is a large gap between the test and training errors, we

say the model generalizes poorly

Ã¢â‚¬â€œ For example, we saw in the first lectures that polynomial regression

has a poor generalization

Ã¢â‚¬â€œ We will see that deep neural networks generalize very well

Ã¢â‚¬â€œ In general, calculating the VCD for a set of classifiers is not easy,

however, often times obtaining approximations or bounds on the

VCD is possible

Ã¢â‚¬â€œ In order to define the VCD, we first need to introduce few notions

13

What is VCD?

Ã¢â‚¬â€œ Usually, the ultimate goal is obtaining the VCD for a set of

classifiers (e.g., linear classifiers in 2D), denoted by H

Ã¢â‚¬â€œ We refer to H as Hypothesis Class. For example, we say, H is the

hypothesis class of linear classifiers in 2D

H = {h : h(x, y ) = ax + by + c}

Ã¢â‚¬â€œ Notice that in this case 2D points can be classified based on the sign

of h, for example class 0 corresponds to h < 0 and class 1
corresponds to h Ã¢â€°Â¥ 1.
Ã¢â‚¬â€œ The VCD of H is defined as the cardinality of the largest set of
points that H shatters
Ã¢â‚¬â€œ What does shattering mean?
14
What is Shattering?
Ã¢â‚¬â€œ Consider a given hypothesis class H for a binary classification
problem (of say + and -, or white and black)
Ã¢â‚¬â€œ Also consider a given set of N points (fixed location), which can
obviously be labelled in 2N ways with the + and - labels
Ã¢â‚¬â€œ We say H shatters the given N points, if H realizes every one of the
2N labelings. In other words, for any binary labelling of the N
points, there exists a function in H (a classier formulated by H) that
can separate the two class of points
Ã¢â‚¬â€œ Best explained via a few examples
15
Example 1
Ã¢â‚¬â€œ Consider the two points below and a linear hypothesis class, i.e.,
H = {h : h(x, y ) = ax + by + c}
Can H shatter them?
16
Example 1
Ã¢â‚¬â€œ Consider the two points below and a linear hypothesis class, i.e.,
H = {h : h(x, y ) = ax + by + c}
Can H shatter them?
17
Example 2
Ã¢â‚¬â€œ Consider the three points below and a linear hypothesis class, i.e.,
H = {h : h(x, y ) = ax + by + c}
Can H shatter them?
18
Example 3
Ã¢â‚¬â€œ Consider the three points below and a linear hypothesis class, i.e.,
H = {h : h(x, y ) = ax + by + c}
Can H shatter them?
19
Example 4
Ã¢â‚¬â€œ Consider the four points below and a linear hypothesis class, i.e.,
H = {h : h(x, y ) = ax + by + c}
Can H shatter them?
20
Example 5
Ã¢â‚¬â€œ Consider the two points below and an origin centered disk classifier,
i.e.,
H = {h : h(x, y ) = x 2 + y 2 Ã¢Ë†â€™ r 2 }
Can H shatter them?
21
How is the VCD Calculated?
Ã¢â‚¬â€œ As we pointed out earlier, the VCD is defined as the cardinality of
the largest set of points that H shatters
Ã¢â‚¬â€œ Therefore, to show that the VCD of a classifier is M, we need to
provide two steps:
Ã¢â‚¬â€œ Step 1. Come up with a set of M points that our classifier shatters
(usually not hard)
Ã¢â‚¬â€œ Step 2. Show that NO M + 1 POINTS can be shattered by our
classifier (can be hard to show!)
Ã¢â‚¬â€œ From Examples 2 we conclude that the VCD of a linear classifier in
2D is at least 3, but Example 4 is not sufficient to show that it is
exactly 3 (we need to show no 4 points can be shattered, not only
the specific ones in Example 4)
Ã¢â‚¬â€œ Also Example 3 does not imply that the VCD cannot be 3 (all we
need for step 1 is one set of points that are shattered)
Ã¢â‚¬â€œ Using the RadonÃ¢â‚¬â„¢s theorem one can show the VCD is in fact 3, but
that requires more analysis that we skip
Ã¢â‚¬â€œ In fact linear classifiers in d-Dimensional space have a VCD of d + 1
22
Example 6
Ã¢â‚¬â€œ Can you show that the origin centered disk classifier
H = {h : h(x, y ) = x 2 + y 2 Ã¢Ë†â€™ r 2 }
has a VCD of 1?
Ã¢â‚¬â€œ Step 1. If we have one point on the plane, it can be shattered
Ã¢â‚¬â€œ Step 2. Any two distinct points that we consider on the plane
cannot be shattered, because we can take the farthest point from
the origin and label it -, and label the other point with +. Then a
disk that contains the negative labeled one should also contain the
positive one and a shattering is not possible.
Ã¢â‚¬â€œ Step1 and Step 2 imply that the VCD for this classifier is 1.
23
Example 7
Ã¢â‚¬â€œ Prove that the VCD of axis aligned rectangle is 4 in 2D.
Ã¢â‚¬â€œ Step 1. Presenting a set of 4 points that can be shattered by axis
aligned rectangles
Ã¢â‚¬â€œ We can take advantage of realizations that are rotated version of
each other, and only verify 6 labelings
24
Example 7 (Continued)
Ã¢â‚¬â€œ Step 2. Proving that no set of 5 points can be shattered by axis
aligned rectangles:
Ã¢â‚¬â€œ Proof. Think of any 5 distinct points on the plane. We can pick the
topmost, bottommost, leftmost and rightmost points and label them
-, and label the fifth point by +. Any rectangle that contains the 4
negative points would also contain the positive labelled point,
therefore a shattering of 5 points is never possible.
Ã¢â‚¬â€œ Now based on step 1 and step 2, we can claim that the VCD of axis
aligned rectangle is 4.
Ã¢â‚¬â€œ Performing only step 1 cannot prove that the VCD is 4. It can only
prove that the VCD is at least 4.
25
Some Final Notes
Ã¢â‚¬â€œ In general VCD is a way of quantifying model complexities
Ã¢â‚¬â€œ Calculation of the exact VCD may be hard for a given classifier (step
2 is usually the most challenging part)
Ã¢â‚¬â€œ Shattering a set of M points by a classifier only guarantees a
lower-bound M on the VCD, but does not prove that the VCD is M
Ã¢â‚¬â€œ While VCD is a very powerful tool to have an idea about the
generalization, the equation relating the test and training errors is
only an inequality. For some models (such as Deep Neural Nets) we
see that while the VCD is large (in order of the number of
parameters), they generalize very well! Maybe this is the magic of
DNNs!
26
Questions?
26
References
https://www.alsharif.info/iom530, 2013.
J. Friedman, T. Hastie, and R. Tibshirani.
The elements of statistical learning.
Springer series in statistics, 2nd edition, 2009.
G. James, D. Witten, T. Hastie, and R. Tibshirani.
https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning
/asset/cv boot.pdf, 2013.
G. James, D. Witten, T. Hastie, and R. Tibshirani.
An introduction to statistical learning: with applications in R,
volume 112.
Springer, 2013.
Purchase answer to see full
attachment