write your code at specified locations with â€œyour code is hereâ€

CS 3753/5163 Intro to Data Science

Clustering

1

Clustering

â€¢ Partition unlabeled examples into disjoint

subsets of clusters, such that:

â€“ Examples within a cluster are very similar

â€“ Examples in different clusters are very different

â€¢ Discover new categories in an unsupervised

manner (no sample category labels provided).

â€“ Therefore the term â€œunsupervised learningâ€

2

Clustering Example

.

..

.

.

. .

. .

.

. .

. . .

.

3

Hierarchical Clustering

â€¢ Build a tree-based hierarchical taxonomy

(dendrogram) from a set of unlabeled examples.

animal

vertebrate

fish reptile amphib. mammal

invertebrate

worm insect crustacean

â€¢ Recursive application of a standard clustering

algorithm can produce a hierarchical clustering.

4

Aglommerative vs. Divisive Clustering

â€¢ Agglommerative (bottom-up) methods start

with each example in its own cluster and

iteratively combine them to form larger and

larger clusters.

â€¢ Divisive (partitional, top-down) separate all

examples immediately into clusters.

5

Direct Clustering Method

â€¢ Direct clustering methods require a

specification of the number of clusters, k,

desired.

â€¢ A clustering evaluation function assigns a

real-value quality measure to a clustering.

â€¢ The number of clusters can be determined

â€œautomaticallyâ€ by explicitly generating

clusterings for multiple values of k and

choosing the best result according to a

clustering evaluation function.

6

Hierarchical Agglomerative Clustering

(HAC)

â€¢ Assumes a similarity function for determining

the similarity of two instances.

â€¢ Starts with all instances in a separate cluster

and then repeatedly joins the two clusters that

are most similar until there is only one cluster.

â€¢ The history of merging forms a binary tree or

hierarchy.

7

HAC Algorithm

Start with all instances in their own cluster.

Until there is only one cluster:

Among the current clusters, determine the two

clusters, ci and cj, that are most similar.

Replace ci and cj with a single cluster ci Ãˆ cj

8

HAC

a

â€¢

â€¢

â€¢

b

c

d

e

f

Exact behavior depends on how to compute the distance between two

clusters

No need to specify number of clusters

A distance cutoff is often chosen to break tree into clusters

9

Cluster Similarity

â€¢ Assume a similarity function that determines the

similarity of two instances: sim(x,y).

â€“ Cosine similarity of document vectors.

â€¢ How to compute similarity of two clusters each

possibly containing multiple instances?

â€“ Single Link: Similarity of two most similar members.

â€“ Complete Link: Similarity of two least similar members.

â€“ Group Average: Average similarity between members in

the merged cluster.

10

Cluster Similarity

â€¢ Single Link: Similarity of two most similar

members.

11

Cluster Similarity

â€¢ Single Link can separate non-elliptical shapes as long as the

gap between the two clusters is not small.

12

Cluster Similarity

â€¢ Single Link can separate non-elliptical shapes as long as the

gap between the two clusters is not small.

13

Cluster Similarity

â€¢ Single Link cannot separate clusters properly if there is

noise between clusters.

14

Cluster Similarity

â€¢ Complete Link: Similarity of two least similar

members.

15

Cluster Similarity

â€¢ Complete Link approach does well in separating clusters if

there is noise between clusters.

16

Cluster Similarity

â€¢ Complete Link approach is biased towards globular clusters.

â€¢ It tends to break large clusters.

17

Cluster Similarity

â€¢ Complete Link approach is biased towards globular clusters.

â€¢ It tends to break large clusters.

18

Cluster Similarity

â€¢ Group Average: Average similarity between members in the

merged cluster.

â€¢ The group Average approach does well in separating

clusters if there is noise between clusters.

19

Single Link Example

a

b

c

d

e

f

g

h

a b c d e f g h

20

Complete Link Example

a

b

c

d

e

f

g

h

a b e f c d g h

21

Non-Hierarchical Clustering

â€¢ Typically must provide the number of desired

clusters, k.

â€¢ Usually an optimization problem

22

K-Means

â€¢ Assumes instances are real-valued vectors.

â€¢ Clusters based on centroids, center of

gravity, or mean of points in a cluster, c:

!

!

1

Î¼(c) =

x

Ã¥

| c | x!ÃŽc

â€¢ Reassignment of instances to clusters is

based on distance to the current cluster

centroids.

23

Distance Metrics

24

K-Means Algorithm

Let d be the distance measure between instances.

Select k random instances {s1, s2,â€¦ sk} as seeds.

Until clustering converges or other stopping criterion:

For each instance xi:

Assign xi to the cluster cj such that d(xi, sj) is minimal.

(Update the seeds to the centroid of each cluster)

For each cluster cj

sj = Âµ(cj)

25

K Means Example

(K=2)

Pick seeds

Reassign clusters

Compute centroids

Reasssign clusters

x

x

x

x

Compute centroids

Reassign clusters

Converged!

26

K-Means Objective

â€¢ The objective of k-means is to minimize the

total sum of the squared distance of every

point to its corresponding cluster centroid.

Ã¥ Ã¥

K

l =1

||

x

Âµ

||

i

l

x ÃŽX

i

2

l

â€¢ Finding the global optimum is NP-hard.

â€¢ The k-means algorithm is guaranteed to

converge to a local optimum.

27

Seed Choice

â€¢ Results can vary based on random seed

selection.

â€¢ Some seeds can result in poor convergence

rate, or convergence to sub-optimal clusters.

â€¢ Select good seeds using a heuristic or the

results of another method.

â€¢ It may be better to repeat several times with

different seeds and choose the best results

28

Seed Choice

â€¢ Results can vary based on random seed selection.

One possible clustering with 3 random seeds

29

Seed Choice

â€¢ Results can vary based on random seed selection.

Another three random seeds

30

Seed Choice

â€¢ Results can vary based on random seed selection.

Another three random seeds

31

Seed Choice

â€¢ Results can vary based on random seed selection.

Another three random seeds

Iteration 1: calculate the distance between each point to three seeds

Clustered the first point to the closest seed

32

Seed Choice

â€¢ Results can vary based on random seed selection.

All points are clustered

33

Seed Choice

â€¢ Results can vary based on random seed selection.

All points are clustered

Update seedsâ€™ locations to the centers of clusters

34

Seed Choice

â€¢ Results can vary based on random seed selection.

All points are clustered

Update seedsâ€™ locations to the centers of clusters

Iteration 2: calculate the distance between each point to three seeds

35

Seed Choice

â€¢ Cluster all points

â€¢ If the seedsâ€™ new locations are different from the locations in

previous iteration, then update their locations and go to next

iteration.

â€¢ Otherwise, stop the iteration and we get the final clusters.

36

How to determine number of clusters?

â€¢ An open problem

â€¢ Larger K:

â€“ More homogeneity within clusters

â€“ Less separation between clusters

â€¢ Smaller K:

â€“ The opposite

â€¢ Many heuristic methods have been

proposed, none is uniformly good

How to determine number of clusters?

Three clusters

How to determine number of clusters?

Three clusters

Four clusters

How to determine number of clusters?

Sec. 16.3

What Is A Good Clustering?

â€¢ Internal criterion: A good clustering will

produce high quality clusters in which:

â€“ the intra-cluster (i.e. within-cluster) similarity is

high

â€“ the inter-cluster (i.e. between-cluster) similarity

is low

â€“ For any given data set, usually there is a tradeoff between the two and you cannot optimize

both

Sec. 16.3

External criteria for clustering quality

â€¢ Quality measured by its ability to discover

some or all of the hidden patterns or latent

classes in gold standard data

â€¢ Assesses a clustering with respect to ground

truth â€¦ requires labeled data

â€¢ Assume documents with C gold standard

classes, while our clustering algorithms

produce K clusters, Ï‰1, Ï‰2, â€¦, Ï‰K with ni

members.

Sec. 16.3

External Evaluation of Cluster Quality

â€¢ Simple measure: purity, the ratio between

the dominant class in the cluster Ï€i and the

size of cluster Ï‰i

1

Purity (wi ) = max j (nij )

ni

j ÃŽC

â€¢ Biased because having n clusters maximizes

purity

â€¢ Others are entropy of classes in clusters (or

mutual information between classes and

clusters)

Conclusions

â€¢ Unsupervised learning induces categories

from unlabeled data.

â€¢ There are a variety of approaches, including:

â€“ HAC

â€“ k-means

â€“ Many more

44

Hidden Â Markov Â Models Â

Slides Â based Â on Â the Â tutorial Â by Â Eric Â Fosler Lussier Â

h9p://www.di.ubi.pt/~jpaulo/competence/tutorials/

hmm-Ââ€tutorial-Ââ€1.pdf Â

Markov Â Models Â

Three Â types Â of Â weather Â sunny, Â rainy, Â and Â foggy Â

â€¢ Assume: Â Â the Â weather Â lasts Â all Â day Â (it Â doesnâ€™t Â

change Â from Â rainy Â to Â sunny Â in Â the Â middle Â of Â the Â

day) Â

â€¢ Weather Â predicMon Â is Â all Â about Â trying Â to Â guess Â

what Â the Â weather Â will Â be Â like Â tomorrow Â based Â

on Â a Â history Â of Â observaMons Â of Â weather Â

â€¢ For Â example Â if Â we Â knew Â that Â the Â weather Â for Â

the Â past Â three Â days Â was Â {sunny, Â sunny Â foggy} Â

in Â chronological Â order Â the Â probability Â that Â

tomorrow Â would Â be Â rainy Â is Â given Â by: Â

â€¢ For Â example Â if Â we Â knew Â that Â the Â weather Â for Â

the Â past Â three Â days Â was Â {sunny, Â sunny Â foggy} Â

in Â chronological Â order Â the Â probability Â that Â

tomorrow Â would Â be Â rainy Â is Â given Â by: Â

â€¢ The Â larger Â n Â is Â the Â more Â staMsMcs Â we Â must Â

collect. Â Suppose Â that Â n=5 Â then Â we Â must Â collect Â

staMsMcs Â for Â 35 Â = Â 243 Â Â

Markov Â AssumpMon Â

â€¢ This Â is Â called Â a Â ï¬rst Â order Â Markov Â assumpMon Â

since Â we Â say Â that Â the Â probability Â of Â an Â

observaMon Â at Â Mme Â n Â only Â depends Â on Â the Â

observaMon Â at Â Mme Â n Â -Ââ€1 Â

Markov Â AssumpMon Â

Markov Â AssumpMon Â

â€¢ We Â can Â express Â the Â joint Â distribuMon Â using Â the Â

Markov Â assumpMon: Â

â€¢ Now, Â the Â number Â of Â staMsMcs Â that Â we Â need Â to Â

collect Â is Â 32 Â = Â 9 Â Â Â

â€¢ Given Â that Â today Â is Â sunny Â what Â is Â the Â

probability Â that Â tomorrow Â is Â sunny Â and Â the Â

day Â aXer Â is Â rainy? Â

â€¢ Given Â that Â today Â is Â sunny Â what Â is Â the Â

probability Â that Â tomorrow Â is Â sunny Â and Â the Â

day Â aXer Â is Â rainy? Â

Given Â that Â today Â is Â foggy Â whatâ€™s Â the Â probability Â

that Â it Â will Â be Â rainy Â two Â days Â from Â now? Â

Given Â that Â today Â is Â foggy Â whatâ€™s Â the Â probability Â

that Â it Â will Â be Â rainy Â two Â days Â from Â now? Â

There Â are Â three Â ways Â to Â get Â from Â foggy Â today Â to Â

rainy Â two Â days Â from Â now Â {foggy, Â foggy, Â rainy} Â

{foggy, Â rainy, Â rainy} Â and Â {foggy, Â sunny, Â rainy} Â

Â Therefore Â we Â have Â to Â sum Â over Â these Â paths Â

Given Â that Â today Â is Â foggy Â whatâ€™s Â the Â probability Â

that Â it Â will Â be Â rainy Â two Â days Â from Â now? Â

{foggy, Â foggy, Â rainy} Â {foggy, Â rainy, Â rainy} Â {foggy, Â sunny, Â rainy} Â

â€¢ Note Â that Â you Â have Â to Â know Â where Â you Â start Â

from. Â

â€¢ Â Usually Â Markov Â models Â start Â with Â a Â null Â start Â

state Â and Â have Â transiMons Â to Â other Â states Â with Â

certain Â probabiliMes. Â

â€¢ In Â the Â previous Â problems Â you Â can Â just Â add Â a Â

start Â state Â with Â a Â single Â arc Â with Â probability Â 1 Â

to Â the Â iniMal Â state Â (sunny Â in Â problem Â 1 Â and Â

foggy Â in Â problem Â 2) Â Â

Hidden Â Markov Â Models Â

What Â makes Â a Â Hidden Â Markov Â Model? Â

â€¢ Suppose Â you Â were Â locked Â in Â a Â room Â for Â several Â

days Â and Â you Â were Â asked Â about Â the Â weather Â

outside. Â The Â only Â piece Â of Â evidence Â you Â have Â

is Â whether Â the Â person Â who Â comes Â into Â the Â

room Â carrying Â your Â daily Â meal Â is Â carrying Â an Â

umbrella Â or Â not Â

Example Â

HMM Â

HMM Â

â€¢ The Â equaMon Â for Â the Â weather Â Markov Â process Â

before Â you Â were Â locked Â in Â the Â room Â was: Â

â€¢ Now Â we Â have Â to Â factor Â in Â the Â fact Â that Â the Â actual Â

weather Â is Â hidden Â from Â you. Â We Â do Â that Â by Â using Â

Bayes Rule: Â

â€¢ Â

HMM Â

â€¢ The Â probability Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â

can Â be Â esMmated Â as Â

â€“ if Â you Â assume Â that Â for Â all Â i Â given Â wi, Â ui Â is Â

independent Â of Â all Â uj Â and Â wj Â Â for Â all Â j Â â‰ Â I Â

Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â is Â the Â prior Â probability Â of Â

seeing Â a Â parMcular Â sequence Â of Â umbrella Â

events Â eg Â {True Â False Â True} Â

HHM Â QuesMon Â 1 Â

â€¢ Suppose Â the Â day Â you Â were Â locked Â in Â it Â was Â

sunny. Â

â€¢ The Â next Â day Â the Â caretaker Â carried Â an Â umbrella Â

into Â the Â room. Â Â

â€“ Assuming Â that Â the Â prior Â probability Â of Â the Â

caretaker Â carrying Â an Â umbrella Â on Â any Â day Â is Â 0.5 Â

â€¢ What is Â the Â probability Â that Â the Â second Â day Â

was Â rainy? Â

HMM Â QuesMon Â 2 Â

â€¢ Suppose Â the Â day Â you Â were Â locked Â in Â the Â room Â

it Â was Â sunny Â the Â caretaker Â brought Â in Â an Â

umbrella Â on Â day Â 2 Â but Â not Â on Â day Â 3 Â Â

â€¢ Â Again Â assuming Â that Â the Â prior Â probability Â of Â

the Â caretaker Â bringing Â an Â umbrella Â is Â 0.5 Â

â€¢ What Â is Â the Â probability Â that Â itâ€™s Â foggy Â on Â day Â

3? Â Â

Â

HMM Â Q2 Â

HMM Â Q2 Â

Example Â

How Â many Â parameters? Â

â€¢ The Â hidden Â state Â space Â is Â assumed Â to Â consist Â of Â one Â of Â N Â possible Â

values, Â modeled Â as Â a Â categorical Â distribuMon. Â Â

â€¢ This Â means Â that Â for Â each Â of Â the Â N Â possible Â states Â that Â a Â hidden Â

variable Â at Â Mme Â t Â can Â be Â in, Â there Â is Â a Â transiMon Â probability Â from Â

this Â state Â to Â each Â of Â the Â N Â possible Â states Â of Â the Â hidden Â variable Â at Â

Mme Â t+1, Â for Â a Â total Â of Â N2 Â transiMon Â probabiliMes. Â

How Â many Â parameters? Â

â€¢

The Â set Â of Â transiMon Â probabiliMes Â for Â transiMons Â from Â any Â given Â state Â must Â

sum Â to Â 1. Â Because Â any Â one Â transiMon Â probability Â can Â be Â determined Â once Â

the Â others Â are Â known, Â there Â are Â a Â total Â of Â N(N-Ââ€1) Â transiMon Â parameters. Â

â€¢

In Â addiMon, Â for Â each Â of Â the Â N Â possible Â states, Â there Â is Â a Â set Â of Â emission Â

probabiliMes. Â The Â size Â of Â this Â set Â depends Â on Â the Â nature Â of Â the Â observed Â

variable. Â Â

â€¢

For Â example, Â if Â the Â observed Â variable Â is Â discrete Â with Â M Â possible Â values, Â

governed Â by Â a Â categorical Â distribuMon, Â there Â will Â be Â M-Ââ€1 Â separate Â

parameters, Â for Â a Â total Â of Â N(M-Ââ€1) Â emission Â parameters Â over Â all Â hidden Â

states. Â Â

â€¢

If Â the Â observed Â variable Â is Â an Â M-Ââ€dimensional Â vector Â distributed Â according Â

to Â an Â arbitrary Â mulMvariate Â Gaussian Â distribuMon, Â there Â will Â be Â M Â

parameters Â controlling Â the Â means Â and Â M(M+1)/2 Â parameters Â controlling Â

the Â covariance Â matrix, Â for Â a Â total Â of Â Â

Purchase answer to see full

attachment