+1(978)310-4246 credencewriters@gmail.com
Select Page

CS 3753/5163 Intro to Data Science
Clustering
1
Clustering
â€¢ Partition unlabeled examples into disjoint
subsets of clusters, such that:
â€“ Examples within a cluster are very similar
â€“ Examples in different clusters are very different
â€¢ Discover new categories in an unsupervised
manner (no sample category labels provided).
â€“ Therefore the term â€œunsupervised learningâ€
2
Clustering Example
.
..
.
.
. .
. .
.
. .
. . .
.
3
Hierarchical Clustering
â€¢ Build a tree-based hierarchical taxonomy
(dendrogram) from a set of unlabeled examples.
animal
vertebrate
fish reptile amphib. mammal
invertebrate
worm insect crustacean
â€¢ Recursive application of a standard clustering
algorithm can produce a hierarchical clustering.
4
Aglommerative vs. Divisive Clustering
â€¢ Agglommerative (bottom-up) methods start
with each example in its own cluster and
iteratively combine them to form larger and
larger clusters.
â€¢ Divisive (partitional, top-down) separate all
examples immediately into clusters.
5
Direct Clustering Method
â€¢ Direct clustering methods require a
specification of the number of clusters, k,
desired.
â€¢ A clustering evaluation function assigns a
real-value quality measure to a clustering.
â€¢ The number of clusters can be determined
â€œautomaticallyâ€ by explicitly generating
clusterings for multiple values of k and
choosing the best result according to a
clustering evaluation function.
6
Hierarchical Agglomerative Clustering
(HAC)
â€¢ Assumes a similarity function for determining
the similarity of two instances.
â€¢ Starts with all instances in a separate cluster
and then repeatedly joins the two clusters that
are most similar until there is only one cluster.
â€¢ The history of merging forms a binary tree or
hierarchy.
7
HAC Algorithm
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci Ãˆ cj
8
HAC
a
â€¢
â€¢
â€¢
b
c
d
e
f
Exact behavior depends on how to compute the distance between two
clusters
No need to specify number of clusters
A distance cutoff is often chosen to break tree into clusters
9
Cluster Similarity
â€¢ Assume a similarity function that determines the
similarity of two instances: sim(x,y).
â€“ Cosine similarity of document vectors.
â€¢ How to compute similarity of two clusters each
possibly containing multiple instances?
â€“ Single Link: Similarity of two most similar members.
â€“ Complete Link: Similarity of two least similar members.
â€“ Group Average: Average similarity between members in
the merged cluster.
10
Cluster Similarity
â€¢ Single Link: Similarity of two most similar
members.
11
Cluster Similarity
â€¢ Single Link can separate non-elliptical shapes as long as the
gap between the two clusters is not small.
12
Cluster Similarity
â€¢ Single Link can separate non-elliptical shapes as long as the
gap between the two clusters is not small.
13
Cluster Similarity
â€¢ Single Link cannot separate clusters properly if there is
noise between clusters.
14
Cluster Similarity
â€¢ Complete Link: Similarity of two least similar
members.
15
Cluster Similarity
â€¢ Complete Link approach does well in separating clusters if
there is noise between clusters.
16
Cluster Similarity
â€¢ Complete Link approach is biased towards globular clusters.
â€¢ It tends to break large clusters.
17
Cluster Similarity
â€¢ Complete Link approach is biased towards globular clusters.
â€¢ It tends to break large clusters.
18
Cluster Similarity
â€¢ Group Average: Average similarity between members in the
merged cluster.
â€¢ The group Average approach does well in separating
clusters if there is noise between clusters.
19
a
b
c
d
e
f
g
h
a b c d e f g h
20
a
b
c
d
e
f
g
h
a b e f c d g h
21
Non-Hierarchical Clustering
â€¢ Typically must provide the number of desired
clusters, k.
â€¢ Usually an optimization problem
22
K-Means
â€¢ Assumes instances are real-valued vectors.
â€¢ Clusters based on centroids, center of
gravity, or mean of points in a cluster, c:
!
!
1
Î¼(c) =
x
Ã¥
| c | x!ÃŽc
â€¢ Reassignment of instances to clusters is
based on distance to the current cluster
centroids.
23
Distance Metrics
24
K-Means Algorithm
Let d be the distance measure between instances.
Select k random instances {s1, s2,â€¦ sk} as seeds.
Until clustering converges or other stopping criterion:
For each instance xi:
Assign xi to the cluster cj such that d(xi, sj) is minimal.
(Update the seeds to the centroid of each cluster)
For each cluster cj
sj = Âµ(cj)
25
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reasssign clusters
x
x
x
x
Compute centroids
Reassign clusters
Converged!
26
K-Means Objective
â€¢ The objective of k-means is to minimize the
total sum of the squared distance of every
point to its corresponding cluster centroid.
Ã¥ Ã¥
K
l =1
||
x
Âµ
||
i
l
x ÃŽX
i
2
l
â€¢ Finding the global optimum is NP-hard.
â€¢ The k-means algorithm is guaranteed to
converge to a local optimum.
27
Seed Choice
â€¢ Results can vary based on random seed
selection.
â€¢ Some seeds can result in poor convergence
rate, or convergence to sub-optimal clusters.
â€¢ Select good seeds using a heuristic or the
results of another method.
â€¢ It may be better to repeat several times with
different seeds and choose the best results
28
Seed Choice
â€¢ Results can vary based on random seed selection.
One possible clustering with 3 random seeds
29
Seed Choice
â€¢ Results can vary based on random seed selection.
Another three random seeds
30
Seed Choice
â€¢ Results can vary based on random seed selection.
Another three random seeds
31
Seed Choice
â€¢ Results can vary based on random seed selection.
Another three random seeds
Iteration 1: calculate the distance between each point to three seeds
Clustered the first point to the closest seed
32
Seed Choice
â€¢ Results can vary based on random seed selection.
All points are clustered
33
Seed Choice
â€¢ Results can vary based on random seed selection.
All points are clustered
Update seedsâ€™ locations to the centers of clusters
34
Seed Choice
â€¢ Results can vary based on random seed selection.
All points are clustered
Update seedsâ€™ locations to the centers of clusters
Iteration 2: calculate the distance between each point to three seeds
35
Seed Choice
â€¢ Cluster all points
â€¢ If the seedsâ€™ new locations are different from the locations in
previous iteration, then update their locations and go to next
iteration.
â€¢ Otherwise, stop the iteration and we get the final clusters.
36
How to determine number of clusters?
â€¢ An open problem
â€¢ Larger K:
â€“ More homogeneity within clusters
â€“ Less separation between clusters
â€¢ Smaller K:
â€“ The opposite
â€¢ Many heuristic methods have been
proposed, none is uniformly good
How to determine number of clusters?
Three clusters
How to determine number of clusters?
Three clusters
Four clusters
How to determine number of clusters?
Sec. 16.3
What Is A Good Clustering?
â€¢ Internal criterion: A good clustering will
produce high quality clusters in which:
â€“ the intra-cluster (i.e. within-cluster) similarity is
high
â€“ the inter-cluster (i.e. between-cluster) similarity
is low
â€“ For any given data set, usually there is a tradeoff between the two and you cannot optimize
both
Sec. 16.3
External criteria for clustering quality
â€¢ Quality measured by its ability to discover
some or all of the hidden patterns or latent
classes in gold standard data
â€¢ Assesses a clustering with respect to ground
truth â€¦ requires labeled data
â€¢ Assume documents with C gold standard
classes, while our clustering algorithms
produce K clusters, Ï‰1, Ï‰2, â€¦, Ï‰K with ni
members.
Sec. 16.3
External Evaluation of Cluster Quality
â€¢ Simple measure: purity, the ratio between
the dominant class in the cluster Ï€i and the
size of cluster Ï‰i
1
Purity (wi ) = max j (nij )
ni
j ÃŽC
â€¢ Biased because having n clusters maximizes
purity
â€¢ Others are entropy of classes in clusters (or
mutual information between classes and
clusters)
Conclusions
â€¢ Unsupervised learning induces categories
from unlabeled data.
â€¢ There are a variety of approaches, including:
â€“ HAC
â€“ k-means
â€“ Many more
44
Hidden Â Markov Â Models Â
Slides Â based Â on Â the Â tutorial Â by Â Eric Â Fosler Lussier Â
h9p://www.di.ubi.pt/~jpaulo/competence/tutorials/
hmm-Â­â€tutorial-Â­â€1.pdf Â
Markov Â Models Â
Three Â types Â of Â weather Â sunny, Â rainy, Â and Â foggy Â
â€¢ Assume: Â  Â the Â weather Â lasts Â all Â day Â (it Â doesnâ€™t Â
change Â from Â rainy Â to Â sunny Â in Â the Â middle Â of Â the Â
day) Â
â€¢ Weather Â predicMon Â is Â all Â about Â trying Â to Â guess Â
what Â the Â weather Â will Â be Â like Â tomorrow Â based Â
on Â a Â history Â of Â observaMons Â of Â weather Â
â€¢ For Â example Â if Â we Â knew Â that Â the Â weather Â for Â
the Â past Â three Â days Â was Â {sunny, Â sunny Â foggy} Â
in Â chronological Â order Â the Â probability Â that Â
tomorrow Â would Â be Â rainy Â is Â given Â by: Â
â€¢ For Â example Â if Â we Â knew Â that Â the Â weather Â for Â
the Â past Â three Â days Â was Â {sunny, Â sunny Â foggy} Â
in Â chronological Â order Â the Â probability Â that Â
tomorrow Â would Â be Â rainy Â is Â given Â by: Â
â€¢ The Â larger Â n Â is Â the Â more Â staMsMcs Â we Â must Â
collect. Â Suppose Â that Â n=5 Â then Â we Â must Â collect Â
staMsMcs Â for Â 35 Â = Â 243 Â  Â
Markov Â AssumpMon Â
â€¢ This Â is Â called Â a Â ï¬rst Â order Â Markov Â assumpMon Â
since Â we Â say Â that Â the Â probability Â of Â an Â
observaMon Â at Â Mme Â n Â only Â depends Â on Â the Â
observaMon Â at Â Mme Â n Â -Â­â€1 Â
Markov Â AssumpMon Â
Markov Â AssumpMon Â
â€¢ We Â can Â express Â the Â joint Â distribuMon Â using Â the Â
Markov Â assumpMon: Â
â€¢ Now, Â the Â number Â of Â staMsMcs Â that Â we Â need Â to Â
collect Â is Â 32 Â = Â 9 Â  Â  Â
â€¢ Given Â that Â today Â is Â sunny Â what Â is Â the Â
probability Â that Â tomorrow Â is Â sunny Â and Â the Â
day Â aXer Â is Â rainy? Â
â€¢ Given Â that Â today Â is Â sunny Â what Â is Â the Â
probability Â that Â tomorrow Â is Â sunny Â and Â the Â
day Â aXer Â is Â rainy? Â
Given Â that Â today Â is Â foggy Â whatâ€™s Â the Â probability Â
that Â it Â will Â be Â rainy Â two Â days Â from Â now? Â
Given Â that Â today Â is Â foggy Â whatâ€™s Â the Â probability Â
that Â it Â will Â be Â rainy Â two Â days Â from Â now? Â
There Â are Â three Â ways Â to Â get Â from Â foggy Â today Â to Â
rainy Â two Â days Â from Â now Â {foggy, Â foggy, Â rainy} Â
{foggy, Â rainy, Â rainy} Â and Â {foggy, Â sunny, Â rainy} Â
Â Therefore Â we Â have Â to Â sum Â over Â these Â paths Â
Given Â that Â today Â is Â foggy Â whatâ€™s Â the Â probability Â
that Â it Â will Â be Â rainy Â two Â days Â from Â now? Â
{foggy, Â foggy, Â rainy} Â {foggy, Â rainy, Â rainy} Â {foggy, Â sunny, Â rainy} Â
â€¢ Note Â that Â you Â have Â to Â know Â where Â you Â start Â
from. Â
â€¢ Â Usually Â Markov Â models Â start Â with Â a Â null Â start Â
state Â and Â have Â transiMons Â to Â other Â states Â with Â
certain Â probabiliMes. Â
â€¢ In Â the Â previous Â problems Â you Â can Â just Â add Â a Â
start Â state Â with Â a Â single Â arc Â with Â probability Â 1 Â
to Â the Â iniMal Â state Â (sunny Â in Â problem Â 1 Â and Â
foggy Â in Â problem Â 2) Â  Â
Hidden Â Markov Â Models Â
What Â makes Â a Â Hidden Â Markov Â Model? Â
â€¢ Suppose Â you Â were Â locked Â in Â a Â room Â for Â several Â
days Â and Â you Â were Â asked Â about Â the Â weather Â
outside. Â The Â only Â piece Â of Â evidence Â you Â have Â
is Â whether Â the Â person Â who Â comes Â into Â the Â
room Â carrying Â your Â daily Â meal Â is Â carrying Â an Â
umbrella Â or Â not Â
Example Â
HMM Â
HMM Â
â€¢ The Â equaMon Â for Â the Â weather Â Markov Â process Â
before Â you Â were Â locked Â in Â the Â room Â was: Â
â€¢ Now Â we Â have Â to Â factor Â in Â the Â fact Â that Â the Â actual Â
weather Â is Â hidden Â from Â you. Â We Â do Â that Â by Â using Â
Bayes Rule: Â
â€¢ Â
HMM Â
â€¢ The Â probability Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â
can Â be Â esMmated Â as Â
â€“ if Â you Â assume Â that Â for Â all Â i Â given Â wi, Â ui Â is Â
independent Â of Â all Â uj Â and Â wj Â  Â for Â all Â j Â â‰  Â I Â
Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â is Â the Â prior Â probability Â of Â
seeing Â a Â parMcular Â sequence Â of Â umbrella Â
events Â  eg Â {True Â False Â True} Â
HHM Â QuesMon Â 1 Â
â€¢ Suppose Â the Â day Â you Â were Â locked Â in Â it Â was Â
sunny. Â
â€¢ The Â next Â day Â the Â caretaker Â carried Â an Â umbrella Â
into Â the Â room. Â  Â
â€“ Assuming Â that Â the Â prior Â probability Â of Â the Â
caretaker Â carrying Â an Â umbrella Â on Â any Â day Â is Â  0.5 Â
â€¢ What is Â the Â probability Â that Â the Â second Â day Â
was Â rainy? Â
HMM Â QuesMon Â 2 Â
â€¢ Suppose Â the Â day Â you Â were Â locked Â in Â the Â room Â
it Â was Â sunny Â the Â caretaker Â brought Â in Â an Â
umbrella Â on Â day Â 2 Â but Â not Â on Â day Â 3 Â  Â
â€¢ Â Again Â assuming Â that Â the Â prior Â probability Â of Â
the Â caretaker Â bringing Â an Â umbrella Â is Â  0.5 Â
â€¢ What Â is Â the Â probability Â that Â itâ€™s Â foggy Â on Â day Â
3? Â  Â
Â
HMM Â Q2 Â
HMM Â Q2 Â
Example Â
How Â many Â parameters? Â
â€¢ The Â hidden Â state Â space Â is Â assumed Â to Â consist Â of Â one Â of Â N Â possible Â
values, Â modeled Â as Â a Â categorical Â distribuMon. Â  Â
â€¢ This Â means Â that Â for Â each Â of Â the Â N Â possible Â states Â that Â a Â hidden Â
variable Â at Â Mme Â t Â can Â be Â in, Â there Â is Â a Â transiMon Â probability Â from Â
this Â state Â to Â each Â of Â the Â N Â possible Â states Â of Â the Â hidden Â variable Â at Â
Mme Â t+1, Â for Â a Â total Â of Â N2 Â transiMon Â probabiliMes. Â
How Â many Â parameters? Â
â€¢
The Â set Â of Â transiMon Â probabiliMes Â for Â transiMons Â from Â any Â given Â state Â must Â
sum Â to Â 1. Â Because Â any Â one Â transiMon Â probability Â can Â be Â determined Â once Â
the Â others Â are Â known, Â there Â are Â a Â total Â of Â N(N-Â­â€1) Â transiMon Â parameters. Â
â€¢
In Â addiMon, Â for Â each Â of Â the Â N Â possible Â states, Â there Â is Â a Â set Â of Â emission Â
probabiliMes. Â The Â size Â of Â this Â set Â depends Â on Â the Â nature Â of Â the Â observed Â
variable. Â  Â
â€¢
For Â example, Â if Â the Â observed Â variable Â is Â discrete Â with Â M Â possible Â values, Â
governed Â by Â a Â categorical Â distribuMon, Â there Â will Â be Â M-Â­â€1 Â separate Â
parameters, Â for Â a Â total Â of Â N(M-Â­â€1) Â emission Â parameters Â over Â all Â hidden Â
states. Â  Â
â€¢
If Â the Â observed Â variable Â is Â an Â M-Â­â€dimensional Â vector Â distributed Â according Â
to Â an Â arbitrary Â mulMvariate Â Gaussian Â distribuMon, Â there Â will Â be Â M Â
parameters Â controlling Â the Â means Â and Â M(M+1)/2 Â parameters Â controlling Â
the Â covariance Â matrix, Â for Â a Â total Â of Â  Â