+1(978)310-4246 credencewriters@gmail.com
  

write your code at specified locations with “your code is here”

CS 3753/5163 Intro to Data Science
Clustering
1
Clustering
• Partition unlabeled examples into disjoint
subsets of clusters, such that:
– Examples within a cluster are very similar
– Examples in different clusters are very different
• Discover new categories in an unsupervised
manner (no sample category labels provided).
– Therefore the term “unsupervised learning”
2
Clustering Example
.
..
.
.
. .
. .
.
. .
. . .
.
3
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram) from a set of unlabeled examples.
animal
vertebrate
fish reptile amphib. mammal
invertebrate
worm insect crustacean
• Recursive application of a standard clustering
algorithm can produce a hierarchical clustering.
4
Aglommerative vs. Divisive Clustering
• Agglommerative (bottom-up) methods start
with each example in its own cluster and
iteratively combine them to form larger and
larger clusters.
• Divisive (partitional, top-down) separate all
examples immediately into clusters.
5
Direct Clustering Method
• Direct clustering methods require a
specification of the number of clusters, k,
desired.
• A clustering evaluation function assigns a
real-value quality measure to a clustering.
• The number of clusters can be determined
“automatically” by explicitly generating
clusterings for multiple values of k and
choosing the best result according to a
clustering evaluation function.
6
Hierarchical Agglomerative Clustering
(HAC)
• Assumes a similarity function for determining
the similarity of two instances.
• Starts with all instances in a separate cluster
and then repeatedly joins the two clusters that
are most similar until there is only one cluster.
• The history of merging forms a binary tree or
hierarchy.
7
HAC Algorithm
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci È cj
8
HAC
a
•
•
•
b
c
d
e
f
Exact behavior depends on how to compute the distance between two
clusters
No need to specify number of clusters
A distance cutoff is often chosen to break tree into clusters
9
Cluster Similarity
• Assume a similarity function that determines the
similarity of two instances: sim(x,y).
– Cosine similarity of document vectors.
• How to compute similarity of two clusters each
possibly containing multiple instances?
– Single Link: Similarity of two most similar members.
– Complete Link: Similarity of two least similar members.
– Group Average: Average similarity between members in
the merged cluster.
10
Cluster Similarity
• Single Link: Similarity of two most similar
members.
11
Cluster Similarity
• Single Link can separate non-elliptical shapes as long as the
gap between the two clusters is not small.
12
Cluster Similarity
• Single Link can separate non-elliptical shapes as long as the
gap between the two clusters is not small.
13
Cluster Similarity
• Single Link cannot separate clusters properly if there is
noise between clusters.
14
Cluster Similarity
• Complete Link: Similarity of two least similar
members.
15
Cluster Similarity
• Complete Link approach does well in separating clusters if
there is noise between clusters.
16
Cluster Similarity
• Complete Link approach is biased towards globular clusters.
• It tends to break large clusters.
17
Cluster Similarity
• Complete Link approach is biased towards globular clusters.
• It tends to break large clusters.
18
Cluster Similarity
• Group Average: Average similarity between members in the
merged cluster.
• The group Average approach does well in separating
clusters if there is noise between clusters.
19
Single Link Example
a
b
c
d
e
f
g
h
a b c d e f g h
20
Complete Link Example
a
b
c
d
e
f
g
h
a b e f c d g h
21
Non-Hierarchical Clustering
• Typically must provide the number of desired
clusters, k.
• Usually an optimization problem
22
K-Means
• Assumes instances are real-valued vectors.
• Clusters based on centroids, center of
gravity, or mean of points in a cluster, c:
!
!
1
μ(c) =
x
Ã¥
| c | x!ÃŽc
• Reassignment of instances to clusters is
based on distance to the current cluster
centroids.
23
Distance Metrics
24
K-Means Algorithm
Let d be the distance measure between instances.
Select k random instances {s1, s2,… sk} as seeds.
Until clustering converges or other stopping criterion:
For each instance xi:
Assign xi to the cluster cj such that d(xi, sj) is minimal.
(Update the seeds to the centroid of each cluster)
For each cluster cj
sj = µ(cj)
25
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reasssign clusters
x
x
x
x
Compute centroids
Reassign clusters
Converged!
26
K-Means Objective
• The objective of k-means is to minimize the
total sum of the squared distance of every
point to its corresponding cluster centroid.
å å
K
l =1
||
x
µ
||
i
l
x ÃŽX
i
2
l
• Finding the global optimum is NP-hard.
• The k-means algorithm is guaranteed to
converge to a local optimum.
27
Seed Choice
• Results can vary based on random seed
selection.
• Some seeds can result in poor convergence
rate, or convergence to sub-optimal clusters.
• Select good seeds using a heuristic or the
results of another method.
• It may be better to repeat several times with
different seeds and choose the best results
28
Seed Choice
• Results can vary based on random seed selection.
One possible clustering with 3 random seeds
29
Seed Choice
• Results can vary based on random seed selection.
Another three random seeds
30
Seed Choice
• Results can vary based on random seed selection.
Another three random seeds
31
Seed Choice
• Results can vary based on random seed selection.
Another three random seeds
Iteration 1: calculate the distance between each point to three seeds
Clustered the first point to the closest seed
32
Seed Choice
• Results can vary based on random seed selection.
All points are clustered
33
Seed Choice
• Results can vary based on random seed selection.
All points are clustered
Update seeds’ locations to the centers of clusters
34
Seed Choice
• Results can vary based on random seed selection.
All points are clustered
Update seeds’ locations to the centers of clusters
Iteration 2: calculate the distance between each point to three seeds
35
Seed Choice
• Cluster all points
• If the seeds’ new locations are different from the locations in
previous iteration, then update their locations and go to next
iteration.
• Otherwise, stop the iteration and we get the final clusters.
36
How to determine number of clusters?
• An open problem
• Larger K:
– More homogeneity within clusters
– Less separation between clusters
• Smaller K:
– The opposite
• Many heuristic methods have been
proposed, none is uniformly good
How to determine number of clusters?
Three clusters
How to determine number of clusters?
Three clusters
Four clusters
How to determine number of clusters?
Sec. 16.3
What Is A Good Clustering?
• Internal criterion: A good clustering will
produce high quality clusters in which:
– the intra-cluster (i.e. within-cluster) similarity is
high
– the inter-cluster (i.e. between-cluster) similarity
is low
– For any given data set, usually there is a tradeoff between the two and you cannot optimize
both
Sec. 16.3
External criteria for clustering quality
• Quality measured by its ability to discover
some or all of the hidden patterns or latent
classes in gold standard data
• Assesses a clustering with respect to ground
truth … requires labeled data
• Assume documents with C gold standard
classes, while our clustering algorithms
produce K clusters, ω1, ω2, …, ωK with ni
members.
Sec. 16.3
External Evaluation of Cluster Quality
• Simple measure: purity, the ratio between
the dominant class in the cluster πi and the
size of cluster ωi
1
Purity (wi ) = max j (nij )
ni
j ÃŽC
• Biased because having n clusters maximizes
purity
• Others are entropy of classes in clusters (or
mutual information between classes and
clusters)
Conclusions
• Unsupervised learning induces categories
from unlabeled data.
• There are a variety of approaches, including:
– HAC
– k-means
– Many more
44
Hidden  Markov  Models  
Slides  based  on  the  tutorial  by  Eric  Fosler Lussier  
h9p://www.di.ubi.pt/~jpaulo/competence/tutorials/
hmm-­‐tutorial-­‐1.pdf  
Markov  Models  
Three  types  of  weather  sunny,  rainy,  and  foggy  
• Assume:    the  weather  lasts  all  day  (it  doesn’t  
change  from  rainy  to  sunny  in  the  middle  of  the  
day)  
• Weather  predicMon  is  all  about  trying  to  guess  
what  the  weather  will  be  like  tomorrow  based  
on  a  history  of  observaMons  of  weather  
• For  example  if  we  knew  that  the  weather  for  
the  past  three  days  was  {sunny,  sunny  foggy}  
in  chronological  order  the  probability  that  
tomorrow  would  be  rainy  is  given  by:  
• For  example  if  we  knew  that  the  weather  for  
the  past  three  days  was  {sunny,  sunny  foggy}  
in  chronological  order  the  probability  that  
tomorrow  would  be  rainy  is  given  by:  
• The  larger  n  is  the  more  staMsMcs  we  must  
collect.  Suppose  that  n=5  then  we  must  collect  
staMsMcs  for  35  =  243    
Markov  AssumpMon  
• This  is  called  a  first  order  Markov  assumpMon  
since  we  say  that  the  probability  of  an  
observaMon  at  Mme  n  only  depends  on  the  
observaMon  at  Mme  n  -­‐1  
Markov  AssumpMon  
Markov  AssumpMon  
• We  can  express  the  joint  distribuMon  using  the  
Markov  assumpMon:  
• Now,  the  number  of  staMsMcs  that  we  need  to  
collect  is  32  =  9      
• Given  that  today  is  sunny  what  is  the  
probability  that  tomorrow  is  sunny  and  the  
day  aXer  is  rainy?  
• Given  that  today  is  sunny  what  is  the  
probability  that  tomorrow  is  sunny  and  the  
day  aXer  is  rainy?  
Given  that  today  is  foggy  what’s  the  probability  
that  it  will  be  rainy  two  days  from  now?  
Given  that  today  is  foggy  what’s  the  probability  
that  it  will  be  rainy  two  days  from  now?  
There  are  three  ways  to  get  from  foggy  today  to  
rainy  two  days  from  now  {foggy,  foggy,  rainy}  
{foggy,  rainy,  rainy}  and  {foggy,  sunny,  rainy}  
 Therefore  we  have  to  sum  over  these  paths  
Given  that  today  is  foggy  what’s  the  probability  
that  it  will  be  rainy  two  days  from  now?  
{foggy,  foggy,  rainy}  {foggy,  rainy,  rainy}  {foggy,  sunny,  rainy}  
• Note  that  you  have  to  know  where  you  start  
from.  
•  Usually  Markov  models  start  with  a  null  start  
state  and  have  transiMons  to  other  states  with  
certain  probabiliMes.  
• In  the  previous  problems  you  can  just  add  a  
start  state  with  a  single  arc  with  probability  1  
to  the  iniMal  state  (sunny  in  problem  1  and  
foggy  in  problem  2)    
Hidden  Markov  Models  
What  makes  a  Hidden  Markov  Model?  
• Suppose  you  were  locked  in  a  room  for  several  
days  and  you  were  asked  about  the  weather  
outside.  The  only  piece  of  evidence  you  have  
is  whether  the  person  who  comes  into  the  
room  carrying  your  daily  meal  is  carrying  an  
umbrella  or  not  
Example  
HMM  
HMM  
• The  equaMon  for  the  weather  Markov  process  
before  you  were  locked  in  the  room  was:  
• Now  we  have  to  factor  in  the  fact  that  the  actual  
weather  is  hidden  from  you.  We  do  that  by  using  
Bayes Rule:  
•  
HMM  
• The  probability                                                                                                        
can  be  esMmated  as  
– if  you  assume  that  for  all  i  given  wi,  ui  is  
independent  of  all  uj  and  wj    for  all  j  ≠  I  
                                                                 is  the  prior  probability  of  
seeing  a  parMcular  sequence  of  umbrella  
events   eg  {True  False  True}  
HHM  QuesMon  1  
• Suppose  the  day  you  were  locked  in  it  was  
sunny.  
• The  next  day  the  caretaker  carried  an  umbrella  
into  the  room.    
– Assuming  that  the  prior  probability  of  the  
caretaker  carrying  an  umbrella  on  any  day  is   0.5  
• What is  the  probability  that  the  second  day  
was  rainy?  
HMM  QuesMon  2  
• Suppose  the  day  you  were  locked  in  the  room  
it  was  sunny  the  caretaker  brought  in  an  
umbrella  on  day  2  but  not  on  day  3    
•  Again  assuming  that  the  prior  probability  of  
the  caretaker  bringing  an  umbrella  is   0.5  
• What  is  the  probability  that  it’s  foggy  on  day  
3?    
 
HMM  Q2  
HMM  Q2  
Example  
How  many  parameters?  
• The  hidden  state  space  is  assumed  to  consist  of  one  of  N  possible  
values,  modeled  as  a  categorical  distribuMon.    
• This  means  that  for  each  of  the  N  possible  states  that  a  hidden  
variable  at  Mme  t  can  be  in,  there  is  a  transiMon  probability  from  
this  state  to  each  of  the  N  possible  states  of  the  hidden  variable  at  
Mme  t+1,  for  a  total  of  N2  transiMon  probabiliMes.  
How  many  parameters?  
•
The  set  of  transiMon  probabiliMes  for  transiMons  from  any  given  state  must  
sum  to  1.  Because  any  one  transiMon  probability  can  be  determined  once  
the  others  are  known,  there  are  a  total  of  N(N-­‐1)  transiMon  parameters.  
•
In  addiMon,  for  each  of  the  N  possible  states,  there  is  a  set  of  emission  
probabiliMes.  The  size  of  this  set  depends  on  the  nature  of  the  observed  
variable.    
•
For  example,  if  the  observed  variable  is  discrete  with  M  possible  values,  
governed  by  a  categorical  distribuMon,  there  will  be  M-­‐1  separate  
parameters,  for  a  total  of  N(M-­‐1)  emission  parameters  over  all  hidden  
states.    
•
If  the  observed  variable  is  an  M-­‐dimensional  vector  distributed  according  
to  an  arbitrary  mulMvariate  Gaussian  distribuMon,  there  will  be  M  
parameters  controlling  the  means  and  M(M+1)/2  parameters  controlling  
the  covariance  matrix,  for  a  total  of    

Purchase answer to see full
attachment

  
error: Content is protected !!