Goal: Form groups (clusters) of similar records (unsupervised learning) Used for segmenting markets into groups of similar customers Example: Claritas segmented US neighborhoods based on demographics & income In prior example, clustering was done by eye Multiple dimensions require formal algorithm with A distance measure A way to use the distance measure in forming clusters We will consider two algorithms: hierarchical and non-hierarchical.
In prior example, clustering was done by eye Multiple dimensions require formal algorithm with A distance measure A way to use the distance measure in forming clusters We will consider two algorithms: hierarchical and non hierarchical Problem: Raw distance measures are highly influenced by scale of measurements Solution: normalize (standardize) the data first Subtract mean, divide by std. deviation.
Start with n clusters (each record is its own cluster) Merge two closest records into one cluster At each successive step, the two clusters closest to each other are merged Dendrogram, from bottom up, illustrates the process See process of clustering: Lines connected lower down are merged earlier 10 and 13 will be merged next, after 12 & 21 Determining number of clusters: For a given “distance between clusters”, a horizontal line intersects the clusters that are that far apart, to create clusters E.g., at distance of 4.6 (red line in next slide), data can be reduced to 2 clusters -- The smaller of the two is circled At distance of 3.6 (green line) data can be reduced to 6 clusters, including the circled cluster
Stability are clusters and cluster assignments sensitive to slight changes in inputs Are cluster assignments in partition B similar to partition A check ratio of between-cluster (inter-cluster) variation to within-cluster (intra-cluster) variation (higher is better).