???example “Clustering Sky Objects” - A catalog of 2 billion sky objects represents objects by their radiation in 7 dimensions (frequency bands) - Problem: cluster into similar objects, e.g., galaxies, stars, quasars, etc.
1
2
3
4
<figure markdown="span">

<figcaption>36 of the 500+ Type Ia supernovae discovered by the Sloan Supernova Survey</figcaption>
</figure>
???example “Clustering music albums” - Music divides into categories, and customer prefer a few categories - Are categories simply genres? - Represent an album by a set of customers who bought it - Similar albums have similar sets of customers, and vice-versa - Space of all albums: - Think of a space with one dimension for each customer - Values in a dimension may be 0 or 1 only ???abstract “Data representation” - An album is a point in this space $(x_1, x_2, …,x_k)$ where $x_i = 1$ if and only if the $i^{th}$ customer bought the CD. - For Amazon, the dimension is tens of millions - Find clusters of similar CDs
???example “Clustering documents” - Finding topics - Group together documents on the same topic. - Documents with similar sets of words maybe about the same topic. - Dual formulation: a topic is a group of words that co-occur in many documents. ???abstract “Data representation” - Represent a document by a vector $(x_1, x_2, …,x_k)$ where $x_i = 1$ if and only if the $i^{th}$ word (in some order) appears in the document. - Document with similar sets of words may be about the same topic.
set of words point in space of words. i appears in doc.vector in space of words. === “Hierarchical”
{ align=left width=300 loading=lazy }
1
2
3
4
5
- Agglomerative (bottom up): each point is a cluster,
repeatedly combining two nearest cluster.
- Divisive (top down): start with one cluster and
recursively split it.
- Key operation: repeatedly combine two nearest clusters.
=== “Point assignment”
{ align=left width=300 height=200 loading=lazy }
1
2
- Maintain a set of clusters
- Points belong to `nearest` cluster
Three important questions:
???info “1. How do you represent a cluster of more than one point?” - Key problem: As you merge clusters, how do you represent the “location” of each cluster, to tell which pair of clusters is closest? - Euclidean case: - Each cluster has a centroid = average of its (data)points - Non-Euclidean case: - The only “locations” we can talk about are the points themselves i.e., there is no “average” of two points - clustroid = (data)point closest to other points - Smallest maximum distance to other points - Smallest average distance to other points - Smallest sum of squares of distances to other points
???info “2. How do you determine the “nearness” of clusters?” - Measure cluster distances by distances of centroids, clustroid - Define Intercluster distance = minimum of the distances between any two points, one from each cluster - Pick a notion of “cohesion” of clusters, e.g., maximum distance from the clustroid - Merge clusters whose union is most cohesive
???info “3. When to stop combining clusters?” - Minimal differences between iteration - Native iteration: $O(N^3)$ - Priority queue: $O(N^2logN)$
Euclidean space/distancek, the number of clusters.k clusters.k, looking at the change in the average distance to centroid, as k increases.k clusters.dispersed set of points k points.???example “Visual example”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
???warning "Too few"

- Many long distances to centroid
???success "Just right"

- Distances are relatively short
???warning "Too many"

- Little improvement in avaerage distance
???abstract “BFR” - Points are read from disk one main-memory-full at a time - Most points from previous memory loads are summarized by simple statistics - To begin, from the initial load we select the initial k centroids by some sensible approach: - Take k random points - Take a small random sample and cluster optimally - Take a sample; pick a random point, and then k-1 more points, each as far from the previously selected points as possible

???info “Discard set (DS)” - Points close enough to a centroid to be summarized
???info “Compression set (CS)” - Groups of points that are close together but not close to any existing centroid - These points are summarized, but not assigned to a cluster
???info “Retained set (RS)” - Isolated points waiting to be assigned to a compression set
SUM, whose $i^{th}$ component is the sum of the coordinates of the points in the $i^{th}$ dimensionSUMSQ: whose $i^{th}$ component is the sum of squares of coordinates in $i^{th}$ dimension???tip “1. How do we decide if a point is “close enough” to a cluster that we will add the point to that cluster?” BFR suggests two approaches
1
2
3
4
5
6
7
8
- The Mahalanobis distance is less than a threshold
- Normalized Euclidean distance from centroid
- For point $(x_1,x_2,...,x_d)$ and centroid $(c_1,c_2,...,c_d)$
- Normalize in each dimension: $y_i = \frac{x_i - c_i}{\sigma_i}$
- ${\sigma_i}$: standard deviation of points in the cluster in the $i^{th}$ dimension
- Take sum of the squares of the $y_i$
- Take the square root
- High likelihood of the point belonging to currently nearest centroid
???tip “2. How do we decide whether two compressed sets (CS) deserve to be combined into one?” - Compute the variance of the combined subcluster - N, SUM, and SUMSQ allow us to make that calculation quickly - Combine if the combined variance is below some threshold - Many alternatives: Treat dimensions differently, consider density
