kmeans.h implements a number of algorithm for **K-means quantization**: Lloyd [lloyd82least]}, an accelerated version by Elkan [elkan03using]}, and a large scale algorithm based on Approximate Nearest Neighbors (ANN). All algorithms support float
or double
data and can use the $l^1$ or the $l^2$ distance for clustering. Furthermore, all algorithms can take advantage of multiple CPU cores.
Please see K-means fundamentals for a technical description of K-means and of the algorithms implemented here.
The goal of K-means is to partition a dataset into $K$ “compact” clusters. The following example demonstrates using kmeans.h in the C programming language to partition numData
float
vectors into compute numCenters
clusters using Lloyd's algorithm:
#include <vl/kmeans.h> double energy ; double * centers ; // Use float data and the L2 distance for clustering KMeans * kmeans = vl_kmeans_new (VLDistanceL2, VL_TYPE_FLOAT) ; // Use Lloyd algorithm vl_kmeans_set_algorithm (kmeans, VlKMeansLloyd) ; // Initialize the cluster centers by randomly sampling the data vl_kmeans_init_centers_with_rand_data (kmeans, data, dimension, numData, numCenters) ; // Run at most 100 iterations of cluster refinement using Lloyd algorithm vl_kmeans_set_max_num_iterations (kmeans, 100) ; vl_kmeans_refine_centers (kmeans, data, numData) ; // Obtain the energy of the solution energy = vl_kmeans_get_energy(kmeans) ; // Obtain the cluster centers centers = vl_kmeans_get_centers(kmeans) ;
Once the centers have been obtained, new data points can be assigned to clusters by using the vl_kmeans_quantize function:
vl_uint32 * assignments = vl_malloc(sizeof(vl_uint32) * numData) ; float * distances = vl_malloc(sizeof(float) * numData) ; vl_kmeans_quantize(kmeans, assignments, distances, data, numData) ;
Alternatively, one can directly assign new pointers to the closest centers, without bothering with a VlKMeans object.
There are several considerations that may impact the performance of KMeans. First, since K-means is usually based local optimization algorithm, the **initialization method** is important. The following initialization methods are supported:
Method | Function | Description ---------------|-----------------------------------------|----------------------------------------------- Random samples | vl_kmeans_init_centers_with_rand_data | Random data points K-means++ | vl_kmeans_init_centers_plus_plus | Random selection biased towards diversity Custom | vl_kmeans_set_centers | Choose centers (useful to run quantization only)
See Initialization methods for further details. The initialization methods use a randomized selection of the data points; the random number generator init is controlled by vl_rand_init.
The second important choice is the **optimization algorithm**. The following optimization algorithms are supported:
Algorithm | Symbol | See | Description ------------|------------------|-------------------|----------------------------------------------- Lloyd | VlKMeansLloyd | Lloyd's algorithm | Alternate EM-style optimization Elkan | VlKMeansElkan | Elkan's algorithm | A speedup using triangular inequalities ANN | VlKMeansANN | ANN algorithm | A speedup using approximated nearest neighbors
See the relative sections for further details. These algorithm are iterative, and stop when either a **maximum number of iterations** (vl_kmeans_set_max_num_iterations) is reached, or when the energy changes sufficiently slowly in one iteration (vl_kmeans_set_min_energy_variation).
All the three algorithms support multithreaded computations. The number of threads used is usually controlled globally by vl_set_num_threads.