libvlfeat: K-means clustering

Author:: Andrea Vedaldi; David Novotny

kmeans.h implements a number of algorithm for **K-means quantization**: Lloyd [lloyd82least]}, an accelerated version by Elkan [elkan03using]}, and a large scale algorithm based on Approximate Nearest Neighbors (ANN). All algorithms support float or double data and can use the $l^1$ or the $l^2$ distance for clustering. Furthermore, all algorithms can take advantage of multiple CPU cores.

Please see K-means fundamentals for a technical description of K-means and of the algorithms implemented here.

Getting started

The goal of K-means is to partition a dataset into $K$ “compact” clusters. The following example demonstrates using kmeans.h in the C programming language to partition numData float vectors into compute numCenters clusters using Lloyd's algorithm:

#include <vl/kmeans.h>
double energy ;
double * centers ;

// Use float data and the L2 distance for clustering
KMeans * kmeans = vl_kmeans_new (VLDistanceL2, VL_TYPE_FLOAT) ;

// Use Lloyd algorithm
vl_kmeans_set_algorithm (kmeans, VlKMeansLloyd) ;

// Initialize the cluster centers by randomly sampling the data
vl_kmeans_init_centers_with_rand_data (kmeans, data, dimension, numData, numCenters) ;

// Run at most 100 iterations of cluster refinement using Lloyd algorithm
vl_kmeans_set_max_num_iterations (kmeans, 100) ;
vl_kmeans_refine_centers (kmeans, data, numData) ;

// Obtain the energy of the solution
energy = vl_kmeans_get_energy(kmeans) ;

// Obtain the cluster centers
centers = vl_kmeans_get_centers(kmeans) ;

Once the centers have been obtained, new data points can be assigned to clusters by using the vl_kmeans_quantize function:

vl_uint32 * assignments = vl_malloc(sizeof(vl_uint32) * numData) ;
float * distances = vl_malloc(sizeof(float) * numData) ;
vl_kmeans_quantize(kmeans, assignments, distances, data, numData) ;

Alternatively, one can directly assign new pointers to the closest centers, without bothering with a VlKMeans object.

There are several considerations that may impact the performance of KMeans. First, since K-means is usually based local optimization algorithm, the **initialization method** is important. The following initialization methods are supported:

Method | Function | Description ---------------|-----------------------------------------|----------------------------------------------- Random samples | vl_kmeans_init_centers_with_rand_data | Random data points K-means++ | vl_kmeans_init_centers_plus_plus | Random selection biased towards diversity Custom | vl_kmeans_set_centers | Choose centers (useful to run quantization only)

See Initialization methods for further details. The initialization methods use a randomized selection of the data points; the random number generator init is controlled by vl_rand_init.

The second important choice is the **optimization algorithm**. The following optimization algorithms are supported:

Algorithm | Symbol | See | Description ------------|------------------|-------------------|----------------------------------------------- Lloyd | VlKMeansLloyd | Lloyd's algorithm | Alternate EM-style optimization Elkan | VlKMeansElkan | Elkan's algorithm | A speedup using triangular inequalities ANN | VlKMeansANN | ANN algorithm | A speedup using approximated nearest neighbors

See the relative sections for further details. These algorithm are iterative, and stop when either a **maximum number of iterations** (vl_kmeans_set_max_num_iterations) is reached, or when the energy changes sufficiently slowly in one iteration (vl_kmeans_set_min_energy_variation).

All the three algorithms support multithreaded computations. The number of threads used is usually controlled globally by vl_set_num_threads.