libvlfeat: VLAD fundamentals

This page describes the *Vector of Locally Aggregated Descriptors* (VLAD) image encoding of [jegou10aggregating]}. See Vector of Locally Aggregated Descriptors (VLAD) encoding for an overview of the C API.

VLAD is a *feature encoding and pooling* method, similar to Fisher vectors. VLAD encodes a set of local feature descriptors $I=(,,)$ extracted from an image using a dictionary built using a clustering method such as Gaussian Mixture Models (GMM) or K-means clustering. Let $q_{ik}$ be the strength of the association of data vector $$ to cluster $$, such that $q_{ik} 0$ and ${k=1}^K q_{ik} = 1$. The association may be either soft (e.g. obtained as the posterior probabilities of the GMM clusters) or hard (e.g. obtained by vector quantization with K-means).

$$ are the cluster *means*, vectors of the same dimension as the data $$. VLAD encodes feature $$ by considering the *residuals* \[ = {i=1}^{N} q_{ik} ({i} - ). \] The residulas are stacked together to obtain the vector \[ (I) = {bmatrix} \ \ {bmatrix} \]

Before the VLAD encoding is used it is usually normalized, as explained VLAD normalization next.

VLAD normalization

VLFeat VLAD implementation supports a number of different normalization strategies. These are optionally applied in this order:

**Component-wise mass normalization.** Each vector $$ is divided by the total mass of features associated to it ${i=1}^N q_{ik}$.

**Square-rooting.** The function $(z){|z|}$ is applied to all scalar components of the VLAD descriptor.

**Component-wise $l^2$ normalization.** The vectors $$ are divided by their norm $\|\|_2$.

**Global $l^2$ normalization.** The VLAD descriptor $(I)$ is divided by its norm $\|(I)\|_2$.