This page describes the *Fisher Vector* (FV) of [perronnin06fisher]} [perronnin10improving]}. See Fisher Vector encoding (FV) for an overview of the C API and Fisher kernel for its relation to the more general notion of Fisher kernel.
The FV is an image representation obtained by pooling local image features. It is frequently used as a global image descriptor in visual classification.
While the FV can be derived as a special, approximate, and improved case of the general Fisher Kernel framework, it is easy to describe directly. Let $I = (,,)$ be a set of $D$ dimensional feature vectors (e.g. SIFT descriptors) extracted from an image. Let $=(,,:k=1,,K)$ be the parameters of a Gaussian Mixture Model fitting the distribution of descriptors. The GMM associates each vector $$ to a mode $k$ in the mixture with a strength given by the posterior probability:
\[ q_{ik} = {[-{1}{2}( - )^T ^{-1} ( - )]} {{t=1}^K [-{1}{2}( - )^T ^{-1} ( - )]}. \]
For each mode $k$, consider the mean and covariance deviation vectors
where $j=1,2,,D$ spans the vector dimensions. The FV of image $I$ is the stacking of the vectors $$ and then of the vectors $$ for each of the $K$ modes in the Gaussian mixtures:
\[ (I) = {bmatrix} \ \ \ \ {bmatrix}. \]
The *improved* Fisher Vector [perronnin10improving]} (IFV) improves the classification performance of the representation by using to ideas:
1. *Non-linear additive kernel.* The Hellinger's kernel (or Bhattacharya coefficient) can be used instead of the linear one at no cost by signed squared rooting. This is obtained by applying the function $|z| z$ to each dimension of the vector $(I)$. Other additive kernels can also be used at an increased space or time cost. 2. *Normalization.* Before using the representation in a linear model (e.g. a support vector machine), the vector $(I)$ is further normalized by the $l^2$ norm (note that the standard Fisher vector is normalized by the number of encoded feature vectors).
After square-rooting and normalization, the IFV is often used in a linear classifier such as an SVM.
In practice, several data to cluster assignments $q_{ik}$ are likely to be very small or even negligible. The *fast* version of the FV sets to zero all but the largest assignment for each input feature $$.