Rather than visiting points completely at random, VLFeat SDCA follows the best practice of visiting all the points at every epoch (pass through the data), changing the order of the visit randomly by picking every time a new random permutation.
This page describes the *Stochastic Gradient Descent* (SGD) linear SVM solver. SGD minimizes directly the primal SVM objective (see Support Vector Machines (SVM)):
\[ E() = {}{2} \| \|^2 + {1}{n} {i=1}^n ( ,) \]
Firts, rewrite the objective as the average
\[ E() = {1}{n} {i=1}^n E_i(), E_i() = {}{2} \| \|^2 + ( ,). \]
Then SGD performs gradient steps by considering at each iteration one term $E_i()$ selected at random from this average. In its most basic form, the algorithm is:
Provided that the learning rate $$ is chosen correctly, this simple algorithm is guaranteed to converge to the minimizer $^*$ of $E$.
The goal of the SGD algorithm is to bring the *primal suboptimality* below a threshold $$: \[ E() - E(^*) . \]
If the learning rate $$ is selected appropriately, SGD can be shown to converge properly. For example, [shalev-shwartz07pegasos]} show that, since $E()$ is $$-strongly convex, then using the learning rate \[ { = {1}{ t}} \] guarantees that the algorithm reaches primal-suboptimality $$ in \[ O( {1}{ } ). \] iterations. This particular SGD variant is sometimes known as PEGASOS [shalev-shwartz07pegasos]} and is the version implemented in VLFeat.
The *convergence speed* is not sufficient to tell the *learning speed*, i.e. how quickly an algorithm can learn an SVM that performs optimally on the test set. The following two observations can be used to link convergence speed to learning speed:
Under these two assumptions, PEGASOS can learn a linear SVM in time $ O(n)$, which is *linear in the number of training examples*. This fares much better with $O(n^2)$ or worse of non-linear SVM solvers.
Adding a bias $b$ to the SVM scoring function $ , +b$ is done, as explained in Adding a bias, by appending a constant feature $B$ (the *bias multiplier*) to the data vectors $$ and a corresponding weight element $w_b$ to the weight vector $$, so that $b = B w_b$ As noted, the bias multiplier should be relatively large in order to avoid shrinking the bias towards zero, but small to make the optimization stable. In particular, setting $B$ to zero learns an unbiased SVM (vl_svm_set_bias_multiplier).
To counter instability caused by a large bias multiplier, the learning rate of the bias is slowed down by multiplying the overall learning rate $$ by a bias-specific rate coefficient (vl_svm_set_bias_learning_rate).
As a rule of thumb, if the data vectors $$ are $l^2$ normalized (as they typically should for optimal performance), then a reasonable bias multiplier is in the range 1 to 10 and a reasonable bias learning rate is somewhere in the range of the inverse of that (in this manner the two parts of the extended feature vector $(, B)$ are balanced).
Initially, the learning rate $ = 1/ t$ is usually too fast: as usually $ 1$, $ 1$. But this is clearly excessive (for example, without a loss term, the best learning rate at the first iteration is simply $=1$, as this nails the optimum in one step). Thus, the learning rate formula is modified to be $ = 1 / (t + t_0)$, where $t_0 2/$, which is equivalent to start $t_0$ iterations later. In this manner $ 1/2$.
Starting from a given model $$ is easy in SGD as the optimization runs in the primal. However, the starting iteration index $t$ should also be advanced for a warm start, as otherwise the initial setting of $$ is rapidly forgot (vl_svm_set_model, vl_svm_set_bias, vl_svm_set_iteration_number).
Rather than visiting points completely at random, VLFeat SDCA follows the best practice of visiting all the points at every epoch (pass through the data), changing the order of the visit randomly by picking every time a new random permutation.
At each iteration, the SGD algorithm updates the vector $$ (including the additional bias component $w_b$) as ${t+1} - - $, where $$ is the learning rate. If the subgradient of the loss function $$ is zero at a given iteration, this amounts to simply shrink $$ towards the origin by multiplying it by the factor $1 - $. Thus such an iteration can be accelerated significantly by representing internally $ = f_t $, where $f_t$ is a scaling factor. Then, the update becomes \[ f_{t+1} {t+1} = f_{t} {t} - f_{t} {t} - = (1- ) f_{t} {t} - . \] Setting $f_{t+1} = (1- ) f_{t}$, this gives the update equation for $$ \[ {t+1} = {t} - {}{f_{t+1}} . \] but this step can be skipped whenever $$ is equal to zero.
When the bias component has a different learning rate, this scheme must be adjusted slightly by adding a separated factor for the bias, but it is otherwise identical.