Feature Triangulation

3D Cartesian Triangulation

We wish to create a solvable linear system that can give us an initial guess for the 3D cartesian position of our feature. To do this, we take all the poses that the feature is seen from to be of known quantity. This feature will be triangulated in some anchor camera frame $\{A\}$ which we can arbitrary pick. If the feature $\mathbf{p}_f$ is observed by pose $1\ldots m$, given the anchor pose $A$, we can have the following transformation from any camera pose $C_i, i=1\ldots m$:

\begin{align*} {}^{C_i}\mathbf{p}_{f} & = {}^{C_i}_A\mathbf{R} \left( {}^A\mathbf{p}_f - {}^A\mathbf{p}_{C_i}\right) \\ {}^A\mathbf{p}_f & = {}^{C_i}_A\mathbf{R}^\top {}^{C_i}\mathbf{p}_{f} + {}^A\mathbf{p}_{C_i} \end{align*}

In the absents of noise, the measurement in the current frame is the bearing ${}^{C_i}\mathbf{b}$ and its depth ${}^{C_i}z$. Thus we have the following mapping to a feature seen from the current frame:

\begin{align*} {}^{C_i}\mathbf{p}_f & = {}^{C_i}z_{f} {}^{C_i}\mathbf{b}_{f} = {}^{C_i}z_{f} \begin{bmatrix} u_n \\ v_n \\ 1 \end{bmatrix} \end{align*}

We note that $u_n$ and $v_n$ represent the undistorted normalized image coordinates. This bearing can be warped into the the anchor frame by substituting into the above equation:

\begin{align*} {}^A\mathbf{p}_f & = {}^{C_i}_A\mathbf{R}^\top {}^{C_i}z_{f} {}^{C_i}\mathbf{b}_{f} + {}^A\mathbf{p}_{C_i} \\ & = {}^{C_i}z_{f} {}^{A}\mathbf{b}_{C_i \rightarrow f} + {}^A\mathbf{p}_{C_i} \end{align*}

To remove the need to estimate the extra degree of freedom of depth ${}^{C_i}z_{f}$, we define the following vectors which are orthoganal to the bearing ${}^{A}\mathbf{b}_{C_i \rightarrow f}$:

\begin{align*} {}^{A}\mathbf{N}_i &= \lfloor {}^{A}\mathbf{b}_{C_i \rightarrow f} \times\rfloor = \begin{bmatrix} 0 & -{}^{A}{b}_{C_i \rightarrow f}(3) & {}^{A}{b}_{C_i \rightarrow f}(2) \\ {}^{A}{b}_{C_i \rightarrow f}(3) & 0 & -{}^{A}{b}_{C_i \rightarrow f}(1) \\ -{}^{A}{b}_{C_i \rightarrow f}(2) & {}^{A}{b}_{C_i \rightarrow f}(1) & 0 \\ \end{bmatrix} \end{align*}

All three rows are perpendicular to the vector ${}^{A}\mathbf{b}_{C_i \rightarrow f}$ and thus ${}^{A}\mathbf{N}_i{}^{A}\mathbf{b}_{C_i \rightarrow f}=\mathbf{0}_3$. We can then multiple the transform equation/constraint to form two equation which only relates to the unknown 3 d.o.f ${}^A\mathbf{p}_f$:

\begin{align*} {}^{A}\mathbf{N}_i {}^A\mathbf{p}_f & = {}^{A}\mathbf{N}_i {}^{C_i}z_{f} {}^{A}\mathbf{b}_{C_i \rightarrow f} + {}^{A}\mathbf{N}_i {}^A\mathbf{p}_{C_i} \\ & = {}^{A}\mathbf{N}_i {}^A\mathbf{p}_{C_i} \end{align*}

By stacking all the measurements, we can have:

\begin{align*} \underbrace{ \begin{bmatrix} \vdots \\ {}^{A}\mathbf{N}_i \\ \vdots \end{bmatrix} }_{\mathbf{A}} {}^A\mathbf{p}_f & = \underbrace{ \begin{bmatrix} \vdots \\ {}^{A}\mathbf{N}_i {}^A\mathbf{p}_{C_i} \\ \vdots \end{bmatrix} }_{\mathbf{b}} \end{align*}

Since each pixel measurement provides two constraints, as long as $m>1$, we will have enough constraints to triangulate the feature. In practice, the more views of the feature the better the triangulation and thus normally want to have a feature seen from at least five views. We could select two rows of the each ${}^{A}\mathbf{N}_i$ to reduce the number of rows, but by having a square system we can perform the following "trick".

\begin{align*} \mathbf{A}^\top\mathbf{A}~ {}^A\mathbf{p}_f &= \mathbf{A}^\top\mathbf{b} \\ \Big( \sum_i {}^{A}\mathbf{N}_i^\top {}^{A}\mathbf{N}_i \Big) {}^A\mathbf{p}_f &= \Big( \sum_i {}^{A}\mathbf{N}_i^\top {}^{A}\mathbf{N}_i {}^A\mathbf{p}_{C_i} \Big) \end{align*}

This is a 3x3 system which can be quickly solved for as compared to the originl 3mx3m or 2mx2m system. We additionally check that the triangulated feature is "valid" and in front of the camera and not too far away. The condition number of the above linear system and reject systems that are "sensitive" to errors and have a large value.

1D Depth Triangulation

We wish to create a solvable linear system that can give us an initial guess for the 1D depth position of our feature. To do this, we take all the poses that the feature is seen from to be of known quantity along with the bearing in the anchor frame. This feature will be triangulated in some anchor camera frame $\{A\}$ which we can arbitrary pick. We define it as its normalized image coordiantes $[u_n ~ v_n ~ 1]^\top$ in tha anchor frame. If the feature $\mathbf{p}_f$ is observed by pose $1\ldots m$, given the anchor pose $A$, we can have the following transformation from any camera pose $C_i, i=1\ldots m$:

\begin{align*} {}^{C_i}\mathbf{p}_{f} & = {}^{C_i}_A\mathbf{R} \left( {}^A\mathbf{p}_f - {}^A\mathbf{p}_{C_i}\right) \\ {}^A\mathbf{p}_f & = {}^{C_i}_A\mathbf{R}^\top {}^{C_i}\mathbf{p}_{f} + {}^A\mathbf{p}_{C_i} \\ {}^{A}z_{f} {}^A\mathbf{b}_f & = {}^{C_i}_A\mathbf{R}^\top {}^{C_i}\mathbf{p}_{f} + {}^A\mathbf{p}_{C_i} \\ \end{align*}

In the absents of noise, the measurement in the current frame is the bearing ${}^{C_i}\mathbf{b}$ and its depth ${}^{C_i}z$.

\begin{align*} {}^{C_i}\mathbf{p}_f & = {}^{C_i}z_{f} {}^{C_i}\mathbf{b}_{f} = {}^{C_i}z_{f} \begin{bmatrix} u_n \\ v_n \\ 1 \end{bmatrix} \end{align*}

We note that $u_n$ and $v_n$ represent the undistorted normalized image coordinates. This bearing can be warped into the the anchor frame by substituting into the above equation:

\begin{align*} {}^{A}z_{f} {}^A\mathbf{b}_f & = {}^{C_i}_A\mathbf{R}^\top {}^{C_i}z_{f} {}^{C_i}\mathbf{b}_{f} + {}^A\mathbf{p}_{C_i} \\ & = {}^{C_i}z_{f} {}^{A}\mathbf{b}_{C_i \rightarrow f} + {}^A\mathbf{p}_{C_i} \end{align*}

To remove the need to estimate the extra degree of freedom of depth ${}^{C_i}z_{f}$, we define the following vectors which are orthoganal to the bearing ${}^{A}\mathbf{b}_{C_i \rightarrow f}$:

\begin{align*} {}^{A}\mathbf{N}_i &= \lfloor {}^{A}\mathbf{b}_{C_i \rightarrow f} \times\rfloor \end{align*}

All three rows are perpendicular to the vector ${}^{A}\mathbf{b}_{C_i \rightarrow f}$ and thus ${}^{A}\mathbf{N}_i{}^{A}\mathbf{b}_{C_i \rightarrow f}=\mathbf{0}_3$. We can then multiple the transform equation/constraint to form two equation which only relates to the unknown ${}^{A}z_{f}$:

\begin{align*} ({}^{A}\mathbf{N}_i {}^A\mathbf{b}_f) {}^{A}z_{f} & = {}^{A}\mathbf{N}_i {}^{C_i}z_{f} {}^{A}\mathbf{b}_{C_i \rightarrow f} + {}^{A}\mathbf{N}_i {}^A\mathbf{p}_{C_i} \\ & = {}^{A}\mathbf{N}_i {}^A\mathbf{p}_{C_i} \end{align*}

We can then formulate the following system:

\begin{align*} \Big( \sum_i ({}^{A}\mathbf{N}_i{}^A\mathbf{b}_f)^\top ({}^{A}\mathbf{N}_i{}^A\mathbf{b}_f) \Big) {}^{A}z_{f} &= \Big( \sum_i ({}^{A}\mathbf{N}_i{}^A\mathbf{b}_f)^\top {}^{A}\mathbf{N}_i{}^A \mathbf{b}_i \Big) \end{align*}

This is a 1x1 system which can be quickly solved for with a single scalar division. We additionally check that the triangulated feature is "valid" and in front of the camera and not too far away. The full feature can be reconstructed by $ {}^A\mathbf{p}_f = {}^{A}z_{f} {}^A\mathbf{b}_f$.

3D Inverse Non-linear Optimization

After we get the triangulated feature 3D position, a nonlinear least-squares will be performed to refine this estimate. In order to achieve good numerical stability, we use the inverse depth representation for point feature which helps with convergence. We find that in most cases this problem converges within 2-3 iterations in indoor environments. The feature transformation can be written as:

\begin{align*} {}^{C_i}\mathbf{p}_f & = {}^{C_i}_A\mathbf{R} \left( {}^A\mathbf{p}_f - {}^A\mathbf{p}_{C_i} \right) \\ &= {}^Az_f {}^{C_i}_A\mathbf{R} \left( \begin{bmatrix} {}^Ax_f/{}^Az_f \\ {}^Ay_f/{}^Az_f \\ 1 \end{bmatrix} - \frac{1}{{}^Az_f} {}^A\mathbf{p}_{C_i} \right) \\ \Rightarrow \frac{1}{{}^Az_f} {}^{C_i}\mathbf{p}_f & = {}^{C_i}_A\mathbf{R} \left( \begin{bmatrix} {}^Ax_f/{}^Az_f \\ {}^Ay_f/{}^Az_f \\ 1 \end{bmatrix} - \frac{1}{{}^Az_f} {}^A\mathbf{p}_{C_i} \right) \end{align*}

We define $u_A = {}^Ax_f/{}^Az_f$, $v_A = {}^Ay_f/{}^Az_f$, and $\rho_A = {1}/{{}^Az_f}$ to get the following measurement equation:

\begin{align*} h(u_A, v_A, \rho_A) & = {}^{C_i}_A\mathbf{R} \left( \begin{bmatrix} u_A \\ v_A \\ 1 \end{bmatrix} - \rho_A {}^A\mathbf{p}_{C_i} \right) \end{align*}

The feature measurement seen from the $\{C_i\}$ camera frame can be reformulated as:

\begin{align*} \mathbf{z} & = \begin{bmatrix} u_i \\ v_i \end{bmatrix} \\ &= \begin{bmatrix} h(u_A, v_A, \rho_A)(1) / h(u_A, v_A, \rho_A)(3) \\ h(u_A, v_A, \rho_A)(2) / h(u_A, v_A, \rho_A)(3) \\ \end{bmatrix} \\ & = \mathbf{h}(u_A, v_A, \rho_A) \end{align*}

Therefore, we can have the least-squares formulated and Jacobians:

\begin{align*} \operatorname*{argmin}_{u_A, v_A, \rho_A}||{\mathbf{z} - \mathbf{h}(u_A, v_A, \rho_A)}||^2 \end{align*}

\begin{align*} \frac{\partial \mathbf{h}(u_A, v_A, \rho_A)}{\partial {h}(u_A, v_A, \rho_A)} & = \begin{bmatrix} 1/h(\cdots)(3) & 0 & -h(\cdots)(1)/h(\cdots)(3)^2 \\ 0 & 1/h(\cdots)(3) & -h(\cdots)(2)/h(\cdots)(3)^2 \end{bmatrix} \\ \frac{\partial {h}(u_A, v_A, \rho_A)}{\partial [u_A, v_A, \rho_A]} & = {}^{C_i}_A\mathbf{R} \begin{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix} & -{}^A\mathbf{p}_{C_i} \end{bmatrix} \end{align*}

The least-squares problem can be solved with Gaussian-Newton or Levenberg-Marquart algorithm.



ov_core
Author(s): Patrick Geneva , Kevin Eckenhoff , Guoquan Huang
autogenerated on Mon Dec 16 2024 03:06:46