06. Local Features

Scale and rotation invariant descriptors.

Correspondence using window matching: if points are used, it is highly ambiguous.

Stereo Cameras

Baseline: distance between cameras. Wider baseline allows greater accuracy further away, while a smaller baseline allows overlap at closer distances. Increased camera resolution can increase depth resolution overall.

Rectification: transform of images onto a common image plane (i.e. sensor, but inverted). The image planes of the two cameras must be parallel

Correspondence using window matching:

For each window and for every pixel offset, determine how well the two match; minimum will give you the pixel offset and thus distance
For window of size $m \times m$ :

$W_m(x, y) = \left\{ u, v \vert x - \frac{m}{2} \le u \le x + \frac{m}{2}, y - \frac{m}{2} \le v \le y + \frac{m}{2} \right\}$
Matching metric: sum of squared pixel differences
- $_r(x, y, d) = \sum_{(u, v) \in W_m(x, y)}{\left[I_L(u, v) - I_R(u - d, v)\right]^2}$

Image normalization: variation in sensor gain/sensitivity means normalization is recommended.

Window magnitude: $$ \left\Vert I \right\Vert_{W_m(x, y)} = \sqrt{\sum_{(u, v) \in W_m(x, y)}{\left[I(u, v)\right]^2}} $$

Average:

\bar{I} = \frac{1}{\vert W_m(x, y) \vert}\sum_{(u, v) \in W_m(x, y)}{I(u, v)}

Normalized pixel:

\hat{I}_{(x, y)} = \frac{I(x, y) - \bar{I}}{\left\Vert I - \bar{I}\right\Vert_{W_m(x, y)}}

Vectorization: convert matrix into vector by unwrapping: horizontal lines together. Denote as $\omega$ .

Normalization scales magnitude of the $m^2$ -dimensional space into unit length. Two metrics possible for comparing two windows: distance and angle.

Distance ((normalized) sum of squared differences);

C_\textrm{SSD}(d) = \Vert \omega_L - \omega_R(d) \Vert^2

$\omega_R(d)$ is window centered around $(x - d, y)$ )

Normalized correlation:

C_\textrm{NC}(d) = \omega_L \cdot \omega_R(d) = \cos(\theta)

Local Features

Aperture problem and normal flow: if you only have a partial view of a moving, one-dimensional object, you cannot always tell how it is moving (e.g. a moving line whose ends are outside the viewport) n.

Given velocities $u$ and $v$ and partial derivatives $I_x$ , $I_y$ and $I_t$ for a given pixel, the brightness change constraint equation (BCCE), which states that brightness should stay constant for a given point over time, can be approximated (1st order Taylor series) as:

\begin{aligned} I_xu + I_yv + I_t &= 0 \\ \nabla I \cdot \vec{U} &= 0 \end{aligned}

Normal flow, the vector representing translation of the line in the direction in the direction of its normal, can be written as:

u_\perp = - \frac{I_t}{\vert \nabla I \vert} \frac{\nabla I}{\vert \nabla I \vert}

By considering multiple moving points, the velocity $U$ can be found:

\begin{aligned} \nabla I^1 \cdot U &= -I_t^1 \\ \nabla I^2 \cdot U &= -I_t^2 \\ \dots \end{aligned}

Lucas-Kanade

Assumes the same velocity for all pixels within the window, and that pixel intensities do not change between frames.

https://docs.opencv.org/4.5.0/d4/dee/tutorial_optical_flow.html

E(u, v) = \sum_{x, y \in \Omega}{ \left( I_x(x, y)u + I_y(x, y)v + I_t \right)^2 }

Solve using:

\begin{bmatrix} \sum{I_x^2} & \sum{I_x I_y} \\ \sum{I_x I_y} & \sum{I_y^2} \\ \end{bmatrix} \begin{pmatrix} u \\ v \end{pmatrix} = -\begin{pmatrix} \sum{I_x I_t} \\ \sum{I_y I_t} \end{pmatrix}

LHS: sum of outer product tensor of gradient vector

\left( \sum{\nabla I \nabla I^T} \right) \vec{U} = -\sum{\nabla I I_t}

Good features:

Satisfy brightness constancy
Has sufficient texture variation
But not too much texture variation - too many edges is also a problem as it is hard to tell how the
Corresponds to a ‘real’ surface patch (e.g. shadows not real)
Does not deform too much over time

Previous equation can be written as $\bold{A}\vec{u} = -\vec{b}$ .

For this to be solvable:

$\bold{A}$ should be invertible
$\bold{A}$ should not be too small - signal to noise ratio
- Eigenvalues $\lambda_1$ and $\lambda_2$ should not be too small
$\bold{A}$ should be well-conditioned
- $\lambda_1 / \lambda_2$ should not be too large (where $\lambda_1$ is the larger eigenvalue)
Original scoring function: $R = \lambda 1 \lambda 2 - k(\lambda_1 + \lambda_2)^2$
Shi-Tomasi scoring function: $R = \text{min}(\lambda_1, \lambda_2) > c$

Harris Detector

Using auto-correlation on ‘interesting’ points - where there are important differences in all directions.

For a point $(x, y)$ and shift $(\Delta x, \Delta y)$ , the auto-correlation is:

f(x, y) \sum_{(x_k, y_k) \in W}{ \left( I(x_k, y_k) - I( x_k + \Delta x, y_k + \Delta y ) \right)^2 }

Avoiding discrete shifts:

I(x_k + \Delta x, y_k + \Delta y) = I(x_k, y_k) + \begin{pmatrix}I_x(x_k, y_k) & I_y(x_k, y_k) \end{pmatrix} \begin{pmatrix}\Delta x \\ \Delta y \end{pmatrix}

f(x, y) = \sum_{(x_k, y_k) \in W}{ \left( \begin{pmatrix}I_x(x_k, y_k) & I_y(x_k, y_k) \end{pmatrix} \begin{pmatrix}\Delta x \\ \Delta y \end{pmatrix} \right)^2 }

Auto-correlation matrix:

= \begin{pmatrix}\Delta x & \Delta y\end{pmatrix} \begin{bmatrix} \sum_{(x_k, y_k) \in W}{\left(I_x(x_k, y_k)\right)^2} & \sum_{(x_k, y_k) \in W}{I_x(x_k, y_k) I_y(x_k, y_k)} \\ \sum_{(x_k, y_k) \in W}{I_x(x_k, y_k) I_y(x_k, y_k)} & \sum_{(x_k, y_k) \in W}{\left((I_y(x_k, y_k)\right)^2} \end{bmatrix} \begin{pmatrix}\Delta x \\ \Delta y\end{pmatrix}

The matrix captures the structure of the local neighborhood. Interest can be measured using the matrix’s eigenvalues:

2 strong eigenvalues: interesting point
1 strong eigenvalue: contour
0 eigenvalue: uniform regions

Interest point detection can be done using thresholding, or a local maximum for localization.

Feature distortion:

Model as Affine transforms: parallel lines preserved
OpenCV findFeatures
Affine transforms:
- $u(x, y) = a_1 + a_2x + a_3y$
- $v(x, y) = a_4 + a_5x + a_6y$
- Six parameters, min. six pixels per window
- Pass into BCCE and minimize error

Invariant Local Features

Local features that are invariant to translation, rotation, scale etc… They should have:

Locality: features are local; robust to occlusion/clutter
Distinctiveness: individual features can be matched to a large database of objects
Quantity: even small projects should have features
Efficiency: close to real-time performance
Extensibility: can be extended for a wide range of differing feature types

SIFT: Scale-invariant feature transform (SIFT).

Scale invariance:
- Gaussian pyramid
  - Blur then halve the dimensions (one octave)
  - Compute Difference of Gaussian (DoG): difference between neighboring Gaussian layers
    - Approximation of Laplacian of Gaussian
  - Compare pixel against 8 neighbors in same scale, plus against 9 neighbors in the DoG one octave above/below: use as keypoint if a minimum/maximum across all 26 neighbors
Rotation invariance:
- Create histogram of local (i.e. neighbor pixel) gradient directions; each bin covers 10 degrees
- Canonical orientation = peak of histogram
Descriptor:
- 16x16 region in scale space around keypoint
- Rotate region to match canonical orientation
- Create orientation histograms on 4x4 pixel neighborhoods; 8 bins/orientations each
- Hence 16 neighborhoods with 8 bins each: 128 element vector

Sift explanation