10. 3D Reconstruction using Computer Vision

Reconstructing 3D structure and camera positions from a set of images.

Many applications:

Robot control
Self-driving cars
Measuring
Medical imaging
Photo-realistic graphics
AR

Most important algorithms: RANSAC and bundle adjustment.

Also known as structure from motion (SfM), photogrammetery.

SLAM (simultaneous localization and mapping) is usually real-time while SfM is offline. Closing the loop: once you recognize you have visited a position previously, you need to back-propagate changes to the model which may have drifted.

Summary:

Homography $H$ : relates relative pose of two cameras viewing a planar scene using RANSAC i.e. points all on the same plane
Essential matrix $E$ : relates relative pose of two cameras viewing a 3D scene using RANSAC
Bundle adjustment: initialize using RANSAC (for $E$ ), estimate a set of 3D points and camera points which minimizes re-projection error

Background

Camera Calibration

Camera calibration: map pixel coordinates to normalized image coordinates: correct factors such as lens distortion, focal length, image center etc.

Feature Matching

Process of choosing point features that appear in two adjacent images.

Features are usually point features found using corner detectors.

Corner features should be:

Repeatable - same corners found in every image
Localizable: detected location is an actual 3D point
- e.g. one object behind another from the PoV of the camera - the intersection is not a real point

cv::goodFeaturesToTrack can be used to find Harris corners.

Representations of appearance: ‘feature descriptors’ e.g. image patches, SIFT, SURF

May get incorrect matches: objects which look the same but are different. “gross outliers” - outliers where location error is much higher (orders of magnitude higher) than expected.

For feature registration, algorithms should be robust against:

Translation (scale change = z translation)
Rotation (including skew)
Illumination (colour shifts, shadows)
Blur (motion/defocus blur - the former is easy to remove if you know the velocity)
Non-rigid deformations
- Radial distortions
- Stretching
- Warping
- Intrinsic camera parameters
Noise
- Gaussian filter, median filter
Partial occlusion
Camera gain changes
Self similarity

Homogeneous Coordinates

Homogeneous coordinates: add an extra dimension (e.g. 3D points become 4D points) with $w=1$ .

This allows matrix multiplication to be used to represent rotation, translation, and represent projection by normalization.

Transform of point $\vec{X}$ with rotation $R$ and translation $\vec{t}$ :

T = \begin{bmatrix} R & \vec{t} \\ \vec{0} & 1 \end{bmatrix}

Then multiply by $\begin{bmatrix} \vec{X} \\ 1 \end{bmatrix}$

In homogeneous coordinates:

The camera is at $(0, 0, 0)$
Image points lie in the plane at $z = 1$
- Points can be mapped between the 2D sensor plane and 3D space

2-view Reconstruction

Recover rotation, translation and 3D structure given two images.

Planar scenes: Homography Estimation

e.g. aerial images, AR apps. All points are on a single plane, making things easier - one less dimension to worry about and no occlusion.

There are two views of a planar scene and three parameters: rotation, translation, and the plane normal.

A 3x3 matrix $H$ can represent the relative pose of the camera and plane normal.

Inlier match $\vec{x} \rightarrow \vec{x'}$ : $H\vec{x}=\vec{x'}$ when in normalized homogeneous coordinates, where $x$ is the 3D/4D point and $x'$ is the projection of the point on the camera sensor.

H = R + \frac{\vec{t} \vec{n}^T}{d}

Where:

$R$ is the relative orientation
$\vec{t}$ is the translation unit vector
$\vec{n}$ is the plane normal
$d$ is the ratio of the distance to the plane to the translation

Note that there is scale ambiguity in the above formula - we only have $d$ , not the actual length $t$ .

$H\vec{x} = \vec{x'}$ can be used to estimate $H$ when you have at least four inlier feature matches. The function cv::findHomography implements this algorithm.

As this only works on inliers, and you don’t know which points are inliers without $H$ , RANSAC is used to simultaneously estimate $H$ and filter out outliers.

RANSAC

Random sample consensus. A genera-purpose framework for fitting a model to data which has gross (very large) outliers.

Steps:

Generate a hypothesis set: a randomly chosen set of points, the number of points being the minimum number needed to generate the model
Hypothesis: generate a model from the hypothesis set
Test: count the number of datapoints that would be inliers assuming the model is correct
- That is, count the number of features where $H\vec{x}' - \vec{x} < threshold$
Repeat until enough points are inliers
Generate a new model using all inlier points

In the case of planar scenes, 4 feature matches.

3D Scenes: Essential Matrix Estimation

3x3 matrix that represents the relative pose (rotation $R$ , translation $t$ ) of two cameras:

\begin{aligned} E &= [t]_x R \\ [t]_x &= \begin{bmatrix} 0 & -t_3 & t_2 \\ t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix} \end{aligned}

For an inlier match, $\vec{x'}^T E \vec{x} = 0$ .

RANSAC is used to compute $E$ from five feature matches.

Conversion to a 3D structure:

Pick cameras with poses (rotation and transforms) $P = (\mathbb(I), \vec{0})$ and $P' = (R, t)$
For each inlier feature match $(\vec{x}, \vec{x'}$ , solve $P\vec{X}=x$ and $P\vec{X}=\vec{x'}$ to find the 3D point $X$
This creates a sparse set of 3D points
http://hilandtom.com/tombotterill/code

N-view Reconstruction by Bundle Adjustment

RANSAC suitable for two views, but 3D modelling may have tens or hundreds of photos; aerial mapping often has several thousand.

Hence, bundle adjustment is needed for accurate 3D reconstruction.

Reprojection error: distance between where a 3D point $X$ is projected into an image, and where it is measured to lie, $x$ .

Re-projection error for point: $|\vec{x} - P\vec{X}|$

Bundle adjustments find the set of camera positions and 3D points which minimizes the total reprojection error.

That, is, it finds the 3D points $\{ X_j: j=1, \dots, M \}$ and cameras $\{ P_i = (R_i, \vec{t}_i): i = 1, \dots, N \}$ such that

C(\{ \vec{X}_j \}, \{ P_i \}) = \sum_i{\sum_{j\text{ appearing in view } i}{\| \vec{X_{ij}} - P_i \vec{X_j} \|^2 }}

is minimized.

After RANSAC is run on pairs of images, non-linear gradient descent is used to minimize the total reprojection error.

Errors in reconstruction will remain:

Remaining outliers
Point localization errors
Poorly-conditioned geometry

However, this can be mitigated by the use of additional information such as GPS, or domain knowledge - for example, buildings usually have vertical planes and right-angles.

Extensions:

Camera calibration unknown
Giant feature sets (millions) - optimization algorithm must be designed to reduce complexity

BA finds a sparse structure, but if objects are assumed to be convex, a mesh can be formed. This compares to stereo, which returns a dense structure - distance value for each pixel.