10. 3D Reconstruction using Computer Vision

Reconstructing 3D structure and camera positions from a set of images.

Many applications:

Most important algorithms: RANSAC and bundle adjustment.

Also known as structure from motion (SfM), photogrammetery.

SLAM (simultaneous localization and mapping) is usually real-time while SfM is offline. Closing the loop: once you recognize you have visited a position previously, you need to back-propagate changes to the model which may have drifted.

Summary:

Background

Camera Calibration

Camera calibration: map pixel coordinates to normalized image coordinates: correct factors such as lens distortion, focal length, image center etc.

Feature Matching

Process of choosing point features that appear in two adjacent images.

Features are usually point features found using corner detectors.

Corner features should be:

cv::goodFeaturesToTrack can be used to find Harris corners.

Representations of appearance: ‘feature descriptors’ e.g. image patches, SIFT, SURF

May get incorrect matches: objects which look the same but are different. “gross outliers” - outliers where location error is much higher (orders of magnitude higher) than expected.

For feature registration, algorithms should be robust against:

Homogeneous Coordinates

Homogeneous coordinates: add an extra dimension (e.g. 3D points become 4D points) with w=1w=1.

This allows matrix multiplication to be used to represent rotation, translation, and represent projection by normalization.

Transform of point X\vec{X} with rotation RR and translation t\vec{t}:

T=[Rt01] T = \begin{bmatrix} R & \vec{t} \\ \vec{0} & 1 \end{bmatrix}

Then multiply by [X1]\begin{bmatrix} \vec{X} \\ 1 \end{bmatrix}

In homogeneous coordinates:

2-view Reconstruction

Recover rotation, translation and 3D structure given two images.

Planar scenes: Homography Estimation

e.g. aerial images, AR apps. All points are on a single plane, making things easier - one less dimension to worry about and no occlusion.

There are two views of a planar scene and three parameters: rotation, translation, and the plane normal.

A 3x3 matrix HH can represent the relative pose of the camera and plane normal.

Inlier match xx\vec{x} \rightarrow \vec{x'}: Hx=xH\vec{x}=\vec{x'} when in normalized homogeneous coordinates, where xx is the 3D/4D point and xx' is the projection of the point on the camera sensor.

H=R+tnTd H = R + \frac{\vec{t} \vec{n}^T}{d}

Where:

Note that there is scale ambiguity in the above formula - we only have dd, not the actual length tt.

Hx=xH\vec{x} = \vec{x'} can be used to estimate HH when you have at least four inlier feature matches. The function cv::findHomography implements this algorithm.

As this only works on inliers, and you don’t know which points are inliers without HH, RANSAC is used to simultaneously estimate HH and filter out outliers.

RANSAC

Random sample consensus. A genera-purpose framework for fitting a model to data which has gross (very large) outliers.

Steps:

In the case of planar scenes, 4 feature matches.

3D Scenes: Essential Matrix Estimation

3x3 matrix that represents the relative pose (rotation RR, translation tt) of two cameras:

E=[t]xR[t]x=[0t3t2t30t1t2t10] \begin{aligned} E &= [t]_x R \\ [t]_x &= \begin{bmatrix} 0 & -t_3 & t_2 \\ t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix} \end{aligned}

For an inlier match, xTEx=0\vec{x'}^T E \vec{x} = 0.

RANSAC is used to compute EE from five feature matches.

Conversion to a 3D structure:

N-view Reconstruction by Bundle Adjustment

RANSAC suitable for two views, but 3D modelling may have tens or hundreds of photos; aerial mapping often has several thousand.

Hence, bundle adjustment is needed for accurate 3D reconstruction.

Reprojection error: distance between where a 3D point XX is projected into an image, and where it is measured to lie, xx.

Re-projection error for point: xPX|\vec{x} - P\vec{X}|

Bundle adjustments find the set of camera positions and 3D points which minimizes the total reprojection error.

That, is, it finds the 3D points {Xj:j=1,,M}\{ X_j: j=1, \dots, M \} and cameras {Pi=(Ri,ti):i=1,,N}\{ P_i = (R_i, \vec{t}_i): i = 1, \dots, N \} such that

C({Xj},{Pi})=ij appearing in view iXijPiXj2 C(\{ \vec{X}_j \}, \{ P_i \}) = \sum_i{\sum_{j\text{ appearing in view } i}{\| \vec{X_{ij}} - P_i \vec{X_j} \|^2 }}

is minimized.

After RANSAC is run on pairs of images, non-linear gradient descent is used to minimize the total reprojection error.

Errors in reconstruction will remain:

However, this can be mitigated by the use of additional information such as GPS, or domain knowledge - for example, buildings usually have vertical planes and right-angles.

Extensions:

BA finds a sparse structure, but if objects are assumed to be convex, a mesh can be formed. This compares to stereo, which returns a dense structure - distance value for each pixel.