Reconstructing 3D structure and camera positions from a set of images.
Many applications:
- Robot control
- Self-driving cars
- Measuring
- Medical imaging
- Photo-realistic graphics
- AR
Most important algorithms: RANSAC and bundle adjustment.
Also known as structure from motion (SfM), photogrammetery.
SLAM (simultaneous localization and mapping) is usually real-time while SfM is offline. Closing the loop: once you recognize you have visited a position previously, you need to back-propagate changes to the model which may have drifted.
Summary:
- Homography
: relates relative pose of two cameras viewing a planar scene using RANSAC i.e. points all on the same plane - Essential matrix
: relates relative pose of two cameras viewing a 3D scene using RANSAC - Bundle adjustment: initialize using RANSAC (for
), estimate a set of 3D points and camera points which minimizes re-projection error
Background
Camera Calibration
Camera calibration: map pixel coordinates to normalized image coordinates: correct factors such as lens distortion, focal length, image center etc.
Feature Matching
Process of choosing point features that appear in two adjacent images.
Features are usually point features found using corner detectors.
Corner features should be:
- Repeatable - same corners found in every image
- Localizable: detected location is an actual 3D point
- e.g. one object behind another from the PoV of the camera - the intersection is not a real point
cv::goodFeaturesToTrack can be used to find Harris corners.
Representations of appearance: ‘feature descriptors’ e.g. image patches, SIFT, SURF
May get incorrect matches: objects which look the same but are different. “gross outliers” - outliers where location error is much higher (orders of magnitude higher) than expected.
For feature registration, algorithms should be robust against:
- Translation (scale change = z translation)
- Rotation (including skew)
- Illumination (colour shifts, shadows)
- Blur (motion/defocus blur - the former is easy to remove if you know the velocity)
- Non-rigid deformations
- Radial distortions
- Stretching
- Warping
- Intrinsic camera parameters
- Noise
- Gaussian filter, median filter
- Partial occlusion
- Camera gain changes
- Self similarity
Homogeneous Coordinates
Homogeneous coordinates: add an extra dimension (e.g. 3D points become 4D points) with
This allows matrix multiplication to be used to represent rotation, translation, and represent projection by normalization.
Transform of point
Then multiply by
In homogeneous coordinates:
- The camera is at
- Image points lie in the plane at
- Points can be mapped between the 2D sensor plane and 3D space
2-view Reconstruction
Recover rotation, translation and 3D structure given two images.
Planar scenes: Homography Estimation
e.g. aerial images, AR apps. All points are on a single plane, making things easier - one less dimension to worry about and no occlusion.
There are two views of a planar scene and three parameters: rotation, translation, and the plane normal.
A 3x3 matrix
Inlier match
Where:
-
is the relative orientation -
is the translation unit vector -
is the plane normal -
is the ratio of the distance to the plane to the translation
Note that there is scale ambiguity in the above formula - we only have
cv::findHomography implements this algorithm.
As this only works on inliers, and you don’t know which points are inliers without
RANSAC
Random sample consensus. A genera-purpose framework for fitting a model to data which has gross (very large) outliers.
Steps:
- Generate a hypothesis set: a randomly chosen set of points, the number of points being the minimum number needed to generate the model
- Hypothesis: generate a model from the hypothesis set
- Test: count the number of datapoints that would be inliers assuming the model is correct
- That is, count the number of features where
- That is, count the number of features where
- Repeat until enough points are inliers
- Generate a new model using all inlier points
In the case of planar scenes, 4 feature matches.
3D Scenes: Essential Matrix Estimation
3x3 matrix that represents the relative pose (rotation
For an inlier match,
RANSAC is used to compute
Conversion to a 3D structure:
- Pick cameras with poses (rotation and transforms)
and - For each inlier feature match
, solve and to find the 3D point - This creates a sparse set of 3D points
- http://hilandtom.com/tombotterill/code
N-view Reconstruction by Bundle Adjustment
RANSAC suitable for two views, but 3D modelling may have tens or hundreds of photos; aerial mapping often has several thousand.
Hence, bundle adjustment is needed for accurate 3D reconstruction.
Reprojection error: distance between where a 3D point
Re-projection error for point:
Bundle adjustments find the set of camera positions and 3D points which minimizes the total reprojection error.
That, is, it finds the 3D points
is minimized.
After RANSAC is run on pairs of images, non-linear gradient descent is used to minimize the total reprojection error.
Errors in reconstruction will remain:
- Remaining outliers
- Point localization errors
- Poorly-conditioned geometry
However, this can be mitigated by the use of additional information such as GPS, or domain knowledge - for example, buildings usually have vertical planes and right-angles.
Extensions:
- Camera calibration unknown
- Giant feature sets (millions) - optimization algorithm must be designed to reduce complexity
BA finds a sparse structure, but if objects are assumed to be convex, a mesh can be formed. This compares to stereo, which returns a dense structure - distance value for each pixel.