01. Introduction
Weighting:
- 50%: research project presented as a conference style paper, including a relevant literature review (due week 8)
- 15-25 references in the paper
- Deep learning: use when it will give you a better result
- No novelty required
- Possible projects:
- Automatic timecode/GPS coordinate extraction from dashcam/surveillance cameras?
- Censoring faces in real-time (except specific people)
- AR angry birds with finger dragging on table
- Automatically turning screen off when no one is looking at the screen
- Othello/chess state detection
- Russian flag replacement (with Ukraine)
- No-mask detection (stretch goal: if holding drink/food, they get a pass)
- Finger tracking for cursor movement (e.g. hand wet)
- Bread baking? Does rate at which bread rises drop near the end of the baking cycle
- Port/cable detection (e.g. USB-C, HDMI)
- Virtual touch bar: mirror on webcam to view keyboard area, add stickers and detect touch to change brightness
- Drink detection: detect when you take a drink, alert if you haven’t taken a sip in an hour
- Deep learning: feel free to use it where beneficial
- Classification: is there an x (and possibly a y or z) in the image or not?
- Object detection: where is the object in the image?
- Dense/semantic segmentation: recognizing what ‘object’ (e.g. sky) each pixel belongs to
- Instance segmentation: recognizing and locating multiple instances of an object
- Tracking: tracking the motion of objects
- 10%: class participation/presentations (presentations in final weeks of semester)
- 40%: final exam
Goal: to recognize objects and their motions.
Signal processing on 1D data; computer vision on 2D data.
Image processing on still images; computer vision on video and still images.
Difficulties
The sensory gap: gap between reality and what a recording of the scene can capture.
The semantic gap: lack of contextual knowledge about a scene.
The human visual system is very good:
- ~50% of cerebral cortex used for vision
- Emualating the operation of human behaviours:
- Perception based on relative brightness (e.g. Checker shadow illusion)
- Logarithmic response to brightness
- Low-light performance (for passive systems)
Recovering 3D information
Several cues available:
- Motion
- Stereo (< 3m)
- Texture
- Shading
- Contours
Or actual depth hardware:
- Active stereo (IR - brightness limited for safety)
- Structured light: dot projection
- Time of flight
- RF-modulated
- Direct timing (LIDAR)
- Brightness not limited as eye exposure duration so short
Labs: Intel Realsense D435.
Processing
- Pre-processing: noise reduction, contrast etc.
- Low-level: color, boundary/edge, shape, texture detection
- High-level: object detection, determining spatial relationships, meanings
The higher-level the processing, the less generic and the more domain-specific knowledge is required.
Low-Level Image Processing:
- Image compression
- Noise reduction (while maximizing information kept)
- Edge extraction
- Contrast enhancement (only helps humans)
- Segmentation
- Thresholding
- Morphology
- Image restoration (e.g. deblurring using knowledge of speed of subject)
Approaches
- Build a simple model of the world; constrain the environment (e.g. fixed lighting environment)
- Focus on definite tasks with clear requirements
- Find provably good algorithms
- Try ideas based on theory
- Experiment on real world
- Solutions may not generalize to a more complex environment
- Update the model
02. Perception and Color
Color Physics
Simplified rendering model: illumination * reflectance (relative energy, 0 to 1) = color signal.
For project: use consistent light source to prevent color shifting
Human vision system: optimized for spectrum from the sun (greatest intensity at ~500 nm).
Reflectance spectra: reflectance (0 to 1) as a function of wavelength. Eyes (and cameras) simplify this signal by reducing it to intensity detected via three types of sensors and hence, objects with different spectral albedoes may be perceived as having the same color (metamerism).
Spectral colors can be described as a spectra with a single peak in some given wavelength range.
Color mixing:
- Additive color mixing: colors combine by adding intensity (e.g. displays)
- Subtracting color mixing: colors combine by multiplying intensity (e.g. paint)
Color Perception
The Human Eye
Specifications:
- Spectral resolution: 400 - 700 nm
- c.f. ears which have a far wider range of 10 octaves (20 Hz to 20 kHz)
- Dynamic range: approx.
, similar to ears - Spatial resolution: approx 1-3 cm at 20 m
- Radiometric resolution (differentiating between two colors after a time gap): around 16 - 32 shades of black and white, 100 colors
Physiology:
- Lens and fluid causes signal attenuation
- Rods (black and white)
- ~ 100 million rods
- Extremely sensitive: can be excited by a single photon
- ~170 degree field-of-view per eye
- Cones (color):
- ~ 6.5 million cones
- Much larger than rods
- A vast majority centered around the fovea
- Three types:
-
and spectra have large overlap -
cones much less sensitive than the other two types - Cameras usually require IR filters and are in a Bayer pattern (two green for each red or green)
- Dark-adapted (scotopic) vision more sensitive to shorter wavelengths (almost blind to red)
- Light-adapted (photopic) vision sensitive to all visible wavelengths, peaking at 555 nm
Brain and processing:
- Signals from left visual field of each eye go to right visual cortex and vice-versa
- ~800,000 nerve fibers but ~100 million receptors; one fiber carries signals from multiple rods/cones
Color Spaces
Although perceived color depends on illumination and surroundings, assume for now that the spectrum of light arriving at the eye fully determines the perceived color.
- CIE 1931 XYZ color space
- Subjects would vary intensity of RGB lights to match test color, adding color to the test color if necessary (leading to negative values for red)
- Perceptually uniform color space: any change of a given distance in the color space looks the same to the human eye regardless of the starting color?
- Not great for computer vision
- HSV cone:
- Angle = hue
- Radius = saturation
- Value = height
- At the bottom is the tip of the cone - black, which cannot have zero saturation
- Recommended color space: CIECAM02 (JCh)
03. Cameras and Lenses
Pinhole Cameras
Projection Equation
Let
and
Lenses
Pinhole cameras must balance diffraction (aperture too small) and light ray convergence (aperture too large). Lenses allow far more light to pass through while still allowing light rays to converge on the same point on the sensor plane for light from a specific distance.
The most basic approximation of a lens is the thin lens, which assumes that the lens has zero thickness. More accurate models:
- Assume finite lens thickness
- Use higher-order approximation for
- Take chromatic aberrations, where different wavelengths have different refractive indices and thus focus at different points on the sensor, into account
- Coatings can be used to minimize this
- Consider the impacts of reflection on the lens surface
- Once again, coatings can minimize this and maximize light transmission
- Consider vignetting
Camera Calibration
Used to determine relationship between image coordinates and real-world coordinates - geometric camera calibration.
Intrinsic Parameters
Improvements are needed to the matrix above to consider:
- Scaling: convert real-world coordinates on the sensor to pixel space by multiplying the
values by and , taking into account that the pixels may not be square - Origin: need to translate by
as usually, the center of the sensor is not the origin - Skew: camera pixel axes may be skewed by angle
As a matrix:
In more compact notation: $$ \overrightarrow{p} = \frac{1}{z} \begin{pmatrix} K & \overrightarrow{0} \end{pmatrix} \overrightarrow{P} $$
Where
Then, extrinsic parameters: translation and rotation of the camera frame, must be taken into account, further complicating things: $$ ^C{P} = ^{C}{W}{R} + ^{W}{P} + ^{C}O{W} $$ Combining the two: $$ \overrightarrow{p} = \frac{1}{z} K\begin{pmatrix} {C}{W}{R} & ^{C}O{W} \end{pmatrix} \overrightarrow{P} = \frac{1}{z} M \overrightarrow{P} $$
By using these equations on many features, we can find the value of
04. Filters
Modifying the pixels of an image based on a function taking input from pixels in the target pixel’s local neighborhood.
Linear Functions
When a filter is linear, the value of each pixel is a linear combination of its neighbors.
Where
That is, the dot product of its local neighborhood multiplied by a kernel. This process is called a convolution.
Convolution kernels can be used to make a Gaussian filter - a blur. Blurring/smoothing the image reduces noise - high-frequency information.
Multiple linear functions can be stacked - in addition to convolutions - multiplication by a kernel, addition/subtraction are also valid operations.
For example, to sharpen an image, you can multiply by the pixel magnitudes (and hence both low and high frequency information) by 2, then subtract by a blurred (i.e. low-pass filtered) version of the image, leaving you with only high frequency information - an image with edges over-accentuated. In pseudo-code: 2 * img(x, y, 1)) - dot(gaussian(sigma), img(x, y, sigma)). This approximates the Laplacian (sum of second partial derivatives) of Gaussian filter.
Gradients and Edges
An edge is a single point: a series of edge points is a line.
An edge is a point of sharp change (reflectance, object, illumination, noise) in an image.
The general strategy is to use linear filters to estimate the image gradient, then mark points where the change in magnitude is large in comparison to its neighbors.
Fourier Transforms
The fourier transform of a real function is complex - in this course, the phase component is ignored - we only care about magnitude.
All natural images have a similar same magnitude transform - running the inverse transform with the magnitude from another image returns similar results.
In the magnitude image generated from a fourier transform, the center of the image equals zero frequency while the edges of the image have higher frequency. Hence, by masking the image with a circle and then applying the inverse transform, either a low- (outside masked out) or high- (inside masked out) pass filter can be generated.
In the transform, the input image is tiled to infinity: this may cause discontinuities to occur at the edges. However, the effects of this can be mitigated by fading the image to grey near the edges (e.g. in a circle with a Gaussian).
https://homepages.inf.ed.ac.uk/rbf/HIPR2/fourier.htm
05. Edge Detection
Challenge: convert a 2D image into a set of curves.
That is, find all edge points that are (mostly) connected, then join them into curves.
Edges come from:
- Surface-normal discontinuity (e.g. from shading)
- Depth discontinuity
- Surface color discontinuity
- Illumination discontinuity
Edge profiles:
|----- /\ ||
----| / \ ___||____
Step Roof Line
Edge detection:
- Detect short linear edge segments - edgels
- Aggregate edgels into extended edges
Edges are the points at which the rate of change reaches a maximum - when the second derivative is zero.
The gradient, which points in the direction of greatest change, can be represented by an angle and magnitude:
However, on a discrete image, approximations can be made.
The Sobel operator:
(NB: need to scale by
The Robert’s Cross Operator (for diagonal edges):
The 4×4 Prewitt operator:
Can use trigonometry to determine direction of the gradient using the horizontal and vertical Sobel/Prewitt operators.
Looking at only the adjacent pixel may be useless in situations with large amounts of noise. Hence, use a Gaussian kernel to smooth the image before applying the gradient operator. However, this can be done more efficiently by applying the derivative function to the Gaussian kernel, then applying it to the signal (the derivative theorem of convolution).
Then, the point of maximum difference - the second differential, can be found to detect the edge.
The sums of the second partial derivatives is called the Laplacian.
Canny Edge Detection
An optimal edge detection algorithm should have:
- Good detection: responds to edges, not noise
- Good localization: detected edge is close to the true edge
- Single edge: one edge detected per true edge
Under the assumptions of a linear filter and independent and identically distributed Gaussian noise, the optimal detector is approximately the derivative of the Gaussian.
Detection and localization are diametrically opposed to each other: more smoothing leads to better detection but worse localization.
- Gaussian blur: reduce noise
- Use the Sobel kernel to find the gradient in the
and directions - Find the gradient magnitude and direction (rounded to 45 degree increments)
Non-maximum suppression/edge thinning:
- Zero the pixel value if the magnitude not a maximum compared to neighbors in the relevant direction
- Can predict the next edge point by moving along the normal to the gradient
- Hysteresis thresholding (through double thresholding):
- Any pixel whose magnitude is larger than some threshold is kept
- Any pixel whose magnitude is smaller than some threshold is removed
- Any pixel in between the thresholds is kept if it is connected to a remaining pixel
A low value of
Scale Space
As
- Smoothing/blurring increases
- Noise edges disappear
- Edges become smoother
- Fine/high-frequency detail is removed
Multiple representations of the image at different scales can be generated.
If an edge detection algorithm is applied to all the images, edges can be matched between images:
- Edges may disappear or merge as scale increases - this can be used to determine how ‘strong’ an edge is
- Detected edge positions may change with scale
- Edges will never split as scale increases
Edge Detection by Subtraction
Subtract Gaussian-blurred image from the original, then scale and add an offset. This works as low-frequency information mostly remains in the blurred image and hence gets removed in the subtraction.This set of operations approximates the Laplacian of Gaussian.
Hough Line Detection
Finds straight lines from binary image (e.g. output of Canny algorithm).
Uses a voting scheme instead of naively searching for lines at every single position and orientation.
The Hough space is a transform of the
-
is the shortest distance of the line to the origin -
is the angle of the line -
transformed to
For each point (that is an edge pixel) and for every angle, find
The points with the largest vote counts are the straight lines we are most confident in.
The Hough circle transform does the same, except using the
The same technique works for any curve that can be expressed in parametric form, although the parameter space can get huge.
Corners
Doors and corners, kid. That’s where they get you
Detective Miller
What is the gradient at the corner? Near the corner, edges have gradients going in two different directions and at the corner, the gradient is ill-defined. Hence, edge detectors tend to fail at corners.
Corners, however, are useful for tracking movement between frames.
Harris Corner Detection
Over a small window, a corner is likely to have high intensity variation in all directions. This uses the sum-squared difference.
Given:
-
gives the intensity of a pixel -
is a window function that determines the weight of each pixel (e.g. Gaussian) relative to the target pixel - An offset
that separates two windows
For all pixels
That is, given two windows, calculate the difference between each pair of pixels, square them, and sum them.
Using the Taylor expansion, this can be approximated to:
Giving the approximation:
Where:
Given
Corners have large values of
06. Local Features
Scale and rotation invariant descriptors.
Correspondence using window matching: if points are used, it is highly ambiguous.
Stereo Cameras
Baseline: distance between cameras. Wider baseline allows greater accuracy further away, while a smaller baseline allows overlap at closer distances. Increased camera resolution can increase depth resolution overall.
Rectification: transform of images onto a common image plane (i.e. sensor, but inverted). The image planes of the two cameras must be parallel
Correspondence using window matching:
-
For each window and for every pixel offset, determine how well the two match; minimum will give you the pixel offset and thus distance
-
For window of size
: -
Matching metric: sum of squared pixel differences
-
Image normalization: variation in sensor gain/sensitivity means normalization is recommended.
Window magnitude: $$ \left\Vert I \right\Vert_{W_m(x, y)} = \sqrt{\sum_{(u, v) \in W_m(x, y)}{\left[I(u, v)\right]^2}} $$
Average:
Normalized pixel:
Vectorization: convert matrix into vector by unwrapping: horizontal lines together. Denote as
Normalization scales magnitude of the
Distance ((normalized) sum of squared differences);
Normalized correlation:
Local Features
Aperture problem and normal flow: if you only have a partial view of a moving, one-dimensional object, you cannot always tell how it is moving (e.g. a moving line whose ends are outside the viewport) n.
Given velocities
Normal flow, the vector representing translation of the line in the direction in the direction of its normal, can be written as:
By considering multiple moving points, the velocity
Lucas-Kanade
Assumes the same velocity for all pixels within the window, and that pixel intensities do not change between frames.
https://docs.opencv.org/4.5.0/d4/dee/tutorial_optical_flow.html
Solve using:
LHS: sum of outer product tensor of gradient vector
Good features:
- Satisfy brightness constancy
- Has sufficient texture variation
- But not too much texture variation - too many edges is also a problem as it is hard to tell how the
- Corresponds to a ‘real’ surface patch (e.g. shadows not real)
- Does not deform too much over time
Previous equation can be written as
For this to be solvable:
-
should be invertible -
should not be too small - signal to noise ratio - Eigenvalues
and should not be too small
- Eigenvalues
-
should be well-conditioned -
should not be too large (where is the larger eigenvalue)
-
- Original scoring function:
- Shi-Tomasi scoring function:
Harris Detector
Using auto-correlation on ‘interesting’ points - where there are important differences in all directions.
For a point
Avoiding discrete shifts:
Auto-correlation matrix:
The matrix captures the structure of the local neighborhood. Interest can be measured using the matrix’s eigenvalues:
- 2 strong eigenvalues: interesting point
- 1 strong eigenvalue: contour
- 0 eigenvalue: uniform regions
Interest point detection can be done using thresholding, or a local maximum for localization.
Feature distortion:
- Model as Affine transforms: parallel lines preserved
- OpenCV
findFeatures - Affine transforms:
-
-
- Six parameters, min. six pixels per window
- Pass into BCCE and minimize error
-
Invariant Local Features
Local features that are invariant to translation, rotation, scale etc… They should have:
- Locality: features are local; robust to occlusion/clutter
- Distinctiveness: individual features can be matched to a large database of objects
- Quantity: even small projects should have features
- Efficiency: close to real-time performance
- Extensibility: can be extended for a wide range of differing feature types
SIFT: Scale-invariant feature transform (SIFT).
- Scale invariance:
- Gaussian pyramid
- Blur then halve the dimensions (one octave)
- Compute Difference of Gaussian (DoG): difference between neighboring Gaussian layers
- Approximation of Laplacian of Gaussian
- Compare pixel against 8 neighbors in same scale, plus against 9 neighbors in the DoG one octave above/below: use as keypoint if a minimum/maximum across all 26 neighbors
- Gaussian pyramid
- Rotation invariance:
- Create histogram of local (i.e. neighbor pixel) gradient directions; each bin covers 10 degrees
- Canonical orientation = peak of histogram
- Descriptor:
- 16x16 region in scale space around keypoint
- Rotate region to match canonical orientation
- Create orientation histograms on 4x4 pixel neighborhoods; 8 bins/orientations each
- Hence 16 neighborhoods with 8 bins each: 128 element vector
07. Morphology
Structural processing of images
From ~1960s
- Erosion: shrinks objects
- Dilation: expands objects
- Open: erode then dilate
- Smooths images, removing small spurs, lines and noise
- Close: dilate then erode
- Fills gaps and holes while preserving thin lines
Extracting quantitative descriptions of image components:
- Boundaries
- Skeletons
- Convex hulls
Pixels are either object or non-object pixels.
Structuring element: smaller matrix applied to the image
Binary erode:
-
- Structuring element placed centered around every pixel: remove if any pixel of the structuring element overlaps with a non-object pixel
Binary dilation:
-
- Keep any pixels covered by the structuring element when placed at at least one location
Greyscale erode:
- Replace set operation with
operation -
Greyscale dilate:
Distance transform:
- Minimum distance of each pixel to non-object pixel
- Simple but inefficient: repeat erosion operation until all pixels gone; distance is the number of erosion operations required before the pixel disappeared
- Structuring element:
- Chessboard: pixels sharing corners or edges are both have a distance of 1. 3 by 3 square structuring element
- Manhattan: 3 by 3 cross structuring element
Skeleton transform:
- Reduces regions to one-pixel line thick borders
- Methods:
- Distance transform: create by finding pixels with a distance of 1
- Thinning: repeatedly thin image, retaining end points/connections
Convex hull:
- Follow outlines of object, except concavities
08. Tracking
Kalman Filter
Combine noisy measurements with predictions of how the state changes to get better estimate of real state.
Tracking: inference over time.
Can simplify the problem by assuming linear dynamics and Gaussian noise. An unscented Kalman filter can deal with non-linear state transitions, but still assumes Gaussian noise.
Task: at each time point (and in real-time), re-compute the estimate of position.
Recursive estimation: decompose this into:
- The part that depends on the new observation
- The part that can be computed from previous history
Minimal example - running average:
Where
This would be sensitive to noise/occlusion
Tracking:
Generalized model:
- Assume there are moving objects with underlying state
(e.g. position + velocity) - Assume there are measurements
, some of which are functions of the state - There is a clock: at each tick, the state changes and we get a new observation
- Data association: the measurements taken at tick
tell us something about the object’s state - Prediction: the measurements
tells us something about the object’s state at tick -
- Where
is a random variable representing the probability distribution for the th measurement and is the observed measurement
-
- Correction:
- Once
is obtained, compute
- Once
Simplifying assumptions:
- Only the immediate past matters; that is, only the previous state
-
- Measurements depend only on the current state
- Previous measurements do not affect the current measurement
1D Kalman Filter
Assumes new state can be obtained by multiplying the old state by a constant
In other words:
TODO what is
TODO what is
Once a measurement arrives, this can be corrected:
Note:
Smoothing: if not running the filte in real time, can run the algorithm forwards and backwards and find the mean between the two predictions.
Kalman in Python
g-h filter:
def g_h_filter(data, x0, dx, g, h, dt=1.):
"""
Performs g-h filter on 1 state variable with a fixed g and h.
'data' contains the data to be filtered.
'x0' is the initial value for our state variable
'dx' is the initial change rate for our state variable (assumes linear rate of change)
'g' is the g-h's g scale factor. g * 100% of the estimate comes from the measurement. Should be high for less noisy measurements
'h' is the g-h's h scale factor - larger h means responds quicker to change, but more vulnerable to noise/outliers
"""
x_estimate = x0
results = []
for x_measurement in data:
# prediction step
x_prediction = x_estimate + (dx*dt)
dx = dx
# update step
residual = z - x_prediction # delta between measurement and prediction
# update rate of change using residual.
# h determines how quickly the rate of change changes
dx = dx + h * (residual) / dt
# Update estimate be weighted average of prediction and measurement
# g determines weight given to the measurement
x_estimate = x_prediction + g * residual
Example: system where position is being measured and the object has constant velocity
The distance and velocity can be represented as Gaussian distributions:
Sum of two Gaussians:
Hence, the prediction can be represented as the sum of the distributions of the previous position and predicted velocity.
Product of two Gaussians:
The update step returns the estimated position as the product of the distributions of the new measurement and current estimated position.
Particle Filter
The particle filter allows multiple positions to be predicted, and works with multi-modal and non-Gaussian distributions.
Three probability distributions:
- Prior density: previous state
- Process density: kinematic model - prediction of next state
- Observation density: previous observation
The particle filter processes this into a single probability distribution, the state density
Comparisons:
- Kalman filter: linear transitions and Gaussian distributions. Breaks down if there is too much occlusion/clutter
- Unscented Kalman filter: non-linear systems, but still assumes Gaussian distribution
- Particle filter: predicts multiple states/positions with non-Gaussian distribution. Much slower
Kalman in Python
Algorithm:
- Generate a bunch of particles randomly
- Each has a weight proportional to the probability that it represents the state of the real system
- The use of Monte-Carlo simulation means that particles can be generated which follow any probability distribution, not just the Gaussian
- Predict the next state of the particles based on your predictions of how a real system would behave (e.g. when you send a command to change the state)
- The changes must be noisy to match the real system
- Update the weighting of the particles based on a new measurement or measurements (e.g. multiple objects being tracked)
- e.g. for each measurement, find distance between each particle and the measurement, and update the weight accordingly (e.g. for position, add the difference/residual multiplied by some factor to account for the measurements being noisy). Then, normalize the weights so they sum to one
- Resample, discarding particles that are now classed as highly improbable, and generate more particles by duplicating some of the more likely ones
- The noise added during the predict stage means the duplicate and original will separate
- If only one particle is being tracked, can use the weighted sum to get an estimate of its real position
09. Introduction to Deep Learning
Types:
- Supervised: labeled training set
- Unsupervised: discover patterns in unlabeled data
- Reinforcement learning: learn to act based on feedback/reward
Deep learning:
- Learning representations of data - great at learning patterns
- Uses a hierarchy of multiple layers - hence the ‘deep’ in the name
- Convolutional neural networks
- Works both supervised and unsupervised
- Compared to machine learning, a lower rate of diminishing returns as the size of the training set increases
Neural networks:
- Input layer
- Hidden layer(s)
- Output layer
Activation functions: non-linearities needed to learn complex (non-linear) representations of data. More layers and neurons can approximate more complex functions.
Overfitting: when the model fails to generalize outside the training set
10. 3D Reconstruction using Computer Vision
Reconstructing 3D structure and camera positions from a set of images.
Many applications:
- Robot control
- Self-driving cars
- Measuring
- Medical imaging
- Photo-realistic graphics
- AR
Most important algorithms: RANSAC and bundle adjustment.
Also known as structure from motion (SfM), photogrammetery.
SLAM (simultaneous localization and mapping) is usually real-time while SfM is offline. Closing the loop: once you recognize you have visited a position previously, you need to back-propagate changes to the model which may have drifted.
Summary:
- Homography
: relates relative pose of two cameras viewing a planar scene using RANSAC i.e. points all on the same plane - Essential matrix
: relates relative pose of two cameras viewing a 3D scene using RANSAC - Bundle adjustment: initialize using RANSAC (for
), estimate a set of 3D points and camera points which minimizes re-projection error
Background
Camera Calibration
Camera calibration: map pixel coordinates to normalized image coordinates: correct factors such as lens distortion, focal length, image center etc.
Feature Matching
Process of choosing point features that appear in two adjacent images.
Features are usually point features found using corner detectors.
Corner features should be:
- Repeatable - same corners found in every image
- Localizable: detected location is an actual 3D point
- e.g. one object behind another from the PoV of the camera - the intersection is not a real point
cv::goodFeaturesToTrack can be used to find Harris corners.
Representations of appearance: ‘feature descriptors’ e.g. image patches, SIFT, SURF
May get incorrect matches: objects which look the same but are different. “gross outliers” - outliers where location error is much higher (orders of magnitude higher) than expected.
For feature registration, algorithms should be robust against:
- Translation (scale change = z translation)
- Rotation (including skew)
- Illumination (colour shifts, shadows)
- Blur (motion/defocus blur - the former is easy to remove if you know the velocity)
- Non-rigid deformations
- Radial distortions
- Stretching
- Warping
- Intrinsic camera parameters
- Noise
- Gaussian filter, median filter
- Partial occlusion
- Camera gain changes
- Self similarity
Homogeneous Coordinates
Homogeneous coordinates: add an extra dimension (e.g. 3D points become 4D points) with
This allows matrix multiplication to be used to represent rotation, translation, and represent projection by normalization.
Transform of point
Then multiply by
In homogeneous coordinates:
- The camera is at
- Image points lie in the plane at
- Points can be mapped between the 2D sensor plane and 3D space
2-view Reconstruction
Recover rotation, translation and 3D structure given two images.
Planar scenes: Homography Estimation
e.g. aerial images, AR apps. All points are on a single plane, making things easier - one less dimension to worry about and no occlusion.
There are two views of a planar scene and three parameters: rotation, translation, and the plane normal.
A 3x3 matrix
Inlier match
Where:
-
is the relative orientation -
is the translation unit vector -
is the plane normal -
is the ratio of the distance to the plane to the translation
Note that there is scale ambiguity in the above formula - we only have
cv::findHomography implements this algorithm.
As this only works on inliers, and you don’t know which points are inliers without
RANSAC
Random sample consensus. A genera-purpose framework for fitting a model to data which has gross (very large) outliers.
Steps:
- Generate a hypothesis set: a randomly chosen set of points, the number of points being the minimum number needed to generate the model
- Hypothesis: generate a model from the hypothesis set
- Test: count the number of datapoints that would be inliers assuming the model is correct
- That is, count the number of features where
- That is, count the number of features where
- Repeat until enough points are inliers
- Generate a new model using all inlier points
In the case of planar scenes, 4 feature matches.
3D Scenes: Essential Matrix Estimation
3x3 matrix that represents the relative pose (rotation
For an inlier match,
RANSAC is used to compute
Conversion to a 3D structure:
- Pick cameras with poses (rotation and transforms)
and - For each inlier feature match
, solve and to find the 3D point - This creates a sparse set of 3D points
- http://hilandtom.com/tombotterill/code
N-view Reconstruction by Bundle Adjustment
RANSAC suitable for two views, but 3D modelling may have tens or hundreds of photos; aerial mapping often has several thousand.
Hence, bundle adjustment is needed for accurate 3D reconstruction.
Reprojection error: distance between where a 3D point
Re-projection error for point:
Bundle adjustments find the set of camera positions and 3D points which minimizes the total reprojection error.
That, is, it finds the 3D points
is minimized.
After RANSAC is run on pairs of images, non-linear gradient descent is used to minimize the total reprojection error.
Errors in reconstruction will remain:
- Remaining outliers
- Point localization errors
- Poorly-conditioned geometry
However, this can be mitigated by the use of additional information such as GPS, or domain knowledge - for example, buildings usually have vertical planes and right-angles.
Extensions:
- Camera calibration unknown
- Giant feature sets (millions) - optimization algorithm must be designed to reduce complexity
BA finds a sparse structure, but if objects are assumed to be convex, a mesh can be formed. This compares to stereo, which returns a dense structure - distance value for each pixel.
11. Deep Learning
Dr. Oliver Batchelor
Neural networks, differentiable programming, applications to CV/image processing
History
Artificial neural networks:
- Perceptron (1958)
- Linear classifier based on weighted inputs and thresholding
- Not differentiable
- Unable to learn XOR function
- Backpropagation (1975) - basically just gradient descent
- Enable composition of networks built from multiple layers of differentiable functions
- Long training times
- Vanishing gradient problem - weights at top layers very small, basically not updatable (Sigmoid function)
- Convolutional neural networks (CNN)
- Invented for classifying handwritten digits
- Image convolution is differentiable
- Convolutional layer - many different convolutions, each with a different filter
- Weights associated with the filters are updated
- 2009: imageNet classification problem
- 1000 image classes, 1.4 million images
- AlexNet, 2013: used GPUs for computation, ~64% accuracy
- Now ~90%
- Now:
- Neural nets part of almost every state-of-the-art computer vision applications
- 3D reconstruction: stereo, multi-view stereo, optical flow
- 2D/3D pose estimation
- Image generation, super resolution, style transform, texture synthesis
- Point cloud segmentation, object detection
- Neural nets part of almost every state-of-the-art computer vision applications
Introduction
Neural networks:
- ‘Differentiable programming’ vs ‘Deep learning’: more flexible usage; no longer a set of layers
- Supervised machine learning
- Objective function with data
and target label : , where the model has parameters , the function makes predictions, and the loss function gives the error of the prediction - Minimize objective function
- Compute gradient of the loss with respect to parameters
for a batch of examples - Update the weights
, where is the learning rate
- Compute gradient of the loss with respect to parameters
What’s wrong with fully-connected neural networks?
- Image extremely high dimensional
- If pixels in an image flattened out into a vector, spatial information is lost and shifting by a single pixel can cause vast changes
- High dimensional inputs require a lot of data; large images impractical
- The ‘curse of dimensionality’ - data becomes sparse
Solutions:
- Inductive bias: set of assumptions used by a learner to predict results for inputs it has not yet encountered
- Generalizations
- Invariance and equivariance
- Invariance = shift inputs, outputs remain the same
- Equivariance = shift inputs, outputs shift in the same way
- Images can be transformed (translated, scaled, rotated etc.) without changing their content/class label
- Images can be transformed and their segmentation is transformed the same way
- Hence, use operations that exploit these properties
- Usually convolutions (CNNs) - translationally equivariant, multi-scale approaches
- Much less data required
- Domain specific - invariances/symmetries different between domains
Building blocks of CNNs:
- Convolution
- Non-linear stage e.g. rectified linear
- Pooling: reduce a small window of an image into a single pixel
And repeat until you get a single output or a few outputs.
Other methods also available:
- Attention methods (from NLP) - transformers
- Divide image into patches; treat them like symbols
- Form associations between two sets
- Not obviously ‘better’ than CNNs
Applications
Output types:
- Classification:
- Categorical variables
- Softmax output with cross entropy loss
- Softmax: exponential, but weighted so that sum of components adds to 1
- Sigmoid with binary cross entropy loss
- Sigmoid:
(shaped like an S)
- Sigmoid:
- Regression
- Continuous variables (e.g. depth of a pixel)
Visual recognition:
- Classification - if image belongs to class
- Segmentation (dense/semantic) - classifying each pixel
- Object detection: classification + localization (bounding box)
- Instance segmentation: classification + localization + segmentation (precise mask)
- Keypoint detection (e.g. joints in hand)
- Panoptic segmentation (both dense and instance segmentation) e.g. ‘grass’
Image classification:
- Pre-trained models available: https://github.com/rwightman/pytorch-image-models
- Can be used as basis of other tasks - train on ImageNet, adapt the model by adding layers/connections. Called having a ‘backbone’ based on image classification.
Semantic/dense segmentation:
- Per-pixel classification
- General scene understanding
- Works with irregularly-shaped objects
Segmentation performance measures:
- Percent correct
- Has issues with class imbalance
- Intersection over union (IOU):
- Precision = area of intersection of prediction and ground truth divided by area of prediction
- Recall = area of intersection of prediction and ground truth divided by area of ground truth
- IoU = area of intersection of prediction and ground truth, divided by union
- Problem: they are not differential, although substitutes are available
Object detection:
- Localization + classification
- Difficult as there are a variable number of outputs
- How to measure accuracy?
- Average precision: area under precision vs recall graph
- Build by adding predictions by order of confidence, highest to lowest
- mAP @ t: mean average precision across all classes of dataset using matching threshold of t (e.g. 50%)
- COCO AP: average across mAP for thresholds [0.5, 0.55, …, 0.95]
- Usually output object locations (heatmaps) and classes
- Anchor boxes/prior boxes:
- Places where the network thinks there could be objects - very dense and overlapping
- Regression on box displacement and sizing on set of anchor boxes; resize and reduce to a single high-confidence box
- Non-maximum suppression: does not work well if two objects are overlapping
- Train by matching anchor boxes to ground truth boxes
- e.g. find boxes where IoU > 0.5 or some other threshold
Keypoint recognition:
- Detecting landmarks (e.g. human skeleton/pose) - want single points rather than box
- Bottom up:
- Keypoint detector per part
- Match key-points to find full skeletons
- Top down:
- Detect full objects first
- Estimate key-points given object location
Image matching and correspondence:
- Dense rectified (images aligned vertically) stereo matching (e.g. HSMNet, PSMNet)
- Dense multi-view stereo
- Sparse image matching: keyboard detection, descriptor extraction
- Dense optical flow
- Estimate motion of pixels
- 2D correspondence search - movement not constrained two one axis
- RAFT
- 6DoF pose matching
- Determining object pose (e.g. orientation) from 3D model
- Key point detector (identifiable point)/descriptor (vector uniquely defining that point)
- Can’t manually label data - a single image? Maybe. But definitely not a video
- Self-supervised by warp data augmentation: given an image warp, we know how key points should be transformed
- Direct matching e.g. LoFTR
- Detect, then extract descriptors: Superpoint, R2D2
- Object tracking
- Multi-object tracking: MOT
- Track known types of objects e.g. people, seals
- Supervised learning
- Detect and track paradigm e.g. SORT https://github.com/ifzhang/ByteTrack
- Difficulties: occlusion, changing shape, fast movement
- Generic object tracking (GOT)
- One shot object tracking
- Generic: not trained on a specific type of object (but probably a wide variety of moving objects)
- Makes it easy to use as long as the target is similar to anything in the training set
- Track object from template in the first frame (bounding box)
- e.g. vision4robotics/SiamAPN, got-10k/toolkit
- Can help with assisting labelling data
- Multi-object tracking: MOT
- Neural 3D reconstruction
- Not just for images, although this is a common case
- Input: set of calibrated images
- Structure from Motion, COLMAP
- Sparse point cloud through correspondence search: feature extraction -> matching -> geometric verification
- Dense multiview stereo e.g. PatchMatch to generate depth maps that can be filtered or fused
- Structure from Motion, COLMAP
- Reconstruct 3D scenes use differentiable volumetric ray-tracing
- Inverse of computer graphics
- Synthesize new images of a scene from a different orientation
- Neural radiance fields: NeRF
- Represent 3D scene using a NN by mapping a 3D coordinate to a density and color
- Problem: neural networks have bias towards smooth functions; cannot represent high frequency/discontinuity
- Solution: scale/encode inputs as a Fourier-encoded sequence
- Problem: view-directional effects
- e.g. Google MipNeRF, nerf-pytorch
- instant-ngp: massively reducing required time/processing power
- More recent NeRF methods: direct voxel grids instead of the coordinate space mapping
- Neural hash-tables: ignore hash collisions - multiple levels of resolution combined with a tiny MLP to handle collisions
Image features:
- Description of small point in the image
- Independent of attributes such as location, orientation and scale
- Handcrafted dense image features also exist e.g. histogram of gradients (HoG), dense SiFT
- From NN:
- Intermediate activations from any type of NN
- Typically dense (for CNNs)
- Usually much lower resolution than input
- Vector length depends on layer
- Sometimes modified for matching e.g. normalization, reduced dimension with PCA
- Can arise from training a NN on an auxiliary task (e.g. image classification)
- Handcrafted vs learned:
- NN features usually perform better on tasks near the training domain,
- Handcrafted features often generalize better
- Simpler classifiers with extracted features
- Fine tuning on frozen networks
- SVM (support vector machine), random forest, nearest neighbor etc.
- Depends on the quality of the feature extractor
- If extracted from NN running auxiliary task, it may not match the use case
- Extremely common use cases:
- Feature matching (e.g. nearest neighbor)
- Well tested, less likely to fall down when it encounteres unexpected data
- Image retrieval, face recognition (approximate nearest neighbor)
- Shortest path (e.g. skeletonization)
- Graph cut (e.g. stereo, segmentation)
- Homography estimation, Perspective nPoint (PnP)
- Tracking
- Detect and track e.g. SORT, one-shot detection
- Kalman filter
- Feature matching (e.g. nearest neighbor)
Applying models to new tasks:
- Find a off-the-shelf model
- A lot around; sometimes, may be useful to search for the dataset rather than the model
- Look for widely-used, actively-maintained ones
- Research-quality code often not maintained after publication; may require package to be updated to work with current versions of frameworks
- Check for conda/pip packages before building from source
- May not generalize
- Check similarity of images to ones in paper or those provided as examples
- Neural networks can be weak to unexpected inputs or domain shifts
- Use NN features with non-learning algorithms
- Open world visual recognition models
- Detects objects in general
- Often trained by big organizations with excessive resources
- Often self-supervised learning or from commonly available associations (e.g. OpenAI CLIP uses image data and captions from websites) - may be low quality
- Few shot image classification: create a new class using a few exemplars
- Zero shot image classification: classify image based on textural descriptions
- Instance segmentation e.g. Detic (https://huggingface.co/spaces/akhaliq/Detic)
- Fine tuning
- Hand annotate a few different examples
- Interactive segmentation: using NN to help with training
- DiscoBox: human draws bounding box, NN generates segmentation mask
- Human in the loop
- Active learning: algorithm lets human annotate the data it is most uncertain on
- Verification-based annotation
- Partially-trained model makes suggestion
- Human selects/edits the best suggestions
- Question answering:
- Weak annotation
- Human answer yes/no questions; much faster than manual annotation
- Interactive segmentation: using NN to help with training
- Take an existing model and find tune the model or classifier with a smaller set of annotations
- Catastrophic forgetting/interference: when retraining model in new domain, it tends to lose information about the previous data
- Use data augmentation judiciously
- Apply transformations which are invariant/equivariant to the labels; allows training sets to be enlarged and forces the model to generalize
- Randomly perturb the data
- Geometric: translation, scale, rotation
- Photometric: brightness, contrast, hue
- Mixing: mix images (of different classes!) and labels
- Noise: gaussian, salt and pepper, synthetic rain, cutout
- e.g. albumentations, BBAug, Torchvision, Detectron2
- Considerations:
- Too much may be harmful
- Sometimes may be better to standardize scale/orientation in input data rather than augment it (e.g. people not usually walking upside down)
- Hand annotate a few different examples
- Synthetic data
- 3D rendering to generate your own training data
- Self-supervised learning: known physical properties to provide supervision
- Correspondence tasks: cannot hand label; disparity maps must be more accurate than segmentation masks (precise disparities per pixel)
- Pseudo ground-truths
- Use device for high accuracy capture (LiDAR, structured light)
- Use more information e.g. higher-resolution or more images
- Use 3D reconstruction to generate higher-quality images and train on these
- Today: tends to produce stereo matching; not as good as synthetic data, but more robust
- Self-supervised learning
- Consistency (left/right, inter-frame) using image warping: minimize warping error with truth to train model
- Feature matching with tracking
- TODO
12. Vineyard Project
Example Question
Describe and provide CV example for:
- Supervised learning:
- Learn function from input to output based on label-data pairs
- e.g. road sign classification
- Weakly-supervised learning:
- Labels are ‘weaker’: noisy, limited or imprecise Semi-supervised learning:
- Small amount of labeled data, larger set of unlabeled data
- Use model to assign labels on the unsupervised data, manually correct, and then use results to retrain
- Self-supervised learning:
- Use properties of the data to provide a supervision signal
- e.g. use auxiliary task like image completion to learn mapping from image to feature vector to define similarity metric between images
CNNs: what property of image matching CV algorithms enable self-supervised learning?
- Correct solutions can be verified - loss function can be written, allowing the ML algorithm to be supervised
How would this work for stereo/optical flow?
- Dense stereo/optical flow provide correspondence between two images; one image can be warped to match its counterpart. Hence, this allows a comparison to give an indication of how successful the warp is and hence provide a loss function
- SLAM/SfM: matches based on geometric consistency; badly-matched key-points will fail a geometric consistency test and be discarded. Keypoints that pass/fail can be used as a positive/negative supervision signal
Last question in the exam: briefly describe four of the following class projects, naming at least four algorithmic steps (with algorithm names). Do not select your own/similar projects.
If person does not list four or more algorithms, won’t be selected.
Project Tips
- Academic paper: do not mention failures or running out of time. Phrase as positives in ‘future research’
- Remove the word ‘project’ from the paper; use ‘research’
- Avoid colloquialisms
- ‘The paper proposes a method’ not ‘the goal of this paper’
- ‘These results show the proposed approach can’ not the ‘system can’
- Do not motion the phrase ‘computer vision’; paper for a CV conference, so too broad
- Worse results are fine; proposing a method, not selling a solution
- Only mention the framework, hardware etc. at the beginning of the results/methods section
Abstract:
- Not part of the paper. Self-contained, technical overview of the whole paper. Include algorithm names etc., mention at least one result number, hopefully a comparison with prior research
- Must at least attempt to compare it with prior research
Background:
- Critical review of prior research - mention limitations of prior research/algorithms
- e.g. static background required. Look at future research sections
Proposed methods:
- At least three CV algorithm names
- What algorithms are the DL networks using?
- Novel: can mean tiniest minuscule tweak
Results:
- At beginning, mention OS framework etc.
- Quantify results
- Try to quantitatively compare results with prior research
- Survey papers can be useful
Conclusions:
- Start with brief summary of results
- Quantitatively compare with prior research
- Future research sub-section
References:
- Be consistent
- Most should be newer than 10 years ago, or justify
Real World Example: CV for a Grape Vine Pruning Robot
Approx. half the cost of vineyards is in pruning, hard to get get enough workers, can’t prune in the rain etc.
Pruning: remove old wood and most new canes during the winter.
NZ:
- 90 million vines, mostly Sauvignon Blanc
- Hand-pruned. ~2 minutes per vine
Large project: viticulture, robotics and AI experts, software + hardware engineers. ~5 years
~85% successful. Good enough for the government, but not good enough commercially.
Lighting:
- Extremely challenging: dynamic range far too large in sunlight
- Got a mobile caravan to control lighting: lights, blue screen background etc.,
- Bike wheels
- Place lights to minimize shadows
Camera rig:
- 3 well-conditioned cameras. Allowed reconstruction in all directions
- Needed to align after every setup - bumping and vibrations caused movement
- 3D reconstruction:
- Many challenges: occlusion, depth discontinuities, self-similarity
- Solved using feature matching/bundle adjustments
- 2D feature extraction: canes, wires posts
- Move away from pixels/point clouds to high-level features
- Correspondence between views, using knowledge of vines
- 2D feature extraction: canes, wires posts
- Customized every stage to use knowledge about vines (no sharp corners, vine thickness etc.)
- Made sequential chain of components that could be developed and parametrised in sequence and in isolation
- Rolling buffer of the last few dozen frames
- Now can use ML to get a very accurate 3D model, but was not available at the time
Main challenge was complexity and robustness.
- Lighting
- Even with artificial lighting, getting rid of shadows is hard
- Solution: MORE LIGHTS
- Occlusion
- 6 12 megapixel cameras with global shutters and bright lights to reduce motion blur
- Self-simiarity: vines look the same TODO
Main challenges:
- Complexity
- Robustness
13. Image Representations
What is a good representation for image analysis?
- Fourier transform tells you ‘what’ (textural properties) but not ‘where’
- Pixel domain gives you ‘where’ but not ‘where’
We want an image representation that gives you a local description of image events - what and where, and naturally represent objects across varying scales.
Image pyramids: apply filters of fixed size to images of different sizes. Typically, edge length changes by a scale of 2 or the square root of 2.
There are many types of image pyramids:
- Gaussian
- Acts as a low pass filter
- Applying the Gaussian to a Gaussian returns another Gaussian, allowing recursion
- Synthesis: smooth and sample
- Laplacian
- Synthesis: for a given level in a Gaussian pyramid, find the difference between the image and the up-sampled image from the level below (lower resolution)
- Acts as a band-pass filter: each level represents spatial frequencies largely unrepresented in other levels
- Multi-scale, band-pass and over-complete (more coefficients than image pixels)
- Wavelets/QMFs:
- Apply 1D filters separably in two spatial dimensions
- Wavelet function: 1D function
with total integral of zero (‘centered’ around x axis) with limited domain? (only a small section of the domain returns a non-zero value) - Haar wavelet: -1 for
and 1 for - Parameters scale
and horizontal translation : - Continuous wavelet transform: integral of product of wavelet function with 1D signal: measures the correlation of wavelet with the signal
- Discrete wavelet transform: pick scales that are powers of two
- 2D wavelet transform: use high and low pass filter. At each step, run the filter, downsample by 2, and repeat, except filtering in the other dimension. Hence, if run twice (once per dimension) you get four images downsized by a factor of 4
- Wavelet function: 1D function
- Multi-scale, band-pass and complete
- Some aliasing
- Apply 1D filters separably in two spatial dimensions
- Steerable:
- Can pick angle of interest
- Image corners must be removed
Uses:
- Gaussian: scale-invariance
- Laplacian: difference between pyramid levels - useful for noise reduction and coding
- Wavelet/QMF: band-passed, complete representation of the image
- Steerable pyramid: show components at each scale/orientation - useful for texture/feature analysis
14. Tracking
Fiducial Markers
A fiducial marker is any planar object introduced into the scene and in the field-of-view of a n imaging system to be used as a point of reference of measure.
Can be used for:
- Object position estimation
- Camera position estimation
- Estimate transform/poses of robots
How:
- Calibrate camera to determine and correct distortion
- Homography: 3x3 matrix mapping between projection of two planes
- If
and are corresponding points on two planes,
- If
- Pose: 6 degrees of freedom; rotation + translation along each axis; can be represented in 3x3 rotation and 1x3 translation matrix. Called the extrinsics matrix
- Process:
- Find marker outline
- e.g. ArUco: adaptive thresholding, contour extraction, quad extraction
- e.g. AprilTags: edge detection, line segment detection, quad extraction
- Calculate homography using corner locations
- Calculate extrinsics using focal length and marker size
- Find marker outline
Challenges:
- Occlusion
- Unfocused or motion blur
- Dark/uneven light, vignetting
- Jitter: exact position can vary between frames
- False positives: not all squares are markers
Markerless Tracking
Use ‘natural’ features for tracking: corners, edges, points etc.
Also: templates - basically something whose representation stored in the system.
This is more difficult and usually much more computationally expensive.
Texture tracking:
- Replace marker corners with keypoints
- SIFT, SURF, GLOH, etc.
- Apply detector to every single pixel
- Find the best set of keypoints (e.g. filter by strength/similarity/distance)
- Create descriptors; windows around the keypoints
- Generate robust descriptors that allow differentiation between keypoints
- SIFT:
- Estimate dominant keypoint orientation using gradients
- Compensate for detected orientation
- e.g. if keypoints on plane, transform all features as if the camera is normal to the plane
- Describe keypoints in terms of surrounding gradient radially
- SIFT:
- Match descriptors against database (created offline)
- Build a database with all descriptors and their position on the original image
- For robustness, search for corners at multiple scales
- May require data structures to allow for efficient searching
- Build a database with all descriptors and their position on the original image
- Outlier removal:
- Start with the cheapest techniques first (e.g. geometric)
- End with homography-based tests
Hybrid tracking: use gyroscope for prediction of camera orientation, and computer vision to correct gyroscope drift. Kalman filter?
Outdoor: lots of landmarks and planar features, but varying lighting conditions make it difficult.
15. Face Recognition
- Input face image
- Face detection
- Normalization: rotation, scale, normalization
- e.g. normalize distance between eyes
- Face feature extraction
- Interesting feature compression techniques available
- Classifier: feature matching against database
- Decision maker
- Thresholding
Early face detection:
- Geometric (e.g. using eyes)
- Eyes and mouths are good features to detect
- Color distribution
- Need to segment background from face color
- Skin tones ended up accounting for only a tiny portion of the color space
- Wood color is similar to skin color, which makes for much fun
- Hence, choice of color space is important
- Need to segment background from face color
Surveillance cameras usually high up, which isn’t ideal for most face recognition algorithms.
Surveillance-based tracking: tracking people for the entire time they are in an area.
Normalization:
- e.g. normalize distance between eyes
Features:
- Eyebrow thickness, vertical position at eye center
- Eyebrow Arch
- Nose vertical position, width
- Mouth vertical position, width, lip height
- Chin shape (e.g. distance from keypoint at fixed angles)
- Bigonial breadth - face width at nose position
- Zygomatic breadth - face width halfway between nose tip and eyes
Neural networks:
- Use NN-based filter: use small filter window to scan portions of the image and detect if a face exists
- Merge overlapping detections to eliminate false positives
Static face recognition:
- Eigenface: reduce face to many eigenvalues, project to high-dimensional feature face and find closest match. Worked up to ~1000 faces
- Linear and Fisher discriminant analysis
- Fishface: finds linear trasnformation which maximizes inter-class scatter while minimizing intra-class scatter
- Now: CNNs/deep learning - can work well even with millions of faces
Video face recognition:
- Low quality, small images
- But also allows tracking of the face image and continuity: re-use classification information from high-quality images when processing low-quality images
- Motion structure: create 3D model of face to match against frontal views
- Non-rigid motion analysis
16. One-Minute Demos
???:
- Real time driver fatigue detection
- Dashcam/laptop
- Algorithms:
- Gaussian blur
- Histogram of oriented gradients (HOG)
- SVM
- Percentage eyelid closure (PERCLOS); eye aspect ratio (EAR), mouth aspect ratio (MAR)
???:
- Wildfire hotspot detection
- Smouldering makes detection difficult
- Heavy duty water carrying drone
- Identify, rank, then approximate distance to hotspot
- Drone camera
- Algorithms:
- Segmentation: threshold, contour, centroids
- Morphology
- Distance approximation (camera calibration)
???:
- Same as above
- Algorithms:
- Gaussian
- Binary threshold
- Morphology: opening/erosion/dilation
- Contour detection
- Find largest centroids
- Lucas-Kanade optical flow: is contour expanding
- Stereo imaging for distance approximation?
???:
- Sign language teaching assistant
- Real time feedback
- Only RGB camera
- Algorithms:
- Sharpening kernel
- Convolutional pose machine
- Thresholding
- Hand keypoint detection at some point
- 0.6 Hz
???:
- Track moving objects in robot soccer
- Algorithms:
- Camera calibration
- Background subtraction
- Circular Hough transform + unscented Kalman filter for ball tracking
- CNN object detection for robot detection
???:
- Faster human detection
- Want low power, fast systems
- CNN-based object detection
- Algorithms:
- YOLOv5
- Kalman filter
- Tracker fit model
- 0:11:54
- Run Kalman prediction
- Run object detection
- Are they close enough (Euclidean)?
- No - create new tracker
- Yes - update Kalman with new location
- Multiple objects within max distance: pick closest one
- Low CNN depth, Kalman to remove flicker/missed detection in frames
???:
- Gorilla head tracking
- Algorithms:
- Greyscale frame difference: subtraction
- Binary thresholding
- Median blur
- Morphology: open
- Find centroid
- Moment
- Single person only
???:
- Wildfire
- Algorithms:
- Mean greyscaling
- Gaussian
- To-zero thresholding
- x > threshold? x: 0
- Circular kernel (blurring?)
- Pick brightest pixel
- Blackout pixels around the selected pixel; repeat to get next brightest pixel
- Guess location by assuming ground is flat: know drone position, camera angle
???:
- FPS enemy detection
- Algorithms:
- YOLOv5 to get bounding box
- For outlining:
- Gaussian
- Morphology: close
- Morphology: gradient (difference of dilation and erosion)
- Thresholding: Otsu
???:
- Rehab to free throw detection (basketball)
- Detectron2 pose detection: pyramid network + Mask R-CNN
- Nothing else?
???:
- Chess board framing, move detection
- Lighting can change
- HSV color masking to detect chess board red/blue pieces
- Algorithms:
- Board framing:
- HSV color masking to remove background
- Morphology: open/close
- Contour finding: find square
- Chess grid:
- Canny edge
- Hough line transform
- Homography matrix
- Chess piece:
- HSV Color masking
- Board framing:
???:
- Construction panel quality control
- Detect dimensions, end tolerances
- Algorithms:
- Processing:
- Greyscale
- Blur
- Mask
- Canny edge detection
- Morphology: erosion/dilation
- Calibration:
- Find sticker - pixels to mm
- Find panel:
- Draw contour
- Template match: adapted normalized cross-correlation
- Processing:
???:
- Detect/localize pine tree yellow catkins
- Real time
- Algorithms:
- YOLOv5 catkin detection
- MeanShift segmentation of depth image
- Morphology: opening/closing
- Contour detection Suzuki’s algorithm
- Shape matching: Hu moment invariants
???:
- Identify cut logs
- YOLACT with custom dataset for instance segmentation
???:
- Othello piece detection
- Algorithms:
- Gaussian
- Color thresholding
- Douglas-Peucker algorithm
- Image warping: homography matrix
???:
- Quick-time event detection in Detroit: Become Human
- QTE: fast button/gesture prompt
- Algorithms:
- Hough circle after greyscale, median blue
- Text detection:
- Tesseract OCR
- Frame crop
- Bicubic interpolation
- Gaussian
- Morphology: Erosion
- Grayscale
- Thresholding
- Harris corner detector to detect symbols
- Not sure how symbol type is identified
???:
- Rubiks cube tracking, photo -> model of state
- Stickerless cubes - edges not well defined (no black border)
- Algorithms:
- Harr cascade classifier to detect cube - generate bounding box
- Split into 3x3 grid
- Median blur
- Sample center
- Color thresholding to classify color
- Use LAB color space - less sensitive to luminance
- Also used: Canny edge detection, Suzuki85, Douglas-Peucker
- Contour detection: OpenCV, Suzuiki85 https://doi.org/10.1016/0734-189X(85)90016-7
???:
- Integral calculator
- Algorithms:
- Greyscale
- Gaussian
- Binary threshold
- Morphology: erode/dilate
- Find contours, sorting by x value
- Tessaract OCR
- Convert to string
- Compute integral
- Fails with handwriting, bad with printed, good with screenshots
???:
- Squash tracking in 3D space
- Position of squash ball
- Algorithms:
- Subtraction from clean plate (different from background modeling?)
- Morphology: erosion for noise removal
- Use output as mask
- Contour detection: group
- Filter by size, shape
- Triangulate position using two different cameras
- Triangulation with two cameras
???:
- Automatic projector keystone calibration
- Algorithms:
- Canny edge detection
- Hough to detect image edges
- Ramer-Douglas-Peucker
- Homography transform
???:
- Pose estimation: squat depth
- Algorithms:
- Segmentation mask to separate athlete from background
- MediaPipe pose
- Use keypoints to estimate squat depth
???:
- Virtual paper piano
- Printout of keyboard
- Algorithms:
- Keyboard segmentation:
- Canny edge detection
- Merged Hough line transform
- Perspective transform
- Linear segmentation
- Fingertip tracking and touch detection
- MediaPipe
- Transform to perspective space
- Keyboard segmentation:
- Only works with a single finger
???:
- Robocup object detection - identify weight
- Algorithms:
- Stereo calibration (estimate parameters)
- Gaussian
- Template matching: normalized correlation coefficient
- Triangulation to estimate depth
???:
- Bicycle detection and camera tracking (with gimbal)
- Algorithms:
- YOLO v5 to detect bike
- Pass bounding box to CSRT tracking algorithm (Channel and Spatial Reliability Tracking)
???:
- Stylus input with CV
- Digital sketchpad
- Algorithms:
- Canny
- Hough
- Shi-Tomasi corner detection
- Morphology: opening
- Stylus has colored tab which gets exposed when pressure is applied to the tip
- HSV filter
???:
- Predictive motion
- Algorithms:
- Gaussian blur
- Morphology dilation/opening
- Adaptive thresholding
- Hough circle transform
???:
- Patient rehabilitation monitor; track pose when doing exercises
- Hardcoded reference angles
- Algorithms:
- CNN
- Joint angle calculation, comparison thresholding
- FSM for multistage verification
42:00
???:
- Melanoma detection using smartphone camera
- Identify moles, return measure for border irregularity and color variance
- Algorithms:
- Increase image contrast
- Gaussian
- Greyscale
- Morphology: closing
- Adaptive binary thresholding
- Suzuki-Abe contour
- Fitzgibbon ellipse fitting
???:
- Nuclei segmentation in breast cancer tissue images
- Algorithms:
- Segmentation with convolutional autoencoder (U-net)
- Thresholding
- Morphology
- Watershed
???:
- AR sudoku solver
- Overlay solution on paper
- Algorithms:
- Pre-processsing:
- Adaptive thresholding
- Morphology: open/dilate
- Grey scale
- Hough to detect horizontal/vertical lines:
- Remove the lines to be left with numbers only
- Perspective transform to warp and crop image
- CNN for digit classification
- Solve sudoku using backtracking, render added text, then unwarp to overlay on top of input image
- Pre-processsing:
???:
- Tumor/tissue detection: generate tissue and bulk region masks
- Algorithms:
- Morphology: erosion/dilation
- Contour detection
- Color thresholding
- Median blurring
???:
- Drone detection in airports, differentiate between drone and birds
- Algorithms:
- Double difference w/ sharpen kernel to detect difference
- Contour extraction (using morphology - close/dilate)
- Intersection over union tracker
- Moving average filter
- Fast fourier transform to extract power spectral density
- K-means/linear classifier
???:
- Automatic exposure control for robot navigation
- Algorithms:
- Sobel gradient filter
- Gradient magnitude
- Soft percentile derivative: weighted sum of difference between two frames
- Slice image into equal sections: take median of soft percentile derivatives
- No way of automatically calculating number of slices
???:
- Punching technique stats
- Track fist velocity, acceleration, elbow angle, fist angle
- Algorithms:
- Detectron2 retinanet for fist top/side detection
- Mediapipe pose
- Non-maximum suppression
???:
- Football penalty ball tracking
- Camera head-on with goal, located behind player
- Detect goal posts and ball
- Used yellow ball: HSV color range used to detect ball
- Algorithms:
- Gaussian
- Morphology: open, close, erode, dilate
- Thresholding
- Contours
???:
- Joint tracking: give likeness score computed from joint angles
- D435 camera
- ~0.4x real time speed
- Algorithms:
- Region-based CNN (R-CNN): instance segmentation (detectron 2)
- Keypoint/pose detection?
- Dynamic time warping
- Linear interpolation (for dealing with varying framerates)
- Newton’s method (to align video sequences)
- Region-based CNN (R-CNN): instance segmentation (detectron 2)
???:
- Scrabble board detection
- Continuing code from previous student project
- Algorithms:
- Greyscale
- Adaptive thresholding
- Detect board contours
- Crop image to contain only board?
- HSV masking: filter out undesired colors (i.e. non letter-tiles)
- Morphology: erosion/dilation to remove noise
- Detect ‘maximally stable extremal regions’
- Tesseract OCR
???:
- Rugby ball detection
- Static camera, players in frame
- Difficulty: rugby balls not circular
- Algorithms:
- Gaussian background subtraction
- Median filter
- Morphology: close
- Contour detection
- Filter:
- By area
- Compactness degree: contour area divided by area of best-fit ellipse
- By ellipse aspect ratio
???:
- Pet detection using dominant color
- Algorithms:
- Mask-R CNN instance segmentation to detect pets in frame
- K-means clustering to determine dominant pet colors
- Image thresholding to remove image background?
???:
- Handwritten digit recognition
- Non-linear transform of input features into higher dimensional space so that features are linearly separable
- Cheaper than conventional deep learning
- Nanowire network: randomly? scattered nanowires combine to form network with complex topological structures: junctions between wires act as a form of non-linearity and memory
- Create 3D simulation of wires: do not assume they are 1D lines
- Input: voltage into input electrodes (1 out of 4 edges of a square)
- Output: current from output electrodes (remaining 3 edges of the square)
- Train only the output layer through regression to make 10 classifiers
- Algorithms:
- Linear classifier:
- Moore-Penrose pseudoinverse
- Singular value decomposition (SVD)
- Single layer neural network using Tensorflow
- Softmax activation
- Catagorical cross-entropy
- Adam optimizer
- Nanowires deposition: Euler rotations
- Junctions model using Stormer-Verlet integration
- Modified nodal analysis to solve Kirchoff’s circuit laws
- Linear classifier:
???:
- Blackjack simulation: detect card rank and suit
- Top-down images of cards
- Algorithms:
- Greyscale
- Gaussian
- Thresholding
- Contours: detect cards
- Morphology, close: use closing to merge close cards into a single group to detect hands (i.e. based on distance)
- Image differencing: match corner to preset images to detect rank/suit
???:
- Motion detection for raster graphics editor
- i.e. draw using hand gestures
- Webcam facing user
- Algorithms:
- Image rectification
- Classification/localization CNN to crop to hand
- Instance segmentation CNN to detect keypoints
- Use relative distance between keypoints as gesture/controls
???:
- Real-time face replacement
- https://learnopencv.com/face-swap-using-opencv-c-python
- Algorithms:
- Face detection with
dliblibrary- HOG + linear SVM face detector?
- Face alignment: convex hull
- Delaunay triangulation
- Texture mapping by using affine warp to map triangles between the two images
- OpenCV Seamless cloning
- Face detection with
???:
- Blood splatter analysis
- Algorithms:
- Pre-processing:
- Thresholding
- Dilation
- Resizing
- Pre-processing:
- CNN training: ResNet-50
???:
- Paper piano
- Use built-in laptop camera (paper on top of keyboard/trackpad?)
- One finger only
- Algorithms:
- Adaptive thresholding (with Gaussian mean)
- Morphology: opening for noise reduction
- Ramer-Douglas-Peucker for contour detection
- Finger detection with color thresholding
???:
- Real time number input for timer control using static hand gestures
- Algorithms:
- MediaPipe hands
- Gesture classification:
- Feature-angle thresholding
- Support vector machine (SVM)
- Debouncing:
- https://ieeexplore.ieee.org/abstract/document/8868766
- Time/frame-delay debouncing
???:
- Risk parameter of ALC injury risk
- Record jumping video
- Depth camera
- Detect 4 risk factors
- Algorithms:
- Detectron2 keypoint detection
???:
- Hand gesture controlled calculator
- Use vector of each finger as an input bit
- Thumbs were a special case
- Algorithms:
- Hand keypoint detection
- Kalman filter predictions
???:
- Butterfly/moth classification
- Algorithms:
- Instance segmentation
- Non-maximum supression
- Detectron2 to crop
- resnet18 to classify
???:
- Blood vessel extraction from image
- Algorithms:
- Binary thresholding
- Morphology: open/close/erosion
- CLAHE algorithm to increase contrast
- Illumination equalization
- Gaussian
- Otsu’s threshold: vessel segmentation
???:
- Hand gesture recognition for sign language w/ smartphone camera
- Algorithms:
- Histogram
- Histogram backprojection
- Morphology: closing
- Inception v3 neural network
???:
- SLAM with monocular video
- Algorithms:
- Previous and current frame input into ORB keypoint detection
- FLANN keypoint matching
- Lowe’s ratio match pruning
- RANSAC 5-point: generate essential matrix
- Velocity information for pose recovery
- Filter spurious transforms
- Increment by transformations
???:
- Cow teat detection
- Camera under cow
- Algorithms:
- Blob detection
- Thermal filter - coldest = teat
- Filter by circularity, aspect ratio
- Morphology: erosion/dilation
- Repeat with different morphology parameters until four teats in valid shape detected
- Blob detection
???:
- Visual cue for call detection for pass gesture in basketball?
- Detect gesture from player so that robot shoots ball?
- Algorithms:
- Body pose estimation with Detectron2
- Identify largest skeleton in frame
- Locate wrist
- Extract subframes around wrist (proportional to size of skeleton)
- Mediapipe hand pose
- Measure distances between certain keypoints to detect gesture
- Threshold
- Pass threshold pass/fail to FSM
- Body pose estimation with Detectron2
???:
- Sudoku detection/solving
- Algorithms:
- Gaussian
- Adaptive Gaussian thresholding
- Contour filling: identify contours to determine board outline
- Green’s theorem
- Warp image to board shape
- OCR to identify numbers
- Solve board using backtracking
???:
- Real time NZ sign language detection
- Algorithms:
- Pre-process CNN input:
- Binary thresholding
- Issues: lighting, busy backgrounds
- Canny edge detection
- Binary thresholding
- CNN
- Pre-process CNN input:
???:
- Emotion detection of audiences (multiple people)
- Deepface for facial recognition
- 9 layer NN
- Algorithms:
- Haar cascade
- Face frontalization
- 2D/3D alignment
???:
- Speed limit recognition of NZ street signs
- Detect, then read
- Detection:
- Color space transform to emphasize red (CIELUV)
- Gaussian
- Circle hough transform
- Text
- Crop to isolate text
- Filter
- Threshold: Otsu’s method
- Tesseract OCR
???:
- Real time dart scoring
- Identify dart throw events
- Algorithms:
- Create foreground mask?
- Board detection:
- Canny edge
- Hough transform
- Morphology
- Contouring
- Dart detection:
- Background subtraction
- Morphology
- Contouring
- Keypoint detection with YOLOv4-tiny
???:
- Stereo imaging to find distance/orientation of plane
- Want to require low overlap
- Crop images (Right edge/left edge: overlap)
- Create disparity map
- Normalize values
- Convert to depth
- Average values
- Reproject to 3D
???:
- Emotion recognition through facial expression
- Algorithms:
- Haar cascades face recognition
- Facial action coding system
- Augmentation for classifier training: brightness, rotation, shift
- Classifier: CNN + softmax
???:
- Measuring body dimensions with depth camera
- Use 3D points to get dimensions
- Didn’t get to algorithms – out of time
???:
- Music sheet reader for the visually impaired
- Use eye tracking to zoom in and pan into the sheet
- Prior research: automatic page turning but no zoom
- Algorithms:
- Template matching
- Gaussian
- Region of interest
- Gradient orientation pattern (eye tracking)
???:
- Mobile pool ball detection/identification
- Track position of all balls and identify for scoring
- Prior research: permanent setups
- Try use hand-held video
- Algorithms:
- Gaussian
- Hough circle transform to mask ball
- Get average color of mask
- Use color to identify ball score
- Low success rate: 38%
???:
- Inventory stocktaker
- Repeated patterns: instance segmentation?
- One phone camera, using flash to try control lighting
- Detects repeated vertical lines to count items
- Algorithms:
- Countouring
- Hough lines
- Angle filter
- Morphology: erode/dilate
- Hough lines
???:
- Beer pong score keeper
- Algorithms:
- Gaussian
- Ball detection: color mask + blob detection
- Cups: hough circle transform
- Issues:
- Top-down only
- Lighting/shadows
- Hardcoded size? Distance/lens combination fixed
???:
- Apple detection (for packaging robot)
- Create own dataset
- Algorithms:
- To remove background:
- HSV masking
- Morphology
- To remove background:
- Segmentation labeling for training data
- Mask R-CNN (detectron2)
???:
- Face recognition/tracking on multiple subjects without prior training
- Real time
- Store embedding of faces, store vectors and compare (HOG)
- Algorithms:
- Single shot multibox detector (resnet base)
- Ensemble of regression trees
- Embedding creation: ResNet
- Recognition: euclidean distance + linear embedding search
???:
- Document supermarket receipts
- Product names, cost, total cost
- Use scanner rather than smartphone camera
- Algorithms:
- Adaptive binaraization: Otsu’s method
- Morphology: erosion
- Tesseract OCR
???:
- Antarctic snowstorm classification with CNN
- ResNet-18 model
- Manually label dataset, then artificially grow
- Augment dataset: horizontal flip, crop, resize
- Classification:
- Random crops from image: classify each and combine
- Algorithms:
- Cross entropy loss function
- Stochastic gradient descent
- Ensemble method to combine
???:
- Cricket ball shot tracker
- Algorithms:
- Differencing
- Thresholding
- Median blur
- Dilation
- Contour detection
???:
- Guitar string picking detection
- Algorithms:
- Fretboard/frets/string detection:
- Canny edge
- Hough line
- Fretboard/frets/string detection:
- Pick location: user identifies initial position for template matching, then tracked
- Detecting picks: after velocity goes above threshold, find rapid deceleration
???:
- Hand gestures to control interactive display
- MediaPipe
- ML to recognize gestures using keypoints
- Limited input speed
???:
- Dartboard segmentation
- Single camera
- Algorithms:
- HSV color mask
- Morphology: open/close
- Flood filling
- Edge segmentation:
- Canny edge
- Hough line
- Centroid calculation
- Point multipliers (thin green/red regions): HSV + morphology
- Scoring regions: flood filling + bitwise operations
- Wedges: canny + hough line (with centroid to determine board center)
???:
- Cricket batting shot classification with pose estimation
- Algorithms:
- Gaussian
- Canny edge detection
- Detectron2 pose estimation
- SKLearn
???:
- Dirty dishes on kitchen bench
- Prior: Hough circle, Mask-RCNN
- Solution: Mask-RCNN with COCO dataset
- Plates: transfer learning using resnet?
- Motion detection to reduce false positives
- Gaussian
- Background segmentation
???:
- Darts scoreboard identification
- Blur: gaussian blur. HSV
- Dilate/erosion
- TODO
- Mask generation: thresholding
- Regionsegmentation: Hough lines, canny edge
- Motion detection - double differencing
- Flood fill
- Double differencing
- Triangle interior
???:
- Real time mask detection
- OpenCV/numpy face detection
- Keras/tensorflow face mask
- MobileNetV2
???:
- Manufacturing defect detectoin of surgical reamers
- Otsu thresholding
- Hough transform
- Morphology close
- Topographic TODO
- TODO 0:42
???:
- Robocup object detection, SLAM
- Identify cup weights
- Real time area mapping
- Haar cascade classifier
- ORB feature detection
- TODO
- RANSAC
- ORB-SLAM - didn’t work due to low color variance
???:
- Wheelchair docking assistant
- Prevent damage to desk/chair, injuries
- Slow down when close to the desk
- Object detection: instance segmentation (mask R-CNN)
- Distance calculation: stereo camera
???:
- Low cost stereo with two webcams
- Focus on speed: real time
- Two cameras taped together in a box
- Camera calibration
- Stereo rectification
- Disparity calculation
???:
- Determining queue times with CV
- Greyscale
- Detct faces
- Front-on faces
- Calcluate bounding box
- Centroid
- TODO
- Haar cascade classifier
- Masking
???:
- Measuring growth rates
- Crop/rotate.
- Blur, hsv theshold, erode/dilate
- Create contours of plants from mask
- Exclude contours outside of region of interest
- Find size of contours
- As leaves get closer to camera, visual size increases
???:
- Vehicle lane positioning with semantic segmentation
- If no road markings
- DeepLavV3 semantic segmenation
- Extract road surface, denoise road surface mask
- median blur
- Color thresold
- Morphology
- Canny edge
- TODO
- 0:53
???:
- Climbing tutor: track 3D poses
- Structure from motion
- Perspective n-point
- Locate keypoints: MediaPipe
- Triangulate
- Bundle adjustment: Least squares solver
???:
- Interactive musical webcam
- Volume/pitch depends on hand position
- Single shot detector
- Feature mapping
- Object prediction: CNN
- Object detection: TODO
- TODO
???:
- Real time background subtraction on people with HMD (AR)
- No green screen, single camera, no depth
- YOLACT for instance segmentation
- Mask
- Contour detection
- Thresholding
- TODO
Labs
Lab 01
Algorithms
Thresholding
Binarization of images depending on brightness.
OpenCV has a few types of thresholding.
Simper/basic thresholding uses a single, global threshold for binarization.
Adaptive thresholding uses the surrounding pixels to find some ‘average’ value of the neighboring pixels (some square centered around the target pixel), then subtracts some constant
Morphology
- Dilation:
- Result: objects get larger
- AND two matrices together
- Values must be 0 or 1
- If one is smaller than the other, treat the pixels outside them as being zero
- Repeat for every possible translation
- OR results: if the value of the pixel is one in any translation, set the value to one
- Erosion:
- Result: objects get smaller
- In the larger matrix, pick a pixel
- For every translation of the smaller matrix which overlaps with the selected pixel, both matrices must have a value of one
- Opening
- Result: smooths the images and narrows lines
- Erode by the smaller matrix, then dilate the result by the smaller matrix
- Closing:
- Result: fills gaps and holes
- Dilate by the smaller matrix, then erode the result by the smaller matrix
Misc.
- Contour tracing:
- Frame difference:
- With three frames (and hence additional latency):
(frame1 ^ frame2) & (frame2 ^ frame3) - Called the double-difference algorithm; reduces ghosting compared to a standard difference
- Three frames allows ghosting to be minimized
- Foreground aperture: if object moving smaller, there will be overlap between two frames, leading to there being no difference in the center of the object
- If you have absolute control of lighting, background subtraction will be much better
- With three frames (and hence additional latency):
Lab 02
Kalman filter:
- https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python
- Combine noisy measurements to get better estimate of real state
- Unscented Kalman filter: for highly non-linear state transitions
Blob detector:
- Threshold: binarize image with global threshold with values between min and max, incremented by some value
- Grouping: connected pixels grouped together within each image to form blobs
- Merging: centers of blobs computed, blobs between images merged together if closer than threshold
- Estimate final centers
- Filter blobs by:
- Color: color of the blob
- Area: blob area between min and max
- Circularity: how ‘circular’ they are (ratio of area to perimeter squared). 1 means perfect circle
- Convexity: ratio of area to the area of its convex hull. 1 means completely convex
- Inertia ratio: think moment of inertia. Circle has smallest inertia for a given area (1), a line has the greatest (0) between images
Lucas-Kanade Optical Flow:
- Detecting the velocity of features between frames
- Assumes that the pixel intensities of an object do not change between frames, and that neighboring pixels have similar motion. Falls apart when lighting (or background) changes.
https://docs.opencv.org/4.5.0/d4/dee/tutorial_optical_flow.html
Lab 03
Tesseract OCR:
- From HP in the 80s and 90s, picked up by Google since open sourcing in 2005
- Can detect characters from multiple languages, or simply return bounding boxes of characters
- Requires clean, binarized image
Open3D:
- Point cloud visualization
- Point clouds noisy; to filter out outliers, find nearest n neighbors and eliminate points where the mean distance is greater than some threshold
- Segmentation:
- Detect shapes within point clouds
- Filter out points outside some range
- Fit points to primitive shapes (e.g. planes, cylinders) using RANSAC (RANdom SAmple Consensus)
- Randomly pick a few data points and create a model that matches the primitive (e.g. for plane, pick three points, generate equation for the plane)
- Find what points are consistent with the model
- Outliers are further away than some error threshold
- Repeat until you get a model with few outliers
- Using all non-outlier points, generate a new model