01. Introduction

Weighting:

50%: research project presented as a conference style paper, including a relevant literature review (due week 8)
- 15-25 references in the paper
- Deep learning: use when it will give you a better result
- No novelty required
- Possible projects:
  - Automatic timecode/GPS coordinate extraction from dashcam/surveillance cameras?
  - Censoring faces in real-time (except specific people)
  - AR angry birds with finger dragging on table
  - Automatically turning screen off when no one is looking at the screen
  - Othello/chess state detection
  - Russian flag replacement (with Ukraine)
  - No-mask detection (stretch goal: if holding drink/food, they get a pass)
  - Finger tracking for cursor movement (e.g. hand wet)
  - Bread baking? Does rate at which bread rises drop near the end of the baking cycle
  - Port/cable detection (e.g. USB-C, HDMI)
  - Virtual touch bar: mirror on webcam to view keyboard area, add stickers and detect touch to change brightness
  - Drink detection: detect when you take a drink, alert if you haven’t taken a sip in an hour
  - Deep learning: feel free to use it where beneficial
    - Classification: is there an x (and possibly a y or z) in the image or not?
    - Object detection: where is the object in the image?
    - Dense/semantic segmentation: recognizing what ‘object’ (e.g. sky) each pixel belongs to
    - Instance segmentation: recognizing and locating multiple instances of an object
    - Tracking: tracking the motion of objects
10%: class participation/presentations (presentations in final weeks of semester)
40%: final exam

Goal: to recognize objects and their motions.

Signal processing on 1D data; computer vision on 2D data.

Image processing on still images; computer vision on video and still images.

Difficulties

The sensory gap: gap between reality and what a recording of the scene can capture.

The semantic gap: lack of contextual knowledge about a scene.

The human visual system is very good:

~50% of cerebral cortex used for vision
Emualating the operation of human behaviours:
- Perception based on relative brightness (e.g. Checker shadow illusion)
- Logarithmic response to brightness
- Low-light performance (for passive systems)

Recovering 3D information

Several cues available:

Motion
Stereo (< 3m)
Texture
Shading
Contours

Or actual depth hardware:

Active stereo (IR - brightness limited for safety)
Structured light: dot projection
Time of flight
- RF-modulated
- Direct timing (LIDAR)
  - Brightness not limited as eye exposure duration so short

Labs: Intel Realsense D435.

Processing

Pre-processing: noise reduction, contrast etc.
Low-level: color, boundary/edge, shape, texture detection
High-level: object detection, determining spatial relationships, meanings

The higher-level the processing, the less generic and the more domain-specific knowledge is required.

Low-Level Image Processing:

Image compression
Noise reduction (while maximizing information kept)
Edge extraction
Contrast enhancement (only helps humans)
Segmentation
Thresholding
Morphology
Image restoration (e.g. deblurring using knowledge of speed of subject)

Approaches

Build a simple model of the world; constrain the environment (e.g. fixed lighting environment)
- Focus on definite tasks with clear requirements
Find provably good algorithms
- Try ideas based on theory
Experiment on real world
- Solutions may not generalize to a more complex environment
Update the model

02. Perception and Color

Color Physics

Simplified rendering model: illumination * reflectance (relative energy, 0 to 1) = color signal.

For project: use consistent light source to prevent color shifting

Human vision system: optimized for spectrum from the sun (greatest intensity at ~500 nm).

Reflectance spectra: reflectance (0 to 1) as a function of wavelength. Eyes (and cameras) simplify this signal by reducing it to intensity detected via three types of sensors and hence, objects with different spectral albedoes may be perceived as having the same color (metamerism).

Spectral colors can be described as a spectra with a single peak in some given wavelength range.

Color mixing:

Additive color mixing: colors combine by adding intensity (e.g. displays)
Subtracting color mixing: colors combine by multiplying intensity (e.g. paint)

Color Perception

The Human Eye

Specifications:

Spectral resolution: 400 - 700 nm
- c.f. ears which have a far wider range of 10 octaves (20 Hz to 20 kHz)
Dynamic range: approx. $10^8:1$ , similar to ears
Spatial resolution: approx 1-3 cm at 20 m
Radiometric resolution (differentiating between two colors after a time gap): around 16 - 32 shades of black and white, 100 colors

Physiology:

Lens and fluid causes signal attenuation
Rods (black and white)
- ~ 100 million rods
- Extremely sensitive: can be excited by a single photon
- ~170 degree field-of-view per eye
Cones (color):
- ~ 6.5 million cones
- Much larger than rods
- A vast majority centered around the fovea
- Three types:
  - https://www.researchgate.net/profile/Geoff-Covey-3/publication/287529624/figure/fig1/AS:449947171594240@1484287173685/Sensitivity-of-cone-cells-in-the-human-eye-to-visible-light-8.png
  - $\alpha$ /short: blue
  - $\beta$ /medium: green
  - $\gamma$ /long: red
- $\beta$ and $\gamma$ spectra have large overlap
- $alpha$ cones much less sensitive than the other two types
- Cameras usually require IR filters and are in a Bayer pattern (two green for each red or green)
Dark-adapted (scotopic) vision more sensitive to shorter wavelengths (almost blind to red)
Light-adapted (photopic) vision sensitive to all visible wavelengths, peaking at 555 nm

Brain and processing:

Signals from left visual field of each eye go to right visual cortex and vice-versa
~800,000 nerve fibers but ~100 million receptors; one fiber carries signals from multiple rods/cones

Color Spaces

Although perceived color depends on illumination and surroundings, assume for now that the spectrum of light arriving at the eye fully determines the perceived color.

CIE 1931 XYZ color space
- Subjects would vary intensity of RGB lights to match test color, adding color to the test color if necessary (leading to negative values for red)
- Perceptually uniform color space: any change of a given distance in the color space looks the same to the human eye regardless of the starting color?
- Not great for computer vision
HSV cone:
- Angle = hue
- Radius = saturation
- Value = height
- At the bottom is the tip of the cone - black, which cannot have zero saturation
Recommended color space: CIECAM02 (JCh)

03. Cameras and Lenses

Pinhole Cameras

Projection Equation

Let $f$ be the focal distance - distance between the pinhole and sensor. A point $(x, y, z)$ will be projected onto the sensor at:

\begin{aligned} u = f \frac{x}{z} \\ v = f \frac{y}{z} \end{aligned}

and $z = f$ . Although all values would be multiplied by $-1$ for a real camera, we can model the sensor plane as being in front of the pinhole. This projection can be represented as a matrix equation:

\begin{pmatrix} u \\ v \\ 1 \end{pmatrix} \sim \begin{pmatrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix}

Lenses

Pinhole cameras must balance diffraction (aperture too small) and light ray convergence (aperture too large). Lenses allow far more light to pass through while still allowing light rays to converge on the same point on the sensor plane for light from a specific distance.

The most basic approximation of a lens is the thin lens, which assumes that the lens has zero thickness. More accurate models:

Assume finite lens thickness
Use higher-order approximation for $\sin{\theta}$
Take chromatic aberrations, where different wavelengths have different refractive indices and thus focus at different points on the sensor, into account
- Coatings can be used to minimize this
Consider the impacts of reflection on the lens surface
- Once again, coatings can minimize this and maximize light transmission
Consider vignetting

Camera Calibration

Used to determine relationship between image coordinates and real-world coordinates - geometric camera calibration.

Intrinsic Parameters

Improvements are needed to the matrix above to consider:

Scaling: convert real-world coordinates on the sensor to pixel space by multiplying the $xyz$ values by $\alpha$ and $\beta$ , taking into account that the pixels may not be square
Origin: need to translate by $(u_0, v_0)$ as usually, the center of the sensor is not the origin
Skew: camera pixel axes may be skewed by angle $\theta$

\begin{aligned} u &= \alpha \frac{x}{z} - \alpha\cot(\theta)\frac{y}{z} + u_0 \\ v &= \frac{\beta}{\sin(\theta)} \frac{y}{z} + v_0 \end{aligned}

As a matrix:

\begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = \frac{1}{z} \begin{pmatrix} \alpha & -\alpha\cot(\theta) & u_0 & 0 \\ 0 & \frac{\beta}{\sin(\theta)} & v_0 & 0 \\ 0 & 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix}

In more compact notation: $$ \overrightarrow{p} = \frac{1}{z} \begin{pmatrix} K & \overrightarrow{0} \end{pmatrix} \overrightarrow{P} $$

Where $\overrightarrow{P}$ are the world coordinates and $\overrightarrow{p}$ are the pixel coordinates.

Then, extrinsic parameters: translation and rotation of the camera frame, must be taken into account, further complicating things: $$ ^C{P} = ^{C}{W}{R} + ^{W}{P} + ^{C}O{W} $$ Combining the two: $$ \overrightarrow{p} = \frac{1}{z} K\begin{pmatrix} {C}{W}{R} & ^{C}O{W} \end{pmatrix} \overrightarrow{P} = \frac{1}{z} M \overrightarrow{P} $$

\begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = \frac{1}{z} \begin{pmatrix} \cdotp & m_1^T & \cdotp & \cdotp \\ \cdotp & m_2^T & \cdotp & \cdotp \\ \cdotp & m_3^T & \cdotp & \cdotp \\ \end{pmatrix} \overrightarrow{P}

$1 = \frac{m_3 \cdot \overrightarrow{P}}{z}$ and hence, $u = \frac{m_1 \cdot \overrightarrow{P}}{m_3 \cdot \overrightarrow{P}}$ and $v = \frac{m_2 \cdot \overrightarrow{P}}{m_3 \cdot \overrightarrow{P}}$

By using these equations on many features, we can find the value of $m$ that minimizes error to determine the intrinsic and extrinsic parameters.

04. Filters

Modifying the pixels of an image based on a function taking input from pixels in the target pixel’s local neighborhood.

Linear Functions

When a filter is linear, the value of each pixel is a linear combination of its neighbors.

Where $I$ is the image, $g$ is the kernel, and $g[k, l]$ is the value of the matrix where $g[0, 0]$ is the center element:

f[m, n] = I \otimes g = \sum_{k, i}{I[m - k, n - l]g[k, l]}

That is, the dot product of its local neighborhood multiplied by a kernel. This process is called a convolution.

Convolution kernels can be used to make a Gaussian filter - a blur. Blurring/smoothing the image reduces noise - high-frequency information.

$\sigma$ , the radius of the kernel, is called the scale.

Multiple linear functions can be stacked - in addition to convolutions - multiplication by a kernel, addition/subtraction are also valid operations.

For example, to sharpen an image, you can multiply by the pixel magnitudes (and hence both low and high frequency information) by 2, then subtract by a blurred (i.e. low-pass filtered) version of the image, leaving you with only high frequency information - an image with edges over-accentuated. In pseudo-code: 2 * img(x, y, 1)) - dot(gaussian(sigma), img(x, y, sigma)). This approximates the Laplacian (sum of second partial derivatives) of Gaussian filter.

Gradients and Edges

An edge is a single point: a series of edge points is a line.

An edge is a point of sharp change (reflectance, object, illumination, noise) in an image.

The general strategy is to use linear filters to estimate the image gradient, then mark points where the change in magnitude is large in comparison to its neighbors.

Fourier Transforms

The fourier transform of a real function is complex - in this course, the phase component is ignored - we only care about magnitude.

All natural images have a similar same magnitude transform - running the inverse transform with the magnitude from another image returns similar results.

In the magnitude image generated from a fourier transform, the center of the image equals zero frequency while the edges of the image have higher frequency. Hence, by masking the image with a circle and then applying the inverse transform, either a low- (outside masked out) or high- (inside masked out) pass filter can be generated.

In the transform, the input image is tiled to infinity: this may cause discontinuities to occur at the edges. However, the effects of this can be mitigated by fading the image to grey near the edges (e.g. in a circle with a Gaussian).

https://homepages.inf.ed.ac.uk/rbf/HIPR2/fourier.htm

05. Edge Detection

Challenge: convert a 2D image into a set of curves.

That is, find all edge points that are (mostly) connected, then join them into curves.

Edges come from:

Surface-normal discontinuity (e.g. from shading)
Depth discontinuity
Surface color discontinuity
Illumination discontinuity

Edge profiles:

    |-----    /\        ||
----|        /  \    ___||____
Step         Roof      Line

Edge detection:

Detect short linear edge segments - edgels
Aggregate edgels into extended edges

Edges are the points at which the rate of change reaches a maximum - when the second derivative is zero.

\nabla f = \left[\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right]

The gradient, which points in the direction of greatest change, can be represented by an angle and magnitude:

\begin{aligneD} \theta &= \tan^{-1}\left(\frac{\partial f}{\partial x} / \frac{\partial f}{\partial y}\right) \\ ||\nabla f|| &= \sqrt{\left(\frac{\partial f}{\partial x}\right)^2 + \left(\frac{\partial f}{\partial y}\right)^2} \end{aligned}

However, on a discrete image, approximations can be made.

The Sobel operator:

\Delta_x = \begin{pmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{pmatrix} \quad \Delta_y = \begin{pmatrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \end{pmatrix}

(NB: need to scale by $\frac{1}{8}$ to get the right gradient value, but is irrelevant for edge detection)

The Robert’s Cross Operator (for diagonal edges):

\Delta_1 = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix} \quad \Delta_2 = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}

The 4×4 Prewitt operator:

\Delta_x = \begin{pmatrix} -3 & -1 & 1 & 3 \\ -3 & -1 & 1 & 3 \\ -3 & -1 & 1 & 3 \\ -3 & -1 & 1 & 3 \end{pmatrix} \quad \Delta_y = \begin{pmatrix} 3 & 3 & 3 & 3 \\ 1 & 1 & 1 & 1 \\ -1 & -1 & -1 & -1 \\ -3 & -3 & -3 & -3 \\ \end{pmatrix}

Can use trigonometry to determine direction of the gradient using the horizontal and vertical Sobel/Prewitt operators.

Looking at only the adjacent pixel may be useless in situations with large amounts of noise. Hence, use a Gaussian kernel to smooth the image before applying the gradient operator. However, this can be done more efficiently by applying the derivative function to the Gaussian kernel, then applying it to the signal (the derivative theorem of convolution).

Then, the point of maximum difference - the second differential, can be found to detect the edge.

The sums of the second partial derivatives is called the Laplacian.

Canny Edge Detection

An optimal edge detection algorithm should have:

Good detection: responds to edges, not noise
Good localization: detected edge is close to the true edge
Single edge: one edge detected per true edge

Under the assumptions of a linear filter and independent and identically distributed Gaussian noise, the optimal detector is approximately the derivative of the Gaussian.

Detection and localization are diametrically opposed to each other: more smoothing leads to better detection but worse localization.

Canny edge detector steps:

Gaussian blur: reduce noise
Use the Sobel kernel to find the gradient in the $x$ and $y$ directions
Find the gradient magnitude and direction (rounded to 45 degree increments) Non-maximum suppression/edge thinning:
- Zero the pixel value if the magnitude not a maximum compared to neighbors in the relevant direction
- Can predict the next edge point by moving along the normal to the gradient
Hysteresis thresholding (through double thresholding):
- Any pixel whose magnitude is larger than some threshold is kept
- Any pixel whose magnitude is smaller than some threshold is removed
- Any pixel in between the thresholds is kept if it is connected to a remaining pixel

A low value of $\sigma$ , the radius of the Gaussian blur, detects fine feature, while a large value detects large-scale edges.

Scale Space

As $\sigma$ - the scale of the image gets larger:

Smoothing/blurring increases
Noise edges disappear
Edges become smoother
Fine/high-frequency detail is removed

Multiple representations of the image at different scales can be generated.

If an edge detection algorithm is applied to all the images, edges can be matched between images:

Edges may disappear or merge as scale increases - this can be used to determine how ‘strong’ an edge is
Detected edge positions may change with scale
Edges will never split as scale increases

Edge Detection by Subtraction

Subtract Gaussian-blurred image from the original, then scale and add an offset. This works as low-frequency information mostly remains in the blurred image and hence gets removed in the subtraction.This set of operations approximates the Laplacian of Gaussian.

Hough Line Detection

Finds straight lines from binary image (e.g. output of Canny algorithm).

Uses a voting scheme instead of naively searching for lines at every single position and orientation.

The Hough space is a transform of the $x$ - $y$ coordinate space to $r$ - $\theta$ space where:

$r$ is the shortest distance of the line to the origin
$\theta$ is the angle of the line
$y = mx + c$ transformed to $r = x \cos{\theta} + y \sin{\theta}$

For each point (that is an edge pixel) and for every angle, find $r$ and increment the value in the Hough space by one - the point votes for every line it could be part of.

The points with the largest vote counts are the straight lines we are most confident in.

The Hough circle transform does the same, except using the $x$ - $y$ - $r$ space.

The same technique works for any curve that can be expressed in parametric form, although the parameter space can get huge.

OpenCV Tutorial:

Corners

Doors and corners, kid. That’s where they get you

Detective Miller

What is the gradient at the corner? Near the corner, edges have gradients going in two different directions and at the corner, the gradient is ill-defined. Hence, edge detectors tend to fail at corners.

Corners, however, are useful for tracking movement between frames.

Harris Corner Detection

Over a small window, a corner is likely to have high intensity variation in all directions. This uses the sum-squared difference.

Given:

$I$ gives the intensity of a pixel
$w$ is a window function that determines the weight of each pixel (e.g. Gaussian) relative to the target pixel
An offset $(\Delta x, \Delta y)$ that separates two windows

For all pixels $(x, y)$ that are part of the window:

E(\Delta x, \Delta y) = \sum_{x, y}{w(x, y) \quad \left[I(x + \Delta x, y + \Delta y) - I(x, y)\right]^2}

That is, given two windows, calculate the difference between each pair of pixels, square them, and sum them.

Using the Taylor expansion, this can be approximated to:

\begin{aligned} I(x + \Delta x, y + \Delta y) &= I(x, y) + \Delta x \frac{\partial I}{\partial x}\left(x, y\right) + \Delta y \frac{\partial I}{\partial y}\left(x, y\right) \\ &= I(x, y) + \Delta x I_x(x, y) + \Delta y I_y(x, y) \end{aligned}

Giving the approximation:

\begin{aligned} E(\Delta x, \Delta y) &\approx \sum_{x, \Delta y}{\left[ \Delta x I_x(x, y) + \Delta y I_y(x, y) \right]^2} \\ &\approx \sum_{x, \Delta y}{ \Delta x^2 I_x(x, y)^2 + 2 \Delta x \Delta y I_x(x,y) I_y(x,y) + \Delta y^2 I_y(x, y)^2 } \\ &\approx \begin{pmatrix} \Delta x & \Delta y \end{pmatrix} M \begin{pmatrix} \Delta x \\ \Delta y \end{pmatrix} \end{aligned}

Where:

M = \sum_{x, y} \begin{bmatrix} \left(I_x\right)^2 & I_xI_y \\ I_xI_y & \left(I_y\right)^2 \end{bmatrix}

Given $\lambda_1$ and $\lambda_2$ are the eigenvalues of $M$ :

\begin{aligned} R &= \det(M) - k\left(\text{trace}(M)\right)^2 \\ &= \lambda_1 \lambda_2 - k\left(\lambda_1 + \lambda_2 \right)^2 \end{aligned}

Corners have large values of $R$ ; edges occur where $R < 0$ .

06. Local Features

Scale and rotation invariant descriptors.

Correspondence using window matching: if points are used, it is highly ambiguous.

Stereo Cameras

Baseline: distance between cameras. Wider baseline allows greater accuracy further away, while a smaller baseline allows overlap at closer distances. Increased camera resolution can increase depth resolution overall.

Rectification: transform of images onto a common image plane (i.e. sensor, but inverted). The image planes of the two cameras must be parallel

Correspondence using window matching:

For each window and for every pixel offset, determine how well the two match; minimum will give you the pixel offset and thus distance
For window of size $m \times m$ :

$W_m(x, y) = \left\{ u, v \vert x - \frac{m}{2} \le u \le x + \frac{m}{2}, y - \frac{m}{2} \le v \le y + \frac{m}{2} \right\}$
Matching metric: sum of squared pixel differences
- $_r(x, y, d) = \sum_{(u, v) \in W_m(x, y)}{\left[I_L(u, v) - I_R(u - d, v)\right]^2}$

Image normalization: variation in sensor gain/sensitivity means normalization is recommended.

Window magnitude: $$ \left\Vert I \right\Vert_{W_m(x, y)} = \sqrt{\sum_{(u, v) \in W_m(x, y)}{\left[I(u, v)\right]^2}} $$

Average:

\bar{I} = \frac{1}{\vert W_m(x, y) \vert}\sum_{(u, v) \in W_m(x, y)}{I(u, v)}

Normalized pixel:

\hat{I}_{(x, y)} = \frac{I(x, y) - \bar{I}}{\left\Vert I - \bar{I}\right\Vert_{W_m(x, y)}}

Vectorization: convert matrix into vector by unwrapping: horizontal lines together. Denote as $\omega$ .

Normalization scales magnitude of the $m^2$ -dimensional space into unit length. Two metrics possible for comparing two windows: distance and angle.

Distance ((normalized) sum of squared differences);

C_\textrm{SSD}(d) = \Vert \omega_L - \omega_R(d) \Vert^2

$\omega_R(d)$ is window centered around $(x - d, y)$ )

Normalized correlation:

C_\textrm{NC}(d) = \omega_L \cdot \omega_R(d) = \cos(\theta)

Local Features

Aperture problem and normal flow: if you only have a partial view of a moving, one-dimensional object, you cannot always tell how it is moving (e.g. a moving line whose ends are outside the viewport) n.

Given velocities $u$ and $v$ and partial derivatives $I_x$ , $I_y$ and $I_t$ for a given pixel, the brightness change constraint equation (BCCE), which states that brightness should stay constant for a given point over time, can be approximated (1st order Taylor series) as:

\begin{aligned} I_xu + I_yv + I_t &= 0 \\ \nabla I \cdot \vec{U} &= 0 \end{aligned}

Normal flow, the vector representing translation of the line in the direction in the direction of its normal, can be written as:

u_\perp = - \frac{I_t}{\vert \nabla I \vert} \frac{\nabla I}{\vert \nabla I \vert}

By considering multiple moving points, the velocity $U$ can be found:

\begin{aligned} \nabla I^1 \cdot U &= -I_t^1 \\ \nabla I^2 \cdot U &= -I_t^2 \\ \dots \end{aligned}

Lucas-Kanade

Assumes the same velocity for all pixels within the window, and that pixel intensities do not change between frames.

https://docs.opencv.org/4.5.0/d4/dee/tutorial_optical_flow.html

E(u, v) = \sum_{x, y \in \Omega}{ \left( I_x(x, y)u + I_y(x, y)v + I_t \right)^2 }

Solve using:

\begin{bmatrix} \sum{I_x^2} & \sum{I_x I_y} \\ \sum{I_x I_y} & \sum{I_y^2} \\ \end{bmatrix} \begin{pmatrix} u \\ v \end{pmatrix} = -\begin{pmatrix} \sum{I_x I_t} \\ \sum{I_y I_t} \end{pmatrix}

LHS: sum of outer product tensor of gradient vector

\left( \sum{\nabla I \nabla I^T} \right) \vec{U} = -\sum{\nabla I I_t}

Good features:

Satisfy brightness constancy
Has sufficient texture variation
But not too much texture variation - too many edges is also a problem as it is hard to tell how the
Corresponds to a ‘real’ surface patch (e.g. shadows not real)
Does not deform too much over time

Previous equation can be written as $\bold{A}\vec{u} = -\vec{b}$ .

For this to be solvable:

$\bold{A}$ should be invertible
$\bold{A}$ should not be too small - signal to noise ratio
- Eigenvalues $\lambda_1$ and $\lambda_2$ should not be too small
$\bold{A}$ should be well-conditioned
- $\lambda_1 / \lambda_2$ should not be too large (where $\lambda_1$ is the larger eigenvalue)
Original scoring function: $R = \lambda 1 \lambda 2 - k(\lambda_1 + \lambda_2)^2$
Shi-Tomasi scoring function: $R = \text{min}(\lambda_1, \lambda_2) > c$

Harris Detector

Using auto-correlation on ‘interesting’ points - where there are important differences in all directions.

For a point $(x, y)$ and shift $(\Delta x, \Delta y)$ , the auto-correlation is:

f(x, y) \sum_{(x_k, y_k) \in W}{ \left( I(x_k, y_k) - I( x_k + \Delta x, y_k + \Delta y ) \right)^2 }

Avoiding discrete shifts:

I(x_k + \Delta x, y_k + \Delta y) = I(x_k, y_k) + \begin{pmatrix}I_x(x_k, y_k) & I_y(x_k, y_k) \end{pmatrix} \begin{pmatrix}\Delta x \\ \Delta y \end{pmatrix}

f(x, y) = \sum_{(x_k, y_k) \in W}{ \left( \begin{pmatrix}I_x(x_k, y_k) & I_y(x_k, y_k) \end{pmatrix} \begin{pmatrix}\Delta x \\ \Delta y \end{pmatrix} \right)^2 }

Auto-correlation matrix:

= \begin{pmatrix}\Delta x & \Delta y\end{pmatrix} \begin{bmatrix} \sum_{(x_k, y_k) \in W}{\left(I_x(x_k, y_k)\right)^2} & \sum_{(x_k, y_k) \in W}{I_x(x_k, y_k) I_y(x_k, y_k)} \\ \sum_{(x_k, y_k) \in W}{I_x(x_k, y_k) I_y(x_k, y_k)} & \sum_{(x_k, y_k) \in W}{\left((I_y(x_k, y_k)\right)^2} \end{bmatrix} \begin{pmatrix}\Delta x \\ \Delta y\end{pmatrix}

The matrix captures the structure of the local neighborhood. Interest can be measured using the matrix’s eigenvalues:

2 strong eigenvalues: interesting point
1 strong eigenvalue: contour
0 eigenvalue: uniform regions

Interest point detection can be done using thresholding, or a local maximum for localization.

Feature distortion:

Model as Affine transforms: parallel lines preserved
OpenCV findFeatures
Affine transforms:
- $u(x, y) = a_1 + a_2x + a_3y$
- $v(x, y) = a_4 + a_5x + a_6y$
- Six parameters, min. six pixels per window
- Pass into BCCE and minimize error

Invariant Local Features

Local features that are invariant to translation, rotation, scale etc… They should have:

Locality: features are local; robust to occlusion/clutter
Distinctiveness: individual features can be matched to a large database of objects
Quantity: even small projects should have features
Efficiency: close to real-time performance
Extensibility: can be extended for a wide range of differing feature types

SIFT: Scale-invariant feature transform (SIFT).

Scale invariance:
- Gaussian pyramid
  - Blur then halve the dimensions (one octave)
  - Compute Difference of Gaussian (DoG): difference between neighboring Gaussian layers
    - Approximation of Laplacian of Gaussian
  - Compare pixel against 8 neighbors in same scale, plus against 9 neighbors in the DoG one octave above/below: use as keypoint if a minimum/maximum across all 26 neighbors
Rotation invariance:
- Create histogram of local (i.e. neighbor pixel) gradient directions; each bin covers 10 degrees
- Canonical orientation = peak of histogram
Descriptor:
- 16x16 region in scale space around keypoint
- Rotate region to match canonical orientation
- Create orientation histograms on 4x4 pixel neighborhoods; 8 bins/orientations each
- Hence 16 neighborhoods with 8 bins each: 128 element vector

Sift explanation

07. Morphology

Structural processing of images

From ~1960s

Erosion: shrinks objects
Dilation: expands objects
Open: erode then dilate
- Smooths images, removing small spurs, lines and noise
Close: dilate then erode
- Fills gaps and holes while preserving thin lines

Extracting quantitative descriptions of image components:

Boundaries
Skeletons
Convex hulls

Pixels are either object or non-object pixels.

Structuring element: smaller matrix applied to the image

Binary erode:

$A \otimes B = \{x: B_x \subseteq A \}$
Structuring element placed centered around every pixel: remove if any pixel of the structuring element overlaps with a non-object pixel

Binary dilation:

$A \otimes B = \{x: \hat{B_x} \cap A \neq \empty \}$
Keep any pixels covered by the structuring element when placed at at least one location

Greyscale erode:

Replace set operation with $\text{min}$ operation
$D_G(A, B) = \min_{[j, k] \in B}{\{a[m - j, n - k], b[j, k]\}}$

Greyscale dilate:

$D_G(A, B) = \max_{[j, k] \in B}{\{a[m - j, n - k], b[j, k]\}}$

Distance transform:

Minimum distance of each pixel to non-object pixel
Simple but inefficient: repeat erosion operation until all pixels gone; distance is the number of erosion operations required before the pixel disappeared
Structuring element:
- Chessboard: pixels sharing corners or edges are both have a distance of 1. 3 by 3 square structuring element
- Manhattan: 3 by 3 cross structuring element

Skeleton transform:

Reduces regions to one-pixel line thick borders
Methods:
- Distance transform: create by finding pixels with a distance of 1
- Thinning: repeatedly thin image, retaining end points/connections

Convex hull:

Follow outlines of object, except concavities

08. Tracking

Kalman Filter

Combine noisy measurements with predictions of how the state changes to get better estimate of real state.

Tracking: inference over time.

Can simplify the problem by assuming linear dynamics and Gaussian noise. An unscented Kalman filter can deal with non-linear state transitions, but still assumes Gaussian noise.

Task: at each time point (and in real-time), re-compute the estimate of position.

Recursive estimation: decompose this into:

The part that depends on the new observation
The part that can be computed from previous history

Minimal example - running average:

A_t = \alpha a_{t - 1} + (1 - \alpha) y_t

Where $\alpha$ is the weight given to the previous estimate and $y_i$ is the $i$ th measurement.

This would be sensitive to noise/occlusion

Tracking:

Generalized model:

Assume there are moving objects with underlying state $X$ (e.g. position + velocity)
Assume there are measurements $Y$ , some of which are functions of the state
There is a clock: at each tick, the state changes and we get a new observation

Issues:

Data association: the measurements taken at tick $i$ tell us something about the object’s state
Prediction: the measurements $y_0, \cdots, y_{i - 1}$ tells us something about the object’s state at tick $i$
- $P(X_i | Y_0 = y_0, \dots, Y_{i - 1} = y_{i - 1}$
- Where $Y_i$ is a random variable representing the probability distribution for the $i$ th measurement and $y_i$ is the observed measurement
Correction:
- Once $y_i$ is obtained, compute $P(X_i | Y_0 = y_0, \dots, Y_i = y_i$

Simplifying assumptions:

Only the immediate past matters; that is, only the previous state
- $P(X_i | X_{i - 1})$
Measurements depend only on the current state
- Previous measurements do not affect the current measurement

1D Kalman Filter

Assumes new state can be obtained by multiplying the old state by a constant $d_i$ and adding noise:

x_i \sim N(d_i x_{i - 1}, \theta_{d_i}^2)

In other words:

\begin{aligned} \bar{X}^{-}_i &= d_i \cdot \bar{X}^{+}_{i - 1} \\ (\theta^{-}_i)^2 &= \theta_{d_i}^2 + (d_i \theta^{+}_{i - 1})^2 \end{aligned}

TODO what is $\theta_{d_i}$ ? Why not just the second term?

TODO what is $\theta^{+}$ and $\theta^{-}$

Once a measurement arrives, this can be corrected:

\begin{aligned} x_i^+ &= \frac{\bar{x}_i^{-} \theta^2_{m_i} + m_i y_i (\theta^{-}_i)^2} {\theta^2_{m_i} + m^2_i (\theta^{-}_i)^2} \\ \theta_i^{+} &= \sqrt{\frac{\theta^2_{m_i}(\theta^{-}_i)^2} {\theta^2_{m_i} + m^2_i(\theta_i^{-})^2}} \end{aligned}

Note: $\theta$ does not depend on $y$ .

Smoothing: if not running the filte in real time, can run the algorithm forwards and backwards and find the mean between the two predictions.

Kalman in Python

g-h filter:

def g_h_filter(data, x0, dx, g, h, dt=1.):
  """
  Performs g-h filter on 1 state variable with a fixed g and h.

  'data' contains the data to be filtered.
  'x0' is the initial value for our state variable
  'dx' is the initial change rate for our state variable (assumes linear rate of change)
  'g' is the g-h's g scale factor. g * 100% of the estimate comes from the measurement. Should be high for less noisy measurements
  'h' is the g-h's h scale factor - larger h means responds quicker to change, but more vulnerable to noise/outliers
  """
  x_estimate = x0
  results = []
  for x_measurement in data:
    # prediction step
    x_prediction = x_estimate + (dx*dt)
    dx = dx

    # update step
    residual = z - x_prediction # delta between measurement and prediction

    # update rate of change using residual.
    # h determines how quickly the rate of change changes
    dx = dx + h * (residual) / dt

    # Update estimate be weighted average of prediction and measurement
    # g determines weight given to the measurement
    x_estimate = x_prediction + g * residual

Example: system where position is being measured and the object has constant velocity

The distance and velocity can be represented as Gaussian distributions:

\begin{aligned} \bar{d} &= \mu_d + \mu_v \\ \bar{\theta} &= \theta_d^2 + \theta_v^2 \end{aligned}

Sum of two Gaussians:

\begin{aligned} \mu &= \mu_1 + \mu_2 \\ \sigma^2 &= \sigma_1^2 + \sigma_2^2 \end{aligned}

Hence, the prediction can be represented as the sum of the distributions of the previous position and predicted velocity.

Product of two Gaussians:

\begin{aligned} \mu &= \frac{\sigma_1^2 \mu_2 + \sigma_2^2 \mu_1}{\sigma_1^2 + \sigma_2^2} \\ \sigma^2 &= \frac{\sigma_1^2\sigma_2^2}{\sigma_1^2+\sigma_2^2} \end{aligned}

The update step returns the estimated position as the product of the distributions of the new measurement and current estimated position.

Particle Filter

The particle filter allows multiple positions to be predicted, and works with multi-modal and non-Gaussian distributions.

Three probability distributions:

Prior density: previous state $p(x_{t - 1})$
Process density: kinematic model - prediction of next state $p(x_t | x_{t - 1})$
Observation density: previous observation $p(z_{t - 1} | x_{t - 1})$

The particle filter processes this into a single probability distribution, the state density $(x_t | Z_t)$ .

Comparisons:

Kalman filter: linear transitions and Gaussian distributions. Breaks down if there is too much occlusion/clutter
Unscented Kalman filter: non-linear systems, but still assumes Gaussian distribution
Particle filter: predicts multiple states/positions with non-Gaussian distribution. Much slower

Kalman in Python

Algorithm:

Generate a bunch of particles randomly
- Each has a weight proportional to the probability that it represents the state of the real system
- The use of Monte-Carlo simulation means that particles can be generated which follow any probability distribution, not just the Gaussian
Predict the next state of the particles based on your predictions of how a real system would behave (e.g. when you send a command to change the state)
- The changes must be noisy to match the real system
Update the weighting of the particles based on a new measurement or measurements (e.g. multiple objects being tracked)
- e.g. for each measurement, find distance between each particle and the measurement, and update the weight accordingly (e.g. for position, add the difference/residual multiplied by some factor to account for the measurements being noisy). Then, normalize the weights so they sum to one
Resample, discarding particles that are now classed as highly improbable, and generate more particles by duplicating some of the more likely ones
- The noise added during the predict stage means the duplicate and original will separate
- If only one particle is being tracked, can use the weighted sum to get an estimate of its real position

09. Introduction to Deep Learning

Types:

Supervised: labeled training set
Unsupervised: discover patterns in unlabeled data
Reinforcement learning: learn to act based on feedback/reward

Deep learning:

Learning representations of data - great at learning patterns
Uses a hierarchy of multiple layers - hence the ‘deep’ in the name
Convolutional neural networks
Works both supervised and unsupervised
Compared to machine learning, a lower rate of diminishing returns as the size of the training set increases

Neural networks:

Input layer
Hidden layer(s)
Output layer

Activation functions: non-linearities needed to learn complex (non-linear) representations of data. More layers and neurons can approximate more complex functions.

Overfitting: when the model fails to generalize outside the training set

10. 3D Reconstruction using Computer Vision

Reconstructing 3D structure and camera positions from a set of images.

Many applications:

Robot control
Self-driving cars
Measuring
Medical imaging
Photo-realistic graphics
AR

Most important algorithms: RANSAC and bundle adjustment.

Also known as structure from motion (SfM), photogrammetery.

SLAM (simultaneous localization and mapping) is usually real-time while SfM is offline. Closing the loop: once you recognize you have visited a position previously, you need to back-propagate changes to the model which may have drifted.

Summary:

Homography $H$ : relates relative pose of two cameras viewing a planar scene using RANSAC i.e. points all on the same plane
Essential matrix $E$ : relates relative pose of two cameras viewing a 3D scene using RANSAC
Bundle adjustment: initialize using RANSAC (for $E$ ), estimate a set of 3D points and camera points which minimizes re-projection error

Background

Camera Calibration

Camera calibration: map pixel coordinates to normalized image coordinates: correct factors such as lens distortion, focal length, image center etc.

Feature Matching

Process of choosing point features that appear in two adjacent images.

Features are usually point features found using corner detectors.

Corner features should be:

Repeatable - same corners found in every image
Localizable: detected location is an actual 3D point
- e.g. one object behind another from the PoV of the camera - the intersection is not a real point

cv::goodFeaturesToTrack can be used to find Harris corners.

Representations of appearance: ‘feature descriptors’ e.g. image patches, SIFT, SURF

May get incorrect matches: objects which look the same but are different. “gross outliers” - outliers where location error is much higher (orders of magnitude higher) than expected.

For feature registration, algorithms should be robust against:

Translation (scale change = z translation)
Rotation (including skew)
Illumination (colour shifts, shadows)
Blur (motion/defocus blur - the former is easy to remove if you know the velocity)
Non-rigid deformations
- Radial distortions
- Stretching
- Warping
- Intrinsic camera parameters
Noise
- Gaussian filter, median filter
Partial occlusion
Camera gain changes
Self similarity

Homogeneous Coordinates

Homogeneous coordinates: add an extra dimension (e.g. 3D points become 4D points) with $w=1$ .

This allows matrix multiplication to be used to represent rotation, translation, and represent projection by normalization.

Transform of point $\vec{X}$ with rotation $R$ and translation $\vec{t}$ :

T = \begin{bmatrix} R & \vec{t} \\ \vec{0} & 1 \end{bmatrix}

Then multiply by $\begin{bmatrix} \vec{X} \\ 1 \end{bmatrix}$

In homogeneous coordinates:

The camera is at $(0, 0, 0)$
Image points lie in the plane at $z = 1$
- Points can be mapped between the 2D sensor plane and 3D space

2-view Reconstruction

Recover rotation, translation and 3D structure given two images.

Planar scenes: Homography Estimation

e.g. aerial images, AR apps. All points are on a single plane, making things easier - one less dimension to worry about and no occlusion.

There are two views of a planar scene and three parameters: rotation, translation, and the plane normal.

A 3x3 matrix $H$ can represent the relative pose of the camera and plane normal.

Inlier match $\vec{x} \rightarrow \vec{x'}$ : $H\vec{x}=\vec{x'}$ when in normalized homogeneous coordinates, where $x$ is the 3D/4D point and $x'$ is the projection of the point on the camera sensor.

H = R + \frac{\vec{t} \vec{n}^T}{d}

Where:

$R$ is the relative orientation
$\vec{t}$ is the translation unit vector
$\vec{n}$ is the plane normal
$d$ is the ratio of the distance to the plane to the translation

Note that there is scale ambiguity in the above formula - we only have $d$ , not the actual length $t$ .

$H\vec{x} = \vec{x'}$ can be used to estimate $H$ when you have at least four inlier feature matches. The function cv::findHomography implements this algorithm.

As this only works on inliers, and you don’t know which points are inliers without $H$ , RANSAC is used to simultaneously estimate $H$ and filter out outliers.

RANSAC

Random sample consensus. A genera-purpose framework for fitting a model to data which has gross (very large) outliers.

Steps:

Generate a hypothesis set: a randomly chosen set of points, the number of points being the minimum number needed to generate the model
Hypothesis: generate a model from the hypothesis set
Test: count the number of datapoints that would be inliers assuming the model is correct
- That is, count the number of features where $H\vec{x}' - \vec{x} < threshold$
Repeat until enough points are inliers
Generate a new model using all inlier points

In the case of planar scenes, 4 feature matches.

3D Scenes: Essential Matrix Estimation

3x3 matrix that represents the relative pose (rotation $R$ , translation $t$ ) of two cameras:

\begin{aligned} E &= [t]_x R \\ [t]_x &= \begin{bmatrix} 0 & -t_3 & t_2 \\ t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix} \end{aligned}

For an inlier match, $\vec{x'}^T E \vec{x} = 0$ .

RANSAC is used to compute $E$ from five feature matches.

Conversion to a 3D structure:

Pick cameras with poses (rotation and transforms) $P = (\mathbb(I), \vec{0})$ and $P' = (R, t)$
For each inlier feature match $(\vec{x}, \vec{x'}$ , solve $P\vec{X}=x$ and $P\vec{X}=\vec{x'}$ to find the 3D point $X$
This creates a sparse set of 3D points
http://hilandtom.com/tombotterill/code

N-view Reconstruction by Bundle Adjustment

RANSAC suitable for two views, but 3D modelling may have tens or hundreds of photos; aerial mapping often has several thousand.

Hence, bundle adjustment is needed for accurate 3D reconstruction.

Reprojection error: distance between where a 3D point $X$ is projected into an image, and where it is measured to lie, $x$ .

Re-projection error for point: $|\vec{x} - P\vec{X}|$

Bundle adjustments find the set of camera positions and 3D points which minimizes the total reprojection error.

That, is, it finds the 3D points $\{ X_j: j=1, \dots, M \}$ and cameras $\{ P_i = (R_i, \vec{t}_i): i = 1, \dots, N \}$ such that

C(\{ \vec{X}_j \}, \{ P_i \}) = \sum_i{\sum_{j\text{ appearing in view } i}{\| \vec{X_{ij}} - P_i \vec{X_j} \|^2 }}

is minimized.

After RANSAC is run on pairs of images, non-linear gradient descent is used to minimize the total reprojection error.

Errors in reconstruction will remain:

Remaining outliers
Point localization errors
Poorly-conditioned geometry

However, this can be mitigated by the use of additional information such as GPS, or domain knowledge - for example, buildings usually have vertical planes and right-angles.

Extensions:

Camera calibration unknown
Giant feature sets (millions) - optimization algorithm must be designed to reduce complexity

BA finds a sparse structure, but if objects are assumed to be convex, a mesh can be formed. This compares to stereo, which returns a dense structure - distance value for each pixel.

11. Deep Learning

Dr. Oliver Batchelor

Neural networks, differentiable programming, applications to CV/image processing

History

Artificial neural networks:

Perceptron (1958)
- Linear classifier based on weighted inputs and thresholding
- Not differentiable
- Unable to learn XOR function
Backpropagation (1975) - basically just gradient descent
- Enable composition of networks built from multiple layers of differentiable functions
- Long training times
- Vanishing gradient problem - weights at top layers very small, basically not updatable (Sigmoid function)
Convolutional neural networks (CNN)
- Invented for classifying handwritten digits
- Image convolution is differentiable
- Convolutional layer - many different convolutions, each with a different filter
  - Weights associated with the filters are updated
2009: imageNet classification problem
- 1000 image classes, 1.4 million images
- AlexNet, 2013: used GPUs for computation, ~64% accuracy
- Now ~90%
Now:
- Neural nets part of almost every state-of-the-art computer vision applications
  - 3D reconstruction: stereo, multi-view stereo, optical flow
  - 2D/3D pose estimation
  - Image generation, super resolution, style transform, texture synthesis
  - Point cloud segmentation, object detection

Introduction

Neural networks:

‘Differentiable programming’ vs ‘Deep learning’: more flexible usage; no longer a set of layers
Supervised machine learning
Objective function with data $x$ and target label $y$ : $J(\theta; x, y) = L(f(\theta; x), y)$ , where the model has parameters $\theta$ , the function $\hat{y} = f(\theta; x)$ makes predictions, and the loss function $L(y, \hat{y})$ gives the error of the prediction
Minimize objective function
- Compute gradient of the loss with respect to parameters $\nabla_\theta$ for a batch of examples
- Update the weights $\theta_{i + 1} = \theta_i = a \nabla_\theta J(\dots)$ , where $a$ is the learning rate

What’s wrong with fully-connected neural networks?

Image extremely high dimensional
- If pixels in an image flattened out into a vector, spatial information is lost and shifting by a single pixel can cause vast changes
High dimensional inputs require a lot of data; large images impractical
The ‘curse of dimensionality’ - data becomes sparse

Solutions:

Inductive bias: set of assumptions used by a learner to predict results for inputs it has not yet encountered
- Generalizations
Invariance and equivariance
- Invariance = shift inputs, outputs remain the same
- Equivariance = shift inputs, outputs shift in the same way
- Images can be transformed (translated, scaled, rotated etc.) without changing their content/class label
- Images can be transformed and their segmentation is transformed the same way
Hence, use operations that exploit these properties
- Usually convolutions (CNNs) - translationally equivariant, multi-scale approaches
- Much less data required
- Domain specific - invariances/symmetries different between domains

Building blocks of CNNs:

Convolution
Non-linear stage e.g. rectified linear
Pooling: reduce a small window of an image into a single pixel

And repeat until you get a single output or a few outputs.

Other methods also available:

Attention methods (from NLP) - transformers
- Divide image into patches; treat them like symbols
- Form associations between two sets
- Not obviously ‘better’ than CNNs

Applications

Output types:

Classification:
- Categorical variables
- Softmax output with cross entropy loss
  - Softmax: exponential, but weighted so that sum of components adds to 1
- Sigmoid with binary cross entropy loss
  - Sigmoid: $(1 + e^{-x})^{-1}$ (shaped like an S)
Regression
- Continuous variables (e.g. depth of a pixel)

Visual recognition:

Classification - if image belongs to class
Segmentation (dense/semantic) - classifying each pixel
Object detection: classification + localization (bounding box)
Instance segmentation: classification + localization + segmentation (precise mask)
Keypoint detection (e.g. joints in hand)
Panoptic segmentation (both dense and instance segmentation) e.g. ‘grass’

Image classification:

Pre-trained models available: https://github.com/rwightman/pytorch-image-models
Can be used as basis of other tasks - train on ImageNet, adapt the model by adding layers/connections. Called having a ‘backbone’ based on image classification.

Semantic/dense segmentation:

Per-pixel classification
General scene understanding
Works with irregularly-shaped objects

Segmentation performance measures:

Percent correct
- Has issues with class imbalance
Intersection over union (IOU):
- Precision = area of intersection of prediction and ground truth divided by area of prediction
- Recall = area of intersection of prediction and ground truth divided by area of ground truth
- IoU = area of intersection of prediction and ground truth, divided by union
Problem: they are not differential, although substitutes are available

Object detection:

Localization + classification
Difficult as there are a variable number of outputs
- How to measure accuracy?
- Average precision: area under precision vs recall graph
  - Build by adding predictions by order of confidence, highest to lowest
  - mAP @ t: mean average precision across all classes of dataset using matching threshold of t (e.g. 50%)
  - COCO AP: average across mAP for thresholds [0.5, 0.55, …, 0.95]
Usually output object locations (heatmaps) and classes
Anchor boxes/prior boxes:
- Places where the network thinks there could be objects - very dense and overlapping
- Regression on box displacement and sizing on set of anchor boxes; resize and reduce to a single high-confidence box
- Non-maximum suppression: does not work well if two objects are overlapping
- Train by matching anchor boxes to ground truth boxes
  - e.g. find boxes where IoU > 0.5 or some other threshold

Keypoint recognition:

Detecting landmarks (e.g. human skeleton/pose) - want single points rather than box
Bottom up:
- Keypoint detector per part
- Match key-points to find full skeletons
Top down:
- Detect full objects first
- Estimate key-points given object location

Image matching and correspondence:

Dense rectified (images aligned vertically) stereo matching (e.g. HSMNet, PSMNet)
Dense multi-view stereo
Sparse image matching: keyboard detection, descriptor extraction
Dense optical flow
- Estimate motion of pixels
- 2D correspondence search - movement not constrained two one axis
- RAFT
6DoF pose matching
- Determining object pose (e.g. orientation) from 3D model
Key point detector (identifiable point)/descriptor (vector uniquely defining that point)
- Can’t manually label data - a single image? Maybe. But definitely not a video
- Self-supervised by warp data augmentation: given an image warp, we know how key points should be transformed
- Direct matching e.g. LoFTR
- Detect, then extract descriptors: Superpoint, R2D2
Object tracking
- Multi-object tracking: MOT
  - Track known types of objects e.g. people, seals
  - Supervised learning
  - Detect and track paradigm e.g. SORT https://github.com/ifzhang/ByteTrack
  - Difficulties: occlusion, changing shape, fast movement
- Generic object tracking (GOT)
  - One shot object tracking
  - Generic: not trained on a specific type of object (but probably a wide variety of moving objects)
    - Makes it easy to use as long as the target is similar to anything in the training set
  - Track object from template in the first frame (bounding box)
  - e.g. vision4robotics/SiamAPN, got-10k/toolkit
  - Can help with assisting labelling data
Neural 3D reconstruction
- Not just for images, although this is a common case
- Input: set of calibrated images
  - Structure from Motion, COLMAP
    - Sparse point cloud through correspondence search: feature extraction -> matching -> geometric verification
    - Dense multiview stereo e.g. PatchMatch to generate depth maps that can be filtered or fused
- Reconstruct 3D scenes use differentiable volumetric ray-tracing
  - Inverse of computer graphics
- Synthesize new images of a scene from a different orientation
- Neural radiance fields: NeRF
  - Represent 3D scene using a NN by mapping a 3D coordinate to a density and color
  - Problem: neural networks have bias towards smooth functions; cannot represent high frequency/discontinuity
  - Solution: scale/encode inputs as a Fourier-encoded sequence
  - Problem: view-directional effects
  - e.g. Google MipNeRF, nerf-pytorch
- instant-ngp: massively reducing required time/processing power
  - More recent NeRF methods: direct voxel grids instead of the coordinate space mapping
  - Neural hash-tables: ignore hash collisions - multiple levels of resolution combined with a tiny MLP to handle collisions

Image features:

Description of small point in the image
- Independent of attributes such as location, orientation and scale
Handcrafted dense image features also exist e.g. histogram of gradients (HoG), dense SiFT
From NN:
- Intermediate activations from any type of NN
- Typically dense (for CNNs)
  - Usually much lower resolution than input
- Vector length depends on layer
- Sometimes modified for matching e.g. normalization, reduced dimension with PCA
- Can arise from training a NN on an auxiliary task (e.g. image classification)
Handcrafted vs learned:
- NN features usually perform better on tasks near the training domain,
- Handcrafted features often generalize better
Simpler classifiers with extracted features
- Fine tuning on frozen networks
- SVM (support vector machine), random forest, nearest neighbor etc.
Depends on the quality of the feature extractor
- If extracted from NN running auxiliary task, it may not match the use case
Extremely common use cases:
- Feature matching (e.g. nearest neighbor)
  - Well tested, less likely to fall down when it encounteres unexpected data
- Image retrieval, face recognition (approximate nearest neighbor)
- Shortest path (e.g. skeletonization)
- Graph cut (e.g. stereo, segmentation)
- Homography estimation, Perspective nPoint (PnP)
- Tracking
  - Detect and track e.g. SORT, one-shot detection
  - Kalman filter

Applying models to new tasks:

Find a off-the-shelf model
- A lot around; sometimes, may be useful to search for the dataset rather than the model
- Look for widely-used, actively-maintained ones
  - Research-quality code often not maintained after publication; may require package to be updated to work with current versions of frameworks
- Check for conda/pip packages before building from source
- May not generalize
  - Check similarity of images to ones in paper or those provided as examples
  - Neural networks can be weak to unexpected inputs or domain shifts
Use NN features with non-learning algorithms
Open world visual recognition models
- Detects objects in general
- Often trained by big organizations with excessive resources
- Often self-supervised learning or from commonly available associations (e.g. OpenAI CLIP uses image data and captions from websites) - may be low quality
- Few shot image classification: create a new class using a few exemplars
- Zero shot image classification: classify image based on textural descriptions
- Instance segmentation e.g. Detic (https://huggingface.co/spaces/akhaliq/Detic)
Fine tuning
- Hand annotate a few different examples
  - Interactive segmentation: using NN to help with training
    - DiscoBox: human draws bounding box, NN generates segmentation mask
  - Human in the loop
    - Active learning: algorithm lets human annotate the data it is most uncertain on
    - Verification-based annotation
      - Partially-trained model makes suggestion
      - Human selects/edits the best suggestions
    - Question answering:
      - Weak annotation
      - Human answer yes/no questions; much faster than manual annotation
- Take an existing model and find tune the model or classifier with a smaller set of annotations
  - Catastrophic forgetting/interference: when retraining model in new domain, it tends to lose information about the previous data
- Use data augmentation judiciously
  - Apply transformations which are invariant/equivariant to the labels; allows training sets to be enlarged and forces the model to generalize
  - Randomly perturb the data
    - Geometric: translation, scale, rotation
    - Photometric: brightness, contrast, hue
    - Mixing: mix images (of different classes!) and labels
    - Noise: gaussian, salt and pepper, synthetic rain, cutout
  - e.g. albumentations, BBAug, Torchvision, Detectron2
  - Considerations:
    - Too much may be harmful
    - Sometimes may be better to standardize scale/orientation in input data rather than augment it (e.g. people not usually walking upside down)
Synthetic data
- 3D rendering to generate your own training data
- Self-supervised learning: known physical properties to provide supervision
- Correspondence tasks: cannot hand label; disparity maps must be more accurate than segmentation masks (precise disparities per pixel)
- Pseudo ground-truths
  - Use device for high accuracy capture (LiDAR, structured light)
  - Use more information e.g. higher-resolution or more images
  - Use 3D reconstruction to generate higher-quality images and train on these
  - Today: tends to produce stereo matching; not as good as synthetic data, but more robust
- Self-supervised learning
  - Consistency (left/right, inter-frame) using image warping: minimize warping error with truth to train model
  - Feature matching with tracking
    - TODO

12. Vineyard Project

Example Question

Describe and provide CV example for:

Supervised learning:
- Learn function from input to output based on label-data pairs
- e.g. road sign classification
Weakly-supervised learning:
- Labels are ‘weaker’: noisy, limited or imprecise Semi-supervised learning:
- Small amount of labeled data, larger set of unlabeled data
- Use model to assign labels on the unsupervised data, manually correct, and then use results to retrain
Self-supervised learning:
- Use properties of the data to provide a supervision signal
- e.g. use auxiliary task like image completion to learn mapping from image to feature vector to define similarity metric between images

CNNs: what property of image matching CV algorithms enable self-supervised learning?

Correct solutions can be verified - loss function can be written, allowing the ML algorithm to be supervised

How would this work for stereo/optical flow?

Dense stereo/optical flow provide correspondence between two images; one image can be warped to match its counterpart. Hence, this allows a comparison to give an indication of how successful the warp is and hence provide a loss function
SLAM/SfM: matches based on geometric consistency; badly-matched key-points will fail a geometric consistency test and be discarded. Keypoints that pass/fail can be used as a positive/negative supervision signal

Last question in the exam: briefly describe four of the following class projects, naming at least four algorithmic steps (with algorithm names). Do not select your own/similar projects.

If person does not list four or more algorithms, won’t be selected.

Project Tips

Academic paper: do not mention failures or running out of time. Phrase as positives in ‘future research’
Remove the word ‘project’ from the paper; use ‘research’
Avoid colloquialisms
‘The paper proposes a method’ not ‘the goal of this paper’
‘These results show the proposed approach can’ not the ‘system can’
Do not motion the phrase ‘computer vision’; paper for a CV conference, so too broad
Worse results are fine; proposing a method, not selling a solution
Only mention the framework, hardware etc. at the beginning of the results/methods section

Abstract:

Not part of the paper. Self-contained, technical overview of the whole paper. Include algorithm names etc., mention at least one result number, hopefully a comparison with prior research
- Must at least attempt to compare it with prior research

Background:

Critical review of prior research - mention limitations of prior research/algorithms
e.g. static background required. Look at future research sections

Proposed methods:

At least three CV algorithm names
What algorithms are the DL networks using?
Novel: can mean tiniest minuscule tweak

Results:

At beginning, mention OS framework etc.
Quantify results
Try to quantitatively compare results with prior research
- Survey papers can be useful

Conclusions:

Start with brief summary of results
Quantitatively compare with prior research
Future research sub-section

References:

Be consistent
Most should be newer than 10 years ago, or justify

Real World Example: CV for a Grape Vine Pruning Robot

Approx. half the cost of vineyards is in pruning, hard to get get enough workers, can’t prune in the rain etc.

Pruning: remove old wood and most new canes during the winter.

NZ:

90 million vines, mostly Sauvignon Blanc
Hand-pruned. ~2 minutes per vine

Large project: viticulture, robotics and AI experts, software + hardware engineers. ~5 years

~85% successful. Good enough for the government, but not good enough commercially.

Lighting:

Extremely challenging: dynamic range far too large in sunlight
Got a mobile caravan to control lighting: lights, blue screen background etc.,
- Bike wheels
Place lights to minimize shadows

Camera rig:

3 well-conditioned cameras. Allowed reconstruction in all directions
- Needed to align after every setup - bumping and vibrations caused movement
3D reconstruction:
- Many challenges: occlusion, depth discontinuities, self-similarity
- Solved using feature matching/bundle adjustments
  - 2D feature extraction: canes, wires posts
    - Move away from pixels/point clouds to high-level features
  - Correspondence between views, using knowledge of vines
- Customized every stage to use knowledge about vines (no sharp corners, vine thickness etc.)
  - Made sequential chain of components that could be developed and parametrised in sequence and in isolation
- Rolling buffer of the last few dozen frames
- Now can use ML to get a very accurate 3D model, but was not available at the time

Main challenge was complexity and robustness.

Lighting
- Even with artificial lighting, getting rid of shadows is hard
- Solution: MORE LIGHTS
Occlusion
- 6 12 megapixel cameras with global shutters and bright lights to reduce motion blur
Self-simiarity: vines look the same TODO

Main challenges:

Complexity
Robustness

13. Image Representations

What is a good representation for image analysis?

Fourier transform tells you ‘what’ (textural properties) but not ‘where’
Pixel domain gives you ‘where’ but not ‘where’

We want an image representation that gives you a local description of image events - what and where, and naturally represent objects across varying scales.

Image pyramids: apply filters of fixed size to images of different sizes. Typically, edge length changes by a scale of 2 or the square root of 2.

There are many types of image pyramids:

Gaussian
- Acts as a low pass filter
- Applying the Gaussian to a Gaussian returns another Gaussian, allowing recursion
- Synthesis: smooth and sample
Laplacian
- Synthesis: for a given level in a Gaussian pyramid, find the difference between the image and the up-sampled image from the level below (lower resolution)
- Acts as a band-pass filter: each level represents spatial frequencies largely unrepresented in other levels
- Multi-scale, band-pass and over-complete (more coefficients than image pixels)
Wavelets/QMFs:
- Apply 1D filters separably in two spatial dimensions
  - Wavelet function: 1D function $\psi(x)$ with total integral of zero (‘centered’ around x axis) with limited domain? (only a small section of the domain returns a non-zero value)
  - Haar wavelet: -1 for $0 \le x < 0.5$ and 1 for $0.5 \le x < 1$
  - Parameters scale $s$ and horizontal translation $\tau$ : $\psi_{x, \tau}(x) = \frac{1}{\sqrt{s}}\psi(\frac{x - \tau}{s})$
  - Continuous wavelet transform: integral of product of wavelet function with 1D signal: measures the correlation of wavelet with the signal
  - Discrete wavelet transform: pick scales that are powers of two
  - 2D wavelet transform: use high and low pass filter. At each step, run the filter, downsample by 2, and repeat, except filtering in the other dimension. Hence, if run twice (once per dimension) you get four images downsized by a factor of 4
- Multi-scale, band-pass and complete
- Some aliasing
Steerable:
- Can pick angle of interest
- Image corners must be removed

Uses:

Gaussian: scale-invariance
Laplacian: difference between pyramid levels - useful for noise reduction and coding
Wavelet/QMF: band-passed, complete representation of the image
Steerable pyramid: show components at each scale/orientation - useful for texture/feature analysis

14. Tracking

Fiducial Markers

A fiducial marker is any planar object introduced into the scene and in the field-of-view of a n imaging system to be used as a point of reference of measure.

Can be used for:

Object position estimation
Camera position estimation
Estimate transform/poses of robots

How:

Calibrate camera to determine and correct distortion
Homography: 3x3 matrix mapping between projection of two planes
- If $x_1$ and $x_2$ are corresponding points on two planes, $x_1 = H \cdot x_2$
Pose: 6 degrees of freedom; rotation + translation along each axis; can be represented in 3x3 rotation and 1x3 translation matrix. Called the extrinsics matrix
Process:
- Find marker outline
  - e.g. ArUco: adaptive thresholding, contour extraction, quad extraction
  - e.g. AprilTags: edge detection, line segment detection, quad extraction
- Calculate homography using corner locations
- Calculate extrinsics using focal length and marker size

Challenges:

Occlusion
Unfocused or motion blur
Dark/uneven light, vignetting
Jitter: exact position can vary between frames
False positives: not all squares are markers

Markerless Tracking

Use ‘natural’ features for tracking: corners, edges, points etc.

Also: templates - basically something whose representation stored in the system.

This is more difficult and usually much more computationally expensive.

Texture tracking:

Replace marker corners with keypoints
- SIFT, SURF, GLOH, etc.
- Apply detector to every single pixel
- Find the best set of keypoints (e.g. filter by strength/similarity/distance)
- Create descriptors; windows around the keypoints
Generate robust descriptors that allow differentiation between keypoints
- SIFT:
  - Estimate dominant keypoint orientation using gradients
  - Compensate for detected orientation
    - e.g. if keypoints on plane, transform all features as if the camera is normal to the plane
  - Describe keypoints in terms of surrounding gradient radially
Match descriptors against database (created offline)
- Build a database with all descriptors and their position on the original image
  - For robustness, search for corners at multiple scales
- May require data structures to allow for efficient searching
Outlier removal:
- Start with the cheapest techniques first (e.g. geometric)
- End with homography-based tests

Hybrid tracking: use gyroscope for prediction of camera orientation, and computer vision to correct gyroscope drift. Kalman filter?

Outdoor: lots of landmarks and planar features, but varying lighting conditions make it difficult.

15. Face Recognition

Input face image
Face detection
Normalization: rotation, scale, normalization
- e.g. normalize distance between eyes
Face feature extraction
- Interesting feature compression techniques available
Classifier: feature matching against database
Decision maker
- Thresholding

Early face detection:

Geometric (e.g. using eyes)
- Eyes and mouths are good features to detect
Color distribution
- Need to segment background from face color
  - Skin tones ended up accounting for only a tiny portion of the color space
  - Wood color is similar to skin color, which makes for much fun
- Hence, choice of color space is important

Surveillance cameras usually high up, which isn’t ideal for most face recognition algorithms.

Surveillance-based tracking: tracking people for the entire time they are in an area.

Normalization:

e.g. normalize distance between eyes

Features:

Eyebrow thickness, vertical position at eye center
Eyebrow Arch
Nose vertical position, width
Mouth vertical position, width, lip height
Chin shape (e.g. distance from keypoint at fixed angles)
Bigonial breadth - face width at nose position
Zygomatic breadth - face width halfway between nose tip and eyes

Neural networks:

Use NN-based filter: use small filter window to scan portions of the image and detect if a face exists
Merge overlapping detections to eliminate false positives

Static face recognition:

Eigenface: reduce face to many eigenvalues, project to high-dimensional feature face and find closest match. Worked up to ~1000 faces
Linear and Fisher discriminant analysis
- Fishface: finds linear trasnformation which maximizes inter-class scatter while minimizing intra-class scatter
Now: CNNs/deep learning - can work well even with millions of faces

Video face recognition:

Low quality, small images
But also allows tracking of the face image and continuity: re-use classification information from high-quality images when processing low-quality images
Motion structure: create 3D model of face to match against frontal views
Non-rigid motion analysis

16. One-Minute Demos

???:

Real time driver fatigue detection
Dashcam/laptop
Algorithms:
- Gaussian blur
- Histogram of oriented gradients (HOG)
- SVM
- Percentage eyelid closure (PERCLOS); eye aspect ratio (EAR), mouth aspect ratio (MAR)

???:

Wildfire hotspot detection
Smouldering makes detection difficult
Heavy duty water carrying drone
Identify, rank, then approximate distance to hotspot
Drone camera
Algorithms:
- Segmentation: threshold, contour, centroids
- Morphology
- Distance approximation (camera calibration)

???:

Same as above
Algorithms:
- Gaussian
- Binary threshold
- Morphology: opening/erosion/dilation
- Contour detection
- Find largest centroids
- Lucas-Kanade optical flow: is contour expanding
- Stereo imaging for distance approximation?

???:

Sign language teaching assistant
Real time feedback
Only RGB camera
Algorithms:
- Sharpening kernel
- Convolutional pose machine
- Thresholding
- Hand keypoint detection at some point
0.6 Hz

???:

Track moving objects in robot soccer
Algorithms:
- Camera calibration
- Background subtraction
- Circular Hough transform + unscented Kalman filter for ball tracking
- CNN object detection for robot detection

???:

Faster human detection
Want low power, fast systems
CNN-based object detection
Algorithms:
- YOLOv5
- Kalman filter
- Tracker fit model
  - 0:11:54
  - Run Kalman prediction
  - Run object detection
  - Are they close enough (Euclidean)?
    - No - create new tracker
    - Yes - update Kalman with new location
    - Multiple objects within max distance: pick closest one
Low CNN depth, Kalman to remove flicker/missed detection in frames

???:

Gorilla head tracking
Algorithms:
- Greyscale frame difference: subtraction
- Binary thresholding
- Median blur
- Morphology: open
- Find centroid
  - Moment
Single person only

???:

Wildfire
Algorithms:
- Mean greyscaling
- Gaussian
- To-zero thresholding
  - x > threshold? x: 0
- Circular kernel (blurring?)
- Pick brightest pixel
- Blackout pixels around the selected pixel; repeat to get next brightest pixel
- Guess location by assuming ground is flat: know drone position, camera angle

???:

FPS enemy detection
Algorithms:
- YOLOv5 to get bounding box
- For outlining:
  - Gaussian
  - Morphology: close
  - Morphology: gradient (difference of dilation and erosion)
  - Thresholding: Otsu

???:

Rehab to free throw detection (basketball)
Detectron2 pose detection: pyramid network + Mask R-CNN
Nothing else?

???:

Chess board framing, move detection
Lighting can change
HSV color masking to detect chess board red/blue pieces
Algorithms:
- Board framing:
  - HSV color masking to remove background
  - Morphology: open/close
  - Contour finding: find square
- Chess grid:
  - Canny edge
  - Hough line transform
  - Homography matrix
- Chess piece:
  - HSV Color masking

???:

Construction panel quality control
Detect dimensions, end tolerances
Algorithms:
- Processing:
  - Greyscale
  - Blur
  - Mask
  - Canny edge detection
  - Morphology: erosion/dilation
- Calibration:
  - Find sticker - pixels to mm
- Find panel:
  - Draw contour
- Template match: adapted normalized cross-correlation

???:

Detect/localize pine tree yellow catkins
Real time
Algorithms:
- YOLOv5 catkin detection
- MeanShift segmentation of depth image
- Morphology: opening/closing
- Contour detection Suzuki’s algorithm
- Shape matching: Hu moment invariants

???:

Identify cut logs
YOLACT with custom dataset for instance segmentation

???:

Othello piece detection
Algorithms:
- Gaussian
- Color thresholding
- Douglas-Peucker algorithm
- Image warping: homography matrix

???:

Quick-time event detection in Detroit: Become Human
QTE: fast button/gesture prompt
Algorithms:
- Hough circle after greyscale, median blue
- Text detection:
  - Tesseract OCR
  - Frame crop
  - Bicubic interpolation
  - Gaussian
  - Morphology: Erosion
  - Grayscale
  - Thresholding
- Harris corner detector to detect symbols
  - Not sure how symbol type is identified

???:

Rubiks cube tracking, photo -> model of state
Stickerless cubes - edges not well defined (no black border)
Algorithms:
- Harr cascade classifier to detect cube - generate bounding box
- Split into 3x3 grid
- Median blur
- Sample center
- Color thresholding to classify color
  - Use LAB color space - less sensitive to luminance
Also used: Canny edge detection, Suzuki85, Douglas-Peucker
Contour detection: OpenCV, Suzuiki85 https://doi.org/10.1016/0734-189X(85)90016-7

???:

Integral calculator
Algorithms:
- Greyscale
- Gaussian
- Binary threshold
- Morphology: erode/dilate
- Find contours, sorting by x value
- Tessaract OCR
- Convert to string
- Compute integral
Fails with handwriting, bad with printed, good with screenshots

???:

Squash tracking in 3D space
Position of squash ball
Algorithms:
- Subtraction from clean plate (different from background modeling?)
- Morphology: erosion for noise removal
- Use output as mask
- Contour detection: group
- Filter by size, shape
- Triangulate position using two different cameras
Triangulation with two cameras

???:

Automatic projector keystone calibration
Algorithms:
- Canny edge detection
- Hough to detect image edges
- Ramer-Douglas-Peucker
- Homography transform

???:

Pose estimation: squat depth
Algorithms:
- Segmentation mask to separate athlete from background
- MediaPipe pose
- Use keypoints to estimate squat depth

???:

Virtual paper piano
Printout of keyboard
Algorithms:
- Keyboard segmentation:
  - Canny edge detection
  - Merged Hough line transform
  - Perspective transform
  - Linear segmentation
- Fingertip tracking and touch detection
  - MediaPipe
  - Transform to perspective space
Only works with a single finger

???:

Robocup object detection - identify weight
Algorithms:
- Stereo calibration (estimate parameters)
- Gaussian
- Template matching: normalized correlation coefficient
- Triangulation to estimate depth

???:

Bicycle detection and camera tracking (with gimbal)
Algorithms:
- YOLO v5 to detect bike
- Pass bounding box to CSRT tracking algorithm (Channel and Spatial Reliability Tracking)

???:

Stylus input with CV
Digital sketchpad
Algorithms:
- Canny
- Hough
- Shi-Tomasi corner detection
- Morphology: opening
Stylus has colored tab which gets exposed when pressure is applied to the tip
- HSV filter

???:

Predictive motion
Algorithms:
- Gaussian blur
- Morphology dilation/opening
- Adaptive thresholding
- Hough circle transform

???:

Patient rehabilitation monitor; track pose when doing exercises
Hardcoded reference angles
Algorithms:
- CNN
- Joint angle calculation, comparison thresholding
- FSM for multistage verification

42:00

???:

Melanoma detection using smartphone camera
Identify moles, return measure for border irregularity and color variance
Algorithms:
- Increase image contrast
- Gaussian
- Greyscale
- Morphology: closing
- Adaptive binary thresholding
- Suzuki-Abe contour
- Fitzgibbon ellipse fitting

???:

Nuclei segmentation in breast cancer tissue images
Algorithms:
- Segmentation with convolutional autoencoder (U-net)
- Thresholding
- Morphology
- Watershed

???:

AR sudoku solver
Overlay solution on paper
Algorithms:
- Pre-processsing:
  - Adaptive thresholding
  - Morphology: open/dilate
  - Grey scale
  - Hough to detect horizontal/vertical lines:
    - Remove the lines to be left with numbers only
    - Perspective transform to warp and crop image
- CNN for digit classification
- Solve sudoku using backtracking, render added text, then unwarp to overlay on top of input image

???:

Tumor/tissue detection: generate tissue and bulk region masks
Algorithms:
- Morphology: erosion/dilation
- Contour detection
- Color thresholding
- Median blurring

???:

Drone detection in airports, differentiate between drone and birds
Algorithms:
- Double difference w/ sharpen kernel to detect difference
- Contour extraction (using morphology - close/dilate)
- Intersection over union tracker
- Moving average filter
- Fast fourier transform to extract power spectral density
- K-means/linear classifier

???:

Automatic exposure control for robot navigation
Algorithms:
- Sobel gradient filter
- Gradient magnitude
- Soft percentile derivative: weighted sum of difference between two frames
- Slice image into equal sections: take median of soft percentile derivatives
  - No way of automatically calculating number of slices

???:

Punching technique stats
Track fist velocity, acceleration, elbow angle, fist angle
Algorithms:
- Detectron2 retinanet for fist top/side detection
- Mediapipe pose
- Non-maximum suppression

???:

Football penalty ball tracking
Camera head-on with goal, located behind player
Detect goal posts and ball
- Used yellow ball: HSV color range used to detect ball
Algorithms:
- Gaussian
- Morphology: open, close, erode, dilate
- Thresholding
- Contours

???:

Joint tracking: give likeness score computed from joint angles
D435 camera
~0.4x real time speed
Algorithms:
- Region-based CNN (R-CNN): instance segmentation (detectron 2)
  - Keypoint/pose detection?
- Dynamic time warping
- Linear interpolation (for dealing with varying framerates)
- Newton’s method (to align video sequences)

???:

Scrabble board detection
Continuing code from previous student project
Algorithms:
- Greyscale
- Adaptive thresholding
- Detect board contours
- Crop image to contain only board?
- HSV masking: filter out undesired colors (i.e. non letter-tiles)
- Morphology: erosion/dilation to remove noise
- Detect ‘maximally stable extremal regions’
- Tesseract OCR

???:

Rugby ball detection
Static camera, players in frame
Difficulty: rugby balls not circular
Algorithms:
- Gaussian background subtraction
- Median filter
- Morphology: close
- Contour detection
- Filter:
  - By area
  - Compactness degree: contour area divided by area of best-fit ellipse
  - By ellipse aspect ratio

???:

Pet detection using dominant color
Algorithms:
- Mask-R CNN instance segmentation to detect pets in frame
- K-means clustering to determine dominant pet colors
Image thresholding to remove image background?

???:

Handwritten digit recognition
Non-linear transform of input features into higher dimensional space so that features are linearly separable
Cheaper than conventional deep learning
Nanowire network: randomly? scattered nanowires combine to form network with complex topological structures: junctions between wires act as a form of non-linearity and memory
Create 3D simulation of wires: do not assume they are 1D lines
Input: voltage into input electrodes (1 out of 4 edges of a square)
Output: current from output electrodes (remaining 3 edges of the square)
- Train only the output layer through regression to make 10 classifiers
Algorithms:
- Linear classifier:
  - Moore-Penrose pseudoinverse
  - Singular value decomposition (SVD)
- Single layer neural network using Tensorflow
  - Softmax activation
  - Catagorical cross-entropy
  - Adam optimizer
- Nanowires deposition: Euler rotations
- Junctions model using Stormer-Verlet integration
- Modified nodal analysis to solve Kirchoff’s circuit laws

???:

Blackjack simulation: detect card rank and suit
Top-down images of cards
Algorithms:
- Greyscale
- Gaussian
- Thresholding
- Contours: detect cards
- Morphology, close: use closing to merge close cards into a single group to detect hands (i.e. based on distance)
Image differencing: match corner to preset images to detect rank/suit

???:

Motion detection for raster graphics editor
- i.e. draw using hand gestures
- Webcam facing user
Algorithms:
- Image rectification
- Classification/localization CNN to crop to hand
- Instance segmentation CNN to detect keypoints
- Use relative distance between keypoints as gesture/controls

???:

Real-time face replacement
https://learnopencv.com/face-swap-using-opencv-c-python
Algorithms:
- Face detection with dlib library
  - HOG + linear SVM face detector?
- Face alignment: convex hull
- Delaunay triangulation
- Texture mapping by using affine warp to map triangles between the two images
- OpenCV Seamless cloning

???:

Blood splatter analysis
Algorithms:
- Pre-processing:
  - Thresholding
  - Dilation
  - Resizing
CNN training: ResNet-50

???:

Paper piano
Use built-in laptop camera (paper on top of keyboard/trackpad?)
One finger only
Algorithms:
- Adaptive thresholding (with Gaussian mean)
- Morphology: opening for noise reduction
- Ramer-Douglas-Peucker for contour detection
- Finger detection with color thresholding

???:

Real time number input for timer control using static hand gestures
Algorithms:
- MediaPipe hands
- Gesture classification:
  - Feature-angle thresholding
  - Support vector machine (SVM)
- Debouncing:
  - https://ieeexplore.ieee.org/abstract/document/8868766
  - Time/frame-delay debouncing

???:

Risk parameter of ALC injury risk
Record jumping video
Depth camera
Detect 4 risk factors
Algorithms:
- Detectron2 keypoint detection

???:

Hand gesture controlled calculator
Use vector of each finger as an input bit
- Thumbs were a special case
Algorithms:
- Hand keypoint detection
- Kalman filter predictions

???:

Butterfly/moth classification
Algorithms:
- Instance segmentation
- Non-maximum supression
- Detectron2 to crop
- resnet18 to classify

???:

Blood vessel extraction from image
Algorithms:
- Binary thresholding
- Morphology: open/close/erosion
- CLAHE algorithm to increase contrast
- Illumination equalization
- Gaussian
- Otsu’s threshold: vessel segmentation

???:

Hand gesture recognition for sign language w/ smartphone camera
Algorithms:
- Histogram
- Histogram backprojection
- Morphology: closing
- Inception v3 neural network

???:

SLAM with monocular video
Algorithms:
- Previous and current frame input into ORB keypoint detection
- FLANN keypoint matching
- Lowe’s ratio match pruning
- RANSAC 5-point: generate essential matrix
- Velocity information for pose recovery
- Filter spurious transforms
- Increment by transformations

???:

Cow teat detection
Camera under cow
Algorithms:
- Blob detection
  - Thermal filter - coldest = teat
  - Filter by circularity, aspect ratio
- Morphology: erosion/dilation
- Repeat with different morphology parameters until four teats in valid shape detected

???:

Visual cue for call detection for pass gesture in basketball?
Detect gesture from player so that robot shoots ball?
Algorithms:
- Body pose estimation with Detectron2
  - Identify largest skeleton in frame
  - Locate wrist
  - Extract subframes around wrist (proportional to size of skeleton)
  - Mediapipe hand pose
  - Measure distances between certain keypoints to detect gesture
  - Threshold
- Pass threshold pass/fail to FSM

???:

Sudoku detection/solving
Algorithms:
- Gaussian
- Adaptive Gaussian thresholding
- Contour filling: identify contours to determine board outline
- Green’s theorem
- Warp image to board shape
- OCR to identify numbers
- Solve board using backtracking

???:

Real time NZ sign language detection
Algorithms:
- Pre-process CNN input:
  - Binary thresholding
    - Issues: lighting, busy backgrounds
  - Canny edge detection
- CNN

???:

Emotion detection of audiences (multiple people)
Deepface for facial recognition
- 9 layer NN
Algorithms:
- Haar cascade
- Face frontalization
- 2D/3D alignment

???:

Speed limit recognition of NZ street signs
Detect, then read
Detection:
- Color space transform to emphasize red (CIELUV)
- Gaussian
- Circle hough transform
Text
- Crop to isolate text
- Filter
- Threshold: Otsu’s method
- Tesseract OCR

???:

Real time dart scoring
Identify dart throw events
Algorithms:
- Create foreground mask?
- Board detection:
  - Canny edge
  - Hough transform
  - Morphology
  - Contouring
- Dart detection:
  - Background subtraction
  - Morphology
  - Contouring
- Keypoint detection with YOLOv4-tiny

???:

Stereo imaging to find distance/orientation of plane
Want to require low overlap
- Crop images (Right edge/left edge: overlap)
- Create disparity map
- Normalize values
- Convert to depth
- Average values
- Reproject to 3D

???:

Emotion recognition through facial expression
Algorithms:
- Haar cascades face recognition
- Facial action coding system
- Augmentation for classifier training: brightness, rotation, shift
- Classifier: CNN + softmax

???:

Measuring body dimensions with depth camera
Use 3D points to get dimensions
Didn’t get to algorithms – out of time

???:

Music sheet reader for the visually impaired
Use eye tracking to zoom in and pan into the sheet
Prior research: automatic page turning but no zoom
Algorithms:
- Template matching
- Gaussian
- Region of interest
- Gradient orientation pattern (eye tracking)

???:

Mobile pool ball detection/identification
Track position of all balls and identify for scoring
Prior research: permanent setups
Try use hand-held video
Algorithms:
- Gaussian
- Hough circle transform to mask ball
Get average color of mask
Use color to identify ball score
Low success rate: 38%

???:

Inventory stocktaker
Repeated patterns: instance segmentation?
One phone camera, using flash to try control lighting
Detects repeated vertical lines to count items
Algorithms:
- Countouring
- Hough lines
- Angle filter
- Morphology: erode/dilate
- Hough lines

???:

Beer pong score keeper
Algorithms:
- Gaussian
- Ball detection: color mask + blob detection
- Cups: hough circle transform
Issues:
- Top-down only
- Lighting/shadows
- Hardcoded size? Distance/lens combination fixed

???:

Apple detection (for packaging robot)
Create own dataset
Algorithms:
- To remove background:
  - HSV masking
  - Morphology
Segmentation labeling for training data
Mask R-CNN (detectron2)

???:

Face recognition/tracking on multiple subjects without prior training
Real time
Store embedding of faces, store vectors and compare (HOG)
Algorithms:
- Single shot multibox detector (resnet base)
- Ensemble of regression trees
- Embedding creation: ResNet
- Recognition: euclidean distance + linear embedding search

???:

Document supermarket receipts
Product names, cost, total cost
Use scanner rather than smartphone camera
Algorithms:
- Adaptive binaraization: Otsu’s method
- Morphology: erosion
- Tesseract OCR

???:

Antarctic snowstorm classification with CNN
ResNet-18 model
Manually label dataset, then artificially grow
Augment dataset: horizontal flip, crop, resize
Classification:
- Random crops from image: classify each and combine
Algorithms:
- Cross entropy loss function
- Stochastic gradient descent
- Ensemble method to combine

???:

Cricket ball shot tracker
Algorithms:
- Differencing
- Thresholding
- Median blur
- Dilation
- Contour detection

???:

Guitar string picking detection
Algorithms:
- Fretboard/frets/string detection:
  - Canny edge
  - Hough line
Pick location: user identifies initial position for template matching, then tracked
Detecting picks: after velocity goes above threshold, find rapid deceleration

???:

Hand gestures to control interactive display
MediaPipe
ML to recognize gestures using keypoints
Limited input speed

???:

Dartboard segmentation
Single camera
Algorithms:
- HSV color mask
- Morphology: open/close
- Flood filling
- Edge segmentation:
  - Canny edge
  - Hough line
  - Centroid calculation
Point multipliers (thin green/red regions): HSV + morphology
Scoring regions: flood filling + bitwise operations
Wedges: canny + hough line (with centroid to determine board center)

???:

Cricket batting shot classification with pose estimation
Algorithms:
- Gaussian
- Canny edge detection
- Detectron2 pose estimation
- SKLearn

???:

Dirty dishes on kitchen bench
Prior: Hough circle, Mask-RCNN
Solution: Mask-RCNN with COCO dataset
Plates: transfer learning using resnet?
Motion detection to reduce false positives
Gaussian
Background segmentation

???:

Darts scoreboard identification
Blur: gaussian blur. HSV
Dilate/erosion
TODO
Mask generation: thresholding
Regionsegmentation: Hough lines, canny edge
Motion detection - double differencing
Flood fill
Double differencing
Triangle interior

???:

Real time mask detection
OpenCV/numpy face detection
Keras/tensorflow face mask
MobileNetV2

???:

Manufacturing defect detectoin of surgical reamers
Otsu thresholding
Hough transform
Morphology close
Topographic TODO
TODO 0:42

???:

Robocup object detection, SLAM
Identify cup weights
Real time area mapping
Haar cascade classifier
ORB feature detection
TODO
RANSAC
ORB-SLAM - didn’t work due to low color variance

???:

Wheelchair docking assistant
Prevent damage to desk/chair, injuries
Slow down when close to the desk
Object detection: instance segmentation (mask R-CNN)
Distance calculation: stereo camera

???:

Low cost stereo with two webcams
Focus on speed: real time
Two cameras taped together in a box
Camera calibration
Stereo rectification
Disparity calculation

???:

Determining queue times with CV
Greyscale
Detct faces
- Front-on faces
Calcluate bounding box
Centroid
TODO
Haar cascade classifier
Masking

???:

Measuring growth rates
Crop/rotate.
Blur, hsv theshold, erode/dilate
Create contours of plants from mask
Exclude contours outside of region of interest
Find size of contours
As leaves get closer to camera, visual size increases

???:

Vehicle lane positioning with semantic segmentation
If no road markings
DeepLavV3 semantic segmenation
Extract road surface, denoise road surface mask
median blur
Color thresold
Morphology
Canny edge
TODO
0:53

???:

Climbing tutor: track 3D poses
Structure from motion
Perspective n-point
Locate keypoints: MediaPipe
Triangulate
Bundle adjustment: Least squares solver

???:

Interactive musical webcam
Volume/pitch depends on hand position
Single shot detector
Feature mapping
Object prediction: CNN
Object detection: TODO
TODO

???:

Real time background subtraction on people with HMD (AR)
No green screen, single camera, no depth
YOLACT for instance segmentation
Mask
Contour detection
Thresholding
TODO

Labs

Lab 01

Algorithms

Thresholding

Binarization of images depending on brightness.

OpenCV has a few types of thresholding.

Simper/basic thresholding uses a single, global threshold for binarization.

Adaptive thresholding uses the surrounding pixels to find some ‘average’ value of the neighboring pixels (some square centered around the target pixel), then subtracts some constant $c$ to calculate the threshold, which is then compared to the pixel value. The ‘average’ is either the mean or a Guassian-weighted sum.

Morphology

Basic operators:

Dilation:
- Result: objects get larger
- AND two matrices together
  - Values must be 0 or 1
  - If one is smaller than the other, treat the pixels outside them as being zero
- Repeat for every possible translation
- OR results: if the value of the pixel is one in any translation, set the value to one
Erosion:
- Result: objects get smaller
- In the larger matrix, pick a pixel
- For every translation of the smaller matrix which overlaps with the selected pixel, both matrices must have a value of one
Opening
- Result: smooths the images and narrows lines
- Erode by the smaller matrix, then dilate the result by the smaller matrix
Closing:
- Result: fills gaps and holes
- Dilate by the smaller matrix, then erode the result by the smaller matrix

Misc.

Contour tracing:
- https://theailearner.com/2019/11/19/suzukis-contour-tracing-algorithm-opencv-python/
- https://learnopencv.com/contour-detection-using-opencv-python-c/
Frame difference:
- With three frames (and hence additional latency): (frame1 ^ frame2) & (frame2 ^ frame3)
- Called the double-difference algorithm; reduces ghosting compared to a standard difference
- Three frames allows ghosting to be minimized
- Foreground aperture: if object moving smaller, there will be overlap between two frames, leading to there being no difference in the center of the object
- If you have absolute control of lighting, background subtraction will be much better

Lab 02

Kalman filter:

https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python
Combine noisy measurements to get better estimate of real state
Unscented Kalman filter: for highly non-linear state transitions

Blob detector:

Threshold: binarize image with global threshold with values between min and max, incremented by some value
Grouping: connected pixels grouped together within each image to form blobs
Merging: centers of blobs computed, blobs between images merged together if closer than threshold
Estimate final centers
Filter blobs by:
- Color: color of the blob
- Area: blob area between min and max
- Circularity: how ‘circular’ they are (ratio of area to perimeter squared). 1 means perfect circle
- Convexity: ratio of area to the area of its convex hull. 1 means completely convex
- Inertia ratio: think moment of inertia. Circle has smallest inertia for a given area (1), a line has the greatest (0) between images

Lucas-Kanade Optical Flow:

Detecting the velocity of features between frames
Assumes that the pixel intensities of an object do not change between frames, and that neighboring pixels have similar motion. Falls apart when lighting (or background) changes.

https://docs.opencv.org/4.5.0/d4/dee/tutorial_optical_flow.html

Lab 03

Tesseract OCR:

From HP in the 80s and 90s, picked up by Google since open sourcing in 2005
Can detect characters from multiple languages, or simply return bounding boxes of characters
Requires clean, binarized image

Open3D:

Point cloud visualization
Point clouds noisy; to filter out outliers, find nearest n neighbors and eliminate points where the mean distance is greater than some threshold
Segmentation:
- Detect shapes within point clouds
- Filter out points outside some range
- Fit points to primitive shapes (e.g. planes, cylinders) using RANSAC (RANdom SAmple Consensus)
  - Randomly pick a few data points and create a model that matches the primitive (e.g. for plane, pick three points, generate equation for the plane)
  - Find what points are consistent with the model
    - Outliers are further away than some error threshold
  - Repeat until you get a model with few outliers
  - Using all non-outlier points, generate a new model