All Files in ‘COSC428 (2022-S1)’ Merged

01. Introduction

Weighting:

Goal: to recognize objects and their motions.

Signal processing on 1D data; computer vision on 2D data.

Image processing on still images; computer vision on video and still images.

Difficulties

The sensory gap: gap between reality and what a recording of the scene can capture.

The semantic gap: lack of contextual knowledge about a scene.

The human visual system is very good:

Recovering 3D information

Several cues available:

Or actual depth hardware:

Labs: Intel Realsense D435.

Processing

The higher-level the processing, the less generic and the more domain-specific knowledge is required.

Low-Level Image Processing:

Approaches

02. Perception and Color

Color Physics

Simplified rendering model: illumination * reflectance (relative energy, 0 to 1) = color signal.

For project: use consistent light source to prevent color shifting

Human vision system: optimized for spectrum from the sun (greatest intensity at ~500 nm).

Reflectance spectra: reflectance (0 to 1) as a function of wavelength. Eyes (and cameras) simplify this signal by reducing it to intensity detected via three types of sensors and hence, objects with different spectral albedoes may be perceived as having the same color (metamerism).

Spectral colors can be described as a spectra with a single peak in some given wavelength range.

Color mixing:

Color Perception

The Human Eye

Specifications:

Physiology:

Brain and processing:

Color Spaces

Although perceived color depends on illumination and surroundings, assume for now that the spectrum of light arriving at the eye fully determines the perceived color.

03. Cameras and Lenses

Pinhole Cameras

Projection Equation

Let ff be the focal distance - distance between the pinhole and sensor. A point (x,y,z)(x, y, z) will be projected onto the sensor at:

u=fxzv=fyz \begin{aligned} u = f \frac{x}{z} \\ v = f \frac{y}{z} \end{aligned}

and z=fz = f. Although all values would be multiplied by 1-1 for a real camera, we can model the sensor plane as being in front of the pinhole. This projection can be represented as a matrix equation:

(uv1)(f0000f000010)(xyz1) \begin{pmatrix} u \\ v \\ 1 \end{pmatrix} \sim \begin{pmatrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix}

Lenses

Pinhole cameras must balance diffraction (aperture too small) and light ray convergence (aperture too large). Lenses allow far more light to pass through while still allowing light rays to converge on the same point on the sensor plane for light from a specific distance.

The most basic approximation of a lens is the thin lens, which assumes that the lens has zero thickness. More accurate models:

Camera Calibration

Used to determine relationship between image coordinates and real-world coordinates - geometric camera calibration.

Intrinsic Parameters

Improvements are needed to the matrix above to consider:

u=αxzαcot(θ)yz+u0v=βsin(θ)yz+v0 \begin{aligned} u &= \alpha \frac{x}{z} - \alpha\cot(\theta)\frac{y}{z} + u_0 \\ v &= \frac{\beta}{\sin(\theta)} \frac{y}{z} + v_0 \end{aligned}

As a matrix:

(uv1)=1z(ααcot(θ)u000βsin(θ)v000010)(xyz1) \begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = \frac{1}{z} \begin{pmatrix} \alpha & -\alpha\cot(\theta) & u_0 & 0 \\ 0 & \frac{\beta}{\sin(\theta)} & v_0 & 0 \\ 0 & 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix}

In more compact notation: $$ \overrightarrow{p} = \frac{1}{z} \begin{pmatrix} K & \overrightarrow{0} \end{pmatrix} \overrightarrow{P} $$

Where P\overrightarrow{P} are the world coordinates and p\overrightarrow{p} are the pixel coordinates.

Then, extrinsic parameters: translation and rotation of the camera frame, must be taken into account, further complicating things: $$ ^C{P} = ^{C}{W}{R} + ^{W}{P} + ^{C}O{W} $$ Combining the two: $$ \overrightarrow{p} = \frac{1}{z} K\begin{pmatrix} {C}{W}{R} & ^{C}O{W} \end{pmatrix} \overrightarrow{P} = \frac{1}{z} M \overrightarrow{P} $$

(uv1)=1z(m1Tm2Tm3T)P \begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = \frac{1}{z} \begin{pmatrix} \cdotp & m_1^T & \cdotp & \cdotp \\ \cdotp & m_2^T & \cdotp & \cdotp \\ \cdotp & m_3^T & \cdotp & \cdotp \\ \end{pmatrix} \overrightarrow{P}

1=m3Pz1 = \frac{m_3 \cdot \overrightarrow{P}}{z} and hence, u=m1Pm3Pu = \frac{m_1 \cdot \overrightarrow{P}}{m_3 \cdot \overrightarrow{P}} and v=m2Pm3Pv = \frac{m_2 \cdot \overrightarrow{P}}{m_3 \cdot \overrightarrow{P}}

By using these equations on many features, we can find the value of mm that minimizes error to determine the intrinsic and extrinsic parameters.

04. Filters

Modifying the pixels of an image based on a function taking input from pixels in the target pixel’s local neighborhood.

Linear Functions

When a filter is linear, the value of each pixel is a linear combination of its neighbors.

Where II is the image, gg is the kernel, and g[k,l]g[k, l] is the value of the matrix where g[0,0]g[0, 0] is the center element:

f[m,n]=Ig=k,iI[mk,nl]g[k,l] f[m, n] = I \otimes g = \sum_{k, i}{I[m - k, n - l]g[k, l]}

That is, the dot product of its local neighborhood multiplied by a kernel. This process is called a convolution.

Convolution kernels can be used to make a Gaussian filter - a blur. Blurring/smoothing the image reduces noise - high-frequency information.

σ\sigma, the radius of the kernel, is called the scale.

Multiple linear functions can be stacked - in addition to convolutions - multiplication by a kernel, addition/subtraction are also valid operations.

For example, to sharpen an image, you can multiply by the pixel magnitudes (and hence both low and high frequency information) by 2, then subtract by a blurred (i.e. low-pass filtered) version of the image, leaving you with only high frequency information - an image with edges over-accentuated. In pseudo-code: 2 * img(x, y, 1)) - dot(gaussian(sigma), img(x, y, sigma)). This approximates the Laplacian (sum of second partial derivatives) of Gaussian filter.

Gradients and Edges

An edge is a single point: a series of edge points is a line.

An edge is a point of sharp change (reflectance, object, illumination, noise) in an image.

The general strategy is to use linear filters to estimate the image gradient, then mark points where the change in magnitude is large in comparison to its neighbors.

Fourier Transforms

The fourier transform of a real function is complex - in this course, the phase component is ignored - we only care about magnitude.

All natural images have a similar same magnitude transform - running the inverse transform with the magnitude from another image returns similar results.

In the magnitude image generated from a fourier transform, the center of the image equals zero frequency while the edges of the image have higher frequency. Hence, by masking the image with a circle and then applying the inverse transform, either a low- (outside masked out) or high- (inside masked out) pass filter can be generated.

In the transform, the input image is tiled to infinity: this may cause discontinuities to occur at the edges. However, the effects of this can be mitigated by fading the image to grey near the edges (e.g. in a circle with a Gaussian).

https://homepages.inf.ed.ac.uk/rbf/HIPR2/fourier.htm

05. Edge Detection

Challenge: convert a 2D image into a set of curves.

That is, find all edge points that are (mostly) connected, then join them into curves.

Edges come from:

Edge profiles:

    |-----    /\        ||
----|        /  \    ___||____
Step         Roof      Line

Edge detection:

Edges are the points at which the rate of change reaches a maximum - when the second derivative is zero.

f=[fx,fy] \nabla f = \left[\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right]

The gradient, which points in the direction of greatest change, can be represented by an angle and magnitude:

\begin{aligneD} \theta &= \tan^{-1}\left(\frac{\partial f}{\partial x} / \frac{\partial f}{\partial y}\right) \\ ||\nabla f|| &= \sqrt{\left(\frac{\partial f}{\partial x}\right)^2 + \left(\frac{\partial f}{\partial y}\right)^2} \end{aligned}

However, on a discrete image, approximations can be made.

The Sobel operator:

Δx=(101202101)Δy=(121000121) \Delta_x = \begin{pmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{pmatrix} \quad \Delta_y = \begin{pmatrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \end{pmatrix}

(NB: need to scale by 18\frac{1}{8} to get the right gradient value, but is irrelevant for edge detection)

The Robert’s Cross Operator (for diagonal edges):

Δ1=(0110)Δ2=(1001) \Delta_1 = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix} \quad \Delta_2 = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}

The 4×4 Prewitt operator:

Δx=(3113311331133113)Δy=(3333111111113333) \Delta_x = \begin{pmatrix} -3 & -1 & 1 & 3 \\ -3 & -1 & 1 & 3 \\ -3 & -1 & 1 & 3 \\ -3 & -1 & 1 & 3 \end{pmatrix} \quad \Delta_y = \begin{pmatrix} 3 & 3 & 3 & 3 \\ 1 & 1 & 1 & 1 \\ -1 & -1 & -1 & -1 \\ -3 & -3 & -3 & -3 \\ \end{pmatrix}

Can use trigonometry to determine direction of the gradient using the horizontal and vertical Sobel/Prewitt operators.

Looking at only the adjacent pixel may be useless in situations with large amounts of noise. Hence, use a Gaussian kernel to smooth the image before applying the gradient operator. However, this can be done more efficiently by applying the derivative function to the Gaussian kernel, then applying it to the signal (the derivative theorem of convolution).

Then, the point of maximum difference - the second differential, can be found to detect the edge.

The sums of the second partial derivatives is called the Laplacian.

Canny Edge Detection

An optimal edge detection algorithm should have:

Under the assumptions of a linear filter and independent and identically distributed Gaussian noise, the optimal detector is approximately the derivative of the Gaussian.

Detection and localization are diametrically opposed to each other: more smoothing leads to better detection but worse localization.

Canny edge detector steps:

A low value of σ\sigma, the radius of the Gaussian blur, detects fine feature, while a large value detects large-scale edges.

Scale Space

As σ\sigma - the scale of the image gets larger:

Multiple representations of the image at different scales can be generated.

If an edge detection algorithm is applied to all the images, edges can be matched between images:

Edge Detection by Subtraction

Subtract Gaussian-blurred image from the original, then scale and add an offset. This works as low-frequency information mostly remains in the blurred image and hence gets removed in the subtraction.This set of operations approximates the Laplacian of Gaussian.

Hough Line Detection

Finds straight lines from binary image (e.g. output of Canny algorithm).

Uses a voting scheme instead of naively searching for lines at every single position and orientation.

The Hough space is a transform of the xx-yy coordinate space to rr-θ\theta space where:

For each point (that is an edge pixel) and for every angle, find rr and increment the value in the Hough space by one - the point votes for every line it could be part of.

The points with the largest vote counts are the straight lines we are most confident in.

The Hough circle transform does the same, except using the xx-yy-rr space.

The same technique works for any curve that can be expressed in parametric form, although the parameter space can get huge.

OpenCV Tutorial:

Corners

Doors and corners, kid. That’s where they get you

Detective Miller

What is the gradient at the corner? Near the corner, edges have gradients going in two different directions and at the corner, the gradient is ill-defined. Hence, edge detectors tend to fail at corners.

Corners, however, are useful for tracking movement between frames.

Harris Corner Detection

Over a small window, a corner is likely to have high intensity variation in all directions. This uses the sum-squared difference.

Given:

For all pixels (x,y)(x, y) that are part of the window:

E(Δx,Δy)=x,yw(x,y)[I(x+Δx,y+Δy)I(x,y)]2 E(\Delta x, \Delta y) = \sum_{x, y}{w(x, y) \quad \left[I(x + \Delta x, y + \Delta y) - I(x, y)\right]^2}

That is, given two windows, calculate the difference between each pair of pixels, square them, and sum them.

Using the Taylor expansion, this can be approximated to:

I(x+Δx,y+Δy)=I(x,y)+ΔxIx(x,y)+ΔyIy(x,y)=I(x,y)+ΔxIx(x,y)+ΔyIy(x,y) \begin{aligned} I(x + \Delta x, y + \Delta y) &= I(x, y) + \Delta x \frac{\partial I}{\partial x}\left(x, y\right) + \Delta y \frac{\partial I}{\partial y}\left(x, y\right) \\ &= I(x, y) + \Delta x I_x(x, y) + \Delta y I_y(x, y) \end{aligned}

Giving the approximation:

E(Δx,Δy)x,Δy[ΔxIx(x,y)+ΔyIy(x,y)]2x,ΔyΔx2Ix(x,y)2+2ΔxΔyIx(x,y)Iy(x,y)+Δy2Iy(x,y)2(ΔxΔy)M(ΔxΔy) \begin{aligned} E(\Delta x, \Delta y) &\approx \sum_{x, \Delta y}{\left[ \Delta x I_x(x, y) + \Delta y I_y(x, y) \right]^2} \\ &\approx \sum_{x, \Delta y}{ \Delta x^2 I_x(x, y)^2 + 2 \Delta x \Delta y I_x(x,y) I_y(x,y) + \Delta y^2 I_y(x, y)^2 } \\ &\approx \begin{pmatrix} \Delta x & \Delta y \end{pmatrix} M \begin{pmatrix} \Delta x \\ \Delta y \end{pmatrix} \end{aligned}

Where:

M=x,y[(Ix)2IxIyIxIy(Iy)2] M = \sum_{x, y} \begin{bmatrix} \left(I_x\right)^2 & I_xI_y \\ I_xI_y & \left(I_y\right)^2 \end{bmatrix}

Given λ1\lambda_1 and λ2\lambda_2 are the eigenvalues of MM:

R=det(M)k(trace(M))2=λ1λ2k(λ1+λ2)2 \begin{aligned} R &= \det(M) - k\left(\text{trace}(M)\right)^2 \\ &= \lambda_1 \lambda_2 - k\left(\lambda_1 + \lambda_2 \right)^2 \end{aligned}

Corners have large values of RR; edges occur where R<0R < 0.

06. Local Features

Scale and rotation invariant descriptors.

Correspondence using window matching: if points are used, it is highly ambiguous.

Stereo Cameras

Baseline: distance between cameras. Wider baseline allows greater accuracy further away, while a smaller baseline allows overlap at closer distances. Increased camera resolution can increase depth resolution overall.

Rectification: transform of images onto a common image plane (i.e. sensor, but inverted). The image planes of the two cameras must be parallel

Correspondence using window matching:

Image normalization: variation in sensor gain/sensitivity means normalization is recommended.

Window magnitude: $$ \left\Vert I \right\Vert_{W_m(x, y)} = \sqrt{\sum_{(u, v) \in W_m(x, y)}{\left[I(u, v)\right]^2}} $$

Average:

Iˉ=1Wm(x,y)(u,v)Wm(x,y)I(u,v) \bar{I} = \frac{1}{\vert W_m(x, y) \vert}\sum_{(u, v) \in W_m(x, y)}{I(u, v)}

Normalized pixel:

I^(x,y)=I(x,y)IˉIIˉWm(x,y) \hat{I}_{(x, y)} = \frac{I(x, y) - \bar{I}}{\left\Vert I - \bar{I}\right\Vert_{W_m(x, y)}}

Vectorization: convert matrix into vector by unwrapping: horizontal lines together. Denote as ω\omega.

Normalization scales magnitude of the m2m^2-dimensional space into unit length. Two metrics possible for comparing two windows: distance and angle.

Distance ((normalized) sum of squared differences);

CSSD(d)=ωLωR(d)2 C_\textrm{SSD}(d) = \Vert \omega_L - \omega_R(d) \Vert^2

ωR(d)\omega_R(d) is window centered around (xd,y)(x - d, y))

Normalized correlation:

CNC(d)=ωLωR(d)=cos(θ) C_\textrm{NC}(d) = \omega_L \cdot \omega_R(d) = \cos(\theta)

Local Features

Aperture problem and normal flow: if you only have a partial view of a moving, one-dimensional object, you cannot always tell how it is moving (e.g. a moving line whose ends are outside the viewport) n.

Given velocities uu and vv and partial derivatives IxI_x, IyI_y and ItI_t for a given pixel, the brightness change constraint equation (BCCE), which states that brightness should stay constant for a given point over time, can be approximated (1st order Taylor series) as:

Ixu+Iyv+It=0IU=0 \begin{aligned} I_xu + I_yv + I_t &= 0 \\ \nabla I \cdot \vec{U} &= 0 \end{aligned}

Normal flow, the vector representing translation of the line in the direction in the direction of its normal, can be written as:

u=ItIII u_\perp = - \frac{I_t}{\vert \nabla I \vert} \frac{\nabla I}{\vert \nabla I \vert}

By considering multiple moving points, the velocity UU can be found:

I1U=It1I2U=It2 \begin{aligned} \nabla I^1 \cdot U &= -I_t^1 \\ \nabla I^2 \cdot U &= -I_t^2 \\ \dots \end{aligned}

Lucas-Kanade

Assumes the same velocity for all pixels within the window, and that pixel intensities do not change between frames.

https://docs.opencv.org/4.5.0/d4/dee/tutorial_optical_flow.html

E(u,v)=x,yΩ(Ix(x,y)u+Iy(x,y)v+It)2 E(u, v) = \sum_{x, y \in \Omega}{ \left( I_x(x, y)u + I_y(x, y)v + I_t \right)^2 }

Solve using:

[Ix2IxIyIxIyIy2](uv)=(IxItIyIt) \begin{bmatrix} \sum{I_x^2} & \sum{I_x I_y} \\ \sum{I_x I_y} & \sum{I_y^2} \\ \end{bmatrix} \begin{pmatrix} u \\ v \end{pmatrix} = -\begin{pmatrix} \sum{I_x I_t} \\ \sum{I_y I_t} \end{pmatrix}

LHS: sum of outer product tensor of gradient vector

(IIT)U=IIt \left( \sum{\nabla I \nabla I^T} \right) \vec{U} = -\sum{\nabla I I_t}

Good features:

Previous equation can be written as Au=b\bold{A}\vec{u} = -\vec{b}.

For this to be solvable:

Harris Detector

Using auto-correlation on ‘interesting’ points - where there are important differences in all directions.

For a point (x,y)(x, y) and shift (Δx,Δy)(\Delta x, \Delta y), the auto-correlation is:

f(x,y)(xk,yk)W(I(xk,yk)I(xk+Δx,yk+Δy))2 f(x, y) \sum_{(x_k, y_k) \in W}{ \left( I(x_k, y_k) - I( x_k + \Delta x, y_k + \Delta y ) \right)^2 }

Avoiding discrete shifts:

I(xk+Δx,yk+Δy)=I(xk,yk)+(Ix(xk,yk)Iy(xk,yk))(ΔxΔy) I(x_k + \Delta x, y_k + \Delta y) = I(x_k, y_k) + \begin{pmatrix}I_x(x_k, y_k) & I_y(x_k, y_k) \end{pmatrix} \begin{pmatrix}\Delta x \\ \Delta y \end{pmatrix}
f(x,y)=(xk,yk)W((Ix(xk,yk)Iy(xk,yk))(ΔxΔy))2 f(x, y) = \sum_{(x_k, y_k) \in W}{ \left( \begin{pmatrix}I_x(x_k, y_k) & I_y(x_k, y_k) \end{pmatrix} \begin{pmatrix}\Delta x \\ \Delta y \end{pmatrix} \right)^2 }

Auto-correlation matrix:

=(ΔxΔy)[(xk,yk)W(Ix(xk,yk))2(xk,yk)WIx(xk,yk)Iy(xk,yk)(xk,yk)WIx(xk,yk)Iy(xk,yk)(xk,yk)W((Iy(xk,yk))2](ΔxΔy) = \begin{pmatrix}\Delta x & \Delta y\end{pmatrix} \begin{bmatrix} \sum_{(x_k, y_k) \in W}{\left(I_x(x_k, y_k)\right)^2} & \sum_{(x_k, y_k) \in W}{I_x(x_k, y_k) I_y(x_k, y_k)} \\ \sum_{(x_k, y_k) \in W}{I_x(x_k, y_k) I_y(x_k, y_k)} & \sum_{(x_k, y_k) \in W}{\left((I_y(x_k, y_k)\right)^2} \end{bmatrix} \begin{pmatrix}\Delta x \\ \Delta y\end{pmatrix}

The matrix captures the structure of the local neighborhood. Interest can be measured using the matrix’s eigenvalues:

Interest point detection can be done using thresholding, or a local maximum for localization.

Feature distortion:

Invariant Local Features

Local features that are invariant to translation, rotation, scale etc… They should have:

SIFT: Scale-invariant feature transform (SIFT).

Sift explanation

07. Morphology

Structural processing of images

From ~1960s

Extracting quantitative descriptions of image components:

Pixels are either object or non-object pixels.

Structuring element: smaller matrix applied to the image

Binary erode:

Binary dilation:

Greyscale erode:

Greyscale dilate:

Distance transform:

Skeleton transform:

Convex hull:

08. Tracking

Kalman Filter

Combine noisy measurements with predictions of how the state changes to get better estimate of real state.

Tracking: inference over time.

Can simplify the problem by assuming linear dynamics and Gaussian noise. An unscented Kalman filter can deal with non-linear state transitions, but still assumes Gaussian noise.

Task: at each time point (and in real-time), re-compute the estimate of position.

Recursive estimation: decompose this into:

Minimal example - running average:

At=αat1+(1α)yt A_t = \alpha a_{t - 1} + (1 - \alpha) y_t

Where α\alpha is the weight given to the previous estimate and yiy_i is the iith measurement.

This would be sensitive to noise/occlusion

Tracking:

Generalized model:

Issues:

Simplifying assumptions:

1D Kalman Filter

Assumes new state can be obtained by multiplying the old state by a constant did_i and adding noise:

xiN(dixi1,θdi2) x_i \sim N(d_i x_{i - 1}, \theta_{d_i}^2)

In other words:

Xˉi=diXˉi1+(θi)2=θdi2+(diθi1+)2 \begin{aligned} \bar{X}^{-}_i &= d_i \cdot \bar{X}^{+}_{i - 1} \\ (\theta^{-}_i)^2 &= \theta_{d_i}^2 + (d_i \theta^{+}_{i - 1})^2 \end{aligned}

TODO what is θdi\theta_{d_i}? Why not just the second term?

TODO what is θ+\theta^{+} and θ\theta^{-}

Once a measurement arrives, this can be corrected:

xi+=xˉiθmi2+miyi(θi)2θmi2+mi2(θi)2θi+=θmi2(θi)2θmi2+mi2(θi)2 \begin{aligned} x_i^+ &= \frac{\bar{x}_i^{-} \theta^2_{m_i} + m_i y_i (\theta^{-}_i)^2} {\theta^2_{m_i} + m^2_i (\theta^{-}_i)^2} \\ \theta_i^{+} &= \sqrt{\frac{\theta^2_{m_i}(\theta^{-}_i)^2} {\theta^2_{m_i} + m^2_i(\theta_i^{-})^2}} \end{aligned}

Note: θ\theta does not depend on yy.

Smoothing: if not running the filte in real time, can run the algorithm forwards and backwards and find the mean between the two predictions.

Kalman in Python

g-h filter:

def g_h_filter(data, x0, dx, g, h, dt=1.):
  """
  Performs g-h filter on 1 state variable with a fixed g and h.

  'data' contains the data to be filtered.
  'x0' is the initial value for our state variable
  'dx' is the initial change rate for our state variable (assumes linear rate of change)
  'g' is the g-h's g scale factor. g * 100% of the estimate comes from the measurement. Should be high for less noisy measurements
  'h' is the g-h's h scale factor - larger h means responds quicker to change, but more vulnerable to noise/outliers
  """
  x_estimate = x0
  results = []
  for x_measurement in data:
    # prediction step
    x_prediction = x_estimate + (dx*dt)
    dx = dx

    # update step
    residual = z - x_prediction # delta between measurement and prediction

    # update rate of change using residual.
    # h determines how quickly the rate of change changes
    dx = dx + h * (residual) / dt

    # Update estimate be weighted average of prediction and measurement
    # g determines weight given to the measurement
    x_estimate = x_prediction + g * residual

Example: system where position is being measured and the object has constant velocity

The distance and velocity can be represented as Gaussian distributions:

dˉ=μd+μvθˉ=θd2+θv2 \begin{aligned} \bar{d} &= \mu_d + \mu_v \\ \bar{\theta} &= \theta_d^2 + \theta_v^2 \end{aligned}

Sum of two Gaussians:

μ=μ1+μ2σ2=σ12+σ22 \begin{aligned} \mu &= \mu_1 + \mu_2 \\ \sigma^2 &= \sigma_1^2 + \sigma_2^2 \end{aligned}

Hence, the prediction can be represented as the sum of the distributions of the previous position and predicted velocity.

Product of two Gaussians:

μ=σ12μ2+σ22μ1σ12+σ22σ2=σ12σ22σ12+σ22 \begin{aligned} \mu &= \frac{\sigma_1^2 \mu_2 + \sigma_2^2 \mu_1}{\sigma_1^2 + \sigma_2^2} \\ \sigma^2 &= \frac{\sigma_1^2\sigma_2^2}{\sigma_1^2+\sigma_2^2} \end{aligned}

The update step returns the estimated position as the product of the distributions of the new measurement and current estimated position.

Particle Filter

The particle filter allows multiple positions to be predicted, and works with multi-modal and non-Gaussian distributions.

Three probability distributions:

The particle filter processes this into a single probability distribution, the state density (xtZt)(x_t | Z_t).

Comparisons:

Kalman in Python

Algorithm:

09. Introduction to Deep Learning

Types:

Deep learning:

Neural networks:

Activation functions: non-linearities needed to learn complex (non-linear) representations of data. More layers and neurons can approximate more complex functions.

Overfitting: when the model fails to generalize outside the training set

10. 3D Reconstruction using Computer Vision

Reconstructing 3D structure and camera positions from a set of images.

Many applications:

Most important algorithms: RANSAC and bundle adjustment.

Also known as structure from motion (SfM), photogrammetery.

SLAM (simultaneous localization and mapping) is usually real-time while SfM is offline. Closing the loop: once you recognize you have visited a position previously, you need to back-propagate changes to the model which may have drifted.

Summary:

Background

Camera Calibration

Camera calibration: map pixel coordinates to normalized image coordinates: correct factors such as lens distortion, focal length, image center etc.

Feature Matching

Process of choosing point features that appear in two adjacent images.

Features are usually point features found using corner detectors.

Corner features should be:

cv::goodFeaturesToTrack can be used to find Harris corners.

Representations of appearance: ‘feature descriptors’ e.g. image patches, SIFT, SURF

May get incorrect matches: objects which look the same but are different. “gross outliers” - outliers where location error is much higher (orders of magnitude higher) than expected.

For feature registration, algorithms should be robust against:

Homogeneous Coordinates

Homogeneous coordinates: add an extra dimension (e.g. 3D points become 4D points) with w=1w=1.

This allows matrix multiplication to be used to represent rotation, translation, and represent projection by normalization.

Transform of point X\vec{X} with rotation RR and translation t\vec{t}:

T=[Rt01] T = \begin{bmatrix} R & \vec{t} \\ \vec{0} & 1 \end{bmatrix}

Then multiply by [X1]\begin{bmatrix} \vec{X} \\ 1 \end{bmatrix}

In homogeneous coordinates:

2-view Reconstruction

Recover rotation, translation and 3D structure given two images.

Planar scenes: Homography Estimation

e.g. aerial images, AR apps. All points are on a single plane, making things easier - one less dimension to worry about and no occlusion.

There are two views of a planar scene and three parameters: rotation, translation, and the plane normal.

A 3x3 matrix HH can represent the relative pose of the camera and plane normal.

Inlier match xx\vec{x} \rightarrow \vec{x'}: Hx=xH\vec{x}=\vec{x'} when in normalized homogeneous coordinates, where xx is the 3D/4D point and xx' is the projection of the point on the camera sensor.

H=R+tnTd H = R + \frac{\vec{t} \vec{n}^T}{d}

Where:

Note that there is scale ambiguity in the above formula - we only have dd, not the actual length tt.

Hx=xH\vec{x} = \vec{x'} can be used to estimate HH when you have at least four inlier feature matches. The function cv::findHomography implements this algorithm.

As this only works on inliers, and you don’t know which points are inliers without HH, RANSAC is used to simultaneously estimate HH and filter out outliers.

RANSAC

Random sample consensus. A genera-purpose framework for fitting a model to data which has gross (very large) outliers.

Steps:

In the case of planar scenes, 4 feature matches.

3D Scenes: Essential Matrix Estimation

3x3 matrix that represents the relative pose (rotation RR, translation tt) of two cameras:

E=[t]xR[t]x=[0t3t2t30t1t2t10] \begin{aligned} E &= [t]_x R \\ [t]_x &= \begin{bmatrix} 0 & -t_3 & t_2 \\ t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix} \end{aligned}

For an inlier match, xTEx=0\vec{x'}^T E \vec{x} = 0.

RANSAC is used to compute EE from five feature matches.

Conversion to a 3D structure:

N-view Reconstruction by Bundle Adjustment

RANSAC suitable for two views, but 3D modelling may have tens or hundreds of photos; aerial mapping often has several thousand.

Hence, bundle adjustment is needed for accurate 3D reconstruction.

Reprojection error: distance between where a 3D point XX is projected into an image, and where it is measured to lie, xx.

Re-projection error for point: xPX|\vec{x} - P\vec{X}|

Bundle adjustments find the set of camera positions and 3D points which minimizes the total reprojection error.

That, is, it finds the 3D points {Xj:j=1,,M}\{ X_j: j=1, \dots, M \} and cameras {Pi=(Ri,ti):i=1,,N}\{ P_i = (R_i, \vec{t}_i): i = 1, \dots, N \} such that

C({Xj},{Pi})=ij appearing in view iXijPiXj2 C(\{ \vec{X}_j \}, \{ P_i \}) = \sum_i{\sum_{j\text{ appearing in view } i}{\| \vec{X_{ij}} - P_i \vec{X_j} \|^2 }}

is minimized.

After RANSAC is run on pairs of images, non-linear gradient descent is used to minimize the total reprojection error.

Errors in reconstruction will remain:

However, this can be mitigated by the use of additional information such as GPS, or domain knowledge - for example, buildings usually have vertical planes and right-angles.

Extensions:

BA finds a sparse structure, but if objects are assumed to be convex, a mesh can be formed. This compares to stereo, which returns a dense structure - distance value for each pixel.

11. Deep Learning

Dr. Oliver Batchelor

Neural networks, differentiable programming, applications to CV/image processing

History

Artificial neural networks:

Introduction

Neural networks:

What’s wrong with fully-connected neural networks?

Solutions:

Building blocks of CNNs:

And repeat until you get a single output or a few outputs.

Other methods also available:

Applications

Output types:

Visual recognition:

Image classification:

Semantic/dense segmentation:

Segmentation performance measures:

Object detection:

Keypoint recognition:

Image matching and correspondence:

Image features:

Applying models to new tasks:

12. Vineyard Project

Example Question

Describe and provide CV example for:

CNNs: what property of image matching CV algorithms enable self-supervised learning?

How would this work for stereo/optical flow?

Last question in the exam: briefly describe four of the following class projects, naming at least four algorithmic steps (with algorithm names). Do not select your own/similar projects.

If person does not list four or more algorithms, won’t be selected.

Project Tips

Abstract:

Background:

Proposed methods:

Results:

Conclusions:

References:

Real World Example: CV for a Grape Vine Pruning Robot

Approx. half the cost of vineyards is in pruning, hard to get get enough workers, can’t prune in the rain etc.

Pruning: remove old wood and most new canes during the winter.

NZ:

Large project: viticulture, robotics and AI experts, software + hardware engineers. ~5 years

~85% successful. Good enough for the government, but not good enough commercially.

Lighting:

Camera rig:

Main challenge was complexity and robustness.

Main challenges:

13. Image Representations

What is a good representation for image analysis?

We want an image representation that gives you a local description of image events - what and where, and naturally represent objects across varying scales.

Image pyramids: apply filters of fixed size to images of different sizes. Typically, edge length changes by a scale of 2 or the square root of 2.

There are many types of image pyramids:

Uses:

14. Tracking

Fiducial Markers

A fiducial marker is any planar object introduced into the scene and in the field-of-view of a n imaging system to be used as a point of reference of measure.

Can be used for:

How:

Challenges:

Markerless Tracking

Use ‘natural’ features for tracking: corners, edges, points etc.

Also: templates - basically something whose representation stored in the system.

This is more difficult and usually much more computationally expensive.

Texture tracking:

Hybrid tracking: use gyroscope for prediction of camera orientation, and computer vision to correct gyroscope drift. Kalman filter?

Outdoor: lots of landmarks and planar features, but varying lighting conditions make it difficult.

15. Face Recognition

Early face detection:

Surveillance cameras usually high up, which isn’t ideal for most face recognition algorithms.

Surveillance-based tracking: tracking people for the entire time they are in an area.

Normalization:

Features:

Neural networks:

Static face recognition:

Video face recognition:

16. One-Minute Demos

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

42:00

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

???:

Labs

Lab 01

Algorithms

Thresholding

Binarization of images depending on brightness.

OpenCV has a few types of thresholding.

Simper/basic thresholding uses a single, global threshold for binarization.

Adaptive thresholding uses the surrounding pixels to find some ‘average’ value of the neighboring pixels (some square centered around the target pixel), then subtracts some constant cc to calculate the threshold, which is then compared to the pixel value. The ‘average’ is either the mean or a Guassian-weighted sum.

Morphology

Basic operators:

Misc.

Lab 02

Kalman filter:

Blob detector:

Lucas-Kanade Optical Flow:

https://docs.opencv.org/4.5.0/d4/dee/tutorial_optical_flow.html

Lab 03

Tesseract OCR:

Open3D: