11. Deep Learning

Dr. Oliver Batchelor

Neural networks, differentiable programming, applications to CV/image processing

History

Artificial neural networks:

Perceptron (1958)
- Linear classifier based on weighted inputs and thresholding
- Not differentiable
- Unable to learn XOR function
Backpropagation (1975) - basically just gradient descent
- Enable composition of networks built from multiple layers of differentiable functions
- Long training times
- Vanishing gradient problem - weights at top layers very small, basically not updatable (Sigmoid function)
Convolutional neural networks (CNN)
- Invented for classifying handwritten digits
- Image convolution is differentiable
- Convolutional layer - many different convolutions, each with a different filter
  - Weights associated with the filters are updated
2009: imageNet classification problem
- 1000 image classes, 1.4 million images
- AlexNet, 2013: used GPUs for computation, ~64% accuracy
- Now ~90%
Now:
- Neural nets part of almost every state-of-the-art computer vision applications
  - 3D reconstruction: stereo, multi-view stereo, optical flow
  - 2D/3D pose estimation
  - Image generation, super resolution, style transform, texture synthesis
  - Point cloud segmentation, object detection

Introduction

Neural networks:

‘Differentiable programming’ vs ‘Deep learning’: more flexible usage; no longer a set of layers
Supervised machine learning
Objective function with data $x$ and target label $y$ : $J(\theta; x, y) = L(f(\theta; x), y)$ , where the model has parameters $\theta$ , the function $\hat{y} = f(\theta; x)$ makes predictions, and the loss function $L(y, \hat{y})$ gives the error of the prediction
Minimize objective function
- Compute gradient of the loss with respect to parameters $\nabla_\theta$ for a batch of examples
- Update the weights $\theta_{i + 1} = \theta_i = a \nabla_\theta J(\dots)$ , where $a$ is the learning rate

What’s wrong with fully-connected neural networks?

Image extremely high dimensional
- If pixels in an image flattened out into a vector, spatial information is lost and shifting by a single pixel can cause vast changes
High dimensional inputs require a lot of data; large images impractical
The ‘curse of dimensionality’ - data becomes sparse

Solutions:

Inductive bias: set of assumptions used by a learner to predict results for inputs it has not yet encountered
- Generalizations
Invariance and equivariance
- Invariance = shift inputs, outputs remain the same
- Equivariance = shift inputs, outputs shift in the same way
- Images can be transformed (translated, scaled, rotated etc.) without changing their content/class label
- Images can be transformed and their segmentation is transformed the same way
Hence, use operations that exploit these properties
- Usually convolutions (CNNs) - translationally equivariant, multi-scale approaches
- Much less data required
- Domain specific - invariances/symmetries different between domains

Building blocks of CNNs:

Convolution
Non-linear stage e.g. rectified linear
Pooling: reduce a small window of an image into a single pixel

And repeat until you get a single output or a few outputs.

Other methods also available:

Attention methods (from NLP) - transformers
- Divide image into patches; treat them like symbols
- Form associations between two sets
- Not obviously ‘better’ than CNNs

Applications

Output types:

Classification:
- Categorical variables
- Softmax output with cross entropy loss
  - Softmax: exponential, but weighted so that sum of components adds to 1
- Sigmoid with binary cross entropy loss
  - Sigmoid: $(1 + e^{-x})^{-1}$ (shaped like an S)
Regression
- Continuous variables (e.g. depth of a pixel)

Visual recognition:

Classification - if image belongs to class
Segmentation (dense/semantic) - classifying each pixel
Object detection: classification + localization (bounding box)
Instance segmentation: classification + localization + segmentation (precise mask)
Keypoint detection (e.g. joints in hand)
Panoptic segmentation (both dense and instance segmentation) e.g. ‘grass’

Image classification:

Pre-trained models available: https://github.com/rwightman/pytorch-image-models
Can be used as basis of other tasks - train on ImageNet, adapt the model by adding layers/connections. Called having a ‘backbone’ based on image classification.

Semantic/dense segmentation:

Per-pixel classification
General scene understanding
Works with irregularly-shaped objects

Segmentation performance measures:

Percent correct
- Has issues with class imbalance
Intersection over union (IOU):
- Precision = area of intersection of prediction and ground truth divided by area of prediction
- Recall = area of intersection of prediction and ground truth divided by area of ground truth
- IoU = area of intersection of prediction and ground truth, divided by union
Problem: they are not differential, although substitutes are available

Object detection:

Localization + classification
Difficult as there are a variable number of outputs
- How to measure accuracy?
- Average precision: area under precision vs recall graph
  - Build by adding predictions by order of confidence, highest to lowest
  - mAP @ t: mean average precision across all classes of dataset using matching threshold of t (e.g. 50%)
  - COCO AP: average across mAP for thresholds [0.5, 0.55, …, 0.95]
Usually output object locations (heatmaps) and classes
Anchor boxes/prior boxes:
- Places where the network thinks there could be objects - very dense and overlapping
- Regression on box displacement and sizing on set of anchor boxes; resize and reduce to a single high-confidence box
- Non-maximum suppression: does not work well if two objects are overlapping
- Train by matching anchor boxes to ground truth boxes
  - e.g. find boxes where IoU > 0.5 or some other threshold

Keypoint recognition:

Detecting landmarks (e.g. human skeleton/pose) - want single points rather than box
Bottom up:
- Keypoint detector per part
- Match key-points to find full skeletons
Top down:
- Detect full objects first
- Estimate key-points given object location

Image matching and correspondence:

Dense rectified (images aligned vertically) stereo matching (e.g. HSMNet, PSMNet)
Dense multi-view stereo
Sparse image matching: keyboard detection, descriptor extraction
Dense optical flow
- Estimate motion of pixels
- 2D correspondence search - movement not constrained two one axis
- RAFT
6DoF pose matching
- Determining object pose (e.g. orientation) from 3D model
Key point detector (identifiable point)/descriptor (vector uniquely defining that point)
- Can’t manually label data - a single image? Maybe. But definitely not a video
- Self-supervised by warp data augmentation: given an image warp, we know how key points should be transformed
- Direct matching e.g. LoFTR
- Detect, then extract descriptors: Superpoint, R2D2
Object tracking
- Multi-object tracking: MOT
  - Track known types of objects e.g. people, seals
  - Supervised learning
  - Detect and track paradigm e.g. SORT https://github.com/ifzhang/ByteTrack
  - Difficulties: occlusion, changing shape, fast movement
- Generic object tracking (GOT)
  - One shot object tracking
  - Generic: not trained on a specific type of object (but probably a wide variety of moving objects)
    - Makes it easy to use as long as the target is similar to anything in the training set
  - Track object from template in the first frame (bounding box)
  - e.g. vision4robotics/SiamAPN, got-10k/toolkit
  - Can help with assisting labelling data
Neural 3D reconstruction
- Not just for images, although this is a common case
- Input: set of calibrated images
  - Structure from Motion, COLMAP
    - Sparse point cloud through correspondence search: feature extraction -> matching -> geometric verification
    - Dense multiview stereo e.g. PatchMatch to generate depth maps that can be filtered or fused
- Reconstruct 3D scenes use differentiable volumetric ray-tracing
  - Inverse of computer graphics
- Synthesize new images of a scene from a different orientation
- Neural radiance fields: NeRF
  - Represent 3D scene using a NN by mapping a 3D coordinate to a density and color
  - Problem: neural networks have bias towards smooth functions; cannot represent high frequency/discontinuity
  - Solution: scale/encode inputs as a Fourier-encoded sequence
  - Problem: view-directional effects
  - e.g. Google MipNeRF, nerf-pytorch
- instant-ngp: massively reducing required time/processing power
  - More recent NeRF methods: direct voxel grids instead of the coordinate space mapping
  - Neural hash-tables: ignore hash collisions - multiple levels of resolution combined with a tiny MLP to handle collisions

Image features:

Description of small point in the image
- Independent of attributes such as location, orientation and scale
Handcrafted dense image features also exist e.g. histogram of gradients (HoG), dense SiFT
From NN:
- Intermediate activations from any type of NN
- Typically dense (for CNNs)
  - Usually much lower resolution than input
- Vector length depends on layer
- Sometimes modified for matching e.g. normalization, reduced dimension with PCA
- Can arise from training a NN on an auxiliary task (e.g. image classification)
Handcrafted vs learned:
- NN features usually perform better on tasks near the training domain,
- Handcrafted features often generalize better
Simpler classifiers with extracted features
- Fine tuning on frozen networks
- SVM (support vector machine), random forest, nearest neighbor etc.
Depends on the quality of the feature extractor
- If extracted from NN running auxiliary task, it may not match the use case
Extremely common use cases:
- Feature matching (e.g. nearest neighbor)
  - Well tested, less likely to fall down when it encounteres unexpected data
- Image retrieval, face recognition (approximate nearest neighbor)
- Shortest path (e.g. skeletonization)
- Graph cut (e.g. stereo, segmentation)
- Homography estimation, Perspective nPoint (PnP)
- Tracking
  - Detect and track e.g. SORT, one-shot detection
  - Kalman filter

Applying models to new tasks:

Find a off-the-shelf model
- A lot around; sometimes, may be useful to search for the dataset rather than the model
- Look for widely-used, actively-maintained ones
  - Research-quality code often not maintained after publication; may require package to be updated to work with current versions of frameworks
- Check for conda/pip packages before building from source
- May not generalize
  - Check similarity of images to ones in paper or those provided as examples
  - Neural networks can be weak to unexpected inputs or domain shifts
Use NN features with non-learning algorithms
Open world visual recognition models
- Detects objects in general
- Often trained by big organizations with excessive resources
- Often self-supervised learning or from commonly available associations (e.g. OpenAI CLIP uses image data and captions from websites) - may be low quality
- Few shot image classification: create a new class using a few exemplars
- Zero shot image classification: classify image based on textural descriptions
- Instance segmentation e.g. Detic (https://huggingface.co/spaces/akhaliq/Detic)
Fine tuning
- Hand annotate a few different examples
  - Interactive segmentation: using NN to help with training
    - DiscoBox: human draws bounding box, NN generates segmentation mask
  - Human in the loop
    - Active learning: algorithm lets human annotate the data it is most uncertain on
    - Verification-based annotation
      - Partially-trained model makes suggestion
      - Human selects/edits the best suggestions
    - Question answering:
      - Weak annotation
      - Human answer yes/no questions; much faster than manual annotation
- Take an existing model and find tune the model or classifier with a smaller set of annotations
  - Catastrophic forgetting/interference: when retraining model in new domain, it tends to lose information about the previous data
- Use data augmentation judiciously
  - Apply transformations which are invariant/equivariant to the labels; allows training sets to be enlarged and forces the model to generalize
  - Randomly perturb the data
    - Geometric: translation, scale, rotation
    - Photometric: brightness, contrast, hue
    - Mixing: mix images (of different classes!) and labels
    - Noise: gaussian, salt and pepper, synthetic rain, cutout
  - e.g. albumentations, BBAug, Torchvision, Detectron2
  - Considerations:
    - Too much may be harmful
    - Sometimes may be better to standardize scale/orientation in input data rather than augment it (e.g. people not usually walking upside down)
Synthetic data
- 3D rendering to generate your own training data
- Self-supervised learning: known physical properties to provide supervision
- Correspondence tasks: cannot hand label; disparity maps must be more accurate than segmentation masks (precise disparities per pixel)
- Pseudo ground-truths
  - Use device for high accuracy capture (LiDAR, structured light)
  - Use more information e.g. higher-resolution or more images
  - Use 3D reconstruction to generate higher-quality images and train on these
  - Today: tends to produce stereo matching; not as good as synthetic data, but more robust
- Self-supervised learning
  - Consistency (left/right, inter-frame) using image warping: minimize warping error with truth to train model
  - Feature matching with tracking
    - TODO