Dr. Oliver Batchelor
Neural networks, differentiable programming, applications to CV/image processing
History
Artificial neural networks:
- Perceptron (1958)
- Linear classifier based on weighted inputs and thresholding
- Not differentiable
- Unable to learn XOR function
- Backpropagation (1975) - basically just gradient descent
- Enable composition of networks built from multiple layers of differentiable functions
- Long training times
- Vanishing gradient problem - weights at top layers very small, basically not updatable (Sigmoid function)
- Convolutional neural networks (CNN)
- Invented for classifying handwritten digits
- Image convolution is differentiable
- Convolutional layer - many different convolutions, each with a different filter
- Weights associated with the filters are updated
- 2009: imageNet classification problem
- 1000 image classes, 1.4 million images
- AlexNet, 2013: used GPUs for computation, ~64% accuracy
- Now ~90%
- Now:
- Neural nets part of almost every state-of-the-art computer vision applications
- 3D reconstruction: stereo, multi-view stereo, optical flow
- 2D/3D pose estimation
- Image generation, super resolution, style transform, texture synthesis
- Point cloud segmentation, object detection
- Neural nets part of almost every state-of-the-art computer vision applications
Introduction
Neural networks:
- ‘Differentiable programming’ vs ‘Deep learning’: more flexible usage; no longer a set of layers
- Supervised machine learning
- Objective function with data
and target label : , where the model has parameters , the function makes predictions, and the loss function gives the error of the prediction - Minimize objective function
- Compute gradient of the loss with respect to parameters
for a batch of examples - Update the weights
, where is the learning rate
- Compute gradient of the loss with respect to parameters
What’s wrong with fully-connected neural networks?
- Image extremely high dimensional
- If pixels in an image flattened out into a vector, spatial information is lost and shifting by a single pixel can cause vast changes
- High dimensional inputs require a lot of data; large images impractical
- The ‘curse of dimensionality’ - data becomes sparse
Solutions:
- Inductive bias: set of assumptions used by a learner to predict results for inputs it has not yet encountered
- Generalizations
- Invariance and equivariance
- Invariance = shift inputs, outputs remain the same
- Equivariance = shift inputs, outputs shift in the same way
- Images can be transformed (translated, scaled, rotated etc.) without changing their content/class label
- Images can be transformed and their segmentation is transformed the same way
- Hence, use operations that exploit these properties
- Usually convolutions (CNNs) - translationally equivariant, multi-scale approaches
- Much less data required
- Domain specific - invariances/symmetries different between domains
Building blocks of CNNs:
- Convolution
- Non-linear stage e.g. rectified linear
- Pooling: reduce a small window of an image into a single pixel
And repeat until you get a single output or a few outputs.
Other methods also available:
- Attention methods (from NLP) - transformers
- Divide image into patches; treat them like symbols
- Form associations between two sets
- Not obviously ‘better’ than CNNs
Applications
Output types:
- Classification:
- Categorical variables
- Softmax output with cross entropy loss
- Softmax: exponential, but weighted so that sum of components adds to 1
- Sigmoid with binary cross entropy loss
- Sigmoid:
(shaped like an S)
- Sigmoid:
- Regression
- Continuous variables (e.g. depth of a pixel)
Visual recognition:
- Classification - if image belongs to class
- Segmentation (dense/semantic) - classifying each pixel
- Object detection: classification + localization (bounding box)
- Instance segmentation: classification + localization + segmentation (precise mask)
- Keypoint detection (e.g. joints in hand)
- Panoptic segmentation (both dense and instance segmentation) e.g. ‘grass’
Image classification:
- Pre-trained models available: https://github.com/rwightman/pytorch-image-models
- Can be used as basis of other tasks - train on ImageNet, adapt the model by adding layers/connections. Called having a ‘backbone’ based on image classification.
Semantic/dense segmentation:
- Per-pixel classification
- General scene understanding
- Works with irregularly-shaped objects
Segmentation performance measures:
- Percent correct
- Has issues with class imbalance
- Intersection over union (IOU):
- Precision = area of intersection of prediction and ground truth divided by area of prediction
- Recall = area of intersection of prediction and ground truth divided by area of ground truth
- IoU = area of intersection of prediction and ground truth, divided by union
- Problem: they are not differential, although substitutes are available
Object detection:
- Localization + classification
- Difficult as there are a variable number of outputs
- How to measure accuracy?
- Average precision: area under precision vs recall graph
- Build by adding predictions by order of confidence, highest to lowest
- mAP @ t: mean average precision across all classes of dataset using matching threshold of t (e.g. 50%)
- COCO AP: average across mAP for thresholds [0.5, 0.55, …, 0.95]
- Usually output object locations (heatmaps) and classes
- Anchor boxes/prior boxes:
- Places where the network thinks there could be objects - very dense and overlapping
- Regression on box displacement and sizing on set of anchor boxes; resize and reduce to a single high-confidence box
- Non-maximum suppression: does not work well if two objects are overlapping
- Train by matching anchor boxes to ground truth boxes
- e.g. find boxes where IoU > 0.5 or some other threshold
Keypoint recognition:
- Detecting landmarks (e.g. human skeleton/pose) - want single points rather than box
- Bottom up:
- Keypoint detector per part
- Match key-points to find full skeletons
- Top down:
- Detect full objects first
- Estimate key-points given object location
Image matching and correspondence:
- Dense rectified (images aligned vertically) stereo matching (e.g. HSMNet, PSMNet)
- Dense multi-view stereo
- Sparse image matching: keyboard detection, descriptor extraction
- Dense optical flow
- Estimate motion of pixels
- 2D correspondence search - movement not constrained two one axis
- RAFT
- 6DoF pose matching
- Determining object pose (e.g. orientation) from 3D model
- Key point detector (identifiable point)/descriptor (vector uniquely defining that point)
- Can’t manually label data - a single image? Maybe. But definitely not a video
- Self-supervised by warp data augmentation: given an image warp, we know how key points should be transformed
- Direct matching e.g. LoFTR
- Detect, then extract descriptors: Superpoint, R2D2
- Object tracking
- Multi-object tracking: MOT
- Track known types of objects e.g. people, seals
- Supervised learning
- Detect and track paradigm e.g. SORT https://github.com/ifzhang/ByteTrack
- Difficulties: occlusion, changing shape, fast movement
- Generic object tracking (GOT)
- One shot object tracking
- Generic: not trained on a specific type of object (but probably a wide variety of moving objects)
- Makes it easy to use as long as the target is similar to anything in the training set
- Track object from template in the first frame (bounding box)
- e.g. vision4robotics/SiamAPN, got-10k/toolkit
- Can help with assisting labelling data
- Multi-object tracking: MOT
- Neural 3D reconstruction
- Not just for images, although this is a common case
- Input: set of calibrated images
- Structure from Motion, COLMAP
- Sparse point cloud through correspondence search: feature extraction -> matching -> geometric verification
- Dense multiview stereo e.g. PatchMatch to generate depth maps that can be filtered or fused
- Structure from Motion, COLMAP
- Reconstruct 3D scenes use differentiable volumetric ray-tracing
- Inverse of computer graphics
- Synthesize new images of a scene from a different orientation
- Neural radiance fields: NeRF
- Represent 3D scene using a NN by mapping a 3D coordinate to a density and color
- Problem: neural networks have bias towards smooth functions; cannot represent high frequency/discontinuity
- Solution: scale/encode inputs as a Fourier-encoded sequence
- Problem: view-directional effects
- e.g. Google MipNeRF, nerf-pytorch
- instant-ngp: massively reducing required time/processing power
- More recent NeRF methods: direct voxel grids instead of the coordinate space mapping
- Neural hash-tables: ignore hash collisions - multiple levels of resolution combined with a tiny MLP to handle collisions
Image features:
- Description of small point in the image
- Independent of attributes such as location, orientation and scale
- Handcrafted dense image features also exist e.g. histogram of gradients (HoG), dense SiFT
- From NN:
- Intermediate activations from any type of NN
- Typically dense (for CNNs)
- Usually much lower resolution than input
- Vector length depends on layer
- Sometimes modified for matching e.g. normalization, reduced dimension with PCA
- Can arise from training a NN on an auxiliary task (e.g. image classification)
- Handcrafted vs learned:
- NN features usually perform better on tasks near the training domain,
- Handcrafted features often generalize better
- Simpler classifiers with extracted features
- Fine tuning on frozen networks
- SVM (support vector machine), random forest, nearest neighbor etc.
- Depends on the quality of the feature extractor
- If extracted from NN running auxiliary task, it may not match the use case
- Extremely common use cases:
- Feature matching (e.g. nearest neighbor)
- Well tested, less likely to fall down when it encounteres unexpected data
- Image retrieval, face recognition (approximate nearest neighbor)
- Shortest path (e.g. skeletonization)
- Graph cut (e.g. stereo, segmentation)
- Homography estimation, Perspective nPoint (PnP)
- Tracking
- Detect and track e.g. SORT, one-shot detection
- Kalman filter
- Feature matching (e.g. nearest neighbor)
Applying models to new tasks:
- Find a off-the-shelf model
- A lot around; sometimes, may be useful to search for the dataset rather than the model
- Look for widely-used, actively-maintained ones
- Research-quality code often not maintained after publication; may require package to be updated to work with current versions of frameworks
- Check for conda/pip packages before building from source
- May not generalize
- Check similarity of images to ones in paper or those provided as examples
- Neural networks can be weak to unexpected inputs or domain shifts
- Use NN features with non-learning algorithms
- Open world visual recognition models
- Detects objects in general
- Often trained by big organizations with excessive resources
- Often self-supervised learning or from commonly available associations (e.g. OpenAI CLIP uses image data and captions from websites) - may be low quality
- Few shot image classification: create a new class using a few exemplars
- Zero shot image classification: classify image based on textural descriptions
- Instance segmentation e.g. Detic (https://huggingface.co/spaces/akhaliq/Detic)
- Fine tuning
- Hand annotate a few different examples
- Interactive segmentation: using NN to help with training
- DiscoBox: human draws bounding box, NN generates segmentation mask
- Human in the loop
- Active learning: algorithm lets human annotate the data it is most uncertain on
- Verification-based annotation
- Partially-trained model makes suggestion
- Human selects/edits the best suggestions
- Question answering:
- Weak annotation
- Human answer yes/no questions; much faster than manual annotation
- Interactive segmentation: using NN to help with training
- Take an existing model and find tune the model or classifier with a smaller set of annotations
- Catastrophic forgetting/interference: when retraining model in new domain, it tends to lose information about the previous data
- Use data augmentation judiciously
- Apply transformations which are invariant/equivariant to the labels; allows training sets to be enlarged and forces the model to generalize
- Randomly perturb the data
- Geometric: translation, scale, rotation
- Photometric: brightness, contrast, hue
- Mixing: mix images (of different classes!) and labels
- Noise: gaussian, salt and pepper, synthetic rain, cutout
- e.g. albumentations, BBAug, Torchvision, Detectron2
- Considerations:
- Too much may be harmful
- Sometimes may be better to standardize scale/orientation in input data rather than augment it (e.g. people not usually walking upside down)
- Hand annotate a few different examples
- Synthetic data
- 3D rendering to generate your own training data
- Self-supervised learning: known physical properties to provide supervision
- Correspondence tasks: cannot hand label; disparity maps must be more accurate than segmentation masks (precise disparities per pixel)
- Pseudo ground-truths
- Use device for high accuracy capture (LiDAR, structured light)
- Use more information e.g. higher-resolution or more images
- Use 3D reconstruction to generate higher-quality images and train on these
- Today: tends to produce stereo matching; not as good as synthetic data, but more robust
- Self-supervised learning
- Consistency (left/right, inter-frame) using image warping: minimize warping error with truth to train model
- Feature matching with tracking
- TODO