01. Introduction

Weighting:

50%: research project presented as a conference style paper, including a relevant literature review (due week 8)
- 15-25 references in the paper
- Deep learning: use when it will give you a better result
- No novelty required
- Possible projects:
  - Automatic timecode/GPS coordinate extraction from dashcam/surveillance cameras?
  - Censoring faces in real-time (except specific people)
  - AR angry birds with finger dragging on table
  - Automatically turning screen off when no one is looking at the screen
  - Othello/chess state detection
  - Russian flag replacement (with Ukraine)
  - No-mask detection (stretch goal: if holding drink/food, they get a pass)
  - Finger tracking for cursor movement (e.g. hand wet)
  - Bread baking? Does rate at which bread rises drop near the end of the baking cycle
  - Port/cable detection (e.g. USB-C, HDMI)
  - Virtual touch bar: mirror on webcam to view keyboard area, add stickers and detect touch to change brightness
  - Drink detection: detect when you take a drink, alert if you haven’t taken a sip in an hour
  - Deep learning: feel free to use it where beneficial
    - Classification: is there an x (and possibly a y or z) in the image or not?
    - Object detection: where is the object in the image?
    - Dense/semantic segmentation: recognizing what ‘object’ (e.g. sky) each pixel belongs to
    - Instance segmentation: recognizing and locating multiple instances of an object
    - Tracking: tracking the motion of objects
10%: class participation/presentations (presentations in final weeks of semester)
40%: final exam

Goal: to recognize objects and their motions.

Signal processing on 1D data; computer vision on 2D data.

Image processing on still images; computer vision on video and still images.

Difficulties

The sensory gap: gap between reality and what a recording of the scene can capture.

The semantic gap: lack of contextual knowledge about a scene.

The human visual system is very good:

~50% of cerebral cortex used for vision
Emualating the operation of human behaviours:
- Perception based on relative brightness (e.g. Checker shadow illusion)
- Logarithmic response to brightness
- Low-light performance (for passive systems)

Recovering 3D information

Several cues available:

Motion
Stereo (< 3m)
Texture
Shading
Contours

Or actual depth hardware:

Active stereo (IR - brightness limited for safety)
Structured light: dot projection
Time of flight
- RF-modulated
- Direct timing (LIDAR)
  - Brightness not limited as eye exposure duration so short

Labs: Intel Realsense D435.

Processing

Pre-processing: noise reduction, contrast etc.
Low-level: color, boundary/edge, shape, texture detection
High-level: object detection, determining spatial relationships, meanings

The higher-level the processing, the less generic and the more domain-specific knowledge is required.

Low-Level Image Processing:

Image compression
Noise reduction (while maximizing information kept)
Edge extraction
Contrast enhancement (only helps humans)
Segmentation
Thresholding
Morphology
Image restoration (e.g. deblurring using knowledge of speed of subject)

Approaches

Build a simple model of the world; constrain the environment (e.g. fixed lighting environment)
- Focus on definite tasks with clear requirements
Find provably good algorithms
- Try ideas based on theory
Experiment on real world
- Solutions may not generalize to a more complex environment
Update the model