01-02. Introduction

Course

James Atlas. Office hours: Monday 1-4 pm

Groups of 2-4. Three roles:

Weighting:

Variety: different formats of data from many sources (e.g. country-specific regulations, language barriers)
Velocity: how quickly the data can be ingested and processed
Volume: large volume of data
Veracity: inconsistencies/uncertainty in data
Value: extracting useful information from the data

Scalability (volume, velocity): distributing computation/storage

Reliability (veracity): ensuring results survive when machines fail

Productivity (variety, value): making writing distributed programs easy

Look at the problem from three levels of abstractions:

Scalability:

Storage/IO: ensuring data durability/consistency (RAID, DFS)

Distributed file system:

Chuck servers:
- Files split into contiguous chunks (16-64 MB)
- Chunks replicated 2-3 times, hopefully on different racks
Master node:
- Called ‘Name Node’ in Hadoop
- Metadata about where files are stored
Client library that accesses the data
- Operations should run on the machines

RDD: Resilient Distributed Dataset

Parallelism: $P = W/D$ , where $W$ is the work done (e.g. number of operations) and $D$ is the depth of computation (depth of dependency tree).

Scalability: given a fixed problem size, does adding more processors make it faster?