01-02. Introduction

Course

James Atlas. Office hours: Monday 1-4 pm

Groups of 2-4. Three roles:

Weighting:

The Five Challenges

Scalability (volume, velocity): distributing computation/storage

Reliability (veracity): ensuring results survive when machines fail

Productivity (variety, value): making writing distributed programs easy

Three Perspectives

Look at the problem from three levels of abstractions:

Architecture

Scalability:

Storage/IO: ensuring data durability/consistency (RAID, DFS)

Distributed file system:

RDD: Resilient Distributed Dataset

Parallelism vs Scalability

Parallelism: P=W/DP = W/D, where WW is the work done (e.g. number of operations) and DD is the depth of computation (depth of dependency tree).

Scalability: given a fixed problem size, does adding more processors make it faster?