Course
James Atlas. Office hours: Monday 1-4 pm
Groups of 2-4. Three roles:
- Manager - ensure group on-task
- Recorder - records answers/discussions
- Reporter - summarizes findings to instructor/other teams
Weighting:
- Labs: 30% (Learn quiz + 9 assignments)
- Project: 40%
- Final exam: 30%
The Five Challenges
- Variety: different formats of data from many sources (e.g. country-specific regulations, language barriers)
- Velocity: how quickly the data can be ingested and processed
- Volume: large volume of data
- Veracity: inconsistencies/uncertainty in data
- Value: extracting useful information from the data
Scalability (volume, velocity): distributing computation/storage
Reliability (veracity): ensuring results survive when machines fail
Productivity (variety, value): making writing distributed programs easy
Three Perspectives
Look at the problem from three levels of abstractions:
- Architecture
- Algorithms
- Programming
Architecture
Scalability:
- Datasets do not fit in memory
- Time consuming to process large datasets
- Parallel processing/storage required
Storage/IO: ensuring data durability/consistency (RAID, DFS)
Distributed file system:
- Chuck servers:
- Files split into contiguous chunks (16-64 MB)
- Chunks replicated 2-3 times, hopefully on different racks
- Master node:
- Called ‘Name Node’ in Hadoop
- Metadata about where files are stored
- Client library that accesses the data
- Operations should run on the machines
RDD: Resilient Distributed Dataset
Parallelism vs Scalability
Parallelism:
Scalability: given a fixed problem size, does adding more processors make it faster?