What is consensus?
- Agreement on shared state
- Recovers from server failure autonomously:
- Minority of servers failing: no problem
- Half the servers failing is the threshold
- Majority fail: lose availability but retain consistency
- Minority of servers failing: no problem
Eliminating the single point of failure (e.g. master node) requires either:
- An ad-hoc algorithm (rare case that typically occurs due to network partition with replication lag) OR
- Consensus algorithm: Paxos, Raft etc.
- Paxos was too complicated; every single implementation had bugs
Replicated State Machines
- Replicated log => replicated state machines: all servers execute the same commands in the same order.
- The consensus module ensure proper log replication
- System makes progress as log as a majority of servers are up
- Failure model: fail-stop, delayed/lost messages
State vs replication log: assuming the size of the changes in a reasonable interval (e.g. day) is an order of magnitude or more smaller than the size of the state.
Low level: replicate entire database state, bit-for-bit
High level: transactional level, most common
Raft
Designed to be easy to understand
Leader Election
Select one server to act as a cluster leader.
Election process:
- After there have been no heartbeats after some random timeout, call for an election
- The randomization needed to avoid split votes
- Timeout should be ~10x times the network latency
- Once election request is received, timeout is reset
- Vote for leader if they are as or more up-to-date than you
- Majority rules (needs to know total number of nodes)
If multiple nodes call for an election at the same time, redo the election.
Log Replication
- Leader takes commands from clients and appends them to its in-progress logs
- Leader sends request to all other servers
- Once it gets confirmation from a majority of nodes (in the entire system), it commits the transaction
- The leader sends its confirmation to all other nodes, which commit their own transactions
- Any inconsistencies are overwritten
Safety
Only a server with an up-to-date log (entries all committed) can become the leader.
If a system becomes partitioned, the smaller partition will never commit its transactions as it can never reach confirmation from a majority of nodes.
Hence, when the network rejoins, all of the in-progress items will be overwritten.
If the partitioning leads to no partition having more than half the nodes, nothing will be committed.