21. Raft Consensus Algorithm

What is consensus?

Agreement on shared state
Recovers from server failure autonomously:
- Minority of servers failing: no problem
  - Half the servers failing is the threshold
- Majority fail: lose availability but retain consistency

Eliminating the single point of failure (e.g. master node) requires either:

An ad-hoc algorithm (rare case that typically occurs due to network partition with replication lag) OR
Consensus algorithm: Paxos, Raft etc.
- Paxos was too complicated; every single implementation had bugs

Replicated State Machines

Replicated log => replicated state machines: all servers execute the same commands in the same order.
The consensus module ensure proper log replication
System makes progress as log as a majority of servers are up
Failure model: fail-stop, delayed/lost messages

State vs replication log: assuming the size of the changes in a reasonable interval (e.g. day) is an order of magnitude or more smaller than the size of the state.

Low level: replicate entire database state, bit-for-bit

High level: transactional level, most common

Raft

Designed to be easy to understand

Visualization of Algorithm

Leader Election

Select one server to act as a cluster leader.

Election process:

After there have been no heartbeats after some random timeout, call for an election
- The randomization needed to avoid split votes
- Timeout should be ~10x times the network latency
Once election request is received, timeout is reset
Vote for leader if they are as or more up-to-date than you
Majority rules (needs to know total number of nodes)

If multiple nodes call for an election at the same time, redo the election.

Log Replication

Leader takes commands from clients and appends them to its in-progress logs
Leader sends request to all other servers
Once it gets confirmation from a majority of nodes (in the entire system), it commits the transaction
The leader sends its confirmation to all other nodes, which commit their own transactions
- Any inconsistencies are overwritten

Safety

Only a server with an up-to-date log (entries all committed) can become the leader.

If a system becomes partitioned, the smaller partition will never commit its transactions as it can never reach confirmation from a majority of nodes.

Hence, when the network rejoins, all of the in-progress items will be overwritten.

If the partitioning leads to no partition having more than half the nodes, nothing will be committed.