An attempt to understand and annotate Raft. I’ll also be linking to other resources that helped me get a better grip of the material in this paper. I will not contrast Raft and Paxos as I do not understand Paxos. Also I’ve left out 3 small topics (Cluster Changes, Log Compaction and Client Interaction) that are quite easy to grok once you understand the crux of the algorithm.
Paper:- Raft
Recommended By:- Shrayas Rajagopal
Recommended Reading
- Replicated State Machines - A common practice in distributed systems to replicate data across multiple systems. So even if one or a few of them go down, the system as a whole can function with the help of the copies. Instead of copying data, the replica and the primary system execute the same exact commands in the same order.
- Consensus Problem Explained
- Quick Look into Paxos
- In case you don’t know what a RPC is - Understanding RPCs
Algorithm Explained
You can kind of see Raft in action here - Raft Visually Explained. I’m pretty sure I won’t be able to explain it better than them. You can come back to this article for some of the smaller not so important details that isn’t covered by this visualisation.
Basics
- To increase understandability they focused on two general approaches of problem simplification. The consensus problem is broken into three relatively independent subproblems (Leader Election, Log or State Replication, Safety). They’ve tried reducing the state space whenever possible by using randomised approaches
- They’ve emphasised on the importance of having a leader. It greatly simplifies decision making
- Leaders typically operate until they fail
- At any given time each server is a leader, follower or a candidate. When a node first comes to life it always is a follower. If it doesn’t receive a heartbeat(an acknowledgement from the Leader) message for a specified time, it assumes there is no leader and gets promoted to candidate. It nominates itself and starts an election. If that particular node wins majority it becomes the leader for that term
- Each server can vote for one candidate on a first-come-first-serve basis. Assume node 1 and 2 have timed out and are requesting for votes. Node 1 reaches out to node 3 before node 2, node 3 will vote in favour of node 1
- Time is divided into units called terms. Terms are represented by positive consecutive integers. Each node keeps track of
- Terms act as a logical clock. Say a node becomes the leader for term 2 and becomes unavailable (network cable unplugged or whatever) after some time. One of the functional follower nodes timeout and enter candidate state and eventually becomes the leader. When the unavailable leader return to the cluster it notices that it’s term 3 (remember it was the leader for term 2), it steps down and becomes a follower
- Terms are also used to find and fix anomalies in data between different nodes
- Servers communicate using Remote Procedure Calls. They are issued in parallel for performance gains
- There can only be one leader for a particular term. Say servers become lose connectivity and are split into two groups that can communicate amongst themselves. One of the groups is now void of a leader. One of the nodes in the group without a leader initiates an election by becoming a candidate. It’s then chosen as a leader and all the nodes of that group increment their term by one. Two different leaders can exist for these groups given that they are leaders for different terms
Leader Election
Log Replication
Safety
Miscellaneous
- Follower and Candidate Crashes
They are quite easily handled. Say a follower crashes, the leader indefinitely tries sending it the same RPC. When the follower is back up, it then receives it and responds thus completing the cycle. Keep in mind that this wont’ effect performance as the leader only waits for the majority to respond. So a single or a few nodes don’t matter
Say a follower receives the RPC, appends the entries but fails before responding - When the follower is backup, it will receive the same RPC from the leader. Raft RPCs are idempotent. It notices the entry in it’s log and rejects it
- Timing and Availability
This section explains something critical but very obvious for the proper functioning of the system. Say it takes 10 milliseconds to broadcast a message from the leader to all the followers the election timeout as you guessed it must be larger than 10 milliseconds (if it’s less than the broadcastTime, servers will keep timing out on a continuous basis assuming the absence of a leader)
MTBF is the average time between failures for a single server. It’s typically several months or more
MTBF and broadcastTime are properties of the underlying system. electionTimeout can however be manually set
In short this should be broadcastTime < electionTimeout < MTBF
MTBF should be much larger than the electionTimeout. When a leader crashes, the system will be unavailable for a duration roughly equivalent to the electionTimeout. Therefore the electionTimeout should be a very small portion of overall time. Say the leader crashes every 6 months or so and the electionTimeout is 500 milliseconds. The system is therefore unavailable for 500 ms every 6 months.