Raft (A Consensus Algorithm) Explained

September 23, 2017 by Abhirath Mahipal


An attempt to understand and annotate Raft. I’ll also be linking to other resources that helped me get a better grip of the material in this paper. I will not contrast Raft and Paxos as I do not understand Paxos. Also I’ve left out 3 small topics (Cluster Changes, Log Compaction and Client Interaction) that are quite easy to grok once you understand the crux of the algorithm.

Paper:- Raft
Recommended By:- Shrayas Rajagopal

Recommended Reading

  • Replicated State Machines - A common practice in distributed systems to replicate data across multiple systems. So even if one or a few of them go down, the system as a whole can function with the help of the copies. Instead of copying data, the replica and the primary system execute the same exact commands in the same order.
  • Consensus Problem Explained
  • Quick Look into Paxos
  • In case you don’t know what a RPC is - Understanding RPCs

Algorithm Explained

You can kind of see Raft in action here - Raft Visually Explained. I’m pretty sure I won’t be able to explain it better than them. You can come back to this article for some of the smaller not so important details that isn’t covered by this visualisation.

Basics

  • To increase understandability they focused on two general approaches of problem simplification. The consensus problem is broken into three relatively independent subproblems (Leader Election, Log or State Replication, Safety). They’ve tried reducing the state space whenever possible by using randomised approaches
  • They’ve emphasised on the importance of having a leader. It greatly simplifies decision making
  • Leaders typically operate until they fail
  • At any given time each server is a leader, follower or a candidate. When a node first comes to life it always is a follower. If it doesn’t receive a heartbeat(an acknowledgement from the Leader) message for a specified time, it assumes there is no leader and gets promoted to candidate. It nominates itself and starts an election. If that particular node wins majority it becomes the leader for that term
  • Each server can vote for one candidate on a first-come-first-serve basis. Assume node 1 and 2 have timed out and are requesting for votes. Node 1 reaches out to node 3 before node 2, node 3 will vote in favour of node 1
  • Time is divided into units called terms. Terms are represented by positive consecutive integers. Each node keeps track of
  • Terms act as a logical clock. Say a node becomes the leader for term 2 and becomes unavailable (network cable unplugged or whatever) after some time. One of the functional follower nodes timeout and enter candidate state and eventually becomes the leader. When the unavailable leader return to the cluster it notices that it’s term 3 (remember it was the leader for term 2), it steps down and becomes a follower
  • Terms are also used to find and fix anomalies in data between different nodes
  • Servers communicate using Remote Procedure Calls. They are issued in parallel for performance gains
  • There can only be one leader for a particular term. Say servers become lose connectivity and are split into two groups that can communicate amongst themselves. One of the groups is now void of a leader. One of the nodes in the group without a leader initiates an election by becoming a candidate. It’s then chosen as a leader and all the nodes of that group increment their term by one. Two different leaders can exist for these groups given that they are leaders for different terms

Leader Election

  • The entries (or change in state) flow through the leader to the other nodes in the system. Greatly reduces complexity and increased understandability
  • As explained earlier a candidate initiates an election if it doesn’t get a heartbeat for a said duration. One of the following can happen when an election is initiated
    • It wins the vote if it secures a majority. Once it wins, it sends out heartbeats to other servers to claim authority and also to prevent elections unnecessarily. There are additionally safety checks before a candidate can take charge, I’ll cover them in a bit
    • It receives a heartbeat or appendEntry (RPC to change state) from a leader during the voting process. It checks the term. If the term matches it’s current term or is higher, it realises that a new leader isn’t needed and transitions to become a follower. If the term is dated, it rejects the RPC as it probably is from the leader of an older term
    • There’s a split vote or tie between two or more servers. Winning isn’t enough to become a leader, you need majority support. All the servers increment their term (this term had no leader) and initiate a re-election
  • Randomised timeouts are used to prevent split votes. If they had the same timeout, if a leader fails, many of the nodes might timeout in quick succession and the distributed system will witness many candidates. However if they had different timeouts, one of the node times out and becomes the leader even before other nodes get a chance to timeout

Log Replication

  • Once the troop has a leader, it begins responding to clients. Each client request contains a command to be executed by the replicated state machines (modify an existing entry or add new entries etc)
  • The leader makes the changes (or appends the client command to it’s log) and sends out RPCs (AppendEntries) in parallel to all the other nodes ordering them to do the same. Please note that the leader doesn’t commit the the entry just yet. The log entry is only committed when it is able to replicate the said change by the client in more than half the number of servers. It then returns the result of that execution to the client
  • A single or a few slow nodes will not impact performance at all. Once majority is achieved the entry is committed, the leader indefinitely tries sending RPCs to slow nodes until it gets a response
  • Log Matching Property (every command or entry has an index and term)
    • If two entries have the same index and term, they store the same command
    • If two entries in different logs have the same index and term, then the logs are identical in all preceding entries
  • The first property is a result of the fact that the leader creates at most one entry with a said index in a term. Also log entries never change position
  • The second property is maintained quite easily as well. When the leader sends out an AppendEntries RPC, it also includes the index and term of the preceding entry in it’s own log. If the follower’s latest entry does not match the preceding entry given in the RPC, it rejects it. It refuses new entries. This is a safety check mechanism. So if a follower responds to an AppendEntry the leader knows for sure that the particular follower’s log matches it’s log until that entry
  • If the follower rejects it’s AppendEntries RPC (implying inconsistencies), the leader looks up the logs of the follower. It finds the latest entry upto which they agree upon and delete every entry after that. It then sends all it’s entries to the follower after that point in time (they’ve also pointed a possible optimisation. The follower can send the conflicting entry instead of the leader having to search for it)
  • A leader never overwrites or deletes entries in it’s own log

Safety

  • Election Restriction

    • The leader must have all committed entries
    • When a candidate sends out a RequestVoteRPC, it also sends out details about it’s logs (last committed entry). The voter can deny to vote for that candidate if it’s logs are more up to date than the candidate
    • Raft uses terms and entry index to determine the more recent log. If between two logs, one of them ends with a term 1 and the other has term 2, obviously the one with term 2 is more recent. If the logs have the same term, the longer log (therefore greater index) is more recent
  • Committing Entries From Previous Terms

    • Say a leader copies entries to a majority of the servers but crashes before committing the entry
    • A new leader is chosen. It is now implied that at least one of the server with the new uncommitted entry has voted for the new leader (because a new entry is accepted only with majority and the same goes for a new leader)
    • It is important to keep in mind that only entries from the current term are committed by counting replicas across nodes
    • In short previous term’s uncommitted entry if accepted by a majority of the nodes are automatically handled and committed when a new entry in a new term is received. They’ve explained the argument on page 9 of the paper

Miscellaneous

  • Follower and Candidate Crashes
    They are quite easily handled. Say a follower crashes, the leader indefinitely tries sending it the same RPC. When the follower is back up, it then receives it and responds thus completing the cycle. Keep in mind that this wont’ effect performance as the leader only waits for the majority to respond. So a single or a few nodes don’t matter
    Say a follower receives the RPC, appends the entries but fails before responding - When the follower is backup, it will receive the same RPC from the leader. Raft RPCs are idempotent. It notices the entry in it’s log and rejects it
  • Timing and Availability
    This section explains something critical but very obvious for the proper functioning of the system. Say it takes 10 milliseconds to broadcast a message from the leader to all the followers the election timeout as you guessed it must be larger than 10 milliseconds (if it’s less than the broadcastTime, servers will keep timing out on a continuous basis assuming the absence of a leader)
    MTBF is the average time between failures for a single server. It’s typically several months or more
    MTBF and broadcastTime are properties of the underlying system. electionTimeout can however be manually set In short this should be broadcastTime < electionTimeout < MTBF MTBF should be much larger than the electionTimeout. When a leader crashes, the system will be unavailable for a duration roughly equivalent to the electionTimeout. Therefore the electionTimeout should be a very small portion of overall time. Say the leader crashes every 6 months or so and the electionTimeout is 500 milliseconds. The system is therefore unavailable for 500 ms every 6 months.