Paxos and Raft Consensus Algorithms - Building Reliable Distributed Systems
A deep dive into consensus algorithms, their applications in distributed systems, and how quorums ensure consistency across unreliable networks.
Paxos and Raft Consensus Algorithms - Building Reliable Distributed Systems
Consensus algorithms are the backbone of reliable distributed systems. They enable multiple nodes to agree on a single value or sequence of values, even when some nodes fail or network partitions occur. In this comprehensive guide, we’ll explore two of the most important consensus algorithms: Paxos and Raft, along with the concept of quorums that make them work.
The Consensus Problem
Before diving into specific algorithms, let’s understand the consensus problem:
Given a set of nodes that can fail or become unreachable, how do we ensure all nodes agree on a single value?
This seemingly simple problem becomes incredibly complex when you consider:
- Network partitions: Nodes can’t communicate with each other
- Node failures: Some nodes may crash or become unresponsive
- Byzantine failures: Nodes may behave maliciously (though Paxos/Raft don’t handle this)
- Timing issues: Messages may be delayed or lost
What is a Quorum?
A quorum is a subset of nodes whose agreement is sufficient to make a decision. The key insight is that if two quorums have any overlap, they cannot make conflicting decisions.
Quorum Properties
- Intersection Property: Any two quorums must have at least one node in common
- Majority Quorum: Most common approach where a quorum is any majority of nodes
- Fault Tolerance: System can tolerate up to
⌊(n-1)/2⌋
failures in an n-node system
Quorum Examples
public class QuorumExample {
private final int totalNodes;
private final int quorumSize;
public QuorumExample(int totalNodes) {
this.totalNodes = totalNodes;
this.quorumSize = (totalNodes / 2) + 1; // Majority
}
public boolean isQuorum(int respondingNodes) {
return respondingNodes >= quorumSize;
}
public int getMaxFailures() {
return totalNodes - quorumSize;
}
}
// Example usage
QuorumExample quorum = new QuorumExample(5); // 5 nodes
System.out.println("Quorum size: " + quorum.quorumSize); // 3
System.out.println("Max failures: " + quorum.getMaxFailures()); // 2
Paxos Algorithm
Paxos, developed by Leslie Lamport in 1989, is one of the most influential consensus algorithms. It’s used in systems like Google’s Chubby and Apache ZooKeeper.
Paxos Phases
Paxos operates in two phases:
Phase 1: Prepare Phase
- Proposer sends prepare request with proposal number
n
- Acceptors respond with:
- Promise not to accept proposals numbered less than
n
- Highest numbered proposal they’ve accepted (if any)
- Promise not to accept proposals numbered less than
Phase 2: Accept Phase
- Proposer sends accept request with value
v
- Acceptors accept the proposal if they haven’t promised a higher number
Paxos Implementation
public class PaxosNode {
private int nodeId;
private int proposalNumber;
private Object acceptedValue;
private int acceptedProposalNumber;
private Set<Integer> promises;
public PaxosNode(int nodeId) {
this.nodeId = nodeId;
this.proposalNumber = 0;
this.promises = new HashSet<>();
}
public PrepareResponse prepare(int proposalNum) {
if (proposalNum > proposalNumber) {
proposalNumber = proposalNum;
promises.add(proposalNum);
return new PrepareResponse(
true, // promise
acceptedValue,
acceptedProposalNumber
);
}
return new PrepareResponse(false, null, 0);
}
public boolean accept(int proposalNum, Object value) {
if (promises.contains(proposalNum)) {
acceptedValue = value;
acceptedProposalNumber = proposalNum;
return true;
}
return false;
}
}
class PrepareResponse {
boolean promised;
Object acceptedValue;
int acceptedProposalNumber;
public PrepareResponse(boolean promised, Object value, int proposalNum) {
this.promised = promised;
this.acceptedValue = value;
this.acceptedProposalNumber = proposalNum;
}
}
Paxos Proposer Implementation
public class PaxosProposer {
private int proposalNumber;
private List<PaxosNode> acceptors;
public Object propose(Object value) {
proposalNumber++;
// Phase 1: Prepare
int promises = 0;
Object highestAcceptedValue = null;
int highestAcceptedNumber = 0;
for (PaxosNode acceptor : acceptors) {
PrepareResponse response = acceptor.prepare(proposalNumber);
if (response.promised) {
promises++;
if (response.acceptedProposalNumber > highestAcceptedNumber) {
highestAcceptedNumber = response.acceptedProposalNumber;
highestAcceptedValue = response.acceptedValue;
}
}
}
// Check if we have a quorum
if (promises < (acceptors.size() / 2) + 1) {
return null; // No consensus
}
// Phase 2: Accept
Object valueToPropose = (highestAcceptedValue != null) ?
highestAcceptedValue : value;
int accepts = 0;
for (PaxosNode acceptor : acceptors) {
if (acceptor.accept(proposalNumber, valueToPropose)) {
accepts++;
}
}
if (accepts >= (acceptors.size() / 2) + 1) {
return valueToPropose; // Consensus reached
}
return null; // No consensus
}
}
Raft Algorithm
Raft, developed by Diego Ongaro and John Ousterhout in 2014, was designed to be more understandable than Paxos while providing the same safety guarantees.
Raft Key Concepts
- Leader Election: One node becomes leader, others are followers
- Log Replication: Leader replicates log entries to followers
- Safety: Only committed entries are applied to state machine
Raft Node States
public enum RaftState {
FOLLOWER,
CANDIDATE,
LEADER
}
public class RaftNode {
private int nodeId;
private RaftState state;
private int currentTerm;
private int votedFor;
private int leaderId;
// Log entries
private List<LogEntry> log;
private int commitIndex;
private int lastApplied;
// Leader state
private Map<Integer, Integer> nextIndex;
private Map<Integer, Integer> matchIndex;
public RaftNode(int nodeId) {
this.nodeId = nodeId;
this.state = RaftState.FOLLOWER;
this.currentTerm = 0;
this.votedFor = -1;
this.log = new ArrayList<>();
this.commitIndex = 0;
this.lastApplied = 0;
}
}
Raft Leader Election
public class RaftLeaderElection {
private RaftNode node;
private Random random;
private int electionTimeout;
public void startElection() {
node.setState(RaftState.CANDIDATE);
node.setCurrentTerm(node.getCurrentTerm() + 1);
node.setVotedFor(node.getNodeId());
// Request votes from all other nodes
int votes = 1; // Vote for self
for (RaftNode otherNode : getAllNodes()) {
if (otherNode.getNodeId() != node.getNodeId()) {
VoteRequest request = new VoteRequest(
node.getCurrentTerm(),
node.getNodeId(),
node.getLog().size() - 1,
node.getLog().get(node.getLog().size() - 1).getTerm()
);
VoteResponse response = otherNode.requestVote(request);
if (response.isVoteGranted()) {
votes++;
}
}
}
// Check if we have a quorum
if (votes >= (getAllNodes().size() / 2) + 1) {
becomeLeader();
} else {
// Election timeout, try again
scheduleElectionTimeout();
}
}
private void becomeLeader() {
node.setState(RaftState.LEADER);
node.setLeaderId(node.getNodeId());
// Initialize leader state
for (RaftNode otherNode : getAllNodes()) {
if (otherNode.getNodeId() != node.getNodeId()) {
node.getNextIndex().put(otherNode.getNodeId(), node.getLog().size());
node.getMatchIndex().put(otherNode.getNodeId(), 0);
}
}
// Start sending heartbeats
startHeartbeat();
}
}
Raft Log Replication
public class RaftLogReplication {
private RaftNode leader;
public void replicateLog(Object command) {
// Add entry to leader's log
LogEntry entry = new LogEntry(
leader.getCurrentTerm(),
command
);
leader.getLog().add(entry);
// Send AppendEntries RPC to all followers
for (RaftNode follower : getFollowers()) {
sendAppendEntries(follower, entry);
}
}
private void sendAppendEntries(RaftNode follower, LogEntry entry) {
AppendEntriesRequest request = new AppendEntriesRequest(
leader.getCurrentTerm(),
leader.getNodeId(),
leader.getLog().size() - 2, // prevLogIndex
leader.getLog().get(leader.getLog().size() - 2).getTerm(), // prevLogTerm
entry,
leader.getCommitIndex()
);
AppendEntriesResponse response = follower.appendEntries(request);
if (response.isSuccess()) {
// Update match index
leader.getMatchIndex().put(follower.getNodeId(),
leader.getLog().size() - 1);
leader.getNextIndex().put(follower.getNodeId(),
leader.getLog().size());
// Try to commit
tryCommit();
} else {
// Decrement nextIndex and retry
int nextIndex = leader.getNextIndex().get(follower.getNodeId());
leader.getNextIndex().put(follower.getNodeId(), nextIndex - 1);
}
}
private void tryCommit() {
// Find the highest index that can be committed
for (int i = leader.getLog().size() - 1; i > leader.getCommitIndex(); i--) {
int count = 1; // Leader has this entry
for (RaftNode follower : getFollowers()) {
if (leader.getMatchIndex().get(follower.getNodeId()) >= i) {
count++;
}
}
// Check if majority has this entry
if (count >= (getAllNodes().size() / 2) + 1) {
leader.setCommitIndex(i);
break;
}
}
}
}
Applications of Consensus Algorithms
1. Distributed Databases
Apache Cassandra
- Uses a variant of Paxos for lightweight transactions
- Ensures consistency across multiple data centers
- Handles network partitions gracefully
// Cassandra Paxos implementation example
public class CassandraPaxos {
public boolean executeLightweightTransaction(String key, Object value) {
// Phase 1: Prepare
List<PrepareResponse> responses = prepare(key);
// Check quorum
if (!hasQuorum(responses)) {
return false;
}
// Phase 2: Propose
List<ProposeResponse> proposeResponses = propose(key, value);
// Phase 3: Commit
return commit(key, value);
}
}
MongoDB
- Uses Raft for replica set elections
- Ensures data consistency across replicas
- Automatic failover when primary fails
2. Distributed Key-Value Stores
etcd
- Built on Raft consensus
- Used by Kubernetes for configuration storage
- Provides strong consistency guarantees
// etcd-like key-value store with Raft
public class EtcdStore {
private RaftNode raftNode;
private Map<String, String> keyValueStore;
public void put(String key, String value) {
// Create log entry
LogEntry entry = new LogEntry(
raftNode.getCurrentTerm(),
new PutCommand(key, value)
);
// Replicate through Raft
if (raftNode.isLeader()) {
replicateLog(entry);
}
}
public String get(String key) {
// Read from local state
return keyValueStore.get(key);
}
}
3. Service Discovery and Configuration
Apache ZooKeeper
- Uses Zab (ZooKeeper Atomic Broadcast) protocol
- Provides coordination services for distributed applications
- Used by Kafka, Hadoop, and many other systems
// ZooKeeper-like coordination service
public class ZooKeeperService {
private RaftNode consensusNode;
private TreeMap<String, NodeData> dataTree;
public void createNode(String path, String data) {
CreateNodeCommand command = new CreateNodeCommand(path, data);
LogEntry entry = new LogEntry(
consensusNode.getCurrentTerm(),
command
);
replicateLog(entry);
}
public String getNodeData(String path) {
return dataTree.get(path).getData();
}
}
4. Message Queues
Apache Kafka
- Uses ZooKeeper for controller election
- Ensures message ordering and durability
- Handles partition leadership changes
5. Blockchain Systems
Hyperledger Fabric
- Uses PBFT (Practical Byzantine Fault Tolerance)
- Ensures consensus among network participants
- Handles malicious nodes
Why These Applications Use Consensus
1. Data Consistency
- Ensures all nodes see the same data
- Prevents split-brain scenarios
- Maintains ACID properties
2. Fault Tolerance
- System continues operating despite node failures
- Automatic recovery from failures
- No single point of failure
3. High Availability
- System remains available during failures
- Automatic failover mechanisms
- Load distribution across nodes
4. Linearizability
- Operations appear to execute atomically
- Strong consistency guarantees
- Predictable behavior
Quorum in Practice
Read Quorums vs Write Quorums
public class QuorumSystem {
private int totalNodes;
private int readQuorum;
private int writeQuorum;
public QuorumSystem(int totalNodes) {
this.totalNodes = totalNodes;
// Common pattern: read quorum + write quorum > total nodes
this.readQuorum = (totalNodes / 2) + 1;
this.writeQuorum = (totalNodes / 2) + 1;
}
public boolean read(Object key) {
int responses = 0;
Object value = null;
for (Node node : getAllNodes()) {
Object nodeValue = node.read(key);
if (nodeValue != null) {
responses++;
if (value == null) {
value = nodeValue;
}
}
}
return responses >= readQuorum;
}
public boolean write(Object key, Object value) {
int responses = 0;
for (Node node : getAllNodes()) {
if (node.write(key, value)) {
responses++;
}
}
return responses >= writeQuorum;
}
}
Quorum Types
- Majority Quorum: Most common, requires >50% of nodes
- Weighted Quorum: Nodes have different weights
- Grid Quorum: Nodes arranged in a grid pattern
- Hierarchical Quorum: Nodes organized in a hierarchy
Performance Considerations
Latency vs Consistency
public class ConsistencyLevel {
public enum Level {
ONE, // Single node response
QUORUM, // Majority response
ALL // All nodes response
}
public Object readWithConsistency(Object key, Level level) {
switch (level) {
case ONE:
return readFromAnyNode(key);
case QUORUM:
return readWithQuorum(key);
case ALL:
return readFromAllNodes(key);
default:
throw new IllegalArgumentException();
}
}
}
Network Partition Handling
public class NetworkPartitionHandler {
public void handlePartition(List<Node> availableNodes) {
// Check if we have a quorum
if (availableNodes.size() >= getQuorumSize()) {
// Continue normal operation
continueOperation();
} else {
// Stop accepting writes, only allow reads
enterReadOnlyMode();
}
}
private void enterReadOnlyMode() {
// Reject write operations
// Allow read operations from available nodes
// Wait for network partition to resolve
}
}
Best Practices
1. Choose the Right Algorithm
- Paxos: When you need proven correctness
- Raft: When you need understandability
- PBFT: When you need Byzantine fault tolerance
2. Configure Appropriate Timeouts
public class TimeoutConfig {
private int electionTimeout = 1000; // ms
private int heartbeatInterval = 100; // ms
private int requestTimeout = 5000; // ms
public void setTimeouts(int electionTimeout, int heartbeatInterval, int requestTimeout) {
this.electionTimeout = electionTimeout;
this.heartbeatInterval = heartbeatInterval;
this.requestTimeout = requestTimeout;
}
}
3. Monitor Consensus Health
public class ConsensusMonitor {
public void monitorHealth() {
// Check leader election frequency
// Monitor log replication lag
// Track commit rates
// Alert on quorum violations
}
}
Conclusion
Consensus algorithms like Paxos and Raft are fundamental to building reliable distributed systems. They provide the foundation for:
- Data consistency across multiple nodes
- Fault tolerance in the face of failures
- High availability through automatic recovery
- Strong guarantees for critical applications
Understanding quorums and how they ensure consistency is crucial for designing robust distributed systems. Whether you’re building a database, message queue, or blockchain, consensus algorithms provide the reliability guarantees your system needs.
The choice between Paxos and Raft often depends on your specific requirements:
- Paxos: More complex but battle-tested
- Raft: Easier to understand and implement
Both algorithms provide the same safety guarantees, but Raft’s simplicity makes it more accessible for new implementations.
Remember: Consensus is not just about agreeing on values—it’s about building systems that can withstand the chaos of distributed computing while maintaining consistency and availability.