Requirements and Scope

When designing a distributed ID generator, it’s crucial to define key requirements:

  • Uniqueness: Each generated ID must be globally unique.
  • Numerical: IDs should consist only of numerical values.
  • Size: IDs should fit within a 64-bit integer.
  • Order: IDs should be roughly ordered by creation time (e.g., IDs generated later in the day are larger than those from earlier).
  • Scale: The system must support high throughput, e.g., generating over 10,000 IDs per second.

High-Level Design Approaches

Several strategies exist for distributed unique ID generation, each with its own advantages and drawbacks:

Multi-Master Replication

This approach leverages database auto_increment by configuring multiple master databases. To avoid conflicts, each master is assigned an increment greater than 1, based on the number of masters. For example, with k masters, each server increments its ID by k.

Pros:

  • Leverages existing database features.

Cons:

  • Difficult to scale across multiple data centers.
  • IDs are not globally time-ordered across servers.
  • Poor scalability when servers are added or removed dynamically.

Universally Unique Identifier (UUID)

UUIDs are 128-bit numbers designed for global uniqueness. They can be generated independently by any system without central coordination.

Pros:

  • Simple generation without server coordination.
  • Highly scalable as each server generates its own IDs.
  • Extremely low probability of collisions.

Cons:

  • 128-bit length may exceed the 64-bit requirement.
  • IDs are not inherently time-ordered.
  • Can contain non-numeric characters.

Ticket Server

Inspired by solutions like Flickr’s ticket servers, this method centralizes ID generation to a single database server (the “Ticket Server”) that uses auto_increment to dispense blocks of IDs. Clients request a block of IDs from the ticket server and then generate unique IDs within that block locally.

Pros:

  • Generates numeric, sequential IDs.
  • Simple to implement for small to medium scale.

Cons:

  • Single point of failure. Redundancy introduces complexities like data synchronization.

Snowflake-like Approach

Inspired by Twitter’s Snowflake system, this approach constructs a 64-bit ID by combining different components. This method is highly flexible and can meet all our specified requirements.

64-bit ID Structure:

Sign bit (1 bit) | Timestamp (41 bits) | Datacenter ID (5 bits) | Machine ID (5 bits) | Sequence Number (12 bits)

Total: 1 + 41 + 5 + 5 + 12 = 64 bits

Component Breakdown:

  • Sign bit (1 bit): Always 0, reserved for future use or distinguishing signed/unsigned numbers.
  • Timestamp (41 bits): Represents milliseconds since a custom epoch (e.g., Twitter’s epoch: Nov 04, 2010, 01:42:54 UTC). This ensures IDs are time-ordered.
    • 41 bits provide enough range for approximately 69 years of unique timestamps.
  • Datacenter ID (5 bits): Allows for 2^5 = 32 distinct data centers.
  • Machine ID (5 bits): Allows for 2^5 = 32 machines per data center.
  • Sequence Number (12 bits): A counter that increments for each ID generated within the same millisecond on a specific machine. It resets to 0 every millisecond.
    • 12 bits support 2^12 = 4096 unique IDs per millisecond per machine.

Detailed Design: Snowflake-like ID Generator

Datacenter IDs and machine IDs are typically configured at system startup and remain static. Timestamp and sequence numbers are dynamic during operation.

Timestamp Generation: The 41-bit timestamp ensures chronological ordering. This field measures milliseconds since a chosen epoch. Using a recent epoch maximizes the lifespan of the ID generator before the timestamp overflows (approx. 69 years).

Sequence Number Management: The 12-bit sequence number allows for rapid ID generation within a single millisecond on a given machine. If multiple IDs are requested within the same millisecond, this counter increments. If the counter reaches its maximum (4095) within a millisecond, the system must wait for the next millisecond to reset the counter.

Considerations and Extensions

  • Clock Synchronization: This design assumes synchronized clocks across ID generation servers. In practice, Network Time Protocol (NTP) or similar solutions are crucial for maintaining consistent timestamps across distributed machines.
  • Section Length Tuning: The bit allocation for each section can be adjusted based on specific needs. For example, if concurrency is low but a longer lifespan is desired, more bits can be allocated to the timestamp and fewer to the sequence number.
  • High Availability: As a critical service, the ID generator itself should be highly available, often achieved through redundant deployments and failure detection mechanisms.

Summary

The Snowflake-like approach provides a robust and scalable solution for generating unique, time-ordered, numerical 64-bit IDs in distributed systems. Its segmented structure offers flexibility and addresses the limitations of simpler methods like multi-master replication or UUIDs, fulfilling the requirements for high-throughput, distributed ID generation.

Reference Materials