Design Lyft
Lyft is a ride-sharing platform that connects passengers with nearby drivers for on-demand transportation. Unlike traditional taxi services, Lyft operates as a two-sided marketplace where independent drivers use their personal vehicles to provide rides. The system must handle millions of ride requests daily, match passengers with optimal drivers in real-time, track live locations, dynamically adjust pricing based on supply-demand, support shared rides, ensure safety, and process payments seamlessly.
This design focuses on Lyft’s unique characteristics: aggressive driver-passenger matching algorithms optimized for pickup ETA, sophisticated shared ride routing that maximizes vehicle utilization, zone-based surge pricing that’s more granular than competitors, and a driver-centric approach that emphasizes driver earnings optimization.
Step 1: Understand the Problem and Establish Design Scope
Before diving into the design, it’s crucial to define the functional and non-functional requirements. For user-facing applications like this, functional requirements are the “Users should be able to…” statements, whereas non-functional requirements define system qualities via “The system should…” statements.
Functional Requirements
Core Requirements (Priority 1-3):
- Passengers should be able to request rides by specifying pickup and destination locations.
- The system should match passengers with nearby available drivers based on ETA, rating, and vehicle type.
- Passengers should receive real-time location tracking for both drivers and active rides with sub-second update latency.
- Drivers should be able to accept or reject ride requests within a timeout window.
- The system should support multiple ride types: standard, XL (6+ passengers), Lux, and shared rides (Lyft Line).
- Passengers should receive pre-ride fare estimates with dynamic pricing based on distance, time, demand, and traffic.
Lyft Line (Shared Rides) - Priority 2:
- The system should allow multiple passengers to share a ride along similar routes.
- The system should intelligently optimize routes to minimize detours for all passengers.
- The system should support dynamic matching, adding passengers to in-progress rides if route alignment is acceptable.
- The system should calculate individual fares based on actual route taken and time saved.
Below the Line (Out of Scope):
- Driver availability zones and preferred operating areas.
- Driver earnings dashboard showing fares, tips, bonuses, and surge multipliers.
- Two-way rating system (passenger rates driver, driver rates passenger).
- Share ride status with friends/family (live tracking link).
- Emergency assistance button with direct contact to safety team.
- Ride splitting among multiple passengers.
Non-Functional Requirements
Core Requirements:
- The system should support 10M+ daily rides across 300+ cities.
- The system should handle 500K concurrent riders and 200K active drivers during peak hours.
- The system should achieve ride matching latency under 2 seconds from request to driver assignment.
- The system should process location updates with less than 500ms end-to-end latency.
- The system should ensure 99.99% uptime for core ride request and matching services.
- The system should maintain strong consistency for ride state (only one driver can accept a given ride request).
Below the Line (Out of Scope):
- The system should support geographic redundancy with multi-region deployments for disaster recovery.
- The system should ensure end-to-end encryption for sensitive data (payment info, phone numbers).
- The system should provide graceful degradation with fallback to basic matching if ML models fail.
- The system should comply with PCI-DSS for payment processing.
Clarification Questions & Assumptions:
- Platform: Mobile apps for both passengers and drivers (iOS and Android).
- Scale: 10 million rides per day, approximately 140K concurrent rides during peak hours.
- Location Update Frequency: Drivers update their location roughly every 1 second when active.
- Geographic Coverage: 300+ cities globally, with focus on major metropolitan areas in North America.
- Payment: Third-party payment processors like Stripe or Braintree handle PCI compliance.
- Traffic Estimates: 200K location updates per second from active drivers.
Back-of-the-Envelope Estimations:
- 10M rides per day equals approximately 115 rides per second on average, with peaks around 500 rides per second.
- Average ride duration of 20 minutes means about 140K concurrent rides during peak periods.
- Driver-to-rider ratio of 1:1.5 results in approximately 200K active drivers during peak times.
- Location updates: 200K drivers multiplied by 1 update per second equals 200K writes per second.
- Each location update is approximately 100 bytes, resulting in 20 MB per second ingestion rate.
Storage Estimates:
- Each ride record is approximately 5 KB (pickup/dropoff locations, timestamps, fare details).
- Daily ride data: 10M rides times 5 KB equals 50 GB per day, or 18 TB per year.
- Location tracking retained for 90 days: approximately 155 TB for historical location data.
- Driver and passenger profiles: 60 million users times 10 KB equals 600 GB.
- Total active storage requirement: approximately 200 TB for hot and warm data combined.
Step 2: Propose High-Level Design and Get Buy-in
Planning the Approach
Before moving on to designing the system, it’s important to plan your strategy. For user-facing product-style questions, the plan should be straightforward: build your design up sequentially, going one by one through your functional requirements. This will help you stay focused and ensure you don’t get lost in the weeds.
Defining the Core Entities
To satisfy our key functional requirements, we’ll need the following entities:
User: Any user on the platform, either as a passenger or driver. Contains personal information such as phone number, email, name, profile photo, user type (passenger or driver), and rating. The user_id serves as the primary identifier across all platform interactions.
Driver: Users registered as drivers on the platform. Contains license information, vehicle details, current status (available, en-route to pickup, on-ride, offline), acceptance rate, cancellation rate, total rides completed, and current location with last update timestamp.
Ride: An individual ride from request to completion. Records the passenger and driver identities, ride type (standard, XL, lux, shared), status (requested, matched, accepted, driver arriving, ride started, completed, paid), pickup and dropoff coordinates, various timestamps, estimated and final fares, and surge multiplier.
Shared Ride: A ride with multiple passengers sharing the same vehicle. Contains the driver assignment, current status, ordered list of pickups and dropoffs stored as waypoints, and timestamps for creation and completion. Links to individual passenger rides through a junction table.
Fare: An estimated or final fare for a ride. Includes base price, distance-based charges, time-based charges, surge multiplier, discounts for shared rides, and traffic adjustments. Helps passengers make informed decisions before confirming rides.
Location: Real-time and historical location data for drivers. Includes latitude and longitude coordinates, speed, bearing, and timestamp. Critical for matching riders with nearby drivers and tracking ride progress. Historical data supports dispute resolution and ML training.
API Design
Fare Estimate Endpoint: Used by passengers to get an estimated fare for their ride before confirming the request.
POST /fare -> Fare
Body: {
pickupLocation: { lat, long },
destination: { lat, long },
rideType: "standard" | "xl" | "lux" | "shared"
}
Request Ride Endpoint: Used by passengers to confirm their ride request after reviewing the estimated fare. Initiates the ride matching process.
POST /rides -> Ride
Body: {
fareId: string
}
Update Driver Location Endpoint: Used by drivers to update their location in real-time. Called periodically by the driver client (every 1-3 seconds when active).
POST /drivers/location -> Success/Error
Body: {
lat: number,
long: number,
bearing: number,
speed: number
}
The driverId is extracted from the authentication session token, not from the request body, for security reasons.
Accept/Decline Ride Endpoint: Allows drivers to accept or decline a ride request. Upon acceptance, the system updates the ride status and provides navigation details.
PATCH /rides/:rideId -> Ride
Body: {
action: "accept" | "decline"
}
Update Ride Status Endpoint: Used to track ride progression through various states.
PATCH /rides/:rideId/status -> Ride
Body: {
status: "driver_arriving" | "ride_started" | "ride_completed"
}
High-Level Architecture
Let’s build up the system sequentially, addressing each functional requirement:
1. Passengers should be able to request rides and receive fare estimates
The core components necessary to fulfill fare estimation and ride requests are:
- Passenger Mobile App: The primary touchpoint for riders on iOS and Android. Interfaces with backend services to request rides, view estimates, track drivers, and complete payments.
- Driver Mobile App: Interface for drivers to receive ride requests, provide location updates, navigate to pickups, and manage availability status.
- API Gateway: Entry point for all client requests, handling cross-cutting concerns like authentication, rate limiting, request routing, and load balancing across service instances.
- Ride Service: Orchestrates the entire ride lifecycle from request to completion. Manages ride state transitions through a well-defined state machine, stores ride metadata, and coordinates with other services.
- Pricing Service: Calculates fare estimates and final charges. Applies base fare formulas, surge multipliers, distance and time rates, traffic adjustments, and shared ride discounts.
- Maps Integration: Third-party mapping service (Google Maps or Mapbox) providing geocoding, routing, ETA calculations, and real-time traffic data.
- PostgreSQL Database: Stores transactional data including users, drivers, rides, and fares with strong consistency guarantees. Sharded by geographic region for performance.
Fare Estimation Flow:
- The passenger enters pickup location and destination in the mobile app, which sends a POST request to the fare endpoint.
- The API Gateway authenticates the request and forwards it to the Pricing Service.
- The Pricing Service queries the Maps Integration API to calculate distance and travel time between locations.
- The service applies the pricing formula: base fare plus distance-based charges plus time-based charges.
- The service checks current surge multipliers for the pickup zone and applies them to the base fare.
- The service considers real-time traffic conditions to adjust duration estimates.
- A Fare entity is created in the database with all calculation details.
- The fare estimate is returned to the passenger app for review.
2. The system should match passengers with nearby available drivers
We introduce new components to facilitate real-time driver matching:
- Location Service: Manages the massive write throughput of location updates (200K+ per second). Stores driver locations in Redis using geospatial indexing, streams updates to real-time tracking systems, and persists location history for analytics.
- Ride Matching Service: Core algorithm that pairs passengers with optimal drivers. Receives ride requests, queries nearby available drivers, applies scoring algorithms considering pickup ETA, driver ratings, and vehicle type compatibility, and manages the driver notification process.
- Redis Cache: In-memory data store providing geospatial indexing via GEOADD and GEORADIUS commands. Enables sub-10ms proximity searches for nearby drivers. Also caches driver session data and surge pricing zones.
- Notification Service: Dispatches real-time notifications to drivers when ride requests are matched. Uses Firebase Cloud Messaging for Android and Apple Push Notification Service for iOS.
Driver Matching Flow:
- Meanwhile, drivers continuously send location updates to the Location Service every 1-3 seconds.
- The Location Service writes to Redis using GEOADD commands, organizing drivers by geographic region for performance.
- When a passenger requests a ride, the request goes to the Ride Matching Service.
- The matching service creates a ride record in the database with status “requested”.
- The service queries Redis using GEORADIUS to find available drivers within a 2-3 mile radius of the pickup location.
- If fewer than 5 drivers are found, the search radius expands to 5 miles, then 10 miles if still insufficient.
- The service filters drivers by status (must be available), vehicle type (matches passenger request), acceptance rate (minimum 70%), and rating (prefer above 4.5 stars).
- Remaining drivers are scored based on pickup ETA (most important factor), driver rating, acceptance rate, and experience level.
- The top 3-5 drivers are selected and sent ride notifications simultaneously.
- The first driver to accept gets assigned the ride through an optimistic locking mechanism.
- If no driver accepts within 15 seconds, the search radius expands and the process repeats.
3. Passengers should receive real-time location tracking
We add components for real-time bidirectional communication:
- WebSocket Server: Dedicated server pool handling persistent connections for real-time location streaming. Each server handles approximately 10K concurrent connections with session affinity for efficiency.
- Kafka Message Queue: Event streaming platform handling location updates, ride status changes, and other real-time events. Partitioned by geographic region for scalability.
- Cassandra Database: Time-series database storing historical location data with high write throughput. Optimized for time-range queries with partition by driver_id and clustering by timestamp.
Real-Time Tracking Flow:
- When a ride is matched, the passenger app establishes a WebSocket connection to receive live updates.
- Driver location updates flow through Location Service to Kafka topics partitioned by region.
- WebSocket servers subscribe to relevant Kafka partitions based on active rides they manage.
- Location updates are streamed to passengers showing driver position, bearing, speed, and updated ETA.
- Ride status changes (driver arriving, ride started) are also pushed through WebSocket connections.
- All location updates are persisted to Cassandra for audit trails and dispute resolution.
- Connection drops are handled gracefully with automatic reconnection and catch-up buffers.
4. The system should support Lyft Line shared rides
We extend the architecture to handle multi-passenger ride optimization:
- Routing Service: Integrates with Maps APIs for turn-by-turn navigation. For Lyft Line, implements multi-stop route optimization using constraint-based algorithms to find optimal pickup and dropoff ordering.
- Shared Ride Coordinator: Manages the lifecycle of shared rides, tracks current waypoints, evaluates new passenger additions, validates detour constraints, and coordinates route recalculations.
Shared Ride Flow:
- Passenger requests a Lyft Line ride with pickup and destination.
- The system checks for active shared rides heading in similar directions within acceptable pickup radius.
- If no match exists, a new shared ride is created and a driver is dispatched normally.
- While the shared ride is active, the system monitors for new Lyft Line requests in the area.
- For each potential match, the Routing Service calculates modified route with new waypoints inserted.
- The system validates constraints: maximum 2 additional pickups, each existing passenger’s detour under 10 minutes, new passenger’s detour under 10 minutes versus direct route.
- If constraints are satisfied, the driver receives an offer notification showing additional earnings and estimated detour time.
- Upon driver acceptance, the route is updated and all passengers are notified of the new ETA.
- Fares are calculated individually based on actual route taken, with discounts reflecting time spent sharing the vehicle.
Step 3: Design Deep Dive
With the core functional requirements met, it’s time to dig into the non-functional requirements via deep dives. These are the critical areas that separate good designs from great ones.
Deep Dive 1: How do we handle 200K location updates per second efficiently?
Managing the massive volume of location updates from active drivers presents both write throughput and cost challenges. Traditional databases would either fail under this load or become prohibitively expensive.
Problem 1: Write Throughput
With 200K drivers sending location updates every second, we need a data store optimized for high-frequency writes. PostgreSQL or DynamoDB would require massive over-provisioning. For DynamoDB specifically, 200K writes per second at 100 bytes each would cost approximately $15,000 per day just for write capacity.
Problem 2: Geospatial Query Performance
Finding nearby drivers requires efficient proximity searches. A naive approach using latitude and longitude columns would require full table scans and distance calculations for millions of records. Traditional B-tree indexes don’t work well for multi-dimensional geographic data.
Solution: Redis Geospatial Indexing
Redis provides built-in geospatial commands that internally use sorted sets with geohash encoding:
- GEOADD Command: Adds driver locations to a sorted set with O(log N) complexity. The command takes latitude, longitude, and driver identifier.
- GEORADIUS Command: Finds all drivers within a specified radius of a point with O(log N + M) complexity where M is the number of results. Returns results sorted by distance.
- Geohash Encoding: Internally, Redis encodes lat/long coordinates into geohash strings where nearby locations share common prefixes, enabling efficient range queries.
Implementation Strategy:
When drivers update locations, the Location Service executes GEOADD to add coordinates to region-specific sorted sets. For example, drivers in San Francisco are added to a “driver:locations:sf” key. This regional partitioning prevents single keys from growing too large and enables horizontal scaling.
To query nearby drivers during matching, the Ride Matching Service executes GEORADIUS with the pickup location and desired radius. Redis returns driver IDs sorted by distance in under 10 milliseconds even with hundreds of thousands of drivers.
Handling Stale Data:
Since Redis sorted set members don’t support individual TTLs, we implement time-bucketed keys. Each bucket represents a time window (e.g., “driver:locations:sf:2024-03-28-18-01”) and the entire key expires after 5 minutes. A background process periodically migrates active drivers to new buckets. This ensures stale location data is automatically cleaned up.
Alternative Approaches:
Elasticsearch with geo_point data types provides excellent geospatial capabilities and can handle high write throughput when properly configured. PostGIS extensions for PostgreSQL offer R-tree spatial indexing but require careful scaling for write-heavy workloads. QuadTree implementations can be built in-memory but require custom development.
Deep Dive 2: How do we optimize the matching algorithm for sub-2-second latency?
Ride matching must complete quickly to provide excellent user experience. We need to balance comprehensiveness (finding the best driver) with speed (responding within 2 seconds).
Matching Algorithm Components:
Spatial Search Phase (200ms): The system executes GEORADIUS on Redis to find all available drivers within initial radius (3 miles). If fewer than 5 drivers are found, the radius expands to 5 miles, then 10 miles. This multi-radius approach balances match quality with availability.
Filtering Phase (100ms): Retrieved drivers are filtered based on hard constraints. Status must be “available” (not already on a ride or offline). Vehicle type must match the passenger’s request (standard, XL, lux). Acceptance rate must exceed 70% to ensure responsive drivers. Driver rating threshold is applied if sufficient high-rated drivers are available.
Scoring Phase (300ms): Remaining drivers are ranked using a composite score. Pickup ETA receives the highest weight (negative 10 points per minute) since passengers prioritize fast pickups. Driver rating contributes positive points (2 points per star). Acceptance rate adds credibility (1.5 points per percentage point). Total rides completed provides slight preference for experienced drivers. Drivers operating in their preferred zones receive bonus points.
ETA Calculation: Pickup ETA is estimated using a hybrid approach. For quick estimates, straight-line distance is multiplied by 1.4 to account for city routing. For better accuracy, historical travel times for the specific zone and time-of-day are factored in. An ML model trained on millions of historical trips predicts travel time based on coordinates, time-of-day, day-of-week, and current traffic conditions. Popular routes (like airport to downtown) are cached to reduce API calls.
Parallel Dispatch Phase (1000ms): The top 5 drivers receive ride offer notifications simultaneously via push notifications. Each notification includes estimated pickup distance, passenger rating, and destination direction (not exact address for privacy). Drivers have 15 seconds to accept or decline. The system uses optimistic locking to ensure only one driver can claim the ride.
Optimistic Locking Implementation:
When a driver accepts, the Ride Service attempts to update the ride record in PostgreSQL, setting the driver_id only if it’s currently null. The database returns the number of rows affected. If zero rows were updated, another driver already claimed the ride and this driver receives a “ride no longer available” message. If the update succeeds, the driver status changes to “busy” and other drivers are notified the ride was taken.
Timeout Handling:
If no driver accepts within 15 seconds, the timeout handler expands the search radius or increases surge pricing to attract more drivers. The request can retry up to 3 times before notifying the passenger that no drivers are currently available.
Deep Dive 3: How does Lyft Line route optimization work while maintaining detour constraints?
Shared rides present a complex combinatorial optimization problem: given N passengers with pickup and dropoff pairs, find the optimal order of 2N waypoints that minimizes total trip time while keeping each passenger’s detour under 10 minutes.
Problem Complexity:
This is a variant of the Traveling Salesman Problem (TSP) with precedence constraints (each pickup must occur before its corresponding dropoff). With 3 passengers, there are potentially 720 different orderings to consider. Exhaustive search becomes intractable quickly.
Constraint-Based Approach:
The Routing Service uses a heuristic algorithm that balances solution quality with computation time. When a new passenger requests to join an existing shared ride, the system tries all valid insertion positions for the new pickup and dropoff waypoints. Valid positions maintain precedence constraints (pickup before dropoff) and other ordering requirements.
For each candidate route, the service queries the Maps API to calculate total duration with current traffic. It then validates detour constraints by comparing each passenger’s actual trip time against their direct route time. If any passenger would experience more than 10 minutes of additional travel time, that candidate route is rejected.
Route Optimization Details:
Starting with the driver’s current location, the algorithm builds a waypoint sequence. For existing passengers, it maintains their already-optimized positions. For the new passenger, it systematically tries inserting their pickup after each existing waypoint and their dropoff after the pickup. This generates approximately N squared candidates where N is the current number of waypoints.
The Maps API provides duration estimates considering current traffic. The system caches recent route calculations to avoid redundant API calls when multiple passengers are being evaluated. The algorithm selects the candidate route with minimum total duration that satisfies all detour constraints.
Fare Calculation for Shared Rides:
Each passenger’s fare is calculated individually based on their actual travel distance and time. The base fare for a direct route is computed first. Then a discount is applied based on time spent sharing the vehicle with other passengers. The minimum discount is 20% guaranteed, with additional savings based on the sharing ratio. If a passenger shares the vehicle for 60% of their trip duration, they might receive a 30-35% discount versus a private ride.
Dynamic Passenger Addition:
While a shared ride is active, the system continuously monitors for new Lyft Line requests in the area. For each potential match, it recalculates the optimal route including the new passenger. The system enforces a maximum of 3 total passengers to maintain service quality. Drivers are presented with optional additions showing incremental earnings and estimated delay. They can accept or decline based on their preference.
Data Structure for Active Shared Rides:
Redis stores each active shared ride with current location, ordered waypoint list (with type and passenger association), passenger count, and route vector (bearing and distance remaining). A separate geospatial index tracks shared rides by current location and bearing, enabling efficient matching with new requests heading in the same direction.
Deep Dive 4: How does zone-based surge pricing balance supply and demand?
Dynamic pricing is critical for marketplace balance. When demand exceeds supply, prices must increase to incentivize more drivers and moderate passenger demand. The system needs real-time supply-demand tracking and smooth price adjustments.
Hexagonal Zone System:
Cities are divided into hexagonal zones using Uber’s H3 library at resolution 8, creating hexagons approximately 0.46 square kilometers each. Hexagons are superior to square grids for spatial analysis because all neighbors are equidistant from the center, providing more uniform coverage.
Demand-Supply Tracking:
Every 30 seconds, a background job recalculates surge multipliers for each zone. Demand is measured by counting ride requests in the zone over the last 5 minutes. Supply is measured by counting available drivers currently in the zone. The demand-supply ratio determines the surge multiplier.
Surge Multiplier Calculation:
The ratio maps to surge levels: below 1.2 means no surge (1.0x pricing). Ratios from 1.2 to 1.5 trigger 1.2x surge. Ratios from 1.5 to 2.0 result in 1.5x surge. Higher ratios progressively increase surge up to a maximum cap of 3.5x. When no drivers are available in a zone, maximum surge applies immediately.
To prevent abrupt price jumps that frustrate passengers, surge changes are smoothed. The multiplier can only change by plus or minus 0.2 per 30-second update cycle. If demand suddenly spikes, surge gradually increases rather than jumping from 1.0x to 3.0x instantly.
Passenger Communication:
Before requesting a ride in a surge zone, passengers see a clear warning showing the elevated price range. They must acknowledge understanding surge pricing before confirming. Some jurisdictions require explicit consent for surge pricing as a regulatory requirement.
Driver Incentivization:
The driver app displays a heatmap showing current demand and surge zones. High-demand areas are highlighted in red with the active surge multiplier. The app suggests nearby surge zones with messages like “2.5x surge 0.8 miles away at Downtown area.” This helps drivers position themselves strategically for higher earnings.
Machine Learning for Demand Forecasting:
A Gradient Boosted Trees model (using LightGBM) predicts demand 15-30 minutes ahead. Training features include time-of-day, day-of-week, weather conditions, local events (concerts, sports games, conferences), and historical ride patterns. The model outputs expected ride requests per zone per 5-minute interval. This enables proactive driver positioning notifications like “High demand expected near Stadium at 10 PM (game ending).”
Storage in Redis:
Surge multipliers for each zone are stored in Redis hash structures, with zone H3 IDs as keys and multipliers as values. Demand tracking uses sorted sets with timestamps as scores, enabling sliding window queries (keep only last 5 minutes of requests). Driver availability is maintained in geospatial sorted sets for efficient counting per zone.
Deep Dive 5: How do we ensure ride requests aren’t lost during failures or peak traffic?
System reliability requires handling instance failures, network issues, and traffic spikes without losing ride requests or corrupting state.
Durable Message Queue:
Ride requests are immediately written to Apache Kafka when received. Kafka provides durable storage with replication across multiple brokers. If the Ride Matching Service crashes after receiving a request but before processing it, the message remains in Kafka for processing by another instance.
Consumer Group Architecture:
Multiple instances of the Ride Matching Service form a consumer group reading from Kafka partitions. Each partition is consumed by exactly one instance, providing parallelism while maintaining order within partitions. If an instance fails, Kafka automatically rebalances partitions to healthy instances.
Exactly-Once Processing:
Kafka’s exactly-once semantics (when configured) ensures each ride request is processed once despite retries or failures. The consumer commits offsets to Kafka only after successfully matching a driver and persisting state. If processing fails mid-way, the offset is not committed and another consumer picks up the message.
Priority Handling During Peaks:
During surge pricing or special events, premium ride types can be routed to priority queues. The system processes high-value rides first while ensuring standard rides aren’t indefinitely delayed. Time-based prioritization ensures requests aren’t stuck in the queue beyond acceptable wait times.
Human-in-the-Loop with Workflow Orchestration:
Ride matching involves human drivers who may not respond immediately. Temporal (or similar workflow engines like AWS Step Functions or Cadence) manages these long-running workflows. Each ride matching workflow is represented as durable state that survives service restarts. If a driver doesn’t respond within the timeout, the workflow automatically continues with the next candidate driver. If the entire service crashes, workflows resume from their last checkpoint when the service restarts.
Deep Dive 6: How do we handle real-time location streaming at scale?
Streaming live location updates from 200K drivers to their matched passengers with sub-second latency while handling connection instability requires careful architecture.
WebSocket Connection Management:
When a ride is matched, the passenger app establishes a WebSocket connection to a dedicated WebSocket server. The connection URL includes the ride ID and authentication token. Each WebSocket server handles approximately 10K concurrent connections. For 140K active rides (each with passenger and driver connected), we need about 28 WebSocket server instances.
Event Streaming Pipeline:
Driver location updates flow through a multi-stage pipeline. The Location Service receives updates via REST API and immediately writes to Redis for spatial indexing. Simultaneously, it publishes location events to Kafka topics partitioned by geographic region (16 partitions for major cities). This partitioning keeps related events together and enables parallel processing.
WebSocket to Kafka Mapping:
Each WebSocket server subscribes to specific Kafka partitions based on the rides it’s managing. Consistent hashing on ride IDs ensures connections for the same ride always route to the same WebSocket server, minimizing the number of Kafka partitions each server must monitor. When a location update arrives via Kafka, the server forwards it to the relevant WebSocket connection.
Connection Resilience:
Mobile networks are inherently unreliable. When a connection drops, the client automatically reconnects with exponential backoff. Upon reconnection, the server sends the last known location plus a buffer of the last 10 updates to fill any gaps. Heartbeat ping/pong messages every 30 seconds detect dead connections early.
Fallback for Low Latency:
For critical updates (driver arriving in 1 minute, ride started), the system bypasses Kafka and publishes directly to Redis Pub/Sub channels. WebSocket servers subscribe to these channels for ultra-low-latency notifications that must reach passengers immediately.
Historical Storage:
All location updates are asynchronously written to Cassandra for audit trails and dispute resolution. Cassandra’s write-optimized architecture handles 200K writes per second easily. Data is partitioned by driver_id and clustered by timestamp, enabling efficient time-range queries like “show me all locations for driver 123 during ride 456.” After 90 days, data is archived to S3 in Parquet format for long-term analytics.
Deep Dive 7: How do we prevent payment failures from disrupting rides?
Payment processing must be reliable, but payment service failures shouldn’t prevent ride completion. The system needs graceful degradation and idempotency.
Pre-Authorization at Ride Start:
When a ride starts, the Payment Service places an authorization hold on the passenger’s credit card for the estimated fare plus a 20% buffer. This is not a charge, just a reservation of funds. If authorization fails due to insufficient funds or invalid card, the ride is immediately cancelled and the passenger is notified to update their payment method.
Capture at Ride Completion:
After the ride ends, the final fare is calculated based on actual distance, time, surge multiplier, and any tips. The Payment Service captures the authorized amount (or less if final fare is lower). The excess authorization automatically releases within 3-7 days depending on the bank.
Asynchronous Payment with Retries:
If the Payment Service is temporarily unavailable when a ride completes, the ride is marked “payment_pending” and the payment request is queued. The system retries with exponential backoff: after 1 minute, 5 minutes, 30 minutes, 2 hours, and 24 hours. If all retries fail, the passenger receives an email invoice and the issue is escalated to manual review.
Idempotency Keys:
All payment operations use idempotency keys to prevent double-charging. The key format is “ride_payment_{ride_id}” ensuring uniqueness. If the Payment Service retries a request due to timeout, Stripe (or Braintree) recognizes the duplicate idempotency key and returns the original result without creating a new charge. This protects against network-level retries causing multiple charges.
Split Payment Handling:
For ride splitting, the primary passenger pays the driver first. Then reimbursement requests are sent to other passengers who shared the ride. When they confirm and pay, funds are deposited into the primary passenger’s Lyft wallet or refunded to their original payment method. If any passenger fails to pay their share, the primary passenger is notified but the driver still receives full payment.
Driver Payouts:
Lyft takes a 20-25% commission and drivers receive 75-80% of each fare. Earnings are aggregated weekly and processed via ACH batch transfer every Monday. For immediate cash flow needs, drivers can use instant cashout (paying a $0.50 fee) to receive funds within hours via Stripe Express. Year-end tax forms (1099-K in the U.S.) are generated automatically.
Deep Dive 8: How do we scale the system geographically?
Supporting 300+ cities globally requires geographic partitioning and localization.
Geographic Sharding:
Data is partitioned by city or region. Each major metro area has dedicated PostgreSQL and Redis clusters. Ride requests are routed to the nearest regional API Gateway based on pickup location. This reduces cross-region database queries and minimizes latency.
Data Locality:
Driver locations, active rides, and surge pricing data are inherently local to specific cities. A ride in San Francisco never needs data from New York’s clusters. This natural partitioning enables linear horizontal scaling by adding new regional clusters as service expands to new cities.
Cross-Region Considerations:
Some data requires global consistency or visibility. User profiles (passenger and driver accounts) are replicated across regions for authentication. Payment transaction records are centralized for financial reporting and fraud detection. Analytics pipelines aggregate data from all regions into a central data warehouse for business intelligence.
Disaster Recovery:
Critical regional clusters have standby replicas in geographically separate data centers. If the primary San Francisco cluster fails, traffic automatically fails over to the replica with minimal downtime. Ride requests in progress are preserved in Kafka and resume processing once the failover completes.
Step 4: Wrap Up
In this chapter, we proposed a system design for a ride-sharing platform like Lyft. If there is extra time at the end of the interview, here are additional points to discuss:
Additional Features:
- Driver background checks and vehicle inspections for safety and trust.
- Emergency assistance button connecting to safety team with live location tracking.
- Share ride status with friends/family through web-based tracking links.
- Two-way rating system with quality thresholds for drivers and passengers.
- Promotional codes, referral bonuses, and loyalty credits for user acquisition and retention.
- Ride scheduling allowing passengers to book rides in advance with guaranteed pricing.
Technology Stack Summary:
Application Layer: Backend services in Go for low-latency, high-concurrency operations (matching, location services), Python for ML services and data processing pipelines. Mobile apps using React Native for shared iOS/Android codebase with native modules for critical features. API Gateway using Envoy or Kong for routing, rate limiting, and authentication.
Data Storage: PostgreSQL for transactional data with read replicas and geographic sharding. Redis for geospatial indexing, caching, and pub/sub messaging. Cassandra for time-series location history with high write throughput. S3 for object storage of receipts, documents, and ML training data.
Messaging: Kafka for event streaming with durability and replay capabilities. RabbitMQ for task queues (payment processing, notifications).
External Services: Google Maps or Mapbox for geocoding, routing, and traffic data. Stripe or Braintree for payment processing and PCI compliance. Twilio for SMS, SendGrid for email. Firebase Cloud Messaging and Apple Push Notification Service for mobile push notifications.
ML Platform: TensorFlow or PyTorch for training, TensorFlow Serving for real-time inference. Snowflake or BigQuery for data warehousing and analytics. Apache Airflow for workflow orchestration and ETL pipelines.
Scaling Strategies:
Microservice Isolation: Critical services (matching, location tracking, ride management) are isolated from non-critical services (promotions, analytics). Circuit breakers prevent cascade failures when dependencies fail. Bulkheads limit concurrent requests to protect system stability.
Caching Layers: Three-tier caching with L1 in-memory application cache, L2 Redis distributed cache with sub-5ms p99 latency, and L3 database read replicas with eventual consistency.
Horizontal Scaling: Stateless services auto-scale based on CPU and request rate metrics. WebSocket servers scale based on active connection counts (target 10K per instance). Kafka consumers scale to match partition counts for maximum parallelism.
Database Optimization: Composite indexes on status and timestamp columns for common queries. GIST indexes for geospatial columns. Automatic archival of completed rides older than 90 days to S3 in Parquet format for analytics while keeping operational database lean.
Monitoring and Observability:
Business Metrics: Rides per second, match success rate, average pickup ETA, surge multiplier distribution by zone, driver utilization rates, revenue per ride.
Technical Metrics: API latency percentiles (p50, p99, p999), database query execution time, cache hit rates, Kafka consumer lag, WebSocket connection stability.
Alerting: Automatic alerts for match latency exceeding 3 seconds, ride request error rate above 1%, location update lag exceeding 5 seconds, payment processing failures above threshold.
Distributed Tracing: Jaeger or Zipkin traces requests across microservices (API Gateway to Ride Service to Matching Service to Database) to identify bottlenecks in the critical path.
Failure Scenarios and Resilience:
Database Primary Failure: Automatic failover to read replica promoted to primary (RDS Multi-AZ completes in under 60 seconds). During failover, ride requests queue in Kafka and process once the new primary is ready. Impact: 1-2 minute delay in ride matching with no data loss.
Redis Cluster Failure: Driver locations are lost since Redis is in-memory without persistence for geospatial data. Fallback to PostgreSQL drivers table with last known location (slower and potentially stale). Request drivers refresh their location. Impact: 2-3 minute recovery with degraded matching performance.
Kafka Outage: Real-time location streaming breaks (passengers don’t see driver movement). Fallback to REST API polling every 5 seconds for driver location (higher load but functional). Notification system uses synchronous push instead of event-driven. Impact: Increased API load and slower updates, but no data loss.
Maps API Failure: Cannot calculate routes or ETAs for new requests. Fallback to cached historical routes for popular origin-destination pairs. Use straight-line distance times 1.5 as rough ETA estimate. Impact: Inaccurate ETAs and suboptimal routes.
Payment Service Failure: Cannot charge passengers at ride completion. Mark ride as “payment_pending” and queue for retry with exponential backoff. If all retries fail, send email invoice. Impact: Delayed payment capture with potential revenue loss if card expires.
Future Enhancements:
Autonomous Vehicles: Partnership with self-driving car companies like Waymo or Cruise. Matching algorithm considers AV versus human driver trade-offs (AVs may have longer ETA but lower cost). Passenger preferences for trying autonomous vehicles or preferring human drivers.
Multimodal Transportation: Integration with scooters, bikes, and public transit for comprehensive journey planning. “Take Lyft to train station, then train to destination, then scooter last mile” with unified payment.
Predictive Ride Booking: ML model predicts when passengers need rides based on historical patterns (every weekday at 8 AM to office). Proactive notifications offering pre-positioned drivers. Schedule rides in advance with locked-in pricing to avoid surge.
Carbon Neutrality: Track emissions based on vehicle type and trip distance. Offer carbon offset option to plant trees and neutralize impact. Incentivize electric vehicle drivers with lower commission rates and priority matching.
Advanced Safety Features: In-cabin cameras with AI-based incident detection (sudden braking, shouting, violence). Continuous driver monitoring for drowsiness and phone usage. Passenger health monitoring for unresponsive passengers.
Summary
This comprehensive guide covered the design of a ride-sharing platform like Lyft, including:
-
Core Functionality: Fare estimation, ride requests, driver matching, real-time tracking, shared rides, dynamic pricing, and payment processing.
-
Key Challenges: Handling 200K location updates per second, sub-2-second matching latency, shared ride route optimization with detour constraints, zone-based surge pricing, and real-time WebSocket streaming.
-
Solutions: Redis geospatial indexing for proximity searches, optimistic locking for ride assignment, constraint-based route optimization for Lyft Line, hexagonal zone system for surge pricing, Kafka for durable message queues, workflow orchestration for human-in-the-loop processes, and geographic sharding for global scale.
-
Scalability: Horizontal scaling of stateless services, multi-tier caching, database sharding by region, event streaming with Kafka, and WebSocket servers with consistent hashing.
-
Resilience: Graceful degradation for each critical dependency, automatic failover for database failures, durable queuing for ride requests, idempotent payment operations, and comprehensive monitoring with distributed tracing.
The design demonstrates how to build a real-time marketplace platform balancing supply and demand, optimizing for both passenger experience (fast pickups, accurate ETAs, fair pricing) and driver earnings (efficient routing, surge incentives, reliable payouts), while handling massive scale with 10M+ daily rides and 200K concurrent active drivers.
Comments