Design DoorDash

DoorDash is a food delivery platform connecting customers, restaurants, and delivery drivers (Dashers) in real-time. The system handles millions of daily orders, real-time location tracking, dynamic pricing, and intelligent order-dasher matching. This design covers the architecture required to build a production-grade system at scale.

Designing DoorDash presents unique challenges including real-time location tracking, efficient order-to-dasher matching, multi-stop route optimization, dynamic surge pricing, and coordinating the complex state machine across three independent actors (customer, restaurant, and Dasher).

Step 1: Understand the Problem and Establish Design Scope

Before diving into the design, it’s crucial to define the functional and non-functional requirements. For user-facing applications like this, functional requirements are the “Users should be able to…” statements, whereas non-functional requirements define system qualities via “The system should…” statements.

Functional Requirements

Core Requirements (Priority 1-3):

  1. Customers should be able to search and browse restaurants by location, cuisine type, rating, and delivery time.
  2. Customers should be able to place orders with customization options and receive real-time status updates.
  3. The system should intelligently match orders to available Dashers based on proximity, capacity, and ratings.
  4. Customers should be able to track their Dasher’s real-time location and receive ETA updates.
  5. The system should optimize delivery routes for Dashers handling multiple orders simultaneously.
  6. The system should support payment processing, tips, and automated payouts.

Below the Line (Out of Scope):

  • Customers should be able to rate restaurants and Dashers post-delivery.
  • Customers should be able to schedule orders in advance.
  • The system should support group orders and split payments.
  • Restaurants should be able to manage menus, hours, and promotional campaigns.
  • The system should provide analytics dashboards for restaurants and Dashers.

Non-Functional Requirements

Core Requirements:

  • The system should prioritize low latency for restaurant search (< 200ms) and order placement (< 500ms).
  • The system should ensure strong consistency for order state transitions and payment transactions to prevent double-assignment or payment errors.
  • The system should handle high throughput, supporting 10M+ daily active users and 1M+ concurrent orders during peak hours.
  • The system should provide accurate ETA predictions with minimal error (< 10 minutes variance).

Below the Line (Out of Scope):

  • The system should ensure 99.99% uptime for core services with graceful degradation during partial outages.
  • The system should be PCI DSS compliant for payment processing.
  • The system should comply with data privacy regulations (GDPR, CCPA).
  • The system should have comprehensive monitoring, logging, and alerting.

Clarification Questions & Assumptions:

  • Platform: Mobile apps for customers and Dashers, web portal for restaurants.
  • Scale: 50M monthly active users, 10M daily active users, 500K restaurants, 100K active Dashers per major city.
  • Location Update Frequency: Dashers update their location every 5-10 seconds while on delivery.
  • Order Volume: Approximately 1M orders per day, with 5x peak traffic during lunch and dinner hours.
  • Batching: Dashers can handle up to 2-3 orders simultaneously if routes are optimized.

Capacity Estimation

Traffic Estimates:

  • Orders per day: 10M DAU with average 3 orders/month = 1M orders/day
  • Orders per second (average): 1M / 86,400 = approximately 12 orders/sec
  • Peak orders per second: 12 x 5 = 60 orders/sec

Storage Estimates:

  • Order data: 1M orders/day x 10 KB/order = 10 GB/day = 3.6 TB/year
  • Location tracking: 100K active Dashers x 10 updates/min x 200 bytes = 200 MB/min = 288 GB/day
  • Menu data: 500K restaurants x 50 items x 2 KB = 50 GB

Bandwidth Estimates:

  • Location updates: 100K Dashers x 10 updates/min x 200 bytes = 33 MB/sec
  • Order state updates: 1M orders/day x 10 updates/order x 1 KB / 86,400 = 115 KB/sec

Step 2: Propose High-Level Design and Get Buy-in

Planning the Approach

Before moving on to designing the system, it’s important to plan your strategy. For user-facing product-style questions, the plan should be straightforward: build your design up sequentially, going one by one through your functional requirements. This will help you stay focused and ensure you don’t get lost in the weeds.

Defining the Core Entities

To satisfy our key functional requirements, we’ll need the following entities:

Customer: Any user who uses the platform to order food. Includes personal information such as name, contact details, delivery addresses, preferred payment methods, and order history.

Restaurant: Any business registered on the platform that provides food for delivery. Contains business details, location, cuisine type, operating hours, menu items with prices and availability, average preparation time, and performance metrics.

Dasher: Any users who are registered as delivery drivers on the platform. Contains their personal details, vehicle information, current location, availability status, current capacity (number of active orders), rating, total deliveries, and acceptance rate.

Order: An individual food order from the moment a customer places it until delivery completion. Records all pertinent details including the customer, restaurant, assigned Dasher, order status, items with customizations, pricing breakdown (subtotal, delivery fee, tax, tip, total), delivery address, estimated and actual delivery times, and timestamps for each state transition.

Menu Item: Individual dishes or products offered by restaurants. Includes name, description, price, category, availability status, and customization options (size, ingredients, add-ons).

Location: The real-time location of Dashers. Includes latitude and longitude coordinates, timestamp of the last update, and accuracy information. This entity is crucial for matching orders with nearby Dashers and providing real-time tracking to customers.

API Design

Search Restaurants Endpoint: Used by customers to discover restaurants based on location, cuisine type, rating, and other filters.

GET /restaurants/search -> List<Restaurant>
Query Params: {
  lat, lng, cuisine, sort, minRating, maxDeliveryTime
}

Get Restaurant Menu Endpoint: Retrieves the full menu for a specific restaurant with real-time availability.

GET /restaurants/:restaurantId/menu -> Menu

Place Order Endpoint: Used by customers to create a new order after adding items to their cart.

POST /orders -> Order
Body: {
  restaurantId, items[], deliveryAddress, paymentMethodId, tipCents
}

Track Order Endpoint: Provides real-time order status and Dasher location via WebSocket for live updates.

GET /orders/:orderId/track -> OrderStatus
WebSocket: wss://api.doordash.com/orders/:orderId/track

Update Dasher Location Endpoint: Used by Dashers to send periodic location updates while active.

POST /dashers/location -> Success/Error
Body: {
  lat, lng, timestamp
}

Note: The dasherId is present in the session cookie or JWT and not in the body or path params for security reasons.

Get Available Orders Endpoint: Shows Dashers a list of available orders they can accept based on their location and capacity.

GET /dashers/available-orders -> List<OrderOffer>

Accept Order Endpoint: Allows Dashers to accept an order assignment.

POST /dashers/orders/:orderId/accept -> Order

Update Order Status Endpoint: Used by Dashers to update order status as they progress through pickup and delivery.

POST /dashers/orders/:orderId/status -> Success
Body: {
  status, timestamp
}

High-Level Architecture

Let’s build up the system sequentially, addressing each functional requirement:

1. Customers should be able to search and browse restaurants

The core components necessary to fulfill restaurant search and browsing are:

  • Customer Client: The primary touchpoint for users, available on iOS and Android. Interfaces with the system’s backend services.
  • API Gateway: Acts as the entry point for client requests, routing requests to appropriate microservices. Manages cross-cutting concerns such as authentication, rate limiting, and request validation. Can be implemented using Kong or AWS API Gateway.
  • Restaurant Service: Manages restaurant profiles, menus, hours, and availability. Handles restaurant search queries using geographic indexes for efficient proximity-based searches. Caches frequently accessed menu data in Redis for fast retrieval.
  • Database: Stores Restaurant and MenuItem entities. Uses geographic indexing (GIST indexes in PostgreSQL or geo_point in Elasticsearch) for efficient location-based queries.
  • Search Service (Optional): Uses Elasticsearch for advanced search capabilities with full-text search on restaurant names, cuisine types, and menu items. Provides filtering, sorting, and relevance scoring.

Restaurant Search Flow:

  1. The customer enters their delivery location and optionally applies filters (cuisine, rating, delivery time) in the client app, which sends a GET request to the Restaurant Service.
  2. The API gateway receives the request, handles authentication and rate limiting, then forwards to the Restaurant Service.
  3. The Restaurant Service queries the database using geospatial indexes to find restaurants within delivery range, applies filters, and sorts results by relevance or user preference.
  4. Results are returned with estimated delivery times, ratings, and delivery fees.
  5. When the user selects a restaurant, their menu is fetched from cache (if available) or the database, showing real-time item availability.
2. Customers should be able to place orders with customization options

We extend our existing design to support order placement:

  • Add an Order Service: Manages order creation, lifecycle, and state transitions. Implements the order state machine (CREATED -> CONFIRMED -> PREPARING -> READY -> PICKED_UP -> IN_TRANSIT -> DELIVERED).
  • Add an Order table to our database to track orders and their status.
  • Add a Payment Service: Integrates with third-party payment processors (Stripe, Braintree) to handle customer charges, refunds, and disputes.

Order Placement Flow:

  1. The customer adds items to their cart with customizations, reviews the total (including taxes, fees, tip), and confirms the order, sending a POST request with order details.
  2. The API gateway forwards the request to the Order Service.
  3. The Order Service validates the order (checks restaurant is open, items are available, delivery address is in range).
  4. The Payment Service pre-authorizes the payment amount with the customer’s payment method.
  5. The Order Service creates a new order in the database with status “CREATED”, then immediately transitions it to “CONFIRMED” once payment is authorized.
  6. An event is published to a message queue (Kafka) to trigger the dispatch process.
  7. The Order Service returns the order confirmation to the customer with an estimated delivery time.
  8. The restaurant receives a notification to start preparing the order.
3. The system should intelligently match orders to available Dashers

We need to introduce new components to facilitate order-to-Dasher matching:

  • Dasher Service: Manages Dasher profiles, authentication, availability status, current capacity, earnings, and performance metrics (acceptance rate, on-time percentage, rating).
  • Location Service: Manages real-time location data of Dashers. Receives location updates from Dasher clients, stores this information in Redis using geospatial data structures (GEOADD), and provides the Dispatch Engine with location data for matching.
  • Dispatch Engine (Ride Matching Service): The brain of DoorDash. Handles incoming order events from Kafka and uses a sophisticated algorithm to match orders with the best available Dashers. Considers multiple factors including proximity to restaurant, Dasher rating, acceptance rate, current capacity, vehicle type, and estimated pickup time. Implements order batching logic to assign multiple orders to one Dasher when routes are compatible.
  • Notification Service: Dispatches real-time notifications to Dashers when orders are assigned. Uses APN (Apple Push Notification) for iOS and FCM (Firebase Cloud Messaging) for Android to ensure timely delivery.

Dasher Matching Flow:

  1. When an order is confirmed, the Order Service publishes an order event to Kafka.
  2. Meanwhile, active Dashers continuously send their current location (every 5-10 seconds) to the Location Service, which updates their position in Redis using GEOADD commands.
  3. The Dispatch Engine consumes the order event from Kafka and begins the matching workflow.
  4. It queries the Location Service to find all available Dashers within a certain radius (e.g., 5km) of the restaurant using GEORADIUS.
  5. For each eligible Dasher, it calculates a match score based on multiple weighted factors: distance to restaurant (40%), Dasher rating (20%), acceptance rate (15%), vehicle suitability (10%), and estimated time to pickup (15%).
  6. The Dispatch Engine sorts Dashers by score and attempts to acquire a distributed lock on the top-ranked Dasher using Redis SET with NX and EX flags.
  7. If successful, it sends a notification to the Dasher with order details and pickup location.
  8. The Dasher has 10-15 seconds to accept or decline. If they decline or timeout occurs, the lock is released and the next Dasher is notified.
4. Customers should be able to track their Dasher’s real-time location

We add WebSocket support for real-time updates:

  • Tracking Service: Manages real-time tracking connections. Maintains WebSocket connections with customers, receives location updates from the Location Service, calculates ETAs, and pushes updates to connected clients.
  • Routing Service: Integrates with Google Maps Directions API to calculate optimal routes, estimate travel times considering traffic, and provide turn-by-turn navigation to Dashers.

Real-Time Tracking Flow:

  1. After a Dasher accepts an order, the customer client establishes a WebSocket connection to the Tracking Service.
  2. The Tracking Service verifies the customer owns the order and begins streaming updates.
  3. As the Dasher sends location updates to the Location Service, the Tracking Service receives these updates via Kafka events.
  4. The Tracking Service calculates the updated ETA using the Routing Service, which considers current traffic conditions.
  5. Location and ETA updates are pushed to the customer client via WebSocket every 5-10 seconds.
  6. The customer sees the Dasher’s position on a map with an animated marker and updated ETA.
5. The system should optimize delivery routes for multiple orders

For Dashers handling multiple orders, route optimization is critical:

  • Enhance the Dispatch Engine with order batching logic that identifies compatible orders (nearby restaurants, similar delivery directions).
  • Enhance the Routing Service with Vehicle Routing Problem (VRP) solver that determines the optimal sequence of pickups and deliveries.

Order Batching Flow:

  1. When considering a Dasher for assignment, the Dispatch Engine checks if they’re already handling another order and have capacity (most Dashers can handle 2-3 orders).
  2. It evaluates whether the new order is compatible for batching by checking: restaurant proximity (within 2km), delivery address direction (same general area), and time impact (batching shouldn’t delay either order by more than 5-10 minutes).
  3. If compatible, the Dispatch Engine calls the Routing Service to solve the multi-stop routing problem using constraint optimization (pickup must occur before delivery for each order).
  4. The Routing Service uses Google OR-Tools or a similar constraint solver to find the optimal route sequence that minimizes total delivery time.
  5. If batching provides time savings or minimal delay, the order is assigned to the Dasher with updated routing instructions.
  6. The Dasher receives the optimized route with all pickup and delivery stops in the correct sequence.
6. The system should support payment processing and payouts

Payment flows involve multiple parties:

  • The Payment Service handles customer charges, Dasher payouts, and restaurant settlements.
  • Uses third-party payment processors with PCI DSS compliance.
  • Implements fraud detection algorithms to identify suspicious transactions.

Payment Flow:

  1. When an order is placed, the customer’s payment method is pre-authorized for the total amount.
  2. After successful delivery, the Payment Service captures the payment.
  3. The system calculates the distribution: restaurant receives order subtotal minus commission (typically 20-30%), Dasher receives base pay plus tip, DoorDash keeps delivery fee and commission.
  4. Restaurant settlements occur weekly or biweekly via ACH transfer.
  5. Dasher earnings are available for instant payout (for a small fee) or automatic weekly deposit.
  6. If a customer requests a refund, the Payment Service processes it and adjusts payouts accordingly.

Step 3: Design Deep Dive

With the core functional requirements met, it’s time to dig into the non-functional requirements via deep dives. These are the critical areas that separate good designs from great ones.

Deep Dive 1: How do we implement the order-to-Dasher matching algorithm to optimize for multiple objectives?

The Dispatch Engine is the most complex component of DoorDash, responsible for intelligently matching orders to Dashers while balancing multiple competing objectives: fast delivery times, Dasher earnings, customer satisfaction, and system efficiency.

Multi-Factor Scoring System:

The matching algorithm uses a weighted scoring system that evaluates each potential Dasher based on several factors. First, we find eligible Dashers using geospatial queries. The Location Service uses Redis GEORADIUS to find all Dashers within a certain radius of the restaurant (starting with 5km, expanding to 10km if no matches found). This query filters for Dashers with status “AVAILABLE” and current capacity below their maximum (typically 2-3 orders).

For each eligible Dasher, we calculate a composite match score. Distance is the most important factor, weighted at 40%. We calculate the distance from the Dasher’s current location to the restaurant, with closer Dashers receiving higher scores. The score decreases linearly with distance, starting at 100 points for very close Dashers and decreasing by 10 points per kilometer.

Dasher rating contributes 20% to the score. Higher-rated Dashers provide better customer experiences, so we normalize their rating (typically 1-5 stars) to a 0-100 scale. A Dasher with a 5.0 rating gets 100 points, while a 4.0 rating gets 80 points.

Acceptance rate is weighted at 15%. Dashers who consistently accept orders demonstrate reliability and commitment. We use their historical acceptance rate as a percentage directly as the score component.

Vehicle suitability adds 10% to the score. Large orders (many items or large physical size) should be assigned to Dashers with cars rather than bikes or scooters. Orders with temperature-sensitive items might prefer Dashers with insulated bags.

Finally, estimated time to pickup contributes 15%. Using the Routing Service, we calculate how long it would take the Dasher to reach the restaurant considering current traffic. Shorter pickup times receive higher scores, decreasing by 5 points per minute of travel time.

The total score is the weighted sum of all these factors. We then sort Dashers by their scores in descending order and attempt to assign the order to the highest-scoring Dasher.

Order Batching for Efficiency:

To maximize Dasher efficiency and system throughput, we implement intelligent order batching. When a Dasher is already handling one order and a new order comes in, we evaluate whether batching makes sense.

The batching algorithm first checks restaurant proximity. If two orders are from restaurants more than 2km apart, batching is not considered because the detour would be too large. Next, it checks delivery address proximity and direction. The delivery locations should be in the same general area (within 3km) and ideally in the same direction from the restaurants.

The critical factor is time impact. We use the Routing Service to calculate two scenarios: the time to complete both orders separately versus the time to complete them as a batch. If batching adds more than 5 minutes to either order’s delivery time, we reject it. However, if batching provides time savings (because restaurants are close together) or adds minimal time, it’s approved.

When batching is approved, we solve the Vehicle Routing Problem to determine the optimal sequence of stops. This is a constrained optimization problem where each order’s pickup must occur before its delivery. We use constraint solvers like Google OR-Tools to find the shortest total route that satisfies all constraints.

The final route might look like: Current Location -> Restaurant A (pickup Order 1) -> Restaurant B (pickup Order 2) -> Delivery Address 1 -> Delivery Address 2. The algorithm ensures food quality by minimizing time between pickup and delivery and considers temperature-sensitive items.

Deep Dive 2: How do we handle high-frequency location updates efficiently while maintaining accuracy?

With 100,000 active Dashers per city sending location updates every 5-10 seconds, we’re dealing with approximately 10,000-20,000 writes per second globally. Traditional relational databases would struggle with this write load, and the cost would be prohibitive.

Geospatial Data Store with Redis:

Redis provides excellent geospatial capabilities through its sorted sets with geohash encoding. When a Dasher sends a location update, we use the GEOADD command to add their position to a sorted set named “dasher-locations”. This operation is extremely fast (sub-millisecond) and Redis can handle hundreds of thousands of writes per second with proper hardware.

The geohash encoding is key to efficient proximity searches. Geohash converts a two-dimensional latitude/longitude coordinate into a single string. Nearby locations share longer common prefixes in their geohashes, allowing for fast range queries. Redis uses this property to implement GEORADIUS commands that find all members within a certain radius of a point in O(log N) time.

For each location update, we also update a secondary key-value pair tracking the Dasher’s last update timestamp. This helps us identify stale data and remove Dashers who haven’t updated their location recently (indicating they went offline).

Handling Stale Data with TTL:

Location data has a short shelf life. If a Dasher hasn’t updated their location in 5 minutes, we consider them potentially offline and shouldn’t assign orders to them. Redis doesn’t support TTL on individual sorted set members, so we use a multi-key approach.

We create time-bucketed sorted sets based on the minute timestamp (e.g., “dasher-locations:2024-03-28:14:35”). Each bucket has a 10-minute TTL, ensuring automatic cleanup. When querying for nearby Dashers, we query multiple recent buckets and deduplicate based on the most recent update per Dasher.

Alternatively, we can maintain a separate key-value store where each Dasher has their own key with a TTL. When the key expires, Redis can emit a keyspace notification that triggers cleanup from the geospatial sorted set.

Client-Side Optimization:

We can dramatically reduce server load by optimizing location updates on the client side. The Dasher app uses on-device sensors to implement adaptive update frequency. When the Dasher is stationary, updates are sent every 30 seconds. When moving slowly (less than 10 mph), updates are sent every 10 seconds. When moving quickly (highway speeds), updates are sent every 5 seconds.

Additionally, the client only sends updates if the Dasher has moved a significant distance (e.g., 50 meters) from the last update. This prevents unnecessary updates when the Dasher is stuck in traffic or at a red light.

The client also batches updates when network conditions are poor and implements exponential backoff on failures. During low battery conditions, update frequency is reduced further to preserve battery life.

These optimizations can reduce location update volume by 60-80% while maintaining sufficient accuracy for matching and tracking.

Deep Dive 3: How do we prevent race conditions in order assignment with distributed locking?

Ensuring strong consistency in order assignment is critical. We must guarantee that each order is assigned to exactly one Dasher and that each Dasher receives only one order request at a time (unless they’re already handling orders and have capacity for batching).

Distributed Locking with Redis:

When the Dispatch Engine identifies a candidate Dasher, it attempts to acquire a distributed lock before sending the order notification. We use Redis SET command with the NX (only set if not exists) and EX (expiration time) flags for atomic lock acquisition.

The lock key is based on the Dasher ID (e.g., “lock:dasher:12345”) and the value contains the order ID being offered. The expiration is set to 15 seconds, matching the time window for the Dasher to respond.

If the SET command returns success, we’ve acquired the lock and can send the notification to the Dasher. If it returns failure, another process already holds the lock (the Dasher is considering a different order), so we skip this Dasher and try the next one.

When the Dasher accepts or declines, we explicitly release the lock using DEL. If the Dasher doesn’t respond within the timeout, the lock automatically expires due to the TTL, making the Dasher available for new requests.

Sequential Matching with While Loop:

The Dispatch Engine uses a sequential matching approach rather than parallel broadcasts. It maintains a ranked list of candidate Dashers and processes them one at a time in a while loop.

For each iteration, we select the next highest-scoring Dasher and attempt to acquire their lock. If successful, we send them the order notification and wait for a response. If they accept, we update the order status and assign the Dasher, completing the matching process. If they decline or timeout, we release the lock and continue to the next Dasher in the list.

If we fail to acquire the lock (the Dasher is already considering another order), we immediately skip to the next Dasher without waiting. This ensures we don’t waste time on unavailable Dashers.

We set a maximum number of attempts (typically 5-10 Dashers) before giving up and notifying the customer that no Dasher is available. This triggers surge pricing or other demand management strategies.

Preventing Double Assignment:

We also implement an order-level lock to prevent the same order from being processed by multiple Dispatch Engine instances simultaneously. When an order event is consumed from Kafka, we attempt to acquire a lock on the order ID. Only the instance that successfully acquires this lock processes the order.

This is particularly important in a distributed system where multiple instances might consume the same order event due to rebalancing or retry logic. The order lock ensures exactly-once processing semantics.

Deep Dive 4: How do we ensure order requests are not lost during system failures or peak demand?

System reliability is paramount. If we drop orders during crashes or peak demand, we lose customer trust and revenue. We need mechanisms to ensure durable order processing with retry capabilities.

Durable Message Queue with Kafka:

Instead of processing orders synchronously when they’re placed, we enqueue them to a durable message queue. Apache Kafka is ideal for this use case because it provides high throughput, durability, and exactly-once semantics.

When a customer places an order, the Order Service creates the order in the database and publishes an order event to the “order-events” Kafka topic. This event contains all necessary information for matching: order ID, restaurant location, customer location, estimated preparation time, and priority.

The Dispatch Engine runs as a Kafka consumer group with multiple instances for parallel processing. Kafka partitions the topic across multiple instances, ensuring load distribution. Each order is assigned to a specific partition based on a partition key (typically the order ID or restaurant ID).

Consumer Group and Offset Management:

Kafka’s consumer group mechanism ensures that each order event is consumed by exactly one Dispatch Engine instance. If an instance crashes, Kafka automatically rebalances the partition assignments to healthy instances, ensuring no messages are lost.

The critical aspect is offset management. The Dispatch Engine only commits the Kafka offset after successfully matching the order and sending the notification to a Dasher. If processing fails midway (e.g., the instance crashes), the offset is not committed, and the message remains in Kafka.

When the partition is reassigned to a healthy instance, it resumes processing from the last committed offset, reprocessing the failed order. This provides at-least-once delivery semantics.

To achieve exactly-once semantics and prevent duplicate notifications, we implement idempotency. Each order processing attempt is idempotent: if the order has already been assigned (checked in the database), we skip processing and commit the offset.

Handling Timeouts with Workflow Orchestration:

Matching is a long-running, human-in-the-loop workflow. We might need to wait 15 seconds for each Dasher response, potentially trying 5-10 Dashers before finding a match. If our Dispatch Engine instance crashes during this process, we need to resume from where we left off.

We use a workflow orchestration system like Temporal (or AWS Step Functions) to manage these durable workflows. The matching workflow is defined as a state machine with explicit states: PENDING, NOTIFYING_DASHER, AWAITING_RESPONSE, TRYING_NEXT_DASHER, COMPLETED, FAILED.

Temporal provides durable execution: the workflow state is persisted, and if the worker crashes, another worker can pick up the workflow from its last checkpoint. Timeouts are handled automatically—if a Dasher doesn’t respond within 15 seconds, the workflow automatically transitions to TRYING_NEXT_DASHER without manual intervention.

This approach ensures that every order is eventually matched or explicitly marked as unassignable, with no silent failures.

Deep Dive 5: How do we calculate accurate ETAs using machine learning?

Accurate delivery time estimation is crucial for customer satisfaction. If we consistently overestimate, customers wait unnecessarily; if we underestimate, we create disappointment and frustration. Traditional approaches using distance and average speed are too simplistic and don’t account for real-world factors.

Feature Engineering for ETA Prediction:

We train a machine learning model (typically XGBoost or LightGBM) on historical delivery data to predict delivery times. The model takes numerous features as input:

Distance features include the straight-line distance from restaurant to delivery address, the distance from the Dasher’s current location to the restaurant, and the actual driving distance along roads (from Google Maps API).

Time-based features capture patterns: hour of day (lunch and dinner rushes have higher traffic), day of week (weekends differ from weekdays), whether it’s a holiday, and whether it’s currently a peak hour (11am-2pm and 5pm-9pm).

Restaurant-specific features include the average preparation time for this restaurant (some restaurants are consistently faster), the restaurant’s current order queue size (more orders mean longer prep times), and historical delay patterns for this restaurant.

Order characteristics matter too: the number of items in the order (larger orders take longer to prepare), order complexity (customizations slow down preparation), and total order value (larger orders might require more packaging).

Dasher features include their current capacity (are they handling multiple orders?), their historical average speed, their vehicle type (bikes are slower than cars but better in traffic), and their rating (higher-rated Dashers tend to be more efficient).

External factors like current traffic conditions (from Google Maps Traffic API) and weather conditions (rain slows deliveries) are incorporated.

Finally, we include historical features: the average delay for this restaurant in the past hour, the average delivery time for this route in the past week, and surge pricing indicators (high demand often correlates with longer times).

Model Training and Deployment:

We train the model on millions of historical completed deliveries. For each delivery, we calculate the actual delivery time (from order placement to delivery) and use the features that were present at the time of order placement.

The model learns complex patterns, such as: restaurants near stadiums are slower on game days, certain restaurants are consistently faster during lunch than dinner, specific neighborhoods have predictable traffic patterns, and Dashers with higher ratings deliver faster on average.

The trained model is deployed as a microservice that the Order Service and Tracking Service call for predictions. When an order is placed, we predict the initial ETA. As the Dasher makes progress (picks up the order, starts driving), we continuously update the ETA with real-time location data.

We add a conservative buffer during peak hours (multiply prediction by 1.15) to account for increased variability. We also track prediction accuracy in production and retrain the model weekly with fresh data to capture changing patterns.

Continuous ETA Updates:

ETAs are not static. As the Dasher progresses, we update the prediction. When the order is confirmed, we predict the total time. When the Dasher accepts, we refine the prediction based on their specific location and characteristics. When the Dasher picks up the order, we recalculate based on their current position and traffic to the delivery address.

The Tracking Service receives location updates every 5-10 seconds and recalculates the ETA by combining the ML model prediction with real-time routing data from Google Maps. The updated ETA is pushed to the customer via WebSocket.

Deep Dive 6: How do we optimize multi-stop delivery routes with the Vehicle Routing Problem?

When a Dasher handles multiple orders simultaneously, determining the optimal sequence of pickups and deliveries is a complex optimization problem known as the Vehicle Routing Problem (VRP). Poor routing can significantly delay deliveries and degrade food quality.

Constraint Optimization with OR-Tools:

We use Google OR-Tools, a suite of optimization tools, to solve this problem. The VRP solver takes as input all the locations (Dasher’s current position, all pickup locations, all delivery locations) and finds the optimal sequence that minimizes total travel time.

The key constraint is that each order’s pickup must occur before its delivery. We can’t deliver food before picking it up from the restaurant. This precedence constraint is enforced by the solver.

We build a distance matrix that contains the travel time between every pair of locations. Instead of straight-line distances, we use Google Maps Distance Matrix API to get real driving times considering current traffic. This matrix is fed to the solver.

The solver uses a combination of heuristics and optimization algorithms. It starts with a greedy solution (e.g., nearest neighbor), then iteratively improves it through local search operations like 2-opt (swapping pairs of edges) and relocate (moving a stop to a different position in the route).

Time Windows and Deadlines:

We can add soft time windows to encourage certain behaviors. For example, if one order has been waiting longer, we can add a preference to pick it up sooner. If one delivery address is on the way to another, we can bias the solution to visit it first.

Hard deadlines can be enforced: if an order promises delivery by 7:00 PM, the solver ensures the route completes that delivery before the deadline. If it’s impossible to meet all deadlines, the solver reports this, and the system might need to reassign orders.

Real-Time Re-optimization:

Routes aren’t set in stone. If traffic conditions change dramatically or if a Dasher gets delayed at a restaurant, we re-optimize the route on the fly. The Routing Service continuously monitors progress and can recompute the optimal route based on the current situation.

If a new order becomes available that’s compatible with the Dasher’s current route, we can insert it into the sequence without significantly impacting existing deliveries. The solver evaluates whether adding the new stop increases total time by an acceptable amount.

Deep Dive 7: How do we implement surge pricing to balance supply and demand?

During peak hours or in areas with high demand and low Dasher supply, surge pricing increases delivery fees to incentivize more Dashers to work and to reduce demand to manageable levels.

Supply and Demand Monitoring:

We partition the service area into geographic zones using H3, a hexagonal hierarchical geospatial indexing system. Each zone is about 2km across at resolution 7. This provides consistent zones that don’t align with arbitrary boundaries.

For each zone, we track supply and demand in real-time. Supply is the number of available Dashers in the zone (queried from Redis geospatial data). Demand is the number of pending orders awaiting assignment in the zone (queried from the database or cached in Redis).

We calculate a demand-to-supply ratio. A ratio above 1.0 indicates more orders than Dashers, suggesting we should increase prices. A ratio below 1.0 indicates excess capacity, suggesting prices can remain at baseline.

Dynamic Multiplier Calculation:

Based on the ratio, we apply a surge multiplier to the delivery fee. The multiplier ranges from 1.0 (no surge) to 2.0 (maximum surge). For example:

  • Ratio < 1.0: Multiplier = 1.0 (no surge)
  • Ratio 1.0-2.0: Multiplier = 1.25 (mild surge)
  • Ratio 2.0-3.0: Multiplier = 1.5 (moderate surge)
  • Ratio 3.0-5.0: Multiplier = 1.75 (high surge)
  • Ratio > 5.0: Multiplier = 2.0 (maximum surge)

We also apply time-based adjustments. During known peak hours (lunch: 11am-2pm, dinner: 6pm-9pm), we multiply the surge by an additional 1.1 to preemptively incentivize Dasher supply.

The surge multiplier is cached in Redis with a 60-second TTL. This balances freshness (prices update quickly) with stability (customers don’t see wild price fluctuations).

Applying Surge to Orders:

When a customer requests a fare estimate, the Order Service queries the surge multiplier for the delivery zone. The base delivery fee (typically $2.99) is multiplied by the surge factor. If surge is 1.5x, the delivery fee becomes $4.49.

The customer sees the surged price before placing their order, ensuring transparency. The UI clearly indicates when surge pricing is active.

Surge only applies to the delivery fee, not the food subtotal. This ensures restaurants aren’t affected and the pricing increase is directly tied to delivery logistics.

Incentivizing Dasher Supply:

The increased delivery fees translate to higher Dasher payouts. During surge periods, Dashers earn more per delivery, incentivizing them to log in and accept orders. Over time, the increased supply brings the demand-to-supply ratio back down, reducing surge multipliers.

We can also send targeted notifications to nearby offline Dashers informing them of surge pricing opportunities, further increasing supply during peak demand.

Step 4: Wrap Up

In this design, we proposed a comprehensive system architecture for a food delivery platform like DoorDash. The design handles millions of daily orders with real-time tracking, intelligent dispatch, and optimized delivery routes. If there is extra time at the end of the interview, here are additional points to discuss:

Additional Features:

  • Rating System: Allow customers to rate restaurants and Dashers post-delivery, and allow Dashers to rate customers. Ratings feed into matching algorithms and identify problematic accounts.
  • Scheduled Orders: Enable customers to schedule orders for future delivery (e.g., lunch tomorrow at noon). The system queues these orders and begins matching at the appropriate time.
  • DashPass Subscription: Offer a subscription service with free delivery and reduced service fees. Requires tracking subscription status and applying appropriate pricing.
  • Group Orders: Allow multiple people to add items to a shared cart for office lunches or family dinners. Requires shared cart state management and split payment handling.
  • Restaurant Analytics Dashboard: Provide restaurants with insights into order volume, peak hours, popular items, and customer reviews. Helps restaurants optimize menus and staffing.
  • DoorDash Drive: White-label API for businesses to integrate DoorDash’s logistics network for their own deliveries. Requires separate authentication, pricing, and SLA management.
  • Ghost Kitchens: Support virtual restaurants that exist only on DoorDash without physical storefronts. Enables restaurants to experiment with new concepts and expand to new markets cheaply.

Scaling Considerations:

  • Geographic Sharding: Deploy region-specific clusters for major metropolitan areas. Orders are routed to the nearest regional cluster based on delivery address. Restaurant and menu data are replicated across regions for fast local access.
  • Database Sharding: Shard orders by customer ID or restaurant ID depending on access patterns. Use Vitess or Citus for transparent sharding with cross-shard query support.
  • Caching Layers: Implement multi-level caching: application-level cache (in-memory), distributed cache (Redis), and CDN for static assets. Cache restaurant menus with 5-minute TTL, Dasher locations with 1-minute TTL, and user session data.
  • Read Replicas: Use database read replicas for analytics queries, restaurant dashboards, and historical reporting. Write to primary, read from replicas with eventual consistency.
  • Autoscaling: Configure autoscaling groups for all stateless services based on CPU, memory, and request rate metrics. Ensure the system can handle 5x peak traffic automatically.
  • Connection Pooling: Use connection poolers (PgBouncer for PostgreSQL) to manage database connections efficiently and prevent exhaustion during traffic spikes.

Error Handling and Resilience:

  • Circuit Breakers: Implement circuit breakers for third-party API calls (Google Maps, payment processors). If they fail, fall back to cached data or gracefully degrade functionality.
  • Retry Logic: Use exponential backoff with jitter for retrying failed operations. Distinguish between retriable errors (network timeouts) and non-retriable errors (invalid data).
  • Idempotency: Ensure all critical operations are idempotent. Use idempotency keys for payment processing to prevent double charges on retries.
  • Graceful Degradation: During partial outages, disable non-critical features while keeping core order placement and tracking functional. For example, disable advanced search filters but keep basic restaurant browsing.
  • Chaos Engineering: Regularly test system resilience by intentionally injecting failures (kill instances, introduce latency, simulate third-party API failures) to validate recovery mechanisms.

Monitoring and Observability:

  • Key Metrics: Track order placement latency (p50, p95, p99), Dasher assignment time, order completion rate, average delivery time, ETA accuracy (actual vs predicted), Dasher utilization rate, surge pricing frequency, and customer satisfaction scores.
  • Distributed Tracing: Use Jaeger or Zipkin to trace requests across microservices. Track the full order lifecycle from placement through matching, pickup, and delivery to identify bottlenecks.
  • Real-Time Dashboards: Build operations dashboards showing current order volume by region, active Dashers, matching success rate, average wait times, and system health indicators.
  • Alerting: Configure alerts for critical issues: order matching failures, payment processing errors, high order cancellation rates, ETA inaccuracies exceeding thresholds, and service latency spikes.
  • Logging: Use structured logging with correlation IDs to trace requests across services. Centralize logs with ELK stack or similar for searchability.

Security Considerations:

  • Authentication and Authorization: Use JWT tokens for API authentication. Implement role-based access control (RBAC) for different user types (customers, Dashers, restaurants, admins). Validate tokens on every request at the API gateway.
  • Data Encryption: Encrypt sensitive data in transit using TLS 1.3. Encrypt payment information at rest using AES-256. Use separate encryption keys per tenant and rotate keys regularly.
  • PCI DSS Compliance: For payment processing, ensure PCI DSS Level 1 compliance. Tokenize credit card numbers and never store raw card data. Use certified third-party payment processors.
  • Rate Limiting: Implement rate limiting at the API gateway to prevent abuse. Use different limits for different endpoints and user types. Apply stricter limits during DDoS attacks.
  • Input Validation: Validate and sanitize all user inputs to prevent SQL injection, XSS, and other injection attacks. Use parameterized queries and escape user-generated content.
  • Privacy Compliance: Comply with GDPR, CCPA, and other privacy regulations. Implement data retention policies, provide data export capabilities, and support account deletion with proper data cleanup.

Future Improvements:

  • Predictive Dasher Positioning: Use machine learning to predict demand hotspots (based on historical data, events, weather) and proactively incentivize Dashers to position themselves in high-demand areas before orders come in.
  • Autonomous Delivery: Integrate with delivery robots and drones for contactless, last-mile delivery in select markets. Requires coordination with robot fleets and regulatory compliance.
  • Demand Forecasting: Predict order volume for restaurants to help them optimize ingredient purchasing and staffing. Reduces food waste and improves preparation times.
  • Smart Packaging Recommendations: Suggest optimal packaging based on order contents to maintain food quality during delivery. Cold items stay cold, hot items stay hot.
  • Expansion Beyond Food: Extend the platform to grocery delivery, pharmacy fulfillment, and retail logistics. Requires inventory management systems and partnerships with grocery stores.
  • Kitchen Optimization: Provide restaurants with tablets that integrate with their kitchen display systems, automatically prioritizing orders based on pickup times to ensure food is ready when Dashers arrive.
  • Carbon Footprint Reduction: Optimize routes not just for time but also for fuel efficiency. Partner with electric vehicle and bike Dashers to reduce environmental impact.

Data and Analytics:

  • A/B Testing Framework: Implement experimentation infrastructure to test changes to the matching algorithm, pricing models, UI/UX, and notification strategies. Measure impact on key metrics before full rollout.
  • Recommendation Engine: Use collaborative filtering and machine learning to recommend restaurants and dishes to customers based on their order history and preferences.
  • Fraud Detection: Build ML models to detect fraudulent orders, fake accounts, and Dasher behavior anomalies. Flag suspicious activity for manual review.
  • Supply and Demand Prediction: Forecast Dasher supply and order demand by hour and location to optimize surge pricing proactively and incentivize Dashers ahead of peak periods.
  • Menu Optimization: Analyze which menu items are most popular, which have high return rates, and which are most profitable. Provide insights to restaurants for menu engineering.

This comprehensive design provides a solid foundation for building a production-grade food delivery platform capable of handling millions of orders per day with real-time tracking, intelligent dispatch, optimized routing, and dynamic pricing. The system is designed for scale, reliability, and excellent customer experience.


Summary

This comprehensive guide covered the design of a food delivery platform like DoorDash, including:

  1. Core Functionality: Restaurant search, order placement, intelligent Dasher matching, real-time tracking, multi-stop route optimization, and payment processing.
  2. Key Challenges: High-frequency location updates, efficient proximity searches, distributed locking for order assignment, durable order processing, accurate ETA prediction, and multi-stop route optimization.
  3. Solutions: Redis geospatial data structures (GEOADD, GEORADIUS), multi-factor scoring for matching, distributed locking with Redis, durable message queues with Kafka, machine learning for ETA prediction, constraint optimization for routing (Google OR-Tools), and dynamic surge pricing.
  4. Scalability: Geographic sharding, database sharding, multi-level caching, read replicas, autoscaling, and connection pooling.

The design demonstrates how to build a complex three-sided marketplace with real-time systems, strong consistency requirements, machine learning-driven predictions, and sophisticated optimization algorithms to balance multiple objectives.