Design Booking.com

Booking.com is one of the world’s largest online travel platforms, processing millions of hotel reservations daily across 200+ countries. The system must handle complex inventory management, real-time availability checks, concurrent bookings, payment processing in multiple currencies, and a sophisticated search engine that filters through millions of properties. This design explores how to build a production-grade hotel booking platform that maintains data consistency during high-concurrency booking scenarios while providing sub-second search results.

Step 1: Understand the Problem and Establish Design Scope

Before diving into the design, it’s crucial to define the functional and non-functional requirements. For user-facing applications like this, functional requirements are the “Users should be able to…” statements, whereas non-functional requirements define system qualities.

Functional Requirements

Core Requirements (Priority 1-4):

Users should be able to search for hotels with complex filters including location, dates, price, amenities, ratings, and distance.
Users should be able to check real-time availability for specific room types across date ranges.
Users should be able to create bookings with atomic reservation guarantees and payment processing.
Users should be able to cancel or modify bookings according to cancellation policies.

Below the Line (Out of Scope):

Users should be able to submit reviews and ratings after completing their stay.
Hotel managers should be able to manage properties, room types, and pricing rules.
Users should be able to track loyalty points and redeem rewards.
Users should be able to receive personalized recommendations based on booking history.
Users should be able to set price alerts for specific hotels or destinations.

Non-Functional Requirements

Core Requirements:

The system should prioritize search performance with results returned in under 500ms at p99.
The system should ensure strong consistency for inventory and bookings to prevent double-booking.
The system should handle high throughput with 200 million searches per day and 1.5 million bookings per day.
The system should maintain 99.99% uptime with zero data loss for confirmed bookings.

Below the Line (Out of Scope):

The system should comply with PCI DSS for payment processing and GDPR for user data.
The system should support multiple currencies and payment methods globally.
The system should gracefully degrade during partial outages while maintaining core functionality.
The system should provide comprehensive monitoring and alerting for operational issues.

Clarification Questions & Assumptions:

Platform: Web and mobile applications for users; separate portal for hotel managers.
Scale: 1.5 million total properties with 29 million rooms globally; 100K concurrent users during peak.
Geographic Coverage: Global coverage with regional data centers for low latency.
Inventory Updates: Room availability changes frequently due to bookings and cancellations.
Payment: Integration with third-party payment processors supporting 100+ currencies.

Step 2: Propose High-Level Design and Get Buy-in

Planning the Approach

We’ll build the system sequentially, addressing each functional requirement. This ensures we cover the core booking workflow from search through payment completion while maintaining focus on the critical path.

Defining the Core Entities

To satisfy our key functional requirements, we’ll need the following entities:

Hotel: Represents a property available for booking. Includes information such as name, location coordinates, description, star rating, amenities, contact details, and verification status.

Room Type: Defines categories of rooms within a hotel. Contains the room name (e.g., Deluxe, Suite), base price, maximum occupancy, total inventory count, and room features. A hotel can have multiple room types.

Inventory: Tracks daily room availability for each room type. Records the date, total rooms, available rooms, blocked rooms, and overbooking limits. This is the source of truth for availability.

Booking: Represents a confirmed reservation. Includes the user details, hotel and room type identifiers, check-in and check-out dates, guest information, total price, payment status, and booking status (pending, confirmed, cancelled, completed).

Fare: An estimated price quote for a potential booking. Contains the room type, date range, base price, taxes, fees, total amount, and currency. This allows users to review pricing before confirming.

Review: Guest feedback submitted after stay completion. Includes rating scores, written review, photos, verification status, and hotel owner responses.

API Design

Search Hotels Endpoint: Used by users to find hotels matching their criteria. Returns a paginated list of hotels with availability and pricing information.

POST /hotels/search -> HotelSearchResults
Body: {
  location: string,
  checkIn: date,
  checkOut: date,
  guests: number,
  rooms: number,
  filters: { priceRange, starRating, amenities, reviewScore },
  sort: string
}

Check Availability Endpoint: Used to verify real-time availability for specific room types at a hotel.

POST /hotels/{hotelId}/availability -> AvailabilityResponse
Body: {
  checkIn: date,
  checkOut: date,
  roomTypes: string[]
}

Get Fare Estimate Endpoint: Calculates the total price including all taxes and fees for a potential booking.

POST /fare -> Fare
Body: {
  hotelId: string,
  roomTypeId: string,
  checkIn: date,
  checkOut: date,
  promoCode?: string
}

Create Booking Endpoint: Initiates a new booking with payment processing and inventory reservation.

POST /bookings -> Booking
Body: {
  fareId: string,
  guestDetails: object,
  paymentMethod: string,
  specialRequests?: string
}

Cancel Booking Endpoint: Cancels an existing booking and processes refunds according to the cancellation policy.

POST /bookings/{bookingId}/cancel -> CancellationResponse

High-Level Architecture

Let’s build up the system sequentially, addressing each functional requirement:

1. Users should be able to search for hotels with complex filters

The core components necessary for hotel search are:

User Client: Web and mobile applications serving as the primary interface for users to search hotels and make bookings.
API Gateway: Entry point handling authentication, rate limiting, and request routing to appropriate microservices.
Search Service: Manages hotel discovery using Elasticsearch for full-text search and geo-spatial queries. Handles complex filtering on price, amenities, ratings, and distance.
Elasticsearch Cluster: Distributed search engine storing hotel documents with indexed fields for fast retrieval. Contains hotel details, location coordinates, amenities, and review aggregations.
Redis Cache: Caches popular search results with short TTL (5 minutes) to reduce load on Elasticsearch and improve response times.
Database: PostgreSQL storing the source of truth for hotel data, synchronized to Elasticsearch via change data capture.

Search Flow:

User enters search criteria in the client app, which sends a POST request to the search endpoint.
API Gateway authenticates the request and forwards it to the Search Service.
Search Service generates a cache key from the search parameters and checks Redis for cached results.
If not cached, it constructs an Elasticsearch query with geo-distance filters, amenity filters, and price range filters.
Elasticsearch returns matching hotels ranked by relevance score combining popularity, review ratings, and distance.
Search Service enriches results with real-time availability and pricing by batching requests to the Inventory Service.
Results are cached in Redis and returned to the client.

2. Users should be able to check real-time availability for specific room types

We extend the design to support availability checks:

Inventory Service: Manages room availability as the single source of truth. Tracks daily inventory for each room type and handles reservation allocations.
Inventory Database: PostgreSQL with a table tracking available rooms per date. Uses row-level locking to prevent race conditions during concurrent bookings.
Redis Cache: Stores availability data with 30-second TTL for fast reads. Invalidated when bookings are created or cancelled.

Availability Check Flow:

User selects a hotel and specifies check-in/check-out dates, sending a request to the availability endpoint.
API Gateway routes the request to the Inventory Service.
Inventory Service checks Redis cache for availability data for the requested dates.
If cached data is fresh, it returns immediately. Otherwise, it queries the database for room inventory across the date range.
The service calculates available rooms considering confirmed bookings and overbooking limits.
Results are cached in Redis and returned to the client showing which room types are available.

3. Users should be able to create bookings with payment processing

We introduce additional services to handle the booking workflow:

Booking Service: Orchestrates the multi-step booking process including inventory reservation, pricing calculation, payment authorization, and booking confirmation.
Payment Service: Integrates with third-party payment processors (Stripe, Adyen) to handle payment authorization, currency conversion, fraud detection, and refund processing.
Pricing Service: Calculates dynamic pricing based on demand, seasonality, occupancy rates, and promotional offers. Applies taxes and fees according to local regulations.
Notification Service: Sends booking confirmations, cancellation receipts, and reminders via email and SMS using SendGrid and SNS.
Message Queue: Apache Kafka for asynchronous event processing. Publishes booking events consumed by notification, analytics, and auditing services.

Booking Flow:

User reviews fare estimate and confirms booking, sending a POST request with fare ID and payment details.
API Gateway forwards the request to the Booking Service.
Booking Service initiates a distributed transaction saga with multiple steps.
First, it validates the fare hasn’t expired and retrieves pricing details from the Pricing Service.
Next, it calls Inventory Service to reserve the room inventory with a pessimistic lock on the date range.
The Inventory Service locks the relevant rows in the database and decrements available rooms atomically.
Booking Service then calls Payment Service to authorize the payment amount.
Payment Service performs fraud checks, currency conversion if needed, and authorizes the charge with the payment provider.
If payment succeeds, Booking Service creates the booking record with status “confirmed” and commits the inventory reservation.
An event is published to Kafka triggering async processes for confirmation emails and analytics updates.
If any step fails, the saga executes compensation logic rolling back previous steps.

4. Users should be able to cancel or modify bookings

We add cancellation policy enforcement:

Cancellation Policy Engine: Calculates refund amounts based on policy rules defined per room type. Policies specify refund percentages at different time thresholds before check-in.

Cancellation Flow:

User requests to cancel their booking, sending a POST request with the booking ID.
Booking Service retrieves the booking details and the associated cancellation policy.
Policy Engine calculates the refund amount based on days remaining until check-in.
If a refund is due, Payment Service processes the refund to the original payment method.
Inventory Service releases the reserved rooms back into available inventory.
Booking record is updated to “cancelled” status.
Notification Service sends cancellation confirmation to the user.

Step 3: Design Deep Dive

With the core functional requirements met, it’s time to dig into the non-functional requirements and critical system challenges.

Deep Dive 1: How do we achieve sub-500ms search performance across millions of hotels?

Hotel search must filter through 1.5 million properties with complex criteria while delivering results fast enough to maintain a smooth user experience.

Challenge: Traditional relational databases struggle with multi-dimensional queries combining geo-spatial distance, text matching, numerical ranges, and array containment checks.

Solution: Elasticsearch with Optimized Indexing

Elasticsearch is designed for full-text search and provides excellent geo-spatial capabilities. We structure the hotel documents with carefully indexed fields.

Each hotel document contains the hotel ID, name, description, location as a geo_point type (latitude and longitude), star rating, aggregated review score, review count, amenities array, and nested room type objects with base prices. The location field uses geo_point mapping enabling efficient geo-distance queries.

Query Optimization Strategy:

When a search request arrives, we construct a compound query using Elasticsearch’s bool query combining multiple filters. The must clause requires a geo-distance filter matching hotels within a radius of the search location. We add range filters for review scores and price ranges. The filter clause contains term filters for amenities and star ratings.

For ranking, we use function_score queries to boost results based on multiple factors. Popularity score derived from booking volume affects ranking. Geo-distance decay gives preference to closer hotels. Review scores and recency of reviews influence the final score. The scoring combines these factors multiplicatively to generate the final relevance ranking.

Caching Strategy:

Popular search queries are cached in Redis with a cache key generated by hashing the search parameters. Cache TTL is set to 5 minutes balancing freshness with cache hit ratio. During peak travel planning seasons, cache hit rates can reach 40-50% for common destinations, significantly reducing Elasticsearch load.

For availability enrichment, we only fetch real-time availability for the top 100 search results rather than all matches. This prevents the search endpoint from being blocked by slow availability checks. Results are returned with availability status showing whether rooms are available for the requested dates.

Indexing Strategy:

We use near real-time indexing from PostgreSQL to Elasticsearch. When hotel data changes in the primary database, change data capture triggers updates to the search index. We maintain separate indices for different geographic regions allowing parallel query execution. Index aliases enable zero-downtime reindexing when schema changes are required.

Deep Dive 2: How do we prevent double-booking under high concurrency?

The inventory service must guarantee that no room is booked twice, even when hundreds of users attempt to book the last available room simultaneously.

Challenge: Without proper locking, race conditions can occur where multiple booking requests read the same available count, then all attempt to decrement it, resulting in negative inventory or overbooking.

Solution: Pessimistic Locking with Row-Level Locks

PostgreSQL provides row-level locking with the SELECT FOR UPDATE statement. When checking and reserving inventory, we acquire exclusive locks on the inventory rows for all dates in the booking range.

The reservation process works as follows: When a booking request arrives, we begin a database transaction. We execute a SELECT FOR UPDATE query to lock all inventory rows for the room type across the check-in to check-out date range. This prevents other transactions from reading or modifying these rows until our transaction completes.

With locks acquired, we check if all dates have sufficient available rooms. If any date has insufficient inventory, we rollback the transaction and release locks immediately, returning an error to the user. If all dates have availability, we execute an UPDATE statement decrementing the available room count for each date and incrementing the version number for optimistic concurrency control.

Finally, we insert a reservation record linking to the booking and commit the transaction, releasing the locks. The entire process typically completes in under 100ms for a typical 3-night stay.

Handling Lock Contention:

During high-demand periods, multiple requests may wait for locks on the same inventory rows. PostgreSQL queues these requests and processes them serially. To prevent indefinite waiting, we set a lock timeout of 5 seconds. If a lock cannot be acquired within this time, the request fails fast and the user is notified that the room is no longer available.

Distributed Locking with Redis:

For additional protection against race conditions, we implement distributed locking using Redis before even attempting the database operation. When a booking request arrives, we attempt to acquire locks on Redis keys for each date in the range using SET with NX and EX options.

The Redis lock key follows the pattern “lock:inventory:{roomTypeId}:{date}” with a unique lock ID. We set an expiration of 10 seconds preventing locks from being held indefinitely if a service crashes. If we successfully acquire all date locks, we proceed with the database reservation. Otherwise, we immediately return unavailable to the user.

After completing or failing the database operation, we release the Redis locks using a Lua script that verifies we own the lock before deleting it. This prevents accidentally releasing another request’s lock.

This two-tier locking approach (Redis + PostgreSQL) provides defense in depth ensuring inventory consistency even under extreme load.

Deep Dive 3: How do we handle the booking workflow as a distributed transaction?

Creating a booking involves coordinating multiple services: inventory reservation, pricing calculation, payment authorization, booking record creation, and notification dispatch. Each step can potentially fail, requiring careful coordination.

Challenge: Distributed transactions across microservices are complex. Traditional two-phase commit is slow and susceptible to coordinator failure. We need a pattern that ensures consistency while maintaining high availability.

Solution: Saga Pattern with Compensation

We implement the booking workflow as an orchestrated saga where the Booking Service coordinates the sequence of operations. Each step in the saga has a corresponding compensation action to undo its effects if a later step fails.

The saga executes the following steps in order:

Step 1: Validate Request - Verify the fare hasn’t expired, check user is authenticated, and validate guest details. Compensation: None needed as this is read-only.

Step 2: Reserve Inventory - Call Inventory Service to atomically check and reserve rooms for the date range. The service creates a pending reservation with a 10-minute expiration. Compensation: Release the reservation by calling a cancellation endpoint.

Step 3: Calculate Final Price - Call Pricing Service to get the current price including taxes and fees. Verify it matches the fare estimate within tolerance. Compensation: None needed as this is read-only.

Step 4: Authorize Payment - Call Payment Service to authorize (but not capture) the payment. This places a hold on funds in the user’s account. Compensation: Void the payment authorization to release the hold.

Step 5: Create Booking Record - Insert the booking record in the database with status “confirmed”. This is the point of no return. Compensation: Update booking status to “cancelled” and record the reason.

Step 6: Confirm Reservation - Call Inventory Service to convert the pending reservation to confirmed status. Compensation: Cancel the reservation.

Step 7: Publish Event - Publish a “BookingConfirmed” event to Kafka for async processing. Compensation: Publish “BookingCancelled” event.

If any step fails, we execute compensations in reverse order. For example, if payment authorization fails in Step 4, we release the inventory reservation from Step 2. Each compensation is idempotent, so it’s safe to retry if a compensation itself fails.

Saga Orchestration:

We track saga state in the database with a saga_state table recording which steps have completed. If the Booking Service crashes mid-saga, another instance can recover the saga and continue or compensate from the last checkpoint.

Each saga has a unique saga ID used as an idempotency key. If a client retries the same booking request (due to timeout or network error), we detect the duplicate saga ID and return the original result rather than creating a duplicate booking.

Temporal Workflow Alternative:

For production systems, we recommend using a workflow orchestration framework like Temporal or AWS Step Functions. These systems provide durable execution ensuring workflows survive service restarts. They handle timeouts, retries, and compensation automatically. The workflow is defined as code, making the saga logic explicit and testable.

Temporal maintains workflow state in a distributed database. If a workflow task worker crashes, another worker picks up the workflow and continues from the last completed step. This provides exactly-once execution semantics for the entire booking workflow.

Deep Dive 4: How do we implement overbooking strategies while managing risk?

Hotels intentionally overbook by 5-10% to compensate for expected cancellations and no-shows. This maximizes revenue but requires careful calibration to avoid guest dissatisfaction.

Challenge: Static overbooking limits can lead to either excessive rebooking costs (too aggressive) or lost revenue (too conservative). We need dynamic limits based on historical data.

Solution: Data-Driven Overbooking Limits

The Inventory Service calculates overbooking limits per room type per date based on statistical analysis of historical booking patterns.

Cancellation Rate Analysis:

For each room type, we analyze cancellations over the past 90 days with the same lead time and day of week. For example, for a booking made 30 days in advance for a Friday night, we look at historical Friday bookings made 30 days prior and calculate what percentage were ultimately cancelled.

Cancellation rates vary significantly by factors like booking lead time (last-minute bookings cancel less), day of week (weekend bookings are stickier), season, and price point. We segment the data across these dimensions to get accurate estimates.

Conservative Overbooking Formula:

Given the cancellation rate estimate, we calculate the overbooking limit as: overbooking_limit = floor(total_rooms * cancellation_rate * 0.8)

The 0.8 multiplier provides a safety margin, only allowing overbooking for 80% of expected cancellations. This conservative approach reduces the risk of actually having too many guests arrive.

We also cap the maximum overbooking at 10% of total inventory regardless of the calculated value. This hard limit prevents extreme overbooking if the data is anomalous.

Handling Overbooked Situations:

Despite precautions, situations may arise where more guests arrive than we have rooms. The system handles this through a tiered response strategy.

First, we attempt to upgrade guests to a higher room category if available at the same hotel. The guest receives the upgrade at no additional cost, often turning a potential complaint into a positive experience.

If no upgrade is available, we relocate the guest to a partner hotel nearby (within 5km). The system searches for available comparable rooms at partner properties and automatically creates a booking. The guest receives compensation such as a 50-dollar voucher plus coverage of any price difference.

As a last resort, we provide a full refund plus substantial compensation (100-200 dollars) and help the guest find alternative accommodation. All overbooked situations are logged and analyzed to refine the overbooking algorithm.

Real-Time Adjustment:

The overbooking limits are recalculated daily based on rolling windows of historical data. As check-in date approaches and the cancellation risk decreases, we may reduce the overbooking allowance. During major events or holidays with historically low cancellation rates, overbooking limits are automatically reduced.

Deep Dive 5: How do we implement dynamic pricing that adapts to demand?

Hotel prices should fluctuate based on market conditions, maximizing revenue during high demand while attracting bookings during slow periods.

Challenge: Pricing must balance multiple competing factors: occupancy optimization, revenue maximization, competitive positioning, and perceived fairness to customers.

Solution: Multi-Factor Dynamic Pricing Engine

The Pricing Service calculates prices in real-time using a base price multiplied by various adjustment factors.

Base Price: Each room type has a configured base price representing the minimum profitable rate. This covers operating costs plus a minimum margin.

Demand Multiplier: Based on current occupancy rate for the date range. When occupancy is below 50%, we apply a 0.9 multiplier (10% discount) to stimulate demand. At 50-70% occupancy, we use the base price (1.0 multiplier). As occupancy reaches 70-85%, we apply a 1.15 premium. At 85-95% we charge a 1.3 premium. Above 95% occupancy (almost full), we apply a maximum 1.5 multiplier.

Seasonality Multiplier: Certain dates have inherently higher demand. Summer beach destinations have higher multipliers from June-August. Ski resorts peak in winter months. The multiplier is based on historical booking volume for the same date in previous years.

Day of Week Multiplier: Weekend stays typically command higher prices than weekday stays, especially in leisure destinations. Business hotels reverse this pattern with higher weekday rates.

Event-Based Multiplier: We integrate with event databases to detect conferences, concerts, festivals, and sporting events near the hotel. When a major event is scheduled, prices automatically increase. A large conference might trigger a 1.3x multiplier. A major concert or championship game could justify 1.5-2.0x.

Lead Time Multiplier: Booking lead time affects pricing strategy. Very early bookings (90+ days out) might receive discounts to secure early commitment. Bookings made 14-30 days in advance typically pay base price. Last-minute bookings (within 7 days) can go either way: discounts if we need to fill rooms, or premiums if occupancy is high.

Competitor-Based Adjustment: We monitor pricing at comparable nearby hotels through web scraping and partnership data feeds. If our price is more than 20% higher than competitors for similar quality, we apply a downward adjustment to remain competitive. If we’re priced significantly lower, we may raise prices to capture additional margin.

Calculation Example:

For a deluxe room with base price of 200 dollars on a Saturday night in peak season with 82% occupancy and a nearby concert: Final price = 200 * 1.15 (demand) * 1.2 (season) * 1.1 (weekend) * 1.3 (event) * 1.0 (lead time) * 1.0 (competitor) = 393 dollars

The engine applies minimum and maximum price constraints to prevent extreme prices. A room cannot be priced below the configured minimum (e.g., base price * 0.7) or above the maximum (e.g., base price * 3.0).

Machine Learning Enhancement:

Advanced implementations use machine learning models trained on historical booking data to predict optimal prices. Features include all the factors above plus weather forecasts, local economic indicators, and customer segment data. The model predicts booking probability at different price points, allowing optimization for either occupancy or revenue targets.

Reinforcement learning approaches can continuously adjust pricing strategy based on booking conversion rates, learning which price points maximize long-term revenue.

Deep Dive 6: How do we handle payment processing with multiple currencies and fraud prevention?

Booking.com operates globally with users paying in their local currency while hotels receive payment in their currency, requiring complex payment flows.

Challenge: Currency conversion, fraud detection, payment authorization, and refund processing must work reliably across different payment methods and currencies while maintaining PCI DSS compliance.

Solution: Multi-Stage Payment Processing with Third-Party Integration

The Payment Service integrates with payment gateways like Stripe and Adyen supporting global payment methods.

Payment Authorization Flow:

When a booking is created, we authorize rather than capture payment immediately. Authorization places a hold on the customer’s funds without actually transferring money. This protects against cancellations between booking and check-in.

The flow starts with currency conversion if needed. If the user’s currency differs from the hotel’s base currency, we retrieve the current exchange rate from a currency service with real-time rates. We calculate the amount in the user’s currency with a small markup (1-2%) to cover exchange rate fluctuations between authorization and capture.

Next, we perform fraud risk assessment using multiple signals. New users with no booking history receive higher scrutiny. Large booking amounts (above 1000 dollars) trigger additional checks. We verify the billing address matches the payment method through address verification service (AVS). Device fingerprinting helps detect suspicious patterns. If the user’s IP geolocation significantly differs from their billing address, we flag for review.

The fraud detection system assigns a risk score from 0 to 1. Scores above 0.8 automatically reject the transaction. Scores between 0.6 and 0.8 require additional authentication via 3D Secure (3DS), where the user must authenticate with their bank. Scores below 0.6 proceed with standard authorization.

We then call the payment gateway API to authorize the payment. For credit cards, this creates an authorization hold. For alternative payment methods like PayPal or Apple Pay, this initiates their specific authorization flow. The authorization remains valid for a set period (typically 7-30 days depending on the payment method).

Payment details are stored in tokenized form. We never store full credit card numbers. The payment gateway provides a token representing the payment method which we store in our database. This maintains PCI DSS compliance by keeping sensitive card data out of our systems.

Payment Capture:

Funds are actually captured (transferred) at check-in time or up to 24 hours before check-in. This timing balances protecting the hotel against no-shows while minimizing the time between authorization and capture (reducing the risk of authorization expiration or insufficient funds).

When capturing payment, we use the stored payment token to request capture from the gateway. If the original authorization has expired, we attempt a new authorization and immediate capture. If this fails, we notify both the user and hotel to resolve the payment issue.

Refund Processing:

For cancellations, the refund amount is calculated based on the cancellation policy as discussed in Deep Dive 3. We initiate a refund through the payment gateway, which typically processes within 5-10 business days depending on the payment method and bank.

Partial refunds are supported for modifications where the user changes to a cheaper room type or shorter stay duration. The refund goes back to the original payment method to prevent fraud.

Multi-Currency Handling:

The system maintains prices in three currencies: the hotel’s base currency, the user’s display currency, and the payment currency. When displaying prices to users, we convert from the hotel’s base currency to the user’s preferred currency using current exchange rates. At payment time, we may need to convert again to the payment method’s currency.

To protect against exchange rate volatility, we lock the exchange rate at the time of fare calculation. This rate is stored with the Fare entity and used for payment authorization and capture, ensuring the user pays the amount they saw during booking.

For international payments, we use payment processors with local acquiring banks in major markets. This reduces cross-border interchange fees and improves authorization rates by processing payments domestically.

Deep Dive 7: How can we scale the system to handle 200M searches and 1.5M bookings per day?

At this scale, every component must be optimized for throughput and latency.

Database Sharding Strategy:

We shard the primary PostgreSQL database by geographic region. Hotels in North America are in the US shard, European hotels in the EU shard, and so on. This aligns with data residency regulations and reduces cross-region latency.

Within each regional shard, we further partition by hotel ID ranges. Each partition handles approximately 100,000 hotels. The booking and inventory tables are co-located with their hotels (partitioned by hotel ID) to keep related data together and avoid distributed queries.

Read replicas provide horizontal scaling for read-heavy workloads. Each primary database has 5-10 read replicas distributed across availability zones. Search result enrichment, user dashboards, and reporting queries hit read replicas. The booking workflow uses the primary database for strong consistency.

Connection Pooling:

To handle 100,000 concurrent users, we deploy PgBouncer as a connection pooler in front of PostgreSQL. Each application server maintains a small pool (10-20 connections) to PgBouncer, which multiplexes these to a larger pool (100-500 connections) to PostgreSQL. This prevents overwhelming the database with too many concurrent connections.

Elasticsearch Scaling:

The search cluster consists of 3 dedicated master nodes for cluster management and 15+ data nodes holding index shards. Hotel indices are partitioned by region with each region’s index having 10 shards for parallel query execution.

During peak search load, we can scale horizontally by adding more data nodes. Elasticsearch automatically rebalances shards across nodes. Read replicas (replica shards) are distributed across nodes ensuring high availability and query throughput.

Redis Cluster:

We deploy Redis in cluster mode with 6 master nodes and 6 replica nodes spread across 3 availability zones. Data is automatically sharded across master nodes using hash slots. Replicas provide high availability and read scaling for cache hits.

Cache eviction uses the LRU (Least Recently Used) policy, automatically removing stale entries when memory is full. Maximum memory is set to 80% of available RAM, with warnings triggered at 70% to allow proactive capacity planning.

Async Processing with Kafka:

Non-critical operations like sending confirmation emails, updating analytics, and generating reports are handled asynchronously through Kafka. The booking workflow publishes events to Kafka topics after completing critical operations.

Kafka topics are partitioned (32 partitions per topic) allowing parallel consumption. Consumer groups ensure each event is processed by exactly one consumer instance. If a consumer crashes, Kafka’s consumer group protocol automatically rebalances partitions to healthy consumers.

Content Delivery Network:

Static assets like hotel images, logos, and client-side application bundles are served through CloudFront CDN. This reduces origin server load and provides low-latency asset delivery globally. Images are optimized with multiple resolutions for different device sizes and formats (WebP for modern browsers, JPEG for older clients).

Rate Limiting and DDoS Protection:

The API Gateway implements rate limiting per user (100 requests per minute) and per IP address (1000 requests per minute) to prevent abuse. During DDoS attacks, we employ AWS Shield and WAF to filter malicious traffic before it reaches our servers.

Geographic Load Distribution:

We deploy the application across multiple regions: US-East, US-West, EU-Central, and Asia-Pacific. Users are routed to the nearest region via geographic DNS routing, reducing latency. Each region is a complete deployment capable of serving requests independently.

Cross-region replication keeps hotel data synchronized globally. The replication lag is typically under 1 second, with eventual consistency acceptable for most operations. Critical operations like payment processing and inventory reservation target the primary region for the hotel to ensure strong consistency.

Step 4: Wrap Up

In this design, we proposed a comprehensive system for a hotel booking platform like Booking.com. The architecture handles millions of daily searches and bookings while maintaining strong consistency guarantees for inventory and payments.

Additional Features to Discuss

Review System: Allow verified guests to submit reviews with ratings, photos, and written feedback. Implement moderation to detect fake or inappropriate reviews. Aggregate scores and display them in search results to help users make informed decisions. Hotel owners can respond to reviews creating a dialog with guests.

Loyalty Program: Track booking history and award points based on spending. Allow users to redeem points for discounts on future bookings. Implement tier levels (Silver, Gold, Platinum) with progressively better benefits like free upgrades, late checkout, or priority support.

Recommendation Engine: Use collaborative filtering to recommend hotels based on similar users’ preferences. Factor in booking history, search patterns, and demographic data. Personalize search result ranking based on predicted user preference.

Property Management Portal: Provide hotel managers with tools to update room inventory, set pricing rules, manage promotions, respond to reviews, and view analytics on occupancy and revenue. Support bulk operations for hotel chains managing hundreds of properties.

Mobile Optimization: Implement offline capabilities allowing users to view saved bookings without internet connectivity. Use push notifications for booking confirmations, check-in reminders, and price drop alerts. Optimize the mobile booking flow to minimize steps and form fields.

Scaling Considerations

Horizontal Scaling: All services are stateless, storing session data in Redis. This allows automatic scaling based on CPU and memory metrics. During peak booking periods, we can scale out to hundreds of service instances.

Database Scaling: As data grows beyond the capacity of single database instances, we implement time-based partitioning. Historical bookings older than 2 years are moved to cold storage (S3 with Athena for querying). Active data remains in hot storage (PostgreSQL) for fast access.

Caching Layers: Implement multi-tier caching with in-memory application cache (local to each service instance), distributed cache (Redis cluster), and CDN for static content. This reduces database queries by 80-90% for read-heavy workloads.

Async Processing: Offload all non-critical operations to background workers. Sending notifications, updating search indices, generating invoices, and processing analytics all happen asynchronously through Kafka event streams.

Error Handling and Resilience

Circuit Breakers: Implement circuit breaker patterns for calls to external services. If the payment gateway is experiencing issues, after 5 consecutive failures, we open the circuit and immediately return errors rather than waiting for timeouts. This prevents cascading failures and allows faster recovery.

Graceful Degradation: During partial outages, non-essential features are disabled while core booking functionality remains available. If the review service is down, searches still work but don’t display review scores. If pricing service fails, fall back to cached base prices.

Retry Logic: Implement exponential backoff for retrying failed operations. Transient network failures are retried up to 3 times with increasing delays (1s, 2s, 4s). Idempotency keys prevent duplicate bookings when requests are retried.

Health Checks: Each service exposes health check endpoints queried by load balancers. Unhealthy instances are automatically removed from rotation. Deep health checks verify connectivity to dependencies like databases and message queues.

Security Considerations

Data Encryption: All data is encrypted in transit using TLS 1.3 and at rest using AES-256. Database storage volumes use encryption at rest. Payment tokens are encrypted with separate encryption keys stored in a key management service.

Authentication and Authorization: Users authenticate via JWT tokens with short expiration times (1 hour). Refresh tokens allow re-authentication without requiring password re-entry. Role-based access control ensures users can only access their own bookings while hotel managers can access their properties.

Input Validation: All API inputs are validated against strict schemas. SQL injection is prevented by using parameterized queries. Cross-site scripting (XSS) attacks are mitigated by sanitizing user-generated content in reviews.

Audit Logging: All critical operations (bookings, payments, cancellations) are logged with user ID, timestamp, IP address, and operation details. Logs are immutable and retained for 7 years for compliance and fraud investigation.

Monitoring and Analytics

Key Metrics: Track search latency (p50, p95, p99), booking success rate, payment authorization rate, inventory lock contention, cache hit ratio, and database query performance. Set alerts when metrics deviate from expected ranges.

Distributed Tracing: Implement OpenTelemetry to trace requests across microservices. Each request has a trace ID propagated through all service calls. This enables identifying bottlenecks and debugging failures in distributed workflows.

Real-Time Dashboards: Provide operations teams with dashboards showing request rates, error rates, and system health in real-time. Display booking funnel metrics (searches → availability checks → bookings → payments) to identify conversion drop-offs.

A/B Testing: Implement experimentation framework for testing pricing strategies, search ranking algorithms, and user interface changes. Randomly assign users to control and treatment groups, measuring impact on key metrics like booking conversion rate and revenue per user.

Future Improvements

Machine Learning Enhancements: Train models for demand forecasting to predict future booking patterns and optimize inventory allocation. Use natural language processing to analyze review sentiment and automatically categorize feedback. Implement computer vision to assess hotel photo quality and detect policy violations.

Blockchain for Loyalty: Explore using blockchain for a decentralized loyalty point system that can be redeemed across partner networks. Smart contracts could automatically award points and enforce redemption rules.

Augmented Reality: Allow users to view 360-degree virtual tours of hotel rooms and amenities before booking. Implement AR features for navigation within large hotel properties.

Sustainability Tracking: Add carbon footprint calculations for bookings based on travel distance and hotel environmental practices. Allow users to filter for eco-friendly properties and offset carbon emissions.

This design demonstrates a production-grade hotel booking platform handling massive scale while maintaining data consistency, high availability, and excellent user experience. The architecture leverages proven technologies and patterns to build a reliable system that can evolve with business needs.

Summary

This comprehensive guide covered the design of a hotel booking platform like Booking.com, including:

Core Functionality: Hotel search with complex filters, real-time availability checking, booking creation with payment processing, and cancellation handling.
Key Challenges: Sub-second search performance, preventing double-booking under concurrency, distributed transaction coordination, dynamic pricing, and payment processing across currencies.
Solutions: Elasticsearch for search, pessimistic locking for inventory consistency, saga pattern for distributed transactions, data-driven overbooking strategies, multi-factor dynamic pricing, and global payment processing.
Scalability: Database sharding, read replicas, Elasticsearch clustering, Redis caching, Kafka for async processing, and multi-region deployment.

The design demonstrates how to build a complex e-commerce system with strong consistency requirements, high throughput, and sophisticated business logic while maintaining operational excellence.

Design Booking.com