Design Weather App

Designing a production-grade weather application involves orchestrating real-time data from multiple sources, providing accurate forecasts, delivering timely severe weather alerts, and serving millions of users with low-latency responses. This document outlines the architecture for a system similar to Weather.com or Dark Sky, capable of handling global weather data ingestion, processing, and delivery.

The unique challenges in this design include managing high-frequency weather data updates from numerous sources, performing efficient geospatial queries for location-based weather, implementing intelligent caching strategies to minimize API costs, and ensuring real-time alert delivery to affected users during severe weather events.

Step 1: Understand the Problem and Establish Design Scope

Before diving into the design, it’s crucial to define the functional and non-functional requirements. For weather applications, functional requirements are the features users interact with, whereas non-functional requirements define system qualities around performance, scale, and reliability.

Functional Requirements

Core Requirements (Priority 1-3):

  1. Users should be able to view current weather conditions for any location (temperature, humidity, wind speed, pressure, visibility).
  2. Users should be able to access hourly forecasts for the next 48 hours and daily forecasts for the next 10-14 days.
  3. Users should receive severe weather alerts (tornadoes, hurricanes, floods, heat advisories) for their saved locations.
  4. Users should be able to view weather radar and satellite imagery for their region.

Below the Line (Out of Scope):

  • Users should be able to access minute-by-minute precipitation forecasts for the next hour.
  • Users should be able to view historical weather data for any location.
  • Users should be able to customize alert preferences and manage multiple saved locations.
  • Users should be able to access weather widgets for home screens and lock screens.

Non-Functional Requirements

Core Requirements:

  • The system should provide current weather queries with p99 latency under 200ms.
  • The system should ensure weather data freshness within 5 minutes for current conditions.
  • The system should deliver severe weather alerts within 30 seconds of detection.
  • The system should maintain 99.99% uptime SLA with graceful degradation when data sources are unavailable.

Below the Line (Out of Scope):

  • The system should minimize third-party API calls through intelligent caching to reduce costs.
  • The system should support multi-region deployment for disaster recovery.
  • The system should implement data quality validation through multi-source verification.
  • The system should optimize storage with appropriate retention policies for historical data.

Clarification Questions & Assumptions:

  • Platform: iOS, Android, and Web applications.
  • Scale: 100 million daily active users globally.
  • Data Sources: Integration with multiple weather APIs (OpenWeatherMap, NOAA, Weather Underground).
  • Geographic Coverage: Global coverage with 100,000+ weather station locations.
  • Update Frequency: Current weather updated every 5 minutes, forecasts updated every 1-3 hours.

Capacity Estimations

Traffic:

  • 100M DAU, each user checks weather 5 times per day
  • 500M requests per day equals approximately 6K QPS average
  • Peak traffic during morning hours: 3x average equals 18K QPS

Storage:

  • Current weather: 100K locations times 5KB equals 500MB (refreshed frequently)
  • Hourly forecasts: 100K locations times 48 hours times 2KB equals 9.6GB
  • Daily forecasts: 100K locations times 14 days times 1KB equals 1.4GB
  • Historical data: 100K locations times 365 days times 24 hours times 500B times 5 years equals approximately 2TB
  • Weather radar tiles: 1M tiles times 50KB per tile times 20 time steps equals 1TB (rotating cache)
  • Total: approximately 3TB plus user data and logs

Bandwidth:

  • Average response size: 20KB (current plus forecast plus metadata)
  • 6K QPS times 20KB equals 120MB per second equals 960 Mbps
  • Peak: 2.88 Gbps

Step 2: Propose High-Level Design and Get Buy-in

Planning the Approach

For weather application design, we’ll build up our system sequentially, going one by one through our functional requirements. We’ll start with basic weather data retrieval, then add forecasting capabilities, followed by alert systems, and finally radar/map visualizations.

Defining the Core Entities

To satisfy our key functional requirements, we’ll need the following entities:

Location: Represents a geographic location where weather data is collected or requested. Includes latitude and longitude coordinates, elevation, location name, country code, and station type. This is fundamental for all weather queries.

Weather Data: Current weather conditions at a specific location and time. Contains temperature, feels-like temperature, humidity, pressure, wind speed and direction, weather condition (sunny, cloudy, rainy), visibility, and observation timestamp. This represents the real-time snapshot of weather.

Forecast: Future weather predictions for a location. Includes hourly forecasts for the next 48 hours and daily forecasts for the next 10-14 days. Each forecast entry contains predicted temperature ranges, precipitation probability, wind conditions, and general weather conditions.

Alert: Severe weather warnings or advisories. Contains alert type (tornado, hurricane, heat advisory), severity level (advisory, warning, emergency), affected geographic area, description, issue time, and expiration time. Critical for user safety.

User Preferences: User-specific settings including saved locations, preferred units (Fahrenheit/Celsius), enabled alert types, notification preferences, and quiet hours settings. This personalizes the weather experience.

API Design

Get Current Weather Endpoint: Used by clients to retrieve current weather conditions for a specific location.

GET /weather?lat={lat}&lon={lon} -> WeatherData
Query Params: {
  lat: number,
  lon: number,
  units: "metric" | "imperial" (optional)
}

Get Forecast Endpoint: Used by clients to retrieve weather forecasts for a location.

GET /forecast?lat={lat}&lon={lon}&type={type} -> Forecast
Query Params: {
  lat: number,
  lon: number,
  type: "hourly" | "daily",
  units: "metric" | "imperial" (optional)
}

Update Driver Location Endpoint: Used by backend services to receive location updates from weather data providers.

POST /internal/weather-data -> Success/Error
Body: {
  locationId: string,
  temperature: number,
  humidity: number,
  pressure: number,
  windSpeed: number,
  windDirection: number,
  condition: string,
  observedAt: timestamp,
  source: string
}

Note: This is an internal endpoint not exposed to public clients, used only for data ingestion from trusted weather data providers.

Get Weather Alerts Endpoint: Used by clients to retrieve active weather alerts for a location or region.

GET /alerts?lat={lat}&lon={lon}&radius={radius} -> Alert[]
Query Params: {
  lat: number,
  lon: number,
  radius: number (in km, optional)
}

High-Level Architecture

Let’s build up the system sequentially, addressing each functional requirement:

1. Users should be able to view current weather conditions for any location

The core components necessary to fulfill current weather retrieval are:

  • Client Applications: iOS, Android, and Web apps that provide the primary user interface. They send HTTP requests to fetch weather data and display it in user-friendly formats.
  • API Gateway: Acts as the entry point for all client requests, handling authentication, rate limiting, request routing, and API composition. Protects backend services from direct exposure.
  • Weather Service: Manages current weather data retrieval and unit conversions. Implements multi-layer caching to minimize database queries and provide fast responses.
  • Data Ingestion Service: Continuously polls multiple weather APIs on scheduled intervals, collecting raw weather data from various providers like OpenWeatherMap, NOAA, and Weather Underground.
  • Aggregation Service: Normalizes data from different providers into a unified format, validates data quality, and resolves conflicts when sources disagree on values.
  • Cache Layer (Redis): Multi-tier caching system that stores frequently accessed weather data in memory for ultra-fast retrieval. Implements TTL-based invalidation.
  • Database (PostgreSQL): Stores location metadata, current weather readings, and serves as the source of truth when cache misses occur.

Current Weather Flow:

  1. The user opens the app and requests weather for their current location, sending a GET request to the Weather Service.
  2. The API Gateway receives the request, validates the authentication token, applies rate limiting, and forwards to the Weather Service.
  3. The Weather Service first checks the L1 in-memory cache for the location’s weather data.
  4. If cache miss occurs, it checks the L2 Redis cache. If still missing, it queries the PostgreSQL database.
  5. The service applies unit conversion if needed (Celsius to Fahrenheit) and returns the weather data.
  6. Meanwhile, the Data Ingestion Service periodically fetches fresh data from multiple weather APIs and publishes to a message queue.
  7. The Aggregation Service consumes these events, normalizes the data, and updates both the database and cache.
2. Users should be able to access hourly and daily forecasts

We extend our existing design to support forecast retrieval:

  • Add a Forecast Service: Dedicated service for managing forecast data with specialized caching strategies since forecasts change less frequently than current conditions.
  • Add Forecast Tables: Separate database tables for hourly and daily forecasts, optimized for time-series queries.

Forecast Retrieval Flow:

  1. The user swipes to view the forecast, sending a GET request with the forecast type (hourly or daily).
  2. The API Gateway routes the request to the Forecast Service.
  3. The Forecast Service implements aggressive caching since forecasts are expensive to compute and change infrequently.
  4. For hourly forecasts, cache TTL is 30 minutes. For daily forecasts, cache TTL is 1 hour.
  5. On cache miss, the service queries the database and populates all cache layers.
  6. The Data Ingestion Service fetches forecast data from third-party APIs every 1-3 hours and updates the system.
3. Users should receive severe weather alerts for their saved locations

We need to introduce new components to facilitate real-time alert delivery:

  • Alert Detection Service: Continuously monitors incoming weather data streams, evaluating conditions against alert criteria thresholds to identify severe weather situations.
  • Alert Service: Manages the lifecycle of weather alerts including creation, updates, and expiration. Stores active alerts and determines affected geographic regions.
  • Notification Service: Dispatches push notifications to users via APNs (Apple Push Notification Service) for iOS and FCM (Firebase Cloud Messaging) for Android.
  • Message Queue (Kafka): Provides event-driven architecture for alert propagation, ensuring no alerts are dropped during high-traffic events.

Alert Detection and Delivery Flow:

  1. The Data Ingestion Service receives new weather data from providers and publishes it to a Kafka topic.
  2. The Alert Detection Service consumes these events and evaluates them against alert criteria (e.g., temperature above 40°C, wind speed above 25 m/s, severe conditions like tornadoes).
  3. When a severe condition is detected, it creates an Alert entity with appropriate severity level and affected area.
  4. The Alert Service stores the alert in the database and publishes an alert event to Kafka.
  5. The Notification Service consumes alert events and queries for affected users (users who have saved locations within the alert area).
  6. It filters users based on their notification preferences (alert types enabled, severity threshold, quiet hours).
  7. Push notifications are sent via APNs and FCM to user devices.
  8. The mobile client receives the notification and displays an alert banner, even if the app is in the background.
4. Users should be able to view weather radar and satellite imagery

We add components for map tile serving:

  • Map Tile Service: Manages weather radar and satellite imagery as pre-rendered tiles for efficient delivery. Implements tile generation and caching strategies.
  • Object Storage (S3): Stores pre-rendered map tiles organized by layer type, zoom level, coordinates, and timestamp.
  • CDN (CloudFront): Distributes map tiles from edge locations worldwide for low-latency delivery to users.

Weather Map Flow:

  1. The user opens the radar view, and the client app requests map tiles based on the visible map viewport.
  2. Requests go through the CDN first. If the tile is cached at the edge location, it’s returned immediately (sub-50ms).
  3. On CDN cache miss, the request is forwarded to the Map Tile Service.
  4. The Map Tile Service checks its Redis cache for the tile key (layer, zoom, x, y, timestamp).
  5. If not in Redis, it retrieves the tile from S3 object storage.
  6. If the tile doesn’t exist in S3 (new timestamp or location), it’s generated on-demand by overlaying weather data on the map tile.
  7. The generated tile is uploaded to S3, cached in Redis, and returned to the CDN, which caches it at the edge.
  8. As weather data updates, new tiles are pre-generated for the latest timestamp and uploaded to S3.

Step 3: Design Deep Dive

With the core functional requirements met, it’s time to dig into the non-functional requirements and address key architectural challenges. These deep dives separate good designs from production-ready systems.

Deep Dive 1: How do we efficiently ingest weather data from multiple sources while minimizing costs?

Integrating with multiple third-party weather APIs presents challenges around cost management, rate limiting, and reliability. Most weather APIs charge per request, and with 100,000 locations updating every 5-10 minutes, costs can escalate quickly.

Problem: API Cost Management

If we naively fetch weather data for all 100,000 locations every 5 minutes from a paid API like OpenWeatherMap, that’s 20,000 requests per minute or 28.8 million requests per day. At a cost of 0.0001 dollars per request, this amounts to 2,880 dollars per day or approximately 87,000 dollars per month for a single data source.

Solution: Intelligent Quota Management and Smart Polling

We implement a multi-layered approach to minimize API costs while maintaining data freshness:

Popularity-Based Polling: Not all locations need to be updated with the same frequency. We track which locations users actually query and prioritize updates for popular locations. Cities like New York, London, and Tokyo are updated every 5 minutes, while rural locations with few users might be updated every 30 minutes or on-demand.

Freshness Checking: Before making an API call, we check Redis to see when the location was last updated. If the data is still fresh (within the configured threshold), we skip the API call. This freshness threshold is configurable per location based on popularity.

On-Demand Fetching: For locations that haven’t been queried recently, we don’t proactively poll. Instead, when a user requests weather for that location and the data is stale, we fetch it on-demand and cache it for future requests.

Multi-Source Strategy with Failover: We integrate with multiple weather providers (OpenWeatherMap, NOAA, Weather Underground) and use them strategically. NOAA provides free data for US locations, so we prioritize it for US queries. For global coverage, we use paid APIs but implement circuit breakers to detect and avoid unreliable sources.

Batch API Calls: Many weather APIs support bulk requests where you can fetch data for multiple locations in a single API call. We batch locations together (e.g., 100 locations per request) to reduce the number of API calls and associated costs.

Data Quality Validation: To avoid wasting API quota on bad data, we implement validation checks. If an API returns obviously incorrect data (temperature of 100°C, negative humidity), we mark it as invalid and try an alternative source rather than storing and distributing bad data.

This intelligent polling strategy can reduce API costs by 70-80% compared to naive polling while maintaining data freshness for actively used locations.

Deep Dive 2: How do we aggregate and normalize weather data from different providers?

Different weather APIs return data in varying formats with different units, field names, and levels of detail. We need to normalize this into a consistent internal schema.

Problem: Data Heterogeneity

OpenWeatherMap might return temperature in Kelvin with field name “temp”, NOAA returns it in Fahrenheit as “temperature.value”, and Weather Underground uses Celsius as “temp_c”. Wind direction might be in degrees, cardinal directions (N, NE, E), or radians. Weather conditions use different taxonomies (“thunderstorm” vs “tstorm” vs “electrical_storm”).

Solution: Normalization Pipeline

The Aggregation Service implements provider-specific adapters that transform raw API responses into our internal WeatherData schema:

Unit Standardization: All temperatures are converted to Celsius internally. Wind speeds are normalized to meters per second. Pressure is standardized to hectopascals (hPa). This internal consistency simplifies processing and storage.

Condition Mapping: We maintain mapping tables that translate provider-specific condition codes to our standardized condition taxonomy. For example, OpenWeatherMap condition code 200-232 all map to our “thunderstorm” category, while NOAA’s “Light Rain Showers” maps to “light_rain”.

Timestamp Normalization: All timestamps are converted to Unix epoch time in UTC to avoid timezone confusion and enable consistent time-based queries.

Field Mapping: We extract and rename fields to match our schema. Missing optional fields are set to null rather than omitted, ensuring consistent JSON structure.

Weighted Aggregation: When multiple sources provide data for the same location, we don’t simply take the first response. Instead, we compute a weighted average based on source reliability and data recency. Each source has a reliability score (based on historical accuracy), and more recent readings are weighted higher.

For example, if OpenWeatherMap (reliability 0.95) reports 22°C and NOAA (reliability 0.98) reports 23°C, both from 5 minutes ago, we compute: (0.95 times 22 plus 0.98 times 23) divided by (0.95 plus 0.98) equals approximately 22.5°C.

Conflict Resolution: For categorical data like weather conditions, we can’t average. Instead, we use the most severe condition if there’s disagreement. If one source says “cloudy” and another says “thunderstorm”, we report “thunderstorm” to err on the side of caution.

Confidence Scoring: The aggregated data includes a confidence score reflecting the agreement between sources and the quality of data. High confidence (above 0.9) means multiple sources agree closely. Low confidence indicates disagreement or stale data, which might trigger additional data fetching.

Deep Dive 3: How do we implement efficient geospatial queries for location-based weather?

Users often request weather for arbitrary coordinates that don’t exactly match weather station locations. We need to find nearby stations and interpolate values.

Problem: Spatial Querying

A user might request weather for coordinates (40.7589, -73.9851) in New York City, but our nearest weather station might be a kilometer away at (40.7614, -73.9776). We need to efficiently find the nearest stations to any query point and interpolate weather conditions.

Solution: Geospatial Indexing with PostGIS

We use PostgreSQL with the PostGIS extension to enable efficient spatial queries:

Geographic Data Type: The locations table has a geography column that stores latitude and longitude coordinates using the WGS84 spatial reference system (SRID 4326). This enables distance calculations that account for Earth’s curvature.

Spatial Index: We create a GiST (Generalized Search Tree) index on the geography column. This R-tree-based index dramatically accelerates proximity searches from O(n) full table scans to O(log n) indexed lookups.

Proximity Query Function: When a user queries weather for arbitrary coordinates, we execute a spatial query to find the nearest 5-8 weather stations within a reasonable radius (e.g., 50km). This query uses the KNN (K-Nearest Neighbor) operator which leverages the spatial index.

Inverse Distance Weighting Interpolation: With multiple nearby stations, we interpolate the weather conditions using the Inverse Distance Weighting (IDW) algorithm. Closer stations have more influence on the interpolated value than distant ones.

The weight for each station is calculated as 1 divided by (distance squared). For a point with three nearby stations at distances 5km, 10km, and 15km with temperatures 20°C, 22°C, and 19°C respectively, the weights are 0.04, 0.01, and 0.0044. We compute the weighted average: (0.04 times 20 plus 0.01 times 22 plus 0.0044 times 19) divided by (0.04 plus 0.01 plus 0.0044) equals approximately 20.4°C.

Special Handling for Categorical Data: Wind direction and weather conditions can’t be averaged numerically. For wind direction, we convert to unit vectors, average the vectors, and convert back to a direction. For weather conditions, we use the condition from the nearest station.

Optimization - Direct Match: If there’s a weather station within 1km of the query point, we skip interpolation and use that station’s data directly, as the difference would be negligible.

Caching Interpolated Results: Interpolated weather for popular coordinate pairs is cached in Redis with the coordinates as part of the cache key. This avoids recomputing interpolations for repeated queries to the same or nearby locations.

This approach provides accurate weather data for any location on Earth while keeping query latency under 200ms even for interpolated locations.

Deep Dive 4: How do we implement multi-layer caching to achieve sub-200ms latency?

With 6,000 QPS average and peaks of 18,000 QPS, hitting the database for every request would overwhelm it and create unacceptable latency. Caching is critical.

Problem: Cache Strategy Complexity

We need to balance freshness with performance. Weather data becomes stale quickly (5-minute freshness requirement), so we can’t cache indefinitely. Different data types have different update frequencies: current weather changes every few minutes, hourly forecasts every 30 minutes, daily forecasts every few hours.

Solution: Multi-Layer Cache Architecture

We implement a three-tier caching system with different characteristics at each layer:

L1 Cache - Application-Level In-Memory Cache: Each Weather Service instance maintains a local LRU (Least Recently Used) cache in memory using a library like Caffeine or Guava Cache. This cache stores the most frequently accessed weather data with a maximum size limit (e.g., 10,000 entries). Cache hits at this layer return in under 1ms with zero network latency. TTL is kept short (5 minutes) to ensure reasonable freshness.

L2 Cache - Distributed Redis Cache: All service instances share a Redis cluster that stores weather data with configurable TTLs based on data type. Current weather has a 5-minute TTL, hourly forecasts have 30-minute TTL, daily forecasts have 1-hour TTL. Cache hits at this layer take 5-10ms (network round trip to Redis). Redis stores data in a structured format (JSON or MessagePack) with compression to reduce memory usage.

L3 Cache - Database: PostgreSQL acts as the ultimate source of truth. When both L1 and L2 miss, we query the database. Database queries typically take 20-50ms with proper indexing. After retrieval, we populate both L2 and L1 caches for subsequent requests.

Cache Population Strategy: When new weather data is ingested, we follow a write-through cache pattern: data is written to the database first, then immediately written to the L2 Redis cache. The L1 cache is not directly populated on writes; instead, it’s populated on reads (cache-aside pattern) to avoid wasting memory on data that might not be requested.

Cache Invalidation: When updated weather data arrives from providers, we need to invalidate stale cache entries. For L2 Redis, we simply overwrite the key with new data and reset the TTL. For L1 caches across multiple service instances, we publish a cache invalidation event to a Pub/Sub channel (Redis Pub/Sub or Kafka topic). Each service instance subscribes to this channel and removes invalidated keys from its local cache.

Cache Key Design: Cache keys are structured as “datatype:locationid:params” for example “current:40.7589_-73.9851:metric” or “forecast:hourly:NYC:metric”. This allows targeted invalidation and easy debugging.

Monitoring Cache Performance: We track cache hit rates for each layer. A healthy system sees 60-70% L1 hit rate, 25-30% L2 hit rate, and only 5-10% database hits. If L1 hit rate drops significantly, it might indicate the cache size is too small or TTLs are too short.

This multi-layer approach typically achieves p99 latencies under 200ms with 90-95% of requests served from cache, dramatically reducing database load and costs.

Deep Dive 5: How do we detect severe weather conditions and deliver alerts in real-time?

Severe weather alerts are safety-critical and must be delivered quickly to all affected users, potentially millions during major events like hurricanes.

Problem: Real-Time Alert Processing at Scale

When a severe weather event is detected (e.g., a tornado), we need to identify all users within the affected area (potentially a 50km radius) and send push notifications within 30 seconds. With millions of users, spatial queries and notification delivery must be highly optimized.

Solution: Event-Driven Alert Architecture

We implement a streaming architecture using Kafka for real-time event processing:

Continuous Weather Monitoring: The Data Ingestion Service publishes every weather data update to a Kafka topic called “weather-raw-data”. This creates a real-time stream of all weather observations.

Alert Detection with Stream Processing: The Alert Detection Service uses a stream processing framework like Apache Flink or Kafka Streams to consume the weather data stream in real-time. It maintains alert criteria in memory (temperature thresholds, wind speed limits, dangerous conditions) and evaluates every incoming reading against these criteria.

Alert Criteria Evaluation: For temperature alerts, we check if the value exceeds dangerous thresholds (e.g., above 40°C for extreme heat, below -30°C for extreme cold). For wind, we check speeds above 25 m/s. For conditions, we match against a set of severe weather types (tornado, hurricane, blizzard, thunderstorm). Each alert has an associated severity level: advisory, warning, or emergency.

Alert Creation and Deduplication: When a severe condition is detected, we create an Alert entity. To avoid duplicate alerts for the same event (multiple readings showing the same tornado), we implement deduplication based on alert type, location proximity, and time window. If an active alert of the same type exists for a nearby location (within 10km) and was created recently (within 30 minutes), we update the existing alert rather than creating a new one.

Geospatial User Lookup: To find affected users, we perform a spatial query to find all users who have saved a location within the alert’s affected area (typically a radius around the event location). This query uses a geospatial index on the user_locations table to quickly identify potentially millions of affected users.

User Preference Filtering: Not all affected users should receive notifications. We filter based on user preferences: Are alerts enabled? Is this alert type in their enabled list? Does the severity meet their threshold? Is it currently quiet hours for them? This filtering might reduce the notification list by 50-70%.

Batched Push Notification Delivery: Push notifications are sent in batches to APNs and FCM. For iOS, we use APNs HTTP/2 API which supports sending to thousands of devices in parallel. For Android, FCM supports multicast messages to up to 1,000 devices per request. We batch users by platform and send notifications concurrently.

Notification Rate Limiting: To avoid overwhelming notification services and being rate-limited by APNs/FCM, we implement flow control. If we need to send to 10 million users, we send in waves of 100,000 users every few seconds rather than all at once.

Alert Lifecycle Management: Alerts have an expiration time. A tornado warning might expire after 2 hours. A separate service periodically scans for expired alerts and marks them as inactive, ensuring users don’t continue receiving notifications for past events.

Reliability Guarantees: Using Kafka provides durability and exactly-once processing semantics. If the Alert Detection Service crashes mid-processing, Kafka’s consumer group rebalancing ensures another instance picks up where it left off. Messages are only committed after successful processing, preventing lost alerts.

This architecture achieves alert delivery within 30 seconds from detection to user notification, even for events affecting millions of users.

Deep Dive 6: How do we efficiently store and query historical weather data?

Historical weather data is valuable for trend analysis, climate studies, and machine learning model training. However, storing years of hourly weather data for 100,000 locations is a significant data management challenge.

Problem: Time-Series Data at Scale

Storing 5 years of hourly weather data for 100,000 locations means approximately 100,000 locations times 365 days times 24 hours times 5 years equals over 4 billion rows. Traditional relational databases struggle with this volume, and queries become slow without proper optimization.

Solution: TimescaleDB for Time-Series Optimization

We use TimescaleDB, a PostgreSQL extension optimized for time-series data:

Hypertables: TimescaleDB converts our historical_weather table into a hypertable, which automatically partitions data by time. Instead of one massive table, data is split into chunks (e.g., one chunk per week). Queries that filter by time only scan relevant chunks, dramatically improving performance.

Time-Based Partitioning: Data is automatically partitioned by the observed_at timestamp column. When inserting data, TimescaleDB determines the appropriate chunk and inserts there. When querying historical data for a specific date range, only chunks within that range are scanned.

Continuous Aggregates: For common queries like “average daily temperature” or “monthly precipitation totals”, computing on-the-fly from hourly data is slow. TimescaleDB’s continuous aggregates are materialized views that automatically update as new data arrives. We create aggregates for daily statistics (average temperature, max temperature, min temperature, total precipitation), monthly statistics, and yearly statistics.

Retention Policies: Not all historical data needs to be kept forever. We implement retention policies that automatically delete data older than a certain threshold. For example, hourly data is kept for 2 years, after which only daily aggregates are retained. This balances storage costs with data availability.

Compression: TimescaleDB supports native compression for older data chunks. Data older than 1 month is automatically compressed, reducing storage by 90-95% while still remaining queryable. Compression groups similar values together and uses algorithms optimized for time-series patterns.

Indexing Strategy: We create indexes on location_id and observed_at columns for fast filtering. TimescaleDB automatically creates indexes on time dimensions of hypertables. For queries like “show me temperature trend for New York in January 2024”, the index allows rapid location and time filtering.

Historical Query Service: When users request historical data, we first check if they’re requesting recent data (last 2 years) which hits the full-resolution hourly data, or older data which uses daily aggregates. This ensures queries remain fast even for large time ranges.

For example, a query for “average temperature for each day in 2020” would use the daily_weather_stats continuous aggregate rather than scanning millions of hourly records, returning results in milliseconds instead of seconds.

This approach efficiently stores terabytes of historical weather data while maintaining fast query performance and controlling storage costs through compression and retention policies.

Deep Dive 7: How do we serve weather map tiles with low latency globally?

Weather radar and satellite imagery require serving potentially millions of map tiles to users worldwide with sub-100ms latency.

Problem: Large-Scale Tile Distribution

A typical radar view might display a 10x10 grid of tiles, meaning 100 tiles per page view. With millions of users, this translates to tens of millions of tile requests per minute. Each tile is 20-50KB, creating significant bandwidth requirements. Additionally, tiles need to be updated frequently (every 5-10 minutes) as new radar data arrives.

Solution: CDN-Based Tile Delivery with Pre-Generation

We implement a multi-tiered tile delivery architecture:

Pre-Generated Tiles: Rather than generating tiles on-demand for every request, we pre-generate tiles for common zoom levels (typically zoom 0-12 for global coverage) and popular regions. A background job runs every 5 minutes when new radar data arrives, rendering tiles for the latest timestamp.

Tile Storage in S3: Generated tiles are uploaded to S3 (or similar object storage) organized by a hierarchical key structure: layer/zoom/x/y/timestamp.png. For example, “radar/8/123/456/1640995200.png” represents a radar layer tile at zoom level 8, coordinates x=123 y=456, for timestamp 1640995200.

CloudFront CDN Distribution: S3 is configured as the origin for a CloudFront distribution. The CDN caches tiles at edge locations worldwide. When a user in Tokyo requests a tile, it’s served from the nearest Asia-Pacific edge location rather than crossing the Pacific to US data centers.

Cache Headers: Tiles are served with appropriate cache headers. Since tiles are immutable (tiles for a specific timestamp never change), we set long cache TTLs (e.g., Cache-Control: max-age=3600). This allows aggressive caching at both the CDN and client levels.

Client-Side Tile Request Flow: The mobile app or web client determines which tiles are needed based on the visible map viewport and zoom level. It generates tile URLs and makes concurrent requests (typically 10-20 in parallel). The browser or HTTP client automatically handles connection pooling and parallel downloads.

CDN Cache Hit Optimization: The first user to request a newly generated tile experiences a cache miss and waits for the tile to be fetched from S3 (100-200ms). Subsequent users benefit from the cached tile at the edge location (sub-50ms). During peak usage, cache hit rates exceed 95%.

On-Demand Tile Generation: For less common zoom levels or regions, we generate tiles on-demand when requested. The Map Tile Service checks if the tile exists in S3. If not, it generates it by overlaying weather data on a base map tile, uploads to S3, and returns it to the user. This on-demand tile is then cached for future requests.

Tile Format Optimization: We use PNG format with transparency for weather overlays, allowing them to be layered over base map tiles. Compression is tuned to balance quality and file size, typically achieving 30-50KB per tile.

Animated Radar: For animated radar showing weather movement, we pre-generate tiles for the last 20 time steps (covering the last 2 hours if updated every 6 minutes). The client downloads all time steps and plays them as an animation locally, creating a smooth weather loop without additional server requests.

Bandwidth Optimization: With CDN distribution, most tile bandwidth is served from edge locations, reducing origin server load to near zero. S3 data transfer costs are minimized since most requests are cache hits at CloudFront.

This architecture delivers weather map tiles with p99 latency under 100ms globally while handling millions of requests per minute efficiently.

Step 4: Wrap Up

In this chapter, we proposed a comprehensive system design for a weather application serving 100 million daily active users. If there is extra time at the end of the interview, here are additional points to discuss:

Final Architecture Summary

We’ve designed a production-grade weather application with the following key components:

Data Ingestion Layer: Intelligent polling from multiple weather APIs (OpenWeatherMap, NOAA, Weather Underground) with quota management, circuit breakers, and failover. Cost-optimized through popularity-based polling and freshness checking.

Aggregation and Normalization: Multi-source data fusion with weighted aggregation, conflict resolution, and quality scoring. Normalizes heterogeneous API responses into a unified schema.

Caching Strategy: Three-tier caching (L1 in-memory, L2 Redis, L3 database) with TTL-based invalidation, achieving 90-95% cache hit rates and sub-200ms p99 latency.

Geospatial Capabilities: PostGIS-powered spatial indexing for efficient proximity searches and Inverse Distance Weighting interpolation for accurate weather at any coordinates.

Real-Time Alerts: Event-driven architecture using Kafka for real-time weather monitoring and alert detection, delivering push notifications to affected users within 30 seconds.

Historical Data Storage: TimescaleDB hypertables with automatic time-based partitioning, continuous aggregates, compression, and retention policies for efficient long-term storage.

Map Tile Delivery: CDN-distributed pre-generated weather map tiles served from global edge locations with sub-100ms latency and 95%+ cache hit rates.

Key Design Decisions

Multi-Source Redundancy: Integrating multiple weather APIs provides redundancy if one source fails and improves accuracy through data fusion. The weighted aggregation approach prevents bad data from a single source from affecting users.

Aggressive Caching: With weather data changing every 5 minutes, we can cache aggressively within that window. The three-tier cache dramatically reduces database load and API costs while meeting latency requirements.

Event-Driven Architecture: Using Kafka for weather data streams enables real-time processing, decouples services, provides durability during failures, and scales horizontally by adding consumer instances.

Geospatial Optimization: PostGIS spatial indexes transform O(n) proximity searches into O(log n) indexed lookups, making location-based queries feasible at scale. Interpolation provides accurate weather anywhere on Earth.

Cost Management: Intelligent quota management with popularity-based polling, on-demand fetching, and batch requests reduces API costs by 70-80% compared to naive polling.

Scalability Characteristics

Horizontal Scaling: All services are stateless and can scale horizontally by adding instances behind load balancers. The API Gateway, Weather Service, Forecast Service, and Alert Service all scale independently based on load.

Database Sharding: While not implemented in the initial design, PostgreSQL can be sharded by geographic region or location_id hash for further scaling. North America weather data could be in one shard, Europe in another.

Cache Scaling: Redis can be clustered with data partitioned across multiple nodes. For even higher throughput, we can use Redis Cluster with automatic sharding.

Message Queue Scaling: Kafka topics are partitioned, allowing parallel processing across multiple consumer instances. Adding partitions increases throughput linearly.

CDN Edge Scaling: CloudFront automatically scales to handle traffic spikes and serves content from 200+ edge locations worldwide.

Error Handling and Resilience

Circuit Breakers: When a weather API becomes unreliable or slow, circuit breakers trip to prevent cascading failures. After a cool-down period, the system retries to see if the service recovered.

Graceful Degradation: If all external weather APIs fail, the system serves cached data even if slightly stale, showing users a “last updated” timestamp. This is better than showing nothing.

Retry Logic: Failed API requests are retried with exponential backoff. If all retries fail, the system falls back to alternative data sources.

Data Validation: Invalid data (impossible temperatures, negative wind speeds) is filtered out before storage to prevent garbage data from reaching users.

Multi-Region Deployment: Critical components are deployed in multiple AWS regions for disaster recovery. If us-east-1 goes down, traffic fails over to us-west-2.

Monitoring and Observability

Key Metrics to Track:

  • API latency percentiles (p50, p95, p99) for all endpoints
  • Cache hit rates for L1, L2, and CDN layers
  • Weather data freshness lag (time since last update)
  • Alert delivery time from detection to notification
  • External API error rates and response times
  • Database query performance and connection pool utilization

Critical Alerts:

  • API error rate exceeding 1%
  • Cache hit rate dropping below 80%
  • Data staleness exceeding 10 minutes
  • Alert delivery delay exceeding 60 seconds
  • Database replica lag exceeding 30 seconds

Distributed Tracing: Using tools like Jaeger or DataDog, we trace requests across service boundaries to identify bottlenecks and latency spikes in the distributed system.

Logging: Structured logging with correlation IDs allows debugging issues across multiple services and identifying patterns in errors.

Security Considerations

API Authentication: All client requests require JWT tokens obtained through OAuth 2.0 authentication. Tokens expire after 24 hours and must be refreshed.

Rate Limiting: API Gateway implements rate limiting per user (e.g., 100 requests per minute) to prevent abuse and ensure fair resource distribution.

Data Encryption: All data is encrypted in transit using TLS 1.3 and at rest using AES-256. Database encryption is enabled for PostgreSQL and Redis.

API Key Management: Third-party weather API keys are stored in AWS Secrets Manager and rotated periodically. Services retrieve keys at runtime rather than hardcoding.

Input Validation: All location coordinates and query parameters are validated to prevent injection attacks. Latitude must be between -90 and 90, longitude between -180 and 180.

Future Enhancements

Machine Learning for Forecasting: Train custom ML models on historical weather data to improve forecast accuracy for specific regions. Ensemble models combining third-party forecasts with our models could outperform any single source.

Hyperlocal Weather: Integrate crowdsourced data from personal weather stations and IoT devices. Services like Weather Underground’s PWS (Personal Weather Station) network provide hyperlocal data at street-level granularity.

Predictive Analytics: Use historical weather patterns and ML to predict severe weather events before they’re officially issued by agencies. Early warnings could save lives.

Weather-Based Recommendations: Integrate with other services to provide context-aware recommendations. Suggest indoor activities when rain is forecasted, recommend sunscreen when UV index is high.

AR Weather Visualization: Augmented reality features showing weather overlays through smartphone cameras. Point your camera at the sky and see real-time precipitation, temperature gradients, or cloud types labeled.

Climate Change Analytics: Long-term historical data enables climate trend analysis. Show users how temperatures in their city have changed over decades, or project future trends.

Multi-Modal Alerts: Beyond push notifications, integrate with smart home devices (Alexa, Google Home) for voice alerts, smartwatch notifications, and even automated actions (close smart blinds when severe storm is coming).

Additional Features

Minute-Cast Precipitation: Similar to Dark Sky’s minute-by-minute precipitation forecasts, use radar data and ML to predict exactly when rain will start and stop at a specific location over the next hour.

Air Quality Integration: Expand beyond weather to include air quality index (AQI), pollen counts, and UV index. These are closely related to weather and valuable for users with allergies or respiratory conditions.

Weather Widgets: Customizable home screen and lock screen widgets for mobile devices showing at-a-glance weather information without opening the app.

Social Features: Allow users to submit weather reports and photos, creating a crowdsourced verification layer for conditions. “Is it really snowing in your area?” could be answered by user-submitted photos.

Weather API for Third Parties: Expose our aggregated and normalized weather data as an API for other developers, creating a revenue stream. Our multi-source aggregation and quality scoring could be valuable to other applications.

This architecture successfully meets all functional and non-functional requirements, serving 100 million users with sub-200ms latency, ensuring 99.99% uptime, delivering real-time alerts within 30 seconds, and maintaining cost efficiency through intelligent caching and API quota management. The design scales horizontally across all layers and gracefully handles failures through redundancy, circuit breakers, and fallback mechanisms.