Design Slack

Slack is a team collaboration platform that connects people across organizations through real-time messaging, channels, and integrations. It allows teams to communicate instantly through direct messages and organized channels, share files, search conversation history, and integrate with countless third-party tools. Designing Slack presents unique challenges including real-time message delivery at scale, managing millions of concurrent connections, ensuring message ordering, and providing powerful search across billions of messages.

Step 1: Understand the Problem and Establish Design Scope

Before diving into the design, it’s crucial to define what we’re building and the constraints we’re working within. For a complex platform like Slack, we need to carefully prioritize features and establish clear boundaries.

Functional Requirements

Core Messaging (Priority 0 - Must Have):

  1. Users should be able to send and receive messages in real-time with sub-second latency.
  2. Users should be able to create and join channels for organizing conversations by topic.
  3. Users should be able to send direct messages to other users for private conversations.
  4. Users should be able to mention other users with @ symbols to notify them.
  5. Users should be able to organize discussions within threads attached to parent messages.

Essential Features (Priority 1):

  • Users should be able to edit and delete their own messages after sending.
  • Users should be able to add emoji reactions to messages.
  • Users should be able to search across all message history with filters.
  • Users should be able to upload and share files with team members.
  • Users should be able to see online/offline/away status of other users.
  • Users should be able to integrate external services via webhooks and bots.

Nice to Have (Priority 2 - Below the Line):

  • Users should be able to see typing indicators when others are composing messages.
  • Users should be able to archive channels for later reference.
  • Users should be able to pin important messages to channels.
  • Users should be able to customize notification preferences per channel.

Non-Functional Requirements

Core Requirements:

  • The system should deliver messages with less than 100ms latency at p99 under normal load.
  • The system should ensure strong consistency for message ordering within a channel to prevent confusion.
  • The system should handle 10 million daily active users sending 100 million messages per day.
  • The system should maintain 99.95% uptime, allowing only about 4 hours of downtime per year.
  • The system should support 500,000 concurrent WebSocket connections per region.

Below the Line (Out of Scope):

  • The system should provide end-to-end encryption for enterprise customers.
  • The system should comply with data protection regulations like GDPR and CCPA.
  • The system should maintain detailed audit logs for compliance purposes.
  • The system should facilitate zero-downtime deployments and updates.

Clarification Questions & Assumptions:

  • Platform: Web, desktop (Electron), and mobile apps for iOS and Android.
  • Scale: Average workspace has 100 users and 50 channels. Large enterprise workspaces can have 100,000 users and 10,000 channels.
  • Message Size: Average message is 1KB including metadata. Maximum message size is 40,000 characters.
  • Peak Load: System experiences 3x average load during business hours in major time zones.
  • File Storage: Users can upload files up to 1GB in size.

Back-of-the-Envelope Estimation

Storage Requirements:

  • Messages: 100 million messages per day at 1KB average equals 100GB per day or roughly 36TB per year.
  • With indexes and metadata, multiply by 3x for approximately 100TB per year.
  • Files: Assuming 10% of messages include files averaging 500KB, that’s 5TB per day or approximately 1.8PB per year.
  • Total storage needed: Approximately 2PB per year including messages, files, indexes, and backups.

Throughput Requirements:

  • Message writes: 100 million messages per day equals about 1,157 messages per second on average.
  • During peak hours (3x average): approximately 3,500 writes per second.
  • Message reads are typically 10x writes: approximately 35,000 reads per second during peak.
  • WebSocket bandwidth: 70,000 message deliveries per second at 1KB each equals roughly 70MB/s or 560Mbps.

Connection Management:

  • 500,000 concurrent WebSocket connections per region.
  • With 10 gateway servers, each handles 50,000 connections.
  • Heartbeat every 30 seconds generates 16,666 ping/pong messages per second per region.

Step 2: Propose High-Level Design and Get Buy-in

Planning the Approach

We’ll build up the system sequentially, going through each functional requirement. For each requirement, we’ll identify the necessary components and data flows. This structured approach helps ensure we don’t miss critical pieces while keeping the design conversation organized.

Defining the Core Entities

To satisfy our key functional requirements, we’ll need the following entities:

User: Any person who uses Slack, either as part of a workspace or multiple workspaces. Contains authentication credentials, profile information, preferences, and session data for multiple devices. Users can be members of multiple workspaces simultaneously.

Workspace: An isolated organization environment that contains users, channels, and messages. Represents a company, team, or community. Each workspace has its own members, channels, billing plan, and security settings. Workspaces are completely isolated from each other for security and privacy.

Channel: A persistent chat room within a workspace for organizing conversations by topic. Can be public (visible to all workspace members) or private (invitation only). Contains metadata like name, description, topic, member list, and creation timestamp.

Message: A single piece of content sent by a user in a channel or direct message. Includes the text content, author, timestamp, channel reference, thread reference if part of a thread, attachments, reactions, and edit history. Messages are immutable at the storage layer (edits create new versions).

Thread: A collection of reply messages attached to a parent message. Allows organizing side conversations without cluttering the main channel. Contains reference to the parent message and maintains ordering of replies.

File: An uploaded document, image, video, or other file shared in a channel or direct message. Stored separately from messages with metadata including filename, size, MIME type, uploader, upload timestamp, and associated message.

API Design

Send Message: Used by clients to send a new message to a channel or direct message conversation.

POST /messages -> Message
Body: {
  channelId: string,
  text: string,
  threadId?: string,
  attachments?: Array
}

Get Channel Messages: Retrieves message history for a channel with pagination support.

GET /channels/:channelId/messages -> Array<Message>
Query: {
  limit: number,
  before?: timestamp,
  after?: timestamp
}

Search Messages: Performs full-text search across all messages the user has access to.

GET /search -> SearchResults
Query: {
  query: string,
  channelId?: string,
  from?: date,
  to?: date
}

Update Presence: Sent periodically by clients to maintain online status.

POST /presence -> Success
Body: {
  status: "online" | "away" | "offline"
}

Upload File: Handles multipart file uploads with metadata.

POST /files -> File
Body: multipart/form-data {
  file: binary,
  channelId: string
}

High-Level Architecture

Let’s build up the system sequentially, addressing each functional requirement:

1. Users should be able to send and receive messages in real-time

The core components necessary for real-time messaging are:

  • Client Applications: Available on web, desktop, iOS, and Android. Maintain persistent WebSocket connections for real-time updates while also supporting HTTP REST APIs for request/response operations.

  • Load Balancer: Distributes incoming traffic across multiple gateway servers. Handles TLS termination and provides health checks. For WebSocket connections, implements sticky sessions to ensure clients reconnect to the same gateway server when possible.

  • API Gateway: Handles HTTP REST API requests for operations that don’t require real-time updates. Manages authentication using JWT tokens, enforces rate limiting per user and workspace, validates request payloads, and routes requests to appropriate microservices.

  • WebSocket Gateway: Maintains persistent bidirectional connections with clients for real-time message delivery. Each gateway instance can handle tens of thousands of concurrent connections. Stateful service that tracks which users are connected to which gateway server. Implements heartbeat mechanism to detect dead connections.

  • Message Service: Core service that processes incoming messages, assigns unique IDs, validates content, stores messages in the database, and publishes events for downstream processing. Handles message mutations like edits and deletes.

  • Message Database (Cassandra): Stores message content and metadata. Cassandra is chosen for its ability to handle high write throughput and excellent support for time-series data. Messages are partitioned by channel ID and ordered by timestamp within each partition.

  • Event Bus (Kafka): Distributes message events to multiple consumers for fanout, search indexing, notification processing, and analytics. Provides durability and replay capability.

  • Connection Registry (Redis): Maps user IDs to their connected gateway server instances. Enables the fanout system to locate which gateway to push messages to for each recipient. Uses TTL to automatically clean up stale connections.

Message Send Flow:

  1. User composes a message in their client and clicks send, establishing a WebSocket frame to the connected gateway.
  2. WebSocket Gateway validates the authentication token embedded in the connection and forwards the message payload to the Message Service.
  3. Message Service generates a unique message ID using a combination of timestamp and snowflake algorithm to ensure uniqueness and sortability.
  4. The service validates the user has permission to send messages in the specified channel by checking with the Channel Service.
  5. Message is written to Cassandra using channel ID as the partition key, ensuring all messages for a channel are co-located for efficient retrieval.
  6. Message Service publishes a message created event to Kafka, decoupling the send path from the delivery path.
  7. Multiple Kafka consumers process the event: Search Indexer adds to Elasticsearch, Notification Service updates unread counts, and Fanout Worker prepares for delivery.
  8. Fanout Worker queries the Channel Service to get the list of channel members, then looks up each member’s connection in the Connection Registry.
  9. For online members, the fanout system pushes the message to their respective gateway servers, which forward over WebSocket to the connected clients.
  10. For offline members, push notifications are queued for delivery via APN and FCM.
  11. Clients receive the message in under 100ms from send time and update their UI.
2. Users should be able to create and join channels

We extend our existing design with channel management:

  • Channel Service: Manages all channel operations including creation, member management, permissions, and metadata. Stores channel data in PostgreSQL for ACID transaction support since channel operations often involve multiple related updates.

  • Channel Database (PostgreSQL): Stores workspace, channel, and membership information. Provides relational integrity, transactional guarantees, and rich querying capabilities needed for complex permission checks and member operations.

Channel Creation Flow:

  1. User creates a new channel by providing a name, description, and privacy setting (public or private).
  2. Client sends a POST request through the API Gateway to the Channel Service.
  3. Channel Service validates the channel name is unique within the workspace and that the user has permission to create channels.
  4. A new channel record is created in PostgreSQL with a unique ID, along with initial membership adding the creator as an admin.
  5. For public channels, a message is posted to a default announcement channel notifying workspace members.
  6. Client receives the channel object and can immediately start sending messages.

Channel Join Flow:

  1. For public channels, users can discover them through a channel browser or search.
  2. User clicks join, sending a request to add themselves as a member.
  3. Channel Service validates the channel is public or the user has an invitation, then adds a membership record.
  4. The user’s client subscribes to that channel’s message stream and fetches recent message history.
3. Users should be able to organize discussions within threads

Threads are a critical feature for keeping conversations organized:

  • Thread Service: Manages threaded conversations including reply counts, participant lists, and thread metadata. Works closely with the Message Service since thread replies are special messages with parent references.

Threading Implementation:

  • Thread replies are stored as regular messages but include a thread timestamp field pointing to the parent message’s timestamp.
  • When displaying a channel, only parent messages are shown with a reply count indicator.
  • When a user clicks to expand a thread, the client fetches all messages with that thread timestamp.
  • Thread metadata (reply count, last reply time, participants) is cached in Redis and updated via Kafka consumers as new replies arrive.
  • This approach avoids expensive aggregation queries and provides fast thread summary information.
4. Users should be able to search across all message history

Search is one of Slack’s most valuable features:

  • Search Service: Handles all search queries, interacting with Elasticsearch for full-text search capabilities. Enforces access control by filtering results to only channels the user is a member of.

  • Search Index (Elasticsearch): Stores inverted indexes of message content, enabling fast full-text search. Supports complex queries with filters, aggregations, and highlighting. Provides near real-time indexing.

  • Search Indexer: Kafka consumer that reads message events and indexes them into Elasticsearch. Batches messages for efficient indexing with 5-second flush intervals.

Search Flow:

  1. User enters a search query with optional filters for channel, date range, or sender.
  2. Client sends the query to the Search Service via the API Gateway.
  3. Search Service first fetches the list of channels the user has access to from the Channel Service (with aggressive caching).
  4. Constructs an Elasticsearch query that includes the user’s search terms and filters by accessible channels for security.
  5. Elasticsearch returns matching messages with highlighted snippets and relevance scores.
  6. Results are returned to the client with snippets showing context around the match.
5. Users should be able to upload and share files

File sharing requires careful handling of large objects:

  • File Service: Handles file uploads, virus scanning coordination, preview generation, and serving. Manages metadata in PostgreSQL while storing actual file content in object storage.

  • Object Storage (S3): Stores uploaded files with high durability and availability. Integrated with CDN for fast global delivery.

  • CDN (CloudFront): Caches frequently accessed files at edge locations worldwide, reducing latency and origin load.

File Upload Flow:

  1. User selects a file in their client, which initiates a multipart upload to the File Service through the API Gateway.
  2. File Service streams the upload directly to S3 with server-side encryption, validating size limits (max 1GB) during the upload.
  3. File metadata (name, size, type, uploader) is stored in PostgreSQL with a reference to the S3 key.
  4. The File Service publishes a file uploaded event to Kafka and returns a file ID to the client.
  5. Virus Scanner (Kafka consumer) downloads the file from S3, scans it with antivirus software, and marks the result in the database. Infected files are deleted and users notified.
  6. Preview Generator (Kafka consumer) creates thumbnails for images and previews for documents, storing them back in S3.
  7. A message is automatically posted to the channel with a file attachment, allowing members to click and download.
  8. When users access files, signed URLs are generated with expiration times and channel membership is verified.
6. Users should be able to see online status and integrate external services

Additional supporting services round out the platform:

  • Presence Service: Tracks user online/offline/away status using Redis for fast updates and retrieval. Implements TTL-based expiration where connections that don’t heartbeat expire automatically.

  • Integration Service: Manages webhooks, bots, slash commands, and OAuth flows for third-party applications. Provides extensibility without modifying core services.

  • Notification Service: Handles push notifications for mobile devices when users are offline or mentions occur. Integrates with APN for iOS and FCM for Android.

Step 3: Design Deep Dive

With the core functional requirements met, it’s time to dig into the non-functional requirements and critical system challenges. These deep dives separate good designs from great ones.

Deep Dive 1: How do we efficiently maintain 500K concurrent WebSocket connections and deliver messages in real-time?

Maintaining hundreds of thousands of concurrent connections while delivering messages with minimal latency presents significant challenges.

Challenge 1: Connection Management at Scale

Each WebSocket connection consumes memory for buffers, state, and metadata. A naive implementation would struggle to handle more than a few thousand connections per server. We need an efficient architecture for connection management.

Solution - Connection Gateway Fleet:

Deploy a fleet of specialized WebSocket Gateway servers optimized for connection handling. Each gateway server is configured to handle approximately 50,000 concurrent connections. With 500,000 total connections, we need about 10 gateway servers per region.

Each gateway maintains a mapping of connected users in a local data structure for fast lookup. When a connection is established, the gateway registers this mapping in a shared Connection Registry implemented in Redis. The key is the user ID and workspace ID combination, and the value includes the gateway server ID and connection timestamp. This registry uses a TTL that is refreshed by periodic heartbeats.

Clients send heartbeat pings every 30 seconds over the WebSocket connection. The gateway responds with pongs, and this bidirectional heartbeat ensures both sides can detect dead connections. If a heartbeat times out, the gateway closes the connection and removes the registry entry. If a client detects missing pongs, it attempts to reconnect with exponential backoff.

When a client connects, they authenticate with a JWT token that includes user ID, workspace ID, and permissions. The gateway validates this token and refuses connections with invalid or expired tokens. The connection itself becomes a secure channel for bidirectional communication without needing to reauthenticate each message.

Challenge 2: Message Fanout to Thousands of Recipients

When a message is sent to a large channel with thousands of members, the system needs to deliver it to all online members efficiently. A naive approach of sequentially sending to each recipient would create unacceptable delays for later recipients.

Solution - Hybrid Push/Pull Model:

For small to medium channels (under 1,000 members), we use a push model. The Fanout Worker reads the message event from Kafka, queries the Channel Service for the member list, then looks up each member’s gateway connection in Redis. It sends the message to each gateway server in parallel using batching. Each gateway server then pushes to its locally connected clients over WebSocket.

For very large channels (1,000+ members), a pure push model becomes inefficient. Instead, we use a hybrid approach. The system identifies the most recently active members in the channel (those who have sent messages or read the channel recently) and pushes to them. This might be the top 500 most active members. Other members retrieve messages when they next open the channel using a pull model, fetching recent messages via the API.

This hybrid approach ensures active participants get real-time updates while avoiding the overhead of pushing to thousands of potentially inactive recipients. The system can also batch delivery to the same gateway server, sending multiple messages in a single network call to reduce overhead.

Challenge 3: Guaranteed Delivery and Ordering

Network issues, client restarts, and server failures can cause message loss. We need mechanisms to ensure messages are delivered exactly once and in the correct order.

Solution - Client-Side Message Tracking:

Each client maintains a per-channel cursor tracking the last message ID they have received. Message IDs are designed to be sortable by combining a timestamp with a unique sequence, enabling clients to determine if they’ve missed messages.

When a client reconnects after being offline, it sends its last seen message ID for each channel. The server responds with all messages since that ID. The client deduplicates by message ID in case it receives the same message via both push and pull mechanisms.

For ordering, messages within a channel are strictly ordered by their timestamp. Cassandra’s clustering columns naturally maintain this order. When displaying messages, clients sort by timestamp to handle any out-of-order delivery over the network.

The WebSocket gateway maintains a small buffer of recent messages per channel. If message fanout fails or a client reconnects within seconds, the gateway can serve these from memory without hitting the database, providing fast recovery.

Deep Dive 2: How do we ensure strong consistency in message ordering while maintaining high throughput?

Message ordering is critical for usability. Out-of-order messages create confusion, especially in fast-moving conversations.

Challenge: Distributed Systems and Ordering

In a distributed system with multiple Message Service instances and multiple Cassandra nodes, ensuring consistent ordering is complex. Network delays, clock skew, and concurrent processing can lead to messages being stored or delivered out of order.

Solution - Single-Writer Per Channel with Timestamp Ordering:

While we have multiple Message Service instances for scalability and availability, we use a partitioning scheme to ensure all messages for a given channel are processed by the same instance at any point in time.

We implement this using consistent hashing on the channel ID. When a message arrives at the API Gateway or WebSocket Gateway, it forwards to a specific Message Service instance based on hashing the channel ID. This ensures serialization of message processing for each channel while still allowing parallel processing across different channels.

Each Message Service instance generates message IDs that combine the current timestamp (in milliseconds or microseconds) with a sequence number. If multiple messages arrive in the same millisecond, the sequence number ensures they get unique, sortable IDs.

When writing to Cassandra, we use the channel ID as the partition key and message timestamp as the clustering key with descending order. This ensures messages are physically stored in order on disk and range queries return messages in the correct sequence without sorting.

For the fanout mechanism, Kafka maintains message ordering within a partition. We partition Kafka topics by channel ID, ensuring all messages for a channel are processed in order by downstream consumers. This guarantees that search indexing, notifications, and delivery happen in the correct sequence.

Challenge: Clock Synchronization

Relying on timestamps requires synchronized clocks across servers. Clock skew between servers could cause ordering issues.

Solution - Clock Synchronization and Hybrid Clocks:

All servers run NTP (Network Time Protocol) clients synchronized to reliable time sources. This keeps clock skew under 100ms, which is acceptable given the latency of message delivery.

For additional protection, we can implement hybrid logical clocks (HLC) which combine physical timestamps with logical counters. If the system detects a timestamp earlier than the previous message, it uses the previous timestamp plus an incremented counter. This prevents backward time travel in the message sequence.

Deep Dive 3: How do we implement efficient full-text search with access control across billions of messages?

Search is complex when you have billions of messages across millions of channels with fine-grained permissions.

Challenge 1: Search Performance at Scale

Searching billions of messages needs to be fast (under 500ms) even with complex filters. Additionally, we must ensure users only see results from channels they have access to.

Solution - Elasticsearch with Smart Sharding:

Elasticsearch provides distributed full-text search with horizontal scalability. We configure it with 10 shards per index to distribute the load across multiple nodes. Each shard contains a subset of messages, and queries are parallelized across all shards.

The index schema includes the message text (with custom analyzers for tokenization), metadata like user ID, channel ID, and timestamp, and special fields for mentions and file attachments. We use a custom analyzer that handles code snippets, mentions, and emojis specially.

For optimal performance, we route queries by workspace ID. Since users only search within their workspace, this allows Elasticsearch to query specific shards rather than all shards, reducing query time.

We also implement a multi-tier storage strategy. Recent messages (last 90 days) are stored on hot nodes with SSDs for fast access. Older messages are moved to warm nodes with HDDs, which are slower but cheaper. This optimizes cost while maintaining performance for common searches.

Challenge 2: Access Control in Search Results

Unlike a simple blog where all content is public, Slack has complex permissions. Users can only see messages from channels they’re members of, and some channels are private.

Solution - Query-Time Permission Filtering:

When a user performs a search, the Search Service first fetches the list of channel IDs the user has access to. This involves querying the Channel Service with the user ID and workspace ID. The response is aggressively cached in Redis with a TTL of several minutes since channel memberships don’t change frequently.

The search query sent to Elasticsearch includes a filter clause that restricts results to only the channels in the user’s accessible list. Elasticsearch applies this filter efficiently before scoring and ranking results. This happens at the query stage, so documents from inaccessible channels are never returned.

The query structure includes multiple components: a match clause for full-text search on the message content, a terms filter for the channel ID list, and optional filters for date ranges, users, or file types. We also apply highlighting to show matching snippets in context.

For search performance, we batch common searches and cache results in Redis. If multiple users in a workspace search for the same term, subsequent searches within a few minutes return cached results (filtered for their specific permissions).

Challenge 3: Real-Time Indexing

Users expect to search for recent messages immediately. There’s a tension between indexing speed and search performance.

Solution - Near Real-Time Indexing Pipeline:

Messages flow from the Message Service through Kafka to the Search Indexer consumer. The indexer batches messages for efficient indexing, using a buffer size of 500 messages or 5 seconds, whichever comes first.

This creates a small delay (average 2-3 seconds) between message send and searchability, but this is acceptable given the performance benefits. Elasticsearch’s refresh interval is set to 5 seconds, providing near real-time search.

The indexer includes retry logic for transient failures and a dead letter queue for messages that consistently fail to index. We monitor the indexing lag (time between message creation and indexing) to detect pipeline issues.

Deep Dive 4: How do we efficiently track read/unread status and display accurate unread counts?

Unread tracking is critical for user experience but challenging at scale.

Challenge: Scalability of Unread Counts

With 10 million users and an average of 50 channels each, we have 500 million user-channel combinations. Storing and updating unread counts for each would require hundreds of millions of database operations.

Solution - Cursor-Based Tracking with Redis Cache:

Rather than storing an unread count directly, we store only a single timestamp per user per channel representing the last message they read. This is stored in the channel members table in PostgreSQL as a simple timestamp column.

To calculate unread count, we count messages in that channel with timestamps after the user’s last read timestamp. However, running this query for every channel on every page load would be expensive.

Instead, we maintain a cache in Redis with the key structure being unread followed by the user ID and channel ID. The value is the unread count. This cache is populated on demand and invalidated when the user reads the channel.

When a new message arrives in a channel, a Kafka consumer increments the cached unread count for all channel members except the sender. This is done with Redis INCR commands which are extremely fast. If a cache entry doesn’t exist (cache miss), it’s not created by the increment operation, allowing the cache to remain sparse.

When a user opens a channel and reads messages, the client sends a mark read request with the timestamp of the latest message viewed. The server updates the last read timestamp in PostgreSQL and resets the cached unread count to zero in Redis.

To get the total unread count across all channels (for the badge number), we batch fetch the cached counts for all of the user’s channels using Redis MGET. For any cache misses, we calculate on demand and populate the cache. The counts are summed client-side.

This approach means that frequently checked channels have hot caches, while rarely viewed channels don’t consume cache memory. The system handles cache failures gracefully by falling back to database queries.

Read Receipts for Direct Messages:

For one-on-one direct messages and small group conversations, we show read receipts indicating who has seen each message. This is implemented using Redis sorted sets where the key identifies the message and members are user IDs with their read timestamp as the score.

This data is ephemeral with a TTL of 7 days. It’s not critical if lost since it’s only used for UI indicators, not core functionality. The sorted set structure allows efficiently querying which users have read a message and when.

Deep Dive 5: How do we handle presence and typing indicators without overwhelming the system?

Presence and typing indicators enhance real-time collaboration but generate massive amounts of ephemeral data.

Challenge: High Update Frequency

With millions of users, presence updates every 30 seconds and typing indicators every few seconds would generate enormous traffic and database load.

Solution - Redis-Based Ephemeral State with Pub/Sub:

All presence and typing state is stored exclusively in Redis, never touching persistent databases. This data is ephemeral by nature and can be lost without major impact.

For presence, when a user connects via WebSocket, the gateway calls the Presence Service to mark them online. This sets a Redis key with a 60-second TTL. The WebSocket heartbeat refreshes this TTL. When the TTL expires (because the user disconnected or their client crashed), the key automatically vanishes, marking them offline.

The presence key is set in a workspace-specific keyspace. We also maintain a Redis set of all currently online user IDs in the workspace for efficient lookup. When the Presence Service marks a user online or offline, it publishes an event to a Redis pub/sub channel specific to that workspace.

Clients interested in presence updates (typically those viewing the same channel or direct message conversation) subscribe to the workspace presence channel. When they receive a presence change event, they update their UI accordingly. This pub/sub mechanism allows broadcasting presence changes to interested clients without the server tracking who’s interested.

For typing indicators, the implementation is similar but even more ephemeral. When a user types in a channel, the client sends a start typing event at most once every few seconds (using client-side debouncing). The server sets a Redis key with a 5-second TTL and publishes to a channel-specific pub/sub topic.

Other users currently viewing that channel are subscribed to its typing pub/sub topic and receive the typing indicator. The short TTL means if the user stops typing or closes their client, the indicator disappears automatically within seconds.

Optimization for Large Channels:

For channels with thousands of members, broadcasting every presence change to all members would create a firehose of events. We optimize by only broadcasting presence changes for users who have recently been active in the channel or are currently viewing it.

The client can also poll presence for specific users they care about rather than receiving a broadcast stream. This works well for direct messages where you only care about one other person’s presence.

Deep Dive 6: How do we design integrations, webhooks, and bots to be extensible and reliable?

Third-party integrations are core to Slack’s value proposition but introduce reliability challenges.

Challenge 1: Webhook Reliability

When an integration sends data to Slack via incoming webhooks or Slack sends data to external services via outgoing webhooks, network failures and slow services can cause issues.

Solution - Async Processing with Retries:

Incoming webhooks post messages to Slack. Rather than processing these synchronously, we queue them in Kafka for async processing. This protects the system from slow webhook delivery and allows rate limiting and validation to happen in a worker.

The Webhook Service validates the webhook token, checks rate limits (typically 1 request per second per webhook), and validates the payload structure. Valid webhooks result in messages being posted via the Message Service.

For outgoing webhooks (Slack notifying external services), we use a dedicated Webhook Delivery Worker that reads events from Kafka and makes HTTP requests to registered webhook URLs. These requests have a timeout (typically 3 seconds) and include retry logic with exponential backoff. After 3 failed attempts, the webhook is placed in a dead letter queue for manual investigation.

Webhook payloads include a verification token that the receiving service can check to ensure authenticity. We also support signature verification using HMAC for enhanced security.

Challenge 2: Bot Message Delivery

Bots subscribe to specific event types (like all messages in channels they’re invited to). Delivering events to potentially thousands of bot endpoints for every message would be a huge load.

Solution - Event Subscription with Filtering:

Bots register event subscriptions specifying which event types they want (message posted, reaction added, user joined channel, etc.). These subscriptions are stored in PostgreSQL with the bot’s webhook URL and event type filters.

When events occur, the Event Dispatcher service reads from Kafka and queries the subscription registry to find interested bots. Only matching subscriptions result in webhook deliveries, dramatically reducing the fanout.

Event delivery is async with retries. Failed deliveries are retried with exponential backoff up to a maximum of 3 attempts. Bots that consistently fail (high error rate over time) can be automatically disabled with notifications sent to the bot owner.

Bots also have rate limits (typically 1 request per second for posting messages). These limits are enforced at the API Gateway level using token bucket algorithms implemented in Redis.

OAuth for Third-Party Apps:

Third-party apps integrate with Slack using OAuth 2.0 for secure authorization. When a user installs an app, they’re redirected to Slack’s authorization server where they review the requested permissions (scopes like reading channel messages or posting messages).

Upon approval, the app receives an access token which it stores securely and uses for API requests. These tokens have associated scopes that are checked on every API call, ensuring apps can only perform authorized actions. Tokens can be revoked by users at any time through their settings.

Slash Commands:

Slash commands like /remind me to review PR in 2 hours are implemented as registered command handlers. When a user types a slash command, it’s intercepted by the client and sent to the Integration Service.

The service looks up the registered handler for that command in the workspace and makes an HTTP request to the handler’s URL with the command text and context. The handler (which could be a bot or external service) processes it and returns a response message to post in the channel.

The response can be immediate (within the 3-second timeout) or delayed by posting to a response URL provided in the initial request. This allows for long-running commands that need to do background processing before responding.

Deep Dive 7: How do we handle file storage, virus scanning, and preview generation at scale?

File sharing creates challenges around large object storage, security, and user experience.

Challenge 1: Large File Uploads

Uploading large files (up to 1GB) through traditional request/response APIs would timeout and tie up connections.

Solution - Streaming Uploads to Object Storage:

When a client uploads a file, it streams the multipart form data to the File Service. The service immediately streams this data to S3 without buffering the entire file in memory. This allows handling large files efficiently with minimal server memory.

During the upload, the service tracks the size to enforce the 1GB limit. If the limit is exceeded, the upload is aborted and the client receives an error. Upon successful upload, the file metadata is stored in PostgreSQL with a reference to the S3 key (which includes the workspace ID, channel ID, and file ID in its path for organization).

S3 is configured with versioning disabled (files are immutable) and server-side encryption enabled for security. The bucket has lifecycle policies to transition infrequently accessed files to cheaper storage classes over time.

Challenge 2: Security and Malware

User-uploaded files could contain malware, creating risk for other users who download them.

Solution - Async Virus Scanning:

After a file uploads successfully, the File Service publishes a file uploaded event to Kafka. A Virus Scanner consumer picks up this event, downloads the file from S3, and scans it using antivirus software.

If malware is detected, the file is marked as dangerous in the database and deleted from S3. A notification is sent to the uploader informing them their file was removed. The message referencing the file shows a warning that the attachment was removed for security reasons.

Clean files are marked as scanned in the database, allowing downloads. This async approach means there’s a brief window (typically a few seconds) where an unscanned file exists, but we prevent downloads of unscanned files through the File Service download endpoint.

Challenge 3: Preview Generation

Images, PDFs, and videos should have previews and thumbnails for better user experience, but generating these is CPU and memory intensive.

Solution - Async Preview Generation Pipeline:

Another Kafka consumer, the Preview Generator, processes file uploaded events. It downloads the file from S3 and generates appropriate previews based on the MIME type.

For images, it generates a thumbnail (200x200 pixels) and a preview (800x600 pixels) using image processing libraries. For PDFs, it converts the first page to an image. For videos, it extracts a frame at 1 second for the thumbnail.

These generated previews are uploaded back to S3 with keys derived from the original file key. The file metadata in PostgreSQL is updated with the preview URLs. This entire process happens asynchronously, so users see a loading indicator briefly before previews appear.

For unsupported file types, generic file type icons are used. The system degrades gracefully rather than failing if preview generation fails.

Challenge 4: Fast Global File Access

Users worldwide need fast access to shared files.

Solution - CDN Distribution:

All files are served through a CDN (CloudFront) rather than directly from S3. The CDN caches frequently accessed files at edge locations worldwide, providing low-latency access.

For access control, we generate signed URLs with expiration times (typically 1 hour). When a user requests to download a file, the File Service first checks that the user has access to the channel where the file was shared. If authorized, it generates a signed CloudFront URL that grants temporary access without exposing the underlying S3 structure or requiring authentication at the CDN level.

This approach combines security (channel membership verification) with performance (CDN caching) effectively.

Step 4: Wrap Up

In this design, we’ve created a comprehensive team collaboration platform similar to Slack. If there is extra time at the end of the interview, here are additional points to discuss:

Additional Features to Consider:

  • Message Reactions: Implement emoji reactions using a mapping structure in the message record or a separate table, aggregated for display.
  • Threaded Notification Preferences: Allow users to subscribe to or mute specific threads independently of channel notifications.
  • Channel Archiving: Move old channels to an archived state where they’re read-only but searchable, freeing up the active channel list.
  • Multi-Workspace Support: Allow users to be members of multiple workspaces simultaneously with a workspace switcher in the client.
  • Voice and Video Calls: Integrate WebRTC for peer-to-peer audio/video calls, with TURN servers for NAT traversal.
  • Analytics and Insights: Track workspace activity, message patterns, and engagement metrics for workspace administrators.

Scaling Considerations:

  • Database Sharding: As workspaces grow, shard PostgreSQL by workspace ID to distribute load. Large individual workspaces might need further sharding by channel ID.
  • Cassandra Scaling: Add nodes to the Cassandra cluster as message volume grows. Use separate data centers for multi-region deployments with tunable consistency.
  • Cache Warming: Pre-populate caches for frequently accessed channels and users during deployment or restart to avoid thundering herd.
  • Geographic Distribution: Deploy infrastructure in multiple regions with geo-based routing to reduce latency for global teams.
  • Elasticsearch Clusters: Use separate Elasticsearch clusters per region or per large workspace for isolation and performance.

Monitoring and Observability:

  • Key Metrics: Message delivery latency (p50, p95, p99), WebSocket connection count, search query latency, file upload success rate, and error rates per service.
  • Distributed Tracing: Trace message flow from client send through all services to delivery, identifying bottlenecks.
  • Alerting: Set up alerts for degraded performance, elevated error rates, Kafka lag, database replication lag, and WebSocket gateway failures.
  • Dashboards: Create real-time dashboards showing workspace activity, system health, resource utilization, and user engagement.

Disaster Recovery and High Availability:

  • Multi-Region Active-Active: Deploy the full stack in multiple regions with users assigned to their nearest region. Cassandra and S3 replicate across regions.
  • Database Failover: Use PostgreSQL with automatic failover (like RDS Multi-AZ) to handle primary database failures with minimal downtime.
  • Kafka Replication: Configure Kafka with replication factor 3 within the region and cross-region replication for disaster recovery.
  • Recovery Objectives: Target a Recovery Time Objective (RTO) of 5 minutes and Recovery Point Objective (RPO) of 1 minute or less.

Security Hardening:

  • End-to-End Encryption: For enterprise workspaces, implement E2EE where messages are encrypted on the client and only decrypted by recipients, with server unable to read content.
  • Data Retention Policies: Support configurable message retention with automatic deletion of old messages for compliance.
  • Audit Logging: Log all administrative actions, permission changes, and sensitive operations for security audits and compliance.
  • DDoS Protection: Use services like AWS Shield and rate limiting at multiple layers to protect against denial of service attacks.
  • Input Sanitization: Sanitize all user input to prevent XSS, SQL injection, and other injection attacks.

Cost Optimization:

  • Tiered Storage: Move old messages from Cassandra to cheaper cold storage (S3 with Parquet format) after a certain period (like 1 year for paid plans).
  • Elasticsearch Hot-Warm-Cold: Use hot nodes with SSDs for recent data, warm nodes with HDDs for older data, and cold storage for archival.
  • S3 Lifecycle Policies: Automatically transition infrequently accessed files to S3 Glacier after a year to reduce storage costs.
  • Right-Sizing: Continuously monitor resource utilization and adjust instance sizes and counts based on actual load patterns.
  • Reserved Instances: Purchase reserved or spot instances for predictable baseline load to reduce compute costs.

Data Consistency Guarantees:

  • Message Ordering: Strong consistency within a channel, ensuring all users see messages in the same order.
  • Channel Membership: Strong consistency for channel membership changes to prevent security issues.
  • Presence and Typing: Eventual consistency is acceptable, with TTL-based expiration.
  • Unread Counts: Eventual consistency with cache refresh, tolerating brief inconsistencies.

Performance Optimization:

  • Connection Pooling: Maintain pools of database connections across services to reduce connection overhead.
  • Batch Operations: Batch database writes and cache operations where possible to reduce network round trips.
  • Query Optimization: Use database indexes effectively, monitor slow queries, and optimize hot paths.
  • Async Everything: Move non-critical operations (notifications, indexing, analytics) to async processing via Kafka.

Summary

This design covers a production-grade team collaboration platform handling:

Scale Characteristics:

  • 10 million daily active users sending 100 million messages per day
  • 500,000 concurrent WebSocket connections per region
  • Sub-100ms message delivery latency at p99
  • Billions of searchable messages with complex access control
  • Petabytes of file storage with global CDN distribution

Key Architectural Components:

  • WebSocket Gateway fleet for real-time bidirectional communication
  • Cassandra for scalable message storage with time-series partitioning
  • PostgreSQL for structured metadata and transactional operations
  • Redis for ephemeral state (presence, typing, caching, connection registry)
  • Elasticsearch for powerful full-text search with permission filtering
  • Kafka for event-driven architecture and async processing
  • S3 and CloudFront for scalable file storage and delivery

Critical Design Decisions:

  • Hybrid push/pull model for efficient message fanout
  • Cursor-based unread tracking with Redis caching
  • Single-writer per channel for message ordering consistency
  • Query-time permission filtering in search with cached channel lists
  • Async processing for all non-critical paths (scanning, indexing, notifications)
  • Ephemeral state in Redis with TTL-based expiration for presence and typing

Reliability Features:

  • No single point of failure with horizontal scaling across all services
  • Multi-region deployment for disaster recovery
  • Message durability through Kafka and database replication
  • Automatic failover for databases and stateful services
  • Graceful degradation when optional services fail

This architecture demonstrates how to build a real-time collaboration platform that scales to millions of users while maintaining low latency, strong consistency where needed, and rich feature sets that make it indispensable for modern teams.