Design GitHub
GitHub is a web-based platform that provides hosting for software development version control using Git. It offers distributed version control and source code management functionality, along with features for collaboration such as pull requests, code review, issue tracking, and continuous integration/deployment through GitHub Actions.
Designing GitHub presents unique challenges including handling massive-scale git operations, generating efficient code diffs, implementing sophisticated code search across petabytes of data, orchestrating distributed CI/CD workflows, and managing complex access control across millions of repositories.
Step 1: Understand the Problem and Establish Design Scope
Before diving into the design, it’s crucial to define the functional and non-functional requirements. For a platform like GitHub, we need to balance core git functionality with collaboration features and developer productivity tools.
Functional Requirements
Core Requirements (Priority 1-3):
- Users should be able to perform full git operations including clone, fetch, push, and pull on repositories.
- Users should be able to create pull requests with code review capabilities including inline comments and review approvals.
- Users should be able to search code across repositories with full-text and symbol search.
- Users should be able to define and execute CI/CD workflows through GitHub Actions with automated testing and deployment.
Below the Line (Out of Scope):
- Users should be able to create and manage issues with labels, milestones, and projects.
- Users should be able to fork repositories and sync changes with upstream.
- Users should be able to receive notifications for mentions, reviews, and comments.
- Users should be able to configure webhooks for external integrations.
- Users should be able to manage organization and team-based permissions.
Non-Functional Requirements
Core Requirements:
- The system should provide strong consistency for git operations to prevent any lost commits or data corruption.
- The system should optimize for low latency with git clone taking under 5 seconds for small repos and pull request diff generation under 2 seconds.
- The system should handle massive scale with 100M+ repositories, 1B+ git operations per day, and 100M+ CI/CD jobs per month.
- The system should maintain 99.95% uptime with multi-region active-active deployment where possible.
Below the Line (Out of Scope):
- The system should ensure security and privacy of private repositories with encryption at rest and in transit.
- The system should be resilient to failures with graceful degradation for non-critical features.
- The system should have comprehensive monitoring and alerting to quickly identify performance bottlenecks.
Clarification Questions & Assumptions:
- Platform: Web interface and git command-line clients for developers, mobile apps for notifications and light browsing.
- Scale: 100 million active repositories with 50 million concurrent users during peak hours.
- Repository Size: Support repositories up to 100GB with recommended limit of 5GB.
- Geographic Coverage: Global with major data centers in North America, Europe, and Asia-Pacific.
- Payment: GitHub accounts and billing handled separately (out of scope for this design).
Step 2: Propose High-Level Design and Get Buy-in
Planning the Approach
For a complex platform like GitHub, we’ll build our design sequentially through each functional requirement. This ensures we establish solid foundations before layering in advanced features.
Defining the Core Entities
To satisfy our key functional requirements, we’ll need the following entities:
Repository: The fundamental unit containing a git repository with all its commits, branches, tags, and git objects. Includes metadata such as repository name, owner, visibility (public/private), default branch, size, and timestamps. The repository entity tracks where the actual git data is stored in the distributed file system.
User: Any developer who uses the platform. Contains personal information, authentication credentials, preferences, and access tokens. Users can own repositories, create pull requests, submit reviews, and trigger workflows.
Pull Request: A request to merge code changes from one branch into another. Contains references to the base and head branches, the commit SHAs, author information, title, description, reviewers, status (open/closed/merged), mergeable state, and timestamps for creation, updates, and merge events.
Review: A code review submitted on a pull request. Includes the reviewer identity, review state (approved, changes requested, commented), body text, the commit SHA being reviewed, and timestamps. Reviews can have associated comments on specific lines of code.
Workflow Run: An execution of a GitHub Actions workflow. Records the triggering event, workflow definition, job graph, overall status, start and end times, and links to logs and artifacts. Each workflow run contains multiple jobs that may execute in parallel or sequentially.
Commit: A git commit object representing a snapshot of the repository at a point in time. Contains the commit SHA, parent commit references, author and committer information, timestamp, commit message, and tree SHA pointing to the file structure.
API Design
Get Fare Estimate Endpoint: Used by the Ride Service to calculate estimated fare based on pickup and destination coordinates.
POST /repositories -> Repository
Body: {
name: string,
visibility: "public" | "private",
description: string
}
Clone Repository Endpoint: Initiates a git clone operation. The git proxy handles the actual git protocol negotiation and pack file streaming.
GET /repositories/:owner/:name/git-upload-pack
Create Pull Request Endpoint: Used by developers to create a new pull request after pushing their feature branch.
POST /repositories/:owner/:name/pulls -> PullRequest
Body: {
title: string,
body: string,
head: string,
base: string
}
Submit Review Endpoint: Allows reviewers to submit their code review with approval or change requests.
POST /repositories/:owner/:name/pulls/:number/reviews -> Review
Body: {
event: "approve" | "request_changes" | "comment",
body: string,
comments: Array<{path, position, body}>
}
Search Code Endpoint: Performs full-text search across code repositories with filtering and ranking.
GET /search/code -> SearchResults
Query: {
q: string,
language: string,
repo: string
}
Trigger Workflow Endpoint: Manually triggers a workflow run for a repository.
POST /repositories/:owner/:name/actions/workflows/:id/dispatches
Body: {
ref: string,
inputs: object
}
High-Level Architecture
Let’s build up the system sequentially, addressing each functional requirement:
1. Users should be able to perform full git operations including clone, fetch, push, and pull
The core components necessary to fulfill git operations are:
- Git Client: The standard git command-line tool or git libraries used by developers. Communicates with GitHub over HTTP or SSH using the git protocol.
- Load Balancer: Distributes incoming git protocol requests across multiple git proxy servers using anycast routing for geographic proximity.
- Git Proxy: Specialized servers (like HAProxy) that handle git protocol operations. Validates authentication, checks access permissions, and routes requests to the appropriate storage shards.
- Repository Service: Manages repository metadata and coordinates git operations. Determines which storage shard contains the repository data and orchestrates pack file generation.
- Distributed File System: A custom file system optimized for storing git objects and pack files. Implements 3x replication across availability zones and supports efficient random access to git objects.
- Database: Stores repository metadata including owner, name, visibility, storage shard ID, and size. Enables quick lookups without accessing the file system.
Git Clone Flow:
- Developer runs git clone command which sends a request to the load balancer.
- The load balancer routes to the nearest git proxy server based on geographic location.
- The git proxy authenticates the request and validates the user has read access to the repository.
- The proxy queries the Repository Service to determine which storage shard contains the repository data.
- The proxy initiates pack file generation, which traverses the commit graph and compresses git objects.
- The pack file is streamed back to the client in chunks, allowing incremental download.
- Git objects are cached in Redis and CDN for subsequent clone requests.
2. Users should be able to create pull requests with code review capabilities
We extend our existing design to support pull requests and code review:
- Add Pull Request Service to manage PR lifecycle including creation, updates, and merging.
- Add Code Review Service to handle review submissions, comment placement, and review requirements.
- Add Diff Generation Service to compute code diffs between branches efficiently.
Pull Request Creation Flow:
- Developer pushes their feature branch to GitHub, which stores the commits in the distributed file system.
- Developer creates a pull request via the web UI or API, specifying base and head branches.
- The Pull Request Service validates the branches exist and creates a PR record in the database.
- The service triggers asynchronous diff generation to compute the changes between base and head.
- It also triggers mergeable state computation to detect potential merge conflicts.
- The generated diff is cached in Redis and the rendered HTML is stored for quick retrieval.
- The service sends notifications to relevant users and triggers any configured webhooks.
Code Review Flow:
- Reviewers view the pull request diff in the web UI, with syntax highlighting and split view.
- They add inline comments on specific lines by clicking the line numbers in the diff view.
- Comments are stored with their position (file path and line number) and associated commit SHA.
- Reviewers submit their overall review with approval, change requests, or general comments.
- The system validates branch protection rules to check if required reviews are met.
- When reviews are submitted, it updates the PR’s mergeable state and notifies the author.
3. Users should be able to search code across repositories
We need to introduce specialized search infrastructure:
- Search Service: Handles search queries from users and routes them to the appropriate backend (Elasticsearch or Zoekt).
- Indexing Pipeline: Background workers that monitor repository changes and incrementally update search indexes.
- Elasticsearch Cluster: Provides full-text search across code with custom analyzers optimized for code syntax.
- Message Queue: Kafka topics that stream repository change events from git operations to indexing workers.
Code Search Flow:
- A developer enters a search query in the web UI with optional filters (language, repository, organization).
- The Search Service parses the query and applies boosting rules (exact matches ranked higher).
- It queries the Elasticsearch cluster which performs multi-field matching across file content, file names, and file paths.
- Results are aggregated from multiple index shards and ranked by relevance score.
- The service applies access control filtering to ensure users only see repositories they have permission to access.
- Results are returned with code snippets highlighted around matching terms.
- The query and results are cached to optimize repeated searches.
Incremental Indexing Flow:
- When a developer pushes commits, the git operation publishes a change event to Kafka.
- Indexing workers consume from Kafka and clone or fetch the updated repository.
- Workers use git diff to identify which files changed in the push.
- Only modified files are re-indexed, avoiding full repository scans.
- Workers extract file content, metadata, and symbols (function/class names).
- Data is indexed into Elasticsearch with appropriate fields and custom code analyzers.
- Old versions of modified files are removed from the index.
4. Users should be able to define and execute CI/CD workflows through GitHub Actions
We add comprehensive workflow orchestration infrastructure:
- Actions Service: Orchestrates workflow execution including parsing YAML files, resolving dependencies, and queuing jobs.
- Workflow Orchestrator: Manages the workflow state machine, handling retries, timeouts, and status updates.
- Job Queue: Kafka topics partitioned by priority, runner type, and organization for fair job distribution.
- Runner Pools: Collections of compute instances (VMs or containers) that execute workflow jobs. Separate pools for different operating systems (Ubuntu, Windows, macOS).
- Log Service: Collects and streams logs from runners in real-time using WebSocket connections.
- Artifact Storage: S3 buckets for storing build artifacts and caches with lifecycle policies.
Workflow Execution Flow:
- A triggering event occurs (push, pull request, schedule, manual dispatch).
- The Actions Service detects the event and searches for workflow YAML files in the repository.
- It parses the workflow files and validates syntax, permissions, and event filters.
- The orchestrator builds a job dependency graph based on the needs keyword in the workflow.
- Independent jobs (no dependencies) are immediately queued in Kafka for parallel execution.
- The job queue partitions jobs by runner requirements and organization to ensure fair scheduling.
- Available runners from the appropriate pool poll the job queue and claim jobs.
- Each runner clones the repository at the specific commit SHA that triggered the workflow.
- The runner executes each step sequentially, streaming logs in real-time to the Log Service.
- Steps can upload artifacts to S3 which are tracked and accessible from the workflow run.
- As jobs complete, the orchestrator updates the workflow run status and triggers dependent jobs.
- When all jobs finish, the workflow completes and status checks are updated on the pull request.
- Notifications are sent to users and dependent workflows may be triggered.
Step 3: Design Deep Dive
With the core functional requirements met, it’s time to dig into the non-functional requirements and critical design decisions that enable GitHub to operate at massive scale.
Deep Dive 1: How do we efficiently store and shard billions of git repositories?
Git repositories are the core of GitHub, and storing them efficiently while supporting high-throughput read and write operations is critical. Traditional file systems don’t scale to billions of repositories.
Problem Statement:
With 100 million active repositories and growing, we need a storage architecture that provides high availability, efficient space utilization through deduplication, fast random access to git objects, and geographic distribution for low latency access.
Solution: Custom Distributed File System with Sharding
We implement a custom distributed file system similar to HDFS but optimized specifically for git workloads:
Shard Selection Strategy:
Repositories are distributed across storage shards using consistent hashing on the repository ID. Each shard is a separate cluster of storage nodes running the distributed file system. This provides horizontal scalability as we can add more shards as repository count grows.
When a repository is created, we hash the repository ID modulo the number of shards to determine its home shard. This shard information is stored in the repository metadata table for quick lookups.
Storage Layout and Organization:
Within each shard, repositories are organized in a hierarchical directory structure. Each repository has its own directory containing the standard git layout: an objects directory with pack files and loose objects, a refs directory with branches and tags, and an info directory with cached reference lists.
Pack files are the primary storage mechanism, containing compressed git objects with delta compression. We maintain multiple pack files per repository rather than a single monolithic file. This allows incremental updates and parallel access.
Hot, Warm, and Cold Storage Tiers:
To optimize costs and performance, repositories are automatically tiered based on access patterns:
Hot tier repositories (accessed within the last 30 days) are stored on high-performance SSDs in the primary data centers. This tier handles about 20% of repositories but 80% of traffic.
Warm tier repositories (accessed within 180 days) are stored on standard HDDs with lower performance but higher capacity. Access latency is acceptable for occasional operations.
Cold tier repositories (rarely accessed archives) are stored in object storage like S3. When accessed, they’re fetched on-demand with 5-10 second latency and temporarily promoted to warm tier.
Replication and Availability:
Every repository is replicated three times across different availability zones within a region. Replicas are placed with rack awareness to tolerate rack failures. Write operations require acknowledgment from at least two replicas (quorum).
For critical repositories (popular open source projects), we maintain cross-region replicas to ensure global availability and reduce clone latency for international users.
Pack File Optimization:
Git’s pack file format uses delta compression where objects are stored as deltas against base objects. We optimize this by:
Running automatic repacking when loose objects exceed 10,000 to maintain read performance. Using aggressive compression settings for cold tier repositories since they’re read-infrequently. Limiting pack file size to 2GB for efficient streaming to clients. Generating bitmap indexes for fast reachability queries during clone operations.
Caching Strategy:
Multiple caching layers reduce load on the distributed file system:
Redis caches frequently accessed git objects (commits, trees, small blobs) with LRU eviction. CDN caches pack file chunks for popular repositories, serving clone requests directly from edge locations. Client-side caching uses HTTP ETags and conditional requests to avoid re-downloading unchanged data.
Deep Dive 2: How do we prevent lost commits and maintain strong consistency for git operations?
Git operations must be strongly consistent - a push operation that succeeds must guarantee the commit is durably stored and visible to all subsequent reads. This is challenging in a distributed system with replication.
Problem Statement:
With distributed storage and caching, we need to ensure that a successful push guarantees durability, prevent concurrent pushes from creating divergent branch histories, handle partial failures during multi-step operations, and maintain consistency across replicas.
Solution: Distributed Locking and Quorum Writes
For push operations, we implement distributed locking at the branch level to serialize conflicting updates:
When a developer pushes to a branch, the git proxy acquires a distributed lock for that specific branch reference (e.g., refs/heads/main) using Redis with the Redlock algorithm. This prevents concurrent pushes to the same branch from proceeding simultaneously.
The lock is acquired with a reasonable timeout (e.g., 30 seconds) that’s longer than typical push duration but prevents indefinite lock holding on failures. If lock acquisition fails, the push is rejected with a retry message.
Atomic Reference Updates:
Git references (branches and tags) must be updated atomically. We use compare-and-swap operations to ensure the reference update only succeeds if the previous value matches expectations. This prevents lost updates when multiple operations race.
The push operation verifies that the current branch SHA matches what the client expects as the parent. If another push already updated the branch, this verification fails and the operation is rejected.
Quorum Writes for Durability:
When new objects are written to the distributed file system, we require acknowledgment from a quorum of replicas (typically 2 out of 3) before considering the write successful. This ensures durability even if one replica immediately fails.
The write coordinator sends objects to all three replicas in parallel and waits for quorum responses. If quorum can’t be achieved within a timeout, the entire push is rolled back and the client receives an error.
Write-Ahead Logging:
Critical metadata updates (like reference changes) are first written to a durable write-ahead log before being applied to the repository. This allows recovery from partial failures and ensures we can reconstruct the exact sequence of operations.
Consistency for Clone and Fetch:
Read operations (clone, fetch) have slightly relaxed consistency requirements. We can serve reads from any replica, accepting that it might be slightly behind due to replication lag. However, we track replication lag and fail over to the master if lag exceeds acceptable thresholds (e.g., 5 seconds).
For critical operations like merge validation, we enforce read-your-writes consistency by directing reads to the master or using session stickiness to ensure the same replica serves related operations.
Deep Dive 3: How do we generate pull request diffs efficiently and detect merge conflicts?
Generating code diffs quickly is essential for pull request performance. Large pull requests with thousands of changed lines across hundreds of files must render in under 2 seconds.
Problem Statement:
Computing diffs requires comparing git trees, fetching blob objects, running diff algorithms, and rendering syntax-highlighted HTML. Doing this on-demand for every PR view would be prohibitively expensive.
Solution: Asynchronous Diff Generation with Caching
When a pull request is created or updated, we immediately enqueue an asynchronous job to generate the diff rather than computing it synchronously in the request path.
Diff Computation Pipeline:
The first step is identifying the commit range. We find the merge base (common ancestor) between the base and head branches using git’s lowest common ancestor algorithm. This determines which commits are unique to the feature branch.
Next, we perform a tree diff by comparing the git tree objects at the merge base and head commits. This efficiently identifies added, modified, and deleted files without examining file contents. We filter out binary files and very large files (over 1MB) that shouldn’t be rendered inline.
For each modified text file, we fetch the blob objects from git storage and run a diff algorithm. GitHub likely uses Myers diff algorithm for most cases, which has O(ND) complexity where D is edit distance. For better quality diffs, histogram diff algorithm is used to detect moved code blocks.
The raw diff output is enriched with metadata: syntax highlighting based on file extension, split into hunks (contiguous change regions), line numbers added, and change indicators (+ for additions, - for deletions).
Intelligent Caching:
The generated diff HTML is cached in Redis with a cache key based on the hash of base SHA, head SHA, and file path. This means if someone force-pushes the same commits, we reuse the cached diff. The cache has a one-hour TTL for memory efficiency.
Raw diff data is also stored in S3 for historical access, allowing us to reconstruct old PR views without recomputing.
Incremental Diff Updates:
When new commits are pushed to a pull request branch, we only need to recompute diffs for files that changed in the new commits. We use git diff between the previous head and new head to identify affected files and selectively invalidate cache entries.
Merge Conflict Detection:
Detecting whether a pull request is mergeable requires simulating the merge:
We perform a three-way merge simulation in memory using the merge base, base branch tip, and head branch tip. For each file modified in both branches, we apply git’s three-way merge algorithm which attempts to automatically resolve non-conflicting changes.
If the merge completes without conflicts, we mark the PR as having “clean” mergeable state. If conflicts are detected, the state becomes “dirty”. If we can’t determine the state (e.g., base branch is still being updated), it’s marked “unknown”.
This computation is expensive, so we cache the mergeable state for five minutes and recompute it only when relevant commits are pushed or the base branch updates.
Optimizations for Large PRs:
For pull requests with thousands of files, we implement progressive rendering:
Initially load only the first 10-20 files with full diffs. Render remaining files as collapsed placeholders with file paths and change statistics. Lazy-load additional files on-demand when users scroll or click to expand. Set hard limits of 1,000 files and 10,000 lines per file for diff rendering.
Deep Dive 4: How do we implement code search across petabytes of source code?
Searching across millions of repositories containing petabytes of code requires sophisticated indexing and query optimization. Traditional full-text search isn’t optimized for code syntax and semantics.
Problem Statement:
Code search has unique requirements: exact substring matching for API names, regular expression support for patterns, language-aware tokenization, symbol search (find all usages of a function), ranking by repository popularity and recency, and access control filtering.
Solution: Elasticsearch with Custom Code Analyzers
We use Elasticsearch as the primary search engine, with custom analyzers tuned for code:
Index Schema Design:
Each indexed file creates a document in Elasticsearch containing: repository metadata (repo ID, name, owner, visibility, stars), file metadata (path, name, extension, language, size), file content as searchable text, extracted symbols (function and class names with types), commit SHA and last modified timestamp, and a boolean indicating public or private access.
Custom Code Tokenizer:
Standard tokenizers split on whitespace and punctuation, which doesn’t work well for code. Our custom tokenizer:
Preserves camelCase and snake_case as single tokens while also indexing the individual words (e.g., “getUserName” indexes as both “getUserName” and [“get”, “User”, “Name”]). Preserves dots in package names (e.g., “com.github.api”). Handles special characters common in code (underscores, dollar signs). Preserves original tokens alongside normalized versions for exact matching.
Search Query Processing:
When a user submits a search query, the Search Service parses it and constructs an Elasticsearch query with multiple strategies:
Multi-field matching searches across content, file name, and file path with different boost factors. File names are boosted 3x, file paths 1.5x, and content 2x, prioritizing matches in file names. Boolean filters apply language, repository, and access control restrictions efficiently without affecting scoring.
Ranking and Relevance:
Search results are ranked using multiple signals:
Text relevance score from Elasticsearch’s BM25 algorithm measures how well the document matches the query. Repository popularity boost multiplies the score by a factor based on stars and forks. Recency boost gives higher scores to files modified recently. Exact match boost significantly increases score for exact string matches versus partial matches.
We also train a machine learning model on historical search click data to learn which results users find most relevant, continually improving ranking quality.
Access Control Filtering:
Before returning results, we filter based on repository visibility and user permissions:
Public repositories are accessible to everyone. Private repositories require checking if the user is a collaborator, organization member, or team member with access. We cache permission checks in Redis to avoid database queries for every search result.
Alternative: Zoekt for Regex Search
For regex-heavy searches, Elasticsearch can be slow. GitHub likely also uses Zoekt, a fast trigram-based code search engine:
Zoekt builds trigram indexes where every three-character substring is indexed. This makes substring and regex searches extremely fast. Index files are memory-mapped for direct access without deserialization overhead. Zoekt is particularly good for exact substring matching and simple regex patterns.
Incremental Indexing:
When commits are pushed, we publish events to Kafka. Indexing workers consume these events and:
Clone or fetch the updated repository. Use git diff to identify changed files since the last indexed commit. Delete old versions of modified files from Elasticsearch. Index new and modified files with their latest content. Update repository-level metadata like commit count and last updated timestamp.
This incremental approach is far more efficient than re-indexing entire repositories on every push.
Index Sharding:
Elasticsearch indexes are sharded across multiple nodes for parallelism. We use hash-based sharding on repository ID to distribute load evenly. Each shard is replicated twice for availability. Queries are scattered to all shards and results are gathered and ranked globally before returning to users.
Deep Dive 5: How do we orchestrate GitHub Actions workflows with high reliability?
GitHub Actions must execute millions of workflows daily with high reliability. Workflows can have complex dependencies, long-running jobs, and require coordination across distributed runner pools.
Problem Statement:
Workflows are triggered by events (push, PR, schedule) and must be parsed, validated, queued, and executed reliably. We need to handle job dependencies and parallelization, scale runner pools dynamically, stream logs in real-time, handle timeouts and retries, and ensure no workflow runs are lost even if services crash.
Solution: Event-Driven Architecture with Durable Orchestration
The workflow execution pipeline is built as an event-driven system with durable state management:
Trigger Detection and Parsing:
When a triggering event occurs (e.g., git push), the webhook or event handler publishes to Kafka. The Actions Service consumes these events and checks if the repository contains workflow YAML files in the .github/workflows directory.
It fetches the workflow files at the commit SHA that triggered the event and parses the YAML definitions. Parsing validates syntax, checks that the event type matches the workflow triggers, and verifies permissions.
Job Dependency Resolution:
Workflows can define job dependencies using the needs keyword. The orchestrator builds a directed acyclic graph (DAG) of jobs where edges represent dependencies.
Jobs without dependencies (leaf nodes) can start immediately and are queued in parallel. Jobs with dependencies wait for their prerequisites to complete successfully. The DAG is stored in the workflow run record for tracking execution progress.
Job Queueing:
Each job is published to a Kafka topic partitioned by runner requirements (OS, labels), priority level, and organization ID. This partitioning ensures fair scheduling and prevents any single organization from monopolizing runners.
Jobs contain all necessary metadata: repository information, commit SHA, workflow file path, job definition (steps, environment variables, secrets), timeout settings, and retry configuration.
Runner Assignment:
Runner pools are pre-provisioned with thousands of VMs or containers:
GitHub-hosted runners are ephemeral - each job gets a fresh VM that’s destroyed after execution. This ensures isolation but requires fast provisioning (under 30 seconds). Auto-scaling mechanisms monitor queue depth and provision additional runners during peak demand.
Self-hosted runners are customer-managed and persistent. They poll the job queue filtering by their configured labels. When a matching job is available, they claim it and begin execution.
Job Execution:
Once a runner claims a job, it:
Clones the repository at the exact commit SHA that triggered the workflow. Sets up the execution environment with required tools and dependencies. Injects secrets as environment variables (masked in logs). Executes each step sequentially, capturing stdout and stderr. Streams logs in real-time to the Log Service via WebSocket connection. Uploads artifacts to S3 if any steps produce build outputs. Updates job status in the database as steps complete.
Real-Time Log Streaming:
Logs are crucial for debugging workflow failures. Runners stream logs line-by-line over WebSocket connections to a Log Aggregator service. The aggregator:
Writes logs to durable storage (S3) for persistence. Broadcasts to connected WebSocket clients (developers watching the workflow in the UI). Handles ANSI color codes and formats logs for rendering. Indexes logs for search functionality.
If the WebSocket connection drops, the runner buffers logs and resends when reconnected, ensuring no log lines are lost.
Artifact Management:
Workflow steps can upload build artifacts and caches:
Artifacts are compressed with gzip before uploading to S3 using multipart upload for large files. Each artifact is associated with the workflow run ID for retrieval. Artifacts have a 90-day retention policy by default. Download URLs are signed to enforce access control. Caching allows subsequent workflow runs to restore dependencies quickly (e.g., node_modules, build caches).
Timeout and Retry Handling:
Each job has a configurable timeout (default 6 hours). If a job exceeds the timeout, the runner is forcefully terminated and the job is marked as failed. Workflows can configure automatic retries for transient failures, with exponential backoff between attempts.
Durable Orchestration:
The critical challenge is ensuring workflow runs survive service crashes. We use a workflow orchestration framework like Temporal:
Each workflow run is modeled as a durable workflow execution. The workflow state machine tracks which jobs have completed and which are pending. If the orchestrator service crashes, Temporal’s durable execution model ensures the workflow resumes from its last checkpoint. Timeouts, retries, and compensation logic are declaratively defined in the workflow code.
This approach guarantees exactly-once semantics for workflow execution - even in the face of failures, each job runs exactly once.
Concurrency Control:
Workflows can define concurrency groups to prevent multiple runs from executing simultaneously:
When a workflow defines a concurrency group (e.g., keyed by branch name), the orchestrator checks if another run with the same group is active. If so, the new run either waits or cancels the previous run (based on cancel-in-progress setting). This prevents wasted compute on outdated runs when rapid commits are pushed.
Implementation uses distributed locks in Redis with the concurrency group as the lock key.
Deep Dive 6: How do we manage access control across millions of repositories with complex permission hierarchies?
GitHub’s permission model supports individual users, organizations, teams, and fine-grained repository permissions. Checking access for every API request must be fast while remaining accurate.
Problem Statement:
Permission checks occur on nearly every API request and git operation. The permission model has multiple levels: repository visibility (public/private/internal), repository-level permissions (admin, maintain, write, triage, read), organization ownership, team-based permissions, and nested teams with inheritance. We need to check permissions in milliseconds while handling frequent permission changes.
Solution: Hierarchical Permission Model with Aggressive Caching
Permission Hierarchy:
At the top level, repository owners have full admin access. Organizations can own repositories, with organization owners having admin access to all org repositories. Teams within organizations can be granted specific permissions to repositories. Teams can be nested, with child teams inheriting parent permissions. Individual users can be direct collaborators on repositories with specific permission levels.
Permission Check Algorithm:
When a user attempts to access a repository, we evaluate permissions in a specific order:
First, check if the user is the repository owner - if so, grant full access immediately. Second, check if the repository is public and the required permission is read - grant access to all authenticated users. Third, if the repository belongs to an organization, check if the user is an organization owner - grant full access. Fourth, query the user’s team memberships within the organization and check if any team has the required repository permission. Fifth, check for direct collaborator access where the user is explicitly granted permissions. If none of these checks pass, deny access.
Permission Caching:
Evaluating the full permission hierarchy on every request would be too slow. We implement aggressive caching:
Cache keys are structured as “perm:userId:repoId:permission” and store a boolean result plus metadata about why access was granted. TTL is set to 5 minutes, balancing freshness with performance. When permissions are modified (team changes, collaborator added, repository transferred), we invalidate relevant cache entries.
We use Redis for the cache with a write-through strategy - permission changes first update the database, then invalidate the cache. This ensures consistency even if cache invalidation lags slightly.
Branch Protection Rules:
Protected branches add another layer of access control. Even users with write access can’t push directly to protected branches. The rules specify:
Required status checks that must pass before merging. Required number of approving reviews before merging. Dismissal of stale reviews when new commits are pushed. Enforcement even for administrators. Restrictions on who can push (specific users or teams).
When a push or merge is attempted, the system evaluates all branch protection rules. This evaluation happens synchronously and can’t be cached since it depends on current PR state (reviews, status checks).
CODEOWNERS File:
Repositories can contain a CODEOWNERS file that maps file paths to required reviewers:
When a pull request is created, the system parses the CODEOWNERS file and matches changed file paths against the patterns. Matched owners are automatically added as required reviewers. Branch protection rules can enforce that code owner approval is required before merging.
Access Tokens and OAuth Scopes:
Many API requests use personal access tokens or OAuth apps rather than user sessions. Each token has associated scopes that limit permissions:
Tokens can have read-only or read-write scopes for repositories. Tokens can be scoped to specific repositories or organizations. Fine-grained tokens can have per-resource permissions. The permission check system evaluates token scopes in addition to user permissions, using the most restrictive combination.
Audit Logging:
All access control decisions and permission changes are logged to an audit trail. This helps with security investigations and compliance. Logs include: who accessed what resource, what permission was checked, whether access was granted or denied, timestamp and request context.
Deep Dive 7: How do we handle merge operations safely with concurrent pull requests?
Multiple pull requests might target the same base branch, and merging them concurrently could create conflicts or inconsistent states. We need to serialize merge operations while maintaining low latency.
Problem Statement:
When a developer clicks the merge button, several operations must happen atomically: validate PR state (reviews, status checks), perform the actual git merge, push the merge commit to the target branch, and update PR status. Concurrent merges to the same branch must be serialized to prevent race conditions.
Solution: Distributed Locking with Optimistic Validation
Merge Lock Acquisition:
Before starting a merge operation, we acquire a distributed lock on the target branch using Redis Redlock algorithm. The lock key is structured as “merge-lock:repoId:branchName”.
If lock acquisition fails (another merge is in progress), we return an error to the user indicating the branch is currently being updated. The lock has a timeout (e.g., 60 seconds) to handle service crashes - if the merge operation doesn’t complete and release the lock, it automatically expires.
Pre-Merge Validation:
Once we hold the lock, we re-validate the pull request state to guard against time-of-check to time-of-use (TOCTOU) races:
Verify all required reviews are still valid and not dismissed. Verify all required status checks still pass. Verify the base branch hasn’t been updated since the mergeable state was computed. Verify the PR is still in open state (not closed or already merged).
If any validation fails, we release the lock and return an error. This prevents merging PRs that no longer meet requirements.
Merge Execution:
The actual merge operation depends on the configured merge strategy:
For merge commits, we create a merge commit with two parents (base branch tip and PR head). For squash and merge, we squash all PR commits into a single commit and apply it to the base branch. For rebase and merge, we rebase the PR commits onto the base branch then fast-forward.
The merge is performed on the git storage layer and must complete atomically. We use git’s atomic reference updates to ensure the branch pointer updates only if the previous value matches expectations.
Post-Merge Updates:
After the merge commit is written, we:
Update the PR status to “merged” in the database. Record the merge commit SHA and timestamp. Trigger webhooks and notifications. Trigger status checks and workflows configured to run on push. Update cache entries and invalidate affected data. Release the distributed lock.
These steps happen asynchronously where possible to minimize lock hold time.
Merge Conflict Handling:
If a merge conflict occurs during the actual merge attempt (even though pre-merge checks passed), we:
Roll back any partial changes. Release the lock. Update the PR mergeable state to “dirty”. Notify the PR author that they need to resolve conflicts.
This can happen if the base branch was updated between mergeable state computation and merge execution.
Step 4: Wrap Up
In this design, we proposed a comprehensive system architecture for GitHub, covering version control hosting, collaboration features, code search, and continuous integration. If there is extra time at the end of the interview, here are additional points to discuss:
Additional Features:
Repository forking and syncing with upstream repositories. Issue tracking with labels, milestones, and project boards. Notifications across multiple channels (web, email, mobile) with batching to reduce spam. Webhooks for integrating with external services and automation. GitHub Packages for hosting software packages alongside source code. Security features including Dependabot for automated dependency updates and secret scanning in commits.
Scaling Considerations:
Horizontal scaling for all stateless services behind load balancers. Database sharding by repository ID for locality and by user ID for user data. Read replicas for scaling read-heavy workloads like repository browsing. Cross-region replication for globally distributed repositories. CDN usage for static assets, repository clones, and artifact downloads. Message queue partitioning in Kafka for parallel processing of events.
Error Handling:
Circuit breakers to prevent cascading failures when dependencies are degraded. Graceful degradation where non-critical features (like notifications, analytics) can be disabled during incidents. Retry logic with exponential backoff for transient failures. Fallback mechanisms for third-party services (backup mapping APIs, alternative search backends). Health checks and automatic failover for databases and caches.
Security Considerations:
Encryption at rest for repository data and user information. TLS encryption in transit for all API and git protocol communications. Authentication via SSH keys, personal access tokens, and OAuth apps with scope limiting. Rate limiting per user and per IP to prevent abuse. Input validation and sanitization to prevent injection attacks. Signed commits and tag verification for supply chain security. Vulnerability scanning in pull requests for security issues.
Monitoring and Analytics:
Key performance indicators: git operation latency percentiles, pull request diff generation time, code search query latency, Actions job queue depth and wait times, database query performance, cache hit rates for Redis and CDN.
Distributed tracing to track requests across microservices. Log aggregation with structured logging for debugging. Alerting on anomalies: error rate spikes, latency degradation, queue backups, storage capacity thresholds.
Analytics dashboards for repository activity, user engagement, and feature usage. A/B testing framework for ranking algorithms and UI changes.
Future Improvements:
AI-powered code review suggesting improvements and detecting bugs. Semantic code search using ML embeddings to find code by functionality rather than keywords. Copilot integration for AI-assisted coding within the platform. Advanced analytics and insights for repository health and productivity metrics. Real-time collaboration features like live cursor sharing and pair programming. Enhanced security scanning with supply chain analysis and provenance tracking.
Designing GitHub requires balancing consistency and availability in a distributed system while providing low-latency experiences for developers. The architecture leverages specialized data structures (git objects, geospatial indexes, distributed locks) and caching strategies to achieve massive scale. Strong consistency for critical operations (git pushes, merges) is balanced with eventual consistency for social features (stars, notifications) to optimize performance.
The key to success is understanding git’s data model, optimizing for code-specific use cases (diff generation, code search), and building robust orchestration for complex workflows like CI/CD. With these principles, GitHub can serve millions of developers collaborating on hundreds of millions of repositories.
Summary
This comprehensive guide covered the design of a distributed version control platform like GitHub, including:
- Core Functionality: Git operations (clone, push, pull), pull requests with code review, code search across repositories, and CI/CD workflow execution.
- Key Challenges: Storing billions of repositories efficiently, maintaining strong consistency for git operations, generating pull request diffs quickly, implementing code search at scale, and orchestrating distributed workflows reliably.
- Solutions: Custom distributed file system with sharding and tiering, distributed locking with quorum writes, asynchronous diff generation with caching, Elasticsearch with code-specific analyzers, event-driven workflow orchestration with durable execution, hierarchical permission model with aggressive caching.
- Scalability: Horizontal scaling of stateless services, database sharding by repository and user, read replicas for read-heavy workloads, CDN for static assets and artifacts, incremental indexing for code search.
The design demonstrates how to build a developer platform with millions of users, handling both high-throughput data operations (git pushes, CI/CD jobs) and low-latency interactive features (code browsing, search, collaboration).
Comments