Design Online Code Editor

Designing an online code editor like CodeSandbox or Repl.it requires building a sophisticated system that combines real-time collaboration, secure code execution, intelligent code assistance, and persistent file storage. This is essentially building VS Code in the browser, plus a sandboxed execution environment, plus real-time collaboration features.

At scale, we’d be looking at 10M+ monthly active users, 1M+ concurrent editing sessions, 100K+ code executions per minute, 100TB+ of user code storage, support for 20+ programming languages, and 99.9% uptime for a service that developers rely on for learning, prototyping, and production work.

Step 1: Understand the Problem and Establish Design Scope

Before diving into the design, it’s crucial to define the functional and non-functional requirements that will guide our architecture decisions.

Functional Requirements

Core Requirements (Priority 1-3):

Users should be able to edit code in the browser with syntax highlighting and language support.
Users should be able to execute code in sandboxed environments and see output.
Multiple users should be able to edit the same file simultaneously with real-time synchronization.
Users should be able to create, read, update, and delete files and folders in projects.

Below the Line (Out of Scope):

Users should be able to access an interactive terminal in the browser.
Users should be able to install dependencies automatically from package managers.
Users should be able to share and embed projects in external websites.
Users should be able to integrate with Git for version control.
Users should be able to receive code autocompletion and intelligent suggestions.

Non-Functional Requirements

Core Requirements:

The system should prioritize low latency for typing (< 50ms keystroke latency).
The system should ensure strong consistency for file operations to prevent lost edits.
The system should provide secure sandboxed code execution with no host system access.
The system should handle 100K+ code executions per minute with proper resource isolation.

Below the Line (Out of Scope):

The system should achieve 99.9% uptime with multi-region deployment.
The system should support collaboration sync latency < 200ms across global users.
The system should handle graceful degradation (read-only mode if execution unavailable).
The system should implement malware scanning and rate limiting for security.

Clarification Questions & Assumptions:

Platform: Web-based interface accessible from any modern browser.
Scale: Support for 1M+ concurrent editing sessions and 10M+ monthly active users.
Code Execution: 30 seconds default timeout, 5 minutes maximum for premium users.
Project Size: Support projects up to 500MB in size.
Language Support: JavaScript, TypeScript, Python, Java, Go, Rust, and other major languages.

Step 2: Propose High-Level Design and Get Buy-in

Planning the Approach

We’ll build the system sequentially, addressing each functional requirement step by step. This ensures a methodical approach that covers all core functionality before diving into optimizations.

Defining the Core Entities

To satisfy our key functional requirements, we’ll need the following entities:

User: Any authenticated user of the platform who can create and manage projects. Includes personal information, authentication credentials, and subscription tier (free, premium, enterprise).

Project: A collection of files and folders representing a coding project. Contains project metadata (name, description, owner, visibility settings), creation and modification timestamps, and total storage size.

File: Individual files within a project. Includes the file path, content hash (for deduplication), size, and last modified timestamp. Files are organized in a hierarchical folder structure.

Execution Session: Represents a code execution request. Contains the code to execute, language/runtime, execution status, output streams (stdout, stderr), exit code, resource usage metrics, and execution duration.

Collaboration Session: Manages real-time collaborative editing. Tracks active users in a session, their cursor positions, selections, and presence information. Coordinates document synchronization across multiple clients.

API Design

Create File Endpoint: Used to create a new file in a project.

POST /projects/:projectId/files -> File
Body: {
  path: string,
  content: string
}

Read File Endpoint: Retrieves the content of a specific file.

GET /projects/:projectId/files?path=string -> File

Update File Endpoint: Updates the content of an existing file.

PUT /projects/:projectId/files -> File
Body: {
  path: string,
  content: string
}

Execute Code Endpoint: Submits code for execution in a sandboxed environment.

POST /executions -> Execution
Body: {
  projectId: string,
  language: string,
  entryPoint: string,
  args: string[]
}

WebSocket Connection for Collaboration: Establishes a WebSocket connection for real-time collaborative editing.

WS /collaborate/:projectId/:fileId

The WebSocket connection handles bidirectional communication for document operations, cursor positions, and presence updates.

High-Level Architecture

Let’s build up the system sequentially, addressing each functional requirement:

1. Users should be able to edit code in the browser with syntax highlighting and language support

The core components necessary for code editing are:

Client Application: A rich browser-based interface built with a code editor framework like Monaco Editor (the same editor that powers VS Code). Handles syntax highlighting, theming, keyboard shortcuts, and basic editor features.
API Gateway: Entry point for all client requests, managing authentication, rate limiting, and routing to appropriate services.
Editor Service: Manages editor state, configuration, and file tree rendering. Serves the editor application assets from a CDN for fast load times.
File Storage Service: Handles all file operations (create, read, update, delete) and coordinates with object storage for persistence.
Database: Stores file metadata including paths, sizes, content hashes, and permissions.
Object Storage: Stores actual file content, typically using a service like S3 for durability and scalability.

Code Editing Flow:

User opens a project in their browser, which loads the editor client application from the CDN.
Client makes authenticated requests to the API Gateway to fetch the project file tree.
File Storage Service retrieves metadata from the database and returns the file structure to the client.
When user opens a file, the client requests file content through the API Gateway.
File Storage Service retrieves the content from object storage (or cache if available) and returns it to the client.
Monaco Editor renders the file with appropriate syntax highlighting based on the file extension.

2. Users should be able to execute code in sandboxed environments and see output

We introduce new components to support secure code execution:

Execution Service: Manages code execution requests, routes them to appropriate runtime environments, and enforces resource limits and timeouts.
Container Orchestrator: Uses Kubernetes or similar to manage a pool of sandboxed execution environments (Docker containers or Firecracker microVMs).
Execution Workers: Language-specific workers that run user code in isolated containers with strict resource limits and network policies.
Output Streaming: Uses WebSocket connections to stream execution output (stdout, stderr) to the client in real-time.

Code Execution Flow:

User clicks the “Run” button, triggering a POST request to the Executions endpoint with project details.
Execution Service receives the request, checks user quotas, and validates the execution parameters.
Service selects an appropriate worker from the warm pool (pre-initialized containers for fast startup).
User code and project files are copied into the isolated execution environment.
Code runs with enforced resource limits: CPU (1 vCPU), memory (512MB), timeout (30s), and network restrictions.
Output is captured and streamed back to the client via WebSocket in real-time.
Upon completion, the execution environment is either cleaned and returned to the warm pool or terminated.

3. Multiple users should be able to edit the same file simultaneously with real-time synchronization

We add collaboration-specific components:

Collaboration Service: Manages real-time collaborative editing sessions using Conflict-free Replicated Data Types (CRDTs) or Operational Transformation (OT) to merge concurrent edits.
WebSocket Gateway: Handles persistent WebSocket connections for real-time bidirectional communication between clients and the collaboration service.
CRDT State Manager: Maintains the document state using a CRDT library like Yjs or Automerge, ensuring all clients converge to the same state without conflicts.
Presence Service: Tracks which users are actively viewing or editing each file, including their cursor positions and selections.

Collaboration Flow:

When a user opens a file, the client establishes a WebSocket connection to the Collaboration Service.
Client downloads the initial document state and initializes a local CRDT document.
As the user types, changes are applied locally immediately (for responsiveness) and sent to the Collaboration Service.
Collaboration Service broadcasts the operation to all other connected clients editing the same file.
Each client applies received operations to their local CRDT, which automatically handles merging without conflicts.
Cursor positions and selections are shared separately through an “awareness” protocol that handles ephemeral data.
The Collaboration Service periodically saves snapshots of the document state to object storage for persistence.

4. Users should be able to create, read, update, and delete files and folders in projects

We extend the File Storage Service to support full CRUD operations:

Database Schema: Uses PostgreSQL to store project and file metadata, including hierarchical folder structures with parent-child relationships.
Content Deduplication: Files are stored in object storage by content hash, allowing multiple files with identical content to share the same storage object.
Caching Layer: Redis is used to cache frequently accessed files (< 100KB) to reduce latency and object storage costs.

File Management Flow:

Users interact with the file explorer in the client to create folders or files.
Client sends requests to the File Storage Service through the API Gateway.
For file creation, the service computes a SHA256 hash of the content and checks if it already exists in object storage.
If the content is new, it’s uploaded to object storage with the hash as the key.
A database record is created with the file path, content hash, size, and project association.
For updates, the service creates a new object storage entry (files are immutable) and updates the database pointer.
Hot files are cached in Redis with appropriate TTL for fast retrieval.

Step 3: Design Deep Dive

With the core functional requirements met, let’s dig into the critical challenges that make this system production-ready.

Deep Dive 1: How do we achieve conflict-free real-time collaboration?

Real-time collaboration is the most complex feature, requiring sophisticated algorithms to handle concurrent edits from multiple users without conflicts or lost data.

The Challenge:

Traditional approaches using simple diff-patch algorithms fail when multiple users edit the same location simultaneously. Locking mechanisms reduce the problem to single-user editing, defeating the purpose of collaboration.

Solution: Conflict-free Replicated Data Types (CRDTs)

CRDTs are specialized data structures that guarantee eventual consistency without requiring coordination. We’ll use the Yjs library, a mature CRDT implementation specifically designed for text editing.

How CRDTs Work:

Each character in the document has a unique identifier composed of the client ID and a logical clock value. When users insert or delete characters, these operations reference the character IDs rather than positions. This means concurrent operations can be applied in any order and all clients will converge to the same final state.

CRDT Document Structure:

The document is represented as a sequence of characters, where each character maintains its identity through a unique ID tuple consisting of the client identifier, a Lamport clock value, and an offset for fine-grained ordering. Deletions are handled using tombstones - characters are marked as deleted but not removed, preserving the reference structure for concurrent operations.

Collaboration Architecture:

The Collaboration Server maintains active sessions mapped by project ID to connected client sockets. When a client performs an edit, it generates a CRDT operation, applies it locally, and sends it to the server. The server broadcasts this operation to all other clients in the session. Each client applies the operation to its local CRDT document, which automatically merges without conflicts.

Handling Ephemeral Data:

Cursor positions and selections don’t need the same durability guarantees as document content. We use a separate “awareness” protocol that shares this ephemeral state more efficiently. Awareness updates are throttled to every 100ms and are not persisted, reducing overhead significantly.

Persistence Strategy:

Active collaboration sessions are kept in memory for performance. Every 60 seconds, the server saves a snapshot of the document state to S3. All operations are also streamed to Kafka for replay capability. When a user rejoins a session, they load the latest snapshot and replay any operations that occurred after it.

Scalability Considerations:

Collaboration servers are sharded by project ID, ensuring all clients editing the same document connect to the same server (using sticky sessions). Each server can handle approximately 10,000 concurrent WebSocket connections. For projects with many collaborators, we can use Redis Pub/Sub for cross-server communication while maintaining consistency.

Deep Dive 2: How do we securely execute untrusted user code?

Executing arbitrary user code is inherently dangerous. We need multiple layers of isolation to prevent malicious code from compromising the host system, attacking other users, or consuming excessive resources.

The Challenge:

Users can submit any code, including attempts to access the file system, make network requests, fork bomb the system, or exploit kernel vulnerabilities. A single compromise could expose all user data or take down the entire execution infrastructure.

Solution: Multi-Layered Sandboxing

We use Firecracker microVMs as our primary sandboxing technology, with fallback to gVisor for certain use cases.

Why Firecracker:

Firecracker provides hardware-level isolation by running each execution in a separate lightweight virtual machine with its own kernel. Unlike Docker containers which share the host kernel, Firecracker gives each execution its own kernel, making kernel exploits ineffective. It achieves VM-level security with container-like performance: 125ms boot time and minimal overhead.

Execution Architecture:

The Execution Service maintains a pool of workers organized by language and runtime. Each worker manages a set of Firecracker microVMs. When an execution request arrives, the service assigns it to an available worker with the appropriate language runtime.

Resource Isolation:

Each microVM is configured with strict resource limits enforced at the hypervisor level. CPU is limited to one vCPU with burst capability for premium users. Memory is capped at 512MB default, 2GB maximum. Disk storage is ephemeral with a 1GB limit. Network access is outbound-only and rate-limited to 10Mbps, with private IP ranges blocked to prevent SSRF attacks.

Warm Pool Optimization:

Cold-starting a microVM and initializing the runtime environment takes 2-3 seconds, which is too slow for interactive use. We maintain a warm pool of pre-initialized microVMs for each language. When an execution completes, the VM is cleaned and returned to the pool rather than destroyed, reducing subsequent execution latency to under 500ms.

Network Security:

MicroVMs run in an isolated network namespace. Outbound traffic is filtered through egress rules that block private IP ranges (preventing SSRF), rate-limit requests (preventing DDoS), and optionally whitelist specific domains. Inbound traffic is completely blocked. This prevents malicious code from scanning internal networks or attacking other systems.

System Call Filtering:

Even with VM isolation, we apply seccomp filters to block dangerous system calls. Calls related to kernel modules, mounting file systems, or accessing hardware devices are blocked. The root filesystem is mounted read-only except for a small writable /tmp directory.

Execution Monitoring:

Resource usage is monitored continuously. If CPU usage exceeds 100% for more than 10 seconds, the process is killed. Memory usage is tracked and the execution is terminated if it exceeds the limit. All executions are logged with code hash, user ID, duration, and resource usage for anomaly detection.

Package Installation Security:

When users specify dependencies, we cache the installed packages by a hash of the manifest file. This prevents repeated installations but also means we need to scan cached packages for malware. New packages go through automated security scanning before being cached. Suspicious packages are quarantined and flagged for manual review.

Deep Dive 3: How do we provide IDE-like code intelligence features?

Modern developers expect autocomplete, error checking, and other intelligent features. Providing these in a browser-based editor requires running language analysis on the backend.

The Challenge:

Language analysis is computationally expensive and requires maintaining state about the entire project (AST, type information, symbol tables). Running this for millions of users requires significant resources, and latency must be low enough for interactive use.

Solution: Language Server Protocol

We use the Language Server Protocol (LSP), an open standard that separates language intelligence from the editor. Each programming language has a dedicated language server that provides autocomplete, hover information, error diagnostics, and other features.

LSP Architecture:

The client editor communicates with a Language Server Manager through the API Gateway. The manager routes requests to appropriate language servers based on the file type. Each project can have multiple language servers running (e.g., TypeScript for .ts files, Python for .py files).

Language Server Lifecycle:

When a user opens a project, the Language Server Manager checks if a language server is already running for that project and language. If not, it spawns a new LSP process with the project files mounted. The server indexes the project in the background, building an AST and symbol table. Once indexing completes, intelligent features become available.

Communication Protocol:

LSP uses JSON-RPC over WebSocket for communication. When a user requests autocomplete, the client sends a textDocument/completion request with the file URI and cursor position. The language server analyzes the code context and returns completion items with labels, types, and documentation. Similar request-response patterns exist for hover info, diagnostics, go-to-definition, and find references.

Performance Optimizations:

To reduce overhead, we use incremental synchronization. Instead of sending the entire file content on every keystroke, we send only the changed text ranges (deltas). This dramatically reduces network traffic.

Requests are debounced on the client side. Diagnostics are debounced by 300ms to avoid running error checking on every keystroke. Autocomplete is debounced by 100ms to reduce unnecessary server calls while maintaining responsiveness.

Language servers are resource-intensive (100-500MB per language per project). To manage costs, we share language servers across multiple projects when possible and kill idle servers after 10 minutes of inactivity.

Caching Strategy:

We cache ASTs (abstract syntax trees) and symbol tables in memory, invalidated only when files change. Type checking results are cached per file with a hash of the file content as the key. Completion items for common prefixes are cached with a 60-second TTL, reducing repeated computation.

Deep Dive 4: How do we handle frequent file saves without overwhelming storage?

Users expect their work to be saved frequently, but saving thousands of files per second to a database and object storage would be expensive and slow.

The Challenge:

With 1M+ concurrent editing sessions, even if users only save once per minute, that’s 16,000+ save operations per second. Each save involves a database write (metadata) and an object storage upload (content), which can be slow and costly.

Solution: Write Coalescing and Caching

We implement several layers of optimization to reduce the actual number of writes:

Client-Side Debouncing:

The client only sends save requests after the user stops typing for 2 seconds (debouncing). This means rapid typing generates only one save request rather than dozens.

Content-Addressable Storage:

Files are stored by their content hash (SHA256) rather than by path. When a file is saved, we compute its hash and check if that content already exists in object storage. If it does, we only update the database metadata to point to the existing object. This deduplication significantly reduces storage costs and upload bandwidth.

Write Batching:

Multiple rapid saves for the same file are coalesced on the server side. If a file is saved three times in quick succession, only the latest version is actually written to object storage. The intermediate versions exist only in cache.

Redis Write-Through Cache:

File content is written to Redis immediately and returned successfully to the client. The actual object storage upload happens asynchronously in the background. This provides sub-100ms save latency while still ensuring durability through Redis persistence.

Database Optimization:

File metadata updates use optimistic locking to prevent lost updates. Each file record has a version number that’s incremented on update. Updates include the expected version number, and the database will reject the update if another update happened concurrently.

Deep Dive 5: How do we provide browser-based terminal access?

Developers often need command-line access to install packages, run scripts, or debug issues. Providing a terminal in the browser requires streaming bidirectional I/O.

The Challenge:

Terminal interaction is inherently bidirectional and stateful. We need to capture keyboard input (including special keys), send it to a shell process in the container, capture the shell’s output (including ANSI escape codes for colors), and display it in the browser with low latency.

Solution: WebSocket with PTY

We use xterm.js in the browser for terminal rendering and WebSocket for communication with a Terminal Service that manages shell sessions.

Terminal Architecture:

The browser client loads xterm.js, which renders a terminal UI and captures keyboard input. It establishes a WebSocket connection to the Terminal Service, which creates a pseudo-terminal (PTY) in the user’s execution container and spawns a shell process attached to it.

Bidirectional I/O:

When the user types, xterm.js sends the keystrokes through the WebSocket. The Terminal Service writes these bytes to the master side of the PTY. The shell receives them on the slave side, processes the command, and writes output. The Terminal Service reads this output from the master PTY and sends it back through the WebSocket to the client, where xterm.js renders it.

Terminal Persistence:

A challenge is maintaining terminal state across page refreshes. We use tmux sessions to solve this. Instead of running bash directly, we run bash inside a tmux session. If the user refreshes the page or disconnects, the tmux session continues running. When they reconnect, we attach to the existing session and send the current buffer contents to restore the display.

Handling Resize Events:

When the browser window is resized, the terminal dimensions (rows and columns) change. The client sends a resize control message through the WebSocket. The Terminal Service uses a system call (ioctl with TIOCSWINSZ) to notify the PTY of the new dimensions, which triggers the shell to reflow its output.

Performance Optimization:

Terminal output can be extremely rapid (think of commands like grep on large files). To prevent overwhelming the browser, we throttle output to 60 frames per second. Output is buffered on the server and sent in batches every 16.7ms. If the buffer grows too large, we drop old output to maintain responsiveness (with a notice to the user).

Deep Dive 6: How do we efficiently handle package dependencies?

Modern projects depend on numerous external packages. Installing dependencies for every execution would be too slow, but caching presents challenges.

The Challenge:

A typical Node.js project might have hundreds of dependencies totaling hundreds of megabytes. Running “npm install” every time would take 30 seconds to 5 minutes, making the system unusable. But with millions of projects, we can’t cache all possible dependency combinations.

Solution: Content-Addressed Caching

We use a sophisticated caching system based on dependency manifest content.

Cache Key Generation:

When a project requires dependencies, we parse the package manifest (package.json for Node, requirements.txt for Python, Cargo.toml for Rust, etc.) and compute a SHA256 hash of the normalized content. This hash becomes the cache key.

Cache Lookup:

Before installing dependencies, we check if this exact dependency set has been cached. The cache is stored in S3 as a tarball of the installed dependencies (node_modules, site-packages, etc.). If found, we download and extract it in the execution container, which takes 5-10 seconds compared to 30+ seconds for fresh installation.

Cache Population:

On cache miss, we proceed with fresh installation, streaming the installation logs to the user so they know what’s happening. After successful installation, we tar the dependencies and upload to S3 asynchronously.

Cache Eviction:

The cache uses LRU (Least Recently Used) eviction with a 1TB total storage limit. Popular dependency sets (like React + common libraries) stay cached. Inactive caches expire after 30 days. This ensures the most commonly used dependency combinations are available while managing storage costs.

Private Registry Support:

For enterprise users, we support private package registries. Users can configure registry credentials in their project settings. These credentials are stored encrypted and injected into the execution environment at runtime, allowing access to private packages without exposing credentials in code.

Security Scanning:

New dependency combinations go through automated security scanning before being cached. This checks for known vulnerabilities (using databases like npm audit) and scans for suspicious code patterns. Flagged packages are quarantined for manual review.

Step 4: Wrap Up

In this design, we’ve proposed a comprehensive system for an online code editor with real-time collaboration and secure execution capabilities. If there’s extra time at the end of the interview, here are additional points to discuss:

Additional Features:

AI-powered code completion using models like GPT-4 or Codex for more intelligent suggestions beyond traditional LSP.
Collaborative debugging with shared breakpoints and variable inspection across multiple users.
Deployment integration to push projects directly to cloud providers like Vercel, Netlify, or AWS.
Mobile applications with offline editing capabilities and sync when reconnected.
Video and audio calls embedded in the editor for pair programming sessions.

Scaling Considerations:

Horizontal scaling: All services are stateless except the Collaboration Service, which uses consistent hashing to shard by project ID.
Geographic distribution: Deploy to multiple regions with intelligent routing based on user location.
Database sharding: Shard project data by user ID or geographic region to distribute load.
CDN for static assets: Editor bundles, Monaco Editor assets, and common libraries served from CDN for fast load times globally.
Autoscaling: Execution workers and collaboration servers autoscale based on queue depth and active sessions respectively.

Error Handling:

Network failures: Implement exponential backoff retry logic for transient failures.
Service failures: Use circuit breakers to prevent cascading failures when downstream services are unavailable.
Data loss prevention: Collaboration service maintains operation logs that can replay to recover state after crashes.
Graceful degradation: If execution service is unavailable, editor remains functional in read-only mode.

Security Considerations:

Encryption at rest: All file content in object storage is encrypted using AES-256.
Encryption in transit: All communications use TLS 1.3, including WebSocket connections.
Authentication: JWT tokens with short expiration times, refresh tokens stored in secure HttpOnly cookies.
Authorization: Row-level security in database ensures users can only access their own projects or explicitly shared ones.
Rate limiting: Per-user rate limits on API requests, executions, and file operations to prevent abuse.
Audit logging: All operations logged with user ID, action, timestamp, and IP address for security monitoring.

Monitoring and Observability:

Track key metrics: Editor load time (p50, p95, p99), execution startup latency, collaboration sync latency, WebSocket connection count.
Distributed tracing: Correlate requests across services to identify bottlenecks.
Real-time dashboards: Monitor system health, active users, execution queue depth, error rates.
Alerting: Automated alerts for high error rates, execution failures, service degradation, or unusual patterns.
A/B testing: Framework for testing new features, matching algorithms, and UI changes with controlled rollout.

Future Improvements:

Machine learning for predictive container pre-warming based on user patterns.
Improved CRDT algorithms for better handling of rich text formatting and embedded media.
WebAssembly-based language servers running in the browser to reduce server load for smaller projects.
Blockchain-based code verification for tamper-proof project history.
Quantum-resistant encryption for long-term data security.

Cost Optimization:

Execution container warm pools sized dynamically based on time-of-day patterns to avoid over-provisioning.
S3 lifecycle policies to move old project files to cheaper storage tiers (Glacier for projects inactive > 90 days).
Spot instances for non-critical execution workers that can tolerate interruption.
Intelligent caching to reduce database read load and lower replica count.
Compression of collaboration session data before storage to reduce costs.

Congratulations on getting this far! Designing an online code editor is a complex system that combines multiple challenging domains: real-time collaboration, secure sandboxing, intelligent code analysis, and large-scale storage. The key is to start with core functionality, ensure correctness and security, then layer in optimizations for performance and scale.

Summary

This comprehensive guide covered the design of a browser-based online code editor, including:

Core Functionality: Browser-based editing with syntax highlighting, secure code execution, real-time collaboration, and persistent file storage.
Key Challenges: Conflict-free collaborative editing, secure sandboxing of untrusted code, IDE-like intelligence features, and efficient dependency management.
Solutions: CRDTs for collaboration (Yjs), Firecracker microVMs for isolation, Language Server Protocol for code intelligence, content-addressed storage for files, and sophisticated caching for dependencies.
Scalability: Horizontal scaling of stateless services, geographic distribution, database sharding, warm container pools, and intelligent caching strategies.

The design demonstrates how to handle complex real-time systems with strong consistency requirements, security constraints, and performance expectations while managing costs at scale.

Design Online Code Editor