Design Venmo
Venmo is a peer-to-peer payment platform that combines financial transactions with social networking features. At its core, it needs to handle billions of dollars in transactions annually while maintaining ACID guarantees, fraud prevention, and regulatory compliance (PCI-DSS, AML, KYC). This design focuses on building a production-grade system that handles 50M+ users, 200M+ transactions per month, with 99.99% uptime and sub-second payment confirmation.
Designing Venmo presents unique challenges including ensuring exactly-once payment processing, maintaining strong consistency for financial transactions, implementing real-time fraud detection, managing social feed privacy, and handling regulatory compliance requirements.
Step 1: Understand the Problem and Establish Design Scope
Before diving into the design, it’s crucial to define the functional and non-functional requirements. For a payment system like Venmo, functional requirements define what users can do, while non-functional requirements establish system qualities around reliability, security, and performance.
Functional Requirements
Core Requirements (Priority 1-4):
- Users should be able to send money to other Venmo users instantly within the ecosystem.
- Users should be able to link bank accounts and debit/credit cards as funding sources.
- Users should be able to maintain a Venmo balance and withdraw funds to linked bank accounts.
- Users should be able to view their transaction history and social feed showing friend transactions.
Below the Line (Out of Scope):
- Users should be able to request money from other users with optional notes.
- Users should be able to split payments among multiple people for group expenses.
- Users should be able to set transaction visibility (public, friends, private).
- Users should be able to like and comment on friend transactions.
- Users should be able to dispute transactions and request chargebacks.
- Users should be able to make recurring payments and scheduled transactions.
Non-Functional Requirements
Core Requirements:
- The system should ensure exactly-once payment processing with no duplicate charges.
- The system should maintain strong consistency for balance updates (no negative balances).
- The system should process payments with sub-second response time (p99 < 500ms).
- The system should provide 99.99% uptime for payment services (52 minutes downtime/year).
- The system should detect fraud in real-time with >99% accuracy and <0.1% false positives.
Below the Line (Out of Scope):
- The system should comply with PCI-DSS Level 1 and maintain SOC 2 Type II certification.
- The system should encrypt all sensitive data at rest and in transit (bank-level 256-bit encryption).
- The system should retain transaction data for 7 years per regulatory requirements.
- The system should scale linearly to support 100M+ users and 500M+ monthly transactions.
- The system should provide automated suspicious activity reporting (SAR) for AML compliance.
Clarification Questions & Assumptions:
- Platform: Mobile apps for iOS and Android as primary touchpoints.
- Scale: 50 million monthly active users (MAU), 200 million transactions per month.
- Transaction Volume: Average 77 TPS, with peaks up to 500 TPS during high-traffic periods.
- Payment Processing: Instant for Venmo-to-Venmo transfers; ACH takes 1-3 days for bank transfers.
- Bank Integration: Leveraging third-party services like Plaid for bank linking and ACH processing.
Step 2: Propose High-Level Design and Get Buy-in
Planning the Approach
For a payment system, we’ll build the design sequentially by addressing each core functional requirement. We’ll start with the basic payment flow, then layer in funding sources, balance management, and finally the social feed features.
Defining the Core Entities
To satisfy our key functional requirements, we’ll need the following entities:
User: Any individual registered on the platform. Contains personal information, KYC verification level, contact details, privacy preferences, and authentication credentials. Users can be both senders and recipients of payments.
Transaction: A payment record from initiation to completion. Includes sender and recipient identifiers, amount, currency, status (initiated, processing, completed, failed), timestamps, optional note, privacy setting, and an idempotency key to prevent duplicates.
Balance: The current Venmo balance for each user. Represents the amount of money stored in their Venmo account that can be used for instant payments or withdrawn to a bank account.
Ledger Entry: Individual debit or credit entries that implement double-entry bookkeeping. Every transaction creates two entries: a debit reducing the sender’s balance and a credit increasing the recipient’s balance. This provides an immutable audit trail.
Funding Source: Bank accounts or debit/credit cards linked to a user’s account. Contains encrypted access tokens, account identifiers, verification status, and metadata like bank name and account mask.
Fraud Score: A risk assessment generated for each transaction. Includes a numerical risk score (0-100), the features used for scoring, the model version, and the decision taken (approve, challenge, review, block).
API Design
Create Payment Endpoint: Used to initiate a payment from one user to another.
POST /v1/payments -> Transaction
Body: {
idempotencyKey: uuid,
recipientId: string,
amount: number,
currency: "USD",
note: string (optional),
privacy: "public" | "friends" | "private",
fundingSourceId: string
}
The idempotency key ensures duplicate API calls don’t create multiple charges. The senderId is extracted from the authenticated session.
Link Bank Account Endpoint: Used to connect a user’s bank account via Plaid integration.
POST /v1/funding-sources/bank -> FundingSource
Body: {
publicToken: string,
accountId: string
}
The public token comes from Plaid’s Link widget and is exchanged server-side for a secure access token.
Get Balance Endpoint: Retrieves the current Venmo balance for the authenticated user.
GET /v1/balance -> Balance
Withdraw Funds Endpoint: Initiates a transfer from Venmo balance to a linked bank account.
POST /v1/withdrawals -> Withdrawal
Body: {
fundingSourceId: string,
amount: number,
type: "standard" | "instant"
}
Standard ACH withdrawals are free but take 1-3 days; instant transfers charge a 1.75% fee.
Get Transaction History Endpoint: Retrieves paginated transaction history for the user.
GET /v1/transactions?limit=50&offset=0 -> Transaction[]
Get Social Feed Endpoint: Retrieves the social feed showing friend transactions based on privacy settings.
GET /v1/feed?limit=50&offset=0 -> FeedItem[]
High-Level Architecture
Let’s build up the system sequentially, addressing each functional requirement:
1. Users should be able to send money to other Venmo users instantly
The core components necessary to fulfill payment processing are:
- Mobile Client: The primary user interface available on iOS and Android. Handles user input, displays transaction status, and manages local state.
- API Gateway: Entry point for all client requests. Manages authentication via JWT tokens, rate limiting (100 requests per minute per user), TLS termination, and routes requests to appropriate microservices.
- Payment Service: Orchestrates the payment transaction flow. Validates idempotency keys, checks transaction limits, coordinates with other services, and manages the payment state machine.
- Account Service: Manages user balances and ledger entries. Implements double-entry bookkeeping to ensure balance integrity and handles balance queries.
- Fraud Detection Service: Evaluates transaction risk in real-time using machine learning models. Extracts features from the transaction and user history, generates risk scores, and makes approval decisions.
- PostgreSQL Database: Stores critical transactional data including users, transactions, balances, and ledger entries. Provides ACID guarantees essential for financial operations.
- Redis Cache: Provides fast access to frequently accessed data like user sessions, idempotency tracking, rate limiting counters, and balance cache.
Payment Flow:
- The sender enters recipient details and amount in the mobile client, which sends a POST request to create a payment with a unique idempotency key.
- The API Gateway authenticates the request and forwards it to the Payment Service.
- The Payment Service checks the idempotency key in Redis to detect duplicates. If found, it returns the cached result.
- The Fraud Detection Service evaluates the transaction, extracting features like transaction velocity, account age, relationship between users, and device fingerprints. It returns a risk score.
- If the risk score is acceptable (below 50), the Payment Service proceeds. Medium-risk scores trigger additional verification like 2FA.
- The Account Service validates that the sender has sufficient balance or valid funding source.
- Using a distributed transaction, the system atomically: locks the sender’s balance row, creates the transaction record, inserts debit and credit ledger entries, and updates both user balances.
- Upon successful commit, the transaction status is marked as completed and the result is cached in Redis with a 24-hour TTL.
- An event is published to the message queue for asynchronous processing like notifications and feed updates.
2. Users should be able to link bank accounts and debit/credit cards as funding sources
We introduce components for external integrations:
- Plaid Integration: Third-party service providing secure bank authentication and ACH transfer capabilities. Handles the OAuth-style flow for bank linking and provides APIs for account verification and fund transfers.
- Funding Source Service: Manages linked bank accounts and cards. Stores encrypted access tokens using AWS KMS, handles verification workflows, and initiates ACH transfers.
Bank Linking Flow:
- The user initiates bank linking in the mobile client, which requests a Plaid Link token from the backend.
- The mobile client opens the Plaid Link widget, where the user authenticates with their bank credentials.
- Upon successful authentication, Plaid returns a public token and account metadata.
- The client sends the public token to the Funding Source Service.
- The service exchanges the public token for an access token via Plaid’s API.
- It retrieves account details and performs instant verification using Plaid’s Auth API.
- The encrypted access token is stored in the database along with account metadata (bank name, account mask, account type).
- The funding source is marked as verified and can now be used for payments and withdrawals.
When a payment is funded from a bank account, the system initiates an ACH pull transaction via Plaid. The ACH transfer takes 1-3 business days to settle. Plaid sends webhook notifications when the transfer completes or fails, allowing the system to update the transaction status accordingly.
3. Users should be able to maintain a Venmo balance and withdraw funds
We leverage the existing Account Service with additional withdrawal capabilities:
Balance Management:
- The Account Service maintains a user balances table with the current balance for each user.
- All balance changes are recorded in the ledger entries table using double-entry bookkeeping.
- Every transaction creates exactly two ledger entries: one debit (reducing sender balance) and one credit (increasing recipient balance).
- This provides an immutable audit trail and enables balance reconciliation.
Withdrawal Flow:
- The user requests a withdrawal from their mobile client, specifying the amount and destination bank account.
- The Funding Source Service validates the user has sufficient Venmo balance.
- For standard ACH withdrawals, it creates a withdrawal record and initiates an ACH push transaction via Plaid.
- The user’s Venmo balance is debited immediately, and the funds are marked as pending withdrawal.
- When Plaid’s webhook confirms the ACH transfer settled successfully (1-3 days), the withdrawal status is updated to completed.
- For instant transfers, a partner bank processes the transfer in real-time for a 1.75% fee.
Balance Reconciliation:
- A nightly batch job verifies balance integrity by summing all ledger entries for each user and comparing against the user balances table.
- Any discrepancies beyond rounding tolerance (1 cent) trigger alerts to the finance team for investigation.
- This ensures the system remains in a consistent state despite any potential bugs or race conditions.
4. Users should be able to view transaction history and social feed
We introduce components for social features:
- Social Feed Service: Manages the feed of transaction activities. Enforces privacy controls, generates personalized feeds based on friendship relationships, and handles likes and comments.
- Notification Service: Sends push notifications via FCM (Android) and APNs (iOS) for payment events, payment requests, and friend activity.
- Cassandra Database: Stores high-volume social feed data. Optimized for write-heavy workloads with wide-column storage for user activity timelines.
Transaction History Flow:
- The user opens the transactions tab in the mobile client, which requests their transaction history.
- The API Gateway routes the request to the Payment Service.
- The service queries the transactions table filtered by user ID (either as sender or recipient) and ordered by timestamp descending.
- Results are paginated (50 per page) and returned to the client.
- Frequently accessed recent transactions are cached in Redis to reduce database load.
Social Feed Flow:
- When a payment is completed with privacy set to “public” or “friends”, an event is published to the message queue.
- The Social Feed Service consumes these events and creates feed entries in Cassandra.
- When a user opens the feed, the service queries for visible transactions based on privacy rules:
- Public transactions are visible to everyone.
- Friends-only transactions are visible if the viewing user is friends with either the sender or recipient.
- Private transactions are only visible to the sender and recipient.
- The feed is sorted by recency and paginated.
- If the user has chosen to hide the amount, the feed shows the transaction note and participants but masks the dollar amount.
- User feeds are cached in Redis with a 5-minute TTL and invalidated when new relevant transactions occur.
Step 3: Design Deep Dive
With the core functional requirements met, it’s time to dig into the non-functional requirements via deep dives. These are the critical areas that ensure the system is production-ready for handling real money.
Deep Dive 1: How do we guarantee exactly-once payment processing with ACID properties?
The payment flow must guarantee that each payment is processed exactly once with full ACID properties. A failed payment must never partially deduct money, and network retries must not create duplicate charges.
Problem: Network Retries and Duplicate Payments
Mobile apps operate on unreliable networks. If a payment request times out, should the app retry? Without proper handling, retries could create duplicate charges - the worst possible user experience for a payment app.
Solution: Idempotency with Client-Generated Keys
Every payment request includes a unique idempotency key generated by the client (typically a UUID v4). The system uses a multi-layer approach to detect duplicates:
Layer 1: Redis Cache The Payment Service first checks Redis for the idempotency key. If found, it immediately returns the cached response without processing the payment again. This provides an O(1) lookup with sub-millisecond latency. The cache entry has a 24-hour TTL, covering typical retry windows.
Layer 2: Database Lookup If not found in cache, the service queries the transactions table for an existing transaction with this idempotency key. If found, it returns that transaction’s details. This handles cases where the Redis cache was cleared or the request comes after the TTL expired.
Layer 3: Database Uniqueness Constraint The transactions table has a unique index on the idempotency key column. This provides a final safety net at the database level, preventing duplicate insertions even if multiple requests somehow reach this point simultaneously.
Transaction State Machine:
Payments progress through a well-defined state machine: INITIATED, FRAUD_CHECK, AUTHORIZED, PROCESSING, SETTLED, COMPLETED. Each state transition is atomic and recorded in the database with timestamps.
Failed states include: REJECTED (failed fraud check), DECLINED (insufficient funds), FAILED (processing error), and REVERSED (chargeback or dispute resolution).
Distributed Transaction with Two-Phase Commit:
For Venmo-to-Venmo transfers, we use a two-phase commit protocol to ensure atomicity across multiple database operations:
Phase 1: Prepare The transaction begins by acquiring a row-level lock on the sender’s balance using SELECT FOR UPDATE. This prevents concurrent modifications. The system validates sufficient funds, creates the transaction record with PROCESSING status, and prepares ledger entries for both the debit and credit operations. All these operations occur within a single database transaction marked as PREPARED.
Phase 2: Commit If all validations pass, the system proceeds to commit. It atomically updates both user balances, inserts both ledger entries, and marks the transaction as COMPLETED. If any operation fails during this phase, the entire transaction is rolled back, ensuring the system remains in a consistent state.
Failure Recovery:
If Phase 2 fails (database crash, network partition, etc.), the transaction coordinator automatically rolls back the prepared transaction. The system marks the transaction as FAILED with an error code, publishes a failure event for monitoring, and returns an error to the client for retry with the same idempotency key.
When the client retries, the idempotency mechanism detects the failed transaction and can either return the error or, if the failure was transient, attempt processing again.
Deep Dive 2: How do we implement double-entry bookkeeping for financial integrity?
Every financial transaction must create two ledger entries to maintain balance integrity and provide an immutable audit trail. This is a fundamental accounting principle used by banks and financial institutions.
The Ledger Entry Model:
The ledger entries table stores every debit and credit operation. Each entry includes the transaction ID it’s part of, the user ID it affects, the entry type (DEBIT or CREDIT), the amount, the resulting balance after this operation, and a timestamp.
The user balances table maintains the current balance for each user. This is essentially a materialized view - the balance should always equal the sum of credits minus debits in the ledger entries table.
Why Double-Entry?
Double-entry bookkeeping has several critical benefits:
- Immutability: Ledger entries are never modified or deleted, only inserted. This provides a complete audit trail.
- Reconciliation: We can verify balance correctness by summing ledger entries and comparing to the balances table.
- Accounting Accuracy: The total of all credits must always equal the total of all debits across the entire system.
- Fraud Detection: Discrepancies between ledger entries and balances indicate potential bugs or fraudulent activity.
Transaction Processing with Ledger Entries:
When processing a payment, the system must insert two ledger entries atomically:
First, it creates a DEBIT entry for the sender, recording the amount being sent and calculating the balance_after value by subtracting from their current balance. Second, it creates a CREDIT entry for the recipient, recording the same amount and calculating the balance_after by adding to their current balance.
Both entries reference the same transaction ID, creating an unbreakable link between the two sides of the transaction. These insertions happen within the same database transaction as the balance updates, ensuring atomicity.
Balance Reconciliation Process:
A nightly batch job performs balance reconciliation for every user. It calculates the expected balance by summing all CREDIT entries minus all DEBIT entries from the ledger. It then compares this calculated balance against the current balance in the user balances table.
If the difference exceeds a small tolerance threshold (one cent, to account for rounding), the system triggers an alert to the finance team. The discrepancy is logged with full details including the user ID, calculated balance, current balance, and timestamp.
The reconciliation process also verifies global accounting invariants, such as ensuring the sum of all credits equals the sum of all debits across the entire platform, and that the sum of all user balances equals the sum of all credits minus debits.
Deep Dive 3: How do we detect and prevent fraud in real-time?
Fraud detection is critical for a payment platform. We need to identify suspicious transactions before they’re processed, without creating too much friction for legitimate users.
Feature Engineering:
The fraud detection system extracts over 30 features from each transaction and the user’s history:
Transaction Velocity Features: Count of transactions from this user in the last hour, last day, and last week. Total amount sent in the last 24 hours. These catch velocity-based attacks where fraudsters try to drain an account quickly.
Account Age Features: Days since sender account creation and days since recipient account creation. Newer accounts are higher risk as fraudsters often create fresh accounts.
Relationship Features: Whether the sender and recipient are friends on the platform. Number of previous transactions between these specific users. First-time transactions to strangers are riskier.
Amount Features: The transaction amount itself. Whether it’s a round number (10, 50, 100 dollars are more suspicious). The amount relative to the sender’s average transaction size. Amounts much larger than usual patterns suggest compromised accounts.
Device and Location Features: Whether this is a new device for the sender. Geographic distance from their last transaction location. Impossible travel (e.g., transactions from New York and California minutes apart) indicates fraud.
Time Features: Hour of day and day of week. Transactions at unusual times (3 AM) have higher risk. Whether it’s a weekend or holiday when users might not notice fraudulent charges.
Machine Learning Model:
The system uses an XGBoost gradient boosting model trained on historical fraud data. The training dataset includes transactions labeled as fraud by manual review teams, chargebacks, and user disputes.
Since fraud is rare (approximately 0.5% of transactions), the training pipeline uses SMOTE (Synthetic Minority Over-sampling Technique) to handle class imbalance. This creates synthetic fraud examples to balance the training data.
The model outputs a fraud probability between 0 and 1, which is scaled to a risk score of 0-100. The system retrains the model nightly on the latest 90 days of data to adapt to evolving fraud patterns.
Risk-Based Decision Making:
Based on the risk score, the system takes different actions:
- Low Risk (0-49): Auto-approve the transaction. Process immediately with no additional friction.
- Medium Risk (50-79): Challenge the user with additional verification. Send a 2FA code via SMS or require biometric authentication. This adds friction but prevents most fraud while allowing legitimate users to proceed.
- High Risk (80-94): Queue for manual review. The transaction is held pending investigation by the fraud operations team. User is notified of a security review.
- Critical Risk (95-100): Auto-block the transaction. Notify the user of suspicious activity. Flag the account for investigation. Notify the security team for immediate response.
Rule-Based Overrides:
In addition to the ML model, the system applies rules-based overrides:
If the account is less than 7 days old and the transaction is over 500 dollars, automatically boost the risk score to at least 85. If the user has sent more than 10 transactions in the last hour, boost to at least 90. These rules catch obvious attack patterns that might slip through the ML model.
Model Monitoring:
The system tracks model performance metrics including precision, recall, and AUC-ROC score. It monitors false positive rate (legitimate transactions blocked) and false negative rate (fraud that got through). If performance degrades below thresholds, alerts are sent to the data science team for model retraining.
Deep Dive 4: How do we handle bank integrations and ACH transfers?
Integrating with the banking system is complex. We leverage Plaid to simplify this integration and reduce our compliance burden.
Plaid Integration Architecture:
Plaid provides a widget that handles the bank authentication flow. Users see familiar bank login screens without Venmo ever handling their banking credentials. This reduces our PCI-DSS scope and improves user trust.
The Bank Linking Process:
First, the backend generates a link token by calling Plaid’s API with our client credentials. This token is scoped to a specific user and has a short expiration time. The mobile client uses this token to initialize the Plaid Link widget.
The user selects their bank and enters their login credentials directly into Plaid’s interface. Plaid authenticates with the bank using OAuth or screen scraping, depending on the bank’s capabilities.
Upon successful authentication, Plaid returns a public token and metadata about the linked accounts. The mobile client immediately sends this public token to our backend.
The backend exchanges the public token for a permanent access token via Plaid’s API. This access token allows us to initiate ACH transfers and query account information. We encrypt the access token using AWS KMS and store it in our database.
Instant Account Verification:
Traditional ACH verification requires sending micro-deposits (two small amounts like 0.17 and 0.23 dollars) and having the user verify the amounts. This takes 2-3 days.
Plaid’s Auth API provides instant verification for supported banks. It directly queries the bank to confirm the account is valid and the user has access. This enables immediate use of the linked bank account for funding payments.
ACH Transfer Processing:
When a user funds a payment from their bank account, we initiate an ACH pull transaction. The system creates a transfer record in our database and calls Plaid’s transfer creation API with the encrypted access token, amount, and transfer type (debit).
ACH transfers are batched and settled through the ACH network, which operates on business days only. Standard ACH takes 1-3 business days to settle. During this time, the transfer status is PENDING.
Plaid sends webhook notifications as the transfer progresses through various states: pending, posted, settled, failed, or returned. Our webhook handler updates the transfer status in the database and takes appropriate action.
Webhook Handling:
When we receive a webhook indicating a transfer settled successfully, we credit the user’s Venmo balance by creating a ledger entry and updating their balance. When a transfer fails or is returned (insufficient funds, closed account, etc.), we reverse any pending transactions that depended on this funding source and notify the user.
Webhook payloads are verified using HMAC signatures to prevent spoofing. We implement idempotency in webhook processing since Plaid may retry webhooks.
Instant Transfers:
For users who need immediate access to funds, we offer instant transfers for a 1.75% fee. These use a partner bank’s real-time payment rails rather than ACH. The partner bank advances the funds immediately and settles with the user’s bank later.
Deep Dive 5: How do we implement social feed privacy controls?
The social feed is a unique feature that differentiates Venmo from traditional payment apps. However, privacy is critical - users must have full control over who sees their transactions.
Privacy Model:
Each transaction has a privacy field with three possible values: public, friends, or private. Users also have a show_amount flag to hide the dollar amount while still showing the transaction note and participants.
Friendship Graph:
The friendships table stores bidirectional friend relationships. When users become friends, we insert two rows to enable efficient queries from either direction. The relationship has a status field to support pending, accepted, and blocked states.
Feed Generation Algorithm:
When a user requests their feed, the system must determine which transactions they’re authorized to see. This requires joining transactions with the friendship graph and applying privacy rules.
First, retrieve all of the requesting user’s accepted friendships. Then query transactions where one of the following conditions is true:
- The privacy is set to public (visible to everyone).
- The privacy is friends AND the requesting user is friends with either the sender or recipient (or is the sender or recipient themselves).
- The privacy is private AND the requesting user is either the sender or recipient.
The results are ordered by creation timestamp descending to show newest transactions first. If show_amount is false for a transaction, the amount field is replaced with null in the response.
Feed Caching Strategy:
Generating feeds requires complex joins and can be expensive. To reduce database load, we cache feeds in Redis with a 5-minute TTL.
The cache key includes the user ID, limit, and offset to handle pagination. When a user requests their feed, we first check Redis. On a cache hit, we return the cached data immediately. On a miss, we query the database, cache the result, and return it.
Cache Invalidation:
When a new transaction completes with public or friends privacy, we must invalidate relevant caches. We invalidate the sender’s and recipient’s feed caches since they’ll see the new transaction. For friends-only or public transactions, we also invalidate all friends’ feed caches since they may see the transaction in their feeds.
This invalidation is handled asynchronously by consuming events from the message queue. The Social Feed Service subscribes to payment completed events and performs cache invalidation in the background.
Performance Optimization:
For high-volume users with many friends, cache invalidation can be expensive. We implement several optimizations:
- Batch invalidations when multiple transactions occur in quick succession.
- Use probabilistic cache invalidation (only invalidate with some probability) for friends with large networks.
- Implement a bloom filter to quickly check if a user might see a transaction before doing expensive graph queries.
- Pre-generate feeds for highly active users during off-peak hours.
Deep Dive 6: How do we handle payment splitting among multiple users?
Payment splitting allows users to divide a bill among friends. This requires coordinating multiple payments and tracking who has paid their share.
Split Payment Data Model:
The split payments table stores the overall split request including the initiator, total amount, description note, and completion status. The split payment participants table stores each person’s share with their user ID, amount owed, amount paid (starts at zero), payment status, and the resulting transaction ID once they pay.
Split Algorithms:
For equal splits, we divide the total amount by the number of participants. Due to rounding, we may have remainders. The last participant in the list receives the adjusted amount to ensure the total matches exactly.
For example, splitting 100 dollars among 3 people yields 33.33, 33.33, and 33.34 dollars. The total remains exactly 100 dollars.
For custom splits, each participant is assigned a specific amount by the initiator. The system validates that the sum of individual amounts matches the declared total amount.
Split Payment Flow:
When a user initiates a split payment, the system creates a split payment record and participant records for each person included. It then sends payment requests to all participants via push notifications.
Each participant receives a notification with the amount they owe and can choose to pay immediately or later. When they pay, the system creates a regular payment transaction from them to the initiator.
The split payment participant record is updated with the transaction ID and status changed to PAID. After each payment, the system checks if all participants have paid. If so, the overall split payment status is marked as COMPLETED and the initiator is notified.
Reminder System:
The Notification Service tracks unpaid split payments and sends reminder notifications:
- First reminder: 24 hours after the initial request.
- Second reminder: 3 days after the initial request.
- Final reminder: 7 days after the initial request.
Users can configure whether they want to receive reminders for split payments. Initiators can also manually send a reminder to a specific participant.
Handling Partial Payments:
If a participant doesn’t have enough Venmo balance to pay their full share, they can make a partial payment. The system updates the amount_paid field and keeps their status as PENDING. They can make additional payments until the full amount is paid.
The split payment only completes when all participants have paid their full amounts. The initiator can see a real-time view of who has paid and how much is still outstanding.
Step 4: Wrap Up
In this design, we proposed a comprehensive system architecture for a peer-to-peer payment platform like Venmo. If there is extra time at the end of the interview, here are additional points to discuss:
Key Design Decisions Summary
Strong Consistency with PostgreSQL: We chose PostgreSQL with ACID transactions over eventual consistency because financial correctness is non-negotiable. A user’s balance must always reflect the exact sum of all ledger entries. We use two-phase commit for distributed transactions and row-level locking to prevent race conditions.
Double-Entry Ledger: Every transaction creates two ledger entries (debit plus credit), providing an immutable audit trail and enabling balance reconciliation. This is standard practice in financial systems and allows us to detect and correct discrepancies.
Idempotency Guarantees: Client-generated idempotency keys prevent duplicate charges from network retries. We cache results in Redis for 24 hours and persist in the database for long-term deduplication.
Plaid for Bank Integration: Rather than building ACH integration from scratch, we leverage Plaid’s infrastructure for bank authentication, account verification, and ACH transfers. This reduces compliance burden and time-to-market.
Machine Learning Fraud Detection: Real-time fraud scoring using XGBoost models trained on historical fraud patterns. We extract 30+ features including velocity, device fingerprinting, and behavioral signals to achieve greater than 99% accuracy with less than 0.1% false positive rate.
Event-Driven Architecture: Kafka enables asynchronous processing of non-critical tasks like notifications, analytics, and feed updates without blocking the payment flow. This improves throughput and allows independent scaling of services.
Multi-Layer Caching: Redis caches hot data including balances, sessions, and idempotency tracking. Cassandra stores social feed data for fast read access. This reduces database load and achieves sub-100ms response times.
Scaling Considerations
Database Sharding: Shard PostgreSQL by user ID using consistent hashing. Each shard handles 10 million users, allowing horizontal scaling to 100M+ users with 10+ shards. Transactions between users on different shards require distributed transactions coordinated by a transaction manager.
Read Replicas: Provision 5+ read replicas per shard for analytics queries and reporting, keeping primary databases free for write transactions. Use connection pooling to efficiently manage database connections across the fleet.
Geographic Distribution: Deploy in multiple AWS regions (us-east-1, us-west-2, eu-west-1) with data residency compliance. Use Route53 geo-routing to direct users to nearest region. Maintain master databases in primary region with asynchronous replication to other regions.
Auto-Scaling: Scale payment service horizontally based on CPU and queue depth. Target 70% CPU utilization during peak hours with auto-scaling policies. Use Kubernetes for container orchestration and automatic pod scaling.
Message Queue Scaling: Use Kafka with partitioned topics for parallel processing. Partition by user ID to maintain ordering guarantees for transactions from the same user. Add consumer instances to scale processing throughput.
Monitoring and Observability
Key Metrics: Track payment success rate with a target above 99.9%, p50/p95/p99 latency for payment API, fraud detection accuracy and false positive rate, balance reconciliation discrepancies with a target of zero, and ACH settlement success rate.
Alerting: Page on-call if payment success rate drops below 99.5%. Alert if fraud model accuracy degrades below 98%. Critical alert for balance reconciliation failures. Monitor for unusual transaction patterns that might indicate DDoS attacks or coordinated fraud.
Logging: Use centralized logging with the ELK stack (Elasticsearch, Logstash, Kibana). Log every transaction state change for audit trail. Implement structured logging with trace IDs for distributed tracing across microservices. Store compliance logs for 7 years in S3 Glacier.
Distributed Tracing: Implement OpenTelemetry for end-to-end tracing across all services. Track request flow from client through API gateway, payment service, fraud detection, and database. Identify bottlenecks and optimize slow paths.
Security Hardening
Data Encryption: Use TLS 1.3 for all API communication. Implement field-level encryption for bank account tokens, SSNs, and other PII. Use AWS KMS for key management with annual key rotation. Encrypt database backups in S3 with versioning enabled.
Access Control: Follow IAM roles with least-privilege principle. Require MFA for production access. Maintain audit logs for all database and AWS console access. Conduct regular penetration testing and vulnerability scanning.
Compliance: Maintain PCI-DSS Level 1 compliance (never store CVV, tokenize card numbers). Achieve SOC 2 Type II certification. Ensure GDPR compliance for European users including right to erasure and data portability. Conduct regular compliance audits by third-party firms.
Secret Management: Store all API keys and tokens in AWS Secrets Manager. Rotate secrets automatically on a schedule. Use separate credentials for each environment (dev, staging, production). Never commit secrets to version control.
Disaster Recovery
Backup Strategy: Implement continuous WAL (Write-Ahead Log) archiving to S3 for PostgreSQL. Perform daily full backups with 90-day retention. Set up cross-region replication for critical data. Test restore procedures monthly to ensure backups are valid.
Failover: Enable automated failover to standby replica within 30 seconds (RTO: 30s). Maintain zero data loss for committed transactions (RPO: 0). Use multi-AZ deployment for high availability. Run chaos engineering tests with random instance termination weekly.
Business Continuity: Document runbooks for common failure scenarios. Conduct quarterly disaster recovery drills. Maintain an incident response plan with clear escalation paths. Train on-call engineers on emergency procedures.
Additional Features to Discuss
Transaction Disputes: Implement a dispute resolution system where users can flag unauthorized transactions. Provide a workflow for support teams to investigate disputes, communicate with both parties, and issue refunds or reversals when appropriate. Maintain detailed audit logs for all dispute actions.
Merchant Payments: Extend the platform to support payments to businesses, not just individuals. Implement merchant accounts with higher transaction limits, business verification (EIN instead of SSN), and integration with point-of-sale systems via QR codes.
Recurring Payments: Allow users to set up automatic recurring payments for subscriptions or regular bills. Implement a scheduler that triggers payments on specified dates, handles failures gracefully, and notifies users of upcoming charges.
Request Money: Add functionality for users to request money from others with an optional note. Implement expiration for requests, reminder notifications, and the ability to decline or negotiate the requested amount.
QR Code Payments: Generate unique QR codes for each user that encode their user ID. Other users can scan the code to initiate a payment without typing the username, useful for in-person payments.
International Payments: Extend to support multiple currencies and cross-border transfers. Integrate with foreign exchange providers, handle currency conversion fees, and comply with international money transmission regulations.
This architecture provides a production-ready foundation for a Venmo-scale peer-to-peer payment platform, handling millions of users and billions in transaction volume with bank-level reliability and security.
Comments