Roadmap / System Design
A comprehensive system design roadmap covering scalability fundamentals, load balancing, caching strategies, database design, message queues, CDNs, API design, microservices, consistency models, reliability patterns, and real-world case studies — essential for senior engineering roles and technical interviews.
Step 1 • Foundations
System design builds on fundamentals. Networking: TCP vs UDP (TCP is reliable and ordered — HTTP runs on it; UDP is faster, less reliable — video streaming, DNS), HTTP/1.1 vs HTTP/2 (multiplexed streams, header compression) vs HTTP/3 (QUIC, UDP-based), TLS handshake, DNS resolution (recursive resolver → root → TLD → authoritative). OS: how processes and threads work, context switching cost, memory hierarchy (L1/L2 cache → RAM → disk — order of magnitude differences). Distributed systems fundamentals: a network call can always fail, clocks on different machines are not synchronized (NTP helps but doesn't guarantee), and you cannot have perfect consistency + availability simultaneously (CAP theorem preview).
Step 2 • Scalability
Scaling is the ability to handle increased load. Vertical scaling (scale up) — add more CPU/RAM to a single machine (simple, no code changes, but has a ceiling and creates a single point of failure). Horizontal scaling (scale out) — add more machines (scales indefinitely, requires stateless application design). A stateless application stores no user session data on the server — sessions go in Redis, uploaded files go to S3, not the local disk. The key insight: to scale horizontally, every instance must be able to handle any request. This drives most architectural decisions — why we use external session stores, shared databases, and object storage.
Step 3 • Scalability
A load balancer distributes traffic across servers. Layer 4 (transport layer — routes based on IP and port, fast but no request awareness) vs Layer 7 (application layer — can route based on URL, headers, cookies — enables path-based and host-based routing). Algorithms: round-robin (equal distribution), weighted round-robin (route more to powerful servers), least connections (route to server with fewest active connections — best for variable-length requests), IP hash (same client always goes to same server — useful for stateful apps, but complicates scaling). Health checks remove unhealthy instances automatically. Global Server Load Balancing (GSLB) routes users to the nearest data center.
Step 4 • Performance
Caching is placing data in a faster storage layer to reduce latency. Cache locations: client-side (browser cache — HTTP Cache-Control headers, ETags, Last-Modified), CDN edge cache, application cache (Redis/Memcached), database query cache, CPU caches. Strategies: cache-aside (app checks cache, then DB on miss), read-through (cache handles DB read), write-through (write to cache + DB together), write-back (cache first, DB later — risk of data loss). Cache eviction: LRU (least recently used), LFU (least frequently used), FIFO. Cache consistency: write-invalidate vs write-update. Cache stampede — multiple simultaneous misses on the same key hitting DB. Cache warming vs lazy loading.
Step 5 • Storage
Database choices drive system design. SQL vs NoSQL — SQL for relational, transactional data (strong consistency, joins, ACID); NoSQL for scale, flexible schema, or specific access patterns (document, key-value, wide-column, graph). Indexing — B-tree indexes for range queries and equality, hash indexes for equality only, covering indexes (include all query columns — no table lookup). Replication — primary-replica for read scaling and high availability (async replication lag means eventual consistency on reads). Sharding (horizontal partitioning) — split data across multiple databases by a shard key (user ID, geography). Choose the wrong shard key and you get hotspots. ACID vs BASE trade-offs.
Step 6 • Async
Synchronous communication (API call waits for response) couples services — if the downstream service is slow or down, the caller blocks. Message queues decouple producers from consumers. At-least-once delivery vs exactly-once (harder, expensive). Queue vs Pub/Sub (topic): queues deliver each message to one consumer (work queues), topics deliver to all subscribers (event broadcast). Kafka is a distributed commit log — messages are persisted, consumers track their own offset, messages can be replayed (vs RabbitMQ where messages are deleted after consumption). Use queues for: email sending, image processing, payment processing, analytics events. Dead letter queues for failed messages.
Step 7 • Performance
A Content Delivery Network (CDN) caches content at edge servers distributed globally — requests are served from the nearest edge node, not your origin server. Push CDN — you upload content to CDN in advance (large files you know the URL for). Pull CDN — CDN fetches from origin on first request, caches for subsequent requests (most common). CDN cache hit ratio depends on TTL and content uniqueness. CDNs also help with DDoS absorption (traffic is spread across edge nodes), TLS termination close to the user (faster handshake), and HTTP/3 support. CDN for static assets is easy — CDN for dynamic API responses (Cache-Control: no-store vs max-age with Vary) requires care.
Step 8 • API Design
API design shapes developer experience and system coupling. REST — resource-oriented URLs, HTTP methods as verbs (GET=read, POST=create, PUT=replace, PATCH=update, DELETE=remove), stateless, JSON responses. REST design principles: use nouns not verbs (/posts not /getPosts), use plural nouns, version the API (/v1/), return appropriate status codes, use pagination, and use HATEOAS for discoverability. GraphQL — client specifies exact data shape (solves over-fetching and under-fetching), single endpoint, strongly typed schema, excellent for complex UIs and mobile (less data). gRPC — Protocol Buffers, binary transport, typed contracts, HTTP/2, excellent for service-to-service internal APIs where performance matters.
Step 9 • Architecture
Microservices decompose an application into small, independently deployable services — each owns its data, runs its own process, and communicates via APIs. Benefits: independent deployability, technology heterogeneity, fault isolation, team autonomy. Costs: distributed systems complexity, network calls instead of in-process calls (latency, failure), data consistency across services (no distributed transactions — use eventual consistency + saga pattern), operational overhead (deploy N services instead of 1). The Monolith First rule — build a monolith until you feel the pain of coupling, then extract services. Modular monolith — well-organized code that could be split later. Service mesh (Istio, Linkerd) for service-to-service TLS, routing, and observability.
Step 10 • Distributed Systems
CAP Theorem: in a distributed system, you can guarantee at most two of Consistency (every read gets the most recent write), Availability (every request gets a response), and Partition Tolerance (the system keeps working despite network failures). Since network partitions always happen, you choose between CP (consistent, may be unavailable during partition — HBase, Zookeeper) and AP (available, eventually consistent — DynamoDB, Cassandra, CouchDB). ACID (SQL databases) — Atomicity, Consistency, Isolation, Durability. BASE (NoSQL) — Basically Available, Soft state, Eventually consistent. Eventual consistency models: monotonic reads (don't see older data after newer), read-your-writes, causal consistency.
Step 11 • Reliability
Everything fails — design for it. Key patterns: Timeouts (always set them — a hanging request ties up a thread forever), Retries with exponential backoff and jitter (avoid retry storms — randomize delay), Circuit Breaker (stop calling a failing service to give it time to recover — closed/open/half-open states), Bulkhead (isolate failures — separate thread pools per downstream service), Fallback (return cached data or a default when a service is down). SLO (Service Level Objective) and SLI (indicator) — define acceptable error rates and latency percentiles (p99 < 200ms). Error budgets. Chaos engineering (Netflix Chaos Monkey) — deliberately inject failures to test resilience. Observability: metrics, logs, and traces.
Step 12 • Operations
You cannot manage what you cannot measure. The three pillars of observability: Metrics (quantitative — Prometheus scrapes time-series metrics; Grafana visualises; define SLIs/SLOs/SLAs), Logs (events — structured JSON logs, ELK stack or Loki, correlation IDs to link logs across services), Distributed Tracing (trace a request across microservices — OpenTelemetry, Jaeger, Zipkin; spans and traces). Alerting: alert on symptoms (high error rate, high latency) not causes (high CPU). On-call: PagerDuty, runbooks. The four golden signals (Google SRE Book): Latency, Traffic, Errors, Saturation. Health check endpoints (/health, /ready). Logging anti-patterns: logging PII, logging too much (noise), not logging correlation IDs.
Step 13 • Case Studies
The URL shortener is a classic system design interview question — it's tractable but covers key concepts. Requirements: shorten a URL to a 7-character alias, redirect to original URL on access, ~100M URLs stored, ~10B reads/day (read-heavy, 1000:1 read/write ratio). Key decisions: how to generate unique short codes (MD5 hash + collision handling vs auto-increment ID + base62 encoding), database schema (just a key-value mapping), caching (cache most popular URLs in Redis — 80% traffic from 20% of URLs), read-heavy design (many replicas), avoiding hot partitions (random shard key), and handling expiration (TTL). Analytics as a secondary concern (Kafka for click events).
Step 14 • Case Studies
The news feed is one of the hardest system design problems — it looks simple but involves massive scale and complex trade-offs. Approaches: fan-out on write (pre-compute feed for every follower on post — fast reads, expensive writes, problematic for celebrities with 50M followers) vs fan-out on read (compute feed at read time — simple writes, slow reads). Hybrid approach: fan-out on write for regular users, fan-out on read for celebrities (pull when reading). Timeline storage: Redis sorted set per user (score = timestamp, value = tweet ID), retrieve tweet data separately. Pagination with cursor (not offset — row count changes). Push vs Pull for mobile clients.
Step 15 • Interview Preparation
System design interviews evaluate structured thinking, not a 'correct' answer. Framework: 1) Clarify requirements (functional: what it does, non-functional: scale, latency, availability), 2) Estimate scale (DAU, read/write ratio, storage per year, bandwidth), 3) Define data model (core entities, relationships, schema sketch), 4) High-level design (a box diagram — clients, LB, servers, cache, DB), 5) Deep dive into 1-2 bottlenecks the interviewer cares about, 6) Identify bottlenecks and trade-offs. Estimation anchors: 1 million requests/day ≈ 12 req/sec, 1 byte in memory is 1 byte, 1 photo is ~300KB, SSD read is ~1MB/ms. Practice estimating out loud — interviewers assess reasoning, not arithmetic precision.
Privacy choices
We use optional analytical tools only if you accept. You can change this later from "Privacy settings" in the footer.