ho System Design Fundamentals β€” SystemCraft

System Design Fundamentals

Everything you need to know before tackling specific HLD problems. These are the building blocks every design uses.


Scalability

Vertical scaling (scale up): Add more CPU/RAM to one machine. Simple but has a ceiling.

Horizontal scaling (scale out): Add more machines. Harder (state management, coordination) but practically unlimited.

Β  Vertical Horizontal
Cost Expensive hardware Cheap commodity servers
Limit Hardware ceiling Practically unlimited
Complexity Low High (distributed state)
Downtime Requires restart Zero-downtime rolling
When to use DB primary, cache Stateless services, web tier

Load Balancing

Distributes traffic across multiple servers so no single server gets overloaded.

Where it sits:

Client β†’ Load Balancer β†’ Server 1
                       β†’ Server 2
                       β†’ Server 3

Algorithms:

Algorithm How it works Best for
Round Robin Rotate through servers 1β†’2β†’3β†’1→… Equal-capacity servers
Weighted Round Robin More traffic to stronger servers Mixed hardware
Least Connections Send to server with fewest active requests Long-lived connections
IP Hash Same client always hits same server Session stickiness
Random Pick randomly Simple, surprisingly effective

L4 vs L7:


Caching

Store frequently accessed data closer to the consumer. Trades freshness for speed.

Where to cache:

Client β†’ CDN β†’ API Gateway Cache β†’ Application Cache β†’ Database
         ↑         ↑                      ↑
     static     response-level       object-level
     assets     (full response)      (query results)

Cache strategies:

Strategy How Best for
Cache-Aside App checks cache, misses β†’ read DB β†’ write cache General purpose, most common
Write-Through Write to cache + DB together Strong consistency needs
Write-Behind Write to cache, async flush to DB later High write throughput
Read-Through Cache itself fetches from DB on miss Simpler app code

Cache eviction policies:

Cache invalidation (the hard problem):

Tools: Redis, Memcached, CDN (CloudFront, Cloudflare), local in-process (Caffeine, Guava)


Database Concepts

SQL vs NoSQL

Β  SQL (Postgres, MySQL) NoSQL (DynamoDB, Cassandra, MongoDB)
Schema Fixed, enforced Flexible, schema-on-read
Relationships Joins, foreign keys Denormalized, no joins
Scale Vertical (hard to shard) Horizontal (built for it)
Consistency Strong (ACID) Tunable (eventual to strong)
Best for Transactions, complex queries High throughput, simple access patterns

Database Replication

Primary-Replica: One primary handles writes. Replicas handle reads. Read-heavy workloads scale horizontally.

Writes β†’ Primary DB ──replicates──→ Replica 1 (reads)
                                  β†’ Replica 2 (reads)
                                  β†’ Replica 3 (reads)

Replication lag: Replicas might be a few ms behind primary. If you write then immediately read from a replica, you might not see your write. Solutions: read-your-writes consistency, sticky sessions to primary after write.

Database Sharding (Partitioning)

Split data across multiple databases by a shard key.

User ID 1-1M    β†’ Shard 1
User ID 1M-2M   β†’ Shard 2
User ID 2M-3M   β†’ Shard 3

Shard key choice is critical:

Problems with sharding:


CAP Theorem

In a distributed system, you can only guarantee 2 of 3:

In practice: Network partitions WILL happen. So you choose between:

Most systems are AP with tunable consistency β€” you choose per-operation whether you need strong or eventual consistency.


Consistency Models

Model Guarantee Example
Strong Read always sees latest write Single-node DB, ZooKeeper
Eventual Read will eventually see latest write DynamoDB (default), Cassandra
Read-your-writes YOU see your own writes immediately; others might not Social media feeds
Causal If A caused B, everyone sees A before B Chat messages

Message Queues

Decouple producers from consumers. Enable async processing.

Producer β†’ Queue β†’ Consumer
           ↑
     (buffer, retry, ordering)

Why use queues:

Delivery guarantees:

Tools: Kafka (log-based, ordered, high throughput), SQS (simple queue, managed), RabbitMQ (routing, priority)

Kafka vs SQS:

Β  Kafka SQS
Ordering Per-partition guaranteed FIFO queue or best-effort
Retention Days/weeks (replay possible) 14 days max, once consumed gone
Throughput Millions/sec Thousands/sec
Consumer model Pull (consumer controls pace) Pull (long-polling)
Use case Event streaming, log aggregation Task queues, decoupling

API Design

REST

GET    /users/123       β†’ fetch user
POST   /users           β†’ create user
PUT    /users/123       β†’ replace user
PATCH  /users/123       β†’ partial update
DELETE /users/123       β†’ delete user

Key principles:

Rate Limiting

Protect services from abuse or thundering herds.

Algorithms:


CDN (Content Delivery Network)

Cache static content at edge locations close to users.

User in India β†’ CDN edge in Mumbai (cache hit) β†’ fast!
                  ↓ (cache miss)
              Origin server in US β†’ slow, but CDN caches for next time

What to put on CDN: Images, CSS, JS, videos, static HTML, API responses (with TTL)

Tools: CloudFront, Cloudflare, Fastly, Akamai


Consistent Hashing

Problem: you have N cache servers. hash(key) % N works until you add/remove a server β€” then ALL keys remap.

Consistent hashing: Only K/N keys remap when a server is added/removed.

How: place servers on a ring (0 to 2^32). Hash the key β†’ walk clockwise β†’ first server you hit owns that key. Adding a server only steals keys from its clockwise neighbor.

Used in: DynamoDB, Cassandra, Redis Cluster, load balancers


Idempotency

An operation is idempotent if doing it 1 time or N times produces the same result.

Why it matters: In distributed systems, retries happen. If β€œcharge $10” is retried, you don’t want to charge $20.

How to achieve:

Examples:


Heartbeat & Health Checks

How distributed systems detect dead nodes.

Failure detection trade-off:


Leader Election

When multiple nodes exist, sometimes one must be the β€œleader” (coordinates work, makes decisions).

Algorithms: ZooKeeper (ephemeral nodes), Raft (consensus), Bully algorithm

Why needed:


Back-of-Envelope Estimation

Quick math to validate design decisions.

Key numbers to memorize:

Operation Time
L1 cache read 1 ns
RAM read 100 ns
SSD read 100 ΞΌs
HDD seek 10 ms
Network round-trip (same DC) 0.5 ms
Network round-trip (cross-continent) 150 ms

Data size rules:

Traffic rules:


How to Approach an HLD Interview

  1. Clarify requirements (2-3 min): functional + non-functional. Ask what’s in scope.
  2. Back-of-envelope (2 min): traffic, storage, bandwidth estimates.
  3. High-level design (10-15 min): boxes and arrows. Client β†’ LB β†’ Service β†’ DB.
  4. Deep dive (15-20 min): interviewer picks 2-3 areas. Show depth.
  5. Wrap up (2-3 min): trade-offs, what you’d change at 10Γ— scale.

Don’t:


Next: pick a specific design problem and see these concepts applied.

πŸ’¬ Comments