Home › Design a Key-Value Store — System Design Interview

Designing a Distributed Key-Value Store

⚡ Difficulty: Beginner–Intermediate 📋 Prerequisites: System Design Fundamentals — especially Databases and Caching ⏱️ Reading time: 12 min 🏢 Asked at: Amazon, Google, Meta, Microsoft, Databricks

TL;DR

A key-value store is the simplest database: give it a key, get back a value. Distributed versions (Redis, DynamoDB, Cassandra) split data across many machines using consistent hashing and replicate it for durability.

flowchart LR
    CLIENT["Client<br/>GET/PUT"]:::client
    ROUTER["Router<br/>consistent hashing"]:::service
    N1["Node 1"]:::data
    N2["Node 2"]:::data
    N3["Node 3"]:::data

    CLIENT --> ROUTER
    ROUTER --> N1
    ROUTER --> N2
    ROUTER --> N3

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

In 3 sentences: Client sends a key → router hashes it to find which node owns it → that node stores/returns the value. Data is replicated to N nodes for durability. Consistent hashing ensures adding/removing nodes only moves a small fraction of keys.

Understanding the Problem

What is a key-value store? The simplest database imaginable: a giant dictionary/hashmap. You PUT(key, value) to store data and GET(key) to retrieve it. No complex queries, no joins, no schemas — just keys and values.

Real examples:

Redis — in-memory, sub-millisecond reads, used for caching and sessions
DynamoDB — AWS managed, scales to millions of requests/sec
Cassandra — wide-column store with KV access patterns
etcd — distributed config store used by Kubernetes
Memcached — in-memory caching layer

Why is this asked? It’s a foundational distributed systems problem. A single-node hashmap is trivial, but distributing it across machines teaches: partitioning, replication, consistency, failure handling — the core of distributed systems.

Prior Art We’re Drawing From

Amazon DynamoDB (Dynamo Paper) — Introduced consistent hashing with virtual nodes, vector clocks for conflict resolution, and tunable consistency (W, R, N quorum). The foundational paper for distributed KV stores. (Amazon Dynamo Paper)
Google Bigtable / LevelDB — Pioneered the LSM-tree storage engine (memtable → SSTable → compaction) that powers most modern KV stores. LevelDB is the reference implementation. (Bigtable Paper)
Apache Cassandra — Combines Dynamo’s partitioning with Bigtable’s storage model. Demonstrates tunable consistency at massive scale (Apple runs 160K+ Cassandra nodes). (Cassandra Architecture)
etcd / Raft Consensus — Shows how strong consistency can be achieved in a distributed KV store using the Raft consensus algorithm. Used by Kubernetes for cluster state. (etcd.io)

Scale Estimation (Back-of-Envelope)

Users: Millions of clients (internal services + applications)
Write QPS: 500K writes/sec (mixed PUTs and DELETEs)
Read QPS: 1M reads/sec (GETs dominate in most workloads)
Storage: 10TB total data distributed across the cluster, 3x replication = 30TB raw
Bandwidth: Sub-10ms P99 latency target, ~100 Gbps aggregate cluster throughput

Naive First Cut

A single server with an in-memory hashmap:

flowchart LR
    CLIENT["Client"]:::client
    SERVER["Single Server<br/>HashMap in memory"]:::service

    CLIENT --> SERVER

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0

Why this breaks:

❌ Memory limit — one server has maybe 64GB RAM. Your data might be 10TB.
❌ Single point of failure — server crashes, all data gone
❌ No scalability — can’t handle more reads by adding machines
❌ Data loss — in-memory means a restart wipes everything (unless you add persistence, but then writes are slow)

We need to split data across many machines and replicate it.

The Solution: Distributed Key-Value Store

New components:

Client Library / Router — decides which node a key belongs to using consistent hashing. Could be a smart client or a separate routing layer.
Storage Nodes — each node owns a portion of the key space. Stores data on disk (for durability) with an in-memory cache for speed.
Replication — each key is stored on N nodes (typically 3). If one node dies, the data still exists on the others.
Coordinator — handles cluster membership: which nodes are alive, which owns what range. Could be a gossip protocol (Cassandra) or a centralized service (etcd/ZooKeeper).

flowchart LR
    C["Client"]:::client
    R["Router<br/>consistent hash ring"]:::service
    N1["Node 1<br/>range A-F"]:::data
    N2["Node 2<br/>range G-M"]:::data
    N3["Node 3<br/>range N-T"]:::data
    N4["Node 4<br/>range U-Z"]:::data
    COORD["Coordinator<br/>cluster membership"]:::async

    C --> R
    R --> N1
    R --> N2
    R --> N3
    R --> N4
    COORD --> N1
    COORD --> N2
    COORD --> N3
    COORD --> N4

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0
    classDef async fill:#AB47BC,stroke:#4A148C,color:#fff

💡 Consistent hashing maps both keys AND nodes onto a circular ring (0 to 2³²). A key is assigned to the first node clockwise from its position on the ring. When a node is added/removed, only its neighbors’ keys move — everything else stays put.

API Design

PUT /v1/data/{key}
Body: { value: <any>, ttl?: 3600 }
→ 200 OK

GET /v1/data/{key}
→ { value: <any>, version: 42 }

DELETE /v1/data/{key}
→ 204 No Content

That’s it. The simplicity is the point — no SQL, no query language, just CRUD by key.

Flow: Write (PUT)

sequenceDiagram
    autonumber
    participant C as Client
    participant R as Router
    participant N1 as Primary Node
    participant N2 as Replica 1
    participant N3 as Replica 2

    C->>R: PUT key=user:42 value=json
    R->>R: hash key → maps to Node 1
    R->>N1: write key=user:42
    N1->>N1: write to local store
    N1->>N2: replicate async
    N1->>N3: replicate async
    N1-->>R: ACK
    R-->>C: 200 OK

Write path: Router hashes the key, finds the primary node, writes to it. The primary replicates to N-1 other nodes. Depending on consistency settings, the client might wait for 1 ACK (fast but risky) or all N ACKs (slow but safe).

Flow: Read (GET)

sequenceDiagram
    autonumber
    participant C as Client
    participant R as Router
    participant N1 as Primary Node

    C->>R: GET key=user:42
    R->>R: hash key → maps to Node 1
    R->>N1: read key=user:42
    N1->>N1: lookup in memory cache or disk
    N1-->>R: value + version
    R-->>C: 200 value=json

For stronger consistency, the router can read from multiple replicas and return the latest version.

Deep Dives

1. Consistent Hashing — How Keys Map to Nodes

Problem: If you use node = hash(key) % num_nodes, adding or removing a node changes EVERY key’s assignment. With 1 billion keys, all of them would need to move — catastrophic.

Solution: Consistent hashing.

flowchart LR
    subgraph Ring["Hash Ring 0 to 2^32"]
        direction LR
        A["Node A"]:::data
        B["Node B"]:::data
        C["Node C"]:::data
    end

    KEY["key: user:42<br/>hash → position 7821"]:::client
    KEY --> B

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

How it works:

Hash each node’s ID to a position on a ring (0 to 2³²)
Hash each key to a position on the same ring
Walk clockwise from the key’s position — the first node you hit owns that key

Adding a new node D: Only keys between D and its predecessor move to D. Everything else stays put. On average, only 1/N of keys move (where N = number of nodes). With 4 nodes, adding a 5th moves only ~20% of keys instead of 100%.

💡 Virtual nodes: Real systems place each physical node at multiple positions on the ring (e.g., 150 virtual nodes per server). This ensures even distribution — otherwise, some nodes might get 60% of keys by luck of the hash.

2. Replication — What Happens When a Node Dies

Problem: Nodes fail. Disks die. Networks partition. If data lives on only one node, it’s gone when that node goes down.

Solution: Store each key on N nodes (typically N=3). The primary node replicates to the next N-1 nodes clockwise on the ring.

flowchart LR
    WRITE["Write key X"]:::client
    P["Primary<br/>Node 2"]:::data
    R1["Replica<br/>Node 3"]:::data
    R2["Replica<br/>Node 4"]:::data

    WRITE --> P
    P --> R1
    P --> R2

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

Tunable consistency (W and R values):

Setting	Meaning	Trade-off
W=1	Write succeeds after 1 node ACKs	Fast writes but risk data loss if that node dies before replicating
W=N	Write succeeds after ALL nodes ACK	Slow writes but guaranteed durability
W=2, R=2 (quorum)	Majority must agree	Good balance of speed and safety

💡 Quorum rule: As long as W + R > N, reads always see the latest write. With N=3, W=2, R=2 → at least one node in the read set has the latest value. This is how DynamoDB and Cassandra achieve “tunable consistency.”

When a node comes back: It catches up by comparing its data versions with replicas (anti-entropy repair / Merkle tree comparison).

3. Storage Engine — How Data Lives on Each Node

Each node needs to actually store data on disk efficiently. Two main approaches:

Bad: Store each key as a file on disk. Millions of tiny files = filesystem chokes.

Good: Use a log-structured merge tree (LSM tree). All writes go to an in-memory buffer (memtable). When full, flush to disk as a sorted file (SSTable). Reads check memtable first, then SSTables.

flowchart LR
    WRITE["Write"]:::client
    MEM["Memtable<br/>in-memory sorted"]:::service
    SS1["SSTable L0<br/>on disk"]:::data
    SS2["SSTable L1<br/>on disk"]:::data
    READ["Read"]:::client
    BLOOM["Bloom Filter<br/>quick miss check"]:::service

    WRITE --> MEM
    MEM -->|"flush when full"| SS1
    SS1 -->|"compaction"| SS2
    READ --> MEM
    READ --> BLOOM
    BLOOM --> SS1

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

💡 Bloom filter is a space-efficient data structure that tells you “definitely not here” or “maybe here.” Before reading an SSTable from disk, the bloom filter tells you if the key MIGHT exist in that file — saves expensive disk reads for keys that aren’t there.

Great: Add a Write-Ahead Log (WAL). Before writing to the memtable, append to a log on disk. If the node crashes before flushing, the WAL replays to recover the memtable. Zero data loss.

Why LSM tree over B-tree? LSM trees turn random writes into sequential writes (just append to a log). Sequential disk I/O is 100x faster than random I/O. For write-heavy workloads (like a KV store that handles millions of writes/sec), LSM trees win.

Key Technologies

Term	What it is
Consistent Hashing	Maps keys and nodes onto a ring. Adding/removing a node only moves neighboring keys. Used by DynamoDB, Cassandra, Redis Cluster.
Replication Factor (N)	Number of copies of each key. N=3 means 3 nodes hold the data. Survive up to N-1 failures.
Quorum (W + R > N)	Ensures reads always see latest writes. A tunable knob between speed and consistency.
LSM Tree	Log-Structured Merge Tree. Write-optimized storage engine. Sequential writes → fast. Used by LevelDB, RocksDB, Cassandra.
Bloom Filter	Probabilistic structure to quickly check “is key X in this file?” Avoids unnecessary disk reads.
WAL (Write-Ahead Log)	Append-only disk log written before in-memory changes. Crash recovery guarantee.
Vector Clock	Tracks causality between writes. Detects conflicts when two replicas have divergent values. Used in DynamoDB.

Interview Cheat Sheet

Question	Answer
“How to partition data?”	Consistent hashing with virtual nodes for even distribution
“How to handle node failure?”	Replication factor N=3. Quorum reads/writes (W=2, R=2).
“Consistency vs availability?”	Tunable: W=1 R=1 for speed (AP), W=N R=N for strong consistency (CP). Quorum is the sweet spot.
“How to detect conflicts?”	Vector clocks or last-writer-wins (simpler but loses data)
“How data is stored on disk?”	LSM tree: memtable → flush to SSTable → compaction. WAL for crash recovery.
“How to find data fast?”	Bloom filter to skip SSTables that don’t have the key
“What about hot keys?”	Cache popular keys in front. Or split hot keys across sub-keys (shard the shard).

What’s Expected at Each Level

This section helps you calibrate your depth. You don’t need to cover everything — just know what’s expected for your level.

Mid-level

Understand the basic API (GET/PUT/DELETE by key). Propose a hash map for storage. Recognize that a single server can’t hold all data — need to split across machines. With prompting, explain consistent hashing as a way to distribute keys across nodes without reshuffling everything when a node is added or removed.

Senior

Explain consistent hashing with virtual nodes for even distribution. Discuss replication factor and quorum reads/writes (W+R>N for strong consistency). Propose LSM tree storage engine for write optimization (memtable → SSTable flush → compaction). Articulate the CAP theorem trade-off for this system — which guarantee do you sacrifice?

Staff+

Address vector clocks for conflict detection in multi-master writes, Merkle trees for anti-entropy repair (detecting divergence between replicas), hinted handoff during temporary node failures (sloppy quorum), and read repair as a protocol for healing stale reads. Discuss tunable consistency per operation (some keys need strong, others can be eventual) and the operational cost of compaction in LSM trees (write amplification, space amplification).

🎯 Key Takeaways

Consistent hashing distributes data evenly and minimizes reshuffling when nodes change
Replication (N=3) provides fault tolerance — lose a node, data survives
Quorum (W+R>N) lets you tune the consistency-vs-speed tradeoff per operation
LSM trees make writes fast by converting random I/O to sequential appends
This pattern underpins almost every distributed database — learn it once, apply everywhere

Rate Limiter — uses Redis (a KV store) for counters
Leaderboard — Redis sorted sets are a specialized KV structure
URL Shortener — simple KV mapping of short code → URL