Distributed Job Scheduler — HLD

Difficulty: Intermediate–Advanced 📋 Prerequisites: System Design Fundamentals — especially Message Queues, Leader Election, and Databases ⏱️ Reading time: 25 min


TL;DR

A distributed job scheduler lets teams register recurring or one-off tasks (like cron, but across a fleet of machines). It ensures each job runs exactly once, on time, with retries and dependency management.

flowchart LR
    USER["Users<br/>define jobs"]:::client
    API["Scheduler API"]:::service
    DB[("Job Store<br/>Cassandra")]:::data
    TICKER["Ticker<br/>checks whats due"]:::service
    QUEUE["Job Queue<br/>Kafka"]:::async
    WORKERS["Worker Pool<br/>executes jobs"]:::service

    USER --> API
    API --> DB
    TICKER --> DB
    TICKER --> QUEUE
    QUEUE --> WORKERS

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef async fill:#AB47BC,stroke:#4A148C,color:#fff
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

In 3 sentences: Users register jobs with a schedule (cron expression) or a one-time fire time. A “ticker” process scans the database for due jobs and enqueues them to Kafka. Worker pods consume from Kafka, execute the job, and report success/failure back. Leader election ensures only one ticker runs per shard.


1. Understanding the Problem

A distributed job scheduler accepts jobs (run once at time T, or recurring on a cron schedule, or a DAG of dependent steps) and executes them reliably across a fleet of workers. Callers shouldn’t worry about which machine runs the job, what happens when a worker crashes mid-execution, or whether the job ran twice. The scheduler owns timing, dispatch, retries, isolation, and observability.


1.5. Naive First Cut

The 30-second whiteboard version:

flowchart LR
    CLIENT["Client App"]:::client
    API["Scheduler API"]:::service
    DB[("Jobs Table")]:::data
    CRON["Cron Loop<br/>single process"]:::service
    WORKER["Worker"]:::service

    CLIENT --> API
    API --> DB
    CRON --> DB
    CRON --> WORKER

    classDef client fill:#fed7aa,stroke:#c2410c,color:#431407
    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

Why it collapses under real use:

The rest of the doc evolves this into a horizontally scalable, HA, exactly-once-in-effect job platform.


1.6. Glossary — what these names mean

Quick reference so nothing in this doc is a black box:

Components we build

Infrastructure pieces we use

External / reference names in Prior Art

Observability tools


1.7. Prior Art We’re Drawing From


2. Functional Requirements

Core (top 3)

  1. Schedule a job — one-time (run at timestamp T), recurring (cron expression), or delayed (run in N seconds).
  2. Execute reliably — at-least-once delivery to a worker, with retries on failure, respecting timeouts.
  3. Inspect and cancel — query the status of a scheduled or running job, cancel a future run.

Below the line (out of scope)


3. Non-Functional Requirements

Core

Below the line

Scale Estimation (Back-of-Envelope)


4. Core Entities


5. API / System Interface

POST   /v1/jobs                     -> Job
  Header: Idempotency-Key: <uuid>
GET    /v1/jobs/:id                 -> Job + latest executions
PUT    /v1/jobs/:id                 -> update schedule or payload
DELETE /v1/jobs/:id                 -> cancel (and stop future fires)

POST   /v1/jobs/:id/pause           -> pause recurring job
POST   /v1/jobs/:id/resume          -> resume

GET    /v1/jobs/:id/executions      -> paginated history
POST   /v1/executions/:id/cancel    -> cancel a specific run
GET    /v1/executions/:id           -> status + logs pointer

Example create:

POST /v1/jobs
{
  "name": "daily-invoice-gen",
  "type": "CRON",
  "schedule": "0 3 * * *",
  "timezone": "America/Los_Angeles",
  "target": { "pool": "batch-etl", "handler": "generate_invoices" },
  "payload": { "tenantId": "acme", "dateRange": "yesterday" },
  "retryPolicy": { "maxAttempts": 3, "backoff": "EXPONENTIAL", "initialDelayMs": 30000 },
  "timeoutSec": 600,
  "priority": "NORMAL"
}

Response:

{
  "jobId": "job_a93f2",
  "nextFireAt": "2026-05-05T10:00:00Z",
  "state": "ACTIVE"
}

Security notes:


6. High-Level Design

Three passes, one per core functional requirement.

6.1 FR-1: Schedule a job

New components we need:

  1. Scheduler API — the HTTP interface users call to create, query, or cancel jobs. Validates inputs and stores job definitions.
  2. Postgres (jobs + schedules) — the durable source of truth. Stores job definitions, cron expressions, and computed next_fire_time. 💡 We use Postgres because job creation needs ACID transactions — if we write the job and its schedule, both must succeed or neither does.
  3. Redis Sorted Set (upcoming index) — holds jobs due within the next hour, scored by next_fire_time. 💡 A sorted set (ZSET) lets us ask “give me everything due before NOW” in O(log N) — the dispatcher polls this instead of scanning millions of rows in Postgres every second.
flowchart LR
    CLIENT["Client"]:::client
    API["Scheduler API"]:::edge
    DB[("Postgres<br/>jobs + schedules")]:::data
    CACHE[("Redis<br/>upcoming index")]:::data

    CLIENT --> API
    API --> DB
    API --> CACHE

    classDef client fill:#fed7aa,stroke:#c2410c,color:#431407
    classDef edge fill:#bfdbfe,stroke:#1d4ed8,color:#0c1f4a
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

Legend

Color Role
Orange Client
Blue Edge / API
Green Service
Purple Async / broker
Yellow Data store
Pink External

Step-by-step flow:

  1. Developer calls POST /v1/jobs with a cron expression like "0 3 * * *" (run daily at 3am) → hits the Scheduler API
  2. API validates: Is the cron expression valid? Does the target worker pool exist? Is the payload within size limits?
  3. API computes next_fire_time from the cron + timezone (e.g., 3:00 AM Pacific = 10:00 UTC), then writes a jobs row AND a schedules row to Postgres in one atomic transaction
  4. For jobs due within the next hour, API also adds (next_fire_time, job_id) to the Redis sorted set — this is the “hot window” that the dispatcher polls. Jobs further out stay only in Postgres until a background hydrator promotes them
  5. Returns 201 Created with the job ID and next fire time

Why Postgres for jobs: ACID for “create + schedule” atomicity, SQL flexibility for admin queries (“show all jobs by tenant X that fired in the last 24h”), indexes on (next_fire_time, state) for dispatcher polling.

Why Redis ZSET for the hot window: at 1M jobs/min peak, polling Postgres for “jobs due in the next 60s” every second would hammer the index. Redis ZSET gives O(log N) inserts and O(log N + k) range queries by score (timestamp). The ZSET holds only the next hour; everything further out lives only in Postgres, promoted to Redis one hour ahead.

6.2 FR-2: Execute reliably (the dispatch + retry path)

This is where most of the complexity lives.

New components we need (in addition to the ones above):

  1. Dispatcher Pool (leader-elected shards) — the heartbeat of the system. Each dispatcher continuously polls its slice of the Redis ZSET for due jobs. 💡 Leader election ensures only ONE dispatcher owns each shard — without it, two dispatchers would both fire the same job, causing duplicate execution.
  2. Kafka (per-pool topics) — decouples dispatch timing from worker availability. When the dispatcher finds a due job, it publishes to Kafka rather than directly calling a worker.
  3. Workers — the processes that actually execute your job code. Each worker pool handles a specific class of jobs (e.g., email-sender, batch-etl).
  4. Executions table (Postgres) — one row per attempt to run a job. Append-only history so you can answer “did my job run? when? how long did it take?”
  5. Worker Heartbeats (Redis) — workers write a heartbeat every 10s. If a worker crashes, its heartbeat expires and a sweeper reschedules the stuck job. 💡 This is the “dead man’s switch” — if we don’t hear from a worker, we assume it’s dead and retry the job.
  6. Retry Queue — a delayed Kafka topic where failed jobs wait with exponential backoff before being retried.
flowchart LR
    CACHE[("Redis ZSET<br/>hot window")]:::data
    DISPATCH["Dispatcher pool<br/>leader-elected shards"]:::service
    KAFKA["Kafka<br/>per-pool topics"]:::async
    WORKER["Workers<br/>pool A"]:::service
    EXEC[("Postgres<br/>executions")]:::data
    HEARTBEAT[("Redis<br/>worker heartbeats")]:::data
    RETRY["Retry Queue<br/>delayed topic"]:::async

    CACHE --> DISPATCH
    DISPATCH --> KAFKA
    DISPATCH --> EXEC
    KAFKA --> WORKER
    WORKER --> EXEC
    WORKER --> HEARTBEAT
    WORKER -.failure.-> RETRY
    RETRY --> KAFKA

    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

Step-by-step flow:

  1. Dispatcher shards continuously poll their slice of the Redis ZSET: “Give me all jobs with score <= now()” — runs every 100-500ms
  2. When a shard finds due jobs, it atomically claims them via ZREMRANGEBYSCORE (Redis is single-threaded, so only one dispatcher wins the race)
  3. For each claimed job, the dispatcher:
    • Creates an executions row in Postgres with status PENDING
    • Publishes a message to Kafka on the target pool’s topic (e.g., jobs.batch-etl)
    • For cron jobs, computes the NEXT fire time and re-adds it to the ZSET (or Postgres if > 1 hour out)
  4. A worker in the target pool picks up the Kafka message, marks the execution as RUNNING, and starts writing heartbeats to Redis every 10s
  5. Worker invokes the job handler with the payload — this is where YOUR code actually runs
  6. On success → worker marks execution SUCCEEDED, commits Kafka offset, moves on
  7. On failure → worker marks FAILED, reads the retry policy, and publishes to the retry queue with exponential backoff (30s → 2min → 10min → 1h)
  8. A sweeper periodically checks: “any executions stuck in RUNNING with expired heartbeats?” If yes → the worker crashed. Mark FAILED_WORKER_LOST and trigger a retry

Why Kafka between dispatcher and workers? If the dispatcher called workers directly, a pool restart would lose all in-flight jobs. Kafka gives us durability (messages survive worker crashes), replay (reprocess an hour of jobs if a worker had a bug), and independent scaling per pool.

Why a separate execution row, not just status on the job row: one job may produce many executions (cron fires daily, retries add more). Executions are append-only, cheap to partition by day, and joinable by job_id for history views.

6.3 FR-3: Inspect and cancel

New components we need (in addition to the ones above):

  1. Cancel Set (Redis) — a short-lived set of executionIds that have been cancelled. Workers check this before starting and periodically during execution. 💡 We can’t “un-send” a Kafka message, so instead we let the worker check a cancel flag before it starts working.
flowchart LR
    CLIENT["Client"]:::client
    API["API"]:::edge
    DB[("Postgres<br/>jobs + executions")]:::data
    CANCEL[("Redis<br/>cancel set")]:::data
    WORKER["Worker"]:::service

    CLIENT --> API
    API --> DB
    API --> CANCEL
    WORKER --> CANCEL

    classDef client fill:#fed7aa,stroke:#c2410c,color:#431407
    classDef edge fill:#bfdbfe,stroke:#1d4ed8,color:#0c1f4a
    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

Step-by-step flow (read path):

The read path is straightforward: GET /v1/jobs/:id hits Postgres with an indexed query by job_id — returns the job definition plus recent executions. No drama.

Step-by-step flow (cancellation — the tricky part):

Cancellation sounds simple but has three distinct timing windows:

  1. Cancel a future job (not yet due) — easiest case. API sets the job state to CANCELLED and removes it from the Redis ZSET. It’ll never fire.
  2. Cancel a PENDING execution (dispatched but worker hasn’t started yet) — write the executionId to the Redis cancel set. Worker checks this set before starting; if present, it skips the job entirely.
  3. Cancel a RUNNING execution (worker is mid-flight) — worker polls the cancel set every few seconds during execution. On hit, it sends an interrupt to the handler code. Handlers must cooperate — we can’t force-kill without risking data corruption.

Why lazy cancellation via a flag instead of “delete from the queue”? Kafka doesn’t support targeted message deletion. And even if it did, there’s a race between the cancel request and the worker consuming the message. A cancel flag checked at execution time is simpler and race-free.


Technology Choices

Vendor-agnostic with alternatives listed.

Tier / purpose What it stores Access pattern Primary pick Alternatives
Job definitions jobs, schedules — immutable plus a state column, cron expression, payload low write, point reads + admin queries PostgreSQL sharded by tenant_id MySQL, CockroachDB, Aurora
Hot schedule index upcoming-hour ZSET keyed by next_fire_time O(log N) insert, range query by score, atomic pop Redis sorted sets DynamoDB with sort key on timestamp; Google Cloud Tasks for managed
Executions one row per run, heavy insert insert-heavy, query by job_id + time range PostgreSQL partitioned monthly Cassandra for 100M+ executions/day, ClickHouse for analytics replica
Worker heartbeats workerId -> lastSeenAt, currentExecutionId writes every 10s per worker, TTL eviction Redis with TTL ZooKeeper for small fleets, ETCD
Event backbone execution.dispatched, execution.succeeded, etc. ordered per-pool, replayable, at-least-once Kafka Kinesis, Google Pub/Sub, Pulsar
Delayed retry queue messages with ready_at timestamp insert-then-pop-when-ready Redis sorted sets (reuse the ZSET pattern) Kafka timer topic, SQS delay queues, RabbitMQ delayed exchange
Leader election who owns each dispatcher shard heartbeat-based, fast failover ZooKeeper / etcd / Consul Kubernetes Lease, Redis Redlock
Cancellation signal cancelled:{executionId} — short-lived set fast write from API, fast check from worker Redis set with TTL Postgres with LISTEN/NOTIFY
Cron parsing turn cron expressions into next_fire_time pure compute Library (quartz-cron, croniter) Custom impl
Observability metrics, traces, audit high write, OLAP queries Prometheus + OpenTelemetry + ClickHouse Datadog, Honeycomb

Why Postgres and Redis together for scheduling

Postgres is durable truth. Redis is the fast index. The split is the key insight:

This keeps the dispatcher hot path lightning-fast while Postgres handles the scale-at-rest (100M jobs easily on a single partitioned table).

Why not just put everything in Redis?

Redis is not durable at the cost-per-GB point we’re operating at. Losing scheduled jobs on a Redis failover is unacceptable. Postgres + WAL gives us “if the commit returned, the job will run.”

Why Kafka for the worker-facing bus?


7. Potential Deep Dives

Self-audit surfaces these weak spots. Eight deep dives.

Deep Dive 1 — Hot dispatcher: scale beyond one leader

Bad: single dispatcher leader polling one Redis ZSET. At 1M jobs/min the single thread is saturated, polling latency creeps into seconds.

Good: shard the ZSET by hash(jobId) % N. Each shard has its own leader (picked via etcd). N dispatchers poll N ZSETs in parallel.

Great — dynamic sharding with consistent hashing and idle thievery:

flowchart LR
    ZSET1[("Redis ZSET shard 1")]:::data
    ZSET2[("Redis ZSET shard 2")]:::data
    ZSET3[("Redis ZSET shard N")]:::data
    D1["Dispatcher 1<br/>leader"]:::service
    D2["Dispatcher 2<br/>leader"]:::service
    D3["Dispatcher N<br/>leader"]:::service
    ETCD["etcd<br/>leader locks"]:::service
    KAFKA["Kafka pool topics"]:::async

    D1 --> ETCD
    D2 --> ETCD
    D3 --> ETCD
    ZSET1 --> D1
    ZSET2 --> D2
    ZSET3 --> D3
    D1 --> KAFKA
    D2 --> KAFKA
    D3 --> KAFKA

    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

Deep Dive 2 — Exactly-once-in-effect execution

Exactly-once delivery is impossible across a network boundary. The achievable goal: at-least-once delivery + idempotent handlers = “exactly-once in effect.”

Bad: no execution ID. If dispatch retries the same message, the handler runs twice with no way to dedupe.

Good: generate an executionId on dispatch. Hand it to the worker. The worker’s first action is to check a “processed” table; if present, skip; else run the handler, record completion.

Great — fencing tokens + transactional completion:

This is the Outbox + fencing pattern borrowed from Stripe’s idempotency work and Temporal’s activity-retry model.

Deep Dive 3 — Worker crash during execution

Bad: worker crashes mid-run. Execution sits in RUNNING forever. Nobody retries.

Good: heartbeat-based liveness. Workers write to Redis every 10s. A sweeper scans executions stuck in RUNNING with expired heartbeats and marks them FAILED_WORKER_LOST, triggering retry.

Great — heartbeat TTL + at-most-once interpretation + retry with backoff:

Edge case: worker finished the work but crashed before ACKing. Job effectively ran; our retry will run it again. That’s why the idempotency contract with handlers matters.

Deep Dive 4 — Cron drift, DST, and timezones

Bad: interpret cron in UTC. User in India sees “3 AM IST” jobs run at 3 AM UTC, 8:30 AM IST.

Good: store cron + timezone. Compute next_fire_time in the user’s zone, convert to UTC for the ZSET score.

Great — recompute every fire, handle DST discontinuities:

Deep Dive 5 — Multi-tenant isolation and fairness

Bad: tenant A schedules 10M cron jobs all firing at 0 0 * * * (midnight UTC). At midnight, the dispatcher is overwhelmed, tenant B’s urgent jobs miss their SLA.

Good: per-tenant quotas enforced at job-submit time. Cap at N concurrent executions per tenant.

Great — weighted fair queuing in the dispatcher:

Deep Dive 6 — Leader election and failover

Bad: single dispatcher process. Dies → no jobs scheduled until ops restarts. 5-minute outage.

Good: two dispatchers with a DB-based lock row (UPDATE scheduler_leader SET leader = $me WHERE leader IS NULL). Quartz’s classic approach.

Great — consensus-backed leases with sub-second failover:

Deep Dive 7 — Handling long-running jobs (heartbeat + restart-safe)

Bad: a 4-hour ETL job. Worker crashes at hour 3.5. Retry restarts from zero. Massive waste.

Good: handlers checkpoint progress periodically. On retry, they read the checkpoint and resume.

Great — activity/heartbeat pattern (from Temporal):

Deep Dive 8 — Observability and debuggability

Bad: “did my job run?” — grep logs on 200 worker hosts.

Good: executions table indexed by job_id. UI shows history.

Great — metrics + traces + audit + replay:


7.5. Design Self-Audit

Weak spots checked:


6.5. Core Flows

Flow 1 — One-shot delayed job (run in 5 minutes)

sequenceDiagram
    actor Client
    participant API
    participant PG as Postgres
    participant Redis
    participant Disp as Dispatcher
    participant Kafka
    participant Worker
    participant Attempts as executions

    Client->>API: POST /v1/jobs (runAt = now + 5min)
    API->>PG: INSERT job + schedule
    API->>Redis: ZADD hot_window next_fire job_id
    API-->>Client: 201 (jobId)

    loop every 100ms
        Disp->>Redis: ZRANGEBYSCORE 0 now LIMIT 100
    end
    Note over Disp,Redis: at T+5min, job surfaces
    Disp->>Redis: ZREMRANGEBYSCORE (atomic claim)
    Disp->>Attempts: INSERT execution PENDING
    Disp->>Kafka: publish to pool topic

    Kafka->>Worker: consume
    Worker->>Attempts: UPDATE state=RUNNING, worker_id=me
    Worker->>Worker: run handler
    Worker->>Attempts: UPDATE state=SUCCEEDED
    Worker->>Kafka: commit offset

Walkthrough:

  1. Client posts a job with runAt = now + 5min.
  2. API persists the job and schedule rows, then adds to the Redis hot-window ZSET.
  3. Dispatcher polls the ZSET every 100ms.
  4. At the fire time, dispatcher atomically pops the entry (ZREMRANGEBYSCORE), creates an execution row, and publishes to Kafka.
  5. Worker consumes, marks RUNNING, runs, marks SUCCEEDED, commits.

Non-obvious failure path: if dispatcher crashes between step 3 and publishing, on restart the ZSET entry has already been removed. Safety net: the executions row was created before the publish, so a sweeper sees a PENDING execution with no Kafka publish → re-publishes.

Flow 2 — Recurring cron job with a failure and retry

sequenceDiagram
    participant Disp as Dispatcher
    participant Kafka
    participant Worker
    participant Retry as retry topic
    participant Exec as executions
    participant PG as Postgres

    Disp->>Exec: INSERT execution e1 PENDING
    Disp->>Kafka: publish execution e1
    Kafka->>Worker: consume
    Worker->>Exec: UPDATE RUNNING
    Worker->>Worker: handler throws
    Worker->>Exec: UPDATE FAILED attempt=1
    Worker->>Retry: publish with readyAt = now + 30s
    Note over Disp: meanwhile schedule next cron fire
    Disp->>PG: UPDATE schedules SET next_fire_time = + 1 day
    Note over Retry,Kafka: 30s later
    Retry->>Kafka: re-publish execution e1 attempt=2
    Kafka->>Worker: consume
    Worker->>Exec: UPDATE RUNNING attempt=2
    Worker->>Worker: handler succeeds
    Worker->>Exec: UPDATE SUCCEEDED

Walkthrough:

  1. Dispatcher fires the cron job, creates execution e1, publishes to Kafka.
  2. Worker picks up, runs handler, handler throws.
  3. Worker writes FAILED with attempt=1, publishes to retry topic with a 30-second delay.
  4. Dispatcher also advances the next fire time for the cron (+1 day) — retries don’t affect the schedule.
  5. 30 seconds later, the retry topic re-publishes. A worker picks it up as attempt=2.
  6. Succeeds, marked SUCCEEDED.

Non-obvious failure: worker crashes between step 3a (handler throws) and step 3b (publish to retry). Safety net: the execution row is still RUNNING in Postgres. Heartbeat TTL expires in 30s → sweeper marks FAILED_WORKER_LOST → enqueues retry.

Flow 3 — Running execution cancellation

sequenceDiagram
    actor Ops
    participant API
    participant Redis
    participant Worker
    participant Exec as executions

    Ops->>API: POST /v1/executions/e1/cancel
    API->>Redis: SADD cancelled:e1 TTL 10min
    API-->>Ops: 202

    Note over Worker: currently running e1
    Worker->>Redis: check every 5s
    Redis-->>Worker: cancelled
    Worker->>Worker: send interrupt to handler
    Worker->>Exec: UPDATE state=CANCELLED

Walkthrough:

  1. Ops calls the cancel API for a running execution.
  2. API writes to a Redis set with a short TTL (10 min is enough; after that the execution is done anyway).
  3. Worker polls the cancel set every 5 seconds (cheap — single GET).
  4. On hit, it sends an interrupt signal to the handler (Java Thread.interrupt, or a cancel-token check in the handler).
  5. Handler cooperatively stops, updates the execution to CANCELLED.

Non-cooperative handlers (native code, infinite CPU loop) can’t be cancelled. We surface that as “best effort” in the docs and kill the worker process after a grace period.

State machine — an execution’s lifecycle

stateDiagram-v2
    [*] --> PENDING
    PENDING --> RUNNING: worker picks up
    PENDING --> CANCELLED: cancelled before dispatch
    RUNNING --> SUCCEEDED: handler returns
    RUNNING --> FAILED: handler throws
    RUNNING --> TIMED_OUT: exceeded timeout
    RUNNING --> CANCELLED: cancel signal
    RUNNING --> FAILED_WORKER_LOST: heartbeat TTL
    FAILED --> PENDING: retry scheduled
    FAILED_WORKER_LOST --> PENDING: retry scheduled
    TIMED_OUT --> PENDING: retry scheduled
    FAILED --> DEAD: max retries
    SUCCEEDED --> [*]
    CANCELLED --> [*]
    DEAD --> [*]

8. Final Architecture Diagram

flowchart LR
    CLIENT["Clients and CI systems"]:::client

    subgraph "Control Plane"
        API["Scheduler API"]:::edge
        PG[("Postgres<br/>jobs + schedules + executions")]:::data
        REDIS[("Redis<br/>hot ZSET + heartbeats + cancel")]:::data
        CRON["Cron Parser"]:::service
    end

    subgraph "Dispatch"
        ETCD["etcd<br/>leader leases"]:::service
        DISP["Dispatcher shards<br/>leader-elected"]:::service
        SWEEPER["Sweeper<br/>stuck executions"]:::service
        HYDRATOR["Hydrator<br/>PG -> Redis"]:::service
    end

    subgraph "Execution"
        KAFKA["Kafka<br/>per-pool topics + retry + DLQ"]:::async
        WORKER_A["Worker pool A<br/>email-sender"]:::service
        WORKER_B["Worker pool B<br/>batch-etl"]:::service
        WORKER_C["Worker pool C<br/>general-purpose"]:::service
    end

    subgraph "Observability"
        METRICS["Prometheus"]:::service
        TRACES["OpenTelemetry + Jaeger"]:::service
        CH[("ClickHouse<br/>executions OLAP replica")]:::data
    end

    CLIENT --> API
    API --> PG
    API --> REDIS
    API --> CRON
    HYDRATOR --> PG
    HYDRATOR --> REDIS
    REDIS --> DISP
    DISP --> ETCD
    DISP --> PG
    DISP --> KAFKA
    SWEEPER --> PG
    SWEEPER --> KAFKA
    KAFKA --> WORKER_A
    KAFKA --> WORKER_B
    KAFKA --> WORKER_C
    WORKER_A --> PG
    WORKER_B --> PG
    WORKER_C --> PG
    WORKER_A --> REDIS
    WORKER_B --> REDIS
    WORKER_C --> REDIS
    WORKER_A -.failure.-> KAFKA
    WORKER_B -.failure.-> KAFKA
    WORKER_C -.failure.-> KAFKA
    DISP --> METRICS
    WORKER_A --> METRICS
    WORKER_A --> TRACES
    PG --> CH

    classDef client fill:#fed7aa,stroke:#c2410c,color:#431407
    classDef edge fill:#bfdbfe,stroke:#1d4ed8,color:#0c1f4a
    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

Key Technologies Mentioned

Term What it is
Redis Sorted Set (ZSET) In-memory data structure scored by fire-time timestamp — O(log N) insert and range queries power the “what’s due now?” dispatcher hot path.
Timing Wheel Alternative scheduling structure with O(1) insert and fire for time-bucketed events — used in some dispatcher implementations for high-volume ticks.
Leader Election Coordination mechanism (via etcd/ZooKeeper/Consul) ensuring only one dispatcher owns each shard — prevents duplicate job firing.
Kafka Durable message broker decoupling dispatch timing from worker execution — survives worker restarts and enables per-pool topic scaling.
Cron Expression Standard syntax (e.g., 0 3 * * *) for defining recurring schedules, parsed with timezone-aware libraries to compute next fire times.
Heartbeat Periodic signal (every 10s) from workers to Redis with TTL — missed heartbeats trigger sweeper-based retry of stuck jobs.
Dead Letter Queue Holding queue for jobs that exhausted all retry attempts — surfaced to an ops dashboard for manual investigation.
Temporal Durable workflow engine for complex multi-step job orchestration with built-in retries, timeouts, and crash recovery.

What’s Expected at Each Level

This section helps you calibrate your depth. You don’t need to cover everything — just know what’s expected for your level.

Mid-level

Design a system that stores jobs with execution times and triggers them when due. Propose a polling mechanism or priority queue for finding due jobs. Understand why a single timer thread doesn’t scale — one machine crashing means jobs don’t fire.

Senior

Propose Redis ZSET for the hot window of upcoming jobs (score = execution timestamp). Explain leader election for preventing duplicate execution across multiple scheduler instances. Discuss retry logic with exponential backoff and dead-letter queues for permanently failed jobs. Articulate the difference between at-least-once and exactly-once execution guarantees.

Staff+

Address multi-tenant fair scheduling (one user’s million jobs shouldn’t starve others) using weighted queues with per-tenant token buckets. Discuss timing wheel data structures for sub-second precision without polling overhead, sharding strategies for the job store (partition by tenant + time bucket), and exactly-once execution guarantees using fencing tokens to prevent stale workers from completing zombie executions.


🎯 Key Takeaways


💬 Comments