ho Notification System — HLD — SystemCraft

Notification System — HLD

Difficulty: Intermediate 📋 Prerequisites: System Design Fundamentals — especially Message Queues and Event-Driven Architecture ⏱️ Reading time: 20 min


TL;DR

A multi-channel notification platform that delivers push, SMS, email, and in-app messages. It decouples “something happened” from “tell the user” using an event bus.

flowchart LR
    SERVICES["Internal Services<br/>order placed and payment etc"]:::client
    BUS["Event Bus<br/>Kafka"]:::async
    NS["Notification Service<br/>template and routing"]:::service
    PUSH["Push<br/>FCM APNs"]:::external
    SMS["SMS<br/>Twilio"]:::external
    EMAIL["Email<br/>SES"]:::external
    USER["User"]:::client

    SERVICES --> BUS
    BUS --> NS
    NS --> PUSH
    NS --> SMS
    NS --> EMAIL
    PUSH --> USER
    SMS --> USER
    EMAIL --> USER

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef async fill:#AB47BC,stroke:#4A148C,color:#fff
    classDef external fill:#EC407A,stroke:#880E4F,color:#fff

In 3 sentences: Backend services emit events (“order confirmed”) to Kafka. The notification service consumes these, renders a template, picks the channel (push/SMS/email based on user preferences), and dispatches. Delivery tracking, retries, and send-time optimization ensure messages reach users when they’re most likely to engage.


1. Understanding the Problem

A notification system lets product surfaces across a company send messages to users across multiple channels — push (mobile), email, SMS, in-app — without every team re-implementing delivery, preferences, retries, and rate limiting. The system must handle billions of events per day, respect user preferences, dedupe noisy senders, and prove delivery.


1.5. Naive First Cut

The whiteboard sketch before any real thought:

flowchart LR
    APP["Product Service"]:::service
    NS["Notification Sender"]:::service
    APNS["APNs / FCM / SES / Twilio"]:::external
    DB[("Users DB")]:::data

    APP --> NS
    NS --> DB
    NS --> APNS

    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef external fill:#fbcfe8,stroke:#be185d,color:#500724
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

Why it breaks under real load:

The rest of the doc evolves this into a queue-based, multi-channel, preference-aware notification platform.


1.7. Prior Art We’re Drawing From


2. Functional Requirements

Core (top 3)

  1. Send a notification to a user through one or more channels (push, email, SMS, in-app) given a template ID and template variables.
  2. Respect user preferences — channel opt-ins, category opt-ins (marketing vs transactional), quiet hours, locale.
  3. Guaranteed at-least-once delivery with retries for transient failures, and exposed per-attempt status for debugging.

Below the line (out of scope)


3. Non-Functional Requirements

Core

Below the line


4. Core Entities


5. API / System Interface

One primary send API. Callers are internal product services, authenticated via service-to-service JWT.

Send a notification

POST /v1/notifications                    -> NotificationReceipt
  Header: Idempotency-Key: <uuid>
  Header: Authorization: Bearer <service-jwt>

Request body:

{
  "userId": "u_1293847",
  "templateId": "order_shipped_v3",
  "variables": {
    "orderId": "A-88273",
    "trackingUrl": "https://tr.ck/X7Y8Z",
    "carrier": "BlueDart"
  },
  "channels": ["PUSH", "EMAIL"],
  "category": "TRANSACTIONAL",
  "priority": "HIGH",
  "dedupKey": "order-A-88273-shipped",
  "deeplink": "myapp://orders/A-88273",
  "overrides": {
    "sendAt": null,
    "expireAt": "2026-05-05T10:00:00Z"
  }
}

Response (202 Accepted):

{
  "notificationId": "n_7a3f2e91",
  "state": "ACCEPTED",
  "createdAt": "2026-05-04T12:01:33.412Z"
}

Get status and attempts (for support / debugging)

GET /v1/notifications/:id                 -> Notification + attempts

Response:

{
  "id": "n_7a3f2e91",
  "userId": "u_1293847",
  "templateId": "order_shipped_v3",
  "state": "DELIVERED",
  "category": "TRANSACTIONAL",
  "attempts": [
    { "channel": "PUSH", "provider": "APNs", "status": "SENT",      "at": "2026-05-04T12:01:33.900Z", "providerResp": "200 APNs accepted" },
    { "channel": "PUSH", "provider": "APNs", "status": "DELIVERED", "at": "2026-05-04T12:01:34.102Z", "receipt": "apns-receipt-abc" },
    { "channel": "EMAIL","provider": "SES",  "status": "SENT",      "at": "2026-05-04T12:01:34.200Z", "providerResp": "250 OK" },
    { "channel": "EMAIL","provider": "SES",  "status": "OPENED",    "at": "2026-05-04T12:04:11.512Z", "userAgent": "Mail.app iOS 17" }
  ]
}

User preferences

GET  /v1/users/:id/preferences            -> Preference
PUT  /v1/users/:id/preferences            -> Preference

Preference shape:

{
  "userId": "u_1293847",
  "channels": { "PUSH": true, "EMAIL": true, "SMS": false, "IN_APP": true },
  "categories": {
    "TRANSACTIONAL": { "enabled": true,  "channels": ["PUSH","EMAIL","SMS"] },
    "SECURITY":      { "enabled": true,  "channels": ["PUSH","EMAIL","SMS"] },
    "MARKETING":     { "enabled": false, "channels": ["EMAIL"] }
  },
  "quietHours": { "start": "22:00", "end": "07:00", "timezone": "Asia/Kolkata" },
  "frequencyCaps": { "MARKETING": 5, "SOCIAL": 20 },
  "locale": "en-IN"
}

Device registration (push)

POST   /v1/users/:id/devices              -> register token
DELETE /v1/users/:id/devices/:token       -> revoke
{
  "deviceToken": "apns-token-base64=",
  "platform": "IOS",
  "appVersion": "12.3.0",
  "timezone": "Asia/Kolkata"
}

Campaigns

POST /v1/campaigns                        -> Campaign
{
  "name": "diwali_offers_2026",
  "templateId": "promo_diwali_v1",
  "segment": { "query": "country=IN AND active_last_7d=true AND age BETWEEN 18 AND 34" },
  "scheduledAt": "2026-10-28T09:00:00+05:30",
  "rateLimit": { "perSec": 20000, "perMin": 600000 },
  "useSendTimeOptimization": true
}

Real-time subscription (in-app)

WSS /v1/users/:id/stream                  -> WebSocket
    Header: Authorization: Bearer <user-jwt>

Server pushes JSON frames as notifications fire:

{ "type": "notification", "id": "n_7a3f2e91", "title": "Your order shipped", "body": "...", "at": "2026-05-04T12:01:33.412Z" }

Security notes:


6. High-Level Design

We’ll grow the architecture in three passes — one per core functional requirement.

6.1 FR-1: Send a notification through multiple channels

Start with the minimum viable pipeline: accept, enqueue, fan out per channel, dispatch to the provider.

flowchart LR
    APP["Product Service"]:::service
    API["Notification API"]:::edge
    Q["Message broker<br/>Kafka or Kinesis"]:::async
    ROUTE["Router"]:::service
    PUSH["Push Worker"]:::service
    EMAIL["Email Worker"]:::service
    SMS["SMS Worker"]:::service
    APNS["APNs / FCM"]:::external
    SES["SES / Mailgun"]:::external
    TWIL["Twilio"]:::external

    APP --> API
    API --> Q
    Q --> ROUTE
    ROUTE --> PUSH
    ROUTE --> EMAIL
    ROUTE --> SMS
    PUSH --> APNS
    EMAIL --> SES
    SMS --> TWIL

    classDef edge fill:#bfdbfe,stroke:#1d4ed8,color:#0c1f4a
    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef external fill:#fbcfe8,stroke:#be185d,color:#500724

Legend

Color Role
Blue Edge / API gateway
Green Service
Purple Async / message broker
Pink External dependency
Yellow Data store
Orange Client

Flow:

  1. Product service calls POST /v1/notifications with a template ID, variables, and category.
  2. The Notification API validates the payload, resolves the user, and enriches with user preferences (channels + locale).
  3. It persists the intent to a notifications table (audit + dedup) and publishes an event to the message broker.
  4. The Router reads the event, picks the channels based on template rules and user preferences, and fans out — one message per (user, channel) pair — onto a channel-specific topic.
  5. Each channel worker picks up its topic, renders the template in the user’s locale, and calls the relevant provider SDK.
  6. Provider response (accepted / rejected) is written as a delivery attempt record for observability.

Why async? The product service call path has a strict latency budget (checkout is running). Hitting APNs synchronously is a timebomb — provider slowness becomes product slowness. Enqueueing gives us 10-50ms end-to-end on the hot path; the actual send happens on the worker’s clock.

Why persist before publish? Safety net. If the broker is down, we still have the row. A reconciler (covered later) sweeps notifications stuck in PENDING_PUBLISH and re-publishes.

6.2 FR-2: Respect user preferences

Preferences live in a dedicated service. Both the Notification API (at intake) and the Router (before fan-out) consult it.

flowchart LR
    API["Notification API"]:::edge
    ROUTE["Router"]:::service
    PREFS["Preference Service"]:::service
    PREFDB[("Postgres<br/>user_preferences")]:::data
    PREFCACHE[("Redis<br/>pref cache")]:::data

    API --> PREFS
    ROUTE --> PREFS
    PREFS --> PREFCACHE
    PREFCACHE -. miss .-> PREFDB

    classDef edge fill:#bfdbfe,stroke:#1d4ed8,color:#0c1f4a
    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

What preferences capture:

Flow (Router consulting prefs):

  1. Router receives the event with category=MARKETING, channels=[PUSH,EMAIL].
  2. Queries Preference Service.
  3. Filters: user has SMS=off (not requested anyway), Marketing=on, Push=on, Email=on — keep both channels.
  4. Checks quiet hours: user is in quiet hours → enqueue to delayed queue with wake-up time = end of quiet hours.
  5. Checks frequency cap: “user got 4/5 marketing notifications today” — OK.
  6. Fans out to push and email topics.

Why cache preferences in Redis? Every notification triggers a preference lookup. 50k/sec sustained = 50k/sec reads minimum. Postgres can do it, but Redis drops the latency from millis to microseconds and takes load off the DB for campaigns.

Write path: user updates prefs → API writes Postgres → invalidates Redis entry (write-through not worth the complexity; read-through handles the miss).

6.3 FR-3: Guaranteed at-least-once delivery with retries

Channel workers own the retry logic. The key mechanism is the outbox pattern between provider state and our own DB.

flowchart LR
    PUSH["Push Worker"]:::service
    APNS["APNs"]:::external
    ATTEMPT[("Postgres<br/>delivery_attempts")]:::data
    RETRY["Retry Queue<br/>delayed topic"]:::async
    DLQ["Dead Letter Queue"]:::async

    PUSH --> APNS
    PUSH --> ATTEMPT
    PUSH -.timeout or 5xx.-> RETRY
    RETRY --> PUSH
    PUSH -.permanent failure.-> DLQ

    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef external fill:#fbcfe8,stroke:#be185d,color:#500724
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

What a worker does for each message:

  1. Read from channel topic.
  2. Render template with user variables + locale.
  3. POST to provider (APNs / SES / Twilio) with a timeout (2s for push, 5s for email/SMS).
  4. Write a delivery_attempts row: status (SENT / FAILED / THROTTLED), provider response code, timestamp.
  5. Classify the outcome:
    • Success → commit Kafka offset, move on.
    • Transient failure (5xx, timeout, 429 throttle) → publish to a delayed retry topic with exponential backoff (2s → 10s → 60s → 5min → 30min; max 5 retries).
    • Permanent failure (400 bad token, invalid phone) → revoke the token in our user device table, send to DLQ.

Why at-least-once, not exactly-once? Distributed systems can’t do exactly-once delivery across a network boundary — only at-least-once + idempotency on the receiving side. The Idempotency-Key header upstream and the dedupKey on the notification row let us detect dups on retry. Providers also dedupe on apns-collapse-id / FCM collapse_key — we pass our notification ID as collapse key so a retry doesn’t produce two banners on the device.

Why persist every attempt? Debugging (“why didn’t my user get the OTP?”) requires per-attempt receipts. When someone files a ticket, ops need to see the exact provider response.

DLQ ownership: a sweeper runs every 5 minutes, reviews DLQ messages, and either retries with longer backoff, or surfaces to a human dashboard after N tries.


Technology Choices

Vendor-agnostic with alternatives. Swap to a specific cloud’s services if you’re targeting one.

Tier / purpose What it stores Access pattern Primary pick Alternatives
Notification primary notifications — intent, template ID, state, dedupKey high write on send, index by (user, createdAt), point-read by id PostgreSQL partitioned by day MySQL, CockroachDB, Aurora
Delivery attempts one row per provider call, with response insert-heavy, query by notification_id for support PostgreSQL monthly-partitioned, or Cassandra at very high rates DynamoDB with TTL
User preferences channel + category opt-ins, quiet hours, frequency state low write, very high read (every send) PostgreSQL + Redis cache DynamoDB + DAX
Device registry push tokens per user per device medium write (registrations), high read PostgreSQL sharded by user_id DynamoDB, Cassandra
Event backbone notification.requested, notification.dispatched, etc. ordered per user_id, replayable, at-least-once Kafka Kinesis, Google Pub/Sub, Pulsar
Delayed queue (quiet hours, retries) messages keyed by wake-up time producer writes with a readyAt; consumer only pulls ready ones Redis sorted sets keyed by timestamp, or Kafka with timer-topic wheel SQS with visibility delay, RabbitMQ delayed exchange
Template store rendered templates per channel per locale read-heavy, versioned S3 / object storage + Postgres metadata Git-backed templates (GitOps)
Rate-limit counters per-user / per-tenant counters very high read+write, TTL’d Redis token bucket / sliding window DynamoDB with atomic counters
Analytics / reporting daily sends, delivery rates, opt-out trends OLAP scans, dashboards Snowflake / BigQuery / ClickHouse via CDC Redshift, Druid
Secrets provider API keys, APNs certs very low read, high sensitivity Vault / AWS Secrets Manager 1Password Connect, Parameter Store

Why Postgres for notifications + attempts, not Cassandra or DynamoDB?

You get ACID within a send: persist the notification + initial state atomically. Support queries hit an indexed point lookup, not a full table scan. Partition by day to keep hot data in tiny tables. For companies sending 10B+/day, Cassandra becomes the right call because append-only writes beat Postgres WAL. Below that, Postgres is simpler and has full SQL.

Why Redis for the delayed queue?

Redis sorted sets (ZSET) give O(log N) insert and O(log N) range query by score. “Give me everything with score ≤ now()” → pop them. This is the cleanest delayed queue pattern. Kafka timer-topic wheels work too but are harder to get right; use them only when volumes exceed what a Redis cluster handles.

Why Kafka for the event backbone?

Kinesis and Pub/Sub are equivalent on managed clouds.


7. Potential Deep Dives

Running the self-audit against the checklist surfaces eleven worth doing. Deep Dives 1-6 cover the core delivery mechanics; 7-11 address real-time delivery, template management, send-time optimization, engagement tracking, and broadcast.

Deep Dive 1 — Hot write path: notification intake at scale

Bad: Product service inserts directly into notifications + publishes to Kafka + writes an audit log. Three writes on the critical path. At 500k/sec burst, the DB is the bottleneck.

Good: Notification API does one insert with INSERT ... RETURNING id, then publishes. Two writes, still DB-bound. Campaigns that fire 10M sends in a minute still melt Postgres.

Great — outbox + CDC:

flowchart LR
    API["Notification API"]:::edge
    DB[("Postgres<br/>notifications + outbox")]:::data
    CDC["Debezium"]:::async
    KAFKA["Kafka"]:::async
    ROUTE["Router"]:::service

    API --> DB
    DB --> CDC
    CDC --> KAFKA
    KAFKA --> ROUTE

    classDef edge fill:#bfdbfe,stroke:#1d4ed8,color:#0c1f4a
    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

Deep Dive 2 — Fan-out amplification: one event to N devices

Bad: Router looks up a user’s device tokens inline. User has 4 devices (phone, tablet, desktop, kiosk). Push worker fans out 4x. Fine for one user — not for a campaign that hits 50M users = 200M push sends.

Good: Router fans out once per device into the push topic. Each device is an independent delivery attempt. Works until we need to dedupe across devices for the same notification (e.g., user opens on phone, don’t ring the tablet 30s later).

Great — logical notification + per-device attempts + device-collapse:

Deep Dive 3 — Provider throttling and backpressure

Bad: Blast sends at APNs’ max rate. APNs rate-limits the whole tenant, legitimate transactional sends also get 429’d.

Good: Token bucket per provider, per channel. Workers pull from the channel topic only when a token is available.

Great — weighted bucket + priority lanes:

flowchart LR
    KAFKA1["push.transactional<br/>(high priority)"]:::async
    KAFKA2["push.marketing<br/>(bulk)"]:::async
    BUCKET1["Bucket: 10k/s"]:::service
    BUCKET2["Bucket: 2k/s"]:::service
    WORKERS1["Txn Workers"]:::service
    WORKERS2["Mkt Workers"]:::service
    APNS["APNs"]:::external

    KAFKA1 --> BUCKET1
    BUCKET1 --> WORKERS1
    KAFKA2 --> BUCKET2
    BUCKET2 --> WORKERS2
    WORKERS1 --> APNS
    WORKERS2 --> APNS

    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef external fill:#fbcfe8,stroke:#be185d,color:#500724

Deep Dive 4 — Deduplication and frequency capping

Bad: Two product teams both emit welcome_user for a new signup. User gets 2 welcomes.

Good: dedupKey on the notifications table with a unique constraint. Second insert fails → second team’s send is dropped. Works for exact dups.

Great — Air Traffic Control layer (from LinkedIn’s playbook):

This is the layer Uber calls the CCG. It’s where the ML logic for send-time optimization would eventually slot in.

Deep Dive 5 — Quiet hours and scheduled delivery

Bad: Check quiet hours synchronously, if in-quiet-hours, Thread.sleep(untilSomeTime). Worker threads pile up.

Good: If in quiet hours, compute readyAt = end_of_quiet_hours_in_user_tz and insert into a delayed queue. A scheduler wakes up and re-injects when ready.

Great — Redis ZSET with a dequeue poller:

ZADD delayed:notifications <readyAtEpoch> <notificationId>

A scheduler service polls every second:

ZRANGEBYSCORE delayed:notifications 0 <now> LIMIT 0 1000

For each returned ID, re-publish to the channel topic and ZREM. O(log N) inserts, O(log N + k) poll where k = batch size. Scales to hundreds of millions of scheduled notifications.

Alternatives: Kafka timer-topic tumbling wheel, DynamoDB TTL streams, SQS visibility timeout tricks. Pick Redis for simplicity and latency, others if volumes push past a single Redis cluster.

Edge case: user changes time zone mid-wait. Two reasonable answers:

Most teams pick option 1 and accept occasional mis-timing.

Deep Dive 6 — Observability and delivery proof

Bad: “Did user X get their OTP?” — grep logs across 50 hosts. Hope someone logged what we need.

Good: Structured logs in ELK. Search by notification_id.

Great — first-class attempt history + delivery webhooks + dashboards:



Deep Dive 7 — In-app notifications in real time

Bad: In-app notifications rely on polling. The mobile app hits GET /notifications?since=... every 30 seconds. Users see a 30-second lag; 1M DAUs = 33k req/sec of wasted polling.

Good: Server-sent events (SSE) over HTTP/2. The server keeps a unidirectional stream open; when a notification arrives for this user, the server writes a frame. SSE is one-way, text-only, and works through most proxies without configuration.

Great — WebSocket gateway with a presence layer + a fallback poll:

flowchart LR
    APP["Mobile or Web"]:::client
    LB["L4 Load Balancer"]:::edge
    WS["WebSocket Gateway<br/>(sticky sessions)"]:::service
    PRESENCE[("Redis<br/>presence: userId -> nodeId")]:::data
    KAFKA["Kafka<br/>in-app topic"]:::async
    FANOUT["In-App Fan-out"]:::service

    APP --> LB
    LB --> WS
    WS --> PRESENCE
    FANOUT --> KAFKA
    KAFKA --> WS
    WS --> APP

    classDef client fill:#fed7aa,stroke:#c2410c,color:#431407
    classDef edge fill:#bfdbfe,stroke:#1d4ed8,color:#0c1f4a
    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

How it works:

  1. Client opens a WebSocket to /v1/users/:id/stream. Load balancer uses consistent hashing on userId to pin the connection to a specific WS gateway node (sticky sessions). This means a given user is always on the same node, simplifying routing.
  2. On connect, the WS gateway writes userId -> nodeId to Redis (TTL 30s, refreshed by heartbeat every 10s).
  3. When the In-App Fan-out worker consumes a notification, it looks up the target user’s nodeId in Redis. If present, it publishes to a Kafka topic keyed by node; each WS gateway node consumes its own keyed partition and delivers to the live socket.
  4. If nodeId is missing (user offline), the worker writes the notification to an “inbox” — a Redis list inbox:{userId} capped at 100 items + a Postgres backup. Next time the app connects, it drains the inbox as the first thing.
  5. Fallback poll: even with WS, the app periodically (every 60s) calls GET /notifications?since=<lastSeenId> as a safety net. This catches any notification lost to a transient WS hiccup. It’s rare, but belt-and-suspenders.

Why WS over SSE: bidirectional frames give us acknowledgments (client says “got it, showed badge”), and modern load balancers + browsers handle WebSocket fine. Also: we can multiplex multiple event types over one WS (notifications, typing indicators, presence).

Scale numbers: a single modern WS gateway node handles 50k-100k open sockets. For 10M concurrent users, 100-200 gateway nodes behind a consistent-hash LB.

Trade-off: sticky sessions complicate rolling deploys. Mitigation: graceful drain — new deploy tells existing sockets to reconnect, they get routed to the new node via the LB.


Deep Dive 8 — Template service: versioning, localization, rendering

Bad: Templates as hardcoded strings in worker code. Every copy change requires a deploy. Marketing can’t iterate. Translators need a developer.

Good: Templates in a database, fetched by ID at render time. Versioned. Marketing uses a console.

Great — immutable template versions + pre-compiled renderer + per-locale cache:

flowchart LR
    CONSOLE["Marketing Console"]:::client
    TMPL["Template Service"]:::service
    S3[("Object Storage<br/>template artifacts")]:::data
    DB[("Postgres<br/>template metadata")]:::data
    CACHE[("Redis<br/>compiled templates")]:::data
    WORKER["Channel Worker"]:::service

    CONSOLE --> TMPL
    TMPL --> DB
    TMPL --> S3
    WORKER --> TMPL
    TMPL --> CACHE
    CACHE -. miss .-> S3

    classDef client fill:#fed7aa,stroke:#c2410c,color:#431407
    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

Model:

Publishing flow:

  1. Marketer drafts a template in the console.
  2. Console uploads the template file (Mustache/Handlebars/MJML for email) to object storage at templates/order_shipped_v4/en.mustache.
  3. Pre-flight validation: variables referenced in the template must all exist in a registered variableSchema. Catches typos before a real send fails.
  4. Compliance reviewer approves (required for MARKETING category).
  5. Activate atomically: Postgres row updates the active_version pointer for (templateId, locale).

Render flow (from a channel worker):

  1. Look up (templateId, userLocale) via Template Service.
  2. Fetch compiled template from Redis. Miss → pull from S3 → compile (parse Mustache to AST) → cache.
  3. Render with the user’s variables. Run output sanitization (HTML escape for email, length-cap for SMS, JSON-safe for push payload).
  4. Return the rendered content to the worker.

Caching: compiled template objects stay in Redis with a long TTL (24h) because versions are immutable — there’s no staleness risk. On activation, the service bumps a global version number which workers check cheaply to detect new active versions.

Locale fallback: hi-IN not found → try hi → try template’s declared default locale → fail send. All falls through Template Service so workers don’t reinvent fallback logic.

Why pre-compile: a template is parsed once per node per version; subsequent renders are ~10-20 μs instead of parsing the template string each time. At 500k/sec we can’t afford the parser on every send.

Integration with channels:


Deep Dive 9 — Send-time optimization (Uber CCG-style)

Bad: All marketing fires immediately when the campaign is scheduled. Users get a 9am blast, half ignore it. Open rates tank.

Good: Default quiet hours + frequency cap. Better, but still one-size-fits-all. Your “marketing hits at 9am local” misses the user who opens the app at 7pm every day.

Great — per-user send-time prediction + constrained ranking:

flowchart LR
    EVENTS["User Engagement Events<br/>Kafka"]:::async
    FEATURE["Feature Store"]:::data
    MODEL["Send-Time Model<br/>(trained offline)"]:::service
    SERVING["Model Serving<br/>online inference"]:::service
    ATC["Policy ATC"]:::service
    RANKER["Ranker"]:::service
    DELAYQ[("Redis ZSET<br/>per-user delayed queue")]:::data

    EVENTS --> FEATURE
    FEATURE --> MODEL
    MODEL --> SERVING
    ATC --> SERVING
    SERVING --> RANKER
    RANKER --> DELAYQ

    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03

Three layers:

  1. Feature store — per-user features: open-rate-by-hour histogram, click-rate-by-hour, last-active-hour, timezone, days since last notification. Updated in near real time from the engagement Kafka topic.

  2. Send-time model — offline-trained (weekly) gradient-boosted model that predicts P(open user, hour-of-day, category). Lightweight: ~KBs per user, serves in <1ms. For each incoming notification with useSendTimeOptimization=true, inference returns the best hour in the next 24h.
  3. Ranker — multiple notifications compete for attention. Per-user ranker runs a linear program (Uber’s actual approach — linear programming) that picks at most N notifications per day subject to:
    • category priority (transactional > security > social > marketing),
    • frequency caps,
    • minimum spacing between notifications (15 min),
    • predicted open probability.

    The output is a scheduled list: (notificationId, sendAt) pairs. Each is enqueued in the delayed Redis ZSET from Deep Dive 5.

Why linear programming rather than greedy: the “max 5 marketing per day + min 15 min spacing + top-N by predicted open” problem has conflicting constraints. Greedy picks the highest-scoring notification first and loses optimal coverage. LP gets the globally best schedule in milliseconds for per-user problems of this size (~dozens of candidates per user per day).

Only marketing and social categories go through this layer. Transactional and security bypass — they fire immediately.

Cost control: inference serving is the hot spot. At 100M users × 10 marketing candidates per day = 1B inferences/day. A small Redis-cached “best hour per user per category” result valid for 24h absorbs 95% of those.

Trade-off: adds latency to marketing sends. That’s fine — they’re not time-sensitive. For “this product just restocked” you’d still fire with a shorter urgency window.


Deep Dive 10 — Engagement tracking (opens, clicks)

Bad: “Did the user see it?” — no clue. Only the provider knows they accepted it.

Good: Client-side reporting. App SDK pings POST /v1/notifications/:id/opened when user taps. Email has a tracking pixel. But: no reliable way to know for push without client SDK, and clients can lie or double-report.

Great — multi-source engagement ingestion + deduped event store:

flowchart LR
    APP["Mobile SDK"]:::client
    EMAIL["Email Pixel<br/>& tracking links"]:::client
    PROVIDER["Provider Webhooks<br/>APNs Feedback / SES SNS"]:::external
    COLLECT["Engagement Collector"]:::service
    KAFKA["Kafka<br/>engagement topic"]:::async
    DEDUP["Deduper"]:::service
    DWH[("ClickHouse<br/>events — 90d hot")]:::data
    COLD[("Parquet on S3<br/>cold — 2y")]:::data

    APP --> COLLECT
    EMAIL --> COLLECT
    PROVIDER --> COLLECT
    COLLECT --> KAFKA
    KAFKA --> DEDUP
    DEDUP --> DWH
    DWH --> COLD

    classDef client fill:#fed7aa,stroke:#c2410c,color:#431407
    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03
    classDef external fill:#fbcfe8,stroke:#be185d,color:#500724

Event types tracked:

Pipeline:

  1. Sources normalize to a common envelope: {notificationId, userId, event, at, source}.
  2. Collector publishes to the engagement Kafka topic, partitioned by notificationId.
  3. Deduper drops duplicates per (notificationId, event) within a 7-day window using a Bloom filter in Redis. Fixes double-reports from SDK + provider webhook for the same open.
  4. Events land in ClickHouse for real-time dashboards (campaign open rate, per-locale performance). Daily roll-up to S3 Parquet for long-term analysis and ML training.
  5. Events also feed back into the feature store (Deep Dive 9) within minutes.

Why ClickHouse: column-store gives fast aggregation for SELECT category, hour, COUNT(*) FROM events WHERE date=today GROUP BY ... dashboards. Postgres would be too slow at the event volume (10B events/day).

Why a Bloom filter for dedup: exact dedup would require storing 10B IDs per week. Bloom filter accepts ~0.1% false positives (we occasionally drop a real second open) but uses ~100x less memory.

Engagement data flows back into:


Deep Dive 11 — Broadcast to large segments (optional)

Bad: “Send this to all 500M users” — the Router iterates the user list one-by-one. Takes hours.

Good: Parallelize the iteration. Shard the user list into batches of 10k, each processed by a worker pool. Still bounded by provider rate limits.

Great — pre-computed segment materialized view + push-time content personalization:

  1. For any large segment (country = IN, engaged_users, power_users_top_10pct), Segment Service materializes the user list nightly into an S3 object or a dedicated table.
  2. Broadcast Worker reads the segment in parallel partitions (e.g., 1000 partitions for 500M users) — each partition processed by a separate consumer group.
  3. Content is static across the segment (same template ID, same variables) but rendered per-user-locale at worker time.
  4. Provider rate limit is the ceiling: even fully parallelized, APNs caps total throughput at ~1M/sec. 500M users → ~8 minutes wall-clock minimum.

For truly instant large-audience cases (emergency civic alerts, security advisories), broadcast via channels that support topic-based fan-out natively:

For normal business broadcast, the segmented-queue approach is what you want.


7.5. Design Self-Audit

Weak spots checked:


6.5. Core Flows

Flow 1 — Transactional send (OTP)

sequenceDiagram
    actor User
    participant AuthSvc as Auth Service
    participant NotifAPI as Notification API
    participant DB as Postgres
    participant Kafka
    participant Router
    participant Prefs as Preference Svc
    participant Worker as SMS Worker
    participant Twilio
    participant Attempts as attempts table

    AuthSvc->>NotifAPI: POST /notifications (OTP, category=SECURITY)
    NotifAPI->>DB: INSERT notification + outbox
    DB-->>NotifAPI: id
    NotifAPI-->>AuthSvc: 202 accepted (id)
    DB-->>Kafka: CDC → notification.requested
    Kafka->>Router: consume
    Router->>Prefs: get prefs(userId)
    Prefs-->>Router: prefs (security always-on, ignore quiet hours)
    Router->>Kafka: publish sms.transactional (renderedPayload)
    Kafka->>Worker: consume
    Worker->>Twilio: POST /Messages
    alt success
        Twilio-->>Worker: 201
        Worker->>Attempts: INSERT status=SENT
    else transient 5xx
        Twilio-->>Worker: 503
        Worker->>Attempts: INSERT status=FAILED
        Worker->>Kafka: publish retry with backoff
    end

Walkthrough:

  1. Auth service calls the Notification API with the OTP template and category=SECURITY.
  2. API writes the notification and its outbox entry in one transaction.
  3. It returns 202 in under 20ms — the user sees “code sent” immediately.
  4. CDC picks up the commit and publishes to Kafka.
  5. Router consults Preference Service. Security overrides quiet hours and marketing opt-out.
  6. Router fans out to sms.transactional (high-priority topic).
  7. SMS worker renders and hits Twilio with a 5s timeout.
  8. On success, we record the attempt; on 5xx, we retry with exponential backoff; on 4xx we mark permanent failure and alert.

Failure case: Twilio webhooks tell us 15s later the SMS was actually undelivered (number disconnected). The reconciler joins the webhook to our delivery_attempts, flips the status to UNDELIVERED, and notifies Auth Service via its own outbound webhook so it can offer the user an alternate channel.

Flow 2 — Marketing campaign send (10M users)

sequenceDiagram
    participant Marketer
    participant CampaignAPI
    participant SegBuilder as Segment Builder
    participant DB as Postgres
    participant Kafka
    participant Router
    participant ATC as Policy (ATC)
    participant Prefs
    participant DelayQ as Redis ZSET
    participant Worker as Push Worker
    participant APNs

    Marketer->>CampaignAPI: POST /campaigns (segment, templateId)
    CampaignAPI->>SegBuilder: resolve segment → user IDs
    SegBuilder-->>CampaignAPI: 10M user IDs (stream)
    CampaignAPI->>DB: bulk COPY notifications + outbox
    DB-->>Kafka: CDC (batched)
    loop per user
        Kafka->>Router: consume
        Router->>Prefs: get prefs
        Prefs-->>Router: marketing=on, quiet=22-07 in Sydney
        alt in quiet hours
            Router->>DelayQ: ZADD readyAt=7am-Sydney
        else not in quiet hours
            Router->>ATC: check dedup + freq cap
            ATC-->>Router: allowed
            Router->>Kafka: push.marketing
            Kafka->>Worker: consume
            Worker->>APNs: POST /push (rate-limited bucket)
            APNs-->>Worker: 200
        end
    end
    Note over DelayQ,Router: scheduler polls every 1s<br/>ZRANGEBYSCORE 0 now

Walkthrough:

  1. Marketer calls CampaignAPI with a segment definition (e.g., “Indian users, 18-34, active in last 7 days”).
  2. Segment Builder streams the user IDs out of the user data warehouse.
  3. CampaignAPI bulk-writes notifications using COPY — seconds, not minutes.
  4. CDC streams events to Kafka in order.
  5. Router consults prefs + ATC for each. Users in quiet hours get deferred via Redis ZSET.
  6. Rate-limited workers drain the topic, respecting APNs throughput caps.
  7. Failures → retry topic with backoff.

Non-obvious failure: campaign writes succeed, CDC is behind by 10 minutes. We don’t block. Marketer sees “campaign queued” because the row is committed; the delay is at most CDC lag, which alerting monitors. Acceptable for marketing.

Flow 3 — User updates preferences

sequenceDiagram
    actor User
    participant App as Mobile App
    participant PrefAPI as Preference API
    participant DB as Postgres
    participant Cache as Redis
    participant Kafka

    User->>App: toggle marketing off
    App->>PrefAPI: PUT /users/:id/preferences
    PrefAPI->>DB: UPDATE preferences
    PrefAPI->>Cache: DEL user:prefs:{id}
    PrefAPI->>Kafka: publish preference.changed
    PrefAPI-->>App: 200
    Note over Kafka: downstream ATC listens<br/>to reset per-user counters
  1. App calls PUT.
  2. Preference API updates Postgres.
  3. Invalidates Redis cache (next read rebuilds).
  4. Publishes a preference.changed event — downstream consumers (ATC, counters) can react.
  5. Returns 200.

Within milliseconds of the update, the next notification fan-out sees the new preference on cache miss → DB hit → re-cache.

State machine — a notification’s lifecycle

stateDiagram-v2
    [*] --> ACCEPTED
    ACCEPTED --> QUEUED: published to Kafka
    QUEUED --> DEFERRED: in quiet hours
    DEFERRED --> QUEUED: wake up
    QUEUED --> SUPPRESSED: policy drop
    QUEUED --> DISPATCHING: worker picks up
    DISPATCHING --> SENT: provider 200
    DISPATCHING --> RETRYING: transient failure
    RETRYING --> DISPATCHING: backoff elapsed
    RETRYING --> FAILED: retries exhausted
    SENT --> DELIVERED: provider receipt
    SENT --> UNDELIVERED: provider receipt (failure)
    FAILED --> [*]
    SUPPRESSED --> [*]
    DELIVERED --> [*]
    UNDELIVERED --> [*]

8. Final Architecture Diagram

flowchart LR
    CLIENT["Product Services"]:::client
    USER["End Users<br/>mobile and web"]:::client

    subgraph "Ingress"
        API["Notification API"]:::edge
        CAMPAIGN["Campaign API"]:::edge
        WSGW["WebSocket Gateway<br/>sticky per user"]:::edge
    end

    subgraph "Core"
        DB[("Postgres<br/>notifications + outbox")]:::data
        CDC["Debezium"]:::async
        KAFKA["Kafka<br/>per-channel topics"]:::async
        ROUTE["Router"]:::service
        ATC["Policy / ATC"]:::service
        RANKER["Ranker + Send-Time<br/>marketing only"]:::service
        PREFS["Preference Svc"]:::service
        PREFDB[("Postgres<br/>user_preferences")]:::data
        PREFCACHE[("Redis<br/>pref cache")]:::data
        DELAYQ[("Redis<br/>delayed ZSET")]:::data
        PRESENCE[("Redis<br/>user presence")]:::data
        TMPL["Template Svc"]:::service
        TMPLDB[("Postgres + S3<br/>templates")]:::data
    end

    subgraph "Workers"
        PUSH["Push Worker"]:::service
        EMAIL["Email Worker"]:::service
        SMS["SMS Worker"]:::service
        INAPP["In-App Fan-out"]:::service
        ATTEMPTS[("Postgres<br/>delivery_attempts")]:::data
        DLQ["DLQ"]:::async
    end

    subgraph "Engagement"
        COLLECT["Engagement Collector"]:::service
        EVENTS["Kafka<br/>engagement"]:::async
        CH[("ClickHouse<br/>90d hot")]:::data
        FS["Feature Store"]:::data
    end

    subgraph "External"
        APNS["APNs / FCM"]:::external
        SES["SES / Mailgun"]:::external
        TWIL["Twilio"]:::external
    end

    CLIENT --> API
    CLIENT --> CAMPAIGN
    USER --> WSGW
    WSGW --> PRESENCE
    API --> DB
    CAMPAIGN --> DB
    DB --> CDC
    CDC --> KAFKA
    KAFKA --> ROUTE
    ROUTE --> PREFS
    PREFS --> PREFCACHE
    PREFCACHE -. miss .-> PREFDB
    ROUTE --> ATC
    ROUTE --> RANKER
    RANKER --> DELAYQ
    DELAYQ --> KAFKA
    KAFKA --> PUSH
    KAFKA --> EMAIL
    KAFKA --> SMS
    KAFKA --> INAPP
    PUSH --> TMPL
    EMAIL --> TMPL
    SMS --> TMPL
    INAPP --> TMPL
    TMPL --> TMPLDB
    INAPP --> WSGW
    PUSH --> APNS
    EMAIL --> SES
    SMS --> TWIL
    PUSH --> ATTEMPTS
    EMAIL --> ATTEMPTS
    SMS --> ATTEMPTS
    PUSH -.permanent fail.-> DLQ
    EMAIL -.permanent fail.-> DLQ
    SMS -.permanent fail.-> DLQ
    USER -.opens / clicks.-> COLLECT
    APNS -.webhooks.-> COLLECT
    SES -.webhooks.-> COLLECT
    COLLECT --> EVENTS
    EVENTS --> CH
    EVENTS --> FS
    FS --> RANKER

    classDef client fill:#fed7aa,stroke:#c2410c,color:#431407
    classDef edge fill:#bfdbfe,stroke:#1d4ed8,color:#0c1f4a
    classDef service fill:#bbf7d0,stroke:#16a34a,color:#052e16
    classDef async fill:#e9d5ff,stroke:#7c3aed,color:#3b0764
    classDef data fill:#fde68a,stroke:#b45309,color:#451a03
    classDef external fill:#fbcfe8,stroke:#be185d,color:#500724
Privacy Policy — Syste... →

💬 Comments