Home › Design Google News - News Aggregator System Design Interview

Designing a News Aggregator (Google News / Apple News)

⚡ Difficulty: Intermediate 🏷️ Topics: Crawling, NLP Deduplication, Feed Ranking, Trending Detection, Caching 🏢 Asked at: Google, Amazon, Microsoft, Apple, Flipkart 📋 Prerequisites: Fundamentals - especially Caching, Message Queues, and Database Indexing

1. Understanding the Problem

A news aggregator continuously collects articles from thousands of publishers around the world, removes duplicates (50 sources might cover the same event), ranks them by relevance and freshness, and serves personalized feeds to millions of users. The hard parts: ingesting content from unreliable sources at scale, detecting that 200 articles are about the same event, handling breaking news spikes (traffic 10x during elections or disasters), and personalizing without a cold-start problem.

Real examples: Google News (aggregates from 50K+ sources), Apple News, Flipboard, Inshorts, Microsoft Start.

1.5. Naive First Cut

flowchart LR
    CRAWLER["Crawler"]:::service
    DB[("Single DB<br/>all articles")]:::data
    USER["User"]:::client

    CRAWLER --> DB
    USER --> DB

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

Crawl RSS feeds periodically, dump articles into a database, serve them sorted by time.

Why this breaks:

No deduplication - same story from 50 sources clutters the feed
No personalization - everyone sees the same feed regardless of interests
Crawling 50K sources sequentially takes hours - stale news
Breaking news takes too long to surface (waiting for next crawl cycle)
Single DB can’t handle 100M+ articles + millions of feed queries
No concept of “topics” or “stories” - just a flat list of articles

1.7. Prior Art We’re Drawing From

Google News Clustering - Groups articles about the same event into “story clusters” using NLP similarity. A story cluster has one headline, multiple source links, and a freshness score that decays over time. (Google Blog)
Facebook News Feed Ranking - Multi-stage ranking pipeline: candidate generation (1000s) → lightweight ranker (100s) → heavy ranker (top 50). Balances engagement prediction with content quality signals. (Facebook Engineering)
Twitter Trends Detection - Detects trending topics by comparing current mention velocity against historical baseline. A topic “trends” when its current rate exceeds the expected rate by a statistical threshold, not just when volume is high. (Twitter Engineering)
Apache Kafka + Flink at LinkedIn - Real-time content processing pipeline that ingests millions of events, enriches them, deduplicates, and routes to multiple downstream consumers (feed, notifications, search index). (LinkedIn Engineering)

2. Functional Requirements

Core (Top 3)

Ingest articles from thousands of sources - continuously crawl/receive articles from 50K+ publishers via RSS, APIs, and webhooks
Deduplicate and cluster - group articles about the same event into story clusters, surface the best source as the headline
Serve personalized feed - each user sees a ranked feed based on their interests, reading history, and location

Below the Line

Breaking news push notifications
Topic following and custom sections
Publisher credibility scoring
Fact-check labels
Offline reading / save for later
Comments and social sharing

3. Non-Functional Requirements

NFR	Target
Freshness	Breaking news appears within 2-5 minutes of first publication
Scale	50K sources, 1M+ new articles/day, 100M+ DAU
Feed latency	Personalized feed served in < 200ms P99
Availability	99.99% - news is time-sensitive, downtime means missed events

Below the Line

Multi-language support (50+ languages)
Regional content laws compliance (right to be forgotten)
Publisher analytics dashboard

Scale Estimation

Sources: 50K publishers, crawled every 5-15 minutes
Ingestion: ~1M new articles/day, 10K/hour average, 50K/hour during breaking events
Storage: ~500GB new article content/month (title + body + metadata), 5TB with media links
Read QPS: 50K feed requests/sec at peak (100M DAU x 5 opens/day / 86400)
Story clusters: ~50K active clusters at any time, 500K total/month

4. Core Entities

Article - URL, title, body, publisher, publish time, language, category, media
Publisher - name, domain, credibility score, crawl frequency, RSS/API endpoint
Story Cluster - a group of articles about the same event, with a representative headline, summary, source count, and freshness score
Topic - a category or tag (Politics, Tech, Sports, etc.) that stories belong to
User Profile - interests (topics followed), reading history, location, language preferences

5. API / System Interface

GET /v1/feed?userId={id}&page={n}
  Response: [{ storyCluster: { headline, summary, sources[], topic, publishedAt, imageUrl } }, ...]

GET /v1/story/{clusterId}
  Response: { headline, summary, articles: [{ title, publisher, url, publishedAt }], relatedStories[] }

GET /v1/topics
  Response: [{ id, name, articleCount }]

GET /v1/trending
  Response: [{ storyCluster, velocity, region }]

POST /v1/user/interests
  Body: { topics: ["tech", "sports"], publishers: ["bbc", "reuters"] }
  Response: 200 OK

Security note: Feed is read-only for users. Article ingestion is internal only (no user-submitted content). Rate-limit feed API to prevent scraping.

6. High-Level Design

FR1: Ingest Articles from Thousands of Sources

The first challenge: 50K publishers, each publishing 10-100 articles/day. We need to crawl them continuously, extract content, and store it. Some publishers offer RSS feeds, some have APIs, some need HTML scraping. Sources are unreliable - they go down, change formats, or throttle us.

New components:

Crawl Scheduler - maintains a priority queue of sources to crawl. High-priority sources (BBC, Reuters) crawled every 5 min; smaller blogs every 30 min. Adjusts frequency based on publisher’s historical update rate.
Crawler Workers - stateless workers that fetch content from assigned URLs. Handle retries, rate limiting per publisher, and format parsing (RSS, Atom, HTML scraping).
Content Extractor - parses raw HTML/RSS into structured data: title, body text, publish time, author, images. Strips ads and navigation.
Article Store (Cassandra) - stores all articles durably. Partitioned by publish date for efficient time-range queries.
Kafka - decouples crawling from downstream processing. Crawlers publish raw articles; multiple consumers process them independently.

flowchart LR
    SCHED["Crawl Scheduler<br/>priority queue"]:::service
    WORKERS["Crawler Workers<br/>stateless pool"]:::service
    EXTRACT["Content Extractor"]:::service
    KF["Kafka<br/>raw articles"]:::async
    STORE[("Article Store<br/>Cassandra")]:::data
    SOURCES["50K Publishers"]:::external

    SCHED --> WORKERS
    WORKERS --> SOURCES
    WORKERS --> EXTRACT
    EXTRACT --> KF
    KF --> STORE

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0
    classDef async fill:#3b1f5e,stroke:#c084fc,color:#e2e8f0
    classDef external fill:#4a1942,stroke:#f472b6,color:#e2e8f0

Step-by-step flow:

Crawl Scheduler pops the next source due for crawling from its priority queue
Assigns it to a Crawler Worker (round-robin across the worker pool)
Worker fetches the RSS feed or webpage, respecting robots.txt and rate limits
Content Extractor parses the raw content into structured article fields
Deduplicates at the URL level (skip if we’ve already seen this exact URL)
Publishes the new article to Kafka topic articles.raw
Downstream consumers (clustering, indexing) read from Kafka independently

Why Kafka? Crawling speed varies wildly (some sources respond in 50ms, some in 5s). Kafka buffers the stream so downstream processing isn’t coupled to crawl speed. If the clustering service goes down for maintenance, articles queue up and are processed when it’s back.

FR2: Deduplicate and Cluster Articles into Stories

This is the hardest part. When a major event happens (election results, earthquake), 200 publishers write about it within minutes. We need to detect that these 200 articles are about the same event and group them into one “story cluster.” The user should see one headline with “200 sources” - not 200 separate cards.

New components:

Clustering Service - consumes articles from Kafka, computes text similarity against existing clusters, and either assigns the article to an existing cluster or creates a new one.
Embedding Store (Redis) - stores vector embeddings of recent story clusters for fast similarity lookup. When a new article arrives, we compare its embedding against existing cluster centroids.
Story Cluster DB (Postgres) - stores cluster metadata: representative headline, source list, topic, freshness score, article count.

flowchart LR
    KF["Kafka<br/>raw articles"]:::async
    CLUSTER["Clustering Service"]:::service
    EMBED[("Embedding Store<br/>Redis vectors")]:::data
    CLUSTERDB[("Cluster DB<br/>Postgres")]:::data
    STORE[("Article Store")]:::data

    KF --> CLUSTER
    CLUSTER --> EMBED
    CLUSTER --> CLUSTERDB
    CLUSTER --> STORE

    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0
    classDef async fill:#3b1f5e,stroke:#c084fc,color:#e2e8f0

Step-by-step flow:

Clustering Service consumes a new article from Kafka
Generates a text embedding (vector) from the article’s title + first paragraph
Queries Embedding Store: “find clusters whose centroid is within 0.85 cosine similarity”
Match found? → Add article to that cluster. Update cluster metadata (source count, freshness, representative headline if this source is more authoritative).
No match? → Create a new cluster with this article as the seed. Store its embedding as the cluster centroid.
Assign topic(s) to the cluster based on content classification (Politics, Tech, Sports, etc.)
Update freshness score: score = article_count * recency_weight (more sources + newer = hotter story)

Why embeddings over keyword matching? “Biden wins election” and “US Presidential race results announced” are about the same event but share few keywords. Semantic embeddings capture meaning, not just words. Cosine similarity of their vectors will be >0.9.

FR3: Serve Personalized Feed

When a user opens the app, they need a ranked feed of story clusters tailored to their interests. A tech enthusiast in Bangalore should see different stories than a sports fan in Mumbai - even during the same news cycle.

New components:

Feed Service - the API layer users hit. Fetches candidate stories, applies personalization ranking, returns the final feed.
User Profile Store (Redis) - stores each user’s interests, reading history (last 100 stories read), location, and language.
Feed Cache (Redis) - pre-computed feeds for active users. Refreshed every 5-10 minutes. Avoids re-ranking on every request.
Ranking Service - scores each candidate story for a specific user based on: topic relevance, freshness, source authority, diversity (don’t show 5 politics stories in a row).

flowchart LR
    USER["User"]:::client
    GW["API Gateway"]:::edge
    FEED["Feed Service"]:::service
    CACHE[("Feed Cache<br/>Redis")]:::data
    RANK["Ranking Service"]:::service
    PROFILE[("User Profile<br/>Redis")]:::data
    CLUSTERDB[("Cluster DB")]:::data

    USER --> GW
    GW --> FEED
    FEED --> CACHE
    FEED --> RANK
    RANK --> PROFILE
    RANK --> CLUSTERDB

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef edge fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

Step-by-step flow:

User opens app → GET /feed?userId=42
Feed Service checks Feed Cache: is there a fresh pre-computed feed? (< 5 min old)
Cache hit? → Return immediately. Sub-10ms.
Cache miss? → Call Ranking Service to build a fresh feed:
- Fetch top 500 active story clusters from Cluster DB (sorted by freshness + article count)
- Fetch user profile: interests, reading history, location
- Score each cluster: score = w1*topic_match + w2*freshness + w3*source_authority + w4*diversity_penalty
- Filter out stories user already read (from reading history)
- Return top 50 ranked clusters
Cache the result for this user (TTL = 5 min)
Return feed to user

Why cache feeds? At 50K feed requests/sec, running the ranking model on every request is expensive. Most users check their feed 5-10 times between updates anyway. A 5-minute cache means 99% of requests are served without computation.

6.5. Core Flows

Flow 1: Article Ingestion (Breaking News)

sequenceDiagram
    participant S as Crawl Scheduler
    participant W as Crawler Worker
    participant E as Extractor
    participant K as Kafka
    participant C as Clustering Service
    participant DB as Cluster DB

    S->>W: Crawl bbc.com/rss (high priority)
    W->>W: Fetch RSS, find 3 new articles
    W->>E: Parse article content
    E-->>K: Publish to articles.raw
    K->>C: Consume new article
    C->>C: Generate embedding
    C->>C: Find matching cluster (cosine > 0.85)
    alt Cluster exists
        C->>DB: Add article to cluster, update freshness
    else New story
        C->>DB: Create new cluster
    end

Non-obvious failure: If the Clustering Service is slow during a breaking news spike (100 articles/min about the same event), articles queue in Kafka. This is fine - Kafka handles backpressure naturally. The feed might show the story 30-60 seconds later than ideal, but no data is lost.

Flow 2: Personalized Feed Load

sequenceDiagram
    participant U as User
    participant F as Feed Service
    participant Cache as Feed Cache
    participant R as Ranking Service
    participant P as User Profile
    participant DB as Cluster DB

    U->>F: GET /feed
    F->>Cache: Check cache for user:42
    alt Cache hit (< 5min old)
        Cache-->>F: Cached feed
        F-->>U: Return feed (sub-10ms)
    else Cache miss
        F->>R: Rank stories for user:42
        R->>P: Get interests + history
        R->>DB: Get top 500 active clusters
        R->>R: Score and rank
        R-->>F: Top 50 clusters
        F->>Cache: Store (TTL 5min)
        F-->>U: Return feed
    end

7. Deep Dives

Deep Dive 1: Story Clustering - Detecting Same Event Across Sources

Problem: 50 publishers write about the same event with different headlines, different angles, different details. We need to detect they’re the same “story” and group them.

Bad: Keyword matching. “Biden” AND “election” → same cluster. Fails because: “Biden election victory” and “Biden election campaign funding scandal” are completely different stories sharing the same keywords.

Good: TF-IDF cosine similarity on article titles. Compute term-frequency vectors, compare cosine similarity. Threshold > 0.7 = same cluster. Works for obvious duplicates but misses paraphrased content (“Stock market crashes” vs “Wall Street sees worst day in a decade”).

Great: Sentence embeddings (BERT/sentence-transformers) + incremental clustering.

Each article’s title + first paragraph → 768-dim vector via a pre-trained model
New article’s vector compared against all active cluster centroids using approximate nearest neighbor (FAISS or Redis Vector Search)
If cosine similarity > 0.85 → assign to cluster. Update centroid as running average.
If no match → new cluster.
Clusters decay: if no new article joins for 24h, cluster moves to archive.

Latency: Embedding generation ~10ms (GPU), ANN search ~5ms, total clustering decision < 20ms per article. At 10K articles/hour, one machine handles it.

Deep Dive 2: Breaking News - How to Surface Events in Under 5 Minutes

Problem: A major event happens. The first publisher posts about it. Our crawler might not check that source for another 10 minutes. By then, users have already seen it on Twitter.

Bad: Crawl all 50K sources every 5 minutes. At 50K sources with 2s average response time, that’s 100K seconds of crawl time / parallelism. Even with 100 workers = 1000 seconds per full cycle. Too slow and wasteful for sources that rarely update.

Good: Adaptive crawl frequency. Track how often each source publishes. BBC publishes every 2 minutes → crawl every 3 min. A local blog publishes weekly → crawl every 6 hours. Prioritize sources by historical freshness.

Great: Adaptive crawling + webhook push + velocity detection.

Push for top publishers: Major publishers (Reuters, AP, BBC) send webhooks when they publish. Instant - zero crawl delay.
Adaptive polling for the rest: Crawl frequency = f(publish_rate). High-velocity sources crawled every 3-5 min, low-velocity every 1-6 hours.
Velocity spike detection: If the clustering service sees 10+ new clusters created in the last 5 minutes (unusual), trigger an emergency re-crawl of all top-100 sources. Something big is happening.
Breaking news flag: Stories with cluster growth rate > 20 articles/hour get flagged as “Breaking” and boosted to the top of all feeds regardless of personalization.

Deep Dive 3: Feed Ranking - Personalization Without Being a Filter Bubble

Problem: Pure personalization creates filter bubbles - a user who reads only tech news never sees important political events. Pure chronological is noisy - most stories aren’t relevant to any specific user.

Bad: Sort by publish time only. User drowns in irrelevant content.

Good: Topic-based filtering. User follows “Tech” and “Sports” → only show stories with those topics. Simple but misses cross-topic stories the user might care about and provides no ranking within a topic.

Great: Multi-signal scoring with diversity constraints.

Scoring formula per story cluster for a user:

score = 0.3 * topic_relevance
      + 0.25 * freshness_decay
      + 0.2 * story_importance (source_count * authority_avg)
      + 0.15 * engagement_signals (CTR from similar users)
      + 0.1 * diversity_bonus (penalize 3rd story on same topic)

Diversity constraint: After ranking, apply a post-processing pass:

No more than 2 consecutive stories from the same topic
At least 1 “serendipity” story per page (topic the user doesn’t usually read, but is nationally important)
Breaking news always ranks in top 3 regardless of personalization

Cold start (new users): Use location + language to serve a “trending in your region” feed. After 10 clicks, enough signal to personalize.

Deep Dive 4: Handling Traffic Spikes During Breaking Events

Problem: Normal traffic is 50K QPS. During election night or a natural disaster, traffic spikes to 500K QPS in minutes. The same 3 stories are requested by everyone simultaneously.

Bad: Every user’s feed request triggers a fresh ranking computation. At 500K QPS, ranking service melts.

Good: Feed cache with 5-min TTL absorbs most reads. But during breaking news, users want the LATEST - a 5-min-old cache feels stale.

Great: Tiered caching + push invalidation.

Global trending cache: Top 10 stories for each region, updated every 30 seconds. Served to users whose personal feed cache is stale. Super cheap (one cache entry per region, millions of reads).
Breaking news override: When a story is flagged “Breaking,” it’s injected at the top of ALL cached feeds without regenerating the entire feed.
Graceful degradation: If ranking service is overloaded, fall back to the global trending feed + user’s topic preferences (simple filter, no ML ranking). “Good enough” feed in 5ms vs perfect feed timing out.

7.5. Design Self-Audit

Question	Answer
Single points of failure?	Kafka is replicated. Crawlers are stateless. Cluster DB has read replicas. Feed cache is Redis Cluster.
Stale content?	Feed cache TTL = 5 min. Breaking news bypasses cache. Acceptable for a news feed.
Duplicate articles?	URL-level dedup at ingestion + semantic clustering catches paraphrases.
Hot stories?	Trending cache absorbs 95% of reads for popular stories.
Publisher goes down?	Crawler retries with backoff. Missing one crawl cycle is acceptable.

8. Final Architecture

flowchart LR
    SOURCES["50K Publishers"]:::external
    SCHED["Crawl Scheduler"]:::service
    WORKERS["Crawler Pool"]:::service
    KF["Kafka"]:::async
    CLUSTER["Clustering Service"]:::service
    EMBED[("Embeddings<br/>Redis Vectors")]:::data
    CLUSTERDB[("Cluster DB<br/>Postgres")]:::data
    ARTICLES[("Article Store<br/>Cassandra")]:::data
    FEED["Feed Service"]:::service
    RANK["Ranking Service"]:::service
    CACHE[("Feed Cache<br/>Redis")]:::data
    PROFILE[("User Profiles<br/>Redis")]:::data
    USER["Users"]:::client
    GW["API Gateway"]:::edge

    SOURCES --> WORKERS
    SCHED --> WORKERS
    WORKERS --> KF
    KF --> CLUSTER
    KF --> ARTICLES
    CLUSTER --> EMBED
    CLUSTER --> CLUSTERDB
    USER --> GW
    GW --> FEED
    FEED --> CACHE
    FEED --> RANK
    RANK --> CLUSTERDB
    RANK --> PROFILE

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef edge fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0
    classDef async fill:#3b1f5e,stroke:#c084fc,color:#e2e8f0
    classDef external fill:#4a1942,stroke:#f472b6,color:#e2e8f0

Key Technologies

Term	What it is
Sentence Embeddings	ML models (BERT, sentence-transformers) that convert text into fixed-size vectors capturing semantic meaning. Similar texts have high cosine similarity.
Approximate Nearest Neighbor (ANN)	Algorithms (FAISS, HNSW) that find similar vectors without comparing against all vectors. O(log N) vs O(N).
Story Clustering	Grouping articles about the same event. The cluster has one headline, N sources, a freshness score, and decays over time.
Adaptive Crawling	Adjusting crawl frequency per source based on how often they actually publish. Saves resources, improves freshness for active sources.
Feed Ranking	Multi-signal scoring that balances personalization, freshness, importance, and diversity to produce a ranked feed.
Cascade Ranking	Two-stage: lightweight filter (1000→100) then expensive ML ranker (100→50). Saves compute.

What’s Expected at Each Level

Mid-level

Design the basic pipeline: crawl sources, store articles, serve chronologically. Propose RSS parsing and a database. With prompting, recognize the deduplication problem (same story from multiple sources). Propose keyword matching or URL-based dedup.

Senior

Proactively identify story clustering as the core challenge. Propose embedding-based similarity for grouping articles. Design the feed with personalization (topic preferences + freshness). Discuss adaptive crawl scheduling and why uniform polling wastes resources. Explain caching strategy for feed reads.

Staff+

Address breaking news latency (webhook push + velocity detection + emergency re-crawl). Discuss feed diversity constraints to avoid filter bubbles. Propose cascade ranking (lightweight filter → ML ranker) for cost efficiency. Cover graceful degradation during traffic spikes (fall back to trending feed). Discuss cold-start personalization and the tension between engagement optimization and editorial quality.

Key Takeaways

Story clustering with embeddings groups 200 articles about the same event into one card
Adaptive crawling balances freshness vs resource cost across 50K sources
Feed cache (5-min TTL) absorbs 99% of read traffic without re-ranking
Breaking news bypass injects urgent stories into cached feeds without full regeneration
Diversity constraints prevent filter bubbles while still personalizing

Twitter Feed - fan-out and personalized timeline ranking
Notification System - multi-channel delivery for breaking news alerts
Instagram - media-heavy feed with CDN and ranking

Designing a News Aggregator (Google News / Apple News)

1. Understanding the Problem

1.5. Naive First Cut

1.7. Prior Art We’re Drawing From

2. Functional Requirements

Core (Top 3)

Below the Line

3. Non-Functional Requirements

Below the Line

Scale Estimation

4. Core Entities

5. API / System Interface

6. High-Level Design

FR1: Ingest Articles from Thousands of Sources

FR2: Deduplicate and Cluster Articles into Stories

FR3: Serve Personalized Feed

6.5. Core Flows

Flow 1: Article Ingestion (Breaking News)

Flow 2: Personalized Feed Load

7. Deep Dives

Deep Dive 1: Story Clustering - Detecting Same Event Across Sources

Deep Dive 2: Breaking News - How to Surface Events in Under 5 Minutes

Deep Dive 3: Feed Ranking - Personalization Without Being a Filter Bubble

Deep Dive 4: Handling Traffic Spikes During Breaking Events

7.5. Design Self-Audit

8. Final Architecture

Key Technologies

What’s Expected at Each Level

Mid-level

Senior

Staff+

Key Takeaways

Related Designs

💬 Comments