Design a Web Crawler

End-to-end system design interview walkthrough. URL frontier management, async fetching, headless rendering, deduplication with Bloom filters and SimHash, politeness policies, storage at petabyte scale, and adaptive re-crawling strategies.

Problem Statement & Requirements

Requirements & Scale Estimation

The Basics

System Architecture & Data Flow

Deep Dive: URL Frontier

Priority Queues, BFS vs DFS, Back Queues

Deep Dive: Fetching

Async I/O, Semaphores, DNS & Timeouts

Deep Dive: Parsing

Parsing Strategies & URL Resolution

Deep Dive: Rendering

JS Rendering Tier, Playwright, CDP, Cost

Deep Dive: Deduplication

Bloom Filters, SimHash, Canonicalization

Deep Dive: Politeness

Crawl-Delay, robots.txt, Per-Host Rate Limits

Deep Dive: Storage

Content Storage, Metadata DB, Inverted Index

Deep Dive: Fault Tolerance

Retries, Redirect Loops, Observability

Deep Dive: Freshness

Change Detection, ETags, Adaptive Scheduling