CursorRead originalMarch 2025

How Cursor Built Fast Regex Search with N-Gram Indexing

10 min read

Cursor's AI-powered code editor needs sub-second regex search across entire codebases to keep agents productive. Ripgrep alone takes 15+ seconds on large monorepos because it scans every file. This post covers how Cursor built a client-side n-gram inverted index that narrows candidates before ripgrep runs — covering trigrams, bloom filter masks, sparse n-grams with frequency-based weight functions, memory-mapped file formats, and why the index lives on the client rather than a server.

Ready to ace your system design interview?

This article is just one piece. SWE Quiz gives you structured, interview-focused practice across every topic that comes up in senior engineering rounds.

1,000+ quiz questions across system design and ML/AI
Spaced repetition to lock in what you learn
Full case study walkthroughs of real interview topics
Track streaks, XP, and progress over time

Start for free See pricing

TL;DR

Ripgrep scans every file sequentially — in large monorepos this routinely takes 15+ seconds, which is unacceptable for agentic workflows where search runs continuously.
A trigram inverted index maps every 3-character sequence to the files containing it; querying intersects posting lists to narrow candidates before running ripgrep on just those files.
GitHub's "Project Blackbird" approach adds bloom filter masks (locMask/nextMask) to each trigram entry, enabling "3.5-gram" matching without the index bloat of full quadgrams.
Sparse n-grams use a deterministic weight function (e.g. character-pair frequency) to extract variable-length tokens at natural boundaries — dramatically reducing both index size and query-time posting list intersections.
The index is client-side (not server-side) for four reasons: completeness (regex needs actual file content for final verification), latency, freshness (stale indexes cause agents to speculate), and privacy.

Why this matters for interviews

Search infrastructure is a recurring system design topic — from autocomplete to full-text search to code search. This post gives you concrete indexing techniques (inverted indexes, n-grams, bloom filters) with real tradeoffs. Use these when asked to design a search system, explain indexing strategies, or discuss client-side vs. server-side architecture tradeoffs.

Breakdown

1.The problem: grep is too slow for agents

Ripgrep is fast at pattern matching within individual files, but it must scan every file in the repository sequentially. For large monorepos, this means 15+ second search times. In traditional development, a 15-second search is annoying but tolerable. For AI agents that search continuously as part of their workflow — investigating bugs, finding references, exploring unfamiliar code — it becomes a critical bottleneck. Each search blocks the agent's next action, and agents may issue dozens of searches per task. The cumulative latency turns what should be a fast iteration cycle into minutes of waiting.

Interview angle: When designing search systems, the first question is whether you can afford a full scan or need an index. This is a clean example of when brute-force scanning hits a wall: when the consumer is automated and issues many queries in tight succession. In interviews, frame the problem as "at what query volume does scanning become unacceptable?" — it's not just about data size, it's about query frequency.

2.Inverted indexes and trigrams

An inverted index is a data structure that maps tokens to the documents containing them — like the index at the back of a textbook. For regex search, the question is what tokens to use. Cursor uses n-grams: overlapping character sequences of length n extracted from every file. Trigrams (n=3) are the sweet spot. Bigrams (n=2) produce posting lists that are too large — most 2-character pairs appear in nearly every file, so the index doesn't filter well. Quadgrams (n=4) create too many unique keys, bloating the index. During indexing, every overlapping 3-character sequence from every file becomes an entry. During querying, a regex pattern is decomposed into its constituent trigrams, and the posting lists for those trigrams are intersected to find candidate files that contain all required sequences.

Interview angle: The n-gram size tradeoff is a classic precision-vs-recall-vs-cost question. Smaller n-grams give high recall but low precision (too many candidates). Larger n-grams give high precision but cost more storage and miss partial matches. Interviewers may ask you to reason about this tradeoff for different data types — the trigram sweet spot applies specifically to source code, where character diversity is high enough to make 3-grams discriminative.

3.Trigram queries with probabilistic masks

Plain trigram indexes have a precision problem: a 3-character match is often not selective enough. GitHub's Project Blackbird approach augments each trigram posting list entry with two 8-bit bloom filter masks. The locMask encodes positions where the trigram appears within a file (position mod 8). The nextMask is a bloom filter of all characters that immediately follow that trigram in the file. Together, these enable "3.5-gram" matching — you get most of the filtering power of quadgrams without actually storing quadgram keys. Adjacent trigrams can be verified by rotating one trigram's locMask left by one bit and intersecting with the next trigram's locMask to confirm they appear at adjacent positions. The tradeoff: bloom filters saturate as files are modified, gradually losing their filtering power and requiring periodic rebuilds.

Interview angle: Bloom filters are a frequent interview topic, but candidates rarely discuss their use inside inverted indexes. This is a concrete example of augmenting an index with probabilistic metadata to improve precision without increasing key cardinality. The saturation tradeoff is important — bloom filters have no false negatives but increasing false positive rates over time, which means the index stays correct but gets slower.

4.Sparse n-grams

Instead of extracting every overlapping trigram, sparse n-grams use a deterministic weight function to select variable-length tokens at natural boundaries. A weight function (such as CRC32 or character-pair frequency) assigns a score to each position in the text. An n-gram boundary is placed wherever the weight exceeds all interior weights — this produces longer n-grams in common regions and shorter ones in rare regions. At query time, the same weight function generates a minimal "covering" set of n-grams from the search pattern. A frequency-based weight function assigns high weights to rare character pairs, which produces fewer but more discriminative n-grams compared to uniform trigrams. This dramatically reduces both the number of index lookups needed per query and the size of posting lists that must be intersected.

Interview angle: Sparse n-grams are an advanced indexing technique that shows deep understanding of search infrastructure. The key insight is that not all n-grams are equally useful — rare character sequences are far more discriminative than common ones. In interviews, this maps to the broader principle of information-theoretic token selection: index the tokens that carry the most information, not all tokens uniformly.

5.Why client-side indexing

Cursor chose to build and store the index on the user's machine rather than a server, for four reasons. First, completeness: regex search ultimately needs the actual file content for final pattern verification, and keeping files synchronized between client and server adds complexity and failure modes. Second, latency: Cursor's agent operates at high tokens-per-second and issues searches in tight loops — network roundtrips for each search would negate the speed gains from indexing. Third, freshness: unlike semantic search indexes that tolerate some staleness, regex search must reflect the latest file state or agents will miss recently written code and waste time speculating. Fourth, privacy: client-side indexing avoids sending user codebases to a centralized server.

Interview angle: Client-side vs. server-side is a fundamental architecture decision in search. Most candidates default to server-side because that's what web search engines do. Being able to articulate when client-side is better — low-latency requirements, freshness constraints, privacy sensitivity, need for complete local data — shows architectural maturity. This maps to broader distributed systems questions about where to place computation relative to data.

6.Implementation: memory-efficient file format

The index is stored in two files on disk. The lookup table is a sorted array of entries mapping n-gram hashes to offsets in the postings file. It is memory-mapped (mmap), so the OS loads only the pages that are actually accessed rather than reading the entire table into memory. Lookups use binary search on the hash values to find the offset, then seek directly to that position in the postings file to read the posting list. The postings file stores all posting lists sequentially on disk. Importantly, only the n-gram hashes are stored — not the full n-gram strings. Hash collisions are safe because they only broaden a posting list (adding extra candidate files), never causing false negatives. A file that shouldn't match might appear as a candidate, but it will be filtered out during the final ripgrep verification pass. Index freshness is keyed to Git commits, with user and agent changes layered on top as incremental updates.

Architecture diagram showing the n-gram indexing pipeline, query flow with posting list intersection, and client-side architecture with memory-mapped access — Top: source files are tokenized into n-grams and stored in a lookup table + postings file. Middle: queries decompose into n-grams, intersect posting lists, and verify candidates with ripgrep. Bottom: everything runs client-side with memory-mapped I/O.

Interview angle: Memory-mapped I/O is an underappreciated technique in system design interviews. It lets you work with data structures larger than available RAM while the OS handles paging transparently. The hash collision insight is also important: in systems where false positives are acceptable but false negatives are not (search, spam filters, content deduplication), you can use smaller hashes and simpler data structures because collisions only hurt performance, not correctness.

7.Performance impact and agentic workflows

With the n-gram index in place, regex search in Cursor skips the full-codebase scan entirely. The index narrows candidates to a small subset of files, and ripgrep only runs on those candidates for final verification. In benchmarks on large codebases like Chromium, the "grep phase" duration dropped substantially. For agent workflows specifically, the impact is qualitative, not just quantitative: removing search latency from the agent's iteration loop means it can investigate more hypotheses, explore more of the codebase, and converge on solutions faster. The difference between a 15-second search and a sub-second search is the difference between an agent that feels stuck and one that feels fluid.

Interview angle: This section illustrates a key system design principle: optimizing for the bottleneck in a pipeline. The bottleneck was not pattern matching (ripgrep is already fast at that) but candidate selection (which files to match against). The index moves candidate selection from O(n) file scanning to O(k) posting list intersection where k is the number of query n-grams. In interviews, always identify the actual bottleneck before proposing optimizations.

Key Concepts

Inverted Index

A data structure that maps content tokens (words, n-grams, terms) to the documents or files that contain them. Enables fast lookup of "which documents contain this token?" without scanning every document.

Analogy: Like the index at the back of a textbook — instead of reading every page to find "binary search," you look up the term in the index and jump directly to the relevant pages.

Trigram

A 3-character overlapping subsequence extracted from text. For the string "cursor", the trigrams are "cur", "urs", "rso", "sor". Used as tokens in inverted indexes for regex search because they balance selectivity (filtering power) against index size.

Posting List

The list of documents (or file IDs) associated with a single token in an inverted index. For example, the trigram "err" might have a posting list of all files containing that 3-character sequence. Query processing intersects posting lists for multiple tokens to find candidate documents.

Bloom Filter

A space-efficient probabilistic data structure that tests whether an element is in a set. It can return false positives ("maybe in set") but never false negatives ("definitely not in set"). In Cursor's context, bloom filter masks augment trigram entries to approximate quadgram matching.

Analogy: Like a bouncer with a guest list who might accidentally let in someone not on the list, but will never turn away someone who is on the list. Over time (saturation), the bouncer gets more lenient and lets in more uninvited guests.

Sparse N-Gram

A variable-length n-gram extracted at positions determined by a deterministic weight function rather than at every overlapping position. Produces fewer, more discriminative tokens than uniform trigrams by placing boundaries where character-pair rarity is highest.

Memory-Mapped I/O (mmap)

An OS feature that maps a file on disk into virtual memory, allowing programs to access file data as if it were in RAM. The OS loads pages on demand and manages caching transparently. Useful for working with data structures larger than available memory without explicit read/write calls.

Covering Algorithm

At query time, the algorithm that decomposes a search pattern into the minimal set of n-grams needed to look up in the index. Uses the same weight function as the indexing phase to ensure the generated query n-grams match what was indexed. Fewer covering n-grams means fewer posting list intersections.

N-Gram Weight Function

A deterministic function (e.g., CRC32 hash or character-pair frequency table) that assigns a numerical weight to each position in a text. N-gram boundaries are placed at local maxima of this function. Frequency-based weights assign high scores to rare character pairs, producing n-grams that are more selective during search.

Knowledge Check

10 questions — your answers are saved locally so you can come back anytime.