Understanding Regex Search: From grep to Sparse N-grams
An interactive deep-dive into how text search evolved over 50 years — from scanning every file to scanning almost nothing.
When an AI coding agent needs to find something in your codebase — say, every file that defines getUserById — it runs a search. In a small project, that’s instant. In a monorepo with half a million files, it can take 15 seconds. During that time, the agent just… waits.
This post traces how developers solved that problem over 50 years — from the earliest Unix tools to the indexing algorithms used by Cursor, GitHub Code Search, and others today. Each concept builds on the last, and you can interact with every one of them.
1. grep — where it all started
In 1973, Ken Thompson built grep for Unix. The name stands for globally search for a regular expression and print matching lines. You give it a pattern and a file, and it checks every line, one at a time.
A regular expression (regex) is a mini-language for describing text patterns. Instead of searching for the exact word cat, you could write c[aou]t to match cat, cot, or cut.
grep compiles your regex into a non-deterministic finite automaton (NFA) — a state machine where each state can have multiple possible transitions for the same input character. Unlike a deterministic automaton (DFA) that follows exactly one path, an NFA can explore several paths simultaneously. This is what makes regex features like alternation (cat|dog) and quantifiers (a*b) possible — the machine tracks all possible states at once and accepts if any path reaches the final state.
For example, the pattern c[aou]t compiles into an NFA with a start state, a state after matching c, three parallel transitions for a, o, or u, a state after matching one of those, and a final accepting state after t. The NFA processes each character of input and maintains the set of all states it could currently be in.
See how the NFA processes text — watch it track multiple possible states simultaneously:
The critical thing to notice: grep has no choice but to read every line. It doesn’t know what’s in a file until it opens it. Even if the pattern is getUserById and a file clearly contains nothing but CSS — grep still reads every byte.
Try it — type a pattern and watch grep scan every line:
Notice the comparison counter. grep checked every single line regardless of whether there was any hope of a match. This is the cost of a linear scan.
2. ripgrep — making the scan faster
For 40 years, the core approach didn’t change: read every file, check every line. But the implementation got dramatically better.
In 2016, Andrew Gallant built ripgrep (rg), arguably the fastest grep implementation ever written. It’s roughly 10x faster through a stack of optimizations:
Boyer-Moore skipping. Classic grep checks every position in the file. Boyer-Moore reads the pattern right-to-left. When it hits a mismatch and the mismatched character doesn’t appear in the pattern at all, it can skip forward by the entire pattern length. For a 12-character pattern like getUserById, that means jumping 12 positions in one step. In the best case, you read fewer bytes than the file contains.
SIMD acceleration. Modern CPUs have instructions that compare 16 or 32 bytes simultaneously. ripgrep’s Teddy algorithm exploits this — it loads 16 bytes from the file into a CPU register and checks all of them in a single instruction. Most bytes are eliminated 16 at a time. The actual regex engine only runs on the tiny fraction that could match.
See the difference — traditional scanning checks one byte at a time, while SIMD processes an entire row in a single operation:
Parallelism. ripgrep walks directories across multiple CPU cores. While one core is reading src/, another is already scanning lib/.
.gitignore awareness. ripgrep automatically skips node_modules/, .git/, dist/, and anything in your .gitignore. grep reads all of them.
Watch both tools scan the same set of files:
searching for getUserById
ripgrep is faster and smarter about which files to open. But look at the “files read” count for non-ignored files — both tools still open every one. Boyer-Moore and SIMD make the per-file scan faster, but they can’t eliminate the scan itself. They are optimizations within the same paradigm: open file, read bytes, check pattern, next file.
3. The wall
Now scale this up. Enterprise monorepos — think Google, Meta, or a large fintech — have 500,000+ files. Even with ripgrep’s optimizations, scanning all of them takes 15+ seconds. Every single search. Every single time.
For a human typing Ctrl+Shift+F, that’s annoying. For an AI agent that runs dozens of searches per task — investigating the codebase, finding definitions, checking usage patterns, verifying its own changes — it’s catastrophic. The agent spends more time waiting for grep than actually thinking.
0/500,000
scanning file 0 of 500,000
The fundamental insight: we need to stop scanning files entirely. Instead of reading every file to check if it matches, what if we already knew which files to check?
This is what an index does. You pay a one-time cost to read all files upfront and build a lookup structure. After that, every search is just a dictionary lookup — no files need to be opened.
4. Inverted indexes — flip the question
The idea behind an inverted index is beautifully simple. Instead of asking “does file X contain word W?” (which requires opening file X), flip it to “which files contain word W?” (which requires only a dictionary lookup).
Here’s how it works:
Step 1 — Tokenize. Read every document and split it into individual words. The word "the" in document D0 becomes an entry. This process is called tokenization.
Step 2 — Build the index. For each unique word, record which documents contain it. This list of document IDs is called a posting list. The complete dictionary of word → posting list is the inverted index.
Step 3 — Search. Look up your search term in the dictionary and get its posting list instantly. For multi-word searches like "cat sat", look up both words’ posting lists and intersect them — keep only document IDs that appear in both lists.
Edit the documents below and watch the index rebuild in real time:
Documents
Inverted Index
This is the foundation of every search engine — Google, Elasticsearch, Sourcegraph. The index lookup is nearly instantaneous because it’s just a dictionary access, not a file scan.
But there’s a problem. This approach works for English text because English has natural word boundaries (spaces). Code doesn’t. What’s a “word” in MAX_FILE_SIZE = buffer.read()? Is it MAX_FILE_SIZE? Or MAX, FILE, SIZE? And even if you got tokenization right, you couldn’t match patterns like MAX_\w+ — only exact words.
We need a way to tokenize anything — not just English words. That’s where trigrams come in.
5. Trigrams — tokenizing anything
In 2006, Russ Cox built Google Code Search using a clever insight: don’t try to split code into meaningful tokens. Just use every overlapping window of 3 characters.
Here’s the algorithm. Take the string catch. Slide a window of width 3 across it:
- Position 0:
cat - Position 1:
atc - Position 2:
tch
That’s it. Three trigrams. The window doesn’t care about word boundaries, spaces, or syntax. It captures every possible 3-character sequence in the file.
The formula: a string of length L produces exactly L − 2 trigrams. catch has 5 characters, so 5 − 2 = 3 trigrams. The last valid starting position for a window of width 3 is position L − 3.
Type any string and see how the sliding window works. Toggle between n=2, n=3, and n=4 to feel the tradeoff:
11 trigrams·13 − 2 = 11
A file is only a candidate if it contains every single trigram. The index doesn't give final answers — it gives a shortlist.
Why 3? The key count tradeoff
The choice of 3 isn’t arbitrary — it’s a mathematical sweet spot:
-
Bigrams (n=2): Only 26² = 676 possible keys (for letters alone). Almost every file in existence contains the bigrams
th,he,er,in. The posting lists are enormous — pointing to nearly every file — which makes them useless for narrowing down candidates. -
Trigrams (n=3): 26³ = 17,576 possible keys. Now
theis specific enough that only some files contain it._FIis rare enough that very few files contain it. The posting lists stay small enough to be useful as filters. -
Quadgrams (n=4): 26⁴ = 456,976 possible keys. The posting lists would be tiny and very precise — but the index itself becomes the problem. You’d need to store hundreds of thousands of keys, and with digits, symbols, and mixed case (real code), that explodes into billions.
Each extra character in the window multiplies the key space by 26. Trigrams sit at the sweet spot: one step more specific than bigrams (17× more keys), one step cheaper than quadgrams (26× fewer keys).
The trigram guarantee
This is the logical heart of the system. If the word cat exists somewhere in a file, then the three characters c, a, t appear consecutively in that file. The sliding window visits every possible 3-character position — it literally cannot miss any consecutive triple. So the trigram cat is physically embedded inside the word cat, and the index will have recorded it.
Flip it around: if the trigram cat is not in the index for a file, then c-a-t never appeared consecutively anywhere in that file. The word cat cannot be there. We can safely skip that file — without opening it.
The word “candidate” is critical here. The index doesn’t give you final answers — it gives you a shortlist. A file containing concatenate also contains the trigram cat (buried inside it), so it would be a false positive. That’s fine — you run the real regex on candidates and filter them out. What matters is that 499,960 files are eliminated entirely without any disk access.
The three shortcomings
Trigram indexes work, but they have real problems that developers spent the next 15 years solving:
1. The index is enormous. Every trigram from every file is stored. getUserById alone produces 9 trigrams, and every file has thousands. On a large repo this can be gigabytes of index data.
2. Too many lookups per query. getUserById (11 chars) produces 9 trigrams — that’s 9 separate posting list lookups, each returning a potentially large list, all 9 intersected. And notice: adjacent trigrams like get, etU, tUs share 2 of 3 characters. You’re doing nearly the same lookup three times.
3. No adjacency information. The index knows a file contains the trigrams cat and sat but has no idea if they appear near each other. A file with “cat” on line 1 and “sat” on line 800 incorrectly becomes a candidate for the phrase “cat sat”.
6. Bloom filter masks — GitHub’s clever fix
GitHub’s Code Search team (internally called “Project Blackbird”) attacked problems 2 and 3 by adding just 2 extra bytes per index entry.
Before understanding their fix, you need to understand a bloom filter — the simplest probabilistic data structure:
- You have N bits, all starting at 0
- Add an item: hash it to a bit position, set that bit to 1
- Check an item: hash it, check the bit
- Bit is 0 → the item is definitely not in the set (guaranteed, no false negatives)
- Bit is 1 → the item is probably in the set (could be a false positive — another item hashed to the same bit)
Play with it — add characters, check others, and drag the saturation slider to see the fatal flaw:
A bloom filter can tell you something is definitely NOT in the set, but only probably in the set. As more data is added, bits saturate — every query returns "probably yes" and the filter becomes useless. This is the fundamental trade-off of probabilistic data structures.
GitHub’s two masks:
nextMask solves the precision problem. When indexing, for every trigram, they hash the character that follows it into an 8-bit mask. Now when searching for the 4-character sequence e·fo, they look up trigram e·f and check whether o’s bit is set in the nextMask. If not — this file cannot contain e·fo, skip it. They get quadgram precision at trigram storage cost.
locMask solves the adjacency problem. They record where in the file each trigram appears, using position mod 8 as the bit. To check if two trigrams are actually adjacent (not just both present somewhere in the file), rotate one mask by 1 bit and AND them. Non-zero means they appear at consecutive positions somewhere.
The fatal flaw: as more data is added to a file, more bits get set. Eventually, with enough data, all 8 bits are 1 — and the filter says “yes” to every query. It’s completely saturated. This is unavoidable with a fixed-size bloom filter. The approach works for a point-in-time snapshot but degrades as codebases grow over time.
7. Sparse n-grams — the smart approach
This is the approach Cursor actually uses, and it’s the most elegant of all. It fixes the fundamental redundancy in trigram indexing.
The redundancy problem
Look at the trigrams for getUserById: get, etU, tUs, Use, ser, erB, rBy, ByI, yId. That’s 9 trigrams, and every adjacent pair shares 2 of 3 characters. You’re storing the same information multiple times and doing 9 lookups for what should be 2 or 3.
The algorithm — step by step
Here’s how sparse n-grams work. Imagine you have the string getUserById and you want to index it.
Step 1 — Extract character pairs. Take every adjacent pair of characters: ge, et, tU, Us, se, er, rB, By, yI, Id. There are L−1 pairs for a string of length L.
Step 2 — Assign weights. Each pair gets a numeric weight. The weight function must be deterministic — the same pair always produces the same weight, on every machine, every time. This is critical: both the indexer and the searcher must agree on weights.
Two approaches exist:
- CRC32 hash — produces pseudo-random but deterministic weights. Simple, but peaks land randomly.
- Frequency table — trained on terabytes of real code. Common pairs like
erandth(which appear everywhere) get low weight. Rare pairs likerB(how often does a lowercaserprecede an uppercaseB?) get high weight. Peaks land on the most distinctive parts of the string.
Step 3 — Find peaks. A peak is a local maximum: a pair whose weight is higher than both its left and right neighbor. These peaks become the anchor points for n-gram boundaries. With frequency weights on getUserById, the peaks are rB (weight 92) and yI (weight 85) — both rare camelCase transitions.
Step 4 — Build-all (indexing). Now generate all valid n-grams. The rule: a span from position i to position j is valid if both endpoint weights dominate every weight strictly between them. In other words, no weight inside the span is higher than the span’s endpoints.
All 9 trigrams are automatically valid — adjacent pairs have nothing between them, so the domination condition is trivially satisfied. That’s the easy part.
The interesting part is the 5 longer spans. Let’s derive each one using the actual frequency weights (ge=18, et=12, tU=35, Us=28, se=15, er=14, rB=92, By=22, yI=85, Id=42):
getU — endpoints ge(18) and tU(35), interior is just et(12). Is 18 > 12? Yes. Is 35 > 12? Yes. Both endpoints dominate the interior. Valid.
serB — endpoints se(15) and rB(92), interior is er(14). Is 15 > 14? Yes. Is 92 > 14? Yes. Valid.
UserB — endpoints Us(28) and rB(92), interior is max(se=15, er=14) = 15. Is 28 > 15? Yes. Is 92 > 15? Yes. Valid.
tUserB — endpoints tU(35) and rB(92), interior is max(Us=28, se=15, er=14) = 28. Is 35 > 28? Yes. Is 92 > 28? Yes. Both peaks see over everything between them. Valid. This is the longest span — 6 characters — and one of the most selective n-grams in the index.
rByI — endpoints rB(92) and yI(85), interior is By(22). Is 92 > 22? Yes. Is 85 > 22? Yes. Valid.
Why not longer? Try getUserB — the interior contains tU(35), but the left endpoint ge is only 18. 18 < 35, so ge can’t dominate tU. The peak at tU acts as a wall that blocks longer spans from crossing it. Similarly, rById fails because Id(42) can’t dominate yI(85) in the interior. Peaks are barriers — you can span between them but not through them.
Step 5 — Covering (search time). When a query arrives, the algorithm extracts the minimum set of n-grams needed to cover the entire query string. It uses a greedy approach: pick the longest available n-gram, mark those characters as covered, repeat.
The mathematical guarantee: covering always produces a subset of what build-all stored. They run the same weight function, so they identify the same peaks and the same valid spans. A covering n-gram can never ask for something the index doesn’t have.
Step through the algorithm yourself:
Why this beats everything
- Fewer lookups. Instead of 9 trigram lookups for
getUserById, covering produces 2-3 lookups with longer, more specific n-grams. - Smaller posting lists. A 6-character n-gram like
tUserBmatches far fewer files than a 3-character trigram likeget. The posting lists are tiny. - No bloom filter saturation. No probabilistic structures that degrade over time.
- No adjacency problem. The n-grams are anchored on specific positions, so adjacency is inherent.
- Smaller index overall. Despite build-all storing overlapping spans, the total is less than storing every trigram, because longer spans replace multiple short ones.
8. The full pipeline
The sparse n-gram index and ripgrep are not alternatives — they’re two phases of one pipeline:
Phase 1 — Index lookup (microseconds). The query getUserById arrives. Covering mode extracts 2 sparse n-grams. The index does 2 binary searches on a memory-mapped lookup table, reads 2 posting lists from disk, and intersects them. Result: 37 candidate files out of 500,000. The other 499,963 files are eliminated without being opened. No regex runs. No files read.
Phase 2 — Verification (milliseconds). ripgrep receives only the 37 candidate file paths. It opens each one, runs the full SIMD regex engine, finds exact line numbers, eliminates false positives from the index, and returns confirmed matches with surrounding context.
Why you need both: The index can only tell you which files might match. It can’t tell you the exact line number, the surrounding context, or whether complex patterns like get\w+ById actually match. ripgrep does all of that — but it has to open every file it checks, which is why the index’s job of eliminating 499,963 files first is so critical.
9. All of this, on your machine
The index needs to be fast to query and small enough to store locally. Here’s how it’s laid out on disk.
Two files, almost nothing in RAM
The entire index consists of just two files:
The lookup table is a sorted array of entries, each containing an n-gram’s hash and a byte offset into the postings file. It’s tiny — typically 2-4 KB for a large codebase — and is memory-mapped (mmap), which means the operating system loads it into RAM automatically and keeps it there. Looking up an n-gram is a binary search over this array.
The postings file sits on disk and contains the actual posting lists — the file IDs that contain each n-gram. It’s larger (tens of KB to a few MB) but is never loaded entirely. When the lookup table says “the posting list for tUserB starts at byte 0x0400”, the system seeks directly to that offset and reads just those bytes. Everything else stays untouched on disk.
Hover over the lookup entries to see how they map to disk offsets:
Lookup Table (mmap'd, in RAM)
Lookup: 2.4 KB (in RAM)
Postings File (on disk)
file_042, file_187, file_903
file_042, file_187
file_042, file_187, file_903, file_1204, file_3891
file_042, file_903
file_042, file_187, file_903
Postings: 48 KB (on disk, mmap'd)
A search for getUserById reads approximately 200 bytes total from disk: 2 binary searches on the mmap’d lookup table, 2 seeks into the postings file, 2 small posting lists read and intersected. That’s it. On an SSD, this takes microseconds.
Why local beats server
You might wonder: why not put the index on a server? The answer is that verification must be co-located with files. After the index narrows 500,000 files to 37 candidates, ripgrep needs to actually open those 37 files and run the regex. If the files are on your local disk, that’s milliseconds. If they’re on a remote server, every file open is a network round-trip — the verification step alone could take seconds, destroying the entire speedup the index provides.
10. Keeping the index fresh
There’s an obvious problem: if you edit a file, the index becomes stale. The trigrams stored for that file no longer reflect its current content. Rebuilding the entire index on every keystroke would be absurd.
Git as the anchor
Cursor’s solution is elegant. The index is built from your last Git commit — the stable snapshot. Every file that has been modified since that commit is tracked in a small dirty list.
When you search:
- The index handles the hundreds of thousands of committed files instantly (the fast path)
- The handful of dirty files (the ones you’re currently editing) are scanned directly with ripgrep (the slow path, but only on a tiny set)
At any given moment, you’re typically editing 2-5 files out of hundreds of thousands. The dirty layer is almost always negligible.
When does the index rebuild?
Three scenarios trigger a rebuild:
- You commit. The dirty files are now part of the committed snapshot. The index incorporates them and the dirty list resets.
- You pull new changes. Same as a commit — new files enter the stable snapshot.
- Cursor updates its version. The weight function might have changed (maybe they improved their frequency table). Since the weight function must be identical between indexing and querying, a new weight function means the old index would produce wrong n-gram boundaries at query time. So the index rebuilds from scratch. This is the “determinism” requirement in action — the same input must always produce the same output, and a version change breaks that contract.
The rebuild runs in the background while you keep working. Until it finishes, the system falls back to scanning all files with ripgrep — slower, but correct.
11. Where you already use this
These algorithms aren’t just academic — they power tools you use every day:
VS Code “Find in Files” (Cmd+Shift+F) uses ripgrep as its search backend. When you search across your project, you’re running ripgrep. This is why VS Code’s search is significantly faster than most editors.
GitHub Code Search uses sparse n-gram indexing (evolved from their earlier trigram + bloom filter “Project Blackbird” approach). When you search code on github.com, you’re querying a sparse n-gram index.
Cursor uses a local sparse n-gram index combined with ripgrep verification — exactly the pipeline described above. The index lives on your machine, is anchored to your last git commit (dirty files since the commit are scanned directly), and rebuilds automatically on version updates.
Sourcegraph and Zoekt use trigram indexing for code search across large codebases.
Google Code Search (2006-2013, and internally still) was the original trigram-indexed code search, built by Russ Cox.
In your own codebase
If you’re building search functionality, the decision tree is simple:
- < 10K files: ripgrep is fast enough. Don’t bother with an index.
- 10K - 100K files: ripgrep still works, but consider indexing if search is on the hot path (CI, AI agents, real-time IDE features).
- 100K+ files: You need an index. Trigram indexing is the simplest to implement. Sparse n-grams are better but harder.
- Any size, if you need sub-100ms search: Build an index. The one-time cost pays for itself on the second search.
The key insight from this whole post: the fastest search isn’t the one that reads files fastest — it’s the one that reads the fewest files.
The 50-year arc
| Year | Tool | Innovation | Files read |
|---|---|---|---|
| 1973 | grep | NFA regex engine | All of them |
| 1988 | GNU grep | Boyer-Moore skipping | All of them |
| 2006 | Google Code Search | Trigram inverted index | Candidates only |
| 2016 | ripgrep | SIMD + parallelism | All of them (fast) |
| 2023 | GitHub Code Search | Bloom filter masks | Candidates only |
| 2024 | Cursor | Sparse n-gram index + ripgrep | ~37 out of 500,000 |
From scanning everything to scanning almost nothing. The search didn’t get faster by reading files faster — it got faster by reading fewer files.
Try it on your own data
Want to see these algorithms compete on your document? Upload a PDF or text file and watch grep, ripgrep, and trigram search run side by side — with real timing on your real data.
Inspired by Cursor’s “Fast Regex Search” blog post by Vicent Marti. Built with interactive components so you can explore each algorithm yourself.