You do not need a large language model for every content quality problem.
Sometimes the useful question is smaller and more useful:
Are these two pages, titles, keywords, descriptions, or snippets too similar to ship without review?
That question comes up constantly in SEO and content QA. It is also one of the easiest places to overcomplicate the stack.
Simple similarity methods are fast, explainable, cheap to run, and good enough for a first pass. The point is not to automate editorial judgment. The point is to shrink a messy content inventory into a review queue a human can actually work through.
Similarity Is A Smoke Alarm, Not A Verdict#
A high similarity score does not automatically mean two pages are bad, duplicate, cannibalized, or ready to merge.
It means they deserve a look.
That distinction matters because SEO decisions usually depend on more than text:
search intent,
query demand,
backlinks,
conversions,
internal links,
canonical tags,
business purpose,
page freshness,
whether the overlap is useful or lazy.
Google's canonicalization guidance talks about duplicate and very similar pages in terms of consolidating signals, crawl efficiency, and choosing which URL should appear in Search. A similarity score can help find candidates. It should not decide canonicals, redirects, or deletions by itself.
Start With The Job#
Text similarity is not one thing. The right method depends on the job.
| Job | Question | Useful input |
|---|---|---|
| Duplicate detection | Are these too close? | Titles, H1s, meta descriptions, body extracts |
| Cannibalization review | Are these pages competing for the same intent? | Queries, titles, headings, summaries, URLs |
| Metadata QA | Did templating create near-identical snippets? | Title tags and meta descriptions |
| Content rewrite QA | Did the rewrite actually change the page? | Before/after drafts |
| Keyword clustering | Which terms belong in the same review batch? | Keyword lists and SERP notes |
| Boilerplate detection | Which pages are mostly template copy? | Main content with navigation removed |
The goal is not to replace judgment. The goal is to make review possible.
Compare The Right Text#
Before choosing an algorithm, decide what you are comparing.
Comparing full HTML pages is often noisy because navigation, footer copy, cookie banners, CTAs, product grids, and sidebars can dominate the score. For SEO QA, you usually want cleaner fields.
| Field | Why compare it | Watch out for |
|---|---|---|
| Title tag | Finds repeated templates and unclear ownership | Location or product variants may be intentionally similar |
| H1 | Catches page-purpose overlap | H1s are short, so one shared word can distort the score |
| Meta description | Finds generated or repetitive snippets | Descriptions can be similar without pages overlapping |
| Main content extract | Finds genuine content overlap | Remove boilerplate before scoring |
| Heading outline | Shows structural duplication | Common guide formats can look similar |
| Keyword set | Helps clustering | Query intent still needs SERP review |
| Before/after draft | Checks whether a rewrite changed enough | Higher similarity may be fine for compliance or legal copy |
Good inputs beat clever scoring. A clean 500-word main-content summary is usually more useful than a noisy full-page scrape.
Normalize Before You Score#
Tiny differences can make text look less similar than it is. Normalize first.
At minimum:
lowercase,
strip punctuation,
collapse whitespace,
remove common stopwords,
remove boilerplate,
decide whether to stem or lemmatize,
keep numbers only when they matter,
keep brand, product, location, and entity names.
Do not remove everything interesting. For SEO work, named entities are often the difference between "similar" and "same intent."
Method 1: Keyword Overlap#
Keyword overlap is the plainest useful method. It answers: how many meaningful words do these strings share?
For short fields, a simple Jaccard-style overlap can be enough.
const stopwords = new Set([
'the',
'a',
'an',
'and',
'or',
'for',
'to',
'of',
'in',
]);
function tokens(value: string) {
return new Set(
value
.toLowerCase()
.replace(/[^a-z0-9\s]/g, ' ')
.split(/\s+/)
.filter((word) => word && !stopwords.has(word)),
);
}
function overlapScore(a: string, b: string) {
const wordsA = tokens(a);
const wordsB = tokens(b);
const intersection = [...wordsA].filter((word) => wordsB.has(word)).length;
const union = new Set([...wordsA, ...wordsB]).size;
return union === 0 ? 0 : intersection / union;
}This is crude, but it is excellent for title tags, H1s, navigation labels, short page summaries, and keyword lists. It is also easy to explain to a client: these two fields share most of their meaningful words.
Keyword overlap breaks down when two pages use different words for the same concept. It also overreacts to short strings. Use it as a cheap first filter, not a final answer.
Method 2: TF-IDF#
TF-IDF helps when some words are more informative than others.
In a set of SEO pages, words like "service," "guide," "business," or "marketing" may appear everywhere. They do not tell you much. More specific words like "canonical," "checkout," "robots," "migration," or "schema" carry more signal.
TF-IDF weights words based on how often they appear in one document and how rare they are across the wider set. The scikit-learn TF-IDF documentation is a useful reference if you want the implementation details.
That makes TF-IDF useful for comparing:
article drafts,
product descriptions,
service pages,
location pages,
category descriptions,
extracted body summaries.
It is still lexical, not magical. It does not understand meaning the way a person does. It just gives more weight to distinctive terms.
Method 3: Cosine Similarity#
Cosine similarity compares two vectors by direction instead of raw length. That matters because one page may be much longer than another.
The scikit-learn cosine similarity documentation defines it as a normalized dot product. In practice, with TF-IDF vectors, it is a common way to ask whether two documents point in a similar topical direction.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
docs = [
"SEO migration checklist for preserving rankings",
"Website migration SEO checklist before launch",
"Schema markup examples for product pages",
]
vectorizer = TfidfVectorizer(stop_words="english")
matrix = vectorizer.fit_transform(docs)
scores = cosine_similarity(matrix)
print(scores)You can use the score matrix to sort review queues. High similarity means "inspect this pair first." It does not mean "delete one."
Thresholds Are Local#
There is no universal score where a page becomes a duplicate.
A 0.72 cosine similarity between two title tags may be boring. A 0.72 similarity between two 1,200-word service pages may be a serious review candidate. A 0.9 score between two legal disclaimers may be expected.
Start with loose thresholds, then calibrate against real examples from the site.
| Pair type | First-pass flag | Human question |
|---|---|---|
| Title tags | High word overlap | Are these pages targeting the same query? |
| Meta descriptions | Very high overlap | Did templating produce generic snippets? |
| Service pages | High TF-IDF cosine | Are the offers actually distinct? |
| Product pages | High similarity plus similar attributes | Is this a variant strategy or duplicate content? |
| Blog posts | High similarity plus shared intent | Should these be merged, differentiated, or cross-linked? |
| Location pages | High similarity with only city swaps | Is there real local detail, proof, or usefulness? |
The threshold is a triage rule. The decision is editorial and strategic.
Practical SEO Uses#
Duplicate Title And Description QA#
Run title tags and meta descriptions through a similarity check before launch.
This catches:
title templates used too aggressively,
location pages with only the city changed,
product pages with weak differentiators,
generated meta descriptions that sound unique but say the same thing,
old and new drafts that did not actually diverge.
The fix is not always "make everything wildly different." Sometimes similar metadata is fine. The question is whether the field helps searchers and systems understand why this URL deserves to exist.
Content Cannibalization Review#
If two pages are similar and target similar queries, they may be splitting relevance.
Similarity scoring can help find candidates for:
merging,
redirecting,
differentiating,
internal link cleanup,
clearer keyword ownership,
canonical review.
It will not decide the strategy for you. Use Search Console queries, rankings, conversions, backlinks, and business priority before changing URLs.
Generated Content QA#
Similarity checks are useful when a team is using templates, AI drafts, programmatic SEO, or bulk metadata generation.
Flag:
paragraphs repeated across pages,
overused intros,
near-identical conclusion blocks,
product copy that only changes the model name,
city pages with no real local specificity,
AI rewrites that preserve the same thin structure.
Google's helpful, reliable, people-first content guidance is a good sanity check here. If a page only exists as a lightly varied version of another page, similarity scoring may be showing a deeper usefulness problem.
Keyword Clustering#
Simple similarity can group obvious variants before a human reviews intent.
For example:
"technical SEO audit"
"technical SEO checklist"
"SEO audit service"
"site audit for SEO"
These may not all deserve the same page, but they probably belong in the same review batch. Similarity helps organize the work; SERP intent decides the page strategy.
A Sensible Workflow#
Use text similarity as a QA pipeline:
Export URLs with title, H1, meta description, canonical, target query, traffic, and conversions.
Extract main content, then remove navigation, footer, and repeated boilerplate.
Normalize the fields you want to compare.
Score likely pairs instead of every possible pair when the site is large.
Sort by highest similarity inside the same content type or section.
Review the top results manually.
Label the issue: duplicate, overlap, template copy, legitimate variant, or no problem.
Decide whether to merge, rewrite, redirect, canonicalize, improve internal links, or leave alone.
Re-run the check after edits so the QA loop has evidence.
The score is a flashlight, not a verdict.
What Simple Similarity Misses#
Simple methods do not understand meaning the way people do.
They struggle with:
synonyms,
sarcasm,
entity relationships,
intent differences,
multilingual content,
pages that use different wording for the same concept,
pages that share boilerplate but solve different problems.
For semantic matching, embeddings are stronger. But embeddings are not always the first tool to reach for. They add cost, complexity, and harder-to-explain results.
For audits, launch QA, metadata review, and content inventories, simple methods are often enough.
Where This Fits In A Content System#
Text similarity is useful because it turns a messy content inventory into a smaller review problem.
The output should not be "these 180 pages are bad." It should be something a team can act on:
| Finding | Better handoff |
|---|---|
| 47 pages have similar titles | Review the top 15 pairs by traffic and business value |
| 22 descriptions are near-identical | Rewrite the shared template and regenerate affected pages |
| 8 service pages overlap heavily | Assign one primary intent to each page or consolidate |
| 3 articles cover the same topic | Merge, redirect, and preserve the strongest sections |
| Location pages are mostly boilerplate | Add real proof, local details, photos, testimonials, or cut the page |
That is the real value: not the algorithm by itself, but the handoff from vague concern to focused action.
The Bottom Line#
Text similarity is not advanced SEO theater. It is a practical QA layer.
Use keyword overlap for short fields. Use TF-IDF and cosine similarity for larger text. Use embeddings only when lexical methods are not enough. Then let humans make the actual SEO decision.
For SEO and content QA, that is often enough to move from "we should review this site" to "these are the twenty things to check first."