Back to Software Development

Software Development

Simple Text Similarity for SEO and Content QA

A practical guide to using keyword overlap, TF-IDF, cosine similarity, and human review to find duplicate metadata, cannibalization risks, and weak content overlap.

Gabriel Luis • May 8, 2026

You do not need a large language model for every content quality problem.

Sometimes the useful question is smaller and more useful:

Are these two pages, titles, keywords, descriptions, or snippets too similar to ship without review?

That question comes up constantly in SEO and content QA. It is also one of the easiest places to overcomplicate the stack.

Simple similarity methods are fast, explainable, cheap to run, and good enough for a first pass. The point is not to automate editorial judgment. The point is to shrink a messy content inventory into a review queue a human can actually work through.

Similarity Is A Smoke Alarm, Not A Verdict#

A high similarity score does not automatically mean two pages are bad, duplicate, cannibalized, or ready to merge.

It means they deserve a look.

That distinction matters because SEO decisions usually depend on more than text:

  • search intent,

  • query demand,

  • backlinks,

  • conversions,

  • internal links,

  • canonical tags,

  • business purpose,

  • page freshness,

  • whether the overlap is useful or lazy.

Google's canonicalization guidance talks about duplicate and very similar pages in terms of consolidating signals, crawl efficiency, and choosing which URL should appear in Search. A similarity score can help find candidates. It should not decide canonicals, redirects, or deletions by itself.

Start With The Job#

Text similarity is not one thing. The right method depends on the job.

JobQuestionUseful input
Duplicate detectionAre these too close?Titles, H1s, meta descriptions, body extracts
Cannibalization reviewAre these pages competing for the same intent?Queries, titles, headings, summaries, URLs
Metadata QADid templating create near-identical snippets?Title tags and meta descriptions
Content rewrite QADid the rewrite actually change the page?Before/after drafts
Keyword clusteringWhich terms belong in the same review batch?Keyword lists and SERP notes
Boilerplate detectionWhich pages are mostly template copy?Main content with navigation removed

The goal is not to replace judgment. The goal is to make review possible.

Compare The Right Text#

Before choosing an algorithm, decide what you are comparing.

Comparing full HTML pages is often noisy because navigation, footer copy, cookie banners, CTAs, product grids, and sidebars can dominate the score. For SEO QA, you usually want cleaner fields.

FieldWhy compare itWatch out for
Title tagFinds repeated templates and unclear ownershipLocation or product variants may be intentionally similar
H1Catches page-purpose overlapH1s are short, so one shared word can distort the score
Meta descriptionFinds generated or repetitive snippetsDescriptions can be similar without pages overlapping
Main content extractFinds genuine content overlapRemove boilerplate before scoring
Heading outlineShows structural duplicationCommon guide formats can look similar
Keyword setHelps clusteringQuery intent still needs SERP review
Before/after draftChecks whether a rewrite changed enoughHigher similarity may be fine for compliance or legal copy

Good inputs beat clever scoring. A clean 500-word main-content summary is usually more useful than a noisy full-page scrape.

Normalize Before You Score#

Tiny differences can make text look less similar than it is. Normalize first.

At minimum:

  • lowercase,

  • strip punctuation,

  • collapse whitespace,

  • remove common stopwords,

  • remove boilerplate,

  • decide whether to stem or lemmatize,

  • keep numbers only when they matter,

  • keep brand, product, location, and entity names.

Do not remove everything interesting. For SEO work, named entities are often the difference between "similar" and "same intent."

Method 1: Keyword Overlap#

Keyword overlap is the plainest useful method. It answers: how many meaningful words do these strings share?

For short fields, a simple Jaccard-style overlap can be enough.

TS
const stopwords = new Set([
  'the',
  'a',
  'an',
  'and',
  'or',
  'for',
  'to',
  'of',
  'in',
]);

function tokens(value: string) {
  return new Set(
    value
      .toLowerCase()
      .replace(/[^a-z0-9\s]/g, ' ')
      .split(/\s+/)
      .filter((word) => word && !stopwords.has(word)),
  );
}

function overlapScore(a: string, b: string) {
  const wordsA = tokens(a);
  const wordsB = tokens(b);
  const intersection = [...wordsA].filter((word) => wordsB.has(word)).length;
  const union = new Set([...wordsA, ...wordsB]).size;

  return union === 0 ? 0 : intersection / union;
}

This is crude, but it is excellent for title tags, H1s, navigation labels, short page summaries, and keyword lists. It is also easy to explain to a client: these two fields share most of their meaningful words.

Keyword overlap breaks down when two pages use different words for the same concept. It also overreacts to short strings. Use it as a cheap first filter, not a final answer.

Method 2: TF-IDF#

TF-IDF helps when some words are more informative than others.

In a set of SEO pages, words like "service," "guide," "business," or "marketing" may appear everywhere. They do not tell you much. More specific words like "canonical," "checkout," "robots," "migration," or "schema" carry more signal.

TF-IDF weights words based on how often they appear in one document and how rare they are across the wider set. The scikit-learn TF-IDF documentation is a useful reference if you want the implementation details.

That makes TF-IDF useful for comparing:

  • article drafts,

  • product descriptions,

  • service pages,

  • location pages,

  • category descriptions,

  • extracted body summaries.

It is still lexical, not magical. It does not understand meaning the way a person does. It just gives more weight to distinctive terms.

Method 3: Cosine Similarity#

Cosine similarity compares two vectors by direction instead of raw length. That matters because one page may be much longer than another.

The scikit-learn cosine similarity documentation defines it as a normalized dot product. In practice, with TF-IDF vectors, it is a common way to ask whether two documents point in a similar topical direction.

PYTHON
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = [
    "SEO migration checklist for preserving rankings",
    "Website migration SEO checklist before launch",
    "Schema markup examples for product pages",
]

vectorizer = TfidfVectorizer(stop_words="english")
matrix = vectorizer.fit_transform(docs)
scores = cosine_similarity(matrix)

print(scores)

You can use the score matrix to sort review queues. High similarity means "inspect this pair first." It does not mean "delete one."

Thresholds Are Local#

There is no universal score where a page becomes a duplicate.

A 0.72 cosine similarity between two title tags may be boring. A 0.72 similarity between two 1,200-word service pages may be a serious review candidate. A 0.9 score between two legal disclaimers may be expected.

Start with loose thresholds, then calibrate against real examples from the site.

Pair typeFirst-pass flagHuman question
Title tagsHigh word overlapAre these pages targeting the same query?
Meta descriptionsVery high overlapDid templating produce generic snippets?
Service pagesHigh TF-IDF cosineAre the offers actually distinct?
Product pagesHigh similarity plus similar attributesIs this a variant strategy or duplicate content?
Blog postsHigh similarity plus shared intentShould these be merged, differentiated, or cross-linked?
Location pagesHigh similarity with only city swapsIs there real local detail, proof, or usefulness?

The threshold is a triage rule. The decision is editorial and strategic.

Practical SEO Uses#

Duplicate Title And Description QA#

Run title tags and meta descriptions through a similarity check before launch.

This catches:

  • title templates used too aggressively,

  • location pages with only the city changed,

  • product pages with weak differentiators,

  • generated meta descriptions that sound unique but say the same thing,

  • old and new drafts that did not actually diverge.

The fix is not always "make everything wildly different." Sometimes similar metadata is fine. The question is whether the field helps searchers and systems understand why this URL deserves to exist.

Content Cannibalization Review#

If two pages are similar and target similar queries, they may be splitting relevance.

Similarity scoring can help find candidates for:

  • merging,

  • redirecting,

  • differentiating,

  • internal link cleanup,

  • clearer keyword ownership,

  • canonical review.

It will not decide the strategy for you. Use Search Console queries, rankings, conversions, backlinks, and business priority before changing URLs.

Generated Content QA#

Similarity checks are useful when a team is using templates, AI drafts, programmatic SEO, or bulk metadata generation.

Flag:

  • paragraphs repeated across pages,

  • overused intros,

  • near-identical conclusion blocks,

  • product copy that only changes the model name,

  • city pages with no real local specificity,

  • AI rewrites that preserve the same thin structure.

Google's helpful, reliable, people-first content guidance is a good sanity check here. If a page only exists as a lightly varied version of another page, similarity scoring may be showing a deeper usefulness problem.

Keyword Clustering#

Simple similarity can group obvious variants before a human reviews intent.

For example:

  • "technical SEO audit"

  • "technical SEO checklist"

  • "SEO audit service"

  • "site audit for SEO"

These may not all deserve the same page, but they probably belong in the same review batch. Similarity helps organize the work; SERP intent decides the page strategy.

A Sensible Workflow#

Use text similarity as a QA pipeline:

  1. Export URLs with title, H1, meta description, canonical, target query, traffic, and conversions.

  2. Extract main content, then remove navigation, footer, and repeated boilerplate.

  3. Normalize the fields you want to compare.

  4. Score likely pairs instead of every possible pair when the site is large.

  5. Sort by highest similarity inside the same content type or section.

  6. Review the top results manually.

  7. Label the issue: duplicate, overlap, template copy, legitimate variant, or no problem.

  8. Decide whether to merge, rewrite, redirect, canonicalize, improve internal links, or leave alone.

  9. Re-run the check after edits so the QA loop has evidence.

The score is a flashlight, not a verdict.

What Simple Similarity Misses#

Simple methods do not understand meaning the way people do.

They struggle with:

  • synonyms,

  • sarcasm,

  • entity relationships,

  • intent differences,

  • multilingual content,

  • pages that use different wording for the same concept,

  • pages that share boilerplate but solve different problems.

For semantic matching, embeddings are stronger. But embeddings are not always the first tool to reach for. They add cost, complexity, and harder-to-explain results.

For audits, launch QA, metadata review, and content inventories, simple methods are often enough.

Where This Fits In A Content System#

Text similarity is useful because it turns a messy content inventory into a smaller review problem.

The output should not be "these 180 pages are bad." It should be something a team can act on:

FindingBetter handoff
47 pages have similar titlesReview the top 15 pairs by traffic and business value
22 descriptions are near-identicalRewrite the shared template and regenerate affected pages
8 service pages overlap heavilyAssign one primary intent to each page or consolidate
3 articles cover the same topicMerge, redirect, and preserve the strongest sections
Location pages are mostly boilerplateAdd real proof, local details, photos, testimonials, or cut the page

That is the real value: not the algorithm by itself, but the handoff from vague concern to focused action.

The Bottom Line#

Text similarity is not advanced SEO theater. It is a practical QA layer.

Use keyword overlap for short fields. Use TF-IDF and cosine similarity for larger text. Use embeddings only when lexical methods are not enough. Then let humans make the actual SEO decision.

For SEO and content QA, that is often enough to move from "we should review this site" to "these are the twenty things to check first."

Further reading

Keep going from here.