Implementing Levenshtein Distance for Product Matching

This guide solves one focused problem: given a scraped competitor product title or vendor SKU that has no usable GTIN/UPC, how do you score its character-level similarity against an internal catalog entry reliably enough to drive automated price comparison? Levenshtein (edit) distance is the deterministic baseline for that decision. It is a sub-stage of Fuzzy Matching Algorithms for SKU Alignment — the parent guide that owns the full match cascade — and it only behaves well when its inputs have already been cleaned by the Data Normalization & Promo Parsing Pipelines stage and when its outputs feed deterministic tie-breakers in Price Hierarchy & Rule-Based Fallback Routing. Used in isolation it produces phantom matches; used as a calibrated, pre-filtered scorer it is fast, transparent, and auditable.

Prerequisites & Input Contract

This implementation assumes records have already passed normalization upstream and arrive as a flat, typed structure. The minimum input contract per record is:

{
  "raw_sku": "string (required, non-empty)",
  "normalized_sku": "string (required, lowercased, punctuation-stripped)",
  "brand": "string (nullable)",
  "category": "enum[electronics, apparel, home, pharma, accessories]",
  "variant_attrs": { "capacity": "string?", "color": "string?", "pack_size": "int?" }
}

Environment assumptions:

Python ≥ 3.10 (uses match-free typing but relies on modern typing syntax).
rapidfuzz ≥ 3.0 — the C-optimized library that supplies Levenshtein, process.extract, and process.cdist. Pure-Python edit distance is O(N·M) per pair with interpreter overhead and is unsuitable for production volume.
numpy ≥ 1.24 for the vectorized cdist path.
Inbound strings must be Unicode-normalized (NFKC) before they reach this stage; scraped feeds routinely inject zero-width joiners and full-width characters that silently inflate edit distance.

Edit distance measures the minimum number of single-character insertions, deletions, or substitutions to turn string a into string b. It is computed with a dynamic-programming matrix where each cell holds the cost of aligning the prefixes that meet at it; the bottom-right cell is the answer. The normalized similarity used throughout this guide is 1 - distance / max(len(a), len(b)), yielding a value in [0.0, 1.0].

The grid below fills CABEL (a transposed-letter typo) against the correct CABLE. The shaded diagonal is the optimal alignment path, and the final cell shows the distance is 2 — the swapped E/L resolve as two separate substitutions, the same penalty that makes plain Levenshtein under-score vendor typos (see Edge Cases below).

Step-by-Step Implementation

Step 1 — Normalize the input strings

Raw titles and SKUs carry structural noise (trademark glyphs, abbreviations, separators) that inflates edit distance artificially. Apply a deterministic, stateless, cacheable normalizer before any distance call. This mirrors the contract established in the unified product catalog schema, so identifiers compare on semantic content rather than formatting.

import re

STOP_WORDS = {"the", "new", "official", "store", "brand", "product", "item"}

# Order matters: longer keys must expand before their prefixes
# (e.g. "w/o" before "w/") so "w/o" is never matched as "w/".
ABBREVIATIONS = {
    "w/o": "without",
    "w/": "with",
    "oz": "ounce",
    "pk": "pack",
    "gen": "generation",
}

def normalize_sku_string(raw: str) -> str:
    text = raw.lower().strip()
    text = re.sub(r"[™®©]", "", text)

    # Expand abbreviations BEFORE stripping punctuation: some keys contain '/'.
    # \b cannot anchor a non-word char, so use lookarounds.
    for abbr, full in ABBREVIATIONS.items():
        text = re.sub(rf"(?<!\w){re.escape(abbr)}(?!\w)", full, text)

    text = re.sub(r"[-_/]", " ", text)            # separators -> space
    text = re.sub(r"\s+", " ", text).strip()      # collapse whitespace
    tokens = [t for t in text.split() if t not in STOP_WORDS]
    return " ".join(tokens)

# >>> normalize_sku_string("Acme™ 6-Pk Gen3 Cable w/ Case")
# 'acme 6 pack generation3 cable with case'

Step 2 — Compute a length-penalized similarity

Raw normalized similarity rewards short overlapping prefixes too generously, so subtract a bounded length-disparity penalty. The penalty caps at 0.15 so it nudges rather than dominates the score:

$$s_{\text{adj}} = \operatorname{sim}(a, b) - \min!\left(\frac{\bigl\lvert,\lvert a\rvert - \lvert b\rvert,\bigr\rvert}{\max(\lvert a\rvert, \lvert b\rvert)},; 0.15\right)$$

from rapidfuzz.distance import Levenshtein

def compute_weighted_levenshtein(source: str, target: str) -> float:
    """Length-penalized Levenshtein similarity in [0.0, 1.0]."""
    base = Levenshtein.normalized_similarity(source, target)
    max_len = max(len(source), len(target), 1)
    length_penalty = min(abs(len(source) - len(target)) / max_len, 0.15)
    return max(base - length_penalty, 0.0)

# >>> compute_weighted_levenshtein("apple usb c cable 1m", "apple usb c cable 2m")
# 0.95
# >>> compute_weighted_levenshtein("apple usb c cable", "apple usb c cable for macbook pro 16 2023")
# 0.485  # length penalty fires on the disproportionate target

Step 3 — Batch-match a query against candidates with early cutoff

Never score a query against every catalog row in Python. Pass a score_cutoff so the C backend terminates early on hopeless pairs, and cap results with limit:

from rapidfuzz.process import extract
from typing import List, Tuple

def batch_match(
    candidates: List[str], query: str, threshold: float = 0.85
) -> List[Tuple[str, float]]:
    """Return up to 5 (candidate, score) pairs above threshold, best first."""
    matches = extract(
        query,
        candidates,
        scorer=Levenshtein.normalized_similarity,
        score_cutoff=threshold,   # early termination in the C backend
        limit=5,
    )
    return [(m[0], m[1]) for m in matches]

# >>> batch_match(["apple usb c cable 1m", "anker usb c cable"], "apple usb c cable 1 m", 0.8)
# [('apple usb c cable 1m', 0.95), ('anker usb c cable', 0.81)]

Step 4 — Gate the score on a category-specific threshold

A single global threshold fails across heterogeneous catalogs: apparel tolerates color/size variance, while electronics and pharmaceuticals demand strict character alignment. Drive the accept/review/reject decision from a category matrix, and route anything below the floor into Price Hierarchy & Rule-Based Fallback Routing (UPC/EAN exact match, brand+model token intersection, or human review) rather than guessing.

Tier	Score band	Categories	Optimised for
High-precision	$0.88 \le s \le 0.95$	Electronics, pharma, serialized HW	Minimising false positives
Balanced	$0.80 \le s \le 0.87$	Home goods, apparel, consumables	Tolerating size/colour suffix variance
Recall-optimised	$0.72 \le s \le 0.79$	Generic accessories, bundle kits	Maximising candidate coverage

CATEGORY_FLOORS = {
    "electronics": 0.88, "pharma": 0.90, "home": 0.82,
    "apparel": 0.80, "accessories": 0.74,
}

def decide(query: str, candidate: str, category: str) -> str:
    score = compute_weighted_levenshtein(query, candidate)
    floor = CATEGORY_FLOORS.get(category, 0.85)
    if score >= floor + 0.05:
        return "auto_accept"
    if score >= floor:
        return "review"          # hand off to fallback routing / analyst queue
    return "reject"

Verification & Testing

Lock the scorer’s behavior with table-driven assertions so threshold tweaks cannot silently regress. Run with pytest:

import pytest
from matcher import normalize_sku_string, compute_weighted_levenshtein, decide

@pytest.mark.parametrize("raw, expected", [
    ("Acme™ 6-Pk", "acme 6 pack"),
    ("Cable w/o Case", "cable without case"),  # w/o expands before w/
    ("  The  NEW  Widget ", "widget"),         # stop-words + whitespace
])
def test_normalization_is_deterministic(raw, expected):
    assert normalize_sku_string(raw) == expected

def test_identical_strings_score_one():
    assert compute_weighted_levenshtein("apple usb c", "apple usb c") == 1.0

def test_length_penalty_is_bounded():
    s = compute_weighted_levenshtein("a", "a" * 100)
    assert 0.0 <= s <= 1.0  # penalty never drives the score negative

def test_category_floor_routing():
    # one substituted char in an electronics SKU must not auto-accept
    assert decide("sku ab12", "sku ab13", "electronics") in {"review", "reject"}

For a regression baseline, freeze a labeled sample of confirmed query/candidate pairs to a CSV, score the whole set, and assert that precision and recall stay within tolerance of the last known-good run. Diffing against that golden file catches drift introduced by normalization-dictionary edits before it reaches production repricing.

Edge Cases & Gotchas

Transpositions cost two edits. Plain Levenshtein treats keybaord → keyboard as two operations, deflating the score for fat-finger vendor typos. Switch to rapidfuzz.distance.DamerauLevenshtein (which counts an adjacent swap as one edit) or blend in a Jaro-Winkler component when transposition noise dominates a source.
```
from rapidfuzz.distance import DamerauLevenshtein
DamerauLevenshtein.normalized_similarity("keybaord", "keyboard")  # 0.875, not 0.75
```
Variant suffixes collide. iphone 15 128gb and iphone 15 256gb differ by one character and score ~0.97 — a margin-destroying false match. Strip and store variant identifiers (-BLK, 128GB, V2) before scoring, then re-assert them as a hard gate after; capacity or pack-size mismatch must cap the score regardless of string overlap.
Unicode artifacts inflate distance. Feeds inject zero-width joiners and full-width digits that look identical but are distinct code points. Normalize to NFKC and strip -‍ before Step 1, or two visually identical titles will never match.
```
import unicodedata
clean = unicodedata.normalize("NFKC", raw).replace("‍", "")
```
Empty or single-token inputs. A blank normalized_sku after stop-word removal (e.g. a title that was entirely brand noise) yields max_len = 1 and unstable ratios. Guard for empty strings and route them straight to the review queue instead of scoring.

Performance Notes

A single Levenshtein comparison is O(N·M) in time and O(min(N, M)) in space with the band-optimized C implementation, where N and M are string lengths. The bottleneck is never one pair — it is the O(n²) explosion of comparing every query against every catalog row. Levenshtein scales only behind a candidate-reduction step.

Vectorize the inner loop. For aligning a scraped batch against a 100k+ SKU master, replace Python iteration with rapidfuzz.process.cdist, which computes a full score matrix in the C layer and cuts Python-level overhead by roughly 70%.
Cache deterministic results. Edit distance is pure: identical normalized query/candidate pairs must never be recomputed. Cache normalized strings and hot score pairs in Redis with a TTL aligned to your catalog refresh cycle (7–14 days).
Know when to stop using it. Edit distance is a character-level metric; it cannot tell that “16GB RAM laptop” and “laptop with 16 gigabytes memory” are the same product. Once word-order and synonym variance dominate over typos, that is the signal to move past raw Levenshtein toward token-set scoring, and ultimately to the blocking and embedding-based candidate generation owned by Fuzzy Matching Algorithms for SKU Alignment. Treat this technique as the transparent, auditable baseline — not the whole matcher.

Fuzzy Matching Algorithms for SKU Alignment — the parent guide whose match cascade this edit-distance scorer plugs into.
Building a Unified Product Catalog Schema — the canonical entity model that consumes accepted matches and re-asserts variant attributes.
Price Hierarchy & Rule-Based Fallback Routing — the deterministic destination for review-band and sub-floor scores.
Cross-Platform Category Taxonomy Mapping — supplies the category signal that selects which threshold floor applies.

Implementing Levenshtein Distance for Product Matching #

Prerequisites & Input Contract #

Step-by-Step Implementation #

Step 1 — Normalize the input strings #

Step 2 — Compute a length-penalized similarity #

Step 3 — Batch-match a query against candidates with early cutoff #

Step 4 — Gate the score on a category-specific threshold #

Verification & Testing #

Edge Cases & Gotchas #

Performance Notes #

Related #