Cross-Platform Category Taxonomy Mapping

Cross-platform category taxonomy mapping is the component that reconciles the wildly divergent category trees of competing retailers — Amazon’s Electronics > Headphones > Over-Ear, a boutique’s Sound & Vision / Cans, a marketplace’s flat Audio bucket — onto one canonical hierarchy your pricing models can trust. It sits inside the Core Architecture & Catalog Matching Fundamentals layer, downstream of ingestion and upstream of repricing. Without it, category-level price elasticity, share-of-voice, and margin-floor logic all read from incompatible buckets and silently produce wrong answers. This guide is written for Python engineers and pricing analysts who need a resolver that is deterministic first, probabilistic only as a fallback, and fully auditable. Category mapping reuses the canonical model defined in Building a Unified Product Catalog Schema and borrows similarity techniques from Fuzzy Matching Algorithms for SKU Alignment; read those first if you have not built the schema or the SKU matcher yet.

Problem Framing & Prerequisites

When category alignment is missing or naive, downstream pricing signals become statistically noisy: a competitor’s “Smart Watches” listing lands in your “Wearables” cohort on one retailer and your “Fitness” cohort on another, so elasticity curves average across unrelated demand. Repricing rules keyed on category (margin floors, brand eligibility, region locks) fire against the wrong inventory, and SKU match rates fall because category context — one of the strongest blocking keys — is unreliable.

This component assumes three upstream stages already exist:

Ingestion has emitted raw category breadcrumbs per listing. Competitor category trees are extracted via headless browser automation, structured API feeds, or sitemap parsers, as covered in Scraping & Data Ingestion Workflows. Output is serialized to JSON or Parquet containing raw breadcrumbs, canonical URLs, locale identifiers, and extraction timestamps.
Normalization has stripped HTML artifacts and units, per Data Normalization & Promo Parsing Pipelines. Taxonomy mapping consumes already-cleaned text, not raw DOM.
A canonical catalog schema exists with a stable category_path field, so resolved categories have somewhere consistent to land.

The input contract is a single normalized record. Validate it with Pydantic before anything touches the resolver, so malformed breadcrumbs never cascade into the matching core:

from datetime import datetime
from pydantic import BaseModel, field_validator


class RawCategoryRecord(BaseModel):
    source_platform: str          # e.g. "amazon_de"
    listing_url: str
    breadcrumb: list[str]         # ["Electronics", "Headphones", "Over-Ear"]
    locale: str                   # BCP-47, e.g. "de-DE"
    scraped_at: datetime

    @field_validator("breadcrumb")
    @classmethod
    def non_empty(cls, v: list[str]) -> list[str]:
        if not v or any(not seg.strip() for seg in v):
            raise ValueError("breadcrumb must be non-empty with no blank segments")
        return v

The output contract is a ResolvedCategory carrying the canonical path, the method that produced it, a confidence score, and a routing decision — the three fields every audit and every downstream consumer depends on.

Architecture Detail: The Canonical Graph and the Resolver

The foundation of reliable mapping is a rigorously defined internal taxonomy. Rather than a flat category list, model the canonical taxonomy as a directed acyclic graph (DAG) — or a hierarchical trie — that captures parent/child relationships, synonym clusters, and exclusion rules. A graph lets a node like Audio > Headphones carry aliases, a deterministic primary path, and policy flags simultaneously, while still supporting fast ancestor/descendant queries.

Each canonical node should store:

Primary path — the deterministic breadcrumb sequence that defines the node’s identity.
Alias cluster — synonyms, abbreviations, and regional variants (“Cans”, “Kopfhörer”, “Over-Ear”, “Headphone”).
Exclusion flags — categories blocked from pricing feeds for compliance or margin policy reasons.
Confidence weights — historical match-success rate per node, used to calibrate fallback thresholds.

import networkx as nx

taxonomy = nx.DiGraph()
taxonomy.add_node(
    "audio.headphones",
    primary_path=("Audio", "Headphones"),
    aliases={"cans", "kopfhorer", "over-ear", "ear phones", "casque"},
    excluded=False,
    confidence_weight=0.97,
)
taxonomy.add_edge("audio", "audio.headphones")  # parent -> child

The resolver evaluates a normalized breadcrumb against this graph in tiers, escalating only when the cheaper, higher-confidence method fails. This tiered execution model keeps accuracy and coverage from trading off against each other:

from dataclasses import dataclass
from enum import Enum


class Method(str, Enum):
    EXACT = "exact_path"
    HEURISTIC = "rule_heuristic"
    PROBABILISTIC = "vector_fallback"


@dataclass
class ResolvedCategory:
    canonical_id: str | None
    method: Method | None
    confidence: float
    routing: str  # "auto" | "review" | "quarantine"


def resolve(tokens: tuple[str, ...], graph, alias_index, embedder) -> ResolvedCategory:
    # Tier 1 — exact / canonical path match. Fastest, highest confidence.
    node = alias_index.get(tokens)
    if node is not None:
        return ResolvedCategory(node, Method.EXACT, 1.0, "auto")

    # Tier 2 — rule-based heuristics: leaf-token alias lookup + overrides.
    leaf = tokens[-1]
    node = alias_index.get((leaf,))
    if node is not None:
        return ResolvedCategory(node, Method.HEURISTIC, 0.93, "auto")

    # Tier 3 — probabilistic fallback via embedding similarity.
    node, score = embedder.nearest(tokens)
    routing = "auto" if score >= 0.92 else "review" if score >= 0.75 else "quarantine"
    return ResolvedCategory(node, Method.PROBABILISTIC, score, routing)

Tier 1 is direct DAG traversal on sanitized tokens — O(1) against a precomputed alias index, zero hallucination risk. Tier 2 covers structural variation (“Electronics > Audio” vs “Sound & Vision”) with merchant-specific override tables and leaf-token matching. Tier 3 — string similarity or vector embeddings — only engages when deterministic routes miss, which is exactly where the methods from the Fuzzy Matching Algorithms for SKU Alignment guide (token overlap, edit distance, semantic scoring) carry over to category text. The complexity trade-off is deliberate: the cheap deterministic tiers absorb the bulk of traffic, so the expensive embedding lookup runs on only the residual that genuinely needs it.

Normalization feeds this resolver and must account for linguistic drift across markets. Apply Unicode normalization (NFC/NFD), retail-specific stopword removal (“Shop”, “Deals”, “2024 Collection”), and locale-aware tokenization that preserves compound terms critical to pricing — “refurbished”, “open-box”, “OEM” must survive tokenization intact, because they gate entire margin policies.

Candidate Generation & Compute Optimization

Tier 3 is the cost center. Comparing every unmatched competitor breadcrumb against every canonical node is O(n·m) and collapses under real catalog volume. Reduce the search space before invoking any embedding or string metric, exactly as SKU matching uses blocking:

Alias inverted index. Build a dict from every alias token to the set of candidate node IDs once at startup. Tiers 1 and 2 become hash lookups instead of graph walks.
Ancestor blocking. If the breadcrumb’s top segment already maps confidently (e.g. “Electronics” → electronics), restrict Tier 3 candidates to that subtree. This typically prunes 90%+ of nodes before scoring.
Locality-sensitive hashing (LSH). Encode canonical node labels as character n-gram MinHash signatures and bucket them; query breadcrumbs only score against same-bucket nodes, turning a quadratic scan into near-linear retrieval.

import numpy as np

class EmbeddingIndex:
    """Pre-encode canonical nodes once; query against the in-subtree subset."""

    def __init__(self, node_ids, vectors: np.ndarray):
        self.node_ids = node_ids
        self.matrix = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

    def nearest(self, query_vec: np.ndarray, candidate_mask: np.ndarray):
        q = query_vec / np.linalg.norm(query_vec)
        scores = self.matrix @ q                 # cosine, vectorized
        scores = np.where(candidate_mask, scores, -1.0)
        best = int(scores.argmax())
        return self.node_ids[best], float(scores[best])

For high-throughput tokenization use polars or duckdb; for deterministic string distance use rapidfuzz (C-optimized); for graph traversal use networkx or igraph. Encode canonical embeddings in a single batch at startup and cache them — they change only when the taxonomy version changes, so re-encoding per request is pure waste. Process competitor feeds in micro-batches and materialize intermediate scores in a key-value store so a worker restart replays cleanly rather than recomputing.

Configuration & Threshold Tuning

Routing thresholds are the dial that balances automation velocity against margin protection. Calibrate them against a hand-labeled ground-truth sample (a few hundred breadcrumbs per vertical is usually enough), then tighten for high-margin categories and relax for commoditized ones. Tighten the automated cutoff when a misroute is expensive (electronics, where a wrong category misprices a high-AOV item); relax it where categories are coarse and errors are cheap.

Parameter	Default	Tighten when	Relax when
`auto_route_min` (confidence for hands-off routing)	`0.92`	High-margin / regulated categories; volatile competitor navigation	Coarse, low-risk categories with stable trees
`review_band` (analyst queue)	`0.75 – 0.91`	Brand-restricted or parity-agreement SKUs	Mature taxonomy with high historical match rate
`quarantine_max` (drop below)	`0.75`	After a competitor redesign inflates false positives	Never below `0.6` — silent misroutes corrupt feeds
`ancestor_block_min` (subtree pruning)	`0.85`	Deep taxonomies (>5 levels) to cut compute	Shallow trees where pruning gains are marginal
`lsh_bucket_threshold` (n-gram Jaccard)	`0.5`	Recall is high but compute is over budget	Recall gaps appear on long compound labels
`embedding_refresh`	on version bump	—	—

Treat these as versioned configuration, not constants in code. A confidence cutoff that changed silently is indistinguishable from a regression when match rates drift.

Failure Modes & Mitigations

Competitor navigation redesign (taxonomy drift). A retailer restructures Sound & Vision into Audio + TV & Home Cinema, and yesterday’s deterministic paths miss en masse. Detect it by diffing the freshly scraped tree against the last stored version; a spike in Tier-3 fallbacks or quarantine volume is the alarm. Trigger an automated alert and re-map, rather than letting probabilistic guesses silently reroute traffic.

def detect_drift(old_paths: set[tuple], new_paths: set[tuple], alert_ratio=0.15):
    removed = old_paths - new_paths
    if old_paths and len(removed) / len(old_paths) >= alert_ratio:
        raise TaxonomyDriftAlert(f"{len(removed)} canonical paths vanished")

Cross-locale collision. “Bras” (apparel) in en-GB versus “BRAS” as an acronym, or German “Gift” (poison) versus English “gift”. Locale-aware tokenization plus a per-locale stopword set prevents an embedding model from forcing a confident but wrong match; never share one tokenizer across locales.
Polysemous leaf tokens. A bare “Accessories” or “Sale” leaf is ambiguous without its ancestors. Require the full ancestor context for Tier-2 matches on a configurable denylist of generic leaves; route bare generics straight to review.
State leakage between stages. Mapping that mutates shared state introduces silent failures that corrupt downstream price feeds. Keep the sanitization and resolver stages idempotent and stateless, and enforce backpressure and replayability with a broker (Apache Kafka or RabbitMQ) between them.
Embedding model version skew. Re-encoding canonical nodes with a new model while query vectors come from the old one silently degrades every score. Pin the model version into the taxonomy schema version and refuse to serve mismatched pairs.

Compliance & Auditability

Every mapping decision must be logged as an immutable event — this is what defends a pricing strategy in a regulatory audit or supplier dispute. Record the source competitor domain and extraction timestamp, the raw breadcrumb versus the resolved canonical path, the applied transformation rules and confidence score, and the routing outcome (accepted, quarantined, escalated):

import json, logging

def emit_decision(rec: RawCategoryRecord, result: ResolvedCategory, schema_version: str):
    logging.getLogger("taxonomy.audit").info(json.dumps({
        "source_platform": rec.source_platform,
        "scraped_at": rec.scraped_at.isoformat(),
        "raw_breadcrumb": rec.breadcrumb,
        "canonical_id": result.canonical_id,
        "method": result.method.value if result.method else None,
        "confidence": round(result.confidence, 4),
        "routing": result.routing,
        "taxonomy_version": schema_version,
    }))

Apply semantic versioning to taxonomy schemas so version drift is detected via DAG diffing rather than discovered through mispriced inventory. Embed policy-as-code in the validation stage: region-specific data-residency constraints, brand restriction lists, and pricing-parity agreements must be enforced before a category reaches any pricing model, and excluded nodes must never route to auto. Category breadcrumbs rarely contain personal data, but redact or hash any incidental PII (seller names, user-generated path segments) before it lands in audit logs, and respect each source’s robots.txt and data-use terms upstream in ingestion.

Deployment Checklist

Core Architecture & Catalog Matching Fundamentals — the parent architecture this resolver plugs into, defining stage isolation and the matching half of the platform.
Building a Unified Product Catalog Schema — the canonical schema whose category_path field receives resolved categories.
Fuzzy Matching Algorithms for SKU Alignment — the similarity techniques reused by the probabilistic fallback tier.
Price Hierarchy & Rule-Based Fallback Routing — the downstream consumer that reprices against resolved categories.
Data Normalization & Promo Parsing Pipelines — the upstream cleansing stage that hands clean breadcrumbs to this resolver.

Cross-Platform Category Taxonomy Mapping #

Problem Framing & Prerequisites #

Architecture Detail: The Canonical Graph and the Resolver #

Candidate Generation & Compute Optimization #

Configuration & Threshold Tuning #

Failure Modes & Mitigations #

Compliance & Auditability #

Deployment Checklist #

Related #