API Fallback & Official Data Source Integration

In production price monitoring, treating a competitor’s storefront as the only source of truth is a liability: DOM parsing breaks on every template change, invites anti-bot escalation, and carries real legal exposure. A resilient ingestion design instead prefers an official data source — a vendor pricing API, partner feed, or affiliate catalog — and reserves rendering-based scraping as a deterministic fallback. This guide details how to build that dual-path pipeline as a single stage under Scraping & Data Ingestion Workflows, with both paths converging on one validated contract before anything reaches the pricing database. It sits alongside Configuring Headless Browsers for Dynamic Pricing, which owns the fallback’s rendering layer, and Async Data Pipelines with Python & Scrapy, which carries the normalized output downstream.

The orchestrator routes each request to the official API first; a tripped breaker or recurring fault diverts to the ToS-gated scrape fallback, while a contract violation is quarantined. Both successful paths converge on one validated canonical contract before the pricing database.

Problem Framing & Prerequisites

Without an explicit fallback architecture, a price feed has exactly one point of failure per vendor. When a retailer ships a frontend redesign, all selector-based extraction for that vendor goes dark simultaneously, and the pricing team discovers it only when repricing acts on stale numbers. Conversely, a pipeline that scrapes a vendor who does publish an API is paying rendering cost, accepting anti-bot risk, and tolerating DOM-mutation fragility for data it could have fetched cleanly. The job of this stage is to make the source a runtime decision, not a hardcoded assumption, so each vendor is read by the cheapest, most reliable path currently available.

Three upstream contracts are non-negotiable before this stage can do its job. First, the source must be decoupled into interchangeable strategies: API ingestion and rendering-based scraping operate as independent stages that share one output schema, each with its own connection pool, retry budget, and credential scope, so a failure cascade on one path never contaminates the other. Second, the fallback path depends on the rendering layer already being correct — explicit wait conditions, network interception, and anti-fingerprinting are owned by Configuring Headless Browsers for Dynamic Pricing, and this stage only invokes it, never reimplements it. Third, the stage must be a stateless transformation with an explicit input/output contract: it consumes a fetch request (vendor_id, sku, region) and emits a validated, canonical record, with no shared mutable state between the two paths.

The minimal contract a vendor declares to the orchestrator, validated with Pydantic before any fetch runs:

from pydantic import BaseModel, HttpUrl
from typing import Optional
from enum import Enum

class SourceKind(str, Enum):
    rest = "rest"
    graphql = "graphql"
    scrape = "scrape"

class VendorSource(BaseModel):
    vendor_id: str
    region: str                          # ISO-3166, drives currency + locale
    kind: SourceKind                     # primary path declaration
    endpoint: Optional[HttpUrl] = None   # None for scrape-only vendors
    auth_ref: Optional[str] = None       # secrets-manager key, never an inline token
    rate_limit_rps: float = 2.0          # vendor SLA ceiling
    fallback_after: int = 3              # consecutive failures before scraping
    scrape_allowed: bool = True          # ToS-derived, gates the fallback path

Validating this descriptor at load time means a vendor with scrape_allowed=False and a dead API surfaces as a configuration error, not a silent ToS violation at 3 a.m. The same canonical output schema is consumed downstream by the catalog matcher in Core Architecture & Catalog Matching Fundamentals, so the contract emitted here must be stable regardless of which path produced it.

Algorithm or Architecture Detail

The core is a routing orchestrator implementing the strategy pattern: it selects a fetch strategy per request, evaluates the result against the contract, and decides whether to escalate to the fallback. The decision is deterministic and side-effect-free with respect to downstream state — a fallback transition mutates nothing the pricing database can see until a record passes validation.

The orchestrator evaluates source priority at the task level. An authenticated official endpoint receives primary routing. If it returns structural errors, 4xx/5xx status codes, authentication failures, or breaches the configurable fallback_after threshold (for example, three consecutive timeouts), the orchestrator transitions to the scraping fallback. Each path returns the same envelope, so the convergence logic downstream never branches on provenance:

import asyncio
from dataclasses import dataclass

@dataclass
class FetchResult:
    ok: bool
    payload: dict | None
    path: str            # "api" | "scrape" — provenance for the audit log
    error: str | None = None

class IngestionOrchestrator:
    def __init__(self, source: VendorSource, api, scraper, breaker):
        self.source = source
        self.api = api            # REST/GraphQL strategy
        self.scraper = scraper    # headless strategy
        self.breaker = breaker    # per-vendor circuit breaker

    async def fetch(self, sku: str) -> FetchResult:
        # Primary path: only attempt the API if the breaker is closed.
        if self.source.endpoint and self.breaker.closed:
            try:
                payload = await self.api.fetch(self.source, sku)
                self.breaker.record_success()
                return FetchResult(True, payload, path="api")
            except (TransientApiError, asyncio.TimeoutError) as exc:
                self.breaker.record_failure()       # may trip the breaker
            except ContractError as exc:
                # A schema violation is NOT a transient fault — quarantine it.
                return FetchResult(False, None, "api", error=f"contract:{exc}")

        # Fallback path: render-and-extract, only where ToS permits it.
        if not self.source.scrape_allowed:
            return FetchResult(False, None, "api", error="no_fallback_allowed")
        payload = await self.scraper.extract(self.source, sku)
        return FetchResult(True, payload, path="scrape")

Two data-structure choices define the trade-offs. The circuit breaker is a small per-vendor state machine (closed → open → half-open) keyed on vendor and region; it converts a stream of individual failures into a single routing decision, so the pipeline stops hammering a degraded endpoint and pays the cost of the fallback exactly once per outage window rather than per request. The contract envelope keeps provenance (path) attached to the payload all the way to the audit log, which is what later lets an analyst distinguish an API-sourced price from a scraped one during a dispute. Complexity-wise, the API path is dominated by network latency (single round trip plus validation), while the fallback path is an order of magnitude more expensive — full browser hydration — which is precisely why the breaker exists to keep that cost rare.

Official API contract enforcement

Official endpoints return structured JSON, but structure is not correctness. Every payload is validated against a strict model before it can enter the database; a violation is quarantined, never silently coerced. Mandatory price and SKU presence, currency normalization, and type enforcement all happen at the boundary:

from pydantic import BaseModel, field_validator
from decimal import Decimal

class CanonicalPrice(BaseModel):
    vendor_id: str
    sku: str
    region: str
    price: Decimal                 # never float — avoids binary rounding drift
    currency: str                  # ISO-4217, normalized upstream
    in_stock: bool
    captured_at: str               # ISO-8601 UTC
    source_path: str               # "api" | "scrape"

    @field_validator("price")
    @classmethod
    def positive(cls, v: Decimal) -> Decimal:
        if v <= 0:
            raise ValueError("non-positive price")
        return v

Credentials are referenced by auth_ref and resolved through a secrets manager (AWS Secrets Manager, HashiCorp Vault) with rotation hooks — OAuth2, API keys, and HMAC signatures never appear inline. The OAuth 2.0 Authorization Framework (RFC 6749) governs token exchange and scope; adaptive rate limiting (token-bucket with exponential backoff and jitter) respects vendor SLAs and protects IP reputation.

Dynamic schema discovery for GraphQL vendors

When a vendor exposes GraphQL, hardcoded queries break silently on schema migrations. Use schema introspection to map available pricing fields, inventory states, and promotional flags at runtime, then build selection sets dynamically against the discovered fields rather than a frozen query string. Cache introspection payloads with a short TTL (15–30 minutes) in a Redis layer, validate the generated query against your Pydantic models, and run a dry-run execution before committing it to production routing so schema drift surfaces as a caught error instead of a column of nulls. This same hidden-endpoint discipline underpins the network-interception approach in Extracting Hidden Price Data from JSON-LD.

Routing Optimization & Fallback Compute Budgeting

The fallback path is the expensive path, so production feasibility hinges on invoking it as rarely as correctness allows. The optimization analog here is not blocking or LSH but a tiered routing budget that keeps the pipeline’s cost dominated by cheap API round trips.

Circuit breaking over per-request retries. A tripped breaker short-circuits the whole vendor to the fallback (or to a skip) for a cool-down window, collapsing thousands of doomed API attempts into one decision. Half-open probes re-test the endpoint with a single request before restoring primary routing.
Introspection and response caching. GraphQL introspection and slow-changing reference data (category trees, currency tables) are cached so the fallback’s setup cost is amortized, and unchanged ETag/Last-Modified responses are served from cache rather than re-fetched.
Batched and coalesced fetches. Where a vendor API supports multi-SKU queries, the orchestrator coalesces pending requests for the same (vendor_id, region) into a single call, cutting both latency and rate-limit pressure.
Concurrency governed by asyncio semaphores. Both paths run on a non-blocking event loop, but the fallback pool is capped far tighter than the API pool because each headless context costs ~150 MB; the semaphore enforces that the cheap path scales freely while the expensive path stays bounded.

import asyncio

class RoutingBudget:
    def __init__(self, api_concurrency=64, scrape_concurrency=6):
        self.api_sem = asyncio.Semaphore(api_concurrency)
        self.scrape_sem = asyncio.Semaphore(scrape_concurrency)  # deliberately small

    async def run_api(self, coro):
        async with self.api_sem:
            return await coro

    async def run_scrape(self, coro):
        async with self.scrape_sem:      # backpressure on the expensive path
            return await coro

The result is a pipeline whose throughput tracks API capacity, with the fallback acting as a bounded safety valve rather than a parallel firehose. When fallback rendering itself consistently fails or triggers blocking, the rendering guidance in Configuring Headless Browsers for Dynamic Pricing covers proxy failover and challenge handling.

Configuration & Threshold Tuning

Routing thresholds are vendor- and tier-specific. A stable, high-volume vendor with a generous API SLA should fail over slowly and rarely; a flaky endpoint guarding high-value competitor data warrants a tighter breaker and a faster fallback. Calibrate against observed success-rate and latency distributions, not round numbers, and re-check whenever a vendor changes its API tier or rate policy.

Vendor tier	`fallback_after`	Breaker open threshold	Cool-down	API rate (rps)	Cache TTL	Notes
Tier-1 stable API	5	50% errors / 60s	120s	8.0	30 min	Fail over slowly; API is the trusted source
Tier-2 standard	3	40% errors / 30s	90s	4.0	20 min	Default profile for most catalog vendors
Tier-3 flaky API	2	30% errors / 30s	60s	2.0	15 min	Fast fallback; expect frequent scrape paths
GraphQL (introspected)	3	40% errors / 30s	90s	3.0	15 min	Dry-run queries; short TTL absorbs schema drift
Scrape-only (no API)	n/a	n/a	n/a	n/a	10 min	`scrape_allowed=True` mandatory; rendering owns rate

Store these as versioned configuration, never inline constants. A threshold change must bump a version that propagates into the audit log, so a later investigation can reconstruct exactly which routing profile produced a given price. Currency-sensitive vendors should confirm that price-side normalization — currency conversion and exchange-rate sync — has run before any cross-region price is trusted, since a fallback path may capture a localized price in a different currency than the API path reports.

Failure Modes & Mitigations

Dual-path ingestion fails in characteristic, repeatable ways. Each has a concrete mitigation that belongs in code, not a runbook footnote.

Silent schema drift on the API path. A vendor adds a nullable field or renames price to unitPrice and the old query returns nulls. Strict Pydantic validation rejects the payload as a ContractError, which quarantines rather than scrapes — a contract failure is a data-quality incident, not a transient fault, and must not silently trigger the fallback.
Currency and locale drift between paths. The API reports a region’s price in one currency while the rendered storefront shows another. Pin currency and region in the canonical contract and reconcile both paths against Data Normalization & Promo Parsing Pipelines before upsert.
Fallback retry storms producing duplicates. A breaker flapping between open and half-open re-fetches the same SKU repeatedly. Idempotent upserts keyed on (vendor_id, sku, region) plus a deduplication guard make retries safe.
DOM mutations degrading the fallback. A template change silently lowers extraction yield on the scrape path. Monitor per-vendor fallback success rate; a sudden drop signals a parser regression handed off to the rendering layer, not a routing bug.

async def safe_ingest(orch: IngestionOrchestrator, sku: str, db) -> str:
    result = await orch.fetch(sku)
    if not result.ok:
        await db.dead_letter(orch.source.vendor_id, sku, reason=result.error)
        return "quarantined"
    try:
        record = CanonicalPrice(**result.payload, source_path=result.path)
    except ValueError as exc:
        await db.dead_letter(orch.source.vendor_id, sku, reason=f"contract:{exc}")
        return "quarantined"
    # Idempotent upsert — identical inputs never create duplicate rows.
    await db.upsert(record, key=("vendor_id", "sku", "region"))
    return record.source_path

Malformed payloads, schema violations, and unresolvable fallback failures all land in a dead-letter queue rather than the pricing database. Route the normalized output through the async broker described in Async Data Pipelines with Python & Scrapy, and expose structured observability: API success rate, fallback activation frequency, extraction latency, and contract-violation counts, with alerts when fallback latency breaches the SLA or data freshness degrades.

Compliance & Auditability

Dual-path ingestion makes a deliberate trade-off between data fidelity, infrastructure cost, and legal exposure, and every routing decision must be reconstructable to defend it. Official APIs deliver high-fidelity, legally sanctioned data but carry rate limits, licensing fees, and sometimes incomplete field coverage; the scraping fallback offers broader coverage and real-time DOM visibility but adds compute overhead, anti-bot maintenance, and higher legal risk if vendor terms are violated. The scrape_allowed gate encodes that boundary in configuration so the fallback can never silently engage where a vendor’s ToS forbids it.

The stage writes a deterministic audit record for every fetch: the vendor and SKU, the path taken, the routing-profile version, the breaker state at decision time, and any contract violation. That record is what reconstructs a pricing decision during a regulatory audit or supplier dispute.

audit_record = {
    "vendor_id": "acme_eu",
    "sku": "A2890-256-BLK",
    "region": "DE",
    "path": "api",                 # provenance: api vs scrape
    "routing_profile": "tier1-v4",
    "breaker_state": "closed",
    "contract_ok": True,
    "captured_at": "2026-06-27T09:14:02Z",
}

Maintain a vendor-specific compliance matrix documenting scraping permissions, rate limits, and data-use restrictions; consult legal counsel before scraping restricted or authenticated endpoints. Logs redact or hash any PII, routing profiles are version-controlled so a given price is reproducible across pipeline iterations, and audit records are retained for the full period your jurisdiction requires. Scraping-side compliance — respecting robots.txt, Retry-After, and crawl delays — is enforced by the rendering layer, but this stage inherits the obligation to preserve provenance end to end.

Deployment Checklist

A production price feed is an exercise in disciplined routing, not just extraction scripts. By isolating the two paths, breaking the circuit on a degraded API, enforcing one contract regardless of provenance, and gating the fallback on documented permissions, retail tech teams deliver reliable, legally defensible competitor intelligence at scale.

Scraping & Data Ingestion Workflows — the parent guide that frames how every ingestion stage feeds the price feed.
Configuring Headless Browsers for Dynamic Pricing — the rendering layer this stage invokes on the fallback path.
Async Data Pipelines with Python & Scrapy — carries the normalized output downstream through the broker.
Handling Infinite Scroll & Pagination Logic — the navigation strategy the fallback path relies on for catalog-scale pages.
Data Normalization & Promo Parsing Pipelines — reconciles currency, tax, and promo structure so both paths converge on one canonical price.

API Fallback & Official Data Source Integration #

Problem Framing & Prerequisites #

Algorithm or Architecture Detail #

Official API contract enforcement #

Dynamic schema discovery for GraphQL vendors #

Routing Optimization & Fallback Compute Budgeting #

Configuration & Threshold Tuning #

Failure Modes & Mitigations #

Compliance & Auditability #

Deployment Checklist #

Related #