Extracting Hidden Price Data from JSON-LD

The visible price on a modern retail page is the worst place to read a number from: it is reflowed by CSS frameworks, swapped by A/B overlays, and split across lazy-loaded spans. The same page almost always ships the canonical figure in a <script type="application/ld+json"> block — a machine-readable schema.org/Product payload that search engines and affiliate networks consume. This page solves one focused problem: given a rendered product page, reliably extract and normalize the price, priceCurrency, availability, and priceValidUntil fields out of its JSON-LD. It is a hands-on companion to the parent guide on Configuring Headless Browsers for Dynamic Pricing, and it assumes the page has already loaded cleanly — if a challenge interstitial stands in the way, clear it first with Bypassing Cloudflare Turnstile with Playwright. Once a clean payload is in hand, the extracted string is handed straight to normalization — never parsed into a float here.

Prerequisites & Input Contract

This recipe assumes a Python 3.11+ environment with Playwright as the rendering engine, because most storefronts inject JSON-LD asynchronously via Next.js/Nuxt streaming or React/Vue hydration, and a static HTTP fetch returns an empty or placeholder <script> block. Pin the versions — selector timing and networkidle semantics drift between Playwright releases.

playwright==1.44.0          # Chromium 124 bundled; run `playwright install chromium`
python>=3.11
# Standard library only for parsing: json, re, hashlib, decimal — no third-party JSON-LD lib needed.

The function contract is deliberately narrow. The caller supplies a target URL; the extractor returns a list of structured offers, one per valid Offer found, so the orchestration layer can apply buy-box selection downstream.

Field	Direction	Type	Notes
`url`	in	`str`	Fully-qualified product page URL.
`timeout_ms`	in	`int`	Per-navigation budget; default `15000`.
`price_raw`	out	`str` \| `None`	Raw `offers.price` string — never pre-parsed to float.
`currency`	out	`str`	ISO 4217 code, e.g. `USD`; `UNKNOWN` if absent.
`availability`	out	`str`	Normalized enum: `IN_STOCK`, `OUT_OF_STOCK`, `PRE_ORDER`, `UNKNOWN`.
`valid_until`	out	`str` \| `None`	ISO 8601 `priceValidUntil`, for flash-sale decay tracking.
`payload_hash`	out	`str`	SHA-256 of the cleaned payload, for dedup and audit.

The canonical price path is $.offers.price, but enterprise implementations nest offers as an array, attach a priceSpecification object, or list multiple merchant offers under one Product. When several JSON-LD tags exist, prioritize the one whose "@type" is "Product"; the rest are usually breadcrumbs, AggregateRating, or Organization metadata with no pricing fields. Validate field names against the official Schema.org Product Vocabulary.

Step-by-Step Implementation

Step 1 — Render the page and wait for the payload to attach. Navigate, then wait specifically for the JSON-LD selector rather than trusting networkidle alone, because hydration frequently injects the script after the network goes quiet.

import hashlib
import json
import re
from typing import List, Dict, Any
from playwright.sync_api import sync_playwright

def fetch_jsonld_payloads(url: str, timeout_ms: int = 15000) -> List[str]:
    """Render the page and return the raw innerText of every JSON-LD script tag."""
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
                "--disable-dev-shm-usage",
                "--disable-gpu",
            ],
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            timezone_id="America/New_York",
        )
        page = context.new_page()
        try:
            page.goto(url, wait_until="networkidle", timeout=timeout_ms)
            # Block until at least one JSON-LD tag is present, even if hydration is late.
            page.wait_for_selector('script[type="application/ld+json"]', timeout=timeout_ms)
            scripts = page.query_selector_all('script[type="application/ld+json"]')
            return [s.inner_text() for s in scripts]
        finally:
            browser.close()

Step 2 — Sanitize before deserializing. Retail CMS platforms emit JSON-LD that real parsers reject: CDATA wrappers, trailing commas, and stray comment markers. Clean the string first, then parse.

AVAILABILITY_MAP = {
    "instock": "IN_STOCK", "in_stock": "IN_STOCK",
    "outofstock": "OUT_OF_STOCK", "soldout": "OUT_OF_STOCK",
    "preorder": "PRE_ORDER", "backorder": "PRE_ORDER",
    "discontinued": "OUT_OF_STOCK",
}

def _clean(raw: str) -> str:
    """Strip CDATA wrappers and trailing commas that break json.loads()."""
    cleaned = re.sub(r"<!--\s*CDATA\[\s*|\s*\]\s*-->", "", raw).strip()
    cleaned = re.sub(r",\s*([}\]])", r"\1", cleaned)   # trailing comma before } or ]
    return cleaned

def _norm_availability(value: str) -> str:
    token = (value or "").rsplit("/", 1)[-1].lower().replace("-", "")
    return AVAILABILITY_MAP.get(token, "UNKNOWN")

Step 3 — Walk Product/Offer and emit normalized records. Handle both single-object and array-wrapped graphs, and both single-Offer and multi-Offer products. Keep the price as a raw string — float conversion belongs in the normalization stage, not here.

def parse_offers(payloads: List[str]) -> List[Dict[str, Any]]:
    """Extract one normalized record per valid Offer across all payloads."""
    results: List[Dict[str, Any]] = []
    for raw in payloads:
        cleaned = _clean(raw)
        try:
            data = json.loads(cleaned)
        except json.JSONDecodeError:
            continue  # log raw payload to object storage for offline forensics

        items = data if isinstance(data, list) else data.get("@graph", [data])
        for item in items:
            if item.get("@type") != "Product":
                continue
            offers = item.get("offers", [])
            offers = [offers] if isinstance(offers, dict) else offers
            payload_hash = hashlib.sha256(cleaned.encode("utf-8")).hexdigest()

            for offer in offers:
                # priceSpecification overrides a top-level price when present.
                spec = offer.get("priceSpecification") or {}
                price_raw = offer.get("price") or spec.get("price")
                seller = offer.get("seller")
                results.append({
                    "price_raw": str(price_raw) if price_raw is not None else None,
                    "currency": offer.get("priceCurrency")
                                or spec.get("priceCurrency", "UNKNOWN"),
                    "availability": _norm_availability(offer.get("availability", "")),
                    "valid_until": offer.get("priceValidUntil"),
                    "seller": seller.get("name") if isinstance(seller, dict) else None,
                    "payload_hash": payload_hash,
                })
    return results

Step 4 — Compose and inspect the output.

def extract_jsonld_pricing(url: str, timeout_ms: int = 15000) -> List[Dict[str, Any]]:
    return parse_offers(fetch_jsonld_payloads(url, timeout_ms))

A page with two competing marketplace offers yields a list the buy-box selector can rank:

>>> extract_jsonld_pricing("https://example-retailer.test/p/widget")
[{'price_raw': '129.99', 'currency': 'USD', 'availability': 'IN_STOCK',
  'valid_until': '2026-07-04', 'seller': 'Direct', 'payload_hash': 'a1b2…'},
 {'price_raw': '134.50', 'currency': 'USD', 'availability': 'IN_STOCK',
  'valid_until': None, 'seller': 'ThirdPartyCo', 'payload_hash': 'a1b2…'}]

The raw price_raw strings flow on to Converting Multi-Currency Prices to Base Currency, which owns locale separator resolution and fixed-point arithmetic.

Verification & Testing

JSON-LD extraction fails silently — an empty result looks identical to a genuinely delisted product — so assert the parser against fixed payloads rather than against live sites. These tests exercise the pure parsing path, with no browser required.

import unittest

class TestParseOffers(unittest.TestCase):
    def test_single_offer_product(self):
        payload = '''{"@type":"Product","name":"Widget",
            "offers":{"@type":"Offer","price":"129.99","priceCurrency":"USD",
            "availability":"https://schema.org/InStock"}}'''
        out = parse_offers([payload])
        self.assertEqual(out[0]["price_raw"], "129.99")
        self.assertEqual(out[0]["availability"], "IN_STOCK")

    def test_trailing_comma_is_recovered(self):
        payload = '{"@type":"Product","offers":{"price":"10.00","priceCurrency":"EUR",}}'
        self.assertEqual(parse_offers([payload])[0]["currency"], "EUR")

    def test_multiple_offers_all_extracted(self):
        payload = '''{"@type":"Product","offers":[
            {"price":"129.99","priceCurrency":"USD"},
            {"price":"134.50","priceCurrency":"USD"}]}'''
        self.assertEqual(len(parse_offers([payload])), 2)

    def test_non_product_tag_ignored(self):
        payload = '{"@type":"BreadcrumbList","itemListElement":[]}'
        self.assertEqual(parse_offers([payload]), [])

if __name__ == "__main__":
    unittest.main()

Run with python -m unittest -v. For live monitoring, store the payload_hash per scrape: when the hash is unchanged but your visible-DOM scraper reports a different price, the discrepancy points at a stale render or a client-side overlay, not a real price move.

Edge Cases & Gotchas

Hydration race conditions. If networkidle fires before React/Vue hydration completes, the JSON-LD tag is empty. Fall back to wait_until="domcontentloaded" and wait for a known pricing node before reading scripts:

page.goto(url, wait_until="domcontentloaded", timeout=timeout_ms)
page.wait_for_selector('[data-testid="price"], [itemprop="price"]', timeout=timeout_ms)

Anti-bot payload stripping. Cloudflare, Akamai, and DataDome may inject placeholder JSON-LD or strip the tag entirely. Compare the observed tag count against an expected baseline and treat a shortfall as a block, not a missing price — route it back through the parent headless-browser guide for re-rendering rather than recording null.

count = page.evaluate(
    "() => document.querySelectorAll('script[type=\"application/ld+json\"]').length")
if count == 0:
    raise RuntimeError("JSON-LD stripped — likely a challenge, retry via stealth path")

Multiple merchant offers. Marketplaces nest several Offer objects under one Product. Extract them all (Step 3 does), then apply buy-box rules — filter by seller.name and currency, prefer IN_STOCK, and break ties on lowest price_raw after currency normalization, never before.
Regional price variance. The same URL returns different priceCurrency depending on Accept-Language and geo-IP. Pin the browser locale and timezone_id (Step 1 does) and route through a fixed-region proxy so a historical series is comparable rather than mixing currencies silently.

Performance Notes

The browser navigation dominates wall-clock time — typically 1–4 seconds per page — while parse_offers is O(n) in total payload length and runs in microseconds. That asymmetry is the whole optimization story: never pay for a full render when a cheaper path exists. Where a storefront seeds JSON-LD from a JSON XHR/Fetch response, intercept that response directly and skip rendering entirely; for category and infinite-scroll pages, the same structured data is regenerated per viewport, so see Handling Infinite Scroll & Pagination Logic before brute-forcing scrolls. At fleet scale, decouple rendering from parsing: queue render jobs through the async architecture in Async Data Pipelines with Python & Scrapy so a slow headless session never blocks the parse workers. When JSON-LD is absent or deliberately obfuscated, stop scraping the DOM and switch to a first-party contract via API Fallback & Official Data Source Integration.

Configuring Headless Browsers for Dynamic Pricing — the parent guide: stealth context, hydration waits, and render budgets this recipe depends on.
Bypassing Cloudflare Turnstile with Playwright — clear the challenge interstitial first, so the JSON-LD tag actually renders.
Converting Multi-Currency Prices to Base Currency — the next stage that turns the raw price_raw string into an exact base-currency number.
API Fallback & Official Data Source Integration — the path to take when a retailer ships no usable JSON-LD at all.

Extracting Hidden Price Data from JSON-LD #

Prerequisites & Input Contract #

Step-by-Step Implementation #

Verification & Testing #

Edge Cases & Gotchas #

Performance Notes #

Related #