Handling Infinite Scroll & Pagination Logic

In e-commerce price monitoring and competitor intelligence, product catalogs almost never expose a complete dataset in a single HTTP response. Pagination and infinite scroll are deliberately engineered to throttle client-side rendering, but for automated ingestion they turn catalog traversal into a stateful problem: every additional product page is gated behind a cursor, an offset, or a scroll-triggered fetch. This guide covers how to navigate both mechanisms reliably as one stage of the Scraping & Data Ingestion Workflows pipeline. It sits directly downstream of the rendering layer in Configuring Headless Browsers for Dynamic Pricing and feeds the broker-backed stages described in Async Data Pipelines with Python & Scrapy, so traversal must emit deterministic, deduplicated page payloads without leaking browser-session complexity downstream.

Problem Framing & Prerequisites

Without a disciplined traversal stage, a catalog scrape silently captures whatever fraction of products the frontend happened to render before the script stopped. Offset drift, a misfired scroll, or a single timed-out fetch leaves gaps that surface much later as missing price drops or phantom out-of-stock signals — the kind of error that is invisible until repricing acts on it. The job of this stage is to walk a vendor’s full result set exactly once, in a resumable way, and hand off a clean stream of product nodes to parsing.

Pagination logic must be strictly decoupled from parsing, enrichment, and storage. Treat URL/cursor generation, page traversal, and payload extraction as discrete micro-stages. This isolation prevents a DOM parsing failure or malformed JSON from cascading into queue exhaustion, runaway memory growth, or corrupted checkpoints. Three upstream contracts must hold before traversal runs:

A rendering capability it can call but does not own. Explicit wait conditions, network interception, and anti-fingerprinting belong to Configuring Headless Browsers for Dynamic Pricing; this stage invokes a ready browser context and never reimplements stealth or launch logic.
A durable state store. Cursor position, last offset, and processed-node hashes must persist in a key-value store (Redis, or a state table) so a crash resumes mid-catalog instead of restarting.
An output contract. Traversal emits raw product nodes (vendor_id, page_token, node_payload) and nothing else — currency, tax, and promo reconciliation happen later in Data Normalization & Promo Parsing Pipelines.

Align the traversal model with your concurrency model up front. Offset-based pagination maps cleanly to parallel HTTP requests; cursor-driven and infinite-scroll architectures demand sequential state tracking and session persistence. Choosing the wrong concurrency shape for the pagination type is the most common source of duplicate and missed rows.

Pick the traversal strategy from the source's shape: an exposed API parallelizes cleanly, numbered HTML pages template into a URL fan-out, and only the remainder needs the sequential headless scroll loop. All three converge on one dedup-and-checkpoint boundary before parsing.

Algorithm & Architecture Detail

Prefer an API-first traversal

RESTful storefront endpoints usually expose pagination via query parameters (?page=, ?offset=, or ?cursor=). For production traversal, always prefer cursor- or token-based pagination over offsets. Offsets degrade under concurrent writes, inventory shifts, and dynamic sorting, producing duplicate SKUs or skipped price updates as the underlying result set reorders between requests. Extract pagination metadata from response headers (X-Total-Pages, Link: rel="next") or the embedded JSON payload before issuing the next request, and parse the Link header per the RFC 8288 Web Linking specification rather than hand-rolling string splits.

import httpx

async def traverse_cursor_api(client: httpx.AsyncClient, url: str,
                              max_pages: int = 500):
    """Walk a cursor-paginated JSON endpoint. Yields product nodes once each."""
    cursor, pages = None, 0
    while pages < max_pages:                 # hard depth guardrail
        params = {"limit": 100}
        if cursor:
            params["cursor"] = cursor
        resp = await client.get(url, params=params, timeout=20.0)
        resp.raise_for_status()
        body = resp.json()

        for node in body.get("products", []):
            yield node

        page_info = body.get("pageInfo", {})
        if not page_info.get("hasNextPage"):
            break                            # clean termination
        cursor = page_info["endCursor"]
        pages += 1
    else:
        raise RuntimeError(f"max_pages={max_pages} hit — suspect hasNextPage loop")

The max_pages guardrail is mandatory: a misconfigured or adversarial hasNextPage flag that never flips false will otherwise pin a worker forever. When a storefront fronts its data with GraphQL, use schema introspection to map pageInfo, hasNextPage, and endCursor directly into the request orchestrator and request only price, SKU, and availability fields. This eliminates brittle DOM scraping and cuts network overhead. Where native APIs are restricted or rate-limited, the fallback should still prefer structured extraction — for example reading embedded JSON-LD via Extracting Hidden Price Data from JSON-LD — over raw HTML parsing, in line with the routing rules in API Fallback & Official Data Source Integration.

Headless execution & infinite-scroll mechanics

When pagination is entirely client-side and triggered by viewport scroll events, you must transition to a headless browser execution model. Traversal then requires careful orchestration of viewport dimensions, scroll cadence, and DOM-mutation observation. The cardinal rule is to wait on signals, not on sleep(): tie progress to network-idle states or to the insertion of a specific DOM element, never to arbitrary delays that are simultaneously too slow on a fast connection and too fast on a slow one.

Modern storefronts trigger lazy-loaded product cards through the Intersection Observer API; the most robust strategy is to detect when the loader sentinel enters the viewport and fetch the next batch precisely when the frontend expects it. For observer thresholds and callback behaviour, the MDN Intersection Observer reference is authoritative. Avoid aggressive window.scrollTo() loops; dispatch controlled scroll steps and re-evaluate the DOM after each one.

from playwright.async_api import async_playwright

async def scroll_until_settled(page, item_selector: str,
                               sentinel_selector: str,
                               max_rounds: int = 200,
                               idle_rounds: int = 3) -> int:
    """Scroll-load a catalog until the item count stops growing.

    Returns the final rendered item count. `idle_rounds` guards against
    a slow loader: we only stop after N consecutive rounds with no growth.
    """
    seen, stable, rounds = 0, 0, 0
    while rounds < max_rounds and stable < idle_rounds:
        await page.locator(sentinel_selector).scroll_into_view_if_needed()
        # wait on the network, not a fixed sleep
        try:
            await page.wait_for_load_state("networkidle", timeout=8000)
        except TimeoutError:
            pass
        count = await page.locator(item_selector).count()
        stable = stable + 1 if count == seen else 0
        seen, rounds = count, rounds + 1
    return seen

Two thresholds carry the logic: max_rounds caps total work so a never-terminating feed cannot hang a worker, and idle_rounds distinguishes a genuinely exhausted catalog from a loader that is merely slow. Simulating human-like scroll cadence reduces anti-bot flagging but introduces non-deterministic latency — which is exactly why the stop condition is count-stability rather than a round count alone.

Candidate Generation & Compute Optimization

Traversal is the cheapest place to eliminate redundant work, and the most expensive place to get deduplication wrong. Infinite-scroll DOMs frequently re-render overlapping product cards during rapid viewport transitions, and offset endpoints re-serve rows when the result set shifts. Deduplicate at the traversal boundary using a deterministic key so duplicates never reach parsing or storage:

import hashlib

def node_key(vendor_id: str, sku: str, price_point: str) -> str:
    raw = f"{vendor_id}|{sku}|{price_point}".encode()
    return hashlib.blake2b(raw, digest_size=16).hexdigest()

async def dedup_stream(nodes, seen: set[str], vendor_id: str):
    for node in nodes:
        key = node_key(vendor_id, node["sku"], str(node["price"]))
        if key in seen:          # already emitted this exact observation
            continue
        seen.add(key)
        yield node

Hold the seen set per traversal run; for catalogs too large to keep in memory, back it with a Redis set or a Bloom filter sized to the expected node count and tolerate a small false-positive rate. Persist traversal state — last cursor, offset, or scroll round — to the same store so a crashed run resumes idempotently: if a worker dies mid-catalog it should restart from the last checkpoint without re-fetching processed nodes. This checkpoint contract is what lets the broker-backed workers in Optimizing Scrapy for 10k SKUs per Hour retry safely.

The largest compute saving is structural: parallelize only what is safe to parallelize. Offset and page-number traversal fan out across a worker pool because each page URL is independent. Cursor and scroll traversal are inherently sequential per catalog — the next request needs the previous response — so scale them by sharding across vendors or categories, not within a single result set. A tiered crawl strategy keeps the headless budget bounded: run high-velocity API traversal for top-revenue SKUs and reserve the expensive scroll loop for the minority of vendors that genuinely require it.

Configuration & Threshold Tuning

Traversal thresholds are not universal constants; they are tuned per source class and recalibrated against ground-truth samples (a known full catalog count for a few vendors). Start from the table below and tighten when you observe duplicate inflation or anti-bot escalation, relax when you see truncated catalogs.

Parameter	API / offset source	Infinite-scroll source	Notes
`page_size` / `limit`	100–250	n/a (DOM-driven)	Larger pages cut request count but raise per-response parse cost and ban risk.
`max_pages` / `max_rounds`	500	150–250	Hard depth guardrail; set to ~1.5× the largest observed catalog.
`idle_rounds`	n/a	3	Consecutive no-growth rounds before declaring the catalog exhausted.
`networkidle` timeout	n/a	6–10 s	Per scroll step; lower for fast feeds, raise for image-heavy grids.
Request concurrency	8–16 / vendor	1 / catalog	Scroll is sequential per result set; parallelize across vendors instead.
Base inter-request delay	250–750 ms	800–1500 ms	Add full jitter; never a fixed delay.
Dedup key TTL	per-run	per-run	Persist for the catalog pass; clear on a fresh full crawl.
Resume checkpoint interval	every page	every 10 rounds	Trade-off between write cost and re-fetch cost on crash.

Calibrate max_pages and idle_rounds against a vendor whose true product count you can verify from a sitemap or category counter. If traversal terminates well below the known count, idle_rounds is too aggressive or the networkidle timeout is starving the loader; if it never terminates, the hasNextPage/sentinel signal is unreliable and you should switch that vendor to URL-template pagination or an API path.

Failure Modes & Mitigations

Offset drift / duplicate SKUs. Concurrent inventory writes reorder offset result sets between requests. Mitigation: prefer cursors; where only offsets exist, snapshot a stable sort key (e.g. created_at,id) and deduplicate on the node_key above.
hasNextPage / sentinel that never terminates. A bugged or adversarial flag pins a worker indefinitely. Mitigation: the max_pages/max_rounds guardrail raises rather than loops, and the run is quarantined for inspection.
Loader slower than the scroll loop. Stopping after one no-growth round truncates the catalog. Mitigation: require idle_rounds consecutive stable counts and wait on networkidle, not sleep().
DOM/selector drift mid-crawl. Storefronts refactor templates and break CSS/XPath selectors without warning. Mitigation: prefer structured extraction (embedded JSON-LD) over fragile selectors via Extracting Hidden Price Data from JSON-LD, and fail soft to a fallback parser instead of aborting the whole job.
Anti-bot escalation (CAPTCHA / 429 / IP reputation). Aggressive traversal trips rate limits or challenge pages such as Cloudflare Turnstile. Mitigation: parse Retry-After, honour HTTP 429 with exponential backoff and full jitter, and route challenged sessions through the approach in Bypassing Cloudflare Turnstile with Playwright.
Phantom price points from overlapping cards. Rapid re-render emits the same product at a transient mid-load price. Mitigation: dedup on (vendor_id, sku, price) and push observations to Statistical Outlier Detection for Price Data before they influence repricing.

import asyncio, random

async def fetch_with_backoff(client, url, params, attempts=5):
    for i in range(attempts):
        resp = await client.get(url, params=params, timeout=20.0)
        if resp.status_code == 429:
            wait = float(resp.headers.get("Retry-After",
                                          2 ** i)) + random.random()
            await asyncio.sleep(wait)        # honour the server, add jitter
            continue
        resp.raise_for_status()
        return resp
    raise RuntimeError(f"rate-limited after {attempts} attempts: {url}")

When violation thresholds are repeatedly breached for a vendor, an automated emergency pause should halt that vendor’s traversal, flush in-flight requests, and alert operations rather than burning the IP pool — degrade gracefully instead of escalating.

Compliance & Auditability

Automated catalog traversal sits at the intersection of technical execution and platform terms, so provenance is not optional. Respect robots.txt directives and any declared crawl delay, honour Retry-After, and maintain a per-vendor permission matrix recording allowed paths, rate limits, and data-use restrictions; consult legal counsel before traversing authenticated or explicitly restricted endpoints. Trade-offs between data freshness and access preservation are unavoidable — high-frequency polling yields real-time competitor pricing but raises infrastructure cost and ban risk, while conservative intervals preserve access at the cost of missing flash sales — so make the polling tier a documented, versioned decision per vendor rather than a hardcoded default.

Write a deterministic audit record for every traversal run so any captured price is reproducible during a supplier dispute or regulatory review:

audit_record = {
    "vendor_id": "acme_eu",
    "strategy": "scroll",            # api | offset | scroll
    "pages_or_rounds": 142,
    "nodes_emitted": 3187,
    "duplicates_dropped": 219,
    "terminated": "idle_rounds",     # idle_rounds | hasNextPage | guardrail
    "robots_checked": True,
    "config_version": "tier1-v4",
    "captured_at": "2026-06-27T09:14:02Z",
}

Version the traversal configuration so a given price observation is reproducible across pipeline iterations, redact or hash any PII that appears incidentally in payloads, and retain audit records for the period your jurisdiction requires. Because traversal emits raw nodes, it carries no pricing judgement of its own — its compliance duty is to preserve exactly which path produced which node.

Deployment Checklist

Navigating pagination and infinite scroll reliably is a discipline of bounded, resumable, deduplicated traversal — not a scroll script. By preferring cursor APIs, gating scroll loops on stability signals, checkpointing state, and respecting rate and permission boundaries, retail tech teams capture complete catalogs at scale without sacrificing data integrity or access.

Scraping & Data Ingestion Workflows — the parent guide that frames how this traversal stage fits the end-to-end ingestion pipeline.
Configuring Headless Browsers for Dynamic Pricing — the rendering layer this stage invokes for scroll-driven catalogs.
API Fallback & Official Data Source Integration — routing rules for preferring an API path over scroll traversal.
Extracting Hidden Price Data from JSON-LD — selector-resilient extraction that survives template drift.
Optimizing Scrapy for 10k SKUs per Hour — the concurrency model that consumes deduplicated traversal output downstream.

Handling Infinite Scroll & Pagination Logic #

Problem Framing & Prerequisites #

Algorithm & Architecture Detail #

Prefer an API-first traversal #

Headless execution & infinite-scroll mechanics #

Candidate Generation & Compute Optimization #

Configuration & Threshold Tuning #

Failure Modes & Mitigations #

Compliance & Auditability #

Deployment Checklist #

Related #