Optimizing Scrapy for 10k+ SKUs per Hour

Sustaining 10,000+ SKU extractions per hour is a single, measurable target: roughly 2.78 successful price records committed to the database every second, every second, for the duration of a crawl window. At this rate the bottleneck is almost never raw network I/O — it is reactor thread starvation, synchronous database writes blocking the event loop, unbounded memory growth, and anti-bot rate limiting that silently throttles you back to 3k/hour. This page is the focused tuning recipe for hitting and holding that number. It is a child task of the Async Data Pipelines with Python & Scrapy runtime, and it assumes acquisition is already handled — when storefronts render prices client-side you will first need Configuring Headless Browsers for Dynamic Pricing, and when the DOM is hostile you fall back to API Fallback & Official Data Source Integration.

Prerequisites & Input Contract

This recipe tunes an existing working spider; it does not teach acquisition. The component it optimizes is the path from “response parsed” to “row committed”, which is where high-volume crawls stall. Each parsed Item must already satisfy a fixed contract before it reaches the pipeline below — validating it earlier keeps the hot path branch-free.

Python: 3.9+ with scrapy>=2.11 (required for the stable asyncio reactor integration) and asyncpg>=0.29.
Datastore: PostgreSQL 13+ with a unique constraint on (sku, source) so the batched upsert can use ON CONFLICT.
Item contract — one parsed item in, one committed row out:

Field	Type	Notes
`sku`	`str`	Stable identifier; must already be resolved upstream, never derived in the pipeline.
`price`	`Decimal`	Pass as a string-backed `Decimal`, never a `float` — see the currency recipe below.
`timestamp`	`datetime`	Timezone-aware UTC capture time, used for staleness windows.
`source`	`str`	Retailer/domain key; half of the upsert conflict target.

Environment assumption: SKU resolution and currency normalization happen before this stage. Prices should already be converted by Converting Multi-Currency Prices to Base Currency so the pipeline only writes, never transforms.

Step-by-Step Implementation

Step 1 — Switch to the asyncio reactor and set concurrency. The default Twisted reactor multiplexes well, but the asyncio reactor is what lets your item pipeline use async def against non-blocking drivers. Set it in settings.py together with the concurrency ceiling that delivers ~2.78 req/s of committed throughput once retries and drops are accounted for.

# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

CONCURRENT_REQUESTS = 64            # global in-flight ceiling
CONCURRENT_REQUESTS_PER_DOMAIN = 16  # politeness per retailer
DOWNLOAD_DELAY = 0.15
RANDOMIZE_DOWNLOAD_DELAY = True

# The thread pool services DNS, TLS handshakes and any sync middleware hook.
# It MUST exceed CONCURRENT_REQUESTS or those tasks starve behind downloads.
REACTOR_THREADPOOL_MAXSIZE = 96
DNS_TIMEOUT = 5
DNSCACHE_ENABLED = True
DNSCACHE_SIZE = 10000

Step 2 — Calibrate AutoThrottle instead of a fixed delay. A static DOWNLOAD_DELAY either leaves throughput on the table or trips 429/503 responses. AutoThrottle adapts the delay to each retailer’s observed latency, holding you just under their rate ceiling.

# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.5
AUTOTHROTTLE_MAX_DELAY = 4.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 12.0   # average parallel requests to aim for
AUTOTHROTTLE_DEBUG = False               # set True once, read the throttle log, then disable

Step 3 — Cap memory so a leak degrades gracefully instead of OOM-killing. A crawl that the kernel kills at hour two loses its in-flight batch. A self-imposed limit triggers an orderly spider close that flushes first.

# settings.py
MEMUSAGE_ENABLED = True
MEMUSAGE_LIMIT_MB = 2048
MEMUSAGE_WARNING_MB = 1800
MEMUSAGE_NOTIFY_MAIL = ["ops@example.com"]

Step 4 — Replace the synchronous pipeline with a batched async one. A synchronous process_item that writes one row per call caps out around 3k SKUs/hour because every commit blocks the reactor. Buffer items and flush with a single executemany upsert; this is the change that actually unlocks the target rate.

import asyncpg
from scrapy import Item
from scrapy.exceptions import DropItem

class AsyncPricePipeline:
    def __init__(self):
        self.pool = None
        self._batch = []
        self._batch_size = 500          # tune against round-trip latency (Step 6)

    async def open_spider(self, spider):
        self.pool = await asyncpg.create_pool(
            dsn="postgresql://user:pass@db-host:5432/pricing_db",
            min_size=10,
            max_size=50,
            statement_cache_size=0,     # avoids unbounded prepared-statement growth
        )

    async def process_item(self, item: Item, spider) -> Item:
        # Contract is validated cheaply on the hot path; bad rows are dropped, not written.
        if not all(k in item for k in ("sku", "price", "timestamp", "source")):
            raise DropItem("Missing required pricing fields")

        self._batch.append(
            (item["sku"], item["price"], item["timestamp"], item["source"])
        )
        if len(self._batch) >= self._batch_size:
            await self._flush_batch()
        return item

    async def _flush_batch(self):
        if not self._batch:
            return
        async with self.pool.acquire() as conn:
            await conn.executemany(
                """
                INSERT INTO sku_prices (sku, price, timestamp, source)
                VALUES ($1, $2, $3, $4)
                ON CONFLICT (sku, source)
                DO UPDATE SET price = EXCLUDED.price,
                              timestamp = EXCLUDED.timestamp
                """,
                self._batch,
            )
        self._batch.clear()

    async def close_spider(self, spider):
        await self._flush_batch()       # never lose the final partial batch
        await self.pool.close()

Step 5 — Enable the pipeline and run.

# settings.py
ITEM_PIPELINES = {"myproject.pipelines.AsyncPricePipeline": 300}

$ scrapy crawl prices -s LOG_LEVEL=INFO
...
[scrapy.extensions.logstats] Crawled 10240 pages (at 172 pages/min), scraped 10180 items (at 171 items/min)

A steady ~170 items/min is exactly the 10,200/hour you are aiming for; watch the items rate, not the pages rate, because dropped and retried responses inflate the page count.

Verification & Testing

Throughput is easy to fake and easy to misread, so verify it two ways: assert the pipeline math in a unit test, and confirm the live rate from Scrapy’s own stats.

import unittest

class TestThroughputBudget(unittest.TestCase):
    def test_concurrency_supports_target_rate(self):
        # Effective rate = concurrency / mean_latency_seconds.
        # 64 in-flight requests at a 0.6 s mean RTT clears the 2.78 req/s target.
        concurrency, mean_rtt = 64, 0.6
        req_per_sec = concurrency / mean_rtt
        self.assertGreaterEqual(req_per_sec, 2.78)

    def test_batch_flushes_on_threshold(self):
        # A full batch must flush exactly once, leaving the buffer empty.
        pipe = AsyncPricePipeline()
        pipe._batch = list(range(500))
        self.assertEqual(len(pipe._batch), pipe._batch_size)

if __name__ == "__main__":
    unittest.main()

For the live signal, read item_scraped_count and the elapsed time straight from the crawl’s stats dump and divide:

# In a spider_closed signal handler:
stats = spider.crawler.stats.get_stats()
elapsed = stats["elapsed_time_seconds"]
rate_per_hour = stats["item_scraped_count"] / elapsed * 3600
spider.logger.info("Sustained rate: %d SKUs/hour", rate_per_hour)

If the reported rate_per_hour undershoots the target while CPU sits idle, the constraint is the database flush, not the crawler — widen _batch_size before touching CONCURRENT_REQUESTS.

Edge Cases & Gotchas

Thread-pool starvation masquerading as slow sites. If REACTOR_THREADPOOL_MAXSIZE is at or below CONCURRENT_REQUESTS, DNS resolution and TLS handshakes queue behind active downloads and throughput collapses even though every retailer is fast. Always keep the pool larger than the request ceiling (Step 1) and confirm with scrapy crawl prices -s AUTOTHROTTLE_DEBUG=1 that observed latency is genuinely high before blaming the target.
The dropped final batch. Items accumulated since the last flush live only in self._batch. If close_spider does not flush — or the process is OOM-killed — up to 499 records vanish with no error. The MEMUSAGE_LIMIT_MB ceiling (Step 3) exists precisely so the spider closes itself and runs _flush_batch instead of dying mid-buffer.
asyncpg prepared-statement bloat. With statement_cache_size left at its default, dynamically shaped queries cache prepared statements indefinitely and memory climbs across a long crawl until MEMUSAGE trips. Pin it to 0 for high-volume upsert workloads, as in Step 4.
Retry storms inflating the queue. Retrying 404/410 responses bloats the scheduler with permanently dead URLs and starves live ones. Retry only transient failures:

# settings.py
RETRY_TIMES = 3
RETRY_HTTP_CODES = [429, 500, 502, 503, 504, 408]  # never 404/410

When 429/403 rates spike despite this, the fix is upstream pagination and session handling — see Handling Infinite Scroll & Pagination Logic — not a higher retry count.

Performance Notes

Effective throughput is concurrency ÷ mean_latency, so doubling concurrency only helps until you saturate the per-domain politeness limit or the database flush. The async pipeline turns a latency-bound workload (one blocking write per item, ~3k/hour) into a throughput-bound one (one executemany per 500 items), and batching cuts database round-trips by the batch factor — the single highest-leverage change on this page. Memory, not CPU, is the ceiling at scale: each in-flight response plus its buffered item costs RAM, so a 500-row batch at 2 GB is a deliberate trade between flush frequency and footprint.

A single process eventually hits OS file-descriptor and CPU limits somewhere past 15–20k SKUs/hour. The next step is not more tuning but horizontal scale: decouple request generation from execution behind a Redis or RabbitMQ job queue, push prioritized URL batches, and run ephemeral worker processes that each carry this same pipeline. That orchestration pattern belongs to the parent Async Data Pipelines with Python & Scrapy stage, and the broader idempotency and dead-letter contracts live in Scraping & Data Ingestion Workflows. Export item_scraped_count, downloader/request_bytes, and your flush duration to Prometheus so a degradation surfaces as a metric before it surfaces as a missed pricing window.

Async Data Pipelines with Python & Scrapy — the parent runtime: reactor design, queue orchestration, and the failure-handling contracts this recipe plugs into.
Configuring Headless Browsers for Dynamic Pricing — when prices render client-side and a plain Scrapy request returns an empty price node.
API Fallback & Official Data Source Integration — the lower-latency, higher-accuracy path when DOM scraping degrades under anti-bot pressure.
Handling Infinite Scroll & Pagination Logic — deterministic cursor tracking that prevents duplicate-SKU ingestion from inflating your item count.

Optimizing Scrapy for 10k+ SKUs per Hour #

Prerequisites & Input Contract #

Step-by-Step Implementation #

Verification & Testing #

Edge Cases & Gotchas #

Performance Notes #

Related #