How to Scrape Amazon Product Data Without Getting Blocked
Amazon runs one of the most aggressive anti-scraping infrastructures on the web. It fingerprints browser behavior, rotates CAPTCHAs, flags datacenter IP ranges, and rate-limits at the session level. Getting product data reliably means understanding exactly what triggers those defenses and building around them systematically.
Here is what actually works in production.
- Use residential IPs, not datacenter IPs. Amazon's detection layer starts with IP reputation. Datacenter subnets are flagged almost immediately because they have no organic browsing history associated with them. Residential IPs are assigned to real ISP customers, so they carry legitimate reputation. Geonode operates a residential proxy network across 140+ countries, which means you can source IPs that look like organic shoppers from the same geography as your target listings.
- Rotate IPs per request, not per session. A single IP hammering product pages is the fastest path to a block. Per-request rotation ensures each HTTP call comes from a different address. When you need session continuity — for example, navigating paginated search results or following a product variant chain — sticky sessions let you hold the same IP for up to 30 minutes via a session ID in the proxy username string. Use rotating endpoints for isolated product lookups; use sticky sessions when you need to maintain state across a sequence of requests.
- Match your request headers to real browser behavior. Amazon checks User-Agent, Accept-Language, Accept-Encoding, and a cluster of other headers. Sending a bare Python requests call with a default User-Agent will get you blocked inside minutes. Populate the full header set that a real Chrome or Firefox browser would send. Keep the header order consistent — HTTP/2 fingerprinting looks at header ordering, not just presence.
- Handle JavaScript rendering.. A significant portion of Amazon's product data — reviews, pricing on variant SKUs, availability flags — is injected by JavaScript after the initial HTML loads. Static HTML scrapers miss this content entirely. You need either a headless browser (Playwright or Puppeteer) or a scraping API that handles rendering server-side. The trade-off: headless browsers are slower and more resource-intensive; a rendering API adds latency but removes infrastructure overhead.