Blog/Engineering

I've scraped 2 Billion+ keywords on Amazon. Here's what I learned.

Scraping Amazon is easy. Scraping accurately is hard. At scale it's even harder. The stuff nobody tells you about residential IPs, retry mechanisms, zipcode locking, and why Bright Data/Oxylabs aren't gods.

M

Mert Zorlu

Do not disturb, scraping Amazon

5 min read
I've scraped 2 Billion+ keywords on Amazon. Here's what I learned.

I've scraped more than 2 Billion+ keywords / search result pages (~100B product pages) on Amazon until now.

This is the stuff I learned the hard way. Send it to your data engineers.

Bright Data / Oxylabs are not gods

You'll be surprised how often the providers everyone treats as GODS can be inaccurate. I was too.

The lesson: always validate results. Don't trust a provider's name. Trust what you can verify against Amazon's live page. If you're not checking your own accuracy, you have no idea how wrong your data is.

No DC IPs. Residential or nothing.

If you do not use residential proxies you won't get accurate data. Period.

Datacenter IPs are cheap and they feel fine until you actually compare the results side by side with what a real shopper sees. Different prices, different ranks, different buybox. NO DC IPs.

Not ALL products show up the first time

This one bites everyone. Sponsored / ad ranked ASINs do not all appear on the first request. The page loads a partial set and if you take it at face value you're underreporting the competition.

You need retry mechanisms to get ALL of them. If you're not retrying until the full sponsored set is captured, your rank data is a flattering lie.

Browser automation at scale is suicide

Amazon's antibotting / blockage rates are high right now, especially because of the AI scraping wave. Everyone and their dog is scraping.

Using browser automation (puppeteer, playwright, headless chrome) for this is suicide. It's slow, it's heavy, and you can't get scale. You'll hit walls fast.

Inhouse vs outsourcing is a cost math problem

Scraping inhouse vs outsourcing depends entirely on how cost effective you are at using the proxies / IPs you buy.

It piles up fast per gigabyte / terabyte. My math is no better than GPT's so run your own numbers, but a lot of teams underestimate how quickly the proxy bill balloons when they scale daily volume.

Zipcodes are not optional

Zipcodes are nice for buybox & search result pages. They let you see different organic ranks and buybox winners based on shipping times.

But here's the part people miss: zipcodes are also necessary because using random ones is inconsistent. You need to lock one every time as a control variable.

Example: 1 keyword, page 1 and page 2. If you scrape both with random zips, you get no analysis. You're stitching together two different Amazon experiences and calling it a ranking. Lock the zip. Always.

The one line that matters

Scraping Amazon is easy. Scraping accurately is hard. At scale it's even harder.

A normal scraper can't survive Amazon anymore. You need big guns.

ā˜ž https://asgardata.com/

Amazon APIWeb ScrapingResidential ProxiesKeyword RankingBuyBoxData AccuracyScale

Ready to scrape Amazon data at scale?

Get your free API key and start in minutes. No credit card required.