6 Signs Your Web Crawler Code Is Going To Break At Scale
Here are 6 ways to sense the upcoming disaster:
1. You have not used a framework
We have not seen a single project survive the wild that is not built on a robust web crawling and web scraping framework like Scrapy. Ignoring years of community tested code (handling rouge HTML, server vagaries) and abstractions (concurrency, sessions, cookies handling, rate limit handling, multiple spiders) and sophisticated algorithms (Beautiful soup, CSS and XPath selectors), etc., is foolhardy and is setting yourself up for massive headaches and mostly a failure.
2. You have no checks and balances
a. Does a website change code?
b. Does a spider crash?
c. Does a web site rate limit you?
d. Does a web site issue a CAPTCHA challenge?
e. Does a website IP block you?
3. You have no measures to overcome IP blocks
You have to use a rotating proxy service like Proxies API if you are serious about scaling your web scraper and doing it consistently.
4. Your scraping logic is built on RegEx or some such rudimentary way of extracting data
No! RegEx is no. It is because even small changes can and will break your code. It’s better to use CSS selectors or XPath to define the data you want. It will also break but nowhere as often as any other method.
5. You are ignoring cookies, rotating User-Agent-String, and giving away many others’ tells’ that you are not human.
6. Your bot runs at the same time every day, runs in precisely the same intervals, runs in a particular order. You know, like a bot!
Add some random intervals between requests, run at a random time, change the order of fetching the URLs a bit, spread the requests over the day, and use a goddamn rotating proxy service for GOD’s sakes!