1. Use a crawling and scraping framework like Scrapy:
Don’t try to reinvent the wheel. Frameworks like Scrapy abstract many of the complex functions of web crawling like concurrency, rate limiting, handling cookies, extracting links, using file pipelines, handling broken and different encoding to make life easier.
Scrapy also makes life easier by providing the support of selectors for scraping content.
2. Learn & Use XPath or CSS selectors
Instead of using RegEx or any other custom rudimentary method to get to the data, you want to scrape, using CSS selectors or XPath or a combination of both makes your code more stable. It protects you against arbitrary changes in a website’s code.
3. Scale using Scrapyd and Rotating Proxies
Scrapyd allows you to run multiple spiders at the same time and manage them easily. Combining it with a rotating proxy like Proxies API means you can scale your project to dramatic speeds and break a lot of usage and concurrency restrictions of linear coding without incurring the wrath of usage restrictions or IP blocks.
4. Take measures to counter usage restrictions and IP blocks
5. Put in checks and balances
There are so many failure points in a web crawling projects that you have no control over. It’s best to put it a bunch of checks and balances by first identifying them like:
a. Loss of internet connectivity on both ends.
b. Usage restrictions imposed.
c. IP blocks imposed.
d. The target website changes its HTML.
e. The target website is down.
f. Target website issues a CAPTCHA challenge.
g. The target website changes the rules of pagination.
The author is the founder of Proxies API, a proxy rotation api service.