1. Make sure you pick the right framework. A framework like Scrapy comes with compelling features like rate limiting, crawling policies, multi-threading, support for distributed crawling, secure scraping syntax support like XPath and CSS selectors to make the job easier.
2. Always rate limit your web crawlers. If you are using a framework like Scrapy, you set the concurrent connections in the settings to limit this.
3. Think about how you will overcome IP blocks early. Invest in a rotating proxy service like Proxies API to route your requests through a pool of millions of residential proxies, so its almost impossible to get blocked.
4. Build-in monitors where you expect the crawler to break down. Take into account these areas:
a. What happens if there is no or slow internet?
b. What happens if the root page doesn’t load?
c. What happens if the page changes its template?
d. What happens if you get IP blocked?
e. What happens if you encounter a CAPTCHA?
It is best to actively check for all these failure points and build in logging and alerting mechanisms so you can detect a lousy crawl or bad data early.
5. If you want to increase the concurrency rate than what is allowed, consider using the Scrapyd daemon, which can handle multiple spiders simultaneously. Also, combine this with a rotating proxy like Proxies API, and you can scale concurrent connections quite quickly.