Making Your Web Crawling Project Blazingly Fast

Photo By Seobility Wiki

1. Use a multi-threaded crawler like Scrapy.

Scrapy can fetch requests asynchronously. You can set the concurrent requests limit with the CONCURRENT_REQUESTS_PER_IP variable. Click Here

2. Use Scrapyd to run multiple spiders concurrently. Suppose you want to crawl a few thousand pages from 3 different websites every day. You can first-of-all configure Scrapy with the setting above to fetch multiple URLs in parallel. And with Scrapyd, you can run all 3 spiders at the same time. Click Here

Scrapyd is a daemon that spawns a process for each spider you request it to run. It runs multiple processes in parallel, allocating them in a fixed number of slots for your spiders. Scrapyd starts as many processes as possible to handle the overall load.

3. Use XPath instead of Regex or any other pattern matching techniques for scraping.

4. Here is the catch. Most web servers limit the number of concurrent requests per IP, making the Scrapy concurrency setting redundant. For this make sure to use a rotating proxy service like Proxies API and use Scrapy to run multiple batches simultaneously which can automatically rotate proxies from a huge pool so you can scale concurrency limitlessly limited only to the number of requests the target web server can handle at a time.