In one of our first ever web scraping projects, we had to scrape Yelp and other local business listing directories for details about the business and their reviews. This was a humongous task running into millions of records, 100s of GBs of data and about 60 varied ‘fields’ of information.
When we started building it, we would literally take one web page and build a Regex pattern matcher around it to extract the data we wanted.
It worked beautifully. Then we would hook up a multithreaded python URL fetcher using the requests module.
This was going to be easy. The crawler would crawl a web page, give it to the scraper, and then the extracted information was going into a database. We put the crawler on a cronjob and we were done.
We hadn’t used a framework.
Remember we had done this not this for Yelp but for about 50 other websites. Each with its own pattern match. The great foresight we had was to have all these Regex patterns coming from different files loaded dynamically so we can change them whenever we wanted. We were proud of this ‘modular design’
We had taken a month to get all this going and then we turned on the engine. It was on the production floor now well on time.
Over the next few weeks, everything that could go wrong did. We tried HARD to fix it but it took us well over a month to realize and accept that the origin of all these problems was probably because of the fundamental decision to build everything ourselves.
That fundamental mistake, lead to all the other mistakes because there was no framework enforcing best practices on us.
- The ‘multithreaded’ python URL fetcher would keep having memory leaks and hanging all the time and would not know what to do when the URLs returned errors.
- There was no beautiful soup to tame rogue HTML.
- We were using Regex to extract data which broke even if there was extra space, or a CSS change in the HTML of the target website because it didn’t match the ‘pattern’ now.
- There was no reporting of how many links were fetched successfully, how many failed, how many patterns were no longer working. The crawler and scraper the internet to never change and never to behave erratically.
- We kept getting CAPTCHA challenges and the crawler would just stall on all its multiple threads. We didn’t know how to overcome that.
- We didn’t realize we had to rotate User-Agent-Strings and do a bunch of other stuff to make sure the web servers know we were human.
- We kept getting IP blocked as we didn’t know about Rotating Proxy Servers.
- We didn’t about external tools we could use to make our crawler more stable.
- The scraping job could be decoupled from the crawling job to make things easier and more distributed.
- We even had to use term extractors on the reviews and we did it in the same process as the crawling and scraping instead of giving it to a worker script controlled by a messaging queue like RabbitMQ.
- We didn’t build a monitoring and alerting mechanism which detected any of the above problems and alerted us. The crawler would run for a few days many times before we detected that it was producing wrong data or no data at all.
We didn’t use a framework like Scrapy or Nutch to force us to consider all these factors beforehand and reduce our time in building it and also to expose ourselves to the community of developers who knew what they were doing.
The blog was originally posted at https://www.proxiesapi.com/blog/the-biggest-mistake-we-ever-did-in-web-crawling.php