Ugly Truths About Web Scraping

Typically, people think that most of the work in web scraping is in setting up the web crawlers and then getting the pattern recognition for the scraping operation to be perfect. But in reality, this is a given. There are almost no websites cryptic enough not to be crawled and also scraped at will by a reasonably skilled programmer. In over 20 years of experience building web scrapers of all sizes and kinds, we have found that broadly web scraping projects stumble at the following 3 areas — Knowledge of constraints the target website imposes, knowledge of workarounds, Degree of de-risking enforced.

If you are new to web scraping, here are 3 ugly truths about the journey you are about to undertake.

1. The legality of the web crawling and web scraping operation will always be in the grey. Even if your data mining operation somehow cures world hunger! It is because people are protective of their data. There have been instances when many aggregators who bring enormous traffic, including Google, have been sued by overprotective data owners who dont understand the long term benefits of an open data world.

2. It’s not about the code.

Web scraping is about checks and balances. Your web crawler will fail. A lot. It may not fail when you are coding now. But when you deploy and get thousands of URLs, it will fail a lot.

a. It is because you will encounter the constraints and policies like CAPTCHAs, rate limiting, IP blocks that will hinder your scraping project, which will not be triggered when you are getting your basic coding right.

b. The web is a wild place & things slow down. Web servers become sluggish during different times of the day. The HTML dom structure breaks because of bad coding on their part. There will be CSS changes, infinite scrolls, javascript based content rendering, etc.

c. You will need to expect failures at every point and build in alerting mechanisms and redundancy to help debug and overcome these exceptions, which are not that exceptional.

3. No free lunches. Web scraping is expensive if you want reliability and scale

a. You will probably have to spend on a decent server infrastructure like Amazon AWS that can scale.

b. You are also better off going for their premium plans, which have superior bandwidth, so your crawlers run at the maximum speeds possible.

c. Eventually, almost everyone falls to the lure of the free public proxy servers seemingly available in the thousands on tens of web sites that list them. But the truth is, you will probably have to invest in a professional rotating proxy server service like Proxies API.

(Full disclosure, I am the founder of Proxies API) if you want to route your requests through a large pool of high-speed residential proxies (millions in our case) and have a peaceful night.