6 Signs Your Web Crawler Code Is Going To Break At Scale

Did you ignore the signs?

Here are 6 ways to sense the upcoming disaster:

1. You have not used a framework

We have not seen a single project survive the wild that is not built on a robust web crawling and web scraping framework like Scrapy. Ignoring years of community tested code (handling rouge HTML, server vagaries) and abstractions (concurrency, sessions, cookies handling, rate limit handling, multiple spiders) and sophisticated algorithms (Beautiful soup, CSS and XPath selectors), etc., is foolhardy and is setting yourself up for massive headaches and mostly a failure.

2. You have no checks and balances

a. Does a website change code?

b. Does a spider crash?

c. Does a web site rate limit you?

d. Does a web site issue a CAPTCHA challenge?

e. Does a website IP block you?

3. You have no measures to overcome IP blocks

You have to use a rotating proxy service like Proxies API if you are serious about scaling your web scraper and doing it consistently.

4. Your scraping logic is built on RegEx or some such rudimentary way of extracting data

No! RegEx is no. It is because even small changes can and will break your code. It’s better to use CSS selectors or XPath to define the data you want. It will also break but nowhere as often as any other method.

5. You are ignoring cookies, rotating User-Agent-String, and giving away many others’ tells’ that you are not human.

6. Your bot runs at the same time every day, runs in precisely the same intervals, runs in a particular order. You know, like a bot!

Add some random intervals between requests, run at a random time, change the order of fetching the URLs a bit, spread the requests over the day, and use a goddamn rotating proxy service for GOD’s sakes!




Founder @ ProxiesAPI.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

ANTLR Magic — Developing Mainframe Language Applications Using Language Recognizer

Ankr Switching To Pay-As-You-Go Model for Premium Users

GSoC 2019 — Community Bonding Period

The Multilingual Developer — Defining Variables and Constants in 6 Programming Languages

Sprint Ceremonies — Demystified!

DSC: Don’t Be Afraid To Ask

Playing Around on LeetCode

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mohan Ganesan

Mohan Ganesan

Founder @ ProxiesAPI.com

More from Medium

Honeypot Company Portrait Series: Mailbutler

LoL Champion Recommendations based on Value-in-Mastery

What Are the Benefits and Features of Inventory Management Mobile Application?

Features of Inventory Management

What are the eLearning and corporate learning trends in 2022?