Web scraping nightmare — Can you scrape these websites?

It is always fun and challenging to come across some websites which are tough to web scrape.

Here are a few we found that it is always a bit of a challenge. You can use this to test your skills at web crawling and web scraping. You will thank us later.

1. Pixabay

Pixabay blocks your bots straight away. No amount of User-Agent String rotation and web browser spoofing even if you send all the headers Chrome sends is enough. You will be presented with a Captcha page the moment you use a crawler.

Even this code didn’t work.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
url = 'https://pixabay.com/images/search/crazy/'
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
cookies = {
'is_human': '1',
'__cfduid' : 'dd0dbf52b454e1cbbc4f46fecc72f997f1581756368',
'anonymous_user_id' : 'e9a26f0a-d5f4-4e21-9af0-8c0dc1ddd441',
}response=requests.get(url,headers=headers)

Would throw a Captcha challenge.

2. Reddit

Scraping Reddit can be annoying because none of the classes that are defined make any sense after the new version was released. It may not have been done to stop web scraping, but the class names seem to be generated by an algorithm.

Plus, it does this.

3. Yelp

Yelp is easy when scraping in the beginning, but the moment you begin to scale, they tend to only IP block you. We have seen this far too many times to count.

The most probable solution

While it is fun to wrangle with these websites and come up with intelligent solutions, one of the permanent ways to solve these problems is to solve it at the fundamental IP level. You will need to use proxies. That’s just the truth about Web scraping at scale.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

a) With millions of high speed rotating proxies located all over the world.

b) With our automatic IP rotation.

c) With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions).

d) With our automatic CAPTCHA solving technology.

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

A simple API can access the whole thing like the below in any programming language.

The whole thing can be accessed by a simple API like below in any programming language.

You don’t even have to take the pain of loading Puppeteer as we render JavaScript behind the scenes, and you can get the data and parse it in any language like Node, Puppeteer, or PHP or using any framework like Scrapy or Nutch. In all these cases, you can call the URL with render support like so.

curl "http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

The author is the founder of Proxies API the rotating proxies service.