Building Fail-Safe Scrapers: Example of Writing a Self-Monitoring Scraper

  • Url fetch errors: Website is down
  • Url fetch errors: Our internet is down
  • Url fetch errors: Website times out
  • Url fetch errors: too many redirects
  • Url fetch errors: IP blocked
  • Url fetch errors: Captcha challenge issued
  • Scraping errors: Website changed the pattern
  • Scraping errors: Some data pieces are missing
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.airbnb.co.in/s/New-York--NY--United-States/homes?query=New York, NY, United States&checkin=2020-03-12&checkout=2020-03-19&adults=4&children=1&infants=0&guests=5&place_id=ChIJOwg_06VPwokRYv534QaPC8g&refinement_paths[]=/for_you&toddlers=0&source=mc_search_bar&search_type=unknown'
response=requests.get(url,headers=headers)
soup=BeautifulSoup(response.content,'lxml')
for item in soup.select('[itemprop=itemListElement]'):
try:
print('----------------------------------------')
print(item.select('a')[0]['aria-label'])
print(item.select('a')[0]['href'])
print(item.select('._krjbj')[0].get_text())
print(item.select('._krjbj')[1].get_text())
print(item.select('._16shi2n')[0].get_text())
print(item.select('._zkkcbwd')[0].get_text())
print(name)
print('----------------------------------------')
except Exception as e:
#raise e
print('')
errorCount=0
def monitor(eventType, email):
#This code saves the errors to a file log and also sends an email alerting the developers of a failure point
# You can write code that logs this to a server as well

global errorCount
errorCount=errorCount 1
now = datetime.now()
dt_string = now.strftime("%d/%m/%Y %H:%M:%S")
#Save to log
with open("monitor.txt", "a") as myfile:
myfile.write("[" dt_string "] " "Scraper Error: " eventType "\r\n")
return
#Now send an email
s = smtplib.SMTP('smtp.gmail.com', 587)
s.starttls()
s.login("mohan@proxiesapi.com", "mypassword")
message = "Scraper Error: " eventType
s.sendmail("mohan@proxiesapi.com", email, message)
s.quit()
#too many errors. Something very wrong. Abort script
if (errorCount>3):
with open("monitor.txt", "a") as myfile:
myfile.write("[" dt_string "] " "Full abort after " errorCount " errors\r\n")
sys.exit()
  • The code tries to maintain a count of code. This is an overall trigger to abort if the total number of errors in a script goes beyond an acceptable limit.
  • The inserts a timestamp and URL (you can extend to capture other payloads)
  • It tries to append everything to a log file.
  • It will also send an email to the developer so they can act on it immediately.
  • You can add a database logging to it very easily.
try:
response=requests.get(url,headers=headers)
except requests.exceptions.Timeout:
monitor('Airbnb timed out', 'xxx@gmail.com')
except requests.exceptions.TooManyRedirects:
monitor('Too many redirects Airbnb', 'xxx@gmail.com')
except requests.exceptions.RequestException as e:
monitor('Catastrophic error requesting Airbnb', 'xxx@gmail.com')
print(e)
sys.exit(1)
if (len(response.content)<200):
monitor('Airbnb returned unusual results', 'xxx@gmail.com')
if (len(soup.select('[itemprop=itemListElement]'))<1):
monitor('Airbnb pattern changed. Cant fetch anything', 'xxx@gmail.com')
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import smtplib
from datetime import datetime
import sys
errorCount=0
def monitor(eventType, email):
#This code saves the errors to a file log and also sends an email alerting the developers of a failure point
# You can write code that logs this to a server as well
global errorCount
errorCount=errorCount 1
now = datetime.now()
dt_string = now.strftime("%d/%m/%Y %H:%M:%S")
#Save to log
with open("monitor.txt", "a") as myfile:
myfile.write("[" dt_string "] " "Scraper Error: " eventType "\r\n")
return
#Now send an email
s = smtplib.SMTP('smtp.gmail.com', 587)
s.starttls()
s.login("mohan@proxiesapi.com", "mypassword")
message = "Scraper Error: " eventType
s.sendmail("mohan@proxiesapi.com", email, message)
s.quit()
#too many errors. Something very wrong. Abort script
if (errorCount>3):
with open("monitor.txt", "a") as myfile:
myfile.write("[" dt_string "] " "Full abort after " errorCount " errors\r\n")
sys.exit()
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.airbnb.co.in/s/New-York--NY--United-States/homes?query=New York, NY, United States&checkin=2020-03-12&checkout=2020-03-19&adults=4&children=1&infants=0&guests=5&place_id=ChIJOwg_06VPwokRYv534QaPC8g&refinement_paths[]=/for_you&toddlers=0&source=mc_search_bar&search_type=unknown'
try:
response=requests.get(url,headers=headers)
except requests.exceptions.Timeout:
monitor('Airbnb timed out', 'xxx@gmail.com')
except requests.exceptions.TooManyRedirects:
monitor('Too many redirects Airbnb', 'xxx@gmail.com')
except requests.exceptions.RequestException as e:
monitor('Catastrophic error requesting Airbnb', 'xxx@gmail.com')
print(e)
sys.exit(1)
monitor('Unable to reach Airbnb', 'xxx@gmail.com')
monitor('Airbnb pattern changed', 'xxx@gmail.com')
monitor('Airbnb timed out', 'xxx@gmail.com')
monitor('Catastrophic error requesting Airbnb', 'xxx@gmail.com')
if (len(response.content)<200):
monitor('Airbnb returned unusual results', 'xxx@gmail.com')
soup=BeautifulSoup(response.content,'lxml')
if (len(soup.select('[itemprop=itemListElement]'))<1):
monitor('Airbnb pattern changed. Cant fetch anything', 'xxx@gmail.com')
for item in soup.select('[itemprop=itemListElement]'):
try:
print('----------------------------------------')
print(item.select('a')[0]['aria-label'])
print(item.select('a')[0]['href'])
print(item.select('._krjbj')[0].get_text())
print(item.select('._krjbj')[1].get_text())
print(item.select('._16shi2n')[0].get_text())
print(item.select('._zkkcbwd')[0].get_text())
print(name)
print('----------------------------------------')
except Exception as e:
#raise e
print('')
  • With millions of high speed rotating proxies located all over the world
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

--

--

--

Founder @ ProxiesAPI.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Do you still spend most of your time Designing a REST API? Should try AWS AppSync (GraphQL).

Up 0.4.0 — Alerting, Encrypted Environment Variables, and 30% Quicker

Flutter WebApp with GCP #1

Send and Receive Messages with the Telegram API

Enlightment on Storage Virtualization

Interaction: alert, prompt, confirm

Solve the sliding window maximum problem

Prologue Response — Or Why “Digital Service” and not “Software”?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mohan Ganesan

Mohan Ganesan

Founder @ ProxiesAPI.com

More from Medium

Google Ads’ Changing Landscape: Focus on Automation and Machine Learning — Vizion Interactive

Financial Technology (Fintech) : A Practical Introduction

How to Populate Pipedrive with Leads

How to Ace B2B eCommerce Website Development

How to Ace B2B eCommerce Website Development