New To Web Scraping? Here’s What You Need To Know
--
Many developers think that web scraping is about just the coding part. In our experience at Proxies API, coding is the smaller, more accessible part. A lot of it is about dealing with the vagaries of the internet, taking measures to be fail-safe, and to scale. Let’s go through some of those in this article. But first…..
First, let’s dive into code.
Here is how you extract some data from some website.
Let’s choose a simple website — the New York Times.
This code, created in Python and Beautiful Soup, gets you some headlines, links, and summaries of today’s stories.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url='https://www.nytimes.com/'
response=requests.get(url,headers=headers)
soup=BeautifulSoup(response.content,'lxml')
#print(soup.select('.Post')[0].get_text())
for item in soup.select('.assetWrapper'):
try:
print('----------------------------------------')
headline = item.find('h2').get_text()
link = item.find('a')['href']
summary = item.find('p').get_text()
print(headline)
print(link)
print(summary)
except Exception as e:
#raise e
print('')
Here is what is going on in the code.
We import the Beautiful Soup library, which will help in using CSS selectors and point at data we want from the HTML. The requests library helps us get the data.
The code…
for item in soup.select('.assetWrapper'):
…selects all the HTML elements with the class name assetWrapper used by NYtimes.com.
Then it goes on to get different pieces of info like headline, link, and summary with this.
headline = item.find('h2').get_text()
link = item.find('a')['href']
summary = item.find('p').get_text()
So then you run it like this.
python3 scrapeNYT.py
and you see results like this.
So that’s just the code part. Start there and explore. But you need to know more than code, so continue reading.
You need to know about some frameworks
In addition to this, get familiar with some frameworks. Like Nutch/Goutte/Scrapy depending on the language you are comfortable in.
If you want to scrape web sites that have content loaded by javascript (AJAX), try learning Selenium or Puppeteer.
Frameworks make your job super easy. Custom code like the one above is great for learning and also getting quick scraping done, but If you use custom code for a large scraper project, you will never be able to scale reliably.
Pretend to be Human
Learn how to pretend to be human as a crawler. It’s a big part of not getting blocked. Read up about spoofing User-Agent strings, passing cookies, rate limiting, CAPTCHA solving, etc.
Learn the scraping part
Learn about XPath and CSS selectors. You will be using a lot of those while scraping data.
Other things
Some common turns in the road you will come across are:
Learning how to navigate multiple pages in a paginated system.
Dealing with unlimited page scrolling is another thing.
Using a tool like Puppeteer to login to websites is another.
Advanced Stuff
Scaling the crawling to millions of URLs
For a more advanced understanding, learn how to scale your web crawlers.
Learn about Asynchronous, concurrent connections using let’s say Scrapy, monitoring the progress of your spiders using signals, exporting data using pipelines.
Overcome IP Blocks
Also, learn about IP blocks and how to overcome them using some sort of a Rotating Proxy Server like Proxies API. (Full disclosure, I am the founder of Proxies API)