How to Scrape Wikipedia using Python Scrapy

Mohan Ganesan
4 min readSep 6, 2020

Scrapy is one of the most accessible tools that you can use to scrape and also spider a website with effortless ease.

Today lets see how we can scrape Wikipedia data for any topic.

Here is the URL we are going to scrape https://en.wikipedia.org/wiki/List_of_common_misconceptions, which provides a list of common misconceptions in life!

First, we need to install scrapy if you haven’t already.

pip install scrapy

Once installed, go ahead and create a project by invoking the startproject command.

scrapy startproject scrapingproject

This will output something like this.

New Scrapy project 'scrapingproject', using template directory '/Library/Python/2.7/site-packages/scrapy/templates/project', created in:
/Applications/MAMP/htdocs/scrapy_examples/scrapingproject
You can start your first spider with:
cd scrapingproject
scrapy genspider example example.com

And create a folder structure like this.

Now CD into the scrapingproject. You will need to do it twice like this.

cd scrapingproject
cd scrapingproject

Now we need a spider to crawl through the Wikipedia page. So we use the genspider to tell scrapy to create one for us. We call the spider ourfirstbot and pass it to the URL of the Wikipedia page.

scrapy genspider ourfirstbot https://en.wikipedia.org/wiki/List_of_common_misconceptions

This should return successfully like this.

Created spider 'ourfirstbot' using template 'basic' in module:
scrapingproject.spiders.ourfirstbot

Great. Now open the file ourfirstbot.py in the spider’s folder. It should look like this.

# -*- coding: utf-8 -*-
import scrapy
class OurfirstbotSpider(scrapy.Spider):
name = 'ourfirstbot'
start_urls = ['https://en.wikipedia.org/wiki/List_of_common_misconceptions']
def parse(self, response):
pass

Let’s examine this code before we proceed.

He allowed_domains array restricts all further crawling to the domain paths specified here.

start_urls is the list of URLs to crawl. For us, in this example, we only need one URL.

The def parse(self, response): function is called by scrapy after every successful URL crawl. Here is where we can write our code to extract the data we want.

We now need to find the CSS selector of the elements we need to extract the data. Go to the URL en.wikipedia.org and right-click on one of the headlines of the Wikipedia data and click on inspect. This will open the Google Chrome Inspector like below.

You can see that the CSS class name of the headline element is MW-headline, so we are going to ask scrapy to get us the contents of this class like this.

dates = response.css('.mw-headline').extract()

Now we see that there is an element that lists all the content pieces in bulleted list form, so let’s get that by the selector below.

Now we see that there is an element that lists all the content pieces in bulleted list form, so let’s get that by the selector below.

  • With millions of high speed rotating proxies located all over the world
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

A simple API can access the whole thing like below in any programming language.

A simple API can access the whole thing like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Once you have an API_KEY from Proxies API, you just have to change your code to this.

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
import urllib


class OurfirstbotSpider(scrapy.Spider):
name = 'ourfirstbot'
start_urls = [
'http://api.proxiesapi.com/?key=API_KEY&url=https://en.wikipedia.org/wiki/List_of_common_misconceptions',
]

def parse(self, response):
#yield response
headings = response.css('.mw-headline').extract()
datas = response.css('ul').extract()


for item in zip(headings, datas):
all_items = {
'headings' : BeautifulSoup(item[0]).text,
'datas' : BeautifulSoup(item[1]).text,


}


yield all_items

We have only changed one line at the start_urls array, and that will make sure we will never have to worry about IP rotation, user agent string rotation, or even rate limits ever again.

The blog was originally posted at : https://www.proxiesapi.com/blog/how-to-scrape-wikipedia-using-python-scrapy.html.php

--

--