Web scraping from Wikipedia using Python

Web scraping is the process of automatically extracting data from websites. It enables programmers to collect structured information from web pages without manual effort. In this article, we will demonstrate how to scrape data from the Wikipedia homepage using Python and discuss commonly used techniques, libraries for efficient web scraping.

website_pages_unstructured_data_https_ — Process of Web Scraping

Python Libraries Required

The following Python libraries are commonly used to fetch, parse, and automate web pages during the web scraping process:

Requests: It is the foundation of web scraping in Python. It allows to send HTTP requests like GET and POST to download web pages quickly and reliably.
lxml: A high-performance library for parsing HTML and XML, lxml is extremely fast and ideal for large-scale scraping tasks.
BeautifulSoup: A beginner-friendly HTML parser that creates a parse tree for easy data extraction. It works seamlessly with content fetched using Requests.
Selenium: Automates a real browser, enabling scraping of dynamic, JavaScript-loaded websites. It is slower than other tools and less suitable for large-scale scraping.
Scrapy: A full-featured, asynchronous web scraping framework built for speed and scalability. It is ideal for large crawling projects, providing pipelines and advanced data handling capabilities.

Python Web Scraping on Wikipedia

The following diagram shows the basic workflow followed when scraping data from Wikipedia using Python:

steps_in_web_scraping_using_python — Steps in Web Scraping

It explains how a web page is loaded, parsed, and processed to extract useful information before converting it into a structured form.

Step 1: Setting Up Python for Web Scraping

Before scraping a website, we must set up our Python environment.

Requirements:

A Python IDE (PyCharm, VS Code, Jupyter, etc.)
Basic understanding of Python programming
virtualenv used to create isolated Python environments
Web scraping libraries: requests, urllib3, selenium

Using virtualenv, we can install packages inside an isolated environment without affecting global system installations.

Install Required Libraries

pip install virtualenv
python -m pip install selenium
python -m pip install requests
python -m pip install urllib3

Step 2: Fetching Web Pages Using Requests

Requirements:

Python IDE
Python Modules
Requests library

This code sends a request to Wikipedia and downloads the HTML of the page.

Python

import requests
page = requests.get("https://en.wikipedia.org/wiki/Main_Page")

print(page.status_code)
print(page.content)

Output

Explanation:

requests.get(): sends an HTTP GET request to the website.
page.status_code: returns 200 if the page loaded successfully.
page.content: returns the full HTML of the page.

Step 3: Parsing HTML Using BeautifulSoup

You can install the BeautifulSoup using the below command:

pip install bs4

This code converts the downloaded HTML into a structured format that Python can read.

Python

from bs4 import BeautifulSoup
import requests

page = requests.get("https://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(page.content, 'html.parser')

print(soup.prettify())

Output

Explanation:

BeautifulSoup cannot fetch pages by itself, it only parses HTML content.
soup.prettify(): prints the HTML in a readable, structured format.

Step 4: Extracting Data with BeautifulSoup

This shows how to use BeautifulSoup to parse a webpage and extract all <p> tags, then retrieve the text content of the first paragraph.

Python

from bs4 import BeautifulSoup
import requests

page = requests.get("https://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(page.content, 'html.parser')

print(soup.find_all('p'))
print("\n\n")
print(soup.find_all('p')[0].get_text())

Output

Explanation:

find_all('p'): returns a list of all <p> tags.
get_text(): extracts only text, removing HTML tags.
Use find() if you only need the first matching tag.

Step 5: Exploring Page Structure Using Chrome DevTools

Before scraping, analyzing the structure of the webpage is essential.

How to Inspect Elements:

Open Chrome
Press Ctrl+Shift+I (or right-click -> Inspect)
Use the Elements tab to see the HTML layout
Identify IDs, classes, and HTML tags that contain the required data.

This helps us locate the exact tag or class to extract.

Exploring Page Structure Using Chrome DevTools

This code extracts a specific section of the Wikipedia page using its id and class.

Python

from bs4 import BeautifulSoup
import requests

page = requests.get("https://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(page.content, 'html.parser')

container = soup.find(id="mp-left")
items = container.find_all(class_="mp-h2")
result = items[0]
print(result.prettify())

Output

Explanation:

id="mp-left" identifies the parent section.
find_all(class_="mp-h2"): searches for all elements with a specific class.
prettify(): formats the HTML for easy reading.

Web Scraping Tutorial
Tools in the world of Web Scraping

Web scraping from Wikipedia using Python

Python Libraries Required

Python Web Scraping on Wikipedia

Step 1: Setting Up Python for Web Scraping

Install Required Libraries

Step 2: Fetching Web Pages Using Requests

Step 3: Parsing HTML Using BeautifulSoup

Step 4: Extracting Data with BeautifulSoup

Step 5: Exploring Page Structure Using Chrome DevTools

How to Inspect Elements:

Related Articles:

Explore