Web Scraping using lxml and XPath in Python
Last Updated :
17 Oct, 2022
Prerequisites: Introduction to Web Scraping
In this article, we will discuss the lxml python library to scrape data from a webpage, which is built on top of the libxml2 XML parsing library written in C. When compared to other python web scraping libraries like BeautifulSoup and Selenium, the lxml package gives an advantage in terms of performance. Reading and writing large XML files takes an indiscernible amount of time, making data processing easier & much faster.
We will be using the lxml library for Web Scraping and the requests library for making HTTP requests in Python. These can be installed in the command line using the pip package installer for Python.
Getting data from an element on the webpage using lxml requires the usage of Xpaths.
Using XPath
XPath works very much like a traditional file system
Diagram of a File System
To access file 1,
C:/File1
Similarly, To access file 2,
C:/Documents/User1/File2
Now consider a simple web page,
HTML
<html>
<head>
<title>My page</title>
</head>
<body>
<h2>Welcome to my page<h2>
<a href="www.example.com">page</a>
<p>This is the first paragraph</p>
<h2>Hello World</h2>
</body>
</html>
This can be represented as an XML Tree as follows,
XML Tree of the Webpage
For getting the text inside the <p> tag,
XPath : html/body/p/text()
Result : This is the first paragraph
For getting a value inside the <href> attribute in the anchor or <a> tag,
XPath : html/body/a/@href
Result: www.example.com
For getting the value inside the second <h2> tag,
XPath : html/body/h2[2]/text()
Result: Hello World
To find the XPath for a particular element on a page:
- Right-click the element in the page and click on Inspect.
- Right click on the element in the Elements Tab.
- Click on copy XPath.
Using LXML
Step-by-step Approach
- We will use requests.get to retrieve the web page with our data.
- We use html.fromstring to parse the content using the lxml parser.
- We create the correct XPath query and use the lxml xpath function to get the required element.
Example 1:
Below is a program based on the above approach which uses a particular URL.
Python
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
# Parsing the page
# (We need to use page.content rather than
# page.text because html.fromstring implicitly
# expects bytes as input.)
tree = html.fromstring(page.content)
# Get element using XPath
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
print(buyers)
Output:

Example 2:
Another example for an E-commerce website, URL.
Python
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('https://webscraper.io/test-sites/e-commerce/allinone')
# Parsing the page
tree = html.fromstring(page.content)
# Get element using XPath
prices = tree.xpath(
'//div[@class="col-sm-4 col-lg-4 col-md-4"]/div/div[1]/h4[1]/text()')
print(prices)
Output:

Similar Reads
Implementing web scraping using lxml in Python Web scraping basically refers to fetching only some important piece of information from one or more websites. Every website has recognizable structure/pattern of HTML elements. Steps to perform web scraping :1. Send a link and get the response from the sent link 2. Then convert response object to a
3 min read
Python Web Scraping Tutorial In todayâs digital world, data is the key to unlocking valuable insights, and much of this data is available on the web. But how do you gather large amounts of data from websites efficiently? Thatâs where Python web scraping comes in.Web scraping, the process of extracting data from websites, has em
12 min read
Python | Tools in the world of Web Scraping Web page scraping can be done using multiple tools or using different frameworks in Python. There are variety of options available for scraping data from a web page, each suiting different needs. First, let's understand the difference between web-scraping and web-crawling. Web crawling is used to in
4 min read
Reading and Writing XML Files in Python Extensible Markup Language, commonly known as XML is a language designed specifically to be easy to interpret by both humans and computers altogether. The language defines a set of rules used to encode a document in a specific format. In this article, methods have been described to read and write XM
8 min read
Reading selected webpage content using Python Web Scraping Prerequisite: Downloading files in Python, Web Scraping with BeautifulSoup We all know that Python is a very easy programming language but what makes it cool are the great number of open source library written for it. Requests is one of the most widely used library. It allows us to open any HTTP/HTT
3 min read
How to Build Web scraping bot in Python In this article, we are going to see how to build a web scraping bot in Python. Web Scraping is a process of extracting data from websites. A Bot is a piece of code that will automate our task. Therefore, A web scraping bot is a program that will automatically scrape a website for data, based on our
8 min read
Extract title from a webpage using Python Prerequisite Implementing Web Scraping in Python with BeautifulSoup, Python Urllib Module, Tools for Web Scraping In this article, we are going to write python scripts to extract the title form the webpage from the given webpage URL. Method 1: bs4 Beautiful Soup(bs4) is a Python library for pulling
3 min read
Scraping websites with Newspaper3k in Python Web Scraping is a powerful tool to gather information from a website. To scrape multiple URLs, we can use a Python library called Newspaper3k. The Newspaper3k package is a Python library used for Web Scraping articles, It is built on top of requests and for parsing lxml. This module is a modified an
2 min read
How to Scrape Multiple Pages of a Website Using Python? Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites
6 min read
Implementing Web Scraping in Python with BeautifulSoup There are mainly two ways to extract data from a website:Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.Access the HTML of the webpage and extract useful information/data from it. This technique is called
8 min read