How to parse local HTML file in Python?
Last Updated :
16 Mar, 2021
Prerequisites: Beautifulsoup
Parsing means dividing a file or input into pieces of information/data that can be stored for our personal use in the future. Sometimes, we need data from an existing file stored on our computers, parsing technique can be used in such cases. The parsing includes multiple techniques used to extract data from a file. The following includes Modifying the file, Removing something from the file, Printing data, using the recursive child generator method to traverse data from the file, finding the children of tags, web scraping from a link to extract useful information, etc.
Modifying the file
Using the prettify method to modify the HTML code from- https://festive-knuth-1279a2.netlify.app/, look better. Prettify makes the code look in the standard form like the one used in VS Code.
Example:
Python3
# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
# Importing the HTTP library
import requests as req
# Requesting for the website
Web = req.get('https://festive-knuth-1279a2.netlify.app/')
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(Web.text, 'lxml')
# Using the prettify method
print(S.prettify())
Output:


Removing a tag
A tag can be removed by using the decompose method and the select_one method with the CSS selectors to select and then remove the second element from the li tag and then using the prettify method to modify the HTML code from the index.html file.
Example:
File Used:
Python3
# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
# Opening the html file
HTMLFile = open("index.html", "r")
# Reading the file
index = HTMLFile.read()
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
# Using the select-one method to find the second element from the li tag
Tag = S.select_one('li:nth-of-type(2)')
# Using the decompose method
Tag.decompose()
# Using the prettify method to modify the code
print(S.body.prettify())
Output:


Finding tags
Tags can be found normally and printed normally using print().
Example:
Python3
# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
# Opening the html file
HTMLFile = open("index.html", "r")
# Reading the file
index = HTMLFile.read()
# Creating a BeautifulSoup object and specifying the parser
Parse = BeautifulSoup(index, 'lxml')
# Printing html code of some tags
print(Parse.head)
print(Parse.h1)
print(Parse.h2)
print(Parse.h3)
print(Parse.li)
Output:

Traversing tags
The recursiveChildGenerator method is used to traverse tags, which recursively finds all the tags within tags from the file.
Example:
Python3
# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
# Opening the html file
HTMLFile = open("index.html", "r")
# Reading the file
index = HTMLFile.read()
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
# Using the recursiveChildGenerator method to traverse the html file
for TraverseTags in S.recursiveChildGenerator():
# Traversing the names of the tags
if TraverseTags.name:
# Printing the names of the tags
print(TraverseTags.name)
Output:

Parsing name and text attributes of tagsÂ
Using the name attribute of the tag to print its name and the text attribute to print its text along with the code of the tag- ul from the file.
Example:
Python3
# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
# Opening the html file
HTMLFile = open("index.html", "r")
# Reading the file
index = HTMLFile.read()
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
# Printing the Code, name, and text of a tag
print(f'HTML: {S.ul}, name: {S.ul.name}, text: {S.ul.text}')
Output:

Finding Children of a tagÂ
The Children attribute is used to get the children of a tag. The Children attribute returns 'tags with spaces' between them, we're adding a condition- e. name is not None to print only names of the tags from the file.
Example:
Python3
# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
# Opening the html file
HTMLFile = open("index.html", "r")
# Reading the file
index = HTMLFile.read()
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
# Providing the source
Attr = S.html
# Using the Children attribute to get the children of a tag
# Only contain tag names and not the spaces
Attr_Tag = [e.name for e in Attr.children if e.name is not None]
# Printing the children
print(Attr_Tag)
Output:
Finding Children at all levels of a tag:
The Descendants attribute is used to get all the descendants (Children at all levels) of a tag from the file.
Example:
Python3
# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
# Opening the html file
HTMLFile = open("index.html", "r")
# Reading the file
index = HTMLFile.read()
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
# Providing the source
Des = S.body
# Using the descendants attribute
Attr_Tag = [e.name for e in Des.descendants if e.name is not None]
# Printing the children
print(Attr_Tag)
Output:

Finding all elements of tagsÂ
Using find_all():
The find_all method is used to find all the elements (name and text) inside the p tag from the file.
Example:
Python3
# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
# Opening the html file
HTMLFile = open("index.html", "r")
# Reading the file
index = HTMLFile.read()
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
# Using the find_all method to find all elements of a tag
for tag in S.find_all('p'):
# Printing the name, and text of p tag
print(f'{tag.name}: {tag.text}')
Output:
CSS selectors to find elements:Â
Using the select method to use the CSS selectors to find the second element from the li tag from the file.
Example:
Python3
# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
# Opening the html file
HTMLFile = open("index.html", "r")
# Reading the file
index = HTMLFile.read()
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
# Using the select method
# Prints the second element from the li tag
print(S.select('li:nth-of-type(2)'))
Output:
Similar Reads
How to make HTML files open in Chrome using Python? Prerequisites: Webbrowser HTML files contain Hypertext Markup Language (HTML), which is used to design and format the structure of a webpage. It is stored in a text format and contains tags that define the layout and content of the webpage. HTML files are widely used online and displayed in web brow
2 min read
How to parse HTML in Ruby? We have many languages which are used to parse the html files. We have Python programming languages. In Python, we can parse the html files using the panda's library and the library which is beautiful soup. The Beautiful Soup library is mainly used for web scraping. Similarly, we can parse the HTML
3 min read
Parsel: How to Extract Text From HTML in Python Parsel is a Python library used for extracting data from HTML and XML documents. It provides tools for parsing, navigating, and extracting information using CSS selectors and XPath expressions. Parsel is particularly useful for web scraping tasks where you need to programmatically extract specific d
2 min read
html5lib and lxml parsers in Python Parsers in Python: Parsing simply means to break down a blob of text into smaller and meaningful parts. This breaking down depends on certain rules and factors which a particular parser defines. These parsers can range from native string methods of parsing line by line to the libraries like html5lib
3 min read
How to Convert HTML to Markdown in Python? Markdown is a way of writing a formatted text on the web. This article discusses how an HTML text can be converted to Markdown. We can easily convert HTML to markdown using markdownify package. So let's see how to download markdownify package and convert our HTML to markdown in python. Installation
1 min read
Read Html File In Python Using Pandas We are given an HTML file that contains one or more tables, and our task is to extract these tables as DataFrames using Python. For example, if we have an HTML file with a table like this:<table> <tr><th>Code</th><th>Language</th><th>Difficulty</th>
4 min read
Requesting a URL from a local File in Python Making requests over the internet is a common operation performed by most automated web applications. Whether a web scraper or a visitor tracker, such operations are performed by any program that makes requests over the internet. In this article, you will learn how to request a URL from a local File
4 min read
How to use HTML in Tkinter - Python? Prerequisite: Tkinter Python offers multiple options for developing GUI (Graphical User Interface). Out of all the GUI methods, Tkinter is the most commonly used method. It is a standard Python interface to the Tk GUI toolkit shipped with Python. Python with Tkinter is the fastest and easiest way to
2 min read
What is Parsel in Python? Parsel is a library of Python which is designed for extracting and processing data from HTML and XML documents. It is widely used for web scraping and data extraction. It provides a simple and intuitive API for querying and parsing web content. It supports both XPath and CSS selectors to make it a v
4 min read
How to Import BeautifulSoup in Python Beautiful Soup is a Python library used for parsing HTML and XML documents. It provides a simple way to navigate, search, and modify the parse tree, making it valuable for web scraping tasks. In this article, we will explore how to import BeautifulSoup in Python. What is BeautifulSoup?BeautifulSoup
3 min read