Extract hyperlinks from PDF in Python
Last Updated :
16 Oct, 2021
Prerequisite: PyPDF2, Regex
In this article, We are going to extract hyperlinks from PDF in Python. It can be done in different ways:
Method 1: Using PyPDF2.
PyPDF2 is a python library built as a PDF toolkit. It is capable of Extracting document information and many more.
Approach:
- Read the PDF file and convert it into text
- Get URL from text Using Regular Expression
Let's Implement this module step-wise:
Step 1: Open and Read the PDF file.
Python3
import PyPDF2
file = "Enter PDF File Name"
pdfFileObject = open(file, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
for page_number in range(pdfReader.numPages):
pageObject = pdfReader.getPage(page_number)
pdf_text = pageObject.extractText()
print(pdf_text)
pdfFileObject.close()
Output:
Step 2: Use Regular Expression to find URL from String
Python3
# Import Module
import PyPDF2
import re
# Enter File Name
file = "Enter PDF File Name"
# Open File file
pdfFileObject = open(file, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
# Regular Expression (Get URL from String)
def Find(string):
# findall() has been used
# with valid conditions for urls in string
regex = r"(https?://\S+)"
url = re.findall(regex,string)
return [x for x in url]
# Iterate through all pages
for page_number in range(pdfReader.numPages):
pageObject = pdfReader.getPage(page_number)
# Extract text from page
pdf_text = pageObject.extractText()
# Print all URL
print(Find(pdf_text))
# CLose the PDF
pdfFileObject.close()
Output:
['https://docs.python.org/', 'https://pythonhosted.org/PyPDF2/', 'https://www.geeksforgeeks.org/']
Method 2: Using pdfx.
In this method, we will use pdfx module. pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL. Features: Extract references and metadata from a given PDF.
pip install pdfx
Below is the implementation:
Python3
# Import Module
import pdfx
# Read PDF File
pdf = pdfx.PDFx("File Name")
# Get list of URL
print(pdf.get_references_as_dict())
Output:-
{'url': ['https://www.geeksforgeeks.org/',
'https://docs.python.org/',
'https://pythonhosted.org/PyPDF2/',
'GeeksforGeeks.org']}
Similar Reads
How to extract images from PDF in Python? The task in this article is to extract images from PDFs and convert them to Image to PDF and PDF to Image in Python.To extract the images from PDF files and save them, we use the PyMuPDF library. First, we would have to install the PyMuPDF library using Pillow.pip install PyMuPDF PillowPyMuPDF is us
3 min read
How to Extract PDF Tables in Python? When handling data in PDF files, you may need to extract tables for use in Python programs. PDFs (Portable Document Format) preserve the layout of text, images and tables across platforms, making them ideal for sharing consistent document formats. For example, a PDF might contain a table like:User_I
3 min read
Python | Extract URL from HTML using lxml Link extraction is a very common task when dealing with the HTML parsing. For every general web crawler that's the most important function to perform. Out of all the Python libraries present out there, lxml is one of the best to work with. As explained in this article, lxml provides a number of help
4 min read
Delete pages from a PDF file in Python In this article, We are going to learn how to delete pages from a pdf file in Python programming language. Introduction Modifying documents is a common task performed by many users. We can perform this task easily with Python libraries/modules that allow the language to process almost any file, the
4 min read
Email Id Extractor Project from sites in Scrapy Python Scrapy is open-source web-crawling framework written in Python used for web scraping, it can also be used to extract data for general-purpose. First all sub pages links are taken from the main page and then email id are scraped from these sub pages using regular expression. This article shows the e
8 min read