Extract hyperlinks from PDF in Python

Last Updated : 16 Oct, 2021

Prerequisite: PyPDF2, Regex

In this article, We are going to extract hyperlinks from PDF in Python. It can be done in different ways:

Using PyPDF2
Using pdfx

Method 1: Using PyPDF2.

PyPDF2 is a python library built as a PDF toolkit. It is capable of Extracting document information and many more.

Approach:

Read the PDF file and convert it into text
Get URL from text Using Regular Expression

Let's Implement this module step-wise:

Step 1: Open and Read the PDF file.

Python3

import PyPDF2


file = "Enter PDF File Name"

pdfFileObject = open(file, 'rb')
 
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
 
for page_number in range(pdfReader.numPages):
    
    pageObject = pdfReader.getPage(page_number)
    pdf_text = pageObject.extractText()
    print(pdf_text)
    
pdfFileObject.close()

Output:

Step 2: Use Regular Expression to find URL from String

Python3

# Import Module
import PyPDF2
import re 

# Enter File Name
file = "Enter PDF File Name"

# Open File file
pdfFileObject = open(file, 'rb')
 
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)

# Regular Expression (Get URL from String)
def Find(string): 
  
    # findall() has been used 
    # with valid conditions for urls in string 
    regex = r"(https?://\S+)"
    url = re.findall(regex,string)
    return [x for x in url] 
  
# Iterate through all pages
for page_number in range(pdfReader.numPages):
    
    pageObject = pdfReader.getPage(page_number)
    
    # Extract text from page
    pdf_text = pageObject.extractText()
    
    # Print all URL
    print(Find(pdf_text))
    
# CLose the PDF 
pdfFileObject.close()

Output:

['https://docs.python.org/', 'https://pythonhosted.org/PyPDF2/', 'https://www.geeksforgeeks.org/']

Method 2: Using pdfx.

In this method, we will use pdfx module. pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL. Features: Extract references and metadata from a given PDF.

pip install pdfx

Below is the implementation:

Python3

# Import Module
import pdfx 

# Read PDF File
pdf = pdfx.PDFx("File Name") 

# Get list of URL
print(pdf.get_references_as_dict())

Output:-

{'url': ['https://www.geeksforgeeks.org/',
  'https://docs.python.org/',
  'https://pythonhosted.org/PyPDF2/',
  'GeeksforGeeks.org']}

Extract hyperlinks from PDF in Python

abhigoya

Improve

Article Tags :

Practice Tags :

python

Extract hyperlinks from PDF in Python

Similar Reads

Thank You!

What kind of Experience do you want to share?