Check for URL in a String - Python

Last Updated : 12 Apr, 2025

We are given a string that may contain one or more URLs and our task is to extract them efficiently. This is useful for web scraping, text processing, and data validation. For example:

Input:

s = "My Profile: https://auth.geeksforgeeks.org/user/Prajjwal%20/articles in the portal of https://www.geeksforgeeks.org/"

Output:

['https://auth.geeksforgeeks.org/user/Rayyyyy%20/articles', 'https://www.geeksforgeeks.org/']

Using re.findall()

Python’s Regular Expressions (regex) module allows us to extract patterns like URLs from texts, it comes with various functions like findall(). The re.findall() function in Python is used to find all occurrences of a pattern in a given string and return them as a list.

Python

import re

s = 'My Profile: https://auth.geeksforgeeks.org/user/Rayyyy%20/articles in the portal of https://www.geeksforgeeks.org/'
pattern = r'https?://\S+|www\.\S+'

print("URLs:", re.findall(pattern, s))

Output

URLs: ['https://auth.geeksforgeeks.org/user/Rayyyy%20/articles', 'https://www.geeksforgeeks.org/']

Explanation:

r'https?://\S+|www\.\S+' is a regex pattern to match URLs starting with http://, https://, or www.
findall() extracts all matches in a list.

Using the urlparse()

urlparse() function from Python's urllib.parse module helps break down a URL into its key parts, such as the scheme (http, https), domain name, path, query parameters, and fragments. This function is useful for validating and extracting URLs from text by checking if a word follows a proper URL structure.

Python

from urllib.parse import urlparse

s = 'My Profile: https://auth.geeksforgeeks.org/user/Rayyyy%20/articles in the portal of https://www.geeksforgeeks.org/'

# Split the string into words
split_s = s.split()

# Empty list to collect URLs
urls = []
for word in split_s:
    parsed = urlparse(word)
    if parsed.scheme and parsed.netloc:
        urls.append(word)

print("URLs:", urls)

Output

URLs: ['https://auth.geeksforgeeks.org/user/Rayyyy%20/articles', 'https://www.geeksforgeeks.org/']

Explanation:

s.split() function splits the string to words.
then urlparse(word) function checks each word to see if it has a valid scheme (http/https) and domain.
URLs are added to url list using append() function.

Using urlextract()

urlextract is a third party library so to use it we need to first install it by giving the command "pip install urlextract" in out terminal, it offers a pre-built solution to find URLs in text. Its URLExtract class helps us to quickly identify URLs without needing custom patterns, making it a convenient choice for difficult extraction of URLs.

Python

from urlextract import URLExtract

s = 'My Profile: https://auth.geeksforgeeks.org/user/Prajjwal%20/articles in the portal of https://www.geeksforgeeks.org/'
extractor = URLExtract()
urls = extractor.find_urls(s)

print("URLs:", urls)

Output

Urls:  ['https://auth.geeksforgeeks.org/user/Prajjwal%20/articles', 'https://www.geeksforgeeks.org/']

Explanation:

import URLExtract from the urlextract library.
URLExtract() creates an extractor object to scan the string.
find_urls() detects all URLs in s and returns them as a list, no manual splitting or validation is needed.

Using startswith()

One simple approach is to split the string and check if each word starts with "https://" or "https://" using .startswith() built-in method, we can use .split() function to split the string and then check each word, if it starts with "https://" or "https://". If it does, we add it to our list of extracted URLs.

Python

s = 'My Profile: https://auth.geeksforgeeks.org/user/Rayyyy%20/articles in the portal of https://www.geeksforgeeks.org/'
x = s.split()

# Empty list to extract the URL
res=[]

for i  in x:
    if i.startswith("https:") or i.startswith("http:"):
        res.append(i)  
        
print("Urls: ", res)

Output

Urls:  ['https://auth.geeksforgeeks.org/user/Rayyyy%20/articles', 'https://www.geeksforgeeks.org/']

Explanation:

string.split() method splits the string into words.
then we checks if each word starts with http:// or https:// using the "if" statement.
if it does, then we add it to the list of URLs using .append() method.

Using find() method

find() is a built-in method in Python that is used to find a specific element in a collection, so we can use it to identify and extract a URL from a string. Here's how:

Python

s = 'My Profile: https://auth.geeksforgeeks.org/user/Rayyyy%20/articles in the portal of https://www.geeksforgeeks.org/'
split_s = s.split()

res=[]

for i in split_s:
    if i.find("https:")==0 or i.find("http:")==0:
        res.append(i)

print("Urls: ", res)

Output

Urls:  ['https://auth.geeksforgeeks.org/user/Rayyyy%20/articles', 'https://www.geeksforgeeks.org/']

Explanation:

s.split() funtion splits the string to words.
identify url using i.find() function.
add the URLs to the list 'res' using .append().

Related Articles:

Check for URL in a String - Python

chinmoy lenka

Improve

Article Tags :

Practice Tags :

python

Check for URL in a String - Python

Using re.findall()

Using the urlparse()

Using urlextract()

Using startswith()

Using find() method

Similar Reads

Thank You!

What kind of Experience do you want to share?