Ujeebu is a set of powerful APIs for Web data scraping and automatic content extraction. This SDK provides an easy-to-use interface for interacting with Ujeebu API. It is built on top of Python and uses the requests library to make HTTP requests.
You can install the SDK using pip:
pip install ujeebu-pythonTo use the SDK, you first need to create an instance of it with your API credentials:
from ujeebu_python import UjeebuClient
import json
ujeebu = UjeebuClient(api_key="__YOUR-API-KEY__")
url = "https://ujeebu.com/blog/scraping-javascript-heavy-pages-using-puppeteer/"
response = ujeebu.extract(url=url)
if(response.status_code == 200):
result = response.json()
print(json.dumps(result['article'], indent=2))
else:
print("Error:\n", json.dumps(response.json(), indent=2))The SDK provides the following methods:
-
scrape(url, params, headers)url: The url to scrape (required).params: Dict of scrape API params (optional).headers: Dict of headers to forward (optional).
-
extract(url, params, headers)url: The url to extract (required).params: Dict of extract API params (optional).headers: Dict of headers to forward (optional).
-
preview(url, params, headers)url: The url to extract (required).params: Dict of preview API params (optional).headers: Dict of headers to forward (optional).
-
serp(params, headers)params: Dict of SERP API params (optional).headers: Dict of headers to forward (optional).
-
account()- Returns account information including usage, balance, and plan details.
-
get_pdf(url, params, headers)- Gets a PDF of a web page using the Scrape API.
url: The URL to create a PDF from (required).params: Additional parameters for the PDF generation (optional).headers: Headers to forward to the request (optional).- Automatically sets
response_typeto 'pdf' andjsonto True.
-
get_screenshot(url, params, headers)- Gets a screenshot of a web page using the Scrape API.
url: The URL to take a screenshot of (required).params: Additional parameters for the screenshot (optional).headers: Headers to forward to the request (optional).- Automatically sets
response_typeto 'screenshot' andjsonto True.
-
get_html(url, params, headers)- Gets the HTML of a web page using the Scrape API.
url: The URL to get HTML from (required).params: Additional parameters for the request (optional).headers: Headers to forward to the request (optional).- Automatically sets
response_typeto 'html' andjsonto True.
-
scrape_with_rules(url, extract_rules, params, headers)- Extracts data from a web page using extraction rules with the Scrape API.
url: The URL to extract data from (required).extract_rules: The rules to extract data with (required).params: Additional parameters for the extraction (optional).headers: Headers to forward to the request (optional).- Automatically sets
jsonto True.
-
search_text(search, params)- Performs a Google text search using the SERP API.
search: The search query to perform on Google (required).params: Additional parameters for the search (optional).
-
search_news(search, params)- Performs a Google news search using the SERP API.
search: The search query to perform on Google News (required).params: Additional parameters for the search (optional).
-
search_images(search, params)- Performs a Google images search using the SERP API.
search: The search query to perform on Google Images (required).params: Additional parameters for the search (optional).
-
search_videos(search, params)- Performs a Google videos search using the SERP API.
search: The search query to perform on Google Videos (required).params: Additional parameters for the search (optional).
-
search_maps(search, params)- Performs a Google Maps search using the SERP API.
search: The search query to perform on Google Maps (required).params: Additional parameters for the search (optional).
- Example to scrape html of URL with infinite scroll
url = "https://scrape.li/load-more"
response = ujeebu.scrape(url, params={
# define what to extract 'raw', 'html', 'screenshot' or 'pdf'
"response_type": "dddd",
# return response in json format or
"json": False,
# user-agent header to forward
"useragent": "Ujeebu-Node",
# cookies to forward
"cookies": {"Cookie1": "Cookie Value"},
# execute js
"js": True,
# wait for selector or time in ms
"wait_for": ".products-list",
# if the selector doesn't appear in 5000ms, ignore and continue
"wait_for_timeout": 5000,
# scroll the page down
"scroll_down": True,
# wait 2000ms between two scrolls
"scroll_wait": 2000,
# scroll to this element in each scroll
"scroll_to_selector": ".load-more-section",
# scroll condition. while this is true the page will continue to perform more scrolls
"scroll_callback": "() => (document.querySelector('.no-more-products') === null)",
"proxy_type": "premium",
# proxy country code
"proxy_country": "US",
# device type "desktop" or "mobile"
"device": "desktop",
"window_width": 1200,
"window_height": 900,
"block_ads": True,
"block_resources": True,
}, headers={
# forwarded headers
"Authorization": "Basic eWSjaW5lnlhY4luZUdxMDE2"
})
if(response.status_code == 200):
print(response.content)
else:
print("Error:\n", json.dumps(response.json(), indent=2))- Example to take a screenshot of URL
url = "https://scrape.li/load-more";
response = ujeebu.scrape(url, {
"response_type": "screenshot",
"screenshot_fullpage": True,
"js": True,
# CSS selector to screenshot or coordinates of the rect to screenshot
# screenshot_partial: {
# x: 0,
# y: 0,
# },
# If json is set the true the screenshot will be sent in base64 encoding
"json": False,
"wait_for": 4000,
"block_ads": True,
})
if(response.status_code == 200):
# from pathlib import Path
Path('screenshot.png').write_bytes(response.content)
else:
print("Error:\n", json.dumps(response.json(), indent=2))- Example of extracting list of products from a page
response = ujeebu.scrape(url=url, params={
"wait_for": 5000,
"block_resources": 0,
"js": 1,
"extract_rules": {
"products": {
"selector": ".product-card",
"type": "obj",
"multiple": 1,
"children": {
"name": {
"selector": ".title",
"type": "text"
},
"description": {
"selector": ".description",
"type": "text"
},
"price": {
"selector": ".price",
"type": "text"
},
"image": {
"selector": ".card__image > img",
"type": "image",
}
}
}
}
})
if(response.status_code == 200):
print(json.dumps(response.json(), indent=2))
else:
print("Error:\n", json.dumps(response.json(), indent=2))- Example of extracting main body of an article
url = "https://thenextweb.com/news/european-space-agency-unveils-new-plan-for-growing-plants-on-the-moon"
response = ujeebu.extract(url=url, params={
"js": True
})
if(response.status_code == 200):
result = response.json()
print(json.dumps(result['article'], indent=2))
else:
print("Error:\n", json.dumps(response.json(), indent=2))- Get PDF using helper function:
from ujeebu_python import UjeebuClient
ujeebu = UjeebuClient(api_key="__YOUR-API-KEY__")
# Get PDF
response = ujeebu.get_pdf(
"https://ujeebu.com/blog/scraping-javascript-heavy-pages-using-puppeteer/"
)
if response.status_code == 200:
result = response.json()
# PDF is base64 encoded
print(result['pdf'][:100])- Get screenshot using helper function:
# Take a full page screenshot
response = ujeebu.get_screenshot(
"https://ujeebu.com",
params={
"screenshot_fullpage": True
}
)
if response.status_code == 200:
result = response.json()
# Screenshot is base64 encoded
print(result['screenshot'][:100])- Get HTML using helper function:
# Get HTML with JavaScript execution
response = ujeebu.get_html(
"https://ujeebu.com",
params={
"js": True,
"wait_for": 2000
}
)
if response.status_code == 200:
result = response.json()
print(result['html'][:100])- Scrape with extraction rules using helper function:
# Extract product data using helper function
extract_rules = {
"products": {
"selector": ".product-card",
"type": "obj",
"multiple": True,
"children": {
"name": {"selector": ".title", "type": "text"},
"price": {"selector": ".price", "type": "text"}
}
}
}
response = ujeebu.scrape_with_rules(
"https://example.com/products",
extract_rules=extract_rules,
params={"js": True, "wait_for": 3000}
)
if response.status_code == 200:
result = response.json()
print(json.dumps(result['result'], indent=2))- Google text search:
# Perform a text search
response = ujeebu.search_text(
"Nikola Tesla",
params={"results_count": 10, "lang": "en"}
)
if response.status_code == 200:
result = response.json()
for item in result['organic_results']:
print(f"{item['title']}: {item['link']}")- Google news search:
# Search for news articles
response = ujeebu.search_news(
"Donald Trump",
params={"results_count": 20}
)
if response.status_code == 200:
result = response.json()
for news in result['news']:
print(f"{news['title']}: {news['link']}")- Google images search:
# Search for images
response = ujeebu.search_images(
"Coffee",
params={"results_count": 10}
)
if response.status_code == 200:
result = response.json()
for image in result['images']:
print(f"{image['title']}: {image['image']}")- Google videos search:
# Search for videos
response = ujeebu.search_videos(
"Bitcoin",
params={"results_count": 10}
)
if response.status_code == 200:
result = response.json()
for video in result['videos']:
print(f"{video['title']}: {video['url']}")- Google Maps search:
# Search for places on Google Maps
response = ujeebu.search_maps(
"Italian restaurant",
params={"results_count": 10, "location": "ca"}
)
if response.status_code == 200:
result = response.json()
for place in result['maps_results']:
print(f"{place['title']} - Rating: {place['rating']}")# Get account information
response = ujeebu.account()
if response.status_code == 200:
account_info = response.json()
print(f"Plan: {account_info['plan']}")
print(f"Used: {account_info['used']} / {account_info['quota']}")
print(f"Used Percent: {account_info['used_percent']}%")Contributions are welcome! If you find a bug or have a feature request, please open an issue or submit a pull request.
This library is licensed under the MIT License. See the LICENSE file for more information.
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.