Data Scraping Techniques
CYBER SECURITY-1
What is Data Scraping?
Data scraping is the process of
extracting
information from websites,
documents, APIs, or databases. It is
commonly used for:
Market research
Competitor analysis
SEO monitoring
Email & contact extraction
Machine learning datasets
Techniques for Data
Scraping
Manual Copy-Pasting (Basic)
Simply copying and pasting data from a
website into a file.
Useful for small-scale tasks but inefficient
for large datasets.
Web Scraping with Python (Automated)
Uses libraries like BeautifulSoup, Scrapy,
and Selenium to extract data
programmatically.
Example with BeautifulSoup:
Web Scraping with Selenium
(For Dynamic Sites)
Automates browser interactions to scrape
JavaScript-rendered content.
API Scraping (More Efficient)
Many websites provide APIs to access
data legally.
Headless Browsers & Puppeteer
Uses Puppeteer (Node.js) for scraping
JavaScript-heavy websites.
What is a Headless
Browser and Pupeteer?
A headless browser is a web browser
that runs without a graphical user
interface (GUI). It behaves just like a
regular browser but operates in the
background, making it useful for
automation, web scraping, and testing.
Puppeteer is a Node.js library that
provides a high-level API to control Chrome
or Chromium using the DevTools Protocol.
By default, Puppeteer runs in headless
mode, but it can also run with a visible UI.
Reverse Engineering APIs
Many websites hide APIs but expose
them in browser requests (inspect via
DevTools → Network Tab).
Example: Scraping stock market data from
a hidden API instead of parsing HTML.
Proxy Rotation & User Agents
Prevents detection & blocking by using:
Rotating IP addresses with proxy
services (e.g., BrightData, ScraperAPI).
CAPTCHA Solving
Some sites use CAPTCHAs to block
scrapers. Solutions include:
AI-based CAPTCHA solvers (e.g.,
2Captcha, Anti-Captcha).
Headless browser automation
(simulating human behavior).
Legal & Ethical Considerations
Scraping can be illegal if it violates
a website's Terms of Service (ToS).
Always check robots.txt before
scraping
(https://example.com/robots.txt).
Use official APIs whenever possible.
Respect privacy laws (GDPR, CCPA).
Google Dorks & Advanced Search Queries
What is Google Dorking?
Google Dorking (also called Google Hacking) is a
technique that uses advanced search operators
to find hidden or sensitive information on websites.
This includes:
Exposed emails & passwords
Confidential files (PDFs, DOCs, XLS)
Login pages & admin panels
Google Dorking Examples
Find Exposed Login Pages
Search for admin login portals:
inurl:adminlogin intitle:"admin panel"
Example: inurl:login.php site:example.com
Discover Exposed Files
Find PDF documents on a government site:
iletype:pdf site:gov
Example: filetype:xlsx "financial report"
Find Open Directories
Search for directories with publicly available
files:
intitle:"index of" "parent directory"
Google Dorking Examples
Discover Cameras & IoT Devices
Search for exposed security
cameras:
inurl:/view/view.shtml
Example: inurl:top.htm
inurl:currenttime
Find Websites with SQL
Vulnerabilities
Detect pages vulnerable to SQL
injection:
Advanced Google Search
Queries
You can combine multiple operators for
precise searches.
Find a PDF about cybersecurity from a
university website:
site:.edu filetype:pdf
intext:"cybersecurity"
Search for leaked passwords on
Pastebin:
site:pastebin.com "email" "password"
Find GitHub repositories with API keys:
site:github.com "API_KEY"
Ethical & Legal
Considerations
Use Google Dorks for cybersecurity
research & OSINT.
Avoid accessing unauthorized or
private data.
Respect robots.txt and website terms
of service.
Report security vulnerabilities
What is Geolocation & IP
Tracing?
GeoIP Location is the process of determining
the geographical location of an IP
address. This can be used for security,
analytics, content personalization, and
fraud prevention
IP Tracing tracks the origin of an IP
address, revealing ISP details, approximate
location, and possible activity.
How Does GeoIP Location
Work?
An IP address is mapped to a geographical
database.
The database contains information such as:
Country
City
Region/State
Latitude & Longitude
ISP (Internet Service Provider)
Organization
Timezone
Using APIs or libraries, we can retrieve this
data.
Common Uses of
Geolocation & IP Tracing
Cybersecurity – Detecting suspicious
activity, preventing fraud.
Website Analytics – Tracking visitor
locations for better targeting.
Marketing – Serving location-based ads
and content.
Network Troubleshooting – Diagnosing
connectivity issues.
Ethical Hacking & OSINT – Gathering
intelligence in cybersecurity
investigations.
Methods for IP
Geolocation & Tracing
Using Online IP Lookup Tools
Several online services provide IP location
data:
ipinfo.io
iplocation.net
whatismyipaddress.com
Example Output for an IP Lookup:
IP: 192.168.1.1 Country: USA City: New York
ISP: Verizon Communications
Latitude/Longitude: 40.7128° N, 74.0060°
Using Python for IP
Tracing
Python can automate IP tracking
using requests and ipinfo.io:
import requests
ip = "8.8.8.8" # Example IP (Google
Public DNS)
url = f"http://ipinfo.io/{ip}/json"
response = requests.get(url).json()
print(response)
Tracing IP Addresses
Using Command-Line Tools
Find an IP Address’s Route
(Traceroute)
Traceroute maps the path packets take to
reach an IP.
Windows:
tracert example.com
Linux/macOS:
traceroute example.com
Shows all intermediary servers
(hops) along the route.
Tracing IP Addresses
Using Command-Line Tools
Finding Server IP Using nslookup
Find the IP address of a domain:
nslookup example.com
Ping an IP to Check Connectivity
ping 8.8.8.8
Checks if an IP is reachable and
how long packets take to travel.
Advanced Geolocation
Techniques
Using MaxMind’s GeoIP Database
MaxMind’s GeoLite2 provides detailed
geolocation based on IP:
Install the geoip2 library:
pip install geoip2
Python Example:
import geoip2.database reader =
geoip2.database.Reader('GeoLite2-City.mmdb') ip
= "8.8.8.8" response = reader.city(ip) print(f"City:
{response.city.name}, Country:
{response.country.name}")
Used in cybersecurity and forensic investigations.
Ethical & Legal
Considerations
IP geolocation is NOT 100% accurate
– it estimates the city or region, not exact
addresses.
Use responsibly for cybersecurity,
marketing, and research.
Avoid using IP tracing for illegal
activities (stalking, hacking, etc.).
Social Engineering &
Information Gathering on
Mobile Apps
What is Social Engineering?
Social engineering is the psychological
manipulation of people to obtain
confidential information, such as:
Passwords & Login Credentials
Personal Data (Emails, Phone
Numbers, Addresses)
Financial Information
Company Secrets & Internal
Documents
Common Techniques:
Phishing – Fake emails or messages
tricking users into revealing credentials.
Pretexting – Creating a fabricated
scenario to gain access to information.
Baiting – Offering free software or gifts
that contain malware.
Shoulder Surfing – Watching people
enter credentials in public places.
Impersonation – Pretending to be an
authority figure (IT support, HR, etc.).
Algorithms & Techniques for
Information Gathering
OSINT (Open-Source Intelligence) Tools
OSINT tools are used for data collection from public sources.
Tools for Gathering Information:
theHarvester – Gathers emails, subdomains, and
usernames.
Maltego – Visualizes social connections and information
sources.
Sherlock – Finds usernames across different social media
platforms.
Google Dorks – Uses advanced search queries to find
hidden data.
Shodan – Scans for exposed IoT devices and networks.
Example: Google Dork to Find Email Addresses
site:example.com intext:"@example.com"
Gathering Information
from Mobile Apps
Reverse Engineering Mobile Apps
Mobile apps often store sensitive
information in hidden files or databases.
Techniques for Extracting Data:
Decompiling APKs – Extracting source
code from Android apps.
Analyzing API Calls – Inspecting how
apps communicate with servers.
Inspecting Permissions – Checking for
excessive data access requests.
Gathering Information
from Mobile Apps
Extracting API Data from a Mobile
App
Use Frida or Mitmproxy to monitor an
app’s API calls.
mitmproxy -p 8080
Can capture API tokens, user IDs, and
hidden endpoints.
Extracting User Data from APK Files
Android apps store data in SQLite
databases, which can be extracted:
Steps to Extract User
Data
Decompile APK using Apktool:
apktool d app.apk
Look for databases in
/data/data/com.appname/databases/
Use SQLite to read stored user data:
sqlite3 database.db SELECT * FROM users;
Can reveal user credentials, chat logs, or
sensitive tokens.
Ethical & Legal Considerations
Social engineering can be illegal if used for
hacking or unauthorized access, Use for
cybersecurity research, penetration testing,
and awareness training. Always obtain
Thank You