Web Scraping: Techniques and Tools

The document discusses data collection methods, focusing on web scraping as a technique for extracting information from websites. It differentiates between web scraping and web crawling, outlines use cases for web scraping, and describes the components of a standard web scraper. Additionally, it emphasizes the importance of legal considerations and preliminary research before engaging in web scraping activities.

Uploaded by

alfredjoso847

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views10 pages

Web Scraping: Techniques and Tools

Uploaded by

alfredjoso847

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Data Collection

Sources of Data

 Sites
 Data science online communities
e.g Kaggle, Zindi Africa
 Create your own dataset
 Scrap data from the web(Web scraping)
Web Scraping

 Definition: an automatic process of extracting information from web

 What to get from the web
 How to get it
 We may get data from a database or data file and other sources
 what if we need large amount of data that is available online?
Manually search (clicking away in a web browser) and save (copy-pasting into a
spreadsheet or file) the required data.
Web scrapping :is the process of constructing an agent which can extract, parse,
download and organize useful information from the web automatically, the web
scraping agent will automatically load and extract data from multiple websites as per
our requirement.
Web scraping vs Web crawling

 A web crawler, crawler or web spider, is a computer program that's

used to search and automatically index website content and other
information over the internet. These programs are most commonly
used to create entries for a search engine.
Web Crawling Web Scraping
Refers to downloading and Refers to extracting individual
storing the contents of a large data elements from the website
number of websites by using a site-specific structure.
Mostly done on large scale. Can be implemented at any scale
Yields generic information Yields specific information
Used by major search engines The information extracted using
like Google,Bing, Yahoo. webs craping can be used to
Googlebot replicate in some other website
is an example of a web crawler. or can be used to perform data
Web scraping use cases
 E-commerce Websites: Web scrapers can collect the data specially related to the
price of a specific product from various e-commerce websites for their comparison
 Content Aggregators: Web scraping is used widely by content aggregators like
news aggregators and job aggregators for providing updated data to their users.
 Marketing and Sales Campaigns: Web scrapers can be used to get the data like
emails, phone number etc. for sales and marketing campaigns.
 Data for Machine Learning Projects: Retrieval of data for machine learning
projects depends upon web scraping.
Components of a standard web
Scraper

 Web Crawler Module: A very necessary component of web scraper,

web crawler module, is used to navigate the target website by making
HTTP or HTTPS request to the URLs. The crawler downloads the
unstructured data (HTML contents) and passes it to extractor, the next
module
 Extractor :The extractor processes the fetched HTML content and
extracts the data into semi-structured format. This is also called as a
parser module and uses different parsing techniques like Regular
expression, HTML Parsing, DOM parsing or Artificial Intelligence for its
functioning
Components of a standard web
Scraper cont..

 Data Transformation and Cleaning Module: The data extracted

above is not suitable for ready use. It must pass through some cleaning
module so that we can use it. The methods like String manipulation or
regular expression can be used for this purpose. Note that
extraction and transformation can be performed in a single step also
 Storage Module: After extracting the data, we need to store it as per
our requirement. The storage module will output the data in a standard
format that can be stored in a database or JSON or CSVformat.
Exercise :

 Use at least 3 web scrapping software: browser extensions

 Ponder on the following questions:
 Is web scraping legal?
 Is web scraping the same as hacking?
 Is web scraping the same as stealing data?
Research Required Prior to Scraping

 Analyzing robots.txt
 Analyzing Sitemap files
 Content of Sitemap file
 What is the Size of Website?
 Checking Website’s Size
 Which technology is used by website?
Developing our own web scraping
tool

Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Scraping
100% (1)
Scraping
25 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Scraping Course Notes
No ratings yet
Web Scraping Course Notes
89 pages
Implementation of Web Application For Disease Prediction Using AI
No ratings yet
Implementation of Web Application For Disease Prediction Using AI
5 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Module 4
No ratings yet
Module 4
14 pages
Semin
No ratings yet
Semin
8 pages
EJMCM Volume7 Issue3 Pages433-442
No ratings yet
EJMCM Volume7 Issue3 Pages433-442
11 pages
Web Scraping Software Overview
No ratings yet
Web Scraping Software Overview
10 pages
Synopsis WS
No ratings yet
Synopsis WS
11 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Abstract: YSPM'S YTC, Faculty of MCA, Satara. 1
No ratings yet
Abstract: YSPM'S YTC, Faculty of MCA, Satara. 1
15 pages
Intro To Web Scraping
No ratings yet
Intro To Web Scraping
13 pages
INDEX
No ratings yet
INDEX
3 pages
Web Scraping Ganesh
0% (1)
Web Scraping Ganesh
20 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
Sing Rodia 2019
No ratings yet
Sing Rodia 2019
6 pages
Integrasi Level Antarmuka Pengguna
No ratings yet
Integrasi Level Antarmuka Pengguna
20 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Scraperapi Web Scrapping The Basics Explained
No ratings yet
Scraperapi Web Scrapping The Basics Explained
15 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Web Crawling and Scraping with Python
No ratings yet
Web Crawling and Scraping with Python
34 pages
Web Scraping - Notes - 321
No ratings yet
Web Scraping - Notes - 321
3 pages
Data Scraping: Techniques and Challenges
No ratings yet
Data Scraping: Techniques and Challenges
25 pages
Webscraping
No ratings yet
Webscraping
12 pages
Com 059
No ratings yet
Com 059
6 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
Final Report
No ratings yet
Final Report
39 pages
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
No ratings yet
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
4 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
A Dive Into Web Scraper World
100% (1)
A Dive Into Web Scraper World
5 pages
@7724353 PDF
No ratings yet
@7724353 PDF
5 pages
Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application
No ratings yet
Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application
25 pages
Data Cleaning and Web Scraping Guide
No ratings yet
Data Cleaning and Web Scraping Guide
4 pages
Web Scraping with Machine Learning
No ratings yet
Web Scraping with Machine Learning
4 pages
218R1A6747
No ratings yet
218R1A6747
10 pages
WEB Scrap Report
No ratings yet
WEB Scrap Report
77 pages
Web Scraping, Web Harvesting, or Web Data Extraction Is
No ratings yet
Web Scraping, Web Harvesting, or Web Data Extraction Is
1 page
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Web Scraping with Python Guide
No ratings yet
Web Scraping with Python Guide
5 pages
Data Aggregation via Web Scraping
No ratings yet
Data Aggregation via Web Scraping
48 pages
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
Web Scraping: Techniques and Applications
No ratings yet
Web Scraping: Techniques and Applications
4 pages
Mini Project
No ratings yet
Mini Project
13 pages
IRSNOTES5
No ratings yet
IRSNOTES5
7 pages
Web Scraping: Tools and Applications
No ratings yet
Web Scraping: Tools and Applications
16 pages
Web Crawling State of ArtTechniques ApproachesandApplication
No ratings yet
Web Crawling State of ArtTechniques ApproachesandApplication
26 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
E-commerce Review Scraper Project
No ratings yet
E-commerce Review Scraper Project
15 pages
Webscraping 2
No ratings yet
Webscraping 2
2 pages
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
No ratings yet
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
29 pages
A Dive Into Web Scraper World
No ratings yet
A Dive Into Web Scraper World
11 pages
Resume - Suchita Chavan
No ratings yet
Resume - Suchita Chavan
2 pages
Sena 50 Series WiFi Adapter Guide
No ratings yet
Sena 50 Series WiFi Adapter Guide
2 pages
Consult-Iiiplus R2R: Quick Manual
No ratings yet
Consult-Iiiplus R2R: Quick Manual
13 pages
User Manual BM115 3.0 - 6 12KW Auto Focusing Laser Cutting Head - V3.0 PDF
No ratings yet
User Manual BM115 3.0 - 6 12KW Auto Focusing Laser Cutting Head - V3.0 PDF
26 pages
Unit-2 (IoT)
No ratings yet
Unit-2 (IoT)
26 pages
PHP Assignment Answers
No ratings yet
PHP Assignment Answers
3 pages
UD05216B Baseline User Manual of Storage System V1.0.0 20171024
No ratings yet
UD05216B Baseline User Manual of Storage System V1.0.0 20171024
94 pages
xMxx0 HandheldPlatform GSG Es Oct2012
No ratings yet
xMxx0 HandheldPlatform GSG Es Oct2012
52 pages
7026 Enterprise Server Model M80 Model 6M1 Service Guide
No ratings yet
7026 Enterprise Server Model M80 Model 6M1 Service Guide
621 pages
FL4000H Flame Detector Overview
No ratings yet
FL4000H Flame Detector Overview
4 pages
Tps 2546
No ratings yet
Tps 2546
44 pages
Qualcomm Aware Platform Slides
No ratings yet
Qualcomm Aware Platform Slides
12 pages
Darpg Chatbot
No ratings yet
Darpg Chatbot
3 pages
Understanding AI: Definitions and History
No ratings yet
Understanding AI: Definitions and History
4 pages
Ai Assignment 1
No ratings yet
Ai Assignment 1
5 pages
RADH2002
No ratings yet
RADH2002
1 page
C Accounting 1
No ratings yet
C Accounting 1
27 pages
3-Phase MOSFET Driver - A4915 - 11959609
No ratings yet
3-Phase MOSFET Driver - A4915 - 11959609
17 pages
OCR A Level (H046-H446) Data Structures Part 1 - Linked Lists
No ratings yet
OCR A Level (H046-H446) Data Structures Part 1 - Linked Lists
24 pages
An Optimized GNN-Based Caching Scheme For SDN-Based Information-Centric Networks
No ratings yet
An Optimized GNN-Based Caching Scheme For SDN-Based Information-Centric Networks
6 pages
CLASS 3 Computer PT1 (PARTII) REVISION
100% (1)
CLASS 3 Computer PT1 (PARTII) REVISION
4 pages
AZ 700T00A ENU PowerPoint - 04
No ratings yet
AZ 700T00A ENU PowerPoint - 04
42 pages
New Ccna - Ajith
No ratings yet
New Ccna - Ajith
676 pages
Casey Wang Reesume
No ratings yet
Casey Wang Reesume
2 pages
Overview of Data Link Layer Protocols
No ratings yet
Overview of Data Link Layer Protocols
27 pages
Scanned Document Overview
No ratings yet
Scanned Document Overview
40 pages
Visual Studio Code: End-to-End Editing and Debugging Tools For Web Developers Bruce Johnson Instant Download
No ratings yet
Visual Studio Code: End-to-End Editing and Debugging Tools For Web Developers Bruce Johnson Instant Download
127 pages
ERP Insights for Business Leaders
No ratings yet
ERP Insights for Business Leaders
49 pages
DS Supernova EN INT UK
No ratings yet
DS Supernova EN INT UK
1 page
2 Player Bartop Arcade Machine Powered by Pi
No ratings yet
2 Player Bartop Arcade Machine Powered by Pi
32 pages