Data Collection
Sources of Data
Sites
Data science online communities
e.g Kaggle, Zindi Africa
Create your own dataset
Scrap data from the web(Web scraping)
Web Scraping
Definition: an automatic process of extracting information from web
What to get from the web
How to get it
We may get data from a database or data file and other sources
what if we need large amount of data that is available online?
Manually search (clicking away in a web browser) and save (copy-pasting into a
spreadsheet or file) the required data.
Web scrapping :is the process of constructing an agent which can extract, parse,
download and organize useful information from the web automatically, the web
scraping agent will automatically load and extract data from multiple websites as per
our requirement.
Web scraping vs Web crawling
A web crawler, crawler or web spider, is a computer program that's
used to search and automatically index website content and other
information over the internet. These programs are most commonly
used to create entries for a search engine.
Web Crawling Web Scraping
Refers to downloading and Refers to extracting individual
storing the contents of a large data elements from the website
number of websites by using a site-specific structure.
Mostly done on large scale. Can be implemented at any scale
Yields generic information Yields specific information
Used by major search engines The information extracted using
like Google,Bing, Yahoo. webs craping can be used to
Googlebot replicate in some other website
is an example of a web crawler. or can be used to perform data
Web scraping use cases
E-commerce Websites: Web scrapers can collect the data specially related to the
price of a specific product from various e-commerce websites for their comparison
Content Aggregators: Web scraping is used widely by content aggregators like
news aggregators and job aggregators for providing updated data to their users.
Marketing and Sales Campaigns: Web scrapers can be used to get the data like
emails, phone number etc. for sales and marketing campaigns.
Data for Machine Learning Projects: Retrieval of data for machine learning
projects depends upon web scraping.
Components of a standard web
Scraper
Web Crawler Module: A very necessary component of web scraper,
web crawler module, is used to navigate the target website by making
HTTP or HTTPS request to the URLs. The crawler downloads the
unstructured data (HTML contents) and passes it to extractor, the next
module
Extractor :The extractor processes the fetched HTML content and
extracts the data into semi-structured format. This is also called as a
parser module and uses different parsing techniques like Regular
expression, HTML Parsing, DOM parsing or Artificial Intelligence for its
functioning
Components of a standard web
Scraper cont..
Data Transformation and Cleaning Module: The data extracted
above is not suitable for ready use. It must pass through some cleaning
module so that we can use it. The methods like String manipulation or
regular expression can be used for this purpose. Note that
extraction and transformation can be performed in a single step also
Storage Module: After extracting the data, we need to store it as per
our requirement. The storage module will output the data in a standard
format that can be stored in a database or JSON or CSVformat.
Exercise :
Use at least 3 web scrapping software: browser extensions
Ponder on the following questions:
Is web scraping legal?
Is web scraping the same as hacking?
Is web scraping the same as stealing data?
Research Required Prior to Scraping
Analyzing robots.txt
Analyzing Sitemap files
Content of Sitemap file
What is the Size of Website?
Checking Website’s Size
Which technology is used by website?
Developing our own web scraping
tool