Session 3 Data Aquisition - Updated

This document provides an introduction to web scraping, including: - An overview of HTML page structure and how web pages are delivered to browsers. - An outline of the basic steps for web scraping using the Requests and BeautifulSoup Python modules, including sending requests, parsing responses, and saving extracted data. - An explanation of how Selenium can be used for web scraping sites with non-static or JavaScript-rendered content by programmatically controlling a browser. - Some challenges of web scraping like fragility of code when sites change and potential blocking of scrapers. - Resources provided for learning more about HTML, BeautifulSoup, Selenium, and Scrapy web scraping tools.

Uploaded by

Alessandro Sinai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

94 views40 pages

Session 3 Data Aquisition - Updated

Uploaded by

Alessandro Sinai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Innovation and

Marketing Analytics
Prof. Qiaoni Shi
Questions?
Today’s Plan
• Introduction of Web scraping
• Web scraping with requests & BeautifulSoup
• Web scraping with Selenium
Introduction of Web scraping
Web page
• Webpages are
(mostly) written in
HTML
• Web page delivered
to user’s browser
exactly as stored
• Each webpage is a
separate HTML file
Web page
• HTML
• hyper text markup language
Tree-like Structure of a HTML Page
HTML tags
Tag Name/function
<head> Heading of a HTML document, which contains
elements describing the document
<body> Body of a HTML document, which is the
content of the web page
<h1>…<h6> Headings
<p> Paragraph
<div> A block/session
<span> An inline session
<a> A link
<li> List
<ul> unordered list
HTML Resources
• https://www.youtube.com/watch?v=UB1O30fR-EE
• https://www.codecademy.com/learn/learn-html
• https://www.w3schools.com/html/html_intro.asp
• Mac OS
• Chrome - > Developer -> View source
• Command + Shift + Option

Source: Chris Bail

• Windows
• Chrome -> right click -> Inspect
Exercise

• Pick a webpage, check the following items:

• How is it organized?
• Where is the head and where is the body?
• Is it a tree-like structure?
Outline of basic web scraping
Web scraping with requests &
BeautifulSoup
Web scraping with requests &
BeautifulSoup
Steps
Step 1 Request Information

You need to request

information from the
url and get the html
text data.

https://m.imdb.com/title/tt1160419/
Send request
page = requests.get(url)
page.content
Tree-like Structure of a HTML Page
Step 2 Parsing Information

What we want
Parsing Data
soup = BeautifulSoup(page.content, ‘html.parser’)
• Locate the information we want
soup.find(“h1“).get_text()
soup.find_all()
Parsing Data
soup.find("div",{"class":“…"})
soup.find(id=“…")
soup.find_all(“span", class_=“…")
Web scraping with Selenium
Outline of basic web scraping

Selenium
Selenium is a Python module that controls a
browser to open a webpage and extract data from it
An unique advantage of Selenium

• Selenium can handle non-static webpage that has content

hidden behind code (e.g. Javascript)

• How? Selenium can interact with the browser. For example,

Selenium can click on button / link / dropdown menu etc.
Step 0 Import Modules
# install firefox, geckodriver, and selenium
!apt-get update
!pip install selenium
!apt install firefox-geckodriver
!cp /usr/lib/geckodriver /usr/bin
!cp /usr/lib/firefox /usr/bin
binary = '/usr/bin/firefox'
options = webdriver.FirefoxOptions()
options.binary = binary
options.add_argument('--headless')
driver = webdriver.Firefox(options=options, executabl
e_path='/usr/bin/geckodriver')
Step 1&2 Send requests, Parsing Data
driver = webdriver.Firefox(options=options, executable_path='/u
sr/bin/geckodriver')
driver.get(url)
drive.page_source
Selenium
.find_element(By.CLASS_NAME,””)
.find_element(By.XPATH,””)
.find_elements(By.CLASS_NAME,””)
.find_elements(By.XPATH,””)

e.g.,
driver.find_elements(By.CLASS_NAME, "review-container")
Selenium locator
Selenium
Browser interaction

.click() instructs the browser to click on the element

select() instructs the browser to select the specific dropdown box

Example:

dropdown_box = Select(elem)
dropdown_box.select_by_visible_text('Most recent’)

First, instructs the browser to select the dropdown box referred to by the element
second, instructs the browser to choose the option with text 'Most recent'
Selenium
Browser interaction

.back() instructs the browser to go back one page

.forward() instructs the browser to go forward one page

Step 3 Save Data (Pandas)
• final_dict = {‘v1’:list1, ‘v2’:list2}
• df = pd.DataFrame(final_dict)
• df.to_csv()
More tools
Comparison of Web Scraping Tools
BeautifulSoup Selenium Scrapy

Easy to learn Easy to learn Good integration with

Pros data pipeline, proxies, VPN
Extensive documentation Can scrape non-static page
(e.g. javascript) via browser Fast performance
automation

Cons Slow performance Slow performance More complex

Documentation of Scrapy: https://docs.scrapy.org/en/latest/

More tools

https://www.youtube.com/watch?v=n7fob_XVsbY
Challenges in Web Scraping
• Time investment
• Each website is different and requires custom-made web
scraping code

• Fragility of code
• Web scraping code may break when the website is
redesigned (even slightly)
• Require continual monitoring and maintenance
for ongoing / production data source
• Website may block / IP-ban your scraper
Resources
• HTML
• https://www.youtube.com/watch?v=UB1O30fR-EE
• https://www.codecademy.com/learn/learn-html
• BeautifulSoup
• Filters applied to search the tree
• https://www.crummy.com/software/BeautifulSoup/bs4/doc/#c
alling-a-tag-is-like-calling-find-all
• Documentation
• https://www.crummy.com/software/BeautifulSoup/bs4/doc
• Selenium
• Documentation
• https://selenium-python.readthedocs.io/
• https://www.selenium.dev/
Questions?

Lecture 12
No ratings yet
Lecture 12
136 pages
Model Repository
No ratings yet
Model Repository
526 pages
Git Notes For Professionals
No ratings yet
Git Notes For Professionals
195 pages
MSC575 - Sabih - Uddin - WEEK 8 - LAB PDF
No ratings yet
MSC575 - Sabih - Uddin - WEEK 8 - LAB PDF
400 pages
Essential Python Libraries and Frameworks
No ratings yet
Essential Python Libraries and Frameworks
170 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Clojure
100% (1)
Clojure
1,086 pages
The American Statistician - Perpustakaan Universitas Sanata Dharma - Page 1 - 98 - Flip PDF Online - PubHTML5
No ratings yet
The American Statistician - Perpustakaan Universitas Sanata Dharma - Page 1 - 98 - Flip PDF Online - PubHTML5
99 pages
Deploying With Jruby
No ratings yet
Deploying With Jruby
220 pages
Markdown HTML: Plain Text
No ratings yet
Markdown HTML: Plain Text
2 pages
Module-2 Part-1 - Merged
No ratings yet
Module-2 Part-1 - Merged
66 pages
Bdaa
No ratings yet
Bdaa
41 pages
(Ebooks PDF) Download Istio in Action 1st Edition Christian E. Posta Full Chapters
No ratings yet
(Ebooks PDF) Download Istio in Action 1st Edition Christian E. Posta Full Chapters
49 pages
Coldfusion 9 Cfmlref
No ratings yet
Coldfusion 9 Cfmlref
1,553 pages
Docs Graylog Org en 3.2
No ratings yet
Docs Graylog Org en 3.2
528 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Conan
No ratings yet
Conan
418 pages
Practical Gremlin
No ratings yet
Practical Gremlin
468 pages
Git Basic Commands
No ratings yet
Git Basic Commands
27 pages
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
100% (1)
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
51 pages
Python CodeAcademy
No ratings yet
Python CodeAcademy
8 pages
Cpython Internals Sample Chapters
No ratings yet
Cpython Internals Sample Chapters
77 pages
Evans J. Hell Yes. CSS 2024
No ratings yet
Evans J. Hell Yes. CSS 2024
28 pages
E GOVSecurity
No ratings yet
E GOVSecurity
75 pages
CX Programming Langauge
No ratings yet
CX Programming Langauge
75 pages
Ember Js Guide
No ratings yet
Ember Js Guide
295 pages
Git and Github Slides
No ratings yet
Git and Github Slides
30 pages
Baotian BT49QT 12 Tanco 50 User Manual
100% (1)
Baotian BT49QT 12 Tanco 50 User Manual
26 pages
Advanced Git
No ratings yet
Advanced Git
86 pages
Humble Ruby Book
100% (1)
Humble Ruby Book
141 pages
Modelsim Se Tut
No ratings yet
Modelsim Se Tut
205 pages
Rails As She Is Spoke
No ratings yet
Rails As She Is Spoke
95 pages
Red Hat Enterprise Linux 6 Virtualization Host Configuration and Guest Installation Guide
No ratings yet
Red Hat Enterprise Linux 6 Virtualization Host Configuration and Guest Installation Guide
153 pages
Sad Sack 1 37 August 1954 Text
100% (2)
Sad Sack 1 37 August 1954 Text
32 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Scraping
No ratings yet
Scraping
6 pages
Open Source Data Collector: Fluentd
No ratings yet
Open Source Data Collector: Fluentd
63 pages
13 HCM e
No ratings yet
13 HCM e
134 pages
Sybase Session 4-Document
No ratings yet
Sybase Session 4-Document
11 pages
Jekyll
No ratings yet
Jekyll
22 pages
The Sample With A Built-In Bias: How To Lie With Statistics by Darrell Huff Summery by Chapters (1-5)
No ratings yet
The Sample With A Built-In Bias: How To Lie With Statistics by Darrell Huff Summery by Chapters (1-5)
4 pages
Git and Github, Part I - Introduction To Git Cheatsheet - Codecademy
No ratings yet
Git and Github, Part I - Introduction To Git Cheatsheet - Codecademy
3 pages
First Submission Copia de Git Commands Documentation
No ratings yet
First Submission Copia de Git Commands Documentation
6 pages
Download
No ratings yet
Download
4 pages
QML CPP Integration
No ratings yet
QML CPP Integration
60 pages
Ember - Js Cookbook - Sample Chapter
No ratings yet
Ember - Js Cookbook - Sample Chapter
26 pages
Vikrant Unix Notes
No ratings yet
Vikrant Unix Notes
12 pages
HTML Tag Sheet
100% (2)
HTML Tag Sheet
1 page
Git Cheatsheet
No ratings yet
Git Cheatsheet
5 pages
1.2.7. Standard Library: 1.2.7.1. Module: Operating System Functionality
No ratings yet
1.2.7. Standard Library: 1.2.7.1. Module: Operating System Functionality
6 pages
Muppets Blatch PDF
No ratings yet
Muppets Blatch PDF
7 pages
Git Jenkin S Docker Puppet: Chef Ansible Splunk
No ratings yet
Git Jenkin S Docker Puppet: Chef Ansible Splunk
5 pages
Java Server Side Programming - The Conceptual Foundation
No ratings yet
Java Server Side Programming - The Conceptual Foundation
148 pages
Cognos Query Tips and Guidelines
No ratings yet
Cognos Query Tips and Guidelines
11 pages
Clase 3. Capa de Aplicacion
No ratings yet
Clase 3. Capa de Aplicacion
62 pages
Az 101
No ratings yet
Az 101
140 pages
Toy Problem List To Do in Data Science Domain
No ratings yet
Toy Problem List To Do in Data Science Domain
5 pages
Cosmosonlinetr Aining: Jenkins Training
No ratings yet
Cosmosonlinetr Aining: Jenkins Training
8 pages
MOTOTRBO® System Release Notes Professional Commercial Radios (PCR) & Accessories
100% (1)
MOTOTRBO® System Release Notes Professional Commercial Radios (PCR) & Accessories
37 pages
Kind Attention: Net4 India Registrant: Support@nixi - in Registry@nixi - in
No ratings yet
Kind Attention: Net4 India Registrant: Support@nixi - in Registry@nixi - in
2 pages
HTML Tags Chart: Tag Name Code Example Browser View
No ratings yet
HTML Tags Chart: Tag Name Code Example Browser View
10 pages
StoreGrowers GoogleAdsForEcommerceGuide
No ratings yet
StoreGrowers GoogleAdsForEcommerceGuide
59 pages
Hacktoberfest Git Cheat Sheet
No ratings yet
Hacktoberfest Git Cheat Sheet
1 page
Enhanced Batch Reports User Manual
No ratings yet
Enhanced Batch Reports User Manual
57 pages
KSH 93 Quick Ref Testing
No ratings yet
KSH 93 Quick Ref Testing
2 pages
The Ultimate Guide To SEO-nidmindia-digital Marketing Institute in Bangalore
No ratings yet
The Ultimate Guide To SEO-nidmindia-digital Marketing Institute in Bangalore
23 pages
Advanced Ubuntu Sheet PDF
No ratings yet
Advanced Ubuntu Sheet PDF
4 pages
Grade 6 Tle Summative Test
100% (6)
Grade 6 Tle Summative Test
3 pages
ENOVIASynchronicityDesignSyncDataManager ProjectSyncUser V6R2011x
No ratings yet
ENOVIASynchronicityDesignSyncDataManager ProjectSyncUser V6R2011x
295 pages
Abaqus CAE User's Manual
No ratings yet
Abaqus CAE User's Manual
847 pages
Food in The Internet Age (2013)
No ratings yet
Food in The Internet Age (2013)
94 pages
Module-5-Web Application Frameworks
No ratings yet
Module-5-Web Application Frameworks
36 pages
2012 Iku
No ratings yet
2012 Iku
84 pages
The Kubernetes Learning Resources List
No ratings yet
The Kubernetes Learning Resources List
20 pages
Web Based TV Station
No ratings yet
Web Based TV Station
9 pages
WebCloud Web-Based Cloud Storage For Secure Data Sharing Across Platforms
No ratings yet
WebCloud Web-Based Cloud Storage For Secure Data Sharing Across Platforms
15 pages
CR35iNG QSG
No ratings yet
CR35iNG QSG
12 pages
WT Lab Manual
No ratings yet
WT Lab Manual
23 pages
Ws - Addressing: References and Message Information Headers
No ratings yet
Ws - Addressing: References and Message Information Headers
20 pages
The Founder & Brand Backstory
No ratings yet
The Founder & Brand Backstory
8 pages
PROXIS
No ratings yet
PROXIS
10 pages
Australian Mygov Security Issues
No ratings yet
Australian Mygov Security Issues
6 pages
Activity 0313
No ratings yet
Activity 0313
1 page
Warren: Contact & Info
No ratings yet
Warren: Contact & Info
4 pages
How To Upload & View Statement-V1.0
No ratings yet
How To Upload & View Statement-V1.0
2 pages
Seatwork (1 Item X 20 Points) Directions: Select A Web Publishing and Microblogging Site You Want To Focus On (E.g., Wix and Twitter)
No ratings yet
Seatwork (1 Item X 20 Points) Directions: Select A Web Publishing and Microblogging Site You Want To Focus On (E.g., Wix and Twitter)
2 pages
Shuchi Resume
No ratings yet
Shuchi Resume
1 page