100% found this document useful (1 vote)
94 views40 pages

Session 3 Data Aquisition - Updated

This document provides an introduction to web scraping, including: - An overview of HTML page structure and how web pages are delivered to browsers. - An outline of the basic steps for web scraping using the Requests and BeautifulSoup Python modules, including sending requests, parsing responses, and saving extracted data. - An explanation of how Selenium can be used for web scraping sites with non-static or JavaScript-rendered content by programmatically controlling a browser. - Some challenges of web scraping like fragility of code when sites change and potential blocking of scrapers. - Resources provided for learning more about HTML, BeautifulSoup, Selenium, and Scrapy web scraping tools.

Uploaded by

Alessandro Sinai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
94 views40 pages

Session 3 Data Aquisition - Updated

This document provides an introduction to web scraping, including: - An overview of HTML page structure and how web pages are delivered to browsers. - An outline of the basic steps for web scraping using the Requests and BeautifulSoup Python modules, including sending requests, parsing responses, and saving extracted data. - An explanation of how Selenium can be used for web scraping sites with non-static or JavaScript-rendered content by programmatically controlling a browser. - Some challenges of web scraping like fragility of code when sites change and potential blocking of scrapers. - Resources provided for learning more about HTML, BeautifulSoup, Selenium, and Scrapy web scraping tools.

Uploaded by

Alessandro Sinai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Innovation and

Marketing Analytics
Prof. Qiaoni Shi
Questions?
Today’s Plan
• Introduction of Web scraping
• Web scraping with requests & BeautifulSoup
• Web scraping with Selenium
Introduction of Web scraping
Web page
• Webpages are
(mostly) written in
HTML
• Web page delivered
to user’s browser
exactly as stored
• Each webpage is a
separate HTML file
Web page
• HTML
• hyper text markup language
Tree-like Structure of a HTML Page
HTML tags
Tag Name/function
<head> Heading of a HTML document, which contains
elements describing the document
<body> Body of a HTML document, which is the
content of the web page
<h1>…<h6> Headings
<p> Paragraph
<div> A block/session
<span> An inline session
<a> A link
<li> List
<ul> unordered list
HTML Resources
• https://www.youtube.com/watch?v=UB1O30fR-EE
• https://www.codecademy.com/learn/learn-html
• https://www.w3schools.com/html/html_intro.asp
• Mac OS
• Chrome - > Developer -> View source
• Command + Shift + Option

Source: Chris Bail


• Windows
• Chrome -> right click -> Inspect
Exercise

• Pick a webpage, check the following items:


• How is it organized?
• Where is the head and where is the body?
• Is it a tree-like structure?
Outline of basic web scraping
Web scraping with requests &
BeautifulSoup
Web scraping with requests &
BeautifulSoup
Steps
Step 1 Request Information

You need to request


information from the
url and get the html
text data.

https://m.imdb.com/title/tt1160419/
Send request
page = requests.get(url)
page.content
Tree-like Structure of a HTML Page
Step 2 Parsing Information

What we want
Parsing Data
soup = BeautifulSoup(page.content, ‘html.parser’)
• Locate the information we want
soup.find(“h1“).get_text()
soup.find_all()
Parsing Data
soup.find("div",{"class":“…"})
soup.find(id=“…")
soup.find_all(“span", class_=“…")
Web scraping with Selenium
Outline of basic web scraping

Selenium
Selenium is a Python module that controls a
browser to open a webpage and extract data from it
An unique advantage of Selenium

• Selenium can handle non-static webpage that has content


hidden behind code (e.g. Javascript)

• How? Selenium can interact with the browser. For example,


Selenium can click on button / link / dropdown menu etc.
Step 0 Import Modules
# install firefox, geckodriver, and selenium
!apt-get update
!pip install selenium
!apt install firefox-geckodriver
!cp /usr/lib/geckodriver /usr/bin
!cp /usr/lib/firefox /usr/bin
binary = '/usr/bin/firefox'
options = webdriver.FirefoxOptions()
options.binary = binary
options.add_argument('--headless')
driver = webdriver.Firefox(options=options, executabl
e_path='/usr/bin/geckodriver')
Step 1&2 Send requests, Parsing Data
driver = webdriver.Firefox(options=options, executable_path='/u
sr/bin/geckodriver')
driver.get(url)
drive.page_source
Selenium
.find_element(By.CLASS_NAME,””)
.find_element(By.XPATH,””)
.find_elements(By.CLASS_NAME,””)
.find_elements(By.XPATH,””)

e.g.,
driver.find_elements(By.CLASS_NAME, "review-container")
Selenium locator
Selenium
Browser interaction

.click() instructs the browser to click on the element

select() instructs the browser to select the specific dropdown box

Example:

dropdown_box = Select(elem)
dropdown_box.select_by_visible_text('Most recent’)

First, instructs the browser to select the dropdown box referred to by the element
second, instructs the browser to choose the option with text 'Most recent'
Selenium
Browser interaction

.back() instructs the browser to go back one page

.forward() instructs the browser to go forward one page


Step 3 Save Data (Pandas)
• final_dict = {‘v1’:list1, ‘v2’:list2}
• df = pd.DataFrame(final_dict)
• df.to_csv()
More tools
Comparison of Web Scraping Tools
BeautifulSoup Selenium Scrapy

Easy to learn Easy to learn Good integration with


Pros data pipeline, proxies, VPN
Extensive documentation Can scrape non-static page
(e.g. javascript) via browser Fast performance
automation

Cons Slow performance Slow performance More complex

Documentation of Scrapy: https://docs.scrapy.org/en/latest/


More tools

https://www.youtube.com/watch?v=n7fob_XVsbY
Challenges in Web Scraping
• Time investment
• Each website is different and requires custom-made web
scraping code

• Fragility of code
• Web scraping code may break when the website is
redesigned (even slightly)
• Require continual monitoring and maintenance
for ongoing / production data source
• Website may block / IP-ban your scraper
Resources
• HTML
• https://www.youtube.com/watch?v=UB1O30fR-EE
• https://www.codecademy.com/learn/learn-html
• BeautifulSoup
• Filters applied to search the tree
• https://www.crummy.com/software/BeautifulSoup/bs4/doc/#c
alling-a-tag-is-like-calling-find-all
• Documentation
• https://www.crummy.com/software/BeautifulSoup/bs4/doc
• Selenium
• Documentation
• https://selenium-python.readthedocs.io/
• https://www.selenium.dev/
Questions?

You might also like