Development Web Scrapping

1) The document discusses using Python for web scraping to generate unique datasets from data stored on websites in an irregular format. It provides a simple example of scraping comic title data from the xkcd website. 2) The example code shows how to import libraries, find the number of comics on the site, loop through each comic URL, search the HTML for the title, and export the results to a text file with the comic number and title. 3) The document notes that Python is very capable for web scraping, parsing HTML with regular expressions, and exporting processed data, but sometimes more structured libraries like Beautiful Soup are useful for more complex scraping jobs.

Uploaded by

Cristián Ignacio Zuilt Zúñiga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views

Development Web Scrapping

Uploaded by

Cristián Ignacio Zuilt Zúñiga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

High Powered Data and Development

Economics
Scraping the Web to Generate Unique Datasets

Damian Clarke

November 24, 2013

Why Python?

I Free
I Power over the whole operating system
I Imagine if Stata had control over Firefox, image editing, Google
Earth, better scientific libraries, . . .
I Quite easy to get up and scraping the web (we’ll do it in 20 mins)
I If you decide you like it, it can do everything for you
I Kevin Sheppard’s course, John Stachurski and Sargent’s course
I Signalling?
What Do You Need?

I Unix or OS X: nothing!
I Windows: In many distributions Python is not installed by
default
I For complete packages, install Anaconda (http://continuum.io/)
I It may also be useful to install a stand alone text editor with
syntax highlighting (ie gedit)
How to Run Python

I A number of ways: from the command line, interactively, using

ipython
I For the interests of time, we’ll just run from the command line
I However, if you’re going to run this frequently, ipython is worth
checking out
I If you’re interested in following along online (without
downloading Python to your local machine), go to
http://py-ide-online.appspot.com/
What is Web Scraping?

Essentially, the process of harvesting data that is directly stored on

the web in an irregular or highly disperse format.

I When undertaking econometric analysis, we of course want very

regular data, formatted into lines and columns
I Generally two steps:
I Looping through nested urls to get to (many) source html pages
I Taking html and formatting into a useful structure
I There are a number of tools people use for this sort of analysis:
Python, R, RapidMiner, even Matlab . . .
Why do we care?

I Often (particularly in developing country settings) data is not

stored directly as a csv
I In some cases, data does not yet exist in any centralised form
I This opens up many entirely different types of data we mightn’t
have previously thought about
I The majority of economics papers are now using ‘novel’ data (ie
not survey based)
What can we do with it?

I It has come in handy for me many times

I Download, unzip and merge 1000+ DHS surveys, up to date at
the second that scraping takes place
I Download all (30,000+) papers on NBER for text analysis
I Download election results: India, Philippines
I Repeated calls to World Bank Data Bank
I And turns up frequently in cool development papers
I Looking at effects of natural disasters
I Looking at effects of ports
I Night lights, geography, bombs, weather, . . .
Figure 1: And it can look quite cool. . .

Hansen, M.C. et al (2013) High-Resolution Global Maps of 21st-Century Forest

Cover Change. Science 342 (6160) 850-853.
Coding

We will go through a relatively simple (and contrived) example.

I For this process, there are a number of tools we will use:

I Ideally, a web browser that lets us look at source code (pretty
much any of them)
I Regular Expressions (Python’s re)
I If this is a big job, we should think about error capture (Python’s
try command)
Basic Code

1 # Scrape_xkcd 0.01 damiancclarke yyyy-mm-dd:2013-11-21

2 #---|----1----|----2----|----3----|----4----|----5----|----6----|----7----|----8
3 #
4
5 #*******************************************************************************
6 # (1) Import required packages, set-up names used in urls
7 #*******************************************************************************
8 import urllib2
9 import re
10
11 target = 'http://www.xkcd.com'
12
13 #*******************************************************************************
14 # (2) Scrape target url and print source code
15 #*******************************************************************************
16 response = urllib2.urlopen(target)
17 print response

If you want to download the source code for the example we’ll go through, go to
http://users.ox.ac.uk/∼ball3491/Python/
Complete Code

1 # (1) Import required packages, set-up names used in urls

2 import urllib2
3 import re
4 target = 'http://www.xkcd.com'
5
6 # (2) Scrape target url and find the last comic number (num)
7 response = urllib2.urlopen(target)
8
9 for line in response:
10 search = re.search('Permanent link to this comic:', line)
11 if search!=None:
12 lastcomic=re.findall('\d*', line)
13
14 for item in lastcomic:
15 if len(item)>0:
16 num = int(item)
17
18 # (3) Loop through all comics, finding each comic's title or capturing errors
19 for append in range(1, num+1):
20 url = target + '/' + str(append)
21 response = urllib2.urlopen(url)
22 for line in response:
23 search = re.search('ctitle',line)
24 if search!=None:
25 print line[17:-7]
Or, With Error Capture

#*******************************************************************************
# (3) Loop through all comics, finding each comic's title or capturing errors
#*******************************************************************************
for append in range(1, num+1):
url = target + '/' + str(append)
try:
response = urllib2.urlopen(url)
for line in response:
search = re.search('ctitle',line)
if search!=None:
print line[17:-7]
except urllib2.HTTPError, e:
print('%s has http error' % url)
except urllib2.URLError, e:
print('%s has url error' % url)
Exporting Our ‘Data’

Python is extremely capable at editing text to create output files:

1 #*******************************************************************************
2 # (3) Loop through all comics, finding each comic's title or capturing errors
3 #*******************************************************************************
4 output = open('xkcd_names.txt', 'w')
5 output.write('Comic, Number, Title \n')
6
7 for append in range(1, num+1):
8 url = target + '/' + str(append)
9 response = urllib2.urlopen(url)
10 for line in response:
11 search = re.search('ctitle',line)
12 if search!=None:
13 print line[17:-7]
14 output.write('xkcd,' + str(append) + ',' + line[17:-7] + '\n')
15
16 output.close()
Where to From Here

I You can actually get remarkably far with Python + a web

browser + Regular Expressions!
I Some times you may want a more structured approach: Beautiful
Soup
I Python can do much, much, much more
I Further applied examples at: bitbucket.org/damiancclarke
I Questions/comments?

Khóa Học Python
No ratings yet
Khóa Học Python
391 pages
z/OS Format of The Message Body
No ratings yet
z/OS Format of The Message Body
3 pages
Siebel15.5 Installation Guide
No ratings yet
Siebel15.5 Installation Guide
10 pages
Getting Started With MariaDB - Second Edition - Sample Chapter
No ratings yet
Getting Started With MariaDB - Second Edition - Sample Chapter
24 pages
Download Complete (Ebook) How to Build Android Apps with Kotlin: A practical guide to developing, testing, and publishing your first Android apps by Alex Forrester; Eran Boudjnah; Alexandru Dumbravan; Jomar Tigcal ISBN 9781837634934, 1837634939 PDF for All Chapters
100% (11)
Download Complete (Ebook) How to Build Android Apps with Kotlin: A practical guide to developing, testing, and publishing your first Android apps by Alex Forrester; Eran Boudjnah; Alexandru Dumbravan; Jomar Tigcal ISBN 9781837634934, 1837634939 PDF for All Chapters
55 pages
Using Scrapy in PyCharm
100% (1)
Using Scrapy in PyCharm
8 pages
Scrapping The Web
100% (1)
Scrapping The Web
13 pages
Seamap Introduction
100% (2)
Seamap Introduction
21 pages
SRS Template Quy Nguyen
No ratings yet
SRS Template Quy Nguyen
14 pages
Sample - A Hands-On Guide to Fine-Tuning LLMs
100% (3)
Sample - A Hands-On Guide to Fine-Tuning LLMs
49 pages
Mysql Gui Tools Manual
No ratings yet
Mysql Gui Tools Manual
143 pages
Important Commands For Linux Admin PDF
100% (1)
Important Commands For Linux Admin PDF
19 pages
Create Your Own File Hosting Service For Red Teamers - Pwndrop Step by Step
No ratings yet
Create Your Own File Hosting Service For Red Teamers - Pwndrop Step by Step
10 pages
Tryhackme Spoofingattack
No ratings yet
Tryhackme Spoofingattack
41 pages
IoT Final Lab
No ratings yet
IoT Final Lab
27 pages
An A-Z Index of The Bash Command Line For Linux - SS64
No ratings yet
An A-Z Index of The Bash Command Line For Linux - SS64
5 pages
Christos Chen
No ratings yet
Christos Chen
42 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scraping With Python Tutorials From A To Z
100% (1)
Web Scraping With Python Tutorials From A To Z
35 pages
Efficient Python Tricks and Tools For Data Scientists
100% (1)
Efficient Python Tricks and Tools For Data Scientists
23 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
web scraping using python
No ratings yet
web scraping using python
18 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
How to Download Files From URLs With Python – Real Python
No ratings yet
How to Download Files From URLs With Python – Real Python
15 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Data Scraping
No ratings yet
Web Data Scraping
5 pages
Web Scraping
No ratings yet
Web Scraping
35 pages
4a82c633-5051-45ef-a932-6a6495641a0e_4F_IntroToWebScraping
No ratings yet
4a82c633-5051-45ef-a932-6a6495641a0e_4F_IntroToWebScraping
6 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Lecture 2 - Collecting, Analyzing, and Visualizing Data with Python Part I
No ratings yet
Lecture 2 - Collecting, Analyzing, and Visualizing Data with Python Part I
15 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
No ratings yet
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
161 pages
Introduction to Web Scraping in RPA With Python
No ratings yet
Introduction to Web Scraping in RPA With Python
10 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
No ratings yet
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
29 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
First Web Scraper
No ratings yet
First Web Scraper
34 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
PYTHON UNIT-4
No ratings yet
PYTHON UNIT-4
10 pages
Data Collection
No ratings yet
Data Collection
14 pages
Web-Scraping-With-Python
No ratings yet
Web-Scraping-With-Python
16 pages
Api and data structure
No ratings yet
Api and data structure
3 pages
I
No ratings yet
I
54 pages
Module 5-Web Scraping
No ratings yet
Module 5-Web Scraping
8 pages
web_scrapping_final[1]
No ratings yet
web_scrapping_final[1]
7 pages
Scraping 1000's of News Articles Using 10 Simple Steps - by Kajal Yadav - Jun, 2020 - Towards Data Science
No ratings yet
Scraping 1000's of News Articles Using 10 Simple Steps - by Kajal Yadav - Jun, 2020 - Towards Data Science
24 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Stru of DS Project
No ratings yet
Stru of DS Project
24 pages
Template
No ratings yet
Template
21 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
TPAM Approver Guide
No ratings yet
TPAM Approver Guide
42 pages
Robocopy Usage Guide
No ratings yet
Robocopy Usage Guide
19 pages
Assignment 1 Iot
No ratings yet
Assignment 1 Iot
11 pages
Operating System Command
No ratings yet
Operating System Command
8 pages
Acw Ug PDF
No ratings yet
Acw Ug PDF
467 pages
Getting Started: Activation & Settings
No ratings yet
Getting Started: Activation & Settings
7 pages
Nessus 3.0 Client Guide
No ratings yet
Nessus 3.0 Client Guide
25 pages
(Ebook) Learning Unix for Mac OS X Tiger by Dave Taylor ISBN 9780596553043, 0596553048 download
100% (2)
(Ebook) Learning Unix for Mac OS X Tiger by Dave Taylor ISBN 9780596553043, 0596553048 download
54 pages
Fundamentals of Python Programming (2015)
No ratings yet
Fundamentals of Python Programming (2015)
430 pages
Security Appscan Enterprise V9.0.3.9 Planning & Installation Guide
No ratings yet
Security Appscan Enterprise V9.0.3.9 Planning & Installation Guide
164 pages
MSC/Patran
No ratings yet
MSC/Patran
53 pages
SL500
No ratings yet
SL500
13 pages
HPE VMware Utilities User Guide For ESXi 7.0 April 2020-A00098103en - Us.0 April 2020-A00098103en - Us.0 April 2020-A00098103en - Us
No ratings yet
HPE VMware Utilities User Guide For ESXi 7.0 April 2020-A00098103en - Us.0 April 2020-A00098103en - Us.0 April 2020-A00098103en - Us
26 pages
OSY Practical No.4
No ratings yet
OSY Practical No.4
13 pages
Marksheet Management
No ratings yet
Marksheet Management
8 pages
SAN Switch cheat sheet
No ratings yet
SAN Switch cheat sheet
13 pages
Lab 4 - Passive Identity (Easy Connect)
No ratings yet
Lab 4 - Passive Identity (Easy Connect)
23 pages
Microsoft Exchange Server PowerShell Cookbook - Third Edition - Sample Chapter
No ratings yet
Microsoft Exchange Server PowerShell Cookbook - Third Edition - Sample Chapter
33 pages
Shell Script Programming Guidelines PDF
No ratings yet
Shell Script Programming Guidelines PDF
7 pages
3com Switch 8800 Family Configuration Guide
No ratings yet
3com Switch 8800 Family Configuration Guide
883 pages
HOMM5 A2 Script Functions
No ratings yet
HOMM5 A2 Script Functions
109 pages
Unix Shell Functions
No ratings yet
Unix Shell Functions
4 pages
Toad Data Point 4.3 Installation Guide
No ratings yet
Toad Data Point 4.3 Installation Guide
87 pages
How To - Smart RF WiNG 5 v1.6 Final
No ratings yet
How To - Smart RF WiNG 5 v1.6 Final
30 pages
PYTHON PROGRAMMING For Beginners The Easy and Complete Step-By-Step BooxRack
No ratings yet
PYTHON PROGRAMMING For Beginners The Easy and Complete Step-By-Step BooxRack
265 pages
Contributor Guide Overview and Browser-Based Task
No ratings yet
Contributor Guide Overview and Browser-Based Task
105 pages
Problem Description
No ratings yet
Problem Description
26 pages
Operating_System_Word_Processing_Spreadsheet_Detailed_Notes
No ratings yet
Operating_System_Word_Processing_Spreadsheet_Detailed_Notes
6 pages