Daniel Burseth
Co-president MIT Big Data Explorers
dburseth@[Link]
@dmbnyc
Github: dburseth
Acronyms abound
Tremendous complexity
Use building blocks not code
This is easy
EPPM of 10 requires 500 professionals
[Link]
[Link]?emc=eta1&_r=0
Data preparation and cleansing:
Missing
Duplicative
Conventions (dates, time,
geographies)
Spacing
Can we measure data
cleanliness?
Whats our Pareto point?
AWS -> EC2
Launch instance: ami-c6b61fae (US-EAST)
Instance type [Link]
Connect
You should see some software on the desktop
Scrape all of Craiglists Boston apartment listings using WebHarvy
Examine, clean, and prepare the data set using OpenRefine
Map our data and apply filters using Tableau
all without writing a single line of code.
A hyper-intelligent utility to scrape website data.
SysNucleus, makers of USBTrace
Heavy duty alternatives: Scrapy ([Link]),
Beautiful Soup
HTTP://[Link]/WIRE
1. Start Config
2. Click on Hungry Mother
capture text
3. Click on Hungry Mother
capture URL
4. Click on Kendall
Square/MIT capture text
5. Click lasts review capture
text
CLEAR
1. Mine -> Scrape a list of similar
links
2. Click on Hungry Mother
Lets start collecting
information in the first sub-
page.
Edit Clear
Navigate into a sub-page
Start Config
Set as Next Page Link
Scheduler
Input keywords
Puase Inject (word of caution: scraping often violates TOS. Potentially not viable for
apps, commercial purposes!)
TRY VISITING CRAIGSLIST IN AWS BTW!!
Proxy
Database export
Download Craigslist Boston from [Link]
Look at our data: open Boston [Link] (20k rows of mess!)
Time to CLEAN: Launch [Link]
Within MOZILLA, navigate to [Link]
Create Project -> This Computer -> Browse
Parse by tab
Create Project
1. First, sort your column.
2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of the
middle of the data table.
3. Then invoke Edit cells and Blank down on the Title column.
4. Then on that column, invoke menu Facet > Custom facets and Facet by blank.
5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown menu.
6. Remove the facet.
Then run the To Number transform again
Increment the radius to 7
and make judgment calls
along the way.
Change the Distance
Function and do the same
thing
Looks like we have SOME really expensive real
estate. Data errors????
Boston [Link]
Load Boston [Link]
Go to Worksheet
Great semantic example. Tableau understands that this text translates to a lat/long
Look on the map in the lower right corner
Lets Filter Data
Under Measures, drag Price onto size in Marks
Change sum(Price) to avg(Price)
Drag Price, change to max(price) into Filters and select an At Most
Right click on the filter and show Quick Filter
Drag City onto Label
Menu Map -> Map Options
Click on a node for info and drill down potential
1. Explored various webpage structures and scraped them
2. Exported the data to Refine
3. Parsed columns to extract critical price and location information
4. Used clustering algorithms to merge related geographies
5. Applied filters to identify errant prices
6. Exported the data to Tableau
7. Completed a real cursory mapping visualization
Please come talk to me