Defining a Digital Storytelling Discipline: Learning,
Skills, and Knowledge
John Wihbey
Northeastern University
@wihbey
Case study:
Northeastern University
undergrads working on Boston
Police Department data, “as is” -
in a general digital skills course
Murder data from the 1960s
City of Boston - Homicide data obtained through public
records request
Text text text
Murder data from the 1990s
City of Boston - Homicide data obtained through public
records request
Murder data from the 2000s
City of Boston - Homicide data obtained through public records
request
(http://the-accidental-housewife.blogspot.com/)
29 problems.
1 assignment.
List of problems/errors in structure and format of homicide data
1. Inconsistencies in case column, e.g. “01/06” vs “ ’09/06 ”
2. No indication of meaning of red text
3. No key for case column IDs
4. Different text formats/styles for entire rows and cells
5. Inconsistent descriptions of intersection addresses, e.g. “Washington @ Cedar St” vs “Willowood & Woodrow Ave” vs
“Shawmut Ave. / Dwight”
6. No key for weapon codes
7. Race and gender are collapsed into single column
8. No codes for race/gender (race: “W”, “B”, gender: “M”, “F”, “H”)
9. Some R/G codes are “W/H/M”, “B/N/H” making it impossible to systematically split columns into 2 using the only
delimiting character (/)
10. Some R/G codes have NO delimiter (e.g., 2009 sheet), so cannot split at all
11. Data for 2007 and later have two additional columns not in 2006: “defendant” and “DOA” (no indication of what DOA
means)
12. Some rows have merged cells
13. Some merged cells have multiple values
14. Missing data/empty cells – what do these mean?
List of problems/errors in structure and format of homicide data (cont.)
15. Location data is incomplete – no zip code information and Boston assumed as city (except in cases where “Dor” is appended at end of
address)
16. Only first couple of sheets have column header information; column headers have to be assumed for remaining ones to follow those
with labeled headers
17. Mysterious unexplained extra characters in date columns (e.g., (w) and xxx)
18. Inconsistent syntax for times: 12:00am, 7:10pm, 02:16am, 2:56hrs, 15oo hrs (Letter “o” instead of number 0), 1:49 AM, 7:24 PM,
21:25, 21:32 PM, 12:39P.M., 2:22p.m., 1:49:pm, 12;24AM, :14 am
19. Inconsistent syntax for dates: “07/24/2006” vs “2006/7/24” vs “6/31/64” vs “08/31/06”
20. Inconsistent syntax for age: “1’7”
21. Sheets for 2012/13/14 have new columns not in previous sheets
22. Motive/Relation columns look identical but are not both present in all sheets, impossible to know which labels are which in those
sheets without column headers
23. Simple spelling errors: “Tauma”
24. Inconsistent coding: Unk, UNK, unk
25. Unexplained “DV” column that only appears in 2013
26. No explanation for meaning of row breaks – are these separating data rows into groups of some sort? Are these data that one existed
but were removed?
27. Multiple columns with same (non-unique) headers – “R/G”, “Age”, “DOB” for both VICTIM and DEFENDANTS
28. Inconsistent district labels and squadron personnel names
29. For cells with multiple data/names in cell merged column, have to assume respective values in adjacent cells are provided in same
order
Existential experience of:
#datafail & #GIGO risk
Good student requests for clarification
Many noble student attempts at
cleaning, analysis, exploration,
viz:
Data viz using Plot.ly
Data viz using Carto
Google Maps
Fun with line graphs - an attempt to look at time-of-day
patterns
Experiments in viz for exploratory purposes
D'où Venons Nous / Que Sommes Nous / Où Allons Nous - Paul Gauguin, 1897
2000
2005
2016
Wikimedia
1774
(Nielsen 2006 via
G. Chaucer, The Canterbury Tales (courtesy: library.arizona.edu)
1400s
https://projects.propublica.org/docdollars/
Fatal Force
https://www.washingtonpost.com/graphics/national/police-shootings/
http://www.nytimes.com/interactive/2012/01/15/business
/one-percent-map.html
http://www.npr.org/news/graphics/2011/10/toxic-air/#4.00/39.00/-
84.00
https://offshoreleaks.icij.org/#_ga=1.76851094.2020983486.1475355003
http://projects.latimes.com/value-
added/
Pew Research Center
Polarized Crowd: Two large dense groups with little interconnection
Pew Research Center
Tight Crowd: Highly interconnected group with few isolated participants
Pew Research Center
Brand clusters: Products, services, celebrities discussed by disparate persons
Pew Research Center
Community clusters: Popular topics attracting multiple smaller groups
Pew Research Center
Broadcast networks: Media-centric, with audience proliferating information
Pew Research Center
Support network: Customer complaints, with hub-and-spoke dynamics
Six degrees - Wikimedia
Facebook friends network
Nebraska local politicians; network graph - Matt Waite
http://www.poynter.org/2013/how-to-visually-explore-local-
politics-with-network-graphs/218543/
Chicago community - homicide network (Andrew Papachristos et al.)
NYTimes.com
Gary King, et al, IQSS, Harvard
Global supply chain, Sourcemap.com
Opte Project
D'où Venons Nous / Que Sommes Nous / Où Allons Nous - Paul Gauguin, 1897
John Wihbey
Northeastern University
@wihbey
j.wihbey@northeastern.edu

Social Media Academy 2016 Keynote Presentation, John Wihbey