Data	Science	and	Journalism:
Mining	Data	from	Wikipedia’s	Coverage	of	Current	Events
Professor	Brian	Keegan
Department	of	Information	Science
brian.keegan@colorado.edu
@bkeegan
But	Wikipedia	isn’t	a	newspaper…
@bkeegan 2
Not	a	new	development…
@bkeegan 3
Most-Edited	Articles	by	Month	in	2015
• Jan:	Charlie	Hebdo	shooting
• Feb:	Super	Bowl	XLIX	
• Mar:	Germanwings Flight	9525
• Apr:	Nepal	earthquake
• May:	UK	general	election
• Jun:		Jurassic	World
• Jul:	Donald	Trump
• Aug:	Tianjin	explosions
• Sep:	Jeremy	Corbyn
• Oct:	Umpqua	College	shooting
• Nov:	Paris	attacks
• Dec:	Star	Wars	VII
https://stats.wikimedia.org/EN/TablesWikipediaEN.htm
@bkeegan 4
How	do	these	collaborations	work?
High	tempo	collaboration	dynamics
@bkeegan 6
In	the	first	24	hours:
• First	shots	fired	at	2:02	am
• WP	article	created	at	2:52	am
• 1694	revisions
• 292	unique	editors
• 31	seconds	between	edits
Complete	revision	history	going	back	to	2002
First	revision	– 2001 First	revision	– 2004
@bkeegan 7
Hourly	pageview statistics	going	back	to	2008
http://www.brianckeegan.com/2014/12/the-news-on-wikipedia-in-2014/
@bkeegan 8
Semantic	web	
adoption
@bkeegan 9
Dozens	of	language	editions
@bkeegan 10
Free	and	powerful	API
@bkeegan 11
What	were	the	biggest	stories	of	the	year?
http://www.brianckeegan.com/2014/12/the-news-on-wikipedia-in-2014/
@bkeegan 12
Forecasting	and	prediction
@bkeegan 13
Is	Wikipedia’s	information	supply	meeting	demand?
http://www.brianckeegan.com/2014/12/the-news-on-wikipedia-in-2014/
@bkeegan 14
Who	is	contributing	to	these	articles?
All	collaborators	on	2014	events First	24	hours	after	each	article’s	creation
http://www.brianckeegan.com/2014/12/the-news-on-wikipedia-in-2014/
@bkeegan 15
Non-breaking Breaking
How	do	other	Wikipedians contribute?
@bkeegan 16
data	journalism
information	science
computational	social	science
human-centered	data	science
statistics
programming
network	analysis
machine	learning
experimental	design
natural	language	processing
information	visualization
participatory	design
action	research
crowdsourcing
survey	design
ethnography
data	mining
@bkeegan 17
Thank	you!
www.brianckeegan.com
@bkeegan
http://www.dutiee.com/wp-content/uploads/nonprofitdata11.jpg

#ONA16 - Data Science and Journalism