Secondary data analysis
  with digital trace data

Examples from FLOSS research

         Andrea Wiggins
         13 Juillet, 2011
Secondary Data Analysis
•   Uses existing data produced or collected by
    someone else, usually for a different purpose
    •   Databases
    •   Repositories
    •   Surveys
    •   Emails
    •   Social networks
                           2
Digital Trace Data
•   Records of activity (trace data) undertaken through
    an online information system (thus digital)
•   Increasingly common in studies of online
    phenomena
    •   Large volumes of available data
    •   Can be complete: a census, not a sample
    •   May be more reliably recorded than other data

                             3
Characteristics


1. Found data (not produced for research)
2. Event-based data (not summary data)
3. Events occur over time, so it is longitudinal data




                          4
Requirements
•   Understand the original data source
    •   How it was collected, potential problems
    •   Limitations of the sample
    •   What the data describe
•   Match with appropriate analysis methods and measures
    •   New types of data may require new measures
    •   Theoretical coherence is very important
                              5
Advantages
•   Data may be “complete”
    •   Usually no response bias (exception: cookies)
    •   May cover long periods of time and large groups
    •   Multiple different data types, but mostly textual
•   Data are often easy to acquire
    •   APIs or scraping web pages (with caution)
    •   Databases, archives, or repositories of research data
•   But remember: you usually get what you pay for!
                                  6
Disadvantages
•   Often difficult to know limitations of data
    •   Data may be poorly documented
    •   Original creator may not be available for comment
•   Volume of data can be overwhelming
    •   Sampling strategies needed, e.g., temporal, random
    •   Substantial time required for data preparation: 90% of effort
    •   Exceptions are everywhere and will break analyses, but can
        only be discovered through trial and error

                                  7
Example: Email Networks
•   Data source: email listservs for FLOSS projects
•   Analysis approach: create social networks
    •   Within discussion threads, individuals are nodes, and links
        are reply-to messages
    •   Some conceptual issues for interpretation, choice of
        measures
•   Technical challenges
    •   Temporal aggregation
    •   Identity resolution
                                   8
Figures from Howison et al., 2006


Temporal Aggregation
                  9
Network Workflow
       10
Network Results
                                                     • Different levels of correlation
                                                       between venues, suggesting different
                                                       types of interactions
                                                     • User venues more decentralized than
                                                       developer venues, reflecting greater
                                                       number of participants
                                                     • Overall trend toward decentralization
                                                       could be result of different influences

• Observed anomalous patterns in trackers for
  both projects: periodic centralization spikes
                                                                Cleaning up before shutting down
• A single user makes batch bug closings
  (up to 279!)
   – Fire’s (feature request) tracker housekeeping
     appears to be preparation for project
     closure
   – Gaim’s tracker housekeeping was more
     regular and repeated
                                              11
Example: Classification
•   Replication of success-tragedy classification
    •   Classification criteria originally drawn from
        interviews with community members
    •   Data extracted from repositories
•   Technical challenges
    •   Merging data from two repositories
    •   Processing large volume of data in multiple steps
                             12
Variables
•   Inputs: project names and 5 threshold values for
    classification tests, e.g. number of downloads
•   Project statistics retrieved from repositories
    •   Founding date
    •   Data collection date
    •   Dates for all releases
    •   Number of downloads
    •   URL
                                 13
Classification workflow
          14
Classification Results
   Class        Original           Our results    Difference
unclassifiabl      3 186               3 296          +110
     e
     II        13 342 (12%)        16 252 (14%)   +2 910 (+2%)

    IG         10 711 (10%)        12 991 (11%)   +2 280 (+1%)

    TI         37 320 (35%)        36 507 (31%)    -813 (-4%)

    TG         30 592 (28%)        32 642 (28%)   +2 050 (0%)

    SG         15 782 (15%)        16 045 (14%)    +263 (-1%)

   other          8 422                 0

   Total         119 355             117 733

                              15
Thanks!



•   Questions?




                    16

More Related Content

PPT
Collaborative Data Analysis with Taverna Workflows
PPTX
Databases, Web Services and Tools For Systems Immunology
PPTX
Reproducible research: theory
PPTX
Reproducibility and replicability: a practical approach
PPTX
OpenNeuro: a free online platform for sharing and analysis of neuroimaging data
PPTX
A practical guide to practicing open science
PDF
Citation and reproducibility in software
PPTX
Software Citation: Principles, Implementation, and Impact
Collaborative Data Analysis with Taverna Workflows
Databases, Web Services and Tools For Systems Immunology
Reproducible research: theory
Reproducibility and replicability: a practical approach
OpenNeuro: a free online platform for sharing and analysis of neuroimaging data
A practical guide to practicing open science
Citation and reproducibility in software
Software Citation: Principles, Implementation, and Impact

What's hot (20)

PDF
Software Ecosystems = Big Data
PPTX
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
PDF
Software Analytics: Towards Software Mining that Matters
PPTX
20171003 lancaster data conversations Chue-Hong
PPTX
Being Reproducible: SSBSS Summer School 2017
PPTX
Micropublication WormBase Workshop International Worm Meeting 2015
PPTX
Scientific Software - what happens after the grant?
PPTX
Modern tools for sharing and synthesizing neuroimaging results
PDF
User Expectations in Mobile App Security
PDF
Software Mining and Software Datasets
PDF
Large Scale Studies: Malware Needles in a Haystack
PPTX
Intro to Reproducible Research
PPTX
Getting (and giving) credit for all that we do
PPTX
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
PPTX
Automating the process of continuously prioritising data, updating and deploy...
PPTX
Avoiding the tower of babel - The Role of Data Description Standards in Biome...
PPTX
ROHub
PPTX
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
PPTX
Research Data (and Software) Management at Imperial: (Everything you need to ...
PPTX
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
Software Ecosystems = Big Data
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
Software Analytics: Towards Software Mining that Matters
20171003 lancaster data conversations Chue-Hong
Being Reproducible: SSBSS Summer School 2017
Micropublication WormBase Workshop International Worm Meeting 2015
Scientific Software - what happens after the grant?
Modern tools for sharing and synthesizing neuroimaging results
User Expectations in Mobile App Security
Software Mining and Software Datasets
Large Scale Studies: Malware Needles in a Haystack
Intro to Reproducible Research
Getting (and giving) credit for all that we do
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Automating the process of continuously prioritising data, updating and deploy...
Avoiding the tower of babel - The Role of Data Description Standards in Biome...
ROHub
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
Research Data (and Software) Management at Imperial: (Everything you need to ...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
Ad

Viewers also liked (13)

PPS
PDF
With Great Data Comes Great Responsibility
PPS
Moselle
PPT
National Park System Property Designations
PPT
secondary data analysis for MS advance research one Lecture eight
PPT
Content Analysis vs secondary analysis
PPTX
Secondary data collection.mjm
PDF
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
PPT
Ch11 Agency Records, Content Analysis, and Secondary Data
PPTX
Secondary Data Analysis
DOC
Harvard Housing.Marketing Research.Case Study
PPT
Business Research Methods. problem definition literature review and qualitati...
DOCX
Primary & secondary data
With Great Data Comes Great Responsibility
Moselle
National Park System Property Designations
secondary data analysis for MS advance research one Lecture eight
Content Analysis vs secondary analysis
Secondary data collection.mjm
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Ch11 Agency Records, Content Analysis, and Secondary Data
Secondary Data Analysis
Harvard Housing.Marketing Research.Case Study
Business Research Methods. problem definition literature review and qualitati...
Primary & secondary data
Ad

Similar to Secondary data analysis with digital trace data (20)

PPT
Social dynamics of FLOSS team communication across channels
PDF
PDF
Gephi icwsm-tutorial
PDF
SP1: Exploratory Network Analysis with Gephi
PDF
Tut mathematics and hypermedia research seminar 2011 11-11
PDF
Anomalous symmetry succession for seek out
PDF
Opportunities and Challenges in Crisis Informatics
PPTX
Building Effective Frameworks for Social Media Analysis
PDF
Crunching the numbers: Open Source Community Metrics at OSCON
PDF
Crunching the numbers: Open Source Community Metrics
PPT
Team activity analysis / visualization
PDF
Social Networks Analysis: challenges in the era of the social web
PPTX
Building Effective Frameworks for Social Media Analysis
PDF
Version control thesis
PDF
TEFSE05.ppt
PPTX
Scientific data management from the lab to the web
PPT
A Framework for Multi-Level Analysis of Distributed Interaction
PDF
Gephi short introduction
PDF
Alla ricerca della User Story perduta
PDF
Alla ricerca della user story perduta
Social dynamics of FLOSS team communication across channels
Gephi icwsm-tutorial
SP1: Exploratory Network Analysis with Gephi
Tut mathematics and hypermedia research seminar 2011 11-11
Anomalous symmetry succession for seek out
Opportunities and Challenges in Crisis Informatics
Building Effective Frameworks for Social Media Analysis
Crunching the numbers: Open Source Community Metrics at OSCON
Crunching the numbers: Open Source Community Metrics
Team activity analysis / visualization
Social Networks Analysis: challenges in the era of the social web
Building Effective Frameworks for Social Media Analysis
Version control thesis
TEFSE05.ppt
Scientific data management from the lab to the web
A Framework for Multi-Level Analysis of Distributed Interaction
Gephi short introduction
Alla ricerca della User Story perduta
Alla ricerca della user story perduta

More from Andrea Wiggins (20)

PDF
Online Communities in Citizen Science & BirdCams
PDF
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
PDF
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
PDF
Online Communities in Citizen Science
PPT
Citizen Science Phenotypes
PPT
The Evolving Landscape of Citizen Science
PDF
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
PPT
Data Management for Citizen Science
PDF
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
PPTX
Mechanisms for Data Quality and Validation in Citizen Science
KEY
Open Source & Citizen Science
PPT
From Conservation to Crowdsourcing: A Typology of Citizen Science
PDF
Motivation by Design: Technologies, Experiences, and Incentives
PDF
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
KEY
Open Source, Open Science, & Citizen Science
PPT
Reclassifying Success and Tragedy in FLOSS Projects
PDF
Crowdsourcing Science
PPT
Intellectual Diversity in the iSchools: Past, Present and Future
PPT
Distributed Scientific Collaboration: Research Opportunities in Citizen Science
PPT
Designing Virtual Organizations for Citizen Science
Online Communities in Citizen Science & BirdCams
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
Online Communities in Citizen Science
Citizen Science Phenotypes
The Evolving Landscape of Citizen Science
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Data Management for Citizen Science
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Mechanisms for Data Quality and Validation in Citizen Science
Open Source & Citizen Science
From Conservation to Crowdsourcing: A Typology of Citizen Science
Motivation by Design: Technologies, Experiences, and Incentives
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
Open Source, Open Science, & Citizen Science
Reclassifying Success and Tragedy in FLOSS Projects
Crowdsourcing Science
Intellectual Diversity in the iSchools: Past, Present and Future
Distributed Scientific Collaboration: Research Opportunities in Citizen Science
Designing Virtual Organizations for Citizen Science

Recently uploaded (20)

PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
PPTX
Microsoft User Copilot Training Slide Deck
PDF
Human Computer Interaction Miterm Lesson
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
Build Real-Time ML Apps with Python, Feast & NoSQL
PDF
Ensemble model-based arrhythmia classification with local interpretable model...
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PPTX
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Introduction to MCP and A2A Protocols: Enabling Agent Communication
Microsoft User Copilot Training Slide Deck
Human Computer Interaction Miterm Lesson
Rapid Prototyping: A lecture on prototyping techniques for interface design
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
Basics of Cloud Computing - Cloud Ecosystem
EIS-Webinar-Regulated-Industries-2025-08.pdf
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
giants, standing on the shoulders of - by Daniel Stenberg
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Build Real-Time ML Apps with Python, Feast & NoSQL
Ensemble model-based arrhythmia classification with local interpretable model...
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf

Secondary data analysis with digital trace data

  • 1. Secondary data analysis with digital trace data Examples from FLOSS research Andrea Wiggins 13 Juillet, 2011
  • 2. Secondary Data Analysis • Uses existing data produced or collected by someone else, usually for a different purpose • Databases • Repositories • Surveys • Emails • Social networks 2
  • 3. Digital Trace Data • Records of activity (trace data) undertaken through an online information system (thus digital) • Increasingly common in studies of online phenomena • Large volumes of available data • Can be complete: a census, not a sample • May be more reliably recorded than other data 3
  • 4. Characteristics 1. Found data (not produced for research) 2. Event-based data (not summary data) 3. Events occur over time, so it is longitudinal data 4
  • 5. Requirements • Understand the original data source • How it was collected, potential problems • Limitations of the sample • What the data describe • Match with appropriate analysis methods and measures • New types of data may require new measures • Theoretical coherence is very important 5
  • 6. Advantages • Data may be “complete” • Usually no response bias (exception: cookies) • May cover long periods of time and large groups • Multiple different data types, but mostly textual • Data are often easy to acquire • APIs or scraping web pages (with caution) • Databases, archives, or repositories of research data • But remember: you usually get what you pay for! 6
  • 7. Disadvantages • Often difficult to know limitations of data • Data may be poorly documented • Original creator may not be available for comment • Volume of data can be overwhelming • Sampling strategies needed, e.g., temporal, random • Substantial time required for data preparation: 90% of effort • Exceptions are everywhere and will break analyses, but can only be discovered through trial and error 7
  • 8. Example: Email Networks • Data source: email listservs for FLOSS projects • Analysis approach: create social networks • Within discussion threads, individuals are nodes, and links are reply-to messages • Some conceptual issues for interpretation, choice of measures • Technical challenges • Temporal aggregation • Identity resolution 8
  • 9. Figures from Howison et al., 2006 Temporal Aggregation 9
  • 11. Network Results • Different levels of correlation between venues, suggesting different types of interactions • User venues more decentralized than developer venues, reflecting greater number of participants • Overall trend toward decentralization could be result of different influences • Observed anomalous patterns in trackers for both projects: periodic centralization spikes Cleaning up before shutting down • A single user makes batch bug closings (up to 279!) – Fire’s (feature request) tracker housekeeping appears to be preparation for project closure – Gaim’s tracker housekeeping was more regular and repeated 11
  • 12. Example: Classification • Replication of success-tragedy classification • Classification criteria originally drawn from interviews with community members • Data extracted from repositories • Technical challenges • Merging data from two repositories • Processing large volume of data in multiple steps 12
  • 13. Variables • Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads • Project statistics retrieved from repositories • Founding date • Data collection date • Dates for all releases • Number of downloads • URL 13
  • 15. Classification Results Class Original Our results Difference unclassifiabl 3 186 3 296 +110 e II 13 342 (12%) 16 252 (14%) +2 910 (+2%) IG 10 711 (10%) 12 991 (11%) +2 280 (+1%) TI 37 320 (35%) 36 507 (31%) -813 (-4%) TG 30 592 (28%) 32 642 (28%) +2 050 (0%) SG 15 782 (15%) 16 045 (14%) +263 (-1%) other 8 422 0 Total 119 355 117 733 15
  • 16. Thanks! • Questions? 16