Data Analysis Andrea Wiggins IST 400/600 April 14, 2008
Data Analysis Data are collected, created, and kept for the purpose of analysis Without analysis, it’s just a bunch of bits Data managers need familiarity with analysis practices http://flickr.com/photos/techne/100055322/
Overview Types of data analysis Requirements for analysis Basic steps in data analysis Types of tools Scientific analysis workflows Types of analysis output
Requirements for Analysis Data Analysis design Analysis tools Computing resources to run the analysis Human expertise http://www.flickr.com/photos/anikarenina/369089979/
Computational Resources The computing resources required for analysis will depend on scale and complexity Scale refers to the the data Complexity refers to the analysis
Scale of Analysis Physical locations of the data and the analysis machines Communication networks Volume of data, number of data streams Time to complete reflects size of data and complexity of analysis Hours or days may be required
Complexity Both analytic and computational complexity are relevant Some operations are “cheap” and others are “expensive” Number of calculations required - every function is made of other functions Execution in serial versus parallel processing: how many tasks at once?
Serial & Parallel Processes Serial Parallel
Small Scale Computing Regular microcomputers like your laptop Ordinary consumer PCs are able to do some significant computational work http://www.flickr.com/photos/cayusa/431036565/
Moderate Scale Computing Relatively small, locally-managed clusters  Google’s smallest cluster: 13 servers Reservoir Simulation Joint Industry Project’s cluster -> http://www.cpge.utexas.edu/rsjip/
Macro Scale Computing High performance and grid computing NYSGrid TeraGrid NEESGrid Etc. http://ajlopez.wordpress.com/2007/12/03/grid-computing-programming/
Types of Data Analysis Two primary types for quantitative data EDA: exploratory data analysis CDA: confirmatory data analysis Third type for non-numerical data QDA: qualitative data analysis Photographs, words, observations Traditionally found in social sciences
Confirmatory Data Analysis Uses statistical tests to confirm or falsify hypotheses You know what you’re looking for Analysis is usually carefully planned in advance http://flickr.com/photos/activitystory/105110622/
Exploratory Data Analysis Methods used for data mining Nontrivial knowledge discovery from data Looking at data to form hypotheses for CDA testing (on a different data set) Don’t always know what you’re looking for, analysis evolves over time Caution: sometimes you find what you’re looking for, even if it isn’t there!
Qualitative Data Analysis Most common in social sciences, where data sets are usually smaller Uses a variety of methods to analyze non-numerical data Many qualitative analysis methods are difficult or impossible to automate http://flickr.com/photos/valix/939388335/
Context of Analysis Scientific  inquiry Business intelligence Monitoring Carefully planned regular reporting As-needed ad hoc analysis http://flickr.com/photos/makou0629/1145908929/
Data Analysis: Design Examine (some of) the raw data Especially important with data meshing, when multiple different data sources are used together Design the analysis & instantiate it Test existing hypotheses Explore data to form hypotheses Use databases & analysis tools
Data Analysis: Prepare Data Clean the data - preprocessing Remove “noise” Sample the data Select the portion to use for analysis Validate the analysis  Use a subsample to check your analysis Do the results make sense? Can you check intermediate values?
Data Analysis: Revise & Run Revise the analysis & test again  Also known as debugging Good idea to compare manually and automatically computed results when possible to verify that everything works Repeat as needed Run the full analysis when ready
Data Analysis: Save Output Save/export the analysis results, artifacts, and appropriate metadata Data selection criteria, sample, analysis design version When analysis was run, by whom System details Time to run, exceptions Other relevant details dictated by your context of inquiry
Data Analysis: Use Write up the results Often requires returning to the raw data, analyzed data, and other information about the analysis Questions always arise… Something looks out of place, doesn’t make sense, can’t possibly be true Double-check everything: results, analysis records, analysis metadata
Very Important Details Data formats Format/s of raw data in source/s Format/s required for analysis Format/s of outputs: image, csv, statistics, descriptive text, etc. Data manipulation Moving from source to analysis to usable results, without losing/abusing anything
Data Analysis Tools Vary by preferences, skills, demands of the data Custom solutions Collections of small modular scripts Customized vendor software installations Stats packages SPSS, SAS, R eScience tools http://flickr.com/photos/alistairmcmillan/2102898220/
Open Science Movement Not just open data, also open analysis: Open Notebook Science Closed Open Traditional Lab Notebook (unpublished) Traditional Journal Article Open Access Journal Article Open Notebook Science (full transparency) From Jean-Claude Bradley in  Nature Proceedings : doi:10.1038/npre.2007.39.1 : Posted 11 Jun 2007
Analysis Workflows Scientists “need access to tools and services that help ensure that metadata are automatically captured or created in real-time” -  Cyberinfrastructure Vision for 21st Century Discovery Taverna Workbench demo video Example of a scientific workflow analysis tool, used for genetics - and social science! http://floss.syr.edu/Presentations/TavernaDemoRedux.m4v
Analysis Outputs Most analysis starts as numbers and ends up as words Scholarly articles White papers Technical reports Visualizations More on Wednesday http://flickr.com/photos/thiru/278930492/
Dashboard Reports At-a-glance reports for regular, ongoing monitoring Uses many visualizations Usually intended for managers & executives http://flickr.com/photos/jauladeardilla/345883088/
Concluding Thoughts Understanding how data is used will help you manage it better Planning ahead makes data analysis go more smoothly Data analysis almost never goes perfectly Analysis is the fun part of research, when discoveries are made
Questions for Discussion What can data managers contribute to data analysis? What are some of the factors that are relevant to designing data analysis? How is metadata relevant to designing data analysis? How is metadata relevant to reporting data analysis results?

More Related Content

PPTX
Various statistical software's in data analysis.
PPTX
Uses of SPSS and Excel to analyze data
PDF
Data Analysis using SPSS: Part 1
PPTX
Statistical software packages
PPT
Application of spss usha (1)
PPT
Introduction to spss
PPT
Evaluation Spss
PPTX
Statistical software
Various statistical software's in data analysis.
Uses of SPSS and Excel to analyze data
Data Analysis using SPSS: Part 1
Statistical software packages
Application of spss usha (1)
Introduction to spss
Evaluation Spss
Statistical software

What's hot (20)

PPSX
SPSS-SYNTAX
PPTX
data analysis techniques and statistical softwares
PPT
Application of SPSS by umakant bhaskar gohatre
PDF
Exploratory data analysis data visualization
PDF
Final spss hands on training (descriptive analysis) may 24th 2013
PPTX
An introduction to spss
PDF
Data analysis
PPTX
Statistical softwares
PPT
Spss an introduction
PPTX
Data analysis using spss
PPTX
What Is the Use of SPSS in Data Analysis
PPT
Spss beginners
PDF
Statistical Procedures using SPSSi
PPTX
Introduction to data analysis using excel
PPTX
Spss as a research tool
PPT
Introduction To Spss - Opening Data File and Descriptive Analysis
PPTX
PPTX
Statistical analysis using spss
PPT
An Introduction to SPSS
PDF
Data Visualization in Exploratory Data Analysis
SPSS-SYNTAX
data analysis techniques and statistical softwares
Application of SPSS by umakant bhaskar gohatre
Exploratory data analysis data visualization
Final spss hands on training (descriptive analysis) may 24th 2013
An introduction to spss
Data analysis
Statistical softwares
Spss an introduction
Data analysis using spss
What Is the Use of SPSS in Data Analysis
Spss beginners
Statistical Procedures using SPSSi
Introduction to data analysis using excel
Spss as a research tool
Introduction To Spss - Opening Data File and Descriptive Analysis
Statistical analysis using spss
An Introduction to SPSS
Data Visualization in Exploratory Data Analysis
Ad

Viewers also liked (20)

PPTX
Data analysis powerpoint
PPTX
Data analysis and Presentation
PPTX
Data Analysis, Presentation and Interpretation of Data
PPT
Chapter 10-DATA ANALYSIS & PRESENTATION
PPT
Qualitative data analysis
PDF
94 beautiful slides from CANNES LIONS 2013
PPT
Ansys ppt
PPTX
ppt on data collection , processing , analysis of data & report writing
PPT
Qualitative Data Analysis (Steps)
PPTX
6 Essential Data Analyst Skills for Your Healthcare Organization
PPTX
Presentation of data
PPTX
Holy Crap! You Can Get Fired For Social Media Posts?
PDF
How to build a great coding culture
PDF
Final venture outlook 2016
PDF
SpringOwl's 99 Page Presentation On How To Best Turnaround Yahoo!
PDF
DocSend Fundraising Research: What we Learned from 200 Startups Who Raised $360M
PDF
Design in Tech Report 2015
PDF
2016 A-Z Culture Glossary
PDF
Blitzscaling Session 1: Household Stage
PDF
Data analysis using spss
Data analysis powerpoint
Data analysis and Presentation
Data Analysis, Presentation and Interpretation of Data
Chapter 10-DATA ANALYSIS & PRESENTATION
Qualitative data analysis
94 beautiful slides from CANNES LIONS 2013
Ansys ppt
ppt on data collection , processing , analysis of data & report writing
Qualitative Data Analysis (Steps)
6 Essential Data Analyst Skills for Your Healthcare Organization
Presentation of data
Holy Crap! You Can Get Fired For Social Media Posts?
How to build a great coding culture
Final venture outlook 2016
SpringOwl's 99 Page Presentation On How To Best Turnaround Yahoo!
DocSend Fundraising Research: What we Learned from 200 Startups Who Raised $360M
Design in Tech Report 2015
2016 A-Z Culture Glossary
Blitzscaling Session 1: Household Stage
Data analysis using spss
Ad

Similar to Data Analysis (20)

PPT
Labmatrix
PPT
IPT Tools 2
PDF
Advanced Analytics and Machine Learning with Data Virtualization
DOCX
Introduction
PPTX
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
PPT
Integrating scientific laboratories into the cloud
PPTX
Role of computer and its efficiency in management.pptx
PPTX
1) Introduction to Data Analyticszz.pptx
PPTX
Introduction to data science
PPTX
Introduction-FODS-fundamantals of data science
PDF
OpenML data@Sheffield
PPT
Qiagram
PDF
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
PDF
5_Data Analytics, Data Science and Machine Learning
PPTX
Data mining introduction
PPT
UK Digital Curation Centre: enabling research data management at the coalface
PPT
eScience: A Transformed Scientific Method
PDF
Role of Computers in Research, Data Processing, Data Analysis
PPTX
Donders neuroimage toolkit - open science and good practices
PPTX
Data science in business Administration Nagarajan.pptx
Labmatrix
IPT Tools 2
Advanced Analytics and Machine Learning with Data Virtualization
Introduction
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Integrating scientific laboratories into the cloud
Role of computer and its efficiency in management.pptx
1) Introduction to Data Analyticszz.pptx
Introduction to data science
Introduction-FODS-fundamantals of data science
OpenML data@Sheffield
Qiagram
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
5_Data Analytics, Data Science and Machine Learning
Data mining introduction
UK Digital Curation Centre: enabling research data management at the coalface
eScience: A Transformed Scientific Method
Role of Computers in Research, Data Processing, Data Analysis
Donders neuroimage toolkit - open science and good practices
Data science in business Administration Nagarajan.pptx

More from Andrea Wiggins (20)

PDF
Online Communities in Citizen Science & BirdCams
PDF
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
PDF
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
PDF
Online Communities in Citizen Science
PPT
Citizen Science Phenotypes
PPT
The Evolving Landscape of Citizen Science
PDF
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
PPT
Data Management for Citizen Science
PDF
With Great Data Comes Great Responsibility
PDF
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
PPTX
Mechanisms for Data Quality and Validation in Citizen Science
KEY
Open Source & Citizen Science
PPT
From Conservation to Crowdsourcing: A Typology of Citizen Science
PDF
Motivation by Design: Technologies, Experiences, and Incentives
PDF
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
PDF
Secondary data analysis with digital trace data
KEY
Open Source, Open Science, & Citizen Science
PPT
Reclassifying Success and Tragedy in FLOSS Projects
PDF
Crowdsourcing Science
PPT
Intellectual Diversity in the iSchools: Past, Present and Future
Online Communities in Citizen Science & BirdCams
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
Online Communities in Citizen Science
Citizen Science Phenotypes
The Evolving Landscape of Citizen Science
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Data Management for Citizen Science
With Great Data Comes Great Responsibility
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Mechanisms for Data Quality and Validation in Citizen Science
Open Source & Citizen Science
From Conservation to Crowdsourcing: A Typology of Citizen Science
Motivation by Design: Technologies, Experiences, and Incentives
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
Secondary data analysis with digital trace data
Open Source, Open Science, & Citizen Science
Reclassifying Success and Tragedy in FLOSS Projects
Crowdsourcing Science
Intellectual Diversity in the iSchools: Past, Present and Future

Recently uploaded (20)

PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PDF
Human Computer Interaction Miterm Lesson
PDF
The AI Revolution in Customer Service - 2025
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
Build Real-Time ML Apps with Python, Feast & NoSQL
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
Lung cancer patients survival prediction using outlier detection and optimize...
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
Data Virtualization in Action: Scaling APIs and Apps with FME
NewMind AI Weekly Chronicles – August ’25 Week IV
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
LMS bot: enhanced learning management systems for improved student learning e...
Co-training pseudo-labeling for text classification with support vector machi...
Build automations faster and more reliably with UiPath ScreenPlay
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
Early detection and classification of bone marrow changes in lumbar vertebrae...
giants, standing on the shoulders of - by Daniel Stenberg
SGT Report The Beast Plan and Cyberphysical Systems of Control
A symptom-driven medical diagnosis support model based on machine learning te...
Human Computer Interaction Miterm Lesson
The AI Revolution in Customer Service - 2025
Rapid Prototyping: A lecture on prototyping techniques for interface design
Build Real-Time ML Apps with Python, Feast & NoSQL
AI.gov: A Trojan Horse in the Age of Artificial Intelligence

Data Analysis

  • 1. Data Analysis Andrea Wiggins IST 400/600 April 14, 2008
  • 2. Data Analysis Data are collected, created, and kept for the purpose of analysis Without analysis, it’s just a bunch of bits Data managers need familiarity with analysis practices http://flickr.com/photos/techne/100055322/
  • 3. Overview Types of data analysis Requirements for analysis Basic steps in data analysis Types of tools Scientific analysis workflows Types of analysis output
  • 4. Requirements for Analysis Data Analysis design Analysis tools Computing resources to run the analysis Human expertise http://www.flickr.com/photos/anikarenina/369089979/
  • 5. Computational Resources The computing resources required for analysis will depend on scale and complexity Scale refers to the the data Complexity refers to the analysis
  • 6. Scale of Analysis Physical locations of the data and the analysis machines Communication networks Volume of data, number of data streams Time to complete reflects size of data and complexity of analysis Hours or days may be required
  • 7. Complexity Both analytic and computational complexity are relevant Some operations are “cheap” and others are “expensive” Number of calculations required - every function is made of other functions Execution in serial versus parallel processing: how many tasks at once?
  • 8. Serial & Parallel Processes Serial Parallel
  • 9. Small Scale Computing Regular microcomputers like your laptop Ordinary consumer PCs are able to do some significant computational work http://www.flickr.com/photos/cayusa/431036565/
  • 10. Moderate Scale Computing Relatively small, locally-managed clusters Google’s smallest cluster: 13 servers Reservoir Simulation Joint Industry Project’s cluster -> http://www.cpge.utexas.edu/rsjip/
  • 11. Macro Scale Computing High performance and grid computing NYSGrid TeraGrid NEESGrid Etc. http://ajlopez.wordpress.com/2007/12/03/grid-computing-programming/
  • 12. Types of Data Analysis Two primary types for quantitative data EDA: exploratory data analysis CDA: confirmatory data analysis Third type for non-numerical data QDA: qualitative data analysis Photographs, words, observations Traditionally found in social sciences
  • 13. Confirmatory Data Analysis Uses statistical tests to confirm or falsify hypotheses You know what you’re looking for Analysis is usually carefully planned in advance http://flickr.com/photos/activitystory/105110622/
  • 14. Exploratory Data Analysis Methods used for data mining Nontrivial knowledge discovery from data Looking at data to form hypotheses for CDA testing (on a different data set) Don’t always know what you’re looking for, analysis evolves over time Caution: sometimes you find what you’re looking for, even if it isn’t there!
  • 15. Qualitative Data Analysis Most common in social sciences, where data sets are usually smaller Uses a variety of methods to analyze non-numerical data Many qualitative analysis methods are difficult or impossible to automate http://flickr.com/photos/valix/939388335/
  • 16. Context of Analysis Scientific inquiry Business intelligence Monitoring Carefully planned regular reporting As-needed ad hoc analysis http://flickr.com/photos/makou0629/1145908929/
  • 17. Data Analysis: Design Examine (some of) the raw data Especially important with data meshing, when multiple different data sources are used together Design the analysis & instantiate it Test existing hypotheses Explore data to form hypotheses Use databases & analysis tools
  • 18. Data Analysis: Prepare Data Clean the data - preprocessing Remove “noise” Sample the data Select the portion to use for analysis Validate the analysis Use a subsample to check your analysis Do the results make sense? Can you check intermediate values?
  • 19. Data Analysis: Revise & Run Revise the analysis & test again Also known as debugging Good idea to compare manually and automatically computed results when possible to verify that everything works Repeat as needed Run the full analysis when ready
  • 20. Data Analysis: Save Output Save/export the analysis results, artifacts, and appropriate metadata Data selection criteria, sample, analysis design version When analysis was run, by whom System details Time to run, exceptions Other relevant details dictated by your context of inquiry
  • 21. Data Analysis: Use Write up the results Often requires returning to the raw data, analyzed data, and other information about the analysis Questions always arise… Something looks out of place, doesn’t make sense, can’t possibly be true Double-check everything: results, analysis records, analysis metadata
  • 22. Very Important Details Data formats Format/s of raw data in source/s Format/s required for analysis Format/s of outputs: image, csv, statistics, descriptive text, etc. Data manipulation Moving from source to analysis to usable results, without losing/abusing anything
  • 23. Data Analysis Tools Vary by preferences, skills, demands of the data Custom solutions Collections of small modular scripts Customized vendor software installations Stats packages SPSS, SAS, R eScience tools http://flickr.com/photos/alistairmcmillan/2102898220/
  • 24. Open Science Movement Not just open data, also open analysis: Open Notebook Science Closed Open Traditional Lab Notebook (unpublished) Traditional Journal Article Open Access Journal Article Open Notebook Science (full transparency) From Jean-Claude Bradley in Nature Proceedings : doi:10.1038/npre.2007.39.1 : Posted 11 Jun 2007
  • 25. Analysis Workflows Scientists “need access to tools and services that help ensure that metadata are automatically captured or created in real-time” - Cyberinfrastructure Vision for 21st Century Discovery Taverna Workbench demo video Example of a scientific workflow analysis tool, used for genetics - and social science! http://floss.syr.edu/Presentations/TavernaDemoRedux.m4v
  • 26. Analysis Outputs Most analysis starts as numbers and ends up as words Scholarly articles White papers Technical reports Visualizations More on Wednesday http://flickr.com/photos/thiru/278930492/
  • 27. Dashboard Reports At-a-glance reports for regular, ongoing monitoring Uses many visualizations Usually intended for managers & executives http://flickr.com/photos/jauladeardilla/345883088/
  • 28. Concluding Thoughts Understanding how data is used will help you manage it better Planning ahead makes data analysis go more smoothly Data analysis almost never goes perfectly Analysis is the fun part of research, when discoveries are made
  • 29. Questions for Discussion What can data managers contribute to data analysis? What are some of the factors that are relevant to designing data analysis? How is metadata relevant to designing data analysis? How is metadata relevant to reporting data analysis results?