Multivariate Data Analysis and Visualization Tools for Understanding Biological Data   Dmitry Grapov
Introduction:  Systems Oltvai, et al. Science 25 October 2002: 763-764.   Emergent Reductionist  Deterministic Systems Complex systems Chemical analysis Physiology Biochemistry Graph theory Modeling Informatics
Introduction:  Inference
http://www.thefullwiki.org/Hypercube  Overview many correlation mean Central Idea: dendrograms heatmaps biplots networks scatter plots histograms densities Representations: matrix matrix vector Properties: Multivariate n-D Bivariate 2-D Univariate 1-D Types:
Univariate:  Properties   vector of length m mean variance
Univariate:  Representations
Univariate:  Assumptions Normality
Univariate:  Utility Hypothesis testing α   -  type I error  ( False Positive) β   -  type II error  ( False negative) power  -  (1– β ) effect size - standardized difference in mean
Univariate:  Limitations Biological definition of the mean ? Relationship between sample size and test power Multiple hypothesis testing False discovery rate
Old Faithful Data   272 observations time between eruptions 70 ± 14 min duration of eruption 3.5 ± 1 min Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser.  Applied Statistics   39 , 357–365
Matrix of 2 vectors of length m  Bivariate:  Properties
( X , Y ) Bivariate:  Representations
( X , Y ) Bivariate:  Utility bivariate distribution correlation Variable 2  = m* Variable 1  + b
http://en.wikipedia.org/wiki/Correlation   Bivariate:  Limitations correlation coefficient Measure of linear or monotonic relationship
http://en.wikipedia.org/wiki/Correlation   Bivariate:  Limitations Sensitive to outliers
Old Faithful Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser.  Applied Statistics   39 , 357–365
Old Unfaithful?
Old Unfaithful? Additional variables Nearby hydrofracking Improve inference based on more information
Old Unfaithful? Additional variables Nearby hydrofracking Improve inference based on more information
Challenges data often wide structured integration noise Rewards robust inference signal amplification holistic/systems approach A matrix of n vectors of length m Multivariate:  Properties Correlation matrix
Principal Components Analysis (PCA) Linear n-dimensional encoding of original data  Where dimensions are: orthogonal (uncorrelated) Top k dimensions are ordered by variance explained Multivariate:   Dimensional Reduction PC 2 PC 1
Multivariate:   Dimensional Reduction Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in  A Practical Approach to Microarray Data Analysis . D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA (2003). LANL LA-UR-02-4001.  Scores Loadings Explained variance m x PC PC x PC n x PC Original Data Calculating PCs: singular value decomposition (SVD) Eigenvalue explained variance Scores   sample representation based on all variables Loadings variable contribution to scores
Old Faithful 2.0 272 measurements 8 variables 2 real, 6 random noise A matrix of n vectors of length m Multivariate:  Representations
Multivariate:  Representation Identify outliers using all measurements Use known to impute missing Identify interesting groups Evaluate uni- and bivariate observations Number of PCs can be used true data complexity
PCA:  Considerations data pre-treatment  outliers  noise unsupervised projection no pre-treatment centered  and scaled to unit variance
PCA:  Considerations data pre-treatment  outliers  linear reconstruction noise Independent components analysis (ICA)  unsupervised projection Use ICA to calculate statistically independent components
PCA:  Considerations data pre-treatment  outliers  linear reconstruction noise supervised projection Non-negative matrix factorization (NMF) NMF uses additive parts based encoding Learning the parts of objects by nonnegative matrix factorization,  D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.
PCA:  Considerations data pre-treatment  outliers  linear reconstruction noise supervised projection Identify projection correlated with class assignment (classification) or continuous variables (regression) Partial Least Squares Projection to Latent Structures (PLS/-DA)
PLS/-DA: Utility Strengths Predict multiple dependent variables avoids issues of multicollinearity Independent measure of variable importance Weaknesses Need to derive an empirical reference for model performance Poor established model optimization methods
PLS-DA: Example Data: Old Faithful 2.0 272 observations on 8 variables Latent Variables are analogous to PCs Important Statistics (CV) Q2 = fit RMSEP = error of prediction AU(RO)C = specificity vs. sensitivity Select the appropriate number Latent Variables (LVs) to maximize Q2
PLS-DA: Performance Use permutation tests to empirically determine model performance
PLS-DA: Performance Use permutation tests to empirically determine model performance
PLS: Predictive Performance Split data into training (2/3) and test sets (1/3) Generate model using training set and then predict class assignment for test set Use permutation tests to generate confidence bounds for future predictions
PLS: Predictive Performance
PLS: Feature Selection Use the PLS-DA as an objective function to identify the most informative variables
Networks Network: representation of relationships among objects Utility Project statistical results into a biological context Explore informative data aspects in the context of all that was observed. Identify emergent patterns
Networks Interpret statistical results within a biological context
Networks Highlight changes in patterns of relationships.  non-diabetics type 2 diabetics
Networks Display complex interactions non-diabetics type 2 diabetics
non-diabetics type 2 diabetics imDEV :  interactive modules for Data Exploration and Visualization   An integrated environment for systems level analysis of multivariate data. http:// sourceforge.net/apps/mediawiki/imdev
Acknowledgements Newman Lab  Designated Emphasis in Biotechnology (DEB) NIH This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS 5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.

Multivariate data analysis and visualization tools for biological data

  • 1.
    Multivariate Data Analysisand Visualization Tools for Understanding Biological Data Dmitry Grapov
  • 2.
    Introduction: SystemsOltvai, et al. Science 25 October 2002: 763-764. Emergent Reductionist Deterministic Systems Complex systems Chemical analysis Physiology Biochemistry Graph theory Modeling Informatics
  • 3.
  • 4.
    http://www.thefullwiki.org/Hypercube Overviewmany correlation mean Central Idea: dendrograms heatmaps biplots networks scatter plots histograms densities Representations: matrix matrix vector Properties: Multivariate n-D Bivariate 2-D Univariate 1-D Types:
  • 5.
    Univariate: Properties vector of length m mean variance
  • 6.
  • 7.
  • 8.
    Univariate: UtilityHypothesis testing α - type I error ( False Positive) β - type II error ( False negative) power - (1– β ) effect size - standardized difference in mean
  • 9.
    Univariate: LimitationsBiological definition of the mean ? Relationship between sample size and test power Multiple hypothesis testing False discovery rate
  • 10.
    Old Faithful Data 272 observations time between eruptions 70 ± 14 min duration of eruption 3.5 ± 1 min Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39 , 357–365
  • 11.
    Matrix of 2vectors of length m Bivariate: Properties
  • 12.
    ( X ,Y ) Bivariate: Representations
  • 13.
    ( X ,Y ) Bivariate: Utility bivariate distribution correlation Variable 2 = m* Variable 1 + b
  • 14.
    http://en.wikipedia.org/wiki/Correlation Bivariate: Limitations correlation coefficient Measure of linear or monotonic relationship
  • 15.
  • 16.
    Old Faithful Azzalini,A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39 , 357–365
  • 17.
  • 18.
    Old Unfaithful? Additionalvariables Nearby hydrofracking Improve inference based on more information
  • 19.
    Old Unfaithful? Additionalvariables Nearby hydrofracking Improve inference based on more information
  • 20.
    Challenges data oftenwide structured integration noise Rewards robust inference signal amplification holistic/systems approach A matrix of n vectors of length m Multivariate: Properties Correlation matrix
  • 21.
    Principal Components Analysis(PCA) Linear n-dimensional encoding of original data Where dimensions are: orthogonal (uncorrelated) Top k dimensions are ordered by variance explained Multivariate: Dimensional Reduction PC 2 PC 1
  • 22.
    Multivariate: Dimensional Reduction Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in  A Practical Approach to Microarray Data Analysis . D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA (2003). LANL LA-UR-02-4001. Scores Loadings Explained variance m x PC PC x PC n x PC Original Data Calculating PCs: singular value decomposition (SVD) Eigenvalue explained variance Scores sample representation based on all variables Loadings variable contribution to scores
  • 23.
    Old Faithful 2.0272 measurements 8 variables 2 real, 6 random noise A matrix of n vectors of length m Multivariate: Representations
  • 24.
    Multivariate: RepresentationIdentify outliers using all measurements Use known to impute missing Identify interesting groups Evaluate uni- and bivariate observations Number of PCs can be used true data complexity
  • 25.
    PCA: Considerationsdata pre-treatment outliers noise unsupervised projection no pre-treatment centered and scaled to unit variance
  • 26.
    PCA: Considerationsdata pre-treatment outliers linear reconstruction noise Independent components analysis (ICA) unsupervised projection Use ICA to calculate statistically independent components
  • 27.
    PCA: Considerationsdata pre-treatment outliers linear reconstruction noise supervised projection Non-negative matrix factorization (NMF) NMF uses additive parts based encoding Learning the parts of objects by nonnegative matrix factorization, D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.
  • 28.
    PCA: Considerationsdata pre-treatment outliers linear reconstruction noise supervised projection Identify projection correlated with class assignment (classification) or continuous variables (regression) Partial Least Squares Projection to Latent Structures (PLS/-DA)
  • 29.
    PLS/-DA: Utility StrengthsPredict multiple dependent variables avoids issues of multicollinearity Independent measure of variable importance Weaknesses Need to derive an empirical reference for model performance Poor established model optimization methods
  • 30.
    PLS-DA: Example Data:Old Faithful 2.0 272 observations on 8 variables Latent Variables are analogous to PCs Important Statistics (CV) Q2 = fit RMSEP = error of prediction AU(RO)C = specificity vs. sensitivity Select the appropriate number Latent Variables (LVs) to maximize Q2
  • 31.
    PLS-DA: Performance Usepermutation tests to empirically determine model performance
  • 32.
    PLS-DA: Performance Usepermutation tests to empirically determine model performance
  • 33.
    PLS: Predictive PerformanceSplit data into training (2/3) and test sets (1/3) Generate model using training set and then predict class assignment for test set Use permutation tests to generate confidence bounds for future predictions
  • 34.
  • 35.
    PLS: Feature SelectionUse the PLS-DA as an objective function to identify the most informative variables
  • 36.
    Networks Network: representationof relationships among objects Utility Project statistical results into a biological context Explore informative data aspects in the context of all that was observed. Identify emergent patterns
  • 37.
    Networks Interpret statisticalresults within a biological context
  • 38.
    Networks Highlight changesin patterns of relationships. non-diabetics type 2 diabetics
  • 39.
    Networks Display complexinteractions non-diabetics type 2 diabetics
  • 40.
    non-diabetics type 2diabetics imDEV : interactive modules for Data Exploration and Visualization   An integrated environment for systems level analysis of multivariate data. http:// sourceforge.net/apps/mediawiki/imdev
  • 41.
    Acknowledgements Newman Lab Designated Emphasis in Biotechnology (DEB) NIH This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS 5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.