Exploratory Data Analysis_v3_part1
Exploratory Data Analysis_v3_part1
Analysis
Benefits, Techniques, and Examples
Recall
Identifying the
Right Data
What is in the
data?
What is Exploratory Data Analysis?
Exploratory Data Analysis: EDA
An approach
a precursor to more complex data analysis techniques
such as regression analysis, cluster analysis, and machine learning
Self-Study: https://www.itl.nist.gov/div898/handbook/eda/eda.htm
Why EDA?
Goal
• A good-fitting, parsimonious model
Secondary
• Outliers
• Robustness of conclusions
• Estimates for parameters
Goals
• Uncertainties for those estimates
• A ranked list of important factors
• Statistical significance of conclusions
• Optimal settings
What are other EDA approaches?
• Start with a general science/engineering problem and all yield
science/engineering conclusions.
• Classical
• such as summary statistics and hypothesis testing
• analyze data in a structured manner
• useful for answering specific research questions
• do not provide a comprehensive understanding of the data.
• Bayesian
• analyze data by incorporating prior knowledge and beliefs.
• useful for making predictions and updating beliefs
• require a strong understanding of probability theory and can be computationally intensive.
• Exploratory (EDA)
How Does Exploratory Data Analysis
differ from Classical Data Analysis?
In the real world, data analysts freely mix elements of all of the
above three approaches (and other approaches)
Techniques for EDA
• Univariate Analysis
• Bivariate Analysis
• Multivariate Analysis
• Correlation and Regression Analysis
• Cluster Analysis
• Time Series Analysis
• Principal Component Analysis
• Factor Analysis
• Discriminant Analysis
• Canonical Correlation Analysis
• Spatial Analysis
• Text Analysis
• Network Analysis
Techniques for EDA
Next Time!
Univariate Analysis
To find out the average of the horsepower of the cars among the population of cars, we will check and
calculate the average of the values:
Measures of Central Tendencies-
Median
Measures of Central Tendencies-
Mode
Measures of Central Tendencies
Measures of Spread-Range
Measures of Spread-Quartiles
Measures of Spread-Quartiles(IQR
Calculation)
Measures of Spread-Variance
Measures of Spread-Standard
deviation
Measures of Spread
Data Understanding
• Data understanding in univariate analysis is a crucial step.
• Univariate analysis focuses on examining and understanding a single
variable at a time
• Here are the key steps for data understanding in univariate analysis:
Data Collection
Data Inspection
Data Cleaning
Data Visualization
• Create visual representations of the variable to understand its
distribution and characteristics.
• Common types of plots and charts include:
Boxplot: graphic display of five-number summary
Histogram: x-axis are values, y-repres, frequencies
Quantile plot: each value xi is paired with fi indicating that approximately 100
fi% of data are ≤Xi
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant
distribution against the corresponding quantiles of another
Scatter-plot: each pair of values is a pair of coordinates and plotted as points
in the plane
Box Plot
• Five-number summary of a distribution
• Min, Q1, median, Q3, Max
• Data is represented as a box
• Ends of the box are at the first and third
quartiles
• Height of the box is IQR
• Median is marked by a line within the box
• Whiskers are added, 1.5*IQR and 1.5*IQR
• Some versions also plot Whiskers at max
and min data points1
• Points beyond a specific threshold are
plotted individually
1 https://asq.org/quality-resources/box-whisker-plot
Histogram
• Graph display of tabulated frequencies,
shown as bars
• Shows what proportion of cases fall into
each of several categories
• Differs from a bar chart
• it is the area of the bar that denotes the
value
• not the height as in bar charts
• a crucial distinction when the categories are
not of uniform width.
• The categories are usually specified as
non-overlapping intervals of some
variable. The categories(bars) must be
adjacent.
Quantile-quantile (Q-Q) Plot
• Q-Q plot: Plots quantiles of one
univariate distribution against the
corresponding quantiles of another.
• Points on the plot along a diagonal line
indicate a good fit, while deviations
suggest differences.
• Typically used to:
• compare observed data to a theoretical
distribution (typically normal distribution)
• identifying outliers
• Q-Q plots help assess data's fit to an
expected distribution
Bivariate Analysis
• A statistical method used in data analysis to examine and understand
the relationship, association, or interaction between two different
variables.
• It involves the simultaneous analysis of two variables, which can be
numerical or categorical.
• It explores how changes in one variable are associated with changes
in another variable.
Types of Bivariate Analysis
Bivariate analysis involves examining the relationships or associations
between two different variables.
The type of bivariate analysis you choose depends on the nature of the
variables you're working with.
Here are the common types of bivariate analysis:
1. Numerical-Numerical Analysis
2. Categorical-Categorical Analysis
3. Categorical-Numerical Analysis
Numerical-Numerical Analysis
• Numerical-numerical bivariate analysis, also known as correlation
analysis, is a statistical technique used to examine the relationship
between two numerical variables.
• The primary goal is to understand whether and to what extent the
two variables are related and the nature of that relationship.
• Correlation analysis quantifies the strength, direction, and linearity of
the relationship between these variables.
Steps for Numerical-Numerical
Analysis
• Choose the variable
• Visualize the data
• Calculate the Correlation Coefficient
Pearson Correlation Coefficient (r)
Spearman Rank Correlation
Pearson Correlation Coefficient(r)
• Measures the linear relationship between two variables. It ranges from -1
(perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating
no linear correlation.
Pearson Correlation Coefficient(r)-
Example
Calculating Pearson correlation coefficient for two variables:
Spearman Rank Correlation
• Measures the monotonic (non-linear) relationship between two
variables based on their ranked values. It's more robust to outliers
and non-linear relationships.
Spearman Rank Correlation-Example
Calculating Spearman Rank Correlation for two variables:
To be continued!