EDA
EDA
• Box (IQR: Interquartile Range) → Shows the middle 50% of the data (Q1 to
Q3).
• Line inside the box → Represents the median (Q2).
• Whiskers → Extend to the lowest and highest values within 1.5 times the
IQR.
• Dots outside the whiskers → Indicate outliers (values beyond 1.5×IQR).
Bivariate Analysis
• Bivariate analysis is a statistical method used to
analyze the relationship between two variables
simultaneously.
• It explores the association, correlation, or
dependency between two variables to understand
how changes in one variable are related to changes
in another.
• In bivariate analysis, the two variables being
studied are typically referred to as the independent
variable and the dependent variable.
• The independent variable is the variable that is
manipulated or controlled by the researcher, while
the dependent variable is the variable that is
observed and measured to see how it responds to
changes in the independent variable.
Bivariate Analysis
• Bivariate analysis can involve various statistical techniques and
visualizations depending on the types of variables being studied:
• Categorical vs. Categorical Variables: For categorical variables, bivariate
analysis often involves heatmap, contingency table etc.
• Heatmaps for bivariate analysis typically display the magnitude or
strength of the relationship between variables using color gradients.
• sns.heatmap(pd.crosstab(titanic['pclass'], titanic['survived']),annot=True,fmt='d')
Bivariate Analysis
• Categorical vs. Numerical Variables: When one variable
is categorical and the other is numerical (also known as
quantitative or continuous), bivariate analysis may
involve comparing the distribution of the numerical
variable across different categories of the categorical
variable using techniques such as box plots or
histograms etc.
Bivariate Analysis
• Numerical vs. Numerical Variables: For two numerical
variables, bivariate analysis focuses on exploring the
relationship between the variables through techniques
such as scatter plots and correlation analysis.
• Scatter plots visually represent the relationship between the
variables.
• sns.scatterplot(x='age', y='fare', data=titanic)
Bivariate Analysis
• Correlation analysis quantifies the strength and
direction of the linear relationship between the
variables using correlation coefficients (e.g., Pearson
correlation coefficient).
Multivariate Analysis
• Multivariate analysis is a statistical technique used to
analyze data that involves multiple variables
simultaneously.
• It aims to understand the relationships and interactions
between these variables, making it possible to identify
patterns, correlations, and underlying structures in the
data.
Pandas Profiling
• Pandas Profiling (now ydata-profiling) is a Python library that
generates an exploratory analysis report for a DataFrame in
pandas.
• This report provides a comprehensive summary of the
dataset's characteristics, including descriptive statistics, data
types, missing values, correlations, and visualizations.
• Pandas Profiling aims to automate and streamline the
process of data exploration and preliminary analysis.
• Here are some key features of Pandas Profiling:
• Descriptive Statistics: Pandas Profiling generates descriptive
statistics for each numerical and categorical variable in the
DataFrame, including count, mean, median, minimum, maximum,
standard deviation, and quantiles.
• Data Types: It provides an overview of the data types of each
column in the DataFrame, including numerical, categorical, and
datetime types.
• Missing Values: Pandas Profiling identifies missing values in the
dataset and provides information about the percentage of missing
values for each column.
Pandas Profiling
• Correlations: It calculates pairwise correlations between
numerical variables and visualizes them using correlation
matrices and heatmap plots. This helps identify linear
relationships between variables.
• Distribution Plots: Pandas Profiling generates histograms
and kernel density plots to visualize the distribution of
numerical variables. For categorical variables, it provides bar
plots showing the frequency of each category.
• Unique Values: It lists the unique values and their
frequencies for each categorical variable, helping identify
potential data quality issues or inconsistencies.
• Interactions: Pandas Profiling identifies potential
interactions between variables by generating scatter plots,
pair plots, and parallel coordinates plots for numerical
variables.
• Exporting Reports: The generated analysis report can be
exported to various formats, including HTML, JSON, and
interactive HTML formats.
Pandas Profiling
!pip install ydata-profiling
import pandas as pd
from ydata_profiling import ProfileReport
df=pd.read_csv('titanic.csv')
prof = ProfileReport(df)
prof.to_file(output_file='eda.html')
Data Preparation
• Data Preparation is a critical step in Exploratory
Data Analysis (EDA) where the main objective is to
preprocess, and transform the raw data into a
format suitable for analysis.
• This step ensures that the data is of high quality,
consistent, and ready for further exploration and
modeling.
• Data Imputation
• Feature Transformation
• Feature Encoding
• Feature Scaling
• Feature Creation
• Feature Selection
Data Imputation
• Data imputation in exploratory data analysis (EDA)
refers to the process of filling in missing values in a
dataset.
• Missing data is a common issue in real-world
datasets and can arise due to various reasons such
as human error, malfunctioning sensors, or data
corruption during collection or storage.
• Imputing missing values is important because many
statistical and machine learning algorithms cannot
handle missing data.
Data Imputation
• Therefore, imputation aims to replace missing values
with estimated or predicted values based on the
available data.
• There are several common methods for data
imputation in EDA:
• Mean/Median/Mode Imputation:
• Replace missing values with the mean, median, or mode of the
non-missing values in the same column.
#Apply ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
('age_imputer', SimpleImputer(strategy='mean'), ['Age']),
('ohe', OneHotEncoder(), ['Sex','Embarked'])
], remainder='passthrough') # Pass through other columns like 'Fare'