0% found this document useful (0 votes)
4 views52 pages

EDA

The document discusses Exploratory Data Analysis (EDA) as a critical initial step in data science projects, detailing its importance in understanding and visualizing datasets. It outlines the EDA process, including data analysis steps such as univariate, bivariate, and multivariate analysis, as well as data preparation techniques like imputation and feature transformation. Additionally, it introduces tools like Pandas Profiling for automated exploratory analysis and emphasizes the significance of preparing data for effective machine learning modeling.

Uploaded by

Tooba Liaquat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views52 pages

EDA

The document discusses Exploratory Data Analysis (EDA) as a critical initial step in data science projects, detailing its importance in understanding and visualizing datasets. It outlines the EDA process, including data analysis steps such as univariate, bivariate, and multivariate analysis, as well as data preparation techniques like imputation and feature transformation. Additionally, it introduces tools like Pandas Profiling for automated exploratory analysis and emphasizes the significance of preparing data for effective machine learning modeling.

Uploaded by

Tooba Liaquat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Programming for AI

Exploratory Data Analysis


Today’s Lecture
• Exploratory Data Analysis (EDA)
• EDA and Machine Learning Pipeline
• EDA Steps
• Data Analysis
• Data Understanding
• Univariate Analysis
• Bivariate Analysis
• Multivariate Analysis
• Pandas Profiler (ydata-Profiling)
• Data Preparation
• Data Imputation
• Feature Transformation
• Feature Encoding
• Feature Scaling
• Feature Creation
• Feature Selection
• Column Transformer
• Function Transformer
Exploratory Data Analysis (EDA)
• Exploratory Data Analysis (EDA) is a crucial initial
step in any data science project.
• It involves investigating, summarizing, and
visualizing a dataset to gain insights into its
characteristics, patterns, and potential
relationships between variables.
• Overall, EDA is an iterative process where you
continuously explore and refine your
understanding of the data.
• It sets the stage for more sophisticated analysis by
helping you ask the right questions and choose the
most appropriate techniques for your data.
EDA and Machine Learning
Pipeline
EDA Steps
• Data Analysis
• Data Understanding
• Univariate Analysis
• Bivariate Analysis
• Multivariate Analysis
• Pandas Profiler (ydata-Profiling)
• Data Preparation
• Data Imputation
• Feature Transformation
• Feature Encoding
• Feature Scaling
• Feature Creation
• Feature Selection
Data Reading
• df.read_csv(“file.csv”)
Data Understanding
• Data Understanding is a crucial step in Exploratory Data
Analysis (EDA) where the main objective is to gain
insights into the dataset and understand its structure,
content, and characteristics.
• This step involves examining and familiarizing oneself
with the data to form a solid foundation for subsequent
analysis.
• The main steps questions we ask in data understanding
are:
• How big is the data?
• What are features and their values in the data?
• What is the data type of the features?
• Are there any missing values?
Data Understanding
• How big is the data?
• df. shape

• What are features and their values in the data?


• df.head() and df.tail()
Data Understanding
• What is the data type of the features?
• df.info()
Data Understanding
• Are there any missing values?
• df.isnull().sum
Data Understanding
• How does the data look like mathematically?
• df.describe()
Data Understanding
• Are there any duplicate values?
• df.duplicated().sum()
Univariate Analysis
• Univariate analysis explores each variable in a
dataset separately. Variable is of two main types:
• Categorical Data (such as country, gender etc.)
• Numerical Data (such as weight, age etc.)
• Mixed Type (Name, Ticket No etc.)
• Univariate analysis for categorical data involves
examining the distribution and characteristics of a
single categorical variable.
• It aims to understand the frequency of each
category within the variable and any patterns or
trends that may exist.
Univariate Analysis (Categorical Data)
• The univariate analysis can be performed for
categorical data in the following ways:
• Bar Chart: Calculate the frequency or count of each
category in the categorical variable.
sns.countplot(x='Survived', data=df)
Univariate Analysis (Categorical
Data)
• Percentage Distribution: Calculate the percentage
or proportion of each category relative to the total
number of observations.
df['Sex'].value_counts().plot(kind='pie',autopct='%0.
1f%%')
Univariate Analysis (Numerical Data)
• Univariate analysis for numerical data involves
exploring the distribution and characteristics of a
single numerical variable.
• Here's how univariate analysis can be performed
for numerical data:
• Mean: Calculate the arithmetic average of the data
points. It gives an indication of the central tendency of
the data.
• Median: Determine the middle value of the dataset
when it is sorted in ascending order. It is less sensitive to
outliers than the mean and provides a robust measure
of central tendency.
Univariate Analysis (Numerical Data)
• Skewness: Assess the asymmetry and tail heaviness of the
distribution. Skewness measures the deviation from
symmetry.
• Histogram:
• Create a histogram to visualize the frequency distribution of the
numerical variable.
• It divides the range of the variable into bins and displays the count
or density of observations in each bin.
• sns.histplot(titanic['age'], bins=27, color='blue', alpha=0.6)
• sns.histplot(titanic['fare'], bins=30, color='orange',
alpha=0.6)
Univariate Analysis (Numerical Data)
• Box Plot: Construct a box plot (box-and-whisker plot) to
display the distribution of the numerical variable, including
its median, quartiles, and outliers.
• Box plots are useful for detecting skewness, outliers, and
variability in the data.
• sns.boxplot(y=titanic[age])
• sns.boxplot(y=titanic[fare])
How to Interpret a Boxplot?
• A boxplot is a graphical representation of the distribution of a dataset
based on five key summary statistics:
• Minimum – The smallest value (excluding outliers).
• First Quartile (Q1) – The 25th percentile (lower quartile).
• Median (Q2) – The 50th percentile (middle value).
• Third Quartile (Q3) – The 75th percentile (upper quartile).
• Maximum – The largest value (excluding outliers).
• It also displays outliers as individual points beyond the whiskers.

• Box (IQR: Interquartile Range) → Shows the middle 50% of the data (Q1 to
Q3).
• Line inside the box → Represents the median (Q2).
• Whiskers → Extend to the lowest and highest values within 1.5 times the
IQR.
• Dots outside the whiskers → Indicate outliers (values beyond 1.5×IQR).
Bivariate Analysis
• Bivariate analysis is a statistical method used to
analyze the relationship between two variables
simultaneously.
• It explores the association, correlation, or
dependency between two variables to understand
how changes in one variable are related to changes
in another.
• In bivariate analysis, the two variables being
studied are typically referred to as the independent
variable and the dependent variable.
• The independent variable is the variable that is
manipulated or controlled by the researcher, while
the dependent variable is the variable that is
observed and measured to see how it responds to
changes in the independent variable.
Bivariate Analysis
• Bivariate analysis can involve various statistical techniques and
visualizations depending on the types of variables being studied:
• Categorical vs. Categorical Variables: For categorical variables, bivariate
analysis often involves heatmap, contingency table etc.
• Heatmaps for bivariate analysis typically display the magnitude or
strength of the relationship between variables using color gradients.
• sns.heatmap(pd.crosstab(titanic['pclass'], titanic['survived']),annot=True,fmt='d')
Bivariate Analysis
• Categorical vs. Numerical Variables: When one variable
is categorical and the other is numerical (also known as
quantitative or continuous), bivariate analysis may
involve comparing the distribution of the numerical
variable across different categories of the categorical
variable using techniques such as box plots or
histograms etc.
Bivariate Analysis
• Numerical vs. Numerical Variables: For two numerical
variables, bivariate analysis focuses on exploring the
relationship between the variables through techniques
such as scatter plots and correlation analysis.
• Scatter plots visually represent the relationship between the
variables.
• sns.scatterplot(x='age', y='fare', data=titanic)
Bivariate Analysis
• Correlation analysis quantifies the strength and
direction of the linear relationship between the
variables using correlation coefficients (e.g., Pearson
correlation coefficient).
Multivariate Analysis
• Multivariate analysis is a statistical technique used to
analyze data that involves multiple variables
simultaneously.
• It aims to understand the relationships and interactions
between these variables, making it possible to identify
patterns, correlations, and underlying structures in the
data.
Pandas Profiling
• Pandas Profiling (now ydata-profiling) is a Python library that
generates an exploratory analysis report for a DataFrame in
pandas.
• This report provides a comprehensive summary of the
dataset's characteristics, including descriptive statistics, data
types, missing values, correlations, and visualizations.
• Pandas Profiling aims to automate and streamline the
process of data exploration and preliminary analysis.
• Here are some key features of Pandas Profiling:
• Descriptive Statistics: Pandas Profiling generates descriptive
statistics for each numerical and categorical variable in the
DataFrame, including count, mean, median, minimum, maximum,
standard deviation, and quantiles.
• Data Types: It provides an overview of the data types of each
column in the DataFrame, including numerical, categorical, and
datetime types.
• Missing Values: Pandas Profiling identifies missing values in the
dataset and provides information about the percentage of missing
values for each column.
Pandas Profiling
• Correlations: It calculates pairwise correlations between
numerical variables and visualizes them using correlation
matrices and heatmap plots. This helps identify linear
relationships between variables.
• Distribution Plots: Pandas Profiling generates histograms
and kernel density plots to visualize the distribution of
numerical variables. For categorical variables, it provides bar
plots showing the frequency of each category.
• Unique Values: It lists the unique values and their
frequencies for each categorical variable, helping identify
potential data quality issues or inconsistencies.
• Interactions: Pandas Profiling identifies potential
interactions between variables by generating scatter plots,
pair plots, and parallel coordinates plots for numerical
variables.
• Exporting Reports: The generated analysis report can be
exported to various formats, including HTML, JSON, and
interactive HTML formats.
Pandas Profiling
!pip install ydata-profiling
import pandas as pd
from ydata_profiling import ProfileReport

df=pd.read_csv('titanic.csv')

prof = ProfileReport(df)
prof.to_file(output_file='eda.html')
Data Preparation
• Data Preparation is a critical step in Exploratory
Data Analysis (EDA) where the main objective is to
preprocess, and transform the raw data into a
format suitable for analysis.
• This step ensures that the data is of high quality,
consistent, and ready for further exploration and
modeling.
• Data Imputation
• Feature Transformation
• Feature Encoding
• Feature Scaling
• Feature Creation
• Feature Selection
Data Imputation
• Data imputation in exploratory data analysis (EDA)
refers to the process of filling in missing values in a
dataset.
• Missing data is a common issue in real-world
datasets and can arise due to various reasons such
as human error, malfunctioning sensors, or data
corruption during collection or storage.
• Imputing missing values is important because many
statistical and machine learning algorithms cannot
handle missing data.
Data Imputation
• Therefore, imputation aims to replace missing values
with estimated or predicted values based on the
available data.
• There are several common methods for data
imputation in EDA:
• Mean/Median/Mode Imputation:
• Replace missing values with the mean, median, or mode of the
non-missing values in the same column.

• # Mean Imputation for the 'Age' column (numerical)


• titanic['age'].fillna(titanic['age'].mean(), inplace=True)
• # Mode Imputation for the 'Embarked' column (categorical)
• titanic['embarked'].fillna(titanic['embarked'].mode()[0],
inplace=True)
Data Imputation Techniques
• Forward Fill (or Backward Fill) Imputation:
• Propagate the last (or next) observed value forward (or backward)
to fill missing values.
• This method is suitable for time-series data where the missing
values are assumed to follow the trend of the observed data.
• titanic['age'].fillna(method='ffill', inplace=True)
• Interpolation:
• Estimate missing values based on the values of neighboring data
points.
• Linear interpolation, spline interpolation, or other interpolation
techniques can be used depending on the nature of the data.
• titanic['age'].interpolate(method='linear',inplace=True)
• K-Nearest Neighbors (KNN) Imputation:
• Estimate missing values by averaging the values of the nearest
neighbors in the feature space.
• This method considers the similarity between data points and is
Feature Transformation
• Feature transformation is the process of converting
or modifying the features (variables) in a dataset to
make them more suitable for analysis or modeling.
• It involves applying various mathematical or statistical
transformations to the original features to extract
more useful information, improve the performance of
machine learning models, or meet specific
requirements of the analysis.
• Feature transformation can include a wide range of
techniques, including:
• Feature Encoding
• Feature Scaling
• Outlier Detection
• Renaming Columns
• Identifying Duplicate Rows
• Handling Irrelevant Columns and Rows
Feature Encoding
• In the context of exploratory data analysis (EDA),
feature encoding refers to the process of
converting categorical or textual data into a
numerical format that can be used for analysis and
modeling.
• Feature encoding is essential because many
machine learning algorithms and statistical
techniques require numerical inputs.
• Ordinal Encoding
• One-Hot Encoding
• Label Encoding
Ordinal Encoding
• OrdinalEncoder is specifically designed to handle
ordinal data where the order of categories is important.
• It encodes the categories in the given order, ensuring
that the numbers reflect the correct ranking of the
data.
• For example, "low", "medium", and "high" have a
natural order, and ordinal encoding ensures this order
is preserved in the numerical values.
• oe =
OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','
PG']])
One-Hot Encoding
• One-Hot Encoding is a technique used to convert
categorical variables into a binary (0 or 1)
representation, where each unique category in
the original feature is transformed into a new
binary feature (column).

• Using pandas library


• pd.get_dummies(df,columns=['fuel','owner'])
• Using sklearn library
• ohe= OneHotEncoder()
• X_train_new = ohe.fit_transform(X_train[['fuel','owner']])
Label Encoding
• Label Encoding is a technique used in machine
learning to convert categorical data (features or
target variables) into numerical labels. .

• Often used when the target variable has categorical


classes (e.g., binary classification: 'Yes', 'No' or
multi-class classification: 'Dog', 'Cat', 'Rabbit').
• Assigns a unique numerical value to each category.
For example, if you have categories like "Red,"
"Green," and "Blue," you might assign them 0, 1, and
2 respectively.

• colors = ['red', 'blue', 'green']


• # Fit and transform the data
• encoded_colors = label_encoder.fit_transform(colors)
Feature Scaling
• Feature scaling is a technique used in
machine learning to normalize the range of
independent variables or features of the data.
• In many machine learning algorithms, the
distance between data points influences the
model's performance.
• Features on different scales can
disproportionately affect the model, leading to
suboptimal performance.
• Min-Max Scaling (Normalization)
Feature Creation
• Feature creation is the process of generating new
features from existing data to improve the
performance of machine learning models.
• It involves transforming raw data into meaningful
features that make patterns more visible to the
model, thus increasing its predictive power.
• Consider the Titanic dataset. Some of the
passengers' names contain titles like "Mr.",
"Mrs.", or "Dr.", which can provide additional
information about the passenger's social status or
occupation.
• # Example of creating a new feature 'FamilySize' by summing
SibSp and Parch
• data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
Column Transformer
• ColumnTransformer is a class in the
sklearn.compose module of the scikit-learn
library that allows you to apply different
preprocessing steps to different columns of a dataset
within a single pipeline.
• It is particularly useful for datasets that contain both
numerical and categorical features, as it enables you
to handle each type of data appropriately.
• You can specify different transformers for different
columns or sets of columns. For instance, you might
want to standardize numerical features while
encoding categorical features.
• ColumnTransformer applies transformations in
parallel for the different columns or groups of
columns specified in its configuration ..
Column Transformer (Example)
# Sample DataFrame
data = pd.DataFrame({
'Age': [22, None, 24, 22, None, 24],
'Sex': ['male', 'female', 'female','male', 'female', 'female'],
'Embarked': ['B', 'B', 'C','C', 'C','S'],
'Fare': [7.25, 71.83, 8.05,7.25, 71.83, 8.05]
})

#Apply ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
('age_imputer', SimpleImputer(strategy='mean'), ['Age']),
('ohe', OneHotEncoder(), ['Sex','Embarked'])
], remainder='passthrough') # Pass through other columns like 'Fare'

# Transform the data


transformed_data = preprocessor.fit_transform(data)
Output
Original DataFrame:
Age Sex Embarked Fare
0 22.0 male B 7.25
1 NaN female B 71.83
2 24.0 female C 8.05
3 22.0 male C 7.25
4 NaN female C 71.83
5 24.0 female S 8.05

Transformed Data Shape:


(6, 7)
[[22. 0. 1. 1. 0. 0. 7.25]
[23. 1. 0. 1. 0. 0. 71.83]
[24. 1. 0. 0. 1. 0. 8.05]
[22. 0. 1. 0. 1. 0. 7.25]
[23. 1. 0. 0. 1. 0. 71.83]
[24. 1. 0. 0. 0. 1. 8.05]]
Function Transformer
FunctionTransformer is a utility in Scikit-learn that
allows you to create a transformer from a custom
function.
It is particularly useful when you want to apply a
specific function to your data during the
preprocessing step.
This flexibility enables you to integrate custom
transformations into Scikit-learn’s pipeline and
ColumnTransformer workflows.
Function Transformer (Example)
# Sample DataFrame
data = pd.DataFrame({
'Age': [22, None, 24, 22, None, 24],
'Sex': ['male', 'female', 'female','male', 'female', 'female'],
'Embarked': ['B', 'B', 'C', None, 'C','S'],
'Fare': [7.25, 71.83, 8.05,7.25, 71.83, 8.05]
})

# Custom function to impute 'Embarked'


def impute_embarked(X):
X['Embarked'] =
['Embarked'].fillna(X['Embarked'].mode()[0])
return X
Function Transformer (Contd..)
#Apply ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
('age_imputer', SimpleImputer(strategy='mean'), ['Age']),
('embarked_imputer', FunctionTransformer(impute_embarked),
['Embarked']),
('ohe', OneHotEncoder(), ['Sex','Embarked'])
], remainder='passthrough') # Pass through other columns like 'Fare'

# Transform the data


transformed_data = preprocessor.fit_transform(data)
Pipeline
• In scikit-learn (sklearn), a pipeline is a sequence of
data processing steps combined into a single
object.
• Pipeline allows you to sequentially apply a list of
transformers to preprocess the data and, if desired,
conclude the sequence with a final predictor for
predictive modeling.
• Pipelines are useful for encapsulating multiple
preprocessing steps and modeling steps into a
single entity, making it easier to manage and apply
the entire workflow consistently.
• This can include feature encoding, feature scaling,
feature selection, model training, and prediction.
Pipeline (Example)
# Sample DataFrame
data = pd.DataFrame({
'Age': [22, None, 24, 22, None, 24],
'Sex': ['male', 'female', 'female','male', 'female', 'female'],
'Embarked': ['B', 'B', 'C',None, 'C','S'],
'Fare': [7.25, 71.83, 8.05,7.25, 71.83, 8.05]
})

# Custom function to impute 'Embarked'


def impute_embarked(X):
X['Embarked'] = X['Embarked'].fillna(X['Embarked'].mode()[0])
return X
#Apply ColumnTransformer
Pipeline Example (Contd..)
preprocessor = ColumnTransformer(transformers=[
('age_imputer', SimpleImputer(strategy='mean'), ['Age']),
('embarked_encoder', Pipeline(steps=[
('imputer', FunctionTransformer(impute_embarked)),
('onehot', OneHotEncoder())
]), ['Embarked']),
('ohe', OneHotEncoder(), ['Sex'])
], remainder='passthrough') # Pass through other columns like
'Fare'

# Transform the data


transformed_data = preprocessor.fit_transform(data)
Output
Original DataFrame:
Age Sex Embarked Fare
0 22.0 male B 7.25
1 NaN female B 71.83
2 24.0 female C 8.05
3 22.0 male C 7.25
4 NaN female C 71.83
5 24.0 female S 8.05

Transformed Data Shape:


(6, 7)
[[22. 1. 0. 0. 0. 1. 7.25]
[23. 1. 0. 0. 1. 0. 71.83]
[24. 0. 1. 0. 1. 0. 8.05]
[22. 0. 1. 0. 0. 1. 7.25]
[23. 0. 1. 0. 1. 0. 71.83]
[24. 0. 0. 1. 1. 0. 8.05]]
Today’s Lecture
• Exploratory Data Analysis (EDA)
• EDA and Machine Learning Pipeline
• EDA Steps
• Data Analysis
• Data Understanding
• Univariate Analysis
• Bivariate Analysis
• Multivariate Analysis
• Pandas Profiler (ydata-Profiling)
• Data Preparation
• Data Imputation
• Feature Transformation
• Feature Encoding
• Feature Scaling
• Feature Creation
• Feature Selection
• Column Transformer
• Function Transformer
References
• https://pypi.org/project/pandas-profiling/
• https://scikit-learn.org/stable/modules/generated/sklearn.p
ipeline.Pipeline.html#:~:text=Pipeline%20allows%20you%20
to%20sequentially,implement%20fit%20and%20transform%
20methods.
• https://scikit-learn.org/1.5/modules/generated/sklearn.com
pose.ColumnTransformer.html
• https://scikit-learn.org/dev/modules/generated/sklearn.pre
processing.FunctionTransformer.html
• https://scikit-learn.org/1.5/modules/generated/sklearn.pipe
line.Pipeline.html

You might also like