100% found this document useful (2 votes)
111 views

Course Title: Data Pre-Processing and Visualization

This document discusses exploratory data analysis techniques in R. It describes functions like describe(), normality(), correlate(), target_by(), relate(), plot_relate(), summary(), and eda_report() to calculate descriptive statistics, test normality, correlate variables, analyze target variables, and generate comprehensive reports. The objectives are to explore a dataset and understand the relationships between variables. Techniques include visualizing distributions, correlations, and relationships with target variables to gain insights from data.

Uploaded by

Intekhab Aslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
111 views

Course Title: Data Pre-Processing and Visualization

This document discusses exploratory data analysis techniques in R. It describes functions like describe(), normality(), correlate(), target_by(), relate(), plot_relate(), summary(), and eda_report() to calculate descriptive statistics, test normality, correlate variables, analyze target variables, and generate comprehensive reports. The objectives are to explore a dataset and understand the relationships between variables. Techniques include visualizing distributions, correlations, and relationships with target variables to gain insights from data.

Uploaded by

Intekhab Aslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Course Title : Data pre-processing and

Visualization
Ram Mohan Dhara|
IMTG/ PGDM/ Term – VI / 2017-2019
Session 3 : EDA (Exploratory Data Analysis)
After completing this session, you will be able to –
Session • Carryout a comprehensive exploration of a
objectives dataset in R
Exploratory Data Analysis
• The following is a list of the EDA functions included in the dlookr package-
• describe() - provides descriptive statistics for all variables
• normality() and plot_normality() - perform normalization and visualization of normality
• correlate() and plot_correlate() - calculate the correlation coefficient between two
numerical variables and plots correlation
• target_by() - defines the target variable
• relate() - describes the relationship with the variables of interest corresponding to the
target variable.
• plot.relate() - visualizes the relationship to the variable of interest corresponding to the
target variable.
• summary()- gives a detailed summary of analysis
• eda_report() - performs an exploratory data analysis and reports the results.
Calculating descriptive statistics using describe() in R
• n : number of observations excluding • skewness : skewness
missing values
• kurtosis : kurtosis
• na : number of missing values
• p25 : Q1. 25% percentile
• mean : arithmetic average
• p50 : Q2. median. 50% percentile
• sd : standard deviation
• p75 : Q3. 75% percentile
• se_mean : standard error mean.
sd/sqrt(n)
• IQR : interquartile range (Q3-Q1)
Test of normality of numeric Normalization visualization of
variables using normality() numerical variables using plot_
normality()

• statistic : Statistics of the Shapiro-Wilk • Histogram of original data


test
• Q-Q plot of original data
• p_value : p-value of the Shapiro-Wilk
• histogram of log transformed data
test
• Histogram of square root transformed
• sample : Number of sample
data
observations performed Shapiro-Wilk
test
Calculation of correlation Visualization of the correlation
coefficient using correlate() matrix using plot_correlate()

• r : Pearson's correlation • Visualizes co-relation matrix


EDA on target variable using target_by(),
relate(), plot.relate(), summary()
• target_by() – creates a target object (variable)
• relate() – establishes a relationship between target and predictors
• plot.relate() – plots the relationship between target and predictors
• summary() – gives the summary of analysis carried out at the background
EDA on target variable
Target Predictor target_by() - relate() plot.relate() summary()
Variable Variable nomenclature
Categorical Continuous tar_cat_pred_cont Description of Density plot of Summary of target
predictor at predictor at different
different levels levels of target
of target
Categorical Categorical tar_cat_pred_cat It creates a A mosaic plot between Chi-sq statistic, tests
contingency target and predictor independence between
table target and predictor
Continuous Continuous tar_cont_pred_cont It runs a simple Scatter plot with a Gives a linear regression
linear regression trend line between output between target and
target and predictor predictor
Continuous Categorical tar_cont_pred_cat It creates an It creates box plots of F –statistic.
ANOVA table target at different
levels of predictor
A comprehensive report of EDA using eda_report()

• Introduction • Relationship Between Variables


• Information of Dataset • Correlation Coefficient
• Information of Variables • Correlation Coefficient by Variable
Combination
• Numerical Variables
• Correlation Plot of Numerical Variables
• Univariate Analysis
• Descriptive Statistics
• Target based Analysis
• Numerical Variables and Categorical
• Normality Test of Numerical Variables Variables
• Statistics and Visualization of (Sample) • Correlation and Correlation Plots
Data
Summary : what we have learnt
• Essential steps in exploration of data using R

• describe()
• normality() and plot_normality()
• correlate() and plot_correlate()
• target_by()
• relate() and plot.relate()
• summary()
• eda_report()
This concludes the session :
EDA (Exploratory Data Analysis)

Next session :
Introduction to Visual Analytics and Tableau

You might also like