0% found this document useful (0 votes)

70 views26 pages

Business Club: Basic Statistics

This document provides an overview of key statistical concepts and methods for analyzing business data. It discusses data cleaning techniques like variable identification, univariate analysis to understand individual variables, multivariate analysis to analyze relationships between multiple variables, and methods for handling missing data. It also covers statistical charts and plots like histograms, box plots, scatter plots, and QQ plots as well as probability distributions like the normal distribution. Concepts like conditional probability, Bayes' theorem, and examples of its application are also summarized.

Uploaded by

Justin Russo Harry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views26 pages

Business Club: Basic Statistics

Uploaded by

Justin Russo Harry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Business Club
Basic Statistics

Business Club Analytics Team

October 2017

INDEX

Cleaning the Data 2

Conditional Probability 13

Bayes Theorem 14

Example of Bayes Theorem 15

Histograms 17

Box Plots 18

Strip chart 20

Scatter Plots 21

Normal QQ Plots 21
Normal Distribution 23

CLEANING THE DATA :

1. Variable Identification
2. Univariate Analysis
3. Multivariate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation *
7. Variable creation *

Variable Identification :

Continuous variable : These variables can take number of numeric values.

Eg :temperature

Categorical variable : These variables contain a finite number of categories or groups

.
Eg : Gender

For any dataset perhaps the most important part is identifying Predictor (Input) and
Target (output) variables. Next is identifying the data type and category of the variables.

Let’s understand this step more clearly by taking an example.

Example:- Suppose, we want to predict, whether the students will play cricket or not
(refer below data set). Here you need to identify predictor variables, target variable, data
type of variables and category of variables.

Univariate Analysis :

Univariate analysis is the simplest form of analyzing data. This deals with only
variable . It is mostly used to find patterns in the data.

Here also, the approaches for analysing the categorical and continuous variables are
separate:

Continuous Variables:
Here, we need to understand the central tendency of the data and spread of the
variable. These are measured using various statistical metrics visualization methods as
shown below:

Most terms in the table are self explanatory, except for some which we’ll discuss here:

Quartile and InterQuartile Range(IQR):

Quartiles are data values which divide the dataset in four equal parts. The first
quartile(Q1) is the middle point of initial value and median whereas Q2 is the median of
the data. Q3 is the middle value between median and the highest value.

IQR as the name suggests is the measure of dispersion between upper and lower
quartile or :
IQR=Q3-Q1

Kurtosis: In a similar way to the concept of skewness, kurtosis is a descriptor of the
shape of a probability distribution and, just as for skewness, there are different ways of
quantifying it for a theoretical distribution and corresponding ways of estimating it from a
sample from a population.

Mathematical tools regarding quantifying Kurtosis would be discussed later.

Categorical Variables:
For categorical variables, we can use frequency table to understand distribution of each
category. We can also read as percentage of values under each category. It can be be
measured using two metrics, Count and Count% against each category. Bar chart can
be used as visualization.

Multivariate Analysis :

Any method that is used to analyse data with more than one variable is called
multivariate analysis.

In multivariate Analysis we’ll discuss bi-variate analysis in greater detail as it is the most
frequently used:

Bi-Variate Analysis:

Bi-variate Analysis finds out the relationship between two variables. Here, we look for
association and dissociation between variables at a pre-defined significance level. We
can perform bi-variate analysis for any combination of categorical and continuous
variables. The combination can be: Categorical & Categorical, Categorical &
Continuous and Continuous & Continuous. Different methods are used to tackle these
combinations during analysis process.

Correlation:
It is a statistical measure that indicates the extent to which two or more variables
fluctuate together. A positive correlation indicates the extent to which those variables
increase or decrease in parallel; a negative correlation indicates the extent to which one
variable increases as the other decreases.

Continuous & Continuous:

While doing bi-variate analysis between two continuous variables, we should look at
scatter plot. The pattern of scatter plot indicates the relationship between variables. The
relationship can be linear or nonlinear.

Categorical & Categorical:

To find the relationship between two categorical variables, we can use following
methods:
● Two-way table: We can start analyzing the relationship by creating a two-way
table of count and count%. The rows represents the category of one variable and
the columns represent the categories of the other variable. We show count or
count% of observations available in each combination of row and column
categories.

● Stacked Column Chart: This method is more of a visual form of Two-way table.

● Chi-Square Test: This test is used to derive the statistical significance of

relationship between the variables. Also, it tests whether the evidence in the
sample is strong enough to generalize that the relationship for a larger population
as well. Chi-square is based on the difference between the expected and
observed frequencies in one or more categories in the two-way table. It returns
probability for the computed chi-square distribution with the degree of freedom.

Probability of 0: It indicates that both categorical variable are dependent

Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the
variables is significant at 95% confidence. The chi-square test statistic for
a test of independence of two categorical variables is found by:

where O represents the observed frequency. E is the expected frequency
under the null hypothesis and computed by:

Different data science language and tools have specific methods to perform
chi-square test. In SAS, we can use Chisq as an option with Proc freq to
perform this test.

Categorical & Continuous:

While exploring relation between categorical and continuous variables, we can draw box
plots for each level of categorical variables. If levels are small in number, it will not show
the statistical significance. To look at the statistical significance we can perform Z-test,
T-test or ANOVA. These will be discussed in detail later.

Missing values treatment

Missing values are categorised broadly into four types :

● Missing completely at random: This is a case when the probability of missing

variable is same for all observations. For example: respondents of data collection
process decide that they will declare their earning after tossing a fair coin. If an
head occurs, respondent declares his / her earnings & vice versa. Here each
observation has equal chance of missing value.
● Missing at random: This is a case when variable is missing at random and
missing ratio varies for different values / level of other input variables. For
example: We are collecting data for age and female has higher missing value
compare to male.
● Missing that depends on unobserved predictors: This is a case when the
missing values are not random and are related to the unobserved input variable.
For example: In a medical study, if a particular diagnostic causes discomfort,
then there is higher chance of drop-out from the study.
● Missing that depends on the missing value itself: This is a case when the
probability of missing value is directly correlated with missing value itself. For
example: People with higher or lower income are likely to provide non-response
to their earning.

1. Deletion
Here ,we delete observations where any of the variable is missing. Even though
this method is easy to handle but the sample size reduces .

Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.

2. Mean/ Mode/ Median Imputation:

Imputation is a method to fill in the missing values with estimated ones. In this
method the missing values are replaced with mean / mode / median of the known
values under that variable.

It can be of two types:-

1. Generalized Imputation: In this case, we calculate the mean or median
for all non missing values of that variable then replace missing value with
mean, median or mode.
2. Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values then
replace the missing value based on gender. For “Male“, we will replace
missing values of manpower with 29.75 and for “Female” with 25.

3. Prediction Model:
Prediction model is one of the sophisticated method for handling missing data.
Here, we create a predictive model to estimate values that will substitute the
missing data.

In this case the data is divided into two sets – one with no missing values for the
variable and another one with missing values. The first is treated as a training
data set to predict the missing values of the other set .Major drawbacks of this
method are :

1. The model estimated values are usually more well-behaved than the true
values

2. If there are no relationships with attributes in the data set and the attribute
with missing values, then the model will not be precise for estimating
missing values.

DEALING WITH MISSING VALUES IN R AND PYTHON :

http://www.statmethods.net/input/missingdata.html
https://www.r-bloggers.com/missing-value-treatment/

OUTLIERS :
Outlier is an observation that appears far away and diverges from an overall pattern in a
sample.

What is the impact of Outliers on a dataset?

Outliers can drastically change the results of the data analysis and statistical modeling.
There are numerous unfavourable impacts of outliers in the data set:

● It increases the error variance and reduces the power of statistical tests
● They can bias or influence estimates that may be of substantive interest
● They can also impact the basic assumption of Regression,and other statistical
model assumptions.

To understand the impact deeply, let’s take an example to check what happens to a
data set with and without outliers in the data set.

The table probably emphasizes enough how important handling outliers is. Now we
should move on to detecting and handling outliers.

HOW TO DETECT OUTLIERS :

Visualization through box plots, Histogram : Outliers are easily detected through
visualisations using histogram or other plots .

Thumb rules to detect outliers

1. Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
2. Any value which out of range of 5th and 95th percentile can be considered as
outlier
3. Data points, three or more standard deviation away from mean are considered
outlier.

Dealing with outliers

eletion : We delete outlier values if it is due to data entry error, data processing
1. D
error or outlier observations are very small in numbers. We can also use trimming
at both ends to remove outliers.

2. T ransforming and binning values: Transforming variables can also eliminate

outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows
to deal with outliers well due to binning of variable. We can also use the process
of assigning weights to different observations.

3. Treating them separately: If there are significant number of outliers, we should
treat them separately in the statistical model. One of the approach is to treat both
groups as two different groups and build individual model for both groups and
then combine the output

4. Imputing: Like imputation of missing values, we can also impute outliers. We

can use mean, median, mode imputation methods. Before imputing values, we
should analyse if it is natural outlier or artificial. If it is artificial, we can go with
imputing values. We can also use statistical model to predict values of outlier
observation and after that we can impute it with predicted values.

*FEATURE ENGINEERING
Feature engineering is the science of extracting more information from existing data.

2 important steps in feature engineering are :

● Variable transformation.
● Variable / Feature creation.

Conditional Probability
Conditional probabilities arise naturally in the investigation of experiments where an
outcome of a trial may affect the outcomes of the subsequent trials.
We try to calculate the probability of the second event (event B) given that the first event
(event A) has already happened. If the probability of the event changes when we take
the first event into consideration, we can safely say that the probability of event B is
dependent of the occurrence of event A.
Let’s think of cases where this happens:
● Drawing a second ace from a deck given we got the first ace
● Finding the probability of having a disease given you were tested positive
● Finding the probability of liking Harry Potter given we know the person likes
fiction
And so on….
Here we can define, 2 events:
● Event A is the probability of the event we’re trying to calculate.
● Event B is the condition that we know or the event that has happened.

We can write the conditional probability as , the probability of the occurrence of

event A given that B has already happened.

Let’s play a simple game of cards for you to understand this. Suppose you draw two
cards from a deck and you win if you get a jack followed by an ace (without
replacement). What is the probability of winning, given we know that you got a jack in
the first turn?
Let event A be getting a jack in the first turn
Let event B be getting an ace in the second turn.

We need to find:

P(A) = 4/52
P(B) = 4/51 {no replacement}
P(A and B) = 4/52*4/51= 0.006

Here we are determining the probabilities when we know some conditions instead of
calculating random probabilities. Here we knew that he got a jack in the first turn.
Let’s take another example.
Suppose you have a jar containing 6 marbles – 3 black and 3 white. What is the
probability of getting a black given the first one was black too.
P (A) = getting a black marble in the first turn
P (B) = getting a black marble in the second turn
P (A) = 3/6
P (B) = 2/5
P (A and B) = ½*2/5 = 1/5

Bayes Theorem
The Bayes theorem describes the probability of an event based on the prior knowledge
of the conditions that might be related to the event. If we know the conditional

probability we can use the bayes rule to find out the reverse probabilities

.
How can we do that?

The above statement is the general representation of the Bayes rule.

For the previous example – if we now wish to calculate the probability of having a pizza
for lunch provided you had a bagel for breakfast would be = 0.7 * 0.5/0.6.
We can generalize the formula further.
If multiple events Ai form
an exhaustive set with another event B.
We can write the equation as

Example of Bayes Theorem

Let’s take the example of the breast cancer patients. The patients were tested thrice
before the oncologist concluded that they had cancer. The general belief is that 1.48 out
of a 1000 people have breast cancer in the US at that particular time when this test was
conducted. The patients were tested over multiple tests. Three sets of test were done
and the patient was only diagnosed with cancer if she tested positive in all three of
them.
Let’s examine the test in detail.
Sensitivity of the test (93%) – true positive Rate

Specificity of the test (99%) – true negative Rate

Let’s first compute the probability of having cancer given that the patient tested positive
in the first test.
P (has cancer | first test +)
P (cancer) = 0.00148
Sensitivity can be denoted as P (+ | cancer) = 0.93
Specificity can be denoted as P (- | no cancer)
Since we do not have any other information, we believe that the patient is a randomly
sampled individual. Hence our prior belief is that there is a 0.148% probability of the
patient having cancer.
The complement is that there is a 100 – 0.148% chance that the patient does not have
CANCER. Similarly we can draw the below tree to denote the probabilities.

Let’s not try to calculate the probability of having cancer given that he tested positive on
the first test i.e. P (cancer|+)

P (cancer and +) = P (cancer) * P (+) = 0.00148*0.93

P (no cancer and +) = P (no cancer) * P(+) = 0.99852*0.01
To calculate the probability of testing positive, the person can have cancer and test
positive or he may not have cancer and still test positive.

This means that there is a 12% chance that the patient has cancer given he tested
positive in the first test. This is known as the posterior probability.

TYPES OF PLOTS

Histograms
A histogram is very common plot. It plots the frequencies that data appears within
certain ranges.

Dataset used in the examples

1. (w1)
https://docs.google.com/spreadsheets/d/1uYfR4Oi7FlgkjtyVgTl7uPYpDmBhIrtP6O
2hGvLFjHM/edit?usp=sharing

2. Trees

https://docs.google.com/spreadsheets/d/1WhtwoyDd7Xkkeou_NcQkBd_ygQ5ABd
jdEgbF_cilKQM/edit?usp=sharing

Implementation in R

To plot a histogram of the data use the “hist” command:

> hist(w1$vals)

> hist(w1$vals,main="Distribution of w1",xlab="w1")

Box Plots
A box plot provides a graphical view of the median, quartiles, maximum, and minimum
of a data set.

Implementation in R

We first use the w1 data set and look at the boxplot of this data set:

> boxplot(w1$vals)

Again, this is a very plain graph, and the title and labels can be specified in exactly the same
way as in the stripchart and hist commands:

> boxplot(w1$vals, main='Leaf BioMass in High CO2 Environment', ylab='BioMass of Leaves')

Note that the default orientation is to plot the boxplot vertically. Because of this we used the ylab
option to specify the axis label. There are a large number of options for this command. To see
more of the options see the help page:

> help(boxplot)

Strip Charts
A strip chart is the most basic type of plot available. It plots the data in order along a line
with each data point represented as a box

Implementation in R

Here we provide examples using the w1 data frame mentioned at the top of this page,
and the one column of the data is w1$vals.

To create a strip chart of this data use the stripchart command:

> help(stripchart)
> stripchart(w1$vals)

Scatter Plots

A scatter plot provides a graphical view of the relationship between two sets of
numbers. Here we provide examples using the tree data frame from the trees91.csv
data file which is mentioned at the top of the page. In particular we look at the
relationship between the stem biomass (“tree$STBM”) and the leaf biomass
(“tree$LFBM”).

The command to plot each pair of points as an x-coordinate and a y-coordinate is “plot:”

> plot(tree$STBM,tree$LFBM)

You should always annotate your graphs. The title and labels can be specified in exactly
the same way as with the other plotting commands:

> plot(tree$STBM,tree$LFBM,
main="Relationship Between Stem and Leaf Biomass",
xlab="Stem Biomass",
ylab="Leaf Biomass")

Normal QQ Plots
The final type of plot that we look at is the normal quantile plot. This plot is used to
determine if your data is close to being normally distributed. You cannot be sure that the
data is normally distributed, but you can rule out if it is not normally distributed.

Implementation in R

Here we provide examples using the w1 data frame mentioned at the top of this page,
and the one column of data is w1$vals.

The command to generate a normal quantile plot is qqnorm. You can give it one
argument, the univariate data set of interest:

> qqnorm(w1$vals)

You can annotate the plot in exactly the same way as all of the other plotting commands
given here:

> qqnorm(w1$v als,

main="Normal Q-Q Plot of the Leaf Biomass",
xlab="Theoretical Quantiles of the Leaf Biomass",
ylab="Sample Quantiles of the Leaf Biomass")

Gaussian Distribution

A general standard distribution of data which is often taken as a reference in many

applications.

The normal distribution, also known as the Gaussian or standard normal distribution, is
the probability distribution that plots all of its values in a symmetrical fashion, and most
of the results are situated around the probability mean. Values are equally likely to plot
either above or below the mean. Grouping takes place at values close to the mean and
then tails off symmetrically away from the mean.

Standard Deviation

It depicts how the whole data deviate from the central measures.

PROPERTIES
❏ Mean = Median = Mode
❏ Symmetry about the center
❏ 50% of the data is greater than mean, 50% of the data is less than mean
❏ 68% of the data is within 1 standard deviation (Likely)
❏ 95% of the data is within 2 standard deviation (Very Likely)
❏ 99.7% of the data is within 3 standard deviation (Almost certainly)

Skewness

Used to depict the asymmetry of the data from the general standard normal
distribution.

Kurtosis
Measures how pointed is the peak relative to the normal distribution

SUGGESTED VIDEO
http://www.investopedia.com/terms/n/normaldistribution.asp

MH3511 Midterm 2017 Q
No ratings yet
MH3511 Midterm 2017 Q
4 pages
Data Analysis-Univariate & Bivariate
100% (1)
Data Analysis-Univariate & Bivariate
9 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
FIN10002 - Notes Master
No ratings yet
FIN10002 - Notes Master
44 pages
CHAPTER 1 & 2_ STATS
No ratings yet
CHAPTER 1 & 2_ STATS
5 pages
DBBA2102
No ratings yet
DBBA2102
10 pages
CH11.pptx-1
No ratings yet
CH11.pptx-1
35 pages
Module 01 Introduction To Business Statistics
No ratings yet
Module 01 Introduction To Business Statistics
16 pages
CH 02
No ratings yet
CH 02
38 pages
Business Statistics: Graphs, Charts, and Tables - Describing Your Data Graphs, Charts, and Tables - Describing Your Data
100% (1)
Business Statistics: Graphs, Charts, and Tables - Describing Your Data Graphs, Charts, and Tables - Describing Your Data
74 pages
Data Exploration & Visualization
No ratings yet
Data Exploration & Visualization
23 pages
Introduction To Statistics
100% (3)
Introduction To Statistics
43 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
AS Level Mathematics Statistics (New)
No ratings yet
AS Level Mathematics Statistics (New)
49 pages
Statistic Lecture2023
No ratings yet
Statistic Lecture2023
99 pages
Slide-04-Chapter2-Getting to Know Your Data
No ratings yet
Slide-04-Chapter2-Getting to Know Your Data
47 pages
Data Exploration
No ratings yet
Data Exploration
23 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
3. Variables & Chart
No ratings yet
3. Variables & Chart
60 pages
5315 ch00 Plotschartshistogram
No ratings yet
5315 ch00 Plotschartshistogram
37 pages
SECPROJECT.ITSKILLSANDDATAANALYSIS 2
No ratings yet
SECPROJECT.ITSKILLSANDDATAANALYSIS 2
69 pages
Intro To Statistics Lecture
No ratings yet
Intro To Statistics Lecture
41 pages
Intro To Statistics
No ratings yet
Intro To Statistics
35 pages
Basic Concepts
No ratings yet
Basic Concepts
105 pages
Graphical Presentation
No ratings yet
Graphical Presentation
6 pages
CH 12 Analyse Quantitative Data
No ratings yet
CH 12 Analyse Quantitative Data
34 pages
E-Book On Essentials of Business Analytics: Group 7
No ratings yet
E-Book On Essentials of Business Analytics: Group 7
6 pages
Ba Lecture 2
No ratings yet
Ba Lecture 2
54 pages
IEM Outline Lecture Notes Autumn 2016
No ratings yet
IEM Outline Lecture Notes Autumn 2016
198 pages
SCSA1606 - Predictive and Advanced Analytics - Unit II
No ratings yet
SCSA1606 - Predictive and Advanced Analytics - Unit II
50 pages
Descriptive Statistics: Instructor: Maira Sami
No ratings yet
Descriptive Statistics: Instructor: Maira Sami
55 pages
2/ Organizing and Visualizing Variables: Dcova
No ratings yet
2/ Organizing and Visualizing Variables: Dcova
4 pages
1st Unit Notes
No ratings yet
1st Unit Notes
22 pages
Chapter 03 Describing Bivarate Data
No ratings yet
Chapter 03 Describing Bivarate Data
32 pages
L18&19 Data Exploration
No ratings yet
L18&19 Data Exploration
50 pages
(eBook PDF) Basic Business Statistics Concepts and Applications 13th download
No ratings yet
(eBook PDF) Basic Business Statistics Concepts and Applications 13th download
48 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
13 pages
Algebra 1 Unit 6 Describing Data Notes
No ratings yet
Algebra 1 Unit 6 Describing Data Notes
13 pages
BUSINESS STATS PRINT-1
No ratings yet
BUSINESS STATS PRINT-1
1 page
Lesson 01
No ratings yet
Lesson 01
52 pages
Chapter 12 Outline
No ratings yet
Chapter 12 Outline
8 pages
Course Introduction Inferential Statistics Prof. Sandy A. Lerio
No ratings yet
Course Introduction Inferential Statistics Prof. Sandy A. Lerio
46 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
48 pages
7. Data Cleaning
No ratings yet
7. Data Cleaning
39 pages
AP Statistics Chapter 1-3 Outlines
No ratings yet
AP Statistics Chapter 1-3 Outlines
9 pages
(eBook PDF) Essentials of Modern Business Statistics with Microsoft Office Excel 7th Edition download
100% (2)
(eBook PDF) Essentials of Modern Business Statistics with Microsoft Office Excel 7th Edition download
43 pages
Introduction To Data Viz Lecture 2
No ratings yet
Introduction To Data Viz Lecture 2
44 pages
Collection of Data Part 2 Edited MLIS
No ratings yet
Collection of Data Part 2 Edited MLIS
45 pages
Data Analysis and report Writing BRM
No ratings yet
Data Analysis and report Writing BRM
49 pages
module 3 data preparation
No ratings yet
module 3 data preparation
33 pages
YMS Topic Review (Chs 1-8)
No ratings yet
YMS Topic Review (Chs 1-8)
7 pages
4.02 Statistics Fundamentals
No ratings yet
4.02 Statistics Fundamentals
2 pages
MATH210 - Stats Custom Text
No ratings yet
MATH210 - Stats Custom Text
145 pages
Written Report Gathering and Organizing Data
No ratings yet
Written Report Gathering and Organizing Data
13 pages
Quantitative Data Analysis
No ratings yet
Quantitative Data Analysis
22 pages
Chapter 7&8
No ratings yet
Chapter 7&8
40 pages
Unit 4
No ratings yet
Unit 4
25 pages
Business Statistics Using Excel: Understanding Standard Deviation Well Is Critical For This Course
No ratings yet
Business Statistics Using Excel: Understanding Standard Deviation Well Is Critical For This Course
13 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
Decision Trees
67% (3)
Decision Trees
14 pages
New Zealand Mathematical Olympiad Committee 2010 Squad Assignment Four
No ratings yet
New Zealand Mathematical Olympiad Committee 2010 Squad Assignment Four
2 pages
CFi Capital Structure Leverage Problems
No ratings yet
CFi Capital Structure Leverage Problems
6 pages
Aha! Insight Aha! Gotcha: Paradoxes To Puzzle and Delight Aha! Gotcha Aha! Insight Fernando Q. Gouvêa's Review
No ratings yet
Aha! Insight Aha! Gotcha: Paradoxes To Puzzle and Delight Aha! Gotcha Aha! Insight Fernando Q. Gouvêa's Review
1 page
The MIT Challenge
No ratings yet
The MIT Challenge
8 pages
Public Service Commission, West Bengal
No ratings yet
Public Service Commission, West Bengal
1 page
Regional Mathematical Olympiad 2014
No ratings yet
Regional Mathematical Olympiad 2014
1 page
QP For Physics For Estiak
No ratings yet
QP For Physics For Estiak
3 pages
RMO Solu 90
No ratings yet
RMO Solu 90
8 pages
Joining Mec: Applying Procedure
No ratings yet
Joining Mec: Applying Procedure
1 page
Esearch Proposal For PHD Program Title: Corporate Governance and Earnings Management
No ratings yet
Esearch Proposal For PHD Program Title: Corporate Governance and Earnings Management
10 pages
Flight of The Vampire: Ontogenetic Gait-Transition In: Vampyroteuthis Infernalis (Cephalopoda: Vampyromorpha)
No ratings yet
Flight of The Vampire: Ontogenetic Gait-Transition In: Vampyroteuthis Infernalis (Cephalopoda: Vampyromorpha)
12 pages
Activity Sheet 8 - Q4 - RWS
No ratings yet
Activity Sheet 8 - Q4 - RWS
3 pages
55646530njtshenye Assignment 4 Iop2601 1
No ratings yet
55646530njtshenye Assignment 4 Iop2601 1
5 pages
Physics Olympiad Error and Data Analysis Note
100% (1)
Physics Olympiad Error and Data Analysis Note
7 pages
Lec - 3 - Engineering Hydrology
No ratings yet
Lec - 3 - Engineering Hydrology
37 pages
Outliers PDF
No ratings yet
Outliers PDF
5 pages
Normal Probability Curve
No ratings yet
Normal Probability Curve
8 pages
m4 Simp (1) PDF
No ratings yet
m4 Simp (1) PDF
4 pages
PRJ p1400
No ratings yet
PRJ p1400
11 pages
Chapter 6 Processing and Analysis of Data
No ratings yet
Chapter 6 Processing and Analysis of Data
30 pages
Deviatia Standard Si Eroarea Standard
No ratings yet
Deviatia Standard Si Eroarea Standard
17 pages
Arena User S Guide En-107-150
No ratings yet
Arena User S Guide En-107-150
44 pages
(eBook PDF) Basic Allied Health Statistics and Analysis, Spiral bound 5th Edition pdf download
100% (1)
(eBook PDF) Basic Allied Health Statistics and Analysis, Spiral bound 5th Edition pdf download
45 pages
II Year - 3 Sem Syllabus
No ratings yet
II Year - 3 Sem Syllabus
14 pages
Impotance of Marketing
No ratings yet
Impotance of Marketing
7 pages
Solution SCM Assignment 1
No ratings yet
Solution SCM Assignment 1
7 pages
Worksheet On Mean, Median, Mode and Range
No ratings yet
Worksheet On Mean, Median, Mode and Range
2 pages
Revised Time Management and Academic Performance Fabiana
No ratings yet
Revised Time Management and Academic Performance Fabiana
77 pages
Continuous Probability Distribution: Bipul Kumar Sarker
No ratings yet
Continuous Probability Distribution: Bipul Kumar Sarker
32 pages
Data Science 3
No ratings yet
Data Science 3
216 pages
Assignment - Convert Research Methodology Question 2
No ratings yet
Assignment - Convert Research Methodology Question 2
16 pages
Course: Statistical Inference & Applications: Instructor in Charge
No ratings yet
Course: Statistical Inference & Applications: Instructor in Charge
30 pages
Chapter 8 Stat
No ratings yet
Chapter 8 Stat
20 pages
Normal Probability Distributions
No ratings yet
Normal Probability Distributions
61 pages
The 4 Simple Steps For Creating A Monte Carlo Simulation With Engage or Workspace
No ratings yet
The 4 Simple Steps For Creating A Monte Carlo Simulation With Engage or Workspace
11 pages
Logit Marginal Effects
No ratings yet
Logit Marginal Effects
12 pages
Modeling and Simulation of Queuing Systems Using Arena Software: A Case Study
No ratings yet
Modeling and Simulation of Queuing Systems Using Arena Software: A Case Study
7 pages
Mathematics: Answer Key
No ratings yet
Mathematics: Answer Key
5 pages
Stats Salah Notes
No ratings yet
Stats Salah Notes
6 pages
Mathematics Chapterwise Assignment For Class 10 Summative Assessment-1
100% (2)
Mathematics Chapterwise Assignment For Class 10 Summative Assessment-1
13 pages

Business Club: Basic Statistics

Uploaded by

Business Club: Basic Statistics

Uploaded by

Business Club Analytics Team

Cleaning the Data​​ ​2

Example of Bayes Theorem​​ 15

CLEANING THE DATA​​ :

Continuous variable ​: These variables can take number of numeric values.

Categorical variable​​ : These variables contain a finite number of categories or groups

Let’s understand this step more clearly by taking an example.

Quartile and InterQuartile Range(IQR):

Mathematical tools regarding quantifying Kurtosis would be discussed later.

Multivariate Analysis :

Continuous & Continuous:

Categorical & Categorical:

● Chi-Square Test: This test is used to derive the statistical significance of

Probability of 0: It indicates that both categorical variable are dependent

Categorical & Continuous:

Missing values treatment

Missing values are categorised broadly into four types :

● Missing completely at random:​​ This is a case when the probability of missing

2. Mean/ Mode/ Median Imputation:

It can be of two types:-

DEALING WITH MISSING VALUES IN R AND PYTHON :

What is the impact of Outliers on a dataset?

H​​OW TO DETECT OUTLIERS :

Thumb rules to detect outliers

Dealing with outliers

2. T​ ransforming and binning values: ​Transforming variables can also eliminate

4. Imputing: ​Like imputation of missing values, we can also impute outliers. We

2 important steps in feature engineering are :

We can write the conditional probability as , the probability of the occurrence of

The above statement is the general representation of the Bayes rule.

Example of Bayes Theorem

Specificity of the test (99%) – true negative Rate

P (cancer and +) = P (cancer) * P (+) = 0.00148*0.93

Dataset used in the examples

To plot a histogram of the data use the “hist” command:

> hist(w1$vals,main=​​"Distribution of w1"​​,xlab=​​"w1"​​)

> boxplot(w1$vals, main='Leaf BioMass in High CO2 Environment', ylab='BioMass of Leaves')

To create a strip chart of this data use the stripchart command:

> qqnorm(w1​$v​ als,

A general standard distribution of data which is often taken as a reference in many

You might also like

Cleaning the Data 2

Example of Bayes Theorem 15

CLEANING THE DATA :

Continuous variable : These variables can take number of numeric values.

Categorical variable : These variables contain a finite number of categories or groups

● Missing completely at random: This is a case when the probability of missing

HOW TO DETECT OUTLIERS :

2. T ransforming and binning values: Transforming variables can also eliminate

4. Imputing: Like imputation of missing values, we can also impute outliers. We

> hist(w1$vals,main="Distribution of w1",xlab="w1")

> qqnorm(w1$v als,