Business Club: Basic Statistics
Business Club: Basic Statistics
Business Club
Basic Statistics
October 2017
INDEX
Conditional Probability 13
Bayes Theorem 14
Histograms 17
Box Plots 18
Strip chart 20
Scatter Plots 21
Normal QQ Plots 21
Normal Distribution 23
1
Variable Identification :
For any dataset perhaps the most important part is identifying Predictor (Input) and
Target (output) variables. Next is identifying the data type and category of the variables.
Example:- Suppose, we want to predict, whether the students will play cricket or not
(refer below data set). Here you need to identify predictor variables, target variable, data
type of variables and category of variables.
2
Univariate Analysis :
Univariate analysis is the simplest form of analyzing data. This deals with only
variable . It is mostly used to find patterns in the data.
Here also, the approaches for analysing the categorical and continuous variables are
separate:
Continuous Variables:
Here, we need to understand the central tendency of the data and spread of the
variable. These are measured using various statistical metrics visualization methods as
shown below:
Most terms in the table are self explanatory, except for some which we’ll discuss here:
3
IQR as the name suggests is the measure of dispersion between upper and lower
quartile or :
IQR=Q3-Q1
Kurtosis: In a similar way to the concept of skewness, kurtosis is a descriptor of the
shape of a probability distribution and, just as for skewness, there are different ways of
quantifying it for a theoretical distribution and corresponding ways of estimating it from a
sample from a population.
4
Categorical Variables:
For categorical variables, we can use frequency table to understand distribution of each
category. We can also read as percentage of values under each category. It can be be
measured using two metrics, Count and Count% against each category. Bar chart can
be used as visualization.
Any method that is used to analyse data with more than one variable is called
multivariate analysis.
In multivariate Analysis we’ll discuss bi-variate analysis in greater detail as it is the most
frequently used:
Bi-Variate Analysis:
Bi-variate Analysis finds out the relationship between two variables. Here, we look for
association and dissociation between variables at a pre-defined significance level. We
can perform bi-variate analysis for any combination of categorical and continuous
variables. The combination can be: Categorical & Categorical, Categorical &
Continuous and Continuous & Continuous. Different methods are used to tackle these
combinations during analysis process.
Correlation:
It is a statistical measure that indicates the extent to which two or more variables
fluctuate together. A positive correlation indicates the extent to which those variables
increase or decrease in parallel; a negative correlation indicates the extent to which one
variable increases as the other decreases.
5
6
● Stacked Column Chart: This method is more of a visual form of Two-way table.
where O represents the observed frequency. E is the expected frequency
under the null hypothesis and computed by:
Different data science language and tools have specific methods to perform
chi-square test. In SAS, we can use Chisq as an option with Proc freq to
perform this test.
7
1. Deletion
Here ,we delete observations where any of the variable is missing. Even though
this method is easy to handle but the sample size reduces .
Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
8
3. Prediction Model:
Prediction model is one of the sophisticated method for handling missing data.
Here, we create a predictive model to estimate values that will substitute the
missing data.
In this case the data is divided into two sets – one with no missing values for the
variable and another one with missing values. The first is treated as a training
data set to predict the missing values of the other set .Major drawbacks of this
method are :
1. The model estimated values are usually more well-behaved than the true
values
2. If there are no relationships with attributes in the data set and the attribute
with missing values, then the model will not be precise for estimating
missing values.
9
OUTLIERS :
Outlier is an observation that appears far away and diverges from an overall pattern in a
sample.
Outliers can drastically change the results of the data analysis and statistical modeling.
There are numerous unfavourable impacts of outliers in the data set:
● It increases the error variance and reduces the power of statistical tests
● They can bias or influence estimates that may be of substantive interest
● They can also impact the basic assumption of Regression,and other statistical
model assumptions.
To understand the impact deeply, let’s take an example to check what happens to a
data set with and without outliers in the data set.
10
The table probably emphasizes enough how important handling outliers is. Now we
should move on to detecting and handling outliers.
Visualization through box plots, Histogram : Outliers are easily detected through
visualisations using histogram or other plots .
1. Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
2. Any value which out of range of 5th and 95th percentile can be considered as
outlier
3. Data points, three or more standard deviation away from mean are considered
outlier.
11
3. Treating them separately: If there are significant number of outliers, we should
treat them separately in the statistical model. One of the approach is to treat both
groups as two different groups and build individual model for both groups and
then combine the output
12
*FEATURE ENGINEERING
Feature engineering is the science of extracting more information from existing data.
● Variable transformation.
● Variable / Feature creation.
Conditional Probability
Conditional probabilities arise naturally in the investigation of experiments where an
outcome of a trial may affect the outcomes of the subsequent trials.
We try to calculate the probability of the second event (event B) given that the first event
(event A) has already happened. If the probability of the event changes when we take
the first event into consideration, we can safely say that the probability of event B is
dependent of the occurrence of event A.
Let’s think of cases where this happens:
● Drawing a second ace from a deck given we got the first ace
● Finding the probability of having a disease given you were tested positive
● Finding the probability of liking Harry Potter given we know the person likes
fiction
And so on….
Here we can define, 2 events:
● Event A is the probability of the event we’re trying to calculate.
● Event B is the condition that we know or the event that has happened.
Let’s play a simple game of cards for you to understand this. Suppose you draw two
cards from a deck and you win if you get a jack followed by an ace (without
replacement). What is the probability of winning, given we know that you got a jack in
the first turn?
Let event A be getting a jack in the first turn
Let event B be getting an ace in the second turn.
13
We need to find:
P(A) = 4/52
P(B) = 4/51 {no replacement}
P(A and B) = 4/52*4/51= 0.006
Here we are determining the probabilities when we know some conditions instead of
calculating random probabilities. Here we knew that he got a jack in the first turn.
Let’s take another example.
Suppose you have a jar containing 6 marbles – 3 black and 3 white. What is the
probability of getting a black given the first one was black too.
P (A) = getting a black marble in the first turn
P (B) = getting a black marble in the second turn
P (A) = 3/6
P (B) = 2/5
P (A and B) = ½*2/5 = 1/5
Bayes Theorem
The Bayes theorem describes the probability of an event based on the prior knowledge
of the conditions that might be related to the event. If we know the conditional
probability we can use the bayes rule to find out the reverse probabilities
.
How can we do that?
14
15
Let’s not try to calculate the probability of having cancer given that he tested positive on
the first test i.e. P (cancer|+)
16
This means that there is a 12% chance that the patient has cancer given he tested
positive in the first test. This is known as the posterior probability.
TYPES OF PLOTS
Histograms
A histogram is very common plot. It plots the frequencies that data appears within
certain ranges.
1. (w1)
https://docs.google.com/spreadsheets/d/1uYfR4Oi7FlgkjtyVgTl7uPYpDmBhIrtP6O
2hGvLFjHM/edit?usp=sharing
2. Trees
https://docs.google.com/spreadsheets/d/1WhtwoyDd7Xkkeou_NcQkBd_ygQ5ABd
jdEgbF_cilKQM/edit?usp=sharing
17
Implementation in R
> hist(w1$vals)
18
Box Plots
A box plot provides a graphical view of the median, quartiles, maximum, and minimum
of a data set.
Implementation in R
We first use the w1 data set and look at the boxplot of this data set:
> boxplot(w1$vals)
Again, this is a very plain graph, and the title and labels can be specified in exactly the same
way as in the stripchart and hist commands:
19
Note that the default orientation is to plot the boxplot vertically. Because of this we used the ylab
option to specify the axis label. There are a large number of options for this command. To see
more of the options see the help page:
> help(boxplot)
Strip Charts
A strip chart is the most basic type of plot available. It plots the data in order along a line
with each data point represented as a box
Implementation in R
Here we provide examples using the w1 data frame mentioned at the top of this page,
and the one column of the data is w1$vals.
20
> help(stripchart)
> stripchart(w1$vals)
Scatter Plots
A scatter plot provides a graphical view of the relationship between two sets of
numbers. Here we provide examples using the tree data frame from the trees91.csv
data file which is mentioned at the top of the page. In particular we look at the
relationship between the stem biomass (“tree$STBM”) and the leaf biomass
(“tree$LFBM”).
The command to plot each pair of points as an x-coordinate and a y-coordinate is “plot:”
> plot(tree$STBM,tree$LFBM)
You should always annotate your graphs. The title and labels can be specified in exactly
the same way as with the other plotting commands:
> plot(tree$STBM,tree$LFBM,
main="Relationship Between Stem and Leaf Biomass",
xlab="Stem Biomass",
ylab="Leaf Biomass")
21
Normal QQ Plots
The final type of plot that we look at is the normal quantile plot. This plot is used to
determine if your data is close to being normally distributed. You cannot be sure that the
data is normally distributed, but you can rule out if it is not normally distributed.
Implementation in R
Here we provide examples using the w1 data frame mentioned at the top of this page,
and the one column of data is w1$vals.
The command to generate a normal quantile plot is qqnorm. You can give it one
argument, the univariate data set of interest:
> qqnorm(w1$vals)
You can annotate the plot in exactly the same way as all of the other plotting commands
given here:
22
Gaussian Distribution
The normal distribution, also known as the Gaussian or standard normal distribution, is
the probability distribution that plots all of its values in a symmetrical fashion, and most
of the results are situated around the probability mean. Values are equally likely to plot
either above or below the mean. Grouping takes place at values close to the mean and
then tails off symmetrically away from the mean.
23
Standard Deviation
It depicts how the whole data deviate from the central measures.
PROPERTIES
❏ Mean = Median = Mode
❏ Symmetry about the center
❏ 50% of the data is greater than mean, 50% of the data is less than mean
❏ 68% of the data is within 1 standard deviation (Likely)
❏ 95% of the data is within 2 standard deviation (Very Likely)
❏ 99.7% of the data is within 3 standard deviation (Almost certainly)
Skewness
Used to depict the asymmetry of the data from the general standard normal
distribution.
24
Kurtosis
Measures how pointed is the peak relative to the normal distribution
SUGGESTED VIDEO
http://www.investopedia.com/terms/n/normaldistribution.asp
25