0% found this document useful (0 votes)

11 views21 pages

Unit 4

The document outlines the course materials for the Foundation of Data Science course (22CSC202) offered by the School of Physical Sciences, focusing on inferential statistics, hypothesis testing, and related concepts. It details the syllabus, including topics such as data preprocessing, statistical inference, and various sampling techniques, along with definitions and processes related to hypothesis testing. Additionally, it provides case studies and examples to illustrate the application of statistical methods in real-world scenarios.

Uploaded by

Sree RK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views21 pages

Unit 4

Uploaded by

Sree RK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

22CSC202- Learning Materials

Foundation of Data Science/22 Unit IV

School of Physical Sciences

Department of Mathematics

Course Materials

Course Name : Foundation of Data Science

Course Code : 22CSC202

Programme Name : Int. [Link]. Data Science

Year : II

Semester : III

Course Coordinator : Dr. P. Sriramakrishnan

Foundation of Data Science/22CSC202- Learning Materials Unit IV

Syllabus
Unit I
Introduction, Causality and Experiments, Data Preprocessing: Data cleaning, Data reduction, Data
transformation, Data discretization. Visualization and Graphing: Visualizing Categorical
Distributions, Visualizing Numerical Distributions, Overlaid Graphs, plots, and summary statistics of
exploratory data analysis
Unit-II
Randomness, Probability, Sampling, Sample Means and Sample Sizes
Unit-III
Introduction to Statistics, Descriptive statistics – Central tendency, dispersion, variance, covariance,
kurtosis, five point summary, Distributions, Bayes Theorem;
Unit-IV
Statistical Inference; Hypothesis Testing, P-Values, Error Probabilities, Assessing Models, Decisions
and Uncertainty, Comparing Samples, A/B Testing, Causality.
Unit-V
Estimation, Prediction, Confidence Intervals, Inference for Regression, Classification, Graphical
Models, Updating Predictions.
Text Books:

1. Adi Adhikari and John DeNero, “Computational and Inferential Thinking: The Foundations
of Data Science”, e-book.
Reference Books:

1. Data Mining for Business Analytics: Concepts, Techniques and Applications in R, by Galit
Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr., Wiley
India, 2018.

2. Rachel Schutt & Cathy O’Neil, “Doing Data Science” O’ Reilly, First Edition, 2013.

2
Foundation of Data Science/22CSC202- Learning Materials Unit IV

Unit IV

1. Inferential Statistics

 Inferential statistics is a branch of statistics that makes the use of various analytical
tools to draw inferences about the population data from sample data.

 Inferential statistics help to draw conclusions about the population while descriptive
statistics summarizes the features of the data set.

 There are two main types of inferential statistics - hypothesis testing and regression
analysis.

 The samples chosen in inferential statistics need to be representative of the entire

population.
Inferential Statistics Definition
Inferential statistics can be defined as a field of statistics that uses analytical tools for
drawing conclusions about a population by examining random samples. The goal of inferential
statistics is to make generalizations about a population. In inferential statistics, a statistic is
taken from the sample data (e.g., the sample mean) that used to make inferences about the
population parameter (e.g., the population mean).
What is Inferential Statistics?
Inferential statistics helps to develop a good understanding of the population data by
analyzing the samples obtained from it. It helps in making generalizations about the
population by using various analytical tests and tools. In order to pick out random samples
that will represent the population accurately many sampling techniques are used. Some of the
important methods are simple random sampling, stratified sampling, cluster sampling, and
systematic sampling techniques which already discussed under unit 2.
Inferential Statistics makes inference and prediction about population based on a sample of
data taken from population. It generalizes a large dataset and applies probabilities to draw a
conclusion. It is simply used for explaining meaning of descriptive stats. It is simply used to
analyze, interpret result, and draw conclusion.
Inferential Statistics is mainly related to associate with hypothesis testing whose main target
is to reject null hypothesis. Hypothesis testing is a type of inferential procedure that takes help
of sample data to evaluate and assess credibility of a hypothesis about a population. Inferential
statistics are generally used to determine how strong relationship is within sample. But it is
very difficult to obtain a population list and draw a random sample.

3
Foundation of Data Science/22CSC202- Learning Materials Unit IV

Process

Inferential statistics can be done with help of various steps as given below:

1. Obtain and start with a theory.

2. Generate a research hypothesis.
3. Operationalize or use variables
4. Identify or find out population to which we can apply study material.
5. Generate or form a null hypothesis for these populations.
6. Collect and gather a sample from population and simply run study.
7. Then, perform all tests of statistical to clarify if obtained characteristics of sample are
sufficiently different from population what would be expected under null hypothesis so
that we can be able to find and reject null hypothesis.

Types of Inferential Statistics

Inferential statistics can be classified into hypothesis testing and regression analysis.
Hypothesis testing also includes the use of confidence intervals to test the parameters of a
population. Given below are the different types of inferential statistics.

Inferential Statistics vs Descriptive Statistics

The table given below lists the differences between inferential statistics and descriptive statistics.

Inferential Statistics Descriptive Statistics

Inferential statistics are used to make

Descriptive statistics are used to
conclusions about the population by
quantify the characteristics of the
using analytical tools on the sample
data.
data.
Foundation of Data Science/22CSC202- Learning Materials Unit IV

Inferential Statistics Descriptive Statistics

Measures of central
Hypothesis testing and regression tendency and measures of
analysis are the analytical tools used. dispersion are the important tools
used.

It is used to describe the

It is used to make inferences about an
characteristics of a known sample or
unknown population
population.

Measures of descriptive statistics

Measures of inferential statistics are
are variance, range, mean, median,
t-test, z test, linear regression, etc.
etc.

2. Hypothesis Testing

 Hypothesis Testing is a type of statistical analysis in which you put your assumptions
about a population parameter to the test.

 It is used to estimate the relationship between 2 statistical variables.

Steps of Hypothesis Testing

Step 1: Specify Your Null and Alternate Hypotheses

Step 2: Set the criteria for a decision

Step 3: Conduct a Statistical Test (Compute z score and p value)

Step 4: Determine Rejection of Your Null Hypothesis

Step 1: Specify Null and Alternate Hypotheses

Null Hypothesis (H0)

 A null hypothesis is a type of conjecture in statistics that proposes that there is no

difference between certain characteristics of a population data and sample data.

 The alternative hypothesis proposes that there is a difference.

 Hypothesis testing provides a method to reject a null hypothesis within a certain

confidence level.

 If you can reject the null hypothesis, it provides support for the alternative hypothesis.

5
Foundation of Data Science/22CSC202- Learning Materials Unit IV

The Alternative Hypothesis (H1)

An important point to note is that we are testing the null hypothesis because there is an
element of doubt about its validity. Whatever information that is against the stated null
hypothesis is captured in the alternative (alternate) hypothesis (H1).

Examples 1:
Null Hypothesis: H0: There is no difference in the salary of factory workers based on
gender.
Alternative Hypothesis: Ha: Male factory workers have a higher salary than female factory
workers.
Null Hypothesis: H0: There is no difference between height and shoe size.
Alternative Hypothesis: Ha: There is a positive relationship between height and shoe size.

Example 2:

A school principal claims that students (300) in her school score an average of 7 out of 10 in
exams. The null hypothesis is that the population mean is 7.0. To test this null hypothesis, we
record marks of, say, 30 students (sample) from the entire student population of the school
(say 300) and calculate the mean of that sample is 6.5.

Null Hypothesis (H0) = no difference between population and sample.

Alternate Hypothesis (H1) = difference between population and sample

Here, population mean = 7.0 and sample mean= 6.5. Therefore, difference occurs between
population and sample, so that null hypothesis is rejected and alternate hypothesis is selected.

Step 2: Set the criteria for a decision

For a statistical test to be legitimate, sampling and data collection must be done in a way that is
meant to test your hypothesis. You cannot draw statistical conclusions about the population you
are interested in if your data is not representative.

Example: Population mean, sample mean, sample size, standard deviation, any specific
condition, confidential level, level of significance., etc.

Step 3: Conduct a Statistical Test (Compute p value)

Compute P- Value

6
Foundation of Data Science/22CSC202- Learning Materials Unit IV

 A p-value is a statistical measurement used in hypothesis testing against sample data.

 A p-value measures the probability of obtaining the results from sample, assuming that
the null hypothesis is true.

 The lower the p-value, the greater the statistical significance of the observed difference.

 A p-value < 0.05 or 5% is generally considered statistically significant and reject the null
hypothesis.

 P-value > 0.05 or 5% is not having statistical significant and retain the null hypothesis.

P-value Table

The P-value table shows the hypothesis interpretations:

Null Alternate
P-value Decision
Hypothesis Hypothesis

The result is not statistically significant and

P-value > 0.05 Retain Not Select
hence don’t reject the null hypothesis.

The result is statistically significant.

P-value < 0.05 Reject Select Generally, reject the null hypothesis in favor
of the alternative hypothesis.

The result is highly statistically significant,

P-value < 0.01 Reject Select and thus rejects the null hypothesis in favor of
the alternative hypothesis.

P-value Formula through Z score

We Know that P-value is a statistical measure, that helps to determine whether the hypothesis is
correct or not. P-value is a number that lies between 0 and 1. The level of significance (α) is a
predefined threshold that should be set by the researcher based on the confidence interval. It is
generally fixed as 0.05 or 5% when confidence interval is 95%. The formula for the calculation
for P-value is:

Find out the test static (Z score) is

𝑋−𝜇
𝑍= 𝜎
√𝑛

7
Foundation of Data Science/22CSC202- Learning Materials Unit IV

Here, 𝑋 - sample mean, 𝜇 - population mean, 𝜎 – standard deviation of sample, 𝑛 - sample size

Look at the Z score table to find the corresponding p value.

Step 4: Determine Rejection of Your Null Hypothesis

Null Hypothesis is accepted or rejected based on p-value as per the p table.

3. Case Studies – Hypothesis Testing

Case Study I: Let’s say we are drawing samples from a population with a mean of 50.
What is the probability that we get a random sample of size 10 and standard deviation 10
with a mean greater than or equal to 55?

Note: confident level = 95%, and level of significant – 5%

Population mean (𝜇) = 50, sample mean (𝑋) - 55, sample size (n) – 10, standard deviation of
sample (𝜎) = 10

𝑋−𝜇
𝑍= 𝜎
√𝑛

55̇ − 50
𝑍= = 1.58
10
√10

Now we need to shade the area under the normal curve corresponding to scores greater than 55
under z = 1.58 as
Foundation of Data Science/22CSC202- Learning Materials Unit IV

As per Z-score table, z(1.58) = 0.9429 (selecting 10 samples from population which gave sample
mean upto 55 yields p(z<55)= 94.29% probability).

As per hypothesis, greater than or equal to 55 means:

p(z>=55)= 1.000-0.9429=0.0571

p value – 0.0571

So that the probability of randomly drawing a sample of 10 people from a population with a mean of
50 and standard deviation of 10 whose sample mean is 55 or more is p = .0571, or 5.71%.

As per the p table, p>0.05 we retain the null hypothesis.

Result: The result is not statistically significant and hence don’t reject the null hypothesis. It proved
that the sample size and sample mean not sufficiently defined and not reached the confident level
(95%).

Case Study II

Now let’s do the same thing, but assume that instead of only having a sample of 10 people we
took a sample of 50 people. First, we find z score and p value.

Note: confident level = 95%, and level of significant – 5%

Population mean (𝜇) = 50, sample mean (𝑋) - 55, sample size (n) – 50, standard deviation of
sample (𝜎) = 10

𝑋−𝜇
𝑍= 𝜎
√𝑛

55̇ − 50
𝑍= = 3.55
10
√50
Foundation of Data Science/22CSC202- Learning Materials Unit IV

As per Z-score table, z(3.55) = 0.9990 (selecting 50 samples from population which gave sample
mean upto 55 yields p(z<55)= 99.90% probability).

As per hypothesis, greater than or equal to 55 means:

p(z>=55)= 1.000-0.9990=0.0001

p value – 0.0001

So that the probability of randomly drawing a sample of 50 people from a population with a mean of
50 and standard deviation of 10 whose sample mean is 55 or more is p = 0.0001, or 0.01%.

As per the p table, p<0.05 we reject the null hypothesis.

Result: The result is highly statistically significant and hence rejects the null hypothesis. It proved
that the sample size and sample mean sufficiently defined and reached the confident level (95%).

Case Study III

Let’s look at this one more way. For the same population of sample size 50, what proportion of
sample means fall between 47 and 53 if they are of standard deviation 10 and sample size 10
and sample size 50?

Note: confident level = 95%, and level of significant – 5%

Population mean (𝜇) = 50, sample mean (𝑋) - 47, & sample mean (𝑋) - 53, sample size (n) – 50,
standard deviation of sample (𝜎) = 10

𝑋−𝜇
𝑍 = 𝜎
√𝑛
̇ ̇
𝑍 = = = −2.13 𝑍 = = = 2.13
. .
√ √
Foundation of Data Science/22CSC202- Learning Materials Unit IV

p(Z1<47) = p(-2.13) =0.01659

p(Z2<53) = p(2.13) =0.983

p(Z1>47 & Z2<53) = p(Z2)-p(Z1) = 0.983 - 0.01659 = 0.9664

p(Z1>47 & Z2<53) = 96.64 = ~97%

As per p table, p>0.05 so that null hypothesis is retained.

4. Hypothesis Testing using Python

Case I: Example for right tailed Test

A school claimed that the students who study that are more intelligent than the average school. On
calculating the IQ scores of 50 students, the average turns out to be 110. The mean of the population
IQ is 100 and the standard deviation is 30. State whether the claim of the principal is right or not at a
5% significance level.

Hypothesis Framing

Null Hypothesis H0: no difference between sample mean and population mean 𝜇 = 100

Alternate Hypothesis H1: 𝜇 > 110

Parameter: Level of Significant = 0.05, 𝜇 = 100, 𝑋 =110, 𝜎 = 30 and n=50

𝑋−𝜇
𝑍= 𝜎
√𝑛

110 − 100
𝑍= = 2.357
30
√50

Z(2.357)=0.9906

This is right tiled so p-value =1- p(Z score) = 1 – 0.9906 = 0.0094

As per P table: p <0.05 then reject null hypothesis and choose alternate hypothesis.

Code:

11
Foundation of Data Science/22CSC202- Learning Materials Unit IV

# Import the necessary libraries

import numpy as np

import [Link] as stats

# Given input

sample_mean = 110

population_mean = 100

population_std = 20

sample_size = 50

# compute the z-score

z_score = (sample_mean-population_mean)/(population_std/[Link](sample_size))

print('Z-Score :',z_score)

# P-Value : Probability of getting less than a Z-score

p_value = [Link](z_score)

print('p-value :',p_value)

# Hypothesis

if p_value < 0.05:

print("Reject Null Hypothesis go with alternate hypothesis")

else:

print("Retain Null Hypothesis")

Output:

12
Foundation of Data Science/22CSC202- Learning Materials Unit IV

Z-Score : 3.5355339059327378

p-value : 0.00020347600872250293

Reject Null Hypothesis go with alternate hypothesis

Case II: Example for left tiled

A school claimed that the students who study that are more intelligent than the average school. On
calculating the IQ scores of 50 students, the average turns out to be 80. The mean of the population
IQ is 100 and the standard deviation is 30. State whether the claim of the principal is right or not at a
5% significance level.

𝑋−𝜇
𝑍= 𝜎
√𝑛

population_mean = 100

population_std = 20

sample_size = 50

sample_mean = 80

Note: p value = p(z score)

Result: Z-Score : -2.3570226039551585

p-value : 0.009211062727049501

Reject Null Hypothesis go with alternate hypothesis

5. Error Probability
Sometimes error may occur while retain or reject the null hypothesis. The error possibilities are:
1. The decision to retain the null hypothesis could be correct.
2. The decision to retain the null hypothesis could be incorrect (Type I Error).
3. The decision to reject the null hypothesis could be correct.
4. The decision to reject the null hypothesis could be incorrect (Type II Error).

13
Foundation of Data Science/22CSC202- Learning Materials Unit IV

Final Result
Retain the Null Reject the null
Actual Null is True Correct result Type I Error (𝛼)
Result Alternative
Type II Error (𝛽) Correct result
is True

Type I Error (𝜶)

Actual result: Retain the null
Obtained result from study: Reject the null
Type I error is the probability of rejecting a null hypothesis that is actually true. Researchers directly
control for the probability of committing this type of error.
To minimize this error, we assume a defendant is innocent when beginning a trial. Similarly, to
minimize making a Type I error, we assume the null hypothesis is true when beginning a hypothesis
test. This criterion is usually set at .05 (α = 0.05), and we compare the alpha level to the p value.
When the probability of a Type I error is less than 5% (p < .05),
If 𝜶 < 𝑝
we decide to reject the null hypothesis;
otherwise, we retain the null hypothesis.

Type II Error (𝛽)

Actual result: Reject the null
Obtained result from study: Retain the null
The incorrect decision is to retain a false null hypothesis. This decision is an example of a Type II
error, or 𝛽 error. With each test we make, there is always some probability that the decision could be
a Type II error. In this decision, we decide to retain previous notions of truth that are in fact false.
While it’s an error, we still did nothing; we retained the null hypothesis. We can always go back and
conduct more studies.

14
Foundation of Data Science/22CSC202- Learning Materials Unit IV

6. Hypothesis Testing Models

It requires creating the null and alternate hypothesis. Inferences are drawn by considering the critical
value, test statistic, and confidence interval. A hypothesis test can be two-tailed, left-tailed, and
right-tailed. The hypothesis testing models consist of the following tools:

a) Z-test
Z-test is used when the sample size is greater than or equal to 30 and the data set follows a normal
distribution. The population standard deviation is known to the researcher. The formulas are given as
follows:
Null hypothesis: H0 : μ= x̄
Alternate hypothesis: H1: μ > x̄ or μ < x̄

𝑋−𝜇
𝑍 𝑡𝑒𝑠𝑡 = 𝜎
√𝑛
where,
x̄ = sample mean, μ = population mean, σ = standard deviation of the population, n = sample size
b) T-test
T-test is used when the sample size is less than 30 and the data set follows a t-distribution. The
population standard deviation is not known to the researcher. The formulas are given as follows:
Null Hypothesis: H0: μ= x̄
Alternate Hypothesis: H1: μ > x̄ or μ < x̄
𝑋−𝜇
𝑇 𝑡𝑒𝑠𝑡 = 𝑠
√𝑛
The representations x̄, μ, and n are the same as stated for the z-test. The letter “s” represents the
standard deviation of the sample.
One Sample T Test: stats.ttest_1samp (a, popmean, alternative='two-sided')
Two Test Sample T Test: [Link].ttest_ind(a, b, alternative='two-sided')
a – sample observations
b – sample observations
popmean – expected population mean specified in null hypo
Foundation of Data Science/22CSC202- Learning Materials Unit IV

alternative{‘two-sided’, ‘less’, ‘greater’}, optional

Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):
 ‘two-sided’: the mean of the underlying distribution of the sample is different than the given
population mean (popmean)
 ‘less’: the mean of the underlying distribution of the sample is less than the given population
mean (popmean)
 ‘greater’: the mean of the underlying distribution of the sample is greater than the given
population mean (popmean)

Example:
A company claims to produce ball bearings of 10 cm diameter (null hypothesis), and you decide to
audit if that is true. You collect 21 ball bearings from multiple shops around the city and measure
their diameter. You must determine if the company’s claim is false (alternate hypothesis) based on
the measurements.
The test is designed as follows:
Null Hypothesis: H0 : Sample mean (x) = population mean (𝝁)
Alternate Hypothesis: Ha : sample (x) ≠ population mean (𝝁)
To declare the claims as false, you need to statistically prove that the measurements from the sample
of bearings are different from the 10 cm claimed by the company. As the sample size is 21 (which is
less than 30), we use a t-test. Here the sample mean is 11 cm, and the standard deviation is 1 cm.

One-sample t-test
import numpy as np
from scipy import stats
# Population Mean
mu = 10
# Sample Size
N1 = 21
# Generate a random sample with mean = 11 and standard deviation = 1
x = [Link](mean, std_dev, N1)
# Using the Stats library, compute t-statistic and p-value
t_stat, p_val = stats.ttest_1samp(a=x, popmean = mu)
print("t-statistic = " + str(t_stat))
print("p-value = " + str(p_val))

16
Foundation of Data Science/22CSC202- Learning Materials Unit IV

if p_val<0.05:
print("Reject null hypo")
else:
print("Retain null hypo")
Output:
t-statistic = 4.689539773390642
p-value = 0.0001407967502139183
Interpretation of the test results
The value of t-statistic comes out to be 4.69 which seems to be very far from the mean of zero in a t-
distribution. To quantify such extreme value, we refer to a p-value (which is introduced in the
terminology section). The p-value is less than the default significance level of 0.05, which indicates
that the probability of such an extreme outcome is close to zero and that the null hypothesis can be
rejected.

Two-sample t-test example

Let’s consider that the first factory shares 21 samples of ball bearings where the mean diameter of
the sample comes out to be 10.5 cm. On the other hand, the second factory shares 25 samples with a
mean diameter of 9.5 cm. Both have a standard deviation of 1 cm.
# Sample Sizes
N1, N2 = 21, 25

# Degrees of freedom
dof = min(N1,N2) - 1
mean =10.5
std_dev=1
# Gaussian distributed data with mean = 10.5 and sd = 1
x = [Link](mean, std_dev, N1)
mean =9.5
std_dev=1

# Gaussian distributed data with mean = 9.5 and sd = 1

y = [Link](mean, std_dev, N1)

## Using the internal function from SciPy Package

17
Foundation of Data Science/22CSC202- Learning Materials Unit IV

t_stat, p_val = stats.ttest_ind(x, y)

print("t-statistic = " + str(t_stat))

print("p-value = " + str(p_val))
if p_val<0.05:
print("Reject null hypo")
else:
print("Retain null hypo")

Output:
t-statistic = 2.310472984971306
p-value = 0.025611000716066114
Reject null hypo
Interpretation of the test results
Referring to the p-value of 0.0026 which is less than the significance level of 0.05, we reject the null
hypothesis stating that the bearings from the two factories are not identical.

c) F-test
F-test checks whether a difference between the variances of two samples or populations exists or not.
F Test for do not know the sample size. The formulas are given as follows:
Null Hypothesis H0:𝜎 =𝜎
Alternate Hypothesis H1= :𝜎 >𝜎 or 𝜎 <𝜎
𝜎
𝐹 𝑡𝑒𝑠𝑡 =
𝜎
where, 𝜎 − variance of the first population, 𝜎 − variance of the second population

Example 1:
Perform an F Test for the following samples.
1. Sample 1 with variance equal to 109.63 and sample size equal to 41.
2. Sample 2 with variance equal to 65.99 and sample size equal to 21.

Solution:

Step 1:
The hypothesis Statements are written as:

18
Foundation of Data Science/22CSC202- Learning Materials Unit IV

H_0: No difference in variances

H_a: Difference invariances

Step 2:
Calculate the value of F critical. In this case, the highest variance is taken as the numerator and the
lowest variance in the denominator.

Step 3:
The next step is the calculation of degrees of freedom.

The degrees of freedom is calculated as Sample size - 1

The degree of freedom for sample 1 is 41 -1 = 40.

The degree of freedom for sample 2 is 21 - 1 = 20.

Step 4:
There is no alpha level described in the question, and hence a standard alpha level of 0.05 is chosen.
During the test, the alpha level should be reduced to half the initial value, and hence it becomes
0.025.

Step 5:
Using the F table, the critical F value is determined with alpha at 0.025. The critical value for (40,
20) at alpha equal to 0.025 is 2.287.
Foundation of Data Science/22CSC202- Learning Materials Unit IV

Step 6:
It is now the time for comparing the calculated value with the standard value in the table. Generally,
the null hypothesis is rejected if the calculated value is greater than the table value. In this F value
definition example, the calculated value is 1.66, and the table value is 2.287.

It is clear from the values that 1.66 < 2.287. Hence, the null hypothesis cannot be rejected.

8. Comparing Samples – A/B Testing

A/B testing is a basic randomized control experiment. It is a way to compare the two versions of
a variable to find out which performs better in a controlled environment.
This is often done in marketing when two different types of content—whether it be email copy,
a display ad, a call-to-action (CTA) on a web page, or any other marketing asset—are being
compared. This is usually done before launching any product in the market so that the company
can get better results.

This also helps in comparing the performance of two or more variants of emails and then
selecting the best among them based on the result given by the audience.

9. Comparing two Samples – Causality

Causality means that there is a clear cause-effect relationship between two variables. Therefore,
there is causation. Causation implies a cause-and-effect relationship between variables; a change
in one variable ‘causes’ a change in the other. When two variables have a causal relationship, a
change in the independent variable (or the ‘cause’) influences a change in the dependent
variable (or the ‘effect’).

Correlation Vs Causality
Correlation:
It is a statistical term which depicts the degree of association between two random variables. In
data analysis it is often used to determine the amount to which they relate to one another.

Three types of correlation-

Positive correlation –
If with increase in random variable A, random variable B increases too, or vice versa.
Negative correlation –
If increase in random variable A leads to a decrease in B, or vice versa.

20
Foundation of Data Science/22CSC202- Learning Materials Unit IV

No correlation –
When both the variables are completely unrelated and change in one leads to no change in other.

Correlation and Causation can exist at the same time also, so definitely correlation doesn’t
imply causation. Below example is to show this difference more clearly-

No battery in computer causes computer to shut and also causes video player to stop shows
causality of battery over laptop and video player. The moment computer shuts, video player also
shuts shows both are correlated. More specifically positively correlated.

Unit I, Ii, & Iii
No ratings yet
Unit I, Ii, & Iii
13 pages
Lecture 4 - Data Science Statistics
No ratings yet
Lecture 4 - Data Science Statistics
21 pages
ML Unit 3
No ratings yet
ML Unit 3
46 pages
Inferential Statistics for Data Science
No ratings yet
Inferential Statistics for Data Science
48 pages
Applied - Data - Science MODULE 2 SEM8
No ratings yet
Applied - Data - Science MODULE 2 SEM8
53 pages
Inferential Statistics Sem 4
No ratings yet
Inferential Statistics Sem 4
13 pages
Lecture 2 - MAT361 (21 JAN 2025)
No ratings yet
Lecture 2 - MAT361 (21 JAN 2025)
40 pages
Statistical Inference 1 and 2 1
No ratings yet
Statistical Inference 1 and 2 1
107 pages
Lecture 4 Educational Statictics
No ratings yet
Lecture 4 Educational Statictics
32 pages
Day 4 Educational Statistics
No ratings yet
Day 4 Educational Statistics
34 pages
Descriptive and Inferential Statistics
No ratings yet
Descriptive and Inferential Statistics
30 pages
1 Statistical-Tests
No ratings yet
1 Statistical-Tests
42 pages
2statistical Analysis of Data 2
No ratings yet
2statistical Analysis of Data 2
43 pages
Unit 3
No ratings yet
Unit 3
36 pages
Inferential Statistics-Hypothesis Testing
No ratings yet
Inferential Statistics-Hypothesis Testing
26 pages
Inferenatial Assign, of Iqra Sajid
No ratings yet
Inferenatial Assign, of Iqra Sajid
8 pages
Lecture 5: Chapter 5 Statistical Analysis of Data Yes The "S" Word
No ratings yet
Lecture 5: Chapter 5 Statistical Analysis of Data Yes The "S" Word
42 pages
Basicconeceptsofinferentialstatisticsslideshare 240320041926 f8024153
No ratings yet
Basicconeceptsofinferentialstatisticsslideshare 240320041926 f8024153
35 pages
Intro to Statistical Analysis
No ratings yet
Intro to Statistical Analysis
42 pages
8614.educational Statitics Unit 6
No ratings yet
8614.educational Statitics Unit 6
42 pages
250 Lec 5 Fall 13
No ratings yet
250 Lec 5 Fall 13
42 pages
DV Unit 1&2 Notes
No ratings yet
DV Unit 1&2 Notes
50 pages
Data Visualization Notes Ou
No ratings yet
Data Visualization Notes Ou
125 pages
Inferential Stat
No ratings yet
Inferential Stat
40 pages
Unit 1 Introduction To Inferential Statistics: 1.0 Objectives
No ratings yet
Unit 1 Introduction To Inferential Statistics: 1.0 Objectives
14 pages
Inferential Statistics
No ratings yet
Inferential Statistics
6 pages
IB372 FA10 Lab01 Intro Statistics Presentation
100% (1)
IB372 FA10 Lab01 Intro Statistics Presentation
75 pages
The Statistical Tools
No ratings yet
The Statistical Tools
66 pages
Applications of Inferential Statistics
No ratings yet
Applications of Inferential Statistics
28 pages
Parametric Vs Non Parametric Statistics
No ratings yet
Parametric Vs Non Parametric Statistics
12 pages
Inferenctial Statistics
No ratings yet
Inferenctial Statistics
32 pages
Theory
No ratings yet
Theory
7 pages
Unit 1 DS Vs IS
No ratings yet
Unit 1 DS Vs IS
5 pages
Week - 1 Day - 2 Inferential Statistics
No ratings yet
Week - 1 Day - 2 Inferential Statistics
34 pages
2 Intro To Inferential Stat
No ratings yet
2 Intro To Inferential Stat
37 pages
ST1381 Elementary Statistics PDF
100% (1)
ST1381 Elementary Statistics PDF
299 pages
02 Intrro Continued
No ratings yet
02 Intrro Continued
34 pages
Statistics
No ratings yet
Statistics
33 pages
Final Exam
No ratings yet
Final Exam
5 pages
Week 2 Quantitative Data Analysis
No ratings yet
Week 2 Quantitative Data Analysis
22 pages
22amh32 - Data Analytics and Data Science Unit I & Statistical Inference and Modelling 1. Statistical Inference and Modelling
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Statistical Inference and Modelling 1. Statistical Inference and Modelling
4 pages
Statistics SS2020
No ratings yet
Statistics SS2020
12 pages
Intro to Descriptive & Inferential Stats
No ratings yet
Intro to Descriptive & Inferential Stats
46 pages
PG Descriptive and Inferential Statistic 2024
No ratings yet
PG Descriptive and Inferential Statistic 2024
51 pages
a78bde04-1efd-4ff1-9e48-b23104cd7c3b (1)
No ratings yet
a78bde04-1efd-4ff1-9e48-b23104cd7c3b (1)
10 pages
Course COM 401 STATISTICAL ANALYSIS
No ratings yet
Course COM 401 STATISTICAL ANALYSIS
6 pages
Biostatistics Project
No ratings yet
Biostatistics Project
5 pages
Inferential Statistics
No ratings yet
Inferential Statistics
11 pages
Whats The Results Data Analysis
100% (1)
Whats The Results Data Analysis
96 pages
Statistics
No ratings yet
Statistics
8 pages
3 - Test of Hypothesis (Part - 1) PDF
100% (1)
3 - Test of Hypothesis (Part - 1) PDF
45 pages
COM 201 - Inferential Statistics - 18032022-1
No ratings yet
COM 201 - Inferential Statistics - 18032022-1
58 pages
Bachu Assignment
No ratings yet
Bachu Assignment
25 pages
Data Analysis and Statistical Methods
No ratings yet
Data Analysis and Statistical Methods
44 pages
Parametric Tests
No ratings yet
Parametric Tests
69 pages
Difference Between Descriptive and Inferential Statistics
No ratings yet
Difference Between Descriptive and Inferential Statistics
8 pages
Understanding Simple Inferential Statistics
No ratings yet
Understanding Simple Inferential Statistics
58 pages
Unit 2
No ratings yet
Unit 2
36 pages
Unit IV
No ratings yet
Unit IV
36 pages
Lab Manual - AJ
No ratings yet
Lab Manual - AJ
48 pages
Unit V
No ratings yet
Unit V
23 pages
UNIT 3 Research Methodologies
No ratings yet
UNIT 3 Research Methodologies
15 pages
EPPP Statistics and Research Design
No ratings yet
EPPP Statistics and Research Design
17 pages
Pondering To Publishing - A Beginner's Guide To Research Methodology and Biostatistics - Tirthankar Guha Thakurta
No ratings yet
Pondering To Publishing - A Beginner's Guide To Research Methodology and Biostatistics - Tirthankar Guha Thakurta
60 pages
Understanding Hypothesis Testing Concepts
No ratings yet
Understanding Hypothesis Testing Concepts
52 pages
Cot 1
No ratings yet
Cot 1
4 pages
3 - q3 Practical Research
No ratings yet
3 - q3 Practical Research
18 pages
Research Methods
No ratings yet
Research Methods
111 pages
3.1 R Programming For Statistics and Data Science - Course Notes - Hypothesis Testing
No ratings yet
3.1 R Programming For Statistics and Data Science - Course Notes - Hypothesis Testing
9 pages
Unit 3 (Hypothesis Testing)
No ratings yet
Unit 3 (Hypothesis Testing)
40 pages
Ae9 Final Module
No ratings yet
Ae9 Final Module
33 pages
Null Hypothesis Literature Review
100% (2)
Null Hypothesis Literature Review
6 pages
Assignment No. 1, Statistics-2 (395) Spring 2020
No ratings yet
Assignment No. 1, Statistics-2 (395) Spring 2020
8 pages
A/B Test Results Analysis Guide
No ratings yet
A/B Test Results Analysis Guide
13 pages
P & S Problem Sheet CO - 5
No ratings yet
P & S Problem Sheet CO - 5
2 pages
Forza 2002 Survey - Research - in - Operations - Management - A - Process
No ratings yet
Forza 2002 Survey - Research - in - Operations - Management - A - Process
44 pages
Hypothesis Testing Guide
No ratings yet
Hypothesis Testing Guide
28 pages
Testing of Hypothesis Practice Questions
No ratings yet
Testing of Hypothesis Practice Questions
23 pages
Hypothesis Testing: Null vs. Alternative
No ratings yet
Hypothesis Testing: Null vs. Alternative
11 pages
Understanding Statistical Analysis - Techniques and Applications
No ratings yet
Understanding Statistical Analysis - Techniques and Applications
29 pages
L1.2 1.3 Z, T Test
No ratings yet
L1.2 1.3 Z, T Test
47 pages
Rohini 28599856674
No ratings yet
Rohini 28599856674
5 pages
Formulating Hypotheses and Variables
No ratings yet
Formulating Hypotheses and Variables
16 pages
DLL Stat 5th Week For COT
100% (1)
DLL Stat 5th Week For COT
5 pages
One-Tailed Directional Test Explained
No ratings yet
One-Tailed Directional Test Explained
23 pages
MYA Final Statistics and Probability
No ratings yet
MYA Final Statistics and Probability
10 pages
Purpose of Hypothesis
No ratings yet
Purpose of Hypothesis
13 pages
One-Sample Non-Parametric Tests Guide
No ratings yet
One-Sample Non-Parametric Tests Guide
224 pages
CCW331 Business Analytics Lecture Notes 2
No ratings yet
CCW331 Business Analytics Lecture Notes 2
185 pages
A. Null and Alternate Hypothesis: Post-Assessment-Written Outputs
No ratings yet
A. Null and Alternate Hypothesis: Post-Assessment-Written Outputs
6 pages
Research Problem Formulation in Talavera
No ratings yet
Research Problem Formulation in Talavera
37 pages