0% found this document useful (0 votes)

8 views14 pages

Statistics 2

Statistics Interview Questions and Answers.

Uploaded by

Rushi Khandare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views14 pages

Statistics 2

Statistics Interview Questions and Answers.

Uploaded by

Rushi Khandare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Day 3

In [ ]:

"""
Explain the concept of convenience sampling and its limitations.

Convenience sampling involves selecting individuals who are easily accessible

or readily available to participate in the study. While this method is quick
and inexpensive, it may not accurately represent the population as it often
leads to a biased sample.

exmaple : mobile phone review

"""

In [ ]:

"""
What are the limitations of using covariance to measure the relationship
between variables?

Covariance does not provide a standardized measure of the strength of the

relationship between variables. It is affected by the scale of the variables,
making it challenging to compare covariances across different datasets.
Additionally, covariance only measures linear relationships, so it may not
capture complex or non-linear associations between variables.
"""

In [ ]:
"""
In what scenarios is covariance particularly useful, and how can it be
applied in data analysis?
Answer:
Covariance is useful for understanding the relationship between two variables
and identifying patterns in data. It can be applied in various fields such as
finance, where it helps analyze the relationship between asset returns, or in
genetics, where it assists in studying the co-occurrence of genetic traits.

"""

In [ ]:

"""
What are some techniques for visualizing and interpreting correlation in a
dataset? Provide examples of graphical representations commonly used to
depict correlation relationships.

Common techniques for visualizing correlation include scatter plots,

correlation matrices, and heatmaps. Scatter plots display the relationship
between two variables, with each data point representing an observation.
Correlation matrices provide a comprehensive view of correlations between
multiple variables in a dataset. Heatmaps visually represent correlation
matrices, with colors indicating the strength and direction of correlations.
These visualization techniques help analysts identify patterns and
relationships within the data, facilitating interpretation and decision-making.

"""

In [ ]:

"""
What is the difference between long format and wide format data?

A dataset can be written in two different formats: wide and long.

Wide format is where we have a single row for every data point with multiple
columns to hold the values of various attributes.
The long format is where for each data point we have as many rows as the
number of attributes and each row contains the value of a particular
attribute for a given data point

"""

In [ ]:
"""
What is a Probability Density Function (PDF), and how does it differ from a
Probability Mass Function (PMF)? Provide an example of a continuous random
variable and its associated PDF.

A Probability Density Function (PDF) describes the likelihood of a continuous

random variable taking on a specific value within a given range. Unlike a
Probability Mass Function (PMF), which is for discrete variables, a PDF
represents probabilities as areas under the curve rather than individual
probabilities
"""

In [ ]:
"""
Explain the concept of Cumulative Distribution Function (CDF) and its
significance in probability theory. How is the CDF related to the PDF?
The Cumulative Distribution Function (CDF) gives the probability that a
random variable takes on a value less than or equal to a given point.
It is the integral of the PDF up to that point. The CDF provides a complete
summary of the distribution's properties and is essential for various
statistical calculations
"""

In [ ]:
"""
Distribution in statistics refers to the way values are spread out or
arranged within a dataset. It provides information about the frequency or
probability of different outcomes or events occurring. Understanding the
distribution of data is essential for making inferences, modeling, and
analyzing statistical properties.

Types :

Normal Distribution (Gaussian Distribution): The normal distribution is

symmetrical and bell-shaped, with the mean, median, and mode all coinciding
at the center. Many natural phenomena follow a normal distribution, and it
is widely used in statistical analysis.

Binomial Distribution: The binomial distribution describes the probability

of a certain number of successes in a fixed number of independent Bernoulli
trials. It is characterized by two parameters: the number of trials (n) and
the probability of success (p).

Poisson Distribution: The Poisson distribution models the number of events

occurring within a fixed interval of time or space, given the average rate
of occurrence (λ). It is commonly used for count data, such as the number of
arrivals at a service point or the number of defects in a product.

Uniform Distribution: In a uniform distribution, all outcomes within a given

range are equally likely. The probability density function is constant
within the range and zero outside of it. It is often used in situations
where all outcomes are equally probable, such as rolling a fair die or
selecting a random number from a range.

Log-Normal Distribution: The log-normal distribution arises when the

logarithm of a variable follows a normal distribution. It is commonly used
to model data that are positively skewed, such as income or stock prices.
"""
In [ ]:
"""
How do you define the parameters of a normal distribution, and what do they
represent?

The parameters of a normal distribution are the mean (μ) and the standard
deviation (σ).
The mean represents the central tendency of the distribution, indicating the
average value around which the data are centered.The standard deviation
represents the dispersion or spread of the data points around the mean.
A larger standard deviation indicates greater variability, while a smaller
standard deviation indicates less variability.
"""

Topic:

Normal distribution , z score , Standardisation and nor , Clt , Estimation , hypothesis testing
(basic)

In [ ]:
"""
Discuss the central limit theorem and its significance in relation to the
normal distribution.

The Central Limit Theorem is a fundamental principle in statistics. It

states that when we take a large number of random samples from any
distribution and sum them together, the distribution of these sums will
tend towards a normal distribution, regardless of the original distribution
of the individual samples. This is crucial because it allows us to make
statistical inferences and perform hypothesis tests even when we don't know
the exact distribution of the population. In essence, it provides a bridge
between the theoretical properties of random variables and the practical
applications of statistics.
"""

In [ ]:
"""
Explain the concept of z-scores in the context of the normal distribution.
How are z-scores used to interpret and compare data points?

Z-scores represent the number of standard deviations a data point is away

from the mean of a distribution.They are calculated by subtracting the mean
from the data point and dividing by the standard deviation.Z-scores allow
for standardization and comparison of data points across different
distributions.Positive z-scores indicate data points above the mean, while
negative z-scores indicate data points below the mean.Z-scores facilitate
probability calculations and percentile rankings within the normal
distribution.
"""

In [ ]:
"""
How can you standardize data using the normal distribution? Why is
standardization useful in statistical analysis?

Standardizing data with the normal distribution involves calculating

z-scores, which represent how many standard deviations a data point is from
the mean.This is done by subtracting the mean of the data set from each data
point and then dividing by the standard deviation.Standardization is useful
in statistical analysis because it allows for comparisons between different
data sets, regardless of their original scales or units.It also facilitates
interpretation by providing a common scale, making it easier to identify
outliers and assess the relative position of data points within their
distributions.
"""
In [ ]:
"""
Describe a real-world scenario where the Central Limit Theorem would be
applicable. How would you apply it in practice?

Real-World Scenario: Quality Control in Manufacturing

Imagine you work for a company that manufactures light bulbs, and you want
to ensure that the average lifespan of the light bulbs meets a certain standard.
The population distribution of light bulb lifespans is not necessarily normal—there
might be a skewed distribution due to factors like defects, wear and tear, or
variations in materials.

Problem:
The company wants to determine if the average lifespan of the light bulbs produced
is close to the target lifespan of 1,000 hours.

How to Apply the Central Limit Theorem:

Sampling:

Since testing every light bulb produced is impractical, you take random samples of,
say, 50 light bulbs at a time from the production line.
You repeat this process multiple times, calculating the average lifespan of each sample.

Distribution of Sample Means:

According to the Central Limit Theorem, even if the original distribution of individual
light bulb lifespans is not normal, the distribution of the sample means
(as you collect more samples) will approximate a normal distribution.
As the sample size (n = 50) is large enough, the sample means will tend to cluster
around the true population mean.

Analyzing the Data:

You calculate the mean and standard deviation of the sample means.
With the sample means normally distributed, you can now use this distribution to create
confidence intervals, perform hypothesis testing, or conduct other analyses to make
inferences about the overall population mean.

Decision-Making:

Using the sample data, you can determine whether the true average lifespan of the
light bulbs likely meets the 1,000-hour target.
For example, you might perform a hypothesis test to see if the population mean is
significantly different from 1,000 hours or construct a confidence interval to
estimate the population mean lifespan.

Outcome:
By leveraging the Central Limit Theorem, you can make informed decisions about the
production quality without needing to test every single light bulb. The CLT allows
you to draw conclusions about the population mean based on the sample means, even
when the underlying population distribution is unknown or non-normal.

Why the CLT is Important in This Scenario:

Predicting Population Parameters: The CLT lets you confidently estimate the population
mean using sample data.
Sampling Efficiency: Testing a few samples rather than the entire population saves time
and resources while still providing reliable results.
Robustness: The CLT is robust even if the original data distribution is skewed or non-nor
mal,
making it widely applicable in practice.
This scenario demonstrates how the Central Limit Theorem provides the foundation for much
of
inferential statistics, enabling you to make data-driven decisions in situations where
examining the entire population is impractical.
"""

In [ ]:
"""
"""
Can you explain the limitations of the Central Limit Theorem? How would you
address these limitations in practical data analysis?
Central Limit Theorem (CLT):

The CLT states that the distribution of the sample means of a sufficiently
large number of independent and identically distributed random variables
approaches a normal distribution, regardless of the original distribution
of the variables themselves.
Purpose:
The primary purpose of the CLT is to provide a theoretical foundation for
statistical inference.It allows statisticians to make inferences about
population parameters based on sample statistics, even when the population
distribution is unknown or non-normal.Additionally, it enables the use of
parametric statistical methods in situations where the data may not strictly
adhere to normality.

Limitations:

The CLT assumes that the random variables are independent and identically
distributed, which may not always hold true in practice.It requires a
sufficiently large sample size for the sample means to approximate a normal
distribution accurately. For small sample sizes or skewed distributions,
the approximation may be poor.The CLT applies asymptotically, meaning it
becomes increasingly accurate as the sample size grows indefinitely, but
there's no fixed threshold for "sufficiently large" sample size.

Real-world Applications:

The CLT is extensively used in inferential statistics, such as hypothesis

testing and confidence interval estimation.It underpins the validity of many
statistical methods, including t-tests, ANOVA, and regression analysis,
which rely on the assumption of normally distributed sample means.
In fields like quality control, finance, and epidemiology, where sample
means play a crucial role in decision-making, the CLT guides practitioners
in drawing reliable conclusions from data.
"""

In [ ]:
"""
What does the empirical rule (68-95-99.7 rule) state? How is it useful in
understanding data distributions?

Approximately 68% of the data in a normal distribution lies within one

standard deviation (σ) of the mean (μ).
Approximately 95% of the data falls within two standard deviations (2σ)
of the mean.Almost all data (about 99.7%) falls within three standard
deviations (3σ) of the mean.
purpose :
The empirical rule provides a quick and intuitive way to understand the
spread of data in a normal distribution.It allows analysts to gauge how
closely a dataset aligns with a normal distribution and identify potential
outliers or unusual patterns.
"""

In [ ]:
"""
Discuss scenarios where the empirical rule may not hold true. How would
you adapt your analysis in such cases?

Scenarios Where the Empirical Rule May Not Hold True:

Non-Normal Distributions:
The empirical rule is specifically applicable to normal distributions.
If the data follows a non-normal distribution, such as skewed or multimodal
distributions, the rule may not accurately represent the spread of data.

Outliers and Extreme Values:

Outliers, which are data points significantly distant from the bulk of the
data, can distort the distribution and violate the assumptions of the
empirical rule.Extreme values or heavy tails in distributions, such as those
found in financial data or certain natural phenomena, may also lead to
deviations from the empirical rule.
"""

In [ ]:
"""
Compare and contrast different methods of normalization and standardization.
When would you choose one method over another?
Normalization:

Useful for algorithms that require input features to be within a specific

range, like neural networks and algorithms using distance measures.
Retains the shape and distribution of the original data.
Sensitive to outliers, as extreme values can disproportionately influence
the scaled data.

Standardization:
Maintains the shape of the original distribution while centering the data
around 0 and scaling to unit variance.Less affected by outliers compared to
normalization.Suitable for algorithms assuming normally distributed features,
like linear regression and logistic regression.

Choosing a Method:

Normalization:
Choose when the algorithm or model requires features to be within a specific
range, or when you want to preserve the original data distribution.
Suitable for scenarios where the range of values is known and meaningful.
Standardization:

Opt for standardization when the algorithm assumes normally distributed data
or when robustness against outliers is essential.Useful in situations where
the mean and standard deviation have statistical significance.
"""

In [ ]:
"""
Discuss the importance of normalization and standardization in machine
learning algorithms. How do these techniques affect model performance?

Improving Convergence:

Many machine learning algorithms, such as gradient descent-based optimization

algorithms, converge faster when the features are scaled to a similar range.
Normalization and standardization help achieve this by ensuring that features
are on a comparable scale.
Mitigating Numerical Instability:

Algorithms that involve numerical computations, such as calculating distances

or solving optimization problems, can suffer from numerical instability when
dealing with features that have vastly different scales. Normalization and
standardization alleviate this issue by bringing all features to a similar
magnitude.
Enhancing Model Interpretability:

Normalization and standardization make the coefficients or weights associated

with each feature more interpretable. When features are on different scales,
it becomes challenging to discern the relative importance of each feature in
the model.
Improving Model Robustness:

Scaling features can make the model more robust to outliers and noisy data.
Outliers can disproportionately influence the model's behavior,
but normalization and standardization reduce their impact by ensuring that
extreme values are not overly dominant.

Effects on Model Performance:

Faster Convergence
Better Generalization
Enhanced Model Accuracy
Improved Stability
"""

In [ ]:
"""
What are the limitations of the empirical rule (68-95-99.7) in the context
of data analysis?

Only works well for data that follows a perfect bell-shaped curve, called a
normal distribution.It might not give accurate estimates if your data is not
exactly normal.Outliers, or extreme values, can mess up the estimates.
The rule assumes that your data is independent and identical, which might
not always be true.It's not very precise, especially for small or
non-standard datasets.It doesn't provide detailed information about specific
percentiles or ranges, just general guidelines.
"""

In [ ]:
"""
Estimation, Hypothesis Testing, Significance Values, P-values:

Easy:

What is the difference between point estimation and interval estimation?

Provide examples.

Point Estimation:

Involves estimating a single value (point) as the most likely value of a

population parameter.
Provides a precise but single value estimate.
Example: Estimating the mean height of students in a school based on a
sample mean.

Interval Estimation:

Involves estimating a range (interval) within which the true value of a

population parameter is likely to lie, along with a level of confidence.
Provides a range of values along with a measure of confidence.
Example: Calculating a 95% confidence interval for the mean height of
students in a school.
"""

In [ ]:
"""
Define null hypothesis and alternative hypothesis. How are they used in
hypothesis testing?

Null Hypothesis:
Represents the default assumption that there is no significant difference or
effect in the population Denoted as H₀.
Tested against the alternative hypothesis.
Typically the hypothesis being challenged or tested.
Alternative Hypothesis:

Contradicts the null hypothesis, suggesting there is a significant

difference or effect in the population.Denoted as H₁ or Hₐ.
Represents what researchers are trying to provide evidence for.
Stated as the hypothesis proposing an effect, relationship, or difference
between groups.

Usage in Hypothesis Testing:

Hypotheses are used to assess evidence from sample data.

Goal is to determine whether observed data provides enough evidence to
reject the null hypothesis in favor of the alternative hypothesis.
Statistical tests generate a test statistic and calculate a p-value.
If p-value < significance level, null hypothesis is rejected in favor of
alternative hypothesis.
If p-value ≥ significance level, null hypothesis is retained
"""

In [ ]:
"""
What is hypothesis testing, and why is it important in data science?

Hypothesis testing is a statistical method used to make inferences about

population parameters based on sample data.It helps data scientists validate
assumptions, test hypotheses, and make data-driven decisions.
"""

In [ ]:
"""
Explain the difference between the null hypothesis and the alternative
hypothesis.

The null hypothesis (H₀) states there is no significant difference or effect

in the population.The alternative hypothesis (H₁ or Hₐ) contradicts the null
hypothesis, suggesting a significant difference or effect.

What is a Type I error, and how does it relate to the significance level?
A Type I error occurs when we reject the null hypothesis when it is actually
true.The significance level (α) represents the probability of making a Type
I error.
"""

In [ ]:
"""
Can you describe a Type II error and its implications?

A Type II error happens when we fail to reject the null hypothesis when it
is actually false.It means we miss detecting a true effect or difference in
the population.
"""

In [ ]:
"""
What is a p-value, and how is it used in hypothesis testing?

The p-value is the probability of observing the data given that the null
hypothesis is true.It helps determine the strength of evidence against the
null hypothesis.
"""

In [ ]:
"""
How do you interpret a p-value in the context of hypothesis testing?

A small p-value (typically ≤ α) suggests strong evidence against the null

hypothesis, leading to its rejection.A large p-value indicates weak evidence
against the null hypothesis, suggesting its retention.
"""

In [ ]:
"""
Describe the significance level (alpha) and its role in hypothesis testing.

The significance level (α) sets the threshold for rejecting the null
hypothesis.It represents the maximum acceptable probability of making a Type
I error.
"""

In [ ]:
"""
What is the difference between a one-tailed and a two-tailed hypothesis test?

In a one-tailed test, hypotheses are directional, testing for effects in one

direction only.In a two-tailed test, hypotheses are non-directional, testing
for effects in both directions.
"""

In [ ]:
"""
How do you choose the appropriate significance level for a hypothesis test?
The significance level is chosen based on the desired balance between Type I
and Type II errors, along with domain-specific considerations.
"""

In [ ]:
"""
What is a critical value, and how is it used in hypothesis testing?
A critical value is the threshold value used to determine the rejection
region for a hypothesis test.It helps compare the test statistic to decide
whether to reject the null hypothesis
"""

In [ ]:
"""
Can you explain the concept of power in hypothesis testing?

Power is the probability of correctly rejecting the null hypothesis when it

is false.It measures the test's ability to detect a true effect or difference
in the population.
"""

In [ ]:
"""
What factors influence the power of a hypothesis test?
The effect size (magnitude of the difference), sample size, significance
level, and variability of the data all affect the power of a hypothesis test.
A larger effect size, sample size, and significance level increase power,
while higher variability decreases power.
"""

In [ ]:
"""
Explain the steps involved in conducting a hypothesis test.

Define null and alternative hypotheses.

Choose a significance level (α).
Collect sample data and calculate the test statistic.
Determine the critical value or calculate the p-value.
Compare the test statistic to the critical value or assess the p-value.
Make a decision to reject or retain the null hypothesis.
"""

In [ ]:
"""
What is the difference between parametric and non-parametric hypothesis tests?

Parametric tests assume specific distributional properties of the data, such

as normality and homogeneity of variance.Non-parametric tests do not make
distributional assumptions and are often used when data violate parametric
assumptions.
"""

In [ ]:
"""
Describe the assumptions underlying parametric hypothesis tests.

Common assumptions include normality of data, homogeneity of variance,

independence of observations, and linearity of relationships.Violations of
these assumptions can affect the validity of parametric tests.
"""

In [ ]:
"""
When would you use a t-test instead of a z-test?

A t-test is used when the population standard deviation is unknown or the

sample size is small (typically <30).A z-test is appropriate when the
population standard deviation is known and the sample size is large.
"""

In [ ]:
"""
What is ANOVA, and how is it used in hypothesis testing?
ANOVA (Analysis of Variance) is a statistical technique used to compare
means across multiple groups.It tests whether there are significant
differences among the group means, accounting for variability within and
between groups.
"""

In [ ]:
""""
How do you perform a hypothesis test for proportions?

A hypothesis test for proportions compares the observed proportion from a

sample to a hypothesized population proportion.Common tests include the
z-test for proportions or chi-square test for goodness of fit or
independence.
"""

In [ ]:
"""
What is the purpose of a z-test, and when would you use it in data analysis?

A z-test is used to determine whether the mean of a sample is statistically

different from a known population mean when the population standard deviation
is known.It is typically used when sample sizes are large (n > 30) and the
data are normally distributed.
"""

In [ ]:
"""
In what situations would you use a one-tailed z-test versus a two-tailed
z-test?
A one-tailed z-test is used when the hypothesis specifies the direction of
the difference (e.g., greater than or less than).
A two-tailed z-test is used when the hypothesis does not specify the
direction of the difference.
"""

In [ ]:
"""
Describe the assumptions underlying the t-test. When is it appropriate to
use a t-test instead of a z-test?

Assumptions include normality of data, independence of observations, and

homogeneity of variances.A t-test is used when the population standard
deviation is unknown or the sample size is small (typically < 30).
"""

In [ ]:
"""
How do you interpret the results of a t-test in terms of statistical
significance?
If the calculated t-statistic falls within the rejection region (determined
by the significance level), it suggests that the sample means are
significantly different from each other or from a known population mean.
"""

In [ ]:

"""
When would you use ANOVA (Analysis of Variance) in data analysis, and what
insights does it provide?

ANOVA is used to compare means across multiple groups simultaneously.

It helps determine whether there are significant differences among the
group means, accounting for variability within and between groups.
"""

In [ ]:
"""
Can you explain the relationship between ANOVA and t-tests?

ANOVA is an extension of the t-test and can be thought of as a generalization

of the t-test to more than two groups.The F-statistic in ANOVA is calculated
by dividing the variance between groups by the variance within groups.
"""

In [ ]:
"""
What are the assumptions underlying the use of ANOVA, and how can violations
of these assumptions affect the results?
Assumptions include normality of data, homogeneity of variances, and
independence of observations.Violations of these assumptions can lead to
inflated Type I error rates or decreased power, affecting the validity of
the ANOVA results.
"""

In [ ]:

"""
1. What is a z-test and when is it used?
A z-test is a statistical test used to determine whether the means of two
populations are significantly different when the population standard
deviations are known and the sample sizes are large. It is based on the
standard normal distribution, which has a mean of 0 and a standard deviation
of 1.

The z-test is used in the following scenarios:

1. Hypothesis Testing: The z-test is used to test hypotheses about the

population mean when the population standard deviation is known. It helps
determine if the observed difference between sample means is statistically
significant or simply due to chance.
2. Comparing Means: The z-test is used to compare the means of two
independent groups or samples. It determines if there is a significant
difference between the means of the two populations being studied.
3. Quality Control: The z-test is used in quality control to assess whether
a production process is operating within acceptable limits. It helps
determine if the measured sample mean falls within the acceptable range
defined by the population mean.
4. A/B Testing: The z-test can be used in A/B testing to compare the
performance of two different versions of a website, application, or marketing
campaign. It helps determine if the observed difference in outcomes between
the two versions is statistically significant.

To perform a z-test, the following steps are typically followed:

1. Formulate Hypotheses: Define the null hypothesis (H0) and alternative

hypothesis (Ha) based on the research question.
2. Set Significance Level: Determine the desired level of significance (α)
to control the probability of Type I error.
3. Calculate Test Statistic: Compute the z-statistic using the formula
z = (x - μ) / (σ / √n), where x is the sample mean, μ is the population mean,
σ is the population standard deviation, and n is the sample size.
4. Determine Critical Value: Find the critical value corresponding to the
desired level of significance (α) and the chosen test (one-tailed or
two-tailed).
5. Compare Test Statistic and Critical Value: Compare the test statistic
with the critical value. If the test statistic falls within the critical
region, reject the null hypothesis; otherwise, fail to reject the null
hypothesis.

6. Draw Conclusion: Based on the comparison, draw a conclusion regarding

the statistical significance of the observed difference between sample
means.
The z-test is widely used when sample sizes are large and the population
standard deviation is known. However, when the population standard deviation
is unknown or the sample size is small, the t-test is more appropriate.
"""

In [ ]:
""""
4. What is a t-test and when is it used?

A t-test is a statistical test used to determine whether the means of

two groups are significantly different from each other. It is commonly used
when the sample sizes are small and the population standard deviation is
unknown. The t-test assesses the likelihood that the observed difference
between the sample means is due to chance or represents a true difference
in the population means.

The t-test is used in the following scenarios:

1. Comparing Means: The t-test is used to compare the means of two

independent groups or samples. It helps determine if there is a significant
difference between the means of the two populations being studied.
2. Paired Samples: The t-test can be used to compare the means of two
related or paired samples. This is often done when the same group of
subjects is measured before and after a treatment or intervention.
3. One-Sample Test: The t-test can also be used to compare the mean of a
single sample to a known or hypothesized value. This is called a one-sample
t-test and helps determine if the sample mean significantly differs from the
population mean.
4. Assumptions Testing: The t-test is used to test assumptions in statistical
analyses, such as normality assumptions or assumptions of equal variances in
different groups.The t-test is based on the t-distribution, which is similar
to the normal distribution but with fatter tails. The test calculates a
t-statistic, which measures the difference between the sample means relative
to the variability within the samples. The calculated t-statistic is then
compared to critical values from the t-distribution to determine statistical
significance.

To perform a t-test, the following steps are typically followed:

1. Formulate Hypotheses: Define the null hypothesis (H0) and alternative

hypothesis (Ha) based on the research question.
2. Set Significance Level: Determine the desired level of significance (α)
to control the probability of Type I error.
3. Choose the Appropriate Test: Select the appropriate type of t-test based
on the study design (independent samples, paired samples, or one-sample).
4. Calculate Test Statistic: Compute the t-statistic using the appropriate
formula for the chosen test.
5. Determine Degrees of Freedom: Calculate the degrees of freedom, which
depend on the sample sizes and study design.
6. Determine Critical Value: Find the critical value corresponding to the
desired level of significance (α) and the degrees of freedom.

7. Compare Test Statistic and Critical Value: Compare the test statistic
with the critical value. If the test statistic falls within the critical
region, reject the null hypothesis; otherwise, fail to reject the null
hypothesis.
8. Draw Conclusion: Based on the comparison, draw a conclusion regarding
the statistical significance of the observed difference between sample means.
"""

In [ ]:

""""
How is the t-statistic calculated?
Here are the formulas for calculating the t-statistic for different t-tests:
1. One-Sample t-test:
The one-sample t-test compares the mean of a single sample to a known or
hypothesized value.
Formula:
t = (x - μ) / (s / √n)
Where:
- t is the t-statistic
- x is the sample mean
- μ is the hypothesized population mean
- s is the sample standard deviation
- n is the sample size

2. Independent Samples t-test (Equal Variances):

The independent samples t-test compares the means of two independent groups
or samples, assuming equal variances.
Formula:
t = (x1 - x2) / √((s1^2 / n1) + (s2^2 / n2))
Where:
- t is the t-statistic
- x1 and x2 are the means of the two samples
- s1 and s2 are the standard deviations of the two samples
- n1 and n2 are the sample sizes of the two samples

3. Independent Samples t-test (Unequal Variances):

The independent samples t-test compares the means of two independent groups
or samples,allowing for unequal variances.
Formula:
t = (x1 - x2) / √((s1^2 / n1) + (s2^2 / n2))
Where:
- t is the t-statistic
- x1 and x2 are the means of the two samples
- s1 and s2 are the standard deviations of the two samples

- n1 and n2 are the sample sizes of the two samples

"""

In [ ]:
"""
What is A/B testing and why is it important?

A/B testing, also known as split testing or bucket testing, is a controlled

experiment method used to compare two versions of a webpage, application,
marketing campaign, or any other product or feature. It helps determine
which version performs better in terms of user behavior,conversion rates,
click-through rates, or other key performance indicators (KPIs).
"""

In [ ]:
"""
What is the chi-square test and when is it used?

The chi-square test is a statistical test used to determine if there is a

significant association or relationship between categorical variables.
It assesses whether the observed frequencies of categorical data differ
significantly from the expected frequencies under a specified hypothesis.

The chi-square test can be used in the following situations:

1. Goodness-of-Fit Test: It is used to determine if an observed frequency

distribution fits aspecific expected distribution. For example, you might
use a chi-square test to determine if the observed distribution of eye color
in a population matches the expected distribution based on Mendelian genetics.

2. Test of Independence: The chi-square test is used to examine if there is a

relationship between two categorical variables. It helps determine if the
variables are independent or if there is an association between them.
For example, you might use a chi-square test to analyze if there is a
relationship between smoking status (smoker or non-smoker) and the
development of a specific disease.

3. Homogeneity Test: The chi-square test can be used to compare the

distributions of a categorical variable across multiple groups or
populations. It helps determine if there are significant differences in the
distributions, indicating that the groups or populations are not homogeneous.
For example, you might use a chi-square test to compare the distribution of
political affiliations among different age groups.
"""

In [ ]:
"""
ANOVA (Analysis of Variance):

ANOVA is a statistical technique used to compare means across three or more

independent groups.It assesses whether there are significant differences
among the means of the groups, beyond what would be expected due to random
variation.ANOVA does this by partitioning the total variability in the data
into two components: variability between groups and variability within groups.
It then compares the ratio of these two variances to determine whether the
differences among the group means are statistically significant.

F-test:

The F-test, on the other hand, is a statistical test used to compare the
variances of two or more groups or populations. In the context of ANOVA, the
F-test is used to test the overall significance of the model by comparing the
variance explained by the group means (between-group variance) to the residual
variance (within-group variance).Specifically, the F-test in ANOVA compares
the ratio of the mean square between groups to the mean square within groups.
If the F-statistic is large and the associated p-value is small (typically
less than a chosen significance level, often 0.05), it indicates that there
are significant differences among the group means, and the null hypothesis
of equal means across groups is rejected.
"""

In [ ]:

Statistics Reviewer Coverage From DCAT 2018 by KBanaag
No ratings yet
Statistics Reviewer Coverage From DCAT 2018 by KBanaag
4 pages
Project Charter Template
No ratings yet
Project Charter Template
5 pages
Stat Distributions
No ratings yet
Stat Distributions
24 pages
2466939-EDA_and_STATISTICS_NOTES
No ratings yet
2466939-EDA_and_STATISTICS_NOTES
15 pages
E-Book On Essentials of Business Analytics: Group 7
No ratings yet
E-Book On Essentials of Business Analytics: Group 7
6 pages
MATM Midterm Reviewer
No ratings yet
MATM Midterm Reviewer
10 pages
3 STATISTICAL DISTRIBUTION FUNCTIONS
No ratings yet
3 STATISTICAL DISTRIBUTION FUNCTIONS
4 pages
Univariate Statistics
No ratings yet
Univariate Statistics
7 pages
Statistics 101: Introduction To Data Management
No ratings yet
Statistics 101: Introduction To Data Management
37 pages
Statistics Theory Notes 2025
No ratings yet
Statistics Theory Notes 2025
15 pages
Fundamentals of Data Science and Analytics On Descriptive Analysis
No ratings yet
Fundamentals of Data Science and Analytics On Descriptive Analysis
53 pages
ADV U2
No ratings yet
ADV U2
13 pages
Business Research 2
No ratings yet
Business Research 2
8 pages
Predictive Analytics Notes1
No ratings yet
Predictive Analytics Notes1
37 pages
Statistics Laws Are True On Average. Statistics Are Aggregates of Facts. So Single Observation
No ratings yet
Statistics Laws Are True On Average. Statistics Are Aggregates of Facts. So Single Observation
6 pages
Unit 2
No ratings yet
Unit 2
7 pages
Module 2
No ratings yet
Module 2
13 pages
Classify Sample Observation
No ratings yet
Classify Sample Observation
2 pages
Advance Statistics for Data Science and Data Analysis (2)
No ratings yet
Advance Statistics for Data Science and Data Analysis (2)
47 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
Business Statistics: Qualitative or Categorical Data
No ratings yet
Business Statistics: Qualitative or Categorical Data
14 pages
5630-1 final
No ratings yet
5630-1 final
15 pages
Statisitcs
No ratings yet
Statisitcs
22 pages
It0089 Finalreviewer
No ratings yet
It0089 Finalreviewer
143 pages
Quantitative Data Analysis
No ratings yet
Quantitative Data Analysis
22 pages
Emgt 512 SP 2024
No ratings yet
Emgt 512 SP 2024
156 pages
Solution Manual For Statistics Data Analysis and Decision Modeling 5th Edition Evans 0132744287 9780132744287
100% (49)
Solution Manual For Statistics Data Analysis and Decision Modeling 5th Edition Evans 0132744287 9780132744287
7 pages
Antim Prahar 2024 Business Statistics and Analysis
No ratings yet
Antim Prahar 2024 Business Statistics and Analysis
34 pages
Stat Defn Booklet
No ratings yet
Stat Defn Booklet
9 pages
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
No ratings yet
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
33 pages
Unit II: Basic Data Analytic Methods
No ratings yet
Unit II: Basic Data Analytic Methods
38 pages
MMW - DATA DESCRIPTORS, PROBABILITIES AND NORMAL DISTRIBUTION , REGRESSION AND CORRELATION
No ratings yet
MMW - DATA DESCRIPTORS, PROBABILITIES AND NORMAL DISTRIBUTION , REGRESSION AND CORRELATION
6 pages
STA301 IMP Notes Headings and Some Questions Answers Prepared by
No ratings yet
STA301 IMP Notes Headings and Some Questions Answers Prepared by
32 pages
PGDISM Assignments 05 06
No ratings yet
PGDISM Assignments 05 06
12 pages
ZC-417 Quantitative Methods Exam Notes
No ratings yet
ZC-417 Quantitative Methods Exam Notes
144 pages
STAT100 - Full Course Notes
No ratings yet
STAT100 - Full Course Notes
27 pages
Unit_I_II_III_IV
No ratings yet
Unit_I_II_III_IV
23 pages
Bocalig Act5 MMW
No ratings yet
Bocalig Act5 MMW
6 pages
MMW REVIEWER FOR MIDTERMS
No ratings yet
MMW REVIEWER FOR MIDTERMS
4 pages
Mathematics in The Modern World
No ratings yet
Mathematics in The Modern World
6 pages
Sampling and sampling distribution with Business Application_v2.docx
No ratings yet
Sampling and sampling distribution with Business Application_v2.docx
11 pages
SAP_10S_10N_unit1
No ratings yet
SAP_10S_10N_unit1
6 pages
OCR MEI S1 Summary Sheets
No ratings yet
OCR MEI S1 Summary Sheets
9 pages
Statistics
No ratings yet
Statistics
12 pages
4th Chap Variability
No ratings yet
4th Chap Variability
24 pages
Gurruh Dwi Septano Tugas Rangkuman BAB 2
No ratings yet
Gurruh Dwi Septano Tugas Rangkuman BAB 2
16 pages
BUSINESS AND STATISTICS
No ratings yet
BUSINESS AND STATISTICS
29 pages
History Reporting
No ratings yet
History Reporting
61 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
21 pages
Midterm Reviewer Matm
No ratings yet
Midterm Reviewer Matm
3 pages
Statistics
No ratings yet
Statistics
25 pages
BIOSTAT LESSON 2 - Descriptive Statistics
No ratings yet
BIOSTAT LESSON 2 - Descriptive Statistics
3 pages
Reviewer Part 1
No ratings yet
Reviewer Part 1
9 pages
MANM526-W1
No ratings yet
MANM526-W1
38 pages
Gec004 - Module 4 - Normal Distribution and Regression
No ratings yet
Gec004 - Module 4 - Normal Distribution and Regression
84 pages
Lesson 4 Notes
No ratings yet
Lesson 4 Notes
14 pages
Week 12
No ratings yet
Week 12
37 pages
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Day 3
No ratings yet
Day 3
88 pages
Pandas & Numpy
No ratings yet
Pandas & Numpy
32 pages
NOSQL Interview Q&A
No ratings yet
NOSQL Interview Q&A
25 pages
Day 2
No ratings yet
Day 2
35 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Hive Quiz and Questions
No ratings yet
Hive Quiz and Questions
6 pages
Day71 - Day75 Tableau Interview
No ratings yet
Day71 - Day75 Tableau Interview
26 pages
Day65 - Day70 Power BI Interview
No ratings yet
Day65 - Day70 Power BI Interview
31 pages
Day1-Day75 Data Analytics Interview
No ratings yet
Day1-Day75 Data Analytics Interview
669 pages
ML Inter Q&A
No ratings yet
ML Inter Q&A
54 pages
Stata Calculo de Tamaño de Muestra 1 Muestra
No ratings yet
Stata Calculo de Tamaño de Muestra 1 Muestra
15 pages
CSE M.Tech
No ratings yet
CSE M.Tech
24 pages
Statistics and Probability Q4 - M2 - LAS
No ratings yet
Statistics and Probability Q4 - M2 - LAS
3 pages
Hypothesis Testing Random Motors
No ratings yet
Hypothesis Testing Random Motors
6 pages
Lectorial Week 6b NEW
No ratings yet
Lectorial Week 6b NEW
16 pages
Management Accounting Change and The Changing Roles of Management Accountants: A Comparative Analysis Between Dependent and Independent Organizations
No ratings yet
Management Accounting Change and The Changing Roles of Management Accountants: A Comparative Analysis Between Dependent and Independent Organizations
19 pages
Statistical Analysis With Software Application Module - 8
No ratings yet
Statistical Analysis With Software Application Module - 8
4 pages
Usp Diapositivas 1220 Ciclo de Vida Estadistica
No ratings yet
Usp Diapositivas 1220 Ciclo de Vida Estadistica
121 pages
ASQ Course Outline Lean Six Sigma Green Belt - ASQ - LSSGB01MS
No ratings yet
ASQ Course Outline Lean Six Sigma Green Belt - ASQ - LSSGB01MS
4 pages
Ferramentas para Lean 6 Sigma
No ratings yet
Ferramentas para Lean 6 Sigma
306 pages
Chapter 18 Nonparametric Methods Analysi
No ratings yet
Chapter 18 Nonparametric Methods Analysi
17 pages
kasu current ff
0% (1)
kasu current ff
2 pages
Examination 2 STAT 285: Business Statistics Spring 2020: Raehslerr@duq - Edu
No ratings yet
Examination 2 STAT 285: Business Statistics Spring 2020: Raehslerr@duq - Edu
3 pages
SL_4.11 (1)
No ratings yet
SL_4.11 (1)
75 pages
Spearman Rho Correlation
No ratings yet
Spearman Rho Correlation
5 pages
LSM3254 Practical 2-3 FW Stream Communities
No ratings yet
LSM3254 Practical 2-3 FW Stream Communities
13 pages
Solutions 2022
No ratings yet
Solutions 2022
13 pages
Q4 CLAS2 Statistics-And-Probability The Parameter
100% (1)
Q4 CLAS2 Statistics-And-Probability The Parameter
19 pages
MCQ Testing of Hypothesis With Correct Answers
100% (6)
MCQ Testing of Hypothesis With Correct Answers
8 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
22 pages
Group 4 T-Test Assignment
No ratings yet
Group 4 T-Test Assignment
9 pages
UM04CBBA04 - 09 - Statistics For Management II
No ratings yet
UM04CBBA04 - 09 - Statistics For Management II
2 pages
Out of Specification & Out of Trend Investigations: October 2017
No ratings yet
Out of Specification & Out of Trend Investigations: October 2017
38 pages
Research Methods Chapter 1 Concepts of Scientific Research
No ratings yet
Research Methods Chapter 1 Concepts of Scientific Research
59 pages
Lab 3
No ratings yet
Lab 3
6 pages
Sign and Wilcoxon Signed Rank Test
No ratings yet
Sign and Wilcoxon Signed Rank Test
10 pages
YUSI AssignmentModule6
No ratings yet
YUSI AssignmentModule6
4 pages
DoE Lecture
No ratings yet
DoE Lecture
315 pages
Required Practical 12
No ratings yet
Required Practical 12
2 pages

Statistics 2

Uploaded by

Statistics 2

Uploaded by

Day 3

Convenience sampling involves selecting individuals who are easily accessible

exmaple : mobile phone review

Covariance does not provide a standardized measure of the strength of the

Common techniques for visualizing correlation include scatter plots,

A dataset can be written in two different formats: wide and long.

A Probability Density Function (PDF) describes the likelihood of a continuous

Normal Distribution (Gaussian Distribution): The normal distribution is

Binomial Distribution: The binomial distribution describes the probability

Poisson Distribution: The Poisson distribution models the number of events

Uniform Distribution: In a uniform distribution, all outcomes within a given

Log-Normal Distribution: The log-normal distribution arises when the

The Central Limit Theorem is a fundamental principle in statistics. It

Z-scores represent the number of standard deviations a data point is away

Standardizing data with the normal distribution involves calculating

Real-World Scenario: Quality Control in Manufacturing

How to Apply the Central Limit Theorem:

Distribution of Sample Means:

Analyzing the Data:

Why the CLT is Important in This Scenario:

The CLT is extensively used in inferential statistics, such as hypothesis

Approximately 68% of the data in a normal distribution lies within one

Scenarios Where the Empirical Rule May Not Hold True:

Outliers and Extreme Values:

Useful for algorithms that require input features to be within a specific

Many machine learning algorithms, such as gradient descent-based optimization

Algorithms that involve numerical computations, such as calculating distances

Normalization and standardization make the coefficients or weights associated

Effects on Model Performance:

What is the difference between point estimation and interval estimation?

Involves estimating a single value (point) as the most likely value of a

Involves estimating a range (interval) within which the true value of a

Contradicts the null hypothesis, suggesting there is a significant

Usage in Hypothesis Testing:

Hypotheses are used to assess evidence from sample data.

Hypothesis testing is a statistical method used to make inferences about

The null hypothesis (H₀) states there is no significant difference or effect

A small p-value (typically ≤ α) suggests strong evidence against the null

In a one-tailed test, hypotheses are directional, testing for effects in one

Power is the probability of correctly rejecting the null hypothesis when it

Define null and alternative hypotheses.

Parametric tests assume specific distributional properties of the data, such

Common assumptions include normality of data, homogeneity of variance,

A t-test is used when the population standard deviation is unknown or the

A hypothesis test for proportions compares the observed proportion from a

A z-test is used to determine whether the mean of a sample is statistically

Assumptions include normality of data, independence of observations, and

ANOVA is used to compare means across multiple groups simultaneously.

ANOVA is an extension of the t-test and can be thought of as a generalization

The z-test is used in the following scenarios:

1. Hypothesis Testing: The z-test is used to test hypotheses about the

To perform a z-test, the following steps are typically followed:

1. Formulate Hypotheses: Define the null hypothesis (H0) and alternative

6. Draw Conclusion: Based on the comparison, draw a conclusion regarding

A t-test is a statistical test used to determine whether the means of

The t-test is used in the following scenarios:

1. Comparing Means: The t-test is used to compare the means of two

To perform a t-test, the following steps are typically followed:

1. Formulate Hypotheses: Define the null hypothesis (H0) and alternative

2. Independent Samples t-test (Equal Variances):

3. Independent Samples t-test (Unequal Variances):

- n1 and n2 are the sample sizes of the two samples

A/B testing, also known as split testing or bucket testing, is a controlled

The chi-square test is a statistical test used to determine if there is a

The chi-square test can be used in the following situations:

1. Goodness-of-Fit Test: It is used to determine if an observed frequency

2. Test of Independence: The chi-square test is used to examine if there is a

3. Homogeneity Test: The chi-square test can be used to compare the

ANOVA is a statistical technique used to compare means across three or more

You might also like