0% found this document useful (0 votes)
33 views24 pages

Unit-2 Data Analytics Approaches

The document provides an overview of data analytics approaches, including the distinction between supervised and unsupervised learning, and key statistical concepts such as descriptive statistics, variability, correlation, and regression. It also covers hypothesis testing, the Central Limit Theorem, and the types of variables used in data analysis. Additionally, it discusses bias and variance, normal distribution, skewness, kurtosis, and the coefficient of variance.

Uploaded by

sanketkingaonkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views24 pages

Unit-2 Data Analytics Approaches

The document provides an overview of data analytics approaches, including the distinction between supervised and unsupervised learning, and key statistical concepts such as descriptive statistics, variability, correlation, and regression. It also covers hypothesis testing, the Central Limit Theorem, and the types of variables used in data analysis. Additionally, it discusses bias and variance, normal distribution, skewness, kurtosis, and the coefficient of variance.

Uploaded by

sanketkingaonkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Unit-2

Data Analytics Approaches


Data Analytics
• Data Analytics is the science of
analyzing data to convert
information to useful knowledge.
• Data analytics is the process of
collecting, transforming, and
organizing data in order to draw
conclusions, make predictions.
• It is the process of analyzing raw
data to discover patterns and
trends, and to make better
decisions.
Supervised and Unsupervised Learning

Figure 1. Working of Supervised and Unsupervised learning Source.


Difference between Supervised and Unsupervised
Learning
Criteria Supervised Learning Unsupervised Learning
Predict outcomes for new data based on Extract insights from large volumes of
Goal
labeled data. unclassified data.

Anomaly detection, recommendation


Spam detection, sentiment analysis,
Applications engines, customer personas, medical
weather forecasting, pricing predictions.
imaging.

Relatively simple, uses tools like R or Computationally complex, requires powerful


Complexity
Python. tools and large datasets.

Time-consuming training, requires Results may be inaccurate without human


Drawbacks labeled data and expertise. validation.
Population and Sample
• A population refers to the entire set of individuals, objects, or data points that you want to
study.
• It can be large or small depending on the scope of the research. For example, all students in a
school or all people in a country.
• A sample is a subset of the population that is selected for analysis. It’s used when studying the
entire population is impractical or impossible. Sampling allows for inferences about the
population using statistical techniques.
Statistical concepts in data science
1. Descriptive Statistics (Measure of Central Tendency)
It is used to describe the basic features of data that provide a summary of the given data
set which can either represent the entire population or a sample of the population.
It is derived from calculations that include:
• Mean: It is the central value which is commonly known as arithmetic average.
• Mode: It refers to the value that appears most often in a data set.
• Median: It is the middle value of the ordered set that divides it in exactly half.
2. Variability
Variability includes the following parameters:
• Standard Deviation: It is a statistic that calculates the dispersion of a data set as compared to its
mean.
• Variance: It refers to a statistical measure of the spread between the numbers in a data set. In
general terms, it means the difference from the mean. A large variance indicates that numbers are
far apart from the mean or average value. Small variance indicates that the numbers are closer to
the average values. Zero variance indicates that the values are identical to the given set.
• Range: This is defined as the difference between the largest and smallest value of a dataset.
• Percentile: It refers to the measure used in statistics that indicates the value below which the given
percentage of observation in the dataset falls.
• Quartile: It is defined as the value that divides the data points into quarters.
• Interquartile Range: It measures the middle half of your data. In general terms, it is the middle 50%
of the dataset.
3. Correlation
It is one of the major statistical techniques that measure the relationship between two variables. The
correlation coefficient indicates the strength of the linear relationship between two variables.

• A correlation coefficient that is more than zero indicates a positive relationship.

• A correlation coefficient that is less than zero indicates a negative relationship.

• Correlation coefficient zero indicates that there is no relationship between the two variables.
4. Regression
Regression is a method that is used to determine the relationship between one
or more independent variables and a dependent variable. Regression is mainly
of two types:
• Linear regression techniques mathematically model the unknown factor on
multiple known factors to estimate the exact unknown value (prediction).
• Logistic regression uses mathematics to find the relationships between two
data factors. It then uses this relationship to predict the value of one of those
factors based on the other. The prediction usually has a finite number of
outcomes, like yes or no (classification).
4. Regression
5. Bias and Variance
• Bias refers to the errors in data or models that lead to the inclination of results in a specific
direction rather than making them unbiased. Bias in machine learning is when a model
makes wrong assumptions about the data, leading to poor learning.

• Variance refers to the error due to the model’s sensitivity to small fluctuations in the
training data. A model with high variance learns too much from the training data, capturing
noise rather than the actual pattern. This leads to overfitting.
6. Normal Distribution
• The bell curve or normal distribution is the
symmetrical probability distribution characterized
by a specific shape.
• There are two parameters here: mean and
standard deviation.
• The distribution is essential for data science to
analyze different scenarios like measurement
errors, test scores, and heights.
• The regular distribution simplifies the calculations
and is foundational in hypothesis testing,
inferential statistics, and parameter estimation.
Types of Variable
• Qualitative (Categorical) Variables: It describe attributes or characteristics that do not
have numerical values. Instead, they represent categories or groups. These variables
are typically used to categorize data into different classes or groups.
• Two subtypes:
• Nominal Variables: They are qualitative variables that represent different categories or groups
without any inherent order.
• Examples:
• Gender (Male, Female, Non-binary)
• Hair Color (Black, Brown, Blonde, Red)
• Ordinal Variables: Ordinal variables are qualitative variables with a meaningful order or ranking
among the categories. Although the differences between the categories are not measurable.
• Examples:
• Education Level (High School, Bachelor’s, Master’s, PhD)
• Socioeconomic Status (Low, Middle, High)
Types of Variable
• Quantitative (Numerical) Variables: These variables have numerical values that can be
counted or measured.
• Two subtypes:
• Discrete Variables: These are quantitative variables that take on a finite or countable number of values.
These values are typically whole numbers, and no intermediate values exist between them.
• Examples:
• Number of children (0, 1, 2, 3…)
• Number of books on a Shelf (10, 15, 20…)
• Continuous Variables: These are quantitative variables that can take on infinite values within a given
range. These values are not restricted to whole numbers and can include fractions or decimals.
• Examples:
• Height (5.8 ft, 6.1 ft, 5.75 ft)
• Weight (150.5 lbs, 175.2 lbs)
Coefficient of Variance (CV)
It is a statistical measure that expresses the amount of variability in a dataset relative to its mean. It is used to
compare the relative dispersion between datasets with different units or magnitudes.
Formula:
CV = (μ / σ​) × 100

Key Points:
•Expressed as a percentage (%), not in absolute units.
•Useful for comparing variability across datasets with different scales.
•Higher CV → More variability relative to the mean.
•Lower CV → Less variability (more consistency).

Applications:
•Risk assessment in stock returns.
•Measuring consistency in production.
•Comparing the variation in patient responses.
Skewness
• It is the attribute of a frequency distribution that extends further on one side of the class
with the highest frequency on the other.
• A frequency distribution is said to be skewed if the frequencies decrease with markedly
greater rapidity on one side of the central maximum than on the other side. This
characteristics of a frequency distribution is known as Skewness. The measure of
asymmetry are usually called measures of skewness.
• Types: 1) Positive skewness. 2) Negative skewness.
Kurtosis
• Kurtosis is a statistical measure that describes the shape of
a distribution’s tails in relation to its overall shape. It
measures the degree to which a given distribution is more
or less ‘peaked’ relative to the normal distribution. It helps
to identify outliers or extreme observations.

• The interpretation of kurtosis focuses on the excess


kurtosis value:
• Positive Excess Kurtosis (Leptokurtic): Indicates heavy
tails in the distribution, suggesting a higher likelihood
of outliers. The kurtosis is greater than 3.
• Negative Excess Kurtosis (Platykurtic): Implies lighter
tails, indicating fewer extreme values than a normal
distribution. The kurtosis will be less than 3.
• Zero Excess Kurtosis (Mesokurtic): Suggests a
distribution similar to the normal distribution in terms
of tail weight, indicating a balanced distribution of
outliers. The kutosis for normal distibution is 3.
Hypothesis Testing
• Hypothesis testing is a statistical method used to make decisions or inferences about a population
based on sample data. In machine learning (ML), it helps validate assumptions, compare models, and
assess statistical significance in experimental results.

• Hypothesis testing is a structured method used to determine if the findings of a study provide
evidence to support a specific theory relevant to a larger population.

• Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a
population parameter to the test. It is used to estimate the relationship between 2 statistical
variables.

• Hypothesis method compares two opposite statements about a population and uses sample data to
decide which one is more likely to be correct.To test this assumption we first take a sample from the
population and analyze it and use the results of the analysis to decide if the claim is valid or not.
Hypothesis Testing
• Suppose a company claims that its website gets an average of 50 user visits per day. To verify this we use
hypothesis testing to analyze past website traffic data and determine if the claim is accurate. This helps us
decide whether the observed data supports the company’s claim or if there is a significant difference.

Defining Hypotheses

• Null hypothesis (H0): The null hypothesis is the starting assumption in statistics. It says there is no relationship
between groups. For Example A company claims its average production is 50 units per day then here:

• Null Hypothesis: H₀: The mean number of daily visits (μ) = 50.

• Alternative hypothesis (H1): The alternative hypothesis is the opposite of the null hypothesis it suggests there
is a difference between groups. like The company’s production is not equal to 50 units per day then the
alternative hypothesis would be: H₁: The mean number of daily visits (μ) ≠ 50.
Hypothesis Testing
In hypothesis testing Type I and Type II errors are two possible errors that can happen when we are finding
conclusions about a population based on a sample of data. These errors are associated with the decisions we
made regarding the null hypothesis and the alternative hypothesis.

•Type I error: When we reject the null hypothesis although that hypothesis was true. Type I error is denoted by
alpha(α).

•Type II errors: When we accept the null hypothesis but it is false. Type II errors are denoted by beta(β).

Null Hypothesis Null Hypothesis


is True is False
Null Hypothesis Correct Type II Error
is True (Accept) Decision (False Negative)
Alternative
Type I Error Correct
Hypothesis is
(False Positive) Decision
True (Reject)
Hypothesis Testing
The significance level (α) is the probability of rejecting the null hypothesis (H₀) when it is actually true. It
represents the threshold for statistical significance, determining how much uncertainty we are willing to accept in
our decision-making process.
Mathematical Interpretation:
Α = P(Type I Error) = P(Rejecting H0∣H0 is true).
This means that if we set α = 0.05 (5%), we accept a 5% chance of incorrectly rejecting the null hypothesis when it
is actually true.
Significance Level (α) Confidence Level (1 - α) Usage
Used when a small effect
0.10 (10%) 90%
size is acceptable.
Most commonly used in
0.05 (5%) 95%
hypothesis testing.
Used in high-stakes
0.01 (1%) 99% experiments (e.g., medical
trials, security systems).
Used when making critical
0.001 (0.1%) 99.9% decisions with minimal
error tolerance.
Hypothesis Testing
Step 1: Define the Hypothesis: Null Hypothesis: (H0) and Alternate Hypothesis: (H1)
Step 2: Define the Significance level
Step 3: Compute the test statistic: Using paired T-test analyze the data to obtain a test statistic and a p-value.
t = m/(s/√n)
Where: m = mean of the difference, s = standard deviation of the difference, n = sample size.
Step 4: Find the p-value : The p-value is found using the t-distribution with degree of freedom.
Step 5: Result: The significance level (α) is a fixed threshold set before testing, while the p-value is calculated from the data.
p-value < α → Reject H₀ (Statistically significant result)
p-value ≥ α → Fail to reject H₀ (Not statistically significant)
Central Limit Theorem
• The Central Limit Theorem explains that the sample distribution of the sample mean resembles the normal
distribution irrespective of the fact that whether the variables themselves are distributed normally or not.

• The Central Limit Theorem states that:

When large samples usually greater than thirty are taken into consideration then the distribution of sample
arithmetic mean approaches the normal distribution irrespective of the fact that random variables were originally
distributed normally or not.

• Central Limit Theorem is important as it helps to make accurate prediction about a population just by analyzing
the sample.

• The Central Limit Theorem can be solved by finding Z score which is calculated by using the formula:

You might also like