0% found this document useful (0 votes)

33 views24 pages

Unit-2 Data Analytics Approaches

The document provides an overview of data analytics approaches, including the distinction between supervised and unsupervised learning, and key statistical concepts such as descriptive statistics, variability, correlation, and regression. It also covers hypothesis testing, the Central Limit Theorem, and the types of variables used in data analysis. Additionally, it discusses bias and variance, normal distribution, skewness, kurtosis, and the coefficient of variance.

Uploaded by

sanketkingaonkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views24 pages

Unit-2 Data Analytics Approaches

Uploaded by

sanketkingaonkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Unit-2

Data Analytics Approaches

Data Analytics
• Data Analytics is the science of
analyzing data to convert
information to useful knowledge.
• Data analytics is the process of
collecting, transforming, and
organizing data in order to draw
conclusions, make predictions.
• It is the process of analyzing raw
data to discover patterns and
trends, and to make better
decisions.
Supervised and Unsupervised Learning

Figure 1. Working of Supervised and Unsupervised learning Source.

Difference between Supervised and Unsupervised
Learning
Criteria Supervised Learning Unsupervised Learning
Predict outcomes for new data based on Extract insights from large volumes of
Goal
labeled data. unclassified data.

Anomaly detection, recommendation

Spam detection, sentiment analysis,
Applications engines, customer personas, medical
weather forecasting, pricing predictions.
imaging.

Relatively simple, uses tools like R or Computationally complex, requires powerful

Complexity
Python. tools and large datasets.

Time-consuming training, requires Results may be inaccurate without human

Drawbacks labeled data and expertise. validation.
Population and Sample
• A population refers to the entire set of individuals, objects, or data points that you want to
study.
• It can be large or small depending on the scope of the research. For example, all students in a
school or all people in a country.
• A sample is a subset of the population that is selected for analysis. It’s used when studying the
entire population is impractical or impossible. Sampling allows for inferences about the
population using statistical techniques.
Statistical concepts in data science
1. Descriptive Statistics (Measure of Central Tendency)
It is used to describe the basic features of data that provide a summary of the given data
set which can either represent the entire population or a sample of the population.
It is derived from calculations that include:
• Mean: It is the central value which is commonly known as arithmetic average.
• Mode: It refers to the value that appears most often in a data set.
• Median: It is the middle value of the ordered set that divides it in exactly half.
2. Variability
Variability includes the following parameters:
• Standard Deviation: It is a statistic that calculates the dispersion of a data set as compared to its
mean.
• Variance: It refers to a statistical measure of the spread between the numbers in a data set. In
general terms, it means the difference from the mean. A large variance indicates that numbers are
far apart from the mean or average value. Small variance indicates that the numbers are closer to
the average values. Zero variance indicates that the values are identical to the given set.
• Range: This is defined as the difference between the largest and smallest value of a dataset.
• Percentile: It refers to the measure used in statistics that indicates the value below which the given
percentage of observation in the dataset falls.
• Quartile: It is defined as the value that divides the data points into quarters.
• Interquartile Range: It measures the middle half of your data. In general terms, it is the middle 50%
of the dataset.
3. Correlation
It is one of the major statistical techniques that measure the relationship between two variables. The
correlation coefficient indicates the strength of the linear relationship between two variables.

• A correlation coefficient that is more than zero indicates a positive relationship.

• A correlation coefficient that is less than zero indicates a negative relationship.

• Correlation coefficient zero indicates that there is no relationship between the two variables.
4. Regression
Regression is a method that is used to determine the relationship between one
or more independent variables and a dependent variable. Regression is mainly
of two types:
• Linear regression techniques mathematically model the unknown factor on
multiple known factors to estimate the exact unknown value (prediction).
• Logistic regression uses mathematics to find the relationships between two
data factors. It then uses this relationship to predict the value of one of those
factors based on the other. The prediction usually has a finite number of
outcomes, like yes or no (classification).
4. Regression
5. Bias and Variance
• Bias refers to the errors in data or models that lead to the inclination of results in a specific
direction rather than making them unbiased. Bias in machine learning is when a model
makes wrong assumptions about the data, leading to poor learning.

• Variance refers to the error due to the model’s sensitivity to small fluctuations in the
training data. A model with high variance learns too much from the training data, capturing
noise rather than the actual pattern. This leads to overfitting.
6. Normal Distribution
• The bell curve or normal distribution is the
symmetrical probability distribution characterized
by a specific shape.
• There are two parameters here: mean and
standard deviation.
• The distribution is essential for data science to
analyze different scenarios like measurement
errors, test scores, and heights.
• The regular distribution simplifies the calculations
and is foundational in hypothesis testing,
inferential statistics, and parameter estimation.
Types of Variable
• Qualitative (Categorical) Variables: It describe attributes or characteristics that do not
have numerical values. Instead, they represent categories or groups. These variables
are typically used to categorize data into different classes or groups.
• Two subtypes:
• Nominal Variables: They are qualitative variables that represent different categories or groups
without any inherent order.
• Examples:
• Gender (Male, Female, Non-binary)
• Hair Color (Black, Brown, Blonde, Red)
• Ordinal Variables: Ordinal variables are qualitative variables with a meaningful order or ranking
among the categories. Although the differences between the categories are not measurable.
• Examples:
• Education Level (High School, Bachelor’s, Master’s, PhD)
• Socioeconomic Status (Low, Middle, High)
Types of Variable
• Quantitative (Numerical) Variables: These variables have numerical values that can be
counted or measured.
• Two subtypes:
• Discrete Variables: These are quantitative variables that take on a finite or countable number of values.
These values are typically whole numbers, and no intermediate values exist between them.
• Examples:
• Number of children (0, 1, 2, 3…)
• Number of books on a Shelf (10, 15, 20…)
• Continuous Variables: These are quantitative variables that can take on infinite values within a given
range. These values are not restricted to whole numbers and can include fractions or decimals.
• Examples:
• Height (5.8 ft, 6.1 ft, 5.75 ft)
• Weight (150.5 lbs, 175.2 lbs)
Coefficient of Variance (CV)
It is a statistical measure that expresses the amount of variability in a dataset relative to its mean. It is used to
compare the relative dispersion between datasets with different units or magnitudes.
Formula:
CV = (μ / σ) × 100

Key Points:
•Expressed as a percentage (%), not in absolute units.
•Useful for comparing variability across datasets with different scales.
•Higher CV → More variability relative to the mean.
•Lower CV → Less variability (more consistency).

Applications:
•Risk assessment in stock returns.
•Measuring consistency in production.
•Comparing the variation in patient responses.
Skewness
• It is the attribute of a frequency distribution that extends further on one side of the class
with the highest frequency on the other.
• A frequency distribution is said to be skewed if the frequencies decrease with markedly
greater rapidity on one side of the central maximum than on the other side. This
characteristics of a frequency distribution is known as Skewness. The measure of
asymmetry are usually called measures of skewness.
• Types: 1) Positive skewness. 2) Negative skewness.
Kurtosis
• Kurtosis is a statistical measure that describes the shape of
a distribution’s tails in relation to its overall shape. It
measures the degree to which a given distribution is more
or less ‘peaked’ relative to the normal distribution. It helps
to identify outliers or extreme observations.

• The interpretation of kurtosis focuses on the excess

kurtosis value:
• Positive Excess Kurtosis (Leptokurtic): Indicates heavy
tails in the distribution, suggesting a higher likelihood
of outliers. The kurtosis is greater than 3.
• Negative Excess Kurtosis (Platykurtic): Implies lighter
tails, indicating fewer extreme values than a normal
distribution. The kurtosis will be less than 3.
• Zero Excess Kurtosis (Mesokurtic): Suggests a
distribution similar to the normal distribution in terms
of tail weight, indicating a balanced distribution of
outliers. The kutosis for normal distibution is 3.
Hypothesis Testing
• Hypothesis testing is a statistical method used to make decisions or inferences about a population
based on sample data. In machine learning (ML), it helps validate assumptions, compare models, and
assess statistical significance in experimental results.

• Hypothesis testing is a structured method used to determine if the findings of a study provide
evidence to support a specific theory relevant to a larger population.

• Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a
population parameter to the test. It is used to estimate the relationship between 2 statistical
variables.

• Hypothesis method compares two opposite statements about a population and uses sample data to
decide which one is more likely to be correct.To test this assumption we first take a sample from the
population and analyze it and use the results of the analysis to decide if the claim is valid or not.
Hypothesis Testing
• Suppose a company claims that its website gets an average of 50 user visits per day. To verify this we use
hypothesis testing to analyze past website traffic data and determine if the claim is accurate. This helps us
decide whether the observed data supports the company’s claim or if there is a significant difference.

Defining Hypotheses

• Null hypothesis (H0): The null hypothesis is the starting assumption in statistics. It says there is no relationship
between groups. For Example A company claims its average production is 50 units per day then here:

• Null Hypothesis: H₀: The mean number of daily visits (μ) = 50.

• Alternative hypothesis (H1): The alternative hypothesis is the opposite of the null hypothesis it suggests there
is a difference between groups. like The company’s production is not equal to 50 units per day then the
alternative hypothesis would be: H₁: The mean number of daily visits (μ) ≠ 50.
Hypothesis Testing
In hypothesis testing Type I and Type II errors are two possible errors that can happen when we are finding
conclusions about a population based on a sample of data. These errors are associated with the decisions we
made regarding the null hypothesis and the alternative hypothesis.

•Type I error: When we reject the null hypothesis although that hypothesis was true. Type I error is denoted by
alpha(α).

•Type II errors: When we accept the null hypothesis but it is false. Type II errors are denoted by beta(β).

Null Hypothesis Null Hypothesis

is True is False
Null Hypothesis Correct Type II Error
is True (Accept) Decision (False Negative)
Alternative
Type I Error Correct
Hypothesis is
(False Positive) Decision
True (Reject)
Hypothesis Testing
The significance level (α) is the probability of rejecting the null hypothesis (H₀) when it is actually true. It
represents the threshold for statistical significance, determining how much uncertainty we are willing to accept in
our decision-making process.
Mathematical Interpretation:
Α = P(Type I Error) = P(Rejecting H0∣H0 is true).
This means that if we set α = 0.05 (5%), we accept a 5% chance of incorrectly rejecting the null hypothesis when it
is actually true.
Significance Level (α) Confidence Level (1 - α) Usage
Used when a small effect
0.10 (10%) 90%
size is acceptable.
Most commonly used in
0.05 (5%) 95%
hypothesis testing.
Used in high-stakes
0.01 (1%) 99% experiments (e.g., medical
trials, security systems).
Used when making critical
0.001 (0.1%) 99.9% decisions with minimal
error tolerance.
Hypothesis Testing
Step 1: Define the Hypothesis: Null Hypothesis: (H0) and Alternate Hypothesis: (H1)
Step 2: Define the Significance level
Step 3: Compute the test statistic: Using paired T-test analyze the data to obtain a test statistic and a p-value.
t = m/(s/√n)
Where: m = mean of the difference, s = standard deviation of the difference, n = sample size.
Step 4: Find the p-value : The p-value is found using the t-distribution with degree of freedom.
Step 5: Result: The significance level (α) is a fixed threshold set before testing, while the p-value is calculated from the data.
p-value < α → Reject H₀ (Statistically significant result)
p-value ≥ α → Fail to reject H₀ (Not statistically significant)
Central Limit Theorem
• The Central Limit Theorem explains that the sample distribution of the sample mean resembles the normal
distribution irrespective of the fact that whether the variables themselves are distributed normally or not.

• The Central Limit Theorem states that:

When large samples usually greater than thirty are taken into consideration then the distribution of sample
arithmetic mean approaches the normal distribution irrespective of the fact that random variables were originally
distributed normally or not.

• Central Limit Theorem is important as it helps to make accurate prediction about a population just by analyzing
the sample.

• The Central Limit Theorem can be solved by finding Z score which is calculated by using the formula:

Unit 1 AIDS
No ratings yet
Unit 1 AIDS
128 pages
Chapter 2 BSC TY Statistical Data Analysis
No ratings yet
Chapter 2 BSC TY Statistical Data Analysis
124 pages
Chapter 6 Research Methods
No ratings yet
Chapter 6 Research Methods
24 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
ISDS 361A - Cheat Sheet Exam 1 PDF
No ratings yet
ISDS 361A - Cheat Sheet Exam 1 PDF
2 pages
Business Analytics
No ratings yet
Business Analytics
44 pages
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
13 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
Ge8 Statistics
No ratings yet
Ge8 Statistics
2 pages
Introduction To Statistics Final
No ratings yet
Introduction To Statistics Final
30 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
Data Management (1) (1) - Compressed
No ratings yet
Data Management (1) (1) - Compressed
46 pages
Introduction To Statistics 2024-2025
No ratings yet
Introduction To Statistics 2024-2025
40 pages
ISA Summary Toya
No ratings yet
ISA Summary Toya
38 pages
Statistics - Reviewer
No ratings yet
Statistics - Reviewer
12 pages
RM EBBA Class 8 CH0 11 Quatitative Analysis
No ratings yet
RM EBBA Class 8 CH0 11 Quatitative Analysis
37 pages
Statistics
No ratings yet
Statistics
152 pages
Stats Midterms Cheat Sheet
No ratings yet
Stats Midterms Cheat Sheet
3 pages
Introduction to Business Statistics
No ratings yet
Introduction to Business Statistics
54 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Data Science & ML Essentials
No ratings yet
Data Science & ML Essentials
15 pages
3 4 Research 8 2
No ratings yet
3 4 Research 8 2
54 pages
Statistics For Data Analytics
No ratings yet
Statistics For Data Analytics
15 pages
Audels Engineers and Mechanics Guide Volume 5 From WWW Jgokey Com
No ratings yet
Audels Engineers and Mechanics Guide Volume 5 From WWW Jgokey Com
556 pages
Reviewer Part 1
No ratings yet
Reviewer Part 1
9 pages
Desc. Stat
No ratings yet
Desc. Stat
41 pages
QT 1 UNIT - 3 - Watermarked
No ratings yet
QT 1 UNIT - 3 - Watermarked
15 pages
Foundations or Research Analysis
No ratings yet
Foundations or Research Analysis
31 pages
Statistics - Compendium - DMS IIT DELHI - 2025
No ratings yet
Statistics - Compendium - DMS IIT DELHI - 2025
18 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
INF30036 Lecture5
No ratings yet
INF30036 Lecture5
33 pages
Statistics
No ratings yet
Statistics
64 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
A Detailed Lesson Plan in Mathematics Grade 7 (Algebra) : I. Objectives
0% (1)
A Detailed Lesson Plan in Mathematics Grade 7 (Algebra) : I. Objectives
7 pages
BAA Class Notes
No ratings yet
BAA Class Notes
16 pages
Intro to Descriptive Statistics
No ratings yet
Intro to Descriptive Statistics
15 pages
Notes
No ratings yet
Notes
12 pages
Statistics
No ratings yet
Statistics
23 pages
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
No ratings yet
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
31 pages
Business Statstics Complete
No ratings yet
Business Statstics Complete
13 pages
Statistics - Material
No ratings yet
Statistics - Material
12 pages
Statistical Treatment
No ratings yet
Statistical Treatment
7 pages
Psychological Stats Reviewer
No ratings yet
Psychological Stats Reviewer
11 pages
Nature of Statistics
No ratings yet
Nature of Statistics
5 pages
COMM 191 Reviewer
No ratings yet
COMM 191 Reviewer
17 pages
Introduction to Data & Statistics
No ratings yet
Introduction to Data & Statistics
21 pages
E Book - Unit 4
No ratings yet
E Book - Unit 4
12 pages
Chapter 5 - RM
No ratings yet
Chapter 5 - RM
22 pages
Stats 201
No ratings yet
Stats 201
5 pages
Intro to Stats: Key Concepts
No ratings yet
Intro to Stats: Key Concepts
7 pages
A. Variables:: Types of Distributions
No ratings yet
A. Variables:: Types of Distributions
10 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Mathematics Statistics
No ratings yet
Mathematics Statistics
4 pages
Denise Dailyroutine
No ratings yet
Denise Dailyroutine
10 pages
Activity 2
No ratings yet
Activity 2
4 pages
LT 1083RCU FX 3500RCU Installation Manual
No ratings yet
LT 1083RCU FX 3500RCU Installation Manual
103 pages
MLGS Ii
No ratings yet
MLGS Ii
505 pages
BHU RET Geology 2020
0% (1)
BHU RET Geology 2020
41 pages
Data Management
No ratings yet
Data Management
43 pages
Session 1 On Descriptive Statistics
No ratings yet
Session 1 On Descriptive Statistics
24 pages
SAT Suite Question Bank - 1 o 10 Difficult and Hard Grammar 2622024 Answers
No ratings yet
SAT Suite Question Bank - 1 o 10 Difficult and Hard Grammar 2622024 Answers
10 pages
RSV4 Factory APRC - SM - 2010-11 - GB - 898952
No ratings yet
RSV4 Factory APRC - SM - 2010-11 - GB - 898952
504 pages
Geometric Series
No ratings yet
Geometric Series
16 pages
ISOM Cheat Sheet 1
No ratings yet
ISOM Cheat Sheet 1
6 pages
Lesson 1: Inspection, Palpation, Percussion & Auscultation
No ratings yet
Lesson 1: Inspection, Palpation, Percussion & Auscultation
7 pages
Sociology of Families Change Continuity and Diversity 1st Edition Ciabattari Test Bankinstant Download
100% (10)
Sociology of Families Change Continuity and Diversity 1st Edition Ciabattari Test Bankinstant Download
49 pages
Thesis Help for Trade Students
100% (2)
Thesis Help for Trade Students
6 pages
Statistics for Computer Science Students
No ratings yet
Statistics for Computer Science Students
6 pages
Ariston Trainman63X
No ratings yet
Ariston Trainman63X
19 pages
KNNL - Malaprabha - Final Feasibility Report
No ratings yet
KNNL - Malaprabha - Final Feasibility Report
53 pages
Linear Array Operations in C++
No ratings yet
Linear Array Operations in C++
4 pages
Moral Panics Assignment
No ratings yet
Moral Panics Assignment
7 pages
Fluid Mechanics Practice Problems
No ratings yet
Fluid Mechanics Practice Problems
8 pages
Lesson 2 Political Ideologies
No ratings yet
Lesson 2 Political Ideologies
15 pages
Qcells Mcs
No ratings yet
Qcells Mcs
12 pages
Rapid Prototyping
100% (1)
Rapid Prototyping
21 pages
Photography - Tips & Tricks
No ratings yet
Photography - Tips & Tricks
13 pages
Chapter 1.
No ratings yet
Chapter 1.
6 pages
BS en 13335-2002 PDF
No ratings yet
BS en 13335-2002 PDF
12 pages
Scaffolding in Learning
No ratings yet
Scaffolding in Learning
5 pages
KMS-GL-QUA-SOP-12-PFL.04 - 3rd Party Inspection Process Flowchart
No ratings yet
KMS-GL-QUA-SOP-12-PFL.04 - 3rd Party Inspection Process Flowchart
3 pages
Cohesity License Terms Overview
No ratings yet
Cohesity License Terms Overview
5 pages
Cadenas, Bandas y Piñones
No ratings yet
Cadenas, Bandas y Piñones
0 pages
A Saga of Qualitative Research
No ratings yet
A Saga of Qualitative Research
4 pages
Calculus & Algebra for Engineers
No ratings yet
Calculus & Algebra for Engineers
2 pages
The Next Generation Melting System
No ratings yet
The Next Generation Melting System
19 pages

Unit-2 Data Analytics Approaches

Uploaded by

Unit-2 Data Analytics Approaches

Uploaded by

Unit-2

Data Analytics Approaches

Figure 1. Working of Supervised and Unsupervised learning Source.

Anomaly detection, recommendation

Relatively simple, uses tools like R or Computationally complex, requires powerful

Time-consuming training, requires Results may be inaccurate without human

• A correlation coefficient that is more than zero indicates a positive relationship.

• A correlation coefficient that is less than zero indicates a negative relationship.

• The interpretation of kurtosis focuses on the excess

Null Hypothesis Null Hypothesis

• The Central Limit Theorem states that:

You might also like