0% found this document useful (0 votes)
14 views24 pages

DSA Module 1 Notes

Uploaded by

gaganad.21.beai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views24 pages

DSA Module 1 Notes

Uploaded by

gaganad.21.beai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Science and It’s Applications 21AD62

SUBJECT: Data Science and Its Applications (21AD62)

MODULE-1 INTRODUCTION

Syllabus: What is Data Science? Visualizing Data, matplotlib, Bar Charts, Line Charts,
Scatterplots, Linear Algebra, Vectors, Matrices, Statistics, Describing a Single Set of Data,
Correlation, Simpson's Paradox, Some Other Correlational Caveats, Correlation and Causation,
Probability, Dependence and Independence, Conditional Probability, Bayes's Theorem, Random
Variables, Continuous Distributions, The Normal Distribution, The Central Limit Theorem.

Introduction

Data science is the study of data to extract meaningful insights for business which combines
tools, methods, and technology to generate meaning from data. It is a multidisciplinary approach that
combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and
computer engineering to analyze large amounts of data.

Data science is used to study data in four main ways:

1. Descriptive analysis
Descriptive analysis examines data to gain insights into what happened or what is happening in
the data environment. It is characterized by data visualizations such as pie charts, bar charts, line
graphs, tables, or generated narratives.
For example, a flight booking service may record data like the number of tickets booked each day.
Descriptive analysis will reveal booking spikes, booking slumps, and high-performing months for this
service.

2. Diagnostic analysis
Diagnostic analysis is a deep-dive or detailed data examination to understand why something
happened. It is characterized by techniques such as drill-down, data discovery, data mining, and
correlations. Multiple data operations and transformations may be performed on a given data set to
discover unique patterns in each of these techniques.
For example, the flight service might drill down on a particularly high-performing month to better
understand the booking spike. This may lead to the discovery that many customers visit a particular
city to attend a monthly sporting event.

3. Predictive analysis
Predictive analysis uses historical data to make accurate forecasts about data patterns that may
occur in the future. It is characterized by techniques such as machine learning, forecasting, pattern
matching, and predictive modeling. In each of these techniques, computers are trained to reverse
engineer causality connections in the data.
For example, the flight service team might use data science to predict flight booking patterns for the
coming year at the start of each year. The computer program or algorithm may look at past data and
predict booking spikes for certain destinations in May. Having anticipated their customer’s future
travel requirements, the company could start targeted advertising for those cities from February.
1
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

4. Prescriptive analysis

Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely
to happen but also suggests an optimum response to that outcome. It can analyze the potential
implications of different choices and recommend the best course of action. It uses graph analysis,
simulation, complex event processing, neural networks, and recommendation engines from machine
learning.

Back to the flight booking example, prescriptive analysis could look at historical marketing campaigns
to maximize the advantage of the upcoming booking spike. A data scientist could project booking
outcomes for different levels of marketing spend on various marketing channels. These data forecasts
would give the flight booking company greater confidence in their marketing decisions.

Benefits of data science

 Discover unknown transformative patterns


 Innovate new products and solution
 Real-time optimization

Data Pre processing

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.

Some common steps in data preprocessing include:

Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such
as missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.

Data Integration: This involves combining data from multiple sources to create a unified dataset.
Data integration can be challenging as it requires handling data with different formats, structures,
and semantics. Techniques such as record linkage and data fusion can be used for data integration.

Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization is
used to transform the data to have zero mean and unit variance. Discretization is used to convert
continuous data into discrete categories.
=

Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant features from the dataset, whi le
2

feature extraction involves transforming the data into a lower-dimensional space while preserving
Page

the important information.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical
data. Discretization can be achieved through techniques such as equal width binning, equal
frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or
-1 and 1. Normalization is often used to handle data with different units and scales. Common
normalization techniques include min-max normalization, z-score normalization, and decimal
scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the
analysis results. The specific steps involved in data pre processing may vary depending on the
nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become
more accurate.
Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining
datasets to summarize their main characteristics, often using visual methods. The primary goal of EDA
is to uncover patterns, spot anomalies, test hypotheses, and check assumptions through the use of
summary statistics and graphical representations.

Objectives of EDA

1. Understand Data Structure: Determine the types of variables and their relationships.
2. Summarize Data Characteristics: Use summary statistics to get an overall sense of the data.
3. Detect Outliers and Anomalies: Identify any unusual observations that may need further
investigation.
4. Discover Patterns and Relationships: Reveal trends, correlations, and other patterns.
5. Check Assumptions: Validate the assumptions of statistical models you may use later.

Steps in EDA

1. Data Collection and Loading:Import the data from various sources (e.g., CSV files, databases).
2. Data Cleaning: Handle missing values (e.g., imputation, removal),Correct errors and
inconsistencies (e.g., duplicate records, incorrect data types)
3. Descriptive Statistics: a) Central Tendency: Mean, median, mode.
b) Dispersion: Range, variance, standard deviation
c) Shape: Skewness, kurtosis.
4. Data Visualization:
1. Univariate Analysis: Analyzing a single variable.
1. Histograms: For continuous variables.
2. Bar Plots: For categorical variables.
3. Box Plots: For detecting outliers and understanding the distribution.
2. Bivariate Analysis: Analyzing relationships between two variables.
1. Scatter Plots: To study correlations.
2. Line Graphs: For time series data.
3

3. Heatmaps: To visualize correlation matrices.


Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

3. Multivariate Analysis: Analyzing relationships among more than two variables.

1. Pair Plots: To visualize relationships between multiple pairs of variables.


2. 3D Scatter Plots: For understanding the interaction between three variables.

Correlation Analysis:

1. Correlation Matrix: To understand the pairwise correlations between variables.


2. Heatmap: To visualize the correlation matrix.

Hypothesis Testing:

1. T-tests, Chi-square tests, ANOVA: To test assumptions and hypotheses about data
relationships.

Examples of EDA
Sales Data Analysis:
1. Summary Statistics: Calculate the average, minimum, and maximum sales.
2. Histograms: Show the distribution of sales amounts.
3. Box Plots: Identify outliers in sales data.
4. Scatter Plots: Analyze the relationship between advertising spend and sales.

Customer Data Analysis:


1. Descriptive Statistics: Understand the demographics of customers (e.g., age, income).
2. Bar Plots: Visualize the frequency distribution of customer segments.
3. Heatmap: Correlation between different customer attributes (e.g., age, income, spending
score).

Data visualization
Data visualization is the graphical representation of information and data where data is
representing in the form of graphs or charts. It helps to understand large and complex amounts of data
very easily. It allows the decision-makers to make decisions very efficiently and also allows them in
identifying new trends and patterns very easily. It is also used in high-level data analysis for Machine
Learning and Exploratory Data Analysis (EDA). Data visualization can be done with various tools
like Tableau, Power BI, Python.

There are two primary uses for data visualization:


 To explore data
 To communicate data

1.Purpose of Data visualization

 To simplify complex data: by bring large amount data can be bring to one graph helps to analyze
the data:
 To highlight relationships, patterns, and trends:
4
Page

 To support data-driven decision-making

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

2. Common Visualization Techniques:


 Charts: Bar charts, line charts, pie charts.
 Graphs: Scatter plots, histograms, box plots.
 Maps: Geographic maps to represent spatial data.
 Interactive Dashboards: Tools that allow users to interact with the data, such as filtering and
zooming.
3. Tools and Software:

 Libraries: Matplotlib, Seaborn, Plotly (Python); ggplot2 (R).


 Software: Tableau, Power BI, D3.js for web-based visualizations.
4. Benefits:
 Enhances comprehension by converting data into a visual context.
 Makes it easier to spot trends and outliers.
 Facilitates communication of insights to stakeholders.

Advantages
 Data visualization is a form of visual art that grabs our interest and keeps our eyes on the
message.
 Easily sharing information.
 Interactively explore opportunities.
 Visualize patterns and relationships.

Disadvantages

 When viewing visualization with many different data points, it’s easy to make an inaccurate
assumption.
 Biased or inaccurate information.
 Correlation doesn’t always mean causation.
 Core messages can get lost in translation.

matplotlib
matplotlib is a low-level library of Python which is used for data visualization. It is easy to use
and emulates MATLAB like graphs and visualization. This library is built on the top of NumPy arrays
and consists of several plots like line chart, bar chart, histogram, etc. It provides a lot of flexibility but
at the cost of writing more code.
Here are some key aspects of Matplotlib:
1. Core Features: 2D Plotting: Create line plots, scatter plots, bar charts, histograms, pie
charts, and more.
2. Customization: Extensive options to customize plots, including colors, labels, line styles,
and annotations (comment added to a text).
3. Subplots: Support for creating complex figures with multiple subplots in a single figure. 2.
5
Page

Usage:
4. Integration: Works seamlessly with other scientific libraries like NumPy, Pandas, and
SciPy.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

5. Scripting: Suitable for scripting and quick plotting from the Python shell or Jupyter
notebooks.
6. Publication-Quality Figures: Capable of producing high-quality figures for publications
and presentations.

To install Matplotlib type the below command in the terminal.

pip install matplotlib

For example,
from matplotlib import pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o', linestyle='solid')
# add a title
plt.title("Nominal GDP")
# add a label to the y-axis
plt.ylabel("Billions of $")
plt.show()

6
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Line Charts
Write Python program to plot Line chart by assuming your own data and explain the various
attributes of line chart.

import matplotlib.pyplot as plt


# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 20, 15, 25, 30]
# Creating the line chart
plt.plot(x, y, color='blue', marker='o', linestyle='-', linewidth=2, markersize=8)
# Adding title and labels
plt.title('Sample Line Chart') plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
# Adding grid
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
# Adding legend
plt.legend(['Sample Data'], loc='upper left')
# Display the line chart
plt.show()

Explanation of Various Attributes


1. Data for Line Chart:
x: List of values for the x-axis ([1, 2, 3, 4, 5]).
y: Corresponding values for each point on the y-axis ([10, 20, 15, 25, 30]).

2. plt.plot( ) Function:
plt.plot(x, y):
This function creates the line chart with the x values on the x-axis and their corresponding y values on
the y-axis.
color='blue': Sets the color of the line to blue.
marker='o': Uses a circle marker for each data point. L
inestyle='-': Sets the style of the line to solid.
linewidth=2: Sets the width of the line to 2 points.
markersize=8: Sets the size of the markers to 8 points.

3. Title and Labels:


plt.title('Sample Line Chart'): Adds a title to the line chart.
plt.xlabel('X-axis Label'): Labels the x-axis as 'X-axis Label'.
plt.ylabel('Y-axis Label'): Labels the y-axis as 'Y-axis Label'.

4.Grid:
plt.grid(True, which='both', linestyle='--', linewidth=0.5):
7

Adds a grid to the chart to improve readability.


Page

True: Enables the grid. which='both': Applies the grid to both major and minor ticks. linestyle='--':
Sets the style of the grid lines to dashed.
linewidth=0.5: Sets the width of the grid lines to 0.5 points.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

5.Legend:
plt.legend(['Sample Data'], loc='upper left'):
Adds a legend to the chart. ['Sample Data']: List of labels for the legend. loc='upper left': Positions the
legend in the upper left corner of the chart.

6.plt.show():
plt.show(): Displays the line chart.

Bar Charts

A bar chart is a good choice to show how some quantity varies among some discrete set of items. For
Example:

import matplotlib.pyplot as plt


import numpy as np
x=np.array(["A", "B", "C", "D"])
y=np.array([3, 8, 1, 10])
plt.bar(x,y)
plt.show()

Scatterplots
A scatterplot is the right choice for visualizing the relationship between two paired sets of data.
Example
import matplotlib.pyplot as plt
import numpy as np

x=np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y=np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
colors=np.array(["red","green","blue","yellow","pink","black","orange","purple","beige","brown",
8

"gray","cyan","magenta"])
Page

plt.scatter(x,y,c=colors)

plt.show()

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Histogram
Histogram Write Python program to plot histogram by assuming your own data and explain the
various attributes of histogram.
import numpy as np
import matplotlib.pyplot as plt Z =
np.random.normal(0, 1, 100)
ax.hist(Z)

Scatterplots
Write Python program to plot Scatterplot by assuming your own data and explain the various
attributes of Scatterplot.
import numpy as np
import matplotlib.pyplot as pltX =
np.random.uniform(0, 1, 100)
Y = np.random.uniform(0, 1, 100)
ax.scatter(X, Y)

import numpy as np
import matplotlib.pyplot as pltZ =
np.random.uniform(0, 1, (8,8))
ax.imshow(Z)

Z = np.random.uniform(0, 1, (8,8))
ax.contourf(Z)
9
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Z = np.random.uniform(0, 1, 4)
ax.pie(Z)

X = np.arange(5)
Y = np.random.uniform(0, 1, 5)
ax.errorbar(X, Y, Y∕4)

Z = np.random.normal(0, 1, (100,3))
ax.boxplot(Z)

Linear Algebra for data science

 Branch of Mathematics that deals with Vector Spaces


 Linear algebra is a foundational component of data science,underpinning many of the algorithms
and techniques used for data manipulation, analysis, and machine learning.
 Linear algebra is a branch of mathematics that deals with vector spaces and linear transformations
between them. It involves the study of vectors, matrices, and systems of linear equations, and
explores concepts such as vector addition, scalar multiplication, matrix operations, determinants,
eigenvalues, and eigenvectors.
 It forms the backbone of machine learning algorithms, enabling operations like matrix
multiplication, which are essential to model training and prediction.
 Linear algebra techniques facilitate dimensionality reduction, enhancing the performance of data
processing and interpretation.
 Eigenvalues and eigenvectors help understand data records variability, influencing clustering
and pattern recognition.
 Solving systems of equations is crucial for optimization tasks and parameter estimation.
 Linear algebra supports image and signal processing strategies critical in data analysis.
 Proficiency in linear algebra empowers data scientists to successfully represent, control, and
extract insights from data, in the end driving the development of accurate models and informed
decision-making.

Applications of linear algebra in data science:


Linear Algebra Concepts

1. Scalar: A single number.

2. Vectors:
A vector is a list of numbers representing data points or features in a dataset.
10

Vector: An ordered array of numbers, e.g., v=[v1,v2,...,vn]. Operations: addition, subtraction, scalar
Page

multiplication, and dot product.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Vector operations

1. addition: a+b=[a1+b1,a2+b2,...,an+bn]
2. subtraction: a−b=[a1−b1,a2−b2,...,an−bn]
3. scalar multiplication: ca=[ca1,ca2,...,can]
4. dot product: a⋅b=∑i=1naibi
5. norm (magnitude): ∥a∥=a⋅a=∑i=1nai2

Write Python program to add two vectors and multiply a vector by a scalar

def vector_add(v, w):

"""Adds corresponding elements of two vectors."""

return [v_i + w_i for v_i, w_i in zip(v, w)]

def scalar_multiply(c, v):

"""Multiplies every element of vector v by the scalar c."""

return [c * v_i for v_i in v]

# Example vectors

v = [1, 2, 3]

w = [4, 5, 6]

# Scalar value

c=2

# Adding two vectors

result_addition = vector_add(v, w)

print(f"Vector addition of {v} and {w} is {result_addition}")

# Multiplying vector by a scalar

result_scalar_multiplication = scalar_multiply(c, v)

print(f"Scalar multiplication of {v} by {c} is {result_scalar_multiplication}")

Output
11

Vector addition of [1, 2, 3] and [4, 5, 6] is [5, 7, 9]


Page

Scalar multiplication of [1, 2, 3] by 2 is [2, 4, 6]

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Data Representation becomes an important aspect of data science and data is represented usually in a
matrix form.
To uncover the relations between variables linear algebraic tools are used.

Rank of a Matrix
It refers to the number of linearly independent rows or columns of the matrix

Null space and Nullity


 Linear relationships among attributes, is answered by the concepts of null space and nullity.
 The null space of any matrix A consists of all the vectors B such that AB = 0 and B is not zero.
 It can also be obtained from AB = 0 where A is known matrix of size m x n and B is matrix to
be found of size n x k.
 Every null space vector corresponds to one linear relationship.
 Nullity can be defined as the number of vectors present in the null space of a given matrix.
 In other words, the dimension of the null space of the matrix A is called the nullity of A

Rank nullity Theorem


12

Consider a data matrix a with a null space and nullity


Page

Rank nullity theorem helps us to relate the nullity of a data matrix to the rank and the number of
attributes in the data

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

According to nullity theorem

Statistics
Statistics is the science of analyzing data and which helps to understand the data. Statistics play a
crucial role in data science, enabling professionals to make sense of data, draw valid conclusions, and
make informed decisions.
There are two types of statistics:
1. Descriptive Statistics: These are used to summarize and describe the main
features of a dataset.
2. Inferential Statistics: These are used to draw conclusion fro that data.

The purpose of descriptive and inferential statistics is to analyze different types of data using different
tools. Descriptive statistics helps to describe and organize known data using charts, bar graphs, etc.,
while inferential statistics aims at making inferences and generalizations about the population data.

Descriptive Statistics
Descriptive statistics are a part of statistics that can be used to describe data. It is used to summarize
the attributes of a sample in such a way that a pattern can be drawn from the group. It enables
researchers to present data in a more meaningful way such that easy interpretations can be made.
Descriptive statistics uses two tools to organize and describe data. These are given as follows:

 Measures of Central Tendency - These help to describe the central position of the data by using
measures such as mean, median, and mode.
 Measures of Dispersion - These measures help to see how spread out the data is in a distribution
with respect to a central point. Range, standard deviation, variance, quartiles, and absolute
deviation are the measures of dispersion.
Need of descriptive statistics in data analysis:

1. Summarizing Data

Descriptive statistics allow us to condense large datasets into meaningful summaries, making it easier
to understand the overall characteristics of the data.

Example:
 A company surveys 1,000 customers about their satisfaction with a product. The average
satisfaction score (mean) can be calculated to provide a quick summary of customer sentiment.
If the mean satisfaction score is 4.2 out of 5, this indicates that customers are generally
13

satisfied.
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

2. Identifying Patterns and Trends

Descriptive statistics help in detecting patterns and trends within the data, which can inform further
analysis and decision-making.

Example:
 In a dataset of monthly sales figures over the past year, a time series plot can reveal trends such
as seasonality (higher sales in December) or growth (increasing sales month-over-month). A
histogram of sales data might show a normal distribution, indicating consistent sales
performance.

3. Facilitating Comparison

Descriptive statistics enable easy comparison between different groups or datasets, helping to identify
differences and similarities.

Example:
 In a clinical trial, researchers compare the average reduction in blood pressure between two
groups: those receiving a new medication and those receiving a placebo. If the mean reduction
in blood pressure is 15 mmHg for the medication group and 5 mmHg for the placebo group, it
clearly shows the effectiveness of the medication.

4. Detecting Outliers and Anomalies

Descriptive statistics are essential for identifying outliers and anomalies, which can indicate data entry
errors, exceptional cases, or areas needing further investigation.

Example:
 A box plot of students' test scores can highlight outliers, such as a few students scoring
significantly lower than the rest. Identifying these outliers can lead to further investigation into
whether these scores were due to errors, misunderstandings, or other factors.

5. Informing Further Analysis

Descriptive statistics provide a necessary foundation for more complex analyses, guiding the choice of
appropriate statistical methods and models.

Example:
 Before performing a regression analysis, a data analyst examines the descriptive statistics of
the variables involved. If the standard deviation of the dependent variable is very high, it might
indicate the need for data transformation or the inclusion of additional predictor variables.

6. Communicating Results

Descriptive statistics are crucial for presenting data insights in a clear and understandable manner to
stakeholders who may not have a statistical background.
14

Example:
Page

 In a business report, presenting the mean and standard deviation of customer satisfaction
scores, along with visual aids like bar charts and pie charts, helps executives quickly grasp the
key findings and make informed decisions.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

7. Ensuring Data Quality

Descriptive statistics help in assessing the quality of the data, identifying inconsistencies, and ensuring
the data is suitable for analysis.

Example:
 When analyzing survey data, an analyst might calculate the mean, median, and mode of
responses to check for unusual patterns. If the mean age of respondents is 150 years, it
indicates a data entry error that needs correction.

Inferential Statistics

Inferential statistics is a branch of statistics that is used to make inferences about the population by
analyzing a sample. When the population data is very large it becomes difficult to use it. In such cases,
certain samples are taken that are representative of the entire population. Inferential statistics draws
conclusions regarding the population using these samples. Sampling strategies such as simple random
sampling, cluster sampling, stratified sampling, and systematic sampling, need to be used in order to
choose correct samples from the population. Some methodologies used in inferential statistics are as
follows:

 Hypothesis Testing - This technique involves the use of hypothesis tests such as the z test, f test, t
test, etc. to make inferences about the population data. It requires setting up the null hypothesis,
alternative hypothesis, and testing the decision criteria.
 Regression Analysis - Such a technique is used to check the relationship between dependent and
independent variables. The most commonly used type of regression is linear regression.

Important Concepts

Mean: The average of a dataset and is calculated by summing all the numbers in the dataset and then
dividing by the count of numbers.

Median: The middle value separating the higher half from the lower half of the dataset.
For an odd number of observations, it is the middle value.
For an even number of observations, it is the average of the two middle values.

Mode: The most frequently occurring value in the dataset.here can be more than one mode if multiple
values have the same highest frequency.

Example: For the dataset [1, 2, 2, 3, 4], the mode is 2. For the dataset [1, 2, 2, 3, 3, 4], the modes are
2 and 3.
Range: The difference between the maximum and minimum values.
Variance and Standard Deviation: Measures of the dispersion or spread of the data. Variance is the
average of the squared differences from the mean, and the standard deviation is the square root of the
variance.
Probability Distributions:
Uniform Distribution: All outcomes are equally likely.
Normal Distribution: The bell curve; characterized by its mean and standard deviation.
15

Binomial Distribution: The number of successes in a fixed number of independent Bernoulli trials.
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Correlation
 A measure of the relationship between two variables, indicating how one variable changes with
respect to another.
 Variance measures how a single variable deviates from its mean, Whereas Covariance
measures how two variables vary together from their means.
 Correlation refers to the statistical relationship between the two entities.
 It measures the extent to which two variables are linearly related.
 For example, the height and weight of a person are related, and taller people tend to be heavier
than shorter people.

There are three types of correlation:

Positive Correlation: A positive correlation means that this linear relationship is positive, and the two
variables increase or decrease in the same direction.

Negative Correlation: A negative correlation is just the opposite. The relationship line has a negative
slope, and the variables change in opposite directions, i.e., one variable decreases while the other
increases.

No Correlation: No correlation simply means that the variables behave very differently and thus,
have no linear relationship

Correlation and causation


Correlation depicts the degree of association between 2 random variables. In data analysis it is often
used to determine the amount to which they related to one another.
A relationship where one event causes another event to occur, which is often more challenging to
establish than correlation.
Causation between A & B implies that A&B have a cause & effect relationship with one another
i,e A depend on B or vice versa

Example: Mobile phone- low battery leads to disconnect the video call and shut down the system, both
are causation.

“Correlation is Not Causation"

The statement "correlation is not causation" means that just because two variables are correlated, it
16

does not mean that one variable causes the other to change. There are three main reasons for this:
Page

1. Coincidence: The correlation might be due to random chance.


2. Third Variable Problem: Another variable (a confounding variable) might be influencing both
variables, creating a false impression of a direct relationship.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

3. Reverse Causality: It might be that the supposed effect is actually the cause.

Example

Consider a scenario where there is a high correlation between ice cream sales and drowning incidents.
Based on correlation alone, one might mistakenly conclude that ice cream sales cause drowning.
However, the actual explanation involves a third variable: temperature.

 During hot weather, more people buy ice cream to cool off.
 During the same hot weather, more people go swimming to cool off.
 As more people swim, the likelihood of drowning incidents increases.

In this case, the hot weather (the third variable) causes both the increase in ice cream sales and the
increase in drowning incidents. Therefore, the correlation between ice cream sales and drowning does
not imply that one causes the other.

Quantile:
A generalization of the median is the quantile
The concept of quantiles generalizes the median.
 Quantiles are introduced to describe the values that divide a dataset into intervals with equal
probabilities.
 Quantiles are points taken at regular intervals from the cumulative distribution function (CDF)
of a random variable. They are used to divide the range of a dataset into contiguous intervals
with equal probabilities.
 Median: The 50th percentile, which divides the dataset into two equal parts whereas Quartiles
Divide the dataset into four equal parts.
The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the median (50th
percentile), and the third quartile (Q3) is the 75th percentile.Percentiles: Divide the dataset into
100 equal parts.

CODE:
import numpy as np data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
quantiles = np.quantile(data, [0.25, 0.5, 0.75])
print("25th percentile:", quantiles[0])
print("50th percentile (median):", quantiles[1])
print("75th percentile:", quantiles[2])

Dispersion
Dispersion in statistics refers to the extent to which a distribution is stretched or squeezed. Commonly
used measures of dispersion include the range, variance, standard deviation, and interquartile range
(IQR). These measures provide insights into the variability or spread of a dataset.. It indicates how
much the values differ from the average (mean) value. High dispersion means the data points are
spread out over a wide range of values, while low dispersion indicates that the data points are closely
clustered around the mean.

1. Range: The difference between the maximum and minimum values in a dataset.
2. Variance: The average of the squared differences from the mean.
17

3. Standard Deviation: The square root of the variance, representing the average distance from the
Page

mean.
4. Interquartile Range (IQR): The difference between the 75th percentile (Q3) and the 25th
percentile (Q1).

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Code:
import numpy as np

# Sample data

data = [12, 15, 14, 10, 18, 20, 25, 30, 22, 23]

# Range

data_range = np.max(data) - np.min(data)

# Variance

variance = np.var(data)

# Standard Deviation

std_deviation = np.std(data)

# Interquartile Range (IQR)

Q1 = np.percentile(data, 25)

Q3 = np.percentile(data, 75)

IQR = Q3 - Q1

# Output the results

print(f"Range: {data_range}")

print(f"Variance: {variance}")

print(f"Standard Deviation: {std_deviation}")

print(f"Interquartile Range (IQR): {IQR}")

Variance and Covariance:

1. Variance:
o Variance measures the dispersion of a single variable.
o It calculates the average of the squared differences between each data point and the
mean of the data set.
o Variance is always non-negative.
o It provides an idea of how much the values of a single variable differ from the mean
value of that variable.
2. Covariance:
o Covariance measures the degree to which two variables change together.
18

o It calculates the average of the product of the deviations of each pair of data points
from their respective means.
Page

o Covariance can be positive, negative, or zero.


o It provides an idea of the direction of the linear relationship between two variables (i.e.,
whether the variables tend to increase or decrease together).

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

CODE:

import numpy as np

# Sample data

x = [12, 15, 14, 10, 18, 20, 25, 30, 22, 23]

y = [22, 25, 24, 20, 28, 30, 35, 40, 32, 33]

# Calculating covariance

covariance_matrix = np.cov(x, y, bias=True)

covariance = covariance_matrix[0, 1]

# Output the results

print(f"Covariance between x and y: {covariance}")

Simpson’sParadox
Simpson's paradox occurs when groups of data show one particular trend, but this trend is reversed
when the groups are combined together. Understanding and identifying this paradox is important for
correctly interpreting data.

Consider n groups of data such that I has Ai trails and 0 ≤ ai ≤ Ai “ Successes” . Similarly consider an
analogous n groups of data such that group I has Bi trails and 0 ≤ bi ≤ Bi “ Successes”. Then, Simpsons
paradox occurs if

∑ ∑
∑ ∑

Consider an example

Let's consider an example involving a case study of two students A and B assigned to solve some
problems for two days. We want to determine which treatment is more effective.

Data:

Day Student A Success Rate Student B Success Rate

Saturday 7/8 87.5% 2/2 100%


19

Sunday 1/2 50% 5/8 62.5%


Page

Total 8/10 80% 7/10 70%

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Thus Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in
several groups of data but disappears or reverses when the groups are combined.

Probability
Probability Theory is the backbone of many Data Science concepts, such as Inferential Statistics,
Machine Learning, Deep Learning, etc. Probability is an estimation of how likely a certain event or
outcome will occur. It is typically expressed as a number between 0 and 1.Probability can be
calculated by dividing the number of favorable outcomes by the total number of outcomes of an event.

Probability Formula
P(A)=Number of favorable outcomes to A/Total number of possible outcomes

For example, if there is an 80% chance of rain tomorrow, then the probability of it raining tomorrow is
0.8.

Dependence and Independence

If two events E and F are dependent if knowing something about whether E happens gives us
information about whether F happens or vice versa.

If we flip a fair coin twice, knowing whether the first flip is Heads gives us no information about
whether the second flip is Heads. These events are independent.

Independent events are events that are not affected by the occurrence of other events.
The formula for the Independent Events is,

P(A and B) = P(A)×P(B)

Dependent events are events that are affected by the occurrence of other events.

Bayes' theorem
Bayes' theorem is a powerful tool in data science for updating the probability of a hypothesis based on
new evidence. It's widely used in various applications such as spam filtering, medical diagnosis, and
machine learning.

Bayes' Theorem Formula

Bayes' theorem relates the conditional and marginal probabilities of random events. The formula is:

where:
 P(A∣B) is the posterior probability: the probability of event A occurring given that B is true.
 P(B∣A)is the likelihood: the probability of event B occurring given that A is true.
 P(A)is the prior probability: the initial probability of event A before any evidence is considered.
20

 P(B) is the marginal probability: the total probability of event B occurring.


Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Probability distribution

A probability distribution describes how the values of a random variable are distributed. It
provides the probabilities of occurrence of different possible outcomes in an experiment.
Probability distributions can be categorized into two main types: discrete and continuous.

Discrete Probability Distribution:

This type applies to discrete random variables, which are variables that take on a finite or
countable number of distinct values.

Example: The number of heads in 10 coin flips.

Expected Value and Variance:


Expected Value (Mean): For a discrete random variable X:
E(X) = i∑ xi P(x i )
Variance: Measures the spread of the random variable's values:
Var(X) =E((X−E(X))2 )

Continuous Distribution
A continuous distribution is a type of probability distribution in which the variable can take an
infinite number of values within a given range. It describes the probabilities of the possible values
of a continuous random variable. Unlike discrete variables, which have specific, countable
outcomes, continuous variables can take on any value within a given range.

In continuous distribution , there are infinitely many numbers between 0 and 1Instead of assigning
probabilities to individual points, we use a Probability Density Function (PDF).PDF is that it must
integrate to 1 over the entire range of possible values, ensuring the total probability is 1.

Examples:

1. Height of Adults: The height of adults can be measured to any level of precision and is
typically described by a continuous distribution. For instance, someone might be 170.2 cm tall,
while another person might be 170.25 cm tall.
2. Temperature: Temperature readings are continuous as they can take on any value within a
range, such as 23.5°C or 23.56°C..

Normal Distribution
The normal distribution, also known as the Gaussian distribution, is a specific type of continuous
distribution. It is characterized by its bell-shaped curve, which is symmetric around the mean. The
properties of a normal distribution include:

 The mean, median, and mode are all equal.


21

 It is fully described by two parameters: the mean (µ) and the standard deviation (σ).
Page

 Approximately 68% of the data falls within one standard deviation of the mean, 95% within
two standard deviations, and 99.7% within three standard deviations (empirical rule).

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Example: The distribution of heights of adult men in a specific population is often modeled as a
normal distribution with a mean of around 175 cm and a standard deviation of about 7 cm. This means
most men's heights will be around 175 cm, with fewer men being significantly shorter or taller.

The normal distribution is the king of distributions: determined by two parameters - its mean μ (mu)
and its standard deviation σ (sigma).

Code:
def normal_cdf(x, mu=0,sigma=1):
return (1 + math.erf((x - mu) / math.sqrt(2) / sigma))
xs = [x / 10.0 for x in range(-50, 50)]

plt.plot(xs,[normal_cdf(x,sigma=1) for x in xs],'-',label='mu=0,sigma=1')


plt.plot(xs,[normal_cdf(x,sigma=2) for x in xs],'--',label='mu=0,sigma=2')
plt.plot(xs,[normal_cdf(x,sigma=0.5) for x in xs],':',label='mu=0,sigma=0.5')
plt.plot(xs,[normal_cdf(x,mu=-1) for x in xs],'-.',label='mu=-1,sigma=1')
plt.legend(loc=4) # bottom right plt.title("Various Normal cdfs")
plt.show()

Common Distributions
Binomial Distribution: Describes the number of successes in a fixed number of independent
Bernoulli trials.
Parameters: n (number of trials) and p (probability of success).

Conditional Probability

Conditional probability refers to the probability of an event occurring given that another event has
already occurred. The conditional probability of event A given that event B has occurred is denoted as
P(A/B) and is defined as:

P (A/B)= P(A and B) / P(B)


This formula assumes that P(B)≠0. It is used to update our probability estimates based on new
information.

Example: Medical Test

Consider a scenario involving a medical test for a particular disease. Consider the following
information:
The probability of having the disease (D) is P (D)=0.01.

The probability of testing positive (T) given that one has the disease is

P (T/D)=0.99.

The probability of testing positive given that one does not have the disease is P(T/¬D)=0.05.
We want to find the probability that a person has the disease given that they tested positive (P(D/T)).
22
Page

Applying Bayes' Theorem

Bayes' Theorem provides a way to update our beliefs based on new evidence. The theorem states:

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

P (D/T)= P(T)P(T/D)⋅P(D)
We need to calculate P (T), the total probability of testing positive. This can be done using the law of
total probability:

P (T)=P(T/D)⋅P(D)+P(T/¬D)⋅P(¬D)

Where: P(¬D)=1−P(D)
Substituting the given values:
P(T)=(0.99⋅0.01)+(0.05⋅0.99)
Calculating this:
P(T)=0.0099+0.0495=0.0594
Now we can use Bayes' Theorem to find P(D/T):
P(D/T)= P(T/D)⋅P(D) / P(T)= 0.99⋅0.01/0.0594
P(D/T)= 0.0099/0.0594 ≈0.1667
Interpretation Despite the high sensitivity P(T/D)=0.99 of the test, the probability of actually having
the disease given a positive test result is only about 16.67%. This example illustrates the importance of
considering the base rate (prevalence) of the disease in the population when interpreting test results. In
this case, the low prevalence of the disease significantly affects the conditional probability.

Random Variables and Probability Distributions

1. Random Variables:

Definition: A random variable is a variable whose value is subject to variations due to randomness.
Types: Discrete and continuous random variables. A random variable is a variable whose possible
values have an associated probability distribution.

Example:
A very simple random variable equals 1 if a coin flip turns up heads and 0 if the flip turns up tails.
A more complicated one might measure the number of heads observed when flipping a coin 10 times
or a value picked from range (10) where each number is equally likely. The associated distribution
gives the probabilities that the variable realizes each of its possible values. The coin flip variable
equals 0 with probability0.5 and 1 with probability 0.5. The range (10) variable has a distribution that
assigns probability 0.1 to each of the numbers from 0 to 9.

The expected value of a random variable, which is the average of its values weighted by their
probabilities. The coin flip variable has an expected value of
1/2 (= 0 * 1/2 + 1 * 1/2)

and the range(10) variable has an expected value of 4.5.Random variables can be conditioned on
23

events just as other events .


Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Central Limit Theorem (CLT)

The Central Limit Theorem is a fundamental principle in statistics that describes the characteristics of
the sampling distribution of the sample mean. It states that, regardless of the original distribution of
the data, the distribution of the sample means approaches a normal distribution as the sample size
becomes larger, provided the samples are independent and identically distributed.

Example: Suppose you want to estimate the average height of all students in a large university. If you
take multiple random samples of students and calculate the average height for each sample, the
distribution of these sample means will approximate a normal distribution as the sample size increases,
regardless of the original distribution of student heights.

24
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore

You might also like