Stats Unit I Notes
Stats Unit I Notes
Statistics Terminologies:
Some of the most common terms you might come across in statistics are:
Population: It is actually a collection of a set of individual objects or events whose
properties are to be analyzed.
Sample: It is the subset of a population.
Variable: It is a characteristic that can have different values.
Parameter: It is numerical characteristic of population.
Statistics Examples:
Some real-life examples of statistics that you might have seen:
Example 1: In a class of 45 students, we calculate their mean marks to evaluate
performance of that class.
Example 2: Before elections, you might have seen exit polls. Exit polls are opinion of
population sample, that are used to predict election results.
Types of Statistics
There are 2 types of statistics:
Descriptive Statistics
Inferential Statistics
Scope of Statistics
Statistics is a branch of mathematics that deals with the collection, organization,
analysis, interpretation, and presentation of data. It is used in a wide variety of fields,
including:
Science: Statistics is used to design experiments, analyze data, and draw
conclusions about the natural world.
Business: Statistics is used to market products, track sales, and make financial
decisions.
Government: Statistics is used to track economic trends, measure the effectiveness
of government programs, and allocate resources.
Healthcare: Statistics is used to develop new drugs, track the spread of
diseases, and assess the effectiveness of medical treatments.
Sports: Statistics is used to analyze player performance, scout new talent, and
predict the outcome of games.
A population refers to the entire set of individuals, objects, or data points that you want to
study. It can be large or small depending on the scope of your research. For example, all
students in a school or all people in a country.
A sample is a subset of the population that is selected for analysis. It’s used when
studying the entire population is impractical or impossible. Sampling allows for
inferences about the population using statistical techniques.
The population gives a complete picture, while the sample provides an estimate.
Parameters (like population mean) describe the population; statistics (like sample mean)
describe the sample. The population refers to the entire group of individuals or items that
we are interested in studying and drawing conclusions about. In statistics, the population
is the entire set of items from which data is drawn in the statistical study. It can be a
group of individuals or a set of items. The population is usually denoted by N.
A sample is a subset of the population selected for study. It is a representative portion of
the population from which we collect data in order to make inferences or draw
conclusions about the entire population. The sample is denoted by n.
Population Sample
Populations are used when your research question requires, or when you have
access to, data from every member of the population. Usually, it is only
straightforward to collect data from a whole population when it is small,
accessible and cooperative.
Data is a simple record or collection of different numbers, characters, images, and others
that are processed to form Information. In statistics, we have different types of data that
are used to represent various information. In statistics, we analyze the data to obtain any
meaningful information and thus categorizing data into different types is very important.
Data types in statistics help us to make an informed decision about what type of process
is used to analyze the data.
Nominal Data
Nominal data is a type of data that consists of categories or names that cannot be
ordered or ranked. Nominal data is often used to categorize observations into groups,
and the groups are not comparable. In other words, nominal data has no inherent order
or ranking. Examples of nominal data include gender (Male or female), race (White,
Black, Asian), religion (Hinuduism, Christianity, Islam, Judaism), and blood type (A, B,
AB, O).
Nominal data can be represented using frequency tables and bar charts, which display
the number or proportion of observations in each category. For example, a frequency
table for gender might show the number of males and females in a sample of people.
Nominal data is analyzed using non-parametric tests, which do not make any
assumptions about the underlying distribution of the data. Common non-parametric tests
for nominal data include Chi-Squared Tests and Fisher’s Exact Tests. These tests are
used to compare the frequency or proportion of observations in different categories.
Ordinal Data
Ordinal data is a type of data that consists of categories that can be ordered or ranked.
However, the distance between categories is not necessarily equal. Ordinal data is often
used to measure subjective attributes or opinions, where there is a natural order to the
responses. Examples of ordinal data include education level (Elementary, Middle, High
School, College), job position (Manager, Supervisor, Employee), etc.
Ordinal data can be represented using bar charts, line charts. These displays show the
order or ranking of the categories, but they do not imply that the distances between
categories are equal.
Ordinal data is analyzed using non-parametric tests, which make no assumptions about
the underlying distribution of the data. Common non-parametric tests for ordinal data
include the Wilcoxon Signed-Rank test and Mann-Whitney U test.
Discrete Data
Discrete data type is a type of data in statistics that only uses Discrete Value or Single
Values. These data types have values that can be easily counted as whole numbers.
The example of the discrete data types are,
Height of Students in a class
Marks of the students in a class test
Weight of different members of a family, etc.
Continuous Data
Continuous data is the type of the quantitative data that represent the data in a
continuous range. The variable in the data set can have any value between the range of
the data set. Examples of the continuous data types are,
Temperature Range
Salary range of Workers in a Factory, etc.
Can be shown in numbers and variables like Could be about the behavioral attributes
ratio, percentage, and more. of a person, or thing.
There are distinct or different values in Every value within a range is included in
discrete data. continuous data.
Scales of Measurement
Data can be classified as being on one of four scales: nominal, ordinal, interval or
ratio.
1. Nominal Scale –
Nominal variables can be placed into categories. These don’t have a numeric value
and so cannot be added, subtracted, divided or multiplied. These also have no order,
and nominal scale of measurement only satisfies the identity property of
measurement.
For example, gender is an example of a variable that is measured on a nominal scale.
Individuals may be classified as “male” or “female”, but neither value represents more
or less “gender” than the other.
2. Ordinal Scale –
The ordinal scale contains things that you can place in order. It measures a variable
in terms of magnitude, or rank. Ordinal scales tell us relative order, but give us no
information regarding differences between the categories. The ordinal scale has the
property of both identity and magnitude.
For example, in a race If Ram takes first and Vidur takes second place, we do not
know competition was close by how many seconds.
3. Interval Scale –
An interval scale has ordered numbers with meaningful divisions, the magnitude
between the consecutive intervals are equal. Interval scales do not have a true zero
i.e In Celsius 0 degrees does not mean the absence of heat.
Interval scales have the properties of:
Identity
Magnitude
Equal distance
For example, temperature on Fahrenheit/Celsius thermometer i.e. 90° are hotter than
45° and the difference between 10° and 30° are the same as the difference between
60° degrees and 80°.
4. Ratio Scale –
The ratio scale of measurement is similar to the interval scale in that it also
represents quantity and has equality of units with one major difference: zero is
meaningful (no numbers exist below the zero). The true zero allows us to know how
many times greater one case is than another. Ratio scales have all of the
characteristics of the nominal, ordinal and interval scales. The simplest example of a
ratio scale is the measurement of length. Having zero length or zero money means
that there is no length and no money but zero temperature is not an absolute zero.
Properties of Ratio Scale:
Identity
Magnitude
Equal distance
Absolute/true zero
Presentation of Data:
Tabular Form
It is a table that helps to represent even a large amount of data in an
engaging, easy to read, and coordinated manner. The data is arranged in
rows and columns. This is one of the most popularly used forms of
presentation of data as data tables are simple to prepare and read.
1. Qualitative
2. Quantitative
3. Temporal
4. Spatial
Graphical Form
Line Graphs
A line graph is used to show how the value of a particular variable changes with time.
We plot this graph by connecting the points at different values of the variable. It can be
useful for analyzing the trends in the data and predicting further trends.
Bar Graphs
A bar graph is a type of graphical representation of the data in which bars of uniform width
are drawn with equal spacing between them on one axis (x-axis usually), depicting the
variable. The values of the variables are represented by the height of the bars.
Histograms
This is similar to bar graphs, but it is based frequency of numerical values rather than their
actual values. The data is organized into intervals and the bars represent the frequency of
the values in that range. That is, it counts how many values of the data lie in a particular
range.
Line Plot
It is a plot that displays data as points and checkmarks above a number line, showing
the frequency of the point.
Stem and Leaf Plot
This is a type of plot in which each value is split into a “leaf”(in most cases, it is the last
digit) and “stem”(the other remaining digits). For example: the number 42 is split into
leaf (2) and stem (4).
For example, let’s say we have a dataset of students’ test scores in a class.
Test Score Frequency
0-20 6
20-40 12
40-60 22
60-80 15
80-100 5
10 – 20 5
20 – 30 8
30 – 40 12
40 – 50 6
50 – 60 3
10 4
15 3
20 2
25 3
30 2
3.Relative Frequency Distribution
This distribution displays the proportion or percentage of observations in each interval
or class. It is useful for comparing different data sets or for analyzing the distribution of
data within a set.
Relative Frequency is given by:
Relative Frequency = (Frequency of Event)/(Total Number of Events)
Example: Make the Relative Frequency Distribution Table for the following data:
Score
0-20 21-40 41-60 61-80 81-100
Range
Frequency 5 10 20 10 5
Solution:
To Create the Relative Frequency Distribution table, we need to calculate Relative
Frequency for each class interval. Thus Relative Frequency Distribution table is given
as follows:
Total 50 1.00
4.Cumulative Frequency Distribution
Less than Type: We sum all the frequencies before the current interval.
More than Type: We sum all the frequencies after the current interval.
45 34 50 75 22
56 63 70 49 33
0 8 14 39 86
92 88 70 56 50
57 45 42 12 39
Solution:
Since there are a lot of distinct values, we’ll express this in the form of grouped
distributions with intervals like 0-10, 10-20 and so. First let’s represent the data in the
form of grouped frequency distribution.
Runs Frequency
0-10 2
10-20 2
20-30 1
30-40 4
40-50 4
50-60 5
60-70 1
70-80 3
80-90 2
90-100 1
Now we will convert this frequency distribution into cumulative frequency distribution
by summing up the values of current interval and all the previous intervals.
Less than 10 2
Less than 20 4
Less than 30 5
Less than 40 9
Less than 50 13
Less than 60 18
Less than 70 19
Less than 80 22
Less than 90 24
This table represents the cumulative frequency distribution of less than type.
Runs scored by Virat Kohli Cumulative Frequency
More than 0 25
More than 10 23
More than 20 21
More than 30 20
More than 40 16
More than 50 12
More than 60 7
More than 70 6
More than 80 3
More than 90 1
This table represents the cumulative frequency distribution of more than type.