Data Management
Data Management
Libeeth B. Guevarra
Department of Mathematics and Natural Sciences
Data Management 1
Review: Basic
Data Management 2
Review: Basic
Areas of Statistics
Data Management 3
Review: Basic
Data Management 4
Review: Basic
Data Management 5
Review: Basic
Data Management 6
Review: Basic
Classification of data:
Qualitative Data (categorical)
Example
Marital Status, Socio-Economic Status, Religious Sector, zip
code, and military rank
Data Management 7
Review: Basic
Data Management 8
Review: Basic
Data Management 9
Review: Basic
Levels of Measurement
Nominal level characterized by data that consist of
names, labels, or categories only. The data cannot be
arranged in an ordering scheme (such as low to high).
Examples: gender, race, color, and savings account
number.
Ordinal level involves data that may be arranged in some
order, but differences between data values either cannot
be determined or are meaningless.
Examples: socioeconomic status of families, Class
Standing (A to D), and Teacher’s Evaluation (Excellent to
Poor)
Data Management 10
Review: Basic
Levels of Measurement
Interval level like the ordinal level, with the additional
property that the difference between any two data values
is meaningful. However, there is no natural zero starting
point (where none of the quantity is present).
Examples: temperature, score in an exam, and IQ.
Ratio level possesses the characteristic of interval level,
and there exist a true zero. Differences and ratios are
meaningful.
Examples: ratio scales are measures of time or space,
height, weight, width, area, age, and monthly income.
Data Management 11
Review: Basic
Data Management 12
Review: Basic
Organizing Data
1 Textual Method
2 Tabular Method
Parts of a Statistical Table
1 Table Heading includes the table number and the
title of the table
2 Body is the main part of the table that contains the
information or figures
3 Stubs or Classes are the classification or categories
describing the data.
4 Caption is a designation or identification of the
information contained in a column.
3 Graphical Method
Data Management 13
Review: Basic
Categorical Distribution
Twenty five inductees were given a blood test to determine their
blood type.
A B B AB O
B AB B B B
O A O O O
AB AB A O B
O O O B A
Data Management 14
Review: Basic
Data Management 15
Review: Basic
Graphical
Pie Chart is used to visually depict qualitative data. A circle
divided into sections according to the percentage of
frequencies in each category of the distribution
Data Management 16
Review: Basic
Data Management 17
Review: Basic
Time Series Graph shows the data that have been collected at
different point in time.
Data Management 18
Review: Basic
Data Management 19
Review: Basic
Pareto Chart is a type of chart that contains both bar and line
graph, where individual values are represented in descending
order by bars and the cumulative total is represented by the
line.
Data Management 20
Central Location
Data Management 21
Central Location
Data Management 22
Central Location
Data Management 23
Central Location
Example
Out of 100 numbers, 20 were 5’s, 40 were 4’s, 35
were 7’s, and 5 were 3’s. What is the mean of the
data set?
Pn
i=1 fi · xi 20(5) + 40(4) + 35(7) + 5(3)
x̄ = = = 5.2
n 100
Data Management 24
Central Location
x̃ = x n+1 (4)
2
x n2 + x n+2
2
x̃ = (5)
2
Example
1 Find the Median of : 9, 3, 44, 17, 15
Answer: Median is 15
2 Find the Median of : 8, 3, 44, 17, 12, 6
Answer: Median is 10
Data Management 25
Central Location
Data Management 26
Central Location
Data Management 27
Central Location
Percentile Ranking
The pth Percentile
A value x is called the pth percentile of a data set, provided that
p% of the data value are less than x.
#of data values < x + 0.5
Percentile rank of x = · 100
total number of data values
Data Management 28
Central Location
Quartile Ranking
Quartiles are values that divide a set of data into 4 equal parts,
denoted by Q1 , Q2 , Q3 , Q4 .
Example
A teacher gives a 20-point test to 10 students. The scores are
as follows: 10, 20, 3, 5, 6, 8, 18, 12, 15 and 2.
Find the quartiles of the given scores.
Data Management 29
Dispersion
Things to consider:
mean of A and B
how far are the scores from each other?
how far are the scores from their mean?
Data Management 30
Dispersion
Data Management 31
Dispersion
Data Management 32
Dispersion
Data Management 33
Dispersion
Example
A= 5, 5, 5, 5, 5, 5, 5, 5
B = 4, 4, 4, 5, 5, 5, 5, 6, 6, 6
C = 0, 0 , 0 , 0 , 10, 10, 10 , 10
D = 5, 7, 10, 11, 11, 15, 16, 20
Data Management 34
Dispersion
Data Management 35
Dispersion
Example
The mean time to download pdf file is 12 min with a standard
deviation of 4 min. Belle’s download time is 20 min. John’s
download time is 6 min. How can you compare Belle’s
download time compare with John?
x −µ 20 − 12
zBelle = = =2
σ 4
x −µ 6 − 12
zJohn = = = −2
σ 4
Data Management 36
Frequency Distribution
Occupation Frequency
Nuns 17
Nursery teachers 3
Television presenters 23
Students 20
Other 17
Data Management 37
Frequency Distribution
Example
Ordered Array is a listing of values from the smallest to largest
values or conversely.
25 29 30 32 36 36 39 40 40 44
45 48 49 50 50 51 54 55 55 55
55 56 57 57 59 60 60 60 61 61
61 63 65 65 65 67 68 70 71 74
74 76 77 77 80 81 81 83 84 90
Data Management 38
Frequency Distribution
Data Management 39
Frequency Distribution
k = 1 + 3.322log(50) = 7
class width = 90−25
7
= 10
Data Management 40
Frequency Distribution
Graphical
A. Histogram
Histogram is a bar graph which the horizontal scale represents
classes of data values and the vertical scale represent
frequencies. The heights of the bars correspond to the
frequency values and the bars are drawn adjacent to each
other (without gaps)
Data Management 41
Frequency Distribution
B. Frequency Polygon
Frequency polygon uses line segments connected to points
located directly above class midpoint values.
Data Management 42
Frequency Distribution
Data Management 43
Shape
Boxplot
A boxplot is also called a box - and - whisker plot. It is a
graphical representation of a summary of five important values;
minimum
first quartile
median
third quartile
maximum value
Data Management 44
Shape
Data Management 45
Shape
Data Management 46
Shape
Data Management 47
Shape
Example
B1: Construct a boxplot for the given data set:
Number of rooms Occupied in a resort during a
10-day period
12 12 13 14 14
16 17 19 19 25
Data Management 48
Shape
Measures of Skewness
Skewness measures the deviation from the symmetry.
3(µ − median)
SK = (10)
σ
3(x̄ − median)
SK = (11)
s
Example
The scores of the students in the Prelim Exam has a median of
18 and a mean of 16. What does this indicate about the shape
of the distribution of the scores?
Mean < Median, hence SK will be negative. The distribution is
negatively skewed.
Data Management 49
Normal Distribution
Data Management 50
Normal Distribution
Data Management 51
Normal Distribution
Data Management 52
Normal Distribution
Data Management 53
Normal Distribution
Example
ND1: A vegetable distributor knows that during the month of
August, the weights of its tomatoes are normally distributed
with a mean of 0.61 lb and a standard deviation of 0.15 lb.
1 What percent of the tomatoes weigh less than 0.76 lb?
0.5 + 0.3413 = 0.8413 = 84%
2 In a shipment of 6000 tomatoes, how many tomatoes can
be expected to weigh more than 0.31 lb?
0.5 + 0.475 = 0.975; 5,850 tomatoes
3 In a shipment of 4500 tomatoes, how many tomatoes can
be expected to weigh from 0.31 lb to 0.91 lb?
95% of 4500 = 4, 275
Data Management 54
Normal Distribution
1 1 2
φ(z) = √ e− 2 z
2π
Data Management 55
Normal Distribution
Data Management 56
Normal Distribution
Data Management 57
Normal Distribution
Data Management 58
Normal Distribution
Data Management 59
Normal Distribution
Data Management 60
Normal Distribution
ND3:
Find a z- score such that 10 percent of the area
under the standard normal curve is above that
score.
Answer: z = 1.28
Find a z- score such that 24 percent of the area
under the standard normal curve is below that
score.
Answer: z = −0.71
Data Management 61
Normal Distribution
Data Management 62
Correlation and Regression
Data Management 63
Correlation and Regression
Data Management 64
Correlation and Regression
Scatter plot shows that the relationship of stress test score and
blood pressure is linear
Data Management 65
Correlation and Regression
Data Management 66
Correlation and Regression
Example
Compute the correlation coefficient for the data:
Data Management 67
Correlation and Regression
P P P
n( xy)−( x)( y) 8(49628)−(588)(665)
r=√ =√
[n( x 2 )−( x)2 ][n( y 2 )−( y)2 ] [8(44550)−(588)2 ][8(55799)−(665)2 ]
P P P P
r = 0.9010
Data Management 68
Correlation and Regression
y − ȳ = m(x − x̄)
Data Management 69
Correlation and Regression
y − ȳ = m(x − x̄)
where:
x̄ = mean of variable x
ȳ = mean of variable y
m =slope of the line
P
xy − nx̄ ȳ
m=P 2
x − n(x̄)2
Data Management 70
Correlation and Regression
Example
Find the equation of the regression line for the data. (This data
shows linear relationship as seen in the scatter plot)
Data Management 71
Correlation and Regression
y − ȳ P
= m(x − x̄) and
xy − nx̄ ȳ 49628 − 8(73.5)(83.125)
m=P 2 2
= = 0.56344
x − n(x̄) 44550 − 8(73.5)2
y − 83.125 = 0.56344(x − 73.5)
y = 0.56344x + 41.71 or we may have
y = 0.56x + 41.71
Data Management 72
Correlation and Regression
Data Management 73
Correlation and Regression
Data Management 74
Correlation and Regression
Data Management 75