0% found this document useful (0 votes)
27 views20 pages

Data Collection & Analysis

The document discusses the importance of data collection and analysis in business research, emphasizing that accurate and reliable data is crucial for effective decision-making. It outlines the research process, classification of data, methods of data collection, and the distinction between primary and secondary data. Additionally, it covers measures of central tendency, specifically the arithmetic mean, and its advantages and disadvantages in statistical analysis.

Uploaded by

Rifat Hoque
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views20 pages

Data Collection & Analysis

The document discusses the importance of data collection and analysis in business research, emphasizing that accurate and reliable data is crucial for effective decision-making. It outlines the research process, classification of data, methods of data collection, and the distinction between primary and secondary data. Additionally, it covers measures of central tendency, specifically the arithmetic mean, and its advantages and disadvantages in statistical analysis.

Uploaded by

Rifat Hoque
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Data Collection and Analysis in Business Research

Introduction:The word research originates from the French word ‘recerchier’ and it means
‘search again’. It assumes that the earlier search was not exhaustive and complete and hence a
repeated search is called for.Today research means a scientific and systematic investigation or
inquiry through search for new facts in any branch of knowledge. In short, the search for
knowledge through objective and systematic method of finding solution to a problem is
research.Actually, research means knowing knowledge through search for the development and
better future for all living beings of this universe.Data are the raw materials of all types of
research.
To make a decision in any business situation we need data. Facts expressed in quantitative
form can be termed as data. Collection of data and their statistical analysis are two important
operations in any research. Investigators generate data through some process of measurement,
counting or observation. Success of any statistical investigation depends on the availability of
accurate and reliable data. These depend on the appropriateness of the method chosen for data
collection. Therefore, data collection is a very basic activity in decision-making. In this chapter
first a brief overview of the important data collection methods is critically discussed and then
important statistical methods are systematically cited to analysis the collected data.
Research Process:
Research process consists of series of action or steps necessary to effectively carry out research
and the derived sequencing of these steps. The important steps of search process are:
1. Identification of the problem
2. Literature review
3. Objective and developing hypothesis
4. Research design
5. Sample design
6. Collection of data
7. Analysis of data
8. Report writing.
Data: The raw materials of statistics consists of numbers or observations usually obtained by
some process of counting or measurement, they are referred to collectively data. Thus, ‘A set of
observations is called data’.
Classification of Data: Data can be classified in a number of ways. Now we can classify data
according to some distinct criteria:
1. Data according to origin: (a) Population data (b) Sample data.
2. Data according to variable: (a) Qualitative (categorical) data (b) Quantitative data.
Again quantitative data can be classified as (i) Discrete data (ii) Continuous data
3. Data according to time: (a) Time series data (b) Cross-section data (c) Panel data
4. Data according to measurements of scale: (a) Nominal data (b) Ordinal data (c) Interval
data
(d) Ratio data.
5. Data according to subject (Discipline): Data are named according to different disciplines:
(a) Economic data (b) Agriculture data (c) Medical data (d) Business data (e) Metrological
data
(f) Import data (g) Export data etc.
Population data: Census inquiry requires population data. In this case we have to take
observations from all the experimental units of the population. That is we need population data. It
is also called census data. Population census, agriculture census, animal census are some
examples where we study population as a whole. Census inquiry involves a great deal of time,
money and energy. Sometimes experimental units may destroy during the time of taking

66
observation. Blood testing, bulb testing, rice testing etc are some examples where experimental
units destroy. All these sample inquiry is the appropriate method for collecting data.
Sample data:When we take observations from the sample experimental units under study we get
sample data. Sample data plays vital role in inferential statistics.
Qualitative data: In certain statistical investigation we are concerned with only presence or
absence of some characteristic in a set of object or individuals. This type of data is called
qualitative data. The characteristic used to classify an individual into different categories is called
an attribute. For example, honesty, sex, religion etc willconstitute a qualitative data.
Quantitative data: When we are concerned with only observations in a set of objects or
individuals, this type of data is called quantitative data. For example, the export values of
Bangladesh from 1972 to 2012 will constitute a quantitative data.
Cross-sectional data: Data that are observed at one point in time are referred to the cross
sectional data. The number of students of different universities for the year 2012 will constitute a
cross sectional data. Salary of the workers of a factory at a particular month, height of the students
of a class at a particular time, weight of the patients of a clinic in a particular time etc are the
examples of cross-section data.
Time series data: A set of figures observed over a period of time is called time series data.
Although it is not essential, it is common for these points to be equidistant in time.The GDP of
Bangladesh corresponding to different years will constitute a time series data.Year-
wiseproductions of a firm for last 15 years, month-wise prices of potatoes for last six years etc are
the examples of cross section data.
Panel Data: The combination of cross-section and time series data is called panel data. Year-wise
prices of different food stuffs for last ten years are another example of panel data. Following year-
wise GDPs growth rate of different countries for 1972-2012 are an example of panel data.
Country Bangladesh India Nepal Pakistan Bhutan
1972
1973
1974
1975
-
-
2012

Experimental data: The data which are collected by experiment i.e. by natural science are called
experimental data. Sometimes the investigators want to collect data in order to find the effect of
some factors on a given phenomenon that the effect of some other factors are constant, then these
types of data are called experimental data. For example, in order to find the impact of iodine on
arsenic doctor collect data keeping that the eating, smoking, and drinking habits of the people are
constant.
Sources of Data: The collection of data may range from a simple observation to a large scale
survey in any defined population. The tools and techniques to be employed to collect data depend
largely on the objectives of the study, the research design and the availability of time and money.
In any field of inquiry data can be collected from two sourcesnamely, (i) Primary sources (ii)
Secondary sources. And the data collected by these sources are known as primary data and
secondary data.
Primary Data: The primary data are those which are collected afresh and for the first time, and
thus happen to be original in character. Primary data are published by authorities who themselves
are responsible for their collection.
Methods of Collecting Primary Data: Questioning and observations are the two basic methods
of collecting primary data. In the observation method, the investigator asks no questions, but he
simply observes the phenomenon under consideration and records the necessary data. Sometimes
individuals make the observation, on other occasions, mechanical and electronic devices do the
job. In the observation method, it may be difficult to produce accurate data.
67
On the other hand, questioning as the name suggests is distinguished by the fact that data
are collected by asking questions of people who are thought to have the desired information.
Questions may be asked in person or in writing. A formal list of such questions is called a
questionnaire. Three different methods of communication with questionnaires are available:(a)
personal interview, (b) telephone and(c) mail. Personal interviews are those in which an
interviewer obtains information from respondents in face-to-face meetings. Telephone interviews
are similar except that communication between interviewer and respondent is via telephone
instead of direct personal contact. In most mail surveys, questionnaires are mailed to respondents
who also return them by mail.
Designing a Good Questionnaire:
(a) Number of questions should be kept to the minimum.
(b) Questions should be simple, short, and unambiguous.
(c) Questions of sensitive or personal nature should be avoided.
(d) Answers to questions should not require calculations.
(e) Questions should be capable of an objective answer.
(f) Questions should be arranged logically.
(g) Proper words should be used in the questionnaire.
(h) Questionnaire should look attractive.
(i) Questionnaire should be pre-tested to find out its shortcomings if any.
(j) Cross-Check and footnotes should be considered in the questionnaire.
(k) Necessary instructions should be given to the informant.
Editing of Primary Data:Editing involves reviewing the data collected by investigators to
ensure maximum accuracy and unambiguity. It should be done as soon as possible after the data
have been collected. If the size of the data is relatively small, it is desirable that only one person
edit all the data for the entire study. The different steps of editing are discussed below:
1. Checking legibility: Obviously, the data must be legible to be used. If a response is not
presented clearly, the concerned investigator should be asked to rewrite it.
2. Checking Completeness: An omitted entry on a fully structured questionnaire may mean
that no attempt was made to collect data from the respondent or that the investigator
simply did not record the data. If the investigator did not record the data, prompt editing
and questioning of the investigator may provide the missing item. If an entry is missing
because of the first possible cause, there is not much that can be done, except to make
another attempt to get the missing data. Obviously, this requires knowing why the entry is
missing.
3. Checking Consistency: The editor should examine each questionnaire to check
inconsistency or inaccuracy if any, in the statement. The income and expenditure figures
may be unduly inconsistent. The age and the date of birth may disagree. The area of an
agricultural plot may be unduly large. The concerned investigators should be asked to
make the necessary corrections. If there is any repetitive response pattern in the reports of
individual investigators they may represent investigator bias or perhaps attempted
dishonesty.
Secondary Data: When an investigator uses the data which have already been collected
andprocessed by others to satisfy their own needs such types of data are called secondary
data.Such data are primary for the agency that collected them, and becomes secondary for some
one else who uses these data for his own purposes. Secondary data are available in various
published and unpublished documents. Generally, secondary data are obtained from books,
magazines, newspaper, journals, reports, government& international publications, publications of
research organizations and professional bodies etc. The suitability, reliability, adequacy and
accuracy of the secondary data should be ensured before they are used for the study.
Scrutiny of Secondary Data: Primary data are to be scrutinized after the questionnaires are
completed by the interviewers. Likewise secondary data are to be scrutinized before they are

68
complied from the source.The scrutiny should be made to assess the suitability, reliability,
adequacy and accuracy of the data to be compiled and to be used for the proposed study.
1. Suitability: The complier should satisfy himself that the data contained in the publication
will be suitable for his study. In particular, the conformity of the definitions, units
measurement and time frame should be checked. For example, one US gallon is different
from one British gallon.
2. Reliability: The reliability of the secondary data can be ascertained from the collecting
agency, mode of collection and the time period of collection. For instance, secondary data
collected by a voluntary agency with unskilled investigators are unlikely to be reliable.
3. Adequacy: The source of data may be suitable and reliable but the data may not be
adequate for the proposed enquiry. The original data may cover a bigger or narrower
geographical region or the data may not cover suitable periods. For instance, per capita
income of Pakistan prior to 1971 is inadequate for reference during the subsequent periods
as it became separated into two different countries with considerable variation in standard
of living.
4. Accuracy: The user must be satisfied about the accuracy of the secondary data. The
process of collecting raw data, the reproduction of processed data in the publication, the
degree of accuracy desired and achieved should also be satisfactory and acceptable to the
researcher.

DESCRIBING DATA: NUMERICAL MEASURES


MEASURES OF LOCATION / CENTRAL TENDENCY
According to Professor Bowley, Averages or Measures of central tendency are “Statistical
constants which enable us to comprehend in a single effort the significance of the whole”. They
give us the idea b2isJCSERFGJX C. The tendency of concentrating towards the central value of all
observation of a distribution is known as Central Tendency and the measure of the central value
is known as measures of Central Tendency. Various types of Measures of central tendency are
discussed below:

Measures of Central Tendency

Mean Median Mode

Arithmetic Mean Geometric Mean Harmonic Mean

Requisites for an ideal Measure of central tendency:


According to Professor Yule, the following are the characteristics to be satisfied by an ideal
measure of central tendency. The measure should be;
i) Rigidly defined
ii) Readily comprehensible and easy to calculate.
iii) Based on all observations.
69
iv) Suitable for further mathematical treatment
v) Less affected by sampling fluctuations and
vi) Not affected by extreme values
The purpose of a measure of location is to pinpoint the center of a set of observations.

Measure of location: A single value that summarizes a set of data. It locates the center of the
values.

3.1.1 ARITHMETIC MEAN


The arithmetic mean, or simply mean, is the most widely used measure of location.

Arithmetic mean: The sum of observations divided by the total number of observations.
The mean is calculated as follows:

In terms of symbols, the formula for the arithmetic mean of a population is:

PopulationMean
Where:
 is the population mean.
N is the number of items in the population.
X is a particular value.
∑ indicates the operation of adding all the values. It is pronounced “sigma.”
∑X is the sum of the X values. It is pronounced “sigma X.”
[3-1] indicates the formula number from the text.

Any measurable characteristic of a population is called a parameter.

Parameter: A characteristic of a population. In other words, it is the unknown constants of a


population.

The Sample Mean


As explained in Chapter 1, we frequently select a sample from the population to find out something
about a specific characteristic of the population.
The mean of a sample and the mean of a population are computed in the same way, but the shorthand
notation is different.
In terms of symbols, the formula for the mean of a sample is:

SampleMean
Where:
is the sample mean; it is read AX bar@.
n is the number of values in the sample.
X is a particular value.
∑ indicates the operation of adding all the values.
∑X is the sum of the X values.
[3-2] is the formula number from the text.

The mean of a sample, or any other measure based on sample data, is called a statistic.
70
Statistic: A characteristic of a sample. Any function of sample observations.
“The mean weight of a sample of laptop computers is 15 pounds,” is an example of a statistic.
Note that in both of the above formulas the mean is calculated by summing the observations and dividing
by the total number of observations.
As an example, the Kellogg Company had quarterly earnings per share of $0.89, $0.77, $1.05, $0.79, and
$0.95. The mean is found by:

The mean quarterly earning per share is $0.89. In some situations the mean may not be representative of
the data.
As an example, the annual salaries of five executives are $40,000, $42,000, $44,000, $48,000, and
$300,000. The mean is:

Notice how the one extreme value ($300,000) pulled the mean upward. Four of the five executives
earned less than the mean, raising the question whether the arithmetic mean value of $94,800 is typical
of the salary of the five executives.

Advantages of Arithmetic Mean:


Arithmetic mean is
i) Rigidly defined
ii) Easy to understand and easy to calculate.
iii) Based on all observations.
iv) Suitable for further mathematical treatment
v) Least affected by sampling fluctuations

Disadvantages of Arithmetic Mean:


i) It cannot be determined by inspection or cannot be located graphically
ii) It cannot be use while dealing with qualitative characteristics, which cannot be
measured quantitatively. Such as intelligence, honesty, beauty, etc.
iii) It cannot be obtained if a single observation is missing or lost
iv) It cannot be calculated if the extreme class is open
v) It is affected very much by extreme values.

Properties of the Arithmetic Mean


As stated, the mean is a widely used measure of location. It has several important properties.
1. Every set of interval level and ratio level data has a mean.
2. All the data values are used in the calculation.
3. A set of data has only one mean, that is, the mean is unique.
4. The mean is a useful measure for comparing two or more populations.
5. The sum of the deviations of each value from the mean will always be zero, that is:

71
6. Mean of composite series: If X̄ i , (i=1 , 2 , ⋯⋯, k ) are the means of k-component series of sizes
ni, (i = 1, 2, ….., k) respectively, then the mean X̄ of the composite series obtained on combining
the component series is given by the formula:
k

n1 X̄ 1 + n2 X̄ 2 +⋯⋯ + nk X̄ k ∑ ni X̄ i
X̄ = = i=1k
n1 +n2 +⋯⋯ + nk
∑ ni
i =1

3.1.2 GEOMETRIC MEAN


The geometric mean is used to determine the mean percent increase from one period to another. It is
also used in finding the average of ratios, indexes, and growth rates.

Geometricmean: The n th root of the product of n values.

The formula for finding the geometric mean is:

Geometric Mean GM =√n ( X 1 ׿ X 2 ×X 3×⋯X n ) ¿


Where:
X 1, X 2, etc. are data values.
n is the number of values.
is the n th root.

The geometric mean can be used for averaging percents. Suppose the return on investment for
McDermoll International for the past 4 years is 0.4%, 2.9%, 2.1%,and 12.3%. The GM increase over the
period is 4.3 percent, found by:

Advantages of Geometric Mean:


i) Rigidly defined
ii) Based on all observations.
iii) Suitable for further mathematical treatment
iv) Less affected by sampling fluctuations and
v) Gives comparatively more weight to small items

Disadvantages of Geometric Mean:


i) It is not easy to understand and to calculate for a non-mathematic student
ii) If any observation is zero, Geometric mean becomes zero.
iii) If any observation is negative, GM becomes imaginary.

Uses of GM:
i) To find the rate of population growth and the rate of interest
ii) In the construction of index number

3.1.3 HARMONIC MEAN

72
Harmonic mean of a number of observations is the reciprocal of the AM of the reciprocals of the given
n
1 1 1 1
+ + +⋯⋯+
X
values. 1 X 2 X 3 X n

Advantages of Harmonic Mean:


i) Rigidly defined
ii) Based on all observations.
iii) Suitable for further mathematical treatment
iv) Less affected by sampling fluctuations and
v) Gives greater importance to small items and is useful only when small items have to be
given a very high weight.

Disadvantages of Harmonic Mean:


i) It is not easy to understand and is difficult to calculate.
ii) If any observation is zero, Harmonic mean becomes imaginary

Uses of HM:
It is used for calculating average speed of automobiles.
Weighted Mean
The weighted mean is a special case of the arithmetic mean. It is often useful when there are several
observations of the same value.

Weighted mean: The value of each observation is multiplied by the number of times it occurs.
The sum of these products is divided by the total number of observations to give the weighted
mean.
In general, the weighted mean of a set of numbers, designated X1, X2, X3, Xn, with the corresponding
weights w1, w2, w3, , wn is computed by:

Weighted Mean
The weighted mean is particularly useful when various classes or groups contribute differently to the
total. For example, the coronary care unit of a hospital consists of nurses= aides who are paid $12 per
hour, nurses = assistants who earn $15 per hour, and registered nurses who earn $21 per hour.
It would not be accurate to say the average hourly wage for the coronary unit is $16 per hour ($12 + $15
+ $21) / 3 unless there was the same number of people in each group.
Suppose the coronary care unit has ten employees: two aides who earn $12 per hour, 3 nurses=
assistants who earn $15 per hour, and five registered nurses who earn $21 per hour. The weighted mean
is:

Thus the weighted mean is $17.40.

3.1.4 THE MEDIAN


It was pointed out that the arithmetic mean is often not representative of data with extreme values. The
median is a useful measure when we encounter data with an extreme value.
Median: The midpoint of the values after all observations has been ordered from the
smallest to the largest or from largest to smallest.

73
Fifty percent of the observations are above the median and 50 percent are below the median. To
determine the median, the values are ordered from low to high, or high to low, and the middle value
selected. Hence, half the observations are above the median and half are below it. For the executive
incomes, the middle value is $44,000, the median.
$40,000 $42,000 $44,000 $48,000 $300,000

median
Obviously, it is a more representative value in this problem than the mean of $94,800.
Note that there were an odd number of executive incomes (5). For an odd number of ungrouped values
we just order them and select the middle value. To determine the median of an even number of
ungrouped values, the first step is to arrange them from low to high as usual, and then determine the
value half way between the two middle values.
As an example, the final grades of the six students in Mathematics 126 were 87, 62, 91, 58, 99, and 85.
Ordering these from low to high:

DD
58 62 85 87 91 99

The median grade is halfway between the two middle values of 85 and 87. The median grade is 86. Thus
we note that the median (86) may not be one of the values in a set of data.
The formula of finding median for grouped data is given below
N
 Cf
Median L1  2 i
fm
Where
L1 is the lower limit of median class
N total number of observations
Cf Cumulative frequency of the class just preceding the median class
fm Frequency of the median class
i Width of the median class
Advantages of Median:
i) Well defined
ii) Readily comprehensible and easy to calculate.
iii) Not affected by extreme values
iv) Can be calculated for a distribution when extreme class is open

Disadvantages of Median:
i) Not based on all observations.
ii) Not suitable for further mathematical treatment
iii) As compared to AM it is affected much by sampling fluctuations
Uses of Median:
i) It is the only average to be used while dealing with qualitative data, which cannot be
measured quantitatively but still can be arranged in ascending or descending order of
magnitude. e.g., to find the average intelligence, or average honesty among a group of
people.
ii) It is to be used to determining the typical values in the problems concerning wages,
distribution of wealth, etc.
Properties of the Median
The major properties of the median are:
1. The median is a unique value, that is, like the mean, there is only one median for a set of data.
2. It is not influenced by extremely large or small values.
3. It can be computed for ratio level, interval level, and ordinal-level data.
74
4. Fifty percent of the observations are greater than the median and fifty percent of the
observations are less than the median.

3.1.5 THE MODE


A third measure of central tendency is the mode.
Mode: The value of the observation that appears most frequently.
The mode is the value that occurs most often in a set of raw data. The dividends per share declared on
five stocks were: $3, $2, $4, $5, and $4. Since $4 occurred twice, which was the most frequent, the mode
is $4.
Below is the formula for calculating the mode from grouped data
Δ1
Mode=L1 + × i
Δ1+ Δ2
Where
L1 is the lower limit of modal class
1 The difference between the frequency of the modal class and the frequency of the class just
preceding the modal class
2 The difference between the frequency of the modal class and the frequency of the class just
succeeding the modal class
i Width of the modal class
Advantages of Mode:
i) Rigidly defined
ii) Readily comprehensible and easy to calculate.
iii) Not affected by extreme values
Disadvantages of Mode:
i) Not based on all observations.
ii) Ill-defined
iii) Not suitable for further mathematical treatment
iv) As compared to AM it is affected much by sampling fluctuations

Properties of the Mode


i) The mode can be found for all levels of data (nominal, ordinal, interval, and ratio).
ii) The mode is not affected by extremely high or low values.
iii) A set of data can have more than one mode. If it has two modes, it is said to be bimodal.
iv) A disadvantage is that a set of data may not have a mode because no value appears more than
once.

Other Location Measures:


Median and mode are widely used two location measures. In addition to median and mode there are
other location measures also. They are
i) Quartiles, ii) Deciles, iii) Percentiles

Solved Problems

Problem 1: The monthly income (in Tk.) of 10 persons working in a firm as follows:
1500, 1600, 1800, 1700, 1600, 1200, 1500, 2000, 1500, 1800. Calculate the Arithmetic Mean.

75
n
∑ xi
16200
x̄= i=1 = =1620.
Solution: n 10

Problem 2: Calculate the mean from the following data:


Record Numbers 1 2 3 4 5 6 7 8 9 10
Marks 40 50 55 78 58 60 73 35 43 48
Solution:
R. Numbers Marks
1 40
2 50
3 55
4 78
5 58
6 60
7 73
8 35
9 43
10 48
∑ x =540
n
∑ xi
540
x̄= i=1 = =54
n 10

x= A+
∑ d =50+ 40 =54
We know that, n 10
Problem 3: Calculate the mean from the following data:
Value 1 2 3 4 5 6 7 8 9 10
Frequency 21 30 28 40 26 34 40 9 15 57
Solution:
x f fx
1 21 21
2 30 60
3 28 84
4 40 160
5 26 130
6 34 204
7 40 280
8 9 72
9 15 135
10 57 570
∑ f =300 ∑ fx=1716
x=
∑ f i x i =1716 =5 . 72
We know that ∑ f i=n 300

Problem 4: Calculate the mean profits from the following data:


Profits per shop Number of Shops
100-200 10
200-300 18
76
300-400 20
400-500 26
500-600 30
600-700 28
700-800 18

Solution:
Profits per shop Mid-point Number of Shops(f) fm
100-200 150 10 1500
200-300 250 18 4500
300-400 350 20 7000
400-500 450 26 11700
500-600 550 30 16500
600-700 650 28 18200
700-800 750 18 13500
∑ f =150 ∑ fm=72900
x=
∑ fm = 72900 =486
We know that ∑ f 150
Problem 5: Compute the median from the following series:
Daily Savings Number of Workers
30-35 3
36-41 10
42-47 18
48-53 25
54-59 8
60-65 6
Solution:
Daily Savings Number of Workers Cumulative frequency
30-35 3 3
36-41 10 13
42-47 18 31
48-53 25 56
54-59 8 64
60-65 6 70
Total 70
n
∴ =35 f =31 f m=25 and d = 6
Here n=70 2 , median item lies in the class 48-53, L=48 , c

( )
n
−f
2 c
m̄=L+ ×i
fm
We know that
35−31
=48+( )×6=48 . 96
25

Problem 6: Compute the mode from the following series:


Daily Savings Number of Workers
25-30 7

77
31-36 21
37-42 47
43-48 62
49-54 37
55-60 16
61-66 5
Solution: By inspection mode lies in the class 43-48.
Here L= 43,
Δ 1= 62 – 47 = 15, Δ 2 = 62 – 37 = 25 and d = 6

( )
Δ1
M 0 =L+ ×i
Δ1+ Δ2
We know that
15
=43+( )×6=45 .25
15+25

3.2 MEASURES OF DISPERSION


Why Study Dispersion?
A direct comparison of two sets of data based only on two measures of location such as the mean and
the median can be misleading since an average does not tell us anything about the spread of the data.
For example, the mean salary paid to baseball players for the New York Yankees is $4,342,365. However,
the range is $14,390,000, with a low of $210,000 and a high of $14,600,000. The Tampa Devil Rays have
a mean salary of $1,227,857. The range is $8,550,000, with a low of $200,000 and a high of $8,750,000.
As another example, suppose a statistics instructor had two classes, one in the morning and one in the
evening; each with six students. In the morning class (AM) the students’ ages are 18, 20, 21, 21, 23, and
23 years. In the evening class (PM) the ages are 17, 17, 18, 20, 25, and 29 years. Note that for both
classes the mean age is 21 years but there is more variation or dispersion in the ages of the evening
students.
A small value for a measure of dispersion indicates that the data are clustered closely, say, around the
arithmetic mean. Thus the mean is considered representative of the data, that is, it is reliable.
Conversely, a large measure of dispersion indicates that the mean is not reliable and is not representative
of the data.
There are several measures of dispersion. We will consider six: the range, the meandeviation, the
standarddeviation, the interquartilerange, and quartiledeviation.
3.2.1 RANGE
Perhaps the simplest measure of dispersion is the range.
Range: The difference between the highest and lowest value in a set of data.
The formula for range is:
Range = Highest value – Lowest value [3 – 4]
For example, suppose a statistics instructor had two classes with the ages indicated:
A.M. Class: 18, 20, 21, 21, 23, 23 P.M. Class: 17, 17, 18, 20, 25, 29
The range for the classes is:
A.M. Class: (23  18) = 5 P.M. Class: (29  17) = 12
Thus we can say that there is more spread in the ages of the students enrolled in the evening (P.M.) class
compared with the morning (A.M.) class.

The characteristics of the range are:


i) Only two values are used in the calculation.
ii) It is influenced by extreme values.
iii) It is easy to compute and understand.
iv) It can be distorted by an extreme value.
78
The range has two disadvantages. It can be distorted by a single extreme value. Suppose the same
statistics instructor has a third class of five students. The ages of these students are given below.

Ages of Students
20, 20, 21, 22, 60

The range of ages is 40 years, yet four of the five students’ ages are within two years of each other. The
60-year old student has distorted the spread. Another disadvantage is that only two values, the largest
and the smallest, are used in its calculation.

3.2.2 MEAN DEVIATION


In contrast to the range, the mean deviation considers all the data.

Mean Deviation: The arithmetic mean of the absolute values of the deviations from the arithmetic mean.
In terms of symbols, the formula for the mean deviation is:

MD=
∑ |X− X̄|
Mean Deviation n [3-5]
Where:
X is the value of each observation.
X is the arithmetic mean.
n is the number of observations in the sample.
|| indicates the absolute value.
We disregard the signs of the deviations from the mean because if we didn’t, the positive and negative
deviations from the mean exactly offset each other, and the mean deviation would always be zero. Such a
measure (zero) would be a useless statistic.
The mean deviation is computed by first determining Absolute
the difference between each observation and the mean. X Deviation
These differences are then averaged without regard to 17  21 = 4 = 4
their signs. For the PM statistics class the mean 17  21 = 4 = 4
deviation is 4.0 years, found by the table on the right: 18  21 = 3 = 3
Then 20  21 = 1 = 1
25  21 = 4 = 4
29  21 = 8 = 8
= 24
The parallel lines indicate absolute value. To
interpret, 4.0 years is the mean amount by which the ages differ from the arithmetic mean age of 21.0
years for the PM students.

3.2.3 VARIANCE AND STANDARD DEVIATION


The disadvantage of the mean deviation is that the absolute values are difficult to manipulate
mathematically. Squaring the differences from each value and the mean eliminates the problem of
absolute values. These squared differences are used both in the computation of the variance and the
standarddeviation.
Variance: The arithmetic mean of the squared deviations from the mean.
Note that the variance is non-negative and is zero only if all observations are the same.
Standard Deviation: The square root of the variance
Squaring units of measurement, such as dollars or years, makes the variance cumbersome to use since it
yields units like “dollars squared” or “years squared.” However, by calculating the standard deviation,
which is the positive square root of the variance, we can return to the original units, such as years or

79
dollars. Because the standard deviation is easier to interpret, it is more widely used than the mean
deviation or the variance.
Population Variance
The formula for the population variance and the sample variance are slightly different. The formula for
the population variance is:

σ 2
=
∑ ( X−μ )2
Population Variance N [3 – 6]
Where:
2 is the symbol for the population variance.
X is a value of an observation in the population.
 is the arithmetic mean of the population.
N is the total number of observations in the population.

The major characteristics of the variance are:


i) All the observations are used in the calculations.
ii) It is not influenced by extreme observations.
iii) The units are somewhat difficult to work with. (They are the original units squared.)
Population Standard Deviation
The standard deviation is the square root of the variance. The formula for the standard deviation of a
population is:

Population Standard deviation


σ=
√ ∑ ( X−μ )2
N [3-7]

Sample Variance
The conversion of the population variance formula to the sample variance formula is not as direct as the
change made when we went from the population mean formula to the sample mean formula. Recall that
we replaced  with X and N with n.
The conversion from population variance to sample variance requires a change in the denominator.
Instead of substituting n, the number in the sample, for N, the number in the population, we replace N
with (n – 1). Thus the formula for the sample variance is:

s2 
 (X  X )2
Sample Variance n 1 [3 – 8]
Where:
s2 is the symbol for the sample variance. It is pronounced as “s squared.”
X is the value of each observation in the sample.
X is the mean of the sample.
n is the total number of observations in the sample.

Changing the denominator to (n – 1) seems insignificant, however the use of n tends to underestimate the
population variance. The use of (n –1) in the denominator provides an appropriate correction factor.

Interpretation and Uses of the Standard Deviation


Recall that the standard deviation is used to measure the spread of the data. A small standard deviation indicates
that the data is clustered close to the mean, thus the mean is representative of the data. A large standard deviation
indicates that the data are spread out from the mean and the mean is not representative of the data.

Variance of the Combined Series

80
2
If σ i , (i=1 , 2 , ⋯⋯, k ) are the variances of k-component series of sizes n i, (i = 1, 2, ….., k) respectively,
2
then the variance σ of the composite series obtained on combining the component
n1 ( σ 21 + d 21 )+ n2 ( σ 22 +d 22 )+⋯⋯+ nk ( σ 2k +d 2k )
2
σ =
series is given by the formula:
n1 + n2 +⋯⋯ +n k
Relative Dispersion
Suppose we want to compare the variability of two sets of data that are measured in different units such
as one in dollars and the other in years. How can this be done? Relative dispersion is the answer. Below
are the four relative measures of dispersion:
L−S
Coefficient of Range = L+S
Q3 −Q1
Coefficient of Quartile deviation = Q3 +Q1
MD σ
Coefficient of Mean deviation = X̄ , and Coefficient of Standard deviation = X̄

The coefficient of variation is another relative measures of dispersion.


Coefficient of variation: The ratio of the standard deviation to the arithmetic mean, expressed as a
percent.

The formula for coefficient of variation for a sample is:


σ
×100
Coefficient of Variation CV = X̄ [3 – 9]
It is a measure of relative dispersion. To compute the coefficient of variation the standard deviation is
divided by the mean and the result is multiplied by 100. This measure reports the standard deviation as a
percent of the mean.
If, for example, in a study of executives the coefficient of variation for incomes is 29 percent and
for their ages it is 12 percent, we would conclude that there is more relative dispersion in the
incomes of the executives than in their ages.
Characteristics of the coefficient of variation are:
 It reports the variation relative to the mean.
 It is useful for comparing distributions with different units.

CHAPTER PROBLEMS
Problem 1
A comparison shopper employed by a large grocery chain recorded these Supermarket Price X
prices for a 340-gram jar of Kraft blackberry preserves at a sample of six 1 $1.31
supermarkets selected at random. 2 1.35
3 1.26
a. Compute the arithmetic mean.
4 1.42
5 1.31
b. Compute the median.
6 1.33
c. Compute the mode. Total $7.98

Solution:

a. Determine the mean price of this raw data by summing the prices for the six jars and dividing the
total by six. Using the formula for the mean of a sample we get.
81
X $7.98
X  $1.33
n 6
b. As noted above the medianis defined as the middle value of a set of data, after the data is arranged
from smallest to largest. The prices for the six jars of blackberry preserves have been ordered from a
low of $1.26 up to $1.42. Because this is an even number of prices the median price is halfway
between the third and the fourth price. The median is $1.32.
Prices Arranged from Low to High:
$1.26 $1.31 $1.31 $1.33 $1.35 $1.42

Suppose there are an odd number of blackberry preserve prices, such as shown in the table.
$1.31 $1.31 $1.33 $1.35 $1.42
The median is the middle value ($1.33). To find the median, the values must first be ordered from low
to high.
c. The mode is the price that occurs most often. The price of $1.31 occurs twice in the original data and
is the mode.
Problem 2
A sample of the amounts spent in November for propane gas to heat homes of similar sizes in Duluth
revealed these amounts (to the nearest dollar):
191 212 176 129 106 92 108 109 103 121 175 194
What is the range? Interpret your results.

Solution:
Recall that the range is the difference between the largest value and the smallest value.
Range = Highest Value – Lowest Value = $212 - $92 = $120
This indicates that there is a difference of $120 between the largest and the smallest heating cost.
Problem 3
Using the heating cost data in Problem 2, compute the mean deviation.
Solution:
The mean deviation is the mean of the absolute deviations from the arithmetic mean. For raw, or
ungrouped data, it is computed by first determining the mean. Next, the difference between each value
and the arithmetic mean is determined. Finally, these differences are totaled and the total divided by the
number of observations.
The table below shows the data values, each data value minus the mean, and the absolute value of
the deviations from the mean. In other words, the signs of the deviations from the mean are disregarded.
Payment Absolute
Deviations
$191 |$+48 | = $48
212 | +69 | = 69
176 | +33 | = 33
129 | -14 | = 14
106 | –37 | = 37
92 | –51 | = 51
108 | –35 | = 35
109 | –34 | = 34
103 | –40 | = 40
121 | –22 | = 22
175 | +32 | = 32
194 | +51 | = 51
$1,716 $466

82
The mean deviation of $38.83 indicates that the typical electric bill deviates $38.83 from the mean of
$143.00.

Problem 4
The hourly wages for a sample of plumbers were grouped into the Hourly Number
following frequency distribution. Since the wages have been grouped Wages f
into classes, we refer to the following distribution as being grouped $8 up to $10 3
data. $10 up to $12 6
$12 up to $14 12
a. Compute the arithmetic mean.
$14 up to $16 10
b. Compute the mode. $16 up to $18 7
$18 up to $20 2
40

Solution:

a. The arithmetic mean of this sample data, grouped into a frequency distribution, is computed by
formula.

X̄ =
∑ fX
n
Where:
is the designation for the arithmetic mean.
x is the mid-value, or midpoint, of each class.
f is the frequency in each class.
fX is the frequency in each class times the midpoint of the class.
åfM is the sum of these products.
n is the total number of frequencies.

It is assumed that the observations in each class are represented by the midpoint of the class. The
midpoint of the first class is $9.00, found by ($8.00 + $10.00)/2. For the next higher class, the midpoint is
$11.00.
Using formula for the arithmetic mean hourly wage is $13.90, found by
Wage Frequency Class
fX
Rate f Midpoint X

$8 up to $10 3 $9.00 $27.00


$10 up to $12 6 11.00 66.00
$12 up to $14 12 13.00 156.00
$14 up to $16 10 15.00 150.00
$16 up to $18 7 17.00 119.00
$18 up to $20 2 19.00 38.00
Total 40 $556.00

b. The mode is the value that occurs most often.. So we can say that mode of this distribution lies in
the class $12 up to $14. For data grouped into a frequency distribution mode is

83
Δ1 6
Mode=L1 + × i 12+ × 2
Δ1+ Δ2 = 6+2 = 13.5

Problem 5
Determine the mean and SD of sales of 100 First Food Restaurants in the Eastern Districts (in ’ 000$)
Sales Number of Restaurants
700 - 800 4 Solution:

f[∑ X =1250 0, { X̄¿ 1250¿][∑ f ( X−X̄ )2=6680 0 ,¿]¿¿¿


800 - 900 7
900 - 1000 8
1000 - 1100 10
1100 - 1200 12
1200 - 1300
1300 - 1400
17
13
¿
1400 - 1500 10
1500 - 1600 9
1600 - 1700 7
1700 - 1800 2
1800 – 1900 1

Exercises: The Measures of Central Tendency


1. The annual exports of 50 medium-sized manufacturers were organized into a frequency distribution.
(Exports are in $ millions).
Exports Frequency
$6 up to $9 2
9 up to 12 8
12 up to 15 20
15 up to 18 14
18 up to 21 6
Compute the: i) mean ii) mode, and iii) Median

2. The mean marks obtained in an examination by a group of 100 students were found to be 49.96. The
mean marks obtained in the same examination by another group of 200 students were 52.32. Find
the mean of marks obtained by both groups of students taken together.[Ans. 51.53]
3. The mean marks got by 300 students in the subject statistics is 45. The mean of the top 100 of them
was found to be70 and the mean of last 100 was known to be 20. What is the mean of the remaining
100 students?
4. The mean weekly salary paid to all employees in a company is Tk. 500. The mean weekly salary paid
to male and female employee is Tk. 520 and 420 respectively. Determine the percentage of males
and females employed by the company.

Exercises: The Measures of Dispersion


1. Calculate coefficient of variation (CV) from the following data:
Profits (in Crores Tk.) No. of Companies
Less than 10 8
20 20
30 40
40 70
50 90
60 100

84
2. Run scored by two cricketers in 10 ODI matches are as follows
Cricketer – A: 90, 27, 08, 80, 13, 105, 06, 60, 45, 00
Cricketer – B: 25, 50, 65, 43, 75, 56, 16, 67, 49, 37
Which cricketer may be considered a more consistent player?
3. The sum of squares corresponding to length X (in cm.) and weight Y (in gm.) of 50 tapioca tubes are
given below:
∑ X=212, ∑ X 2=902 . 8 , ∑ Y =261, and ∑ Y 2=1457.6
Which is more varying, the length or weight?

4. For a group containing 100 observations, the AM and SD are 8 and 10 .5 respectively. For 50
observations, selected from those 100 observations the mean and SD are 10 and 2 respectively. Find
the AM and SD of the other 50 observations.
5. In two factories A and B engaged in the same industry, the average weekly wages and SD’s are as
follows:
Factory Ave. weekly wage SD of wage No. of wage earners
A 460 50 100
B 490 40 80
a) Which factory A and B pays large amount as weekly wages?
b) Which factory shows greater variability in the distribution of wages?
c) What is the mean and SD of all workers in two factories taken together?

6. FundInfo provides information to its subscribers to enable them to evaluate the performance of
mutual funds they are considering as potential investment vehicles. A recent survey of Funds whose
started investment goals was growth and income produced the following data on total annual rate of
return over the five years:
Annual rate 11 - 12 12 - 13 13 - 14 14 - 15 15 - 16 16 - 17 17 - 18 18 - 19
of return
Frequency 2 2 8 10 11 8 3 1
Calculate the mean, Variance and SD of the annual rate of return for this sample of 45 funds.

85

You might also like