Research Methodology
October 11, 2018
Types of Data
Data Types
Categorical Numerical
(Qualitative) (Quantitative)
Nominal Ordinal Discrete Continuous
Categorical Data
• Represent characteristics
• Can use numbers to represent them but do not have
mathematical meaning
• Example:
Gender : Male Female
Communication mode: Phone Email
Categorical Data
Categorical
Nominal Ordinal
Nominal Ordinal
- Used to label variables - Used to label variables
- Discrete values - Discrete values
- No order (can switch values) - With order (cannot switch
Male 1 Female 2 values)
(1 and 2 are labels to UG 1 PG 2
represent gender category – (1 and 2 are ordered labels to
you can label Male as 2 and represent educational
Female as 1) qualification)
Numerical Data
Discrete data Continuous data
• Represent discrete data • Represent measurements
• Can only take certain values • Cannot be counted but
• Cannot be measured but measured
counted
• Example:
• Example:
1. Height of the person
1. Number of accidents in a
month (= 15) (= 141.3)
2. Number of shops in a 2. Room Temperature
mall (= 52) (= 31.4)
Numerical Data
Numerical
Interval Ratio
Interval data Ratio data
• Ordered units • Ordered units
• Have the same difference • Have the same difference
• No true Zero • Have an absolute Zero
• Example:
• Example:
Temperature:
+10 Height of the plant (in cm):
+5 0
0 +1
-5 +2
-10 +3
Statistical Methods
Nominal Data:
Freq, Proportion, Percentage, Pie chart, bar chart
Ordinal Data:
Freq, Proportion, Percentage, Percentiles, Median,
Mode and the inter quartile range, Pie chart, bar chart
Continuous Data:
Percentiles, median, inter quartile range, mean, mode,
standard deviation, range, histogram, box-plot
Sample data
Amount
No. of times
S.No Age ( yrs) Age Group Gender spent in
shopped
shopping
1 19 1 Male 6 12001.50
2 24 2 Female 7 16455.00
3 28 2 Male 4 9800.75
4 35 3 Female 3 4344.00
Age -
Age group –
Gender –
No of times shopped –
Amount spent in shopping -
Sample data
Amount
No. of times
S.No Age ( yrs) Age Group Gender spent in
shopped
shopping
1 19 1 Male 6 12001.50
2 24 2 Female 7 16455.00
3 28 2 Male 4 9800.75
4 35 3 Female 3 4344.00
Age - Numeric, Continuous
Age group – Categorical, Ordinal
Gender – Categorical, Nominal
No of times shopped – Numeric, Discrete
Amount spent in shopping - Numeric, Continuous
Population and Sample
Population
• Any large collection of objects or individuals about
which information is sought
• Example – Indians, students, hospitals or trees
“A study of road accidents in India”
“ A study of academic achievement of private
school students in America”
Population
Population - Parameter
• Summary number for population
• Pertains only to population
• Population mean - µ ( Greek letter mu)
• Population proportion – p
• Example:
The average weight of all middle-aged female
Chinese - µ
The proportion of likely Indian students approving
the new education policy - p
Sample
• A group drawn from population that represents the
population
• Example:
To study Road accidents in India, take sample as:
Road accidents in 10 large cities across India
Road accidents in 10 medium size cities across India
Road accidents in 10 small size cities across India
Road accidents in 30 rural places across India
Sample
Sample - Statistic
• Summary number for sample
• Pertains only to sample
• Sample mean - x
• Sample proportion – p
• Example:
The average weight of a random sample of 100
middle-aged female Chinese
The proportion in a random sample of 1000 likely
Indian students approving the new education policy
Population and Sample
The main campus at XXX University has a population
of approximately 42,000 students. A research
question is "what proportion of these students
smoke regularly?" A survey was administered to a
sample of 987 XXX university students. Forty-three
percent (43%) of the sampled students reported that
they smoked regularly.
Population, Parameter, Sample, Statistic
Population and Sample
Assume that there exists a population of 7 million
college students in the United States today. The
average GPA of all of these college students is 2.7 (on
a 4-point scale). A random sample of 100 college
students were taken, and their average GPA was
found to be 2.9
Population, Parameter, Sample, Statistic
Population and Sample
• Very very difficult to find the population mean
• 99.99% impossible to find the population mean
• Can be estimated from sample and their statistics
• Can be estimated using confidence interval and
hypothesis testing
Lower value < Population mean < Upper value
Confidence Interval
Lower value < Population mean < Upper value
• Should using a hand-held cell phone while driving be
illegal?
• For example, a newspaper report (ABC News poll, May 16-20,
2001) was concerned whether or not U.S. adults thought using a
hand-held cell phone while driving should be illegal. Of the
1,027 U.S. adults randomly selected for participation in the poll,
69% thought that it should be illegal. The reporter claimed that
the poll's "margin of error" was 3%. Therefore, the confidence
interval for the (unknown) population proportion p is 69% ± 3%.
That is, we can be really confident that between 66% and 72% of
all U.S. adults think using a hand-held cell phone while driving a
car should be illegal.
Confidence Interval
• Let's take an example of researchers who are
interested in the average heart rate of male college
students. Assume a random sample of 130 male
college students were taken for the study.
The following is the Minitab Output of a one-sample t-
interval output using this data.
One-Sample T: Heart Rate
Confidence Interval
One-Sample T: Heart Rate
Descriptive Statistics &
Inferential Statistics
Descriptive Statistics – Measures of
Central Tendency
• 3 Ms - Mean, Median, Mode
Mean - Example:
Heart beats per minute for 10 adults are given below:
58, 61, 72, 65, 68, 60, 75, 69, 69, 73
Rearrange them in the ascending order
58, 60, 61, 65, 68, 69, 69, 72, 73, 75
1. Mean is the average of the given data
Mean= (58+ 60+ 61+ 65+ 68+ 69+ 69+ 72+ 73+ 75)/10=67.0
Descriptive Statistics
• 3 Ms - Mean, Median, Mode
Median - Example with even number of data:
Heart beats per minute for 10 adults are given below:
Rearrange them in the ascending order
58, 60, 61, 65, 68, 69, 69, 72, 73, 75
4 values 4values
2. Median is the score that divides the data in to half
(median is the mid value)
Median = Average of 68 and 69 (since there are even
number of data) = 68.5
Descriptive Statistics
• 3 Ms - Mean, Median, Mode
Median - Example with odd number of data:
Heart beats per minute for 9 adults are given below:
Rearrange them in the ascending order
58, 60, 61, 65, 68, 69, 72, 73, 75
4 values 4values
2. Median is the score that divides the data in to half
(median is the mid value)
Median = Mid value = 68 (since there are odd number of
data)
Descriptive Statistics
• 3 Ms - Mean, Median, Mode
Mode - Example:
Heart beats per minute for 10 adults are given below:
58, 61, 72, 65, 68, 60, 75, 69, 69, 73
Rearrange them in the ascending order
58, 60, 61, 65, 68, 69, 69, 72, 73, 75
1. Mode is the most frequently occurring value
Mode= 69
Statistical Significance & p-value
Significance:
A measure to check if the results of research are due to
chance
p-value:
The way in which significance is reported statistically
Example:
p<0.01 means that there is less than 1% chance that the
study results are due to random chance
Generally p-values are set as 0.05 or 0.01
Statistical Significance & p-value
Example:
• A study had one group of students (Group A) study using
notes they took in class; the other group (Group B)
studied using notes they took after class using a
recording of the lecture. Students in Group A scored
higher on a test than Group B. The study reports a
significance of p<.01 for the results.
• This means that whatever the reason students who took
notes in class did better on the test, there is only a 0 - 1%
chance that the results are due to some random factor
(such as Group A having smarter students than Group B).
•
Descriptive Statistics –
Measures of Dispersion
Measures of dispersion:
A measure that measures the spread of the data or the
variation around the central value
- Range
- Variance
- Standard Deviation
- Inter quartile Range
Descriptive Statistics –
Measures of Dispersion
Range:
• Difference between the largest and smallest sample
values
• Depends only on extreme values and provides no
information about how the remaining data is
distributed
Example:
Heart beats per minute for 10 adults are given below:
58, 60, 61, 65, 68, 69, 69, 72, 73, 75
Range = 75-58 = 17
Descriptive Statistics –
Measures of Dispersion
Variance and Standard deviation (for sample):
Measures the degree of spread in a variable’s values
n = number of observations
x = variable value
x = Mean value
Note: For population, variance
becomes as:
Variance
Coefficient of variation is the ratio between the standard
deviation and the mean ,expressed as a percentage. This
can be either (σ / µ)*100 or (s / x )*100
Variance - Example
The temperature in chemical reactor A and chemical
reactor B were measured every half hour under the
same conditions.
For A: 78.1°C, 79.2°C, 78.9°C, 80.2°C, 78.3°C, 78.8°C,
79.4°C
For B: 78.5°C, 79.1°C, 80.1°C, 80.2°C, 78.6°C, 78.7°C,
78.1°C
Which one of the reactor is better in terms of
maintaining the temperature over the measured
period
Variance – Example - Solution
Chemical reactor A :
Temp x x - Mean x - Mean (x - Mean)2
78.1 78.1-79.0 -0.89 0.78
79.2 79.2-79.0 0.21 0.05
78.9 78.9-79.0 -0.09 0.01
80.2 80.2-79.0 1.21 1.47
78.3 78.3-79.0 -0.69 0.47
78.8 78.8-79.0 -0.19 0.03
79.4 79.4-79.0 0.41 0.17
=2.99
Mean Temp x =79.0
Variance – Example - Solution
Chemical reactor B :
Temp y y - Mean y - Mean (y - Mean)2
78.5 78.5-79.0 -0.54 0.29
79.1 79.1-79.0 0.06 0.00
80.1 80.1-79.0 1.06 1.12
80.2 80.2-79.0 1.16 1.34
78.6 78.6-79.0 -0.44 0.20
78.7 78.7-79.0 -0.34 0.12
78.1 78.1-79.0 -0.94 0.89
=3.96
Mean Temp y =79.0
Variance – Example - Solution
Chemical reactor A : Chemical reactor B :
Mean Temp x =79.0 Mean Temp y =79.0
(x-Mean)2 =2.99 (y-Mean)2 =3.96
Variance = (x-Mean)2 Variance = (x-Mean)2
(n-1) (n-1)
= 2.99/(7-1) = 3.96/(7-1)
= 2.99/6 = 3.96/6
Variance = 0.4983 Variance = 0.6600
Std devn=SQRT of Variance Std devn=SQRT of Variance
= SQRT of 0.4983 = SQRT of 0.6600
Std devn= 0.7059 Std devn= 0.8124
Variance – Example - Solution
Chemical reactor A : Chemical reactor B :
Mean Temp x =79.0 Mean Temp y =79.0
Coeff of variation = Coeff of variation =
(S.D/Mean)*100 (S.D/Mean)*100
Coeff of varia = Coeff of varia =
(0.7059/79.0)*100 (0.8124/79.0)*100
= 0.89% = 1.028%
A has got less variance, standard deviation and
coefficient of variation than B which means A has got
less spread of temperature. Hence Reactor A is better as
temperature does not deviate much compared to B