0% found this document useful (0 votes)

294 views58 pages

Introduction To Statistics 1 COD

This document outlines the topics that will be covered in a lecture series on introductory statistics. The lectures will cover descriptive statistics and graphical presentation of data, statistical inference including hypothesis testing and confidence intervals, and sample inferences such as t-tests, ANOVA, and regression. Specific topics to be covered include measures of central tendency, frequency distributions, the normal distribution, sampling distributions, p-values, and types of statistical tests. Examples and demonstrations of statistical techniques will be provided.

Uploaded by

Mangala Semage

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

294 views58 pages

Introduction To Statistics 1 COD

Uploaded by

Mangala Semage

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 58

Introduction to

Statistics
Colm ODushlaine

Neuropsychiatric Genetics, TCD

[email protected]

1
Overview
Descriptive Statistics & Graphical Presentation of
Data
Statistical Inference
Hypothesis Tests & Confidence Intervals
T-tests (Paired/Two-sample)
Regression (SLR & Multiple Regression)
ANOVA/ANCOVA
Intended as an interview. Will provide slides after
lectures
Whats in the lectures?...
2
Lecture 1 Lecture 2 Lecture 3
Lecture 4
Descriptive Statistics and Graphical
Presentation
1. Terminology of Data
2. Frequency Distributions/Histograms
3. Measures of data location
4. Measures of data spread
5. Box-plots
6. Scatter-plots
7. Clustering (Multivariate Data)

3
Lecture 1 Lecture 2 Lecture 3
Lecture 4 Statistical Inference

1. Distributions & Densities

2. Normal Distribution
3. Sampling Distribution & Central Limit Theorem
4. Hypothesis Tests
5. P-values
6. Confidence Intervals
7. Two-Sample Inferences
8. Paired Data

4
Lecture 1 Lecture 2 Lecture 3
Lecture 4 Sample Inferences

1. Two-Sample Inferences
Paired t-test
Two-sample t-test
2. Inferences for more than two samples
One-way ANOVA
Two-way ANOVA
Interactions in Two-way ANOVA
3. DataDesk demo

5
Lecture 1 Lecture 2 Lecture 3
Lecture 4

1. Regression
2. Correlation
3. Multiple Regression
4. ANCOVA
5. Normality Checks
6. Non-parametrics
7. Sample Size Calculations
8. Useful tools and websites
6
FIRST, A REALLY USEFUL SITE
Explanations of outputs
Videos with commentary
Help with deciding what test
to use with what data

7
1. Terminology
Populations & Samples
Population: the complete set of individuals,
objects or scores of interest.
Often too large to sample in its entirety
It may be real or hypothetical (e.g. the results from an
experiment repeated ad infinitum)

Sample: A subset of the population.

A sample may be classified as random (each member
has equal chance of being selected from a population)
or convenience (whats available).
Random selection attempts to ensure the sample is
representative of the population.
8
Variables
Variables are the quantities measured in a
sample.They may be classified as:
Quantitative i.e. numerical
Continuous (e.g. pH of a sample, patient
cholesterol levels)
Discrete (e.g. number of bacteria colonies in a
culture)
Categorical
Nominal (e.g. gender, blood group)
Ordinal (ranked e.g. mild, moderate or severe
illness). Often ordinal variables are re-coded to be
quantitative.

9
Variables

Variables can be further classified as:

Dependent/Response. Variable of primary interest
(e.g. blood pressure in an antihypertensive drug trial).
Not controlled by the experimenter.
Independent/Predictor
called a Factor when controlled by experimenter. It

is often nominal (e.g. treatment)

Covariate when not controlled.

If the value of a variable cannot be predicted in

advance then the variable is referred to as a
random variable
10
Parameters & Statistics
Parameters: Quantities that describe a
population characteristic. They are usually
unknown and we wish to make statistical
inferences about parameters. Different to
perimeters.

Descriptive Statistics: Quantities and

techniques used to describe a sample
characteristic or illustrate the sample data
e.g. mean, standard deviation, box-plot

11
2. Frequency Distributions

An (Empirical) Frequency Distribution or

Histogram for a continuous variable presents the
counts of observations grouped within pre-
specified classes or groups

A Relative Frequency Distribution presents the

corresponding proportions of observations within
the classes

A Barchart presents the frequencies for a

categorical variable
12
Example Serum CK

Blood samples taken from 36 male

volunteers as part of a study to determine the
natural variation in CK concentration.

The serum CK concentrations were

measured in (U/I) are as follows:

13
Serum CK Data for 36 male
volunteers

121 82 100 151 68 58

95 145 64 201 101 163
84 57 139 60 78 94
119 104 110 113 118 203
62 83 67 93 92 110
25 123 70 48 95 42
14
Relative Frequency Table
Serum CK Frequency Relative Cumulative Rel.
(U/I) Frequency Frequency
20-39 1 0.028 0.028
40-59 4 0.111 0.139
60-79 7 0.194 0.333
80-99 8 0.222 0.555
100-119 8 0.222 0.777
120-139 3 0.083 0.860
140-159 2 0.056 0.916
160-179 1 0.028 0.944
180-199 0 0.000 0.944
200-219 2 0.056 1.000
Total 36 1.000
15
Frequency Distribution
Distributions
CK-concentration-(U/l)
Quantiles
8 100.0% maximu
99.5%
97.5%
90.0%
6 75.0% quart
50.0% media
25.0% quart

Frequency
10.0%
4 2.5%
0.5%
0.0% minimu

20 40 60 80 100 120 140 160 180 200 220

16
Relative Frequency Distribution
Distributions
CK-concentration-(U/l)
Quantiles
Mode
Shaded area is 100.0% maxim
percentage of 99.5%
males with CK 0.20 97.5%
values between 90.0%
60 and 100 U/l, 75.0% quar

Relative Frequency
i.e. 42%. 0.15 50.0% med
Right tail 25.0% quar
10.0%
(skewed) 2.5%
0.10 0.5%
0.0% minim
Left tail
0.05

20 40 60 80 100 120 140 160 180 200 220

17
3. Measures of Central
Tendency (Location)
Measures of location indicate where on the number
line the data are to be found. Common measures of
location are:

(i) the Arithmetic Mean,

(ii) the Median, and
(iii) the Mode

18
The Mean

Let x1,x2,x3,,xn be the realised values of a

random variable X, from a sample of size n.
The sample arithmetic mean is defined as:

n
x 1
n xi
i 1

19
Example

Example 2: The systolic blood pressure of

seven middle aged men were as follows:
151, 124, 132, 170, 146, 124 and 113.

x
151 124 132 170 146 124 113
The mean is
7
137.14

20
The Median and Mode

If the sample data are arranged in increasing

order, the median is
(i) the middle value if n is an odd number, or
(ii) midway between the two middle values if n is
an even number
The mode is the most commonly occurring
value.

21
Example 1 n is odd

The reordered systolic blood pressure data seen

earlier are:

113, 124, 124, 132, 146, 151, and 170.

The Median is the middle value of the ordered data,

i.e. 132.

Two individuals have systolic blood pressure = 124

mm Hg, so the Mode is 124.

22
Example 2 n is even

Six men with high cholesterol participated in a study to investigate

the effects of diet on cholesterol level. At the beginning of the study,
their cholesterol levels (mg/dL) were as follows:
366, 327, 274, 292, 274 and 230.
Rearrange the data in numerical order as follows:

230, 274, 274, 292, 327 and 366.

The Median is half way between the middle two readings, i.e.
(274+292) 2 = 283.

Two men have the same cholesterol level- the Mode is 274.

23
Mean versus Median

Large sample values tend to inflate the mean. This will happen if
the histogram of the data is right-skewed.

The median is not influenced by large sample values and is a better

measure of centrality if the distribution is skewed.

Note if mean=median=mode then the data are said to be

symmetrical

e.g. In the CK measurement study, the sample mean = 98.28. The

median = 94.5, i.e. mean is larger than median indicating that mean
is inflated by two large data values 201 and 203.

24
4. Measures of Dispersion

Measures of dispersion characterise how

spread out the distribution is, i.e., how variable
the data are.
Commonly used measures of dispersion
include:
1. Range
2. Variance & Standard deviation
3. Coefficient of Variation (or relative standard
deviation)
4. Inter-quartile range

25
Range
the sample Range is the difference
between the largest and smallest
observations in the sample
easy to calculate;
Blood pressure example: min=113 and
max=170, so the range=57 mmHg
useful for best or worst case scenarios
sensitive to extreme values

26
Sample Variance

The sample variance, s2, is the arithmetic

mean of the squared deviations from the
sample mean:
n
xi x
2

s i 1
2
n 1

27
Standard Deviation

The sample standard deviation, s, is the

square-root of the variance

n
xi x
2

i 1
s
n 1

s has the advantage of being in the same units

as the original variable x
28
Example
Data Deviation Deviation2
151 13.86 192.02
124 -13.14 172.73
132 -5.14 26.45
170 32.86 1079.59
146 8.86 78.45
124 -13.14 172.73
113 -24.14 582.88
Sum = 960.0 Sum = 0.00 Sum = 2304.86
x 137.14 29
Example (contd.)

x x
2
i 2304.86
i 1

Therefore, 2304.86
s
7 1
19.6
30
Coefficient of Variation
The coefficient of variation (CV) or relative
standard deviation (RSD) is the sample standard
deviation expressed as a percentage of the mean,
i.e.
s
CV 100%
x
The CV is not affected by multiplicative changes in
scale
Consequently, a useful way of comparing the
dispersion of variables measured on different scales
31
Example

The CV of the blood pressure data is:

19.6
CV 100 %
137.1
14.3%
i.e., the standard deviation is 14.3% as large as
the mean.

32
Inter-quartile range
The Median divides a distribution into two halves.

The first and third quartiles (denoted Q1 and Q3) are

defined as follows:
25% of the data lie below Q1 (and 75% is above Q1),
25% of the data lie above Q3 (and 75% is below Q3)

The inter-quartile range (IQR) is the difference

between the first and third quartiles, i.e.
IQR = Q3- Q1

33
Example

The ordered blood pressure data is:

113 124 124 132 146 151 170

Q1 Q3

Inter Quartile Range (IQR) is 151-124 = 27

34
60% of slides complete!

35
5. Box-plots

A box-plot is a visual description of the

distribution based on
Minimum
Q1
Median
Q3
Maximum
Useful for comparing large sets of data

36
Example 1

The pulse rates of 12 individuals arranged in

increasing order are:
62, 64, 68, 70, 70, 74, 74, 76, 76, 78, 78, 80

Q1=(68+70)2 = 69, Q3=(76+78)2 = 77

IQR = (77 69) = 8

37
Example 1: Box-plot

38
Example 2: Box-plots of intensities
from 11 gene expression arrays

14
12
10
8

AG_04659_AS.cel AG_11745_AS.cel KB_5828_AS.cel KB_8840_AS.cel

39
Outliers

An outlier is an observation which does not

appear to belong with the other data
Outliers can arise because of a measurement
or recording error or because of equipment
failure during an experiment, etc.
An outlier might be indicative of a sub-
population, e.g. an abnormally low or high
value in a medical test could indicate presence
of an illness in the patient.
40
Outlier Boxplot

Re-define the upper and lower limits of the

boxplots (the whisker lines) as:
Lower limit = Q1-1.5IQR, and
Upper limit = Q3+1.5IQR
Note that the lines may not go as far as these
limits
If a data point is < lower limit or > upper limit,
the data point is considered to be an outlier.
41
Example CK data

outliers

42
6. Scatter-plot

Displays the relationship between two

continuous variables

Useful in the early stage of analysis when

exploring data and determining is a linear
regression analysis is appropriate

May show outliers in your data

43
Example 1: Age versus Systolic
Blood Pressure in a Clinical
Trial

44
Example 2: Up-regulation/Down-
regulation of gene expression across an
array (Control Cy5 versus Disease Cy3)

45
Example of a Scatter-plot matrix
(multiple pair-wise plots)

46
Other graphical representations

Dot-Plots, Stem-and-leaf plots

Not visually appealing
Pie-chart
Visually appealing, but hard to compare two datasets. Best
for 3 to 7 categories. A total must be specified.
Violin-plots
=boxplot+smooth density
Nice visual of data shape

47
Multivariate Data

Clustering is useful for visualising multivariate

data and uncovering patterns, often reducing its
complexity

Clustering is especially useful for high-

dimensional data (p>>n): hundreds or perhaps
thousands of variables

An obvious areas of application are gel

electrophoresis and microarray experiments
where the variables are protein abundances or
gene expression ratios
48
7. Clustering
Aim: Find groups of samples or variables sharing
similiarity

Clustering requires a definition of distance between

objects, quantifying a notion of (dis)similarity
Points are grouped on the basis on minimum distance
apart (distance measures)

Once a pair are grouped, they are combined into a

single point (using a linkage method) e.g. take their
average. The process is then repeated.
49
Clustering
Clustering can be applied to rows or columns of a data set
(matrix) i.e. to the samples or variables

A tree can be constructed with branch length proportional to

distances between linked clusters, called a Dendrogram

Clustering is an example of unsupervised learning: No use is

made of sample annotations i.e. treatment groups, diagnosis
groups

50
UPGMA
Unweighted Pair-Group Method Average
Most commonly used clustering method
Procedure:
1. Each observation forms its own cluster
2. The two with minimum distance are grouped into a single
cluster representing a new observation- take their average
3. Repeat 2. until all data points form a single cluster

51
Contrived Example
5 genes of interest on 3 replicates arrays/gels
Array1 Array2 Array3

p53 9 3 7
mdm2 10 2 9
bcl2 1 9 4
d xy ( x1 y1 ) ( x2 y2 ) ( x3 y3 )
2 2 2

cyclinE 6 5 5
caspase 8 1 10 3

Calculate distance between each pair of genes

e.g. d ( p53, mdm2) (9 10) 2 (3 2) 2 (7 9) 2 2.5

52
Example

Construct a distance matrix of all pair-wise distances

p53 mdm2 bcl2 cyclinE caspase 8

p53 0 2.5 10.44 4.12 11.75

mdm2 - 0 12.5 6.4 13.93
bcl2 - - 0 6.48 1.41
cyclinE - - - 0 7.35
caspase 8 - - - - 0

Cluster the 2 genes with smallest distance

Take their average & re-calculate distances to other genes

53
{caspase-8 &
p53 mdm2 cyclin E
bcl-2}
p53 0 2.5 4.12 10.9
mdm2 0 6.4 9.1

cyclin E 0 6.9
{caspase-8 &
0
bcl-2}

{p53 & {caspase-8 &

cyclin E
mdm2} bcl-2}
{p53 & mdm2} 0 3.7 9.2
cyclin E 0 6.9

{caspase-8 & bcl-2} 0

54
Example (contd)

..and the final cluster:

55
Example of a gene expression dendrogram

56
Variety of approaches to clustering
Clustering techniques
agglomerative -start with every element in its own cluster, and
iteratively join clusters together
divisive - start with one cluster and iteratively divide it into
smaller clusters
Distance Metrics
Euclidean (as-the-crow-flies)
Manhattan
Minkowski (a whole class of metrics)
Correlation (similarity in profiles: called similarity metrics)
Linkage Rules
average: Use the mean distance between cluster members
single: Use the minimum distance (gives loose clusters)
complete: Use the maximum distance (gives tight clusters)
median: Use the median distance
centroid: Use the distance between the average member or
each cluster 57
Clustering Summary

The clusters & tree topology often depend highly on

the distance measure and linkage method used

Recommended to use two distance metrics, such

as Euclidean and a correlation metric

A clustering algorithm will always yield clusters,

whether the data are organised in clusters or not!

Chapter One: 1. Basic Concepts, Methods of Data Collection and Presentation
No ratings yet
Chapter One: 1. Basic Concepts, Methods of Data Collection and Presentation
111 pages
HST 190 Introduction To Biostatistics 2019
No ratings yet
HST 190 Introduction To Biostatistics 2019
3 pages
Basic Statistics Course at COURSERA
0% (1)
Basic Statistics Course at COURSERA
17 pages
Exercises - SPSS
No ratings yet
Exercises - SPSS
6 pages
Gen Exam CH 1 SOLUTION
No ratings yet
Gen Exam CH 1 SOLUTION
6 pages
Sensory Lab Report Nutrition 205
No ratings yet
Sensory Lab Report Nutrition 205
32 pages
Fraud Detection - Using Data Analysis Techniques
No ratings yet
Fraud Detection - Using Data Analysis Techniques
6 pages
Disney Cruise Marketing Report
No ratings yet
Disney Cruise Marketing Report
18 pages
Lab06 Confidence Intervals
No ratings yet
Lab06 Confidence Intervals
4 pages
EPIData Presentation
No ratings yet
EPIData Presentation
36 pages
Statistical Computing I
No ratings yet
Statistical Computing I
187 pages
Sampling Distribution Theory: Population and Sample
No ratings yet
Sampling Distribution Theory: Population and Sample
8 pages
Elementary Statistics Chap-1 4
No ratings yet
Elementary Statistics Chap-1 4
53 pages
Statistics and Data
No ratings yet
Statistics and Data
67 pages
Statatistical Inferences
No ratings yet
Statatistical Inferences
22 pages
Chapter 4
No ratings yet
Chapter 4
19 pages
Linear Regression Model
No ratings yet
Linear Regression Model
3 pages
Frequency Distribution & Graghs
No ratings yet
Frequency Distribution & Graghs
28 pages
Dplyr Tutorial
100% (1)
Dplyr Tutorial
22 pages
QM Statistic Notes
No ratings yet
QM Statistic Notes
24 pages
Ss Notes
No ratings yet
Ss Notes
34 pages
Biostat
100% (1)
Biostat
66 pages
Basic Statistical Concepts and Methods
100% (1)
Basic Statistical Concepts and Methods
122 pages
TABULAR AND GRAPHICAL PRESENTATIONS Objectives
No ratings yet
TABULAR AND GRAPHICAL PRESENTATIONS Objectives
13 pages
Master of Statistics
100% (1)
Master of Statistics
24 pages
Introduction To Descriptive Statistics
No ratings yet
Introduction To Descriptive Statistics
156 pages
Spss Exercises
100% (1)
Spss Exercises
5 pages
Statistics Chapter 4 Project
No ratings yet
Statistics Chapter 4 Project
3 pages
Bus 3104.E1 Midtm Fall 16
No ratings yet
Bus 3104.E1 Midtm Fall 16
8 pages
Research Methodology - Parametric and Non-Parametric Tests
No ratings yet
Research Methodology - Parametric and Non-Parametric Tests
7 pages
Statistics Module
No ratings yet
Statistics Module
91 pages
Statistics Mcqs - Estimation Part 3: Examrace
No ratings yet
Statistics Mcqs - Estimation Part 3: Examrace
7 pages
Importance of Statistics
No ratings yet
Importance of Statistics
10 pages
R Programming Exam With Solutions
No ratings yet
R Programming Exam With Solutions
9 pages
10 - 11 SPSS Introduction PDF
No ratings yet
10 - 11 SPSS Introduction PDF
25 pages
Data Arrangement and Presentation Formation of Tables and Charts
No ratings yet
Data Arrangement and Presentation Formation of Tables and Charts
55 pages
Estimation
No ratings yet
Estimation
66 pages
Term Test 1 MCQ
100% (1)
Term Test 1 MCQ
18 pages
Statistics FinalReview
No ratings yet
Statistics FinalReview
8 pages
Stats 250 W15 Exam 2 Solutions
No ratings yet
Stats 250 W15 Exam 2 Solutions
8 pages
Introduction To Rstudio: Creating Vectors
No ratings yet
Introduction To Rstudio: Creating Vectors
11 pages
'SST 111 Introduction To Probability and Statistics Lecture Notes
No ratings yet
'SST 111 Introduction To Probability and Statistics Lecture Notes
58 pages
Ratio Regression R
No ratings yet
Ratio Regression R
20 pages
Types of Statistical Analysis
No ratings yet
Types of Statistical Analysis
2 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
22 pages
Questions & Answers Chapter - 7 Set 1
No ratings yet
Questions & Answers Chapter - 7 Set 1
6 pages
SPSS2 Workshop Handout 20200917
No ratings yet
SPSS2 Workshop Handout 20200917
17 pages
Stat 111 - Tutorial Set 2
No ratings yet
Stat 111 - Tutorial Set 2
7 pages
Introduction To Statistics: Prepared By: Engr. Gilbey'S Jhon - Ladion Instructor
No ratings yet
Introduction To Statistics: Prepared By: Engr. Gilbey'S Jhon - Ladion Instructor
25 pages
Unit 5.0-Research Methodology, Tools, Techniques
No ratings yet
Unit 5.0-Research Methodology, Tools, Techniques
39 pages
Probability Distributions
100% (5)
Probability Distributions
21 pages
Is Bigger Better?: An Introduction To Sample Size Calculations
No ratings yet
Is Bigger Better?: An Introduction To Sample Size Calculations
52 pages
Hypothesis Testing, Test Statistic (Z, P, T, F)
100% (3)
Hypothesis Testing, Test Statistic (Z, P, T, F)
22 pages
Theory Session: Introduction To Biostatistics
No ratings yet
Theory Session: Introduction To Biostatistics
22 pages
1.1 Introduction-What Is Statistics
No ratings yet
1.1 Introduction-What Is Statistics
8 pages
Multiple Regression Analysis (MLR)
No ratings yet
Multiple Regression Analysis (MLR)
28 pages
Sampling and Sampling Distributionsnew
100% (1)
Sampling and Sampling Distributionsnew
13 pages
Multiple Range MCQs On Introductory Statistics
100% (1)
Multiple Range MCQs On Introductory Statistics
26 pages
Chapter 6 Section 4-5: Probability: Multiple Choice
No ratings yet
Chapter 6 Section 4-5: Probability: Multiple Choice
7 pages
Statistical Inference 2 Note 02
No ratings yet
Statistical Inference 2 Note 02
7 pages
Statistical Techniques in Business and E
100% (1)
Statistical Techniques in Business and E
98 pages
Biostatistics Lesson 1 PDF
100% (1)
Biostatistics Lesson 1 PDF
34 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
53 pages
Css Quick Guide
No ratings yet
Css Quick Guide
132 pages
F DR Re Yd Ikaksfõok Dlaikh - I Mka S Mrslaikh - 2018 Ckjdrs 2019 A/L
No ratings yet
F DR Re Yd Ikaksfõok Dlaikh - I Mka S Mrslaikh - 2018 Ckjdrs 2019 A/L
6 pages
Css Text
No ratings yet
Css Text
6 pages
Css Tables
No ratings yet
Css Tables
6 pages
Css Images
No ratings yet
Css Images
3 pages
Css Backgrounds
No ratings yet
Css Backgrounds
6 pages
OS and CS-4-Common Vulnerabilities
No ratings yet
OS and CS-4-Common Vulnerabilities
12 pages
Arithmetic Operators Comparison Operators 3. Logical (Or Relational) Operators 4. Assignment Operators Conditional (Or Ternary) Operators
No ratings yet
Arithmetic Operators Comparison Operators 3. Logical (Or Relational) Operators 4. Assignment Operators Conditional (Or Ternary) Operators
1 page
Computer Security Note
No ratings yet
Computer Security Note
88 pages
IT3004 - Operating Systems and Computer Security 06 - Trusted Operating Systems
No ratings yet
IT3004 - Operating Systems and Computer Security 06 - Trusted Operating Systems
28 pages
2016 July em Ict
No ratings yet
2016 July em Ict
12 pages
WD Ól Úohdj - Mqkßlaik: WD Ólhla Ye Kaùu - Jdia:Úl Wnhdi
No ratings yet
WD Ól Úohdj - Mqkßlaik: WD Ólhla Ye Kaùu - Jdia:Úl Wnhdi
6 pages
Janaka Sir 05
No ratings yet
Janaka Sir 05
24 pages
Janaka Sir 03
No ratings yet
Janaka Sir 03
31 pages
Janak K Jayasundara G. C. E Advanced Level Accounting Accounting - Janak K Jayasundara G. C. E Advanced Level
No ratings yet
Janak K Jayasundara G. C. E Advanced Level Accounting Accounting - Janak K Jayasundara G. C. E Advanced Level
27 pages
CH1 5
No ratings yet
CH1 5
1 page
2020 Namibia Road Pricing
No ratings yet
2020 Namibia Road Pricing
278 pages
Geological Applications of Wireline Logs
No ratings yet
Geological Applications of Wireline Logs
28 pages
Module 5 - Sampling and Sampling Distribution (With Annotations)
No ratings yet
Module 5 - Sampling and Sampling Distribution (With Annotations)
29 pages
JTSS 7th June 2017
0% (1)
JTSS 7th June 2017
108 pages
Fraud Analytics Course Outline
No ratings yet
Fraud Analytics Course Outline
4 pages
Time Table Biostatistics and Research Methodology M. Phil. Basic Sciences Session: 2019 - 2021 Timing: 9:00 - 10:30 Am S. No. Date Day Topic
No ratings yet
Time Table Biostatistics and Research Methodology M. Phil. Basic Sciences Session: 2019 - 2021 Timing: 9:00 - 10:30 Am S. No. Date Day Topic
2 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
18 pages
Written Feedback in Students' Writing PDF
No ratings yet
Written Feedback in Students' Writing PDF
11 pages
Student Information System-3279
No ratings yet
Student Information System-3279
7 pages
mlr3 Tutorial
100% (2)
mlr3 Tutorial
271 pages
Data Analytics and Visualization Previous Year Questions
No ratings yet
Data Analytics and Visualization Previous Year Questions
4 pages
Writing Chapter 4 Qualitative Dissertation
100% (2)
Writing Chapter 4 Qualitative Dissertation
7 pages
Correlation and Regression
No ratings yet
Correlation and Regression
41 pages
Accounting Skills
No ratings yet
Accounting Skills
2 pages
Types of Research
0% (1)
Types of Research
18 pages
Factor Analysis - Stat 390 Presentation 3
No ratings yet
Factor Analysis - Stat 390 Presentation 3
13 pages
Stata Tests
No ratings yet
Stata Tests
2 pages
SM 38
No ratings yet
SM 38
58 pages
Session 2-3 (ANOVA) Regression
No ratings yet
Session 2-3 (ANOVA) Regression
54 pages
1 As
No ratings yet
1 As
25 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
Adra Lma Final
No ratings yet
Adra Lma Final
12 pages
Measure of Central Tendency (Excel Template)
No ratings yet
Measure of Central Tendency (Excel Template)
5 pages
Matrix Data Analysis Diagram
75% (4)
Matrix Data Analysis Diagram
11 pages
Sol Testing 2
No ratings yet
Sol Testing 2
5 pages
Springer Series in Surface Sciences 49
No ratings yet
Springer Series in Surface Sciences 49
543 pages

Introduction To Statistics 1 COD

Uploaded by

Introduction To Statistics 1 COD

Uploaded by

Introduction to

Neuropsychiatric Genetics, TCD

1. Distributions & Densities

Sample: A subset of the population.

Variables can be further classified as:

is often nominal (e.g. treatment)

If the value of a variable cannot be predicted in

Descriptive Statistics: Quantities and

An (Empirical) Frequency Distribution or

A Relative Frequency Distribution presents the

A Barchart presents the frequencies for a

Blood samples taken from 36 male

The serum CK concentrations were

121 82 100 151 68 58

20 40 60 80 100 120 140 160 180 200 220

20 40 60 80 100 120 140 160 180 200 220

(i) the Arithmetic Mean,

Let x1,x2,x3,,xn be the realised values of a

Example 2: The systolic blood pressure of

If the sample data are arranged in increasing

The reordered systolic blood pressure data seen

113, 124, 124, 132, 146, 151, and 170.

The Median is the middle value of the ordered data,

Two individuals have systolic blood pressure = 124

Six men with high cholesterol participated in a study to investigate

230, 274, 274, 292, 327 and 366.

The median is not influenced by large sample values and is a better

Note if mean=median=mode then the data are said to be

e.g. In the CK measurement study, the sample mean = 98.28. The

Measures of dispersion characterise how

The sample variance, s2, is the arithmetic

The sample standard deviation, s, is the

s has the advantage of being in the same units

The CV of the blood pressure data is:

The first and third quartiles (denoted Q1 and Q3) are

The inter-quartile range (IQR) is the difference

The ordered blood pressure data is:

Inter Quartile Range (IQR) is 151-124 = 27

A box-plot is a visual description of the

The pulse rates of 12 individuals arranged in

Q1=(68+70)2 = 69, Q3=(76+78)2 = 77

IQR = (77 69) = 8

AG_04659_AS.cel AG_11745_AS.cel KB_5828_AS.cel KB_8840_AS.cel

An outlier is an observation which does not

Re-define the upper and lower limits of the

Displays the relationship between two

Useful in the early stage of analysis when

May show outliers in your data

Dot-Plots, Stem-and-leaf plots

Clustering is useful for visualising multivariate

Clustering is especially useful for high-

An obvious areas of application are gel

Clustering requires a definition of distance between

Once a pair are grouped, they are combined into a

A tree can be constructed with branch length proportional to

Clustering is an example of unsupervised learning: No use is

Calculate distance between each pair of genes

Construct a distance matrix of all pair-wise distances

p53 mdm2 bcl2 cyclinE caspase 8

p53 0 2.5 10.44 4.12 11.75

Cluster the 2 genes with smallest distance

{p53 & {caspase-8 &

{caspase-8 & bcl-2} 0

..and the final cluster:

The clusters & tree topology often depend highly on

Recommended to use two distance metrics, such

A clustering algorithm will always yield clusters,

You might also like