0% found this document useful (0 votes)
23 views109 pages

Xử Lý Số Liệu Trong Phân Tích Dược - 01

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views109 pages

Xử Lý Số Liệu Trong Phân Tích Dược - 01

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

XỬ LÝ SỐ LIỆU

TRONG PHÂN TÍCH DƯỢC


GIỚI THIỆU HỌC PHẦN
kiểm tra thường xuyên (trắc nghiệm): 1
bài (15%)
chuyên cần: điểm danh + có mặt KTTX
(5%)
thi hết môn: tự luận (mở sách) 60 phút
(3‐4 câu/đề) (80%)
KIẾN THỨC CƠ BẢN TOÁN THỐNG KÊ
The Branches of Statistics
Descriptive Statistics
collecting, summarizing, and presenting a
set of data
Inferential Statistics
analyzing sample data to draw conclusions
about a population
E.g. ?

Start with a hypothesis and look to see


whether the data are consistent with that
hypothesis
Content: 350  30 (mg)

470  30 (mg), n = 3; 6; 10

?
The expected values for normal fasting blood glucose
concentration are between 70 mg/dL (3.9 mmol/L)
and 100 mg/dL (5.6 mmol/L). When fasting blood
glucose is between 100 to 125 mg/dL (5.6 to 6.9
mmol/L), changes in lifestyle and monitoring glycemia
are recommended.

?
Concentration: 4.4  0.4 (mmol/L); 6.3  0.5 (mmol/L)

7.3  0.4 (mmol/L), n = 6; 10


?
?
?
Biểu diễn dữ liệu (data presentation)
• Phân loại và biểu diễn các loại dữ liệu:
thang đo khoảng, thang đo thứ bậc, thang
đo định danh
• Biểu diễn chữ số có nghĩa
1. Data types
Set out a system for describing different
types of data.
Explain why we need to identify the type of
data with which we are dealing.
1.2 Interval scale data

• The steps are of an exactly defined size.


• All the steps are of exactly the same size.
1.3 Ordinal scale data

• The name ‘Ordinal’ reflects the fact that the


various outcomes form an ordered sequence going
from one extreme to its opposite. Such data is
sometimes referred to as ‘Ordered categorical’. In
this case the data is usually discontinuous.
1.4 Nominal scale data
In this case there is no sense of measuring a
characteristic; we use a system of
classifications, with no natural ordering. For
example, one of the factors that influences
the effectiveness of treatment could be the
specific manufacturer of a medical device.
Quite commonly there are just two
categories in use. Obvious cases are
Male/Female, Alive/Dead or Success/Failure.
In these cases, the data is described as
“Dichotomous”.
2. Data presentation
 Describe the use of numerical tables, bar
charts, pie charts, and scattergrams.
 Consider which type of data is appropriate
for each method of presentation.
 Stress the importance of considering the
type of readership at which a presentation is
targeted. (Scientifically literate or general
public?)
 Assess the strengths and weaknesses of
each method.
2.1 Numerical tables

• Good: Raw data available for further analysis.


• Bad: Unfriendly and poor immediacy.
2.2 Bar charts and histograms
• 2.2.1 Simple bar charts
2.2.2 Three‐dimensional bar charts
2.2.3 Stacked bar chart
2.2.4 Histograms
2.3 Pie charts
2.4 Scatter plots
• 2.4.1 Dependent versus independent variable
DATA PRESENTATION

 Precision, standard deviation and uncertainty


 Precision: variability of results obtained under
different circumstances (e.g. repeatability and
reproducibility); expressed by SD
 Uncertainty: lack knowledge of the true value
The significant figures (or significant digits)
of a number are the digits that are known with
some degree of confidence
Rules for significant figures:
All non zero digits are significant.
E.g. 325 has 3 significant figures

Zeros between non zero digits are significant.


E.g. 1009 has 4 significant figures

Leading zeros are insignificant.


E.g. 0.0005 has 1 significant figures

Trailing zeros in a number containing a decimal point


is significant.
E.g. 25.00 has 4 significant figures
Trailing zeros in a number not containing a decimal
point can be either significant or insignificant. To
indicate which zeros are significant, place a bar
over the zeros, which are significant or indicate the
number of significant figures in the brackets.

E.g. 2500 has 2 significant figures, 35000 has 3


significant figures, 12000 has 4 significant figures,
800 (2 sf) has 2 significant figures.
writing an uncertainty, how many significant figures?

SD: 2 significant figures.


The measurement result can then be written to
the same number of decimal places. For
example: (1.123±0.032) M.

uncertainty reasonable?
NO simple answer
The more repeats done, the smaller the uncertainty
some uncertainty in any figure < 1%

sensible to report four significant figures, i.e.


to 0.1%

RSD should not be reported below 0.1%


INTERVAL SCALE DATA
3. Descriptive statistics for interval scale data
Review the use of the mean, median or mode
to indicate how small or large a set of values
are and consider when each is most
appropriate.
Describe the use of the standard deviation to
indicate how variable a set of values are.
Show how quartiles can be used to convey
information similar to that mentioned above,
even in the presence of extreme outlying
values.
Discuss the problem of describing ordinal
data.
3.1 Summarising data sets
3.2 Indicators of central tendency: Mean,
median and mode
• Should we automatically use the median in such
cases?
It would be an overgeneralization to suggest that in
every case where the data has outliers, the median is
automatically the statistic to quote; If we were
dealing with the cost of a set of items and the
intention was to predict the cost of future sets of
such items, the mean would be appropriate even if
there were outliers.

• The median is robust to extreme outliers


The term ‘Robust’ is used to indicate that a statistic
or a procedure will continue to give a reasonable
outcome even if some of the data is aberrant
• Calculating the median where there is an even
number of observations
Selecting an indicator of central tendency

• The mean is the industry standard and is the


most useful for a whole range of further
purposes. Unless, there are specific
problems (e.g. polymodality or marked
outliers), the mean is the indicator of choice.
The median is met with quite frequently and
the mode (or modes) tends only to be used
when all else fails.
3.3 Describing variability – standard deviation and
coefficient of variation
• Reporting the SD – The ‘±’ symbol

• Units of SD: The SD is not a unitless number.


The coefficient of variation

• The precision of an HPLC analysis for blood


imipramine levels expressed as the coefficient of
variation
3.4 Quartiles – Another way to describe data
• The second quartile (Median) as an indicator of
central tendency
• The inter‐quartile range as an indicator of
dispersion
3.5 Describing ordinal data
• The mean is not generally considered appropriate
• The median will not always work either
• The mode too unstable for general use
... it would be impossible to give a general
recommendation for use of the modes. A
particular problem is that modes are very
unstable, that is if we repeated this relatively
small trial...
• Dispersion in ordinal data
Bar charts will give a visual impression of the
variability in a set of ordinal data. The only
statistic that can really be quoted as a
measure of dispersal is the interquartile range.
In summary – How can we describe ordinal
data?

• There is no single universal answer.


4. The normal distribution

Describe the normal distribution.


Suggest visual methods for detecting data
that does not follow a normal distribution.
Describe the proportion of individual values
that should fall within specified ranges.
Explain the terms ‘Skewness’ and ‘Kurtosis’.
4.1 What is a normal distribution?
4.2 Identifying data that are not normally
distributed
• Many of the statistical techniques only work
properly if the data being analysed follow a
normal distribution
• There are a number of statistical tests that
will allegedly help in deciding whether data
deviate from a normal distribution.
• ... what will be suggested is a simple visual
inspection of histograms of data and specific
features that indicate non‐normality have
been highlighted.
Problem 1. Data with a polymodal distribution
Problem 2. Data that are severely skewed
Problem 3. Data that are sharply truncated
above and/or below the mean
4.3 Proportions of individuals within 1SD or 2SD
of the mean
4.4 Skewness and kurtosis
The nature of tests for normal distribution

• Shapiro–Wilk, Ryan–Joiner, Anderson–Darling and


Kolmogorov–Smirnov tests
How can these tests mislead us?
Failure to detect serious non‐normality in a
small data set Even gross nonnormality may
not produce a significant result if the data
set is small.
Trivial non‐normality within a large data set
leading to a significant outcome ANOVAs and
t‐tests are pretty robust and moderate
non‐normality won’t distort their outcomes
to an extent that need be of practical
concern. However, if the data set is large
enough, even the most trivial departure
from normality will be detected.
Markedly non‐normal, but data set
too small to achieve significance
Large data set that produces a significant result
despite being only marginally non‐normal
5. Sampling from populations:
The standard error of the mean
Distinguish between samples and
populations.
Describe the way in which sample size and
the SD jointly influence random sampling
error.
Show how the Standard Error of the Mean
(SEM) can be used to indicate the quality of a
sampling scheme.
5.1 Samples and populations
5.3 Types of sampling error
• If we want to determine average alcohol consumption
among the citizens of a given city, we might recruit our
subjects by giving out questionnaires in a bar in the
town centre. The average at which we would arrive is
pretty obviously going to be higher than the true
average for all citizens of the city – very few of the
town’s teetotallers will be present in the bar.
• A drug causes raised blood pressure as a side‐effect in
some users and we want to see how large this effect is.
We might recruit patients who have been using the
drug for a year and measure their blood pressures. In
this case the bias is less blindingly obvious, but equally
inevitable. We will almost certainly understate the
hypertensive effect of the drug as all the most severely
effected patients will have been forced to withdraw
from using the drug. We will be left with a select group
who are relatively unaffected.
If we were to repeat the same sampling
procedure several times, we could pretty
much guarantee that we would make an
error in the same direction every time. With
the drink survey, we would always tend to
over‐estimate consumption; and with the
hypertension study, we would consistently
under‐estimate the side‐effect.
Bias arises from flaws in our experimental
design.
We can remove the bias by improving our
experimental design (always assuming that
we recognise the error of our ways!).
• Now take an essentially sound experimental
design. We want to determine the average
elimination half‐life of a new anti‐diabetic
drug in type II diabetics aged 50–75, within
Western Europe. We recruit several hospitals
scattered throughout Western Europe and
they draw up a census of all the
appropriately aged, type II diabetics under
their care. From these lists we then
randomly select potential subjects.
Over‐ and under‐estimation are equally
likely.
Even the best designed experiments are
subject to random error.
It is impossible get rid of random error.
What factors control the extent of random
sampling error when estimating a population
mean?
Sample
size
Variability within
the data
5.5. Estimating likely sampling error – The SEM
• For Table 5.1 (Contrasting small versus large sample
sizes)

• For Table 5.2 (Contrasting more variable versus less


variable data):

You might also like