Testing Book
Testing Book
OF POLYMER
TESTING
PLASTICS ENGINEERING
Founding Editor
Donald E. Hudgin
Professor
Clemson University
Clemson, South Carolina
Polymer Blends and Alloys, edited by Gabriel 0. Shonaike and George P. Simon
Star and Hyperbranched Polymers, edited by Munmaya K. Mishra and Shiro Kobayashi
edited bV
ROGER BROWN
Rapra Technology Ltd.
Shawbury, Shrewsbury, England
MARCEL
DEKKER
Library of Congress Cataloging-in-Publication Data
Headquarters
Marcel Dekker, Inc.
270 Madison Avenue, New York, NY 10016
tel: 212-696-9000; fax: 212-685-4540
The publisher offers discounts on this book when ordered in bulk quantities. For more information,
write to Special Sales/Professional Marketing at the headquarters address above.
Neither this book nor any part may be reproduced or transmitted in any form or by any means, elec-
tronic or mechanical, including photocopying, microfilming, and recording, or by any information
storage and retrieval system, without permission in writing from the publisher.
It is essential for design, specification, and quality control to have data covering the
physical properties of materials. It is also essential that meaningful data is obtained by
using test methods relevant to the materials. The different characteristics and behavior of
materials dictate that particular test procedures be developed, and often standardized, for
each material type. Polymers, especially, have unique properties that require their own
measurement techniques.
There is a wide range of polymers from soft foams to rigid composites for which
separate industries have developed. Each has its own individual test methods and, for
the major types of polymers, texts exist that detail these procedures. There are, however,
many similarities between different polymer types and frequently it is necessary for labora-
tories to consider a spectrum of materials. Consequently, there are advantages in a book
that comprehensively covers the whole polymer family, describing the individual methods
as well as discussing the approaches taken in different branches of the industry.
Handbook of Polymer Testing provides in one volume that comprehensive coverage of
physical test methods for polymers. The properties considered cover the whole range of
physical parameters, including mechanical, optical, electrical, and thermal as well as
resistance to degradation, nondestructive testing, and tests for processability. All the
main polymer classes are included: rubbers, plastics, foams, textiles, coated fabrics, and
composites. For each property, the fundamental principles and approaches are discussed
and particular requirements and the relevant international and national standards for the
different polymer classes considered, together with the most-up-to-date techniques.
This book will be of particular value to materials scientists and technologists, and to
all those who need to evaluate a spectrum of polymeric materials, including students,
design engineers, and researchers. Its structure allows reference for the main properties
iii
iv Preface
at both the general and the detailed level, thus making it suitable for different levels of
knowledge.
Chapter 29 is based on material produced for the "Testing Knowledge Base" at Rapra
Technology, Ltd. Extracts from British Standards were reproduced with the permission of
BSI. Users of standards should always ensure that they have complete and current infor-
mation. Standards can be obtained from BSI Customer Services, 389 Chiswick High
Road, London W4 4AL, England.
The other contributors and I gratefully acknowledge the support, information, and
helpful advice given by our colleagues during the preparation of this book
Roger Brown
Contents
Preface iii
1. Introduction 1
Roger Brown
4. Standardization 105
Paul P. Ashworth
6. Conditioning 141
Steve Hawley
v
vi Contents
Index 841
Contributors
Roger Brown
Rapra Technology Ltd., Shawbury, Shrewsbury, England
The physical properties of materials need to be measured for quality control, for predicting
service performance, to generate design data, and, on occasions, to investigate failures.
Without test results there is no proof of quality and no hope of successfully designing new
products. The group of materials classed as polymers generally have complicated behavior,
and as much or more than with any material it is critical that their properties be evaluated-
and evaluated in a meaningful way. Their characteristics are such that methods used for
other materials such as metals or ceramics will not usually be suitable. There are also
distinct differences between the classes of materials that make up the polymeric group,
from flexible fabrics and soft foams through solid rubbers and thermoplastics to very rigid
thermosets and composites. Consequently, it is no surprise that particular procedures have
been developed and standardized to suit the needs of each material class.
There are excellent texts that deal in great detail with test methods for rubber and for
plastics, etc. In dealing with one field they recognize the unique requirements for each class
of materials and emphasize the particular procedures that have been standardized in each
industry. It is no criticism of such texts to say that by concentrating on a restricted scope
they do not bring out the similarities and the common themes that run through the testing
of all polymers. It is relatively recently that, rather than metallurgists and plastics tech-
nologists, etc., the material scientist or technologist has emerged with an important role.
There is great need, for technical and commercial reasons, for many companies to consider
and use a spectrum of materials. For these interests, a book that covers the fundamentals
and the latest techniques for testing the whole polymer family will have many advantages
to students, design engineers, researchers, and those who need to evaluate a wide range of
products and materials.
Broadly stated, the scope of this book is the physical testing of polymers. Polymers
have been taken to include rubbers, plastics, cellular materials, composites, textiles, and
1
2 Brown
coated fabrics~all the materials generally considered to make up the polymer industry
with the exception of adhesives. A great many adhesives are polymeric, but it is considered
that treatment of adhesive testing does not fit well with physical testing of the main
polymer classes and requires its own volume. The standardized adhesion tests for solid
polymers adhered to themselves or other substrates are, however, included.
Physical testing is used in its literal sense and hence does not include chemical analysis.
The distinction between physical and chemical is perhaps not completely clear-cut, in that
aging and chemical resistance are generally considered as physical tests but clearly involve
monitoring the effects of chemical changes. Thermal analysis, for example, straddles both
camps, and particular techniques have either been included or excluded depending on their
purpose.
The aim of this book is to present an up-to-date account of procedures for testing
polymers, indicating the similarities and the differences between the approaches taken for
the different materials. Within the restrictions mentioned above, it is intended to be
comprehensive. Hence it sets out to cover all the physical properties from dimensional
through mechanical, thermal, electrical, etc., to chemical resistance, weathering, and non-
destructive. In addition to all these tests on the formed material or product, processability
tests are also included. The focus is on testing materials rather than on finished products.
Indeed, the vast number of tests, many ad hoc, devised for evaluating performance of the
multitude of products made from polymers, would fill a volume, even supposing the
subject could be coherently treated. Comment on product testing however is made
where appropriate. It should also be noted that many tests used for products are adapta-
tions of the normal material tests, for example stress-strain properties on plastic film
products and geomembranes. Even more widely, it is the usual practice to cut test pieces
for standard material tests from such products as hose, conveyor belting, and containers.
A rather novel structure has been used, which is designed to give a progressive path
for the bulk of commonly measured properties, from background principles, through basic
established practice, to the particular requirements of the different materials and then to
less common and more advanced techniques. It hence allows ready reference at different
levels and largely avoids the complications of dealing with the details of different proce-
dures for several materials in one place.
The basic structure consists of five sets of chapters. Chapters I through 7 cover general
topics, sample preparation, conditioning, accuracy, reproducibility, etc., all materials
being considered together.
Processability tests differ from all the other physical properties included in the book
by virtue of being concerned with properties of relevance to the forming of materials and
not the performance of the finished material or product. Chapter 8 deals with the proces-
sability tests in two parts, for rubbers and plastics respectively.
Chapters 9 through 14 are resumes of the principles and the basic approaches taken
for the more commonly tested parameters.
In Chapters 15 through 20 the particular requirements of each of the classes of poly-
meric materials covered are considered in more detail, including reference to the standar-
dized procedures. The scope of properties covered is essentially the same as in Chapters 9
through 14.
The remaining chapters address selected topics. The topics have been chosen for one
or more of three reasons: it is convenient to cover all polymer classes together; the para-
meters are not those most commonly measured; or the subject is of particular topical
interest.
Introduction 3
The practicality of this for the reader is that if the subject of interest is in Chapters 1
through 8 or 21 through 32 then selection of the relevant chapter will find the main
coverage of the subject. For other properties, the procedures for a particular polymer
class can be found by selection of the appropriate chapter in the group 15 through 20.
If the principles of the more common tests and comparison of the approaches for different
polymer classes are required, then consult Chapters 9 through 14. It is suggested that these
chapters be read before the subsequent chapters, especially if the reader is relatively new to
polymer testing. It is also essential that the requirements for test piece preparation, con-
ditioning, and dimensional measurement covered in Chapters 5, 6, and 7 be considered in
conjunction with all the procedures discussed later.
All reasonable effort has been taken to make the book integrated rather than a series
of independent chapters by different contributors. Inevitably there will be some overlap
and repetition, but it is believed that this, and the relative complexity of the structure, is
outweighed by the confusion that could result from trying to weld discussion of common
tests for contrasting polymer types. It is inevitable also that there should be differences in
style adopted by the different authors, which perhaps illustrates that testing can be
approached in more than one way.
The emphasis is on standard test procedures, which by definition are those that have
become widely accepted. Where standardized methods exist, they should be used for
quality control purposes and for obtaining general material property data, to ensure
compatibility between results from different sources. It is counterproductive to invent
alternative procedures when satisfactory and well-tried methods exist, and it prevents
meaningful comparisons of data from being made. It has to be accepted that many
standard methods have severe limitations for the generation of design data, but never-
theless they can often form a good basis for producing more extensive information.
Unfortunately, standard tests are not completely standard, in that different countries
and organizations each have their own standards. The situation has been steadily improv-
ing in recent years as more national standards bodies adopt international methods, and
this is a trend that we should all encourage. In this book the ISO (and for electrical tests
the IEC) standards, together with those of two of the leading national English-language
standards-making organizations, the ASTM and the BSI, are considered, plus the
European regional (CEN) standards. In a great many cases British standards are identical
with ISO standards, but ASTM standards are at very least editorially different. British
standards will always be identical with CEN standards where these exist and, in turn, CEN
standards are often identical with ISO.
It is not possible to claim that every type of test known for every property has been
included, but, within the defined scope, any omission is by accident rather than design. It is
also likely that not every standard from the standard bodies covered will have been
referenced. Standards are continually being developed and revised, so that it can be
guaranteed that between writing and publication there will have been some changes,
thus it is essential that the latest catalogs from the ISO, etc., be consulted for the most
up-to-date position.
The apparatus needed for tests is considered in conjunction with the test procedures,
but in many cases it is not an easy matter to select from the range of apparatus available in
differing levels of sophistication or, indeed, to be able to find any supplier at all. The Test
Equipment and Services Directory (published by Rapra Technology, England, in hard
copy and on CD) contains both advice on selection and a comprehensive guide to instru-
ment suppliers.
2
Putting Testing in Perspective
Ivan James
Forncet, Wem, Shropshire, England
1 Philosophy
As a generality, technical people want to test to obtain knowledge, whereas commercial
people will test only when there is some pressure to do so, In an age of cost-cutting and
streamlining of production, it may seem that testing is an unnecessary expense, but the
reverse is true, since alongside an awareness of cost has grown an increasing customer
awareness of quality, The consequences that arise when testing is omitted are illustrated by
the following examples,
Some years ago a colleague who served on several SI committees once attempted to
buy a radio while he was abroad, The young assistant took one off the shelf, switched it
on, and it didn't work, She then unpacked one from its box just as it had arrived from the
factory, and that one didn't work either, Eventually she found one that did and was
surprised when he declined to buy iL What astounded him was that these products
could be made and packed without any testing whatever to prove fitness for purpose
until they reached the point of sale,
However, testing the product may not be sufficient, since in a complex product such
as a radio, the reliability of the assembly depends on the reliability of each of the
individual components, This was brought horne to a supplier who asked a buyer what
failure rate he would accept, The curt answer "zero" failed to convince him and 0,1%
was suggested, "No, zero," was the reply, "But that implies testing every component,"
said the supplier, "Exactly," answered the buyer, In practice, of course, virtually zero
reject levels can be achieved without 100% testing by extremely tight control of the
process, but the story illustrates the point, The increasing demand for product quality
brings in its train a requirement for component reliability, and that implies component
testing also,
5
6 James
Brown [1] has previously suggested that as well as these two reasons for testing, two
others can be listed, namely tests to establish material properties for design data and tests
to establish reasons for failure if a product proves to be unsatisfactory in service.
Polymers are complex materials, and aspects of their behavior are sometimes unex-
pected. For this reason, tests on polymers need to be well chosen and wide ranging in order
to avoid embarrassing failures. It is important to establish early on that the grade of
material chosen fully matches the design criteria for the product. For example, a plastic
component, although initially of adequate strength, may on constant exposure to deter-
gents suffer from environmental stress cracking.
Examination of failed products or components is related to this, and testing may
reveal that the material did not meet the designer's specification or show that some
important property, such as creep, has been overlooked. The coating applied to surgeon's
gloves, for example, may be more important than the composition of the rubber.
Summarizing, then, and following the approach suggested by Brown, there are four
main areas of testing, namely:
1. Quality control
2. Predicting service performance
3. Design data
4. Investigating failures
Before undertaking any tests, and before considering which properties to measure, it is
essential to identify the purpose of testing, because the requirements for each of the
purposes are different. Failure to appreciate this can lead to time-wasting tests that do
not yield the required results. Similarly, a lack of understanding as to why another person
is carrying out particular tests can lead to misunderstanding and argument, say, between
the research department and the quality control department in a factory.
To the various attributes related to testing procedures, precision, reproducibility,
rapidity, and complexity, may be added the ability for tests to be automated and the
desirability for tests to be nondestructive. The balance of these various attributes, and
the related cost, differs according to the purpose of the test that is undertaken. These will
be considered in turn, but in all cases the precision and reproducibility must be appro-
priate to the tests undertaken.
which is known. In turn the normal force depends on the dimensions of the two compo-
nents and the modulus of the stopper. However, for a quality control test (and a perfor-
mance test) all that is needed is that limits be set on the upper and lower levels of force
required to extract the stopper. Knowledge of the individual parameters is not necessary.
This is an example of a functional test.
Staying with this frictional analogy, the inclined plane method of measuring friction
would give an apparent coefficient of friction, since conditions are not tightly controlled
(velocity, for example, cannot be specified) and there is little chance of relating the result to
other conditions.
If, on the other hand, the requirement is to measure the coefficient of friction between
two materials for design purposes, then shape, surface finish, normal load, velocity, tem-
perature, cleanliness, and humidity all become important parameters needing to be con-
trolled. Furthermore, this illustrates the shortage of truly fundamental tests in which the
rules for extrapolating to other conditions are well known, as in this case it would prob-
ably be necessary to produce multipoint data.
These three types of test can be loosely related to the purpose of testing.
In establishing design data, it is mostly fundamental properties that are needed, but
these are in short supply. Many thermal and chemical tests are fundamental in nature, but
most mechanical tests give apparent properties. In the absence of established and verified
procedures for extrapolating results to other conditions, multipoint data have to be pro-
duced at defined levels of all the parameters likely to influence the test result.
Consequently, reliable tables of properties for designers are difficult and expensive to
establish.
Standard test methods giving apparent properties are best suited to quality control,
and only in relatively few cases are they ideal for design data. Quality control tests are the
most easily established, and many existing methods fulfil this need. In seeking an improve-
ment in test procedure it is not always a more accurate test that is required. Depending on
the purpose for which testing is undertaken, it may be quicker or cheaper tests that are
required, or most important of all, tests that are relevant to service performance.
For predicting service performance, the most suitable tests would be functional ones.
For investigating failures, the most useful tests depend on the particular circumstances,
but fundamental mechanical methods are unlikely to be needed.
3 Trends
Because of the different reasons for testing and the consequent difference in test require-
ments, developments in test methods do not follow one path. The basic themes are con-
stant enough: people want more efficient tests in terms of time and money, better
reproducibility, and tests more suited to design data and more relevant to service perfor-
mance. However, the emphasis depends on the particular individual needs.
In recent years, the drive towards international standards has led to a close examina-
tion of long-established test methods, and it has been found that the reproducibility of
many of the tests was poor. This in turn has not led to new tests but rather to the
establishment of better standardization of test procedures. There has also been a growing
realization of the need to calibrate test equipment with proper documentation of calibra-
tion procedures and results.
Where different test methods associated with different countries have been in use for a
long time, it has sometimes been difficult to reach an acceptance of one method as a
standard test procedure. In these cases it has been necessary to present 'Method A' and
Putting Testing in Perspective 9
'Method B' as equally acceptable. Similarly, different test conditions have been allowed,
perhaps taking account of the difference between temperate and tropical conditions.
Although in the local environment this may be quite satisfactory, it leads to difficulties
if the results are presented in a database and used in a wider context. Figures presented in
a database all need to be produced in exactly the same way, and consequently there has
been a lobby for extremely tight standards, with no choice of method or test conditions,
specifically to yield completely comparable data for presentation in a database. Admirable
though this approach may seem, it has to be recognized that the freedom of action pre-
viously allowed enabled the tests to be used over a wider range of industrial conditions
than that envisaged by those setting up databases.
Automation and, in particular, the application of computers to control tests and
handle the data produced have brought about vast changes in recent years. It is not
only a matter of automation saving time and labor; it also influences the test techniques
that are used. For example, these developments have allowed difficult procedures to
become routine and hence increased their field of application. There are many examples
of tests that would not exist without certain instrumentation, thermal analysis techniques
being one of the more obvious. Advances in instrumentation for an established test may
change the way in which it is carried out but do not generally change the basic concept or
change it to produce more fundamental data.
Whether automatic or advanced instrumentation really saves money is difficult to say.
Initially the equipment costs more, but this is offset by a saving in labor. However, the old
adage that if a thing can go wrong it will remains true, and maintenance costs of complex
equipment are high. Finally, the calibration of such equipment can be difficult, and the
software that so readily transforms the data can give rise to concern as to what has
happened between the transducer and the final output.
While improvements in tests in respect of their usefulness for generating design data
and predicting service performance are continually sought, the advances have perhaps
been less dramatic. The fundamental tests needed for design are often very difficult to
devise and are likely to be more expensive to carry out and required only by a minority. As
with so many things, the advances can be related to commercial pressures and the amount
of effort that is funded. Where better and more fundamental tests do exist they are not
always used as often as they should be because of the cost and complexity involved.
There has been an increase in tests on products, which has resulted from a greater
demand to prove product performance and from specifications more often including such
tests as part of the requirements.
For the future, it is highly probable that the same themes will continue. The quality
movement is still strong, and the generation of databases will probably ensure that greater
compatibility is achieved. Certainly there will be further developments in instrumentation
and the handling of data. It would be a brave person who predicted a surge in tests for
better design data, but there are signs that the sophistication of markets will lead to wider
needs in this direction.
4 Test Conditions
Under the broad heading of test conditions should be included the manner of preparation
of the material being tested and its storage history, as well as the more obvious parameters
such as test temperature, velocity of test, etc. While it is recognized that the result obtained
depends on the conditions of the test, it is not always obvious that some of these condi-
tions may have been established before the samples were received for testing. Sometimes
10 James
the history of the samples is part of the test procedure, as in aged and unaged samples for
example, but at other times it may not be at all clear that certain "new" samples are
already several months old, with their intervening history unknown. Degradative influ-
ences such as the action of ozone on rubber samples cannot be compensated for, but
standard conditioning procedures are designed, as far as possible, to bring the test pieces
to an equilibrium state. The imposition of a standard thermal history before measuring the
density of a crystalline polymer is a good example.
In some cases, conditioning may involve temperature only, but where the material is
moisture sensitive it is likely that a standard atmosphere involving control of both tem-
perature and humidity will be called for. Occasionally, other methods of conditioning,
such as mechanical conditioning, are used, as will be discussed in a later chapter.
Even with careful conditioning, however, the results produced from specimens man-
ufactured by different methods may vary, and if there is to be a controlled comparison it is
important that the test pieces be prepared in exactly the same way. This is particularly
important for figures being presented in databases. For example, laboratory samples of a
rubber prepared on a mill may differ considerably from factory materials prepared in an
internal mixer, and often these differences are not sufficiently emphasized in tables of data.
Equally, test piece geometry is important, and again, if comparison is to be made, a
standard and specified geometry should be adhered to. Rarely is it possible to convert
from one geometry to another, since polymers are complex materials and the influence of
the various test parameters is often nonlinear. For example, it is difficult to scale up gas
transmission results obtained on thin sheets to thick sheets of the same material.
For these various reasons the simulation of service behavior is at best difficult and
often impossible. There are numerous examples of long-term tests over 20 or more years
that have shown that artificial ageing using heat or other means yields results that are
significantly different from those obtained with the passage of time. There has to be an
awareness of the limitations of any test procedure and an acknowledgement that the
results obtained apply only to the narrow range of conditions under which the test was
performed. For these reasons, with important and complex products such as tires, it is
often necessary to test them under the exact conditions under which they will be used.
Test procedures require careful attention to detail, as small and apparently innocent
deviations can produce significant changes in results. This implies that the test conditions
need to be accurately set initially and then monitored throughout the test. Sometimes it
arises that when testing according to a published standard some deviation from the set
procedure cannot be avoided (perhaps because of a limitation on the amount of material
available). In these cases such deviations should always be recorded. In any test report it is
important to state quite clearly which procedure has been followed.
have been done. Thought must be given to the design of the experiment in the very
beginning, and if help from a statistician is needed it should be brought in at that stage.
Because of the importance of applying statistical principles to test results, the subject is
comprehensively covered in Chapter 3.
Brief mention was made of the precision of tests as judged by interlaboratory trials,
and sometimes the quoted level of precision seems relatively poor. Usually the laboratories
taking part in these trials are experienced, and the precision levels quoted should be
representative of good practice. Poor figures may indicate snags with that method but,
whatever the quoted levels, there is no reason to suppose that it is "only the others" who
get divergent results. Interlaboratory comparisons sometimes lead to the elimination of
poor test procedures, and so bring about improved accuracy, as will be discussed later.
However, no measurement is exact, and there is always some uncertainty. Calibration
laboratories are required to make uncertainty estimates for all their measurements, and in
the future it may be that all accredited testing laboratories will also have to do so. This
involves estimating the uncertainty introduced by each factor in the measurement and is
not at all easy to do. At the very least, it is essential to be conscious of the order of
magnitude of the range within which the "true" result lies.
6 Sampling
Efficient sampling means selecting small quantities that are truly representative of a much
larger whole, and the significance of test results is closely related to the efficiency of
sampling.
Often, in the laboratory, one is limited by the amount of material available, and at
least there is then the excuse that the tests relate only to the material available at the time.
In a factory, where the whole output is available, the problem is a different one. Here the
quality control manager has to decide not only what is adequate, but also what is reason-
able, bearing in mind the production schedule and the profitability of the operation.
The frequency of sampling and the number of test pieces (or repeat tests) per item
sampled depend on circumstances, and obviously financial considerations play an impor-
tant part. Certain long-winded (and expensive) tests call for one test piece only, although if
multiple tests are done the method may be quite variable. The use of a single test piece is
hardly satisfactory, but it may be that multiple tests in numbers sufficient to increase
precision are totally uneconomic. This is the dilemma that quality control managers (and
the writers of specifications) have to face. In a continuous quality control scheme it may be
that the number of test pieces at each point is less important than the frequency of testing.
Where multiple tests pieces are available, an odd number is advantageous if a median
is to be taken, and five seems to be the preferred number. This is just about large enough to
make a reasonable statistical assessment of variability. However, the current range of
standard methods is not consistent, and numbers between one and ten or more may be
called for.
The essence of efficient sampling is that the small quantity selected and tested (the
sample) be truly representative of the much larger whole. The test pieces should be repre-
sentative of the sample taken, the sample representative of the batch, and the batch
representative of the wider population of material. In many cases, this information is
not known to the tester, but there should be awareness of the limitations of the results
in this respect, and the best possible practice should be followed in selection of samples
and test pieces. This may include blending of several batches, randomizing the positions
from which test pieces are cut, and testing on test pieces cut in more than one direction.
12 James
Care should be taken when sampling from production that items be taken at random, and
that the time at which samples are taken does not always coincide with some factor such as
a shift change.
7 Quality Control
Quality control embraces the monitoring of incoming materials, the control of the man-
ufacturing processes, and checks of materials and products produced, so as to ensure and
maintain the quality of the output from the factory. Physical testing methods are impor-
tant in this regime, and most of the standardized test methods are intended for quality
control use-it is probable that the majority of tests carried out are undertaken in the first
place for quality assurance purposes. However, this book is about testing and is not a
quality control manual, so discussion here is restricted to the quality control of the testing
process.
Quality control is often thought of as applying only to products, since this affects the
lives of the entire population. However, those of us that work in laboratories must recog-
nize that correct and reproducible results are in a sense products, and that the application
of quality control to test laboratories is designed to improve the general reproducibility of
all test results.
Reliable results can only come from a laboratory where the apparatus, the procedures,
and the staff are all subject to a quality assurance system. ISO 9000 standards are applied
in a wide context to various companies, and their laboratories will be included under the
general umbrella of such a system, but a more focused scheme for test and calibration
laboratories may be found in ISO guide 25 [3] and national equivalents. These standards
cover not only the calibration of equipment and the control of test pieces but also the
training of staff, an item tending to be overlooked in the general context of quality control.
The requirements listed set a high standard, and it has to be recognized that maintenance
of this standard is time-consuming and difficult. In the UK the accreditation of labora-
tories is entrusted to the United Kingdom Accreditation Service (UKAS). Similar orga-
nizations may be found in other countries, and some of these bodies have mutual
recognition agreements.
Undoubtedly the most expensive item in any system of laboratory control is the
calibration of equipment. All test equipment should be calibrated, and every parameter
relating to that machine requires formal calibration. For example, it is easy to see that the
force scale and speed of traverse of a tensile machine need calibrating, but it is less obvious
that the cutting dies for test pieces also need calibrating in order to ensure that the test
pieces conform to specification.
Calibration is based on the principle of traceability from a primary standard through
intermediate or transfer standards. A good example of a transfer standard would be boxes
of certified weights that are not in general use but the sole purpose of which is to check the
accuracy of those that are in use.
Obviously, at each stage of measurement there is some degree of uncertainty, and
estimates of this uncertainty form part of the calibration procedure. It is perfectly accep-
table for a laboratory to carry out its own calibrations, provided they maintain appro-
priate calibration standards and operate a suitable quality system. However, it is often
more convenient to buy-in calibration services. Wherever possible the calibration labora-
tory used should be accredited (UKAS or equivalent).
Calibration of apparatus in the polymer industry has to some degree been hampered
by the lack of definitive guidance, but a British standard has been developed covering the
Putting Testing in Perspective 13
Calibration of Rubber and Plastics Test Equipment [4]. This explains the principles of
calibration and gives details of the parameters to be calibrated and the frequency required,
together with an outline of the procedure to be used for all rubber test methods listed in
the ISO system.
The ASTM gave the lead in conducting systematic interlaboratory trials, and this has
been followed by the ISO and others. The variability obtained was far greater than was
expected, and in some cases it was so bad that it was doubtful whether certain tests were
worth doing at all. These interlaboratory comparisons and the drive towards improved
quality led to an abandonment of the complacent attitude that had formerly existed and
stimulated various initiatives to improve the situation.
On the whole, variability arises from malpractice rather than from a poorly expressed
standard, but if an interlaboratory trial reveals an excessive variability it is first necessary
to pinpoint the problem before a standards committee can correct it. Unfortunately this is
a slow and expensive procedure.
The demand for higher quality has produced pressures to make laboratory accredita-
tion commonplace, and as more laboratories reach this status it must be expected that
reproducibility will improve. The calibration of test machines, training, documentation of
test procedures, sample control, and formal audits all have an enormous influence, and the
discipline involved in maintaining an accredited status helps to minimize mistakes and
maintain reproducibility. International agreements undoubtedly widen the scope of
accreditation schemes and ensure uniform levels of accreditation. This is found to have
an influence on the standard of laboratories with a consequent improvement in interla-
boratory comparisons.
The essential requirements of any piece of test equipment are that it should satisfy the
requirements laid down in the standard relating to the test method under consideration
and that it should be properly calibrated. Convenience of use or the cost of running the
tests are not items that can be specified, but nevertheless they play a dominant role in the
selection of equipment. Increasingly computer control and data handling are becoming
standard.
The manipulation of data by computer is a particularly difficult operation to monitor,
since in a busy laboratory it is only too easy to accept the software as correct in all
circumstances. Obviously the accuracy of the quoted results depends not only on the
accuracy of the original measurements but also on the validity of the data handling.
Some standard bodies are now developing specifications giving rules and guidance on
software verification.
These changes in the basic concepts of laboratory testing bring with them both
advantages and disadvantages. While it is obvious that automation brings with it a saving
in staff time, perhaps enabling measurements to continue with the apparatus attended only
periodically rather than continuously, it is not clear what the effect on accuracy or
reproducibility will be. Noncontact extensometers, for example, ensure that there are
no unwanted stresses on the test piece, but the accuracy is related to the parameters
built into the extensometer (e.g., the response time in following the recorded signal). It
is important not to assume that more complex equipment necessarily means improved
accuracy, although it is frequently true. A simple example may illustrate the difficulty.
Increasingly doctors are using electronic sphygmomanometers to measure blood pressure.
Here the end points still rely on the doctors' skill in detecting a pulse, but rather than
reading the height of a mercury column, a pressure transducer gives a direct digital read-
out. This gives a degree of confidence that is absent from a mercury manometer, but the
defects in the system are hidden. There is an obvious need for calibration, which may go
14 James
unrecognized in a busy surgery, but also the question of linearity of response is crucial. It
is easy to look critically at the equipment used in a different discipline, but the same
principles should apply in our own laboratories.
References
1. Brown, R. P., Physical Testing of Rubber, Chapman and Hall, London, 1996.
2. BS 903, Part 2, Guide to the application of statistics to rubber testing, 1997.
3. ISO Guide 25, General requirements for the competence of calibration and testing laboratories,
1990.
4. BS 7825, Parts 1-3, Calibration of rubber and plastics test equipment, 1995.
3
Quality Assurance of Physical Testing
Measurements
Alan Veith
Technical Development Associates, Akron, Ohio
1 Introduction
Measurement and testing play a key role in the current technologically oriented world.
Decisions for scientific, technical, commercial, environmental, medical, and health pur-
poses based on testing are everyday occurrences. The intrinsic value and fidelity of any
decision depends on the quality of the measurements used for the decision process. Quality
may be defined in terms of the uncertainty in the measured values for a specified test
system; high quality corresponds to a small or low uncertainty. Quality is contingent
upon whether the operational system is simple or complex. The equipment, the procedure,
the operators, the environment, the decision process itself, and the importance of these
decisions-are all part of the system. A lower quality can be tolerated for less important
routine technical decisions than for decisions that have large commercial or financial
implications. Measurement and testing for fundamental and applied research and devel-
opment and also for producer-user transactions are important elements that are part of a
larger organized effort that is frequently called a technical project. Measurement and
testing play a key role in all technical projects and the assurance that the output data
from any technical project are of the highest quality, consistent with the stipulated goals
and objectives, is of paramount importance in technical project organization.
Quality for a test system has two major components: (1) how well the measured
parameters relate to the properties that are involved in the decision process and (2) the
magnitude of the uncertainty of the measured parameter value or values; the higher the
uncertainty the lower the quality. The first component is usually more complex, since it
involves scientific expertise and some subjective judgments. If the measured parameters are
not highly related to the decision process properties, a fundamental scientific uncertainty
exists. The second component, the measurement uncertainty, is somewhat easier to address;
15
16 Veith
PLAN
I
PROJECT MODEL
I
RESPONSE MODEL
I
SAMPLING PROTOCOL
I
MEASUREMENT SYSTEM CALIBRATION
I
MEASUREMENT AND
DATA REPORT
I
PRECISION MODEL
I
ANALYSIS AND EVALUATION
l
SOLUTION
procedure and protocol, (3) setting up a defined capability measurement system with an
appropriate calibration procedure, (4) performing the measurements and reporting the
data, followed by (5) analysis and evaluation, which may make use of a test result uncer-
tainty model to arrive at a solution. All of these must be of the highest quality for the
successful execution of a complex project, especially if the project involves interlaboratory
testing.
attention are overall project organization, goals, and objectives, resources and constraints,
performance criteria, the selection of the measurement methodology, and the selection of
decision procedures. A set of well designed and coordinated standard operating proce-
dures (SOP) must be selected and put into place. All the elements of the project required
for a successful solution need to be specified.
A model may be defined as the simplified representation of a defined physical system.
The representation is developed in symbolic form and is frequently expressed in mathe-
matical equations and uses physical and/or chemical principles based on scientific knowl-
edge, experimental judgment, and intuition, all set on a logical foundation. A model may
be theoretical or empirical, but the formulation of an accurate model is a requirement for
the successful solution of any problem.
Planning Model
A planning model is a generalized statement of the strategy to be used in the solution of a
problem; it involves the selection of a coordinated set of SOPs and an assembly of the
required system elements to arrive at a solution. Planning models are more descriptive in
nature and are not as rigorous as response and analysis models.
Response Model
A response or analytical model is a mathematical formulation that describes one or more
complex measurement operations that are part of a particular project. Once the test
methodology has been selected, the response model should be constructed based on the
performance parameters of the system and the independent variables that influence these
performance parameters. This usually involves three steps: (1) formulation of the model,
(2) calibration of the model, and (3) validation of the model. There are three important
actions in the formulation: (i) identify the most important variables or factors that influ-
ence a selected response, (ii) develop a proper symbolic or mathematical representation for
these variables, and (iii) identify the degree of interaction among the variables. Variables
may be selected on the basis of theoretical principles or on an empirical approach using
correlation and regression or principal components analysis techniques. Systematic experi-
mental designs may be employed to formulate an empirical model; see Section 6.6.
The specified number and character of performance parameters and variables, i.e., the
operational conditions, is defined as a testing domain. Simple domains for any project may
require only a univariate model, while complex project measurement systems may require
multivariate models. Some projects may have multiple response parameters, each of which
may require a multivariate (independent) variable model. The calibration and validation
operations are discussed below in Section 2.3.
Test Uncertainty Model
This is a model that may be used to relate the variation in the measured parameter(s) to
sources of variation that are inherent in any testing operation. This is discussed in more
detail in Section 8 on precision, bias, and uncertainty in laboratory testing.
projects will involve some form of iteration involving successive measurement operations
to arrive at a satisfactory solution.
Random deviations : these are + and - differences about some central value that may be a
true or a reference value; each execution of the test gives a specific difference, and the
mean value of these differences is zero for a long run series of repetitive measurements.
Bias (or systematic) deviations: these are offsets or constant differences from the true
value; these offsets may be + or - and are frequently unique for any particular set
of test conditions.
Random and bias deviations are usually additive. There are four concepts that are applied
to measurement variation: precision, bias, accuracy, and uncertainty. Frequently these are
incorrectly used in an interchangeable manner. One of the main purposes of this chapter is
to distinguish between these and show how each may be used correctly. Precision is
defined in terms of the degree of agreement for repeated measurements under specified
conditions and is customarily assumed to be caused by one or more random processes.
High precision implies close or good agreement. Bias has not been addressed and inves-
tigated to the extent that has been devoted to precision. But as Section 8 of this chapter
will show, bias is very important in testing and needs more attention to improve test
quality. Accuracy is a concept that involves both precision and bias deviations. High
accuracy implies that the sum of both types of deviation is low, or in an ideal situation,
zero. The term uncertainty is of more recent origin and may be used in two different
contexts as previously described. The relationship among these variation concepts is dis-
cussed in Section 8; in Annex B, which gives a statistical model for testing; and in Annex C
on the evaluation of bias.
This section gives some of the more elementary statistical parameters that may be used
to characterize and analyze data and discover the underlying relationships among vari-
ables that may be hidden by the overlaid variation or noise. Although everything discussed
in this section is available in standard texts, a review of the more elementary statistical
principles is presented to address the basic problems of measurement quality. This is given
Quality Assurance of Physical Testing Measurements 21
All three of these perturbations will be discussed in succeeding sections of this chapter.
The mean (frequently known as the arithmetic average), variance, and standard deviation
can be realized in two ways: (1) as a true parameter value based on extensive measurement
or other knowledge of the entire population, in which case these parameters are designated
by the symbols fl, 0' 2 , and 0' respectively or (2) as estimates of the true values based on
samples from the population, in which case they are designated by the symbols .X, S 2 and S
respectively. The equations to calculate the mean, variance, and standard deviation as
estimates from a sample are
Quality Assurance of Physical Testing Measurements 23
X- = ____:
X I:__+ _::_ + ... +----"-
X2 ___ Xn
(6)
n
s 2= L:(xi - x/ (7)
n- 1
- 2]1/2
S= [ L:(xi - x)
(8)
n- 1
where xi = any data value, n = the total number of sample data values, and I: indicates
summation over all values. The degrees of freedom df for Eq. 7 and 8 is (n- 1). The
degrees of freedom is the number of independent differences available for estimating the
variance or standard deviation from a set of data; it is one less than the total number of
data values n, since one degree of freedom (of the total degrees equal to n) is used to
estimate the mean.
The majority of statistical calculations and resulting decisions are based on sample
estimates for i, S 2 , and S. In certain circumstances the true values are known and slightly
different procedures are used for statistical decisions. The standard deviation gives an
estimate of the dispersion about the central value or mean in measurement units. For a
normal distribution the interval of ±a about the mean f.L will contain 68.3 percent of all
values in the population; the ±2a interval contains 95.5 percent, and the ±3a interval
contains 99.7 percent. A relative or unit-free indicator of the dispersion is the coefficient of
variation:
. .
C oe ffICient f . .
o vanatwn = CV
s
=-:: (9)
X
material. The pooling of these individual estimates will give a good overall testing variance
estimate. A procedure is given in the Statistical Analysis section to determine if the
variances obtained on a number of test materials can be considered as equal.
A second approach is the accumulation of individual variance estimates each with
only a few degrees of freedom, on one or more reference materials or objects over a period
of time. The reference material should have the same measured property magnitude and
the same general test response as the experimental materials if the estimated variance is to
be used for decisions on the experimental materials. Both approaches may be used for a
more comprehensive effort.
The process of pooling or averaging the individual variance estimates (each with only
a few degrees of freedom) is equivalent to a weighted average calculation. Thus the pooled
variance is obtained from the sum of each individual variance estimate S 2(i) multiplied by
the number of values in its replicate set ni divided by the sum of the individual number of
values in each replicate set. The number of degrees of freedom df attached to the pooled
variance is equal to the sum of the number of individual df of each replicate data set.
2 0.886
3 0.591
4 0.486
5 0.430
6 0.395
7 0.370
8 0.351
9 0.337
10 0.325
in more detail below. Thus the probability that x will take on a value less than or equal to
a is given by finding the probability P in a standardized normal distribution table for a
value of Z =(a- p,)ja. Annex A Table AI lists the values of Z and associated "areas" for
each value of x expressed in terms of a difference from p, in a units, where the area is equal
to the probability that Z will take on a value less than or equal to the specified Z. These
areas or probabilities are left-hand oriented, i.e., they begin at -oo. From the table, the
probability that a Z value is as low as -2.00 is 0.0228; only two chances out of a hundred.
A frequent use of Table AI is for the selection of a value of z for a specified prob-
ability P = a to make certain statistical inferences. The Za notation is used for this pur-
pose, and two separate examples of its use are as follows
P(Z < -za) =a (14)
P(Z > za) = 1 -a (15)
The first of these expressions, Eq. 14, states that the probability that Z will fall in the
region or area from -oo to -za is a. In the second application in the use of za, the
probability that Z is equal to or greater than Za is obtained by difference. First the
probability that Z will fall in the range -infinity to +infinity is equal to the entire area
under the curve or 1. The probability that Z will be equal to or less than za is equal to the
area from -oo to za. The area to the right of za is the difference between these two areas or
1- a.
Another application of the use of specific tabulated z values is in forming an interval.
Intervals are discussed in more detail in Section 3.6 below. If a is divided into two equal
regions at either extreme end of the distribution, then
P(- Za/2 < Z < Za;2) = 1- a (16)
which states that the probability that Z will be found in the region from -za12 to Za; 2 is
also 1 -a, since the original area has been cut in half and -za; 2 and Za; 2 are defined as the
two points on the z axis that cut off areas of aj2 at each end. Equations 15 and 16 both
have probabilities equal to 1 -a, but the z values are different in the two situations. Each
of the two areas is equal to a/2 at either end of the distribution. Equation 13 can be used
for the mean of n sample values .Xcnl from a population where x in the Z calculation
expression is replaced by .X(n) and a is replaced by a I v'Ji.
The Z-distribution and Eq. 13 are applicable when both the population mean and the
variance are known. When the variance must be estimated from a sample, the z -distribu-
tion and the proportion of sampling values that fall within the - and + limits as given
above no longer apply. In such circumstances a distribution called Student's t-distribution
is used, and tis a random variable defined by Eq. 17.
(17)
applies to problems where the population mean is known or where a selected x value is to
be compared to x(nJ for a decision on a significant difference.
or significance level, the null hypothesis is rejected. The general expression for the prob-
ability that a random normal variable x will take on a value between a and b is
P(a < x <b)= CA(b)- CA(a) (19)
where
P(a < x <b)= the probability that x will fall between a and b
CA(a) = the cumulative area under the standard normal z-distribution curve
up to the value a
CA(b) =the cumulative area under the standard normal z-distribution curve
up to the value b (both areas from -infinity)
In this example, the z value of 2.3 is equivalent to a, and b represents the probability that a
random value will fall in the range -infinity to +infinity. Table Al reveals that
CA(2.3) = 0.9893. This is the probability of finding a z value in the range -infinity to
2.3. The cumulative area or probability h of finding a z value less than +infinity is of
course exactly 1.0. Thus
P(a < x <b)= 1.000- 0.9893 = 0.0107 = 0.011 (20)
The calculated probability of 0.011 is substantially less than 0.05, and the null hypothesis
is rejected and the alternative hypothesis accepted. Hypothesis testing may be applied to
any statistical parameter (!-distribution, F-distribution, etc.) for which a sampling distri-
bution may be calculated or otherwise evaluated.
An alternative method for reporting the results of significance calculations that has
gained acceptance is to use the calculated probability as an indicator of the weight of
evidence or the strength of an assertion about the parameter of interest. Instead of adopt-
ing a critical P =a, with a= 0.05 or other, and making a yes or no decision to reject the
tentative null hypothesis, the calculations are made for P(calc), and this is used to indicate
the decisiveness of the act of potential rejection of the null hypothesis. For this procedure,
Pis defined as the probability of committing a Type I error if the actual sample (measured)
value of the statistic is used as the rejection value. It is the smallest level of significance to
reject the null hypothesis based on the sample at hand and is usually called the attained or
empirical significance level. Many statistical software programs give these calculated P
values as output.
Confidence Intervals
The calculated values from any sample are considered as point estimates. Any such esti-
mate may be close to the true value of the population (ti, CJ or other) or it may vary
substantially from the true value. An indication of the interval around this point estimate,
within which the true value is expected to fall with some stated probability, is called a
confidence interval, and the lower and upper boundary values are called the confidence
limits. The probability used to set the interval is called the level of confidence. This level is
given by (1 -a), where a is the probability as discussed above for rejecting a null hypoth-
esis when it is true. In most circumstances, means are the most important point estimates,
and confidence intervals for means are evaluated at some probability P = (1 -a) that the
true population mean is within the stated confidence limits. This can be expressed for a
population with a known standard deviation CJ as given in Eq. 21.
P[ X(n)- Za/2 (5n) < /-[ < X(n) + Za/2 (5n)] = (1 -a) (21)
Quality Assurance of Physical Testing Measurements 29
where
X(nJ = a mean evaluated from a sample of n values
Za; 2 =the z value at P = al2
a-j v'n =the standard error (standard deviation of means of n)
fL = true mean of the population
This equation states that the confidence interval is defined as the difference between the
two limits about the point estimate of the mean ic11 l, i.e., from the lower limit
(.'X'(n) - Za; 2 0' I ..j/i) to the upper limit (icnl + za; 20' I ..jfi). This difference is designated as
the (1 -a) 100% confidence interval. The true mean fL is a fixed number; it has no
distribution and it is either in the interval or it is not. The interpretation of the value
(1 - a)lOO% is as follows-If a trial of 100 repetitions of the experiment of drawing a
sample of n values from this population is conducted, then in the long run (1 - a)100% of
the intervals for each trial of 100, will contain the true value fL.
Confidence intervals may be alternatively formulated in terms of a factor kcon selected
so that the calculated interval covers the mean fL a certain percent (proportion) of the time.
(5
Con Interval = ±kcon Jn (22)
where Con Interval is the confidence interval at a selected P for means of samples of size n.
As an example suppose that a sample of n = 4 is drawn from a population with C5 = 1.5
and the estimated mean is 8.9. The 95% confidence interval (a= 0.05), where 1.96 is the z
value at al2 = 0.025 and 1.512 = 0.75, is the standard error of means of 4, is given by
95% Con Interval= ±1.96(1.512) = ±1.96(0.75)
= ±1.47
or
95% Con Interval= [8.9- 1.47] to [8.9 + 1.47]
(23)
= 7.43 to 10.37 = 2.94
For a situation where the standard deviation is not known but must be estimated from
a sample in the same manner as the mean, the t-distribution applies, and the probability
expression is
The lower limit is .X(n) - la; 2SI Jn and the upper limit is i(n) + ta; 2SI ..jfi. As an example, if
C5= 1.8 as calculated from the sample of 4(df = 3) and i(n) = 8.9 and (1- a)lOO = 95%
or a= 0.05, the confidence interval would be found using the tabulated t value of 3.18
which is found at P = 0.025 for df = 3.
95% Con Interval= [8.9- 3.18(1.812)] to [8.9 + 3.18(1.812)]
= 6.04 to 11.76 = 5.72
= ±2.86
This confidence interval is almost twice the previous example because the standard devia-
tion as well as the estimated mean is only known to df = 3.
30 Veith
Tolerance Intervals
The word tolerance is used in a number of ways in testing and measurement technology.
Engineering and design tolerances are usually designated as upper and lower limits on
certain dimensions or other numerical factors for an object or product. Tolerances can
also apply to the number of significant figures or digits to retain in a measurement. A third
type of tolerance is concerned with the percentage of population values falling within some
specified limits, and this type is considered here. In explaining this kind of tolerance it is
important to distinguish it from confidence intervals or limits.
Confidence intervals provide a value for the region of uncertainty about an estimated
population parameter (a point value), usually a mean, with a certain degree of confidence.
Frequently it is desirable to obtain an interval which will cover a fixed proportion or
percentage of the population with a specified confidence. Such intervals are called tolerance
intervals and the two endpoints are called tolerance limits. Tolerance intervals can also be
formulated in terms of a factor, ktob and the estimated standard deviation S
Tol Interval = ±ktol(S) (25)
where k 101 is selected so that the interval will encompass a proportion p of the population
with a confidence of (1- P). As an example if i(n) = 14.0, S = 1.5 and n = 15, the toler-
ance interval that will contain 99 percent of the population 95 percent of the time is given
by referring to Table A3 where the confidence level is given by r. Thus for p = 0.99,
r = 0.95, and n = 15; the tabulated value for k 101 is 3.88, and
To! Interval= ±3.88(1.5) = ±5.82
or
i(n) = 14.0 ± 5.82 = 8.18 to 19.82 (26)
The distinction between confidence intervals and tolerance intervals is that confidence
intervals refer to estimates of the population statistics (usually the mean) while tolerance
intervals are concerned with proportions or fractiles of the population. Thus the term
tolerance as used here should be distinguished from the frequently used tolerance in
engineering design for dimensions and other factors in the construction or manufacture
of some object or structure.
The variance for x~o x 2 , etc. refers to individual measurement values in any population.
With simple linear relationships for the function, the differentials become constants, and
the equation for s~ takes the form
(29)
For the simplest linear form for two variables, a sum or difference relationship is given by
Y = x1 + x2 or (30)
the value for s~ is
s~ = s;l + s;2 (31)
since the differentials are unity. Thus the act of adding or subtracting two measured
values, each having a variance associated with its measurement, substantially increases
the variance of the sum or difference. If both x 1 and x 2 have the same variance, the
variance of the sum or difference is two times the individual variances.
With any functional form beyond a sum or difference, the variance of Y is influenced
by the value for the differentials. For a ratio or quotient,
(32)
the variance of Y is given by Eq. 33, and the evaluation of S~ has to be made at some
selected values for x 1 and x 2 .
x variables, designated as S~i, should be used, according to Eq. 34, where n is the number
of values used for the mean for Xi .
8 Xl
2 = s~i (34)
n
Under these conditions, wherever Y appears in the variance expression, a mean value for Y
is to be used.
The importance of the first requirement is obvious. Unstable systems are really not accep-
table for testing. The second requirement is the basis for conducting statistical tests where
independence and randomness are assumed for probabilistic conclusions. When the first
two of these specifications are met, the measurement system is in a state of statistical
control. The third specification relates to how well the sampling is done.
The second set of conditions is related to this sampling operation and the test speci-
mens derived from the samples. The sampling procedure should
Be conducted on a stable (nonchanging) population
Produce individual samples appropriately selected and independent of each other
The phrase "appropriately selected" refers to the different types of sampling that can be
conducted; this topic is discussed in more detail in Section 5 on sampling principles. The
importance of these sampling requirements is self evident; both are necessary for proper
statistical analysis. When these sampling conditions are met, the sampling system is said to
be in a stable condition or in statistical control.
The attainment of these required characteristics is often not straightforward.
Conformance with the requirements is usually obtained by a twofold process. First, sub-
stantial experience with the system and attention to important technical details is required.
Second, certain statistical diagnostic tests may be used, the most important being control
charts, which are defined in Section 7, using standard or reference materials subjected to
the same testing protocol as the experimental materials. Independence of individual mea-
surements can be compromised if there is any carryover effect or correlation between one
test measurement (sample or specimen) and the next measurement. Calibration opera-
tions, to be discussed later, can also be a source of problems in test measurement inde-
pendence if they are not conducted in an organized or standardized manner.
Sensitivity
This is related to the ability to detect small differences in the measured property and/or the
fundamental inherent property. Sensitivity has been defined in quantitative terms for
physical property measurements by Mandel and Stiehler [2] as
. . .
SensltlVJty K
= -- (35)
s(m)
where
K = the slope of the relationship between the measured parameter m and the
inherent property of interest Q, where Q =.f(m)
s(m) = the standard deviation of the measurement m
Sensitivity is high when the precision is good, i.e., S(m) is small, and when K is large. An
example will clarify the factor K. The percentage of bound styrene in a butadiene-styrene
polymer may be evaluated by a fairly rapid test, the refractive index. A curve of refractive
index vs. bound styrene, with the styrene measured by an independent but more complex
reference test, establishes an empirical relationship between the styrene and the refractive
index. Over some selected bound styrene range, the curve has a slope of K, and this value
divided by the precision standard deviation s(m), gives the sensitivity in this range. For
polymers of this type the refractive index sensitivity may be compared to the sensitivity of
alternative quick methods, such as density, by evaluating K and s(m) for each technique.
Useful Range
This is the range over which there is an appropriate instrument response to the property
being measured. Appropriate response is expressed in terms of two categories, (1) the
presence of a linear relationship between instrument output vs. level of the measured
property, and (2) precision, bias (uncertainty), and sensitivity, at an acceptable level.
Ruggedness Testing
Frequently there is the need to determine if a test is reasonably immune to the perturbing
effects of variation in the operating conditions such as ambient temperature, humidity,
electrical line voltage, specimen loading procedures, and other ordinary operator influ-
ences. A procedure called ruggedness testing is conducted by making systematic changes in
the factors that might influence the test and observing the outcome. Such testing is fre-
quently conducted as a new test is being developed and fine tuned for special purposes or
routine use. It can also be used to evaluate suspected operational factors for standardized
methods if environmental or other factors for conducting the test have changed.
A series of systematic changes are made on the basis of fractional factorial statistical
designs, which are discussed in more detail in Section 6. The early work was done by
Plackett and Burman [3], Youden [4], and Youden and Steiner [5]. These designs are quite
efficient. The most popular design evaluates the first-order or main effect of seven factors
Quality Assurance of Physical Testing Measurements 35
in eight tests or test runs. One important caveat in using these designs is that the second-
order and all higher-order interactions of the seven factors are confounded with the main
effects. See Section 6 for additional discussion on interaction and confounding. If there are
any large interactions of this type, they will perturb the main effect estimates. However
experience has shown that in the measurement of typical physical properties under labor a-
tory conditions, first-order or main effects are usually much larger than interactions, so the
use of these fractional designs has been found to be appropriate for ruggedness testing.
The Plackett-Burman statistical design for seven factors, A, B, C, D, E, F, and G that
might influence the test outcome is given by Table 3, where -1 indicates the low level
(value) of any factor, 1 indicates the high level of any factor, andY; is the measured value
or test result for any of the eight runs or combinations of factor levels. This design assumes
that the potential influence of any factor on test response is a linear one. As indicated by
the table, the design calls for the sequential variation of all seven factors across the eight
test runs in a way that provides for an orthogonal evaluation of the effect of each factor.
The design is evaluated by a procedure that sums the eightY; values in a specified way
and expresses the results of the summing operation as the effect of each factor. Thus the
effect of factor A, designated as E(A), is given by Eq. 36 as the difference of two sums
divided by N/2. The first sum is the total of the products obtained by multiplying each
value of Y; by 1 for those rows (runs) that contain a 1 for column A, i.e., rows 1, 4, 6, and
7. The second sum of products is obtained in the same sense for all rows of column A that
contain a -1, i.e., rows 2, 3, 5, and 8. The use of an expression analogous to factor A may
be used for all other factors.
E(A) = [L;y;A(l)- L;y;A(-1)]
(36)
N/2
where
L;y;A(l) = sum of Y; values for all runs (rows) that have 1 for factor A
L;y;A(-1) =sum of Y; values for all runs (rows) that have - 1 for factor A
N = total number of runs in the design (= 8); all sums are algebraic
The significance of the effects is evaluated on the basis of either (1) a separate estimate of
the standard deviation of measurements of the same type (materials, conditions) as con-
ducted for the Plackett-Burman design or (2) repetition of the design a second time to
provide for two estimates of each factor effect.
If Sis the separate estimate of the standard deviation of individual Y; measurements,
then the standard deviation of the mean of four such measurements is S/2. If no real factor
-I -I -I Y1
2 -1 1 -I -I Y2
3 -I -I -1 Y3
4 I -I -1 1 -1 Y4
5 -1 -1 -1 Ys
6 -I -1 -1 Y6
7 -1 -1 -1 Y7
8 -1 -1 -1 -1 -I -1 -1 Ys
36 Veith
effect exists, then the calculated effect E(i), which is a difference between two means of
four each, has an expected value of zero and a standard deviation of (-tis 12) = S 1-/2. If
E(i) is significant (a real effect), it should exceed zero by an amount greater than two
standard deviations based on means of four, i.e., greater than absolute 2(S1-/2), provided
that S is known at a certainty level of at least 18 to 20 df The use of 2(S1-/2) as the
interval to indicate significance is based on a P = 0.05 or 95% confidence level. If Sis not
based on at least 18 d_j; then the value of Student's t (double-sided test) at P = 0.05 should
be substituted for 2 in this interval expression using the appropriate df
If Sis to be estimated from the ruggedness testing itself, the eight runs are repeated to
produce two sets of estimates for each factor, where E(i) = replication 1 value and
E'(i) = the second replication value. Since the standard deviation of E(i) or E'(i) is
Sl-/2, each value of the difference [E(i)- E'(i)] or d, for each of the seven factors, is
an estimate of (-JiSI-/2) = S. Hence an estimate of S based on 7 dfis
where
d(A) 2 = [E(i)- E'(i)] 2 for Factor A
d(B) 2 = [E(i)- E'(i)f for Factor B
and so on for all factors. With two replications of the eight runs now available for
estimating the influence of the seven factors, the mean effect is E(i)m or [E(i) + E'(i)]l2
for each factor. For the value of any E(i)m to be significant, it must exceed
2.37[-/2Sivfs] = 1.18S, where Sis of course given by Eq. 37 and 2.37 is the t value in
the Student's distribution at 7 df and at the P = 0.05 or 0.95 significance level. Factors that
are found to be significant in any test need to be investigated and the test procedure or
protocol revised to reduce the sensitivity to those factors.
For certain test methods, especially those that are more fully developed, only a few
factors may require an evaluation, perhaps 3 or 4. The Plackett-Burman seven factor
design may still be used with the remaining factors, say E, F, and G, being dummy factors,
i.e., factors that have no influence on the outcome. All eight test runs must be completed
for any design independent of the actual number of factors evaluated. Alternatively the
fractional factorial designs for 3 or 4 factors as given in Section 6 may be used.
the test system, and (4) a fully documented protocol along with experienced personnel for
the calibration. A realistic calibration schedule should be maintained. Decisions on the
frequency of calibration are made by balancing the cost of the calibration vs. the risk of
biased test output. When in a state of statistical control, repetitive instrument responses
for each true or standard value should be randomly distributed about a mean, and when a
series of true values are used, the response means vs. true values should give a linear least-
squares regression relationship. This gives confidence intervals on the slope and the inter-
cept and for selected points on the line in the calibration range. Annex C outlines proce-
dures for evaluating bias that are equivalent to this type of calibration operation; see this
for more details. See also Section 6 for background on regression analysis
Empirical relationships that appear to be linear can be tested for linearity by a number
of approaches. Visual review of a plot is most appropriate to reveal any departures from
linearity. A plot of residuals (residual = observed - computed response value) with
respect to the level of the response should not show any correlation or systematic beha-
vior. Such a review requires at least 7 pairs of data points (response levels) to be useful. A
simple F test may also be employed. If S~ is the pooled variance for a set of repetitive
instrument responses, each set of responses at one of a series of levels (true values) of the
calibration standard, and Sfr is the variance of points about the fitted function, when
individual set response values (not means or averages) are used for the least-squares
calculation, then the variance ratio
(38)
should not be significant, i.e., greater than F(crit) at P = 0.05 for the respective dfvalues
for both variances. See Section 6 for variance analysis procedures. If F(calc) equals or
exceeds F(crit) under these conditions, there is some significant departure from linearity.
This approach should be used with a sufficient number of points so that the df for each
estimated variance is 8 to 10. When demonstrated nonlinearity exists, transformations
may be used to linearize the response. See IS011095 in the bibliography for more details
on calibration.
Traceability
As the name implies, this is the ability to trace or accurately record the presence of an
unbroken, identifiable and documented pathway from fundamental standards or standard
values to the measurement system of interest. Traceability is a prerequisite for the assign-
ment of limits of uncertainty on the output or response measurement, but it does not imply
any level of quality, i.e., the magnitude of the uncertainty limits. Physical standards with
certified values as well as calibration services are available from national standardization
laboratories such as National Institute of Standards and Technology or NIST (formerly
NBS) in the USA or from the corresponding national metrology laboratories for all
developed countries. All of these standards are usually expressed in the SI units system.
making process, and good sampling technique insures that all samples unquestionably
represent the population under consideration. Only the most elementary sampling issues
are addressed in this section. For more detailed information, various texts and standards
on sampling and sampling theory should be consulted; see the bibliography for statistical
texts and for standards on sampling.
5.1 Terminology
Sampling terminology systems vary to some degree among industrial and commercial
operations, which frequently involve complex mechanical systems to draw samples or
other increments from some large lot or mass of material. One of the important objectives
in such operations is the elimination of bias in the samples. Increased sampling frequency
can reduce the uncertainties of random sampling variation, but it cannot reduce bias. The
terminology given here is that which applies more directly to laboratory testing, where the
process of obtaining samples is reasonably straightforward. This type of sampling may be
thought of as static as opposed to dynamic sampling of the output of a production line.
Some important terms are:
Sample. A small fractional part of a material or some number of objects taken from a lot
or population; it is (they are) selected for testing, inspection, or specific observations of
particular characteristics.
Subsample. One of a sequence of intermediate fractional parts or intermediate sets of
objects, taken from a lot or population, that usually will be combined by a prescribed
protocol to form a sample.
Random sample. One of a sequence of samples (or subsamples), taken on a random basis
to give unbiased statistical estimates.
Systematic sample. One of a sequence of samples (or subsamples), each taken among all
possible samples by way of a selected alternating sequence with a random start; it is
intended to give unbiased statistical estimates.
Stratification. A condition of heterogeneity that exists for a lot or population that contains
a number of regions or strata, each of which may have substantially different properties.
Stratified sample. One of a sequence of samples (or subsamples), taken on a random basis
from each stratum in a series of strata in a (stratified) population.
that each unit of a population (or lot) have an equal chance 1/N of being selected for
testing.
Random sampling can be conducted in one of two ways: (1) with replacement of the
selected units, under the conditions that the test operation does not change or consume the
unit, or (2) without replacement, when the unit is changed in some way by the testing. For
large populations there is no essential distinction between these two types of random
sampling. For small populations (small N) a difference does exist. Since most physical
testing might in some way change the sample (or test specimen prepared from the sample)
the expressions given below are for the "without replacement" category.
A simple random sample is defined as n units drawn from the population where each
unit has the same probability of being drawn. The ideal procedure for doing this is to
identify all the units in the population, i = 1, 2, ... , N, and select units from a table of
random numbers. For some sampling operations this may have to be modified according
to the manner in which individual units can be identified. Each of theN potential or actual
units has a value y;, and the unbiased estimate of the true mean Y is given by the sample
mean y as
' I:y;
y=- (39)
n
where n is the number of units or size of the sample drawn from the population. The
quantity n/ N is referred to as the samplingfi·action, and the reciprocal N/n is known as the
expansion factor. The unbiased estimate of the variance of y, designated as Sf,
[N- n] s;;
is given by
2 _
S·- -- - (40)
Y N n
where
s;; = the variance of the individual n units
The factor [(N- n)/ N], which can be expressed as [1 - (n/ N)], reduces the magnitude
of the variance of the mean by the sampling fraction when compared to an infinite
population value. This reduction factor is called the finite population correction factor or
fpc, and it indicates the improved quality of the information about the population when n
is large relative toN. As n grows larger, the variance of the population mean decreases and
becomes zero if n = N, since at this point the mean is known exactly. In many situations
fpc has a minimal effect and is usually ignored if n/ N < 0.05, and then [(N- n)/ N] is set
equal to 1. The confidence interval at P =a, for the estimated mean y, is given by
Y
,
= I~]
±ta/2 [
N- n 2 ) 1/2
Syi
-;; (41)
If the sample size used to estimateS~; is fairly large, n ~ 30, the value of 2 may be used for
ta; 2 to give a P = a = 0.05 or 95% confidence interval.
Systematic Sampling
The actual process of drawing random samples can frequently be time consuming as well
as costly, especially for large populations and large samples. An alternative procedure that
is easier to conduct and that gives good estimates of the population properties is systematic
sampling. This type of sampling is conducted as follows:
Quality Assurance of Physical Testing Measurements 41
Stratified Sampling
This type of sampling is a process generally applied to bulk lots or populations that are
known or suspected of having a particular type of nonhomogeneity. These populations
have strata or localized zones, and each stratum is usually expected to contain relatively
homogeneous objects or material properties. However these properties may vary substan-
tially among the strata, and samples are taken independently in each stratum. To apply
stratified sampling techniques a mechanism must exist to identify all the strata in the lot or
population. Once this is done the strata may be sampled by using proportional allocation
where the sample fraction is the same for each stratum. Another approach is optimum
allocation where the sample size or fraction may be increased in those strata with increased
variance if this information is known beforehand.
The calculation as applied above for y, the estimated population mean for random
sampling, may be applied to each stratum and these individual values used to get a
population mean based on all stratum values. Similarly the calculation for S~, the esti-
mated population variance for random sampling, may be applied to each stratum and an
analogous procedure used to get an overall variance based on all stratum values. If
unequal samples are taken from the various strata, the overall population mean and
variance values that represent the entire stratified population must be obtained on a
weighted average basis; see Section 3.
E = la;2Sa/~ (42)
where
E = IY- .YI =the maximum (absolute) difference between the estimated mean
y from the n samples and the true mean Y
ta 12 = t value at a specified P =a; i.e., at a (1 - a)lOO% confidence level,
where the df used for ta 12 is based on the df for Sa
Sa =the applicable standard deviation (among individual units tested), a
function of the specific sampling and testing operation
This may be rearranged to solve for n to give
n = [ta;~Sar (43)
Componenta
_ [2S(m)J 2
nT2- -- (44)
ET2
and Sa is equal to S(m) the measurement error.
Type 3-0nly S(sp) Significant: For this situation the number of samples nT 3 is
and several combinations of nsp and nm may give equal E. Values for nsr and nm have to
be selected based on their respective variance magnitudes and the costs associated with
sampling and measurement.
Detecting Outliers
Outliers may be present in any size database. For a database of from several to 30--40 data
values, an analysis may be conducted using spreadsheet calculations. Data values may be
sorted from low to high and a plot of the sorted or ordered values will reveal any suspi-
cious high or low values. Tietjen and Moore [6] described a test that can be used for a
small database with a reasonable number of suspected extreme values or outliers (1 to 5).
The suspected outliers may be either low or high, and the statistical test may be used when
both types exist in the sample at the same time. The test is applicable to samples of 3 or
more, and for sample sizes of II or more, as many as five suspected extreme values may be
tested as potential outliers. The following procedure is used.
(1) The data values are denoted as X 1 , X 2, ... , X 11 • The mean of all values designated as
i(n) is calculated.
(2) The absolute residuals of all values are next calculated: R 1 = IX 1 - icnJ I,
R2 = IX2- i(n)l, etc.
(3) Sort the absolute residuals in ascending order and rename them as Z values, so that the
z1 is the smallest residual, z2 is next in magnitude, etc.
(4) The sample is inspected for extreme values, low and high. The most likely extreme
values or potential outliers (the largest absolute value residuals) are deleted from the
sample (or database) and a new sample mean is calculated for the remaining (n- k)
values, with k = the number of suspected extreme values. This new mean is designated
as ik. The critical test statistic E(k) is calculated according to
Critical values are given in Table A4 for the test statistic E(k) at the P = 0.05 and the
P = 0.01 levels, for sample size n = 3 to 30 and for selected numbers of suspected outliers
k. If the calculated value of E(k) is less than the tabulated critical value in the table, the
suspected values are declared as outliers. This general approach may be used in an iterative
manner until all potential outliers have been evaluated by the procedure.
The action to take when significant outliers have been identified is part of an ongoing
debate in the data analysis community. One recommendation is that only data values with
verified errors or mistakes should be removed. This ultraconservative approach overlooks
the situation where outright errors are made but no knowledge is available that they are
errors. The opposing view recommends removal if a data value is a significant outlier
(P = 0.05 or lower). A middle ground position is to make a judgment based on a reason-
able analysis considering technical and other issues that relate to the testing in addition to
the importance of the decisions to be made.
Quality Assurance of Physical Testing Measurements 45
Case 1
In conducting an F -test for this situation, sf is assigned as the greater expected variance.
The null hypothesis, H 0 , that there is no difference in variance and the alternative hypoth-
esis, HA, that sf is larger than si, are designated symbolically as
Ho: sf= si (48)
HA: sf> si
These hypotheses are tentatively adopted and a sample is drawn from each population (1
and 2) and the variance calculated. The ratio, sf I si = F( calc) is evaluated. If this ratio is
equal to or larger than F(crit), the ratio that would be expected by chance at a probability
P =a (0.05 or other) for finding a value as large as F(calc) when the null hypothesis is
true, the hypothesis of equality is rejected and the alternative hypothesis is accepted, i.e.,
that sr si.
>
F-distribution tables are given for F( crit) values, that are equaled or exceeded a certain
percentage of the time by chance alone, see Table AS. The F(crit) value cuts off an upper
area under the F distribution curve that is equal to a, and if F(calc) falls in this cutoff
region, then the null hypothesis is rejected. Tables ofF values are arranged for different
46 Veith
degrees of freedom df in the numerator and denominator, and F( crit) is usually listed for
each P level as F (df~, dfct); where dfn = df in the numerator, and dfct = df in the denomi-
nator.
Case 2
In this situation there is no technical reason for expecting either variance to be greater than
the other. A null hypothesis and an alternative hypothesis is adopted:
H0 : sf = S~ (49)
HA: s? > or < S~
and after both variances are calculated the greater variance is placed in the numerator to
evaluate F(calc). For making decisions at a P = 0.05 level, an allowance must be made for
F(calc), to be greater than F(crit) if sf is greater than si and conversely to be less than a
different F(crit) at the other end of the F distribution, if Sf is less than Si. One half of the
P = 0.05 rejection region is assigned to potential large values for F(calc) and one half to
potential small values for F(calc). On this basis the P = 0.05 value for F(crit) would be
found at a P = 0.025 upper tail F table, or conversely if the F(crit) value in a 95%
confidence level upper tail F table were selected, the confidence level for making a decision
would be P = 0.10, since there is a P = 0.05 probability for the F(calc) to be in either
critical region.
One Variance Known
A less frequent situation is the case where one of the variances is a known or defined
variance, represented by 0' 2 , while the other variance is from a sample and is equal to S 2 .
The ratio S 2 j0' 2 has a sampling distribution known as a "chi squared over dj', designated
as Ci jdf). For this application, chi-square i is a random variable that is given by the
ratio of the product S 2(df) to the known variance 0' 2 .
2 S\df)
X =--2- (50)
(J
H 0 : all o} equal
HA: at least two al not equal (52)
If F-max(calc) equals or exceeds F-max(crit), then a significant difference exists for the
maximum vs. the minimum variance of the database. If F-max(calc) does not exceed F-
max( crit), then all variance estimates are equivalent.
values fort at various P levels. If lt(calc)l ~ 1ta; 2 (crit)i, using absolute values, then the null
hypothesis is rejected and the difference is declared significant at the (1 - P) 100 confi-
dence level. If lt(calc)l does not equal or exceed lta; 2(crit)l, the difference is not significant
at this level of testing for the selected P = a value.
Option 2. If the variances are not equal there are two choices: use a transformation
and conduct the analysis on the transformed data, or conduct the normal t-test and use the
d{for the smaller sample to select ta; 2 (crit). The distribution of the t value so calculated is
approximately the same as a true t distribution. This smaller sample df recommendation
is made because with unequal variances the exact number of degrees of freedom cannot be
determined by the usual procedure as given above.
(55)
ta;2 = (Sa/n)l/2
where
d(av) =the average difference in means, (treated, nontreated) for n paired values
s3 = the variance (estimate) of all the individual differences
df = p -1
The null and alternative hypothesis statement is
H 0 : d(av) = 0
HA: d(av) i 0 (56)
The significance of d(av) is found by the same procedure as for a normal or standard t-test.
The variance s3is not the variance of either population but of a constructed population of
differences between the two conditions, treated vs. nontreated.
and this begins with a database generated by a testing program such as a series of treat-
ments on a common material with some number of repeat tests for each treatment. This
elementary one-way analysis procedure will illustrate the basic ANOV A concepts.
Basis of Variance Analysis
The database or data matrix illustrated below as Table 5 consists of j columns and i rows
of data. The j columns may be various treatments of a uniform material or levels of an
adjustable or independent variable, and the values in each column are replicates or
repeated measurements for each of the treatment levels.
The measurements or observations xu are the dependent variable. Two basic assump-
tions are made; all potentially different populations for each treatment have a normal
distribution, and the variance of all populations is equal even if one or more of the
treatments are significantly different. This latter assumption may be checked by the use
of the Hartley F-max test as previously described. Each data value xu may be expressed in
terms of certain population means as given in Eq. 57, where each group as defined below
represents a treatment.
xu = JL + (JL1 - JL) +(xu- JL1) (57)
where
fL = the grand mean, the mean of all values in the database
(JL 1 - fL) = the variation among the xu values attributed to the differences of its
group mean ILJ from the grand mean JL; there are k differences
(xu- JL1) = the variation attributed to the differences of the group mean ILJ from
the individual i replicate values for each group (or treatment)
The equation may be rewritten by assigning symbols to the two differences.
(58)
The term {31 represents the series of differences, jth treatment mean minus the grand mean;
if there are no significant differences for the treatments then {31 = 0 for all treatments. If
one or more of the treatments are significant then {31 -:;i 0 for the one or more treatments.
The term su is a random difference, i.e., measured value xu minus the jth treatment mean.
By subtracting fL from both sides we obtain
(59)
This model equation states that the magnitude of any deviation of the dependent variable
xu from the grand mean is the sum of two components of variation, {31, the component due
to any presumed real response effect of a treatment, and a second component su, a within-
treatment difference or error. The long run mean value of all su equals zero, and the
I
2
50 Veith
variance of cu is equal to the basic test error variance in making a measurement. The two
components are called the between-treatments source of variation and the within-treat-
ment source of variation. The analysis answers the question, is the between-treatments
variation significant when compared to the within-treatment variation?
The analysis begins by setting up the null and alternative hypotheses
Ho: {31 = 0 for j = 1, 2, ... , k
Ha: At least one f3i is not 0 (60)
The between-treatment variation is calculated by noting that the number of {31 terms is j,
and the variance among the j group means, S~, is given by
s~ - :E(xJ- xu)2
(61)
X- k -1
where
x1 =the treatment mean (across all i values) for the jth treatment
xu = the grand mean across all i and j values
k =the number of treatments
In general, the standard error for any normal population mean is equal to a 1Jn, and the
variance is equal to a 2 jn, where n is the number of values that are used to calculate the
mean and a 2 is the population variance. For this analysis the termS~ is an estimate of the
true variance a~ among the k different means. If the null hypothesis is true, S~ can be used
to evaluate the population variance. To evaluate the variance for individual population
values, the calculated S~ must be multiplied by the number of test values (replicates) used
for each of the means, designated as ni, to give niS~, which is an estimate of na~. The
variance of the individual measurements for each of the k treatments is calculated by
We now have two estimates of the individual population variance; the first n1 S~
obtained from the variation in treatment means, and the second S~ the pooled variance
obtained from the replicates. If there are no real effects of the treatments (all {31 = 0), both
of these are estimating the underlying variance of the population. An F-test can be used to
decide if the two estimates are equal. Using the hypotheses as given above and using the
technically justified assertion that if the variances are not really equal the between-treat-
ments variance should be larger than the within-treatment variance, the F-ratio is defined
as
n·S~
F(calc) = 1
2x (64)
sp
The degree of freedom df for the numerator is k- 1 and the df for the denominator is
(n1 - 1)k. If F(calc) is equal to or greater than F(crit) at these respective dfat a probability
Quality Assurance of Physical Testing Measurements 51
level P = 0.05 or less, then the larger variance is significantly greater than the lesser
variance, with the implication that at least one value of f3j is not equal to 0.
ANOVA Calculations
The classical ANOV A calculations are not usually performed as given above but by
shortcut methods that were developed to reduce the burden of calculations prior to the
use of computers. An understanding of this approach can be gained by considering that
the total variation in a database, as given above, is evaluated by calculating the total
variance s;ot by
S2 - z=(xii - xu)2
tot- kn- 1 (65)
with xu and xu as defined above and the summation is over all values. The numerator of
Eq. 65 is a sum-of-squares called the total sum-oj~squares and represents all the variation
in the database. It may be shown that this total sum-of-squares, the numerator, may be
partitioned into the two components on the right-hand side of Eq. 66.
z=(kn)(xu- xu) 2 = 'L,(kn)(xu - x i + n'L,(k)(x;- xu) 2 (66)
where
z=(kn) =summation over all kn values
z=(k) = summation over k values
x; = mean of each of j treatments and other symbols as defined above
The first term on the right-hand side is called the error sum-of-squares and the second
right-hand term is the treatment sum-of-squares. A one-way ANOV A is performed by
calculating the total sum-of-squares SS(tot) by way of a short-cut expression that can
be shown to give the values as defined by the left side of Eq. 66.
SS(tot) = 'L,xt- C (67)
where
z=xt = the individual measured values xu squared and summed over all values in
the database
C = T'/;11 /kn =a constant called the correction term, where Tan is the grand total
of all measured values in the database, k is as defined above, and the j has
been dropped on n
The second ANOV A sum-of-squares is the treatment sum-of-squares SS(trt), given by
Eq. 68, where 'L,T} is the sum of the squared totals for each of the k treatments.
z=T2
SS(trt) = - 1 - C (68)
n
The random variation or error sum-of-squares is obtained by difference:
SS(error) = SS(tot)- SS(trt) (69)
The three sums-of-squares are used in a table with the following layout of Table 6. The
sums-of-squares are divided by the appropriate degrees of freedom df to give variances
that are called mean squares. As indicated in the previous exposition of variance analysis,
the treatment mean square is divided by the error mean square to make a decision on the
significance of the treatments.
Table 6 Analysis of Variance Table
Source of Sum of
variation df squares Mean square F(calc)
-·---·-------·------------------- -------
Although the analysis of variance is a powerful tool for data analysis, especially for
more complex situations that are beyond the scope of this chapter, it needs to be supple-
mented by supplementary analysis tools. If F(calc) is found to be significant, this implies
that at least one of the treatments is significantly different from some other treatment. If
several treatments are used, it is appropriate to determine what the exact relationship is
among the means for all the treatments. This type of analysis is called multiple compar-
isons. A typical statistical test of this type is the Duncan Multiple Range Test; see Part 2 of
this section. This test makes all the pair-wise comparisons among the treatment means
based on a statistical parameter called the least significant range, LSR. For each pair of
means an LSR is calculated and compared to a critical LSR for the number of groups
compared, the df for the error term, and a selected P.
tion may however exist); and (3) the population coefficient is symmetric with respect to x
andy, i.e., x on y gives the same value as y on x.
Analysis begins by calculating the mean .X of the x variable, y, the mean of the y
variable, and the deviations (x; - .x) and (}•; - y). The estimated correlation coefficient r is
the ratio of the sum of the cross products of the deviations of x andy divided by the square
root of the product of the sums of the squares of the same respective deviations.
I:(x; - x)(y; - y)
r = [I:(x;- .X) 2 I:(y;- y) 2]1/-? (70)
For any xy scatter plot of two variables that have a high degree of positive correlation, a
large number of points will fall in the upper right quadrant 1 and the lower left quadrant 3
when the origin (center) of the quadrants is at the mean values .X and f:. This clustering of
the points insures that quadrant I will have positive and relatively large x deviations
associated with similar positive and large y deviations. Quadrant 3 will have a similar
situation with negative deviation values for both variables. When these deviation cross
products are summed over all xy values, they will very nearly equal the square root of the
product of the x and y deviations squared, and the ratio will be high or near 1. For a
negative or inverse association, the points will cluster in the upper left quadrant 4 and the
lower right quadrant 2. A strong association of this type will give negative cross products
and a high negative ratio or r value approaching -1.
The significance of any calculated correlation coefficient is evaluated by adopting the
usual two-sided hypothesis test,
Ho: p = 0
(71)
This hypothesis can be tested for any sample size n of 3 or greater by using a t-statistic
given by
r
(72)
t(a/2) =[(I- r2)/(n- 2)]1/2
where tca; 2) is the value of a random variable that has a distribution that is approximately
the usual £-distribution with n - 2 degrees of freedom. The correlation is considered sig-
nificant if t(calc) is greater than a critical value tcam(crit) at a level of significance desig-
nated by P = a. Table A9 gives precalculated critical r values at dl = n - 2.
Regression Analysis
Simple regression analysis is also concerned with the association between two variables x
andy, but the goal of this analysis is to develop a mathematical relationship between the
two variables to permit a prediction of y by knowing the value of x. The potential use of
regression analysis implies that there is a correlation between the two variables and that
the distinction between correlation and causation as discussed above applies to this ana-
lysis as well. The linear mathematical relationship is called a regression model; it explains
or predicts the response or y variable in terms of the other (independent) variable desig-
nated as the x variable. Regression analysis is used to make inferences about the para-
meters of the regression model which is given by
(73)
This model, referred to as the regression of y on x, involves {3 0 , {3 1, and E: defined as
54 Veith
{3 0 = the intercept, the value predicted by the model when x = 0; it has no practical
meaning if x cannot equal zero, but is necessary to specify the model
{3 1 = the slope of the regression line, i.e., the change in y for unit change in x
s= a random error term with population mean of 0 and variance of ci
A term called the conditional mean, defined by f.LyiX> is a predicted value for the dependent
variable y for some given x and is expressed as
(74)
The regression model describes a line that is the locus of all values of the conditional mean,
each conditional mean corresponding to one of a set of x values. Most regressions pro-
blems are concerned with selecting a set of x values that span a reasonable operational
range and measuring y at each of these x values. Each of the observed or measured values
of the response variable (at a given x) comes from a normal population with mean of f.Lylx
and variance u 2 •
The purpose of a regression analysis is to use a set of measured or observed x andy
values to estimate the parameters {3 0 , {3 1 , and the variance of the s terms or u 2 and to
perform hypothesis tests and evaluate confidence intervals concerning {3 1 . Basic assump-
tions in this analysis are that the linear model is appropriate, that the s error terms or
deviations in they variable are independent and normally distributed and that they have a
common variance u 2 at all x levels, and that the uncertainty variance in selecting each x
level is small in comparison to u 2 . The analysis seeks to find estimates of {3 0 and {3 1 that
produce a set of f.Lylx values that are a best fit to the data. The regression line may be
written in an alternative format as
(75)
In this alternative format, liylx is an estimate of the mean of y for any given x, and
b 0 and b 1 are estimates of {3 0 and {3 1 respectively. How well the estimates agree with the
observed y is evaluated by the magnitude of the differences y- uYix, which are called
residuals. Small residuals indicate good fit, and the best fit is the line that gives the smallest
combined magnitude for the squares or variance of the residuals.
The minimization of the squares of the residuals is called the least-squares criterion,
which requires that the estimates of {3 0 and {3 1 minimize the sums as expressed by
:E(y- Uylx) 2 = :E(y- bo- b1x) 2 (76)
These values are obtained by means of "normal equations," a set of simultaneous equa-
tions of the form
bon+ b1 :Ex = :Ey
bo:Ex + b 1 :Ex 2 = :Exy (77)
(80)
which describes the remaining variation in y after estimating the linear relationship y upon
x. The degrees of freedom for SSt: is n - 2, where n is the number of xy values, since two
df are used for b0 and b 1 and the mean square or variance estimate of CJ2 is
is a standard normal (z) variable. When the estimated variance MSt: is substituted for CJ 2 ,
the ratio becomes a random variable with a t distribution with n - 2 degrees of freedom,
and this may be used for hypothesis testing. Thus for testing whether {3 1 equals a specific
value /3].
Ho: fJ1 = /3]
Ha: fJ1 =f. fJj (83)
the test statistic is
b! - fJi
t !2 - _ ___:._:......:...._~ (84)
a - [MScj(n- 1)S~] 1 / 2
Letting {3 1 = 0 provides a test of the null hypothesis,
H 0 : /3 1 = 0
Ha: fJ1 =f. 0 (85)
and the confidence interval for b1 is calculated as
MS ]1/2
b! ± ta/2 (n- l;S~
[ (86)
Inferences on the model estimates of the response variable are also important. There
are two different but related inferences: (1) inferences on the mean response, how well the
model estimates the conditional mean at some x, and (2) inferences on prediction, how
well the model predicts the value of the response variable y for a randomly chosen future x
value. The point estimate for the first of these is uy x, the estimated mean response for any
1
x, and the estimate for the second case is Yylx, the predicted individual response value for
any x. For a specified value x*, the variance of the estimated mean is
Both of these variances have their minima when x* = x, i.e., the response is estimated with
greatest precision when the independent variable is at its mean. At this location the
conditional mean is y and the variance of the estimated mean is the familiar u 2 jn. Also
note that S 2 (.Y,. x) > S 2 (ur J, since a mean has greater precision than an individual value.
1 1
When MSs is substituted for cr 2 , the estimated variance is given by these two equations.
Letting x = 0 in the variance expression for u,. gives the variance for {3 0 , which can be
1,
used for hypothesis and confidence interval testing for {3 0 . This has applications for some
regression problems where {3 0 has some intrinsic meaning other than an arbitrary fitting
constant.
One simple diagnostic test for simple linear regression that may give an indication that
some of the essential assumptions have been violated is the residual plot. If the residuals
y - livlx are plotted on the vertical axis and predicted values uvlx or x values on the
horizontal axis, the plot should give a horizontal band of points, centered on zero on
the vertical axis, with relatively uniform height (vertical spread) across the entire horizon-
tal span if there are no serious problems with the regression. If the height increases from
end to end, this may indicate nonuniform variance across the x factor range. If the band
curves up or down, left to right, this may indicate a nonlinear underlying true relationship
and the model has not been properly specified, i.e., it may need a square term.
Multiple regression, where a dependent y variable is a function of a number of inde-
pendent x variables or factors, is a widely used analysis technique for multivariable pro-
blems The use of a specialized simplified multiple regression analysis, where all the
independent factors are orthogonal, is discussed in the next section on experimental
design. A detailed discussion of customary multiple regression analysis, where orthogon-
ality may or may not be present for all variables, is a topic that is beyond the scope of this
chapter.
or qualitative, with the levels representing certain types or categories, e.g., reactor vs.
reactor 2. The response is the dependent or measured variable. The generic models for
experiments are (1) a fixe d-eflects model, where only the selected levels of the factors in the
experiment are of concern, (2) a random-effects model, where the factor levels chosen
represent a sample from a larger population and conclusions from the program are applied
to the population, and (3) a mixed-effects model, where both random and fixed effects
factors are included in the design. Factors may be primary, i.e., the major factor(s) for
evaluation that can have a direct influence, or secondary, i.e., factors that can have a less
direct influence, e.g., environmental conditions, which may or may not be evaluated as
secondary objectives.
In a typical experiment set up to evaluate three levels; 1, 2, and 3 for Factor A, a
secondary factor may be designated as qualitative factor 0, which of several operators
conducts the test. The usual or classical experimental approach is to fix the level of 0, i.e.,
select one operator and evaluate the response when Factor A is varied over the three levels
(1, 2, and 3) with perhaps three replications for each level. This approach has 6 dffor error
estimation and nine total test measurements. Testing will evaluate the influence of A with
the selected operator but give no indication of operator effects. A more comprehensive
approach is to use a block experimental design. Select two diverse operators and for each
operator evaluate the response for factor A at levels I, 2, and 3 with two replications per
level. The operators are the blocks, and the influence of factor A is evaluated indepen-
dently in each block. This design also has 6 df for error with now a total of 12 measure-
ments. The investment of 3 more measurements for the second design provides much more
information. The influence of factor A is now evaluated for both operators, and any
unusual influence or interaction of operators on factor A response can also be determined.
In complex experimental programs, blocking may be done for potentially perturbing
factors that cannot be controlled, such as ambient temperature or humidity variations, by
conducting several evaluations of the response in short time periods, where temperature
and humidity are relatively constant. Dividing the total experiment into these blocks, or
relatively uniform groupings, improves the quality of the estimated effects by reducing the
standard error of the measurements.
Factorial Designs
A class of experimental designs called factorial designs is well suited to technical projects
involving physical measurements. Factorial designs are defined as a group of unique
combinations of levels across all the selected factors. When the factor levels are set at
the values called for in a particular combination and a response measurement is made for
this combination, this is called a (test) run. Each design has some number of specified runs,
and the total layout or list of these runs is called the design matrix. A complete factorial
design matrix is one where for each factor, all factor levels of the other factors appear in
some combinations of the design matrix. Thus for three factors investigated at two levels
each, a complete factorial design would require eight (2 3) response measurements or runs,
each having a different combination of the two levels of the three factors. When the
number of factors is large, a complete or full factorial requires too many test runs, and
designs called Factional factorials are used where a certain fraction of the full factorial
number of runs is selected on the basis that the design be balanced with respect to the
number of selected levels of each factor.
Factorial designs may be divided into two types, (1) screening designs used to search
for important factors or to rank factors in order of importance, and (2) exploratory designs
used to explore and map out a region of technical interest in greater detail, thereby gaining
58 Veith
empirical understanding. The simpler screening designs, used for a preliminary search for
important factors in a system, have k factors each at two levels, designated as upper and
lower levels. Exploratory designs also have k factors, but now each factor usually has at
least three levels, upper, middle and lower; this permits the evaluation of nonlinear
response relationships.
The designs as given in this section are analyzed in terms of model equations that
simulate the system under study. The designs and the model equations are set up using
special coded units for the independent variables (factors) of the design. Thus for the
response variable y and two independent variables x 1 and x 2 , the model equation that
allows for the evaluation of any interaction between x 1 and x 2 is
y = bo + b1x1 + b2x2 + b12x1x2 (89)
where
b0 = a constant; in system of units chosen it is the value of y when x 1 = x 2 = 0
b 1 =change in y per unit change in x 1
b2 = change in y per unit change in x 2
b 12 = an interaction term for specific effects of combinations of x 1 and x 2 , it
indicates how b1 changes as x 2 changes 1 unit
The coded units are obtained by selecting for each factor a value that constitutes a center
of interest or a reference value, and then selecting certain values that are below and above
that center of interest by an equivalent amount. This is a straightforward process for
quantitative factors but may not be possible for some qualitative factors that can exist
at only two levels. In this case the center of interest is considered as theoretical or con-
ceptual. The coded units for any X; are defined by
vE- evE
X;= SU (90)
with
VE = selected factor value for X;, in physical units
eVE =center of interest factor value for X;, in physical units
SU = scaling unit, i.e., change in physical units equal to 1 coded unit
When VE is higher than evE by an amount equal to SU, then X;= 1; when VE is less than
evE by an equal amount, X;= -1, and when VE =evE, X;= 0. The center of interest
values for all factors constitute the central point in the multidimensional factor space for
the experiment. As indicated above, the constant b0 is the value for y at this center in the
factor space; it is the (grand) average of all responses and is an important analysis output
parameter for these designs.
The design matrix for a 23 full factorial design is given in Table A10 along with an
additional matrix called the independent variables matrix or analysis matrix that is used to
calculate the effects of the independent variables; this matrix has an equal number of rows
and columns. The analysis of factorial designs is a relatively straightforward hand calcu-
lator procedure that may be performed by using the analysis matrix generated from the
design matrix as follows: (1) the first analysis matrix column consists of 1 values and is
used to evaluate the grand mean; (2) the next three columns are the same as in the design
matrix; (3) the remaining (interaction) columns are generated by multiplying respective
values in the columns headed by x; and x1 to give column entries for bu. The final column
Quality Assurance of Physical Testing Measurements 59
is generated in the same way for three factors. In general this operation is conducted for all
factors up to k, for any 2k design.
With this design the effects of the three factors x 1 , x 2 , and x 3 may be evaluated in
terms of effect coefficients. There are three main effect coefficients, bh b2 , b 3 ; three two-
factor interaction coefficients, b 12 , b 13 , b 23 ; and one three-factor interaction coefficient, b 123 ,
as given in Eq. 91.
y = b0 + b 1x 1 + b2 x 2 + b3x 3 + b 12 x 1x 2 + b 13 x 1x 3 + b23 x 2x 3 + b 123 x 1x 2x 3 (91)
The coefficients are evaluated from the sum of the products obtained by multiplying
certain column values or elements on each row, where (col b;) indicates a specific row
value in column b; of the analysis matrix and Y; is the same specific row value for the
response. The sum obtained over all N responses is divided by N/2, where N is the total
runs in the design.
b· = I:[y;(colb;)]
' N/2
I:[y;(colbu)l
hu = N/2
I:[y;(col buk)]
buk = N/2 (92)
Screening Designs
Seven screening designs with two to five factors are listed in Table A 11. These are a
collection of complete factorials or fractional factorials. The fractional factorials are
designated as one-half replicate or one-fourth replicate, i.e., a one-half or one-fourth
fraction of the full design. With the exceptions as noted below, all of these designs
allow for the evaluation of two factor interactions that often are important in many
technical investigations. Typically three factor interactions have no real significance in
such programs. Thus any design that allows for direct calculation of two factor interac-
tions is usually sufficient to give a good evaluation of any system.
All of the designs are orthogonal in the independent variables or factors, i.e., there is
no correlation among these factors. This feature of orthogonality permits the use of the
matrix as discussed above for easy analysis. The designs are balanced, i.e., for any factor
level for factor i , the levels of all other factors appear at their upper and lower values the
same number of times. The analysis for each design in Table All can also be conducted by
multiple regression analysis with typical computer algorithms to give the values for all b
coefficients and the other typical output parameters for multiple regression. One virtue of
using a multiple regression analysis is the ability to evaluate the significance of the indi-
60 Veith
vidual coefficients on the basis of t tests and thus obtain a model with only significant
coefficients.
Table All gives alias and confounding information for the fractional or blocked
designs. This information is needed for proper design set up, analysis, and interpretation.
An alias exists in fractionated designs when the same sum of products 2:yi( col hu) numeri-
cally evaluates the sum for two different or separate coefficients; thus no unique evaluation
of each of the coefficients is possible. The aliases in the table are indicated by an equals
sign. Design 3, the three-factor design conducted in two blocks, can be used to lay out a 2 3
design into two blocks as well as to use either of the blocks to evaluate the three main
effects of a three-factor design with the indicated aliases. Thus each block is equivalent to a
three-factor one-half replicate design. If it can be shown on technical grounds alone that
no two-factor interactions are possible or are of negligible magnitude, then the two-block
three-factor design may be used for blocking or for main effect evaluation. If this cannot
be assured, then for all potential situations of this sort, designs with no "main effect-two-
factor interaction aliases" must be used. Confounding implies a similar unresolved situa-
tion where certain higher-order coefficients are equivalent to block effects in the same
sense as an alias. This is of lesser importance than aliases with two-factor interactions.
When assigning coefficient numbers to factors, certain features of the fractional
designs can be used to give a layout as free from conflicting interpretations as possible.
Design 5, a four-factor one-half replicate, has the two-factor interactions aliased with each
other. Note that x 1 is part of each aliased combination or equivalency. If it can be
ascertained in advance that one factor of the four can be guaranteed to have minimal
interaction with the other three, then this one factor should be assigned as x 1. On this basis
the left side of all the alias equations is zero or very close to zero, and the right-hand side
becomes the real interaction if this interaction is determined to be significant. A similar but
not exactly equivalent situation exists with regard to x 5 for the five-factor one-half repli-
cate Design 7; x 5 appears in the alias equations with main effects for factors 1 to 4. Thus
for this design the factor with the least likely probability of interacting with the other
factors should be assigned as Factor 5.
Exploratory Designs
With the exception of the two-factor design, those designs illustrated in Table Al2 consist
of a core group of runs that is equivalent to a screening design with certain added runs that
contain additional levels of the factors. The additional runs are located at the center of the
design (center of interest) and at upper and lower levels beyond those that appear in a
screening design. These added runs enable any nonlinear behavior to be evaluated and
allow for main effect and two-factor interactions to be evaluated with enhanced reliability
due to the increased number of runs. The center runs are replicated to give a small df
estimate of error.
The two-factor design is a layout in the shape of a hexagon with two center points.
The three-factor design is a full factorial augmented by lower and upper points selected to
give an efficient estimation of the potential effects of the three factors. The center point is
replicated four times. The four- and five-factor designs are built on a one-half replicate of
the respective full factorial again augmented by lower and upper points for all factors plus
replicated center points.
The selection of the physical units that represent the coded units of both screening and
exploratory designs should be carefully thought out. In a screening design, the lower and
upper physical unit levels should be as wide as possible to increase the sensitivity of the
evaluation. The values should also be in the range of direct interest to the technical
Quality Assurance of Physical Testing Measurements 61
problem at hand. Exploratory designs, with perhaps five levels, also require careful selec-
tion of the range of values and the equivalence between physical and coded units.
Cross-tabu lations
For two-way and multi-way tables-Pearson's r, Pearson's x-square, likelihood-ratio x-
square, Yates corrected x-square, Spearman's rho, contingency coefficient,
Goodman's and Kruskal's tau, eta coefficient, Cohen's kappa, relative risk estimate
One-Way ANOVA
Typical one-way ANOV A with post hoc tests: LSD, Bonferroni, Duncan's, Sidak's,
Scheffe, Tukey, Tukey's-b, R-E-G-W-F, R-E-G-W-Q, S-N-K, Waller-Duncan
Levene's homogeneity of variance
Bivariate Correlations
Pearson's correlation coefficient, Spearman's rho, Kendall's tau-b, univariate statistics,
covariances and cross-products, outlier screening prior to analysis
Linear Regression
Estimates of linear regression coefficients, standard errors of coefficients, significance of
coefficients, blocking of variables, residual calculation and residual analysis, standard
ANOV A, weighted least-squares analysis
62 Veith
Curve Estimation
Models available: linear, logarithmic, inverse, quadratic, cubic, power, compound, S-
curve, logistic, growth, exponential
For each model: regression coefficients, multiple R, R 2 , adjusted R 2 , standard error of
estimate, ANOVA table, predicted values, residuals, prediction curves
Nonparametric Tests
Chi-square, binomial test, runs test, one-sample Kolmogorov-Smirnov test, Mann-
Whitney U test, Moses test, Wald-Wolfowitz test, Kruskal-Wallis test, Wilcoxon
signed rank test, Friedman's test, Kendall's W test, Cochran's Q test
Multiple Response Analysis
Frequencies and frequency tables calculated, multiple response cross-tabs given
may be improved by reducing the inherent variation. Thus both production and test
systems can be in control at various quality levels.
Internal Techniques
Some frequently used internal techniques are
Repetitive measurements
Internal test or reference samples (or objects)
Control charts
Interchange of operators and/or equipment
Independent measurements
Definitive (alternative) methods or measurements
Audits
The purpose of all of these techniques is to determine how a system performs based on the
selection of one or more performance parameters. The first two techniques on repetitive
measurement and the use of internal test or reference samples is the classical way to
evaluate precision. However this can be a time consuming process if it is not carefully
planned, and quality assessment frequently provides a way to minimize the number of
measurements. The use of duplicate samples in routine testing, and the accumulation of
this information over a time period, is another approach to evaluating precision. More
details on this will be given in Section 8 on laboratory precision, which addresses both
intralaboratory and interlaboratory operations.
The use of control charts is a documented way to interpret sequential test data and
assess quality as well as to monitor or maintain quality. This is described in more detail
below. The interchange of operators and equipment (if possible) can also provide valuable
information about sources of variation and their influence on quality. Test data should be
independent of such factors as operators and individual test machines of the same design.
If output data are related to such factors, operator training and machine calibration
operations need attention. Independent measurements and definitive methods are related;
they both may be used to measure the same parameters by a different but equivalent
method. Although independent measurements may not be available for some methods,
the results of such testing may give information about any bias in a selected test method,
although this topic is more rigorously approached by way of external techniques as
64 Veith
described below. Audits by internal staff members of any standard operating procedure
(SOP) may also assist in assessing and improving quality.
External Techniques
Some frequently used external techniques that may be used are
Collaborative or interlaboratory testing
Exchange of test samples (or objects)
External Reference Materials (RM)
Standard Reference Materials (SRM)
Audits
All of these techniques establish a relationship between a laboratory and the external
world. Collaborative or interlaboratory testing consists of programs organized to test
samples from one or more selected materials (or objects) that have some documented
level of homogeneity and are supplied to a number of laboratories. Participation in col-
laborative programs allows a laboratory to determine how it stands in comparison to
other participants in the program and the accepted reference value for the measured
parameter(s). The exchange of test samples is a technique often used for producer-con-
sumer situations to resolve any testing disagreements. Bias may be evaluated by the use of
reference materials; these may be ad hoc reference materials or RMs, developed by a
standardization committee or other recognized group, that have an accepted reference
value; or they may be more formally developed standard reference materials or SRMs
from various national standardization laboratories or bodies such as the National
Institute for Standardization and Technology (NIST) in the USA.
By comparing the values obtained in any laboratory to the standard reference value
the magnitude of any bias is clearly indicated. Biases may be dependent on a number of
factors concerning the testing operation; calibration procedures, operator technique, and
ambient laboratory conditions are some typical sources. See Annex C for more details on
bias evaluation. Interlaboratory testing is used to evaluate repeatability or within-labora-
tory precision and reproducibility or between-laboratory precision. External audits by any
number of accredited organizations are important and have been given increasing atten-
tion in the past decade as the interest in such standards as the ISO 9000 series listed in the
bibliography has risen.
time during the testing operation, and the results are examined. If problems appear when
specified analysis protocols are employed, the problem is resolved and a second quality
control process is established. This procedure is repeated until no further improvement can
be made with the technical evaluation procedures available. When this state is reached full
statistical control is achieved within the scope of the testing technology.
All operating systems have one common characteristic-the output is inherently vari-
able. When the variability present in any operating system is confined to indeterminate
random variation, the system is in a state of statistical control. This type of variation is also
called common cause variation. The magnitude of this random variation is a function of
the complexity of the testing system and the technology available to detect and eliminate
this unwanted variation. Statistical control is that system state after all sources of deter-
minate or assignable variation have been eliminated by the tools available to the experi-
menter. Assignable variation is variation that can be traced to a specific cause such as poor
calibration procedures, poorly trained operators, and faulty machine settings and similar
problems. This type of variation is also called specific cause variation. The discovery and
elimination of assignable variation is dependent on the skill and expertise of an experi-
menter and on the level of the technology available for searching out potential causes of
specific variation.
The basic approach to both assessing and controlling variation of either sort is the
control chart technique, as originally developed by Shewhart [8], which can be used (1) to
assess the level of achievable quality in a series of repetitive application steps to discover
and eliminate assignable causes of variation, and to (2) control the level of quality once a
certain level has been established. Control charts can provide a clear indication of the
repetitive nature of a measured parameter in the sense of evaluating the long-term varia-
tion and the short-term variation. The use of control charts is based on the assumption
that once all sources of the most easily recognized assignable variation have been elimi-
nated, the residual level of variation is represented by a normal distribution. It must be
recognized that any level of residual variation may contain some components that may at
some future time be identified as assignable variation. Quality assessment and control can
be based on one of two types of data: (1) attribute data, which are frequently defined as
count data, the number of defective items or units in a sample of specified size, or (2)
variable or measurement data, which are expressed on a continuous scale. Although the
basic ideas of quality assessment and control are the same for both types of data, the
specific details of calculation are somewhat different. Since attribute data are typical of
industrial production processes, the procedures for their use are not described here.
Control Charts
There are two basic types of control charts. One is a chart that illustrates the long-term
variation of the process or system; it consists of a mean value (of a set of n measurements)
designated as xn, for some measured parameter, plotted sequentially (hourly, daily,
weekly). This is called an X11 chart (x-bar chart). A second type of chart is one that
indicates the short-term variation in the set of n individual measurements for the average
or mean x11 • This usually takes the form of a rangeR, of two or more measured parameter
values obtained in a specified time span (either side by side or within a brief specified time)
also plotted sequentially as above; this is called an R chart. Both of these charts have
certain characteristics or limits that are developed to aid in the interpretation of the x11 and
R values as they are recorded in time.
Both types of charts have a central line, which can be defined as the mean quality level.
In the X11 chart this equates to the mean value of the measured parameter set over some
66 Veith
stable time period. In the R chart this central line is the average range also over some
stable established time period. When sequential values of either X11 or R lie close to the
central line, there is some degree of confidence that the system is in statistical control.
More objective decisions on whether statistical control is achieved are made on the basis of
limits on X11 and R.
The X11 chart. This has values and limits defined according the following:
Central line= mean of X11 values= i 11
Number in set A2 D3 D4
2 1.88 0 3.27
3 1.02 0 2.58
4 0.73 0 2.28
5 0.58 0 2.12
6 0.48 0 2.00
7 0.42 0.08 1.92
8 0.37 0.14 1.86
L WL and the upper warning limit UWL are defined as two-thirds of the control limits and
are calculated using the LCL and the UCL by
LWL = 0.667(LCL) and UWL = 0.667(UCL) (97)
The quality level of production processes can be improved by using factorial design
experimentation. Although such experimental designs have been used for various research
and development programs for several decades, their application to industrial production
processes was pioneered by G. Taguchi [11,12]. His work, which is in essence based on
fractional factorial experimentation, has shown that these techniques can frequently be
used (1) to select one production variable to minimize variation (by careful control) and
select another variable to hold response on target, and (2) to create products that are less
sensitive to the environmental conditions of the process. Cause-and-effect diagrams
obtained as the output of brainstorming sessions with experienced personnel can often
be used to good advantage to detect possible causes for poor quality. This type of infor-
mation can lead to improvements in the SOPs for such testing.
Quality improvement can also be attained by a number of concerted efforts such as
"ruggedness testing" procedures that evaluate the influence of operational factors, as
outlined in Section 4. The use of comprehensive quality manuals and periodic audits
and reviews is also helpful. When computers are used for data acquisition and processing,
the verification and integrity of both the hardware and the software are essential to
quality. All of these options for improving industrial production quality can be used
with some minor modifications for improving the quality of laboratory testing and mea-
surement programs.
zation into technology. In 1884 the American Society of Mechanical Engineers (as well as
Civil and Mining Engineers) began conducting interlaboratory testing, and the results
were a mass of conflicting data with very poor agreement. In Europe at about the same
time similar activity was taking place. In Germany a voluntary standardization organiza-
tion was created as the result of an international conference held in Munich in 1884.
As international trade has increased over the past several decades, those standardiza-
tion organizations that develop test method and specification standards have made policy
decisions that all test methods shall have, as part of the standard, a section on the typical
precision that can be expected. The American Society for Testing and Materials, ASTM,
took such action in 1976. Other national standardization organizations, such as the British
Standards Institute, BSI, and the Deutsche lnstitut fiir Normung, DIN, in Germany, have
also embraced the concept of test method precision. The International Standardization
Organization, ISO, has adopted similar policies, and more than 30 ISO Committees are
engaged in this work. To facilitate the work on evaluating precision for standard test
methods, these organizations have developed guideline standards on how such evaluations
shall be conducted, how the data are to be analyzed, and how the results are to be
expressed. See the bibliography for ASTM and ISO standards on precision. Annex D
gives the necessary background for evaluating precision; topics include organizing an
interlaboratory test program (ITP), a review of the terminology used, the assumptions
underlying the analysis, and the calculation algorithms for repeatability r and reproduci-
bility R. Although there may be some small differences in nomenclature when comparing
the ASTM and ISO precision standards, the calculation algorithms for both are identical.
Numerous precision evaluation programs have been conducted using these and similar
guideline standards for the past four or more decades. In almost all fields of technical
activity these precision evaluation programs have shown that many of the current and
well-developed test methods show very poor interlaboratory or reproducibility precision.
Bias
The major reason for the customary poor reproducibility for many test methods is the
existence of a nonnormal or biased between-laboratory data distribution. Bias exists
because each laboratory has its own testing culture, a unique environment and way of
conducting any test that is dependent on the operational conditions in the laboratory. This
occurs despite the use of standardized testing methods. This biased output causes a
laboratory to be almost always low or high compared to some reference value and to
other laboratories. Annex C gives the needed steps to evaluate bias, provided that one or
more reference materials or standards are available that have recognized (true) values.
One of the early pioneers in the analysis of ITP data, W. J. Y ouden, demonstrated
more than thirty years ago the dominant influence of interlaboratory bias (or systematic
error as he called it) in a series of publications [4,5, 15]. He showed with simple graphical
plotting techniques the unmistakable existence of bias. The existence of an essentially
constant bias for any laboratory invalidates the customary assumption in ITP analysis
that a random normal distribution adequately represents the between-laboratory varia-
tion. See Annex D.
Veith [13] reviewed the current state of precision testing using some ASTM test
methods in the rubber manufacturing industry in 1987. Mooney viscosity (ISO 289,
ASTM D1646), a widely used test for quality assessment of raw rubbers, gave reasonably
good relative precision, Type I (r) pooled values of 3.0 percent for several clear rubbers,
and good pooled (R) values of 3.8 percent on the same basis. See Annex D for the
definition of Type 1 and 2 precision and (r) and (R). For a widely used rate-of-cure
70 Veith
test, the oscillating disc curemeter (ISO 3417, ASTM D2084), the precision was substan-
tially worse with Type 1 (R) values, which depend on the material being tested, in the
range 20 to 81 percent. He also demonstrated that interlaboratory bias was responsible for
the poor agreement.
Brown [14] reviewed the results of interlaboratory precision evaluation programs in
ISO TC45 (Rubber Product Testing) in 1989. He found reasonable precision for hardness
tests (ISO 48, ASTM D2240) with Type 1 (r) values in the range 3.0 to 6.0 percent, and
similar (R) values in the range 6.0 to 11.0 percent. Tensile or stress-strain testing (ISO 37,
ASTM D412), a test with widespread usage, gave Type 1 ( R) values in the range of 8.0 to
32 percent. Many other common tests, such as compression set and temperature rise in
flexometer testing, showed very poor precision. For compression set, Type 1 (R) values
were in the range 26 to 32 percent. For temperature rise, Type 1 (R) values were in the
range 80 to 97 percent. Brown called into question the wisdom of conducting some of the
tests at all, considering the wide variation in interlaboratory results.
Uncertainty
The generic concept uncertainty has been used throughout the chapter because it is a word
that effectively conveys the sense of ambiguity about a measured result. The alternative
concept of specific uncertainty may be defined as "the estimate attached to a test result that
defines or characterizes the range of values within which the true value of the measured
property is asserted to lie." This definition is similar in principle to that for a confidence
interval as discussed in Section 3.6, but it lacks the instructions on how to calculate the
"range of values." The establishment of procedures to calculate this type of specific
uncertainty is currently under development by standardization organizations, and the
provisional standard ISO/CD 12102, which contains the above definition, is currently
under review. The major problem in this effort is the development of a comprehensive
method to express this range so as to encompass any selected testing domain with certain
specific factors that influence the range. The remainder of this section is devoted to specific
uncertainty, and for brevity the word "specific" is dropped.
There are a number of components to this uncertainty, each associated with particular
testing or other operational factors that influence the uncertainty range. These components
were previously addressed in Annex B on the statistical model for testing operations. As
discussed in the annex, there are two categories that influence the deviations that perturb
any measured value for an object or material: production variation and measurement
variation. Within each of these categories there are two additional types of components
that cause deviations about any measured value: random and bias. Current approaches to
uncertainty concentrate on the measurement variation and ignore the production variation
by implicitly assuming that this variation is or can be made to be negligible.
Uncertainty is frequently based on the concept of error populations, where for any of
these populations "error" implies a deviation 8i from some true or reference value flr that
would be obtained for the measurement Yi in the absence of any type of perturbation by a
physical cause. However the word error can be defined on a much broader basis, and
deviation will be used rather than error.
8i = Yi- flr (98)
There may be any number of causes (1, 2, ... , i) that contribute a deviation component.
This situation may be represented by Eq. B1 in Annex B, given here as Eq. 99.
Yi =flo+ lli + L;(b) +I;(e)+ L;(B) + L;(E) (99)
Quality Assurance of Physical Testing Measurements 71
The reference or true value for any material or object class for a particular measurement
process is the sum of the first two terms of Eq .. 99, thus
/L,. = f-Lo + /LJ (100)
A rearrangement of Eq. 99 using f-Lr = f-Lo+ /Li shows that i5i for any given measurement is
8; = :E(b) + :E(e) + :E(B) + :E(E) (101)
The terms :E(b) + :E(e) contribute bias and random deviation components attributable to
inherent or production process variation in the material (or object class), and the terms
:E(B) + :E(E) contribute bias and random components attributable to the operational
conditions of the measurement. Most discussions of uncertainty assume that the magni-
tude of :E(b) + :E(e) is negligible compared to :E(B) + :E(E). However for any realistic
appraisal of uncertainty this assumption may not be tenable. For the discussion to follow
a negligible magnitude for :E(b) + :E(e) will be assumed for the sake of simplicity.
Uncertainty evaluation is concerned with calculating a ±range about any Yi value that
has a high probability of including the reference or true value /L,. within the range. Just as
in the case of a confidence interval, this range is obtained on the basis of the standard
deviation of oi values for the defined testing domain. This standard deviation for oi is a
composite standard deviation obtained as a special sum of the variances of all individual
components contained in the four types of terms on the right-hand side of Eq. 101. Thus,
omitting the terms :E(b) + :E(e), a composite variance for i5i may be defined as
var (8;) = :E var (B)+ :E var (E) (102)
and expressed as a standard deviation, sd (8;)
sd (i5i) = [var (8;)] 112 (103)
The bias components may be divided into two categories: (1) global or inherent bias,
unique to the test and common to all locations and machines, and (2) local bias, unique
to a particular location and/or machine. The set of particular terms as given in Annex B,
Eq. B2, better illustrate bias components such as BL,, a unique laboratory component, BM,
a machine component, Bop, an operator component, as well as random components EM,
potential random differences among machines, and E 0 p, potential random differences
among operators. To these components which are local an additional potential global
bias Bob! must be added. Uncertainty u may be given in terms of sd (i5i) as
u= k[sd (i5i)] (104)
where k is a multiplying factor, and the measured value Yi with its uncertainty may be
expressed as
Yi ± u= Y; ± k[sd (8;)] (105)
The key issue in calculating the uncertainty uis evaluating sd (8;) and adopting a value for
k. If all the components of sd (i5i) are known on the basis of 18 to 20 d[for each compo-
nent, then a value of 2 may be used for k. The operation of insuring that all important
components of sd (8;) are fully evaluated is complex and beyond the scope of this chapter.
Guidance standards for evaluating sd (8;) are currently under development, and reference
should be made to this current activity; see ISO/CD 12102 in the bibliography.
Improving Reproducibility Precision
Poor reproducibility precision has been the one of the major reasons for the establishment
of laboratory accreditation systems and the organizations that administer such systems as