0% found this document useful (0 votes)

325 views86 pages

Testing Book

Polymer Testing book for Polymers /Chemicals Engineering . It helps to understanding different test & its standars though or the book is great & improven your polymer testing knowledge

Uploaded by

sarthak Korde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

325 views86 pages

Testing Book

Polymer Testing book for Polymers /Chemicals Engineering . It helps to understanding different test & its standars though or the book is great & improven your polymer testing knowledge

Uploaded by

sarthak Korde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

HANDBOOK

OF POLYMER
TESTING
PLASTICS ENGINEERING

Founding Editor

Donald E. Hudgin

Professor
Clemson University
Clemson, South Carolina

1. Plastics Waste: Recovery of Economic Value, Jacob Leidner

2. Polyester Molding Compounds, Robert Bums
3. Carbon Black-Polymer Composites: The Physics of Electrically Conducting
Composites, edited by Enid Keil Sichel
4. The Strength and Stiffness of Polymers, edited by Anagnostis E. Zachariades and
Roger S. Porter
5. Selecting Thermoplastics for Engineering Applications, Charles P. MacDermott
6. Engineering with Rigid PVC: Processability and Applications, edited by I. Luis
Gomez
7. Computer-Aided Design of Polymers and Composites, D. H. Kaelble
8. Engineering Thermoplastics: Properties and Applications, edited by James M. Margolis
9. Structural Foam: A Purchasing and Design Guide, Bruce C. Wendle
10. Plastics in Architecture: A Guide to Acrylic and Polycarbonate, Ralph Montella
11. Metal-Filled Polymers: Properties and Applications, edited by Swapan K. Bhattacharya
12. Plastics Technology Handbook, Manas Chanda and Salil K. Roy
13. Reaction Injection Molding Machinery and Processes, F. Melvin Sweeney
14. Practical Thermoforming: Principles and Applications, John Florian
15. Injection and Compression Molding Fundamentals, edited by Avraam I. lsayev
16. Polymer Mixing and Extrusion Technology, Nicholas P. Cheremisinoff
17. High Modulus Polymers: Approaches to Design and Development, edited by
Anagnostis E. Zachariades and RogerS. Porter
18. Corrosion-Resistant Plastic Composites in Chemical Plant Design, John H. Mallinson
19. Handbook of Elastomers: New Developments and Technology, edited by Ani/ K.
Bhowmick and Howard L. Stephens
20. Rubber Compounding: Principles, Materials, and Techniques, Fred W. Barlow
21. Thermoplastic Polymer Additives: Theory and Practice, edited by John T. Lutz, Jr.
22. Emulsion Polymer Technology, Robert D. Athey, Jr.
23. Mixing in Polymer Processing, edited by Chris Rauwendaa/
24. Handbook of Polymer Synthesis, Parts A and B, edited by Hans R. Kricheldorf
25. Computational Modeling of Polymers, edited by Jozef Bicerano
26. Plastics Technology Handbook: Second Edition. Revised and Expanded, Manas
Chanda and Sa/if K. Roy
27. Prediction of Polymer Properties, Jozef Bicerano
28. Ferroelectric Polymers: Chemistry, Physics, and Applications, edited by Hari Singh
Nalwa
29. Degradable Polymers, Recycling, and Plastics Waste Management, edited by Ann-
Christine Albertsson and Samuel J. Huang
30. Polymer Toughening, edited by Charles B. Arends
31. Handbook of Applied Polymer Processing Technology, edited by Nicholas P.
Cheremisinoff and Paul N. Cheremisinoff
32. Diffusion in Polymers, edited by P. Neogi
33. Polymer Devolatilization, edited by Ramon J. Albalak
34. Anionic Polymerization: Principles and Practical Applications, Henry L. Hsieh and
Roderic P. Quirk
35. Cationic Polymerizations: Mechanisms, Synthesis, and Applications, edited by
Krzysztof Matyjaszewski
36. Polyimides: Fundamentals and Applications, edited by Malay K. Ghosh and K. L.
Mittal
37. Thermoplastic Melt Rheology and Processing, A. V. Shenoy and D. R. Saini
38. Prediction of Polymer Properties: Second Edition, Revised and Expanded, Jozef
Bicerano
39. Practical Thermoforming: Principles and Applications, Second Edition, Revised and
Expanded, John Florian
40. Macromolecular Design of Polymeric Materials, edited by Koichi Hatada, Tatsuki
Kitayama, and Otto Vag/
41. Handbook of Thermoplastics, edited by 0/agoke 0/abisi
42. Selecting Thermoplastics for Engineering Applications: Second Edition, Revised and
Expanded, Charles P. MacDermott and Aroon V. Shenoy
43. Metallized Plastics, edited by K. L. Mittal
44. Oligomer Technology and Applications, Constantin V. Uglea
45. Electrical and Optical Polymer Systems: Fundamentals, Methods, and Applications,
edited by Donald L. Wise, Gary E. Wnek, Debra J. Trantolo, Thomas M. Cooper, and
Joseph D. Gresser
46. Structure and Properties of Multiphase Polymeric Materials, edited by Takeo Araki,
Qui Tran-Cong, and Mitsuhiro Shibayama
47. Plastics Technology Handbook: Third Edition, Revised and Expanded, Manas
Chanda and Sali/ K. Roy
48. Handbook of Radical Vinyl Polymerization, edited by Munmaya K. Mishra and Yusef
Yagci
49. Photonic Polymer Systems: Fundamentals, Methods, and Applications, edited by
Donald L. Wise, Gary E. Wnek, Debra J. Trantolo, Thomas M. Cooper, and Joseph
D. Gresser
50. Handbook of Polymer Testing: Physical Methods, edited by Roger Brown

Additional Volumes in Preparation

Handbook of Polypropylene and Polypropylene Composites, edited by Harutun G. Karian

Polymer Blends and Alloys, edited by Gabriel 0. Shonaike and George P. Simon

Star and Hyperbranched Polymers, edited by Munmaya K. Mishra and Shiro Kobayashi

Practical Extrusion Blow Molding, edited by Samuel L. Belcher

HANDBOOK
OF POLYMER
TESTING
PHYSICAl MRHODS

edited bV

ROGER BROWN
Rapra Technology Ltd.
Shawbury, Shrewsbury, England

MARCEL

MARCEL DEKKER, INc. NEw YoRK· BASEL

DEKKER
Library of Congress Cataloging-in-Publication Data

Handbook of polymer testing: physical methods I [edited by] Roger Brown.

p. em.- (Plastics engineering; 50)
Includes bibliographical references and index.
ISBN 0-8247-0171-2 (alk. paper)
I. Plastics-Testing-Handbooks, manuals, etc. 2. Polymers-Testing-Handbooks, manuals, etc.
I. Brown, Roger (Roger P.) II. Series: Plastics engineering (Marcel Dekker, Inc.); 50.
TA455.P5H352 1999
620.1 '923'0287--dc21 98-45735
CIP

This book is printed on acid-free paper.

Headquarters
Marcel Dekker, Inc.
270 Madison Avenue, New York, NY 10016
tel: 212-696-9000; fax: 212-685-4540

Eastern Hemisphere Distribution

Marcel Dekker AG
Hutgasse 4, Postfach 812, CH-4001 Basel, Switzerland
tel: 44-61-261-8482; fax: 44-61-261-8896

World Wide Web

http://www. dekker. com

The publisher offers discounts on this book when ordered in bulk quantities. For more information,
write to Special Sales/Professional Marketing at the headquarters address above.

Copyright© 1999 by Marcel Dekker, Inc. All Rights Reserved.

Neither this book nor any part may be reproduced or transmitted in any form or by any means, elec-
tronic or mechanical, including photocopying, microfilming, and recording, or by any information
storage and retrieval system, without permission in writing from the publisher.

Current printing (last digit)

10 9 8 7
Preface

It is essential for design, specification, and quality control to have data covering the
physical properties of materials. It is also essential that meaningful data is obtained by
using test methods relevant to the materials. The different characteristics and behavior of
materials dictate that particular test procedures be developed, and often standardized, for
each material type. Polymers, especially, have unique properties that require their own
measurement techniques.
There is a wide range of polymers from soft foams to rigid composites for which
separate industries have developed. Each has its own individual test methods and, for
the major types of polymers, texts exist that detail these procedures. There are, however,
many similarities between different polymer types and frequently it is necessary for labora-
tories to consider a spectrum of materials. Consequently, there are advantages in a book
that comprehensively covers the whole polymer family, describing the individual methods
as well as discussing the approaches taken in different branches of the industry.
Handbook of Polymer Testing provides in one volume that comprehensive coverage of
physical test methods for polymers. The properties considered cover the whole range of
physical parameters, including mechanical, optical, electrical, and thermal as well as
resistance to degradation, nondestructive testing, and tests for processability. All the
main polymer classes are included: rubbers, plastics, foams, textiles, coated fabrics, and
composites. For each property, the fundamental principles and approaches are discussed
and particular requirements and the relevant international and national standards for the
different polymer classes considered, together with the most-up-to-date techniques.
This book will be of particular value to materials scientists and technologists, and to
all those who need to evaluate a spectrum of polymeric materials, including students,
design engineers, and researchers. Its structure allows reference for the main properties

iii
iv Preface

at both the general and the detailed level, thus making it suitable for different levels of
knowledge.
Chapter 29 is based on material produced for the "Testing Knowledge Base" at Rapra
Technology, Ltd. Extracts from British Standards were reproduced with the permission of
BSI. Users of standards should always ensure that they have complete and current infor-
mation. Standards can be obtained from BSI Customer Services, 389 Chiswick High
Road, London W4 4AL, England.
The other contributors and I gratefully acknowledge the support, information, and
helpful advice given by our colleagues during the preparation of this book

Roger Brown
Contents

Preface iii

1. Introduction 1
Roger Brown

2. Putting Testing in Perspective 5

Ivan James

3. Quality Assurance of Physical Testing Measurements 15

Alan Veith

4. Standardization 105
Paul P. Ashworth

5. Sample Preparation 125

Freddy Boey

6. Conditioning 141
Steve Hawley

7. Mass, Density, and Dimensions 157

Roger Brown

v
vi Contents

8. Processability Tests 171

John Dick and Martin Gale

9. Strength and Stiffness Properties 225

Roger Brown

10. Fatigue and Wear 245

Roger Brown

11. Time-Dependent Properties 255

Roger Brown

12. Effect of Temperature 263

Roger Brown

13. Environmental Resistance 271

Roger Brown

14. Other Physical Properties 279

Roger Brown

15. Testing of Rubber 285

Peter Lewis

16. Particular Requirements for Plastics 309

Steve Hawley

17. Cellular Materials 375

Ken Hillier

18. Particular Requirements for Composites 407

Graham D. Sims

19. Textile Polymers 427

Frank Broadbent

20. Coated Fabrics 483

Barry Evans

21. Dynamic Mechanical (Thermal) Analysis 501

John Gearing

22. Fracture Mechanics Properties 533

Mustafa Akay

23. Friction 589

Ivan James
Contents vii

24. Thermal Properties 597

David Hands

25. Electrical Properties 617

Cyril Barry

26. Optical Properties 647

Roger Brown

27. Testing for Fire 659

Keith Paul

28. Weathering 697

Dieter Kockott

29. Lifetime Prediction 735

Roger Brown

30. Permeability 747

David Hands

31. Adhesion 761

Roger Brown

32. Nondestructive Testing 773

Xavier E. Gras

Index 841
Contributors

Mustafa Akay University of Ulster at Jordanstown, Jordanstown, Northern Ireland

Paul P. Ashworth Consultant, Withington, Manchester, England

Cyril Barry Consultant, Hay-on-Wye, Hereford, England

Freddy Boey Nanyang Technological University, Singapore

Frank Broadbent British Standards Ltd., London, England

Roger Brown Rapra Technology Ltd., Shawbury, Shrewsbury, England

John Dick Alpha Technologies, Akron, Ohio

Barry Evans Schlumberger Industries-Metfiex, Blackburn, England

Martin Gale Rapra Technology Ltd., Shawbury, Shrewsbury, England

John Gearing Gearing Scientific, Ashwell, Hertfordshire, England

Xavier E. Gros Independent NDT Centre, Bruges, France

David Hands Consultant, Sutton Farm, Shrewsbury, England

Steve Hawley Rapra Technology Ltd., Shawbury, Shrewsbury, England

ix
X Contributors

Ken Hillier British Vita, Middleton, Manchester, England

Ivan James Forncet, Wem, Shropshire, England

Dieter Kockott Consultant, Hanau, Germany

Peter Lewis Tun Abdul Razak Research Centre, Brickendonbury, Hertfordshire,

England

Keith Paul Rapra Technology Ltd., Shawbury, Shrewsbury, England

Graham D. Sims National Physical Laboratory, Teddington, Middlesex, England

Alan Veith Technical Development Associates, Akron, Ohio

1
Introduction

Roger Brown
Rapra Technology Ltd., Shawbury, Shrewsbury, England

The physical properties of materials need to be measured for quality control, for predicting
service performance, to generate design data, and, on occasions, to investigate failures.
Without test results there is no proof of quality and no hope of successfully designing new
products. The group of materials classed as polymers generally have complicated behavior,
and as much or more than with any material it is critical that their properties be evaluated-
and evaluated in a meaningful way. Their characteristics are such that methods used for
other materials such as metals or ceramics will not usually be suitable. There are also
distinct differences between the classes of materials that make up the polymeric group,
from flexible fabrics and soft foams through solid rubbers and thermoplastics to very rigid
thermosets and composites. Consequently, it is no surprise that particular procedures have
been developed and standardized to suit the needs of each material class.
There are excellent texts that deal in great detail with test methods for rubber and for
plastics, etc. In dealing with one field they recognize the unique requirements for each class
of materials and emphasize the particular procedures that have been standardized in each
industry. It is no criticism of such texts to say that by concentrating on a restricted scope
they do not bring out the similarities and the common themes that run through the testing
of all polymers. It is relatively recently that, rather than metallurgists and plastics tech-
nologists, etc., the material scientist or technologist has emerged with an important role.
There is great need, for technical and commercial reasons, for many companies to consider
and use a spectrum of materials. For these interests, a book that covers the fundamentals
and the latest techniques for testing the whole polymer family will have many advantages
to students, design engineers, researchers, and those who need to evaluate a wide range of
products and materials.
Broadly stated, the scope of this book is the physical testing of polymers. Polymers
have been taken to include rubbers, plastics, cellular materials, composites, textiles, and
1
2 Brown

coated fabrics~all the materials generally considered to make up the polymer industry
with the exception of adhesives. A great many adhesives are polymeric, but it is considered
that treatment of adhesive testing does not fit well with physical testing of the main
polymer classes and requires its own volume. The standardized adhesion tests for solid
polymers adhered to themselves or other substrates are, however, included.
Physical testing is used in its literal sense and hence does not include chemical analysis.
The distinction between physical and chemical is perhaps not completely clear-cut, in that
aging and chemical resistance are generally considered as physical tests but clearly involve
monitoring the effects of chemical changes. Thermal analysis, for example, straddles both
camps, and particular techniques have either been included or excluded depending on their
purpose.
The aim of this book is to present an up-to-date account of procedures for testing
polymers, indicating the similarities and the differences between the approaches taken for
the different materials. Within the restrictions mentioned above, it is intended to be
comprehensive. Hence it sets out to cover all the physical properties from dimensional
through mechanical, thermal, electrical, etc., to chemical resistance, weathering, and non-
destructive. In addition to all these tests on the formed material or product, processability
tests are also included. The focus is on testing materials rather than on finished products.
Indeed, the vast number of tests, many ad hoc, devised for evaluating performance of the
multitude of products made from polymers, would fill a volume, even supposing the
subject could be coherently treated. Comment on product testing however is made
where appropriate. It should also be noted that many tests used for products are adapta-
tions of the normal material tests, for example stress-strain properties on plastic film
products and geomembranes. Even more widely, it is the usual practice to cut test pieces
for standard material tests from such products as hose, conveyor belting, and containers.
A rather novel structure has been used, which is designed to give a progressive path
for the bulk of commonly measured properties, from background principles, through basic
established practice, to the particular requirements of the different materials and then to
less common and more advanced techniques. It hence allows ready reference at different
levels and largely avoids the complications of dealing with the details of different proce-
dures for several materials in one place.
The basic structure consists of five sets of chapters. Chapters I through 7 cover general
topics, sample preparation, conditioning, accuracy, reproducibility, etc., all materials
being considered together.
Processability tests differ from all the other physical properties included in the book
by virtue of being concerned with properties of relevance to the forming of materials and
not the performance of the finished material or product. Chapter 8 deals with the proces-
sability tests in two parts, for rubbers and plastics respectively.
Chapters 9 through 14 are resumes of the principles and the basic approaches taken
for the more commonly tested parameters.
In Chapters 15 through 20 the particular requirements of each of the classes of poly-
meric materials covered are considered in more detail, including reference to the standar-
dized procedures. The scope of properties covered is essentially the same as in Chapters 9
through 14.
The remaining chapters address selected topics. The topics have been chosen for one
or more of three reasons: it is convenient to cover all polymer classes together; the para-
meters are not those most commonly measured; or the subject is of particular topical
interest.
Introduction 3

The practicality of this for the reader is that if the subject of interest is in Chapters 1
through 8 or 21 through 32 then selection of the relevant chapter will find the main
coverage of the subject. For other properties, the procedures for a particular polymer
class can be found by selection of the appropriate chapter in the group 15 through 20.
If the principles of the more common tests and comparison of the approaches for different
polymer classes are required, then consult Chapters 9 through 14. It is suggested that these
chapters be read before the subsequent chapters, especially if the reader is relatively new to
polymer testing. It is also essential that the requirements for test piece preparation, con-
ditioning, and dimensional measurement covered in Chapters 5, 6, and 7 be considered in
conjunction with all the procedures discussed later.
All reasonable effort has been taken to make the book integrated rather than a series
of independent chapters by different contributors. Inevitably there will be some overlap
and repetition, but it is believed that this, and the relative complexity of the structure, is
outweighed by the confusion that could result from trying to weld discussion of common
tests for contrasting polymer types. It is inevitable also that there should be differences in
style adopted by the different authors, which perhaps illustrates that testing can be
approached in more than one way.
The emphasis is on standard test procedures, which by definition are those that have
become widely accepted. Where standardized methods exist, they should be used for
quality control purposes and for obtaining general material property data, to ensure
compatibility between results from different sources. It is counterproductive to invent
alternative procedures when satisfactory and well-tried methods exist, and it prevents
meaningful comparisons of data from being made. It has to be accepted that many
standard methods have severe limitations for the generation of design data, but never-
theless they can often form a good basis for producing more extensive information.
Unfortunately, standard tests are not completely standard, in that different countries
and organizations each have their own standards. The situation has been steadily improv-
ing in recent years as more national standards bodies adopt international methods, and
this is a trend that we should all encourage. In this book the ISO (and for electrical tests
the IEC) standards, together with those of two of the leading national English-language
standards-making organizations, the ASTM and the BSI, are considered, plus the
European regional (CEN) standards. In a great many cases British standards are identical
with ISO standards, but ASTM standards are at very least editorially different. British
standards will always be identical with CEN standards where these exist and, in turn, CEN
standards are often identical with ISO.
It is not possible to claim that every type of test known for every property has been
included, but, within the defined scope, any omission is by accident rather than design. It is
also likely that not every standard from the standard bodies covered will have been
referenced. Standards are continually being developed and revised, so that it can be
guaranteed that between writing and publication there will have been some changes,
thus it is essential that the latest catalogs from the ISO, etc., be consulted for the most
up-to-date position.
The apparatus needed for tests is considered in conjunction with the test procedures,
but in many cases it is not an easy matter to select from the range of apparatus available in
differing levels of sophistication or, indeed, to be able to find any supplier at all. The Test
Equipment and Services Directory (published by Rapra Technology, England, in hard
copy and on CD) contains both advice on selection and a comprehensive guide to instru-
ment suppliers.
2
Putting Testing in Perspective

Ivan James
Forncet, Wem, Shropshire, England

1 Philosophy
As a generality, technical people want to test to obtain knowledge, whereas commercial
people will test only when there is some pressure to do so, In an age of cost-cutting and
streamlining of production, it may seem that testing is an unnecessary expense, but the
reverse is true, since alongside an awareness of cost has grown an increasing customer
awareness of quality, The consequences that arise when testing is omitted are illustrated by
the following examples,
Some years ago a colleague who served on several SI committees once attempted to
buy a radio while he was abroad, The young assistant took one off the shelf, switched it
on, and it didn't work, She then unpacked one from its box just as it had arrived from the
factory, and that one didn't work either, Eventually she found one that did and was
surprised when he declined to buy iL What astounded him was that these products
could be made and packed without any testing whatever to prove fitness for purpose
until they reached the point of sale,
However, testing the product may not be sufficient, since in a complex product such
as a radio, the reliability of the assembly depends on the reliability of each of the
individual components, This was brought horne to a supplier who asked a buyer what
failure rate he would accept, The curt answer "zero" failed to convince him and 0,1%
was suggested, "No, zero," was the reply, "But that implies testing every component,"
said the supplier, "Exactly," answered the buyer, In practice, of course, virtually zero
reject levels can be achieved without 100% testing by extremely tight control of the
process, but the story illustrates the point, The increasing demand for product quality
brings in its train a requirement for component reliability, and that implies component
testing also,
5
6 James

Brown [1] has previously suggested that as well as these two reasons for testing, two
others can be listed, namely tests to establish material properties for design data and tests
to establish reasons for failure if a product proves to be unsatisfactory in service.
Polymers are complex materials, and aspects of their behavior are sometimes unex-
pected. For this reason, tests on polymers need to be well chosen and wide ranging in order
to avoid embarrassing failures. It is important to establish early on that the grade of
material chosen fully matches the design criteria for the product. For example, a plastic
component, although initially of adequate strength, may on constant exposure to deter-
gents suffer from environmental stress cracking.
Examination of failed products or components is related to this, and testing may
reveal that the material did not meet the designer's specification or show that some
important property, such as creep, has been overlooked. The coating applied to surgeon's
gloves, for example, may be more important than the composition of the rubber.
Summarizing, then, and following the approach suggested by Brown, there are four
main areas of testing, namely:

1. Quality control
2. Predicting service performance
3. Design data
4. Investigating failures

Before undertaking any tests, and before considering which properties to measure, it is
essential to identify the purpose of testing, because the requirements for each of the
purposes are different. Failure to appreciate this can lead to time-wasting tests that do
not yield the required results. Similarly, a lack of understanding as to why another person
is carrying out particular tests can lead to misunderstanding and argument, say, between
the research department and the quality control department in a factory.
To the various attributes related to testing procedures, precision, reproducibility,
rapidity, and complexity, may be added the ability for tests to be automated and the
desirability for tests to be nondestructive. The balance of these various attributes, and
the related cost, differs according to the purpose of the test that is undertaken. These will
be considered in turn, but in all cases the precision and reproducibility must be appro-
priate to the tests undertaken.

2 Quality Control Tests

Nondestructive methods are advantageous and indeed essential when 100% of the output
is being tested. The tests should be simple and inexpensive, and automation will probably
aid the rapidity of testing. Tests related to product performance are preferred.

2.1 Tests Predicting Product Performance

The most important factor is that the tests relate to service conditions and to aspects of
product performance. The tests should not be too complex, although rapidity and cheap-
ness are less important than was the case with quality control. Nondestructive tests are not
always appropriate when predicting product performance, as it may be necessary to estab-
lish the point at which failure occurs.
Putting Testing in Perspective 7

2.2 Tests for Producing Design Data

Usually test pieces are of a simple shape and a specified size, whereas the product envi-
saged may be of a different geometry and size. Data must be presented in a form that
enables the designer to allow for changes in geometry, time scale, etc., which implies
detailed and comprehensive understanding of material behavior and often multipoint
data. It follows that data of this type are expensive to produce and that results are unlikely
to be obtained with great rapidity. However, automation may be advantageous, particu-
larly in the case of tests running for a long time (creep tests, say).

2.3 Tests for Investigating Failures

Some understanding of the various mechanisms of failure is necessary before suitable tests
can be chosen. Tests need not be complex but must be relevant. For example, a simple
measurement of product thickness may establish that there has been a departure from the
specified design thickness. The radii of corners, moisture content of plastic, bubbles pro-
duced during molding, and a host of other factors may have contributed to the failure, and
it is important to keep an open mind when carrying out tests of this type. The absolute
accuracy of the test may not be important, but it is essential that it be capable of dis-
criminating between the good and the bad product.
It will be clear from the above that the range of possible tests is very wide, and in any
discussion of the philosophy of testing it is useful to classify groups of tests in some
manner.
Several approaches are possible, for example, mechanical tests or electrical tests, or
again, tests on flexible or on rigid materials, but in discussing the philosophy of testing a
broader classification is needed. A useful approach is that given by Brown [1] in an earlier
volume, namely,
Tests relating to fundamental properties
Tests relating to apparent properties
Tests relating to functional properties
He defines these using the example of strength. Fundamental strength of a material is that
measured in such a way that the result can be reduced to a form independent of test
conditions. Apparent strength is that obtained by a method that has completely arbitrary
conditions so that the data cannot be simply related to other conditions. Functional
strength is that measured under the mechanical conditions of service, probably on the
complete product.
Fundamental properties are those relating to the underlying properties of the poly-
mer-the refractive index of a transparent material is a simple example.
Apparent properties are closely related to fundamental properties but are not so
tightly defined or controlled. An example is tear strength, where the standard methods
yield results strongly dependent on test piece geometry.
Functional properties are usually related to a product. A measure of the resistance to
ground of antistatic tires on castor wheels is a good example. The measured value depends
on both the geometry of the product and the underlying resistivity of the material, but in
the end it is the resistance at a specified voltage that is required.
This classification applies to all kinds of properties (mechanical, electrical, chemical,
etc.) and is useful when considering what type of test is needed.
For example, when considering what force is needed to draw a stopper from a bottle,
the answer depends both on the coefficient of friction and on the normal force, neither of
8 James

which is known. In turn the normal force depends on the dimensions of the two compo-
nents and the modulus of the stopper. However, for a quality control test (and a perfor-
mance test) all that is needed is that limits be set on the upper and lower levels of force
required to extract the stopper. Knowledge of the individual parameters is not necessary.
This is an example of a functional test.
Staying with this frictional analogy, the inclined plane method of measuring friction
would give an apparent coefficient of friction, since conditions are not tightly controlled
(velocity, for example, cannot be specified) and there is little chance of relating the result to
other conditions.
If, on the other hand, the requirement is to measure the coefficient of friction between
two materials for design purposes, then shape, surface finish, normal load, velocity, tem-
perature, cleanliness, and humidity all become important parameters needing to be con-
trolled. Furthermore, this illustrates the shortage of truly fundamental tests in which the
rules for extrapolating to other conditions are well known, as in this case it would prob-
ably be necessary to produce multipoint data.
These three types of test can be loosely related to the purpose of testing.
In establishing design data, it is mostly fundamental properties that are needed, but
these are in short supply. Many thermal and chemical tests are fundamental in nature, but
most mechanical tests give apparent properties. In the absence of established and verified
procedures for extrapolating results to other conditions, multipoint data have to be pro-
duced at defined levels of all the parameters likely to influence the test result.
Consequently, reliable tables of properties for designers are difficult and expensive to
establish.
Standard test methods giving apparent properties are best suited to quality control,
and only in relatively few cases are they ideal for design data. Quality control tests are the
most easily established, and many existing methods fulfil this need. In seeking an improve-
ment in test procedure it is not always a more accurate test that is required. Depending on
the purpose for which testing is undertaken, it may be quicker or cheaper tests that are
required, or most important of all, tests that are relevant to service performance.
For predicting service performance, the most suitable tests would be functional ones.
For investigating failures, the most useful tests depend on the particular circumstances,
but fundamental mechanical methods are unlikely to be needed.

3 Trends
Because of the different reasons for testing and the consequent difference in test require-
ments, developments in test methods do not follow one path. The basic themes are con-
stant enough: people want more efficient tests in terms of time and money, better
reproducibility, and tests more suited to design data and more relevant to service perfor-
mance. However, the emphasis depends on the particular individual needs.
In recent years, the drive towards international standards has led to a close examina-
tion of long-established test methods, and it has been found that the reproducibility of
many of the tests was poor. This in turn has not led to new tests but rather to the
establishment of better standardization of test procedures. There has also been a growing
realization of the need to calibrate test equipment with proper documentation of calibra-
tion procedures and results.
Where different test methods associated with different countries have been in use for a
long time, it has sometimes been difficult to reach an acceptance of one method as a
standard test procedure. In these cases it has been necessary to present 'Method A' and
Putting Testing in Perspective 9

'Method B' as equally acceptable. Similarly, different test conditions have been allowed,
perhaps taking account of the difference between temperate and tropical conditions.
Although in the local environment this may be quite satisfactory, it leads to difficulties
if the results are presented in a database and used in a wider context. Figures presented in
a database all need to be produced in exactly the same way, and consequently there has
been a lobby for extremely tight standards, with no choice of method or test conditions,
specifically to yield completely comparable data for presentation in a database. Admirable
though this approach may seem, it has to be recognized that the freedom of action pre-
viously allowed enabled the tests to be used over a wider range of industrial conditions
than that envisaged by those setting up databases.
Automation and, in particular, the application of computers to control tests and
handle the data produced have brought about vast changes in recent years. It is not
only a matter of automation saving time and labor; it also influences the test techniques
that are used. For example, these developments have allowed difficult procedures to
become routine and hence increased their field of application. There are many examples
of tests that would not exist without certain instrumentation, thermal analysis techniques
being one of the more obvious. Advances in instrumentation for an established test may
change the way in which it is carried out but do not generally change the basic concept or
change it to produce more fundamental data.
Whether automatic or advanced instrumentation really saves money is difficult to say.
Initially the equipment costs more, but this is offset by a saving in labor. However, the old
adage that if a thing can go wrong it will remains true, and maintenance costs of complex
equipment are high. Finally, the calibration of such equipment can be difficult, and the
software that so readily transforms the data can give rise to concern as to what has
happened between the transducer and the final output.
While improvements in tests in respect of their usefulness for generating design data
and predicting service performance are continually sought, the advances have perhaps
been less dramatic. The fundamental tests needed for design are often very difficult to
devise and are likely to be more expensive to carry out and required only by a minority. As
with so many things, the advances can be related to commercial pressures and the amount
of effort that is funded. Where better and more fundamental tests do exist they are not
always used as often as they should be because of the cost and complexity involved.
There has been an increase in tests on products, which has resulted from a greater
demand to prove product performance and from specifications more often including such
tests as part of the requirements.
For the future, it is highly probable that the same themes will continue. The quality
movement is still strong, and the generation of databases will probably ensure that greater
compatibility is achieved. Certainly there will be further developments in instrumentation
and the handling of data. It would be a brave person who predicted a surge in tests for
better design data, but there are signs that the sophistication of markets will lead to wider
needs in this direction.

4 Test Conditions
Under the broad heading of test conditions should be included the manner of preparation
of the material being tested and its storage history, as well as the more obvious parameters
such as test temperature, velocity of test, etc. While it is recognized that the result obtained
depends on the conditions of the test, it is not always obvious that some of these condi-
tions may have been established before the samples were received for testing. Sometimes
10 James

the history of the samples is part of the test procedure, as in aged and unaged samples for
example, but at other times it may not be at all clear that certain "new" samples are
already several months old, with their intervening history unknown. Degradative influ-
ences such as the action of ozone on rubber samples cannot be compensated for, but
standard conditioning procedures are designed, as far as possible, to bring the test pieces
to an equilibrium state. The imposition of a standard thermal history before measuring the
density of a crystalline polymer is a good example.
In some cases, conditioning may involve temperature only, but where the material is
moisture sensitive it is likely that a standard atmosphere involving control of both tem-
perature and humidity will be called for. Occasionally, other methods of conditioning,
such as mechanical conditioning, are used, as will be discussed in a later chapter.
Even with careful conditioning, however, the results produced from specimens man-
ufactured by different methods may vary, and if there is to be a controlled comparison it is
important that the test pieces be prepared in exactly the same way. This is particularly
important for figures being presented in databases. For example, laboratory samples of a
rubber prepared on a mill may differ considerably from factory materials prepared in an
internal mixer, and often these differences are not sufficiently emphasized in tables of data.
Equally, test piece geometry is important, and again, if comparison is to be made, a
standard and specified geometry should be adhered to. Rarely is it possible to convert
from one geometry to another, since polymers are complex materials and the influence of
the various test parameters is often nonlinear. For example, it is difficult to scale up gas
transmission results obtained on thin sheets to thick sheets of the same material.
For these various reasons the simulation of service behavior is at best difficult and
often impossible. There are numerous examples of long-term tests over 20 or more years
that have shown that artificial ageing using heat or other means yields results that are
significantly different from those obtained with the passage of time. There has to be an
awareness of the limitations of any test procedure and an acknowledgement that the
results obtained apply only to the narrow range of conditions under which the test was
performed. For these reasons, with important and complex products such as tires, it is
often necessary to test them under the exact conditions under which they will be used.
Test procedures require careful attention to detail, as small and apparently innocent
deviations can produce significant changes in results. This implies that the test conditions
need to be accurately set initially and then monitored throughout the test. Sometimes it
arises that when testing according to a published standard some deviation from the set
procedure cannot be avoided (perhaps because of a limitation on the amount of material
available). In these cases such deviations should always be recorded. In any test report it is
important to state quite clearly which procedure has been followed.

5 Limitations of Test Results

However carefully tests are organized and carried out, there will be a limit to the accuracy
attainable, an obvious fact sometimes hidden by the multidecimal results obtainable from
pocket calculators. Furthermore, once interlaboratory variability is considered, the preci-
sion of a particular method may be lower than was anticipated.
The widespread availability of personal computers has greatly simplified statistical
analysis of test results, and there is now a greater appreciation of the value of such
analysis. To this end the expanded British Standard Application of Statistics to Rubber
Testing [2], which is also applicable to testing other polymers, contains useful information
and is recommended. It is not sufficient to present the results to a statistician once the tests
Putting Testing in Perspective 11

have been done. Thought must be given to the design of the experiment in the very
beginning, and if help from a statistician is needed it should be brought in at that stage.
Because of the importance of applying statistical principles to test results, the subject is
comprehensively covered in Chapter 3.
Brief mention was made of the precision of tests as judged by interlaboratory trials,
and sometimes the quoted level of precision seems relatively poor. Usually the laboratories
taking part in these trials are experienced, and the precision levels quoted should be
representative of good practice. Poor figures may indicate snags with that method but,
whatever the quoted levels, there is no reason to suppose that it is "only the others" who
get divergent results. Interlaboratory comparisons sometimes lead to the elimination of
poor test procedures, and so bring about improved accuracy, as will be discussed later.
However, no measurement is exact, and there is always some uncertainty. Calibration
laboratories are required to make uncertainty estimates for all their measurements, and in
the future it may be that all accredited testing laboratories will also have to do so. This
involves estimating the uncertainty introduced by each factor in the measurement and is
not at all easy to do. At the very least, it is essential to be conscious of the order of
magnitude of the range within which the "true" result lies.

6 Sampling
Efficient sampling means selecting small quantities that are truly representative of a much
larger whole, and the significance of test results is closely related to the efficiency of
sampling.
Often, in the laboratory, one is limited by the amount of material available, and at
least there is then the excuse that the tests relate only to the material available at the time.
In a factory, where the whole output is available, the problem is a different one. Here the
quality control manager has to decide not only what is adequate, but also what is reason-
able, bearing in mind the production schedule and the profitability of the operation.
The frequency of sampling and the number of test pieces (or repeat tests) per item
sampled depend on circumstances, and obviously financial considerations play an impor-
tant part. Certain long-winded (and expensive) tests call for one test piece only, although if
multiple tests are done the method may be quite variable. The use of a single test piece is
hardly satisfactory, but it may be that multiple tests in numbers sufficient to increase
precision are totally uneconomic. This is the dilemma that quality control managers (and
the writers of specifications) have to face. In a continuous quality control scheme it may be
that the number of test pieces at each point is less important than the frequency of testing.
Where multiple tests pieces are available, an odd number is advantageous if a median
is to be taken, and five seems to be the preferred number. This is just about large enough to
make a reasonable statistical assessment of variability. However, the current range of
standard methods is not consistent, and numbers between one and ten or more may be
called for.
The essence of efficient sampling is that the small quantity selected and tested (the
sample) be truly representative of the much larger whole. The test pieces should be repre-
sentative of the sample taken, the sample representative of the batch, and the batch
representative of the wider population of material. In many cases, this information is
not known to the tester, but there should be awareness of the limitations of the results
in this respect, and the best possible practice should be followed in selection of samples
and test pieces. This may include blending of several batches, randomizing the positions
from which test pieces are cut, and testing on test pieces cut in more than one direction.
12 James

Care should be taken when sampling from production that items be taken at random, and
that the time at which samples are taken does not always coincide with some factor such as
a shift change.

7 Quality Control
Quality control embraces the monitoring of incoming materials, the control of the man-
ufacturing processes, and checks of materials and products produced, so as to ensure and
maintain the quality of the output from the factory. Physical testing methods are impor-
tant in this regime, and most of the standardized test methods are intended for quality
control use-it is probable that the majority of tests carried out are undertaken in the first
place for quality assurance purposes. However, this book is about testing and is not a
quality control manual, so discussion here is restricted to the quality control of the testing
process.
Quality control is often thought of as applying only to products, since this affects the
lives of the entire population. However, those of us that work in laboratories must recog-
nize that correct and reproducible results are in a sense products, and that the application
of quality control to test laboratories is designed to improve the general reproducibility of
all test results.
Reliable results can only come from a laboratory where the apparatus, the procedures,
and the staff are all subject to a quality assurance system. ISO 9000 standards are applied
in a wide context to various companies, and their laboratories will be included under the
general umbrella of such a system, but a more focused scheme for test and calibration
laboratories may be found in ISO guide 25 [3] and national equivalents. These standards
cover not only the calibration of equipment and the control of test pieces but also the
training of staff, an item tending to be overlooked in the general context of quality control.
The requirements listed set a high standard, and it has to be recognized that maintenance
of this standard is time-consuming and difficult. In the UK the accreditation of labora-
tories is entrusted to the United Kingdom Accreditation Service (UKAS). Similar orga-
nizations may be found in other countries, and some of these bodies have mutual
recognition agreements.
Undoubtedly the most expensive item in any system of laboratory control is the
calibration of equipment. All test equipment should be calibrated, and every parameter
relating to that machine requires formal calibration. For example, it is easy to see that the
force scale and speed of traverse of a tensile machine need calibrating, but it is less obvious
that the cutting dies for test pieces also need calibrating in order to ensure that the test
pieces conform to specification.
Calibration is based on the principle of traceability from a primary standard through
intermediate or transfer standards. A good example of a transfer standard would be boxes
of certified weights that are not in general use but the sole purpose of which is to check the
accuracy of those that are in use.
Obviously, at each stage of measurement there is some degree of uncertainty, and
estimates of this uncertainty form part of the calibration procedure. It is perfectly accep-
table for a laboratory to carry out its own calibrations, provided they maintain appro-
priate calibration standards and operate a suitable quality system. However, it is often
more convenient to buy-in calibration services. Wherever possible the calibration labora-
tory used should be accredited (UKAS or equivalent).
Calibration of apparatus in the polymer industry has to some degree been hampered
by the lack of definitive guidance, but a British standard has been developed covering the
Putting Testing in Perspective 13

Calibration of Rubber and Plastics Test Equipment [4]. This explains the principles of
calibration and gives details of the parameters to be calibrated and the frequency required,
together with an outline of the procedure to be used for all rubber test methods listed in
the ISO system.
The ASTM gave the lead in conducting systematic interlaboratory trials, and this has
been followed by the ISO and others. The variability obtained was far greater than was
expected, and in some cases it was so bad that it was doubtful whether certain tests were
worth doing at all. These interlaboratory comparisons and the drive towards improved
quality led to an abandonment of the complacent attitude that had formerly existed and
stimulated various initiatives to improve the situation.
On the whole, variability arises from malpractice rather than from a poorly expressed
standard, but if an interlaboratory trial reveals an excessive variability it is first necessary
to pinpoint the problem before a standards committee can correct it. Unfortunately this is
a slow and expensive procedure.
The demand for higher quality has produced pressures to make laboratory accredita-
tion commonplace, and as more laboratories reach this status it must be expected that
reproducibility will improve. The calibration of test machines, training, documentation of
test procedures, sample control, and formal audits all have an enormous influence, and the
discipline involved in maintaining an accredited status helps to minimize mistakes and
maintain reproducibility. International agreements undoubtedly widen the scope of
accreditation schemes and ensure uniform levels of accreditation. This is found to have
an influence on the standard of laboratories with a consequent improvement in interla-
boratory comparisons.
The essential requirements of any piece of test equipment are that it should satisfy the
requirements laid down in the standard relating to the test method under consideration
and that it should be properly calibrated. Convenience of use or the cost of running the
tests are not items that can be specified, but nevertheless they play a dominant role in the
selection of equipment. Increasingly computer control and data handling are becoming
standard.
The manipulation of data by computer is a particularly difficult operation to monitor,
since in a busy laboratory it is only too easy to accept the software as correct in all
circumstances. Obviously the accuracy of the quoted results depends not only on the
accuracy of the original measurements but also on the validity of the data handling.
Some standard bodies are now developing specifications giving rules and guidance on
software verification.
These changes in the basic concepts of laboratory testing bring with them both
advantages and disadvantages. While it is obvious that automation brings with it a saving
in staff time, perhaps enabling measurements to continue with the apparatus attended only
periodically rather than continuously, it is not clear what the effect on accuracy or
reproducibility will be. Noncontact extensometers, for example, ensure that there are
no unwanted stresses on the test piece, but the accuracy is related to the parameters
built into the extensometer (e.g., the response time in following the recorded signal). It
is important not to assume that more complex equipment necessarily means improved
accuracy, although it is frequently true. A simple example may illustrate the difficulty.
Increasingly doctors are using electronic sphygmomanometers to measure blood pressure.
Here the end points still rely on the doctors' skill in detecting a pulse, but rather than
reading the height of a mercury column, a pressure transducer gives a direct digital read-
out. This gives a degree of confidence that is absent from a mercury manometer, but the
defects in the system are hidden. There is an obvious need for calibration, which may go
14 James

unrecognized in a busy surgery, but also the question of linearity of response is crucial. It
is easy to look critically at the equipment used in a different discipline, but the same
principles should apply in our own laboratories.

References
1. Brown, R. P., Physical Testing of Rubber, Chapman and Hall, London, 1996.
2. BS 903, Part 2, Guide to the application of statistics to rubber testing, 1997.
3. ISO Guide 25, General requirements for the competence of calibration and testing laboratories,
1990.
4. BS 7825, Parts 1-3, Calibration of rubber and plastics test equipment, 1995.
3
Quality Assurance of Physical Testing
Measurements

Alan Veith
Technical Development Associates, Akron, Ohio

1 Introduction
Measurement and testing play a key role in the current technologically oriented world.
Decisions for scientific, technical, commercial, environmental, medical, and health pur-
poses based on testing are everyday occurrences. The intrinsic value and fidelity of any
decision depends on the quality of the measurements used for the decision process. Quality
may be defined in terms of the uncertainty in the measured values for a specified test
system; high quality corresponds to a small or low uncertainty. Quality is contingent
upon whether the operational system is simple or complex. The equipment, the procedure,
the operators, the environment, the decision process itself, and the importance of these
decisions-are all part of the system. A lower quality can be tolerated for less important
routine technical decisions than for decisions that have large commercial or financial
implications. Measurement and testing for fundamental and applied research and devel-
opment and also for producer-user transactions are important elements that are part of a
larger organized effort that is frequently called a technical project. Measurement and
testing play a key role in all technical projects and the assurance that the output data
from any technical project are of the highest quality, consistent with the stipulated goals
and objectives, is of paramount importance in technical project organization.
Quality for a test system has two major components: (1) how well the measured
parameters relate to the properties that are involved in the decision process and (2) the
magnitude of the uncertainty of the measured parameter value or values; the higher the
uncertainty the lower the quality. The first component is usually more complex, since it
involves scientific expertise and some subjective judgments. If the measured parameters are
not highly related to the decision process properties, a fundamental scientific uncertainty
exists. The second component, the measurement uncertainty, is somewhat easier to address;
15
16 Veith

it is assessed and controlled by the application of a number of statistical and technical

disciplines that have been developed over the past several decades. Once a certain level of
quality is established and maintained by the use of specified control techniques, it follows
that a certain degree of assurance that this quality is achieved must be part of any ongoing
project. The purpose of this chapter is to give some elementary background on how to
assess, control, and assure the quality of physical property measurements.
The chapter begins with a brief description of the essential components of a technical
project. Next is a short section on elementary statistical principles that reviews some of the
necessary concepts used in the immediate sections that follow. This is succeeded by sections
on the principles of measurement and calibration and then some basic sampling theory. A
more extensive section on statistical analysis is next, which then leads to the principles of
quality assessment and control . The chapter concludes with a discussion of the topics of
precision, bias, and uncertainty in laboratory testing. The word uncertainty is used in two
different senses in the chapter, first as a generic term as defined above and second as a
particular or specific term that defines a range or interval for any point estimate of a
measured value. This distinction is readily apparent in the sense of its use, and this topic
is fully discussed in the last section and in some of the annexes. Annex A gives some general
statistical tables useful for the various statistical analysis algorithms. Annex B describes a
statistical model for the testing process, which is also discussed in the last section of the
chapter. Annex C gives a procedure for evaluating accuracy and bias. Precision is expressed
in terms of within-laboratory variation, called repeatability, and between-laboratory var-
iation, called reproducibility, and since these are important concepts, the calculation pro-
cedure for these two precision parameters is given in Annex D. These annexes should be
consulted as indicated and as needed in the various sections of the chapter.
There is no completely standardized or uniform terminology for the disciplines of
quality assurance and statistical analysis. Attempts at harmonization are being addressed
by standardization organizations world-wide with varying degrees of success. In this
chapter an attempt is made to use a terminology that is consistent within the chapter
and also with the current trends in statistical nomenclature, but the terms and symbols
may be at variance with those from other sources. Since all concepts and symbols are fully
defined, this should present no substantial problems for the reader.
In addition to references to specific literature sources, a bibliography is given at the
end of the chapter for appropriate statistical textbooks and for ASTM and ISO standards
that apply to quality assurance, statistical analysis, precision, bias and uncertainty, labora-
tory accreditation, and proficiency testing . The listing of standards is not exhaustive; only
those that are anticipated to be worthwhile for the topic of this chapter are included. These
standards, with some exceptions as noted, were developed by committees on statistics and
quality as generic standards that apply in principle to all testing and measurement opera-
tions.

2 Defining Technical Projects

Testing operations are frequently organized on the basis of a technical project. Each
project with its defined operational system should have a specific goal-the solution of
a particular problem that requires measurements, the generation of test data, analysis, and
technical decisions. A successful execution requires that the project be well organized with
a number of steps that must be undertaken in a specified order. Figure 1 illustrates these
steps in a flowchart diagram. The steps involve (1) planning and modeling of the project,
both of these being closely linked and used in an interactive way, (2) selecting a sampling
Quality Assurance of Physical Testing Measurements 17

PLAN

I
PROJECT MODEL

I
RESPONSE MODEL

I
SAMPLING PROTOCOL

I
MEASUREMENT SYSTEM CALIBRATION

I
MEASUREMENT AND
DATA REPORT

I
PRECISION MODEL

I
ANALYSIS AND EVALUATION

l
SOLUTION

Figure 1 Flow diagram for technical project.

procedure and protocol, (3) setting up a defined capability measurement system with an
appropriate calibration procedure, (4) performing the measurements and reporting the
data, followed by (5) analysis and evaluation, which may make use of a test result uncer-
tainty model to arrive at a solution. All of these must be of the highest quality for the
successful execution of a complex project, especially if the project involves interlaboratory
testing.

2.1 The Plan and the Models

A project plan and the required models, which are jointly evolved, are usually developed
by a principal investigator in consultation with various participants who frequently are
specialists in certain technical disciplines that are ancillary to the project. Topics that need
18 Veith

attention are overall project organization, goals, and objectives, resources and constraints,
performance criteria, the selection of the measurement methodology, and the selection of
decision procedures. A set of well designed and coordinated standard operating proce-
dures (SOP) must be selected and put into place. All the elements of the project required
for a successful solution need to be specified.
A model may be defined as the simplified representation of a defined physical system.
The representation is developed in symbolic form and is frequently expressed in mathe-
matical equations and uses physical and/or chemical principles based on scientific knowl-
edge, experimental judgment, and intuition, all set on a logical foundation. A model may
be theoretical or empirical, but the formulation of an accurate model is a requirement for
the successful solution of any problem.
Planning Model
A planning model is a generalized statement of the strategy to be used in the solution of a
problem; it involves the selection of a coordinated set of SOPs and an assembly of the
required system elements to arrive at a solution. Planning models are more descriptive in
nature and are not as rigorous as response and analysis models.
Response Model
A response or analytical model is a mathematical formulation that describes one or more
complex measurement operations that are part of a particular project. Once the test
methodology has been selected, the response model should be constructed based on the
performance parameters of the system and the independent variables that influence these
performance parameters. This usually involves three steps: (1) formulation of the model,
(2) calibration of the model, and (3) validation of the model. There are three important
actions in the formulation: (i) identify the most important variables or factors that influ-
ence a selected response, (ii) develop a proper symbolic or mathematical representation for
these variables, and (iii) identify the degree of interaction among the variables. Variables
may be selected on the basis of theoretical principles or on an empirical approach using
correlation and regression or principal components analysis techniques. Systematic experi-
mental designs may be employed to formulate an empirical model; see Section 6.6.
The specified number and character of performance parameters and variables, i.e., the
operational conditions, is defined as a testing domain. Simple domains for any project may
require only a univariate model, while complex project measurement systems may require
multivariate models. Some projects may have multiple response parameters, each of which
may require a multivariate (independent) variable model. The calibration and validation
operations are discussed below in Section 2.3.
Test Uncertainty Model
This is a model that may be used to relate the variation in the measured parameter(s) to
sources of variation that are inherent in any testing operation. This is discussed in more
detail in Section 8 on precision, bias, and uncertainty in laboratory testing.

2.2 Sampling Protocol

Sampling is a very important operation in any decision-making process based on measure-
ment. The samples must be truly relevant and represent the population under considera-
tion. Knowledge of the response and the uncertainty model are needed to be able to select
a sampling protocol that will give the desired information about the process. The sampling
plan should carefully describe how samples (number, type, etc.) are to be selected, and the
Quality Assurance of Physical Testing Measurements 19

operation of subsampling, if used, must also be documented. Procedures to protect or

preserve samples during use and storage as well as the chain of custody for both the
physical samples and the resulting data are required. See Section 5 for a discussion of
basic sampling operations.

2.3 Measurement System

The appropriate methodology or test methods for conducting the selected measurements
should be chosen and completely described as to written SOPs. Standard test methods
should be used if at all possible, and if modifications are made on a standard method they
should be clearly documented. See Section 4 for more details on the principles of measure-
ment. The methodology should be validated as to its accuracy and sensitivity in measuring
what is intended.
Model calibration, in distinction to test method calibration as discussed in Section 4,
is essentially a fitting process, i.e., evaluating experimental model coefficients by regression
and goodness-of-fit procedures; it should be completely described as to (i) how reference
or other materials are used, (ii) the schedule for the initial and ongoing calibrations, and
(iii) how the confidence in the calibration operation is to be quantified and expressed. It is
important to perform the model calibration or coefficient evaluation process over a sub-
stantially large range of the operating or independent variables. This results in a more
robust model with a broader applicable range. Validation is a process whereby the influ-
ence of the relevant input variables is observed for rational model output. The validation
program should be based on input-output response generated from a separate testing and
evaluation operation after all model development is concluded. The validation should also
be conducted over the broadest range possible.

2.4 Test Data Report

Once the above issues have been addressed, the next step is to conduct the measurements
and report the data. For simple projects the as measured data may be of primary concern,
and if this is the case the averages or means for the data are given usually at some
confidence level for a specified uncertainty interval. Analysis and other statistical opera-
tions, using any developed precision model for guidance, may be conducted on the raw
data. However, the measured parameter(s) may be used as input to a mathematical expres-
sion that gives some derived parameter that is used to assess system performance and
make decisions. A typical example is the use of one test or reference material as a control
to express the performance of the tested candidate materials, usually as an index obtained
by dividing the candidate measured parameter by the control measured parameter. When
this is the case, the statistical parameters obtained by analysis of the raw data will not
apply to these derived performance parameters or properties. Under these conditions, one
must use propagation of error algorithms that give the variation or error of the derived
property or parameter in terms of the measured parameter(s). A full discussion of this
topic is given in Section 3 on statistical principles.

2.5 Analysis, Evaluation, and Solution

The techniques on statistical analysis as described in Section 6 may be used to analyze the
data generated in project testing. For more sophisticated or complex analysis procedures,
dedicated statistical software computer programs may be used. If derived expressions are
necessary for performance, the formulas of Table 3.2 in Section 3.6 may be applied. Most
20 Veith

projects will involve some form of iteration involving successive measurement operations
to arrive at a satisfactory solution.

3 Elementary Statistical Principles

3.1 Test Data Variation Defined
Throughout this chapter data are assumed to be expressed on a continuous numeric
variable scale; discrete or attribute data are not discussed. The general term variation is
introduced in the usual classic textbook sense and may be described in terms of deviations
of the output value from a true value that would be obtained for any measurement if, as in
an ideal world, no variation existed. A true value may be (i) a theoretical value that can be
calculated based on scientific principles; (ii) a theoretical value that exists but cannot be
calculated from first principles: in its place a reference value obtained by a collaborative
program by a standardization or other technical organization is adopted; this frequently
uses standard reference materials; and (iii) an assigned value based on a reference test
established by a recognized organization. Both of the later designations are called conven-
tional true values.
All measurements are influenced to some degree by perturbations that influence the
measurement process and generate these deviations. Each measurement may be perturbed
by one or both of two types of deviation.

Random deviations : these are + and - differences about some central value that may be a
true or a reference value; each execution of the test gives a specific difference, and the
mean value of these differences is zero for a long run series of repetitive measurements.
Bias (or systematic) deviations: these are offsets or constant differences from the true
value; these offsets may be + or - and are frequently unique for any particular set
of test conditions.

Random and bias deviations are usually additive. There are four concepts that are applied
to measurement variation: precision, bias, accuracy, and uncertainty. Frequently these are
incorrectly used in an interchangeable manner. One of the main purposes of this chapter is
to distinguish between these and show how each may be used correctly. Precision is
defined in terms of the degree of agreement for repeated measurements under specified
conditions and is customarily assumed to be caused by one or more random processes.
High precision implies close or good agreement. Bias has not been addressed and inves-
tigated to the extent that has been devoted to precision. But as Section 8 of this chapter
will show, bias is very important in testing and needs more attention to improve test
quality. Accuracy is a concept that involves both precision and bias deviations. High
accuracy implies that the sum of both types of deviation is low, or in an ideal situation,
zero. The term uncertainty is of more recent origin and may be used in two different
contexts as previously described. The relationship among these variation concepts is dis-
cussed in Section 8; in Annex B, which gives a statistical model for testing; and in Annex C
on the evaluation of bias.
This section gives some of the more elementary statistical parameters that may be used
to characterize and analyze data and discover the underlying relationships among vari-
ables that may be hidden by the overlaid variation or noise. Although everything discussed
in this section is available in standard texts, a review of the more elementary statistical
principles is presented to address the basic problems of measurement quality. This is given
Quality Assurance of Physical Testing Measurements 21

in what is hoped to be an orderly manner to make the chapter a reasonably self-contained

document.
Today, with the widespread availability of computers and appropriate software, sta-
tistical analysis is within the reach of essentially anyone conducting a laboratory measure-
ment operation. They may be used with spreadsheet programs for elementary analysis of
small-to-intermediate databases, using the typical statistical algorithms built into these
programs. Computers may also be used with dedicated statistical analysis programs for
more complex and comprehensive analysis, especially for large databases. The use of
computers both to acquire and to analyze data has evolved to the point where practically
all test instruments are either equipped with a dedicated computer or connected to a
personal computer and/or centralized computing system. These systems frequently are
programmed to do elementary statistical calculations and print out means (averages)
and standard deviations according to preset algorithms for each tested material or object.
However, rote application of packaged statistical routines without an adequate under-
standing of the principles involved can present a major problem in decision making.
Statistical calculations and decisions should not be made blindly; they should always be
accompanied by intuitive judgments based on well-established technical experience and
common sense as well as an acceptable level of statistical competence. This approach is
especially important when a small database is being considered.

3.2 Basic Probability Concepts

A simple but effective definition of probability is based on a frequency of occurrence
concept. If an event A can occur in na cases out of N possible and equally probable
number of cases, the probability P(A) that the event will occur in any new trial is
P(A) = na = successful cases
(I)
N total number of cases
Probabilities are thus expressed as a value in the interval 0 to 1, where 0 indicates no
probability and 1 indicates certainty. The following theorems are useful:
If P(A) is the probability of an event, 1 - P(A) is the probability that the event will
not occur.
If A and B are two events, the probability that either A or B will occur is
P(A or B) = P(A) + P(B) - P(A and B) (2)
As a special case of the above, if A and B cannot occur simultaneously, i.e., they are
mutually exclusive, then the probability that either A or B will occur is
P(A or B) = P(A) + P(B) (3)
If A and B are two events, the probability that events A and B will occur together is
P(A and B) = P(A) x P(BIA) (4)
where P(BIA) is the probability that B will occur assuming that A has already occurred.
A special case of the above is the situation where the two events A and B are inde-
pendent (no influence of one event on the probability of the other event); then the prob-
ability of both A and B occurring is
P(A and B) = P(A) x P(B) (5)
3.3 Data Distributions
Any individual data value belongs to or originates from some population of values that has
certain properties that are collectively designated as a distribution. A population may be
(1) a single (or a few) object(s) or a very limited mass of material, (2) a finite (but large)
number of objects or a large mass of material or (3) a hypothetical infinite number of
objects or mass of material; all three interpretations imply that the objects or material are
generated by some identifiable process and usually have a recognized property range.
Distributions describe the frequency of occurrence of individual data values about some
specified central value: a mean or median as defined below. Distributions are characterized
by a probability distribution function, which is expressed in terms of distribution para-
meters that relate to the central tendency and the dispersion of values about this central
value.
The normal or gaussian distribution has a typical bell-shaped symmetrical frequency
of occurrence below and above the mean. Nonnormal distributions have a nonsymmetric
occurrence frequency, and such distributions can be made to approach normality by an
averaging process. Thus a secondary population of individual means, each based on n
values from a primary nonnormal population, approaches a normal distribution as n
increases. This is called the central limit theorem. Nonnormal distributions can also be
made to approach normality by the use of transformations. In any ongoing testing opera-
tion that has a known normal distribution, a number of actions can occur that may cause
deviations from normality.

Presence of a number of outliers

Shifts in operating conditions of the system
Undetected cycles or transient instability in the operation

All three of these perturbations will be discussed in succeeding sections of this chapter.

3.4 Characterizing Distributions-Mean, Variance, Standard Deviation,

and Range
Some elementary concepts of the distribution and its characterization will be discussed
here. A more comprehensive computer analysis is listed in the Statistical Analysis section.
A database that is presumed to have a normal distribution may be characterized by two
types of statistical parameters: one that establishes its central value and one that charac-
terizes the spread or dispersion of values around the central value.

Central value~the mean (or for special cases the median)

Dispersion~the variance and standard deviation (square root of the variance) and the range

The mean (frequently known as the arithmetic average), variance, and standard deviation
can be realized in two ways: (1) as a true parameter value based on extensive measurement
or other knowledge of the entire population, in which case these parameters are designated
by the symbols fl, 0' 2 , and 0' respectively or (2) as estimates of the true values based on
samples from the population, in which case they are designated by the symbols .X, S 2 and S
respectively. The equations to calculate the mean, variance, and standard deviation as
estimates from a sample are
Quality Assurance of Physical Testing Measurements 23

X- = ____:
X I:__+ _::_ + ... +----"-
X2 ___ Xn
(6)
n
s 2= L:(xi - x/ (7)
n- 1
- 2]1/2
S= [ L:(xi - x)
(8)
n- 1

where xi = any data value, n = the total number of sample data values, and I: indicates
summation over all values. The degrees of freedom df for Eq. 7 and 8 is (n- 1). The
degrees of freedom is the number of independent differences available for estimating the
variance or standard deviation from a set of data; it is one less than the total number of
data values n, since one degree of freedom (of the total degrees equal to n) is used to
estimate the mean.
The majority of statistical calculations and resulting decisions are based on sample
estimates for i, S 2 , and S. In certain circumstances the true values are known and slightly
different procedures are used for statistical decisions. The standard deviation gives an
estimate of the dispersion about the central value or mean in measurement units. For a
normal distribution the interval of ±a about the mean f.L will contain 68.3 percent of all
values in the population; the ±2a interval contains 95.5 percent, and the ±3a interval
contains 99.7 percent. A relative or unit-free indicator of the dispersion is the coefficient of
variation:
. .
C oe ffICient f . .
o vanatwn = CV
s
=-:: (9)
X

The coefficient of variation may be expressed as a ratio or as a percentage by multiplying

the ratio by 100.
In most testing operations the mean is used in preference to the median, which is
defined as the middle value of an ordered set of (ascending, descending) values. A median
may however be used in special circumstances; in some respects it is a more robust
statistical parameter, being less influenced by extreme values. An important use of the
median is in tension testing, where the tensile strength very frequently has a nonnormal
distribution. The use of the median for small test samples (only a few strength measure-
ments per material) usually is a better indicator of the true strength, since extremely low
values due to abnormal flaws are of little influence.

Pooled Estimates of Variance, Standard Deviation

The variance and its square root the standard deviation along with the range discussed
below are measures of the spread of the data and as such indicate the precision in an
inverse manner. It is frequently assumed that a substantial number of test replications (12
or more) have to be made to estimate a variance or standard deviation. There are certain
situations where this number of replications is required for important testing operations,
but there are alternative ways to get good estimates of these statistical parameters for any
test without this large number of measurements for each test material by a procedure
called pooling. There are two general approaches.
If a measurement of some property is being conducted on a number of materials that
are similar in their characteristics and response to the measurement process, then a reason-
able assumption can be made that, although the means may be different, the variance of
the testing measurement is the same for each material. On this basis a simple testing plan
to evaluate a number of materials can be used, with only a few measurements on each
24 Veith

material. The pooling of these individual estimates will give a good overall testing variance
estimate. A procedure is given in the Statistical Analysis section to determine if the
variances obtained on a number of test materials can be considered as equal.
A second approach is the accumulation of individual variance estimates each with
only a few degrees of freedom, on one or more reference materials or objects over a period
of time. The reference material should have the same measured property magnitude and
the same general test response as the experimental materials if the estimated variance is to
be used for decisions on the experimental materials. Both approaches may be used for a
more comprehensive effort.
The process of pooling or averaging the individual variance estimates (each with only
a few degrees of freedom) is equivalent to a weighted average calculation. Thus the pooled
variance is obtained from the sum of each individual variance estimate S 2(i) multiplied by
the number of values in its replicate set ni divided by the sum of the individual number of
values in each replicate set. The number of degrees of freedom df attached to the pooled
variance is equal to the sum of the number of individual df of each replicate data set.

S 2 (p) = n 1[S2(1)] + n2[S2(2)] +... (10)

nl + n2 + ...
Although it is not as important today with the ready availability of computers, there is a
quick calculation algorithm to obtain a pooled variance, S 2(p), from a sequence of dupli-
cate measurements obtained under either of the approaches as listed above.

S2(p) = 2-d2 (11)

2k
where
d = difference between duplicate measurement values
k = number of duplicate sets measured (df = k)
Pooled standard deviations may be obtained as described in Eqs. 10 and 11 by taking the
square root of the variance.
For small data sets (10 or less) another convenient way to obtain a measure of pre-
cision is the use of a multiplying factor for the range to give an estimate of the standard
deviation. Equation 12 gives the expression for the standard deviationS and Table 1 gives
the factors fR for sample sizes of 2 to 10.
S = w(/RJ (12)
where
w = range = (max value - min value)
fR = factor for standard deviation calculation

The Sampling Distribution of the Mean

If a random sample of size n is drawn from a normal population having a mean fL and
variance a 2 , then the mean of the n values in the sample icnl is a random variable whose
distribution has the mean fL and a variance of a 2 jn or a standard deviation of a 1Jn, which
is frequently called the standard error of the mean. Note that this parameter, which
establishes the reliability of the mean icnl, decreases as the square root of n; it is necessary
to quadruple the values n in order to halve the standard error of the mean.
Quality Assurance of Physical Testing Measurements 25

Table 1 Estimate of Standard Deviation from

Range, w

Sample size n Factor fR

2 0.886
3 0.591
4 0.486
5 0.430
6 0.395
7 0.370
8 0.351
9 0.337
10 0.325

3.5 The Z-Distribution and the t-Distribution

The unique bell shape of the normal distribution curve may be characterized by an
equation called the normal probability density function, which gives the probability of
finding a given distribution value as a function of that value. To avoid having a separate
equation for each measured parameter with its unique units, the function is adjusted to
make the area under the distribution curve equal to one or unity, and this adjusted
equation is called the standard normal distribution. The results of such calculations are
given as tabular values. Using these tables, problems concerning the probabilities of
occurrence of any value may be solved by the use of a random variable usually denoted
by Z, associated with this distribution. Areas for a normal distribution for any variable x
with known mean and variance are obtained by a simple transformation of origin and
scale. This transformation converts the x variable with mean fL and standard deviation a
to the variable Z as given by
X-f.L
Z=-- (13)
a
A distinction is made between the random variable Z and the standard normal distribu-
tion tabulated values, also equal to the right hand side of Eq. 13, which are designated by z.
The expected value of Z is zero, and the standard deviation is 1. When sampling is
conducted from a normal population that has been transformed into Z values, certain
fractions or proportions of all of the values are contained within specified - and + multi-
ples of Z. Just as in the case discussed previously for the standard deviation, 68.3% of all
values drawn from the population are within the interval of ±Z; 95.5% of all values within
±2Z and 99.7% within ±3Z. When considering either actual x values or Z values, these
intervals are called the one, two, and three sigma limits respectively, and they represent the
long run outcomes of sampling.
An alternative way to express the occurrence of 68.3% of the Z (or x) values of the
above population within the limits of ±Z (or ±a) is the use of the concept of probability.
The statement-The probability that any selected population value xi has a value in the
range ±Z (or ±a) is 0.683-is equivalent to the above statement (for ±Z) expressed in
percentage format. With a standardized equation format, a table of probabilities may be
generated that applies to any type of random normal variable. Probabilities are discussed
26 Veith

in more detail below. Thus the probability that x will take on a value less than or equal to
a is given by finding the probability P in a standardized normal distribution table for a
value of Z =(a- p,)ja. Annex A Table AI lists the values of Z and associated "areas" for
each value of x expressed in terms of a difference from p, in a units, where the area is equal
to the probability that Z will take on a value less than or equal to the specified Z. These
areas or probabilities are left-hand oriented, i.e., they begin at -oo. From the table, the
probability that a Z value is as low as -2.00 is 0.0228; only two chances out of a hundred.
A frequent use of Table AI is for the selection of a value of z for a specified prob-
ability P = a to make certain statistical inferences. The Za notation is used for this pur-
pose, and two separate examples of its use are as follows
P(Z < -za) =a (14)
P(Z > za) = 1 -a (15)
The first of these expressions, Eq. 14, states that the probability that Z will fall in the
region or area from -oo to -za is a. In the second application in the use of za, the
probability that Z is equal to or greater than Za is obtained by difference. First the
probability that Z will fall in the range -infinity to +infinity is equal to the entire area
under the curve or 1. The probability that Z will be equal to or less than za is equal to the
area from -oo to za. The area to the right of za is the difference between these two areas or
1- a.
Another application of the use of specific tabulated z values is in forming an interval.
Intervals are discussed in more detail in Section 3.6 below. If a is divided into two equal
regions at either extreme end of the distribution, then
P(- Za/2 < Z < Za;2) = 1- a (16)
which states that the probability that Z will be found in the region from -za12 to Za; 2 is
also 1 -a, since the original area has been cut in half and -za; 2 and Za; 2 are defined as the
two points on the z axis that cut off areas of aj2 at each end. Equations 15 and 16 both
have probabilities equal to 1 -a, but the z values are different in the two situations. Each
of the two areas is equal to a/2 at either end of the distribution. Equation 13 can be used
for the mean of n sample values .Xcnl from a population where x in the Z calculation
expression is replaced by .X(n) and a is replaced by a I v'Ji.
The Z-distribution and Eq. 13 are applicable when both the population mean and the
variance are known. When the variance must be estimated from a sample, the z -distribu-
tion and the proportion of sampling values that fall within the - and + limits as given
above no longer apply. In such circumstances a distribution called Student's t-distribution
is used, and tis a random variable defined by Eq. 17.

(17)

The t-distribution is similar in shape to the normal or z-distribution, it has an expected

mean of zero, but its variance depends on the degrees of freedom df associated with S,
which is equal to (n- 1). As n approaches infinity, the t-distribution approaches the
normal or z-distribution and its variance approaches 1. Tables of t values, designated as
ta, at selected dfare given for various probabilities P =a, that the calculated value t (calc),
as given by Eq. 17, will equal or exceed the tabulated ta, which is usually called a critical t
or ta (crit). See Table A2 for tabulated t values, i.e., ta (crit), at various df Equation 17
Quality Assurance of Physical Testing Measurements 27

applies to problems where the population mean is known or where a selected x value is to
be compared to x(nJ for a decision on a significant difference.

3.6 Hypothesis Testing, Confidence Intervals, Tolerance Intervals

Hypothesis Testing
The act of testing an hypothesis is based on a procedure called statistical inference-
decisions about sample data that are based on defined statistical principles. Statistical
inference is necessary because samples contain incomplete information about any parti-
cular population, and any statement about population parameters has a certain level of
confidence. This is usually expressed as the probability of making an error, i.e., of being
wrong, when any assertion is made about the population. The assertion about a mean or
variance is called a statistical hypothesis, and a procedure that permits agreement or
disagreement about the hypothesis is called a test of hypothesis. Hypothesis tests are
required because of sampling error, the variation of the calculated parameters in a series
of samples, obtained from repeated testing, under identical operating conditions.
A particular hypothesis is tentatively adopted and evidence is sought to support it. If
sufficiently strong evidence cannot be found, it is rejected. This may be illustrated by the
following example using Eq. 13 above. If a population has fL = 10.0 and a= 1.5, and a
value x = 13.5 is obtained, what can be concluded about the probability that such a value
is really from the specified population? Evaluating Z gives
Z=X-f1,= 13.5-10.0 =2.3 (18)
a 1.5
In making a decision about the likelihood that a Z of 2.3 could be obtained on the basis of
chance alone, a null hypothesis is adopted that x is indeed from the specified population.
Anticipating the possible outcome that the null hypothesis may not be true, a second
hypothesis is adopted called the alternative hypothesis. These two are symbolically repre-
sented as H 0 and HA respectively.
H 0 : x is a member of the specified population
HA : x is not a member of the specified population
A criterion is set up to make a decision either (1) to accept the hypothesis as true with a
selected probability of being incorrect, which is designated as a level of significance, or (2)
to reject the null hypothesis and therefore accept the alternative hypothesis. A Type I error
is made when H 0 is incorrectly rejected, i.e., H 0 is actually true but the sample-based
inference procedure rejects it. A Type II error is made when H 0 is not rejected when in
fact H 0 is not true but the inference procedure fails to detect this. The usual approach in
inference testing is to select a certain probability or level of significance for Type I error.
Statistical decisions about hypotheses are made on the basis of a test statistic; a parameter
calculated from the sample whose sampling distribution can be specified for both the null
and the alternative hypothesis at some selected level of significance ct. At the selected a, the
sampling distribution of this test statistic is used to define the rejection region. The rejec-
tion region is that set of values for which the probability is less than or equal to the
specified ct when the null hypothesis is true.
A maximum acceptable probability is selected, P = a, for rejecting a true null hypoth-
esis. This maximum or critical probability is customarily P = 0.05 or a one in twenty
chance. If the calculated probability is equal to or less than the adopted critical probability
28 Veith

or significance level, the null hypothesis is rejected. The general expression for the prob-
ability that a random normal variable x will take on a value between a and b is
P(a < x <b)= CA(b)- CA(a) (19)
where
P(a < x <b)= the probability that x will fall between a and b
CA(a) = the cumulative area under the standard normal z-distribution curve
up to the value a
CA(b) =the cumulative area under the standard normal z-distribution curve
up to the value b (both areas from -infinity)
In this example, the z value of 2.3 is equivalent to a, and b represents the probability that a
random value will fall in the range -infinity to +infinity. Table Al reveals that
CA(2.3) = 0.9893. This is the probability of finding a z value in the range -infinity to
2.3. The cumulative area or probability h of finding a z value less than +infinity is of
course exactly 1.0. Thus
P(a < x <b)= 1.000- 0.9893 = 0.0107 = 0.011 (20)
The calculated probability of 0.011 is substantially less than 0.05, and the null hypothesis
is rejected and the alternative hypothesis accepted. Hypothesis testing may be applied to
any statistical parameter (!-distribution, F-distribution, etc.) for which a sampling distri-
bution may be calculated or otherwise evaluated.
An alternative method for reporting the results of significance calculations that has
gained acceptance is to use the calculated probability as an indicator of the weight of
evidence or the strength of an assertion about the parameter of interest. Instead of adopt-
ing a critical P =a, with a= 0.05 or other, and making a yes or no decision to reject the
tentative null hypothesis, the calculations are made for P(calc), and this is used to indicate
the decisiveness of the act of potential rejection of the null hypothesis. For this procedure,
Pis defined as the probability of committing a Type I error if the actual sample (measured)
value of the statistic is used as the rejection value. It is the smallest level of significance to
reject the null hypothesis based on the sample at hand and is usually called the attained or
empirical significance level. Many statistical software programs give these calculated P
values as output.

Confidence Intervals
The calculated values from any sample are considered as point estimates. Any such esti-
mate may be close to the true value of the population (ti, CJ or other) or it may vary
substantially from the true value. An indication of the interval around this point estimate,
within which the true value is expected to fall with some stated probability, is called a
confidence interval, and the lower and upper boundary values are called the confidence
limits. The probability used to set the interval is called the level of confidence. This level is
given by (1 -a), where a is the probability as discussed above for rejecting a null hypoth-
esis when it is true. In most circumstances, means are the most important point estimates,
and confidence intervals for means are evaluated at some probability P = (1 -a) that the
true population mean is within the stated confidence limits. This can be expressed for a
population with a known standard deviation CJ as given in Eq. 21.

P[ X(n)- Za/2 (5n) < /-[ < X(n) + Za/2 (5n)] = (1 -a) (21)
Quality Assurance of Physical Testing Measurements 29

where
X(nJ = a mean evaluated from a sample of n values
Za; 2 =the z value at P = al2
a-j v'n =the standard error (standard deviation of means of n)
fL = true mean of the population
This equation states that the confidence interval is defined as the difference between the
two limits about the point estimate of the mean ic11 l, i.e., from the lower limit
(.'X'(n) - Za; 2 0' I ..j/i) to the upper limit (icnl + za; 20' I ..jfi). This difference is designated as
the (1 -a) 100% confidence interval. The true mean fL is a fixed number; it has no
distribution and it is either in the interval or it is not. The interpretation of the value
(1 - a)lOO% is as follows-If a trial of 100 repetitions of the experiment of drawing a
sample of n values from this population is conducted, then in the long run (1 - a)100% of
the intervals for each trial of 100, will contain the true value fL.
Confidence intervals may be alternatively formulated in terms of a factor kcon selected
so that the calculated interval covers the mean fL a certain percent (proportion) of the time.
(5
Con Interval = ±kcon Jn (22)

where Con Interval is the confidence interval at a selected P for means of samples of size n.
As an example suppose that a sample of n = 4 is drawn from a population with C5 = 1.5
and the estimated mean is 8.9. The 95% confidence interval (a= 0.05), where 1.96 is the z
value at al2 = 0.025 and 1.512 = 0.75, is the standard error of means of 4, is given by
95% Con Interval= ±1.96(1.512) = ±1.96(0.75)
= ±1.47
or
95% Con Interval= [8.9- 1.47] to [8.9 + 1.47]
(23)
= 7.43 to 10.37 = 2.94
For a situation where the standard deviation is not known but must be estimated from
a sample in the same manner as the mean, the t-distribution applies, and the probability
expression is

P[ (x(n)- ta/2 Jn < fL < X(n) + ta/2 Jn) J = 1- a (24)

The lower limit is .X(n) - la; 2SI Jn and the upper limit is i(n) + ta; 2SI ..jfi. As an example, if
C5= 1.8 as calculated from the sample of 4(df = 3) and i(n) = 8.9 and (1- a)lOO = 95%
or a= 0.05, the confidence interval would be found using the tabulated t value of 3.18
which is found at P = 0.025 for df = 3.
95% Con Interval= [8.9- 3.18(1.812)] to [8.9 + 3.18(1.812)]
= 6.04 to 11.76 = 5.72
= ±2.86
This confidence interval is almost twice the previous example because the standard devia-
tion as well as the estimated mean is only known to df = 3.
30 Veith

Tolerance Intervals
The word tolerance is used in a number of ways in testing and measurement technology.
Engineering and design tolerances are usually designated as upper and lower limits on
certain dimensions or other numerical factors for an object or product. Tolerances can
also apply to the number of significant figures or digits to retain in a measurement. A third
type of tolerance is concerned with the percentage of population values falling within some
specified limits, and this type is considered here. In explaining this kind of tolerance it is
important to distinguish it from confidence intervals or limits.
Confidence intervals provide a value for the region of uncertainty about an estimated
population parameter (a point value), usually a mean, with a certain degree of confidence.
Frequently it is desirable to obtain an interval which will cover a fixed proportion or
percentage of the population with a specified confidence. Such intervals are called tolerance
intervals and the two endpoints are called tolerance limits. Tolerance intervals can also be
formulated in terms of a factor, ktob and the estimated standard deviation S
Tol Interval = ±ktol(S) (25)
where k 101 is selected so that the interval will encompass a proportion p of the population
with a confidence of (1- P). As an example if i(n) = 14.0, S = 1.5 and n = 15, the toler-
ance interval that will contain 99 percent of the population 95 percent of the time is given
by referring to Table A3 where the confidence level is given by r. Thus for p = 0.99,
r = 0.95, and n = 15; the tabulated value for k 101 is 3.88, and
To! Interval= ±3.88(1.5) = ±5.82
or
i(n) = 14.0 ± 5.82 = 8.18 to 19.82 (26)
The distinction between confidence intervals and tolerance intervals is that confidence
intervals refer to estimates of the population statistics (usually the mean) while tolerance
intervals are concerned with proportions or fractiles of the population. Thus the term
tolerance as used here should be distinguished from the frequently used tolerance in
engineering design for dimensions and other factors in the construction or manufacture
of some object or structure.

3.7 Propagation of Random Error in Testing

When measured parameters that have a certain random variation are used in mathematical
calculations that express some derived property, the form of the mathematical relationship
is important in determining the variation associated with the calculated property. The
statistical technique that addresses this topic is called propagation of error. See Ku [1] as
well as ASTM standard D4356 in the bibliography for background on the calculation
algorithms as given here.
For any general relationship, where Y is some function of x 1 and x 2 ,
(27)
the variance of Y, S~, is given by Eq. 28 in terms of the partial differential of the function
with respect to x 1 times the variance of Xj, which is S~h plus the partial differential of the
function with respect to x 2 times the variance of x 2 , S~2 , etc.

S 2y _ [a{¢(x 1, x 2 , .. .)}J 2 s 2 [a{¢(x 1, x 2 , .. .)}J 2 s 2

- ax 1 xl + ax 2 x2 + .. 0
(28)
Quality Assurance of Physical Testing Measurements 31

The variance for x~o x 2 , etc. refers to individual measurement values in any population.
With simple linear relationships for the function, the differentials become constants, and
the equation for s~ takes the form
(29)
For the simplest linear form for two variables, a sum or difference relationship is given by
Y = x1 + x2 or (30)
the value for s~ is
s~ = s;l + s;2 (31)
since the differentials are unity. Thus the act of adding or subtracting two measured
values, each having a variance associated with its measurement, substantially increases
the variance of the sum or difference. If both x 1 and x 2 have the same variance, the
variance of the sum or difference is two times the individual variances.
With any functional form beyond a sum or difference, the variance of Y is influenced
by the value for the differentials. For a ratio or quotient,

(32)

the variance of Y is given by Eq. 33, and the evaluation of S~ has to be made at some
selected values for x 1 and x 2 .

s~ = (~) 2(s~1 + s~2 )

x2 xl x2
(33)

Expressions for a number of functional relationships frequently encountered in calculating

a derived property or parameter are given in Table 2. These expressions are intended to be
used to evaluate the variance for Yin certain local regions as defined by the mean values
used for x 1 and x 2 . Usually mean or average values for x 1 and x 2 will be used to evaluate
the variance expressions, and the variance of the means (the standard error squared) of the

Table 2 Propagation of Error Expressions for Selected Functional Forms

Functional form Expression for S~

1. y = klxl + k2x2 kis;l + k~S~2

2. Y = X1/X2 (xlfx2i[{S~llxf) + {S~2/x~}]
3. y = 1fxl s~lfxi
4. Y = x 1jx 1 +x2 (Y /XJ)\x~S~I + xfS~2)
5. Y = xif(l + x 1) s~ 1 /(l + x 1)4
6. Y = X1X2 (XJX2) 2[(S;Ifxf) + (S;2fxi)J (a)
7. y = XT 4xfS~I (a)
8. y = (x 1)112 l/4(S~Ifx 1 )
9. Y = ln(x 1) s~~;xi (a)
10. Y = kxfx~ Y [a (S;Ifxf) + b 2 (S~2/x~)]
2 2 (a)
11. y = e'l e2xls;l (a)
(a) Expressions are approximate especially for small n.
32 Veith

x variables, designated as S~i, should be used, according to Eq. 34, where n is the number
of values used for the mean for Xi .

8 Xl
2 = s~i (34)
n
Under these conditions, wherever Y appears in the variance expression, a mean value for Y
is to be used.

4 Principles of Measurement, Calibration, and Traceability

4.1 Measurement
Measurement is basically a comparison of an unknown to a known or standard object or
value by way of a specified technical operation that generates data unique to the class of
objects or material. Measurement theory describes how a measured parameter relates to a
specific property. There are two general types of measurement (1) direct, where the mea-
sured parameter is the same as the property of interest and (2) indirect, where the mea-
sured parameter is related to the property of interest by some underlying theoretical
relationship, usually a direct or linear one. Although most physical property measure-
ments are Type I, some physical properties are evaluated by a Type 2 process.
Terminology needed for describing measurement technology includes the following terms.
Standard. (1) A recognized quantity, object, or criterion used for comparison, or (2) a
protocol for some specified technical operation or other measurement goal, established by
a standardization organization.
Measurement system. The entire collection of devices or instruments needed to make a
specified type of measurement; it usually includes a data recording device (recorder, com-
puter).
Technique. A technical operation using a measurement system that has been developed
based on a specific scientific principle.
Test method (or method). The adaptation of a technique to a specific operation or process;
this includes a written procedure for conducting the measurement(s), and the resulting
data are defined as test results.
Protocol. A method and a complete set of definitive instructions to achieve a given testing
or measurement goal; the instructions often include the sampling operation and other
important ancillary information such as data treatment and reporting.
A number of secondary terms related to test methods are
Absolute method. A method where the test results are based on recognized physical or
other standards or standard values directly derived from theory.
Relative or comparative method. A method where the test results are based on a compar-
ison with the measurement of a reference material or object.
Standard method. A method that has been adopted by a standardization organization; it
usually has broad technological acceptance and a stated precision.
Reference Method. A standard method with demonstrated high (or good) precision and
frequently high (or good) accuracy.
Quality Assurance of Physical Testing Measurements 33

4.2 Required Characteristics of a Measurement System

There are two major sets of conditions that need to be met. The first concerns the basic
requirements of the measuring process. For any material or object class the requirements
are that
The measurement system be in a stable condition
The measurements be independent and random
The measurements represent the population of direct interest

The importance of the first requirement is obvious. Unstable systems are really not accep-
table for testing. The second requirement is the basis for conducting statistical tests where
independence and randomness are assumed for probabilistic conclusions. When the first
two of these specifications are met, the measurement system is in a state of statistical
control. The third specification relates to how well the sampling is done.
The second set of conditions is related to this sampling operation and the test speci-
mens derived from the samples. The sampling procedure should
Be conducted on a stable (nonchanging) population
Produce individual samples appropriately selected and independent of each other

The phrase "appropriately selected" refers to the different types of sampling that can be
conducted; this topic is discussed in more detail in Section 5 on sampling principles. The
importance of these sampling requirements is self evident; both are necessary for proper
statistical analysis. When these sampling conditions are met, the sampling system is said to
be in a stable condition or in statistical control.
The attainment of these required characteristics is often not straightforward.
Conformance with the requirements is usually obtained by a twofold process. First, sub-
stantial experience with the system and attention to important technical details is required.
Second, certain statistical diagnostic tests may be used, the most important being control
charts, which are defined in Section 7, using standard or reference materials subjected to
the same testing protocol as the experimental materials. Independence of individual mea-
surements can be compromised if there is any carryover effect or correlation between one
test measurement (sample or specimen) and the next measurement. Calibration opera-
tions, to be discussed later, can also be a source of problems in test measurement inde-
pendence if they are not conducted in an organized or standardized manner.

4.3 Figures of Merit

The selection of a measurement system for a specified testing objective is usually made on
the basis of certain figures of merit that apply to the following essential characteristics:

Precision and bias (uncertainty in test results)

Sensitivity
Useful range

and also to a number of desirable characteristics:

Low cost
Rapid testing and/or automated procedure
Ruggedness and ease of operation
34 Veith

Precision, Bias, Uncertainty

Good precision, which has been defined as the close agreement of measured values, is
indicated by a low standard deviation. It is an essential requirement. Good precision and
low or zero bias equate to good or high accuracy. See Annex C (to be discussed later) for
more on precision, bias, and accuracy. The term uncertainty is also frequently used as a
surrogate for accuracy in an inverse sense, i.e., low or acceptable uncertainty is equivalent
to high accuracy. Section 8 contains an expanded discussion on these concepts.

Sensitivity
This is related to the ability to detect small differences in the measured property and/or the
fundamental inherent property. Sensitivity has been defined in quantitative terms for
physical property measurements by Mandel and Stiehler [2] as
. . .
SensltlVJty K
= -- (35)
s(m)
where
K = the slope of the relationship between the measured parameter m and the
inherent property of interest Q, where Q =.f(m)
s(m) = the standard deviation of the measurement m

Sensitivity is high when the precision is good, i.e., S(m) is small, and when K is large. An
example will clarify the factor K. The percentage of bound styrene in a butadiene-styrene
polymer may be evaluated by a fairly rapid test, the refractive index. A curve of refractive
index vs. bound styrene, with the styrene measured by an independent but more complex
reference test, establishes an empirical relationship between the styrene and the refractive
index. Over some selected bound styrene range, the curve has a slope of K, and this value
divided by the precision standard deviation s(m), gives the sensitivity in this range. For
polymers of this type the refractive index sensitivity may be compared to the sensitivity of
alternative quick methods, such as density, by evaluating K and s(m) for each technique.

Useful Range
This is the range over which there is an appropriate instrument response to the property
being measured. Appropriate response is expressed in terms of two categories, (1) the
presence of a linear relationship between instrument output vs. level of the measured
property, and (2) precision, bias (uncertainty), and sensitivity, at an acceptable level.

Ruggedness Testing
Frequently there is the need to determine if a test is reasonably immune to the perturbing
effects of variation in the operating conditions such as ambient temperature, humidity,
electrical line voltage, specimen loading procedures, and other ordinary operator influ-
ences. A procedure called ruggedness testing is conducted by making systematic changes in
the factors that might influence the test and observing the outcome. Such testing is fre-
quently conducted as a new test is being developed and fine tuned for special purposes or
routine use. It can also be used to evaluate suspected operational factors for standardized
methods if environmental or other factors for conducting the test have changed.
A series of systematic changes are made on the basis of fractional factorial statistical
designs, which are discussed in more detail in Section 6. The early work was done by
Plackett and Burman [3], Youden [4], and Youden and Steiner [5]. These designs are quite
efficient. The most popular design evaluates the first-order or main effect of seven factors
Quality Assurance of Physical Testing Measurements 35

in eight tests or test runs. One important caveat in using these designs is that the second-
order and all higher-order interactions of the seven factors are confounded with the main
effects. See Section 6 for additional discussion on interaction and confounding. If there are
any large interactions of this type, they will perturb the main effect estimates. However
experience has shown that in the measurement of typical physical properties under labor a-
tory conditions, first-order or main effects are usually much larger than interactions, so the
use of these fractional designs has been found to be appropriate for ruggedness testing.
The Plackett-Burman statistical design for seven factors, A, B, C, D, E, F, and G that
might influence the test outcome is given by Table 3, where -1 indicates the low level
(value) of any factor, 1 indicates the high level of any factor, andY; is the measured value
or test result for any of the eight runs or combinations of factor levels. This design assumes
that the potential influence of any factor on test response is a linear one. As indicated by
the table, the design calls for the sequential variation of all seven factors across the eight
test runs in a way that provides for an orthogonal evaluation of the effect of each factor.
The design is evaluated by a procedure that sums the eightY; values in a specified way
and expresses the results of the summing operation as the effect of each factor. Thus the
effect of factor A, designated as E(A), is given by Eq. 36 as the difference of two sums
divided by N/2. The first sum is the total of the products obtained by multiplying each
value of Y; by 1 for those rows (runs) that contain a 1 for column A, i.e., rows 1, 4, 6, and
7. The second sum of products is obtained in the same sense for all rows of column A that
contain a -1, i.e., rows 2, 3, 5, and 8. The use of an expression analogous to factor A may
be used for all other factors.
E(A) = [L;y;A(l)- L;y;A(-1)]
(36)
N/2
where
L;y;A(l) = sum of Y; values for all runs (rows) that have 1 for factor A
L;y;A(-1) =sum of Y; values for all runs (rows) that have - 1 for factor A
N = total number of runs in the design (= 8); all sums are algebraic

The significance of the effects is evaluated on the basis of either (1) a separate estimate of
the standard deviation of measurements of the same type (materials, conditions) as con-
ducted for the Plackett-Burman design or (2) repetition of the design a second time to
provide for two estimates of each factor effect.
If Sis the separate estimate of the standard deviation of individual Y; measurements,
then the standard deviation of the mean of four such measurements is S/2. If no real factor

Table 3 Plackett-Burman Ruggedness Test Design for Seven Factors (A to G)

Test run A B c D E F G Test result

-I -I -I Y1
2 -1 1 -I -I Y2
3 -I -I -1 Y3
4 I -I -1 1 -1 Y4
5 -1 -1 -1 Ys
6 -I -1 -1 Y6
7 -1 -1 -1 Y7
8 -1 -1 -1 -1 -I -1 -1 Ys
36 Veith

effect exists, then the calculated effect E(i), which is a difference between two means of
four each, has an expected value of zero and a standard deviation of (-tis 12) = S 1-/2. If
E(i) is significant (a real effect), it should exceed zero by an amount greater than two
standard deviations based on means of four, i.e., greater than absolute 2(S1-/2), provided
that S is known at a certainty level of at least 18 to 20 df The use of 2(S1-/2) as the
interval to indicate significance is based on a P = 0.05 or 95% confidence level. If Sis not
based on at least 18 d_j; then the value of Student's t (double-sided test) at P = 0.05 should
be substituted for 2 in this interval expression using the appropriate df
If Sis to be estimated from the ruggedness testing itself, the eight runs are repeated to
produce two sets of estimates for each factor, where E(i) = replication 1 value and
E'(i) = the second replication value. Since the standard deviation of E(i) or E'(i) is
Sl-/2, each value of the difference [E(i)- E'(i)] or d, for each of the seven factors, is
an estimate of (-JiSI-/2) = S. Hence an estimate of S based on 7 dfis

S ~ [ IO(d(A)2 + d(B~2 + ... + d(G)'] 112 ()?)

where
d(A) 2 = [E(i)- E'(i)] 2 for Factor A
d(B) 2 = [E(i)- E'(i)f for Factor B
and so on for all factors. With two replications of the eight runs now available for
estimating the influence of the seven factors, the mean effect is E(i)m or [E(i) + E'(i)]l2
for each factor. For the value of any E(i)m to be significant, it must exceed
2.37[-/2Sivfs] = 1.18S, where Sis of course given by Eq. 37 and 2.37 is the t value in
the Student's distribution at 7 df and at the P = 0.05 or 0.95 significance level. Factors that
are found to be significant in any test need to be investigated and the test procedure or
protocol revised to reduce the sensitivity to those factors.
For certain test methods, especially those that are more fully developed, only a few
factors may require an evaluation, perhaps 3 or 4. The Plackett-Burman seven factor
design may still be used with the remaining factors, say E, F, and G, being dummy factors,
i.e., factors that have no influence on the outcome. All eight test runs must be completed
for any design independent of the actual number of factors evaluated. Alternatively the
fractional factorial designs for 3 or 4 factors as given in Section 6 may be used.

4.4 Calibration and Traceability

Calibration is basic to all measurement systems. Two important definitions of calibration
are (1) a set of markings on a scale graduated in the output parameter of a test device, or
(2) the act of comparing a measured test value or response to an accepted true, reference,
or standard value or object. The primary purpose of calibration is the elimination of bias
or systematic deviation of measured values so that they correspond exactly with the true
values. Once the bias is known from a calibration experiment, an adjustment of the testing
device is made so that the true or accepted reference value is obtained from the test device
to some suitable tolerance. Frequently standardization is used as a synonym for calibra-
tion.
The requirements for calibration include (1) an estimate of the accuracy (precision,
bias) needed in the output measurement or response function, (2) the availability of
documented calibration standards, (3) the presence of a state of statistical control for
Quality Assurance of Physical Testing Measurements 37

the test system, and (4) a fully documented protocol along with experienced personnel for
the calibration. A realistic calibration schedule should be maintained. Decisions on the
frequency of calibration are made by balancing the cost of the calibration vs. the risk of
biased test output. When in a state of statistical control, repetitive instrument responses
for each true or standard value should be randomly distributed about a mean, and when a
series of true values are used, the response means vs. true values should give a linear least-
squares regression relationship. This gives confidence intervals on the slope and the inter-
cept and for selected points on the line in the calibration range. Annex C outlines proce-
dures for evaluating bias that are equivalent to this type of calibration operation; see this
for more details. See also Section 6 for background on regression analysis
Empirical relationships that appear to be linear can be tested for linearity by a number
of approaches. Visual review of a plot is most appropriate to reveal any departures from
linearity. A plot of residuals (residual = observed - computed response value) with
respect to the level of the response should not show any correlation or systematic beha-
vior. Such a review requires at least 7 pairs of data points (response levels) to be useful. A
simple F test may also be employed. If S~ is the pooled variance for a set of repetitive
instrument responses, each set of responses at one of a series of levels (true values) of the
calibration standard, and Sfr is the variance of points about the fitted function, when
individual set response values (not means or averages) are used for the least-squares
calculation, then the variance ratio

(38)

should not be significant, i.e., greater than F(crit) at P = 0.05 for the respective dfvalues
for both variances. See Section 6 for variance analysis procedures. If F(calc) equals or
exceeds F(crit) under these conditions, there is some significant departure from linearity.
This approach should be used with a sufficient number of points so that the df for each
estimated variance is 8 to 10. When demonstrated nonlinearity exists, transformations
may be used to linearize the response. See IS011095 in the bibliography for more details
on calibration.
Traceability
As the name implies, this is the ability to trace or accurately record the presence of an
unbroken, identifiable and documented pathway from fundamental standards or standard
values to the measurement system of interest. Traceability is a prerequisite for the assign-
ment of limits of uncertainty on the output or response measurement, but it does not imply
any level of quality, i.e., the magnitude of the uncertainty limits. Physical standards with
certified values as well as calibration services are available from national standardization
laboratories such as National Institute of Standards and Technology or NIST (formerly
NBS) in the USA or from the corresponding national metrology laboratories for all
developed countries. All of these standards are usually expressed in the SI units system.

5 Basic Sampling Principles

Since measurements must be performed on some selected object or material, the practice
of selecting these items, which is usually called sampling, is an important operation in any
technical project. For some testing operations the sample is what is tested. For other
operations test portions or test specimens are prepared from the sample using a given
protocol, and these are tested. Sampling is a very important operation in any decision-
38 Veith

making process, and good sampling technique insures that all samples unquestionably
represent the population under consideration. Only the most elementary sampling issues
are addressed in this section. For more detailed information, various texts and standards
on sampling and sampling theory should be consulted; see the bibliography for statistical
texts and for standards on sampling.

5.1 Terminology
Sampling terminology systems vary to some degree among industrial and commercial
operations, which frequently involve complex mechanical systems to draw samples or
other increments from some large lot or mass of material. One of the important objectives
in such operations is the elimination of bias in the samples. Increased sampling frequency
can reduce the uncertainties of random sampling variation, but it cannot reduce bias. The
terminology given here is that which applies more directly to laboratory testing, where the
process of obtaining samples is reasonably straightforward. This type of sampling may be
thought of as static as opposed to dynamic sampling of the output of a production line.
Some important terms are:
Sample. A small fractional part of a material or some number of objects taken from a lot
or population; it is (they are) selected for testing, inspection, or specific observations of
particular characteristics.
Subsample. One of a sequence of intermediate fractional parts or intermediate sets of
objects, taken from a lot or population, that usually will be combined by a prescribed
protocol to form a sample.
Random sample. One of a sequence of samples (or subsamples), taken on a random basis
to give unbiased statistical estimates.
Systematic sample. One of a sequence of samples (or subsamples), each taken among all
possible samples by way of a selected alternating sequence with a random start; it is
intended to give unbiased statistical estimates.
Stratification. A condition of heterogeneity that exists for a lot or population that contains
a number of regions or strata, each of which may have substantially different properties.
Stratified sample. One of a sequence of samples (or subsamples), taken on a random basis
from each stratum in a series of strata in a (stratified) population.

5.2 The Purpose of Sampling Plans

With every sampling plan or procedure, two characteristics need to be considered, the
quality of the estimates of the properties of interest and the cost of conducting the sam-
pling. Increasing the confidence in the estimates, which means a more extensive sampling
plan, with special emphasis on bias reduction, increases the cost of the sampling. A
sampling plan should generate an objective estimate of any measured parameter, and
this is accomplished by using strict probabilistic or statistical sampling or some variant
of such operations that gives objective decisions. However there are situations where this
type of plan would be excessively costly for the importance level of the decisions to be
made. In such cases an alternative approach using subjective elements or technical judg-
ment is usually employed.
A comprehensive sampling plan will provide estimates of
Quality Assurance of Physical Testing Measurements 39

Population (lot) mean and the confidence limits on the mean

Tolerance interval for a given percentage of the units of the population
The sample size or minimum number of tested units (samples, specimens) to establish the
above intervals with a selected confidence or tolerance level
On a practical basis sampling is usually conducted according to the goals of a technical
project. Since there is a wide range of projects from simple to complex, three generic types
of sampling plans have been used.
Intuitive sampling. This plan is organized employing the developed skill and judgment
of the sampler using prior information on the lot or population as well as past sampling
and testing experience. The decisions made on the data generated by such a plan are based
on a combination of the skill and experience of the tester buttressed by limited statistical
conclusions. Strict probabilistic conclusions are not warranted.
Statistical sampling. This provides for authentic probabilistic conclusions. All three of
the above population estimates may be obtained, hypothesis testing may be conducted,
inferences may be drawn, and predictions may be made about future system or process
behavior. Usually a large number of samples is needed if the significance of small differ-
ences is of importance. Conclusions from this type of sampling are usually not contro-
versial. The statistical model chosen is important, and when the number of samples
required is large, which may impose a testing burden, hybrid plans using some simplifying
intuitive assumptions are frequently employed.
Protocol sampling. These are specified plans used for decision purposes in a given
situation. Regulations (of the protocol) usually specify the type, size, frequency, and
period of sampling in addition to the test methods to be used and other important test
issues. A combination of intuitive and statistical considerations may be used, and the
above population estimates may be obtained depending on the protocol. Testing for
conformance with producer-user specifications for commercial transactions is typical
for this approach.

5.3 Assuring the Quality of Samples and Testing

Quality is assured when a sampling procedure is developed based on the requirements of
the testing operation and its decision process and when the samples are drawn according
to the prescribed procedure. Such topics as sample homogeneity (or unintended stratifica-
tion) and sample stability (conditioning or storage changes in the sample prior to testing)
must be addressed. The sampling procedures, holding time, and other handling operations
should be well documented. The test methods used for any program should be stable or in
a state of statistical control and have demonstrated sensitivity as well as good precision in
regard to the measured parameter. Sampling procedures for three types of statistical
sampling are presented, followed by a section on the size of the sample that must be
taken for a specified precision or confidence interval on the estimated population mean.

5.4 Basics of Statistical Sampling

Random Sampling
A lot or population may be thought of as some total number of N units. If the population
is hypothetically of infinite size, then N equals infinity. The word unit refers to each object
if the population is comprised of individual objects, or to some small defined incremental
volume (or mass) for a bulk material. The important feature in simple random sampling is
40 Veith

that each unit of a population (or lot) have an equal chance 1/N of being selected for
testing.
Random sampling can be conducted in one of two ways: (1) with replacement of the
selected units, under the conditions that the test operation does not change or consume the
unit, or (2) without replacement, when the unit is changed in some way by the testing. For
large populations there is no essential distinction between these two types of random
sampling. For small populations (small N) a difference does exist. Since most physical
testing might in some way change the sample (or test specimen prepared from the sample)
the expressions given below are for the "without replacement" category.
A simple random sample is defined as n units drawn from the population where each
unit has the same probability of being drawn. The ideal procedure for doing this is to
identify all the units in the population, i = 1, 2, ... , N, and select units from a table of
random numbers. For some sampling operations this may have to be modified according
to the manner in which individual units can be identified. Each of theN potential or actual
units has a value y;, and the unbiased estimate of the true mean Y is given by the sample
mean y as
' I:y;
y=- (39)
n
where n is the number of units or size of the sample drawn from the population. The
quantity n/ N is referred to as the samplingfi·action, and the reciprocal N/n is known as the
expansion factor. The unbiased estimate of the variance of y, designated as Sf,
[N- n] s;;
is given by

2 _
S·- -- - (40)
Y N n

where
s;; = the variance of the individual n units
The factor [(N- n)/ N], which can be expressed as [1 - (n/ N)], reduces the magnitude
of the variance of the mean by the sampling fraction when compared to an infinite
population value. This reduction factor is called the finite population correction factor or
fpc, and it indicates the improved quality of the information about the population when n
is large relative toN. As n grows larger, the variance of the population mean decreases and
becomes zero if n = N, since at this point the mean is known exactly. In many situations
fpc has a minimal effect and is usually ignored if n/ N < 0.05, and then [(N- n)/ N] is set
equal to 1. The confidence interval at P =a, for the estimated mean y, is given by

Y
,
= I~]
±ta/2 [
N- n 2 ) 1/2
Syi
-;; (41)

If the sample size used to estimateS~; is fairly large, n ~ 30, the value of 2 may be used for
ta; 2 to give a P = a = 0.05 or 95% confidence interval.

Systematic Sampling
The actual process of drawing random samples can frequently be time consuming as well
as costly, especially for large populations and large samples. An alternative procedure that
is easier to conduct and that gives good estimates of the population properties is systematic
sampling. This type of sampling is conducted as follows:
Quality Assurance of Physical Testing Measurements 41

1. Identify all units of the population; i = 1, 2, ... , N.

2. Calculate the expansion factor N jn and round to nearest integer, call this ef.
3. Choose a random integer Ir such that lr is in the interval from 1 to (ef- 1).
4. Sequentially identify the i units to be included in the sample, i.e., i = ! 0 Ir + ef,
Ir + 2ef, etc., and select the sample based on this identification.
This sampling operation will give a sample of n units (±1 due to rounding of e.f) with each
unit having an equal probability of being selected. Since the starting point is randomly
selected, this type of sampling gives unbiased estimates as expressed in Eqs. 40 and 41 if
the population has truly random variation. If there is any cyclic or transient trend nature
in the population or in some aspect of the sampling, a biased estimate or biased sample
may result due to any potential correspondence between the sample sequence and the
cyclic trends.

Stratified Sampling
This type of sampling is a process generally applied to bulk lots or populations that are
known or suspected of having a particular type of nonhomogeneity. These populations
have strata or localized zones, and each stratum is usually expected to contain relatively
homogeneous objects or material properties. However these properties may vary substan-
tially among the strata, and samples are taken independently in each stratum. To apply
stratified sampling techniques a mechanism must exist to identify all the strata in the lot or
population. Once this is done the strata may be sampled by using proportional allocation
where the sample fraction is the same for each stratum. Another approach is optimum
allocation where the sample size or fraction may be increased in those strata with increased
variance if this information is known beforehand.
The calculation as applied above for y, the estimated population mean for random
sampling, may be applied to each stratum and these individual values used to get a
population mean based on all stratum values. Similarly the calculation for S~, the esti-
mated population variance for random sampling, may be applied to each stratum and an
analogous procedure used to get an overall variance based on all stratum values. If
unequal samples are taken from the various strata, the overall population mean and
variance values that represent the entire stratified population must be obtained on a
weighted average basis; see Section 3.

5.5 Sample Size Estimation

The number of samples required to estimate the sample mean to some degree of precision
or uncertainty is an important aspect of sampling. Increasing the sample size or the
number of units tested to obtain a mean increases the precision or reduces the uncertainty.
However the cost of this extra work increases linearly with sample size, while the precision
increases at a much slower rate with the square root of the number of samples. Sample size
problems are approached on the basis of the total uncertainty in the mean. There are two
criteria that must be specified to calculate a sample size n.
The value of E, the maximum ±uncertainty (error) of estimation of the mean
The required level of confidence of the maximum uncertainty
To be able to calculate n, given some E, requires a value for the standard deviation of the
individual unit measurements under a specified sampling and testing condition. The rela-
tionship between n and E is derived from the sampling distribution of the mean and the !-
distribution as discussed in Section 3.4. Thus rearranging the usual form of the equation,
42 Veith

E = la;2Sa/~ (42)
where
E = IY- .YI =the maximum (absolute) difference between the estimated mean
y from the n samples and the true mean Y
ta 12 = t value at a specified P =a; i.e., at a (1 - a)lOO% confidence level,
where the df used for ta 12 is based on the df for Sa
Sa =the applicable standard deviation (among individual units tested), a
function of the specific sampling and testing operation
This may be rearranged to solve for n to give

n = [ta;~Sar (43)

An absolute difference is indicated, and the uncertainty E may be alternatively expressed

as a plus and minus range E about the estimated mean y. Thus the statement that the true
mean will be contained in the range y ± E, will have a (1 - a)lOO% confidence level.
The evaluation of a value for n requires information about the testing conditions in
order to select a correct applicable standard deviation for any specified (1 -a). This may
be clarified by consideration of the magnitude of two types of error, sampling error and
measurement error, for a relatively simple sampling and testing operation. Table 4 illus-
trates four testing scenarios for sampling variance, S 2 (sp), and measurement variance,
S 2 (m). The importance or significance of either component is determined in large part
by the magnitude of the expected difference d for a simple comparison of the estimated
means for two different or potentially different lots or populations.
A Type 1 scenario is encountered when the expected difference d is large compared to
the known variances for both types of error; i.e., that d » 4 times the square root of the
sum of the two variances. For this situation the variance of neither component is critical.
Type 2 is typical of a less precise test where sampling variation is low. If the test is in
control, and the variance known, the number of measurements needed can be calculated
for any desired value for d.
Type 3 is characteristic of a relatively precise test measurement where several samples
are required to give a good estimate of lot or population means needed to calculate d. A
defined sampling program is required. Usually only a few measurements (one, two) need
be made on any sample. Type 4 is the most complex, since both components are important
or significant. This is unfortunately frequently encountered in much testing. A specified
sampling plan with multiple samples is required as well as multiple measurements on each

Table 4 Four Testing Variation Component Scenarios

Componenta

Type of scenario S 2 (sp) S 2 (m)

Not sig Not sig

2 Not sig Sig
3 Sig Not sig
4 Sig Sig
a Not sig = no significant or large variation component. Sig = a
significant or large variation component.
Quality Assurance of Physical Testing Measurements 43

sample. Substantial background knowledge in addition to a formal analysis of variance for

such a test situation may be required for an efficient evaluation of d.
The uncertainty E needs to be selected in relation to don the basis that E ~ d, where
the preferred situation is that E < d. If dis small, then E needs to be made at least as small,
and for constant test variance this requires that n be increased. The equations needed to
calculate n for different testing scenarios with a selected E are given below. It is assumed
that the standard deviation Sa is known with at least 30 df and thus ta; 2 = 2.
Type ]-Neither S(sp) nor S(m) Significant: For this case it is assumed that one or
two measurements will yield values that can detect a normal expected difference d.
Type 2-0nly S(m) Significant: The number of samples nT 2 is

_ [2S(m)J 2
nT2- -- (44)
ET2
and Sa is equal to S(m) the measurement error.
Type 3-0nly S(sp) Significant: For this situation the number of samples nT 3 is

nT3 = [2S(sp)J2 (45)

En
Type 4-Both S(sp) and S(m) Significant: For this more complex situation the uncer-
tainty E is a function of the combined variance terms and there is no unique solution for n.
The relationship is expressed as

E = [S2(sp) + S2(m)] I/2 (46)

nsp nspnm

and several combinations of nsp and nm may give equal E. Values for nsr and nm have to
be selected based on their respective variance magnitudes and the costs associated with
sampling and measurement.

6 Statistical Analysis Techniques

Part 1 of this section on analysis gives a survey of some of the more frequently used
elementary analysis procedures along with a brief review of experimental design. The
basic purpose of this section is to give a simple overview of the statistical concepts used
in the application of these analysis techniques. Since computer programs are readily
available for a more comprehensive analysis, a brief review of typical software analysis
algorithms is also presented in Part 2.

Part 1. Elementary Analysis Procedures

6.1 Preliminary Screening of Databases

Prior to a formal analysis, a database should be examined for any unusual characteristics
of the data distribution. A database may have some number of outliers, an inherent
nonnormal or skewed distribution, or a bimodal character due to the presence of two
separate underlying distributions. Most tests for normality are intended for fairly large
sample sizes of the order of 15 or more. Smaller databases may be reviewed for unusual
characteristics by way of the usual statistical algorithms available with spreadsheets. Tests
for normality are listed in Part 2.
44 Veith

Detecting Outliers
Outliers may be present in any size database. For a database of from several to 30--40 data
values, an analysis may be conducted using spreadsheet calculations. Data values may be
sorted from low to high and a plot of the sorted or ordered values will reveal any suspi-
cious high or low values. Tietjen and Moore [6] described a test that can be used for a
small database with a reasonable number of suspected extreme values or outliers (1 to 5).
The suspected outliers may be either low or high, and the statistical test may be used when
both types exist in the sample at the same time. The test is applicable to samples of 3 or
more, and for sample sizes of II or more, as many as five suspected extreme values may be
tested as potential outliers. The following procedure is used.

(1) The data values are denoted as X 1 , X 2, ... , X 11 • The mean of all values designated as
i(n) is calculated.
(2) The absolute residuals of all values are next calculated: R 1 = IX 1 - icnJ I,
R2 = IX2- i(n)l, etc.
(3) Sort the absolute residuals in ascending order and rename them as Z values, so that the
z1 is the smallest residual, z2 is next in magnitude, etc.
(4) The sample is inspected for extreme values, low and high. The most likely extreme
values or potential outliers (the largest absolute value residuals) are deleted from the
sample (or database) and a new sample mean is calculated for the remaining (n- k)
values, with k = the number of suspected extreme values. This new mean is designated
as ik. The critical test statistic E(k) is calculated according to

E(k) = I:{i to (n- k)}[Z; - ikf

(47)
I:{i to n}[Z;- x(n)f
where

n = sample size (original)

k = number of suspected extreme values
i(n) =mean for all (n) values in sample
ik = mean for (n - k) values

Critical values are given in Table A4 for the test statistic E(k) at the P = 0.05 and the
P = 0.01 levels, for sample size n = 3 to 30 and for selected numbers of suspected outliers
k. If the calculated value of E(k) is less than the tabulated critical value in the table, the
suspected values are declared as outliers. This general approach may be used in an iterative
manner until all potential outliers have been evaluated by the procedure.

The action to take when significant outliers have been identified is part of an ongoing
debate in the data analysis community. One recommendation is that only data values with
verified errors or mistakes should be removed. This ultraconservative approach overlooks
the situation where outright errors are made but no knowledge is available that they are
errors. The opposing view recommends removal if a data value is a significant outlier
(P = 0.05 or lower). A middle ground position is to make a judgment based on a reason-
able analysis considering technical and other issues that relate to the testing in addition to
the importance of the decisions to be made.
Quality Assurance of Physical Testing Measurements 45

6.2 Evaluating Differences among Precision Estimates

Precision Estimates
Although the evaluation of any significant differences in mean values is one of the most
important statistical operations, the issue of precision, expressed inversely as a variance,
needs to be addressed first because certain knowledge about precision variance estimates
and their uniformity is required for a proper execution of such mean value statistical tests.
Precision variance can be evaluated once a database has been surveyed for unusual dis-
tribution characteristics. Situations that might require a decision on potentially significant
differences in precision variance and in mean values as well are
Possible undetected changes in the operation of a measurement system
Comparison of different technicians, instruments, laboratories for a common test
Comparison of two measurement systems
Comparison of different materials or test objects
These and other similar situations require an objective basis for decisions.

Two Variance Estimates

The statistical significance of apparent differences between two variances can be judged on
the basis of a ratio of the two variances. The most frequently encountered situation is the
comparison of two variance estimates, both estimates obtained from samples from two
ostensibly different populations. The variance ratio, sf I si,
is called an F-ratio, and this
ratio follows a sampling distribution called an F-distribution. The shape of these distribu-
tions depends on the number of degrees of freedom df associated with each of the var-
iances (numerator, denominator). The statistical test to determine if the two variance
estimates are equal is called an F-test; this is conducted by comparing the calculated
value of the ratio F(calc) to a critical F value called F(crit) that would be obtained on
the basis of chance at some probability level when there is no real difference, i.e., both
estimates are drawn from the same population. There are two potential situations or cases:
(1) from technical or operational conditions alone, one variance should have a larger value
and (2) on a technical basis neither variance can be considered greater.

Case 1
In conducting an F -test for this situation, sf is assigned as the greater expected variance.
The null hypothesis, H 0 , that there is no difference in variance and the alternative hypoth-
esis, HA, that sf is larger than si, are designated symbolically as
Ho: sf= si (48)
HA: sf> si
These hypotheses are tentatively adopted and a sample is drawn from each population (1
and 2) and the variance calculated. The ratio, sf I si = F( calc) is evaluated. If this ratio is
equal to or larger than F(crit), the ratio that would be expected by chance at a probability
P =a (0.05 or other) for finding a value as large as F(calc) when the null hypothesis is
true, the hypothesis of equality is rejected and the alternative hypothesis is accepted, i.e.,
that sr si.
>
F-distribution tables are given for F( crit) values, that are equaled or exceeded a certain
percentage of the time by chance alone, see Table AS. The F(crit) value cuts off an upper
area under the F distribution curve that is equal to a, and if F(calc) falls in this cutoff
region, then the null hypothesis is rejected. Tables ofF values are arranged for different
46 Veith

degrees of freedom df in the numerator and denominator, and F( crit) is usually listed for
each P level as F (df~, dfct); where dfn = df in the numerator, and dfct = df in the denomi-
nator.
Case 2
In this situation there is no technical reason for expecting either variance to be greater than
the other. A null hypothesis and an alternative hypothesis is adopted:
H0 : sf = S~ (49)
HA: s? > or < S~
and after both variances are calculated the greater variance is placed in the numerator to
evaluate F(calc). For making decisions at a P = 0.05 level, an allowance must be made for
F(calc), to be greater than F(crit) if sf is greater than si and conversely to be less than a
different F(crit) at the other end of the F distribution, if Sf is less than Si. One half of the
P = 0.05 rejection region is assigned to potential large values for F(calc) and one half to
potential small values for F(calc). On this basis the P = 0.05 value for F(crit) would be
found at a P = 0.025 upper tail F table, or conversely if the F(crit) value in a 95%
confidence level upper tail F table were selected, the confidence level for making a decision
would be P = 0.10, since there is a P = 0.05 probability for the F(calc) to be in either
critical region.
One Variance Known
A less frequent situation is the case where one of the variances is a known or defined
variance, represented by 0' 2 , while the other variance is from a sample and is equal to S 2 .
The ratio S 2 j0' 2 has a sampling distribution known as a "chi squared over dj', designated
as Ci jdf). For this application, chi-square i is a random variable that is given by the
ratio of the product S 2(df) to the known variance 0' 2 .
2 S\df)
X =--2- (50)
(J

To conduct a test of significance for the null and alternative hypotheses,

Ho: S2 =(52

HA: s2 -=F (52 (51)

an estimate S 2 of the variance is obtained and multiplied by the df (i.e., n- 1), with
n = number of data values associated with the sample variance estimate. The calculated
ratio iCcalc) is found and compared to a critical ratio iCcrit) at a selected probability
P =a. Values for /(crit) are obtained from chi-square tables at the applicable djfor any
particular application. Table A6 is a / table, and Table A 7 is a (/ j df) table.
More Than Two Variances
For situations with more than two variances, one statistical test that may be employed is
the Hartley F-max procedure. This is a test that may be applied to a balanced database,
i.e., the same number of values in each of the data sets, each set constituting a certain level
of a factor in a test program. The variance of each of n data sets to be tested for equivalent
variance is evaluated, and the ratio of S 2(max)/ S 2(min) is calculated and designated as F-
max(calc). The value of this is compared to F-max(crit) at a selected P level (0.05 or 0.01)
in tables of the F-max distribution for n factor levels (data sets) and for the common djfor
each variance estimate. Table AS is a Hartley F-max table. The null hypothesis is
Quality Assurance of Physical Testing Measurements 47

H 0 : all o} equal
HA: at least two al not equal (52)
If F-max(calc) equals or exceeds F-max(crit), then a significant difference exists for the
maximum vs. the minimum variance of the database. If F-max(calc) does not exceed F-
max( crit), then all variance estimates are equivalent.

6.3 Evaluating Differences in Means

Two Estimates of a Mean
There are a number of situations where a decision is required about the statistical sig-
nificance of the numerical difference between two means where the two means may be
considered to come from different populations. The approach to this problem requires
information about the means and the variance of each potential population. A number of
distinctions with regard to the means and the variances are possible. For means: (1) the
two means may be estimates, (2) only one mean is an estimate, the other is known. For
variances: (1) both variances may be estimates, (2) one variance is estimated, one is known
or (3) both variances may be known. Any of the options for means may be combined with
any of the three options for variances. The most frequently encountered situation is both
means and variances are estimates. This combination has two additional options: the two
variance estimates are equal or the two are unequal.
Option 1. If both variances are assumed to be equal or they have been shown to be
equivalent by an F-test, the separate estimates are pooled (see Section 3.3) to give a value
for S\p). The null hypothesis and alternative hypothesis are
Ho: f-i1 = f-i2

Ha: 1-il -/= f-i2 (53)

The decision on statistical significance is based on the sampling distribution of the ratio of
the difference between two means to the standard deviation of such differences as given by
Student's !-distribution for the general case of unequal number of values for each mean.
- -
t _ X(n)l - X(n)2
(54)
a/2- [(S2(p)jni) + (S2(p)fn2)]1/2
where
i(n)l= mean of n 1 values, population 1
i(n)l= mean of n2 values, population 2
S 2 (p)jn 1 =variance of means (of n 1 values each), population 1
S 2 (p)jn 2 =variance of means (of n 2 values each), population 2
and df = (n1 + n2 - 2)
Since there is no information on whether either mean is potentially the greater, the calcu-
lated difference is indicated by icnll - X(n)l• which can be either positive or negative. The
value for t designated as ta; 2 is evaluated using Eq. 54. The statistical significance for a
numerical difference is evaluated by comparing t(calc) to a critical value ta; 2 (crit) obtained
at some level of probability P =a. At this a, the value for ta; 2(crit) at the indicated df of
(n 1 + n 2 - 2) cuts off an area in each tail of the !-distribution that equals aj2, since t(calc)
may be + or -, i.e., it may appear in either tail. The value for la; 2( crit) is selected from a
table for the !-distribution at P = aj2 and compared with t(calc). Table A2 gives critical
48 Veith

values fort at various P levels. If lt(calc)l ~ 1ta; 2 (crit)i, using absolute values, then the null
hypothesis is rejected and the difference is declared significant at the (1 - P) 100 confi-
dence level. If lt(calc)l does not equal or exceed lta; 2(crit)l, the difference is not significant
at this level of testing for the selected P = a value.
Option 2. If the variances are not equal there are two choices: use a transformation
and conduct the analysis on the transformed data, or conduct the normal t-test and use the
d{for the smaller sample to select ta; 2 (crit). The distribution of the t value so calculated is
approximately the same as a true t distribution. This smaller sample df recommendation
is made because with unequal variances the exact number of degrees of freedom cannot be
determined by the usual procedure as given above.

Differences in Paired Means

It is not always possible to evaluate the variances of two populations being compared.
Under such circumstances a test program may be conducted by using the "paired sample"
technique. As an example, if the effect of a certain treatment on a particular polymer is
being evaluated, a number of uniform polymer samples are prepared and divided into
pairs. One sample of each pair is given the treatment and one is not, i.e., it is the control.
Measurements are then conducted on a sufficient number of pairs p (six or more) for
treated vs. untreated with some particular test. Inferences on the effect of the treatment or
the difference in means of the two populations (treated vs. nontreated) are based on the
mean value of the difference between n paired values. The distribution of the ratio of this
mean difference to the standard deviation of these differences follows a Student's t-dis-
tribution.

(55)
ta;2 = (Sa/n)l/2

where
d(av) =the average difference in means, (treated, nontreated) for n paired values
s3 = the variance (estimate) of all the individual differences
df = p -1
The null and alternative hypothesis statement is
H 0 : d(av) = 0
HA: d(av) i 0 (56)
The significance of d(av) is found by the same procedure as for a normal or standard t-test.
The variance s3is not the variance of either population but of a constructed population of
differences between the two conditions, treated vs. nontreated.

6.4 Elementary Analysis of Variance

Variance analysis is a task that can be easily performed mechanically with the ready
availability of computers. However to apply this analysis properly requires some under-
standing of the underlying concepts. Analysis of variance or ANOV A is essentially a linear
separation of the individual components of total variation in any database. The analysis
permits an objective decision on what part of the total variation is due to any variable or
external factor of the system. An external factor may be an intended modification and thus
constitute a treatment or a certain factor (uncontrollable environmental type) and thus
impose unintended modification. The simplest analysis of variance is a one-way analysis,
Quality Assurance of Physical Testing Measurements 49

and this begins with a database generated by a testing program such as a series of treat-
ments on a common material with some number of repeat tests for each treatment. This
elementary one-way analysis procedure will illustrate the basic ANOV A concepts.
Basis of Variance Analysis
The database or data matrix illustrated below as Table 5 consists of j columns and i rows
of data. The j columns may be various treatments of a uniform material or levels of an
adjustable or independent variable, and the values in each column are replicates or
repeated measurements for each of the treatment levels.
The measurements or observations xu are the dependent variable. Two basic assump-
tions are made; all potentially different populations for each treatment have a normal
distribution, and the variance of all populations is equal even if one or more of the
treatments are significantly different. This latter assumption may be checked by the use
of the Hartley F-max test as previously described. Each data value xu may be expressed in
terms of certain population means as given in Eq. 57, where each group as defined below
represents a treatment.
xu = JL + (JL1 - JL) +(xu- JL1) (57)
where
fL = the grand mean, the mean of all values in the database
(JL 1 - fL) = the variation among the xu values attributed to the differences of its
group mean ILJ from the grand mean JL; there are k differences
(xu- JL1) = the variation attributed to the differences of the group mean ILJ from
the individual i replicate values for each group (or treatment)
The equation may be rewritten by assigning symbols to the two differences.
(58)
The term {31 represents the series of differences, jth treatment mean minus the grand mean;
if there are no significant differences for the treatments then {31 = 0 for all treatments. If
one or more of the treatments are significant then {31 -:;i 0 for the one or more treatments.
The term su is a random difference, i.e., measured value xu minus the jth treatment mean.
By subtracting fL from both sides we obtain
(59)
This model equation states that the magnitude of any deviation of the dependent variable
xu from the grand mean is the sum of two components of variation, {31, the component due
to any presumed real response effect of a treatment, and a second component su, a within-
treatment difference or error. The long run mean value of all su equals zero, and the

Table 5 Layout of Data for Variance Analysis

Level of Independent Variable

Replicate number
2 3 j

I
2
50 Veith

variance of cu is equal to the basic test error variance in making a measurement. The two
components are called the between-treatments source of variation and the within-treat-
ment source of variation. The analysis answers the question, is the between-treatments
variation significant when compared to the within-treatment variation?
The analysis begins by setting up the null and alternative hypotheses
Ho: {31 = 0 for j = 1, 2, ... , k
Ha: At least one f3i is not 0 (60)
The between-treatment variation is calculated by noting that the number of {31 terms is j,
and the variance among the j group means, S~, is given by
s~ - :E(xJ- xu)2
(61)
X- k -1
where
x1 =the treatment mean (across all i values) for the jth treatment
xu = the grand mean across all i and j values
k =the number of treatments

In general, the standard error for any normal population mean is equal to a 1Jn, and the
variance is equal to a 2 jn, where n is the number of values that are used to calculate the
mean and a 2 is the population variance. For this analysis the termS~ is an estimate of the
true variance a~ among the k different means. If the null hypothesis is true, S~ can be used
to evaluate the population variance. To evaluate the variance for individual population
values, the calculated S~ must be multiplied by the number of test values (replicates) used
for each of the means, designated as ni, to give niS~, which is an estimate of na~. The
variance of the individual measurements for each of the k treatments is calculated by

S~ = :E(xu- xi)2 (62)

J n} - 1
This variance is equal to the variance of cu. The pooled or average variance for the k
estimates of sj is designated asS~, and given that each treatment has the same number of
replicates, it is calculated by
2 :E(S?)
SP =--{- (63)

We now have two estimates of the individual population variance; the first n1 S~
obtained from the variation in treatment means, and the second S~ the pooled variance
obtained from the replicates. If there are no real effects of the treatments (all {31 = 0), both
of these are estimating the underlying variance of the population. An F-test can be used to
decide if the two estimates are equal. Using the hypotheses as given above and using the
technically justified assertion that if the variances are not really equal the between-treat-
ments variance should be larger than the within-treatment variance, the F-ratio is defined
as
n·S~
F(calc) = 1
2x (64)
sp
The degree of freedom df for the numerator is k- 1 and the df for the denominator is
(n1 - 1)k. If F(calc) is equal to or greater than F(crit) at these respective dfat a probability
Quality Assurance of Physical Testing Measurements 51

level P = 0.05 or less, then the larger variance is significantly greater than the lesser
variance, with the implication that at least one value of f3j is not equal to 0.
ANOVA Calculations
The classical ANOV A calculations are not usually performed as given above but by
shortcut methods that were developed to reduce the burden of calculations prior to the
use of computers. An understanding of this approach can be gained by considering that
the total variation in a database, as given above, is evaluated by calculating the total
variance s;ot by
S2 - z=(xii - xu)2
tot- kn- 1 (65)

with xu and xu as defined above and the summation is over all values. The numerator of
Eq. 65 is a sum-of-squares called the total sum-oj~squares and represents all the variation
in the database. It may be shown that this total sum-of-squares, the numerator, may be
partitioned into the two components on the right-hand side of Eq. 66.
z=(kn)(xu- xu) 2 = 'L,(kn)(xu - x i + n'L,(k)(x;- xu) 2 (66)
where
z=(kn) =summation over all kn values
z=(k) = summation over k values
x; = mean of each of j treatments and other symbols as defined above
The first term on the right-hand side is called the error sum-of-squares and the second
right-hand term is the treatment sum-of-squares. A one-way ANOV A is performed by
calculating the total sum-of-squares SS(tot) by way of a short-cut expression that can
be shown to give the values as defined by the left side of Eq. 66.
SS(tot) = 'L,xt- C (67)
where
z=xt = the individual measured values xu squared and summed over all values in
the database
C = T'/;11 /kn =a constant called the correction term, where Tan is the grand total
of all measured values in the database, k is as defined above, and the j has
been dropped on n
The second ANOV A sum-of-squares is the treatment sum-of-squares SS(trt), given by
Eq. 68, where 'L,T} is the sum of the squared totals for each of the k treatments.
z=T2
SS(trt) = - 1 - C (68)
n
The random variation or error sum-of-squares is obtained by difference:
SS(error) = SS(tot)- SS(trt) (69)
The three sums-of-squares are used in a table with the following layout of Table 6. The
sums-of-squares are divided by the appropriate degrees of freedom df to give variances
that are called mean squares. As indicated in the previous exposition of variance analysis,
the treatment mean square is divided by the error mean square to make a decision on the
significance of the treatments.
Table 6 Analysis of Variance Table

Source of Sum of
variation df squares Mean square F(calc)
-·---·-------·------------------- -------

Treatments k-1 SS(trt) MS(trt) = MS(trt)/MS(error)

SS(trt)j(k- I)

Error k(n- I) SS(error) MS(error) =

SS(error)/k(n- 1)
Total kn- I SS(tot)

Although the analysis of variance is a powerful tool for data analysis, especially for
more complex situations that are beyond the scope of this chapter, it needs to be supple-
mented by supplementary analysis tools. If F(calc) is found to be significant, this implies
that at least one of the treatments is significantly different from some other treatment. If
several treatments are used, it is appropriate to determine what the exact relationship is
among the means for all the treatments. This type of analysis is called multiple compar-
isons. A typical statistical test of this type is the Duncan Multiple Range Test; see Part 2 of
this section. This test makes all the pair-wise comparisons among the treatment means
based on a statistical parameter called the least significant range, LSR. For each pair of
means an LSR is calculated and compared to a critical LSR for the number of groups
compared, the df for the error term, and a selected P.

6.5 Elements of Correlation and Regression Analysis

Correlation Analysis
Correlation analysis permits an objective evaluation of the degree of association between
two variables. If a value of variable x is produced by a process that also produces a
corresponding value of variable y, a correlation analysis will indicate if the association
between the two variables is significant at some selected level of confidence. A significant
correlation does not imply that y may be predicted from x or vice versa; it simply indicates
that there is some significant association. Association does not imply a cause and effect
relationship, i.e., that x physically generates y. The associated values of x andy may be the
result of some unknown underlying cause and effect process that involves both variables.
Graphical techniques or xy scatter plots are important adjuncts to correlation analy-
sis; they give a quick indication of any potential association between two variables. A
series of such plots may be required for a number of variables. If simple review of such
plots indicates the possibility of significant correlation, then an estimate of the correlation
coefficient may be obtained that provides a quantitative index of the degree of association.
It is assumed that the x and y values have a normal distribution. The correlation coeffi-
cient is a random variable with a defined distribution function that depends on the sample
size and the population value (the true underlying value) of the correlation coefficient,
designated as p. For any finite sample size (number of xy pairs) an estimate of the
population value pis obtained; this estimate is designated as r. The population correlation
coefficient has the following properties: (I) its value lies between + 1, which indicates
perfect positive linear association, and -1, which indicates perfect negative or inverse
linear association; (2) a value of zero indicates no linear association (a curvilinear associa-
Quality Assurance of Physical Testing Measurements 53

tion may however exist); and (3) the population coefficient is symmetric with respect to x
andy, i.e., x on y gives the same value as y on x.
Analysis begins by calculating the mean .X of the x variable, y, the mean of the y
variable, and the deviations (x; - .x) and (}•; - y). The estimated correlation coefficient r is
the ratio of the sum of the cross products of the deviations of x andy divided by the square
root of the product of the sums of the squares of the same respective deviations.

I:(x; - x)(y; - y)
r = [I:(x;- .X) 2 I:(y;- y) 2]1/-? (70)

For any xy scatter plot of two variables that have a high degree of positive correlation, a
large number of points will fall in the upper right quadrant 1 and the lower left quadrant 3
when the origin (center) of the quadrants is at the mean values .X and f:. This clustering of
the points insures that quadrant I will have positive and relatively large x deviations
associated with similar positive and large y deviations. Quadrant 3 will have a similar
situation with negative deviation values for both variables. When these deviation cross
products are summed over all xy values, they will very nearly equal the square root of the
product of the x and y deviations squared, and the ratio will be high or near 1. For a
negative or inverse association, the points will cluster in the upper left quadrant 4 and the
lower right quadrant 2. A strong association of this type will give negative cross products
and a high negative ratio or r value approaching -1.
The significance of any calculated correlation coefficient is evaluated by adopting the
usual two-sided hypothesis test,
Ho: p = 0
(71)
This hypothesis can be tested for any sample size n of 3 or greater by using a t-statistic
given by
r
(72)
t(a/2) =[(I- r2)/(n- 2)]1/2

where tca; 2) is the value of a random variable that has a distribution that is approximately
the usual £-distribution with n - 2 degrees of freedom. The correlation is considered sig-
nificant if t(calc) is greater than a critical value tcam(crit) at a level of significance desig-
nated by P = a. Table A9 gives precalculated critical r values at dl = n - 2.

Regression Analysis
Simple regression analysis is also concerned with the association between two variables x
andy, but the goal of this analysis is to develop a mathematical relationship between the
two variables to permit a prediction of y by knowing the value of x. The potential use of
regression analysis implies that there is a correlation between the two variables and that
the distinction between correlation and causation as discussed above applies to this ana-
lysis as well. The linear mathematical relationship is called a regression model; it explains
or predicts the response or y variable in terms of the other (independent) variable desig-
nated as the x variable. Regression analysis is used to make inferences about the para-
meters of the regression model which is given by
(73)
This model, referred to as the regression of y on x, involves {3 0 , {3 1, and E: defined as
54 Veith

{3 0 = the intercept, the value predicted by the model when x = 0; it has no practical
meaning if x cannot equal zero, but is necessary to specify the model
{3 1 = the slope of the regression line, i.e., the change in y for unit change in x
s= a random error term with population mean of 0 and variance of ci
A term called the conditional mean, defined by f.LyiX> is a predicted value for the dependent
variable y for some given x and is expressed as
(74)

The regression model describes a line that is the locus of all values of the conditional mean,
each conditional mean corresponding to one of a set of x values. Most regressions pro-
blems are concerned with selecting a set of x values that span a reasonable operational
range and measuring y at each of these x values. Each of the observed or measured values
of the response variable (at a given x) comes from a normal population with mean of f.Lylx
and variance u 2 •
The purpose of a regression analysis is to use a set of measured or observed x andy
values to estimate the parameters {3 0 , {3 1 , and the variance of the s terms or u 2 and to
perform hypothesis tests and evaluate confidence intervals concerning {3 1 . Basic assump-
tions in this analysis are that the linear model is appropriate, that the s error terms or
deviations in they variable are independent and normally distributed and that they have a
common variance u 2 at all x levels, and that the uncertainty variance in selecting each x
level is small in comparison to u 2 . The analysis seeks to find estimates of {3 0 and {3 1 that
produce a set of f.Lylx values that are a best fit to the data. The regression line may be
written in an alternative format as
(75)
In this alternative format, liylx is an estimate of the mean of y for any given x, and
b 0 and b 1 are estimates of {3 0 and {3 1 respectively. How well the estimates agree with the
observed y is evaluated by the magnitude of the differences y- uYix, which are called
residuals. Small residuals indicate good fit, and the best fit is the line that gives the smallest
combined magnitude for the squares or variance of the residuals.
The minimization of the squares of the residuals is called the least-squares criterion,
which requires that the estimates of {3 0 and {3 1 minimize the sums as expressed by
:E(y- Uylx) 2 = :E(y- bo- b1x) 2 (76)
These values are obtained by means of "normal equations," a set of simultaneous equa-
tions of the form
bon+ b1 :Ex = :Ey
bo:Ex + b 1 :Ex 2 = :Exy (77)

Based on these equations the estimate for b 1 can be expressed as

b _ :E(x- x)(y- y)
(78)
1 - :E(x- xi
where the numerator is called the corrected or mean centered sum of cross products and
the denominator is called the corrected sum of squares. The expression for b0 is given by
b0 =y- b 1.X (79)
The error sum of squares, designated by SSs is
Quality Assurance of Physical Testing Measurements 55

(80)
which describes the remaining variation in y after estimating the linear relationship y upon
x. The degrees of freedom for SSt: is n - 2, where n is the number of xy values, since two
df are used for b0 and b 1 and the mean square or variance estimate of CJ2 is

SSe I:(y - uyl,i

M St: = -- = ----''-'--- (81)
df n -2
Statistical inferences on the value of /3 1 are based on the assumption that the estimate
b1 is a random variable whose distribution is approximately normal with mean = {3 1 and
variance of CJ2 jl:(x- xi,
which can be given in equivalent form as ri j(n- l)s;, where s;
is the variance among then values of x. Either of these expressions shows that the estimate
b 1 has greater precision when-the population variance CJ 2 is small, the sample size or
number of xy pairs n is large, and the independent variable x has a wide range of values,
i.e.,s; is large. The standard error of the estimate b1 is [CJ 2 /(n -l)s;] 112 , hence the ratio
bl- fJJ
z= [CJ2 /(n- 1)S~]l/2 (82)

is a standard normal (z) variable. When the estimated variance MSt: is substituted for CJ 2 ,
the ratio becomes a random variable with a t distribution with n - 2 degrees of freedom,
and this may be used for hypothesis testing. Thus for testing whether {3 1 equals a specific
value /3].
Ho: fJ1 = /3]
Ha: fJ1 =f. fJj (83)
the test statistic is
b! - fJi
t !2 - _ ___:._:......:...._~ (84)
a - [MScj(n- 1)S~] 1 / 2
Letting {3 1 = 0 provides a test of the null hypothesis,
H 0 : /3 1 = 0
Ha: fJ1 =f. 0 (85)
and the confidence interval for b1 is calculated as
MS ]1/2
b! ± ta/2 (n- l;S~
[ (86)

Inferences on the model estimates of the response variable are also important. There
are two different but related inferences: (1) inferences on the mean response, how well the
model estimates the conditional mean at some x, and (2) inferences on prediction, how
well the model predicts the value of the response variable y for a randomly chosen future x
value. The point estimate for the first of these is uy x, the estimated mean response for any
1

x, and the estimate for the second case is Yylx, the predicted individual response value for
any x. For a specified value x*, the variance of the estimated mean is

S2(u, .) = (J2 [1 jn + (x* - .X)2] (87)

* (n- l)S~

and the variance for a single predicted value is

56 Veith

s2rv . .) = (}2[1 + 1/n + (x*-

.__.·'
1
' (n-l)S~
xi] (88)

Both of these variances have their minima when x* = x, i.e., the response is estimated with
greatest precision when the independent variable is at its mean. At this location the
conditional mean is y and the variance of the estimated mean is the familiar u 2 jn. Also
note that S 2 (.Y,. x) > S 2 (ur J, since a mean has greater precision than an individual value.
1 1

When MSs is substituted for cr 2 , the estimated variance is given by these two equations.
Letting x = 0 in the variance expression for u,. gives the variance for {3 0 , which can be
1,

used for hypothesis and confidence interval testing for {3 0 . This has applications for some
regression problems where {3 0 has some intrinsic meaning other than an arbitrary fitting
constant.
One simple diagnostic test for simple linear regression that may give an indication that
some of the essential assumptions have been violated is the residual plot. If the residuals
y - livlx are plotted on the vertical axis and predicted values uvlx or x values on the
horizontal axis, the plot should give a horizontal band of points, centered on zero on
the vertical axis, with relatively uniform height (vertical spread) across the entire horizon-
tal span if there are no serious problems with the regression. If the height increases from
end to end, this may indicate nonuniform variance across the x factor range. If the band
curves up or down, left to right, this may indicate a nonlinear underlying true relationship
and the model has not been properly specified, i.e., it may need a square term.
Multiple regression, where a dependent y variable is a function of a number of inde-
pendent x variables or factors, is a widely used analysis technique for multivariable pro-
blems The use of a specialized simplified multiple regression analysis, where all the
independent factors are orthogonal, is discussed in the next section on experimental
design. A detailed discussion of customary multiple regression analysis, where orthogon-
ality may or may not be present for all variables, is a topic that is beyond the scope of this
chapter.

6.6 Design Of Experiments

Experiments have a wide variety of technical objectives, but all should have the same
statistical objective to provide maximum information of the highest quality possible for
the minimum cost. The design of experiments is the process of efficiently planning and
executing a series of experiments with this objective in mind. Two types of planning are
required, (l) effective organization as to test methodology and basic scientific principles,
and (2) selection of an experimental protocol to generate test data with certain statistical
characteristics. This second type of planning is emphasized in this section. The protocol is
the set of instructions for assigning treatments or other experimental factors to measure-
ment or observational units. The reliability of the measured parameters is assessed by the
standard error of the estimates, and these errors may be reduced by increased sample size.
However for relatively complex technical projects with numerous factors, this leads to
excessive testing costs. One way to avoid this is the use of proven experimental designs that
provide for efficient comparisons among treatments or factors and reduced residual or test
variance.

Basic Design Concepts

In experimental design, factors or treatments are independent variables. The factors may
be quantitative, with their values set at some level on a continuous scale, e.g., temperature,
Quality Assurance of Physical Testing Measurements 57

or qualitative, with the levels representing certain types or categories, e.g., reactor vs.
reactor 2. The response is the dependent or measured variable. The generic models for
experiments are (1) a fixe d-eflects model, where only the selected levels of the factors in the
experiment are of concern, (2) a random-effects model, where the factor levels chosen
represent a sample from a larger population and conclusions from the program are applied
to the population, and (3) a mixed-effects model, where both random and fixed effects
factors are included in the design. Factors may be primary, i.e., the major factor(s) for
evaluation that can have a direct influence, or secondary, i.e., factors that can have a less
direct influence, e.g., environmental conditions, which may or may not be evaluated as
secondary objectives.
In a typical experiment set up to evaluate three levels; 1, 2, and 3 for Factor A, a
secondary factor may be designated as qualitative factor 0, which of several operators
conducts the test. The usual or classical experimental approach is to fix the level of 0, i.e.,
select one operator and evaluate the response when Factor A is varied over the three levels
(1, 2, and 3) with perhaps three replications for each level. This approach has 6 dffor error
estimation and nine total test measurements. Testing will evaluate the influence of A with
the selected operator but give no indication of operator effects. A more comprehensive
approach is to use a block experimental design. Select two diverse operators and for each
operator evaluate the response for factor A at levels I, 2, and 3 with two replications per
level. The operators are the blocks, and the influence of factor A is evaluated indepen-
dently in each block. This design also has 6 df for error with now a total of 12 measure-
ments. The investment of 3 more measurements for the second design provides much more
information. The influence of factor A is now evaluated for both operators, and any
unusual influence or interaction of operators on factor A response can also be determined.
In complex experimental programs, blocking may be done for potentially perturbing
factors that cannot be controlled, such as ambient temperature or humidity variations, by
conducting several evaluations of the response in short time periods, where temperature
and humidity are relatively constant. Dividing the total experiment into these blocks, or
relatively uniform groupings, improves the quality of the estimated effects by reducing the
standard error of the measurements.

Factorial Designs
A class of experimental designs called factorial designs is well suited to technical projects
involving physical measurements. Factorial designs are defined as a group of unique
combinations of levels across all the selected factors. When the factor levels are set at
the values called for in a particular combination and a response measurement is made for
this combination, this is called a (test) run. Each design has some number of specified runs,
and the total layout or list of these runs is called the design matrix. A complete factorial
design matrix is one where for each factor, all factor levels of the other factors appear in
some combinations of the design matrix. Thus for three factors investigated at two levels
each, a complete factorial design would require eight (2 3) response measurements or runs,
each having a different combination of the two levels of the three factors. When the
number of factors is large, a complete or full factorial requires too many test runs, and
designs called Factional factorials are used where a certain fraction of the full factorial
number of runs is selected on the basis that the design be balanced with respect to the
number of selected levels of each factor.
Factorial designs may be divided into two types, (1) screening designs used to search
for important factors or to rank factors in order of importance, and (2) exploratory designs
used to explore and map out a region of technical interest in greater detail, thereby gaining
58 Veith

empirical understanding. The simpler screening designs, used for a preliminary search for
important factors in a system, have k factors each at two levels, designated as upper and
lower levels. Exploratory designs also have k factors, but now each factor usually has at
least three levels, upper, middle and lower; this permits the evaluation of nonlinear
response relationships.
The designs as given in this section are analyzed in terms of model equations that
simulate the system under study. The designs and the model equations are set up using
special coded units for the independent variables (factors) of the design. Thus for the
response variable y and two independent variables x 1 and x 2 , the model equation that
allows for the evaluation of any interaction between x 1 and x 2 is
y = bo + b1x1 + b2x2 + b12x1x2 (89)
where
b0 = a constant; in system of units chosen it is the value of y when x 1 = x 2 = 0
b 1 =change in y per unit change in x 1
b2 = change in y per unit change in x 2
b 12 = an interaction term for specific effects of combinations of x 1 and x 2 , it
indicates how b1 changes as x 2 changes 1 unit
The coded units are obtained by selecting for each factor a value that constitutes a center
of interest or a reference value, and then selecting certain values that are below and above
that center of interest by an equivalent amount. This is a straightforward process for
quantitative factors but may not be possible for some qualitative factors that can exist
at only two levels. In this case the center of interest is considered as theoretical or con-
ceptual. The coded units for any X; are defined by
vE- evE
X;= SU (90)

with
VE = selected factor value for X;, in physical units
eVE =center of interest factor value for X;, in physical units
SU = scaling unit, i.e., change in physical units equal to 1 coded unit

When VE is higher than evE by an amount equal to SU, then X;= 1; when VE is less than
evE by an equal amount, X;= -1, and when VE =evE, X;= 0. The center of interest
values for all factors constitute the central point in the multidimensional factor space for
the experiment. As indicated above, the constant b0 is the value for y at this center in the
factor space; it is the (grand) average of all responses and is an important analysis output
parameter for these designs.
The design matrix for a 23 full factorial design is given in Table A10 along with an
additional matrix called the independent variables matrix or analysis matrix that is used to
calculate the effects of the independent variables; this matrix has an equal number of rows
and columns. The analysis of factorial designs is a relatively straightforward hand calcu-
lator procedure that may be performed by using the analysis matrix generated from the
design matrix as follows: (1) the first analysis matrix column consists of 1 values and is
used to evaluate the grand mean; (2) the next three columns are the same as in the design
matrix; (3) the remaining (interaction) columns are generated by multiplying respective
values in the columns headed by x; and x1 to give column entries for bu. The final column
Quality Assurance of Physical Testing Measurements 59

is generated in the same way for three factors. In general this operation is conducted for all
factors up to k, for any 2k design.
With this design the effects of the three factors x 1 , x 2 , and x 3 may be evaluated in
terms of effect coefficients. There are three main effect coefficients, bh b2 , b 3 ; three two-
factor interaction coefficients, b 12 , b 13 , b 23 ; and one three-factor interaction coefficient, b 123 ,
as given in Eq. 91.
y = b0 + b 1x 1 + b2 x 2 + b3x 3 + b 12 x 1x 2 + b 13 x 1x 3 + b23 x 2x 3 + b 123 x 1x 2x 3 (91)
The coefficients are evaluated from the sum of the products obtained by multiplying
certain column values or elements on each row, where (col b;) indicates a specific row
value in column b; of the analysis matrix and Y; is the same specific row value for the
response. The sum obtained over all N responses is divided by N/2, where N is the total
runs in the design.
b· = I:[y;(colb;)]
' N/2
I:[y;(colbu)l
hu = N/2
I:[y;(col buk)]
buk = N/2 (92)

The value for b 0 that is equal to the grand mean of y is given by

b d I:[y;(colb 0);]
0 = gran mean y = N (93)

On the assumption of uncorrelated response measurements and equal variance at all

response levels, the variance of the effect coefficients S~; is given in terms of the variance
of the individual response variance by s;;
2 4S;;
sbi =-- (94)
N

Screening Designs
Seven screening designs with two to five factors are listed in Table A 11. These are a
collection of complete factorials or fractional factorials. The fractional factorials are
designated as one-half replicate or one-fourth replicate, i.e., a one-half or one-fourth
fraction of the full design. With the exceptions as noted below, all of these designs
allow for the evaluation of two factor interactions that often are important in many
technical investigations. Typically three factor interactions have no real significance in
such programs. Thus any design that allows for direct calculation of two factor interac-
tions is usually sufficient to give a good evaluation of any system.
All of the designs are orthogonal in the independent variables or factors, i.e., there is
no correlation among these factors. This feature of orthogonality permits the use of the
matrix as discussed above for easy analysis. The designs are balanced, i.e., for any factor
level for factor i , the levels of all other factors appear at their upper and lower values the
same number of times. The analysis for each design in Table All can also be conducted by
multiple regression analysis with typical computer algorithms to give the values for all b
coefficients and the other typical output parameters for multiple regression. One virtue of
using a multiple regression analysis is the ability to evaluate the significance of the indi-
60 Veith

vidual coefficients on the basis of t tests and thus obtain a model with only significant
coefficients.
Table All gives alias and confounding information for the fractional or blocked
designs. This information is needed for proper design set up, analysis, and interpretation.
An alias exists in fractionated designs when the same sum of products 2:yi( col hu) numeri-
cally evaluates the sum for two different or separate coefficients; thus no unique evaluation
of each of the coefficients is possible. The aliases in the table are indicated by an equals
sign. Design 3, the three-factor design conducted in two blocks, can be used to lay out a 2 3
design into two blocks as well as to use either of the blocks to evaluate the three main
effects of a three-factor design with the indicated aliases. Thus each block is equivalent to a
three-factor one-half replicate design. If it can be shown on technical grounds alone that
no two-factor interactions are possible or are of negligible magnitude, then the two-block
three-factor design may be used for blocking or for main effect evaluation. If this cannot
be assured, then for all potential situations of this sort, designs with no "main effect-two-
factor interaction aliases" must be used. Confounding implies a similar unresolved situa-
tion where certain higher-order coefficients are equivalent to block effects in the same
sense as an alias. This is of lesser importance than aliases with two-factor interactions.
When assigning coefficient numbers to factors, certain features of the fractional
designs can be used to give a layout as free from conflicting interpretations as possible.
Design 5, a four-factor one-half replicate, has the two-factor interactions aliased with each
other. Note that x 1 is part of each aliased combination or equivalency. If it can be
ascertained in advance that one factor of the four can be guaranteed to have minimal
interaction with the other three, then this one factor should be assigned as x 1. On this basis
the left side of all the alias equations is zero or very close to zero, and the right-hand side
becomes the real interaction if this interaction is determined to be significant. A similar but
not exactly equivalent situation exists with regard to x 5 for the five-factor one-half repli-
cate Design 7; x 5 appears in the alias equations with main effects for factors 1 to 4. Thus
for this design the factor with the least likely probability of interacting with the other
factors should be assigned as Factor 5.

Exploratory Designs
With the exception of the two-factor design, those designs illustrated in Table Al2 consist
of a core group of runs that is equivalent to a screening design with certain added runs that
contain additional levels of the factors. The additional runs are located at the center of the
design (center of interest) and at upper and lower levels beyond those that appear in a
screening design. These added runs enable any nonlinear behavior to be evaluated and
allow for main effect and two-factor interactions to be evaluated with enhanced reliability
due to the increased number of runs. The center runs are replicated to give a small df
estimate of error.
The two-factor design is a layout in the shape of a hexagon with two center points.
The three-factor design is a full factorial augmented by lower and upper points selected to
give an efficient estimation of the potential effects of the three factors. The center point is
replicated four times. The four- and five-factor designs are built on a one-half replicate of
the respective full factorial again augmented by lower and upper points for all factors plus
replicated center points.
The selection of the physical units that represent the coded units of both screening and
exploratory designs should be carefully thought out. In a screening design, the lower and
upper physical unit levels should be as wide as possible to increase the sensitivity of the
evaluation. The values should also be in the range of direct interest to the technical
Quality Assurance of Physical Testing Measurements 61

problem at hand. Exploratory designs, with perhaps five levels, also require careful selec-
tion of the range of values and the equivalence between physical and coded units.

Part 2. Computer Software Procedures for Statistical Analysis

6.7 A Typical Program

Although there are numerous computer statistical analysis products available, they all
have many capabilities in common. The characteristics described here are a basic list for
one of these products for a usual range of applications. They are listed without any
explanation or definition of parameters and terms. Refer to [7] for a full description
and explanation.

Initial Data Exploration

Descriptive statistics-mean, median, trimmed means, standard deviation and standard
error, variance, minimum, maximum, range, interquartile range, skewness, kurtosis
Frequency statistics-outlier identification: boxplots, stem-and-leaf plots, and histograms
Frequency statistics-description: percentiles, probability plots, robust estimates or M-
estimators, Kolmogorov-Smirnov and Shapiro-Wilk normality tests
Variance homogeneity-Levene's test for equality of variance

Cross-tabu lations
For two-way and multi-way tables-Pearson's r, Pearson's x-square, likelihood-ratio x-
square, Yates corrected x-square, Spearman's rho, contingency coefficient,
Goodman's and Kruskal's tau, eta coefficient, Cohen's kappa, relative risk estimate

Means and One-Way ANOV A

Mean values for subgroups and related univariate descriptive statistics are given for a
number of dependent variables; a one-way ANOV A is performed

Means and t Tests

Independent (two) sample t tests, paired-sample t tests, one sample t tests

One-Way ANOVA
Typical one-way ANOV A with post hoc tests: LSD, Bonferroni, Duncan's, Sidak's,
Scheffe, Tukey, Tukey's-b, R-E-G-W-F, R-E-G-W-Q, S-N-K, Waller-Duncan
Levene's homogeneity of variance

Simple Factorial Analysis

ANOV A for factorial experiments: cell means, ANOV A table, covariate coefficients, R
and R 2 , multiple classification table, eta and beta values

Bivariate Correlations
Pearson's correlation coefficient, Spearman's rho, Kendall's tau-b, univariate statistics,
covariances and cross-products, outlier screening prior to analysis

Linear Regression
Estimates of linear regression coefficients, standard errors of coefficients, significance of
coefficients, blocking of variables, residual calculation and residual analysis, standard
ANOV A, weighted least-squares analysis
62 Veith

Curve Estimation
Models available: linear, logarithmic, inverse, quadratic, cubic, power, compound, S-
curve, logistic, growth, exponential
For each model: regression coefficients, multiple R, R 2 , adjusted R 2 , standard error of
estimate, ANOVA table, predicted values, residuals, prediction curves
Nonparametric Tests
Chi-square, binomial test, runs test, one-sample Kolmogorov-Smirnov test, Mann-
Whitney U test, Moses test, Wald-Wolfowitz test, Kruskal-Wallis test, Wilcoxon
signed rank test, Friedman's test, Kendall's W test, Cochran's Q test
Multiple Response Analysis
Frequencies and frequency tables calculated, multiple response cross-tabs given

7 Quality Assessment and Control

Although quality may be thought of as a recent or modern development, there is nothing
new about the basic idea of a quality product. For centuries highly skilled artisans have
achieved high levels of quality by a variety of techniques across a broad range of products
and services. What is new is the current action to move up on the quality curve; to attempt
to improve quality to higher levels by a series of quality improvement iterations until some
desired level is attained.
There are three terms that are frequently used in discussing the generic concept of
quality: statistical process control or SPC, statistical quality control or SQC, and quality
assurance or QA. SPC and SQC address essentially the same goals and are often used
interchangeably. Both operations are conducted in large part using average and range
charts, in tandem with innovative technology for process improvement. Quality assurance
is the act of insuring that a certain level of quality is attained in any process. In a
production setting there are basically two ways to attain any quality level: (1) use rigorous
inspection to cull out off-quality product, or (2) analyze and improve the process, using
SPC/SQC techniques, to generate product to whatever target level is achievable or appro-
priate. Quality is assured when these techniques and other ancillary technical disciplines
are used to attain the level of quality desired or needed.
As outlined in the introduction, the quality of a test system has two major compo-
nents: (i) how well the measured parameters relate to the fundamental properties that are
involved in the systems-related decision process and (ii) the magnitude of the uncertainty
of the selected measured or derived parameter(s). In a production setting, system perfor-
mance is related to achieving the properties of the product that satisfy the customer, and
quality is related to this performance. In a laboratory setting where one or more test
systems are of direct concern, performance is defined as the capability to obtain acceptable
output (data) on selected reference or other materials or objects as well as on routine
testing. Quality is defined by this performance, and it may have more than one level or
grade depending on the system and the importance of any technical decision.
For any level of quality in a test system, acceptable output is determined jointly by (1)
the mean, of a set of individual measurements, for a particular parameter, and (2) the
variation of the individual repetitive measurements used to obtain the mean within each
set. When both of these are evaluated over a period of time, they must be within certain
limits, and the magnitude of these limits determines the quality level. Another quality
term, quality assessment , is a process of discovering what these limits are and how they
Quality Assurance of Physical Testing Measurements 63

may be improved by reducing the inherent variation. Thus both production and test
systems can be in control at various quality levels.

7.1 Basic Quality Assessment

Data used for technical decisions fall into one of two categories: (I) the test data them-
selves are the ultimate end point of the measurement process, and the data either meet
some set of criteria or they do not; or (2) the test data are used as input to a more complex
system, and this complex system may have other quantitative inputs; the operation of the
system must also meet some set of criteria. In the first case, decision procedures are
frequently developed on long-term experience with accepted test methods often developed
by standards organizations, and the quality assessment is reasonably straightforward. In
the second case, decisions are more difficult to make, since they depend on fundamental
technical and scientific knowledge that may be in the process of development; often there
is no extensive background testing experience. The guidance given here applies mainly to
Category 1 decision quality assessment, although it may be applied to Category 2 situa-
tions after all issues concerning the fundamental scientific basis for the testing have been
resolved. Quality assessment techniques may be classified as internal or external .

Internal Techniques
Some frequently used internal techniques are
Repetitive measurements
Internal test or reference samples (or objects)
Control charts
Interchange of operators and/or equipment
Independent measurements
Definitive (alternative) methods or measurements
Audits
The purpose of all of these techniques is to determine how a system performs based on the
selection of one or more performance parameters. The first two techniques on repetitive
measurement and the use of internal test or reference samples is the classical way to
evaluate precision. However this can be a time consuming process if it is not carefully
planned, and quality assessment frequently provides a way to minimize the number of
measurements. The use of duplicate samples in routine testing, and the accumulation of
this information over a time period, is another approach to evaluating precision. More
details on this will be given in Section 8 on laboratory precision, which addresses both
intralaboratory and interlaboratory operations.
The use of control charts is a documented way to interpret sequential test data and
assess quality as well as to monitor or maintain quality. This is described in more detail
below. The interchange of operators and equipment (if possible) can also provide valuable
information about sources of variation and their influence on quality. Test data should be
independent of such factors as operators and individual test machines of the same design.
If output data are related to such factors, operator training and machine calibration
operations need attention. Independent measurements and definitive methods are related;
they both may be used to measure the same parameters by a different but equivalent
method. Although independent measurements may not be available for some methods,
the results of such testing may give information about any bias in a selected test method,
although this topic is more rigorously approached by way of external techniques as
64 Veith

described below. Audits by internal staff members of any standard operating procedure
(SOP) may also assist in assessing and improving quality.

External Techniques
Some frequently used external techniques that may be used are
Collaborative or interlaboratory testing
Exchange of test samples (or objects)
External Reference Materials (RM)
Standard Reference Materials (SRM)
Audits
All of these techniques establish a relationship between a laboratory and the external
world. Collaborative or interlaboratory testing consists of programs organized to test
samples from one or more selected materials (or objects) that have some documented
level of homogeneity and are supplied to a number of laboratories. Participation in col-
laborative programs allows a laboratory to determine how it stands in comparison to
other participants in the program and the accepted reference value for the measured
parameter(s). The exchange of test samples is a technique often used for producer-con-
sumer situations to resolve any testing disagreements. Bias may be evaluated by the use of
reference materials; these may be ad hoc reference materials or RMs, developed by a
standardization committee or other recognized group, that have an accepted reference
value; or they may be more formally developed standard reference materials or SRMs
from various national standardization laboratories or bodies such as the National
Institute for Standardization and Technology (NIST) in the USA.
By comparing the values obtained in any laboratory to the standard reference value
the magnitude of any bias is clearly indicated. Biases may be dependent on a number of
factors concerning the testing operation; calibration procedures, operator technique, and
ambient laboratory conditions are some typical sources. See Annex C for more details on
bias evaluation. Interlaboratory testing is used to evaluate repeatability or within-labora-
tory precision and reproducibility or between-laboratory precision. External audits by any
number of accredited organizations are important and have been given increasing atten-
tion in the past decade as the interest in such standards as the ISO 9000 series listed in the
bibliography has risen.

7.2 Basic Quality Control

Quality control consists of sustaining the level of performance or quality as assessed over a
selected time period. The explanation of the concepts of quality control is often based on
an industrial production process in which the quality can be or has been assessed by
independent evidence perhaps of a specialized and more costly nature than the control
testing that is to be established. Thus the state of statistical control as defined below is
guaranteed by this independent evidence. Once this is in place, the operations to set up a
control operation are straightforward. However in a laboratory setting independent evi-
dence for a state of control may or may not be available. When it is not available, the
process of establishing a bona fide quality control program must proceed by an iteration
process.
After all steps have been taken to accurately install a specific testing system, a period
of apparently stable operation is selected and the initial steps to establish the control
system are taken. The initial control process is followed for an additional period of
Quality Assurance of Physical Testing Measurements 65

time during the testing operation, and the results are examined. If problems appear when
specified analysis protocols are employed, the problem is resolved and a second quality
control process is established. This procedure is repeated until no further improvement can
be made with the technical evaluation procedures available. When this state is reached full
statistical control is achieved within the scope of the testing technology.
All operating systems have one common characteristic-the output is inherently vari-
able. When the variability present in any operating system is confined to indeterminate
random variation, the system is in a state of statistical control. This type of variation is also
called common cause variation. The magnitude of this random variation is a function of
the complexity of the testing system and the technology available to detect and eliminate
this unwanted variation. Statistical control is that system state after all sources of deter-
minate or assignable variation have been eliminated by the tools available to the experi-
menter. Assignable variation is variation that can be traced to a specific cause such as poor
calibration procedures, poorly trained operators, and faulty machine settings and similar
problems. This type of variation is also called specific cause variation. The discovery and
elimination of assignable variation is dependent on the skill and expertise of an experi-
menter and on the level of the technology available for searching out potential causes of
specific variation.
The basic approach to both assessing and controlling variation of either sort is the
control chart technique, as originally developed by Shewhart [8], which can be used (1) to
assess the level of achievable quality in a series of repetitive application steps to discover
and eliminate assignable causes of variation, and to (2) control the level of quality once a
certain level has been established. Control charts can provide a clear indication of the
repetitive nature of a measured parameter in the sense of evaluating the long-term varia-
tion and the short-term variation. The use of control charts is based on the assumption
that once all sources of the most easily recognized assignable variation have been elimi-
nated, the residual level of variation is represented by a normal distribution. It must be
recognized that any level of residual variation may contain some components that may at
some future time be identified as assignable variation. Quality assessment and control can
be based on one of two types of data: (1) attribute data, which are frequently defined as
count data, the number of defective items or units in a sample of specified size, or (2)
variable or measurement data, which are expressed on a continuous scale. Although the
basic ideas of quality assessment and control are the same for both types of data, the
specific details of calculation are somewhat different. Since attribute data are typical of
industrial production processes, the procedures for their use are not described here.

Control Charts
There are two basic types of control charts. One is a chart that illustrates the long-term
variation of the process or system; it consists of a mean value (of a set of n measurements)
designated as xn, for some measured parameter, plotted sequentially (hourly, daily,
weekly). This is called an X11 chart (x-bar chart). A second type of chart is one that
indicates the short-term variation in the set of n individual measurements for the average
or mean x11 • This usually takes the form of a rangeR, of two or more measured parameter
values obtained in a specified time span (either side by side or within a brief specified time)
also plotted sequentially as above; this is called an R chart. Both of these charts have
certain characteristics or limits that are developed to aid in the interpretation of the x11 and
R values as they are recorded in time.
Both types of charts have a central line, which can be defined as the mean quality level.
In the X11 chart this equates to the mean value of the measured parameter set over some
66 Veith

stable time period. In the R chart this central line is the average range also over some
stable established time period. When sequential values of either X11 or R lie close to the
central line, there is some degree of confidence that the system is in statistical control.
More objective decisions on whether statistical control is achieved are made on the basis of
limits on X11 and R.
The X11 chart. This has values and limits defined according the following:
Central line= mean of X11 values= i 11

Lower Control limit= LCL = i 11 - 3cr11 = i 11 - A2R (95)

Upper control limit= UCL = i 11 + 3cr 11 =in+ A2R
where
A 2 = a factor as given below for the range charts; when A 2 is multiplied by R,
the product approximates the 3 sigma limits
R = the mean of a number of R values
The central line is established as the mean of some large number of i 11 values, designated as
i,, obtained during a period where the system was in a state of statistical control. The
number of in values used to calculate i 11 should be at least 20 and preferably 30 or slightly
more. The lower control limit is i 11 minus the value of three times the standard deviation
among the 20 to 30 values (each with n measurements) used to obtain i 11 • The upper
control limit is i 11 plus the value of three times the same standard deviation.
The plus and minus three sigma limits may also be established on the basis of the
variation in the single or individual measurements used for the mean i 11 • For an individual
measurement standard deviation cr, based on 30 or more measurements under statistical
control conditions (it is assumed that this gives a value sufficiently close to the true value
to use the symbol cr), the standard deviation for x11 values, designated as cr"' is given by the
usual expression, cr11 = cr1Jn, where n is the number of values in each set that are used to
calculate X11 •
For circumstances where less than 30 values are used for cr, the appropriate t value as
well as the symbol S, should be substituted for 3 and cr in the LCL and UCL expressions.
See Section 3. As indicated in the LCL and UCL expressions, another option is available
for the three sigma limits. When the range of n single measurements is evaluated for each
set of the 20 to 30 sets of values, an unbiased estimate of 3crn is obtained by using the mean
range R among the 20 to 30 values, and an appropriate multiplying factor A 2 , that depends
on the number n .
The Range Chart. This type of chart is also defined in terms of a central line and upper
and lower limits.
central line = mean of R values = R
lower control limit = D 3R (96)
upper control limit= D 4R
The values for the constants D 3 and D 4 depend on the number of values n in each rangeR.
Table 7 gives the values for the typical quality control constants A 2 , D 3 , and D 4 for sets
with various n values.
In addition to the control limits as given for both types of charts, very frequently
lower and upper warning values are also given as the ±2cr limits. These are calculated in
the same fashion as the control limits in regard to sample size. The lower warning limit
Quality Assurance of Physical Testing Measurements 67

Table 7 Control Chart Constants

Number in set A2 D3 D4

2 1.88 0 3.27
3 1.02 0 2.58
4 0.73 0 2.28
5 0.58 0 2.12
6 0.48 0 2.00
7 0.42 0.08 1.92
8 0.37 0.14 1.86

L WL and the upper warning limit UWL are defined as two-thirds of the control limits and
are calculated using the LCL and the UCL by
LWL = 0.667(LCL) and UWL = 0.667(UCL) (97)

Control Chart Procedure

The following is a explanation of the step-by-step procedure to set up and maintain both
an Xn chart and a range chart.
1. Decide on the objective of the control charts~what is being controlled?
2. Choose the measurement variable, if this is not the same as step 1.
3. Decide on the set or subgrouping size, i.e., on the value of n and on the frequency of
sampling and/or measurement.
4. Establish worksheets or a computer format for data accumulation and calculation.
5. Calculate the set means in and set mean ranges R for a period of known or presumed
statistical control; calculate the central line and the LCL, UCL, L WL, and UWL for
both charts and establish these on the charts.
6 Continue testing~do the charts indicate an "in control" situation, i.e., all points
within the control limits? If not, search for the cause and take remedial action.
7. If control is not achieved in Step 6, repeat steps 1 to 6 as a second iteration after
remedial action is taken.
8. Repeat steps 1 to 7 if needed, until statistical control is fully attained within the
capability of the testing technology available to search for assignable causes.

7.3 Quality Improvement

The current trend in quality improvement for production processes is to design quality
into a product or material, i.e., make it an inherent characteristic and not rely on intensive
inspection to cull out the poor-quality items or lots of material. Improved quality can be
achieved in a number of ways. A thorough scientific and theoretical understanding of the
system under consideration is the best route to high quality. This is not possible with most
real-world systems, so systematic empirical approaches must be used. W. E. Deming [9,10]
was a pioneer in the fundamental quality improvement process. His efforts to assist Japan
in achieving high quality levels for its products beginning in the 1950s are well recognized.
J. M. Juran also has a distinguished record in quality assurance and control; see his text in
the bibliography.
68 Veith

The quality level of production processes can be improved by using factorial design
experimentation. Although such experimental designs have been used for various research
and development programs for several decades, their application to industrial production
processes was pioneered by G. Taguchi [11,12]. His work, which is in essence based on
fractional factorial experimentation, has shown that these techniques can frequently be
used (1) to select one production variable to minimize variation (by careful control) and
select another variable to hold response on target, and (2) to create products that are less
sensitive to the environmental conditions of the process. Cause-and-effect diagrams
obtained as the output of brainstorming sessions with experienced personnel can often
be used to good advantage to detect possible causes for poor quality. This type of infor-
mation can lead to improvements in the SOPs for such testing.
Quality improvement can also be attained by a number of concerted efforts such as
"ruggedness testing" procedures that evaluate the influence of operational factors, as
outlined in Section 4. The use of comprehensive quality manuals and periodic audits
and reviews is also helpful. When computers are used for data acquisition and processing,
the verification and integrity of both the hardware and the software are essential to
quality. All of these options for improving industrial production quality can be used
with some minor modifications for improving the quality of laboratory testing and mea-
surement programs.

8 Precision, Bias, and Uncertainty in Laboratory Testing

Although precision, bias, and uncertainty have been discussed in previous sections, an
additional and more detailed and somewhat historically oriented discussion is required,
especially when typical producer-user and other specification testing is considered which
involves a comparison of interlaboratory or different location test results. It should be
emphasized that all three of these parameters must always be defined or evaluated in terms
of the testing domain to which they apply. Annex B should have been reviewed prior to
addressing the topics in this section.
Precision
This concept, which is expressed in terms of the degree of agreement among repetitive test
values, is probably as old as testing itself. The simplest precision testing domain is a typical
analytical "bench test" in a specific location, usually a single laboratory with an uncom-
plicated instrument and one testing technician where replicated measurements can be
obtained within an hour. Such well-established testing procedures usually have a reason-
ably good level of precision, and testing a number of objects or materials usually gives
reliable data that permits distinguishing any real differences among the tested items. When
more complex testing domains are encountered, precision may be less satisfactory.
There are two major domain categories for precision, (I) the ability to repeat within-
laboratory test results, which is by nearly universal agreement designated as repeatability,
and (2) the ability of different laboratories to get agreement on their test results, and this is
designated as reproducibility. In the usual inverse relationship, high repeatability and high
reproducibility imply that there is close or good repetitive test agreement. The most
important circumstance where good precision is desired is in specification and other pro-
ducer vs. user testing as conducted in various laboratories, frequently on a global basis.
Although some degree of interlaboratory testing to evaluate reproducibility has been
conducted since laboratories and technical trade initially came into widespread use, the
first comprehensive interlaboratory testing was part of the action to introduce standardi-
Quality Assurance of Physical Testing Measurements 69

zation into technology. In 1884 the American Society of Mechanical Engineers (as well as
Civil and Mining Engineers) began conducting interlaboratory testing, and the results
were a mass of conflicting data with very poor agreement. In Europe at about the same
time similar activity was taking place. In Germany a voluntary standardization organiza-
tion was created as the result of an international conference held in Munich in 1884.
As international trade has increased over the past several decades, those standardiza-
tion organizations that develop test method and specification standards have made policy
decisions that all test methods shall have, as part of the standard, a section on the typical
precision that can be expected. The American Society for Testing and Materials, ASTM,
took such action in 1976. Other national standardization organizations, such as the British
Standards Institute, BSI, and the Deutsche lnstitut fiir Normung, DIN, in Germany, have
also embraced the concept of test method precision. The International Standardization
Organization, ISO, has adopted similar policies, and more than 30 ISO Committees are
engaged in this work. To facilitate the work on evaluating precision for standard test
methods, these organizations have developed guideline standards on how such evaluations
shall be conducted, how the data are to be analyzed, and how the results are to be
expressed. See the bibliography for ASTM and ISO standards on precision. Annex D
gives the necessary background for evaluating precision; topics include organizing an
interlaboratory test program (ITP), a review of the terminology used, the assumptions
underlying the analysis, and the calculation algorithms for repeatability r and reproduci-
bility R. Although there may be some small differences in nomenclature when comparing
the ASTM and ISO precision standards, the calculation algorithms for both are identical.
Numerous precision evaluation programs have been conducted using these and similar
guideline standards for the past four or more decades. In almost all fields of technical
activity these precision evaluation programs have shown that many of the current and
well-developed test methods show very poor interlaboratory or reproducibility precision.

Bias
The major reason for the customary poor reproducibility for many test methods is the
existence of a nonnormal or biased between-laboratory data distribution. Bias exists
because each laboratory has its own testing culture, a unique environment and way of
conducting any test that is dependent on the operational conditions in the laboratory. This
occurs despite the use of standardized testing methods. This biased output causes a
laboratory to be almost always low or high compared to some reference value and to
other laboratories. Annex C gives the needed steps to evaluate bias, provided that one or
more reference materials or standards are available that have recognized (true) values.
One of the early pioneers in the analysis of ITP data, W. J. Y ouden, demonstrated
more than thirty years ago the dominant influence of interlaboratory bias (or systematic
error as he called it) in a series of publications [4,5, 15]. He showed with simple graphical
plotting techniques the unmistakable existence of bias. The existence of an essentially
constant bias for any laboratory invalidates the customary assumption in ITP analysis
that a random normal distribution adequately represents the between-laboratory varia-
tion. See Annex D.
Veith [13] reviewed the current state of precision testing using some ASTM test
methods in the rubber manufacturing industry in 1987. Mooney viscosity (ISO 289,
ASTM D1646), a widely used test for quality assessment of raw rubbers, gave reasonably
good relative precision, Type I (r) pooled values of 3.0 percent for several clear rubbers,
and good pooled (R) values of 3.8 percent on the same basis. See Annex D for the
definition of Type 1 and 2 precision and (r) and (R). For a widely used rate-of-cure
70 Veith

test, the oscillating disc curemeter (ISO 3417, ASTM D2084), the precision was substan-
tially worse with Type 1 (R) values, which depend on the material being tested, in the
range 20 to 81 percent. He also demonstrated that interlaboratory bias was responsible for
the poor agreement.
Brown [14] reviewed the results of interlaboratory precision evaluation programs in
ISO TC45 (Rubber Product Testing) in 1989. He found reasonable precision for hardness
tests (ISO 48, ASTM D2240) with Type 1 (r) values in the range 3.0 to 6.0 percent, and
similar (R) values in the range 6.0 to 11.0 percent. Tensile or stress-strain testing (ISO 37,
ASTM D412), a test with widespread usage, gave Type 1 ( R) values in the range of 8.0 to
32 percent. Many other common tests, such as compression set and temperature rise in
flexometer testing, showed very poor precision. For compression set, Type 1 (R) values
were in the range 26 to 32 percent. For temperature rise, Type 1 (R) values were in the
range 80 to 97 percent. Brown called into question the wisdom of conducting some of the
tests at all, considering the wide variation in interlaboratory results.

Uncertainty
The generic concept uncertainty has been used throughout the chapter because it is a word
that effectively conveys the sense of ambiguity about a measured result. The alternative
concept of specific uncertainty may be defined as "the estimate attached to a test result that
defines or characterizes the range of values within which the true value of the measured
property is asserted to lie." This definition is similar in principle to that for a confidence
interval as discussed in Section 3.6, but it lacks the instructions on how to calculate the
"range of values." The establishment of procedures to calculate this type of specific
uncertainty is currently under development by standardization organizations, and the
provisional standard ISO/CD 12102, which contains the above definition, is currently
under review. The major problem in this effort is the development of a comprehensive
method to express this range so as to encompass any selected testing domain with certain
specific factors that influence the range. The remainder of this section is devoted to specific
uncertainty, and for brevity the word "specific" is dropped.
There are a number of components to this uncertainty, each associated with particular
testing or other operational factors that influence the uncertainty range. These components
were previously addressed in Annex B on the statistical model for testing operations. As
discussed in the annex, there are two categories that influence the deviations that perturb
any measured value for an object or material: production variation and measurement
variation. Within each of these categories there are two additional types of components
that cause deviations about any measured value: random and bias. Current approaches to
uncertainty concentrate on the measurement variation and ignore the production variation
by implicitly assuming that this variation is or can be made to be negligible.
Uncertainty is frequently based on the concept of error populations, where for any of
these populations "error" implies a deviation 8i from some true or reference value flr that
would be obtained for the measurement Yi in the absence of any type of perturbation by a
physical cause. However the word error can be defined on a much broader basis, and
deviation will be used rather than error.
8i = Yi- flr (98)
There may be any number of causes (1, 2, ... , i) that contribute a deviation component.
This situation may be represented by Eq. B1 in Annex B, given here as Eq. 99.
Yi =flo+ lli + L;(b) +I;(e)+ L;(B) + L;(E) (99)
Quality Assurance of Physical Testing Measurements 71

The reference or true value for any material or object class for a particular measurement
process is the sum of the first two terms of Eq .. 99, thus
/L,. = f-Lo + /LJ (100)
A rearrangement of Eq. 99 using f-Lr = f-Lo+ /Li shows that i5i for any given measurement is
8; = :E(b) + :E(e) + :E(B) + :E(E) (101)
The terms :E(b) + :E(e) contribute bias and random deviation components attributable to
inherent or production process variation in the material (or object class), and the terms
:E(B) + :E(E) contribute bias and random components attributable to the operational
conditions of the measurement. Most discussions of uncertainty assume that the magni-
tude of :E(b) + :E(e) is negligible compared to :E(B) + :E(E). However for any realistic
appraisal of uncertainty this assumption may not be tenable. For the discussion to follow
a negligible magnitude for :E(b) + :E(e) will be assumed for the sake of simplicity.
Uncertainty evaluation is concerned with calculating a ±range about any Yi value that
has a high probability of including the reference or true value /L,. within the range. Just as
in the case of a confidence interval, this range is obtained on the basis of the standard
deviation of oi values for the defined testing domain. This standard deviation for oi is a
composite standard deviation obtained as a special sum of the variances of all individual
components contained in the four types of terms on the right-hand side of Eq. 101. Thus,
omitting the terms :E(b) + :E(e), a composite variance for i5i may be defined as
var (8;) = :E var (B)+ :E var (E) (102)
and expressed as a standard deviation, sd (8;)
sd (i5i) = [var (8;)] 112 (103)
The bias components may be divided into two categories: (1) global or inherent bias,
unique to the test and common to all locations and machines, and (2) local bias, unique
to a particular location and/or machine. The set of particular terms as given in Annex B,
Eq. B2, better illustrate bias components such as BL,, a unique laboratory component, BM,
a machine component, Bop, an operator component, as well as random components EM,
potential random differences among machines, and E 0 p, potential random differences
among operators. To these components which are local an additional potential global
bias Bob! must be added. Uncertainty u may be given in terms of sd (i5i) as
u= k[sd (i5i)] (104)
where k is a multiplying factor, and the measured value Yi with its uncertainty may be
expressed as
Yi ± u= Y; ± k[sd (8;)] (105)
The key issue in calculating the uncertainty uis evaluating sd (8;) and adopting a value for
k. If all the components of sd (i5i) are known on the basis of 18 to 20 d[for each compo-
nent, then a value of 2 may be used for k. The operation of insuring that all important
components of sd (8;) are fully evaluated is complex and beyond the scope of this chapter.
Guidance standards for evaluating sd (8;) are currently under development, and reference
should be made to this current activity; see ISO/CD 12102 in the bibliography.
Improving Reproducibility Precision
Poor reproducibility precision has been the one of the major reasons for the establishment
of laboratory accreditation systems and the organizations that administer such systems as