3/20/2025
1 VALIDITY, RELIABILTY AND USABILITY
2 Essential assessment charactertics
Validity
Reliability
Usability
3 Validity and reliability
Validity
adequacy and appropriateness of the interpretations and uses of assessment
results
E.g.
If the results are to be used as a measure of students’ reading skills
our interpretations are to be based on evidence that the scores actually reflect
reading skills
not impacted by irrelevant factor, such as the vocabulary or linguistic complexity
4 Validity and reliability
Reliability
the consistency of assessment results
E.g.
we get similar scores when the same assessment procedure is used with the same
students on two different occasions
a high degree of reliability from one occasion to another
We get similar scores when different teachers independently rate student
performances on the same assessment task
a high degree of reliability from one rater to another
5 Validity and reliability
Reliability
we are concerned with consistency of the results
rather than with appropriateness of the interpretations made from the results
(which is validity).
Reliability (consistency) of measurement is needed to obtain valid results, but can
have reliability without validity
6 Usability
Refers to the practicality of the procedure
Not about the other qualities present
Assessment procedure should
1
6 3/20/2025
Assessment procedure should
Be economical in terms of time and money
Be easily administered
Be easily scored
Produce results that can be accurately interpreted
7 Nature of validity
Validity
The appropriateness of the interpretation and use of the results
A matter of degree
it does not exist on all-or-none basis. (high validity, low validity)
Specific to some particular use or interpretation for a specific population of test
takers
No assessment is valid for all purposes
When indicating computational skill
the mathematics test may have a high degree of validity for 3rd and 4th
graders but a low degree of validity for the 2nd and 5th graders
A reading test
may have high validity for skimming and scanning and low validity for
inferencing
Necessary to consider the specific interpretation or use to be made of the results
8 Major considerations in assessment validation
Content
The assessment content and specifications from which it was derived
Construct
The nature of the characteristics being measured
Assessment-criterion relationships
The relation of the assessment results to other measures
Consequences
The consequences of the uses and interpretations of the results
9 Content
How an individual performs on a domain of tasks that the assessment is supposed
to represent
E.g. knowledge of 200 words
we select 20 words and generalize it to the knowledge of 200
2
9
3/20/2025
E.g. knowledge of 200 words
we select 20 words and generalize it to the knowledge of 200
the extent to which our 20-word test constituted a representative sample of the
200 words
the goal in the consideration of content validation
to determine if a set of assessment tasks
provides a relevant and representative sample of the domain of tasks
10 Content
The definition of the domain to be assessed
derive from the identification of goals and objectives
The assessment begins with a content area that reflects the goals and objectives
Steps
Specifying the domain of instructionally relevant tasks
Specifying the emphasis according to the priority of goals and objectives
Constructing or selecting a representative set of assessment tasks
From what has been taught
to what is to be measured
to what should be emphasized in the assessment
to a representative sample of relevant tasks
11 Content
Assessment development to enhance validity
Table of specifications
Subject-mater content (topics to be learned)
Instructional objectives (types of performance)
12 Content
Assessment development to enhance validity
The percentage in the table
The relative degree of emphasis that each content area and each instructional
objective is to be given in the test
13 Content
Table of specifications
3
3/20/2025
13
Table of specifications
The specifications should be in harmony with what was taught
The weights assigned in the table reflect the emphasis that was given during
instruction
The more closely the Qs match the specified sample
the more valid a measure of student learning
It can be used in selecting tests that publishers prepare
How well do they match with our table of specifications?
14 Construct
Is the test actually measuring the construct it claims it is measuring?
A construct is an individual characteristic or an abstract theoretical concept
assumed to exist to explain some aspect of behavior
Reading comprehension, inferencing, speaking proficiency, intelligence,
creativity, anxiety, mathematical reasoning, etc.
These are called constructs because they are theoretical constructions that are used
to explain performance on an assessment
15 Construct
Construct validation
the process of determining if the performance on an assessment can be
interpreted in terms of a construct(s)
Two questions are important in construct validations
Does the assessment adequately represent the intended construct? (construct
underrepresentation)
Problem-solving task turning into a memorization task
Is performance influenced by factors that are irrelevant to the construct?
(construct-irrelevant variance)
A mathematics test influenced by reading demands
16 Methods used in construct validation
Defining the domain(area) or tasks to be measured (also in content validation)
Analyzing the response process required by the assessment tasks
Thinking aloud or interviewing (to check on mental process)
Comparing the scores of known groups
A prediction of differences for a particular test or assessment can be checked
4
3/20/2025
Comparing the scores of known groups
A prediction of differences for a particular test or assessment can be checked
against groups that are known to differ and the results used as a partial support
for construct validation (e.g. mathematics majors vs English majors)
The test should be able to distinguish them
Comparing scores before and after a particular learning experience or experimental
treatment
Scores increase with instruction?
Comparing scores with other similar measures (also an assessment-criterion
consideration)
E.g. high correlation between like tests and lower correlation between unlike tests
17 Assessment-criterion considerations
When test scores are to be used
to predict future performance
to estimate current performance on some valued measure other than the test
itself (called a criterion)
Concerned with evaluating the relationship between the test and the criterion
18 Assessment-criterion considerations
For example, can ALES scores indicate success at exams in masters programs?
The degree of relationship can be described by statistically correlating the two set of
scores
The resulting correlation coefficient provides a numerical summary of the degree
of relationship between the two sets of scores
Scatter plots and expectancy tables can also be used.
19
20
Example on excel
Interpretation
Interpretation
.90 to
21 Consideration of consequences
Assessments are intended to contribute to improved learning, but do they?
What impact do assessments have on teaching?
5
3/20/2025
21
What impact do assessments have on teaching?
What are the possibly negative, unintended consequences of a particular use of
assessment results?
High importance associated with test results lead teachers to focus narrowly on
what is on the test while ignoring important parts of the curriculum not covered by
the test
E.g. Changing the construct of teaching from problem-solving to memorization
ability because of a high-stakes test
An example: college professors preparing for YDS for several years and end up
passing exam but not speaking English
22 Factors influencing validity
Factors in the test or assessment itself
Unclear directions
Difficult language
Ambiguity
Inadequate time limit (construct-irrelevant variance)
Overemphasis of easy-to-assess aspects and disregard difficult-to-assess aspects
(construct underrepresentation)
Poorly-constructed test items (e.g. providing clues)
Test too short (i.e. may not be represenative)
Improper arrangement of test (like most difficult ones first)
Identifiable pattern of answers (T, F, T, F, T, F, T, F)
23 Factors influencing validity
Factors in administration and scoring
Insufficient time
Unfair aid to students
Cheating
Unreliable scoring
Failing to follow directions
Adverse physical and psychological conditions
Factors in student responses (like motivation, fear, anxiety)
24 Reliability
The consistency of measurement
how consistent test scores or results are from one assessment to another
The more consistent the assessment results are from one measurement to another
the fewer errors there will be
25
6
24
3/20/2025
The more consistent the assessment results are from one measurement to another
the fewer errors there will be
Consequently, the greater reliability
25 Reliability
An estimate of reliability refers to a particular type of consistency
Different periods of time
Different samples of tasks
Different raters
Low reliability means low validity
But high reliability does not mean high validity
26 Determining reliability in correlation methods
Consistency
over a period of time
over different forms of assessment
within the assessment itself
different raters
27 Test-retest method
The same assessment
administered twice to the same group of students
with a given time interval between the two (a measure of stability)
Not too long not too short for the purpose
The longer the interval between the first and second assessments
influenced by changes in the student characteristic being measured
the smaller the reliability coefficient will be
28 Test-retest method
Stability is important when results are used for several years
like English test scores, but not as important for a unit test
The test-retest method is not very relevant for teacher-constructed classroom tests
Not desirable to readminister the same assessment
In choosing standardized tests, stability is an important criterion
29 Equivalent(parallel)-forms method
Uses two different but equivalent forms of an assessment
Two different tests are prepared based on the same set of specifications
Administered to the same group of students in a short period of time
The resulting assessment scores are correlated
It does not tell anything about long-term stability
30
7
3/20/2025
It does not tell anything about long-term stability
30 Split-half method
The assessment is administered to a group of students in the usual manner and
then is divided in half for scoring purposes
E.g. to score the even-numbered and the odd-numbered tasks separately
This produces two scores for each student
When correlated, provides a measure of internal consistency
To estimate the scores’ reliability based on the full-length assessment, Spearman
Brown formula is applied
31 Interrater consistency
When student work is judgmentally scored
whether the same scores are assigned by another judge
Consistency can be evaluated with correlation
the scores assigned by one judge with those assigned by another judge
To achieve acceptable levels of interrater consistency
Agreed on scoring-rubrics
Training of raters to use those rubrics with examples of student work
32 Writing rubric
33
34
35
36
37
38
39
40 Examples
41 Reliability methods
42 Standard error of measurement
The amount of variation in the scores would be directly related to the reliability of
the assessment procedures
Low reliability by large variations in the student’s assessment results
High reliability by little variation from one assessment to another
To estimate the amount of variation to be expected in the scores
8
3/20/2025
To estimate the amount of variation to be expected in the scores
Standard error of measurement
The standard error of measurement is the standard deviation of the errors of
measurement
When the standard error of measurement is small, the confidence band is narrow
(indicating high reliability)
Greater confidence that the obtained score is near the true score
A teacher who is aware of the standard error of measurement realizes that it is
impossible to be dogmatic in interpreting minor differences in assessment scores
43 Standard error of measurement
44 Factors influencing reliability measures
Number of assessment tasks
The larger the number of assessment tasks (e.g. questions) on an assessment, the
higher its reliability will be
Spread of scores
The larger the spread of scores, the higher the estimate of reliability
Individuals stay in the same relative position in a group from one assessment to
another
Objectivity
Degree to which equally competent scorers obtain the same results
Objectivity can be increased by careful phrasing of the questions and by a
standard set of rules for scoring
45 Usability
Ease of administration
Easy directions? Complicated directions? Requires expertise to implement?
Time required for administration
Allot as much time needed to obtain valid and reliable scores, not more
Ease of interpretation and application
If misinterpreted, there is no use and may even be harmful to some individual or
group
Availability of equivalent forms or comparable forms
Can also be useful in measuring development
Cost of testing
To save money, one should not prefer tests with lower validity and reliability
estimates
9
3/20/2025
To save money, one should not prefer tests with lower validity and reliability
estimates
10