THE NITTYGRITTY OF
LANGUAGE TESTING:
IMPLICATIONS FOR TEST
CONSTRUCTORS
MS. MARIA ZAHEER
PRINCE SULTAN UNIVERSITY
KSAALT Mini Conference 2015
• What are the principles of language testing?
• How can we define them?
• What factors can influence them?
• How can we measure them?
Objectives
What is a test?
‘A method of measuring a person’s ability,
knowledge, or performance in a given
domain’(Brown,2004:3).
Method
Measure
Ability
Performance /competence
Teaching
Assessments
Test
Adapted from Brown,
2004:5
Assessment is one
component of teaching
 Assessment helps
teachers to gain
information about every
aspects of their students
especially their
achievement.
An aspect that plays
crucial role in
assessment is tests.
A good test is constructed by considering the principles of language Testing
Validity
Reliability
PracticalityAuthenticity
Wash bacK
Validity is the extent, to which it exactly
measures what it is supposed to measure
(Hughes, 2003:26).
Construct
Validity
Content
Validity
Consequent
ial Validity
Criterion
validity
Face
validity
what is meant
to be measured
has to be
crystal clear.
The correlation
between the
contents of the test
and the language
skills, structures
Content
Validity
The test items
should really
represent the
course
objective.
http://www.slideshare.net/Samcruz5/validit
y-reliability-practicality?next_slideshow=1
the
relationship
between the
test score
and the
outcome.
The test score
should really
represent the
criterion that is
intended to
measure in the
test
Criterion
Validity
Criterion validity can be established through two
ways.
Concurrent Validity
A test is said to have
concurrent validity if its
result is supported by
other concurrent
performance beyond
the assessment itself
(Brown, 2004:24).
Predictive Validity
The predictive validity
tends to assess and
predict a student’s
possible future success
(Alderson et
al.,1995:180-183).
Construct Validity
Construct validity refers to concepts or theories which
are underlying the usage of certain ability including
language ability.
Construct validity shows that the result of the test really
represents the same construct with the ability of the
students which is being measured (Djiwandono,
1996:96).
Consequential Validity
Consequential validity to refer to the social consequences
of using a particular test for a particular purpose.
 The use of a test is said to have consequential validity to
the extent that society benefits from that use of the test.
Face Validity
 A test is said to have face validity if it looks to
other testers, teachers, moderators, and students as
if it measures what it is supposed to measure
(Heaton, 1990:159).
The test can be judged to have face validity by
simply look at the items of the test.
face validity can affect students in doing the test
(Brown, 2004:27 & Heaton, 1988:160
To overcome this, the test constructor has to consider
these:
Students will be more confident if they face a well- constructed,
expected format with familiar tasks.
Students will be less anxious if the test is clearly doable within
the allotted time limit.
 Students will be optimistic if the items are clear and
uncomplicated (simple).
 Students will find it easy to do the test if the directions are very
clear.
Students will be less worried if the tasks are related to their
course work (content validity).
Students will be at ease if the difficulty level presents a
Reliability
Reliability refers to the consistency of the
scores obtained (Gronlund, 1977:138).
Reliability actually does not really deal with
the test itself. It deals with the results of the
test.
The test results should be consistent.
Reliability
falls
Into
4 kinds
Student-Related
Reliability
Test
Administration
Reliability
Test Reliability
Rater Reliability
(Taken from Brown, 2004:21-22).
Test Administration
Reliability
The condition and situation in
which the test is administered.
Student-Related Reliability
This kind of reliability refers to
temporary illness, fatigue, a bad day,
anxiety and other physical or
psychological factors of the students.
Thus, the score obtained of the student
maybe not his/her actual score.
Test Reliability
The test fits into the time
constraints.
The items of the test should
be crystal clear that it will not
end with ambiguity.
Rater Reliability
This kind of reliability fall into two categories. They are:
1. Inter-rater reliability
It occurs when two or more scorers yield inconsistent
scores of the same test, possibility for lack of attention
to scoring criteria, inexperience, inattention, or even
biases.
2.Intra-rater reliability
It is a common occurrence for classroom teacher
because of unclear scoring criteria, fatigue, and bias
toward particular “good” or “bad” students or simple
carelessness.
Practicality
The relationship between available resources for the test, i.e. human
resources, material resources, time, etc. and resources which will be
required in the design, development, and use of the test (Bachman &
Palmer, 1996:35-36).
Brown (2004:19) defines practicality is in terms of:
1) Cost -The test should not be too expensive to conduct
2) Time- The test should stay within appropriate time
constraints.
3) Administration- The test should not be too complicated or
complex to conduct.
4) Scoring / Evaluation Practicality- The scoring/evaluation
process should fits into the time allocation.
Please put this text in graph
Authenticity
Authenticity is the degree of correspondence of the characteristics of
a given language test task to the features of a target language task
Brown (2004:28).
Brown (2004:28) also proposes considerations that might be helpful to
present authenticity in a test.
 The language in the test is natural as possible.
 Items are contextualized rather than isolated.
Topics are meaningful (relevant, interesting) to the learners.
Some thematic organization to items is provided, such as through a story or
episode.
Tasks represent, or closely approximate, real-world tasks.
Washback/Backwash
The term washback is commonly used in applied
linguistics. it is rarely found in dictionaries.
“An effect that is not the direct result of something”
Cambridge Advanced Learner’s Dictionary.
In dealing with principles of language assessment, these two
words somehow can be interchangeable.
Washback (Brown, 2004)
or
Backwash (Heaton, 1990)
The influence of testing on teaching and learning.
The influence itself can be positive or negative (Cheng et al. (Eds.),
2008:7-11)
Positive Wash back
Teachers and students have a positive
attitude toward the examination or test, and
work willingly and collaboratively towards
its objective (Cheng & Curtis, 200).
Negative Wash back
Negative Wash back does not give any
beneficial influence on teaching and
learning (Cheng and Curtis, 2008:9).
The quality of wash back might be independent of the quality of the test.
 (Fulcher & Davidson, 2007:225).
teachers as the test constructor need to consider the probability of the wash back
of tests which will be constructed and what the future impact on teaching and
learning later on.
Teaching and learning will be impacted in many different ways depending upon
the variables at play at specific contexts.
What these variables are, how they are to be weighted, and whether we can
discover patterns of interaction that may hold steady across contexts, is a matter for
ongoing research (Fulcher & Davidson, 2007:229).
Conclusion
 A test is good if it contains practicality, good
validity, high reliability, authenticity, and
positive wash back.
The five principles provides guidelines for
both constructing and evaluating the tests.
Teachers should apply these five principles in
constructing or evaluating tests which will be
used in assessment activities.

The nittygritty of language testing

  • 1.
    THE NITTYGRITTY OF LANGUAGETESTING: IMPLICATIONS FOR TEST CONSTRUCTORS MS. MARIA ZAHEER PRINCE SULTAN UNIVERSITY KSAALT Mini Conference 2015
  • 2.
    • What arethe principles of language testing? • How can we define them? • What factors can influence them? • How can we measure them? Objectives
  • 4.
    What is atest? ‘A method of measuring a person’s ability, knowledge, or performance in a given domain’(Brown,2004:3). Method Measure Ability Performance /competence
  • 5.
    Teaching Assessments Test Adapted from Brown, 2004:5 Assessmentis one component of teaching  Assessment helps teachers to gain information about every aspects of their students especially their achievement. An aspect that plays crucial role in assessment is tests.
  • 6.
    A good testis constructed by considering the principles of language Testing Validity Reliability PracticalityAuthenticity Wash bacK
  • 7.
    Validity is theextent, to which it exactly measures what it is supposed to measure (Hughes, 2003:26). Construct Validity Content Validity Consequent ial Validity Criterion validity Face validity
  • 8.
    what is meant tobe measured has to be crystal clear. The correlation between the contents of the test and the language skills, structures Content Validity The test items should really represent the course objective. http://www.slideshare.net/Samcruz5/validit y-reliability-practicality?next_slideshow=1
  • 9.
    the relationship between the test score andthe outcome. The test score should really represent the criterion that is intended to measure in the test Criterion Validity
  • 10.
    Criterion validity canbe established through two ways. Concurrent Validity A test is said to have concurrent validity if its result is supported by other concurrent performance beyond the assessment itself (Brown, 2004:24). Predictive Validity The predictive validity tends to assess and predict a student’s possible future success (Alderson et al.,1995:180-183).
  • 11.
    Construct Validity Construct validityrefers to concepts or theories which are underlying the usage of certain ability including language ability. Construct validity shows that the result of the test really represents the same construct with the ability of the students which is being measured (Djiwandono, 1996:96). Consequential Validity Consequential validity to refer to the social consequences of using a particular test for a particular purpose.  The use of a test is said to have consequential validity to the extent that society benefits from that use of the test.
  • 12.
    Face Validity  Atest is said to have face validity if it looks to other testers, teachers, moderators, and students as if it measures what it is supposed to measure (Heaton, 1990:159). The test can be judged to have face validity by simply look at the items of the test. face validity can affect students in doing the test (Brown, 2004:27 & Heaton, 1988:160
  • 13.
    To overcome this,the test constructor has to consider these: Students will be more confident if they face a well- constructed, expected format with familiar tasks. Students will be less anxious if the test is clearly doable within the allotted time limit.  Students will be optimistic if the items are clear and uncomplicated (simple).  Students will find it easy to do the test if the directions are very clear. Students will be less worried if the tasks are related to their course work (content validity). Students will be at ease if the difficulty level presents a
  • 14.
    Reliability Reliability refers tothe consistency of the scores obtained (Gronlund, 1977:138). Reliability actually does not really deal with the test itself. It deals with the results of the test. The test results should be consistent.
  • 15.
  • 16.
    Test Administration Reliability The conditionand situation in which the test is administered. Student-Related Reliability This kind of reliability refers to temporary illness, fatigue, a bad day, anxiety and other physical or psychological factors of the students. Thus, the score obtained of the student maybe not his/her actual score. Test Reliability The test fits into the time constraints. The items of the test should be crystal clear that it will not end with ambiguity.
  • 17.
    Rater Reliability This kindof reliability fall into two categories. They are: 1. Inter-rater reliability It occurs when two or more scorers yield inconsistent scores of the same test, possibility for lack of attention to scoring criteria, inexperience, inattention, or even biases. 2.Intra-rater reliability It is a common occurrence for classroom teacher because of unclear scoring criteria, fatigue, and bias toward particular “good” or “bad” students or simple carelessness.
  • 18.
    Practicality The relationship betweenavailable resources for the test, i.e. human resources, material resources, time, etc. and resources which will be required in the design, development, and use of the test (Bachman & Palmer, 1996:35-36).
  • 19.
    Brown (2004:19) definespracticality is in terms of: 1) Cost -The test should not be too expensive to conduct 2) Time- The test should stay within appropriate time constraints. 3) Administration- The test should not be too complicated or complex to conduct. 4) Scoring / Evaluation Practicality- The scoring/evaluation process should fits into the time allocation. Please put this text in graph
  • 20.
    Authenticity Authenticity is thedegree of correspondence of the characteristics of a given language test task to the features of a target language task Brown (2004:28). Brown (2004:28) also proposes considerations that might be helpful to present authenticity in a test.  The language in the test is natural as possible.  Items are contextualized rather than isolated. Topics are meaningful (relevant, interesting) to the learners. Some thematic organization to items is provided, such as through a story or episode. Tasks represent, or closely approximate, real-world tasks.
  • 21.
    Washback/Backwash The term washbackis commonly used in applied linguistics. it is rarely found in dictionaries. “An effect that is not the direct result of something” Cambridge Advanced Learner’s Dictionary. In dealing with principles of language assessment, these two words somehow can be interchangeable. Washback (Brown, 2004) or Backwash (Heaton, 1990)
  • 22.
    The influence oftesting on teaching and learning. The influence itself can be positive or negative (Cheng et al. (Eds.), 2008:7-11) Positive Wash back Teachers and students have a positive attitude toward the examination or test, and work willingly and collaboratively towards its objective (Cheng & Curtis, 200). Negative Wash back Negative Wash back does not give any beneficial influence on teaching and learning (Cheng and Curtis, 2008:9).
  • 23.
    The quality ofwash back might be independent of the quality of the test.  (Fulcher & Davidson, 2007:225). teachers as the test constructor need to consider the probability of the wash back of tests which will be constructed and what the future impact on teaching and learning later on. Teaching and learning will be impacted in many different ways depending upon the variables at play at specific contexts. What these variables are, how they are to be weighted, and whether we can discover patterns of interaction that may hold steady across contexts, is a matter for ongoing research (Fulcher & Davidson, 2007:229).
  • 24.
    Conclusion  A testis good if it contains practicality, good validity, high reliability, authenticity, and positive wash back. The five principles provides guidelines for both constructing and evaluating the tests. Teachers should apply these five principles in constructing or evaluating tests which will be used in assessment activities.

Editor's Notes

  • #8 Brown (2004:22-27) proposed five ways to establish validity. 
  • #11 For example, the validity of a high score on the final examination of a foreign language course will be verified by the actual proficiency in the language. For example, TOEFL® or IELTS tests are intended to know how well somebody will perform the capability of his/her English in the future.
  • #13 In speaking test, for instance face validity can be shown by speaking activities as the main activities in the test. The test should focus on students activities in speaking, not anything else. 
  • #15  It means that if the test is administered to the same students on different occasions (with no language practice work taking place between these occasions) then it produces (almost) the same results.
  • #17  To increase the degree of this kind of reliability test, teachers as the administrators should consider all the things related to the test administration.  For instance, if we want to conduct a listening test, we should provide a room which is very comfortable to listening environment. The noise from outside the room cannot enter the room. The audio system should clear to all students. Even, we have to consider the lighting, the condition of the desks and chairs as well. Rater Reliability deals with the scoring process. Factors that can affect the reliability might be human error, subjectively, and bias in scoring process. 
  • #21 Deals with the “real world”. Teachers should construct a test with the test items are likely to be used or applied in the real contexts of daily life.
  • #22  The term washback is commonly used in applied linguistics. it is rarely found in dictionaries.  However, the word backwash can be found in certain dictionaries and it is defined as
  • #23 Positive washback has beneficial influence on teaching and learning.  Tests which have negative washback is considered to have negative influence on teaching and learning