PSYC 303: MEASUREMENT
AND EVALUATION IN
PSYCHOLOGY
LECTURE 3: Of Tests and Testing: Basics and Assumptions
MEYMUNE N. TOPÇU, PhD
Recap from Lecture 2
Measures of Skewness - Kurtosis
Variability
Standard Scores
Measures of central
tendency
Correlation
Levels of
Meta-Analysis
Measurement
Challenging Concepts from Pre-Quiz 2.2
Challenging Concepts from Pre-Quiz 2.2
Assumptions about psychological testing and
assessment
What is a good test?
Reliability
Validity
Lecture Plan Norms
Sampling to develop norms
Types of norms
Fixed reference groups scoring systems
Norm- vs. criterion-referenced Evaluation
Culture and Inference
What is a “good test”?
Assumption 1: Psychological traits and states exist
Components of stability and change in our behavior
Trait: A long-term characteristic of an individual that shows
Assumptions about through their behavior, actions, and feelings
psychological Based on a sample of behavior
testing and Intelligence, cognitive style, interests, personality
assessment State: A temporary condition that an individual is
experiencing for a short period of time
Any examples?
https://oxford-review.com/oxford-review-encyclopaedia-
terms/the-difference-between-an-state-and-a-trait/
Why is it important to distinguish between
traits and states in psychological testing &
assessment?
How do traits exist? Do they have a physical experience?
A psychological trait exists as a construct
Construct: An informed, scientific concept developed or
constructed to describe or explain behavior
The construct’s existence can be inferred from overt behavior
Observable action or the product of an observable action
Trait is not expected to be manifested 100% of the time
What determines if a trait will be manifested or not?
The strength of the trait & the nature of the situation
Situation-dependent, E.g., American football - Playground
https://www.mindgarden.com/145-state-trait-anxiety-
inventory-for-adults
Attributions of a trait or state term are relative
E.g., “Özge is very shy” – an unstated comparison with degree of shyness in average person
The reference group can greatly influence one’s conclusions or judgments
Measuring sensation seeking (the need for varied novel, complex sensations)
Sensation seeking scale vs. performance-based measures
Assumption 2: Psychological traits and states can be
quantified and measured
If psychological traits and states vary by degree they are
quantifiable
Assumptions about
psychological Defining the trait/state
testing and The same phenomenon can be defined in different ways
assessment E.g., “Aggressive salesperson”, “Aggressive killer”,
“Aggressive waiter”
How aggressiveness is defined by the test developer
“The number of self-reported acts of harming others”
“The number of observed acts of aggression”
The test developer should provide a clear “operational
definition”
After defining a trait/state a test-developer considers the
types of item content
Components of intelligence in US adults
If knowledge of American history: “Who was the second
president of the US”
If social judgment: “Why should guns in the home always be
inaccessible to children?”
Should all items have equal weight?
The social judgment item could be given more weight
Developing appropriate ways to score/interpret
Cumulative scoring: A trait is measured by a series of test
items
Assumption 3: Test-related behavior predicts non-test-related
behavior
The obtained sample of behavior is typically used to make
predictions about future behavior
E.g., Predicting success in life from intelligence scores
Assumptions about obtained in childhood
psychological To postdict behavior: Understanding of behavior that has
testing and already taken place. E.g., Criminal’s state of mind
assessment
Assumptions about
psychological
testing and
assessment
Assumption 4: All tests have limits and imperfections
Why?
Test users should understand the limitations of tests and
how those limitations can be compensated for by data from
other resources
Assumption 5: Various sources of error are part of the
assessment process
Error: Factors other than what a test attempts to measure will
influence performance on test
Assumptions about Does an intelligence test score truly reflect intelligence or
psychological factors other than intelligence?
testing and
Error variance: The component of a test score attributable to
assessment scores other than the trait or ability measured
Assessees, Assessors, Instruments can all be sources of error
variance
Random errors: Errors that happen as a matter of chance
E.g., the weather on the day of testing
Error is an element in the process of measurement
Assumption 6: Unfair and biased assessment
procedures can be identified and reformed
Sophisticated procedures to identify and correct test
bias and list of ethical guidelines to ensure test
fairness
Assumptions about Fairness-related questions and problems can still
psychological arise
testing and E.g., The test is used with a person whose
assessment background/experience is different from the group the
test was intended for
Tests are tools and they can be used properly or
improperly
Assumption 7: Testing and assessment offer powerful benefits to
society
Imagine a world without psychological tests/assessments.
How would it be?
Assumptions about
In a world without tests…
psychological
People can easily trick others that they are a surgeon
testing and
Personnel might be hired on the basis of nepotism rather than
assessment documented merit
It would be very difficult to offer treatments for educational
difficulties
The military/business sector would not have a tool to screen
applicants
We need good tests…
Psychometric soundness of a test
A good test/measuring tool is reliable
In theory, the perfectly reliable measuring tool
consistently measures in the same way
What is a “good
test”? 1 Kg
Reliability
1 Kg 1.3 Kg 1.2 Kg
1 Kg 1.3 Kg 0.9 Kg
1 Kg 1.3 Kg 1 Kg
Why is it more difficult to achieve perfect
reliability for psychological tests?
How does calculating reliability differ
when you are measuring a trait vs. a state?
Psychometric soundness of a test
A valid test measures what it claims to measure
E.g., Intelligence
What is a “good Items that make up a test adequately sample the range of
test”? areas that must be sampled to adequately measure the
construct
Validity How are the scores interpreted? How do scores on this
test relate to other scores measuring the same/opposite
construct?
E.g., A valid test of introversion should be negatively
correlated with a valid test of extraversion
A good test is one that trained examiners can administer,
score, and interpret with a minimum of difficulty
What is a “good A good test is a useful test, one that yields actionable
test”? results that will ultimately benefit individual test takers
or society at large
Other Considerations If the purpose of a test is to compare the performance of
the test taker with the performance of other test takers,
then a “good test” is one that contains adequate norms
Why choose one test over the other?
What is the objective of using a test? how does the test
meet that objective?
How is the construct defined?
Who is the test designed for use with? (age, gender,
reading level etc.)
How appropriate is it for the targeted test takers?
What type of data will be generated from using this
test?
Will there be a need for other assessment tools?
Does the test require an expert test user?
Norms
Norm-referenced testing and assessment: a method of
evaluation and a way of deriving meaning from test
scores by evaluating an individual test taker’s score and
comparing it to scores of a group of test takers
Aim: To understand where the test taker stands among
other test takers
Norm (singular): Behavior that is usual/average/normal
Norms (Plural): the test performance data of a particular
group of test takers that are designed for use as a
reference
Normative sample: The group of people whose
Norms performance on a particular test is analyzed
The data may be in raw or converted scores
To norm (verb): Refers to the process of deriving norms
Norming a test is expensive
User norms: Descriptive statistics based on a group of
test takers in a given period of time
Sampling to develop Standardization: The process of administering a test to a
norms representative sample
Population: The complete set of individuals with at least
one common observable characteristic
Sample of the population: a portion of the universe of
people deemed to be representative of the whole
population
Sampling: The process of selecting a representative
group of people
Subgroups in a population may differ in terms of certain
characteristics. It can be essential to have these
differences proportionately represented
Stratified sampling help prevent sampling bias and aid
in the interpretation of results
Sampling to develop Stratified-random sampling: When every member of the
norms population had the same chance of being included in the
sample
Purposive sampling: Arbitrarily select some sample because
we believe it to be representative of the population
The prob: The sample may no longer be representative
Decision of sampling: Comparing what is ideal and what is
practical
Incidental/Convenience sampling: employ a sample that is
Sampling to develop
not necessarily the most appropriate but is simply the most
norms convenient
Budgetary or other limitations
E.g., PSYC 101 samples
Exclusionary criteria
People with uncorrected vision impairment
People taking medicine that can affect performance
People who are not fluent in English etc.
Recall your own experience as a research subject. How
appropriate was it for the researcher to use students as
convenient sample?
Developing norms for a standardized test
Administering the test to the sample
Standard set of instructions
Recommended settings
Summarizing the data using descriptive
statistics (?)
Test developers provide information to support
recommended interpretations of the results: the
nature of the content, norms/comparison
groups, other technical evidence
Percentile norms: the raw data from a test’s standardization
Types of Norms sample converted to percentile form
Dividing the distribution into 100 equal parts
Percentile
Percentile: an expression of the percentage of people whose
score on a test or measure falls below a particular raw score.
Percentage correct: What proportion of the items did the test
taker got correct
GRE example
What might be a problem of using percentiles?
With normally distributed scores real differences between raw scores may be minimized near the ends of the
distribution and exaggerated in the middle of the distribution
Highest frequency of raw scores are in the middle –even smallest differences will appear large in percentiles
For the tails – Differences between raw scores may be great with very small percentile differences
Age equivalent norms: Average performance of different
Types of Norms
samples of test takers who were at various ages at time of test
Carefully constructed age norms of physical characteristics is
Age Norms
OK.
For psychological characteristics it is tricky
Identifying the mental age according to intelligence test.
Problem with it: E.g., Young Sheldon
Technical ground: SD can be different for different ages
Types of Norms Grade Norms: Designed to indicate the average test
performance of test takers in a given school grade
Grade Norms The test is administered to a group of representative samples
of children over a range of consecutive grade levels (1 st to
6th)
The school year is 10 months: a 6th grade student performing
average for the 4th month of the school year receives 6.4
Types of Norms
Using nationally representative samples to compare tests that
measure the same construct
National Anchor Norms
Readings tests: BRT & RAT
The 96th percentile= Raw score of 67 on BRT and 14 on
RAT
The national anchor norms must be obtained by administering
the two tests on the same sample
Types of Norms
Subgroup Norms: Segmenting a normative sample by a
criteria used in initial selection of subjects (age, educational
Subgroup and Local Norms
level, ethnicity, handedness etc.)
The manual can provide normative info for each
Local Norms: typically developed by test users to provide
normative info on the local populations’ performance
Norm vs. Criterion
Referenced Evaluation
What is the difference?
Norm-referenced: Evaluating the test
score in relation to other scores on the
same test
Criterion-referenced: Evaluating a test
score based on whether some criterion
is met
Criterion: A standard on which a
judgment or decision may be based
E.g., Diploma, driver license etc.
Norm vs. Criterion
Referenced Evaluation
Criticism: May assess mastery of basic
knowledge, skills, or both, but has little
or no meaningful application at the
upper end of the knowledge/skill
continuum
These two are not mutually exclusive a
test can be both norm and criterion
referenced
In a sense all testing is normative
http://www.edpsycinteractive.org/
topics/measeval/crnmref.html
Can you think of norm vs. criterion
referenced tests you took before?
Test users should not lose sight of culture as a factor in test administration, scoring, and interpretation
Is the test appropriate for the targeted test taker population
The do’s and don’t regarding culture and psychological tests