0% found this document useful (0 votes)
47 views33 pages

Detection and Treatment of Careless Responses To Improve Item Parameter Estimation

Uploaded by

Syaila Ramadhani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views33 pages

Detection and Treatment of Careless Responses To Improve Item Parameter Estimation

Uploaded by

Syaila Ramadhani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Journal of Educational and Behavioral Statistics

Vol. XX, No. X, pp. 1–33


DOI: 10.3102/1076998618825116
Article reuse guidelines: sagepub.com/journals-permissions
© 2019 AERA. http://jebs.aera.net

Detection and Treatment of Careless Responses


to Improve Item Parameter Estimation
Jeffrey M. Patton
Financial Industry Regulatory Authority (FINRA)

Ying Cheng
Maxwell Hong
University of Notre Dame

Qi Diao
Educational Testing Service

In psychological and survey research, the prevalence and serious consequences


of careless responses from unmotivated participants are well known. In this
study, we propose to iteratively detect careless responders and cleanse the data
by removing their responses. The careless responders are detected using
person-fit statistics. In two simulation studies, the iterative procedure leads to
nearly perfect power in detecting extremely careless responders and much
higher power than the noniterative procedure in detecting moderately careless
responders. Meanwhile, the false-positive error rate is close to the nominal
level. In addition, item parameter estimation is much improved by iteratively
cleansing the calibration sample. The bias in item discrimination and location
parameter estimates is substantially reduced. The standard error estimates,
which are spuriously small in the presence of careless responses, are corrected
by the iterative cleansing procedure. An empirical example is also presented to
illustrate the proposed procedure. These results suggest that the proposed
procedure is a promising way to improve item parameter estimation for tests of
20 items or longer when data are contaminated by careless responses.

Keywords: careless responses; item response theory; person fit; sample cleansing; item
parameter estimation

In psychological and survey research, the prevalence of careless responses


from unmotivated participants has been repeatedly reported. For example, it was
found that over 50% of examinees responded to one or more items carelessly on
the Minnesota Multiphasic Personality Inventory (MMPI; Baer, Ballenger,
Berry, & Wetter, 1997; Berry et al., 1992). In the context of low-stakes educa-
tional testing (e.g., program evaluation or pretesting), the same problem persists

1
Detection and Treatment of Careless Responses
due to low motivation. For example, the National Assessment of Educational
Progress (NAEP) suffers from student inattentiveness; according to the National
Science Foundation (www.nsf.gov/statistics/seind93/chap1/doc/1s193.htm),
nearly half (45%) of the Grade 12 students reported that they did not try as hard
on the NAEP math test as they did on other math tests taken in school that year.
Data sets contaminated by careless responses may lead to serious conse-
quences. In psychological research, these consequences include low scale relia-
bility, attenuated effects between predictor and outcome variables, and erroneous
conclusions from hypothesis testing (Clark, Gironda, & Young, 2003). In edu-
cational testing, researchers have reported performance gaps between motivated
and unmotivated students (Wise & DeMars, 2003), biased item parameter esti-
mates (Oshima, 1994; Wise, Kingsbury, Thomason, & Kong, 2004), biased item
and test information functions (van Barneveld, 2007), and biased estimates of
students’ abilities (De Ayala, Plake, & Impara, 2001; Meijer & Sijtsma, 2001;
Nering & Meijer, 1998).
It is therefore not an understatement that careless responding behavior is a
serious and important issue to address in psychological and educational measure-
ment. Note that careless responding behavior may occur in a low-stakes context,
but its consequences may manifest themselves in a high-stakes context. For
example, pretesting of items is usually considered low stakes for examinees in
a pretest sample. But item parameter estimates obtained from pretesting, which
may be biased due to careless responses (Oshima, 1994), may well be used in a
high-stakes certification or admission test.
Given the well-established consequences of careless responses, it is of critical
importance to identify such responses. Methods of detecting careless responses
can be put into one of the two broad categories. The first category “requires
special items or scales to be inserted” into an assessment before administration,
such as bogus items or lie scales (Meade & Craig, 2012). The second category is
post hoc analysis performed after data collection. In this study, we focus on post
hoc analysis because it is not limited to data sets that include responses to bogus
items or lie scales. In some situations, the insertion of special item(s) may not be
possible or desirable; in that case, post hoc analysis may be the only option.
If the detection of careless responses is sufficiently accurate, these data may
be removed to avoid “contaminating” the results of subsequent analyses (e.g.,
item calibration). In large-scale testing programs, it is common to calibrate items
(i.e., estimate item parameters) from pretesting data under the item response
theory (IRT) framework. After these estimates are obtained, they are often
assumed to be the true parameter values when used in subsequent analyses, such
as estimation of a respondent’s latent trait value. One popular approach to iden-
tify aberrant responses under the IRT framework is to use person-fit statistics
(Conjin, Emons, & Sijtsma, 2014; Karabatsos, 2003). However, existing person-
fit statistics under IRT usually assume that the item parameters are known (Glas
& Dagohoy, 2007).1 When the item parameters are unknown, the usual person-fit

2
Patton et al.
statistics may not be able to reliability detect careless responders. In this article,
we therefore propose and evaluate an iterative cleansing method based on
person-fit statistics that takes into account the uncertainty of item parameter
estimates in order to (ultimately) improve item calibration.

Method
Detection of Aberrant Responses Using Person-Fit Statistics
Let us consider a simple scenario of careless responding behavior. In this
scenario, a psychological questionnaire of m binary items is administered to a
calibration sample of N respondents. For example, an item may ask whether a
respondent agrees (coded as 1) or disagrees (coded as 0) with a particular state-
ment. A suitable IRT model in this case is the two-parameter logistic model
(2PLM). The probability of respondent i endorsing “agree” to item j, i.e., uij ¼
1 is given by
eaj ðyi bj Þ
Pij ¼ Pðuij ¼ 1jyi ; ␥j Þ ¼ ; ð1Þ
1 þ eaj ðyi bj Þ

where the item parameters ␥j ¼ ðaj ; bj Þ0 are the item discrimination and location
parameters, respectively. For careless responders, however, the probability of
endorsing “agree” or “disagree” does not depend on the latent trait. Instead, the
respondents randomly choose one of the two response categories for every item;
that is, the probability of an “agree” response is .5. A certain percentage (p) of the
respondents exhibit careless responding behavior. This hypothetical scenario is
certainly a simplified representation of reality. In practice, response style may be
related to the latent trait or a collateral variable. However, this simple scenario is
an appropriate first step to demonstrate the utility of the iterative cleansing
procedure.
When the full, “contaminated” sample (i.e., consisting of both normal and
careless responders) is used for item calibration, the resulting item parameter
estimates will very likely be biased. If the identities of the careless responders are
known, an obvious way to remove the bias is to remove the careless responders
from the pretesting sample and perform item calibration on the remaining sample
(i.e., consisting only of normal responders).
In practice, however, one does not know with certainty which respondents
have responded carelessly. Thus, we propose to utilize the standardized log-
likelihood person-fit index lz to detect careless responders (Drasgow, Levine,
& Williams, 1985), which can be computed as follows:
Xm
w ðu  Pij Þ
j¼1 ij ij
lzi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Xm ffi; ð2Þ
w 2 P ð1  P Þ
j¼1 ij ij ij

3
Detection and Treatment of Careless Responses
where wij ¼ log½Pij =ð1  Pij Þ. The numerator in Equation 2 is written as a
weighted sum of residuals, but it is algebraically equivalent to the log likelihood
of y minus the expected value of the log likelihood (Snijders, 2001). The
denominator is the standard deviation of the log likelihood; thus, Equation 2 is
the standardized log likelihood. Respondents with response patterns that do not
conform to the IRT model tend to have large residuals, which in turn yield large
negative lz values. When evaluated with the true ability value, the lz statistic
follows a standard normal distribution (asymptotically) under the null hypothesis
of person-model fit. Thus, using a ¼ .05, respondents with lz < 1.64 are flagged
as having an aberrant response pattern2 (where “aberrant” describes any response
pattern that does not conform to the IRT model). The use of person-fit statistics to
detect aberrant responses is well known (Meijer & Sijtsma, 2001; Van Krimpen-
Stoop & Meijer, 1999). Careless responding behavior, as one type of aberrant
responding behavior, can therefore be detected.
Note that the lz statistic in Equation 2 requires the true latent trait value.
Operationally, it is usually replaced by the latent trait estimate. When that hap-
pens, the asymptotic distribution of lz has variance smaller than 1. Thus, using
N(0,1) as the reference distribution produces too many false-negative errors (i.e.,
too many aberrant response patterns are undetected) and a false-positive error
rate below the nominal level. Snijders (2001) proposed a corrected statistic
(denoted by lz ) which does have an asymptotic N(0,1) distribution when eval-
uated with a latent trait estimate.
As pointed out by Magis, Raiche, and Béland (2012), the functional form of
the modification in lz depends on the choice of the latent trait estimator. In
contrast with lz, which can be evaluated with any type of latent trait estimate
(though, strictly, the true latent trait value is required), lz must be evaluated with
a latent trait estimator that satisfies certain conditions (see Snijders, 2001, for
details). Such estimators include maximum likelihood (ML), weighted likelihood
(WL; Warm, 1989), and the Bayesian modal estimator (MAP). Take the ML
estimator as an example. The modification to the lz statistic in Equation 2
involves replacing the weight wij in the denominator by wij :

wij ¼ wij  cm ðyi Þrj ðyi Þ;


Pm 0
Pij wij Pij 0
where cm ðyi Þ ¼ Pmj¼1 0 and rj ðyi Þ ¼ Pij =ð1P ij Þ
; j ¼ 1; 2; . . . ; m; m being the
P
j¼1 ij
rj ðyi Þ
test length. Here Pij 0 is the first-order derivative with respect to y . With this
modification, the resulting statistic will be the lz for the ML estimator.
In this study, we will use both ML and WL latent trait estimates, and the
 
corresponding fit statistics are denoted by lzM and lzW , respectively. The reason
we do not use MAP is to avoid the confounding effect of the choice of prior.3 The
WL and ML estimates are both asymptotically unbiased; though for tests of finite
length, the WL estimate has been shown to have less bias than the ML estimate.

4
Patton et al.
Noniterative and Iterative Cleansing of the Calibration Sample
The computation of lz still requires item parameters, which in reality are
replaced by their estimates. It is not uncommon to use item parameter estimates
from pretest data to compute lz values for examinees in an operational data set.
But what if the purpose of obtaining lz is to cleanse the calibration sample in
order to obtain better item parameter estimates? Given a pretest sample, item
parameter estimates can be obtained as well as the latent trait estimates. These
can subsequently be used to calculate lz . Response patterns with lz smaller than
1.64 are removed. The resulting “cleansed” sample is then used to obtain new
item parameter estimates, which are hopefully less biased than the original item
parameter estimates based on the full, contaminated sample. We will refer to this
procedure as the noniterative cleansing procedure.
If lz is computed for a sample of respondents, the investigator suspects the
presence of aberrant response patterns or person misfit. If the item parameter
estimates used to compute lz are based on the same sample, these estimates may
be far from their true values, and lz may no longer follow a N(0,1) distribution.
This is exactly the case when pretest data are contaminated by careless responses.
When evaluated with item parameter estimates that are close to their true values,
lz should be able to successfully identify misfitting responses. When such
responses are removed, item parameter estimation can be improved. The
improved item parameter estimates in turn can produce better person-fit statis-
tics, which then can be used to more accurately pinpoint the aberrant responders.
Therefore, we propose the following iterative cleansing procedure:

1. Use the full calibration sample X0 to obtain item parameter estimates ␥ b 0 using
marginal ML. Use ␥ b 0 to obtain latent trait estimates b
θ 0 for all respondents using
ML or WL.
2. Using ␥ b k and bθk (k ¼ 0, 1, 2, . . . ), compute lz for every respondent in the full
sample. Create a cleansed calibration sample Xkþ1 by removing aberrant response
patterns whose lz are below 1.65.
3. Obtain item parameter estimates ␥ bkþ1 based on the cleansed sample. Use ␥ bkþ1 to
obtain latent trait estimates bθkþ1 for all respondents in the full sample. Substitute
bkþ1 and b
␥ θkþ1 into Step 2 to compute lz .
4. Repeat Steps 2 and 3 until the proportion of respondents that change classification
(i.e., aberrant to normal or vice versa) does not exceed .01.4 Upon convergence,
the most recent set of item parameter estimates are taken as the final values.

Presumably, the final item parameter estimates are based on a sample largely
cleansed of aberrant response patterns, so these estimates should be more accu-
rate than the estimates based on the full, contaminated sample. Note that in each
iteration, the iterative procedure computes lz for every respondent in the original
full sample, instead of just for those in the most recent cleansed sample. In this
way, respondents who are removed during early iterations can be added back in

5
Detection and Treatment of Careless Responses
during later iterations; this keeps the sample size from continuously dropping.
The noniterative procedure simply stops at the cleansed sample X1. We therefore
expect the item parameter estimates from the iterative cleansing procedure to be
less biased than those based on the noniterative procedure.
Next, the iterative and noniterative cleansing procedures are evaluated in three
 
simulation studies. Specifically, we employ the lzM , and lzW person-fit statistics
in both single-step (noniterative) and iterative procedures for tests of different
lengths. We examine the accuracy and precision of the resulting item parameter
estimates, as well as the properties of the “cleansed” data sets that produce these
estimates. The latter entails an analysis of the success of the statistics in detecting
careless responders (e.g., an analysis of false-positive and false-negative error
rates). Furthermore, an empirical example is conducted, which highlights the
utility of the iterative approach.

Simulation Studies
Study 1: Extreme Carelessness
In this first study, we generated data corresponding to the hypothetical sce-
nario described previously. Specifically, a sample of N respondents are adminis-
tered a psychological test of 40 binary items, and the 2PLM is the underlying
item response model. However, a percentage (p) of these respondents are ran-
domly chosen to exhibit careless response behavior. 2PLM parameters for a test
of 40 items are drawn from a retired item pool of a large-scale achievement test.
The mean and standard deviation of the 40 location parameters are 0.11 and
0.90, respectively, and the mean and standard deviation of the 40 discrimination
parameters are 1.76 and 0.56, respectively. The distribution of location para-
meters is roughly symmetric (skewness ¼ .33), whereas the distribution of dis-
crimination has a slight positive skew (skewness ¼ .73). Also, item
discrimination and location are positively correlated (r ¼ .51).
To generate data for item calibration, latent trait parameters for a calibration
sample of size N are drawn from a standard normal distribution. A percentage (p)
of these respondents are randomly selected to exhibit a careless response style.
For normal responders, the probability of an “agree” response is computed using
the 2PLM. For careless responders, this probability is set to .5 for all items.
We simulate two calibration samples sizes (N ¼ 500 or 3,000) and three
percentages of careless responders (p ¼ 0, .1, or .3), for a total of six conditions.
Four hundred is considered the minimum recommended sample size for calibra-
tion with the 2PLM (Hulin, Lissak, & Drasgow, 1982), and a sample size larger
than 1,000 is generally considered adequate for reasonable test lengths even for
calibrating the 3PLM (Gao & Chen, 2005; Hanson & Benguin, 2002; Kim,
2006). Therefore, our choice of N ¼ 500 or 3,000 represents the low and high
end of adequate sample sizes for the calibration of the 2PLM. For each condition,

6
Patton et al.
we obtain item parameter estimates based on the full sample. For the conditions
with p > 0, we also obtain estimates based on the three cleansed samples: the (1 
p)  N normal responders (i.e., perfect cleansing) and two subsamples resulting
from noniterative cleansing and iterative cleansing, respectively. Each type of
 
cleansing procedure is applied with the lzM or lzW person-fit statistic. Item cali-
bration is performed with BILOG-MG 3 (Zimowski, Muraki, Mislevy, & Bock,
2003) using marginal ML assuming a standard normal density for the latent trait.
Finally, for each condition, the above procedure (data generation and item
calibration or cleansing/calibration) is repeated 100 times. To examine the recov-
ery of a given item parameter, we compute the empirical bias and standard error
(SE, i.e., standard deviation) based on the 100 parameter estimates. We also
record the person-fit statistics used to create the cleansed calibration data sets
that, in turn, yielded the final item parameter estimates. By comparing the esti-
mated respondent classification (i.e., normal or careless) with the true classifi-
cation, we were able to examine the false-positive/false-negative error rates as
well as classification accuracy across the 100 replications. Aside from item
calibration, all simulations and analyses are performed in R v2.15.2 ( R Core
Team, 2013).
Results are very similar for the N ¼ 500 and N ¼ 3,000 conditions, so only the N
¼ 500 results are presented here, and the N ¼ 3,000 results are available upon
request. Additionally, using ML or WL, latent trait estimation yielded very similar
results (for both the iterative and noniterative procedures). This may be due to the
test length; the advantage of WL over ML estimation is reduced bias for finite tests,
but this advantage may be difficult to discern for a test as long as 40 items. Thus,

only the results based on lzW are presented, and this statistic is hereafter denoted by
 
lz for clarity. (Results based on lzM are available upon request.)

p ¼ .1 condition. It is important to determine whether lz is able to detect careless


responders without making too many false-positive errors. The left half of
Table 1 summarizes the performance of lz with 10% carelessness. First, we
consider the noniterative procedure. Recall that for N ¼ 500 and p ¼ .1 care-
lessness, each calibration sample contains 50 careless responders. On average,
55 responders are flagged by lz (i.e., 11% of the calibration sample), among
which 88% are actually careless responders. The false-positive error rate, that
is, the average proportion of normal responders that are erroneously flagged as
aberrant, is .014. Interestingly, the observed false-positive error rate is far
below the nominal level of .05. This suggests that the sampling distribution
of the person-fit statistics may not be N(0,1), even after the correction of lz
over lz. This probably occurs because lz is evaluated with item parameter
estimates instead of the true parameter values. In addition, 96% of the careless
responders are flagged, that is, the power is .96. Thus, lz demonstrates adequate
power in this scenario but results in a false-positive error rate that is below the
nominal level of .05.

7
Detection and Treatment of Careless Responses
TABLE 1.
Performance of the lz Person-Fit Statistic in the Iterative and Noniterative Cleansing
Procedures (Extreme Careless Behavior)

10% Carelessness 30% Carelessness

Noniterative Iterative Noniterative Iterative

Number of iterations — 2.7 — 3.4


Proportion flagged (of total sample) .11 0.16 .23 0.34
Proportion careless (of those flagged) .88 0.62 .99 0.87
False-positive error rate .014 0.070 .005 0.065
Detection rate .96 >0.99 .76 >0.99

Next, we consider the iterative procedure. Occasionally, the iterative proce-


dure does not yield converged item parameter estimates (presumably, this is due
to the reduction in sample size). In this case, results for that replication are
omitted from the final results. However, the number of nonconvergent replica-
tions is rather small (2%). For those replications that do converge, only 2.7
iterations are needed (on average). The iterative procedure flags substantially
more respondents than does the noniterative procedure (82 vs. 55). And because
there are only 50 careless responders, the proportion of flagged respondents who
are actually careless is reduced relative to the noniterative procedure (.62 vs. .88).
In the meantime, the false-positive error rate for the iterative procedure (.07) is
slightly higher than the nominal level of .05. The iterative procedure also leads to
slightly higher power than the noniterative procedure (>.99 vs. .96). Thus, the
iterative procedure exhibits nearly perfect power at the price of a slightly inflated
false-positive error (i.e., slight overflagging).
A natural question that follows is how the performance of the lz in detecting
careless responders translates to item parameter recovery. The top row of Figure 1
displays the empirical bias of discrimination estimates against the true discrim-
ination values, and the bottom row displays the empirical bias of location para-
meter estimates versus the true location values. Each column displays results
based on a different type of calibration sample: Results based on the full sample
with p ¼ 0 (i.e., no carelessness) and p ¼ .1 are in the first and second columns,
respectively; results based on the subsample of normal responders (i.e., perfect
cleansing) are in the third column; and results based on the noniterative and
iterative cleansing procedures using lz are in the fourth and fifth columns, respec-
tively. Note that the results based on the subsample of normal responders (col-
umn 3) are the standard of comparison for the cleansing procedures, as it is the
best that can be achieved.
First, consider the bias of the discrimination estimates. For a full sample of
size 500, when there are no careless responders, the bias of the discrimination

8
FIGURE 1. Empirical bias of discrimination and location parameter estimates versus respective true parameter values (extreme careless
behavior, 10% carelessness).

9
Detection and Treatment of Careless Responses
parameter estimates is negligible (see Figure 1, top row, column 1). On the other
hand, the full sample that is “contaminated” with 10% careless responders (i.e.,
450 normal responders and 50 careless responders) yields negative bias for all
items, and this bias becomes quite extreme (as large as 1.5) for items with large
true a values (see column 2). If perfect cleansing is performed, that is, with the
subsample of normal responders (which consists of 450 of the 500 original
respondents), the bias is virtually eliminated (column 3). So the reduction of
sample size from 500 to 450 does not seem to cause any issue in discrimination
parameter recovery when p ¼ .1. In the next column, we see that the noniterative
lz procedure works quite well; the results are very similar to those based on the
normal responders alone. The iterative lz procedure eliminates the negative bias
but adds a slight positive bias for items with large true a values (column 5). This
likely has to do with overflagging—more responders than necessary are screened
from the calibration sample.
For a sample size of 500, when there are no careless responders, the bias of the
location parameter estimates is also negligible (see Figure 1, bottom row, column
1). But the bias of location parameter estimates based on the contaminated
sample shows an interesting pattern: a slight “inward” bias for moderate b values
and a larger “outward” bias for more extreme b values (column 2). This pattern is
virtually eliminated when only the subsample of normal responders is used
(column 3). And in contrast with the results for discrimination, the noniterative
and iterative procedures work equally well (columns 4 and 5), both yield results
very similar to those based on the normal responders only.
Next, the top row of Figure 2 displays the empirical SEs of discrimination
estimates against the true discrimination values, and the bottom row displays the
empirical SEs of location estimates versus the true location values. The first
figure on the top row shows that even with a full uncontaminated sample, the
SE of the discrimination parameter estimates increases with the true parameter
values. In the full, contaminated sample, the SE is very much underestimated for
items with high a parameters (which are also the items with the most extreme
negative bias). After perfect cleansing, the sample size drops from 500 to 450,
and the SEs resemble those based on the sample with no careless responders.
Results for the noniterative procedure are similar to those based on the normal
responders alone, but for the iterative procedure, the SEs for items with the
largest a values are inflated, which is again attributable to the increased false-
positive error rate. Concerning the SEs of location estimates, the type of calibra-
tion sample appears to have little effect.
In summary, when there are 10% careless responders, both iterative and non-
iterative cleansing procedures are very effective in (1) reducing the large nega-
tive bias of the discrimination parameter estimates for items with large true a
values and (2) reducing the large bias in the location parameter estimates for
items with extreme true b values. Between the two procedures, the noniterative

10
FIGURE 2. Empirical standard errors of discrimination and location parameter estimates versus respective true parameter values (extreme
careless behavior, 10% carelessness).

11
Detection and Treatment of Careless Responses
procedure has a small edge concerning item parameter recovery, as the iterative
procedure adds a small positive bias to the discrimination parameter estimates of
highly discriminating items and slightly inflates their SEs.

p ¼ .3 condition. The right half of Table 1 displays power and the false-positive
error rate when using lz to detect careless responders in the 30% carelessness
condition. First, we consider the noniterative procedure. Given N ¼ 500 and p ¼
.3, each calibration sample contains 150 careless responders. In contrast with the
p ¼ .1 results, the noniterative procedure flags far fewer respondents than the
target number (115 respondents or 23% of the sample). Accordingly, power is
relatively poor (.76). Ninety-nine percent of the flagged respondents are careless,
and the false-positive error rate is, accordingly, very small (.005). So when p ¼
.3, the noniterative procedure leads to moderate power and a false-positive error
rate far below the nominal level.
Concerning the iterative procedure, only 5 of the 100 replications did not yield
a converged set of item parameter estimates. As might be expected, the iterative
procedure required slightly more iterations in the p ¼ .3 condition than in the p ¼
.1 condition (3.4 vs. 2.7). In contrast with the noniterative procedure, the iterative
procedure flags more than 150 respondents (170 or 34% of the sample). Accord-
ingly, power increases to nearly perfect (.99), and the false-positive error rate is
slightly inflated (.065).
For p ¼ .3, Figure 3 displays the empirical bias of discrimination and location
estimates in the first and second rows, respectively. As shown earlier, with a
sample size of 500, the bias of item parameter estimates is negligible. But if the
sample is contaminated with 30% carelessness, the resulting discrimination para-
meter estimates exhibit negative bias for all items, particularly for highly dis-
criminating items. The negative bias can be as large as 2.5, which is noticeably
greater than that in the p ¼ .1 condition. Using the subsample of normal respon-
ders largely eliminates this bias, though a small, positive bias remains for some
high-a items. This is different from the p ¼ .1 condition, where perfect cleansing
virtually eliminated the bias. This occurs because the sample consists of only 350
normal responders in the p ¼ .3 condition. The marginal ML procedure yields
consistent estimates, but in a small sample, some bias can be expected. Also in
contrast with the p ¼ .1 condition, the noniterative lz procedure reduces but does
not eliminate the negative bias. The iterative procedure, on the other hand, yields
results similar to those using only normal responders.
Concerning the bias of location estimates, the contaminated sample yields a
pattern of bias similar to, but more extreme than, that seen in the p ¼ .1 condition.
Unsurprisingly, using the subsample of normal responders largely eliminates the
bias. The noniterative procedure yields bias less extreme than that using the full
sample; these results are clearly worse than those based on the normal responders
alone. In contrast, the iterative procedure yields estimates that are only slightly
more biased than those based on the normal responders. These results suggest

12
FIGURE 3. Empirical bias of discrimination and location parameter estimates versus respective true parameter values (extreme careless
behavior, 30% carelessness).

13
Detection and Treatment of Careless Responses
that the iterative procedure is more advantageous than the noniterative procedure
when the percentage of aberrant responders is large.
Next, Figure 4 displays the empirical SEs of discrimination and location
estimates in the first and second rows, respectively. Similar to when p ¼ .1,
using the subsample of normal responders yields a positive relationship between
the true a value and SE, whereas the full, contaminated sample yields substan-
tially underestimated SEs for higher-a items. And similar to the results for bias
when p ¼ .3, the iterative procedure appears to be superior to the noniterative
procedure as performance of the former is closer to perfect cleansing. Concern-
ing the SEs of location estimates, the type of calibration sample appears to have
little effect.
In summary, when there are 30% careless responders, only the iterative
cleansing procedure is effective in reducing the bias of item parameter estimates.
We believe the advantage of the iterative procedure is attributable to its increased
power over the noniterative procedure in detecting careless responders (.99 vs.
.76), and the iterative procedure yields a false-positive error rate closer to the
nominal level.
Comparing the results of the p ¼ .1 condition and the p ¼ .3 condition, we find
that the noniterative procedure fares well in the former but not the latter. Previous
research has demonstrated that as the proportion of aberrant responders
increases, it becomes more difficult to distinguish between normal and aberrant
responding (Rupp, 2013), and person-fit statistics do not do well in such situa-
tions. However, it appears that the iterative procedure is able to account for the
value of p by using iterations to arrive at a set of item parameter estimates based
on a “cleansed” calibration sample, and it performs comparably for each value of
p: False-positive error rates are close to the nominal level of .05, and power is
nearly perfect.
Given the success of the iterative cleansing procedure in Study 1, we would
like to investigate a more realistic scenario. In Study 1, careless responders are
careless throughout the test. Certainly, some respondents in real-life situations
may behave this way, but it is more likely that careless behavior is manifested as
only a few careless responses. Thus, in Study 2, we simulate two different types
of careless responding: moderate and extreme. These two response styles are
explained in more detail in the next section.

Study 2: Realistic Scenario


In Study 2, we employ a similar design to that of Study 1, where either 10% or
30% of the respondents exhibit careless behavior on a 40-item test. However, in
the 10% condition, 8% of the total sample provides between 1 and 10 careless
responses, and 2% of the total sample provides between 30 and 40 careless
responses. Similarly, in the 30% condition, 24% of the total sample provides
between 1 and 10 careless responses, and 6% of the total sample provides

14
FIGURE 4. Empirical standard errors of discrimination and location parameter estimates versus respective true parameter values (extreme
careless behavior, 30% carelessness).

15
Detection and Treatment of Careless Responses
between 30 and 40 careless responses. These scenarios are intended to represent
more realistic careless response behavior.
To generate the data for item calibration, 500 latent trait parameters are drawn
from a standard normal distribution. Ten percent or 30% of these respondents are
randomly selected to be careless responders; of those selected, 80% are randomly
chosen to exhibit “moderate” careless behavior (i.e., provide between 1 and 10
careless responses), and the remaining 20% exhibited “extreme” careless beha-
vior (i.e., provided between 30 and 40 careless responses). If a respondent is
selected to be a careless responder, we determine the exact number of careless
responses by randomly drawing a number between 1 and 10 (if moderately
careless) or between 30 and 40 (if extremely careless). The probability of an
“agree” response to these items is set to .5, and for all other items, the probability
of an agree response is computed using the 2PLM. Note that the response style
and the number of careless responses are independent of the respondent’s latent
trait. As in Study 1, 100 data sets are generated for each condition (10% or 30%
carelessness). Both the iterative and noniterative sample cleansing procedures
are applied to the generated data sets. All other details of the simulations are
identical to those in Study 1.

Results. Figure 5 displays the empirical bias of item parameter estimates when a
total of 10% of respondents exhibit careless response behavior, but only 2%
exhibit extreme carelessness. Figure 6 displays the item parameter recovery
results when a total of 30% of respondents exhibit careless response behavior,
but only 6% exhibit extreme carelessness. Similar to Figures 1 and 3, the top row
of Figures 5 and 6 displays the empirical bias of discrimination estimates versus
the true discrimination values, and the bottom row displays the empirical bias of
location estimates versus the true location values. Also, just as in Study 1, the
sample sizes of 500 and 3,000 yield very similar results, and the performance of
the lz statistic does not change much when evaluated with ML versus WL latent
trait estimates. Thus, we only present results for N ¼ 500 and sample cleansing
based on WL latent trait estimates.
When the entire contaminated sample is used for item calibration, we can see
that the effect of “realistic” carelessness is much smaller than that of the
“extreme” carelessness in Study 1. For example, the largest bias for discrimina-
tion estimates in the more realistic scenario is .75 and 1.5 in the 10% and 30%
conditions, respectively (see plots in the top row, second column of Figures 5 and
6, respectively). In contrast, the largest bias in the extreme scenario of Study 1 is
1.5 and 2.5, respectively (see plots in the top row, second column of Figures 1
and 3, respectively). Similar to the results of Study 1, the magnitude of negative
bias of the discrimination estimates increases with the true parameter value.
Also, the effectiveness of noniterative and iterative cleansing with the more
realistic careless response behavior is very similar to that under extreme care-
lessness; the noniterative procedure works well if 10% of respondents are

16
FIGURE 5. Empirical bias of discrimination and location parameter estimates versus respective true parameter values (realistic careless
behavior, 10% carelessness).

17
18
FIGURE 6. Empirical bias of discrimination and location parameter estimates versus respective true parameter values (realistic careless
behavior, 30% carelessness).
Patton et al.
careless but relatively poor if 30% are careless. And the iterative procedure
performs well regardless of the percentage of careless responders. The iterative
procedure is better than the noniterative procedure in reducing the bias of the
item parameter estimates when the total percentage of careless responses is
higher, but it also tends to “overcorrect” the negative bias slightly in the discrim-
ination parameter estimates when 10% of respondents are careless (see the plot in
the top row, last column of Figure 5).
Concerning the location parameter estimates, when 10% of respondents are
careless, the bias is not large (between .1 and .1) when using the full, con-
taminated sample (see Figure 5, bottom row, column 2). After cleansing, either
iterative or noniterative, the bias is still in that range. When the total percentage
of careless responders increases to 30%, the full sample yields a somewhat larger
range of bias (between .1 and .25; see the plot in the bottom row, second
column of Figure 6). Interestingly, the cleansing procedures actually lead to a
distinct pattern of “outward” bias of location parameter estimates; however, the
iterative procedure does manage to reduce the range of bias somewhat between
.1 and .1 (see the plot in the bottom row, last column of Figure 6). Lastly, the
effect of the different calibration samples on the SEs of the discrimination esti-
mates is very similar to that in Study 1 (compare Figures 7 and 2, and Figures 8
and 4). And again, the different calibration samples have little effect on the SEs
of the location estimates.
Table 2 provides more insight into the performance of the cleansing proce-
dures. When 10% of respondents are careless, the noniterative procedure
detected 37% of the careless respondents and the iterative procedure detected
46% of the careless respondents. When 30% of respondents are careless, the
noniterative procedure has power of .30, whereas the iterative procedure has
power of .44. For both 10% and 30% conditions, the iterative procedure yields
a false-positive error rate that is closer to the nominal level of 5%. Table 2 further
breaks down the performance of the noniterative versus iterative procedures in
correctly detecting careless responders who are moderately or extremely
careless.
Consistent with findings from Study 1, both noniterative and iterative proce-
dures are very successful at detecting extreme careless behavior; in the case of
the noniterative procedure, it does even better than in Study 1. This is attributable
to the fact that with overall less extreme careless responses in the data, the item
parameter estimates are less biased, and extreme careless responders become
more conspicuous. In contrast, neither cleansing procedures performs very well
in terms of detecting moderately careless responders. At best, the iterative
cleansing procedure is able to detect 33% of the moderately careless responders.
However, the iterative procedure does have an advantage here. When there are in
total 30% careless responders, the power in detecting moderately careless
responders by the iterative procedure is .30, more than twice as high as the
detection rate achieved by the noniterative procedure (i.e., .14). This explains

19
20
FIGURE 7. Empirical standard errors of discrimination and location parameter estimates versus respective true parameter values (realistic
careless behavior, 10% carelessness).
FIGURE 8. Empirical standard errors of discrimination and location parameter estimates versus respective true parameter values (realistic
careless behavior, 30% carelessness).

21
Detection and Treatment of Careless Responses
TABLE 2.
Performance of the lz Person-Fit Statistic in the Iterative and Noniterative Cleansing
Procedures (Realistic Careless Behavior)

10% Carelessness 30% Carelessness

Noniterative Iterative Noniterative Iterative

Number of iterations — 2.4 — 3.2


Proportion flagged (of total sample) .06 0.10 .10 0.17
Proportion careless (of those flagged) .59 0.47 .91 0.79
False-positive error rate .029 0.059 .013 0.050
Detection rate (moderate carelessness) .23 0.33 .14 0.30
Detection rate (extreme carelessness) .97 0.98 .93 0.99
Detection rate (all) .37 0.46 .30 0.44

the advantage of the iterative procedure in reducing the negative bias of


discrimination estimates (see Figure 6). When there is in total only 10% of
careless responders, the iterative procedure tends to flag more people than
necessary—only 47% of those who are flagged are truly careless. That helps
explain the small positive bias in the item discrimination parameter estimates
in Figure 5, that is, the slight overcorrection of the negative bias as well as
the inflation of SE at the high end of discrimination (see Figure 7, upper
right).
To better illustrate the contrast between extremely and moderately careless
responders, Figure 9 plots lz values against the latent trait estimates for the 10%
carelessness condition. Normal responders are represented by circles, moderately
careless responders by triangles, and extremely careless responders by boxes. In
the left panel, there is clear separation between normal and extremely careless
responders. The horizontal line at lz ¼ 1.65 denotes the lower 5th percentile of
the standard normal distribution. A respondent with lz below this line is flagged
as aberrant. Nearly all normal responders have lz larger than 1.65, or above that
line, and their latent trait estimates spread from very low to very high. The
extremely careless responders, however, almost always fall below the line of
1.65. Their latent trait estimates are all around 0, which is to be expected,
because their chance of endorsing each item is 50%. The clear separation
between these two groups of respondents in the left panel of Figure 9 results
in the high power and low false-positive error rate when using lz to detect
careless responders in Study 1. Turning to the right panel of Figure 9, the
separation between normal and careless responders is no longer so clear, partic-
ularly because the normal responders and the moderately careless responders
have considerable overlap. A substantial number of moderately careless respon-
ders (triangles) mingle with the normal responders (circles) above the horizontal

22
FIGURE 9. lz versus ability estimates under extreme and realistic careless response
conditions, respectively (10% carelessness, one replication).

line of 1.65. This is why the power drops sharply in Study 2 as compared with
that in Study 1.
The fact that the iterative procedure manages to nearly eliminate the negative
bias in the discrimination parameter estimates in spite of low power in Study 2
suggests that the bias was largely caused by extremely careless responders. Once
they are identified and their responses removed from the data, item parameter
recovery is substantially improved.

Study 3: Shorter Tests


In order to assess the utility of the iterative cleansing procedure for shorter
scales, Simulations 1 and 2 were repeated using a 10-, 20-, and 30-item scale
drawn from the same item pool. We found that (a) as with a 40-item scale, there
were minimal differences between sample size and choice of ability estimator;
(b) the type-1 error rates were similar for varying test lengths; and (c) shorter
scales exhibited the same general pattern in terms of parameter empirical bias
and SE as a 40-item scale. However, two findings were noteworthy. First, power
in detecting aberrant response patterns decreases with shorter scales, consistent
with findings in previous studies comparing test lengths (Meijer & Sitjsma,
2001). More specifically, we found that our proposed procedure suits scales that
contain *20 items or more. Otherwise, the power is substantially reduced.
Second, when an item contains some categories that are sparsely endorsed, the
iterative procedure may lead to convergence issues. Shorter length scales require
a larger number of iterations before convergence for both extreme and realistic
careless behavior. One item exhibited a large bias and SE for the discrimination
parameter when using iterative cleansing in a few replications with a 10-item
test. Further investigation did not find anything strange concerning the item

23
Detection and Treatment of Careless Responses
(discrimination ¼ 1.86, location ¼ 0.62) other than the fact it was a difficult item
with a relatively large discrimination parameter, which caused a sparse number
of participants endorsing the item. This suggests that iterative cleansing may
cause estimation issues with items containing some response categories that are
rarely endorsed.
Overall, the proposed procedures work more effectively when the test length
is 20 or above and when items do not contain very sparse response categories. It
is also important to point out that in spite of the lower power, the iterative
procedures still result in visible improvement in parameter estimation with the
very short scale. For specific results, please see Online Appendix A. Tables A1
and A2 present the average number of iterations for the iterative cleansing
procedure. Tables A3 and A4 present detection rates for both extreme and rea-
listic careless responders, respectively. Figures A1 through A24 present the
average bias and empirical SE for different scale lengths.

Empirical Example
Add Health Scale
We applied both the data removal and iterative procedures to data from the
National Longitudinal Study of Adolescent Health (AddHealth), which is an
education and health-related study of adolescents in Grades 7 through 12 (Harris
& Udry, 2010). We used the base year in-home questionnaire administered
during 1994–1995. We analyzed the Feelings scale, which collects information
about the respondents’ current emotional state. The scale contains 19 items in
which participants respond to Likert-type scale items with 4 response categories
on how often each of the statements apply to the past week (e.g., “You were
bothered by things that usually don’t bother you”). However, participants rarely
endorsed the Categories 3 and 4, “a lot of the time” and “most of the time or all of
the time.” Therefore, we collapsed across both categories to create dichotomous
items consisting of “never or rarely” versus “sometimes or more.” This approach
mirrors similar studies on this scale where a unidimensional IRT model was fit to
the data and found to have adequate fit (Edelen & Reeve, 2007). We evaluated
the model fit again using a 2PLM, which suggest a single factor provide adequate
fit (RMSEA ¼ .066, SRMSR ¼ .079, CFI ¼ .951).
The original sample contains 6,504 respondents, but respondents with any
missing data in the Feelings scale were removed, leaving 6,457 respondents.
Three sets of item parameter estimates were obtained, one set from the data
without cleansing (i.e., N ¼ 6,457), one set with one round of cleansing (i.e.,
the noniterative method), and the third set with the iterative method. We eval-
uated all three methods based on the classification of respondents, differences
among the resulting item parameter estimates, and differences among their SEs.
The iterative cleansing procedure converged in five cycles. Without any cleans-
ing, 8.39% of the sample was flagged. With noniterative cleansing, 13.11% of the

24
Patton et al.
TABLE 3.
Item Parameter Estimates for the Feelings Scale Using a Full Sample, Noniterative, and
Iterative Cleansing Procedures

Full sample Noniterative Cleansing Iterative Cleansing

Item a^ b^ a^ b^ a^ b^

1 1.41 (.05) 0.41 (.03) 1.46 (.05) 0.41 (.03) 1.57 (.06) 0.43 (.03)
2 0.98 (.04) 0.74 (.04) 1.02 (.04) 0.75 (.04) 1.08 (.04) 0.74 (.04)
3 2.06 (.07) 0.76 (.02) 2.24 (.08) 0.77 (.02) 2.46 (.09) 0.78 (.02)
4 0.85 (.04) 0.77 (.04) 0.88 (.04) 0.93 (.05) 0.95 (.04) 1.00 (.05)
5 1.26 (.04) 0.40 (.03) 1.29 (.05) 0.47 (.03) 1.33 (.05) 0.48 (.03)
6 2.56 (.09) 0.36 (.02) 2.64 (.09) 0.36 (.02) 2.86 (.11) 0.37 (.02)
7 0.97 (.04) 0.38 (.03) 0.99 (.04) 0.45 (.03) 1.04 (.04) 0.45 (.04)
8 0.81 (.04) 1.12 (.06) 0.87 (.04) 1.31 (.06) 1.02 (.05) 1.38 (.06)
9 2.23 (.09) 1.27 (.03) 2.55 (.11) 1.3 (.03) 2.77 (.12) 1.32 (.03)
10 1.26 (.05) 1.01 (.04) 1.39 (.05) 1.01 (.04) 1.53 (.06) 1.03 (.04)
11 1.14 (.04) 0.58 (.03) 1.2 (.05) 0.71 (.03) 1.32 (.05) 0.78 (.03)
12 0.87 (.04) 0.33 (.03) 0.88 (.04) 0.32 (.04) 0.92 (.04) 0.30 (.04)
13 1.96 (.06) 0.48 (.02) 2.1 (.07) 0.49 (.02) 2.27 (.08) 0.49 (.02)
14 0.97 (.04) 0.84 (.04) 1.06 (.04) 0.85 (.04) 1.15 (.05) 0.88 (.04)
15 1.28 (.04) 0.06 (.03) 1.28 (.05) 0.14 (.03) 1.37 (.05) 0.21 (.03)
16 2.15 (.07) 0.08 (.02) 2.29 (.08) 0.07 (.02) 2.52 (.09) 0.09 (.02)
17 1.39 (.05) 0.63 (.03) 1.52 (.05) 0.65 (.03) 1.63 (.06) 0.68 (.03)
18 0.93 (.04) 0.09 (.03) 0.94 (.04) 0.15 (.03) 0.97 (.04) 0.16 (.03)
19 2.27 (.1) 1.51 (.04) 2.92 (.14) 1.53 (.04) 3.95 (.25) 1.51 (.03)

Note. Standard errors are presented in parentheses.

sample was flagged, and with iterative cleansing, 17.39% of the sample was
flagged. Item parameter estimates and their SEs based on the full, cleansed, and
iteratively cleansed samples are presented in Table 3.
Generally, the results of the empirical data analysis are consistent with the
findings of the simulations. First, cleansing the sample resulted in more respon-
dents being flagged, especially with iterative cleansing. This is presumably due
to the well-documented “masking effect” in outlier detection, that is, some out-
liers do not appear to be as extreme as they should because their presence has
biased the structural parameter estimates (Yuan, Fung, & Reise, 2004; Yuan &
Zhong, 2008). Second, cleansing the sample resulted in noticeable increases in
the discrimination estimates. For instance, for Item 16, the full sample yielded an
estimate of 2.16, whereas the cleansed sample yielded 2.29 and the iteratively
cleansed sampled yielded 2.52. Larger discrimination estimates were observed in
several other items, along with larger SEs. This was also consistent with what we
observed in the simulation studies. However, we must point out that in the

25
Detection and Treatment of Careless Responses
simulations, we found that the cleansing procedure (especially, the iterative
procedure) can “overcorrect” the negative bias in the discrimination estimates.
Third, the effect of careless responses on the b parameter estimates is not very
pronounced. Table 3 shows that in general, the location estimates varied in both
the magnitude and the direction of change across items. For instance, with Item
16, we can observe small changes from .08 using the full sample compared to
the noniterative method (.07) and the iterative method (.09). In addition, the
SEs for the location estimates did not change much across conditions. In that
case, it is important to balance the improvement in power and the increase in
false-positive errors.

Discussion
In this article, we proposed two methods to detect and treat careless responses
in a calibration sample in order to improve item parameter estimation. The first
method uses the person-fit statistic lz to detect careless responders and remove
them from the data set. The second method also uses lz to detect and remove
careless responders, but it does so by iteratively updating the item parameter
estimates. These two procedures are compared using simulations. We considered
two types of careless responding behavior (extreme and realistic), two total
percentages of careless responders (10% and 30%), two sample sizes (500 and
3,000), and two ways to estimate latent traits (ML and WL). We find that the first
two factors (type of behavior and the percentage of careless responders) had a
substantial impact on the results, whereas the latter two factors (sample size and
ability estimator) were much less influential in the conditions we simulated.
The results suggest that when the percentage of careless responders is small,
the noniterative procedure seems sufficient, and this is true regardless of whether
the careless responses are all extreme or largely moderate. However, when the
percentage of careless responders increases to 30%, the iterative procedure is
much more effective in reducing the large bias in item parameter estimates. The
performance gain in item parameter recovery is due to the increase in power
brought by the iterative procedure. When dealing with extremely careless respon-
ders, the power of the iterative procedure was close to perfect. When dealing with
moderately careless responders, power is reduced but is still much higher than
that of the noniterative procedure. This finding was consistent across varying
scale lengths. Also, the iterative procedure achieves high power while keeping
false-positive error rates to close to 5% in all conditions. However, there are
limitations to the proposed iterative procedure. The iterative procedure appears
to suffer from convergence issues when an item contains some categories that are
sparsely endorsed. Power also greatly diminishes when the test length is 10 or
lower. Therefore, we advise researchers to apply the proposed procedures only
when test length is 20 or above, and when there are few items with sparsely
endorsed response categories.

26
Patton et al.
The proposed procedures and our findings are likely to be useful to both
researchers and practitioners. The prevalence of careless responding is well-
documented in the literature. Even when only 10% of the sample consists of
careless responders and most of these respondents respond carelessly to only a
small percentage of items, the negative bias in item discrimination parameter
estimates can be substantial. Further, the magnitude of bias increases with the
true discrimination value and the percentage of careless responders. Such severe
negative bias can lead to serious consequences. In our empirical example, we can
clearly see that both the iterative and noniterative cleansing procedures lead to
larger, and in some cases, substantially larger discrimination parameter esti-
mates. In applications of IRT such as test assembly and computerized adaptive
testing, highly discriminating items are quite valuable: Large discrimination
values generally yield large values of item information, which in turn yield lower
SEs for ability estimates. If such items suffer from negative bias, they may be
erroneously neglected during item selection. With the iterative cleansing proce-
dure, this situation may very well be avoided; for this reason, we strongly suggest
practitioners to examine data aberrancy before estimating item parameters.
In this study, we focus on carelessness or inattentiveness that is prevalent in
low-stakes testing. This is not unusual, as other studies have focused specifically
on careless responses (e.g., Huang, Curran, Keeney, Poposki, & DeShon, 2012;
Woods, 2006). However, the iterative cleansing procedure can be used to detect
and treat other types of aberrant responses, for example, speededness or cheating.
Examples of the use of person-fit statistics in detecting these other types of
aberrant responses abound (e.g., de la Torre & Deng, 2008; Goegebeur, De
Boeck, & Molenberghs, 2010; Karabatsos, 2003). When multiple types of aber-
rant response are expected, response style modeling may also provide additional
insight (Bolt & Newton, 2011; Falk & Cai, 2016), particularly when there is
some preknowledge of the type of aberrant behavior that one is looking for and
some idea of how it might manifest itself in the data. In addition, the iterative
procedure can be implemented with appropriateness measures or outlier indices
other than person-fit statistics like lz or lz . For instance, respondents with aberrant
responses can be iteratively flagged if the Mahalanobis distance between
observed and expected responses (Meade & Craig, 2012) is sufficiently large.
We chose to focus on lz due to its wide use and success in recent applications
(Felt, Castaneda, Tiemensma, & Depauli, 2017; Tendeiro, 2017), but it is well
worth exploring the use of alternative measures or even nonparametric statistics.
On the other hand, we caution readers against generalizing the results outside
the range of simulation conditions. For example, the comparison of the two
ability estimation methods, ML and WL, shows that they perform very similarly.
WL should provide less biased estimates with short tests. With our recommended
test length of 20 or longer, its advantage may be diminished. As another example,
in this study, the largest amount of aberrant responses is 30%, that is, 30% of

27
Detection and Treatment of Careless Responses
respondents carelessly responded to all items. Results indicated that our proposed
method work very well in spite of the large amount of aberrant responses.
However, we do not expect the method to perform well when the majority of
responses are careless; in other words, the results should not be generalized to
50% or more. This is because of the well-known “masking effect” in the liter-
ature, that is, a large proportion of outliers could bias the model parameter
estimates or model structure to the extent of “masking” outliers (Yuan et al.,
2004; Yuan & Zhong, 2008). Therefore, the assumption of any outlier detection
method is that the majority of data are valid, this study being no exception. We
would also like to ask our readers to exercise extreme caution when it comes to
data removal. In this study, we are only using iterative cleansing for the purpose
of improving item parameter estimation and eventually improving the identifi-
cation of outliers. It rests with researchers how to deal with the flagged cases in
the end. According to Allalouf, Gutentag, and Baumer (2017), typically human
review should follow statistical quality control procedures. Even for the purpose
of improving item parameter estimation, robust methods that down-weight
flagged response patterns can provide an alternative to removing responses
(Hong & Cheng, 2018). Another possibility is to remove only some responses
in a flagged response pattern instead of the entire response vector. This can be
done by detecting a change point in respondents’ response behavior (Shao, Li &
Cheng, 2016; Yu & Cheng, in press).
Furthermore, the current study can be extended in the following ways. First, it
is clear that the noniterative cleansing procedure yields inadequate power and a
false-positive error rate far below the nominal level when the percentage of
careless responders is large. These results suggest that the empirical distribution
of lz does not follow N(0,1) when the percentage of careless responders is large.
This is to be expected because even though lz takes into account the latent trait
estimate, it does not account for the fact that it is evaluated with item parameter
estimates instead of the true values. Resampling techniques have been used to
develop the empirical distribution of person-fit statistics when the latent trait is
estimated. Future studies are warranted to study the empirical distribution when
item parameters are also estimated.
Second, the current study adopts the 2PLM as the underlying true model for
item responses. When there are more than two response categories, a polytomous
model should be used instead; for example, many psychological tests include a
Likert-type response scale with more than two response options. For educational
assessments using multiple-choice items, the 3PLM may be more appropriate.
Additionally, we would like to emphasize here that model fit is another important
issue to consider. It is assumed in this study that a particular unidimensional IRT
model fits the data. In practice, model fit must be evaluated before adopting a
person-fit statistic based on any particular IRT model. R packages ltm (Rizopou-
los, 2006) and mirt (Chalmers, 2012) both provide the option to obtain person-fit

28
Patton et al.
statistics for multidimensional IRT models. Once the fit of a particular IRT
model has been established, the iterative procedure can be used in the same
manner to improve item parameter recovery by iteratively identifying and
removing careless responders.
Last but not least, the careless responses studied in this article are generated
independently of respondents’ latent traits and the characteristics of the items.
This may not be the case in practice; for example, respondents with low ability
may be more likely to respond carelessly on a math exam, and they may be more
likely to respond carelessly to difficult questions (Rupp, 2013). In future studies,
the relationships among respondents’ latent traits, item characteristics, and the
propensity to respond carelessly should be considered.

Authors’ Note
This research uses data from AddHealth, a project directed by Kathleen Mullan Harris and
designed by J. Richard Udry, Peter S. Bearman, and Kathleen Mullan Harris at the
University of North Carolina at Chapel Hill and funded by Grant P01-HD31921 from the
Eunice Kennedy Shriver National Institute of Child Health and Human Development,
with cooperative funding from 23 other federal agencies and foundations.

Declaration of Conflicting Interests


The author(s) declared no potential conflicts of interest with respect to the research,
authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, author-
ship, and/or publication of this article: This research was supported by the CTB-McGraw/
Hill R&D Research Grant.

Notes
1. The study by Glas and Dagohoy (2007) used the Lagrange multiplier test as a
person-fit test for polytomous item response theory (IRT) models. It takes into
account both the effects of latent trait estimation and item parameter estima-
tion. It is also applicable to multidimensional IRT models. It is therefore a
versatile method but its focus is on person-fit test itself, not how to treat
aberrant responses after they are detected to improve item calibration.
2. Note that only large negative values of lz are indicative of an aberrant response
pattern (or “misfit”), hence the use of a one-tailed hypothesis test. This is
typically how lz is used to detect person misfit. On the other hand, large
positive values are indicative of “overfit,” that is, the responses are more
consistent with the model than we would expect. But in spite of the negative
connotation of the term, overfit is rarely considered a problem.
3. Functionally, weighted likelihood (WL) estimation does incorporate a prior.
In fact, this is equivalent to using Bayesian modal estimator with Jeffrey’s

29
Detection and Treatment of Careless Responses
prior (Warm, 1989). However, the WL prior is uninformative in the sense that it
is only a function of the test at hand and is otherwise independent of the user.
4. Alternatively, the iterative procedure could end when the maximum change in
item parameter estimates, ability estimates, or lz values does not exceed some
threshold. However, the final composition of the calibration sample (and thus
the final values of the item parameter estimates) depends entirely on the
classification of respondents. So a more practically motivated stopping criter-
ion is to stop the procedure when the change in classifications between itera-
tions is sufficiently small.

References
Allalouf, A., Gutentag, T., & Baumer, M. (2017). Quality control for scoring tests admi-
nistered in continuous mode: An NCME instructional module. Educational Measure-
ment: Issues and Practice, 36, 58–68. doi:10.1111/emip.12140
Baer, R. A., Ballenger, J., Berry, D. T. R., & Wetter, M. W. (1997). Detection of random
responding on the MMPI-A. Journal of Personality Assessment, 68, 139–151.
Berry, D. T. R., Wetter, M. W., Baer, R. A., Larsen, L., Clark, C., & Monroe, K. (1992).
MMPI-2 random responding indices: Validation using a self-report methodology.
Psychological Assessment, 4, 340–345.
Bolt, D. M., & Newton, J. R. (2011). Multiscale measurement of extreme response style.
Educational and Psychological Measurement, 71, 814–833. doi:10.1177/
0013164410388411
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R
environment. Journal of Statistical Software, 48, 1–29.
Clark, M. E., Gironda, R. J., & Young, R. W. (2003). Detection of back random respond-
ing: Effectiveness of MMPI-2 and personality assessment inventory validity indices.
Psychological Assessment, 15, 223–234.
Conjin, J. M., Emons, W. H. M., & Sijtsma, K. (2014). Statistic lz-based person-fit
methods for noncognitive multiscale measures. Applied Psychological Measurement,
38, 122–136.
De Ayala, R. J., Plake, B., & Impara, J. C (2001). The effect of omitted responses on the
accuracy of ability estimation in item response theory. Journal of Educational Mea-
surement, 38, 213–234.
de la Torre, J., & Deng, W (2008). Improving person fit assessment by correcting the
ability estimate and its reference distribution. Journal of Educational Measurement,
45, 159–177.
Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement
with polytomous item response models and standardized indices. British Journal of
Mathematical and Statistical Psychology, 38, 67–86.
Edelen, M. O., & Reeve, B. B. (2007). Applying item response theory (IRT) modeling to
questionnaire development, evaluation, and refinement. Quality of Life Research, 16,
5–18.
Falk, C. F., & Cai, L. (2016). A flexible full-information approach to the modeling of
response styles. Psychological Methods, 21, 328–347. doi:10.1037/met0000059

30
Patton et al.
Felt, J. M., Castaneda, R., Tiemensma, J., & Depaoli, S. (2017). Using person fit statistics
to detect outliers in survey research. Frontiers in Psychology, 8, 863.
Gao, F., & Chen, L. (2005). Bayesian or non-Bayesian: A comparison study of item
parameter estimation in the three-parameter logistic model. Applied Measurement in
Education, 18, 351–380.
Glas, C. A. W., & Dagohoy, A. V. T. (2007). A person fit test for IRT models for
polytomous items. Psychometrika, 72, 159–180.
Goegebeur, Y., De Boeck, P., & Molenberghs, G. (2010). Person fit for test speededness:
Normal curvatures, likelihood ratio tests and empirical Bayes estimates. Methodology,
6, 3–16.
Hanson, B. A., & Beguin, A. A. (2002). Obtaining a common scale for item
response theory item parameters using separate versus concurrent estimation
in the common item equating design. Applied Psychological Measurement,
26, 3–24.
Harris, K. M., & Udry, J. R. (2010). National Longitudinal Study of Adolescent Health
(Add Health), 1994-2008: Core files restricted use. Inter-university Consortium for
Political and Social Research.
Hong, M. R., & Cheng, Y. (2018). Robust maximum marginal likelihood (RMML) esti-
mation for item response theory models. Behavior Research Methods. Advance online
publication. doi:10.3758/s13428-018-1150-4
Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., & DeShon, R. P. (2012). Detecting
and deterring insufficient effort responding to surveys. Journal of Business and Psy-
chology, 27, 99–114.
Hulin, C. L., Lissak, R. I., & Drasgow, F. (1982). Recovery of two- and three-parameter
logistic item characteristic curves; A Monte Carlo study. Applied Psychological Mea-
surement, 6, 249–260.
Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-
six person-fit statistics. Applied Measurement in Education, 16, 277–298.
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal
of Educational Measurement, 43, 355–381.
Magis, D., Raéche, G., & Béland, S. (2012). A didactic presentation of Snijder’s lz index
of person fit with emphasis on response model selection and ability estimation. Journal
of Educational and Behavioral Statistics, 37, 57–81.
Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data.
Psychological Methods, 17, 437–455.
Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied
Psychological Measurement, 25, 107–135.
Nering, M. L., & Meijer, R. R. (1998). A comparison of the person response function and
the lz person fit statistic. Applied Psychological Measurement, 22, 53–69.
Oshima, T. C. (1994). The effect of speededness on parameter estimation in item response
theory. Journal of Educational Measurement, 31, 200–219.
R Core Team. (2013). R : A language and environment for statistical computing (Version
2.15.2) [Computer software]. Vienna, Austria: R Foundation for Statistical Comput-
ing. Retrieved from http://www.R-project.org
Rizopoulos, D. (2006). ltm: An R package for latent variable modelling and item response
theory analyses. Journal of Statistical Software, 17, 1–25.

31
Detection and Treatment of Careless Responses
Rupp, A. A. (2013). A systematic review of the methodology for person fit research in
item response theory: Lessons about generalizability of inferences from the design of
simulation studies. Psychological Test and Assessment Modeling, 55, 3–38.
Shao, C., Li, J., & Cheng, Y. (2016). Detection of test speedness using change-point
analysis. Psychometrica, 81(4), 1118–1141.
Snijders, T. A. B. (2001). Asymptotic null distribution of person fit statistics with esti-
mated person parameters. Psychometrika, 66, 331–342.
Tendeiro, J. N. (2017). The lz(p)* person-fit statistic in an unfolding model context.
Applied Psychological Measurement, 41, 44–59.
van Barneveld, C. (2007). The effect of respondent motivation on test construction within
an IRT framework. Applied Psychological Measurement, 31, 31–46.
Van Krimpen-Stoop, E., & Meijer, R. (1999). The null distribution of person-fit statistics for
conventional and adaptive tests. Applied Psychological Measurement, 23, 327–345.
Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory.
Psychometrika, 54, 427–450.
Wise, S. L., & DeMars, C. E. (2003, June). Respondent motivation in low-stakes assess-
ment: Problems and potential solutions. Paper presented at the annual Assessment
Conference of the American Association of Higher Education, Seattle, WA.
Wise, S. L., Kingsbury, G. G., Thomason, J. T., & Kong, X. (2004). An investigation of motivation
filtering in a statewide achievement testing program. Paper Presented at the April, 2004
annual meeting of the National Council of Measurement in Education, San Diego, CA.
Woods, C. M.. (2006). Careless responding to reverse-worded items: Implications for
confirmatory factor analysis. Journal of Psychopathology and Behavioral Assessment,
28, 186–191.
Yu, X., & Cheng, Y. (in press). A change-point analysis procedure based on weighted
residuals to detect back random responding. Psychological Methods.
Yuan, K.-H., Fung, W. K., & Reise, S. (2004). Three Mahalanobis-distances and their role
in assessing unidimensionality. British Journal of Mathematical and Statistical Psy-
chology, 57, 151–165.
Yuan, K.-H., & Zhong, X. (2008). Outliers, leverage observations and influential cases in
factor analysis: Minimizing their effect using robust procedures. Sociological Metho-
dology, 38, 329–368.
Zimowski, M., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3: Item
analysis and test scoring with binary logistic models [Computer software]. Chicago,
IL: Scientific Software.

Authors

JEFFREY M. PATTON is a principal psychometrician at the Financial Industry Regula-


tory Authority, 9509 Key West Ave., Rockville, MD 20850; email: jeffrey.patton@
finra.org. His research interests primarily focus on the application of computational and
machine learning methods to psychometrics.

YING CHENG is an associate professor in Department of Psychology at University of


Notre Dame, 390 Corbett Family Hall, Notre Dame, IN 46556; email: ycheng4@nd
.edu. Her primary research interests include theoretical development of item response

32
Patton et al.
theory and its applications to large-scale assessment, and data mining of educational
and psychological assessment data.
MAXWELL HONG is a graduate student in Department of Psychology at University of Notre
Dame, E418 Corbett Family Hall, Notre Dame, IN 46556; email: [email protected]. He
is interested in innovative psychometric methodology and their application to both psycho-
logical and educational constructs.
QI DIAO is a senior psychometrician at Educational Testing Service, 660 Rosedale Road,
MS-06P, Princeton, NJ 08540; email: [email protected]. Her research interests are optimal
design, adaptive testing, and IRT models.

Manuscript received February 2, 2017


First revision received January 6, 2018
Second revision received August 3, 2018
Accepted December 6, 2018

33

You might also like