0% found this document useful (0 votes)
55 views15 pages

Scale Pretesting

Scale Pretesting

Uploaded by

pairins24-05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views15 pages

Scale Pretesting

Scale Pretesting

Uploaded by

pairins24-05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Practical Assessment, Research, and Evaluation

Volume 23 Volume 23, 2018 Article 5

2018

Scale Pretesting
Matt C. Howard

Follow this and additional works at: https://scholarworks.umass.edu/pare

Recommended Citation
Howard, Matt C. (2018) "Scale Pretesting," Practical Assessment, Research, and Evaluation: Vol. 23 ,
Article 5.
DOI: https://doi.org/10.7275/hwpz-jx61
Available at: https://scholarworks.umass.edu/pare/vol23/iss1/5

This Article is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for
inclusion in Practical Assessment, Research, and Evaluation by an authorized editor of ScholarWorks@UMass
Amherst. For more information, please contact [email protected].
Howard: Scale Pretesting

A peer-reviewed electronic journal.


Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission
is granted to distribute this article for nonprofit, educational purposes if it is copied in its entirety and the journal is credited. PARE has the
right to authorize third party reproduction of this article in print, electronic and database forms.
Volume 23 Number 5, April 2018 ISSN 1531-7714

Scale Pretesting
Matt C. Howard, University of South Alabama

Scale pretests analyze the suitability of individual scale items for further analysis, whether through
judging their face validity, wording concerns, and/or other aspects. The current article reviews scale
pretests, separated by qualitative and quantitative methods, in order to identify the differences,
similarities, and even existence of the various pretests. This review highlights the best practices and
objectives of each pretest, resulting in a guide for the ideal applications of each method. This is
followed by a discussion of eight questions that can direct future research and practice regarding scale
pretests. These questions highlight aspects of scale pretests that are still largely unknown, thereby
posing a barrier to their successful application.

Most guides for the scale development process even with a reduced item list, a sufficient sample size
suggest that researchers and practitioners should begin may still be unobtainable. In these instances, scale
by generating an over-representative item list, which pretests have been used in place of EFA or CFA. Third,
helps ensure adequate content coverage (Hinkin, 1995, scales may need to be created for constructs that are not
1998; MacKenzie et al., 2011; Meade & Craig, 2012). central to the research effort. In these cases, it may be
These guides typically suggest that the second step unreasonable for a researcher or practitioner to undergo
should be a reduction of this item list via exploratory the full-scale development process, but scale pretests can
(EFA) or confirmatory factor analysis (CFA) to provide some inferences regarding the ability of a
minimize construct contamination. An increasing developed scale to gauge its intended construct. Fourth,
number of authors, however, have suggested that a some pretest methods can ascertain aspects of items that
distinct intermediate step should be taken between item cannot be identified through EFA or CFA (Presser et al.,
development and EFA/CFA (Anderson & Gerbing, 2004).
1991; DeVellis, 2016; Hardesty & Bearden, 2004; Discussions of pretest methods are beginning to
MacKenzie et al., 2011). This intermediate step is the appear in broader reviews of the scale development
scale pretest. process, but focused reviews of pretests are still scarce
Most often, scale pretests use a small number of (Hunt et al., 1982; Howard & Melloy, 2016; Presser et
participants (i.e., 5 to 30) to initially reduce the item list al., 2004). As shown below, the dearth of pretest reviews
before reducing it further via EFA or CFA. As prior results in the application of many different pretest
authors have suggested (DeVellis, 2016; Presser et al., methods, but authors rarely provide justification for
2004), scale pretests have arisen primarily for four applying their chosen method. Likewise, notable
reasons. First, many recommended sample sizes for differences can be seen between applications of the same
EFA and CFA depend on the number of items, such as pretest method. This suggests that pretest methods are
10 participants for every item analyzed (Brown, 2015; possibly being used in a haphazard manner, and
Hinkin, 1995, 1998; Howard, 2016; Thompson, 2004). If researchers may be applying pretest methods that are not
the initial item list is large, it may be difficult – if not ideal for their research needs. Due to these concerns, we
impossible – for some researchers to obtain a sufficient contend that researchers and practitioners may be
sample size, but an item-sort task can reduce the item list unaware of the differences, best practices, and even
into a more manageable size for EFA or CFA. Second, existence of the various pretest methods. To address
Published by ScholarWorks@UMass Amherst, 2018 1
Practical Assessment, Research, and Evaluation, Vol. 23 [2018], Art. 5
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 2
Howard, Scale Pretesting

these concerns and prompt a more systematic provide brief summaries of lesser-used quantitative
application of scale pretests, we review the best practices pretest methods.
of scale pretesting and identify eight questions to direct Item-Rating Task
future research.
Item-rating tasks and item-sort tasks are among the
Scale Pretests most popular quantitative pretest methods (Anderson &
The goal of a scale pretest is to identify items that Gerbing, 1991; DeVellis, 2016; Hardesty & Bearden,
may be justifiably retained for further testing. The 2004; Howard & Melloy, 2016; Hunt et al., 1982;
manner in which pretest methods achieve this goal Lawshe, 1975). Despite the popularity of the former,
differs, but it is often consistent with whether the pretest many authors do not call item-rating tasks as such,
is quantitative or qualitative. Most often, quantitative instead only calling the procedure a pretest or
pretests obtain a numerical measure of face validity, assessment. We label this method an item-rating task to
which is assumed to contribute to the overall construct avoid any confusion.
validity of the eventual scale (DeVellis, 2016; Hardesty To perform an item-rating task, participants are
& Bearden, 2004; Howard & Melloy, 2016). Construct given a definition of the focal construct. Then, they are
validity is “the degree to which a test measures what it provided each item and asked to evaluate the extent that
claims, or purports, to be measuring” (Brown, 1996, p. the item represents the focal construct. As noted by
231). The construct validity of a scale can never be Hardesty & Bearden (2004), a common response scale
known, but it is supported by the cumulative results of consists of “clearly representative,” “somewhat
the scale development process. Face validity is the extent representative,” and “not representative,” but authors
that a scale or item is subjectively judged to represent its may also use other response scales, such as “very good,
intended construct. A scale consisting of items with “good,” “fair,” and “poor.” No firm rules exist for the
adequate face validity is often assumed to have adequate recommended sample size for item-rating tasks, but
construct validity (although this is not always the case). researchers typically use sample sizes ranging from 10 to
For this reason, items that are judged to have sufficient 30 (Anderson & Gerbing, 1991; Goetz et al., 2013;
face validity are retained for further analysis when using Heene et al., 2014).
quantitative pretest methods.
Once responses have been collected, three
On the other hand, qualitative pretest methods do approaches are most popular to make item retention
not judge the validity of items as directly as quantitative decisions. First, a sumscore can be calculated for each
pretest methods (Blair et al., 2013; Fowler, 2013; Presser item. Each response choice is assigned a corresponding
et al., 2004). Instead, qualitative pretest methods value (e.g., very good – 4, good – 3, fair – 2, poor – 1);
primarily identify whether items have certain wording responses are summed for each item; and the highest
concerns, such as being double-barreled, leading, or scoring items are retained. Second, items that receive a
confusing (Leech, 2002). Some qualitative pretest certain percentage of the highest (e.g., very good) or two
methods are able to identify items with face validity highest responses are retained. Third, items that receive
concerns, but these pretest methods do not provide a any of the lowest response (e.g., poor) are discarded. In
direct numerical indicator that can, for example, be used one of the few studies on item-rating tasks, Hardesty and
to rank the items by their face validity. For this reason, Bearden (2004) provided support that the first and
these qualitative pretest methods may be able to remove second approaches provide the most accurate item-
items with large face validity concerns, but they cannot rating task results, as defined by the likelihood that the
be used to solely retain the items with the greatest face approach replicated the item retention decisions of the
validity. Below, both quantitative and qualitative pretest entire scale development process.
methods are reviewed.
When these three approaches are applied, authors
Quantitative Pretest Methods often use a numerical cutoff that will retain a certain
number of items, rather than an a priori chosen number
Three quantitative pretest methods are reviewed: (Hardesty & Bearden, 2004; Howard & Melloy, 2016).
item-rating tasks, item-sort tasks, and Hinkin and For instance, a researcher may be interested in creating
Tracey’s (1999) ANOVA method. These methods were a reduced item list of 30 items. When using the sumscore
chosen for their popularity and importance, but we also approach, 11 items may have received a score of 24 or
https://scholarworks.umass.edu/pare/vol23/iss1/5
DOI: https://doi.org/10.7275/hwpz-jx61 2
Howard: Scale Pretesting
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 3
Howard, Scale Pretesting

greater, 33 items may have received a score of 23 or Once responses have been collected, authors
greater, and 40 items may have received a score of 22 or calculate the number of times that each item was
greater. If this were the case, the researcher would likely considered representative of the focal construct
use a sumscore cutoff of 23 in order to retain 33 items (Anderson & Gerbing, 1991). Items with a sufficient
for subsequent analyses. number of assignments to the focal construct are
considered representative of that construct and not
While item-rating tasks have been successfully used
others. Two approaches can be used to make item
in ample prior studies, the method poses certain
retention decisions. First, authors can choose a cutoff
concerns. Item-rating tasks may be poor at identifying
that would result in the desired number of items to be
items that represent more than one construct. If an item
retained. Second, authors can use the cutoff values
represents the focal construct and an alternative
provided by Howard and Melloy (2016) that are based
construct equally well, most researchers would not want
on traditional statistical significance testing. Using this
this item in their final scale; however, an item-rating task
latter approach, the results of item-sort tasks have a
may identify this item as adequately representing the
sound statistical justification and have been shown to
focal construct.
replicate EFA results.
Further, using item retention cutoffs that will retain
Further, no matter the approach, item-sort tasks can
a certain number of items may be useful, but this method
address the noted concern of item-rating tasks. Item-sort
goes against the notion of statistical testing. That is,
tasks are not only able to identify items that poorly
statistical decisions are almost always made through a
represent the focal construct, but they are also able to
priori guidelines with statistical justifications, such as p-
identify items that may represent multiple constructs. If
values, confidence intervals, and effect size guidelines
an item represents two constructs equally well, then this
(Bosco et al., 2015; Cohen, 1992, 1994; Nakagawa &
item would be expected to have only half of the
Cuthill, 2007). Without such justifications, it should be
participants to indicate that it represents the focal
questioned whether this approach is a true statistical
construct. Using the cutoff values provided by Howard
method. More importantly, it should be questioned
and Melloy (2016), an item that is considered
whether this method provides accurate and statistically-
representative of the focal construct half of the time is
supported results. For instance, an item with 80% of
not statistically significant no matter the sample size.
respondents reporting “very good” may not be
Likewise, items that only partially represent other
significantly more representative than an item with 75%
constructs can still be identified using item-sort tasks
of respondents reporting “very good.” Also, using
(Anderson & Gerbing, 1991; Howard & Melloy, 2016).
cutoffs to retain a certain number of items results in
Thus, item-sort tasks address the notable concerns of
different values being used from study-to-study, even if
item-rating tasks, while still providing the benefits of this
the number of items and participants remains the same,
other method.
which again draws into question the validity of this
method. Hinkin and Tracey’s ANOVA Method
Item-Sort Task A third quantitative pretest is Hinkin and Tracey’s
(1999) ANOVA method, which was intended to be an
Fortunately, another method alleviates some of
improvement beyond item-rating and item-sort tasks.
these concerns noted above: the item-sort task. To
Participants are given a detailed definition of the focal
perform an item-sort task, participants are given a
construct as well as several other theoretically-related
detailed definition of the focal construct(s) as well as
constructs. Then, the participants are provided each item
several other theoretically similar constructs. Then,
and asked to evaluate the extent that the item represents
participants are instructed to indicate which construct
each of the construct choices. The typical response scale
that they believe each item best represents. The list of
ranges from 1 (not at all) to 5 (completely). While no
choices should include the focal construct(s), other
firm guideline exists for sample size requirements,
theoretically similar constructs, and an “any other
Hinkin and Tracey (1999) used samples of 57 and 173,
construct” option. Typically, sample sizes for item-sort
but they also noted that samples of 30 may be
tasks include between 20 and 30 participants, but
acceptable. Once responses have been collected, a one-
Howard and Melloy (2016) show that sample sizes as
way ANOVA is performed for each item, comparing the
small as five can be used.
Published by ScholarWorks@UMass Amherst, 2018 3
Practical Assessment, Research, and Evaluation, Vol. 23 [2018], Art. 5
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 4
Howard, Scale Pretesting

item’s mean value for each category. If an item has a asked to evaluate the extent that the item represents each
significantly greater value for a certain category, then it of the construct choices. The data is then used to
is considered representative of that construct and not calculate a q-correlation matrix, and this matrix is subject
others. to a principal components analysis. The item loadings
can be used to determine whether an item is
Hinkin and Tracey (1999) suggested that their
representative of a construct. Despite being more
method could gauge the extent that an item represents
sophisticated, Schriesheim and colleagues’ (1993)
multiple constructs, which was an improvement beyond
method has not seen as much use as item-rating and
item-rating tasks. They also suggested that item-sort
item-sort tasks. This may be because Hinkin and Tracey
tasks do not rely on statistical testing or take into
(1999) directly compared their method to Schriesheim
account, “the extent to which an item may correspond
and colleagues’ (1993) method, and Hinkin and Tracey
to a given dimension” (Hinkin & Tracey, 1999, p. 180).
(1999) argued that their method was superior.
While their method achieves these goals, recent
developments in item-sort tasks also satisfy these goals. Lastly, other quantitative pretest methods have seen
modest use and provide little beyond the methods
Despite the proposed benefits, Hinkin and Tracey’s
detailed above (Blair et al., 2013; Clark & Watson, 1995;
(1999) method is not as widespread as item-rating tasks
Fowler, 2013; Goetz et al., 2013; Hardesty & Bearden,
or item-sort tasks. While the reason is unclear, some
2004; Rea & Parker, 2014). We do not review these
suggestions can be made. First, Hinkin and Tracey’s
methods, and instead turn to another important category
(1999) sample sizes in the demonstration of their
of scale pretests: qualitative methods.
method were very large for scale pretests, and
researchers may have been wary of their method’s Qualitative Methods
accuracy with samples smaller than their examples.
Second, providing individual ratings for each item in Three qualitative pretest methods are reviewed in
regard to each possible construct is cognitively taxing for the following: cognitive interviews, focus groups, and
participants. Researchers may have felt that most traditional interviews. These were also selected for their
participants would not be motivated or have the ability popularity and importance, but we also provide brief
to provide accurate ratings. Third, researchers may have summaries of other qualitative pretest methods.
believed that Hinkin and Tracey’s (1999) method was Cognitive Interviews
not a sufficient improvement beyond item-rating tasks
and item-sort tasks, as the application of these two The origins of cognitive interviewing date back to
methods persisted after Hinkin and Tracey (1999). between the 1940s and 1970s (Belson, 1981; Cantril &
Fourth, this method is more involved than item-rating Fried, 1944), in which researchers applied variations of
and item-sort tasks, and researchers may prefer the easier the method with little standardization in their
alternatives. Despite these possibilities, there seem to be approaches. It was not until the 1980s that researchers
no statistical concerns with Hinkin and Tracey’s (1999) more strongly considered the utility and accuracy of the
ANOVA method, and the method may still be able to approach. This shift, paired with the creation of several
provide insightful results regarding initial item lists. federally-funded “cognitive laboratories,” began a more
systematic application of cognitive interviewing as a
Other Quantitative Methods scale pretest method (see Presser et al. 2004 for a
Most other quantitative pretest methods are review).
variants of the item-rating task. For instance, researchers To perform a cognitive interview, participants
have asked participants to rate the importance or complete the over-representative item list, and
difficulty of items, rather than their ability to gauge the information is collected regarding the process of
focal construct (Coste et al., 1997; Goetz et al., 2013; answering each item. Most often, cognitive interviews
Smith et al., 2000). These studies typically use the same involve verbal data collection (Beatty & Willis, 2007;
guidelines as standard item-rating tasks, but they are Presser et al., 2004), which requires the researcher to be
used when item relevance may not be the most present. The recorded information is then used to
important determinant to retaining items. evaluate whether the participant perceives the item as
Schriesheim and colleagues (1993) also developed a intended and/or whether the participant had difficulty
pretest method. Participants are provided each item and understanding the item, both of which may be indicative
https://scholarworks.umass.edu/pare/vol23/iss1/5
DOI: https://doi.org/10.7275/hwpz-jx61 4
Howard: Scale Pretesting
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 5
Howard, Scale Pretesting

of item quality (Beatty & Willis, 2007). In the words of lines. Resources exist to determine coding guidelines
Presser and colleagues (2004), a cognitive interview is, (Beatty, 2004; Willis, 2004), but no “hard and fast” rules
“essentially a dress rehearsal” (p. 110), but the nature of exist.
the “dress rehearsal” may differ in many regards. Lastly, researchers must choose whether to use
When performing a cognitive interview, researchers non-essential coding methods. Two of the most popular
may use think-alouds, probes, or a combination of both. are behavior coding and response latency. Behavior
A think-aloud is when a participant is asked to speak coding involves coding the reports and/or behavior of
their thoughts while completing the items, which may participants and interviewers, such as whether an item
uncover any item confusion. For an item intended to was read incorrectly (Van der Zouwen & Smit, 2004).
gauge conscientiousness, a participant may say, “The Items with many atypical behaviors should be removed.
item reads, I am organized and a hard-working worker. Response latency involves recording the time it takes to
Well, I am organized, but I am not a hard-worker. I guess answer a question (Bassili & Scott, 1996; Draisma &
that I will mark strongly disagree.” This would indicate Dijkstra, 2004). Items with longer latencies are believed
that the item has concerns. On the other hand, probes to perform poorly, and they should be removed. Both
are prompts given to participants about the items. Beatty methods need further research before they can be
(2004) identified several types of probes, including re- applied reliably (Beatty, 2004; Beatty & Willis, 2007;
orienting (asking for an answer), elaborating (asking for Presser et al., 2004; Willis, 2004).
information), cognitive (asking for introspection), While cognitive interviews are widely used, prior
confirmatory (asking for confirmation), expansive studies have discovered some concerns. DeMaio and
(asking for elaboration), functional (asking for Landreth (2004) showed that cognitive interviews vary
clarification), and feedback (providing information). greatly, and cognitive interviews performed by two
Although each probe provides useful information, there separate organizations may produce very different
seems to be no consensus regarding when to use them. results. Even when the same cognitive interviewing
Several authors have suggested, however, that trained techniques are used, inter-rater agreement is often low
interviewers are better at choosing the correct occasion (Conrad & Blair, 2004; DeMaio & Landreth, 2004;
than untrained interviewers (Beatty, 2004; Beatty & Presser & Blair, 1994). Likewise, little research has
Willis, 2007; Presser et al., 2004). compared the utility of multiple qualitative pretest
Also, researchers may choose to apply concurrent methods. Other less-cumbersome pretests may identify
or retrospective reporting. Proponents of concurrent poor items at a similar, or even better, rate than cognitive
reporting argue that participants may be unable to interviewing.
remember their thoughts about particular items after the Focus Groups
fact, and only information about the overall item list may
be accurate (Beatty & Willis, 2007; Willis, 2004). Focus groups are used to gather a wide range of
Alternatively, proponents of retrospective reporting experiences from several diverse participants. Often,
argue that responding to prompts alters participants’ focus groups are used during the item generation phase
thought processes while completing the survey, and the to produce items from multiple perspectives and ensure
social interaction involved with prompting during that the entire content domain of a construct is gauged
administration may alter the response process (Beatty & (Brod et al., 2009; Sweeney & Soutar, 2001). The method
Willis, 2007; Willis, 2004). It appears that more authors can also be used immediately after the item generation
recommend the use of retrospective reporting, but it is phase to ensure that the items are free from wording
always strongly recommended that researchers concerns and represent the focal construct (DeVellis,
understand the benefits and detriments of each 2016; Lynn, 1986; Kim et al., 1999). To perform a focus
approach before performing a cognitive interview. group, participants are gathered at a common location
Researchers also need to determine how to analyze and provided the over-representative item list (Morgan,
cognitive interview results. It is difficult to determine 1996). Then, they are asked to provide feedback
whether a participant interpreted an item correctly or regarding each item. They either provide the feedback as
whether they “missed the mark” altogether. Likewise, it a group, individually, or a combination of both.
is difficult to determine whether a participant struggled Focus groups may differ by the type of feedback
“too much,” but it is up to the researcher to draw these elicited. Kim and colleagues (1999) performed a focus
Published by ScholarWorks@UMass Amherst, 2018 5
Practical Assessment, Research, and Evaluation, Vol. 23 [2018], Art. 5
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 6
Howard, Scale Pretesting

group that consisted of three phases: review the items 1996). When performing a traditional interview,
for (1) grammatical accuracy and readability, (2) participants read the item list and provide feedback on
construct accuracy, (3) and construct deficiency. Many each item. Items that are consistently identified as
other authors have used focus groups that include a concerning are removed. Thus, this method can provide
combination of these same phases, most commonly the similar information as focus groups without needing to
first and second phases (Rosen et al., 2004; Yang et al., gather participants together.
2004). The second phase, gauging construct accuracy, Like most other qualitative pretest methods, it is still
requests participants to provide feedback about the unclear whether this design can provide accurate
ability of each item to gauge the focal construct, which feedback, and little research has investigated the ability
largely forces them to judge the face validity of each of traditional interviews to identify problematic items.
item. While qualitative methods do not provide a Also, many researchers include traditional interviews to
numerical metric to rank items’ face validity, focus reduce item lists, but these researchers rarely report
groups still allow this aspect of validity to be included in applications of this method as a full study (Ferris et al.,
item retention decisions. 2008; Howard et al., 2016). Instead, it is usually
Further, sample size suggestions vary, but all presented as a single sentence or paragraph after the item
authors suggest that researchers should conduct focus generation phase. This insinuates that researchers may
groups until a saturation point is reached (Kim et al., not perceive this approach as important to the scale
1999; Yang et al., 2004). That is, the focus groups fail to development process. Nevertheless, traditional
provide novel information. Brod and colleagues (2009) interviews may provide important information regarding
suggest creating a list of novel information generated the items, and this method should be applied and
after each focus group and to stop the data collection studied.
process when the list from a focus group is notably Other Qualitative Pretest Methods
smaller. Brod and colleagues (2009) also note that this
often occurs after three or four focus groups of four to Other qualitative pretest methods exist aside from
six participants. cognitive interviewing, focus groups, and traditional
While focus groups have several benefits, they also interviews. These methods have seen little discussion,
pose unique concerns. Participants in a focus group may and much is still unknown regarding their validity. One
feel unable to provide certain feedback, or they may even of these methods is free response prompts, which are
have their perceptions changed by others’ feedback brief questions such as “Did you find this item
(Brod et al., 2009; Greenbaum, 2000; Kitzinger, 1995). confusing? If so, why?” Participants are provided the
Prior authors have also supported that participants in item list and asked to respond to the prompt after each
focus groups may provide more extreme responses than item. Items with several participant responses are
they normally would otherwise (Brod et al., 2009; removed. A benefit of free response prompts is their
Morgan, 1996). Focus groups also require multiple ease to administer, and they can be included in an online
participants to gather together in a common location, survey. It is still unclear, however, whether participants
and it may be almost impossible to gather participants can accurately provide feedback regarding each item
from certain populations. Thus, while focus groups can without using more intensive methods, such as cognitive
provide important information, they may be more interviewing.
difficult to perform than other pretest methods. Also, some researchers have used qualitative
Traditional Interviews participant observations to directly ensure the face
validity of each item (Brod et al., 2009). In these
While focus groups can provide information instances, researchers observe the behaviors of target
regarding a wide range of experiences, interviews are participants to ensure that each item represents an
typically able to provide more in-depth information observed behavior. Most often, participant observations
(Brod et al., 2009; Greenbaum, 2000; Kitzinger, 1995). are performed when participants are unable to provide
Some authors have also suggested that participants are the intensive self-reports required in cognitive
more willing to provide honest feedback in interviews interviews, focus groups, traditional interviews, and
compared to focus groups, as they may feel less pressure other qualitative methods. Beyond these, few other
from others to provide certain responses (Morgan, qualitative methods can be seen in research.
https://scholarworks.umass.edu/pare/vol23/iss1/5
DOI: https://doi.org/10.7275/hwpz-jx61 6
Howard: Scale Pretesting
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 7
Howard, Scale Pretesting

Discussion 2) When should each method be used?


Several aspects of scale pretests should be apparent To determine which method to use, a researcher
from the above review (Table 1). Most notably, (1) an should first determine whether they are most concerned
array of pretest methods exist, (2) these pretests may with (a) face validity or (b) wording issues and
achieve various goals, (3) much remains unknown about (somewhat) face validity. If the former is the primary
these pretests, (4) and more research is needed to concern, a quantitative pretest method should be
understand their similarities, differences, benefits, and applied. If the latter is the primary concern, then a
detriments. With this in mind, the following presents qualitative pretest method should be applied.
eight research questions to guide the future study and If a quantitative pretest method is chosen, then the
application of scale pretesting methods. researcher also needs to determine whether they are
Future Research Questions interested in items’ relationship with (a) the focal
construct alone or (b) the focal construct and other
1) Which method provides the best results? constructs. If the researcher is only interested in the focal
Researchers always want to apply the best method construct, then item-rating tasks are ideal. If the
possible, and it is natural to want a single pretest method researcher is interested in the focal construct and other
that is best across all situations. Unfortunately, current constructs, then they should use either an item-sort task
pretest methods cannot provide this solution. Each or Hinkin and Tracey’s (1999) ANOVA method.
method has particular strengths and weaknesses, and Because current research has not directly compared
they should be applied when the research situation is these two methods to determine which provides more
suitable. Thus, researchers should not ask “which accurate results, the researcher can choose whichever of
method provides the best results?” but rather “when these two methods that they prefer. It should be kept in
should each method be used?” mind, however, that research has yet to show that
Hinkin and Tracey’s (1999) ANOVA method performs
well with sample sizes typical of pretests.

Table 1. Summary of Scale Pretest Method Attributes


Item-Rating Item-Sort ANOVA Method Cognitive Focus Group Interviews
Task Task Interviews
1.) Identify items that Yes Yes Yes No Somewhata Somewhata
gauge focal
construct?
2.) Identify items that No Yes Yes No Somewhata Somewhata
gauge multiple
constructs?
3.) Identify items No No No Yes Yes Yes
with wording
concerns?
4.) Identify confusing No No No Yes Yes Yes
items?
5.) Able to be Yes Yes Yes No No No
administered via
online survey?
6.) Typically use Yes Yes Yes No Yes Yes
SMEs?
7.) Typically use No No No No Yes No
group settings to
collect data?
8.) Typical Sample 10 - 30 5 - 30 30 - 150 3-6 3 - 4 Groups of 5 - 6 3-6
Size? People
aFocus groups and interviews can obtain some indicators of face validity, but not in a manner that the items can rank-sorted on these attributes.

Published by ScholarWorks@UMass Amherst, 2018 7


Practical Assessment, Research, and Evaluation, Vol. 23 [2018], Art. 5
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 8
Howard, Scale Pretesting

If a qualitative pretest method is chosen, the more methods from the same category (quantitative or
researcher needs to determine whether they are qualitative) could benefit the development of scales, as it
concerned with (a) wording issues alone (b) or wording could provide a triangulation of results (Jick, 1979;
issues and face validity. If wording issues are the primary Morse, 1991). More importantly, applying two or more
concern, then cognitive interviewing is ideal. If both methods from different categories could identify
wording issues and face validity are concerns, then it attributes of items that could not be discovered with one
should be determined whether the larger concern is (a) category alone, and applying both a quantitative and
the breadth of responses (b) or the depth of responses. qualitative pretest method could address the weaknesses
If the breadth of responses is the concern, then focus of the other. Perhaps the pretest methods that would
groups should be used. If the depth of responses is the provide the most utility, in regard to difficulty to
concern, then traditional interviews should be used. implement and information obtained, would be the
Nevertheless, it should be kept in mind that cognitive application of any quantitative pretest method and free
interviewing has the most empirical support for its response blanks. While free response blanks are only
validity, although these prior results are mixed. To aid in sparsely used, they are among the very few qualitative
future scale pretesting decisions, Figure 1 is included as pretest methods that can be administered through an
a visual guide. online survey. When applying the discussed quantitative
pretest methods, the free response blank can be placed
3) Which methods can be effectively used in
after the numerical rating for each item. A visual
conjunction?
demonstration of this is provided in the supplemental
Researchers almost always apply a single pretest material, in which free response blanks are applied
method when developing measures. Applying two or alongside an item-sort task.

Table 2. Summary of Eight Questions, Answers, and Directions for Future Research
Question Answer More research is needed on…
Which method provides the best results? None, quantitative and qualitative methods
have different goals.
Which methods with similar goals
When should each method be used? In general, quantitative methods should be
provides more accurate results. For
used when face validity is a concern,
instance, do item-sort tasks or the
whereas qualitative methods should be used
ANOVA method provide more accurate
when wording issues (and perhaps face
results?
validity) are a concern. Further decisions
vary on the context.
Which methods can be effectively used in Using a qualitative and quantitative methods Which methods can perform well
conjunction? in conjunction appears to be ideal. Also, together. For instance, should focus
using methods that use general participants groups or traditional interviews be used
and SMEs together may be ideal. with item-sort tasks?
Are SMEs required for certain methods? Perhaps, but many methods that traditionally Whether SMEs provide more accurate
use SMEs may not need to do so. results than general participants.
What is the required sample size for these The bottom-range for moth methods has yet Whether prior sample size
methods? to be identified, but 30 should be sufficient recommendations are supported by
for most methods. empirical and statistical research.
Must scale pretesting methods always precede Not always. Which methods should can used with and
traditional psychometric evaluations? without follow-up evaluations.
What are some concerns with existing pretest Identifying repetitive items, removing The creation of new and modification of
methods? repetitive items, and considering other types old pretest methods and to address these
of validity concerns.
What is the future of scale pretesting? The application of scale pretests will continue Empirically testing the accuracy of
to thrive, and the study of the methods existing pretest methods and the creation
themselves will increase. of new methods that address old
concerns.
https://scholarworks.umass.edu/pare/vol23/iss1/5
DOI: https://doi.org/10.7275/hwpz-jx61 8
Howard: Scale Pretesting
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 9
Howard, Scale Pretesting

4) Are SMEs required for certain methods? when creating a scale for severity of symptoms
(Mangione et al., 2001; Olson, 2010). Like quantitative
Before discussing which methods require SMEs,
pretest methods, research has yet to show that SMEs
another question should be asked first:
provide more accurate results than general participants.
What exactly are SMEs in the context of scale
pretests? Typically, SMEs are those with relevant
5) What is the required sample size for these
academic experience. For instance, a researcher creating
methods?
The recommended sample sizes for the various
a conscientiousness scale may use graduate students or
pretest methods are more direct than the decision to use
graduates of Ph.D. programs in Psychology. Many
SMEs. Typically, 10 to 30 participants are recommended
authors have also used undergraduates but noted that
for item-rating tasks and item-sort tasks, although
these SMEs were current or prior students of a relevant
Howard and Melloy (2016) showed that statistical
course and/or research lab. It is interesting to note,
significance can be calculated with sample sizes of five
however, that researchers less frequently use target
for item-sort tasks. Hinkin and Tracey (1999) suggest
populations as SMEs for scale pretests, although they are
that sample sizes as small as 30 can be used for their
regularly considered SMEs for the item generation
ANOVA method, but their examples included samples
phase. This may be because these SMEs are believed to
larger than 150. For qualitative methods, prior
have relevant knowledge of the behaviors that may
researchers have suggested that three to six participants
compose the criterion space for a construct, but they are
may provide accurate results for cognitive interviews and
unable to identify the exact boundaries of a construct.
traditional interviews. For focus groups, Brod and
Like most other aspects of scale pretests, it is unclear
colleagues (2009) suggested that three or four focus
whether this notion is actually true without supporting
groups of four to six participants can provide accurate
research.
results. Aside from item-sort tasks (Howard & Melloy,
Further, when using quantitative pretest methods, 2016), however, prior research has not provided
the decision to use general participants or SMEs is often empirical or statistical evidence for these sample size
unclear. For item-rating methods and item-sort tasks, cutoffs. Instead, these findings are largely based on
authors almost always use SMEs; however, neither conjecture and prior experience.
Anderson and Gerbing (1991) or Howard and Melloy
(2016) used SMEs in their empirical studies on item-sort
6) Must scale pretesting methods always precede
tasks, and little research has empirically shown that
traditional psychometric evaluations?
Scale pretests almost always precede traditional
SMEs provide more accurate judgements. Further,
psychometric evaluations, such as EFA and CFA, and
Hinkin and Tracey (1999) used graduate and
many researchers may believe that scale pretests are
undergraduate students to test their ANOVA method,
useless without such follow-up investigations. The origin
but these students were not specified to be in classes
of this belief may have arisen from prior empirical
relevant to the item-lists. Thus, it is unclear whether any
studies on the ability of quantitative pretest methods to
quantitative method explicitly requires SMEs. When
predict the results of EFA and CFA (Anderson &
using general participants, provided construct
Gerbing, 1991; Howard & Melloy, 2016), and
definitions need to be clear and comprehensive, as this
suggestions that quantitative pretest methods are able to
information may be their only exposure to certain
identify items that perform well in an EFA or CFA. This
constructs.
tradition should be reconsidered.
Regarding qualitative pretest methods, cognitive
Of course, the entire scale development process has
interviews are almost always performed with general
several steps, and each should be followed to ensure a
participants. If SMEs were used to complete the item
psychometrically sound scale that is valid for gauging the
list, their prior knowledge of the focal construct may
focal construct (Hinkin 1995, 1998). Researchers are
alter their responses. Alternatively, focus groups and
often unable to undergo the entire scale development
interviews may or may not require SMEs. Most research
process due to limited time and/or resources. In these
has used SMEs to identify wording issues and construct
instances, scale pretests can provide valuable
contamination, but some authors have used target
information even in the absence of follow-up analyses.
populations relevant to the focal construct. For instance,
In other words, providing some reassurance that
people with health conditions have been used as SMEs
administered items are adequate is better than providing
Published by ScholarWorks@UMass Amherst, 2018 9
Practical Assessment, Research, and Evaluation, Vol. 23 [2018], Art. 5
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 10
Howard, Scale Pretesting

no such evidence. We strongly suggest that future the measure, and these items can also be further reduced
researchers should apply these discussed methods in in subsequent steps.
these instances, which is only seen sparingly in current Further, overly repetitive items may pose other
research (Howard & Melloy, 2016; Olson, 2010), and concerns aside from content validity issues. These items
they should apply both a qualitative and quantitative provide little information regarding the focal construct
pretest method when doing so. when included in the same scale, and they may also
negatively influence model fit when performing a CFA
(Brown, 2015; Fabrigar et al., 1999). Unfortunately, none
of the discussed quantitative or qualitative methods are
regularly used to identify repetitive items; however,
focus groups and traditional interviews may achieve this
objective – if a phase is added to specifically identify
repetitive items. For this reason, it may be useful for
researchers to more often apply focus groups and
traditional interviews with these phases for their scale
pretesting.
8) What is the future of scale pretesting?
Scale pretest methods provide valuable
information, and researchers are increasingly
recognizing their benefits. For these reasons, we believe
that the application of scale pretests will continue, but
three new directions will be seen. First, the application
of pretest methods will continue in a more systematic
manner. With the continued usage, authors will begin to
recognize situations in which these methods are best
Figure 1. Flowchart of Scale Pretest Applications applied, and more best practices will begin to emerge.
Second, more research will analyze the
7) What are some concerns with existing pretest characteristics of scale pretests themselves. For instance,
methods? several pretests have similar objectives that are achieved
In general, quantitative pretest methods select items in a similar manner, but it is unclear which of these
that are judged to be representative of the focal pretests perform better. Likewise, future research should
construct, and items that more accurately gauge the focal perform more detailed investigations into the manner
construct are more likely to be retained. Selecting the that scale pretests retain items, such as whether
most accurate items may reduce content coverage, quantitative methods actually have concerns with
however, and only items that are closely-related retaining repetitive items, and whether SMEs actually
synonyms may be retained. Similarly, qualitative pretest provide more accurate results for pretest methods.
methods primarily select items that are free from Similarly, future research should determine when the
wording concerns, but participants may also judge the applications of these methods are most appropriate.
face validity of each item during a focus group or While the current article suggested applying quantitative
traditional interview. It is again possible that participants and qualitative pretest methods together, certain pretest
may perceive certain items as being irrelevant that methods may perform particularly well together. Certain
actually gauge important aspects of the focal construct, methods may also perform poorly in the absence of
thereby reducing the content coverage of the item list. subsequent psychometric evaluation, but these methods
We suggest that researchers should apply methods and cannot be identified without further research. Together,
cutoffs that retain more items than needed when pretest these suggestions are only the beginning of further
methods are used with subsequent psychometric pretest investigations.
analysis. This would help ensure the content validity of Third, discussions of pretest methods focus on their
relation to face validity and ability to replicate EFA and
https://scholarworks.umass.edu/pare/vol23/iss1/5
DOI: https://doi.org/10.7275/hwpz-jx61 10
Howard: Scale Pretesting
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 11
Howard, Scale Pretesting

CFA results, but it is important to consider each type of Cohen, J. (1994). The earth is round (p < .05). American
validity together. Face validity is interlinked with Psychologist, 49(12), 997-1003.
content, convergent, discriminant, and other types of Conrad, F., & Blair, J. (2004). Data quality in cognitive
validity. We suggest that new pretest methods should interviews: The case of verbal reports. In S. Presser, J.
analyze multiple aspects of validity together. M. Rothger, M. P. Couper, J. T. Lessler, E. Martin, J.
Martin, & E. Singer (Eds.), Methods for testing and
References evaluating survey questionnaires (pp. 67-87). Hoboken,
Anderson, J. C., & Gerbing, D. W. (1991). Predicting the NJ: John Wiley & Sons, Inc.
performance of measures in a confirmatory factor Coste, J., Guillemin, F., Pouchot, J., & Fermanian, J. (1997).
analysis with a pretest assessment of their substantive Methodological approaches to shortening composite
validities. Journal of Applied Psychology, 76(5), 732-740. measurement scales. Journal of Clinical Epidemiology,
Bassili, J. N., & Scott, B. S. (1996). Response latency as a 50(3), 247-252.
signal to question problems in survey research. Public DeMaio, T. J., & Landreth, A. (2004). Do different cognitive
Opinion Quarterly, 60(3), 390-399. interview techniques produce different results? In S.
Beatty, P. (2004). The dynamics of cognitive interviewing. In Presser, J. M. Rothger, M. P. Couper, J. T. Lessler, E.
S. Presser, J. M. Rothger, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, & E. Singer (Eds.), Methods for
Martin, J. Martin, & E. Singer (Eds.), Methods for testing and evaluating survey questionnaires (pp. 89-
testing and evaluating survey questionnaires (pp. 45- 108). Hoboken, NJ: John Wiley & Sons, Inc.
66). Hoboken, NJ: John Wiley & Sons, Inc. DeVellis, R. (2016). Scale development: Theory and
Beatty, P. C., & Willis, G. B. (2007). Research synthesis: The applications (Vol. 26). Tyne, UK: Sage.
practice of cognitive interviewing. Public Opinion Dietrich, H., & Ehrlenspiel, F. (2010). Cognitive
Quarterly, 71(2), 287-311. interviewing: A qualitative tool for improving
Belson, W. A. (1981). The design and understanding of questionnaires in sport science. Measurement in Physical
survey questions. Aldershot, UK: Gower. Education and Exercise Science, 14(1), 51-60.

Blair, J., Czaja, R. F., & Blair, E. A. (2013). Designing Draisma, S., & Dijkstra, W. (2004). Response latency and
surveys: A guide to decisions and procedures. Tyne, (para) linguistic expressions as indicators of response
UK: Sage. error. In S. Presser, J. M. Rothger, M. P. Couper, J. T.
Lessler, E. Martin, J. Martin, & E. Singer (Eds.),
Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. Methods for testing and evaluating survey
A. (2015). Correlational effect size benchmarks. Journal questionnaires (pp. 131-147). Hoboken, NJ: John
of Applied Psychology, 100(2), 431-449. Wiley & Sons, Inc.
Brod, M., Tesler, L., & Christensen, T. (2009). Qualitative Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., &
research and content validity: developing best practices Strahan, E. (1999). Evaluating the use of exploratory
based on science and experience. Quality of Life Research, factor analysis in psychological research. Psychological
18(9), 1263-1278. Methods, 4(3), 272-299.
Brown, T. A. (2015). Confirmatory factor analysis for Ferris, D. L., Brown, D. J., Berry, J. W., & Lian, H. (2008).
applied research. New York, NY: Guilford The development and validation of the Workplace
Publications. Ostracism Scale. Journal of Applied Psychology, 93(6),
1348-1366.
Brown, J. D. (1996). Testing in language programs. Upper
Saddle River, NJ: Prentice Hall Regents. Fowler Jr, F. J. (2013). Survey research methods. Tyne, UK:
Sage.
Cantril, H. (1944). The meaning of questions. In Gauging
public opinion (pp. 3-22). Princeton, NJ: Princeton Goetz, C., Coste, J., Lemetayer, F., Rat, A. C., Montel, S.,
University Press. Recchia, S., ... & Guillemin, F. (2013). Item reduction
based on rigorous methodological guidelines is
Clark, L. A., & Watson, D. (1995). Constructing validity:
necessary to maintain validity when shortening
Basic issues in objective scale development. Psychological
composite measurement scales. Journal of Clinical
Assessment, 7(3), 309-319.
Epidemiology, 66(7), 710-718.
Cohen, J. (1992). A power primer. Psychological Bulletin,
112(1), 155-159.
Published by ScholarWorks@UMass Amherst, 2018 11
Practical Assessment, Research, and Evaluation, Vol. 23 [2018], Art. 5
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 12
Howard, Scale Pretesting

Greenbaum, T. L. (2000). Moderating focus groups: A Lynn, M. R. (1986). Determination and quantification of
practical guide for group facilitation. Tyne, UK: Sage. content validity. Nursing Research, 35(6), 382-386.
Hardesty, D. M., & Bearden, W. O. (2004). The use of MacKenzie, S. B., Podsakoff, P. M., & Podsakoff, N. P.
expert judges in scale development: Implications for (2011). Construct measurement and validation
improving face validity of measures of unobservable procedures in MIS and behavioral research: Integrating
constructs. Journal of Business Research, 57(2), 98-107. new and existing techniques. MIS Quarterly, 35(2), 293-
334.
Heene, M., Bollmann, S., & Bühner, M. (2014). Much ado
about nothing, or much to do about something? Mangione, C. M., Lee, P. P., Gutierrez, P. R., Spritzer, K.,
Effects of scale shortening on criterion validity and Berry, S., & Hays, R. D. (2001). Development of the
mean differences. Journal of Individual Differences, 35(4), 25-list-item national eye institute visual function
245-249. questionnaire. Archives of Ophthalmology, 119(7), 1050-
1058.
Hinkin, T. R. (1995). A review of scale development
practices in the study of organizations. Journal of Meade, A. W., & Craig, S. B. (2012). Identifying careless
Management, 21(5), 967-988. responses in survey data. Psychological Methods, 17(3),
437-455.
Hinkin, T. R. (1998). A brief tutorial on the development of
measures for use in survey questionnaires. Organizational Morgan, D. L. (1996). Focus groups as qualitative research
Research Methods, 1(1), 104-121. (Vol. 16). Tyne, UK: Sage.
Hinkin, T. R., & Tracey, J. B. (1999). An analysis of variance Morse, J. M. (1991). Approaches to qualitative-quantitative
approach to content validation. Organizational Research methodological triangulation. Nursing Research, 40(2),
Methods, 2(2), 175-186. 120-123.
Howard, M. C. (2016). A review of exploratory factor Nakagawa, S., & Cuthill, I. C. (2007). Effect size, confidence
analysis decisions and overview of current practices: interval and statistical significance: a practical guide for
What we are Doing and how can we improve? biologists. Biological Reviews, 82(4), 591-605.
International Journal of Human-Computer Interaction, 32(1),
51-62. Olson, K. (2010). An examination of questionnaire
evaluation by expert reviewers. Field Methods, 22(4),
Howard, M. C., & Melloy, R. C. (2016). Evaluating item-sort 295-318.
task methods: The presentation of a new statistical
significance formula and methodological best practices. Presser, S., & Blair, J. (1994). Survey pretesting: Do different
Journal of Business and Psychology, 31(1), 173-186. methods produce different results. Sociological
Methodology, 24(1), 73-104.
Hunt, S. D., Sparkman Jr, R. D., & Wilcox, J. B. (1982). The
pretest in survey research: Issues and preliminary Presser, S., Couper, M., Lessler, J., Martin, E., Martin, J.,
findings. Journal of Marketing Research, 19(2), 269-273. Rothgeb, J., & Singer, E. (2004). Methods for testing
and evaluating survey questions. Public Opinion Quarterly,
Jick, T. D. (1979). Mixing qualitative and quantitative 68(1), 109-130.
methods: Triangulation in action. Administrative Science
Quarterly, 24(4), 602-611. Rea, L. M., & Parker, R. A. (2014). Designing and
conducting survey research: A comprehensive guide.
Kim, B. S., Atkinson, D. R., & Yang, P. H. (1999). The Hoboken, NJ: John Wiley & Sons.
Asian Values Scale: Development, factor analysis,
validation, and reliability. Journal of Counseling Psychology, Rosen, R. C., Catania, J., Pollack, L., Althof, S., O'Leary, M.,
46(3), 342. & Seftel, A. D. (2004). Male Sexual Health
Questionnaire (MSHQ): Scale development and
Kitzinger, J. (1995). Qualitative research. Introducing focus psychometric validation. Urology, 64(4), 777-782.
groups. BMJ: British Medical Journal, 311(7000), 299.
Schriesheim, C. A., Powers, K. J., Scandura, T. A., Gardiner,
Lawshe, C. H. (1975). A quantitative approach to content C. C., & Lankau, M. J. (1993). Improving construct
validity. Personnel Psychology, 28(4), 563-575. measurement in management research: Comments and
a quantitative approach for assessing the theoretical
Leech, B. L. (2002). Asking questions: techniques for content adequacy of paper-and-pencil survey-type
semistructured interviews. Political Science & Politics, instruments. Journal of Management, 19(2), 385-417.
35(4), 665-668.

https://scholarworks.umass.edu/pare/vol23/iss1/5
DOI: https://doi.org/10.7275/hwpz-jx61 12
Howard: Scale Pretesting
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 13
Howard, Scale Pretesting

Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). S. Presser, J. M. Rothger, M. P. Couper, J. T. Lessler, E.
On the sins of short-form development. Psychological Martin, J. Martin, & E. Singer (Eds.), Methods for
Assessment, 12(1), 102-111. testing and evaluating survey questionnaires (pp. 109-
130). Hoboken, NJ: John Wiley & Sons, Inc.
Sweeney, J. C., & Soutar, G. N. (2001). Consumer perceived
value: The development of a multiple item scale. Journal Willis, G. B. (2004). Cognitive interviewing: A tool for
of Retailing, 77(2), 203-220. improving questionnaire design. Tyne, UK: Sage.
Thompson, B. (2004). Exploratory and confirmatory factor Yang, Z., Jun, M., & Peterson, R. T. (2004). Measuring
analysis: Understanding concepts and applications. customer perceived online service quality: scale
Washington DC: American Psychological Association. development and managerial implications. International
Journal of Operations & Production Management, 24(11),
Van der Zouwen, J., & Smit, J. H. (2004). Evaluating survey 1149-1174.
questions by analyzing patterns of behavior codes and
question–answer sequences: A diagnostic approach. In

Supplemental Material – Item Sort Task with Free Response Blank Example
Instructions: In the following, you will be asked to indicate which construct that you believe several items
represent from the options provided. For this reason, it is very important that you are familiar with the constructs
of interest. Please read the following definitions to familiarize yourself with these constructs. Afterwards, using
the options provided, please indicate the construct that you believe the following items represent. If you believe
that the item does not represent any of the options provided, please mark “Other Construct.”
Conscientiousness - A fundamental trait that influences whether people adhere to long-range goals, avoid
acting impulsively, act carefully in their behaviors, desire performing well, and remain committed to social
obligations.
Extraversion – A fundamental trait that influences whether people are outgoing, talkative, social, seek new
sensations, and receive gratification outside of oneself.
Neuroticism – A fundamental trait that influences whether people are moody, experience negative
emotions, and response more negatively to stressors.
Lastly, a final column is added that is labeled “Confusing / Wording Concerns.” If you believe that the item
is confusing or possesses any wording concerns, please write a brief note describing the concerns.
Confusing /
Other
Conscientiousness Extraversion Neuroticism Wording
Construct
Concerns?
1.) I am talkative.
2.) I am hard working.
3.) I am emotionally stable.
4.) I enjoy running.
5.) I like to be orderly.
… … … … … …

Citation:
Howard, Matt C. (2018). Scale Pretesting. Practical Assessment, Research & Evaluation, 23(5). Available online:
http://pareonline.net/getvn.asp?v=23&n=5

Published by ScholarWorks@UMass Amherst, 2018 13


Practical Assessment, Research, and Evaluation, Vol. 23 [2018], Art. 5
Practical Assessment, Research & Evaluation, Vol 23 No 5 Page 14
Howard, Scale Pretesting

Corresponding Author
Matt C. Howard
Assistant Professor
Marketing and Quantitative Methods
Mitchell College of Business
University of South Alabama

email: mhoward [at] southalabama.edu

https://scholarworks.umass.edu/pare/vol23/iss1/5
DOI: https://doi.org/10.7275/hwpz-jx61 14

You might also like