See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/323946068
Summative Assessment (Wing Institute Original Paper)
Preprint · March 2018
DOI: 10.13140/RG.2.2.16788.19844
CITATIONS READS
7 72,112
3 authors:
Jack States Ronnie Detrich
Wing Institute 74 PUBLICATIONS 951 CITATIONS
80 PUBLICATIONS 99 CITATIONS
SEE PROFILE
SEE PROFILE
R. Keyworth
The Wing Institute
53 PUBLICATIONS 82 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jack States on 22 March 2018.
The user has requested enhancement of the downloaded file.
Summative Assessment
Summative assessment is an appraisal of learning at the end of an
instructional unit or at a specific point in time. It compares student
knowledge or skills against standards or benchmarks. Summative
assessment evaluates the mastery of learning whereas its counterpart,
formative assessment, measures progress and functions as a diagnostic tool
to help specific students. Generally, summative assessment gauges how a
particular population responds to an intervention rather than focusing on an
individual. It often aggregates data across students to act as an independent
yardstick that allows teachers, administrators, and parents to judge the
effectiveness of the materials, curriculum, and instruction used to meet
national, state, or local standards. Summative assessment includes midterm
exams, final project, papers, teacher-designed tests, standardized tests, and
high-stakes tests. As a subset of summative assessment, standardized tests
play a pivotal role in ensuring that schools are held to the same standards
and that all students regardless of race or socio-economic background
perform to expectations. Summative assessment provides educators with
the metrics to know what’s working and what’s not.
Overview
Research supports the power of assessment to amplify learning and skill
acquisition (Başol & Johanson, 2009). Summative assessment is a form of
appraisal that occurs at the end of an instructional unit or at a specific point
in time, such as the end of the school year. It evaluates mastery of learning
and offers information on what students know and do not know. Frequently,
summative assessment consists of evaluation tools designed to measure
student performance against predetermined criteria based on specific
learning standards. Examples of commonly employed tools include Advanced
Placement exams, National Assessment of Education Progress (NAEP), end-
of-lesson tests, midterm exams, final project, and term papers. These
assessments are routinely used for making high-stakes decisions; for this
purpose, often student knowledge or skill acquisition is compared with
standards or benchmarks (Common Core Standards or High School
Graduation Tests).
What makes summative assessment so invaluable is that each high-stakes
test may result in educators using the data for decisions with significant
long-term consequences affecting a student’s future. Passing bestows
important benefits, such as receiving a high school diploma, a scholarship,
or entry into college, and failure can affect a child’s future employment
prospects and earning potential as an adult (Geiser & Santelices, 2007).
Additionally, summative assessment plays a role in improving future
instruction by providing educators with data on the effectiveness of
curriculum and instruction. Knowing what methods worked for a lesson or
semester may not help current students, but it can provide educators with
the necessary insights into how and where to redesign instructional practices
to elevate next year’s student scores (Moss, 2013).
Despite the important role of summative assessment in education, research
finds little evidence to support it as a critical factor in improved student
achievement (Rosenshine, 2003; Yeh, 2007). Figure 1 provides a
comparison of the effect size of formative assessment and high-stakes
testing (an instrument of summative assessment), gleaned from multiple
studies conducted over more than 40 years.
Figure 1. Comparison of formative assessment and summative
assessment impact on student achievement
Because summative assessment happens after instruction is over, it has
little value as a diagnostic tool to guide teachers in making timely
adjustments to instruction aimed at catching students who are falling
behind. It does not provide teachers with vital information to use in crafting
remedial instruction. Formative assessment is a much more effective
instrument for adjusting instruction to assist students master material
(Garrison & Ehringhaus, 2007; Harlen & James, 1997).
Despite these shortcomings, summative assessment plays a pivotal role in
education by troubleshooting weaknesses in the system. It provides
educators with valuable information to determine the effectiveness of
instruction for a particular unit of study, to make high-stakes decisions, and
to evaluate the effectiveness of schoolwide interventions. It works to
improve overall instruction (1) by providing feedback on progress measured
against benchmarks, (2) by helping teachers to improve, and (3) as an
accountability instrument for continuous improvement of systems (Hart et
al., 2015)
Types of Summative Assessment
Educators generally rely on two forms of summative assessment: teacher
constructed (informal) and standardized (systematic). Teacher-constructed
assessment is the most common form of assessment found in classrooms. It
can provide objective data for appraising student performance, but it is
vulnerable to bias. Standardized assessment is designed to overcome many
of the biases that can taint teacher-constructed tools, but this form of
assessment have their own limitations. Both types of summative assessment
have a place in an effective education system, but for maximum positive
effects they should be employed to meet the needs for which they were
designed.
Teacher Constructed (Informal)
Teacher-constructed assessment, the most common and frequently applied
type of summative assessment, is derived from teachers’ daily interactions
and observations of how students behave and perform in school. Since
schools began, teachers have depended predominantly on informal
assessment, which today includes teacher-constructed tests and quizzes,
grades, and portfolios, and relies heavily on a teacher’s professional
judgment. Teachers inevitably form judgments, often accurate, about
students and their performance (Barnett, 1988; Spencer, Detrich, & Slocum,
2012). Although many of these judgments help teachers understand where
students stand in mastering a lesson, a meaningful percentage result in false
understandings and conclusions. To be effective, a teacher-constructed
assessment must deliver vital information needed for the teacher to make
accurate conclusions about each student’s performance in a content area
and to feel confident that performance is linked to instruction. Ensuring that
a teacher-constructed instrument is reliable and valid is central to the
assessment design process.
Research suggests that the main weaknesses of informal assessment relate
to validity and reliability (AERA, 1999; Mertler, 1999). That is why it is
crucial for teachers to adopt assessment procedures that are valid indicators
of a student’s performance (appraise what the assessment claims to) and
that the assessment is reliable (provides information that can be replicated).
Validity is a measure of how well an instrument gauges the relevant skills of
a student. The research literature identifies three basic types of validity:
construct, criterion, and content. Students are best served when the teacher
focuses on content validity, that is, making sure the content being tested is
actually the content that was taught (Popham, 2014). Content validity
requires no statistical calculations whereas both construct validity and
criterion validity require knowledge of statistics and thus are not well suited
to classroom teachers (Allen & Yen, 2002).
Reliability is key in the development of any teacher-constructed assessment
because teachers must be assured that repeated testing of a student using
the same assessment will produce consistent results (AERA, 1999; Brennan,
2006; McMillan & Schumacher, 1997; Popham, 2014). Two independent
assessors or the same assessor should achieve similar score in a test-retest
format. A reliable assessment gives the teacher confidence that the
instrument provides a good depiction of a student’s skills or knowledge.
Reliability and validity may sometimes be at odds. For example, a test may
be reliable for an English language learner but it may not a valid measure of
the skills of a student whose primary language is not English.
Ultimately, speedy feedback of student performance after an assessment
enhances the value of all forms of assessment. To maximize the positive
impact, both student and teacher should be provided with detailed and
specific information on a student’s achievement. Timely comments and
explanations from teachers can clarify how a student performed are
essential components of quality instruction and performance improvement.
This information tells students where they stand with regard to the teacher’s
expectations. Timely feedback is also essential for teachers (Gibbs &
Simpson, 2005). Otherwise, teachers remain in the dark about the
effectiveness of their instructional strategies and methods. Research
suggests that testing without feedback is likely to produce disappointing
results, and the quantity and quality of the research supports including
feedback as an integral part of assessment (Başol, 2003).
Designing Teacher-Constructed Assessments
The essential question to ask when developing an informal teacher-
constructed assessment is this: Does the assessment consistently assess
what the teacher intended to be evaluated based on the material being
taught? Best practices in assessment suggest that teachers start answering
this question by incorporating assessment design into the instructional
design process. Assessments are best generated at the same time as lesson
plans. Although teaching to the test has acquired negative overtones, it is
precisely what all student assessment is meant to accomplish. Teachers
cannot and should not assess every item they teach, but it is important that
they identify and prioritize the critical lesson elements for inclusion in a
summary assessment.
Instruction and assessment are meant to complement one another. When
this occurs it helps teachers, policymakers, administrators, and parents
know what students are capable of doing at specific stages in the education
process. A good match of assessment with instruction leads to more
effective scope and sequencing, enhancing the acquisition of knowledge and
the mastery of skills required for success in subsequent grades as well as
success after graduation from school (Reigeluth, 1999).
The following are guidelines that lead to increased effectiveness of teacher-
constructed assessment (Reynolds, Livingston, Willson, & Willson, 2010;
Shillingburg, 2016; Taylor & Nolen, 2005):
1. Clarify the purpose of the assessment and the intended use of its
results.
2. Define the domain (content and skills) to be assessed.
3. Match instruction to standards required of each domain.
4. Identify the characteristics of the population to be assessed and
consider how these data might influence the design of the assessment.
5. Ensure that all prerequisite skills required for the lesson have been
taught to the students.
6. Ensure that the assessment evaluates skills compatible with and
required for success in future lessons.
7. Review with the students the purpose of the assessment and the
knowledge and skills to be assessed.
8. Consider possible task formats, timing, and response modes and
whether they are compatible with the assessment as well as how the
scores will be used.
9. Outline how validity will be evaluated and measured.
• Methods include matching test questions to lesson plans, lesson
objectives, and standards, and obtaining student feedback after
the assessment.
• Content-related evidence often consists of deciding whether the
assessment methods are appropriate, whether the tasks or
problems provide an adequate sample of the student’s
performance, and whether the scoring system captures the
performance.
• When possible, review test items with colleagues and students;
revise as necessary.
10. Review issues of reliability.
• Make sure that the assessment includes enough items and tasks
(examples of performance) to report a reliable score.
• Evaluate the relative weight allotted to each task, to each
content category, and to each skill being assessed.
11. Pilot-test the assessment, then revise as necessary. Are the results
consistent with formative assessments administered on the content
being taught?
Standardized (Systematic)
Standardized testing is the second major category of summative assessment
commonly used in schools. Students and teachers are very familiar with
these standardized tests, which have become ubiquitous. Over the past 20
years, they have played an ever-increasing role in schools, especially since
the passage in 2001 of the No Child Left Behind Act (NCLB, 2002).
Standardized tests have increased not only in influence but also in quantity.
Typically, students are engaged in taking standardized tests between 20 and
25 hours each year (Bangert-Drowns, Kulik, Kulik, & Morgan, 1991; Hart et
al., 2015). The average 8th grader spends between 1.6% and 2.3% of
classroom time on standardized tests, not including test preparation
(Bangert-Drowns et al., 1991; Lazarín, 2014). A student will be required to
participate in approximately 112 mandatory standardized exams during his
or her academic career (Hart et al., 2015).
Although research finds that student performance increases with the
frequency of assessment, it also shows that improvement tapers off with
excess testing (Bangert-Drowns et al., 1991). Regardless of where educators
stand on the issue of standardized testing, most can agree that these
assessments should be reduced to the minimum number required to obtain
the critical information for which they were designed. The aim is to decrease
the number of standardized tests to those indispensable in providing
educators with the basic information to make high-stakes decisions and for
schools to implement a continuous improvement process. Ultimately,
everyone is best served by reducing redundancy in test taking in order to
maximize instructional time (Wang, Haertel, & Walberg, 1990).
The premise that past performance is a good indicator of future performance
underlies the use of standardized tests for high-stakes purposes such as
college entrance. However, research suggests that standardized tests such
as the SAT are not as effective as educators would like them to be in
predicting future academic performance of college students (Belfield &
Crosta, 2012; Espenshade & Chung, 2010). Standardized tests do provide a
measure of a student’s achievements independent of grades, thus enhancing
the overall information available for high-stakes decisions. These tests can
provide useful content validity to the extent that they gauge what they were
designed to measure, and reliable to the extent that test-retest scores
correlate. Unfortunately, they are often used in ways other than intended
and are invalid for those purposes.
The fact that standardized tests are rigorously vetted before use means they
can provide valuable information to assist schools and classrooms improve
instruction for future years. Generally, a standardized assessment measures
how a particular population rather than an individual responds to an
intervention. It often aggregates data across students to act as an
independent yardstick that allows teachers, administrators, and parents to
judge the effectiveness of the materials, curriculum, and instruction used to
meet national, state, or local standards.
Standardized tests provide valuable data to be used by educators for school
reform and continuous improvement purposes. Data from these tests can
include early indicators that point to interventions for preventing potential
future problems. The data can also reveal when the system has broken down
or highlight exemplary performers that schools can emulate. Using such data
can be invaluable as a systemwide tool (Celio, 2013). Despite the potential
value of summative assessment as a tool to monitor and improve systems,
research finds minimal positive impact on student performance when the
tests are used for high-stakes purposes or to hold teachers and schools
accountable (Carnoy & Loeb, 2002; Hanushek & Raymond, 2005). The
increased use of incentives and other accountability measures, which have
cost enormous sums, reduced instruction time, and added stress to
teachers, can be linked to only an average effect size of 0.05 in
improvement of student achievement (Yeh, 2007).
As previously noted, formative assessment has been shown to be a much
more effective tool in helping individual students maintain progress toward
meeting accepted performance standards, and the rigor and cost required to
design valid and reliable standardized tests places them outside the realm of
tools that teachers can personally design. In the end, it is important to
understand what summative assessment is best suited to accomplish. When
it comes to improving systems, standardized assessment is well suited for
meeting a school’s needs. But for improving an individual student’s
performance, formative assessment is more appropriate.
Summary
Summative assessment is a commonplace tool used by teachers and school
administrators. It ranges from a simple teacher-constructed end-of-lesson
exam to standardized tests that determine graduation from high school and
entry into college. If used for the purposes for which it was designed,
summative assessment plays an important role in education. When used
appropriately, it can deliver objective data to support a teacher’s
professional judgment, to make high-stakes decisions, and as a tool for
acquiring the needed information for adjustments in curriculum and
instruction that will ultimately improve the education process. When used
incorrectly or for accountability purposes, summative assessment can take
valuable instruction time away from students and increase teacher and
student stress without producing notable results.
Citations
Allen, M. J., & Yen, W. M. (2002). Introduction to measurement theory. Long
Grove, IL: Waveland Press.
American Educational Research Association, American Psychological
Association, & National Council on Measurement in Education.
(1999). Standards for educational and psychological testing. Washington,
DC: American Educational Research Association (AERA).
Bangert-Drowns, R. L., Kulik, C. L. C., Kulik, J. A., & Morgan, M. (1991). The
instructional effect of feedback in test-like events. Review of educational
research, 61(2), 213–238.
Barnett, D. W. (1988). Professional judgment: A critical appraisal. School
Psychology Review, 17(4), 658–672.
Başol, G. (2003). Effectiveness of frequent testing over achievement: a
meta-analysis study. Unpublished doctorate dissertation, Ohio University,
Athens, OH.
Başol, G., & Johanson, G. (2009). Effectiveness of frequent testing over
achievement: A meta analysis study. International Journal of Human
Sciences, 6(2), 99–121.
Belfield, C. R., & Crosta, P. M. (2012). Predicting success in college: The
importance of placement tests and high school transcripts. CCRC Working
Paper No. 42. New York, NY: Community College Research Center, Teachers
College, Columbia University.
Brennan, R. L. (Ed.) (2006). Educational measurement (4th ed.). Westport,
CT: Praeger Publishers.
Carnoy, M., & Loeb, S. (2002). Does external accountability affect student
outcomes? A cross-state analysis. Educational Evaluation and Policy
Analysis, 24(4), 305–331.
Celio, M. B. (2013). Seeking the magic metric: Using evidence to identify
and track school system quality. In Performance Feedback: Using Data to
Improve Educator Performance (Vol. 3, pp. 97–118). Oakland, CA: The Wing
Institute.
Espenshade, T. J., & Chung, C. Y. (2010). Standardized admission tests,
college performance, and campus diversity. Unpublished paper, Office of
Population Research, Princeton University, Princeton, NJ.
Fuchs, L. S. & Fuchs, D. (1986). Effects of systematic formative evaluation:
A meta-analysis. Exceptional Children, 53(3), 199–208.
Garrison, C., & Ehringhaus, M. (2007). Formative and summative
assessments in the classroom. Westerville, OH: Association for Middle Level
Education.
https://www.amle.org/portals/0/pdf/articles/Formative_Assessment_Article_Aug2013.pdf
Geiser, S., & Santelices, M. V. (2007). Validity of high-school grades in
predicting student success beyond the freshman year: High-school record
vs. standardized tests as indicators of four-year college outcomes. Research
and Occasional Paper Series. Berkeley, CA: Center for Studies in Higher
Education, University of California.
Gibbs, G., & Simpson, C. (2005). Conditions under which assessment
supports students’ learning. Learning and Teaching in Higher Education, 1,
3–31.
Hanushek, E. A., & Raymond, M. E. (2005). Does school accountability lead
to improved student performance? Journal of Policy Analysis and
Management, 24(2), 297–327.
Harlen, W., & James, M. (1997). Assessment and learning: Differences and
relationships between formative and summative assessment. Assessment in
Education: Principles, Policy & Practice, 4(3), 365–379.
Hart, R., Casserly, M., Uzzell, R., Palacios, M., Corcoran, A., & Spurgeon, L.
(2015). Student testing in America’s great city schools: An inventory and
preliminary analysis. Washington, DC: Council of the Great City Schools.
Lazarín, M. (2014). Testing overload in America’s schools. Washington, DC:
Center for American Progress.
McMillan, J. H., & Schumacher, S. (1997). Research in education: A
conceptual approach (4th ed.). New York, NY: Longman.
Mertler, C. A. (1999). Teachers’ (mis)conceptions of classroom test validity
and reliability. Paper presented at the annual meeting of the Mid-Western
Educational Research Association, Chicago, IL.
Moss, C. M. (2013). Research on classroom summative assessment. In J. H.
McMillan (Ed.), Handbook of research on classroom assessment (pp. 235–
255). Los Angeles, CA: Sage.
No Child Left Behind (NCLB) Act of 2001, Pub. L. No. 107-110, § 115, Stat.
1425. (2002).
Popham, W. J. (2014). Classroom assessment: What teachers need to know
(7th ed.). Boston, MA: Pearson Education.
Reigeluth, C. M. (1999). The elaboration theory: Guidance for scope and
sequence decisions. In C. M. Reigeluth (Ed.), Instructional design theories
and models: A new paradigm of instructional theory (Vol. II, pp. 425–453).
Mahwah, NJ: Lawrence Erlbaum.
Reynolds, C. R., Livingston, R. B., Willson, V., & Willson, V. (2010).
Measurement and assessment in education. Upper Saddle River, NJ: Pearson
Education.
Rosenshine, B. (2003). High-stakes testing: Another analysis. Education
Policy Analysis Archives, 11(24), 1–8.
Spencer, T. D., Detrich, R., & Slocum, T. A. (2012). Evidence-based
practice: A framework for making effective decisions. Education and
Treatment of Children, 35(2), 127–151.
Shillingburg. W. (2016). Understanding validity and reliability in classroom,
school-wide, or district-wide assessments to be used in teacher/principal
evaluations. Retrieved from
https://cms.azed.gov/home/GetDocumentFile?id=57f6d9b3aadebf0a04b269
1a
Taylor, C. S., & Nolen, S. B. (2005). Classroom assessment: Supporting
teaching and learning in real classrooms (2nd ed.). Upper Saddle River, NJ:
Pearson Prentice Hall.
Wang, M. C., Haertel, G. D., & Walberg, H. J. (1990). What influences
learning? A content analysis of review literature. The Journal of Educational
Research, 84(1), 30–43.
Yeh, S. S. (2007). The cost-effectiveness of five policies for improving
student achievement. American Journal of Evaluation, 28(4), 416–436.
View publication stats