Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 93
5
Measurement Considerations
T
he assessments described in Chapters 2 through 4 have been
designed for a variety of purposes. Some—such as the assessment
components of Operation ARIES! Packet Tracer, and the scenario-
based learning strategy described by Louise Yarnall—are designed pri -
marily for formative purposes. That is, the assessment results are used to
adapt instruction so that it best meets learners’ needs. Formative assess -
ments are intended to provide feedback that can be used both by educa-
tors and by learners. Educators can use the results to gauge learning,
monitor performance, and guide day-to-day instruction. Students can use
the results to assist them in identifying their strengths and weaknesses
and focusing their studying. A key characteristic of formative assessments
is that they are conducted while students are in the process of learning
the material.1
Other assessments—referred to as summative—are conducted at the
conclusion of a unit of instruction (e.g., course, semester, school year).
Summative assessments provide information about students’ mastery
of the material after instruction is completed. They are designed to yield
information about the status of students’ achievement at a given point in
time, and their purpose is primarily to categorize the performance of a
student or a system. The PISA problem-solving assessment and the port -
folio assessments used at Envision Schools are examples of summative
1 The reader is referred to Andrade and Cizek (2010) for further information about
formative assessment and the difference between formative and summative assessment.
93
OCR for page 93
94 ASSESSING 21ST CENTURY SKILLS
assessments, as are the annual state achievement tests administered for
accountability purposes.
All assessments should be designed to be of high quality: to mea-
sure the intended constructs, provide useful and accurate information,
and meet technical and psychometric standards. For assessments used
to make decisions that have an important impact on test takers’ lives,
however, these issues are critical. When assessments are used to make
high-stakes decisions, such as promotion or retention, high school gradu -
ation, college admissions, credentialing, job placement, and the like, they
must meet accepted standards to ensure that they are reliable, valid, and
fair to all the individuals who take them. A number of assessments used
for high-stakes decisions were discussed by workshop presenters, includ-
ing the Multistate Bar exam used to award certification to lawyers, the
situational judgment test used for admitting Belgian students to medi -
cal school, the tests of integrity used for hiring job applicants, and some
of the assessment center strategies used to make hiring and promotion
decisions.
For the workshop, the committee arranged for two presentations to
focus on technical measurement issues, particularly as they relate to high-
stakes uses of summative assessments. Deirdre Knapp, vice president
and director of the assessment, training, and policy studies division with
HumRRO, spoke about the fundamentals of developing assessments.
Steve Wise, vice-president for research and development with Northwest
Evaluation Association, discussed the issues to consider in evaluating the
extent to which the assessments validly measure the constructs they are
intended to measure. This chapter summarizes their presentations and
lays out the steps they highlighted as fundamental for ensuring that the
assessments are of high quality and appropriate for their intended uses. 2
Where appropriate, the reader is referred to other sources for more in-
depth technical information about test development procedures.
Defining the Construct
According to Knapp, assessment development should begin with
a “needs analysis.” A needs analysis is a systematic effort to determine
exactly what information users want to obtain from the assessment and
how they plan to use it. A needs assessment typically relies on informa -
tion gathered from surveys, focus groups, and other types of discussions
2 Knapp’spresentation is available at http://www7.national-academies.org/bota/21st_
Century_Workshop_Knapp.pdf [August 2011]. Wise’s presentation is available at http://
www7.national-academies.org/bota/21st_Century_Workshop_Wise.pdf [August 2011].
OCR for page 93
95
MEASUREMENT CONSIDERATIONS
with stakeholders. Detailed information about conducting a needs analy -
sis can be found at http://www.needsassessment.org [August 2011].
Knapp emphasized that it is important to have a clear articulation of
the construct to be assessed: that is, the knowledge, skill, and/or behavior
the stakeholders would like to have measured. The construct definition
helps the test developer to determine how to measure it. She cautioned
that for the skills covered in this workshop, developing a definition and
operationalizing these definitions in order to produce test items can be
challenging. For example, consider the variability in the definitions of
critical thinking that Nathan Kuncel presented or the definitions of self-
regulations that Rick Hoyle discussed. In order to develop an assessment
that meets appropriate technical standards, the definition needs to be
detailed and sufficiently precise to support the development of test items.
Test development is less challenging when the construct is more concrete
and discrete, such as specific subject-matter or job knowledge.
One of the more important issues to consider during the initial devel -
opment stage, Knapp said, is whether the assessment needs to measure
the skill itself or simply illustrate the skill. For instance, if the goal is to
measure teamwork skills, is it necessary to observe the test takers actu -
ally performing their teamwork skills? Or is it sufficient that they simply
answer questions that show they know how to collaborate with others
and effectively work as a team? This is one of the issues that should be
covered as part of the needs analysis.
Knapp highlighted the importance of considering which aspects of
the construct can be measured by a standardized assessment and which
aspects cannot. If the construct being assessed is particularly broad and
the assessment cannot get at all components of it, what aspects of the
construct are the most important to capture? There are always tradeoffs in
assessment development, and careful prioritization of the most critical fea-
tures can help with decision making about the construct. Knapp advised
that once these decisions are made and the assessment is designed, the
developer should be absolutely clear on which aspects of the construct are
captured and which aspects are not.
Along with defining the construct, it is important to identify the
context or situation in which the knowledge, skills, or behaviors are to be
demonstrated. Identification of the specific way in which the construct is
to be demonstrated helps to determine the type of assessment items to
be used.
Determining the Item Types
As demonstrated by the examples discussed in Chapters 2, 3, and
4, there are many item types and assessment methods, ranging from
OCR for page 93
96 ASSESSING 21ST CENTURY SKILLS
relatively straightforward multiple-choice items to more complex simu -
lations and portfolio assessments. Knapp noted that some of the recent
innovations in computer-based assessments allow for a variety of “glitzy”
options, but she cautioned that while these options may be attractive, they
may not be the best way to assess the targeted construct. The primary
focus in deciding on the assessment method is to consider the knowledge,
skill, and/or behavior that the test developer would like to elicit and then
to consider the best—and most cost-effective—way to elicit it.
Knapp discussed two decisions to make with regard to constructing
test items: the type of stimulus and the response mode. The stimulus is
what is presented to the test taker, the task to which he/she is expected
to respond. The stimulus can take a number of different forms such as a
brief question, a description of a problem to solve, a prompt, a scenario
or case study, or a simulation. The stimulus may be presented orally, on
paper, or using technology, such as a computer.
The response mode is the mechanism by which the test taker responds
to the item. Response modes might include choosing from among a set of
provided options, providing a brief written answer, providing a longer
written answer such as an essay, providing an oral answer, perform-
ing a task or demonstrating a skill, or assembling a portfolio of materi -
als. Response modes are typically categorized as “selected response”
or “constructed-response,” and constructed-response items are further
categorized as “short-answer constructed-response,” “extended-answer
constructed-response,” and “performance-based tasks.” Response modes
also include behavior checklists, such as those described by Candice
Odgers to assess conduct disorders, which may be completed by the test
taker or by an observer. The response may be provided orally, on paper,
through some type of performance or demonstration, or on a computer.
Knapp explained that choices about the stimulus type and the
response mode need to consider the skill to be evaluated, the level of
authenticity desired,3 how the assessment results will be used, and practi-
cal considerations. If the test is intended to measure knowledge of factual
information, a paper-and-pencil test with brief questions and multiple-
choice answer options may be all that is needed. If the test is intended to
measure more complex skills, such as solving complex, multipart prob-
lems, a response mode that requires the examinee to construct an answer
is likely to be more useful.
Layered on top of these considerations about the best ways to elicit
the targeted skill are practical and technical constraints. Test questions
3 Authenticity refers to how closely the assessment task resembles the real-life situation
in which the test taker is required to use the skill being assessed. As described earlier, the
level of authenticity desired is an issue that should be addressed as part of a needs analysis.
OCR for page 93
97
MEASUREMENT CONSIDERATIONS
that use selected-response or short-answer constructed-response modes
can usually be scored relatively quickly by machine. Test questions that
use extended-answer constructed-response or performance-based tasks
are more complicated to score. Some may be scored by machine, by
programming the scoring criteria, but humans may need to score others.
Scoring by humans is usually more expensive than scoring by machine,
takes longer, and introduces subjectivity into the scoring process. Further-
more, constructed-response and performance-based tasks take longer to
answer, and fewer can be included on a single test administration. They
are more resource-intensive to develop and try out, and they usually
present some challenges in meeting accepted measurement standards for
technical quality. These practical and technical constraints are discussed
in more detail below.
Test Administration Issues
How will the assessment be administered to test takers? Where will
it be administered? When will it be administered and how often? Who
will administer it? There are numerous options for how the test may be
delivered to examinees and how they respond to it. Choosing among
these options requires consideration of practical constraints.
A small assessment program, with relatively few examinees and
infrequent administrations, has many options for administration, Knapp
advised. For example, performance-based tasks that involve role playing,
live performances, or that are administered one-on-one (one test adminis-
trator to one examinee) are much more practical when the examinee vol -
ume is small and test administrations are infrequent. When the examinee
volume is large, performance-based tasks may be impractical because of
the resources they require. The resources required for performance-based
tasks can be reduced if they can be presented and responded to via com-
puter, particularly if the scoring can be programmed as well.
Despite the resource required, several currently operating large stan -
dardized testing programs make use of performance-based tasks. As
described in Chapter 2, the Multistate Bar Exam includes a performance-
based component with a written response and is administered to approx-
imately 40,000 candidates each year. Test takers pay $20 to take this
assessment.
Another example is Step 2 of the United States Medical Licensing
Examination (USMLE), which includes a performance-based component.
The Clinical Skills portion of the exam evaluates medical students’ ability
to gather information from patients, perform physical examinations, and
communicate their findings to patients and colleagues. The assessment
uses standardized patients to accomplish this. Standardized patients are
OCR for page 93
98 ASSESSING 21ST CENTURY SKILLS
humans who are trained to pose as patients. They are trained in how they
should respond to the examinee’s questions in order to portray certain
symptoms and/or diseases, and they are trained to rate the examinee’s
skills in taking histories from patients with certain symptoms. (For more
information, see http://www.usmle.org/examinations/step2/step2cs_
content.html [August 2011].) Approximately, 33,300 individuals took the
exam between July 1, 2009, and June 30, 2010 (see http://www.nbme.
org/PDF/Publications/Annual-Report.pdf [August 2011]). This exam is
expensive for test takers; the fee to take the test is $1,100.
A third example is the portfolio component of the assessment used
to award advanced level certification for teachers by the National Board
for Professional Teaching Standards (NBPTS). This assessment evalu-
ates teachers’ ability to think critically and reflect on their teaching and
requires that teachers assemble a portfolio of written materials and vid -
eotapes. Approximately 20,000 teachers take the assessment each year
(Mary Dilworth, vice president for research and higher education with the
NBPTS, personal communication, May 31, 2011), and scores are available
within 6 to 7 months (see http://www.nbpts.org/ [August 2011]). This
assessment is also expensive; examinees pay $2,500 to sit for the exam.
Scoring
Knapp noted that if the choice is to use extended constructed-response
or performance-based tasks, decisions must be made about how to score
them. These types of open-ended responses may be scored dichotomously
or polytomously. Dichotomous scoring means the answer is scored either
correct or incorrect. Polytomous scoring means a graded scale is used,
and points are awarded depending on the quality of the response or the
presence of certain attributes in the response. Either way, a scoring guide,
or “rubric,” must be developed to establish the criteria for earning a cer-
tain score. The scoring criteria may be programmed so a computer does
the scoring, or humans may be trained to do the scoring. When humans
do the scoring, substantial time must be spent on training them to apply
the scoring criteria appropriately. Since scoring constructed-response and
performance-based tasks requires that scorers make judgments about
the quality of the answer, the scorers need detailed instructions on how
to make these judgments systematically and in accord with the rubric.
Likewise, when constructed-response items are scored by computer, the
computer must be “trained” to score the responses correctly, and the
accuracy of this scoring must be closely monitored.
For some purposes, it is useful to set “performance standards” for
the assessment. This might mean determining the level of performance
considered acceptable to pass the assessment. Or it may mean classifying
OCR for page 93
99
MEASUREMENT CONSIDERATIONS
performance into three or more categories, such as “basic,” “proficient,”
and “advanced.” Making these kinds of performance-standards decisions
requires implementing a process called “standard setting.” For further
information about setting standards, see Cizek and Bunch (2007) or Zeiky,
Perie, and Livingston (2008).
Technical Measurement Standards
Any assessment used to make important decisions about the test tak -
ers should meet certain technical measurement standards. These technical
guidelines are laid out in documents such as the Standards for Educational
and Psychological Testing (American Educational Research Association,
American Psychological Association, and National Council on Measure -
ment in Education, 1999), hereafter referred to as the Standards. Knapp
and Wise focused on three critical technical qualities particularly relevant
for assessments of the kinds of skills covered in the workshop, given
the challenges in developing these assessments: reliability, validity, and
fairness.
Reliability
Reliability refers to the extent to which an examinee’s test score
reflects his or her true proficiency with regard to the trait being measured.
The concern of reliability is the precision of test scores, and, as explained
in more detail later in this section, the level of precision needed depends
on the intended uses of the scores and the consequences associated with
these uses (see also American Educational Research Association, Ameri -
can Psychological Association, and National Council on Measurement in
Education, 1999, pp. 29-30).
Reliability is evaluated empirically using the test data, and several
different strategies can be used for collecting the data needed to calculate
the estimate of reliability. One strategy involves administering the same
form4 of the test or parallel forms of the test to the same group of exam-
inees at independent testing sessions. When multiple administrations are
impractical or unavailable, an alternative strategy involves estimating
reliability from a single test form given on a single occasion. For this type
of reliability estimate, the test form is divided into two or more constitu -
ent parts, and the consistency across these parts is determined using an
estimate such as coefficient alpha or a split-half reliability coefficient.
Each of these strategies for estimating reliability examines the precision
4A form is the specific collection of items or tasks that are included on the test.
OCR for page 93
100 ASSESSING 21ST CENTURY SKILLS
of scores in relation to specific sources of error. Additional information
about estimating reliability is available in Haertel (2006) and Traub (1994).
For tests that are scored by humans, another type of reliability infor-
mation is commonly reported. When humans score examinee responses,
they must make subjective judgments based on comparing the scoring
guide and criteria to a particular test taker’s performance. This introduces
the possibility of scoring error associated with human judgment, and it is
important to estimate the impact of this source of error on test scores. One
estimate of reliability when human scoring is used is “inter-rater agree -
ment,” which is obtained by having two raters score each response and
calculating the correlation between these scores. Knapp indicated that an
estimate of inter-rater agreement provides basic reliability information,
but she cautioned that it is not the only type of reliability evidence that
should be collected when responses are scored by humans. A more com-
plete data collection strategy involves generalizability analysis, which can
be designed to examine the precision of test scores in relation to multiple
sources of error, such as testing occasion, test form, and rater. Additional
information about generalizability analysis is available in Shavelson and
Webb (1991).
Reliability is typically reported as a coefficient that ranges from 0 to
1. The level of reliability needed depends on the nature of the test and
the intended use of the scores: there are no absolute levels of reliability
that are considered acceptable. When test results are used for high-stakes
purposes, such as with a high school exit exam, reliability coefficients in
the range of .90, or higher are typically expected. Lower reliability coef-
ficients may be acceptable for tests used for lower stakes purposes, such
as to determine next steps for instruction.
Generally, all else being equal, the more items on a test, the higher
the reliability. This is because longer tests obtain a more extensive sam -
ple of the knowledge, skills, and behaviors being assessed than do
shorter tests. Tests that rely on open-ended questions, such as extended-
answer constructed-response and performance-based tasks, tend to con -
sist of fewer items because these types of questions take more time to
answer than do multiple-choice items. For practical reasons, such as the
amount of testing time available, and because of concerns about exam -
inee fatigue, tests can only include a limited number of these types of
questions. Thus, tests that make use of open-ended questions tend to be
less reliable than tests that primarily use multiple-choice questions, in
part, because they contain fewer test questions. In addition, tests that
require that judgments be made about the quality of the response—
either when humans do the scoring or when scoring is done by artificial
intelligence—introduce error associated with these judgments, which
also tends to reduce reliability levels. Knapp advised that these factors
OCR for page 93
101
MEASUREMENT CONSIDERATIONS
should be considered in relation to the interpretations and uses of test
scores in making decisions about the types of questions used on the test.
Two other measures of score precision to consider are the standard
error of measurement and classification consistency. The standard error
of measurement provides an estimate of precision that is on the same
scale as the test scores (i.e., as opposed to the 0 to 1 scale of a reliability
coefficient). The standard error of measurement can be used to calculate
a confidence band for an individual’s test score. Additional information
on standard errors of measurement and confidence bands can be found
in Anastasi (1988, pp. 133-137), Crocker and Algina (1986, pp. 122-124),
and Popham (2000, pp. 135-138), and the Standards (American Educational
Research Association, American Psychological Association, and National
Council on Measurement in Education, 1999, pp. 28-31).
The third measure of precision—classification consistency—is most
relevant when tests are used to classify the test takers into performance
categories, such as “basic, “proficient,” or “advanced,” or simply as “pro -
ficient” or “not proficient,” or “pass” and “fail.” When important conse-
quences are tied to test results, classification consistency should be exam -
ined. Classification consistency estimates the proportion of test takers
who would be placed in the same category upon repeated administrations
of the test. In this case, the issue is the precision of measurement near
the cut score (the score used to classify test takers into the performance
categories). Additional information about classification consistency can
be found in the Standards (American Educational Research Association,
American Psychological Association, and National Council on Measure -
ment in Education, 1999, p. 30).
It is important to point out that for some of the more innovative
assessments, these measures of precision cannot be estimated. As Knapp
put it, “computer-based technology has gotten way ahead of the capabili -
ties of psychometric tools.” For example, at present there is no practical
way to estimate reliability for some of the computerized assessments,
such as those that are part of Operation ARIES! or Packet Tracer.
Validity
Validity refers to the extent to which the assessment scores measure
the skills that they purport to measure. As Steve Wise framed it, validity
refers to the “trustworthiness of the scores as being true representations
of a student’s proficiency in the construct being assessed.” Validation
involves the evaluation of the proposed interpretations and uses of the
test results. Validity is evaluated based on evidence—both rational and
empirical, qualitative and quantitative. This includes evidence based on
the processes and theory used to design and develop the test as well as
OCR for page 93
102 ASSESSING 21ST CENTURY SKILLS
a variety of kinds of empirical evidence, such as analyses of the internal
structure of the test, analyses of the relationships between test results
and other outcome measures, and other studies designed to evaluate the
extent to which the intended interpretations of test results are justifiable
and appropriate. Wise and Knapp both emphasized that evaluation of
validity and collection of validity evidence is a continuing, ongoing pro -
cess that should be regularly conducted as part of the testing program. See
Messick (1989) and Kane (2006) for further information about validation.
Wise noted that many factors can affect the trustworthiness of the
scores, but two are particularly relevant for the issues raised in the work-
shop: motivation to perform well and construct irrelevant variance. One of
the most important influences on motivation to perform well is the ways in
which the scores are used—the interpretations made of them, the decisions
about actions to take based on those interpretations, and the consequences
(or stakes) attached to these decisions. When the stakes are high, Wise
explained, the incentive to perform well is strong. The more important
the consequences attached to the test results, the higher the motivation to
do well. Motivation to perform well is critical, Wise stressed, in obtaining
test results that are trustworthy as true representations of a student’s pro-
ficiency with the construct. If the test results do not matter or do not carry
consequences for students, they may not try their best, and the test results
may be a poor representation of their proficiency level.
Motivation to do well can also bring about perverse behaviors, Wise
cautioned. When test results have important consequences for students,
examinees may take a number of actions to improve their chances of
doing well—some appropriate and some inappropriate. For example,
some students may study extra hard and spend long hours preparing.
Others may find inappropriate short cuts that work to invalidate the test
results, such as finding out the test questions beforehand, copying from
another test taker, or bringing disallowed materials, such as study notes,
into the test administration. These types of behaviors can produce scores
that are not accurate representations of the students’ true skills.
For the kinds of skills discussed at this workshop, motivation to
do well can introduce a second source of error, which Wise described
as “fake-ability.” Some of the constructs have clearly socially accept -
able responses. For example, if the assessment is designed to measure
constructs such as adaptability, teamwork, or integrity, examinees may
be able to figure out the desired response and respond in the socially
acceptable way, regardless of whether it is a true representation of their
attitudes or behaviors. Another concern with these kinds of items is that
they may be particularly “coachable.” That is, those who are helping a
test taker prepare for the assessment can teach the candidate strategies
for scoring high on the assessment without having taught the candidate
OCR for page 93
103
MEASUREMENT CONSIDERATIONS
the skill or construct being assessed. Thus, the score may be influenced
more by the candidate’s skill in test taking strategies than his or her
proficiency on the construct of interest.
A related issue is construct irrelevant variance. Problems with con-
struct irrelevant variance occur when something about the test questions
or administration procedures interferes with examinees’ ability to assess
the intended construct. For instance, if an assessment of teamwork is pre-
sented in English to students who are not fluent in English, the assessment
will measure comprehension of English as well as teamwork skills. This
may be acceptable if the test is intended to be an assessment of teamwork
skills in English. If not, it will be impossible to obtain a precise estimate
of the examinee’s ability on the intended construct because another factor
(facility with English) will interfere with demonstration of the true skill
level. This can be a particular concern with some of the more innovative
item types, such as those that are computer based or involve strategies
such as simulations or role-playing, Wise noted. If familiarity with the
item type or assessment strategy gives students an advantage that is not
related to the construct, the assessment will give a flawed portrayal of
the examinee’s skills. This influences the validity of the inferences being
made about the test scores.
Fairness
Fairness in testing means the assessment should be designed so that
test takers can demonstrate their proficiency on the targeted skill without
irrelevant factors interfering with their performance. As such, fairness
is an essential component of validity. Many attributes of test items can
contribute to construct irrelevant variance, as described above, and thus
require skills that are not the focus of the assessment. For instance, sup -
pose an assessment is intended to measure skill in mathematical prob -
lem solving, but the test items are presented as word problems. Besides
assessing math skill, the items also require a certain level of reading skill.
Examinees who do not have sufficient reading skills will not be able to
read the items and thus will not be able to accurately demonstrate their
proficiency in mathematical problem solving. Likewise, if the word prob-
lems are in English, examinees that do not have sufficient command of
the English language will not be able to demonstrate their proficiency in
the math skills that are the focus of the test.
Additional considerations about fairness may arise in relation to cul-
tural, racial, and gender issues. Test items should be written so that they
do not in some way disadvantage the test taker based on his or her racial/
ethnic identification or gender. For example, if the math word problem
discussed above uses an example more familiar or accessible to boys than
OCR for page 93
104 ASSESSING 21ST CENTURY SKILLS
girls (e.g., an example drawn from sports), it may give the boys an unfair
advantage. The same may happen if the example is more familiar to stu-
dents from a white Anglo-Saxon culture than to racial/ethnic minority
students. Many of the skills covered in the workshop present considerable
challenges with regard to fairness. For example, cultural issues may cause
differential performance on assessments of skills in communication, col -
laboration, or other interpersonal characteristics. Social inequities related
to income, family background, and home environment may also cause
differential performance on assessments. Students may not have equal
opportunities to learn these skills.
The measurement field has a number of ways to evaluate fairness
with assessments. The Standards (American Educational Research Asso-
ciation, American Psychological Association, and National Council on
Measurement in Education, 1999, pp. 71-106) provides a more complete
discussion.
The Relationship Between Test Uses and Technical Qualities
Knapp and Wise both emphasized that when test results are used for
summative purposes and high-stakes decisions are based on the results,
the tests are expected to meet high technical standards to ensure decisions
are based on accurate and fair information. For example, if a test is used
for pass/fail decisions to determine who graduates from high school
and who does not, the measurement accuracy of the scores needs to be
high. Meeting high technical standards can be challenging and expensive
because it requires a number of actions to be taken during the test devel -
opment, administration, and scoring stages. For example, when tests are
used for high-stakes purposes, reliability and classification consistency
should be high. Test items will need to be kept secure. They cannot be
reused multiple times because students remember them and pass the
information on to others. Having to continually replenish the item pool
is expensive and resource intensive, and it requires developing multiple
forms of the test.
If different forms of the test are used, efforts have to be made to create
test forms that are as comparable as possible. When tests are comprised
of selected-response items or short constructed-response items, quantita -
tive methods can be used to ensure that the scores from different test
forms are equivalent. Statistical procedures—referred to as “equating”
or “linking”—can be used to put the scores from different forms on the
same scale and achieve this equivalence. For a number of reasons, link -
ing or equating is usually not possible when tests are comprised solely
of extended constructed-response items. In this situation, there is no
straightforward way to ensure that the test forms are strictly comparable
OCR for page 93
105
MEASUREMENT CONSIDERATIONS
and test scores equivalent across different forms. See Kolen and Brennan
(2004) or Holland and Dorans (2006) for additional explanation of linking.
Thus, the test developer is often faced with a number of dilemmas.
Constructed-response and performance-based tasks may be the most
authentic way to assess 21st century skills. However, achieving high
technical standards with these item types is challenging. When tests do
not meet high technical standards, the results should not be used for
high-stakes decisions with important consequences for students. But,
when the results do not impact students’ lives in important ways (i.e.,
“they do not count”), students may not try their best. Raising the stakes
means increasing the technical quality of the tests. Test developers must
face these issues and set priorities as to the most important aspects of the
assessment. Is it more important to have authentic test items or to meet
high reliability standards? Test developers are often faced with competing
priorities and will need to make tradeoffs. Decisions about these tradeoffs
will need to be guided by the goals and purposes of the assessment as
well as practical constraints, such as the resources available.
OCR for page 93