Page 125

APPENDIX

B

Depicting Changes in Reading Scores—An Example of a Usability Evaluation

To illustrate how the usability evaluation might work, we will focus on the redesign of a single data display from the report NAEP 1994 Reading: A First Look Report (Williams, Reese, Campbell, Mazzeo, & Phillips, 1995). This report is designed for a broad audience of policy makers, educators, and the press. Wainer and colleagues (1997a, 1999) redesigned several displays from the report in accord with specific usability standards described in Visual Revelations (Wainer, 1997b). These revisions were evaluated through formal usability trials in which preference and comprehension measures were taken (Wainer et al., 1999). We discuss the design modifications that resulted in one of Wainer's more successful redesigns and then illustrate how the processes shown in Box 6-1 (see Chapter 6) might be applied to make the illustration still more usable and accessible.

The original display appears in Figure B-1 and shows test scores as a function of administration date (1992 and 1994), grade (fourth, eighth, or twelfth), and geographic region (Central, Northeast, Southeast, and West). The format chosen is a perspective-view bar graph with region represented along the horizontal axis and grade represented in depth (z-axis). Scores for both years are shown, side by side, for each grade within each region. Numerical data values are placed above the tops of the individual bars. In his revision, Wainer selected a two-dimensional line-graph for these data, and he removed the raw numerical values from the display. Year of administration was represented on the horizontal axis and all other conditions were labeled by line grouping (grade) or by individual line (region) directly



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 125
Page 125 APPENDIX B Depicting Changes in Reading Scores—An Example of a Usability Evaluation To illustrate how the usability evaluation might work, we will focus on the redesign of a single data display from the report NAEP 1994 Reading: A First Look Report (Williams, Reese, Campbell, Mazzeo, & Phillips, 1995). This report is designed for a broad audience of policy makers, educators, and the press. Wainer and colleagues (1997a, 1999) redesigned several displays from the report in accord with specific usability standards described in Visual Revelations (Wainer, 1997b). These revisions were evaluated through formal usability trials in which preference and comprehension measures were taken (Wainer et al., 1999). We discuss the design modifications that resulted in one of Wainer's more successful redesigns and then illustrate how the processes shown in Box 6-1 (see Chapter 6) might be applied to make the illustration still more usable and accessible. The original display appears in Figure B-1 and shows test scores as a function of administration date (1992 and 1994), grade (fourth, eighth, or twelfth), and geographic region (Central, Northeast, Southeast, and West). The format chosen is a perspective-view bar graph with region represented along the horizontal axis and grade represented in depth (z-axis). Scores for both years are shown, side by side, for each grade within each region. Numerical data values are placed above the tops of the individual bars. In his revision, Wainer selected a two-dimensional line-graph for these data, and he removed the raw numerical values from the display. Year of administration was represented on the horizontal axis and all other conditions were labeled by line grouping (grade) or by individual line (region) directly

OCR for page 125
Page 126 ~ enlarge ~ FIGURE B-1 Wainer, H.; Hambleton, R.K., and Meara, K. (1999). Alternative displays for communicating NAEP results: a redesign and validity study. Journal of Educational Measurement, 36(4), 301-335. Copyright 1999 by the National Council on Measurement in Education; reproduced with permission from the publisher. by the relevant display objects. He also included a legend to help readers identify individual lines. The revision appears in Figure B-2. WHAT WOULD WE LEARN FROM A USER NEEDS ANALYSIS? Before beginning to revise the display again, it is essential to have a list of user requirements based on the results of user-needs analysis. This would involve bringing together small “user panels” comprised of people representing the range of individuals who may be exposed to NAEP data reports. Note that the emphasis here is on diversity rather than typicality of potential group members. Thus, parents with limited educational backgrounds should be included as well as educators who may have extensive backgrounds in educational testing. Policy makers with very different political agendas should be chosen, as well as members of the local and national press. Once user panels are established, then focus groups, semi-structured brainstorming sessions, individual interviews, and other related methods

OCR for page 125
Page 127 ~ enlarge ~ FIGURE B-2 Wainer, H.; Hambleton, R.K., and Meara, K. (1999). Alternative displays for communicating NAEP results: a redesign and validity study. Journal of Educational Measurement, 36(4), 301-335. Copyright 1999 by the National Council on Measurement in Education; reproduced with permission from the publisher. can be held to determine the expectations of group members. One of the most important questions in redesigning an existing display, is what the users would like to know. What kinds of conclusions would they like to be able to draw? By giving panelists the data sets in a number of formats (numerical data tables and existing graphs in the present case), it would be possible to see which interpretations are spontaneously made, as well as the order in which these conclusions are drawn. Since the data presentation format will influence the nature of these spontaneous interpretations (Carswell and Ramzy, 1997), it is important to consider the conclusions drawn from the various formats. Alternatively, the data parameters could be verbally described to them and panelists allowed the chance to ask questions. For instance, they could be told:

OCR for page 125
Page 128 We have average NAEP reading test scores from 1992 and 1994. These are reported separately for the 2nd, 8th, and 12th grades. Data are also broken down by region—western, central, southeastern, or northeastern schools. What would you like to know about these data? Tracking panelists' questions is an effective method for eliciting the informational needs of potential users. To illustrate, suppose that these methods revealed that the following questions were asked of the 1992-1994 change data in the following order: (1) Were we (the United States as a whole) doing better or worse in 1994? (2) Which regions were showing the most change and in which direction? (3) What kind of change occurred in my region? (4) How does the change that occurred in my region compare to that found in other regions? These questions should drive decisions about the content and structure of data displays. In addition, when performing usability tests on the comprehensibility of the data display, users' abilities to answer these questions accurately should be a core criterion of design success. With the information needs of the users better understood, one or more usability analysts can perform a heuristic evaluation. HEURISTIC EVALUATION OF THE ORIGINAL AND REVISED DISPLAYS In the text that follows, we evaluate the original and revised displays ( Figure B-1 and Figure B-2 ) of the 1992 and 1994 NAEP reading data by applying the heuristics proposed for the review of NAEP reports ( Box 6-1 ). In addition, we propose changes to be made in the next design iteration. Is the format compatible with the performance criterion selected? Suppose that the questions raised during a hypothetical user-needs analysis revealed that users were primarily interested in ordinal information (e.g., “Did scores increase or decrease from 1992 to 1994?” “Did region X's scores increase/decrease more than region Y's?”). It is likely that the readers

OCR for page 125
Page 129 would want quick access to this information. Thus, a graphical display, rather than a table, is the appropriate choice. This also suggests that displaying the exact data values in conjunction with the graph, as in the original bar chart, may be unnecessary and may even impede rapid access of the comparative information. Our revised display, like the two previous versions, will be graphical. And, as with the previously revised display, we will not report numeric values. Is the structure of the display compatible with the structure of the data? This heuristic is probably not relevant in the present case. Besides test scores, two (theoretically) continuous variables are displayed in the present data set–grade level and year of test administration. However, the present data describe only three grade levels and two test years. Thus, we can say very little about the relationship between either of the latter two variables and test scores. Is the perceptual grouping of information compatible with the mental grouping users must perform to extract the information they want and need? The findings from our hypothetical user-needs analysis suggest that users clearly want to make comparisons and that they are most interested in comparing scores across test administration years. Thus, the two years for each of the region-grade combinations must be tightly grouped so that they can be perceived together. In the original graphic ( Figure B-1 ), the two years were presented side by side, allowing grouping by proximity. In the revised graph ( Figure B-2 ), the two data points were not close together relative to other data points, such as those showing test means for other regions; however, the two administrations for each region-grade condition were connected by a line. In the next revision of the graph, the 1992 and 1994 values should be connected by line segments, but they should also be closer than in the first revision. A second issue is the relative tightness of the grouping of data pairs for 1992 and 1994 values across the same region versus across the same grade level. That is, should all of the data for a region be grouped together or should all of the data for a single grade be grouped together? In the original graph ( Figure B-1 ), the data for a given year appeared in the same horizon-

OCR for page 125
Page 130 tal row perpendicular to the line of sight, while the data for a given region fell along a row parallel to the line of sight. Thus, grouping by region and grouping by grade are about equally strong. The first revision ( Figure B-2 ) made grouping by grade stronger through spatial proximity, which allows easier access to comparisons among different regions within a grade level. Because our hypothetical user-need analysis suggested that comparisons among regions were of greater importance, we would propose continuing to group by grade level so that data from different regions appear side by side. We would further highlight regional comparisons by adding a regional boundary around (or “frame”) the data from each grade level. Is the level of numeric detail compatible with the reliability of the data and the needs of the reader? Based on our hypothetical findings, we would drop the numeric means from the graph, as in the first revision ( Figure B-2 ). Given the users' interest in the mean score changes from 1992 to 1994, reliability becomes important; that is, are the differences between the two mean scores reliable? Perhaps pairs of scores (i.e., pairs of bars in the original graph and line segments in the revised graph) could be coded as exceeding or not exceeding a specific reliability criterion. For example, in the original figure, pairs that were significantly different were coded with asterisks on one of the two bars. Is data salience compatible with data importance? As described above, statistically reliable changes in scores across test administrations should be differentiated from those that are not reliable. The asterisk used in the original figure ( Figure B-1 ) is not highly attention getting. Color could be used for this purpose and, possibly, a more saturated color could highlight the reliable differences. In terms of the relative salience of other graphic elements, the revised graph clearly highlights changes in scores from 1992 to 1994 that are different in magnitude or direction across the geographic regions. However, this salience may actually be misleading in making certain perceptual comparisons across grade levels. On the other hand, the original graph does not clearly highlight unusual changes in scores. Its placement of individual data points on the page tends to call attention to fourth-grade scores because they appear closer to the reader than the other scores in this “3-d”

OCR for page 125
Page 131 graph. This organization would be warranted if based on the perception that the audience is most interested in the fourth-grade scores. Otherwise, this organization could be a misuse of salience cues. In the revised graph ( Figure B-2 ), lengths of the lines connecting scores from the same graderegion will draw attention to the largest changes from 1992 to1994. Is the data display compatible with working memory limits? One crude way of evaluating if a data display is compatible with working memory limits is to simply count the number of groups of elements, as well as the number of elements in each of these groups. For example, the original graph ( Figure B-1 ) could be described as 12 pairs of bars, or 12 groups of two elements. The revised graph ( Figure B-2 ) could be described as three groups of five lines. A closer look should be taken whenever the number of major groups, or number of elements within those groups, is greater than four. Thus, the “12 pairs” and “five lines” of the original and revised graphs, respectively, could pose some difficulties for working memory, depending on the tasks to be performed. If a reader is simply trying to count the number of times test scores appeared to decrease across the years, then exceeding the “rule of fours” is probably not a big problem. However, it might be different if an individual were trying to capture all instances of decreasing scores to generate causal hypotheses. One suggestion for the redesign of the original graph ( Figure B-1 ) would be to create more distinctive groups for different grade levels. This would lead to three groups of four pairs of bars, which may help readers “chunk” information in working memory in a more manageable way. In the initial revision of the graph of reading scores ( Figure B-2 ), two problems are evident. First, as noted, there are five lines in each of the three grade-level groupings. In addition to scores from the four regions, a fifth line represents mean scores across the entire United States. This would seem to be important data to represent directly, given our hypothetical users' need to know how students in the United States are performing across the two years. However, it may not be necessary to know the mean value of test scores during both years to answer this question. Simply determining the overall pattern of the graphic—whether the lines seem to be mostly “going up” or “going down”—may suffice. Therefore, we would suggest removing the line showing the national means. A second problem relates to the use of legends to identify regions on the revised graph (a number of lines per group problem). Different point

OCR for page 125
Page 132 symbols are used for each of the four regions, and the overall United States data are represented by a different line-style and point-symbol combination. Memorizing five symbols can be difficult; a problem that can often be remedied by placing labels directly by the lines in a graph (Milroy & Poulton, 1978). An attempt was made to do this in the revised graph; nevertheless, because the lines overlap, the user must still rely on the symbols described in the legend. Again, dropping one of the lines would help the overlap problem that prevents use of the labels to the side of the lines. In addition, it would reduce the load on working memory by ensuring that readers are more likely to correctly identify the different lines, even when it is necessary to refer to the legend. Are physical properties of the stimuli compatible with our ability to detect, discriminate, and recognize these properties? Both the original graph and the revised graph use differences in position along an aligned scale to represent differences in performance between 1992 and 1994 for each region-grade combination. According to work by Cleveland and McGill (1984, 1985), this is one of the most accurate perceptual comparisons that can be made. Comparisons across different regions and grades within a given year are also made by comparing points along a common scale in the revised figure ( Figure B-2 ). In the original figure, comparisons across grades are based on differences in position of bar heights along nonaligned common scales. People are less accurate at these judgments. In the revised figure, comparisons of changes across regiongrade conditions are to be made by comparing line slopes. Generally, people do not make accurate estimates of relative slopes. For a new revision of the graph, we would recommend devising a format that uses line lengths, which are more likely to be correctly interpreted. We should also be aware of the potential visual distortions or illusions that can occur in both the original and revised graphs. In the original graph, the use of linear perspective and other depth cues (e.g., occlusion) can lead to size illusions, with the size of the bars in the front of the graph underestimated relative to the ones in the back. With line graphs, designers should be aware that we often judge slope relative to nearby frameworks such as other lines. The revised graph ( Figure B-2 ) demonstrates this type of illusion. For example, the central region appears to have a very large increase in fourth graders' performance across the two-year testing interval. This change is actually only one-fourth the size of the decrease in scores

OCR for page 125
Page 133 among twelfth graders for the same region. However, the line graph seems to show that the increase among fourth graders is at least as big as the decrease among twelfth graders. The reason for this misperception is that the slope of a line tends to be over or under estimated depending on the slopes of surrounding lines (and particularly lines that intersect the target line). Specifically, for the fourth-grade data, the positively sloping line for the central region intersects with a negatively sloping line for the northeast region. This presentation tends to accentuate the slope of each. This is known as the Poggendorf illusion (Hubel & Wiesel, 1965, 1979). We will attempt to avoid the use of both perspective and line slope in our revision of the NAEP reading scores graph. Is the organization of information in the display compatible with spatial metaphors and population stereotypes? When the purpose is to show regional differences, the display should consider cartographic conventions of representing North at the top of a map and West to the far left. A display that must order information about geographic regions across a page should conform to the left-for-West rule. In our case, this means that the following left-to-right arrangement of regions should be used: West, Central, Southeast, and Northeast. Neither the original or revised graphs use this ordering. In the original graph ( Figure B-1 ), the map convention is reversed, with the most eastern region on the left of the page. In the revised graph ( Figure B-2 ), the regions are ordered according to their mean scores. Is the choice of display format and ornamentation compatible with the users' preferences and biases? There is evidence that people are more likely to distrust data presented in perspective (3-D) displays (Carswell, Frankenberger, and Bernhard, 1991), such as the original graph. Further, evidence suggests that people less familiar with graphs tend to feel less threatened by bar graphs than by line graphs (Vernon, 1952). In our revision of the graph, we will avoid the use of perspective and the use of traditional line graphs as well. THE REVISED GRAPH Based on the changes suggested by the heuristic evaluation described

OCR for page 125
Page 134 above, we produced the graph shown in Figure B-3. Note that position on common aligned scales is maintained for comparisons of scores across administrations and across regions within a grade level. However, absolutescore comparisons across grade levels cannot be made with this format. Since the hypothetical user-needs analysis indicated that few users would try to make such comparisons, we felt justified in sacrificing this piece of information. In return, the revised graph enables the use of length judgments for comparing the magnitude of changes among different regions and grades. The data are grouped into three clearly demarcated panels by grade level. Within each grade level there are four lines, each representing the two mean scores for a region. Rather than connecting two points that are offset horizontally, the revised graph uses two points along the same vertical grid line to represent the two test administration dates. The end of the line representing the second administration is indicated by an arrowhead. For each grade level, the four regions are arranged from left to right using the West-to-East map convention. In addition, several other changes will simplify the presentation. The term “Midwest” was substituted for the term “Central” in order to streamline the axis labels. The grade-level panels were offset from left to right to mimic the spatial metaphor of moving through the grades as if climbing a staircase. Footnotes and legends were deleted. Instead, a few explanatory comments were presented as part of the graph's title where they are more likely to be read. A USABILITY TEST: IS THE NEW GRAPH BETTER THAN THE EARLIER VERSIONS? Even though we have a redesigned graph that incorporates findings from the user-needs analysis and the heuristic evaluation, we still would not know if the new design were actually better or preferred by users. Accordingly, the next step must be usability testing similar to that described by Wainer and colleagues (1999). The multiple versions of the graph should be viewed by different groups of subjects representative of the intended audiences. Users should be asked what they learned from the graph, and researchers should note whether or not users drew conclusions relevant to the major questions defined in the user needs analysis. These interpretations can be timed, and follow-up questions can be asked to determine if users can access important information. Preference data should also be

OCR for page 125
Page 135 ~ enlarge ~ FIGURE B-3 Changes in Regional NAEP Reading Scores from 1992 to 1994. The direction and length of arrows indicate the direction and size of the change in average scores. A diamond indicates that the average score remaind the same.

OCR for page 125
Page 136 collected after allowing participating users to view all three versions of the graphs. There are many variations of usability tests, and many additional methods are described in Rubin (1994) and Neilsen (1993). If the graph were to be included in the next release of NAEP reports, then data on citations, requests for publication, and misinterpretations by the press can also be collected to gauge display comprehensibility and accessibility. These data should guide future revisions.