PRACTITIONER NEEDS AND REACTIONS TO COMPUTER SCIENCE APPROACHES

Mark Pierzchala

CORK: And, finally, our capstone presentation for the day—and the capstone for the testing section of the workshop—is by Mark Pierzchala, who is a senior systems analyst at Westat. Previously, he was a mathematical statistician at the National Agricultural Statistics Service. Mark?

PIERZCHALA: I just want to make one comment, first, on the presentation by Thomas McCabe. Remember that slide he showed where there were four ways you could mess up code—like jumping out of a loop and so forth? That’s often what’s specified that I have to do … [laughter]

I was invited to do this talk, and I got some papers off the Internet, including some by Harry, there. I went through and tried to apply it to my experience in producing computer-assisted interviewing instruments. And, so I’m just going to go through these fairly quickly, but Pat and others have already given 90 percent of my talk. But some of the things we test for in computer-assisted interviewing—certainly, valid values and flow, but also the hard edits, the soft edits and the computations. But then I have a whole second slide, and there’s a reason I have a first slide and a second slide. But all I want to say is the stuff on the second slide— I’m not going to enumerate here—this is the fuzzier stuff [SeeTable II-1]. Things like usability and that kind of stuff, or getting the question text right when your question is all fills. That, to me, is a bit fuzzier. But I think that the point of this slide is that, often, when we arrive at a question, we want the tester not to merely verify that we’ve wound up in the right place but to verify 25 or 30 things, all at the same time. Or maybe they’ll do it in phases. But there’s a lot of stuff going on.

A lot of the challenges that we meet have already been enumerated; I have a few ones here that are not enumerated. But let me just go through. We talked a little bit about scale, but I have an example— the Bladder Cancer Survey, where we … interviewed cancerous Spanish bladders [laughter]. This was actually conducted in Spain. I could have printed out the paper questionnaire, and I was going to just to be able to take the photograph, but then I figured out how many hundreds of dollars it would have cost and how many trees it would have killed to do it. So I never really printed it out. But it’s 16,000 pages in the questionnaire, and it’s in two languages, so it’s taller than a person if you were actually to print it out. And Westat did this survey all on a laptop. And I’ll say—and you can look it up, since this whole presentation is in your book—it actually worked pretty well. I’m not saying there weren’t



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 174
PRACTITIONER NEEDS AND REACTIONS TO COMPUTER SCIENCE APPROACHES Mark Pierzchala CORK: And, finally, our capstone presentation for the day—and the capstone for the testing section of the workshop—is by Mark Pierzchala, who is a senior systems analyst at Westat. Previously, he was a mathematical statistician at the National Agricultural Statistics Service. Mark? PIERZCHALA: I just want to make one comment, first, on the presentation by Thomas McCabe. Remember that slide he showed where there were four ways you could mess up code—like jumping out of a loop and so forth? That’s often what’s specified that I have to do … [laughter] I was invited to do this talk, and I got some papers off the Internet, including some by Harry, there. I went through and tried to apply it to my experience in producing computer-assisted interviewing instruments. And, so I’m just going to go through these fairly quickly, but Pat and others have already given 90 percent of my talk. But some of the things we test for in computer-assisted interviewing—certainly, valid values and flow, but also the hard edits, the soft edits and the computations. But then I have a whole second slide, and there’s a reason I have a first slide and a second slide. But all I want to say is the stuff on the second slide— I’m not going to enumerate here—this is the fuzzier stuff [SeeTable II-1]. Things like usability and that kind of stuff, or getting the question text right when your question is all fills. That, to me, is a bit fuzzier. But I think that the point of this slide is that, often, when we arrive at a question, we want the tester not to merely verify that we’ve wound up in the right place but to verify 25 or 30 things, all at the same time. Or maybe they’ll do it in phases. But there’s a lot of stuff going on. A lot of the challenges that we meet have already been enumerated; I have a few ones here that are not enumerated. But let me just go through. We talked a little bit about scale, but I have an example— the Bladder Cancer Survey, where we … interviewed cancerous Spanish bladders [laughter]. This was actually conducted in Spain. I could have printed out the paper questionnaire, and I was going to just to be able to take the photograph, but then I figured out how many hundreds of dollars it would have cost and how many trees it would have killed to do it. So I never really printed it out. But it’s 16,000 pages in the questionnaire, and it’s in two languages, so it’s taller than a person if you were actually to print it out. And Westat did this survey all on a laptop. And I’ll say—and you can look it up, since this whole presentation is in your book—it actually worked pretty well. I’m not saying there weren’t

OCR for page 174
Table II-1 Areas of Testing Within Computer-Assisted Survey Instruments First Slide: • Valid values         • Flow     • Hard edits     • Soft edits     • Computations     Second Slide: • Screen appearance     • Proper question and edit text       – Static text     – Dynamic text fills done properly       • Pronouns, possessives, adjectives, etc.     • In multiple languages • Sample Selection     • Multimedia display and play     • Data are correct     • Navigation       – Non-linear navigation     – Ad hoc navigation   • Pop-up help screens     • Interviewer understanding and usuability     NOTES: Items from the first slide are more mechanical in nature and more readily suited to automated testing, while items from the second slide are context- and interface-specific features that may require human interaction for testing. SOURCE: Workshop presentation by Mark Pierzchala.

OCR for page 174
any bugs, but I’m very pleased with this. But this gives you an idea of scale we sometimes go through. And then something I haven’t heard mentioned yet: versions of questionnaires. National Agricultural Statistics Service has offices in virtually every state and every questionnaire for the quarterly Ag survey in every other state is different. In NASS, they actually generate instruments. I mean, they have a spec database and they generate the instrument; there’s no programming after that generation. And that is what I call the Impossible CAI Program, and the only way to solve that was actually to generate the questionnaires. About half the questionnaires are CATI; half are on paper with field interviewers; and interactive editing on both modes of data collection. And that works. But that’s just another challenge that we have. Over a thousand production instruments have been produced so far; it’s been going on for seven or eight years. That’s another challenge. Then, we had these longitudinal surveys. We’ve gone over that before, but we will often visit the same person—especially in agriculture, you’ll sometimes visit the same farmer, you know, 20 or 30 years running sometimes. And it’s all dependent interviewing. How does the CAI industry test? Well, we test from specifications, and that’s manual testing. And we have some scripted testing, but that’s still manual. We have some ad hoc and targeted testing, and that’s manual; what I mean is that somebody is pounding away on the keyboard and then we do it over and over and over again. And I don’t think that’s so rare in the computer industry. But I will say that [there have been] some recent advances in our industry. There is now more formal version control and build procedures; we have pop-up GUI error reporting dialogs from within our instruments, we’re using tracking databases. And we’re getting better at it. I’ll say the last bullet here—I think that people are finally starting to catch on, what it really takes to test. Not always the case, but it’s starting to catch on. There has been some automated regression testing. Westat, where I work, has experimented with WinRunner. RTI has a WinBatch playback utility. But the question that always pops back to us when we have these automated script-playing systems is: when do you implement it? How do you update a script when the specification changes? And what is the overhead to build scripts? So that’s sort of the down side to automated script testing, as far as I can see it, anyway. I will say that one thing that I don’t think has been mentioned yet is that sometimes the specs are wrong. They’re inconsistent. And it doesn’t matter if you’ve got a database or not; you can have one line where this Access database says, “do A,” and the second line says, “do B,” and these can be inconsistent. So we often have updated specifications. And

OCR for page 174
sometimes, of course, us programmers get it wrong and we … you know, I will admit that, and I’ve done it recently myself. You can read the spec wrong. So there’s a lot of iteration and rework. My reading of model-based testing is that test scripts are generated automatically based on specifications. And one of the things that model-based testing tries to do is to keep the number of scripts reasonable. And then there’s something called a state table; we’ve heard a lot about state tables but, basically, here’s your state you begin at, here’s your action, and here’s the expected result. The state table can be hundreds, thousands of lines deep. Then you run the scripts in an automated system and you analyze the results. This is what I got from reading four or five articles on model-based testing. I’ll say there are some advantages from my standpoint. You don’t have to hand-record the scripts. Therefore, you can execute many more scripts. Also, some scripts will exercise the application in ways that hand-written ones will not, and I think that’s very positive. And since the scripts are generated you can apply the tools much sooner, and you can overcome some of these robustness questions—for instance, what happens if you delete a question or insert a question. Scripts are not targeted, and that’s not necessarily a bad thing. But I will say that you’re probably going to want to have some targeted scripts, too. And one of the things I like is that you don’t have to test all combinations of valid values, and I have some examples here. Now, where [might] model-based testing help? Let’s just take the valid values as being OK, and from that you can generate scripts. And one of the things, when you read the literature a bit, you keep hearing about constraints—you know, you have these valid values but then there are constraints. And, to me, the flow—I mean, the skip patterns—are a way of showing constraints. So you have scripts, you subject them to constraints, and then you can test some other things—you know, whether the hard edits pop appropriately or not, and so on. I have a simple example; I like Harry’s diagrams better, but these will get the point across. And I actually programmed this in Blaise; it’s a four-question instrument. But, just say that—I did a very categorical example where you have 2 categories, 21 categories, 2 categories, and 21 categories. And there are essentially two paths through this, and these little symbols—these are edits, but I won’t being going into those today. My perception of how model-based testing might be executed on such a simple example is, first, I would test the flow. Then, using the flow as constraints, I would test the hard edits. Using that as constraints, I would then test the soft edits, and so on. I’m trying to test intelligently; I’m trying to cut down on scripts. Because—as you’re going to see—even with such a small example there

OCR for page 174
can be an enormous number of combinations of values. For instance, on the example I gave you, the number of possible combinations of question values is 1,764. The number of pairwise combinations is 613, and the reason I picked up on this is because I read it in the literature—it says that you don’t have to test all combinations, maybe if you test pairwise combinations you might be able to be as effective, and that’s certainly going reduce the number of scripts, for example. But we’ve seen that there are only three paths through the questionnaire. So how might we be more efficient about that? Perhaps we look only at the regions; they come from the diamonds in the flow graph, the decision points. For example, I might test the region boundaries. And then maybe I’ll just pick a point in the middle of these boundaries. And perhaps I can test effectively, to my comfort level, using just 12 scripts rather than, say, 613 scripts. This is the way I picked up from the literature how you have to start looking at this stuff. So if flow testing is OK then look at the hard edit testing, and the idea is to cut down on the number of tests using the flow as the constraints. And remember the number of pairwise combinations was 613. But if I use the flow as constraints, then there are really only 104 combinations that are valid for some of this stuff. Now, I will say that—in the articles I read—there are these algorithms, and they were mentioned earlier: the Chinese postman’s algorithm is a way of going through the state table with as few scripts as possible but still cover all the necessary actions. I think that this is a beautiful idea; I wish I knew how to do it. I would say that, because of the advantages of model-based testing, it is worth investigation. Implementation will be a challenge because there are details to work out. I’d say that I read the articles, I listened to the presentations this afternoon, I still don’t understand it. I think that the details left to work out are huge. And that’s probably because I haven’t done it yet. There are fuzzier things—and I don’t think that these are as amenable, but if somebody tells me that they are, I’ll rejoice. [laughter] I’m not saying take my word for it; I’m just saying that there are things where I have more doubts. Now, my last slide is, I think, the most key slide of all: my questions. What form of specification? Because I think the thing that the computer scientists ought to know is that our specification is often a Word document. And somebody has described, in a Word document, everything. Valid values, flow, question text. And that does not seem amenable to model-based testing in the sense that you’re trying to generate test scripts from your specifications. So, some sort of database specification seems to be required, but what is the minimal information content? And one of the presentations just before this one got at, you know, putting some information in your database about what you

OCR for page 174
should expect. But, still, I don’t have clearly formed in my mind how that database is going to look, for both specification and some of the results. And then there are a lot of nicely-named algorithms: where do we find out about them? More importantly, is there yet a textbook about this? I mean, what is the maturity of model-based programming? I know that there’s a textbook on extreme programming, or at least a book. But is there one on model-based testing, because the articles I read—to be quite honest—extol the benefits of it and give some little examples but don’t really get into the nitty-gritty about how to do it. So, that’s it. Yes, Harry? ROBINSON: I guess I can answer, I think, the last one of those, about textbooks. About the actual automation part of it, there aren’t yet. But there is a book from 1995 or so called Black Box Testing. In fact, what they do in that, their running example is the 1994 1040 form. SMITH: Just a comment … I liked the slide you had on all the other things that come into play and are really relevant for some applications. In computer-assisted instruction, a huge amount of human time goes into the media and GUI…. I wonder whether these surveys have a major GUI portion. PIERZCHALA: They can. We do these, this kind of interviewing where we turn the laptop around to the respondents and play, you know, radio ads—“have you heard this ad against drug abuse?” Or we play a clip from television against drug abuse. There’s more and more of this kind of GUI aspect to it. But most of it is still an interviewer reading a question. MCCABE: I’d like to extend the notion, a little bit, about model testing. I’ve seen projects where the specification might be a Word document, a narrative document. But the exercise of coming up with scenarios that would provide a test can work off a Word document as well. And a lot of times it’s thought that you’re testing a product that’s built, but what’s missed is that it’s often useful to test the requirement. For example, the requirement might be that the system be interactive. Well, what’s the test for that? You put the input in, and it comes up three days later, does that show that it’s interactive? But to specify the interval of time, for example … So what you find is that when you take any kind of document—narrative, state shell, whatever it is—and start fleshing out the tests, it often shows that the model itself is incorrect in the system, way ahead of the product. And the process of refining the test, and getting consensus that this is a test, it’s complete and robust and accurate, often in a project is very worthwhile, whereby you debug the specification in whatever form it’s in. And then you have consensus before you move any software or survey to apply those acceptance tests.

OCR for page 174
PIERZCHALA: Certainly, one of the things I advocate is to get test scenarios set very early. MCCABE: My point is that it’s not so much technology-based as it is methodology-based. PIERZCHALA: I understand that. I was reading in the literature where it says that these things are generated; what you’re saying is that you can also apply the methodology even if things aren’t fully in a “generable” form. MCCABE: And you can do better. PARTICIPANT: Mark, I also liked the fact that you reviewed for everybody that there’s a lot involved in the testing that isn’t going to be addressed through model-based testing. You still have to do the screen standards and the cosmetic, the navigation and the usability, all of which is very time-consuming and expensive. So I have a question for any of the presenters who talked today about model-based testing, and that is: if you have a client who gives you a project that is both time- and budget-constrained, will the model-based testing be cost-neutral? That is, do you save enough money from it, in perhaps avoiding some of the more iterative phases of the traditional sorts to testing, so that it would not cost the client any more? ROBINSON: I can take a swing at that. We have a lot of teams that are light on budget, and their refrain is, “we can’t afford to do that.” And our response has been, “you can’t afford not to.” Because what they’re doing is, it looks like they’re saving money up front but they’ll actually be paying for it later on. PIERZCHALA: Let me just say, one of my jobs this afternoon, as given to me, was to sort of pull everybody back to earth. And I think I did that a little bit. I would say that I like the idea of model-based testing, and I can even see myself applying it once I’ve learned more about it, how actually to do it, because I think I’ll eliminate the surprise factor—these combinations of data that nobody’s going to put into a scenario but that are going to blow up a calculation, for example. If I can get rid of that kind of stuff, I’d be very happy. Yes? SMITH: Just to respond to the question. I think that given a very appropriate sort of model, sort of Markovian model-based testing we were doing at Computer Curriculum, I think everyone would agree that it really saved a lot of time and money. There would have been no question about that, though we never tried any scenarios to see. Then we were getting killed by the other kinds of testing in this slide show. But we saved enough money with the model-based testing to test the GUI. MCCABE: I have two comments. One is, you think about the process and think of getting a model early and a test early, and the irony

OCR for page 174
is that a model, or a Word document, is often as wrong as the software. It’s inconsistent, it’s incomplete, it’s not regular … So when you think of the test, an acceptance test to apply to the spec itself, usually it’d be worth it. Now the cost of those errors multiplies if you don’t catch them earlier, so you’re catching very expensive errors early. That’s the good news. The bad news is that this often doesn’t work because of time frame. See, what gets in the way is agencies having an RFP before they have any requirement document. And the problem is that the contract is bid on, and the bid is for a fixed price and etc., and no one would go back to the sponsor and say, “these requirements are all wrong.” Because they want to get paid. DOYLE: And they want to win the contract. MCCABE: A lot of lifecycles go wrong because there’s a conflict, if not a conflict of interest then at least a conflict of problem. And it gets in the way of everything. But, otherwise, it saves as much money by avoiding doing the wrong thing. PIERZCHALA: Yes, Harry? ROBINSON: Just back to Tom’s comments on Word documents and such … What we’ve been doing is that the Word documents that are written up by our system engineers, those have been converted to models on the test side. And typically you run into what we call “specification rot,” because you kind of use the use cases and then nobody uses them again and they drop out. But we use our specs to drive the tests so that—rather than being a requirement that somebody keep the specs up to date—it actually serves our purposes to have them up to date. PIERZCHALA: Yes, Mike? COHEN: There is a textbook coming out on April 26 by a student of Jesse Poore’s, James Whittaker. PIERZCHALA: Oh, good … did everybody hear that, on April 26, rush out to your local bookstore … Who’s the publisher? ROBINSON: Addison-Wesley, I think. PIERZCHALA: I’m glad to hear that because textbooks can gloss over some issues but they also tend to bring issues together and synthesize them better than just going one article after another. ROBINSON: Just because I can’t let something go, that model-based can’t test something … [laughter] Just going on a thought here. Getting back to the problem of getting the grammar right, the fills, would it be feasible—if you’re going to generate something that will take you through all of the questions—that you could then automatically put all the questions through a grammar checker? PIERZCHALA: I think that there are ways to do that because, after all, those fills are just a variable or a field. And I think semantically it’s

OCR for page 174
harder to tackle but I’m not willing to say it can’t be done, either. In fact, I would really love it if it could be done. DOYLE: That would take a lot of work off … we would love that. PARTICIPANT: I think that you could use part of the model-based testing and combine it with the human factors, present these cases for review. Our difficulty is hammering the keyboard and getting it to a place where we can look at it again, that you could use this technique early on to identify cases that you want to take a look at and look at the human factors and the fuzzy concepts you identified. GROVES: Let me make sure I’ve got it right … Does model-based testing integrate its own notions of complexity as priors on the model? So if you had a region of activity which was very complex … that could be subjected to more testing? ROBINSON: It’s probably been done; we’ve never actually done that, but what we’ve done is that—as you make up the model—you suddenly begin to realize that all of the arrows go into one spot. Or there are way too many arrows coming out of one. So you look at it and say, “that’s untestable.” It would be wonderful to have some sort of complexity you could just run on the rest … MCCABE: If I could add to that … I did a company’s research that built some stuff. At that time, there was a thing called “data flow diagrams” (DFDs), and I developed some mathematics that could develop a test based on those diagrams. And the beauty of that is the requirements specification and the test specification were the same. So what would happen is that the agencies would develop DFDs as a specification, and we would derive a test based on that. Now the interesting thing is that if you [take a walk through the program and compare with] the DFDs, about 30 percent of them are off because they’ve never been tested. Because the DFD can lead you to places, and it never went there. And the way you’d find that out is that you derive the test, and you find things—for example—in the DFD like sections you can’t get into…. So we did that for a couple of years, published some things. Now there’s another set of work related to that, and that’s called “work flow testing,” developed by Musa at Bell Labs and used there and other places. It’s very much like what Harry’s talking about in that it adds what are called “characteristic nodes.” So you can take the clock, for example, and it might have a lot of usage scenarios, but it might be that scenarios one and two are used 90 percent of the time. So out of that [you can get a sort of stress test.] But … the fundamental idea is to test up front, whatever the medium is, and to build the requirements in. DOYLE: The diagram you gave; is that some kind of software that could be run on our instruments, to see how complex it is? Highly