Wednesday, October 20, 2004
I have run across a few references to Paul Dietz's Common Lisp test suite over the past few months, the latest being from Juho Snellman's blog where he describes how the suite caught some errors in his new SBCL register allocator code.
Paul describes his work on the suite in his own blog, the 13-September entry of which had this to say:
To go beyond simple 'does function FOO do X when called on arguments Y' kinds of tests, I've also started experimenting with a random tester. I've taken the approach of William McKeeman (favorite quote: "Excellent testing can make you unpopular with almost everyone.") I generate random lisp forms that are guaranteed to be well-defined when a set of free variables have integer values. The forms are wrapped in two lambda expressions, one with type declarations for the free variables and one in which optimizations are inhibited. The functions are compiled and their outputs compared on a set of random inputs.
This approach is very simpleminded (and a lot easier to do in Lisp than in C, as McKeeman did), but it immediately found new bugs in every lisp implementation on which I tried it. The beauty of this approach is that it exploits increasingly cheap CPU cycles to save the increasingly relatively expensive programmer/tester cycles.
In my experience, random testing of just about anything is a really good thing. I have always found it effective at finding those elusive corner cases that are so difficult for people to envision themselves. The computer can generate such perverse test cases that it really makes you wonder why people don't do this more often.
For instance, when I was at HP in the late 1980s working on the PA-RISC series 700 workstation products, I knew one of the engineers in the CPU group who was designing the next round of IEEE floating point co-processors for PA-RISC. If you aren't fully aware, the IEEE FP spec is very precise, but has a number of very hairy corner cases. In particular, things like rules for rounding in the last bit of precision are called out very clearly. This engineer, who I always thought was pretty smart, built himself a software simulation of his chip design and proceeded to generate random test vectors for the new design. Simultaneously, he ran those same vectors through his trusty Motorola 68K FP copro and had the system stop and dump the vectors that produced any mismatches.
When he first fired up the simulation, it ran for only a few seconds before it found a problem. He fixed the problem and restarted. It ran for a minute. He fixed the problem and restarted. It ran for a few minutes. He fixed the problem and restarted. It ran for an hour or two. He fixed the problem and restarted. It ran for 24 hours. He fixed the problem and restarted. It then ran non-stop.
Note that if he had just done this, he actually couldn't have been sure that his processor was bug-free, only that it had exactly the same bugs as his 68K FPU. To ensure that he didn't miss a bug, he also compared with earlier PA-RISC FPUs.
What is most interesting about this example is that if Intel had done the same basic testing on the original Pentium, they would not have suffered the massive FPU bug debacle in the 1990s.
I have used the same technique in testing network protocol implementations. Protocol decoding is notoriously error-prone and bugs here can result in huge security holes. In this case, I try to generate real data streams and then write a program to corrupt the data slightly. Often, because protocols involve various packet integrity algorithms, you have to generate something that will make it all the way through different parts of the stack and not just get thrown out at the lowest levels, otherwise you aren't testing much other than the lowest layers of your system. For instance, if you capture IP packets and corrupt them, recompute the IP and TCP checksums such that things don't get bounced right at those checks but actually make it to different parts of the stack.
In short, I don't think I have ever seen a case where a random input tester didn't reveal a bug of some sort, unless the system had already been subjected to such a tester previously.
Update: Here is a good description of William McKeeman's work using random testing on the DEC C compiler. I think that Paul's blog referenced this but it dropped out when I did the cut/paste and I think Paul's link was also stale.
Random testing was used on the Pentium, but the incompleteness of testing (even random testing) allowed the bug to remain undetected.
A few weeks ago there was a bit of surprised activity around probably-exploitable crashes in popular web browsers found with random testing (the author calls his technique "razor sharp html fragments"):
http://www.securityfocus.com/archive/1/378632The key is generating content that is random at the proper level. You touch on this issue when you mention getting the TCP and IP checksums correct; the web browser testing illustrates another instance of this problem. I've tried replicating the crashes shown by the Bugtraq post by a cgi script that just throws up random binary garbage, but that doesn't cause the crashes that are seen almost immediately with the razor-sharp html test. Even random ascii garbage doesn't trigger crashes. (When I have time, I guess the next step is some sort of non-deterministic FSM that's been trained on a large collection of real-world html files)
I guess what I'm saying is that random testing is hard. Maybe not as hard as creating the product under test, but not necessarily easy either. You have to know how to get down to the appropriate level of encoding before injecting your randomness, which when testing a completely finished product can be difficult.
(The date this article's been posted suggests little change of this comment being seen but)
a quick search of the McKeeman paper on the DEC C compiler testing showed a bunch of 404 links; a hint on where it could be had would be appreciated.
Bogdan, there may not be a good source. I'm guessing that the originals got crunched as the result of the merger of Compaq with HP. Other than Googling for it, I have no idea where it might be found. If anybody is able to locate a reference, please post a comment.
Differential Testing for Software
This sort of testing is not a solution. The way it works is that if no bugs are found in the time alloted to bug finding you can feel 'good' about it. It has nothing to do with actually finding the bugs.
The only way I have heard of doing this in a manner that approaches 'right'-ness is to formally prove your program correct.
See "A Mechanically Checked Proof of the Correctness of the Kernel of the AMD5K86 Floating-Point Division Algorithm" http://citeseer.ist.psu.edu/53148.html
I'd be more willing to bet that FP division in AMD's K5 is better than the equivalent function in PA-RISC's or Intel's chips of the day.
Sure, provable algorithms are certainly preferable. But how many pieces of code do you know that have been proved correct? How many do you know that even can be?
Until provability makes huge leaps forward in usability, then we're left with either human testing, where a test engineer tries to develop "interesting" inputs in an attempt to break the code, or techniques like random testing. I submit that no technique is foolproof and that both are required. The machine can sometimes be much more clever at finding the corner cases that can a human, but a human can also help the machine work past various limitations and can utilize the randomness in ways that the machine can't come up with itself.
Post a Comment
Links to this post: