Notes from final discussion Workshop: Trusted Extensions of Interactive Theorem Provers Cambridge, UK, August 11-12, 2010 Konrad Slind asked: What stances do different systems take on evaluation? His current sense is that Coq's approach is similar to ACL2 metafunctions. J Moore suggested that system developers each contribute a paragraph describing their approach to trust. The sense was that this was a good idea, and would be done. We were reminded (by Konrad) that Larry Paulson earlier mentioned three trust levels: -- experienced user -- newbie -- malevolent But no real further discussion on this took lace. J Moore wondered what enforces trust: What social pressures force us to give importance to trust. Well, after all, our goal in life is to mechanize the establishment of truths. More directly: -- Evaluators -- Our reputations What is the role of open source? J thinks that releasing code is an important way to help guarantee soundness of a system, by making inspection possible. He reminded us that he'd mentioned a possibility of warranty in one of the talks; kind of a scary thought -- but a warranty on the theorem, not on all aspects of the artifact. Georges Gonthier: But what would be given in return for a warranty? Maybe a service could be provided. J imagines there will come a day when organizations buy software that has been proved to have certain properties, and a warranty could support a higher price. Shankar pointed out that lots can go wrong. The formalization might only approximate the artifact; he thinks maybe people shouldn't expect any sort of a warranty. J replied that an advantage to thinking about a warranty, as a thought experiment, is that it can help us keep in mind that we are guaranteeing theorem-hood. Shankar expressed the concern that maybe a given framework isn't even consistent. Rob Arthan thinks of the warranty (and he likes J's idea) as a bet. People think about safety critical systems this way -- what are the failure rates. Shankar said that Rushby et al. think about probabilities of perfection. That can be coupled with reliably-engineered artifacts. Burkhart Wolff observed that systems that strive for perfection live longer. Shankar disagreed. Mike Gordon has heard that Praxis offers warranties. Ian O'Neill replied that yes, where there would then be free bug fixes for a period of time. He noted that the economic pressure is more to exhibit bugs than to say "verified". Mike asked about normal industry practice on critical system software warranties. This seems rather rare. Ian points out that a warranty only would cover what has been specified. Konrad asked if we should include in the summary the question about the importance of soundness. J Moore pointed out that asking if his own system is sound helps keep his head straight -- thinking about that sort of thing helps keep us honest. Shankar mentioned that he has "layers" of objections to these sorts of thoughts. Getting "QED" is a small part of what we do. It's also (more?) important for a system to help find bugs. The 4-color theorem proof contributed new mathematics. J pointed out that we have found many times, in interfacing with hardware design people, that the presence of a formal person, e.g., Bishop Brock in some past work with Motorola, helped a lot in clarifying the goals of the design, long before any proofs were done. Bishop would raise questions of the designers, just from writing specs in a careful way while making sense of what was involved. Shankar agreed that this is important. "Trusted" vs. "trustworthy" -- we could use the latter, as the former often suggests "must be trusted". Larry confirmed this ambiguity. J pointed out that in our summary, we should be clear about these two different senses. Roger Jones mentioned the question of degrees of trust: might there be a systematic way of assigning degrees of trust. J responded with doubt about the feasibility of a numerical measure; Roger observed however that he had in mind some other (non-numeric) partial ordering. Konrad said that he thinks trust is binary. Roger pointed out that Konrad's tagging [as described in Konrad's talk] could be a way to assign degrees of trust, but all seemed to agree that this isn't numerically assignable. J said that he thinks of tags as names of people (which corresponds to his own partial order). David Hardin said that evaluators feel better about a trust story based on: -- Familiarity -- History -- Extensive documentation -- Frequent releases (indicates discipline? -- Mike reminded us that the argument was that then an official version could be used, not a patched version -- Rob pointed out that version control etc. is addressed in standards) Konrad pointed out that the social process has been moved... [ed.: Konrad, please finish this thought -- somehow I missed it]. Laurent Thery wondered if there might be more trust in a theorem if two systems were used independently for a proof. Burkhart suggested that increased trust would depend on independence of the efforts, e.g., syntactically different representations of proofs. Laurent wonders though if one can't simply get more trust just by having two systems used. There seemed to be a lot of controversy on this question. J thought that yes, independent efforts will increase confidence, in particular for algorithms. Konrad interpreted Laurent's question as restricted to proving the same theorem in different systems, but J says it's rare to be able to port a specification without change from one system to another. J and Matt Kaufmann both ported Nqthm proofs to ACL2 and this wasn't trivial, even though the logics were close. Shankar said that however, the program [being verified] needn't change from system to system. Roger pointed out that tags (in the sense of Konrad's proof) produce a lattice, but said that you don't know how much less trust you have just because you are lower in the lattice. He observed that having two proofs is dual to using two oracles in a single proof, and that in the natural ordering of sets of tags for oracles by inclusion (which probably corresponds to what is being done by isabelle and HOL4), the inequalities are non-strict. If the degrees are augmented to take account of multiple proofs then having two proofs would also be in general better than or equal (in point of trust) to having either one (better if the two originals are incomparable). Laurent pointed out that it's rare to have multiple efforts. Konrad and Tjark Weber pointed out that it's rarely worth the trouble, but all seemed to agree that you do get some more confidence, the question being whether it's worth the expense. Shankar pointed out that this kind of thing has been done for SMT and has found hundreds of bugs, often with short examples. Konrad mentioned Larry Paulson's and Grant Passmore's systems as being evidently useful, but questioned whether there is going to be more of a trust story. Mike said that Leonardo de Moura gave a talk that producing proofs for SMT was a distraction, the idea being that Z3 is useful for finding bugs. Mike and Shankar agreed that the issue was that Leonardo changes Z3 rapidly, and the need to produce proofs makes him less nimble. Shankar wants to support their being on the cutting edge. J says that his motivation with Bob Boyer for their early work was to do program verification, with the idea that if eventually there is enough reason, they might work on producing certificates. But for now (and for 40 years), he wants to be nimble, even if the system is occasionally unsound (though that feels bad). Rob asked about the relevance of ACL2 "proof" transcripts. J replied that these were initially just a sort of a trace to understand what the system was doing. Then they were a pedagogical tool to help new users to see what the system is doing. Now, there are ways to turn that off, printing only checkpoints that are useful for proof debugging. Matt and J added that they want to discourage users from trying to understand what the prover is doing. Rob summarized that those transcripts aren't relevant for trust. Larry suggested that sometimes it's important to look at models, in order to validate the reasonableness of the spec. J mentioned that when AMD hired him and Matt, it took 9 weeks to satisfy themselves that the IEEE spec was satisfied. Then it took two months for AMD to agree that what had been proved was the theorem that they needed, because J had to convince them that the ACL2 models of the microcode and the IEEE standard were correct. And then for two years, AMD wouldn't allow talking about this, since they thought it would be considered a sign of weakness -- but J convinced AMD that it would instead be a sign of strength. Then it took many rounds with reviewers about whether the theorem was right -- so J and Matt proved more theorems. J was tempted to stop, so Tom Lynch (the other author) stepped in.... Now, floating point proofs are always done at AMD. Shankar told two related stories. During his thesis, he noticed proofs were going through too smoothly, e.g., without induction. He showed this to J and found that there was one definition that accidentally always returned false. The proof log showed this, but Shankar hadn't noticed that. J said it's good to show that a hypothesis is satisfiable. The other story: when integrating BDD-based model-checking into PVS, they couldn't get things to work right and were fighting a paper deadline, so they thought there was a problem with the model-checker, and indeed there was. The model-checker guy fixed it right away. Matt pointed out that he's unaware of ACL2 users who do guard verification for theorems, even though that would be a way to provide added confidence in specs. Konrad thinks Dave Greve (at Rockwell) might do that. Shankar pointed out that in the case of PVS, TCCs are necessary in order to conclude theorem-hood. [That is not the case for ACL2, by the way.] Rob mentioned that Z-Eves has something like guard verification [TCCs]. Z users really seem to like them, as a way of seeing progress. J observed, regarding trusting specs, that evaluation is important. At AMD there was a mechanical translator from RTL to ACL2, and after doing a proof about a translated design, they wanted to know if the semantics of RTL (which was unspecified) were captured by the translation. The managers of the AMD group asked for something like 100M floating point test vectors to be cosimulated: rtl simulator vs. ACL2. The cosimulation showed no discrepancy, which reassured the managers that the formalization had some meaning. J said that he thinks this sort of thing is done at Rockwell, too. Konrad pointed out that Anthony Fox has done something like this. Anthony said that he runs designs on a development board, using random testing, and compares with running the HOL spec. Matt asked if maybe the sentiment here is that the bigger trust bottleneck pertains to specs, more than tools. Burkhart thinks that may be the case: validation of specs is important. Laurent pointed out that proving simple facts about specs is good for validation, not merely using evaluation. Georges suggested that a good verification technique is to approach several independent aspects, rather than going for a single big theorem. Grant pointed out that ITP systems have definitional principles, but that SMT solvers encourage arbitrary axioms, which might not be conservative or even consistent. Tjark pointed out that this issue can come up with ITPs too. Burkhart pointed out that even definitional extensions can be wrong. He adds that we need more test-data-generation methods and model-based testing techniques for validations of the models we are using, by testing them against real systems or its environment. Shankar pointed out that we should look for applications of ITP to proving small properties of systems, e.g. by embedding in a compiler or generating test cases and proving them. Z3 is used largely for bug-finding at Microsoft in a dozen projects; sometimes we might want ITPs to disprove theorems. In such cases, we don't even care about trust. Grant pointed out that when an SMT solver finds a counter-model, that's really a proof too. Larry has started saying that we shouldn't use the term "interactive theorem proving", when their main value is in creating complex models (as of the ARM chip), even when there aren't proofs about those models. "Interactive model development environments" would be a good term. (Shankar reminded us of Peter Sewell's workshop talk a year ago.) Regarding confidence in specs, J pointed to the IEEE standard as a critical development in the last 20 years. Val Kahan told J and Bob Boyer that FP would be one of the last areas of application for formal verification, but then after Kahan did the IEEE FP spec and then Intel had the FDIV bug, we now have (Shankar said) a killer app for ITP. So, good specs are important. Grant asked for a killer app that needs a spec. J answered: "security". Shankar mentioned the Separation Kernel Protection Profile (which was mentioned in David Hardin's talk). Larry gave an example of an attack using timestamps, which wasn't clearly indicated as violating the spec that a timestamp field should contain a timestamp. One problem is that protocols have a lot of information (like hotel naming conventions) mixed in with important stuff. He'd thus prefer layered specs, with the most abstract and critical stuff at the top. Tjark would like to see standard proof formats, as with SMT, SAT, QBF, and first-order provers. Georges said that there's no point standardizing on proofs if we don't translate theorems correctly, respecting differences in logics. Larry said that there is a THF ("typed higher-order formulas" maybe, for TPTP) format for higher-order logic. A standard proof format could help in development of checkers. But J pointed out that ITPs have different underlying mathematical logics, especially ACL2 vs. HOL provers. Georges said that even Isabelle with type classes is distinct from HOL without type classes. J pointed out that the ACL2/HOL connection exhibits our expectation that ultimately we expect to use more than one system in a proof effort, and hence we need ways to communicate theorems between systems. So J asked: How do we build trust when different systems are used for different parts of a proof? Shankar suggested that his Toolbus may provide an answer. Its design no longer anticipates a common proof format -- "let 1000 flowers bloom" and then a few may be supported. Grant asked a "selfish question pertaining to his talk": RAHD has many different decision methods, including CAD which is likely to be a long way from being formally verified. One route is to trust the hand proof of this CAD procedure, and then verify intermediate algebraic computations. One way to gain trust is to run them in many different computer algebra systems -- a form of social process. Grant said that Shankar doesn't think this much increases trust, and wondered how the rest of us feel about that. Mike suggested that it must increase trust at least a bit, since we could presumably find a bug this way. But Rob said that this could make things worse, by checking on a buggy computer algebra system that complains incorrectly. Mike says he's heard that physicists run computations on several computer algebra systems, say to simplify some huge polynomial. Laurent heard of a Russian computer algebra system that people trust, hence use, but is slow. Shankar mentioned a project of starting to specify a computer algebra system, at least proving some TCCs. He mentioned J's point that just doing a formalization can help. Summary: J enjoyed conversing with others in ITP. Mike suggested that any worry about trust isn't with established tools, but with extensions. Shankar is concerned about the future because we are not passing on skills to future generations. J asked about the trend regarding students entering with a passion for formal methods / theorem proving -- there have been dramatically fewer the last 5 years at Univ. of Texas at Austin. Konrad observed that this was perceived to be the case at Utah, though he didn't so much notice a drop-off. J is worried about the manpower issue -- we can find ample contracts to verify specific things, but there are few people to actually do the work. He does a certain amount of such work, but could use a lot more help to deal with opportunities constantly being thrown at him. Mike has a similar experience -- could be financial reasons in part. J wonders if the golden age of theorem proving was the late 90s because of perception by industry that they had money, so they were willing to hire people to do some risky things -- no longer as true. He thinks that the demand isn't there now like it was then. Shankar said that by "golden age", he meant in the technical sense (e.g. SAT and SMT). J senses danger related to trust: now that our tools are being used more seriously in industry, what will happen when something is built but there's a recall because we prove something that's not true. Shankar said that there is speculation that the FDIV bug was due to a mistake in a hand proof. Konrad has observed that at Rockwell, ACL2 theorem proving work might take awhile but Rockwell will train people up to do it as necessary, including engineers. This was observed in the 80s. So J said he was comforted: maybe students aren't going into formal methods, but the skills can be taught to practicing engineers. But Shankar expressed worry not so much about the training, but rather whether there are people being trained to develop theorem provers. Daryl Stewart suggests that industry is encouraged by development of tools in industry (e.g., Jasper).