Notes from final discussion
Workshop: Trusted Extensions of Interactive Theorem Provers
Cambridge, UK, August 11-12, 2010

Konrad Slind asked: What stances do different systems take on
evaluation?  His current sense is that Coq's approach is similar to
ACL2 metafunctions.

J Moore suggested that system developers each contribute a paragraph
describing their approach to trust.  The sense was that this was a
good idea, and would be done.

We were reminded (by Konrad) that Larry Paulson earlier mentioned
three trust levels:
-- experienced user
-- newbie
-- malevolent
But no real further discussion on this took lace.

J Moore wondered what enforces trust: What social pressures force us
to give importance to trust.  Well, after all, our goal in life is to
mechanize the establishment of truths.  More directly:
-- Evaluators
-- Our reputations
What is the role of open source?  J thinks that releasing code is an
important way to help guarantee soundness of a system, by making
inspection possible.  He reminded us that he'd mentioned a possibility
of warranty in one of the talks; kind of a scary thought -- but a
warranty on the theorem, not on all aspects of the artifact.

Georges Gonthier: But what would be given in return for a warranty?
Maybe a service could be provided.

J imagines there will come a day when organizations buy software that
has been proved to have certain properties, and a warranty could support
a higher price.

Shankar pointed out that lots can go wrong.  The formalization might
only approximate the artifact; he thinks maybe people shouldn't expect
any sort of a warranty.

J replied that an advantage to thinking about a warranty, as a thought
experiment, is that it can help us keep in mind that we are
guaranteeing theorem-hood.

Shankar expressed the concern that maybe a given framework isn't even
consistent.

Rob Arthan thinks of the warranty (and he likes J's idea) as a bet.
People think about safety critical systems this way -- what are the
failure rates.

Shankar said that Rushby et al. think about probabilities of
perfection.  That can be coupled with reliably-engineered
artifacts.

Burkhart Wolff observed that systems that strive for perfection live
longer.  Shankar disagreed.

Mike Gordon has heard that Praxis offers warranties.  Ian O'Neill
replied that yes, where there would then be free bug fixes for a
period of time.  He noted that the economic pressure is more to
exhibit bugs than to say "verified".

Mike asked about normal industry practice on critical system software
warranties.  This seems rather rare.  Ian points out that a warranty
only would cover what has been specified.

Konrad asked if we should include in the summary the question about
the importance of soundness.  J Moore pointed out that asking if his
own system is sound helps keep his head straight -- thinking about
that sort of thing helps keep us honest.

Shankar mentioned that he has "layers" of objections to these sorts of
thoughts.  Getting "QED" is a small part of what we do.  It's also
(more?) important for a system to help find bugs.  The 4-color theorem
proof contributed new mathematics.

J pointed out that we have found many times, in interfacing with
hardware design people, that the presence of a formal person, e.g.,
Bishop Brock in some past work with Motorola, helped a lot in
clarifying the goals of the design, long before any proofs were done.
Bishop would raise questions of the designers, just from writing specs
in a careful way while making sense of what was involved.  Shankar
agreed that this is important.

"Trusted" vs. "trustworthy" -- we could use the latter, as the former
often suggests "must be trusted".  Larry confirmed this ambiguity.  J
pointed out that in our summary, we should be clear about these two
different senses.

Roger Jones mentioned the question of degrees of trust: might there be
a systematic way of assigning degrees of trust.  J responded with
doubt about the feasibility of a numerical measure; Roger observed
however that he had in mind some other (non-numeric) partial ordering.
Konrad said that he thinks trust is binary.  Roger pointed out that
Konrad's tagging [as described in Konrad's talk] could be a way to
assign degrees of trust, but all seemed to agree that this isn't
numerically assignable.  J said that he thinks of tags as names of
people (which corresponds to his own partial order).

David Hardin said that evaluators feel better about a trust story
based on:
-- Familiarity
-- History
-- Extensive documentation
-- Frequent releases (indicates discipline? -- Mike reminded us that
   the argument was that then an official version could be used, not a
   patched version -- Rob pointed out that version control etc. is
   addressed in standards)

Konrad pointed out that the social process has been moved... [ed.:
Konrad, please finish this thought -- somehow I missed it].

Laurent Thery wondered if there might be more trust in a theorem if
two systems were used independently for a proof.  Burkhart suggested
that increased trust would depend on independence of the efforts,
e.g., syntactically different representations of proofs.  Laurent
wonders though if one can't simply get more trust just by having two
systems used.

There seemed to be a lot of controversy on this question.  J thought
that yes, independent efforts will increase confidence, in particular
for algorithms.  Konrad interpreted Laurent's question as restricted
to proving the same theorem in different systems, but J says it's rare
to be able to port a specification without change from one system to
another.  J and Matt Kaufmann both ported Nqthm proofs to ACL2 and
this wasn't trivial, even though the logics were close.

Shankar said that however, the program [being verified] needn't change
from system to system.

Roger pointed out that tags (in the sense of Konrad's proof) produce a
lattice, but said that you don't know how much less trust you have
just because you are lower in the lattice.  He observed that having
two proofs is dual to using two oracles in a single proof, and that in
the natural ordering of sets of tags for oracles by inclusion (which
probably corresponds to what is being done by isabelle and HOL4), the
inequalities are non-strict.  If the degrees are augmented to take
account of multiple proofs then having two proofs would also be in
general better than or equal (in point of trust) to having either one
(better if the two originals are incomparable).

Laurent pointed out that it's rare to have multiple efforts.  Konrad
and Tjark Weber pointed out that it's rarely worth the trouble, but
all seemed to agree that you do get some more confidence, the question
being whether it's worth the expense.

Shankar pointed out that this kind of thing has been done for SMT and
has found hundreds of bugs, often with short examples.

Konrad mentioned Larry Paulson's and Grant Passmore's systems as being
evidently useful, but questioned whether there is going to be more of
a trust story.

Mike said that Leonardo de Moura gave a talk that producing proofs for
SMT was a distraction, the idea being that Z3 is useful for finding
bugs.  Mike and Shankar agreed that the issue was that Leonardo
changes Z3 rapidly, and the need to produce proofs makes him less
nimble.  Shankar wants to support their being on the cutting edge.

J says that his motivation with Bob Boyer for their early work was to
do program verification, with the idea that if eventually there is
enough reason, they might work on producing certificates.  But for now
(and for 40 years), he wants to be nimble, even if the system is
occasionally unsound (though that feels bad).

Rob asked about the relevance of ACL2 "proof" transcripts.  J replied
that these were initially just a sort of a trace to understand what
the system was doing.  Then they were a pedagogical tool to help new
users to see what the system is doing.  Now, there are ways to turn
that off, printing only checkpoints that are useful for proof
debugging.  Matt and J added that they want to discourage users from
trying to understand what the prover is doing.  Rob summarized that
those transcripts aren't relevant for trust.

Larry suggested that sometimes it's important to look at models, in
order to validate the reasonableness of the spec.  J mentioned that
when AMD hired him and Matt, it took 9 weeks to satisfy themselves
that the IEEE spec was satisfied.  Then it took two months for AMD to
agree that what had been proved was the theorem that they needed,
because J had to convince them that the ACL2 models of the microcode
and the IEEE standard were correct.  And then for two years, AMD
wouldn't allow talking about this, since they thought it would be
considered a sign of weakness -- but J convinced AMD that it would
instead be a sign of strength.  Then it took many rounds with
reviewers about whether the theorem was right -- so J and Matt proved
more theorems.  J was tempted to stop, so Tom Lynch (the other author)
stepped in....  Now, floating point proofs are always done at AMD.

Shankar told two related stories.  During his thesis, he noticed
proofs were going through too smoothly, e.g., without induction.  He
showed this to J and found that there was one definition that
accidentally always returned false.  The proof log showed this, but
Shankar hadn't noticed that.  J said it's good to show that a
hypothesis is satisfiable.  The other story: when integrating
BDD-based model-checking into PVS, they couldn't get things to work
right and were fighting a paper deadline, so they thought there was a
problem with the model-checker, and indeed there was.  The
model-checker guy fixed it right away.

Matt pointed out that he's unaware of ACL2 users who do guard
verification for theorems, even though that would be a way to provide
added confidence in specs.  Konrad thinks Dave Greve (at Rockwell)
might do that.  Shankar pointed out that in the case of PVS, TCCs are
necessary in order to conclude theorem-hood.  [That is not the case for
ACL2, by the way.]

Rob mentioned that Z-Eves has something like guard verification
[TCCs].  Z users really seem to like them, as a way of seeing
progress.

J observed, regarding trusting specs, that evaluation is important.
At AMD there was a mechanical translator from RTL to ACL2, and after
doing a proof about a translated design, they wanted to know if the
semantics of RTL (which was unspecified) were captured by the
translation.  The managers of the AMD group asked for something like
100M floating point test vectors to be cosimulated: rtl simulator
vs. ACL2.  The cosimulation showed no discrepancy, which reassured the
managers that the formalization had some meaning.  J said that he
thinks this sort of thing is done at Rockwell, too.

Konrad pointed out that Anthony Fox has done something like this.
Anthony said that he runs designs on a development board, using random
testing, and compares with running the HOL spec.

Matt asked if maybe the sentiment here is that the bigger trust
bottleneck pertains to specs, more than tools.  Burkhart thinks that
may be the case: validation of specs is important.  Laurent pointed
out that proving simple facts about specs is good for validation, not
merely using evaluation.  Georges suggested that a good verification
technique is to approach several independent aspects, rather than
going for a single big theorem.

Grant pointed out that ITP systems have definitional principles, but
that SMT solvers encourage arbitrary axioms, which might not be
conservative or even consistent.  Tjark pointed out that this issue
can come up with ITPs too.  Burkhart pointed out that even
definitional extensions can be wrong.  He adds that we need more
test-data-generation methods and model-based testing techniques for
validations of the models we are using, by testing them against real
systems or its environment.

Shankar pointed out that we should look for applications of ITP to
proving small properties of systems, e.g. by embedding in a compiler
or generating test cases and proving them.  Z3 is used largely for
bug-finding at Microsoft in a dozen projects; sometimes we might want
ITPs to disprove theorems.  In such cases, we don't even care about
trust.  Grant pointed out that when an SMT solver finds a
counter-model, that's really a proof too.

Larry has started saying that we shouldn't use the term "interactive
theorem proving", when their main value is in creating complex models
(as of the ARM chip), even when there aren't proofs about those
models.  "Interactive model development environments" would be a good
term.  (Shankar reminded us of Peter Sewell's workshop talk a year
ago.)

Regarding confidence in specs, J pointed to the IEEE standard as a
critical development in the last 20 years.  Val Kahan told J and Bob
Boyer that FP would be one of the last areas of application for formal
verification, but then after Kahan did the IEEE FP spec and then Intel
had the FDIV bug, we now have (Shankar said) a killer app for ITP.
So, good specs are important.

Grant asked for a killer app that needs a spec.  J answered:
"security".  Shankar mentioned the Separation Kernel Protection
Profile (which was mentioned in David Hardin's talk).

Larry gave an example of an attack using timestamps, which wasn't
clearly indicated as violating the spec that a timestamp field should
contain a timestamp.  One problem is that protocols have a lot of
information (like hotel naming conventions) mixed in with important
stuff.  He'd thus prefer layered specs, with the most abstract and
critical stuff at the top.

Tjark would like to see standard proof formats, as with SMT, SAT, QBF,
and first-order provers.  Georges said that there's no point
standardizing on proofs if we don't translate theorems correctly,
respecting differences in logics.  Larry said that there is a THF
("typed higher-order formulas" maybe, for TPTP) format for
higher-order logic.  A standard proof format could help in development
of checkers.  But J pointed out that ITPs have different underlying
mathematical logics, especially ACL2 vs. HOL provers.  Georges said
that even Isabelle with type classes is distinct from HOL without type
classes.  J pointed out that the ACL2/HOL connection exhibits our
expectation that ultimately we expect to use more than one system in a
proof effort, and hence we need ways to communicate theorems between
systems.  So J asked: How do we build trust when different systems are
used for different parts of a proof?

Shankar suggested that his Toolbus may provide an answer.  Its design
no longer anticipates a common proof format -- "let 1000 flowers
bloom" and then a few may be supported.

Grant asked a "selfish question pertaining to his talk": RAHD has many
different decision methods, including CAD which is likely to be a long
way from being formally verified.  One route is to trust the hand
proof of this CAD procedure, and then verify intermediate algebraic
computations.  One way to gain trust is to run them in many different
computer algebra systems -- a form of social process.  Grant said that
Shankar doesn't think this much increases trust, and wondered how the
rest of us feel about that.  Mike suggested that it must increase
trust at least a bit, since we could presumably find a bug this way.
But Rob said that this could make things worse, by checking on a buggy
computer algebra system that complains incorrectly.  Mike says he's
heard that physicists run computations on several computer algebra
systems, say to simplify some huge polynomial.  Laurent heard of a
Russian computer algebra system that people trust, hence use, but is
slow.  Shankar mentioned a project of starting to specify a computer
algebra system, at least proving some TCCs.  He mentioned J's point
that just doing a formalization can help.

Summary:

J enjoyed conversing with others in ITP.

Mike suggested that any worry about trust isn't with established
tools, but with extensions.

Shankar is concerned about the future because we are not passing on
skills to future generations.  J asked about the trend regarding
students entering with a passion for formal methods / theorem proving
-- there have been dramatically fewer the last 5 years at Univ. of
Texas at Austin.  Konrad observed that this was perceived to be the
case at Utah, though he didn't so much notice a drop-off.  J is
worried about the manpower issue -- we can find ample contracts to
verify specific things, but there are few people to actually do the
work.  He does a certain amount of such work, but could use a lot more
help to deal with opportunities constantly being thrown at him.  Mike
has a similar experience -- could be financial reasons in part.  J
wonders if the golden age of theorem proving was the late 90s because
of perception by industry that they had money, so they were willing to
hire people to do some risky things -- no longer as true.  He thinks
that the demand isn't there now like it was then.  Shankar said that
by "golden age", he meant in the technical sense (e.g. SAT and SMT).
J senses danger related to trust: now that our tools are being used
more seriously in industry, what will happen when something is built
but there's a recall because we prove something that's not true.
Shankar said that there is speculation that the FDIV bug was due to a
mistake in a hand proof.

Konrad has observed that at Rockwell, ACL2 theorem proving work might
take awhile but Rockwell will train people up to do it as necessary,
including engineers.  This was observed in the 80s.  So J said he was
comforted: maybe students aren't going into formal methods, but the
skills can be taught to practicing engineers.  But Shankar expressed
worry not so much about the training, but rather whether there are
people being trained to develop theorem provers.  Daryl Stewart
suggests that industry is encouraged by development of tools in
industry (e.g., Jasper).