Saturday, 2 July 2016

Some thoughts on the meaning of equality and cyclic proof - can CTT help were ITT hinders?

The CTT (Computational Type Theory) approach to type theory, as represented in NuPRL and described in [1][2][3], provides a contrasting methodology to what appears to be in vogue at the moment - an approach known as ITT (Intensional Type Theory) as represented in modern type-theory based proof assistants such as Coq and Agda.

There is a very good exposition by Jon Sterling which describes the more fundamental philosophical differences here [4]. It's somewhat difficult to convey the subtleties without such a complete exposition, but since its precisely this that I want to talk about, I'll try to recapitulate the essential differences.

When we talk about the elements of a type in ITT, we give the process of their construction in such a way that the elements which inhabit the type are made entirely explicit in the definition of the type.  So, for instance, in defining the data type for the natural numbers we give zero and succ explicitly as the elements of this type.  The type relation (n : ℕ) is then not an abstract description of what would fit into this type, but a description of the syntax which allows valid types to be formed.

data  : Set where
  zero : 
  succ :   

However, there is more to the story as well. As I mentioned in a previous blogpost here, the meaning of computation and equality in ITT is both intertwined and somewhat hidden from view. We say in ITT something along the lines of: "we mean for two things to be equal when they are the same thing, and the same thing is understood to mean identification of proof-trees under the reduction relation"

But why do we allow identity of proofs according to this reduction and not others and how do we know which and under what circumstances. And where exactly do we define the way that the natural numbers are treated in terms of their behaviour? It turns out that the elimination rules are derived from the above datatype automatically, and then wrapped up nicely for us behind the scenes.

The approach in CTT does these two things very differently. When defining the type ℕ, we instead describe what would constitute evidence that we were a natural number, and what would constitute evidence of equality between two natural numbers. Here, we have to explicitly describe the big-step evaluation judgement which acts on our elements so we know what constitutes a canonical element of the type.

There are a number of ramifications of this approach. One outcome which is of particularly interest to me (due to my current work in RDF, of which I'll describe more later) is that subtyping can be much more direct as data can be a canonical element of more than one type. In ITT this is impossible as the syntax itself describes what constitutes a type.

Circularity and Identity


Which leads me to something which has made me uncomfortable for quite a long time regarding what constitutes a proof. Syntax, in ITT is designed (especially after elaboration) to be the proof. There is virtually no difference between the proof-tree and a fully elaborated proof term and if there is, it's only up to some decidable procedure which fills in gaps which the designer of the syntax didn't think was worth bothering with as it could be elaborated just-in-time. Identification of proofs is done under reduction relation. Unlike in CTT, two proofs can have precisely the same abstract shape, and yet be utterly distinct due to naming (of for instance constructors) alone.

This might not seem like it constitutes much of a problem, but when it comes to cyclic proof, it turns out to be fairly irritating. The reason is that we have decided apriori that there is no possible way to identify two proofs as they will never have the same syntax because induction itself comes equipped with names. Since each proof step is in fact a syntactic step, terms in our term language acts as descriptions of how to build the rest of the proof tree. These names will get all messed up when we start manipulating cyclic proofs and our terms stop corresponding to the naming of the rest of the proof.

To see this we can look at the dpu (double-plus-unbad) function, its proof tree, and another potential cyclic proof which we will call dpg (double-plus-good) - which is semantically equivalent. By semantically equivalent we mean that they cannot be distinguished within the theory.

In order to define dpu and dpg we first need to write down a simple addition function. First we'll show these functions as they would be presented in Agda.
_+_ :     
zero + m = m
succ n + m = succ (n + m) 

double-plus-unbad :       
double-plus-unbad x y z = (x + y) + z

double-plus-good :       
double-plus-good zero y z = y + z
double-plus-good (succ x) y z = succ (double-plus-good x y z)
We are going to use an imaginary programming language to express our cyclic proofs whose syntax should hopefully be clear enough to the reader. In particular, we steal Agda's pattern lambda syntax to describe case distinction on inductively defined types (λ {zero → M ; succ n → P}).

                                                                  ─────────────────────────(*) 
                                                                  n : ℕ , m : ℕ ⊢ n + m : ℕ  
                                       ──────────────var      ───────────────────────────────succ
                                      m : ℕ ⊢ m : ℕ            n : ℕ , m : ℕ ⊢ succ (n + m) : ℕ  
─────────────var ─────────────var  ──────────────────────────────────────────────────────────case(†)
n : ℕ ⊢ n : ℕ     m : ℕ ⊢ m : ℕ        n : ℕ , m : ℕ ⊢ λ { zero → m ; succ n → succ (n + m)} n
───────────────────────────────────────────────────────────────────────────────────────────app
        n : ℕ , m : ℕ ⊢ (λ n m → λ { zero → m ; succ n → n + m} n) n m : ℕ 
  ──────────────────────────────────────────────────────────────────────────unfold(*) 
                             n : ℕ , m : ℕ ⊢ n + m : ℕ

We here demonstrate a cyclic proof presentation[5]. Under usual conditions we would have instead made appeal to some least-fixedpoint operator or have put _+_ in a function-variable-context or the like. Additionally, in order to show that this construction was a well defined (total) function we would have cause to make an appeal to some termination argument. Here we have structural recursion on our side. We can reuse this argument from the usual setting without alteration.

In addition, the circularity is intentionally made more explicit in this first example by virtue of an over-clever naming strategy and the shadowing behaviour of lexical binding, but in real practice of cyclic proof (and in the examples below) there would need to be a substitution applied to obtain the original expression. We will leave this substitution implicit as its not important here (but it would be very important in a more formal description).

You also may have noticed that I leave out a proof rule for the scrutinee of the "case"-rule in these proofs as it should be apparent from context (elaboration here should be performed by the reader).

But why should we want to display this proof as a cyclic proof instead of the more usual least fixed-point approach which is less obscure? Hopefully this will become more apparent with the inclusion of our dpu / dpg functions. We give the dpu function's cyclic proof here:

────────────────────────(*) ─────────────var  ────────────────────────────────────────────────────(†)
x : ℕ , y : ℕ ⊢ x + y : ℕ     z : ℕ ⊢ z : ℕ     n : ℕ , m : ℕ ⊢ λ { zero → m ; succ n → succ (n + m)} n
───────────────────────────────────────────────────────────────────────────────────────────app
     x : ℕ , y : ℕ , z : ℕ ⊢ (λ n m → λ { zero → m ; succ n → succ (n + m)} n)) (x + y) z : ℕ 
     ──────────────────────────────────────────────────────────────────────────────────unfold
                               x : ℕ , y : ℕ , z : ℕ ⊢ (x + y) + z : ℕ
                           ──────────────────────────────────────unfold
                             x : ℕ , y : ℕ , z : ℕ ⊢ dpu x y z : ℕ 

In this proof we make reference to two different entry points into the proof of plus. These correspond with the two different uses of plus which are syntatically apparent in the term, so it is not surprising that there should be two different cyclic references. However, in dpg, we only refer to _+_ once.
                                                      ──────────────────────────────────(§)                                                    
                                                     x : ℕ , y : ℕ , z : ℕ ⊢ dpg x y z : ℕ 
                    ───────────────────────(*) ──────────────────────────────────────────────────succ
                    y : ℕ , z ∶ ℕ ⊢ y + z : ℕ   x : ℕ , y : ℕ , z : ℕ ⊢ succ x → succ (dpg x y z) : ℕ 
                     ────────────────────────────────────────────────────────────────────────case                               
───────────────────var   x : ℕ , y : ℕ , z : ℕ ⊢ λ { zero → y + z                                        
x,y,z : ℕ ⊢ x,y,z : ℕ                                 ; succ x → succ (dpg x y z)} x : ℕ 
────────────────────────────────────────────────────────────────────────────────────app,lam
  x : ℕ , y : ℕ , z : ℕ ⊢ (λ x y z → λ { zero → y + z
                                         ; succ x → dpg x y z} x) x y z : ℕ
   ──────────────────────────────────────────────────────────────────────unfold (§)
                           x : ℕ , y : ℕ , z : ℕ ⊢ dpg x y z : ℕ

This dpg, I claim, is semantically equivalent to dpu. That is, it always gives the same result, but does so in one pass instead of two. However, not only is it semantically equivalent, but it is derivable according to a number of manipulations of a cyclic proof tree. The term dpg is the term which would be arrived at by carrying out a process of super-compilation [6].

To see this, let's watch how dpu would "unfold". In order to do this properly we are going to need an extended reduction relation. Let us extend reduction with "case-distribution" which is given in the following rule.
(λ { zero → M ; succ x → N }) ((λ {zero → P ; succ x' → Q}) R)
↝
((λ {zero → (λ { zero → M ; succ x → N }) P ; succ x' → (λ { zero → M ; succ x → N }) Q}) R)

This yields semantically equivalent terms for any closed terms, even in the case of non-termination.
 
                                                    ─────────────────────────────────────(Ϡ) 
─────────────────────────────────────(†)            n : ℕ , y : ℕ , z : ℕ ⊢ (n + y) + z : ℕ 
n : ℕ , m : ℕ ⊢ λ { zero → m                      ───────────────────────────────────────────succ
                     ; succ n → succ (n + m)} n    n : ℕ , y : ℕ , z : ℕ ⊢ succ ((n + y) + z) : ℕ
───────────────────────────────────────────────────────────────────────────────case
x : ℕ , y : ℕ , z : ℕ ⊢ λ { zero → (λ { zero → z
                                        ; succ n → succ (n + z)} y)
                            ; succ n → succ ((n + y) + z)} x : ℕ 
───────────────────────────────────────────────────────────────────────────────↝
x : ℕ , y : ℕ , z : ℕ ⊢ λ { zero → (λ { zero → z
                                        ; succ n → succ (n + z)} y)
                            ; succ n → (λ { zero → z
                                          ; succ n → succ (n + z)} (succ (n + y)))} x : ℕ 
──────────────────────────────────────────────────────────────────────────────↝
x : ℕ , y : ℕ , z : ℕ ⊢ λ { zero → z
                            ; succ n → succ (n + z)} (λ { zero → y
                                                         ; succ n → succ (n + y)} x) : ℕ 
───────────────────────────────────────────────────────────────────────────────↝
                x : ℕ , y : ℕ , z : ℕ ⊢ λ { zero → z
                                            ; succ n → succ (n + z)} (x + y)) : ℕ 
     ──────────────────────────────────────────────────────────────────────────↝
    x : ℕ , y : ℕ , z : ℕ ⊢ (λ n m → λ { zero → m ; succ n → n + m} n) (x + y) z : ℕ 
  ────────────────────────────────────────────────────────────────────────────unfold(Ϡ) 
                           x : ℕ , y : ℕ , z : ℕ ⊢ (x + y) + z : ℕ
                           ──────────────────────────────────────unfold
                             x : ℕ , y : ℕ , z : ℕ ⊢ dpu x y z : ℕ
Ok, so this isn't precisely the same, we return to (†) instead of (*) and there is an intermediate lambda application which is reduced (and could have been reduced in the original cyclic proof). However, modulo these few unobservable actions, they are the same, and they can be made precisely the same computationally, if we squint our eyes and ignore the inconsequential (↝)-rules.

The reductions that take place in our proof (the (↝)-rules) are "unobservable". This is because finite computations which take place without emitting constructors are not observable in type theory. What we have done is introduced a more flexible reduction relation, which always has the same semantics on closed terms, and then utilised this to fold the proof in a new way. Our folds are performed with simple substitutions of variables. It turns out that if you fold the proof in a new way using variable substition in such a way that the final term terminates, (or co-terminates)[7] you have absolute freedom in folding your proof and you can't go wrong. Here we've created a term which is structurally recursive and so we are safe from evil.

But now we have two strange things going on. We have the terms, which tell us what the rest of the proof tree is supposed to look like, meaning very little about what the rest of the proof tree is going to look like as they are supposed to do in ITT. The second thing is that we have cycles with totally different "names" at each step. That is, the terms at our cycles look pretty difficult to inter-relate and even when we have the same observable "edges" (such as succ) we have lots of intermediate cruft with apparently different terms. In fact, the only way to know whether one term name is like another is to see how it unfolds and refolds. But this is in general something which can't possibly have a decision procedure.

Should the fact that it doesn't have a general decision procedure stop us though? Perhaps there is a reasonable decision procedure that would recognise lots of "names" as being the same. Some process by which we could unfold a lot, but not too much that we don't terminate, and fold in a safe way. Since supercompilation terminates, this could give us our cycles, allow us to compare proofs and admit a larger class of equivalences. In fact, the interested reader might note that the above process will reduce x + (y + z) to essentially the same proof, giving us a proof of associativity for free.

But this leads us to further questions. When we started with the reduction relation in ITT we identified different proofs (such as (λ x. f x) y ≈ f y) but we consider this simple enough that we hope not to do anything dodgy. The rest of the proof we make really explicit with all of the steps written out for us. When we get to supercompilation we have something which is very complex indeed. How are we to trust the terms that come out the other side?

Well, it is possible to prove that case-↝ is a conservative extension of ↝. Similarly, it is conceivable that one could show that there is a semantic equivalence for folding, in the event that we can provide a termination argument.

In CTT we talk about canonical value elements by talking about their reduction behaviour under big-step evaluation. The above arguments feel morally similar and this syntactic cruft is noted in [7]. I'm not yet familiar enough with CTT to know precisely how to formulate things (or to what extent the same problems arise) in this setting. What is clear to me however, is that this argument is a very uncomfortable one to make in ITT without breaking the fundamental design principle which insists on the bureaucracy of syntax everywhere excepting in the original unalterable reduction scheme.

Is there a trick with which we can do this comfortably in ITT? Is there a cleaner way to explain these things in CTT?

EDIT: In general, we don't want proof irrelevance. We'd be quite unhappy in a computational setting to see our proofs of ℕ → ℕ → ℕ → ℕ being replaced arbitrarily (for instance, with one that returns zero in all cases). However, it seems to me that the our notion of proof relevance is sometimes too strong. A notion of proof relevance which would be more flexible would be one of observable equivalence - in which case we at least are assured that we compute to the same canonical forms in all closed contexts.

[1] Naïve Computational Type Theory, Robert L. Constable 2003
[2] The NuPRL Book, The NuPRL Group
[3] Constructing Type Systems over an Operational Semantics, Robert Harper, 1992
[4] Type Theory and its Meaning Explanations, Jon Sterling, 2015
[5] An Introduction to Cyclic Proof, James Brotherston
[6] Introduction to Supercompilation, Morten Heine, B. Sø rensen and Robert Glück
[7] Development of the Productive Forces
[8] Proof Term Assignments and Intuitionistic Truth, Jon Sterling