Friday, 19 October 2012

Formalisation of Tables in a Dependent Language

I've had an idea kicking about in my head for a while of making query plans explicit in SQL in such a way that one can be assured that the query plan corresponds to the SQL statement desired. The idea is something like a Curry-Howard in a relational setting. One could infer the plan from the SQL, the SQL from the plan, or do a sort of "type-checking" to make sure that the plan corresponds to the SQL.

The devil is always in the details however. When I started looking at the primitives that I would need, it turns out that the low level table joining operations are actually not that far from primitive SQL statement themselves. I decided to go ahead and formalise some of what would be necessary in Agda in order get a better feel for the types of objects I would need and the laws which would be required to demonstrate that a plan corresponded with a statement.

Dependent types are very powerful and give you plenty of rope to hang yourself. It's always something of a struggle for me to figure out what I want to live in the type world as opposed to what I want to live in the term world.

First things first. We're going to need rows if we want tables. In dynamically typed languages (a.k.a. languages with a [potentially open] universal type) you can easily make a heterogeneous collection of objects (numbers, strings etc). Modelling a row in lisp or python is very straight forward. However, in a strongly typed setting heterogeneous lists are more complicated. The solution I have written in Agda below uses a vector (really just a list with a guaranteed size) of types which lives in the type universe, and which is constructed as we push things into our row. This is basically just a heterogeneous list where the type advertises what kind of row it is.

module RowList where

open import Level renaming (suc to lsuc; zero to lzero)
open import Data.Nat hiding (__)
open import Data.Vec renaming (__ to __; _++_ to _++_; [] to []; [_] to [_])
open import Data.List 
open import Data.Fin renaming (_+_ to _+f_)
open import Data.Nat.Properties
open import Relation.Binary.PropositionalEquality as PropEq
  using (__; refl; sym; subst; trans; cong; cong₂; inspect)
open import Data.Bool

infixr 4 _,_
infixr 5 __

data Row {α : Level} : (n : )  Vec (Set α) n  Set (lsuc α) where 
  · : Row zero []
  _,_ :  {j} {A : Set α} {Aⱼ : Vec (Set α) j}  A  Row j Aⱼ -> Row (suc j) (A  Aⱼ)

In the data definition above we use (·) to terminate a row, and return a row which has an empty-vector in the rows type signature. In the
(,) case, we extend the row by giving an element of some type A and stick its type in our row signature vector. This will allow us to ensure that tables are constructed of homogeneous row types, and will allows us to present a lookup function for rows which knows what its return type will be.

You may notice all the bits about (Set α) etc. The α is just book keeping to avoid Russell-type paradoxes creeping in. They are otherwise not particularly enlightening and the above table definition might be more easily read if you ignore them.

example : Row 2 (  (  []))
example = (1 , 2 , ·)

pick :  {α i j A}  (v : Vec (Fin j) i)  Vec {α} A j  Vec A i
pick [] ys = []
pick (x  xs) ys = lookup x ys  pick xs ys 

__ :  {α j Aⱼ}  (r : Row {α} j Aⱼ)  (i : Fin j)  lookup i Aⱼ 
·  ()
(y , ys)  zero = y
(y , ys)  suc i = ys  i 

We call our row selection operation (r i) for a row (r) and index i. Notice that the return type is dependent on the type signature vector which is carried around in the type of the row. This selector will be useful for writing predicates over columns which will have to know what their input type is. We use a finite type (Fin j) to describe our index, since we need to make sure we don't index outside of the row. The finite type ensures us that whatever the index is, it's no greater than the size of the vector. These operations are now guaranteed to be type safe and can not lead to run-time out-of-range errors.

_++r_ :  {α i j Aᵢ Bⱼ}  Row {α} i Aᵢ  Row j Bⱼ  Row (i + j) (Aᵢ ++ Bⱼ)
· ++r r = r
(x , xs) ++r r = x , (xs ++r r)

We of course need to be able to combine rows from different tables in joins, and so we'll need a row concatenation operator, which we call (++r).

pickr :  {α i j Aⱼ}  (v : Vec (Fin j) i)  Row {α} j Aⱼ  Row i (pick v Aⱼ)
pickr [] r = · 
pickr (x  xs) r = r  x , pickr xs r

Additionally, it's useful if we can specify a list of indices we want to pick out of a row to construct a new row. This will allow us to make a "select" like statement where we can duplicate or ignore indices in order to create a new row, which when done over a list of rows will become more like the familiar table select.

Table :  {α} (m : ) (Aⱼ : Vec (Set α) m)  Set (lsuc α)
Table {α} m Aⱼ = List (Row m Aⱼ)

We define tables as simply a list of rows, and keep the indices and types as parameters to the type. The table header is now correct by construction and exists only as a type index.

Example table: The header is specified as the type
exampleTable : Table 3 (  (  (  [])))
exampleTable = (1 , 2 , 3 , ·)
             ((3 , 4 , 5 , ·)
             ((5 , 6 , 7 , ·)

__ :  {α i j Aᵢ Bⱼ}  Table {α} i Aᵢ  Table j Bⱼ  Table (i + j) (Aᵢ ++ Bⱼ)
[]  [] = []
[]  _  = []
_   [] = []
(x  xs)  (x'  xs') = x ++r x'  [ x ]  xs' ++ xs  (x'  xs') 
-- concatenate row, union row with the rest of the table

The familiar join operation. As you can see here, our primitive join operation does not take a restriction. In practice the primitive join would have to use the restriction in order to reduce the amount of row consing. In order to have a pragmatic optimiser, I'll have to change this later.

Example join
result : Table 6 (  (  (  (  (  (  []))))))
result = exampleTable  exampleTable

-- Restriction 
__ :  {α j Aⱼ}  Table {α} j Aⱼ   (Row j Aⱼ  Bool)  Table j Aⱼ
[]  p = []
(x  xs)  p = if p x then x  xs  p else xs  p 

Restrictions are just row predicates. They simply take a row and specify which rows to keep and which to throw away. This allows for most of the usual SQL WHERE clause shenanigans (minus aggregation of course) and gives us a fairly complete method of doing joins subject to restrictions [we don't deal with nulls at all].

Example restriction
_b_ :  (n m : )  Bool
zero b m = true
suc n b zero = false
suc n b suc n' = n b n'

restrict : Table 6 (  (  (  (  (  (  []))))))
restrict = result  (λ v  (v  (# 1)) b 1)
Finally we have restriction, which allows us to pick and choose columns by specifying their index. It would of course be more natural to have names for columns and allow that as a means of addressing the columns of interest and I may in fact add that to rows in order to simplify the debugging, but it's not of much interest logically speaking.
-- Selection 
__ :  {α i j Aⱼ}  (v : Vec (Fin j) i)  Table {α} j Aⱼ  Table i (pick v Aⱼ)
xs  [] = []
xs  (y  ys) = (pickr xs y)  xs  ys

Example selection 
-- Syntax of a query: (selector ∥ t1 ∪ t2 ∪ t3 ▹ restriction)

selection : Table 2 (  (  []))
selection = ((# 1)  ((# 3)  []))  result  (λ v  (v  (# 1)) b 4)

From this point we will need some form of relation which allows us to establish query equivalence. We will then want a way to compute whether two tables are equivalent up to this relation and we can continue on with the more pragmatic concerns of pushing the restrictions into the actual table joins.