Skip to main content

Strategies for automatic translation

I've been hacking away at my program to test a theory I have about machine translation. I wrote a bit about it in a previous post but I was fairly vague. I thought I'd describe in more detail exactly how the technique would work (I'm still in phase 1).

The idea is simple. The first phase is to take a corpus in a language. Take each sentence of the source (or some other sized chunk, currently I'm limited by computational tractability to a single sentence) and recombine each element of the sentence into every possible string of n-grams. If you play with it a bit you'll realise that there are 2(N-1) of these for a string of size N. One way to think about it is that there are N-1 indexes into the spaces between words in the string. You can then think of each sentence as being a collection of indexes at which we combine words. This is obviously the power set of the set of indexes {1,2,3...N-1} and hence there are 2(N-1). It turns out however that it is nice to have a special word meaning "beggining of sentence" and another for "end of sentence", so we end up starting with N+2 words, and getting 2(N+1). That can be a big number!

So now that we have our n-grams for each sentence we want to look at transition probabilities between n-grams. The reason for this is that various parts of a sentence have unpredictable size. In the absense of a full NL parsing system there is no way to figure out what a syntactic unit (a noun phrase for instance) will be. This process completely obviates the need for an NL parser. This in itself is a huge win since NL parsing is at least difficult and probably impossible to do correctly because of idioms and variations in dialect. With the n-grams in hand we can now look at transition frequencies amoung the various n-grams in each of the different patterns in which they were combined. At this point we enter the information into a database which stores the transition probability between every two n-grams. Let us assume that we ignore sentences larger than 12 words. This means that we have 213 or 8192 words for a large sentence. This gives us 67,000,000 entries in our transition frequency matrix. O.K. So this is looking fairly intractable. If we decided that we will only look at correlations between neighbors and next neighbors however, we are back in the realm of possibility. This limitation has a certain justification beyond making things computationally feasible in that every element of the sentence will be a next nearest neighbor with an element of one of the n-gram sentences therby relating every possible syntactic unit. It should even be possible given this information to "guess" a parse based on our frequencies given a large enough corpus.

Stage 2 revolves around extracting information from a parallel corpus. We will simply perform a nearly identical procedure between two parallel corpuses.

When stage 1 and stage 2 are completed, we can use the probabilities of co-occurance from the parallel corpus in conjunction with the intra-language transition frequencies to generate "most probable" sentences.

We'll see how it goes.

Comments

Popular posts from this blog

Teagrey

I was ironing my shirt today, which I almost never do. Because of this I don't have an ironing board so I tried to make a make-shift ironing board on my floor using a towel and some books. I grabbed the heaviest books on the nearest shelf, which happened to be Organic Chemistry, Stalingrad and an annotated study bible containing the old and new testament. As I pulled out the bible, a flower fell out which had been there for over 17 years now. I know that because it was put there by my first wife, Daniel, who killed herself in April about 17 years ago. It fell from Thessalonians to which it had been opened partially. Highlighted was the passage: "Ye are all sons of the light and sons of the day." I guess the passage gave her solace. Daniel was a complicated woman. She had serious mental health issues which plagued her for her entire life. None of them were her fault. She was dealt an absolutely awful hand in life, some truly nasty cards. She had some considerable c

Managing state in Prolog monadically, using DCGs.

Prolog is a beautiful language which makes a lot of irritating rudimentary rule application and search easy. I have found it is particularly nice when trying to deal with compilers which involve rule based transformation from a source language L to a target language L'. However, the management of these rules generally requires keeping track of a context, and this context has to be explicitly threaded through the entire application, which involves a lot of irritating and error prone sequence variables. This often leads to your code looking something a bit like this: compile(seq(a,b),(ResultA,ResultB),S0,S2) :- compile(a,ResultA,S0,S1), compile(b,ResultB,S1,S2). While not the worst thing, I've found it irritating and ugly, and I've made a lot of mistakes with incorrectly sequenced variables. It's much easier to see sequence made explicitly textually in the code. While they were not designed for this task, but rather for parsing, DCGs turn out to be a conveni

Decidable Equality in Agda

So I've been playing with typing various things in System-F which previously I had left with auxiliary well-formedness conditions. This includes substitutions and contexts, both of which are interesting to have well typed versions of. Since I've been learning Agda, it seemed sensible to carry out this work in that language, as there is nothing like a problem to help you learn a language. In the course of proving properties, I ran into the age old problem of showing that equivalence is decidable between two objects. In this particular case, I need to be able to show the decidability of equality over types in System F in order to have formation rules for variable contexts. We'd like a context Γ to have (x:A) only if (x:B) does not occur in Γ when (A ≠ B). For us to have statements about whether two types are equal or not, we're going to need to be able to decide if that's true using a terminating procedure. And so we arrive at our story. In Coq, equality is