Tuesday 10 January 2012

Probablistic Deshredder -in Python

The second exercise in the ai class was to recompose a shredded message, this is a bit harder than the Caesar Cypher -although still do-able by Mk1 eyeball. Here's the message :

de|  | f|Cl|nf|ed|au| i|ti|  |ma|ha|or|nn|ou| S|on|nd|on
ry|  |is|th|is| b|eo|as|  |  |f |wh| o|ic| t|, |  |he|h
ab|  |la|pr|od|ge|ob| m|an|  |s |is|el|ti|ng|il|d |ua|c
he|  |ea|of|ho| m| t|et|ha|  | t|od|ds|e |ki| c|t |ng|br
wo|m,|to|yo|hi|ve|u | t|ob|  |pr|d |s |us| s|ul|le|ol|e
 t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er
ry|d |un|Th|" |io|eo|n,|is|  |bl|f |pu|Co|ic| o|he|at|mm
hi|  |  |in|  |  | t|  |  |  |  |ye|  |ar|  |s |  |  |.

We want to put the columns in the correct order, the rows already are ordered. I made one cheat optimisation, I assumed that the first row of the first column wasn't indented and that it would begin with a capital letter. Which gives me column 4 as the first column.

Rather than trying to match words, which looked hard as we don't know where the boundaries are, I decided to match n-grams (sequences of letters) and in the end I used trigrams, as I haven't yet found a decent table of quad-grams.

A few Python first for me here, reading in a file, cleaning it up on the way :
f = open('shredded.txt')

regex = re.compile('[,."]')
input_table = []
output_table = []
for str in  f:
  str = regex.sub(' ', str.strip("\n"))
  str = str.split('|')
  input_table.append(str)
Transposing a table (this is nice) :
col_table = zip(*input_table)

Then I created a Trigram class to return the probabilities of any given sequence, I used data from the Brown Corpus , there was one 'gotcha' here, the sequence " s " -the corpus assumes that "'s " is transformed into " s " and assigns it a very high probability, I decided to change this to the same as that for " t ".

Other than that it's a straight forward naive Bayes sum of probabilities, although I do it for both 'foward' trigram sequences; i.e for |Cl|nf| I find the probabilities for 'Cln' and 'lnf' and sum them.

So how does it do?
Claude Shannon founded ition  manfor 
theory  which is the bas      f is o 
probabilistic language mand   s odel 
of the code breaking methat    thods 
you would use to solve toblem prhis  
with the paper titled  Amaticahe Matl
Theory of Communication ished bl  pu 
in this year 
not too shabby, you could answer the question from this, it gets the first 12 columns dead right and bits of the others in the correct order as well.

Longer columns would increase the chances of success, but what about algorithm tweaks? Things to try :
  • Find a 4-gram table
  • Post process for words (I was put off this as my WordFreq doesn't include names)
  • Process for spaces -there should only be one space between words, I have more as I have substituted spaces for punctuation. We could use this to detect 'illegal' orderings.

No comments:

Post a Comment

linkedin