Bucket of Sparks: Probablistic Deshredder -in Python

The second exercise in the ai class was to recompose a shredded message, this is a bit harder than the Caesar Cypher -although still do-able by Mk1 eyeball. Here's the message :

de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on
ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h
ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c
he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br
wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e
t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er
ry|d |un|Th|" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm
hi| | |in| | | t| | | | |ye| |ar| |s | | |.

We want to put the columns in the correct order, the rows already are ordered. I made one ~~cheat~~ optimisation, I assumed that the first row of the first column wasn't indented and that it would begin with a capital letter. Which gives me column 4 as the first column.

Rather than trying to match words, which looked hard as we don't know where the boundaries are, I decided to match n-grams (sequences of letters) and in the end I used trigrams, as I haven't yet found a decent table of quad-grams.

A few Python first for me here, reading in a file, cleaning it up on the way :

f = open('shredded.txt')

regex = re.compile('[,."]')
input_table = []
output_table = []
for str in f:
str = regex.sub(' ', str.strip("\n"))
str = str.split('|')
input_table.append(str)

Transposing a table (this is nice) :

col_table = zip(*input_table)

Then I created a Trigram class to return the probabilities of any given sequence, I used data from the Brown Corpus , there was one 'gotcha' here, the sequence " s " -the corpus assumes that "'s " is transformed into " s " and assigns it a very high probability, I decided to change this to the same as that for " t ".

Other than that it's a straight forward naive Bayes sum of probabilities, although I do it for both 'foward' trigram sequences; i.e for |Cl|nf| I find the probabilities for 'Cln' and 'lnf' and sum them.

So how does it do?

Claude Shannon founded ition manfor
theory which is the bas      f is o
probabilistic language mand   s odel
of the code breaking methat    thods
you would use to solve toblem prhis
with the paper titled Amaticahe Matl
Theory of Communication ished bl pu
in this year

not too shabby, you could answer the question from this, it gets the first 12 columns dead right and bits of the others in the correct order as well.

Longer columns would increase the chances of success, but what about algorithm tweaks? Things to try :

Find a 4-gram table
Post process for words (I was put off this as my WordFreq doesn't include names)
Process for spaces -there should only be one space between words, I have more as I have substituted spaces for punctuation. We could use this to detect 'illegal' orderings.

Bucket of Sparks

Pages

Tuesday 10 January 2012

Probablistic Deshredder -in Python

No comments:

Post a Comment

linkedin