de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on
ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h
ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c
he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br
wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e
t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er
ry|d |un|Th|" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm
hi| | |in| | | t| | | | |ye| |ar| |s | | |.
We want to put the columns in the correct order, the rows already are ordered. I made one
Rather than trying to match words, which looked hard as we don't know where the boundaries are, I decided to match n-grams (sequences of letters) and in the end I used trigrams, as I haven't yet found a decent table of quad-grams.
A few Python first for me here, reading in a file, cleaning it up on the way :
f = open('shredded.txt')Transposing a table (this is nice) :
regex = re.compile('[,."]')
input_table = []
output_table = []
for str in f:
str = regex.sub(' ', str.strip("\n"))
str = str.split('|')
input_table.append(str)
col_table = zip(*input_table)
Then I created a Trigram class to return the probabilities of any given sequence, I used data from the Brown Corpus , there was one 'gotcha' here, the sequence " s " -the corpus assumes that "'s " is transformed into " s " and assigns it a very high probability, I decided to change this to the same as that for " t ".
Other than that it's a straight forward naive Bayes sum of probabilities, although I do it for both 'foward' trigram sequences; i.e for |Cl|nf| I find the probabilities for 'Cln' and 'lnf' and sum them.
So how does it do?
Claude Shannon founded ition manfornot too shabby, you could answer the question from this, it gets the first 12 columns dead right and bits of the others in the correct order as well.
theory which is the bas f is o
probabilistic language mand s odel
of the code breaking methat thods
you would use to solve toblem prhis
with the paper titled Amaticahe Matl
Theory of Communication ished bl pu
in this year
Longer columns would increase the chances of success, but what about algorithm tweaks? Things to try :
- Find a 4-gram table
- Post process for words (I was put off this as my WordFreq doesn't include names)
- Process for spaces -there should only be one space between words, I have more as I have substituted spaces for punctuation. We could use this to detect 'illegal' orderings.
No comments:
Post a Comment