Bucket of Sparks: January 2012

Tuesday 17 January 2012

Creating a dashboard with Python, MongoDB and Highcharts.

We were asked to create views onto some of our development and deployment processes so that management and the teams could see how they were doing. We would have to take in and extract data from several different systems, for several different projects and display statistics and trend lines for those systems.

Choices and Decisions

Some of the systems that we were going to monitor, such as subversion and anthill, weren't query-able and would push data to us, so we would need a data store. The facts that we weren't doing anything transactional (or critical), that the shape of the data may change for a source, and that it could, potentially, get quite large; lead us to MongoDB. There were a couple of other benefits too, the transition from SQL is supposed to be relatively painless, it uses javascript as it's scripting language, and the document structure is in BSON -which the languages drives return to you as JSON (which gives you the possibility of dumping it straight to the web page).

Highcharts -well it's just a damn fine javascript graphing package that's very well documented.

Python -just because we fancied trying something new really.

Having decided upon Python I wanted a really minimal MVC framework, as I felt we had enough new (to us) technologies on our plate, and settled upon Bottle. We set Python up to run in Apache using mod_wsgi, just because we are used to Apache, a more Python dedicated solution may have been better.

How did it all work out?

Pretty well, once we'd got the hang of it. Mongo is fairly friendly, although getting your head around map/reduce takes a little time, and groupby queries are really syntactic sugar on that. You may want to take a look at MongoDB aggregation post -this is easier to use than map/reduce in many cases.

Mongo sells itself as a document store and is schema- less. This is good and bad, good because things don't break when the data structures change, bad because things don't break when the data structures change, so we had feeds dying where the feeding program should have got an exception, but didn't.

Python -is Python, some things, such as class auto loading we missed, other features such as list comprehensions we appreciated.

Bottle I really liked, I set up an MVC structure, all it really supports nativley is routing and views via SimpleTemplateEngine, but creating controller and model directories is easy enough although it's a bit of a bore having to load the classes manually. We ended up with one big template that had as it's arguments the output of partial templates for each widget and in some cases we were abble to extract a data structure from Mongo and pass it as a JSON object to the template without any translation or mapping in tween.

Issues

With a system like this the issues are all in the maintenance, we had perhaps half a dozen feeds from different products and teams, and any change in configuration at the product end would have a downstream effect on us. It was rare that all widgets were working at the same time. We did have a large number of unit tests written that could be run to narrow down a problem, but 90% of the time it was a change in another system, JIRA or svn, that would cause the problem.

Tuesday 10 January 2012

Probablistic Deshredder -in Python

The second exercise in the ai class was to recompose a shredded message, this is a bit harder than the Caesar Cypher -although still do-able by Mk1 eyeball. Here's the message :

de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on
ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h
ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c
he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br
wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e
t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er
ry|d |un|Th|" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm
hi| | |in| | | t| | | | |ye| |ar| |s | | |.

We want to put the columns in the correct order, the rows already are ordered. I made one ~~cheat~~ optimisation, I assumed that the first row of the first column wasn't indented and that it would begin with a capital letter. Which gives me column 4 as the first column.

Rather than trying to match words, which looked hard as we don't know where the boundaries are, I decided to match n-grams (sequences of letters) and in the end I used trigrams, as I haven't yet found a decent table of quad-grams.

A few Python first for me here, reading in a file, cleaning it up on the way :

f = open('shredded.txt')

regex = re.compile('[,."]')
input_table = []
output_table = []
for str in f:
str = regex.sub(' ', str.strip("\n"))
str = str.split('|')
input_table.append(str)

Transposing a table (this is nice) :

col_table = zip(*input_table)

Then I created a Trigram class to return the probabilities of any given sequence, I used data from the Brown Corpus , there was one 'gotcha' here, the sequence " s " -the corpus assumes that "'s " is transformed into " s " and assigns it a very high probability, I decided to change this to the same as that for " t ".

Other than that it's a straight forward naive Bayes sum of probabilities, although I do it for both 'foward' trigram sequences; i.e for |Cl|nf| I find the probabilities for 'Cln' and 'lnf' and sum them.

So how does it do?

Claude Shannon founded ition manfor
theory which is the bas      f is o
probabilistic language mand   s odel
of the code breaking methat    thods
you would use to solve toblem prhis
with the paper titled Amaticahe Matl
Theory of Communication ished bl pu
in this year

not too shabby, you could answer the question from this, it gets the first 12 columns dead right and bits of the others in the correct order as well.

Longer columns would increase the chances of success, but what about algorithm tweaks? Things to try :

Find a 4-gram table
Post process for words (I was put off this as my WordFreq doesn't include names)
Process for spaces -there should only be one space between words, I have more as I have substituted spaces for punctuation. We could use this to detect 'illegal' orderings.

Saturday 7 January 2012

Apple Cake

Apple and Cinnamon cake

I have a glut of apples on the allotment this year, especially the cookers, so I'm going through recipes trying to use them up. I came across this on the Internet, made a few alterations, and it has turned out well -so I though that I'd share.

Ingredients

500g of apples -peeled cored and chopped
125 ml sunflower oil
275g plain flour
2 eggs
200g of soft brown sugar
1 tbsp runny honey
1 tsp vanilla essence
1 tsp bicarb
2tsp cinnamon
1/2 tsp salt

Cooking time ~50 min

Method

Grease and flour a 20 * 20 c.m. tin, put the oven on at 180 C. Peel core and dice the apples. Beat the eggs and oil together until a bit frothy, they aren't going to foam. Add the rest of the ingredients, bar the apple, to the mixer and combine together until smooth (ish). The mixture will be pretty stiff at this point. Add the apples, I had the Kenwood going at 1 for this -some of the apples get squashed which loosens up the mixture. Turn out into the tin, levelling off the top and working the mixture into the corners.

Bake for 30 mins, try it with a skewer (you could try it with a Skua, but it will be noisy and the RSPB will hate you). I found that it needed more time so I knocked the heat down to 160 C and it needed 15 - 20 mins, but I got a cake that was just right in texture and not burnt on top.

Variations
Mixed spice, might be nice, or some ginger -perhaps the syrup from stem ginger?
You could add sultanas.
It would probably stand some treacle.
Change the apples -my cookers aren't Bramleys, they are Monarchs -which aren't as sharp as a Bramley and they hold their shape better -hence the chunks in the cake. I would recommend them as a tree for anyone wanting a cooker as you can buy Bramleys anywhere. Mine came from Adams Apples.

Thursday 5 January 2012

Caesar cypher solver in Python

An optional exercise in the ai class was to solve a Caesar cypher. The example given was pretty simple with only one shift, so the easiest way to solve it would be by inspection after printing out all 26 options.

But that isn't in the spirit of things, we want to solve it in an a.i. manner using probabilities, here's

the original cypher :

"Esp qtcde nzyqpcpynp zy esp ezatn zq Lcetqtntlw Tyepwwtrpynp hld spwo le Olcexzfes Nzwwprp ty estd jplc."

The first thing I noticed is that it has structure, if we assume that the punctuation has been left alone then we have a set of words. I can look up the probabilities of words in a table -I used the one from the British National Corpus: ftp://ftp.itri.bton.ac.uk/bnc/all.num.o5

I created a Python class with a dictionary of words and probabilities (word_freqs) and a class method to return the probability of a word :

@classmethod
def word_prob(self,word):
    if word in self.word_freqs:
      return math.log(self.word_freqs[word] / self.corpus_freq)
    return math.log(1 / self.corpus_freq)

I take the log because, even with the correct, english, sentence, the probabilities soon get too small for simple arithmetic as they are multiplied together for each word in the sentence to get a 'sentence probability'

Storing and sorting the probabilities stumped me for a bit, but I eventually decided upon a list of tuples :

prob_map.append((prob, string2))

Which I sort, using the operator module, with :
sorted(prob_map, key=operator.itemgetter(0), reverse=False):

This seems clunky compared to PHP, Perl or even C, but I'm still feeling my way around Python so I'm hoping that there's a better way.

To give me the answer :
(-7308.628573183837, 'the first conference on the topic of artificial intelligence was held at dartmouth college in this year.')

1956 BTW

There we go, a first bit of Natural Language Programming -which I have signed up for this term.

Bucket of Sparks

Pages