Tuesday 7 August 2012

Apache Logging to MongoDB Using a Named Pipe

Following on from the last article on remote logging where I collected all the logs into one place, I wanted to be able to query them -so that I can ask questions like "which pages take longer than  5 seconds to produce", 'how many of our redirects fail?" and the like.

I believe that there are already MySQL modules for rsyslogd, but I had written a script in Python to interpret logs and put them into MongoDB, so I wanted to use that.

The obvious solution to join the two programs together seemed to be to used a named pipe, a vaery basic type of interprocess communictation where one program chucks bits down the pipe and the other pulls them out.

Chucking them in is easy enough, firsty create your pipe with mkfifo , then  just change the line is the rsyslog server config from :

:programname, isequal, "apache2" /var/log/oneGiantHeapOfLogs.log
to
:programname, isequal, "apache2" |/tmp/logger_pipe



Reading from the pipe was pretty easy too as this Python snippet shows :

#Open the pipe
in_pipe = open(sys.argv[1], "r")
#Loop forever
while True :
  try:
    line = in_pipe.readline()[:-1]    #Supposedly makes this blocking
    if len(line) == 0 :                       #Happens in test
      time.sleep(5)
      continue


In short, open the pipe, loop forever reading lines. We want this to keep going so the program is wrapped in an exception handler to catch any log parsing errors.

Having got the input to the program it's just a problem or parsing that input -we can take a look at that in another post.

Tuesday 31 July 2012

Remote logging for Apache and Amazon AWS

One of the changes that I have started in my current job has been to move the environment to Amazon Web Services. By and large this suits us well, but one of the issues we do have is that, as instances spin up and down, we lose logs.

My current solution to this is to set up a logging server and route the logs from the web heads to that, if you're not interested in Linux plumbing it's time to go away and read something else.

There are 3 areas that need configuring Apache, the web head's logging system and the logging server. Our systems are running Ubuntu 12.04 LTS (hey, the Rezillos on Last.fm :) ), and that uses rsyslog rather than stock syslog -so YMMV

Apache

This is just an edit of the httpd.conf file (or a file it includes) to repoint ErrorLog and CustomLog. We also set up a new blackbox logging format to help debugging.

ErrorLog "|/usr/bin/logger -p local1.info -t apache2"
LogFormat "%v:%a/0 %X %t \"%r\" %s/%>s %{pid}P/%{tid}P %T/%D %I/%O/%B" blackbox
CustomLog "|/usr/bin/logger -p local1.info -t apache2" blackbox

As you can see this uses a pipe into logger(-p sets the priotity and -t the tag), when you reload your Apache you should find the log messages from it going into the local syslog.

Local Rsyslog


I basically trashed /etc/rsyslog.conf to send everything from local.1 to the logging server, local.1 probably isn't the right channel for this -but it's working.

$ModLoad imuxsock # provides support for local system logging
$ModLoad imklog   # provides kernel logging support (previously done by rklogd)
local1.* @ec2-46-137-82-138.eu-west-1.compute.amazonaws.com


Messages are sent via UDP -as we just want a 'fire and forget' system, if you want to be sure your messages arrive you can tell rsyslog to use TCP

One thing to note is the use of the DNS for the Elastic IP of the server, rather than the IP address itself -this should ensure that traffic is internal to Amazon -and thus free, using the IP address will route it externally, which costs.

Another poi ntis the use of the Elastic IP -rather than just the IP address of the server. This is another AWS funny, servers aren't guranteed to keep their IP if they shutdown and restart, wheras the Elastic IP should always be ther,e using it saves you having to update all your clients if the server IP changes.

Server Rsyslog

Uncomment UDP syslog reception in rsyslog.conf
# provides UDP syslog reception
$ModLoad imudp
$UDPServerRun 514
 

In /etc/rsyslog.d I created a .conf file containing :


:programname, isequal, "apache2" /var/log/oneGiantHeapOfLogs.log

This tests for the tag name 'apache2' that we set using logger in the apache.conf file, if it finds it the message goes to the big log file.


Job Done.


Debugging

  • you can call logger directly on the commandline with a message and tail the various log files you are looking at to see if the message arrives.
  • On the server 'netstat -nlp' will show you if udp is listening on the logging port (514)
  • On the client you can write directly to the port on the server with netcat -u 10.229.70.230 514
  • Put rsyslog messages into debug mode with *.* /var/log/rsyslog_debug.log;RSYSLOG_DebugFormat early on in rsyslog.conf
Eliza Carthy playing  'Moss Meg' now.

Tuesday 17 January 2012

Creating a dashboard with Python, MongoDB and Highcharts.

We were asked to create views onto some of our development and deployment processes so that management and the teams could see how they were doing. We would have to take in and extract data from several different systems, for several different projects and display statistics and trend lines for those systems.

Choices and Decisions

 Some of the systems that we were going to monitor, such as subversion and anthill, weren't query-able and would push data to us, so we would need a data store. The facts that we weren't doing anything transactional (or critical), that the shape of the data may change for a source, and that it could, potentially, get quite large; lead us to MongoDB. There were a couple of other benefits too, the transition from SQL is supposed to be relatively painless, it uses javascript as it's scripting language, and the document structure is in BSON -which the languages drives return to you as JSON (which gives you the possibility of dumping it straight to the web page).


Highcharts -well it's just a damn fine javascript graphing package that's very well documented.

Python -just because we fancied trying something new really.

Having decided upon Python I wanted a really minimal MVC framework, as I felt we had enough new (to us) technologies on our plate, and settled upon Bottle. We set Python up to run in Apache using mod_wsgi, just because we are used to Apache, a more Python dedicated solution may have been better.

How did it all work out? 

Pretty well, once we'd got the hang of it. Mongo is fairly friendly, although getting your head around map/reduce takes a little time, and groupby queries are really syntactic sugar on that. You may want to take a look at MongoDB aggregation post -this is easier to use than map/reduce in many cases.

 Mongo sells itself as a document store and is schema- less. This is good and bad, good because things don't break when the data structures change, bad  because things don't break when the data structures change, so we had feeds dying where the feeding program should have got an exception, but didn't.

Python -is Python, some things, such as class auto loading we missed, other features such as list comprehensions we appreciated.

Bottle I really liked, I set up an MVC structure, all it really supports nativley is routing and views via SimpleTemplateEngine, but creating controller and model directories is easy enough although it's a bit of a bore having to load the classes manually. We ended up with one big template that had as it's arguments the output of partial templates for each widget and in some cases we were abble to extract a data structure from Mongo and pass it as a JSON object to the template without any translation or mapping in tween.

Issues

With a system like this the issues are all in the maintenance, we had perhaps half a dozen feeds from different products and teams, and any change in configuration at the product end would have a downstream effect on us. It was rare that all widgets were working at the same time. We did have a large number of unit tests written that could be run to narrow down a problem, but 90% of the time it was a change in another system, JIRA or svn, that would cause the problem.

Tuesday 10 January 2012

Probablistic Deshredder -in Python

The second exercise in the ai class was to recompose a shredded message, this is a bit harder than the Caesar Cypher -although still do-able by Mk1 eyeball. Here's the message :

de|  | f|Cl|nf|ed|au| i|ti|  |ma|ha|or|nn|ou| S|on|nd|on
ry|  |is|th|is| b|eo|as|  |  |f |wh| o|ic| t|, |  |he|h
ab|  |la|pr|od|ge|ob| m|an|  |s |is|el|ti|ng|il|d |ua|c
he|  |ea|of|ho| m| t|et|ha|  | t|od|ds|e |ki| c|t |ng|br
wo|m,|to|yo|hi|ve|u | t|ob|  |pr|d |s |us| s|ul|le|ol|e
 t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er
ry|d |un|Th|" |io|eo|n,|is|  |bl|f |pu|Co|ic| o|he|at|mm
hi|  |  |in|  |  | t|  |  |  |  |ye|  |ar|  |s |  |  |.

We want to put the columns in the correct order, the rows already are ordered. I made one cheat optimisation, I assumed that the first row of the first column wasn't indented and that it would begin with a capital letter. Which gives me column 4 as the first column.

Rather than trying to match words, which looked hard as we don't know where the boundaries are, I decided to match n-grams (sequences of letters) and in the end I used trigrams, as I haven't yet found a decent table of quad-grams.

A few Python first for me here, reading in a file, cleaning it up on the way :
f = open('shredded.txt')

regex = re.compile('[,."]')
input_table = []
output_table = []
for str in  f:
  str = regex.sub(' ', str.strip("\n"))
  str = str.split('|')
  input_table.append(str)
Transposing a table (this is nice) :
col_table = zip(*input_table)

Then I created a Trigram class to return the probabilities of any given sequence, I used data from the Brown Corpus , there was one 'gotcha' here, the sequence " s " -the corpus assumes that "'s " is transformed into " s " and assigns it a very high probability, I decided to change this to the same as that for " t ".

Other than that it's a straight forward naive Bayes sum of probabilities, although I do it for both 'foward' trigram sequences; i.e for |Cl|nf| I find the probabilities for 'Cln' and 'lnf' and sum them.

So how does it do?
Claude Shannon founded ition  manfor 
theory  which is the bas      f is o 
probabilistic language mand   s odel 
of the code breaking methat    thods 
you would use to solve toblem prhis  
with the paper titled  Amaticahe Matl
Theory of Communication ished bl  pu 
in this year 
not too shabby, you could answer the question from this, it gets the first 12 columns dead right and bits of the others in the correct order as well.

Longer columns would increase the chances of success, but what about algorithm tweaks? Things to try :
  • Find a 4-gram table
  • Post process for words (I was put off this as my WordFreq doesn't include names)
  • Process for spaces -there should only be one space between words, I have more as I have substituted spaces for punctuation. We could use this to detect 'illegal' orderings.

Saturday 7 January 2012

Apple Cake

Apple and Cinnamon cake

I have a glut of apples on the allotment this year, especially the cookers, so I'm going through recipes trying to use them up. I came across this on the Internet, made a few alterations, and it has turned out well -so I though that I'd share.

Ingredients

  • 500g of apples -peeled cored and chopped
  • 125 ml sunflower oil
  • 275g plain flour 
  • 2 eggs
  • 200g of soft brown sugar
  • 1 tbsp runny honey
  • 1 tsp vanilla essence
  • 1 tsp bicarb
  • 2tsp cinnamon
  • 1/2 tsp salt
Cooking time ~50 min

 Method
  
Grease and flour a 20 * 20 c.m. tin, put the oven on at 180 C. Peel core and dice the apples. Beat the eggs and oil together until a bit frothy, they aren't going to foam. Add the rest of the ingredients, bar the apple, to the mixer and combine together until smooth (ish). The mixture will be pretty stiff at this point. Add the apples, I had the Kenwood going at 1 for this -some of the apples get squashed which loosens up the mixture. Turn out into the tin, levelling off the top and working the mixture into the corners.

Bake for 30 mins, try it with a skewer (you could try it with a Skua, but it will be noisy and the RSPB will hate you). I found that it needed more time so I knocked the heat down to 160 C and it needed 15 - 20 mins, but I got a cake that was just right in texture and not burnt on top.

Variations
Mixed spice, might be nice, or some ginger -perhaps the syrup from stem ginger?
You could add sultanas.
It would probably stand some treacle.
Change the apples -my cookers aren't Bramleys, they are Monarchs -which aren't as sharp as a Bramley and they hold their shape better -hence the chunks in the cake. I would recommend them as a tree for anyone wanting a cooker as you can buy Bramleys anywhere. Mine came from Adams Apples.

Thursday 5 January 2012

Caesar cypher solver in Python

An optional exercise in the ai class was to solve a Caesar cypher. The example given was pretty simple with only one shift, so the easiest way to solve it would be by inspection after printing out all 26 options.

But that isn't in the spirit of things, we want to solve it in an a.i. manner using probabilities, here's

the original cypher :

"Esp qtcde nzyqpcpynp zy esp ezatn zq Lcetqtntlw Tyepwwtrpynp hld spwo le Olcexzfes Nzwwprp ty estd jplc."

The first thing I noticed is that it has structure, if we assume that the punctuation has been left alone then we have a set of words. I can look up the probabilities of words in a table -I used the one from the British National Corpus:  ftp://ftp.itri.bton.ac.uk/bnc/all.num.o5

I created a Python class with a dictionary of words and probabilities (word_freqs) and a class method to return the probability of a word :

  @classmethod
  def word_prob(self,word):
    if word in self.word_freqs:
      return math.log(self.word_freqs[word] / self.corpus_freq)
    return math.log(1 / self.corpus_freq)

I take the log because, even with the correct, english, sentence, the probabilities soon get too small for simple arithmetic as they are multiplied together for each word in the sentence to get a 'sentence probability'

Storing and sorting the probabilities stumped me for a bit, but I eventually decided upon a list of tuples :

prob_map.append((prob, string2))


Which I sort, using the operator module, with :
 sorted(prob_map, key=operator.itemgetter(0), reverse=False):


This seems clunky compared to PHP, Perl or even C, but I'm still feeling my way around Python so I'm hoping that there's a better way.

To give me the answer :
(-7308.628573183837, 'the first conference on the topic of artificial intelligence was held at dartmouth college in this year.')

1956 BTW

There we go, a first bit of Natural Language Programming -which I have signed up for this term.

linkedin