Bucket of Sparks: Data Munging with MongoDB Aggregation and Python

I am evaluating some named entity recognition systems for sorted.jobs , trying to improve our search by sorting the wheat from the chaff, and some of the initial results look encouraging -but just how encouraging? We need to do a bit of analysis to find out.

The most hopeful results come from the extraction of programming language and operating system entities from the text -see the figure below :

Entity Types

On to MongoDB

This table was generated from a MongoDb database using two collections entities and missed_entities entities contains the terms that the program found and missed_entities the ones that I though it missed. Of the ones it found it either got it right ('Hit'), wrong ('Miss') or it was a bit dubious ('Null'). To get the stats I used the new (to me) MongoDB aggregation operations, analagous to the SQL GROUP BY, HAVING, SUM &c.

You could do all this in the old MongoDB map/reduce way, but aggregation seems a bit more intuitive.

So to get the 'Hit', 'Miss' and 'Null' columns the Python code looks like :

entity_c.aggregate([{"$group": {"_id": {"etype" : "$_type", "hit_miss" : "$hit_miss"} , "count": {"$sum": 1}}}])

which returns me rows like :

{u'count': 55, u'_id': {u'etype': u'ProgrammingLanguage'ProgrammingLanguage', u'hit_miss': u'1'}}

{u'count': 2, u'_id': {u'etype': u'ProgrammingLanguage'}}

and nothing for the misses because there weren't any.

The hard work occurs in the $group where I create a compound '_id' field made up of the entity type field and entity hit_miss field and then count all the matching entities.

The Aggregation Pipeline

But we can also look at the terms that the recogniser missed :


Missed entities

Here we only want the entities for the given type ('ProgrammingLanguage') and we want them in order, our PyMongo aggregate call now becomes :

missed_c.aggregate([
    {"$match" : {"_type" : etype}},
    {"$group": {"_id": "$name", "count": {"$sum": 1}}},
    {"$sort" : {"count" : -1}}
])

We have extra terms : '$match' which filters the documents so we only consider those with the passed in type (etype) and '$sort' which orders by the count we generated in the group. MongoDB pipelines these performing the match then the group and then the sort before returning you the result.

Finally, looking at the results we can see that there are some casing issues, we can make everything the same case by adding in a '$project' :

{ "$project" : { "name":{"$toUpper":"$name"} } },

$project creates a new field (or overwrites an existing one in the pipeline, not the real one in the collection) in this instance I have told it to make all the names uppercase and we get :


Normalised entities

It doesn't matter where in the array you place any of theses terms MongoDB will sort out the ordering.

What does this tell us? Well in this case if I could persuade the tagger to recognise CSS and variants of terms it already knows with a number on the end I would get a big jump in the overall quality of the results.

References

Angular Aggregation manual pages.

Bucket of Sparks

Pages

Thursday, 6 November 2014

Data Munging with MongoDB Aggregation and Python

On to MongoDB

The Aggregation Pipeline

References

No comments:

Post a Comment

linkedin