The most hopeful results come from the extraction of programming language and operating system entities from the text -see the figure below :
Entity Types |
On to MongoDB
This table was generated from a MongoDb database using two collections entities and missed_entities entities contains the terms that the program found and missed_entities the ones that I though it missed. Of the ones it found it either got it right ('Hit'), wrong ('Miss') or it was a bit dubious ('Null'). To get the stats I used the new (to me) MongoDB aggregation operations, analagous to the SQL GROUP BY, HAVING, SUM &c.You could do all this in the old MongoDB map/reduce way, but aggregation seems a bit more intuitive.
So to get the 'Hit', 'Miss' and 'Null' columns the Python code looks like :
entity_c.aggregate([{"$group": {"_id": {"etype" : "$_type", "hit_miss" : "$hit_miss"} , "count": {"$sum": 1}}}])which returns me rows like :
{u'count': 55, u'_id': {u'etype': u'ProgrammingLanguage'ProgrammingLanguage', u'hit_miss': u'1'}}
{u'count': 2, u'_id': {u'etype': u'ProgrammingLanguage'}}
and nothing for the misses because there weren't any.
The hard work occurs in the $group where I create a compound '_id' field made up of the entity type field and entity hit_miss field and then count all the matching entities.
The Aggregation Pipeline
But we can also look at the terms that the recogniser missed :Missed entities |
missed_c.aggregate([We have extra terms : '$match' which filters the documents so we only consider those with the passed in type (etype) and '$sort' which orders by the count we generated in the group. MongoDB pipelines these performing the match then the group and then the sort before returning you the result.
{"$match" : {"_type" : etype}},
{"$group": {"_id": "$name", "count": {"$sum": 1}}},
{"$sort" : {"count" : -1}}
])
Finally, looking at the results we can see that there are some casing issues, we can make everything the same case by adding in a '$project' :
{ "$project" : { "name":{"$toUpper":"$name"} } },
$project creates a new field (or overwrites an existing one in the pipeline, not the real one in the collection) in this instance I have told it to make all the names uppercase and we get :
Normalised entities |
What does this tell us? Well in this case if I could persuade the tagger to recognise CSS and variants of terms it already knows with a number on the end I would get a big jump in the overall quality of the results.
No comments:
Post a Comment