Tuesday, 2 April 2024

GIGO - Check your data


A pothole?

One of the most important lessons I learnt in computing was Garbage In = Garbage Out, the GIGO law. When I got some less than perfect results from my pothole detector I took a look at the training data.

The data had been taken from a couple of DuckDuckGo searches, being lazy I had used terms from a notebook used to find birds in trees and just changed 'bird' to 'pothole' and 'tree' to 'road surface'. I displayed the first few images of each search and they looked reasonable, but then I took a closer look.



There were quite a few images of sunspots in the search, weird. I looked at my search string, there were three variations : 'pothole', 'pothole in the sun' and 'pothole in the shade'. The last two carried over from the bird search and I had left them in, what harm could it do? In this case it seemed to do quite a lot finding 'holes in the sun'  and also quite a few pictures of awnings, sun shades maybe?


Then there were some images that had come up 'randomly' in the search, like the one above, presumably mislabelled, or maybe on a page about potholes. Of the 140 files downloaded only around 40 were usable.

The road surface query produced much better results, in as much as all the photos were of road surfaces, the issue here was that many of them were variations on a theme.


Driving off into the sunset

This image has many strong features, the white lines grass either side of the road and a skyline, that the algorithm might learn to associate with 'road surface', whereas I just want it to learn about the asphalt, or the lack of. One image like this would be fine, but I felt 25% or more was too many.

To enhance the dataset, and perhaps tailor it to the UK country roads that I cycle around, I got on my bike and took more pictures of potholes and the road surface around them. I combined these with a selection of the downloaded images to get a better training set with around 30 images in each category.

I reran the model training and got a marginally better performance on the figures and a better fit with my validation images.

Further exploration in the next post.



linkedin