18k Videos over 10 Years weren’t Assigned to an Asset
Our customer has been collecting visual data using drones and helicopters for over a decade. Unfortunately much of it isn’t mapped to the asset it’s for; the only way of knowing what it is was to look at the footage. There is no EXIF data containing GPS coordinates either. So what can we do? How do we automatically assign each survey to the asset it belongs to?
Luckily, most videos are captioned with an overlay showing the GPS coordinates of the aircraft when recording. We used a simple convolutional neural network to run Optical Character Recognition (OCR) able to “read” each digit, which we then convert to a GPS coordinate. Armed with this information we map to the tower closest to the aircraft. Although not entirely foolproof, for example near substations the helicopter can be closer to a tower different to the one it’s filming, the approach has worked really well. Of the over 18k videos we were able to map 14k.
Each GPS coordinate is made up of latitude and longitude coordinates. In total these are 14 digits. Of these, 2 do not change, leaving 12 to be recognised. We are aiming for a 90% accuracy for the entire GPS string, this implies that the accuracy on each individual digit has to be 99%. No publicly available model had this level of accuracy for our data set, so we trained our own model using synthetic data.
Synthetic Data for OCR Training
The size of the training set for OCR models plays a fundamental role in its accuracy. With the collection and labelling of data being a long and tedious process, using synthetic data has emerged as a solution.
What Did We Do?
Using the camera maker’s font files our Data Scientist & Software Engineer, Tom, produced synthetic data to train the model by adding various backgrounds and noise for the AI model to recognise. He then cross tested the model’s performance using a test set made of real data, which the model had never analysed before.
Tailoring The Model
A code crops out the area of the video containing the 10 longitude and latitude digits to feed them into the OCR model which individually analyses them. With a 96% accuracy of the model per individual digit, over the 10 digits, the model’s chance to identify each correctly reduces to 66%. Every digit needs to be correctly identified to map the video to the respective tower, therefore every improvement in accuracy hugely influences the model’s ability.
Tom worked on tailoring the model to the customer’s footage format. The real life data made apparent that certain longitude and latitude digits are 2-3 times more common than others. He therefore worked on improving the model’s accuracy in recognising these numbers, based on their appearance frequency. This, combined with making the model biassed towards these numbers, increased the accuracy.
Through using synthetic data to train and improve its character recognition, the model’s accuracy increased to 98%. With the model accurately identifying the longitude and latitude 82% of the time, the likelihood of the video being mapped to the correct tower is largely increased.
What Were The Challenges?
Comparing the synthetic data to the real data used to train the model, made him aware of the difference in noise and digit placement. He therefore increased the noise of the synthetic data backgrounds and slightly alternated the digit placement. Using this to train the model, he achieved a higher success rate.