18k Videos over 10 Years weren’t Assigned to an Asset
Our customer has been collecting visual data using drones and helicopters for over a decade. Unfortunately much of it isn’t mapped to the asset it’s for; the only way of knowing what it is was to look at the footage. There is no EXIF data containing GPS coordinates either. So what can we do? How do we automatically assign each survey to the asset it belongs to?

Luckily, most videos are captioned with an overlay showing the GPS coordinates of the aircraft when recording. We used a simple convolutional neural network to run Optical Character Recognition (OCR) able to “read” each digit, which we then convert to a GPS coordinate. Armed with this information we map to the tower closest to the aircraft.
Each GPS coordinate is made up of latitude and longitude coordinates. In total these are 14 digits. Of these, 2 do not change, leaving 12 to be recognised. We are aiming for a 90% accuracy for the entire GPS string, this implies that the accuracy on each individual digit has to be 99%. No publicly available model had this level of accuracy for our data set, so we trained our own model using synthetic data.
Using synthetic data to bootstrap an OCR model
The size and quality of data used to train an OCR models is often more important than the type of model or the choice of architecture. In the beginning we didn’t have a great deal of time to create data manually by extracting crops and transcribing the coordinates they contained. What we could do was generate synthetic data, use this to train a model and then use the model itself to bootstrap the creation of real data with the correct labels.
Using the camera maker’s font files the team produced synthetic data to train the model by creating various ‘fake’ GPS coordinated with differing backgrounds. The model was then trained on this data and its performance evaluated against a test set of real images which had been manually labelled.

Incorporating background knowledge helps improves accuracy
A code crops out the area of the video containing the 10 longitude and latitude digits to feed them into the OCR model which individually analyses them. With a 96% accuracy of the model per individual digit, over the 10 digits, the model’s chance to identify each correctly reduces to 66%. Every digit needs to be correctly identified to map the video to the respective tower, therefore every improvement in accuracy hugely influences the model’s ability.
Tom worked on tailoring the model to the customer’s footage format. The real life data made apparent that certain longitude and latitude digits are 2-3 times more common than others. He therefore worked on improving the model’s accuracy in recognising these numbers, based on their appearance frequency. This, combined with making the model biassed towards these numbers, increased the accuracy.
Through using synthetic data to train and improve its character recognition, the model’s accuracy increased to 98%. With the model accurately identifying the longitude and latitude 82% of the time, the likelihood of the video being mapped to the correct tower is increased.
Identifying and closing the gaps between real and synthetic data
At the risk of stating a truism; synthetic data is not the same as real data. It differs in ways which may not be obvious and / or are difficult to replicate. Models trained on synthetic data with this ‘real to synthetic variance’ will perform less well when exposed to real data. So one of the roles of a data scientist is to try and recognise these gaps and close them down during the generation process. Comparing the synthetic data to the real data we noticed the difference in noise and digit placement. We therefore increased the noise of the synthetic data backgrounds and slightly alternated the digit placement. Using this to train the model we achieved a higher model accuracy.
Results and future work
The team trained a model which was able to reliable ‘read’ the GPS coordinates from a video, we then used these coordinates to determine what the closest tower was and attach the video to this asset. Although not entirely foolproof, for example near substations the helicopter can be closer to a tower different to the one it’s filming, the approach has worked really well. Of the over 18k videos we were able to map 14k.