In 2017, researchers at Syrian human rights group Mnemonic were faced with a huge mountain to climb. They had more than 350,000 hours of video that contained evidence of war crimes, ranging from chemical attacks to the use of banned munitions, but they could never manually comb through them all.
In particular, Mnemonic wanted to use AI to search the videos in the Syrian Archive, a repository of social media records of the war, for evidence that a specific “cluster” weapon called RBK-250 — a metal shell containing several hundred small explosives — had been used on civilians. RBK-250 shells also often remain unexploded and can be dangerous for decades after the end of a conflict.
But an AI program would require thousands of images of the RBK-250 to train it to recognise the weapon, from every angle and in any situation, whether it was partially destroyed or covered in rubble. And such images do not exist.
So the group turned to Adam Harvey, a computer scientist and artist in Berlin, to try a technique that is becoming increasingly widespread as the use of AI advances: using synthetic data instead of real images.
Harvey and the researchers spent two years creating 10,000 computer-simulated images of RBK-250s and used these to train the AI program. And in a three-day trial in November, software trained on the synthetic images detected the use of RBK-250s more than 200 times from a cache of more than 100,000 videos, with 99 per cent accuracy.
Most of these videos had never been reviewed by humans. “It provides evidence of illegal use of munitions that were used in the Syrian conflict,” said Harvey, who is making his tools available on VFRAME, an open-source resource to help in human rights investigations and casework. “The more we find, the more precise the legal argument becomes that this is a large-scale human rights violation, a war crime.”
Synthetic data is becoming an increasingly attractive alternative to “big data”, the enormous real-world troves of input that are required to teach AI models how to perceive or understand information.
While real-world data needs to be labelled and annotated in detail by human beings, synthetic data comes with auto-generated labels, and can be scaled up quickly. The innovation is particularly useful for smaller companies, which often cannot afford to pay humans between $7-$14 to label each training image, and require hundreds of thousands of images to train their AI.
“It’s a clever way of addressing the problem of not having sufficient training data, particularly with war crimes where this is a challenge,” said Alexa Koenig, executive director of the Human Rights Centre at the University of California, Berkeley.
“The primary limitation is the speed and scale of footage of visual atrocities makes it impossible for us to manually comb through and find signals in the noise,” she said. “It will be an enormous asset to make the data sets manageable for human review.”
Machine-learning algorithms interpret an image through various lenses when they train a model to ‘see’ it. For instance, a ‘normal’ map helps to understand scale, a ‘depth’ map gauges distance, and a segmentation map helps to understand the discrete items comprising a scene or object, allowing individual components to be tagged automatically.
In recent years, large tech companies including Nvidia, Tesla, Apple, Google, Facebook and Amazon have developed their own commercial synthetic data sets for uses ranging from autonomous driving to smart speakers and medical diagnoses.
Maximilian Denninger, a computer vision scientist in the robotics institute at German space agency DLR, said even companies like Apple, which have plenty of consumer data, are employing synthetic data “because it’s so good”. He claimed Apple’s Hypersim synthetic data set has “perfect annotation, you can have pixel-accurate labels, and the best part is you can easily generate more data, which isn’t the same for real data”.
In September, Amazon researchers showed how synthetic data could be used to teach Alexa to recognise the names of various medications, a data set that is hard to come by. Janet Slifka, director of research science in Alexa’s AI Natural Understanding group, wrote that the synthetic data engines could “generate thousands of new, similar sentences” from analysing just a “handful” of key commands.
Meanwhile Tesla said in August that it had built more than 2,000 miles of synthetic road footage, almost the length of the roadway from the east to the west coasts of the US, to help train its Autopilot self-driving software. Tesla’s current cars already run on AI networks that have been trained on 371m synthetic images, which will be multiplied over the coming months, according to Ashok Elluswamy, Tesla’s director of Autopilot software.
The company also uses a combination of real and synthetic data to tackle instances where Autopilot fails in simulations. By combining the real video clip of the faulty incident, with a synthetic reconstruction of the scene, Elluswamy’s team can simulate multiple different scenarios to repeatedly test Autopilot on the same stretch of road, and fix the original fault.
But synthetic data is not a perfect representation of reality. The challenge for developers is to close what they call the “‘synth-real gap”. “This gap is always there. If you’ve ever played a video game, you can tell it’s not real. So we create methods to minimise this gap, so they still work in the real world,” Denninger said.
Despite this disadvantage, the convenience and affordability of synthetic data is hard to ignore. “The effort to create proper real data sets is so big that only huge companies where money is not an issue, where they can just hire 10,000 people to label their data, can do that,” Denninger said.
Researchers believe the most innovative advantage conferred by synthetic data will be its ability to solve problems that remain intractable, such as the challenge posed by the Syrian Archive. As Jeff Deutch, a researcher at the Syrian Archive who has worked on the VFRAME project alongside Harvey, said: “It’s very exciting because now we are at a point where teams like ours that are very small can be empowered to use the same tools that multinational corporations are using for very different aims.”