Top datasets used to train AI models and benchmark how the technology has progressed over time are riddled with labeling errors, a study shows.
Data is a vital resource in teaching machines how to complete specific tasks, whether that’s identifying different species of plants or automatically generating captions. Most neural networks are spoon-fed lots and lots of annotated samples before they can learn common patterns in data.
But these labels aren’t always correct; training machines using error-prone datasets can decrease their performance or accuracy. In the aforementioned study, led by MIT, analysts combed through ten popular datasets that have been cited more than 100,000 times in academic papers and found that on average 3.4 per cent of the samples are wrongly labelled.
The datasets they looked at range from photographs in ImageNet, to sounds in AudioSet, reviews scraped from Amazon, to sketches in QuickDraw. Examples of some of the mistakes compiled by the researchers show that in some cases, it’s a clear blunder, such as a drawing of a light bulb tagged as a crocodile, in others, however, it’s not always obvious. Should a picture of a bucket of baseballs be labeled as ‘baseballs’ or ‘bucket’?
Inside the 1TB ImageNet dataset used to train the world’s AI: Naked kids, drunken frat parties, porno stars, and more
Annotating each sample is laborious work. This work is often outsourced work to services like Amazon Mechanical Turk, where workers are paid the square root of sod all to sift through the data piece by piece, labeling images and audio to feed into AI systems. This process amplifies biases and errors, as Vice documented here.
Workers are pressured to agree with the status quo if they want to get paid: if a lot of them label a bucket of baseballs as a ‘bucket’, and you decide it’s ‘baseballs’, you may not be paid at all if the platform figures you’re wrong or deliberately trying to mess up the labeling. That means workers will choose the most popular label to avoid looking like they’ve made a mistake. It’s in their interest to stick to the narrative and avoid sticking out like a sore thumb. That means errors, or worse, racial biases and suchlike, snowball in these datasets.
The error rates vary across the datasets. In ImageNet, the most popular dataset used to train models for object recognition, the rate creeps up to six per cent. Considering it contains about 15 million photos, that means hundreds of thousands of labels are wrong. Some classes of images are more affected than others, for example, ‘chameleon’ is often mistaken for ‘green lizard’ and vice versa.
There are other knock-on effects: neural nets may learn to incorrectly associate features within data with certain labels. If, say, many images of the sea seem to contain boats and they keep getting tagged as ‘sea’, a machine might get confused and be more likely to incorrectly recognize boats as seas.
Problems don’t just arise when trying to compare the performance of models using these noisy datasets. The risks are higher if these systems are deployed in the real world, Curtis Northcutt, co-lead author of the stud and a PhD student at MIT, and also cofounder and CTO of ChipBrain, a machine-learning hardware startup, explained to The Register.
“Imagine a self-driving car that uses an AI model to make steering decisions at intersections,” he said. “What would happen if a self-driving car is trained on a dataset with frequent label errors that mislabel a three-way intersection as a four-way intersection? The answer: it might learn to drive off the road when it encounters three-way intersections.
What would happen if a self-driving car is trained on a dataset with frequent label errors that mislabel a three-way intersection as a four-way intersection?
“Maybe one of your AI self-driving models is actually more robust to training noise, so that it doesn’t drive off the road as much. You’ll never know this if your test set is too noisy because your test set labels won’t match reality. This means you can’t properly gauge which of your auto-pilot AI models drives best – at least not until you deploy the car out in the real-world, where it might drive off the road.”
When the team working on the study trained some convolutional neural networks on portions of ImageNet that have been cleared of errors, their performance improved. The boffins believe that developers should think twice about training large models on datasets that have high error rates, and advise them to sort through the samples first. Cleanlab, the software the team developed and used to identify incorrect and inconsistent labels, can be found on GitHub.
“Cleanlab is an open-source python package for machine learning with noisy labels,” said Northcutt. “Cleanlab works by implementing all of the theory and algorithms in the sub-field of machine learning called confident learning, invented at MIT. I built cleanlab to allow other researchers to use confident learning – usually with just a few lines of code – but more importantly, to advance the progress of science in machine learning with noisy labels and to provide a framework for new researchers to get started easily.”
And be aware that if a dataset’s labels are particularly shoddy, training large complex neural networks may not always be so advantageous. Larger models tend to overfit to data more than smaller ones.
“Sometimes using smaller models will work for very noisy datasets. However, instead of always defaulting to using smaller models for very noisy datasets, I think the main takeaway is that machine learning engineers should clean and correct their test sets before they benchmark their models,” Northcutt concluded. ®