CleaNLP: Detecting Label Errors in General NLP Tasks

Download

Paper
Code and data

Abstract

This paper presents a method for identifying label errors in natural language processing (NLP) datasets using the T5 model. The T5 model is a large-scale, multi-task language model that has been shown to achieve state-of-the-art performance on a variety of natural language understanding tasks. We used the T5 model to analyze a general dataset of labeled NLP examples and identified instances where the model predicted a different label than the one provided in the dataset. We found that the T5 model was able to accurately identify label errors in the dataset after finetuning on a T5 model, demonstrating the potential for using large-scale language models to improve the quality of NLP datasets.