Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

نویسندگان

چکیده

Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR may introduce errors and inputs NLP become noisy. Despite that pre-trained achieve state-of-the-art performance in many benchmarks, we prove they not robust noisy texts generated by real engines. This greatly limits application scenarios. order improve model transcripts, it is natural train labelled texts. However, most cases there only clean Since no pictures corresponding text, impossible directly recognition obtain data. Human resources can be employed copy take pictures, but extremely expensive considering size data for training. Consequently, interested making intrinsically a low resource manner. We propose novel training framework 1) employs simple effective methods simulate noises from 2) iteratively mines hard examples large number simulated samples optimal performance. 3) To make our learn noise-invariant representations, stability loss employed. Experiments three datasets show proposed boosts robustness margin. believe work promote actual scenarios, although algorithm straightforward. codes publicly available (https://github.com/tal-ai/Robust-learning-MSSHEM).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Modeling and Classification of Cyberspace Papers Using Text Mining

The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...

متن کامل

A Robust Learning Approach for Text Classification

Previous learning approaches often assume that every part of a positive training document of a class is relevant to that class. However, in practice, it is often the case that only one or a few parts in the training document are really relevant to the class. To overcome this limitation, we propose another learning approach based on relevance-based topic model, an extension of well-known Latent ...

متن کامل

Source Free Transfer Learning for Text Classification

Transfer learning uses relevant auxiliary data to help the learning task in a target domain where labeled data is usually insufficient to train an accurate model. Given appropriate auxiliary data, researchers have proposed many transfer learning models. How to find such auxiliary data, however, is of little research so far. In this paper, we focus on the problem of auxiliary data retrieval, and...

متن کامل

Inductive and example-based learning for text classification

Text classification has been widely applied to many practical tasks. Inductive models trained from labeled data are the most commonly used technique. The basic assumption underlying an inductive model is that the training data are drawn from the same distribution as the test data. However, labeling such a training set is often expensive for practical applications. On the other hand, a large amo...

متن کامل

Adversarial Multi-task Learning for Text Classification

Neural network models have shown their promising opportunities for multi-task learning, which focus on learning the shared layers to extract the common and task-invariant features. However, in most existing approaches, the extracted shared features are prone to be contaminated by task-specific features or the noise brought by other tasks. In this paper, we propose an adversarial multi-task lear...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2021

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-030-86517-7_18