Script, Language, and Labels: Overcoming Three Discrepancies for Low-Resource Language Specialization

نویسندگان

چکیده

Although multilingual pretrained models (mPLMs) enabled support of various natural language processing in diverse languages, its limited coverage 100+ languages lets 6500+ remain ‘unseen’. One common approach for an unseen is specializing the model it as target, by performing additional masked modeling (MLM) with target corpus. However, we argue that, due to discrepancy from MLM pretraining, a naive specialization such can be suboptimal. Specifically, pose three discrepancies overcome. Script and linguistic related seen hinder positive transfer, which propose maximize representation similarity, unlike existing approaches maximizing overlaps. In addition, label space prediction vary across reinitialize top layers more effective adaptation. Experiments over four different families tasks shows that our method improves task performance statistical significance, while previous fails to.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Overcoming Procrastination: English Language Teachers’ and Learners’ Suggestions

Procrastination pervades the long and taxing process of foreign language learning and working against it is crucial. This study attempted to elicit and investigate the strategies and solutions from English teachers and learners which can help in dealing with procrastination over weekly assignments, term projects, and preparing for exams. To achieve this aim, suggestions were sought from 46 Engl...

متن کامل

Script Language for Image Processing

This paper proposes a design and structure of script language which is intended for easy description and prototyping of high-level image processing operations. The image operations are meant to be composed from basic building blocks represented by either C/C++ functions or appropriate block connections in FPGA (Field-Programmable Gate Array) circuits. The proposed language is designed for use i...

متن کامل

A Low-Overhead Script Language for Tiny Networked Embedded Systems

With sensor networks starting to get mainstream acceptance, programmability is of increasing importance. Customers and field engineers will need to reprogram existing deployments and software developers will need to test and debug software in network testbeds. Script languages, which are a popular mechanism for reprogramming in general-purpose computing, have not been considered for wireless se...

متن کامل

Sublexical Translations for Low-Resource Language

Machine Translation (MT) for low-resource language has low-coverage issues due to Out-OfVocabulary (OOV) Words. In this research we propose a method using sublexical translation to achieve wide-coverage in Example-Based Machine Translation (EBMT) for English to Bangla language. For sublexical translation we divide the OOV words into sublexical units for getting translation candidates. Previous ...

متن کامل

Selection Criteria for Low Resource Language Programs

This paper documents and describes the criteria used to select languages for study within programs that include low resource languages whether given that label or another similar one. It focuses on five US common task, Human Language Technology research and development programs in which the authors have provided information or consulting related to the choice of language. The paper does not des...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i11.26528