UNLT: Urdu Natural Language Toolkit

نویسندگان

چکیده

Abstract This study describes a Natural Language Processing (NLP) toolkit, as the first contribution of larger project, for an under-resourced language—Urdu. In previous studies, standard NLP toolkits have been developed English and many other languages. There is also dire need text processing tools methods Urdu, despite it being widely spoken in different parts world with large amount digital readily available. presents version UNLT (Urdu Toolkit) which contains three key required Urdu pipeline; word tokenizer, sentence part-of-speech (POS) tagger. The tokenizer employs morpheme matching algorithm coupled state-of-the-art stochastic n -gram language model back-off smoothing characteristics space omission problem. insertion problem compound words tackled using dictionary look-up technique. combination various machine learning, rule-based, regular-expressions, techniques. Finally, POS taggers are based on Hidden Markov Model Maximum Entropy-based addition, we gold training testing data sets to improve evaluate performance new techniques tokenization, tagging. For comparison purposes, compared proposed approaches several methods. Our UNLT, sets, supporting resources all free publicly available academic use.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

NLTK: The Natural Language Toolkit

The Natural Language Toolkit is a suite of program modules, data sets, tutorials and exercises, covering symbolic and statistical natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past three years, NLTK has become popular in teaching and research. We describe the toolkit and report on its current state of development.

متن کامل

Multidisciplinary Instruction with the Natural Language Toolkit

The Natural Language Toolkit (NLTK) is widely used for teaching natural language processing to students majoring in linguistics or computer science. This paper describes the design of NLTK, and reports on how it has been used effectively in classes that involve different mixes of linguistics and computer science students. We focus on three key issues: getting started with a course, delivering i...

متن کامل

VnCoreNLP: A Vietnamese Natural Language Processing Toolkit

We present an easy-to-use and fast toolkit, namely VnCoreNLP—a Java NLP annotation pipeline for Vietnamese. Our VnCoreNLP supports key natural language processing (NLP) tasks including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing, and obtains state-of-the-art (SOTA) results for these tasks. We release VnCoreNLP to provide rich linguistic...

متن کامل

Computational Semantics in the Natural Language Toolkit

NLTK, the Natural Language Toolkit, is an open source project whose goals include providing students with software and language resources that will help them to learn basic NLP. Until now, the program modules in NLTK have covered such topics as tagging, chunking, and parsing, but have not incorporated any aspect of semantic interpretation. This paper describes recent work on building a new sema...

متن کامل

PSI-Toolkit: A Natural Language Processing Pipeline

The paper presents the main ideas and the architecture of the open source PSI-Toolkit, a set of linguistic tools being developed within a project financed by the Polish Ministry of Science and Higher Education. The toolkit is intended for experienced language engineers as well as casual users not having any technological background. The former group of users is delivered a set of libraries that...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Natural Language Engineering

سال: 2022

ISSN: ['1469-8110', '1351-3249']

DOI: https://doi.org/10.1017/s1351324921000425