Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

نویسندگان

چکیده مقاله:

The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create this tool is to identify and recognize the units that are known as independent semantic units in Persian language. This tool detects word boundaries in texts and converts the text into a sequence of words. In the English language, many activities have been done in the field of text tokenization and many tools have been development; such as: Stanford, Ragel, ANTLR, JFLex, JLex, Flex and Quex. In recent decades, valuable researches have also been conducted in the field of tokenization in Persian language that all of them have worked on the lexical and syntactic layer. In the current research, we tried to focus on the semantic layer in addition to those two layers. Persian texts usually have two simple but important problems. The first problem is multi-word tokens that result from connecting one word to the next. Another problem is polysyllabic units, which result from the separation of words that together form a lexical unit. Tokenizer is one of the language preprocessing tools that is widely used in text analysis. This component recognizes the center of words in texts and turns it into a sequence of words for later analysis. Variety in Persian script and non-observance of the rules of separation and spelling of words on the one hand and the lexical complexities of Persian language on the other hand, different language processing such as tokenization face many challenges. Therefore, in order to obtain the optimal performance of this tool, it is necessary to first specify the computational linguistics considerations of tokenization in Persian and then, based on these considerations, provide a data set for training and testing. In this article, while explaining the mentioned considerations, we tried to prepare a data set in this regard. The prepared data set contains 21.183 tokens and the average length of sentences is 40.28.

Download for Free

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text analysis meets corpus linguistics

In recent years, there has been rising interest to using evidence derived from automatic syntactic analysis in large-scale corpus studies. Ideally, of course, corpus linguists would prefer to have access to the wealth of structural and featural information provided by a full parser based on a complex grammar formalism. However, to date such parsers achieve neither the speed nor the robustness n...

متن کامل

A Text Alignment Corpus for Persian Plagiarism Detection

This paper describes how a Persian text alignment corpus was constructed to evaluate plagiarism detection systems. This corpus is in PAN format and contains 11,089 documents and more than 11,603 plagiarism cases. Efforts were made to simulate various types of plagiarism manually, semi-automatically, or automatically in this large-scale corpus.

متن کامل

Text Based Interactive Fiction and Computational Linguistics

Interactive ction (IF) or text adventures are text-based computer games. After a short introduction I will explain some details of an IF authoring system. As an illustration I will give an outline of my own project. I will conclude by discussing some scientiic attempts to improve the genre using theories of artiicial intelligence and computational linguistics. 1 What is interactive ction (IF) ?...

متن کامل

Notes on computational linguistics

متن کامل

Components of Linguistics in Learning Disabilities Focusing on Reading Disorder

Background: The most common and significant learning disabilities include reading disorders. Reading is the foundation for all other learning, and children with weak reading skills are more vulnerable learners through their education and future, thereby failed to show significant progress in academic learning outcomes. Conclusion: Reading covers a language learning system and it is a subset of...

متن کامل

The Future of Text-Meaning in Computational Linguistics

Writer-based and reader-based views of text-meaning are reflected by the respective questions “What is the author trying to tell me?” and “What does this text mean to me personally?” Contemporary computational linguistics, however, generally takes neither view. But this is not adequate for the development of sophisticated applications such as intelligence gathering and question answering. I dis...

متن کامل

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

عنوان ژورنال

پردازش علائم و داده ها

دوره 19 شماره 3

صفحات 175- 188

تاریخ انتشار 2022-12

دنبال کردن

لغو دنبال کردن

{@ msg @}

با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.

کلمات کلیدی

کلمات کلیدی برای این مقاله ارائه نشده است

میزبانی شده توسط پلتفرم ابری doprax.com