persian parallel corpus

ارایه یک پیکره‌ پرسش و پاسخ مذهبی در زبان فارسی

ژورنال: پردازش علائم و داده ها 2018

برشبان, یاسمن, میرروشندل, سیدابوالقاسم, یوسفی نسب, حامد,

Question answering system is a field in natural language processing and information retrieval noticed by researchers in these decades. Due to a growing interest in this field of research, the need to have appropriate data sources is perceived. Most researches about developing question answering corpus area have been done in English so far, but in other languages as Persian, the lack of these co...

متن کامل

Design a Persian Automated Plagiarism Detector (AMZPPD)

Journal: :CoRR 2014

Maryam Mahmoodi Mohammad Mahmoodi Varnamkhasti

Currently there are lots of plagiarism detection approaches. But few of them implemented and adapted for Persian languages. In this paper, our work on designing and implementation of a plagiarism detection system based on preprocessing and NLP technics will be described. And the results of testing on a corpus will be presented. Keywords— External Plagiarism, Plagiarism, Copy detection, natural ...

متن کامل

Discriminating Similar Languages: Persian and Dari

Journal: :TinyToCS 2015

Shervin Malmasi

Although widely-studied in recent years, Language Identification (LID) systems for determining the language of input texts often fail to discriminate between similar languages like Croatian-Serbian and Malay-Indonesian. This has brought attention to the task of discriminating similar languages, varieties and dialects – including a recent shared task [3]. Persian (also known as Farsi) and Dari (...

متن کامل

ISO-TimeML Event Extraction in Persian Text

2012

Yadollah Yaghoobzadeh Gholamreza Ghassem-Sani Seyed Abolghasem Mirroshandel Mahbaneh Eshaghzadeh Torbati

Recognizing TimeML events and identifying their attributes, are important tasks in natural language processing (NLP). Several NLP applications like question answering, information retrieval, summarization, and temporal information extraction need to have some knowledge about events of the input documents. Existing methods developed for this task are restricted to limited number of languages, an...

متن کامل

Supervised Lexical Acquisition for Persian from a Web Corpus

2007

Nick Pendar Serge Sharoff

This paper reports on the compilation of a large Persian Web corpus and the cyclic supervised development of a lexicon and lemmatizer. We discuss the strategies adopted in compiling the corpus as well as some of the challenges in processing and tokenizing it. We also present the word patterns developed for the lemmatizer and the algorithms designed for the supervised lexical acquisition.

متن کامل

Mahak Samim: A Corpus of Persian Academic Texts for Evaluating Plagiarism Detection Systems

2016

Morteza Rezaei Sharifabadi Seyed Ahmad Eftekhari

In this paper we introduce Mahak Samim, a plagiarism detection corpus that consists of Persian academic texts in which plagiarism cases are embedded. This corpus, which can be used for evaluating plagiarism detection systems, consists of more than five thousand artificial plagiarism cases with various lengths and diverse degrees of obfuscation. The development process and the features of the co...

متن کامل

Evaluation of Perstem: A Simple and Efficient Stemming Algorithm for Persian

2009

Amir Hossein Jadidinejad Fariborz Mahmoudi Jon Dehdari

Persian is a challenging language in the field of NLP. Rightto-left orthography, complex morphology, complicated grammatical rules, and different forms of letters make it an interesting language for NLP research. In this paper we measure the effectiveness of a simple and efficient stemming algorithm, Perstem, on Persian information retrieval. Our experiments on the Hamshahri corpus at CLEF2009 ...

متن کامل

Mahtab at SemEval-2017 Task 2: Combination of Corpus-based and Knowledge-based Methods to Measure Semantic Word Similarity

2017

Niloofar Ranjbar Fatemeh Mashhadirajab Mehrnoush Shamsfard Rayeheh Hosseini pour Aryan Vahid pour

In this paper, we describe our proposed method for measuring semantic similarity for a given pair of words at SemEval2017 monolingual semantic word similarity task. We use a combination of knowledge-based and corpus-based techniques. We use FarsNet, the Persian WordNet, besides deep learning techniques to extract the similarity of words. We evaluated our proposed approach on Persian (Farsi) tes...

متن کامل

Extracting Bilingual Persian Italian Lexicon from Comparable Corpora Using Different Types of Seed Dictionaries

Journal: :CoRR 2017

Ebrahim Ansari Mohammad Hadi Sadreddini Lucio Grandinetti Mehdi Sheikhalishahi

Ebrahim Ansari ([email protected]) et al. 2017. Extracting bilingual per-sian italian lexicon from comparable corpora using different types of seed dictionaries. In " Applications of Comparable Corpora " edited book Berlin Linguistic Press (ed.). Bilingual dictionaries are very important in various fields of natural language processing. In recent years, research on extracting new bilingual lex...

متن کامل

Text Recognition with k-means Clustering

Journal: :Research in Computing Science 2014

Mohammad Iman Jamnejad Ali Heidarzadegan Mohsen Meshki

A thesaurus is a reference work that lists words grouped together according to similarity of meaning (containing synonyms and sometimes antonyms), in contrast to a dictionary, which contains definitions and pronunciations. This paper proposes an innovative approach to improve the classification performance of Persian texts considering a very large thesaurus. The paper proposes a flexible method...

متن کامل