Plagiarism Alignment Detection by Merging Context Seeds
نویسندگان
چکیده
We describe our submitted algorithm to the text alignment sub-task of the plagiarism detection task in the PAN2014 challenge that achieved a plagdet score 0.855. By extracting contextual features for each document character and grouping those that are relevant for a given pair of documents, we generate seeds of atomic plagiarism cases. These are then merged by an agglomerative singlelinkage strategy using a defined distance measure.
منابع مشابه
Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Notebook for PAN at CLEF 2013
In this paper, we describe our approach at the PAN@CLEF2013 plagiarism detection competition. In sub-task of Source Retrieval, a method combined TF-IDF, PatTree and Weighted TF-IDF to extract the keywords of suspicious documents as queries to retrieve the plagiarism source document is proposed. In sub-task of Text Alignment, a method based on sentence similarity is presented. Our text alignment...
متن کاملDetecting High Obfuscation Plagiarism: Exploring Multi-Features Fusion via Machine Learning
Providing effective methods of identification of high-obfuscation plagiarism seeds presents a significant research problem in the field of plagiarism detection. The conventional methods of plagiarism detection are based on single type of features to capture plagiarism seeds. But for high-obfuscation plagiarism detection, these single type features are not sufficient for identifying the plagiari...
متن کاملOptimized Fuzzy Text Alignment for Plagiarism Detection
This paper describes a method for plagiarism detection based on a fuzzy alignment between a given pair of documents. The proposed method assigns a weight to each word of the suspicious document according to the straightness of its alignment to the source document; this weight is used as a kind of plagiarism probability measure for each word of the suspicious document. The paper also presents a ...
متن کاملA Text Alignment Corpus for Persian Plagiarism Detection
This paper describes how a Persian text alignment corpus was constructed to evaluate plagiarism detection systems. This corpus is in PAN format and contains 11,089 documents and more than 11,603 plagiarism cases. Efforts were made to simulate various types of plagiarism manually, semi-automatically, or automatically in this large-scale corpus.
متن کاملOverview of the 6th International Competition on Plagiarism Detection
This paper overviews 17 plagiarism detectors that have been evaluated within the sixth international competition on plagiarism detection at PAN 2014. We report on their performances for the two tasks source retrieval and text alignment of external plagiarism detection. For the third year in a row, we invite software submissions instead of run submissions for this task, which allows for cross-ye...
متن کامل