Approaches for Candidate Document Retrieval and Detailed Comparison of Plagiarism Detection

نویسندگان

  • Leilei Kong
  • Haoliang Qi
  • Shuai Wang
  • Cuixia Du
  • Suhong Wang
  • Yong Han
چکیده

In this paper we report on our plagiarism detection system which is used to process the PAN plagiarism corpus for the tasks of Candidate Document Retrieval and Detailed Comparison. To retrieve the plagiarism candidate document by using ChatNoir API, a method based on tf*idf to extract the keywords of suspicious documents as queries is proposed. An Lucene ranking method is used for plagiarism candidate document reduction. And a detailed comparison algorithm to get the web pages that are actually sources for plagiarized passages is applied. To extract all plagiarism passages from the suspicious document and their corresponding source passages from the source document, a plagiarism detection method combined with semantic similarity and structure similarity is proposed. Semantic similarity is calculated by Vector Space Model while structure similarity is calculated by our own method. We use information retrieval to get candidate pairs of sentences from suspicious document and potential source document. A method which is called Bilateral Alternating Sorting is applied to merge pairs of sentences. Those plagiarism candidate result pairs are screened in post-processing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection

This paper reports the best performed approach followed for the candidate document retrieval task and the approach used for the detailed comparison task of the Plagiarism detection track in PAN 2012. The aim of the participation was to understand a few of the computer-assisted approaches used for plagiarism detection. The plagiarism detection is dependent on two broad tasks, (1) the candidate d...

متن کامل

Educated Guesses and Equality Judgments: Using Search Engines and Pairwise Match for External Plagiarism Detection

This paper describes the approaches taken to the two subtasks of Candidate Document Retrieval and Detailed Comparison, in the Plagiarism Detection track at PAN 12. For the first of these, we describe how we used a combination of frequency and a variation of a contrastive corpus measure to select keywords with which to make queries to the ChatNoir search system; for the second, we provide an ove...

متن کامل

Three Way Search Engine Queries with Multi-feature Document Comparison for Plagiarism Detection

In this paper, we describe our approach at the PAN 2012 plagiarism detection competition. Our candidate retrieval system is based on extraction of three different types of web queries with narrowing their execution by skipping certain passages of an input document. Our detailed comparison system detects common features of input document pair, computing valid intervals from them, and then mergin...

متن کامل

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

The task of plagiarism detection entails two main steps, suspicious candidate retrieval and pairwise document similarity analysis also called detailed analysis. In this paper we focus on the second subtask. We will report our monolingual plagiarism detection system which is used to process the Persian plagiarism corpus for the task of pairwise document similarity. To retrieve plagiarised passag...

متن کامل

Overview of the 4th International Competition on Plagiarism Detection

This paper overviews 15 plagiarism detectors that have been evaluated within the fourth international competition on plagiarism detection at PAN’12. We report on their performances for two sub-tasks of external plagiarism detection: candidate document retrieval and detailed document comparison. Furthermore, we introduce the PAN plagiarism corpus 2012, the TIRA experimentation platform, and the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012