Applying Specific Clusterization and Fingerprint Density Distribution with Genetic Algorithm Overall Tuning in External Plagiarism Detection

نویسندگان

  • Yurii Palkovskii
  • Alexei Belov
چکیده

One of the biggest challenges encountered at PAN'11 External Plagiarism Detection was the need for different clusterization methods for different types of plagiarism within the corpus. The existence of sparse sections of highly obfuscated, low obfuscated and translated plagiarism sections alongside with verbatim plagiarism parts, made single pass clusterization inefficient as it produced negative effects in one of the above cases. At PAN'11 we used a single pass fixed length clusterization algorithm with a fixed value defining the maximum distance for cluster formation. The main issue with the fixed cauterization value is that large numbers (1600-1800) perform best for high obfuscation, medium (900) for translated and low (40) for verbatim sections. We decided to develop the system that will be able to either dynamically adjust the clusterization distance depending on the type of detected sections or try out multi-pass clusterization with different distance value with the exclusion of already detected clusters and heuristic post processing. For each detected cluster in several clusterization runs we measured Diagonal Density Distribution (DDD) and Mean Average Diagonal Fingerprint Distance (MADFD). These two values reflect the relative distribution of detected equal fingerprints within the cluster diagonal and allows to effectively tell which type of plagiarism is actually there. One more important role that these values play is the negation of cluster merging if the resulting DDD is less than any of the two clusters merged. This was particularly effective preventing accidental fingerprints merging the resulting clusters. Additionally we discovered that the total number of parameters that affect the system performance is already large and decided to apply the genetic algorithm in order to tackle the best possible meta values instead of picking them by hand. In PAN'12 prototype application we employed a dot plot visualization with both detected clusters and master clusters overlay that allowed us to efficiently control the training process and to measure the overall progress for each separate document pair.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Hybrid Similarity Methods for Plagiarism Detection Notebook for PAN at CLEF 2013

At PAN2013 we decided to focus entirely on Text Alignment subtask. Following our previous experience at PAN2012 and CLINSS2012, we decided to put together the approaches we used in previous year to face the new challenges of PAN2013. This year competition added new way of plagiarism obfuscation via text summarization. This particular feature required represents a wide variety of typical cases o...

متن کامل

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

External Plagiarism Detection based on Human Behaviors in Producing Paraphrases of Sentences in English and Persian Languages

With the advent of the internet and easy access to digital libraries, plagiarism has become a major issue. Applying search engines is one of the plagiarism detection techniques that converts plagiarism patterns to search queries. Generating suitable queries is the heart of this technique and existing methods suffer from lack of producing accurate queries, Precision and Speed of retrieved result...

متن کامل

Exploring Fingerprinting as External Plagiarism Detection Method

This paper outlines the main approach and the general design of the plagiarism detection prototype application we have developed to take part in the 2nd International Plagiarism Detection Competition. The developed system is a part of the larger application used at Zhytomyr State University as CMS Thesis Storage and comes under the title "Plagiarism Detector Accumulator". This application proto...

متن کامل

A Cluster-Based Plagiarism Detection Method - Lab Report for PAN at CLEF 2010

In this paper we describe a cluster-based plagiarism detection method, which we have used in the learning management system of SCUT to detect plagiarism in the network engineering related courses. And we also used this method to detect external plagiarism in the PAN-10 competition. The method is divided into three steps: the first step, called pre-selecting, is to narrow the scope of detection ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012