Paragraph Clustering for Intrinsic Plagiarism Detection using a Stylistic Vector Space Model with Extrinsic Features

نویسندگان

  • Julian Brooke
  • Graeme Hirst
چکیده

Our approach to the task of intrinsic plagiarism detection uses a vectorspace model which eschews surface features in favor of richer extrinsic features, including those based on latent semantic analysis in a larger external corpus. We posit that the popularity and success of surface n-gram features is mostly due to the topic-biased nature of current artificial evaluations, a problem which unfortunately extends to the present PAN evaluation. One interesting of aspect of our approach is our way of dealing with small, imbalanced span sizes; we improved performance considerably in our development evaluation by countering these effect using the expected difference of sums of random variables.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

External and Intrinsic Plagiarism Detection Using Vector Space Models

Plagiarism detection can be divided in external and intrinsic methods. Naive external plagiarism analysis suffers from computationally demanding full nearest neighbor searches within a reference corpus. We present a conceptually simple space partitioning approach to achieve search times sub linear in the number of reference documents, trading precision for speed. We focus on full duplicate sear...

متن کامل

External & Intrinsic Plagiarism Detection: VSM & Discourse Markers based Approach - Notebook for PAN at CLEF 2011

This paper aims to explain the performance of plagiarism detection system which can detect External as well as Intrinsic Plagiarism in text. It reports the results on PAN-PC-2011 test corpus. We investigated Vector Space Model based techniques for detecting external plagiarism cases and discourse markers based features to detect intrinsic plagiarism cases.

متن کامل

Unsupervised Stylistic Segmentation of Poetry with Change Curves and Extrinsic Features

The identification of stylistic inconsistency is a challenging task relevant to a number of genres, including literature. In this work, we carry out stylistic segmentation of a well-known poem, The Waste Land by T.S. Eliot, which is traditionally analyzed in terms of numerous voices which appear throughout the text. Our method, adapted from work in topic segmentation and plagiarism detection, p...

متن کامل

Intrinsic Plagiarism Detection Using Character n-gram Profiles

The task of intrinsic plagiarism detection deals with cases where no reference corpus is available and it is exclusively based on stylistic changes or inconsistencies within a given document. In this paper a new method is presented that attempts to quantify the style variation within a document using character n-gram profiles and a style change function based on an appropriate dissimilarity mea...

متن کامل

Using Clustering to Identify Outlier Chunks of Text - Notebook for PAN at CLEF 2011

Intrinsic plagiarism detection is a sub-task of authorship identification in which outlier chunks must be detected solely on the basis of stylistic differences from the main body of the text. We present a first attempt at utilizing words that appear infrequently in a text as stylistic markers for distinguishing outlier chunks in the text. In the first phase of our method we cluster chunks of te...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012