Content-based Plagiarism Detection in Korean Document Using Ferret’s Trigram

نویسندگان

  • Byung Ryul Ahn
  • Won-gyum Kim
  • Moon-Hyun Kim
چکیده

Document plagiarism means the unauthorized use of the original document of another author without recognition of the source. With the development of the Internet, the volume of digital information available and easily accessible has increased massively and detecting plagiarism manually is so expensive in terms of both time and effort. Although many copy detection techniques for digital document already have been released, their performance is still unsatisfactory. This paper proposes content-based copy detection for Hangul (Korean character) documents by improving the detection accuracy of existing Ferret’s trigram. The key of the proposed system to identify plagiarism is to use two elements: firstly the number of matching trigrams in the original document and secondly the weighting factor of the trigrams where they match sequentially. In this study we show that the proposed system is developed further by weighting results depending on the matching degree of trigram, thereby improving the accuracy of similarity detection in Hangul documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

Intrinsic Plagiarism Detection Using Character Trigram Distance Scores - Notebook for PAN at CLEF 2011

In this paper, we describe a novel approach to intrinsic plagiarism detection. Each suspicious document is divided into a series of consecutive, potentially overlapping ‘windows’ of equal size. These are represented by vectors containing the relative frequencies of a predetermined set of high-frequency character trigrams. Subsequently, a distance matrix is set up in which each of the document’s...

متن کامل

Analyzing Similarity in Mathematical Content To Enhance the Detection of Academic Plagiarism

Despite the effort put into the detection of academic plagiarism, it continues to be a ubiquitous problem spanning all disciplines. Various tools have been developed to assist human inspectors by automatically identifying suspicious documents. However, to our knowledge currently none of these tools use mathematical content for their analysis. This is problematic, because mathematical content po...

متن کامل

Detection of Plagiarism in Student Essays

This paper presents two methods for automatic detection of plagiarism in student essays, using Dutch text corpora to show their effectiveness. The first method is based on measuring the overlap in word trigrams between two essays, excluding all trigrams from the assignment text. This method proves efficient and robust, but relies on the availability of the plagiarized source. The second method ...

متن کامل

English-Persian Plagiarism Detection based on a Semantic Approach

Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-li...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013