Copy detection in Chinese documents using the Ferret: a report on experiments
نویسندگان
چکیده
The Ferret copy detector has been used for some years on English texts to find plagiarism in large collections of students’ coursework. This article reports on extending its application to Chinese, which differs from English in many respects: the sequence of characters that make up a Chinese text do not have word boundaries marked, there is a vast Chinese “alphabet”, or number of different characters, and they are represented with multi-byte encoding. We discuss issues of representation, focus on the effectiveness of a sub-symbolic approach, and show how the Ferret can circumvent the classic problem of finding word boundaries with an automated system. Corpora of students’ coursework from two Chinese universities have been collected, and we apply Ferret to investigate the detection of plagiarism. Our experiments show that Ferret can find both artificially constructed plagiarism as well as actually occurring, previously undetected plagiarism. We also investigate the parameters of the system, and report on typical optimum settings. Experiments reported in this article show that Ferret can work well on Chinese texts, and achieve a consistent performance. The investigation into the representation of written Chinese is likely to be of use in other language processing tasks.
منابع مشابه
Copy detection in Chinese documents using the Ferret
The Ferret copy detector has been used for some years on English texts to find plagiarism in large collections of students’ coursework. This article reports on extending its application to Chinese. Corpora of coursework from two Chinese universities have been collected, and our experiments show that the Ferret can find both artificially constructed plagiarism and also actually occurring, previo...
متن کاملPerformance evaluation of block-based copy- move image forgery detection algorithms
Copy-move forgery is a particular type of distortion where a part or portions of one image is/are copied to other parts of the same image. This type of manipulation is done to hide a particular part of the image or to copy one or more objects into the same image. There are several methods for detecting copy-move forgery, including block-based and key point-based methods. In this paper, a method...
متن کاملContent-based Plagiarism Detection in Korean Document Using Ferret’s Trigram
Document plagiarism means the unauthorized use of the original document of another author without recognition of the source. With the development of the Internet, the volume of digital information available and easily accessible has increased massively and detecting plagiarism manually is so expensive in terms of both time and effort. Although many copy detection techniques for digital document...
متن کاملDevelopment of an Alu-PCR Amplified YAC Probe Suitable for Enumeration of Chromosome 13 on Uncultured Lymphocytes and Amniocytes by Fluorescence in situ Hybridization
The main objective of the present study was to develop an efficient and reliable probe to be routinely used for detection of chromosome 13 copy numbers by interphase FISH. To achieve this, a Yeast Artificial Chromosome (YAC) containing sequences specific for human 13q12 (744D11), was cultured and the whole yeast genomic DNA was extracted. The human insert within the isolated DNA was amplified b...
متن کاملDetection of Copy-Move Forgery in Digital Images Using Scale Invariant Feature Transform Algorithm and the Spearman Relationship
Increased popularity of digital media and image editing software has led to the spread of multimedia content forgery for various purposes. Undoubtedly, law and forensic medicine experts require trustworthy and non-forged images to enforce rights. Copy-move forgery is the most common type of manipulation of digital images. Copy-move forgery is used to hide an area of the image or to repeat a por...
متن کامل