Evaluating the Performance of LSA for Source-code Plagiarism Detection

نویسندگان

  • Georgina Cosma
  • Mike Joy
چکیده

Latent Semantic Analysis (LSA) is an intelligent information retrieval technique that uses mathematical algorithms for analyzing large corpora of text and revealing the underlying semantic information of documents. LSA is a highly parameterized statistical method, and its effectiveness is driven by the setting of its parameters which are adjusted based on the task to which it is applied. This paper discusses and evaluates the importance of parameterization for LSA based similarity detection of source-code documents, and the applicability of LSA as a technique for source-code plagiarism detection when its parameters are appropriately tuned. The parameters involve preprocessing techniques, weighting approaches; and parameter tweaking inherent to LSA processing – in particular, the choice of dimensions for the step of reducing the original post-SVD matrix. The experiments revealed that the best retrieval performance is obtained after removal of in-code comments (Java comment blocks) and applying a combined weighting scheme based on term frequencies, normalized term frequencies, and a cosine-based document normalization. Furthermore, the use of similarity thresholds (instead of mere rankings) requires the use of a higher number of dimensions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

COAT: Code ObfuscAtion Tool to evaluate the performance of code plagiarism detection tools

There exist many plagiarism detection tools to uncover plagiarized codes by analyzing the similarity of source codes. To measure how reliable those plagiarism detection tools are, we developed a tool named Code ObfuscAtion Tool (COAT) that takes a program source code as input and produces another source code that is exactly equivalent to the input source code in their functional behaviors but w...

متن کامل

PDetect: A Clustering Approach for Detecting Plagiarism in Source Code Datasets

Efficient detection of plagiarism in programming assignments of students is of a great importance to the educational procedure. This paper presents a clustering oriented approach for facing the problem of source code plagiarism. The implemented software, called PDetect, accepts as input a set of program sources and extracts subsets (the clusters of plagiarism) such that each program within a pa...

متن کامل

Overview and Comparison of Plagiarism Detection Tools

In this paper we have done an overview of effective plagiarism detection methods that have been used for natural language text plagiarism detection, external plagiarism detection, clustering-base plagiarism detection and some methods used in code source plagiarism detection, also we have done a comparison between five of software used for textual plagiarism detection: (PlagAware, PlagScan, Chec...

متن کامل

Cross-Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization

Computerized cross-language plagiarism detection has recently become essential. With the scarcity of scientific publications in Bahasa Indonesia, many Indonesian authors frequently consult publications in English in order to boost the quantity of scientific publications in Bahasa Indonesia (which is currently rising). Due to the syntax disparity between Bahasa Indonesia and English, most of the...

متن کامل

Extending computer-assisted text analysis techniques to the detection of source code plagiarism and collusion: assisting manual inspection

Traditionally, a distinction has been drawn between software and services to detect text-based plagiarism or collusion, and products to detect such practices in computer programs. The Joint Information Services Committee (JISC) even commissioned two separate studies, the Technical Review of Plagiarism Detection Report (Bull et al, 2001) which focused on text-based assignments, and Source Code P...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Informatica (Slovenia)

دوره 36  شماره 

صفحات  -

تاریخ انتشار 2012