Using latent semantic analysis to identify similarities in source code to support program understanding

نویسندگان

  • Jonathan I. Maletic
  • Andrian Marcus
چکیده

The paper describes the results of applying Latent Semantic Analysis (LSA), an advanced information retrieval method, to program source code and associated documentation. Latent Semantic Analysis is a corpus-based statistical method for inducing and representing aspects of the meanings of words and passages (of natural language) reflective in their usage. This methodology is assessed for application to the domain of software components (i.e., source code and its accompanying documentation). Here LSA is used as the basis to cluster software components. This clustering is used to assist in the understanding of a nontrivial software system, namely a version of Mosaic. Applying Latent Semantic Analysis to the domain of source code and internal documentation for the support of program understanding is a new application of this method and a departure from the normal application domain of natural language.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Support for Software Maintenance Using Latent Semantic Analysis

The paper describes the results of applying semantic (versus structural) methods to the problems of software maintenance and program comprehension. Here, the focus is on tools to assist programmer to understand large legacy software systems. The method applied, Latent Semantic Analysis, is a corpus-based statistical method for inducing and representing aspects of the meanings of words and passa...

متن کامل

Semantic clustering: Identifying topics in source code

Many of the existing approaches in Software Comprehension focus on program program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich software analysis with the developer knowledge hidden in the code naming. This paper proposes the use...

متن کامل

Automatic Software Clustering via Latent Semantic Analysis

1 This paper appears in the 14 IEEE ASE’99, Cocoa Beach FL, Oct. 12-15, pp. 251-254 Abstract The paper describes the initial results of applying Latent Semantic Analysis (LSA) to program source code and associated documentation. Latent Semantic Analysis is a corpus-based statistical method for inducing and representing aspects of the meanings of words and passages (of natural language) reflecti...

متن کامل

Identification of High-Level Concept Clones in Source Code

Source code duplication occurs frequently within large software systems. Pieces of source code, functions, and data types are often duplicated in part, or in whole, for a variety of reasons. Programmers may simply be reusing a piece of code via copy and paste or they may be “reinventing the wheel”. Previous research on the detection of clones is mainly focused on identifying pieces of code with...

متن کامل

Using Traceability Links to Assess and Maintain the Quality of Software Documentation

The paper proposes an approach for using traceability links to assess and maintain the quality of software documentation. Our position is that quality documentation should accurately reflect the structure of the source code; hence elements of documentation that link to strongly coupled elements of the source code should also be strongly related. We use latent semantic indexing (LSI) to compute ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000