Improving the Tokenisation of Identifier Names
نویسندگان
چکیده
Identifier names are the main vehicle for semantic information during program comprehension. Identifier names are tokenised into their semantic constituents by tools supporting program comprehension tasks, including concept location and requirements traceability. We present an approach to the automated tokenisation of identifier names that improves on existing techniques in two ways. First, it improves tokenisation accuracy for identifier names of a single case and those containing digits. Second, performance gains over existing techniques are achieved using smaller oracles. Accuracy was evaluated by comparing the output of our algorithm to manual tokenisations of 28,000 identifier names drawn from 60 open source Java projects totalling 16.5 MSLOC. We also undertook a study of the typographical features of identifier names (single case, use of digits, etc.) per object-oriented construct (class names, method names, etc.), thus providing an insight into naming conventions in industrial-scale object-oriented code. Our tokenisation tool and datasets are publicly available.
منابع مشابه
A new nomenclature for fungi
Important changes brought about by the Melbourne International Code of Nomenclature for Algae,FungiandPlantsare briefly reviewed concerning a clarification of the spelling and typification of sanctioned fungal names, the recognition of electronic publication for the validity of nomenclatural novelties, permission to use English diagnoses or descriptions for their valid publication, and the requ...
متن کاملارزیابی تطبیقی کارایی ساختار فراداده نظامهای شناسگر دیجیتالی
The main solution to the problems of persistency and uniqueness in identification of digital objects in a web environment is provided by using digital identifiers instead of URL. The main basis of this solution is resolution mechanism that is used in digital identifier systems. Resolution is the use of indirect names instead of URLs; what worked for the DNS (Domain Name System) in stabilizing i...
متن کاملAnalysing Java identifier names
Identifier names are the principal means of recording and communicating ideas in source code and are a significant source of information for software developers and maintainers, and the tools that support their work. This research aims to increase understanding of identifier name content types — words, abbreviations, etc. — and phrasal structures — noun phrases, verb phrases, etc. — by improvin...
متن کاملLinguistic-prosodic processing for text-to-speech synthesis in italian
The linguistic-prosodic processing applied to text-to-speech synthesis in Italian is described. It proceeds in 5 steps: tokenisation and normalisation of abbreviations, numbers, etc.; part-of-speech tagging, based on function words, terminations and contextual heuristics; shallow parsing, based on a chunk grammar; grapheme-to-phoneme conversion, lexical stress assignment and syllabification by ...
متن کاملUsing Workflows to Explore and Optimise Named Entity Recognition for Chemistry
Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemist...
متن کامل