DKPro Similarity: An Open Source Framework for Text Similarity
نویسندگان
چکیده
We present DKPro Similarity, an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and phonetic measures. In order to promote the reproducibility of experimental results and to provide reliable, permanent experimental conditions for future studies, DKPro Similarity additionally comes with a set of full-featured experimental setups which can be run out-of-the-box and be used for future systems to built upon.
منابع مشابه
DKPro Keyphrases: Flexible and Reusable Keyphrase Extraction Experiments
DKPro Keyphrases is a keyphrase extraction framework based on UIMA. It offers a wide range of state-of-the-art keyphrase experiments approaches. At the same time, it is a workbench for developing new extraction approaches and evaluating their impact. DKPro Keyphrases is publicly available under an open-source license.1
متن کاملCITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central
Citation-based similarity measures such as Bibliographic Coupling and Co-Citation are an integral component of many information retrieval systems. However, comparisons of the strengths and weaknesses of measures are challenging due to the lack of suitable test collections. This paper presents CITREC, an open evaluation framework for citation-based and text-based similarity measures. CITREC prep...
متن کاملDKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data
We present DKPro TC, a framework for supervised learning experiments on textual data. The main goal of DKPro TC is to enable researchers to focus on the actual research task behind the learning problem and let the framework handle the rest. It enables rapid prototyping of experiments by relying on an easy-to-use workflow engine and standardized document preprocessing based on the Apache Unstruc...
متن کاملMayoClinicNLP-CORE: Semantic representations for textual similarity
The Semantic Textual Similarity (STS) task examines semantic similarity at a sentencelevel. We explored three representations of semantics (implicit or explicit): named entities, semantic vectors, and structured vectorial semantics. From a DKPro baseline, we also performed feature selection and used sourcespecific linear regression models to combine our features. Our systems placed 5th, 6th, an...
متن کاملDKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation
Implementations of word sense disambiguation (WSD) algorithms tend to be tied to a particular test corpus format and sense inventory. This makes it difficult to test their performance on new data sets, or to compare them against past algorithms implemented for different data sets. In this paper we present DKPro WSD, a freely licensed, general-purpose framework for WSD which is both modular and ...
متن کامل