Authorship Verification with Entity Coherence and Other Rich Linguistic Features Notebook for PAN at CLEF 2013
نویسندگان
چکیده
We adopt Koppel et al.’s unmasking approach [5] as the major framework of our authorship verification system. We enrich Koppel et al.’s original word frequency features with a novel set of coherence features, derived from our earlier work [2], together with a full set of stylometric features. For texts written in languages other than English, some stylometric features are unavailable due to the lack of appropriate NLP tools, and their coherence features are derived from their translations produced by Google Translate service. Evaluated on the training corpus, we achieve an overall accuracy of 65.7%: 100.0% for both English and Spanish texts, while only 40% for Greek texts; evaluated on the test corpus, we achieve an overall accuracy of 68.2%, and roughly the same performance across three languages.
منابع مشابه
Authorship Identification in Large Email Collections: Experiments Using Features that Belong to Different Linguistic Levels - Notebook for PAN at CLEF 2011
The aim of this paper is to explore the usefulness of using features from different linguistic levels to email authorship identification. Using various email datasets provided by PAN’11 lab we tested several feature groups in both authorship attribution and authorship verification subtasks. The selected feature groups combined with Regularized Logistic Regression and One-Class SVMmachine learni...
متن کاملAuthorship Verification Using the Impostors Method Notebook for PAN at CLEF 2013
This paper describes the evaluation of the GenIM method, which participated in the PAN' 13 authorship identification competition. The approach is based on comparing the similarity between the given documents and a number of external (impostor) documents, so that documents can be classified as having been written by the same author, if they are shown to be more similar to each other than to the ...
متن کاملAuthorship Verification via k-Nearest Neighbor Estimation Notebook for PAN at CLEF 2013
In this paper we describe our k-Nearest Neighbor (k-NN) based Authorship Verification method for the Author Identification (AI) task of the PAN 2013 challenge. The method follows an ensemble classification technique based on the combination of suitable feature categories. For each chosen feature category we apply a k-NN classifier to calculate a style deviation score between the training docume...
متن کاملA Multitude of Linguistically-rich Features for Authorship Attribution - Notebook for PAN at CLEF 2011
This paper reports on the procedure and learning models we adopted for the ‘PAN 2011 Author Identification’ challenge targetting real-world email messages. The novelty of our approach lies in a design which combines shallow characteristics of the emails (words and trigrams frequencies) with a large number of ad hoc linguistically-rich features addressing different language levels. For the autho...
متن کاملEnsembles of Proximity-Based One-Class Classifiers for Author Verification Notebook for PAN at CLEF 2014
We use ensembles of proximity based one-class classifiers for authorship verification task. The one-class classifiers compare, for each document of the known authorship, the dissimilarity between this document and the most dissimilar other document of this authorship to the dissimilarity between this document and the questioned document. As the dissimilarity measure between documents we use Com...
متن کامل