Continuous N-gram Representations for Authorship Attribution
نویسندگان
چکیده
This paper presents work on using continuous representations for authorship attribution. In contrast to previous work, which uses discrete feature representations, our model learns continuous representations for n-gram features via a neural network jointly with the classification layer. Experimental results demonstrate that the proposed model outperforms the state-of-the-art on two datasets, while producing comparable results on the remaining two.
منابع مشابه
N-gram-based Author Profiles for Authorship Attribution
We present a novel method for computer-assisted authorship attribution based on characterlevel n-gram author profiles, which is motivated by an almost-forgotten, pioneering method in 1976. The existing approaches to automated authorship attribution implicitly build author profiles as vectors of feature weights, as language models, or similar. Our approach is based on byte-level n-grams, it is l...
متن کاملAuthorship Attribution in Portuguese Using Character N-grams
For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experimen...
متن کاملAuthorship Attribution in Bengali Language
We describe Authorship Attribution of Bengali literary text. Our contributions include a new corpus of 3,000 passages written by three Bengali authors, an end-toend system for authorship classification based on character n-grams, feature selection for authorship attribution, feature ranking and analysis, and learning curve to assess the relationship between amount of training data and test accu...
متن کاملN-Gram Based Authorship Attribution in Urdu Poetry
Authorship attribution is an interesting problem in Computational Linguistics. Traditional author recognition systems for electronic text rely on techniques which train the system to the specific vocabulary and writing style of the writer and apply stochastic methods to judge a given text at byte, letter or word levels. In this paper we have developed a software system to apply one existing and...
متن کاملOn the Robustness of Authorship Attribution Based on Character N-gram Features
A number of independent authorship attribution studies have demonstrated the effectiveness of character n-gram features for representing the stylistic properties of text. However, the vast majority of these studies examined the simple case where the training and test corpora are similar in terms of genre, topic, and distribution of the texts. Hence, there are doubts whether such a simple and lo...
متن کامل