Kernel Methods and String Kernels for Authorship Analysis

نویسندگان

  • Marius Popescu
  • Cristian Grozea
چکیده

This paper presents our approach to the PAN 2012 Traditional Authorship Attribution tasks and the Sexual Predator Identification task. We approached these tasks with machine learning methods that work at the character level. More precisely, we treated texts as just sequences of symbols (strings) and used string kernels in conjunction with different kernel-based learning methods: supervised and unsupervised. The results were extremely good, we ranked first in most problem and overall in the traditional authorship attribution task, according to the evaluation provided by the organizers.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Kernel Methods for SVM Sequence Classifiers

In this work we study string kernel methods for sequence analysis and focus on the problem of species-level identification based on short DNA fragments known as barcodes. We introduce efficient sorting-based algorithms for exact string k-mer kernels and then describe a divide-and-conquer technique for kernels with mismatches. Our algorithm for the mismatch kernel matrix computation improves cur...

متن کامل

Kernel-based machine learning for fast text mining in R

Recent advances in the field of kernel-based machine learning methods allow fast processing of text using string kernels utilizing suffix arrays. kernlab provides both kernel methods’ infrastructure and a large collection of already implemented algorithms and includes an implementation of suffix-array-based string kernels. Along with the use of the text mining infrastructure provided by tm thes...

متن کامل

String kernels and similarity measures for information retrieval

Measuring a similarity between two strings is a fundamental step in many applications in areas such as text classification and information retrieval. Lately, kernel-based methods have been proposed for this task, both for text and biological sequences. Since kernels are inner products in a feature space, they naturally induce similarity measures. Information-theoretical approaches have also bee...

متن کامل

Kernels and Similarity Measures for Text Classification

Measuring similarity between two strings is a fundamental step in text classification and other problems of information retrieval. Recently, kernel-based methods have been proposed for this task; since kernels are inner products in a feature space, they naturally induce similarity measures. Information theoretic (dis)similarities have also been the subject of recent research. This paper describ...

متن کامل

A Randomized String Kernel and Its Application to RNA Interference

String kernels directly model sequence similarities without the necessity of extracting numerical features in a vector space. Since they better capture complex traits in the sequences, string kernels often achieve better prediction performance. RNA interference is an important biological mechanism with many therapeutical applications, where strings can be used to represent target messenger RNAs...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012