Evaluating Topic-Based Representations for Author Profiling in Social Media

نویسندگان

  • Miguel Ángel Álvarez Carmona
  • Adrián Pastor López-Monroy
  • Manuel Montes-y-Gómez
  • Luis Villaseñor Pineda
  • Iván V. Meza
چکیده

The Author Profiling (AP) task aims to determine specific demographic characteristics such as gender and age, by analyzing the language usage in groups of authors. Notwithstanding the recent advances in AP, this is still an unsolved problem, especially in the case of social media domains. According to the literature most of the work has been devoted to the analysis of useful textual features. The most prominent ones are those related with content and style. In spite of the success of using jointly both kinds of features, most of the authors agree in that content features are much more relevant than style, which suggest that some profiling aspects, like age or gender could be determined only by observing the thematic interests, concerns, moods, or others words related to events of daily life. Additionally, most of the research only uses traditional representations such as the BoW, rather than other more sophisticated representations to harness the content features. In this regard, this paper aims at evaluating the usefulness of some topic-based representations for the AP task. We mainly consider a representation based on Latent Semantic Analysis (LSA), which automatically discovers the topics from a given document collection, and a simplified version of the Linguistic Inquiry and Word Count (LIWC), which consists of 41 features representing manually predefined thematic categories. We report promising results in several corpora showing the effectiveness of the evaluated topic-based representations for AP in social media.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Subword-based Deep Averaging Networks for Author Profiling in Social Media

Author profiling aims at identifying the authors’ traits on the basis of their sociolect aspect, that is, how language is shared by them. This work describes the system submitted by Symanto Research for the PAN 2017 Author Profiling Shared Task. The current edition is focused on language variety and gender identification on Twitter. We address these tasks by exploiting the morphology and semant...

متن کامل

Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts

We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. The resource includes dictionaries of slang words...

متن کامل

Social Media Sources for Personality Profiling

Social media provide a rich source of author-identified text that can be used for personality profiling. However, differences in length and number of entries, syntax, abbreviations, spelling and grammar errors, and topics can affect type and difficulty of preprocessing to extract appropriate text, accuracy of training, time period sampling for training texts, and rate of degradation of accuracy...

متن کامل

Author Profiling en Social Media: Identificación de Edad, Sexo y Variedad del Lenguaje

The possibility of knowing people traits on the basis of what they write is a field of growing interest named author profiling. To infer a user’s gender, age, native language or personality traits, simply by analysing her texts, opens a wide range of possibilities from the point of view of forensics, security and marketing. Furthermore, social media proliferation, which allows for new communica...

متن کامل

ESTEEM: A Novel Framework for Qualitatively Evaluating and Visualizing Spatiotemporal Embeddings in Social Media

Analyzing and visualizing large amounts of social media communications and contrasting short-term conversation changes over time and geolocations is extremely important for commercial and government applications. Earlier approaches for largescale text stream summarization used dynamic topic models and trending words. Instead, we rely on text embeddings – low-dimensional word representations in ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016