Modeling intra-textual variation with entropy and surprisal: topical vs. stylistic patterns
نویسندگان
چکیده
We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as types of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group) to obtain insights on intra-textual variation.
منابع مشابه
Modeling of Stylistic Variation in Social Media with Stretchy Patterns
In this paper we describe a novel feature discovery technique that can be used to model stylistic variation in sociolects. While structural features offer much in terms of expressive power over simpler features used more frequently in machine learning approaches to modeling linguistic variation, they frequently come at an excessive cost in terms of feature space size expansion. We propose a nov...
متن کاملEffects of Surprisal and Entropy on Vowel Duration in Japanese.
Research on English and other languages has shown that syllables and words that contain more information tend to be produced with longer duration. This research is evolving into a general thesis that speakers articulate linguistic units with more information more robustly. While this hypothesis seems plausible from the perspective of communicative efficiency, previous support for it has come ma...
متن کاملTextual Stylistic Variation: Choices, Genres and Individuals
T his chapter argues for more informed target metrics for the statistical processing of stylistic variation in text collections. Much as operationalized relevance proved a useful goal to strive for in information retrieval, research in textual stylistics, whether application oriented or philologically inclined, needs goals formulated in terms of pertinence, relevance, and utility—notions that a...
متن کاملMultivariate Surprisal Analysis of Gene Expression Levels
We consider here multivariate data which we understand as the problem where each data point i is measured for two or more distinct variables. In a typical situation there are many data points i while the range of the different variables is more limited. If there is only one variable then the data can be arranged as a rectangular matrix where i is the index of the rows while the values of the va...
متن کاملTowards the detection and description of textual meaning indicators in spontaneous conversations
The description of textual and stylistic features has so far been largely neglected in the empirical study of conversational speech. In this paper we want to make a couple of strong initial points towards the use textual meaning and stylistic features in language engineering: First of all we want to show that there are other besides the traditional features in spontaneous speech that are worth ...
متن کامل