Measuring Correlation Between Linguist's Judgments and Latent Dirichlet Allocation Topics
نویسندگان
چکیده
Data that has been annotated by linguists is often considered a gold standard on many tasks in the NLP field. However, linguists are expensive so researchers seek automatic techniques that correlate well with human performance. Linguists working on the ScamSeek project were given the task of deciding how many and which document classes existed in this previously unseen corpus. This paper investigates whether the document classes identified by the linguists correlate significantly with Latent Dirichlet Allocation (LDA) topics induced from that corpus. Monte-Carlo simulation is used to measure the statistical significance of the correlation between LDA models and the linguists’ characterisations. In experiments, more than 90% of the linguists’ classes met the level required to declare the correlation between linguistic insights and LDA models is significant. These results help verify the usefulness of the LDA model in NLP and are a first step in showing that the LDA model can replace the efforts of linguists in certain tasks like subdividing a corpus into classes.
منابع مشابه
Legal Documents Clustering using Latent Dirichlet Allocation
At present due to the availability of large amount of legal judgments in the digital form creates opportunities and challenges for both the legal community and for information technology researchers. This development needs assistance in organizing, analyzing, retrieving and presenting this content in a helpful and distributed manner. We propose an approach to cluster legal judgments based on th...
متن کاملA Sequential Latent Topic-Based Readability Model for Domain-Specific Information Retrieval
In domain-specific information retrieval (IR), an emerging problem is how to provide different users with documents that are both relevant and readable, especially for the lay users. In this paper, we propose a novel document readability model to enhance the domain-specific IR. Our model incorporates the coverage and sequential dependency of latent topics in a document. Accordingly, two topical...
متن کاملTopic Correlations over Time
Topic models have proved useful for analyzing large clusters of documents. Most models developed, however, have paid little attention to the analysis of the latent topics themselves, particularly with regards to change in their correlation over time. We present a novel, probabilistically well-founded extension to Latent Dirichlet Allocation (LDA) which can explicitly model topic drift over time...
متن کاملLatent Dirichlet Allocation with Topic-in-Set Knowledge
Latent Dirichlet Allocation is an unsupervised graphical model which can discover latent topics in unlabeled data. We propose a mechanism for adding partial supervision, called topic-in-set knowledge, to latent topic modeling. This type of supervision can be used to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise...
متن کاملMeasuring Topic Coherence through Optimal Word Buckets
Measuring topic quality is essential for scoring the learned topics and their subsequent use in Information Retrieval and Text classification. To measure quality of Latent Dirichlet Allocation (LDA) based topics learned from text, we propose a novel approach based on grouping of topic words into buckets (TBuckets). A single large bucket signifies a single coherent theme, in turn indicating high...
متن کامل