Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Authors

Borna, Keyvan Kharazmi University

Montazer, Gholam Ali Tarbiat Modares University

RiahiNia, Nosrat Kharazmi University

Shadanpour, Farzaneh Kharazmi University

Abstract:

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of scientific e-books. The evaluation of the used approach has been done by two methods of cosine similarity computing and qualitative evaluation by users. Findings: Table of contents are medium length texts with a trimmed mean of 260.02 words, about 20% of which are stop-words. The cosine similarity between the golden standard keywords and the output keywords is 0.0932 thus very low. The full agreement of users showed that the extracted keywords with LDA topic model represent the subject field of the whole corpus, but the golden standard keywords, the keywords extracted using the LDA topic model in sub-domains of the corpus, and the keywords extracted from the whole corpus were respectively successful in subject describing of each document. Conclusion: The keywords extracted using LDA topic model can be used in unspecified and unknown collections to extract hidden thematic content of the whole collection, but not to accurately relate each topic to each document in large and heterogeneous themes. In collections of texts in one subject field, such as mathematics or physics, etc., with less diversity and more uniform in terms of the words used in them, more coherent and relevant keywords are obtained, but in these cases the control of the relevance of keywords to each document is required. In formal subject analysis procedures and processes of individual documents, this approach can be used as a keyword suggestion system to indexing and analytical workforce.

Upgrade to premium to download articles

Already have an account?login

similar resources

Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Topic modeling is an increasingly important component of Big Data analytics, enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM), while mathematically elegant, do not lend themselves well to direct parallelization because of dependencies from one time step to another. Data decomposition approaches that partition ...

full text

Assignment 2: Twitter Topic Modeling with Latent Dirichlet Allocation Background

In this assignment we are going to implement a parallel MapReduce version of a popular topic modeling algorithm called Latent Dirchlet Allocation (LDA). Because it allows for exploring vast document collection, we are going to use this algorithm to see if we can automatically identify topics from a series of Tweets. For the purpose of this assignment, we are going to treat every tweet as a docu...

full text

Topic Models - Latent Dirichlet Allocation

full text

Decentralized Topic Modelling with Latent Dirichlet Allocation

Privacy preserving networks can be modelled as decentralized networks (e.g., sensors, connected objects, smartphones), where communication between nodes of the network is not controlled by a master or central node. For this type of networks, the main issue is to gather/learn global information on the network (e.g., by optimizing a global cost function) while keeping the (sensitive) information ...

full text

Latent Dirichlet Allocation For Text And Image Topic Modeling

Latent Dirichlet allocation (LDA) is a popular unsupervised technique for topic modeling. It learns a generative model which can discover latent topics given a collection of training documents. In the unsupervised learning framework, where the class label is unavailable, it is less intuitive to evaluate the goodness-of-fit and degree of overfitting of learned model. We discuss two measurements ...

full text

Latent Dirichlet Allocation For Text And Image Topic Modeling

Latent Dirichlet Allocation (LDA) is a generative model for text documents. It is an unsupervised method which can learn latent topics from documents. We investigate the task of topic modeling of documents using LDA, where the parameters are trained with collapsed Gibbs sampling. Since the training process is unsupervised and the true labels of the training documents are absent, it is hard to m...

full text

My Resources

Save resource for easier access later

Save to my library Already added to my library

{@ msg_add @}

Journal title

تعامل انسان و اطلاعات

volume 9 issue 3

pages 1- 21

publication date 2022-10

unfollow

{@ msg @}

By following a journal you will be notified via email when a new issue of this journal is published.

Keywords

No Keywords

Hosted on Doprax cloud platform doprax.com