Highlighting latent structures in texts
نویسنده
چکیده
We have developed an original learning method in order to extract latent structures in raw texts. The induced structure is a data-driven tree which can be unbalanced. It has been obtained from successive partitions of the texts in clusters, with an incremental number of classes ranging from 2 to K; each quasi-optimal partition has been performed with an adaptation of the k-means clustering. The paths of the texts in the successive partitions are the edges of an oriented graph whose nodes are the clusters. The study of the paths shows that some of the clusters remain identical in the successive partitions so that a tree can be extracted from the graph, by merging nodes and clipping edges. A corpus of 1,100 touring information leaflets has been used to illustrate this method.
منابع مشابه
Latent Semantic Analysis for Notional Structures Investigation
The research on the effects of study is hindered by the limitations of the techniques and methods of registering, measuring and assessing the actually formed knowledge. The problem has been solved using latent semantic analysis for comparison and assessment of scientific texts and knowledge, expressed in the form of free verbal statements. Education at higher schools has the specific objective ...
متن کاملTopic Modeling over Short Texts by Incorporating Word Embeddings
Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks, such as content charactering, user interest profiling, and emerging topic detecting. Existing methods such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) cannot solve this problem very well since only very limited word co-o...
متن کاملTowards Automatic Topical Question Generation
We address the challenge of automatically generating questions from topics. We consider that each topic is associated with a body of texts containing useful information about the topic. Questions are generated by exploiting the named entity information and the predicate argument structures of the sentences present in the body of texts. To measure the importance of the generated questions, we us...
متن کاملTowards Topic-to-Question Generation
This paper is concerned with automatic generation of all possible questions from a topic of interest. Specifically, we consider that each topic is associated with a body of texts containing useful information about the topic. Then, questions are generated by exploiting the named entity information and the predicate argument structures of the sentences present in the body of texts. The importanc...
متن کاملHighlighting Latent Structure in Documents
Extensible Markup Language (XML) is playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. It is a simple, very flexible text format, used to annotate data by means of markup. XML documents can be checked for syntactic well-formedness and semantic coherence through DTD and schema validation which makes their processing easier. In particular, d...
متن کامل