"Draw My Topics": Find Desired Topics fast from large scale of Corpus
نویسندگان
چکیده
We develop the “Draw My-Topics” Toolkit, which provides a fast way to incorporate social scientists’ concerns and interests into the standard topic model. Instead of using raw corpus with primitive processing as input, an algorithm based on Vector Space Model and Conditional Entropy are used to connect social scientists’ subjective want and the unsupervised topic models’ output. Space for users’ adjustment on specific corpus of their interest is accommodated in our algorithm. We demonstrate the toolkit’s use on the Diachronic People’s Daily Corpus in Chinese. Several interesting “central words” like “Enlai Zhou” (First PRC premier minister) and “Cultural Revolution” which may be interested of social scientists from different disciplines and the original corpus are used as input of our toolkit, then the most related topics are present efficiently for further research purpose.
منابع مشابه
Sampled Weighted Min-Hashing for Large-Scale Topic Mining
We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term cooccurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SW...
متن کاملStylex: a corpus of educational videos for research on speaking styles and their impact on engagement and learning
In the context of learning through educational videos, the material chosen for a given topic must not only be relevant but also engaging to the consumer—ensuring better understanding and retention of content. This paper focuses on the speaking style of instructors, which is an important aspect driving student engagement. We present StyleX, a corpus of 450 1-minute video clips featuring 50 instr...
متن کاملSelected topics from 40 years of research on speech and speaker recognition
This paper summarizes my 40 years of research on speech and speaker recognition, focusing on selected topics that I have investigated at NTT Laboratories, Bell Laboratories and Tokyo Institute of Technology with my colleagues and students. These topics include: the importance of spectral dynamics in speech perception; speaker recognition methods using statistical features, cepstral features, an...
متن کاملScalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)
Topic modeling is an increasingly important component of Big Data analytics, enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM), while mathematically elegant, do not lend themselves well to direct parallelization because of dependencies from one time step to another. Data decomposition approaches that partition ...
متن کاملTopic Detection, Ranking and Modeling Evolution in Bibliographic Datasets
Topic detection in a text corpus is the detection of semantic units from the underlying texts that can function as building blocks of knowledge discovery. Topic detection provides a powerful tool for text summarization and information navigation across a corpus of documents. Topic detection from text documents using statistical models and natural language processing techniques has been extensiv...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1602.01428 شماره
صفحات -
تاریخ انتشار 2014