Profiling Topics on the Web for Knowledge

نویسندگان

  • Aditya Kumar Sehgal
  • Padmini Srinivasan
چکیده

This paper has been written with the aim of presenting motivations for my dissertation research, based on gaps in current research in text/web mining, as well as to provide an outline of the research I propose to do for my Ph.D. dissertation. The overall goal is to explore methods for knowledge discovery from the web. Text mining also known as Text Data Mining (TDM) [27], Knowledge Discovery in Textual Databases (KDT) [18] and Literature-based Discovery (LBD) [60] can be described as the process of identifying novel ideas from a collection of texts (also known as a corpus). By novel we mean information that is not explicitly present in the text source being analyzed. The kinds of ideas of interest are those indicating associations, hypotheses, trends, etc. This view of text mining is consistent with the definition proposed by Hearst in her highly cited paper [27]. To illustrate, consider the research of Swanson [59] with Raynauds Disease and Fish Oils. Swanson was interested in Raynauds Disease and read a number of research papers on the subject. He observed that Raynauds was exacerbated by certain factors such as platelet aggregability, vasoconstriction, and blood viscosity. From independent literature he also observed that these factors were mitigated by fish oils. Putting the two together he postulated that fish oils may be beneficial for Raynauds. This association was unknown at the time and was later confirmed by bioscientists. In our research we agree with Hearst's view that novelty with respect to the text collection is a requirement in text mining. However, like many others [66, 32] we adopt a more flexible definition of what constitutes " novelty ". Specifically, we see a subjective dimension in what is or is not perceived to be novel. Although not necessary, text mining efforts tend to adopt a multi-document perspective, with novel associations inferred by combining evidence from more than one document. Given the large amount of information available in text form today, we believe that tools that automatically find interesting relationships, hypotheses or ideas, or assist the user in finding these are extremely useful. Interestingly, most of the existing research in text mining has been limited to the context of biomedicine, part of which can be attributed to the early efforts of Swanson and Our focus in this thesis is on Web Mining, which can be thought of as an extension of text mining. As with text mining, …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Visualizing Multiple System Atrophy Studies Based on Collaboration Network and Centrality Indices in Web of Science Database

Introduction: Social network analysis is an analytical method based on graph theories that identifies relationships between individuals or factors to analyze the social structures resulted from those relationships. The objective of this study was to analyze co-authorship and co-word networks based on scientometric indicators and centrality measures in the studies on multiple atrophy system dise...

متن کامل

Visualizing Multiple System Atrophy Studies Based on Collaboration Network and Centrality Indices in Web of Science Database

Introduction: Social network analysis is an analytical method based on graph theories that identifies relationships between individuals or factors to analyze the social structures resulted from those relationships. The objective of this study was to analyze co-authorship and co-word networks based on scientometric indicators and centrality measures in the studies on multiple atrophy system dise...

متن کامل

Profiling topics on the Web for knowledge discovery

The availability of large-scale data on the Web motivates the development of automatic algorithms to analyze topics and to identify relationships between topics. Various approaches have been proposed in the literature. Most focus on specific topics, mainly those representing people, with little attention to topics of other kinds. They are also less flexible in how they represent topics. In this...

متن کامل

Systematic enrichment analysis of microRNA expression profiling studies in endometriosis

Objective(s): The purpose of this study was to conduct a meta-analysis on human microRNAs (miRNAs) expression data of endometriosis tissue profiles versus those of normal controls and to identify novel putative diagnostic markers. Materials andMethods: PubMed, Embase, Web of Science, Ovid Medline were used to search for endometriosis miRNA expression profiling studies of endometriosis. The miRN...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

تحلیل ساختار واژگان و مفاهیم مقالات علم اطلاعات و دانش‌شناسی بر اساس تحلیل شبکۀ اجتماعی در پایگاه وبگاه علم در دو دورۀ قبل و بعد از پیدایش وب (1993-1997 و 2009-2013)

Purpose: This study aimed at the identification and analyzes of “Knowledge and Information Science (KIS)” scientific articles structure using co-word analysis in the “Web of Science (WoS)” database (1993-1997 & 2009-2013). By co-word analysis of the KIS articles, subjects and concepts of KIS were identified. Methodology: This study has based on descriptive and functional approach and on co-wor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006