Latent Topic Based Medical Data Classification

ثبت نشده
چکیده

This paper discusses the classification process for medical data. In this paper, we use the data from ACM KDDCup 2008 to demonstrate our classification process based on latent topic discovery. In this data set, the target set and outliers are quite different in their nature: target set is only 0.6% size in total, while the outliers consist of 99.4% of the data set. We use this data set as an example to show how we dealt with this extremely biased data set with latent topic discovery and noise reduction techniques. Our experiment faces two major challenge: (1) extremely distributed outliers, and (2) positive samples are far smaller than negative ones. We try to propose a suitable process flow to deal with these issues and get a best AUC result of 0.98. Keywords—classification, latent topics, outlier adjustment, feature scaling

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Statistical modeling of medical indexing processes for biomedical knowledge information discovery from text

The overwhelming amount of published literature in the biomedical domain and the growing number of collaborations across scientific disciplines results in an increasing topical complexity of research articles. This represents an immense challenge for efficient biomedical knowledge discovery from text. We present a new graphical model, the socalled Topic-Concept Model, which extends the basic La...

متن کامل

Improving Web Query Classification by Latent Topic Analysis

As Web search engines play an important role in helping people find required information from massive Web data, the Web query classification (WQC) problem becomes an important research issue. Contrast to traditional classification problems, WQC is to classify Web queries into relevant categories of a Web taxonomy. In addition to the typical challenge of processing short and ambiguous queries, W...

متن کامل

User Profiling based on Latent Topic Modeling

NTT DOCOMO Technical Journal Vol. 13 No. 3 ©2011 NTT DOCOMO, INC. Copies of articles may be reproduced only for personal, noncommercial use, provided that the name NTT DOCOMO Technical Journal, the name(s) of the author(s), the title and date of the article appear in the copies. *1 latent topic model: A model widely used in document categorization based on the concept that a document is generat...

متن کامل

Multi - label Classification Algorithm Based on Latent Dirichlet Allocation Model

Vector Space Model (VSM) is used frequently in Text Classification (TC). However, it is usually produces a high dimensional feature space which leads to huge cost of computation and storage. Recently, statistic topic model plays an important role in the field of Information Retrieval (IR), TC and Document Clustering. In this chapter, we try to use a kind of statistic model—Latent Dirichlet Allo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012