White Paper Text Clustering on Patents

نویسنده

  • Manish Sinha
چکیده

Overview Amongst various analyses performed on patents, the area where specialized software helps immensely is text‐mining and two of the most popular text mining techniques used over patent data are: ƒ Text segmentation / Tokenization ƒ Text Clustering / Topic identification Text segmentation is a process of analyzing the patent text and identifying smaller meaningful segments from the text. These segments are also called as Tokens or keywords. Various analyses can be performed using these tokens; however for large patent sets the number of unique tokens can be very high thereby making it unsuitable for certain types of analyses. Text clustering helps identify important topics or concepts (clusters) from a set of documents. Clustering of key patent data documents (such as Title, Abstract and Claims) has been used in various Patent Analysis tools and can help bring out the otherwise hidden insights within patents. Analyzing relationships between generated clusters or analyzing relationships between patent classifications and clusters are popular mechanisms used by researchers especially those in a competitive intelligence role. Clustering Algorithms Classical algorithms used for clustering are TF‐IDF, K‐Means or Bayesian Naïve. Their output is a set of topics (single level or hierarchical with multiple levels), each of which contain a group of documents cluster under the topic. The label (or name) of a topic is derived from the text of the patent and can be a combination of multiple words. Usually advanced algorithms understand parts‐of‐speech and interpret names accordingly. For instance, " tip of the probe… " , " using a probe tip… " , " with a probe whose tip is used for… " , and " probing the substrate with the tip… " will be represented as " probe tip ". Algorithms can differ on their capability to cluster patent under single or multiple topics. For patents it is desirable to have algorithms that place a patent under more than one topic since an attempt make a best‐match for the patent under a single topic may lead to errors in user interpretation. One of the key challenges of clustering algorithms has been to make the overall process transparent for IP Professionals who usually find it hard to accept results of " black‐box " clustering engines. Newer clustering algorithms provide the analyst control over most aspects of the clustering process thereby allowing then to fine tune the clustering output and train the algorithm to deliver more relevant …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Psalm – Patent Mining Tool for Competitive Intelligence

Original scientific paper Patent document is a valuable source of information. However, it is neither easy to extract useful information from patents nor simple to track evidence about all patents that may be relevant. This paper describes PSALM (Patent Search and Analysis for Landscaping and Management), a recently developed software tool for competitive intelligence based on patent data. PSAL...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Identifying Promising Research Frontiers Based on Patent Map

This research suggests a framework to identify promising research frontiers through patent map. Core patents are extracted from a list of patents, using collected patent information. The research frontiers are identified by conducting clustering core patents. Patent map is generated through principle component analysis and network analysis. The promising research frontiers and promising researc...

متن کامل

Applications and Challenges of Text Mining with Patents

This paper gives insight into our current research on three text mining tools for patents designed for information professionals. The first tool identifies numeric properties in the patent text and normalises them, the second extracts a list of keywords that are relevant and reveal the invention in the patent text, and the third tool attempts to segment the patent’s description into it’s sectio...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009