Bidirectional Extraction of Phrases for Expanding Queries in Academic Paper Retrieval
نویسندگان
چکیده
This paper proposes a new method for query expansion based on bidirectional extraction of phrases as word n-grams from research paper titles. The proposed method aims to extract information relevant to users’ needs and interests and thus to provide a useful system for technical paper retrieval. The outcome of proposed method are the trigrams as phrases that can be used for query expansion. First, word trigrams are extracted from research paper titles. Second, a co-occurrence graph of the extracted trigrams is constructed. To construct the co-occurrence graph, the direction of edges is considered in two ways: forward and reverse. In the forward and reverse co-occurrence graphs, the trigrams point to other trigrams appearing after and before them in a paper title, respectively. Third, Jaccard similarity is computed between trigrams as the weight of the graph edge. Fourth, the weighted version of PageRank is applied. Consequently, the following two types of phrases can be obtained as the trigrams associated with the higher PageRank scores. The trigrams of the one type, which are obtained from the forward co-occurrence graph, can form a more specific query when users add a technical word or words before them. Those of the other type, obtained from the reverse co-occurrence graph, can form a more specific query when users add a technical word or words after them. The extraction of phrases is evaluated as additional features in the paper title classification task using SVM. The experimental results show that the classification accuracy is improved than the accuracy achieved when the standard TF-IDF text features are only used. Moreover, the trigrams extracted by the proposed method can be utilized to expand query words in research paper retrieval. Keywords—word n-grams; Jaccard similarity; PageRank; TFIDF; query expansion; information retrieval; feature extraction
منابع مشابه
Automatic suggestion of phrasal-concept queries for literature search
Both general and domain-specific search engines have adopted query suggestion techniques to help users formulate effective queries. In the specific domain of literature search (e.g., finding academic papers), the initial queries are usually based on a draft paper or abstract, rather than short lists of keywords. In this paper, we investigate phrasal-concept query suggestions for literature sear...
متن کاملInformation Retrieval Based on Extraction of Domain Specific Significant Keywords and Other Relevant Phrases from a Conceptual Semantic Network Structure
This paper presents a functional approach towards the problem domain of Information Retrieval System built upon a narration based search text. The presented system retrieves documents from the background collection by extracting the domain specific significant keywords and other relevant phrases from a given narrative search text. The narrative search text can be a description or scenario which...
متن کاملSource Retrieval Plagiarism Detection based on Weighted Noun phrase and Key phrase Extraction
This paper describes an approach for source retrieval task of PAN 2015 competition. We apply two methods to extract important terms, namely weighted noun phrases and keyword phrases which are extracted from long sentences in terms of word count. Queries are constructed from top marked sentences. The prepared system tries to gather a complete dataset of downloaded sources and employ it in query ...
متن کاملEfficient Technique to Retrieve Plagiarized Documents for Plagiarism Detection
This paper details the approach of implementing an English plagiarism source retrieval system. A given document is broke down into segments by using TextTiling algorithm. These segments , are centered around certain topics within the document, key phrases are generated using KPMiner keyphrase extraction system. Segments and key phrases are used to create queries of the segment and document. Cha...
متن کاملCLEF 2003 Experiments at UB: Automatically Generated Phrases and Relevance Feedback for Improving CLIR
This paper presents the results obtained by the University at Buffalo (UB) in CLEF 2003. Our efforts concentrated in the monolingual retrieval and large multilingual retrieval tasks. We used a modified version of the SMART system, a heuristic method based on bigrams to generate phrases that works across multiple languages, and pseudo relevance feedback. Query translation was performed using pub...
متن کامل