wikipedia mining

Mining and Ranking Biomedical Synonym Candidates from Wikipedia

2015

Abhyuday Jagannatha Jinying Chen Hong Yu

Biomedical synonyms are important resources for Natural Language Processing in Biomedical domain. Existing synonym resources (e.g., the UMLS) are not complete. Manual efforts for expanding and enriching these resources are prohibitively expensive. We therefore develop and evaluate approaches for automated synonym extraction from Wikipedia. Using the inter-wiki links, we extracted the candidate ...

متن کامل

Mining for Domain-specific Parallel Text from Wikipedia

2013

Magdalena Plamada Martin Volk

Previous attempts in extracting parallel data from Wikipedia were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combin...

متن کامل

Unsupervised Language-Independent Name Translation Mining from Wikipedia Infoboxes

2011

Wen-Pin Lin Matthew Snover Heng Ji

The automatic generation of entity profiles from unstructured text, such as Knowledge Base Population, if applied in a multi-lingual setting, generates the need to align such profiles from multiple languages in an unsupervised manner. This paper describes an unsupervised and language-independent approach to mine name translation pairs from entity profiles, using Wikipedia Infoboxes as a stand-i...

متن کامل

Wikipedia graph mining: dynamic structure of collective memory

2017

Volodymyr Miz Kirell Benzi Benjamin Ricaud Pierre Vandergheynst

ABSTRACT Wikipedia is the biggest ever created encyclopedia and the fifth most visited website in the world. Tens of millions of people surf it every day, seeking answers to various questions. Collective user activity on the pages leaves publicly available footprints of human behavior, making Wikipedia a great source of the data for largescale analysis of collective dynamical patterns. The dyna...

متن کامل

Mining the Spoken Wikipedia for Speech Data and Beyond

2016

Arne Köhn Florian Stegen Timo Baumann

We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Our pipeline automatically downloads and aligns this d...

متن کامل

Wikipedia Mining for an Association Web Thesaurus Construction

2007

Kotaro Nakayama Takahiro Hara Shojiro Nishio

Wikipedia has become a huge phenomenon on the WWW. As a corpus for knowledge extraction, it has various impressive characteristics such as a huge amount of articles, live updates, a dense link structure, brief link texts and URL identification for concepts. In this paper, we propose an efficient link mining method pfibf (Path Frequency Inversed Backward link Frequency) and the extension method ...

متن کامل

Mining Wikipedia Article Clusters for Geospatial Entities and Relationships

2009

Jeremy Witmer Jugal K. Kalita

We present in this paper a method to extract geospatial entities and relationships from the unstructured text of the English language Wikipedia. Using a novel approach that applies SVMs trained from purely structural features of text strings, we extract candidate geospatial entities and relationships. Using a combination of further techniques, along with an external gazetteer, the candidate ent...

متن کامل

Measuring Semantic Relatedness using Mined Semantic Analysis

Journal: :CoRR 2015

Walid Shalaby Wlodek Zadrozny

Mined Semantic Analysis (MSA) is a novel distributional semantics approach which employs data mining techniques. MSA embraces knowledge-driven analysis of natural languages. It uncovers implicit relations between concepts by mining for their associations in target encyclopedic corpora. MSA exploits not only target corpus content but also its knowledge graph (e.g., "See also" link graph of Wikip...

متن کامل

HIT2 Joint NLP Lab at the NTCIR-9 Intent Task

2011

Dongqing Xiao Haoliang Qi Jingbin Gao Zhongyuan Han Muyun Yang Sheng Li

The report hereby is to represent the principle, the searching process and experiment results. We report our systems and experiments in the intent task of NTCIR 9. The research aims at evaluating the effectiveness of the proposed methods on query intent mining and results diversification in terms of web search. In the subtopic mining subtask, we combine the extracted candidates from search logs...

متن کامل

An Iterative Approach for Mining Parallel Sentences in a Comparable Corpus

2014

Lise Rebout Philippe Langlais

We describe an approach for mining parallel sentences in a collection of documents in two languages. While several approaches have been proposed for doing so, our proposal differs in several respects. First, we use a document level classifier in order to focus on potentially fruitful document pairs, an understudied approach. We show that mining less, but more parallel documents can lead to bett...

متن کامل