Harvesting Entities from the Web Using Unique Identifiers – IBEX Extraction des entités du Web à l’aide d’identifiants uniques – IBEX
نویسندگان
چکیده
In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73–96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web. This work was published at WebDB 2015 [40].
منابع مشابه
Harvesting Entities from the Web Using Unique Identifiers - IBEX
In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extract...
متن کاملHarvesting Entities from the Web Using Unique Identifiers
In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with humanreadable names for the entities at large scale. Starting with a simple extracti...
متن کاملSequence analysis of peste des petits ruminants virus from ibexes in Xinjiang, China.
Peste des petits ruminants (PPR) is an infectious disease caused by peste des petits ruminants virus (PPRV). While PPR mainly affects domestic goats and sheep, it also affects wild ungulates such as ibex, blue sheep, and gazelle, although there are few reports regarding PPRV infection in wild animals. Between January 2015 and February 2015, it was found for the first time that wild ibexes died ...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملEpizootiologic investigations of selected abortive agents in free-ranging Alpine ibex (Capra ibex ibex) in Switzerland.
In the early 2000s, several colonies of Alpine ibex (Capra ibex ibex) in Switzerland ceased growing or began to decrease. Reproductive problems due to infections with abortive agents might have negatively affected recruitment. We assessed the presence of selected agents of abortion in Alpine ibex by serologic, molecular, and culture techniques and evaluated whether infection with these agents m...
متن کامل