Heuristics for Fixing Common Errors in Deployed schema.org Microdata

نویسندگان

  • Robert Meusel
  • Heiko Paulheim
چکیده

Being promoted by major search engines such as Google, Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using schema.org, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which limits its practical use. In this paper, we use the WebDataCommons corpus of Microdata extracted from more than 250 million web pages for a quantitative analysis of common mistakes in Microdata provision. Since it is unrealistic that data providers will provide clean and correct data, we discuss a set of heuristics that can be applied on the data consumer side to fix many of those mistakes in a post-processing step. We apply those heuristics to provide an improved knowledge base constructed from the raw Microdata extraction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

What the Adoption of schema.org Tells About Linked Open Data

schema.org is a common data markup schema, pushed by large search engine providers such as Google, Yahoo!, and Bing. To date, a few hundred thousand web site providers adopt schema.org annotations embedded in their web pages via Microdata. While Microdata and Linked Open Data are not 100% the same, there are some commonalities which make a joint analysis of the two valuable and reasonable. Prof...

متن کامل

A Computer-Guided Approach to Website Schema.org Design

Schema.org offers to web developers the opportunity to enrich a website’s content with microdata and schema.org. For large websites, implementing microdata can take a lot of time. In general, it is necessary to perform two main activities, for which we lack methods and tools. The first consists in designing what we call the website schema.org, which is the fragment of schema.org that is relevan...

متن کامل

HL7 FHIR and Schema.org

Schema.org was developed by a number of major search engine companies such as Bing, Google and Yahoo! as a common vocabulary for marking up web pages. The combination of HTML and Microdata, RDFa 1.1 Lite or JSON-LD enables a well-known set of semantic tags to be added to existing human-readable web pages. Schema.org has been widely adopted by public web sites and multiple extensions have been c...

متن کامل

A Quantitative Analysis of the Use of Microdata for Semantic Annotations on Educational Resources

A current trend in the semantic web is the use of embedded markup formats aimed to semantically enrich web content by making it more understandable to search engines and other applications. The deployment of Microdata as a markup format has increased thanks to the widespread of a controlled vocabulary provided by Schema.org. Recently, a set of properties from the Learning Resource Metadata Init...

متن کامل

Enriching Webpages with Semantic Information

This paper proposes a tool to automatically enrich webpages with semantic information by annotating keywords in the document with microdata markup. There are two case studies described and implemented in this paper. The first case study focuses on generating new webpages with microdata and the second case study focuses on enriching existing webpages with microdata. This paper also demonstrates ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015