Joint Learning of Structural and Textual Features for Web Scale Event Extraction
نویسنده
چکیده
The web has become the central platform and marketplace for the organization, propagation of events and sale of tickets of any kind. Such events range from concerts, workshops, sport events, professional events to small local events. Individual’s event choices vary tremendously based on preferences and lifestyle. Online users use the web to inform themselves about new events near their location of choice, and potentially use the website to purchase tickets or make a reservation for such events. Event Extraction from the web is a particularly difficult type of information extraction, dealing with the detection of specific types of events and its attributes mentioned in source language data. Traditional research in event extraction focuses on the extraction of political, cultural, or other general interest events from text. Such text is typically editorial news content, e.g (Kuzey, Vreeken, and Weikum 2014), or more lately from social media such as Twitter, e.g. (Ritter, Etzioni, and Clark 2012). This research, however, covers events presented in tables, lists, or most crucially on single pages devoted to that event. This thesis focuses on both the discovery and extraction of such “single event pages”. This work is inspired by a series of works on inducing wrappers for the extraction of specific document types from the web. For example, (Wang et al. 2009) proposes an approach for learning to extract news articles and their basic attributes from a very small training corpus. Though inspired by this work, the approach presented here differs considerably, in scope and techniques used. In scope, I am targeting events, which carry many more attributes than the document types targeted in the above work, and where attributes may occur both in the template structure (as in (Wang et al. 2009)) and in the event descriptions. Further, my approach balances the need for more training data due to the more complex domain with semi-supervised methods for acquiring that training data.
منابع مشابه
Automatic Extraction of Textual Elements from News Web Pages
In this paper we present an algorithm for automatic extraction of textual elements, namely titles and full text, associated with news stories in news web pages. We propose a supervised machine learning classification technique based on the use of a Support Vector Machine (SVM) classifier to extract the desired textual elements. The technique uses internal structural features of a webpage withou...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملAnalyzing new features of infected web content in detection of malicious web pages
Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...
متن کاملTowards Cross-Media Feature Extraction
In this paper we describe past and present work dealing with the use of textual resources, out of which semantic information can be extracted in order to provide for semantic annotation and indexing of associated image or video material. Since the emergence of semantic web technologies and resources, entities, relations and events extracted from textual resources by means of Information Extract...
متن کاملEXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS
Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...
متن کامل