Presenting a method for extracting structured domain-dependent information from Farsi Web pages

author

Abstract:

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extract their related information to form a profile of the target entity. In recent years, several methods have been proposed for extracting structured information from web text. The majority of existing methods for extracting entity-centric information require a predefined ontology. The ontology includes the complete knowledge of the entities and their attributes. The main challenge of these methods is their inability to extract information about entities that are not already defined in the ontology. Besides, the existing methods have ignored semantic information extraction and have not linked the extracted information to the general ontology entries. This highlights that introducing new methods for semantic information extraction is an open problem and there is room for more efforts in this field. As an element of research, we proposed a new method for the automatic extraction of semantically structured information from Farsi web text. The proposed method does not require background knowledge about the entities and their properties. The proposed method consists of three main phases including pre-processing, semantic analysis and frame extraction. To fulfill these phases, we use a combination of language resources, text processing tools, and distant ontologies. The main focuses of the proposed method are to enrich the predicate-argument frames with the semantic information extracted from distant ontologies, extract the entity-related information from predicate-argument frames, and link the extracted information with their corresponding sense in DBPedia ontology. The issue facilitates the processing of Farsi texts by computers. To evaluate the proposed method, we created a small Farsi dataset containing 100 complete sentences. Then, the proposed method is compared with three information extraction methods on this dataset. The results of experiments show the superiority of the proposed method compared to counterpart methods in terms of precision and F1 measures.

Upgrade to premium to download articles

Sign up to access the full text

Already have an account?login

similar resources

Extracting Structured Data from Web Pages (Poster)

Many web sites contain a large collection of “structured” web pages. These pages encode data from an underlying structured source, and are typically generated dynamically. An example of such a collection is the set of book pages in Amazon. There are two important characteristics of such a collection: first, all the pages in the collection contain structured data conforming to a common schema; s...

full text

A Framework for Extracting, Classifying, Analyzing, and Presenting Information from Semi-Structured Web Data Sources

Extracting information from the web data sources becomes very important because the massive and increasing amount of diverse semi-structured information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes a framework for extracting, classifying, analyzing, and presentin...

full text

A Novel Method for Extracting Information from Web Pages with Multiple Presentation Templates

Web information extraction is the key part of web data integration. With the need of e-commerce website and the development of web design, web pages with multiple presentation templates arise. The current web information extraction systems are usually based on single presentation template, so web pages with multiple presentation templates can’t be extracted efficiently. This paper focuses on th...

full text

A Method for Extracting Task-related Information from Social Media based on Structured Domain Knowledge

Social media platforms have come into the focus of research as sources of information about the unfolding situation in disaster contexts. Incorporating information from social media into decision-making is still difficult though. One reason may be that the prevalent approach to data analysis works bottom-up, which has several limitations. In this paper, we adopt a top-down approach by means of ...

full text

Bootstrapping Information Extraction from Semi-structured Web Pages

We consider the problem of extracting structured records from semi-structured web pages with no human supervision required for each target web site. Previous work on this problem has either required significant human effort for each target site or used brittle heuristics to identify semantic data types. Our method only requires annotation for a few pages from a few sites in the target domain. T...

full text

My Resources

Save resource for easier access later

Save to my library Already added to my library

{@ msg_add @}


Journal title

volume 19  issue 2

pages  133- 146

publication date 2022-09

By following a journal you will be notified via email when a new issue of this journal is published.

Keywords

No Keywords

Hosted on Doprax cloud platform doprax.com

copyright © 2015-2023