Semantic Press
نویسندگان
چکیده
In this paper Semantic Press, a tool for the automatic press review, is introduced. It is based on Text Mining technologies and is tailored to meet the needs of the eGovernment and eParticipation communities. First, a general description of the application demands emerging from the eParticipation and eGovernment sectors is offered. Then, an introduction to the framework of the automatic analysis and classification of newspaper content is provided, together with a description of the technologies underlying it. 1. The Framework of Semantic Press: the Linguistic Miner The Semantic Press (SP) tool was born as an evolution of a complex system, the so-called Linguistic Miner (LM, Picchi et al., 2004), set up in 2003 with the aim of developing a framework for the automatic extraction of linguistic knowledge from very large amounts of texts (from different sources and in different formats) to be exploited in didactic, editorial and cultural products. SP uses language resources extracted from LM, but adopts a lot of the distinguishing operating modalities and analysis tools of this system as well. Building LM involved two fundamental steps: firstly, the data were gathered; secondly, they were linguistically analysed to be further processed and classified. The first step produced a repository (a “mine”) of around 200 millions words, together with an automatic topic classification of texts. This was achieved by exploiting procedures for the upgrade and increase of textual data within the “mine” and for the automatic acquisition of data from the Web through the Spider technology, both with periodic updating and by means of user-defined paths. The “mine” is thus constantly enlarged in size. The second step consists in the automatic linguistic processing of the textual material collected by using the modules of the PiSystem (Picchi, 1994), an integrated framework for the processing of textual and lexical material, whose most important module is the so-called Data Base Testuale (Textual Data Base, in short DBT). The most effective procedures for further analyses of texts are POS tagging and lemmatization, which were performed on the whole repository. The linguistic analyses performed and their respective annotations were also recorded within the “mine”, which is therefore not only a repository of textual materials, but also a database linguistically annotated. Text Mining techniques are applied and exploited in a lot of frameworks: for instance, Inxight’s LinguistX (Inxight White Paper, 2006), IBM’s Intelligent Miner (Intelligent Miner White Paper, v.8.2), TextWise1 etc. In this scenario, LM stands out for its being based on tools and basic technologies developed to carry out good linguistic analyses supporting language resources for news applications. LM is specifically tailored for analysing Italian, but is obviously open to other languages. In the last year, LM was addressed to meet the needs of political and institutional bodies (such as Regione Toscana), which expressed their interest to use and exploit a tool for an intelligent access to the flow of news and information provided by Italian newspapers available on the Internet. This aim is in line with the current interest of CNR-ILC in the topics of eParticipation and eGovernment, also confirmed by its participation in the DEMO-net project (http://www.demo-net.org/). Figure 1: the SP system schema.
منابع مشابه
PressIndex: a Semantic Web Press Clipping Application
PressIndex is a project focusing on integrating semantic web technologies to provide new services for media monitoring and press clipping activities, especially in the domain of Competitive Intelligence. Ontology modeling, natural language processing tools, rule reasoning engines along with interactive user interfaces supported by the underlying semantics of the application are used to build ta...
متن کاملContent and Rhetorical Status Selection in Instructional Texts
This paper discusses an approach to planning the content of instructional texts. The research is based on a corpus study of 15 French procedural texts ranging from step--bystep device manuals to general artistic procedures. The approach taken starts from an AI task planner building a task representation, from which semantic carriers are selected. The most appropriate RST relations to communicat...
متن کاملViva la nanorevolucion! A Semantic Analysis of Spanish National Press
This study analyses nanotechnologys anchoring and codification in the Spanish national press to determine the thematic contexts in which this technology has been discussed. Latent semantic analysis was applied to identify themes based on semantic clusters and their longitudinal evolution. This analysis was carried out on a corpus of more than 600 articles from the most prominent Spanish nationa...
متن کاملImplementing Semantic Reference Systems
The analogy of Semantic Reference Systems proposed in (Kuhn in press) is being explored here with respect to the computational mechanisms it suggests. Semantic referencing, grounding in a semantic datum, semantic projection, and semantic transformation are defined and demonstrated through an implementation of a semantic reference system for a simple vehicle navigation model. The idea of wrappin...
متن کاملMapping discursive dynamics of the financial crisis: a structural perspective of concept roles in semantic networks
Background/purpose: Convenient access to vast and untapped collections of documents generated by organizations is a highly valuable resource for research. These documents (e.g., press releases) are a window into organizational strategies, communication patterns, and organizational behavior. However, the analysis of large document corpora requires appropriate automated methods for text mining an...
متن کاملAutomatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web corpus for crime analysis
based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. We also introduce an algorithmic approach to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, wh...
متن کامل