Xara: an XML aware tool for corpus searching
نویسندگان
چکیده
Xara is the working name for a new version of SARA, the `SGML aware retrieval application' originally developed for use with the British National Corpus (BNC) in 1994. The system has been completely rewritten as a general purpose tool for searching large XML corpora, with a particular focus on the needs of corpus linguists, with close attention to new XML-based encoding standards, and with the benefit of hindsight derived from a decade of feedback from hundreds of SARA-users world wide.
منابع مشابه
Using the XARA XML-Aware Corpus Query Tool to Investigate the METER Corpus
The METER (MEasuring TExt Reuse) corpus is a corpus designed to support the study and analysis of journalistic text reuse. It consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers, some of which were derived from the PA version and some of which were written i...
متن کاملXARA: An XML- and Rule-based Semantic Role Labeler
XARA is a rule-based PropBank labeler for Alpino XML files, written in Java. I used XARA in my research on semantic role labeling in a Dutch corpus to bootstrap a dependency treebank with semantic roles. Rules in XARA are based on XPath expressions, which makes it a versatile tool that is applicable to other treebanks as well. In addition to automatic role annotation, XARA is able to extract tr...
متن کاملA Search Tool for Corpora with Positional Tagsets and Ambiguities
This article describes POLIQARP, a corpus indexing and query tool, which understands positional tagsets and which does not assume that word forms are annotated with unique morphosyntactic tags. POLIQARP is designed to be applicable to a variety of languages and tagsets: it works with XML-encoded texts, uses the UTF-8 character set, and allows for an external specification of the tagset. Current...
متن کاملAnatomy of an XML-based Text Corpus Server
This document describes an XML-based data model for annotated, modular text corpora along with a WWW-interface for browsing such corpora, reading the texts, searching for examples, and extracting information of word usages. The interface is based solely on programs and techniques belonging to the XML-family. The corpus model is designed in such a way that new parts (texts, sub-corpora) can be e...
متن کامل3LB-SAT: Una herramienta de anotación semántica
We present a tool, called 3LB-SAT, for the semantic tagging of multilingual corpora. Main features of this tool are that it is word-oriented, allows different formats for input corpus (TBF format, PenTreebank Bracketted Format and XML) and uses EuroWordnet for searching the word sense in four languages.
متن کامل