Automating XML markup of text documents
نویسندگان
چکیده
We present a novel system for automatically marking up text documents into XML and discuss the benefits of XML markup for intelligent information retrieval. The system uses the Self-Organizing Map (SOM) algorithm to arrange XML marked-up documents on a twodimensional map so that similar documents appear closer to each other. It then employs an inductive learning algorithm C5 to automatically extract and apply markup rules from the nearest SOM neighbours of an unmarked document. The system is designed to be adaptive, so that once a document is marked-up; its behaviour is modified to improve accuracy. The automatically marked-up documents are again categorized on the Self-Organizing Map.
منابع مشابه
A Method for Automating Text Markup
Markup languages based on XML are increasingly popular, and languages for other formats such as RDF are under active development. One of the problems involved in converting legacy documents to use XML or other markup formats is the insertion of tags into the document and the consequent rearrangement of text required when markup is added to an existing, un-marked-up document. This paper describe...
متن کاملAutomating XML Markup using Machine Learning Techniques
In this paper we present a novel system for automatically marking up text documents into XML. The system uses the techniques of the Self-Organising Map (SOM) algorithm in conjunction with an inductive learning algorithm, C5.0. The SOM algorithm clusters the XML marked-up documents on a two-dimensional map such that documents having similar content are placed close to each other. The C5.0 algori...
متن کاملSearching Multi-hierarchical XML Documents: The Case of Fragmentation
To properly encode properties of textual documents using XML, mul tiple markup hierarchies must be used, often leading to conflicting markup in encodings. Text Encoding Initiative (TEI) Guidelines[1] recognize this problem and suggest a number of ways to incorporate multiple hierarchies in a single well-formed XML document. In this paper, we present a framework for pro cessing XPath queries o...
متن کاملخوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کاملUnification of XML Documents with Concurrent Markup
Annotating multiple hierarchies with SGML-based markup systems is still one of the fundamental problems of text-technological research. Up to now, several solutions have been discussed (e.g. chapter 31 of the TEI-Guidelines (Sperberg-McQueen and Burnard 1994) and Barnard et al. (1995)). Furthermore, some non-SGML based approaches have been proposed. (cf. Huitfeldt and SperbergMcQueen (2001) ; T...
متن کامل