Building a historical corpus for Classical Portuguese: some technological aspects

نویسندگان

  • Maria Clara Paixão de Sousa
  • Thorsten Trippel
چکیده

This paper describes the restructuring process of a large corpus of historical documents and the system architecture that is used for accessing it. The initial challenge of this process was to get the most out of existing material, normalizing the legacy markup and harvesting the inherent information using widely available standards. This resulted in a conceptual and technical restructuring of the formerly existing corpus. The development of the standardized markup and techniques allowed the inclusion of important new materials, such as original 16th and 17th century prints and manuscripts; and enlarged the potential user groups. On the technological side, we were grounded on the premise that open standards are the best way of making sure that the resources will be accessible even after years in an archive. This is a welcomed result in view of the additional consequence of the remodeled corpus concept: it serves as a repository for important historical documents, some of which had been preserved for 500 years in paper format. This very rich material can from now on be handled freely for linguistic research goals.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Corpus-based Historical Portuguese Dictionary: Challenges and Opportunities

Historical corpora are important resources for different areas. Philology, Human Language Technology, Literary Studies, History, and Lexicography are some that benefit from them. However, compiling historical corpora is different from compiling contemporary corpora. Corpus designers have to deal with several characteristics inherent in historical texts, such as: absence of a spelling standard, ...

متن کامل

Nature, Politics and Architecture; Reading Out the Interaction of Nature, Politics and Culture Components in the Architecture Creating Process of Tabriz Blue Mosque

Tabriz Blue Mosque is a valuable historical monument from the 9th century AH, which has been built during the Kara - Koyunlu of Turkomans rule on northwestern Iran and about 35 years before the beginning of the Safavid Iranian government. This building has some features that make it to be distinguished from other monuments of the Azerbaijan region and even Iran. These features have attracted th...

متن کامل

Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary

In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our met...

متن کامل

Phonologic Patterns of Brazilian Portuguese: a grapheme to phoneme converter based study

This paper presents Brazilian Portuguese phoneme patterns of distribution, according to an automatic grammar rulesbased grapheme to phoneme converter. The software Nhenhém (Vasilévski, 2008) was used for treating data: written texts which were decoded into phonologic symbols, forming a corpus, and subjected to a statistical analysis. Results support the high level of predictability of Brazilian...

متن کامل

An Integrated Tool for Annotating Historical Corpora

E-Dictor is a tool for encoding, applying levels of editions, and assigning part-ofspeech tags to ancient texts. In short, it works as a WYSIWYG interface to encode text in XML format. It comes from the experience during the building of the Tycho Brahe Parsed Corpus of Historical Portuguese and from consortium activities with other research groups. Preliminary results show a decrease of at leas...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006