TGermaCorp - A (Digital) Humanities Resource for (Computational) Linguistics
نویسندگان
چکیده
TGermaCorp is a German text corpus whose primary sources are collected from German literature texts which date from the sixteenth century to the present. The corpus is intended to represent its target language (German) in syntactic, lexical, stylistic and chronological diversity. For this purpose, it is hand-annotated on several linguistic layers, including POS, lemma, named entities, multiword expressions, clauses, sentences and paragraphs. In order to introduce TGermaCorp in comparison to more homogeneous corpora of contemporary everyday language, quantitative assessments of syntactic and lexical diversity are provided. In this respect, TGermaCorp contributes to establishing characterising features for resource descriptions, which is needed for keeping track of a meaningful comparison of the ever-growing number of natural language resources. The assessments confirm the special role of proper names, whose propagation in text may influence lexical and syntactic diversity measures in rather trivial ways. TGermaCorp will be made available via hucompute.org.
منابع مشابه
GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus
This paper introduces a software tool, GutenTag, which is aimed at giving literary researchers direct access to NLP techniques for the analysis of texts in the Project Gutenberg corpus. We discuss several facets of the tool, including the handling of formatting and structure, the use and expansion of metadata which is used to identify relevant subcorpora of interest, and a general tagging frame...
متن کاملEnhancing Access to Media Collections and Archives Using Computational Linguistic Tools
In this paper, we outline the strategies, methodology, and infrastructure needed to bring advanced computational linguistic tools to researchers and archivists in the humanities. We discuss three use cases involving the application of the Language Application Grid (LAPPS), an open, web-based infrastructure providing interoperable access to hundreds of computational linguistic (CL) component web...
متن کاملLessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities
In this paper, we present the concept, content and experience with an actively running Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities. This video-based course is held in German, does not require any programming skills, and serves as an introduction to automatic text analysis. The target audience is anyone who is interested in applying basic language tech...
متن کاملIntegration of Linguistic Markup into Semantic Models of Folk Narratives: The Fairy Tale Use Case
Propp’s influential structural analysis of fairy tales created a powerful schema for representing storylines in terms of character functions, which is straightforward to exploit in computational semantic analysis and procedural generation of stories of this genre. We tackle two resources that draw on the Proppian model – one formalizes it as a semantic markup scheme and the other as an ontology...
متن کاملLessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanitie
In this paper, we present the concept, content and experience with an actively running Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities. This video-based course is held in German, does not require any programming skills, and serves as an introduction to automatic text analysis. The target audience is anyone who is interested in applying basic language tech...
متن کامل