Construction and Annotation of a French Folkstale Corpus
نویسندگان
چکیده
In this paper, we present the digitization and annotation of a tales corpus which is to our knowledge the only French tales corpus available and classified according to the Aarne&Thompson classification composed of historical texts (with old French parts). We first studied whether the pre-processing tools, namely OCR and PoS-tagging, have good enough accuracies to allow automatic analysis. We also manually annotated this corpus according to several types of information which could prove useful for future work: character references, episodes, and motifs. The contributions are the creation of an corpus of French tales from classical anthropology material, which will be made available to the community; the evaluation of OCR and NLP tools on this noisy corpus; and the annotation with anthropological information.
منابع مشابه
Syntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF)
This article presents the Syntactic Reference Corpus of Medieval French (SRCMF). The corpus is composed of texts taken from the two major Old French corpora, the Base de Français Médiéval and the Nouveau Corpus d'Amsterdam. This contribution describes some of the core principles of the annotation model, which is based on dependency grammar, as well as the annotation procedure and representation...
متن کاملMODAL: A Multilingual Corpus Annotated for Modality
English. We have produced a corpus annotated for modality which amounts to approximately 20,000 words in English, French, and Italian. The annotation scheme is based on the notion of epistemic construction and virtually languageindependent. The annotation is rigorously evaluated by means of a newly developed strategy based on the alignment of the entire epistemic constructions as identified and...
متن کاملAnnotation référentielle du Corpus Arboré de Paris 7 en entités nommées (Referential named entity annotation of the Paris 7 French TreeBank) [in French]
Referential named entity annotation of the Paris 7 French TreeBank The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no ...
متن کاملDEGELS1: A comparable corpus of French Sign Language and co-speech gestures
In this paper, we describe DEGELS1, a comparable corpus of French Sign Language and co-speech gestures that has been created to serve as a testbed corpus for the DEGELS workshops. These workshop series were initiated in France for researchers studying French Sign Language and co-speech gestures in French, with the aim of comparing methodologies for corpus annotation. An extract was used for the...
متن کاملDeQue: A Lexicon of Complex Prepositions and Conjunctions in French
We introduce DeQue, a lexicon covering French complex prepositions (CPRE) like à partir de (from) and complex conjunctions (CCONJ) like bien que (although). The lexicon includes fine-grained linguistic description based on empirical evidence. We describe the general characteristics of CPRE and CCONJ in French, with special focus on syntactic ambiguity. Then, we list the selection criteria used ...
متن کامل