An Exploration of Language Identification Techniques for the Dutch Folktale Database

نویسندگان

  • Dolf Trieschnigg
  • Djoerd Hiemstra
  • Mariët Theune
  • Franciska de Jong
  • Theo Meder
چکیده

The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low. The studied dataset consisting of over 39,000 documents in 16 languages and dialects is available on request for followup research.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Supporting the Exploration of Online Cultural Heritage Collections: The Case of the Dutch Folktale Database

This paper demonstrates the use of a user-centred design approach for the development of generous interfaces/rich prospect browsers for an online cultural heritage collection, determining its primary user groups and designing different browsing tools to cater to their specific needs. We set out to solve a set of problems faced by many online cultural heritage collections. These problems are lac...

متن کامل

In Search of Cinderella: A Transaction Log Analysis of Folktale Searchers

In this work we report on a transaction log analysis of the Dutch Folktale Database, an online repository of extensively annotated folktales ranging from old fairy tales to recent urban legends, written in (old) Dutch, Frisian and a variety of Dutch dialects. We observed that users have a preference for subgenres within folktales such as traditional legends and urban legends and prefer stories ...

متن کامل

Behavioral Analysis of Traffic Flow for an Effective Network Traffic Identification

Fast and accurate network traffic identification is becoming essential for network management, high quality of service control and early detection of network traffic abnormalities. Techniques based on statistical features of packet flows have recently become popular for network classification due to the limitations of traditional port and payload based methods. In this paper, we propose a metho...

متن کامل

Learning to Extract Folktale Keywords

Manually assigned keywords provide a valuable means for accessing large document collections. They can serve as a shallow document summary and enable more efficient retrieval and aggregation of information. In this paper we investigate keywords in the context of the Dutch Folktale Database, a large collection of stories including fairy tales, jokes and urban legends. We carry out a quantitative...

متن کامل

Automatic classification of folk narrative genres

Folk narratives are a valuable resource for humanities and social science researchers. This paper focuses on automatically recognizing folk narrative genres, such as urban legends, fairy tales, jokes and riddles. We explore the effectiveness of lexical, structural, stylistic and domain specific features. We find that it is possible to obtain a good performance using only shallow features. As da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012