SoMaJo: State-of-the-art tokenization for German web and social media texts
نویسندگان
چکیده
In this paper we describe SoMaJo, a rulebased tokenizer for German web and social media texts that was the best-performing system in the EmpiriST 2015 shared task with an average F1-score of 99.57. We give an overview of the system and the phenomena its rules cover, as well as a detailed error analysis. The tokenizer is available as
منابع مشابه
A POS Tagger for Social Media Texts Trained on Web Comments
Using social media tools such as blogs and forums have become more and more popular in recent years. Hence, a huge collection of social media texts from different communities is available for accessing user opinions, e.g., for marketing studies or acceptance research. Typically, methods from Natural Language Processing are applied to social media texts to automatically recognize user opinions. ...
متن کاملEmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres
We present our system used for the AIPHES team submission in the context of the EmpiriST shared task on “Automatic Linguistic Annotation of ComputerMediated Communication / Social Media”. Our system is based on a rulebased tokenizer and a machine learning sequence labelling POS tagger using a variety of features. We show that the system is robust across the two tested genres: German computer me...
متن کاملLTL-UDE $@$ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text
We present a detailed description of our submission to the EmpiriST shared task 2015 for tokenization and part-of-speech tagging of German social media text. As relatively little training data is provided, neither tokenization nor PoS tagging can be learned from the data alone. For tokenization, our system uses regular expressions for general cases and word lists for exceptions. For PoS tagging...
متن کاملThe Role of the German Researchers in the Formation of Islamic Art Studies
In the beginning of the nineteenth century, with the increasing interest of the Europeans in the culture of the East, the first articles on the Islamic art and culture were appeared in German-speaking countries. In the mid nineteenth century, some entries in German encyclopedias were devoted to Islamic art, and from the end of the century, the first monographs on Islamic architecture and orname...
متن کاملThe TextPro Tool Suite
We present TextPro, a suite of modular Natural Language Processing (NLP) tools for analysis of Italian and English texts. The suite has been designed so as to integrate and reuse state of the art NLP components developed by researchers at FBK. The current version of the tool suite provides functions ranging from tokenization to chunking and Named Entity Recognition (NER). The system‟s architect...
متن کامل