A Common Solution for Tokenization and Part-of-Speech Tagging

نویسندگان

  • Jorge Graña Gil
  • Miguel A. Alonso
  • Manuel Vilares Ferro
چکیده

Cross-Language Access to Recorded Speech in the MALACH Project . . . . . . . . . . . . . 57 Douglas W. Oard, Dina Demner-Fushman (University of Maryland, USA), Jan Hajič (Charles University, Prague, Czech Republic), Bhuvana Ramabhadran (IBM T. J. Watson Research Center, USA), Samuel Gustman (Survivors of the Shoah Visual History Foundation, Los Angeles, USA), William J. Byrne (Johns Hopkins University, Baltimore, USA), Dagobert Soergel, Bonnie Dorr, Philip Resnik (University of Maryland, USA), Michael Picheny (IBM T. J. Watson Research Center, USA)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Simultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer

We describe an approach to simultaneous tokenization and part-of-speech tagging that is based on separating the closed and open-class items, and focusing on the likelihood of the possible stems of the openclass words. By encoding some basic linguistic information, the machine learning task is simplified, while achieving stateof-the-art tokenization results and competitive POS results, although ...

متن کامل

Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging

Sentence word segmentation and Part-OfSpeech (POS) tagging are common preprocessing tasks for many Natural Language Processing (NLP) applications. This paper presents a practical application for POS tagging and segmentation disambiguation using an extension of the one-pass Viterbi algorithm called Viterbi-N. We introduce the internals of the developed system, which is based on lattices and a st...

متن کامل

LTL-UDE $@$ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text

We present a detailed description of our submission to the EmpiriST shared task 2015 for tokenization and part-of-speech tagging of German social media text. As relatively little training data is provided, neither tokenization nor PoS tagging can be learned from the data alone. For tokenization, our system uses regular expressions for general cases and word lists for exceptions. For PoS tagging...

متن کامل

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop

We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the

متن کامل

The Effect of Automatic Tokenization, Vocalization, Stemming, and {POS} Tagging on {A}rabic Dependency Parsing

We use an automatic pipeline of word tokenization, stemming, POS tagging, and vocalization to perform real-world Arabic dependency parsing. In spite of the high accuracy on the modules, the very few errors in tokenization, which reaches an accuracy of 99.34%, lead to a drop of more than 10% in parsing, indicating that no high quality dependency parsing of Arabic, and possibly other morphologica...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002