Improved Arabic Base Phrase Chunking with a new enriched POS tag set
نویسنده
چکیده
Base Phrase Chunking (BPC) or shallow syntactic parsing is proving to be a task of interest to many natural language processing applications. In this paper, A BPC system is introduced that improves over state of the art performance in BPC using a new part of speech tag (POS) set. The new POS tag set, ERTS, reflects some of the morphological features specific to Modern Standard Arabic. ERTS explicitly encodes definiteness, number and gender information increasing the number of tags from 25 in the standard LDC reduced tag set to 75 tags. For the BPC task, we introduce a more language specific set of definitions for the base phrase annotations. We employ a support vector machine approach for both the POS tagging and the BPC processes. The POS tagging performance using this enriched tag set, ERTS, is at 96.13% accuracy. In the BPC experiments, we vary the feature set along two factors: the POS tag set and a set of explicitly encoded morphological features. Using the ERTS POS tagset, BPC achieves the highest overall Fβ=1 of 96.33% on 10 different chunk types outperforming the use of the standard POS tag set even when explicit morphological features are present.
منابع مشابه
YADAC: Yet another Dialectal Arabic Corpus
This paper presents the first phase of building YADAC – a multi-genre Dialectal Arabic (DA) corpus – that is compiled using Web data from microblogs (i.e. Twitter), blogs/forums and online knowledge market services in which both questions and answers are user-generated. In addition to introducing two new genres to the current efforts of building DA corpora (i.e. microblogs and question-answer p...
متن کاملA Hybrid Model for Phrase Chunking Employing Artificial Immunity System and Rule Based Methods
Natural language Understanding (NLU), an important field of Artificial Intelligence (AI) is concerned with the speech and language understanding between human and computer. Understanding language means knowing what concept a word or phrase stands for and how to link them to form meaningful sentence. Identification of phrases or phrase chunking is an important step in natural language understand...
متن کاملA supervised POS tagger for written Arabic social networking corpora
This paper presents an implementation of Brill's Transformation-Based Part-of-Speech (POS) tagging algorithm trained on a manually-annotated Twitter-based Egyptian Arabic corpus of 423,691 tokens and 70,163 types. Unlike standard POS morphosyntactic annotation schemes which label each word based on its word-level morphosyntactic features, we use a function-based annotation scheme in which words...
متن کاملروشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملSecond Generation AMIRA Tools for Arabic Processing: Fast and Robust Tokenization, POS tagging, and Base Phrase Chunking
In this paper, we address the problem of processing Modern Standard Arabic. We present the second generation of tools that process Arabic (AMIRA). AMIRA is a successor suite to the ASVMTools. The AMIRA toolkit includes a clitic tokenizer (TOK), part of speech tagger (POS) and base phrase chunker (BPC) shallow syntactic parser. The technology of AMIRA is based on supervised learning with no expl...
متن کامل