Building a Word Segmenter for Sanskrit Overnight

نویسندگان

  • Vikas Reddy
  • Amrith Krishna
  • Vishnu Dutt Sharma
  • Prateek Gupta
  • Vineeth M. R
  • Pawan Goyal
چکیده

There is abundance of digitised texts available in Sanskrit. However, the word segmentation task in such texts are challenging due to the issue of Sandhi. In Sandhi, words in a sentence often fuse together to form a single chunk of text, where the word delimiter vanishes and sounds at the word boundaries undergo transformations, which is also reflected in the written text. Here, we propose an approach that uses a deep sequence to sequence (seq2seq) model that takes only the sandhied string as the input and predicts the unsandhied string. The state of the art models are linguistically involved and have external dependencies for the lexical and morphological analysis of the input. Our model can be trained “overnight” and be used for production. In spite of the knowledge lean approach, our system preforms better than the current state of the art by gaining a percentage increase of 16.79 % than the current state of the art.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Prototype Text to Speech for Sanskrit

This paper describes about the work done in building a prototype text to speech system for Sanskrit. A basic prototype text-tospeech is built using a simplified Sanskrit phone set, and employing a unit selection technique, where prerecorded sub-word units are concatenated to synthesize a sentence. We also discuss the issues involved in building a full-fledged text-to-speech for Sanskrit.

متن کامل

Can Symbol Grounding Improve Low-Level NLP? Word Segmentation as a Case Study

We propose a novel framework for improving a word segmenter using information acquired from symbol grounding. We generate a term dictionary in three steps: generating a pseudo-stochastically segmented corpus, building a symbol grounding model to enumerate word candidates, and filtering them according to the grounding scores. We applied our method to game records of Japanese chess with commentar...

متن کامل

Spotting Words in Latin, Devanagari and Arabic Scripts

A system for spotting words in scanned document images in three scripts, Devanagari, Arabic and Latin is described. Three main components of the system are a word segmenter, a shape based matcher for words and a search interface. The user gives a query which can be either a word image or text. The candidate words that are searched in the documents are retrieved and ranked, where the ranking cri...

متن کامل

A Maximum Entropy Approach to Chinese Word Segmentation

We participated in the Second International Chinese Word Segmentation Bakeoff. Specifically, we evaluated our Chinese word segmenter in the open track, on all four corpora, namely Academia Sinica (AS), City University of Hong Kong (CITYU), Microsoft Research (MSR), and Peking University (PKU). Based on a maximum entropy approach, our word segmenter achieved the highest F measure for AS, CITYU, ...

متن کامل

The Character-based CRF Segmenter of MSRA&NEU for the 4th Bakeoff

This paper describes the Chinese Word Segmenter for the fourth International Chinese Language Processing Bakeoff. Base on Conditional Random Field (CRF) model, a basic segmenter is designed as a problem of character-based tagging. To further improve the performance of our segmenter, we employ a word-based approach to increase the in-vocabulary (IV) word recall and a post-processing to increase ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1802.06185  شماره 

صفحات  -

تاریخ انتشار 2018