نتایج جستجو برای: text segmentation

تعداد نتایج: 227918  

2010
Mark Johnson Katherine Demuth Michael C. Frank Bevan K. Jones

This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PC...

Journal: :TACL 2014
Benjamin Börschinger Mark Johnson

Stress has long been established as a major cue in word segmentation for English infants. We show that enabling a current state-of-the-art Bayesian word segmentation model to take advantage of stress cues noticeably improves its performance. We find that the improvements range from 10 to 4%, depending on both the use of phonotactic cues and, to a lesser extent, the amount of evidence available ...

Journal: :IJCLCLP 2012
Chuan-Jie Lin Jia-Cheng Zhan Yen-Heng Chen Chien-Wei Pao

This paper proposes an approach to identify word candidates that are not Traditional Chinese, including Japanese names (written in Japanese Kanji or Traditional Chinese characters) and word variants, when doing word segmentation on Traditional Chinese text. When handling personal names, a probability model concerning formats of names is introduced. We also propose a method to map Japanese Kanji...

2008
Alexandre Labadié Violaine Prince

This paper present a semantic and syntactic distance based method in topic text segmentation and compare it to a very well known text segmentation algorithm: c99. To do so we ran the two algorithms on a corpus of twenty two French political discourses and compared their results.

2013
Atif Mahmood

Text Segmentation is one of the critical and vital step in OCR system of any language because accuracy of OCR depends upon correctly segmented characters. Segmentation divide the text images into its constituent parts (i.e. lines, components or words and individual characters). As Urdu and Arabic are highly cursive and context sensitive in nature and have improper space between words therefore,...

2014
Namisha Modi Ritu Dewan Vaneet Mohan Jashanpreet Kaur

Gurumukhi script is used for Punjabi language, which is a two dimensional composition of symbols with connected and disconnected diacritics. Handwritten Gurumukhi script has some complexities like connected, overlapped text lines, words and characters. It is one of the foremost issues for errors during the recognition process. Text segmentation is a challenging job in unconstrained writer indep...

2017
Noam Mor Omri Koshorek Adir Cohen

We train an LSTM-based model to predict structure in Wikipedia articles. This results in a model that is capable of segmenting any English text, is not constrained to a limited number of topics, and has much better runtime characteristics than previous methods. Finally, we introduce a new dataset which is much more extensive than current ones, and compare our method with previous methods in ter...

2017
Yan Shao Christian Hardmeier Joakim Nivre

We extensively analyse the correlations and drawbacks of conventionally employed evaluation metrics for word segmentation. Unlike in standard information retrieval, precision favours under-splitting systems and therefore can be misleading in word segmentation. Overall, based on both theoretical and experimental analysis, we propose that precision should be excluded from the standard evaluation ...

2004
Chooi-Ling Goh Masayuki Asahara Yuji Matsumoto

During the process of unknown word detection in Chinese word segmentation, many detected word candidates are invalid. These false unknown word candidates deteriorate the overall segmentation accuracy, as it will affect the segmentation accuracy of known words. Therefore, we propose to eliminate as many invalid word candidates as possible by a pruning process. Our experiments show that by cuttin...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید