واژه سازی word building

Una Nueva Técnica de Construcción de Grafos Semánticos para la Desambiguación Bilingüe del Sentido de las Palabras

Journal: :Procesamiento del Lenguaje Natural 2013

Andres Duque Fernandez Lourdes Araujo Juan Martinez-Romo

In this paper we present preliminary results obtained by the application of a new technique for building semantic graphs to the task of cross-lingual word sense disambiguation. Through the use of this unsupervised technique, we induce the senses associated with the translations of the ambiguous word in the target language. For this purpose, we use the translation of the words in the context of ...

متن کامل

Unsupervised Morphological Expansion of Small Datasets for Improving Word Embeddings

Journal: :CoRR 2017

Syed Sarfaraz Akhtar Arihant Gupta Avijit Vajpayee Arjit Srivastava Manish Shrivastava

We present a language independent, unsupervised method for building word embeddings using morphological expansion of text. Our model handles the problem of data sparsity and yields improved word embeddings by relying on training word embeddings on artificially generated sentences. We evaluate our method using small sized training sets on eleven test sets for the word similarity task across seve...

متن کامل

Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics

2017

Zhe Zhao Tao Liu Shen Li Bofang Li Xiaoyong Du

The existing word representation methods mostly limit their information source to word co-occurrence statistics. In this paper, we introduce ngrams into four representation methods: SGNS, GloVe, PPMI matrix, and its SVD factorization. Comprehensive experiments are conducted on word analogy and similarity tasks. The results show that improved word representations are learned from ngram cooccurre...

متن کامل

Identifying trends in word frequency dynamics

Journal: :CoRR 2013

Eduardo G. Altmann Zakary L. Whichard Adilson E. Motter

The word-stock of a language is a complex dynamical system in which words can be created, evolve, and become extinct. Even more dynamic are the short-term fluctuations in word usage by individuals in a population. Building on the recent demonstration that word niche is a strong determinant of future rise or fall in word frequency, here we introduce a model that allows us to distinguish persiste...

متن کامل

Word Graphs: The Third Set

2001

Cornelis Hoede L. Zhang

This is the third paper in a series of natural language processing in term of knowledge graphs. A word is a basic unit in natural language processing. This is why we study word graphs. Word graphs were already built for prepositions and adwords (including adjectives, adverbs and Chinese quantity words) in two other papers [2], [3]. In this paper, we propose the concept of the logic word and cla...

متن کامل

Word-Sense Disambiguation for Machine Translation

2004

In word sense disambiguation, a system attempts to determine the sense of word from contextual features. Major barriers to building a highperforming word sense disambiguation system include the difficulty of labeling data for this task and of predicting fine-grained sense distinctions. In contrast, we can use parallel language corpora as a large supply of potential data. In this paper we presen...

متن کامل

Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints

2014

Xiaodong Zeng Lidia S. Chao Derek F. Wong Isabel Trancoso Liang Tian

This study investigates on building a better Chinese word segmentation model for statistical machine translation. It aims at leveraging word boundary information, automatically learned by bilingual character-based alignments, to induce a preferable segmentation model. We propose dealing with the induced word boundaries as soft constraints to bias the continuous learning of a supervised CRFs mod...

متن کامل

Word Co-occurrence Counts Prediction for Bilingual Terminology Extraction from Comparable Corpora

2013

Amir Hazem Emmanuel Morin

Methods dealing with bilingual lexicon extraction from comparable corpora are often based on word co-occurrence observation and are by essence more effective when using large corpora. In most cases, specialized comparable corpora are of small size, and this particularity has a direct impact on bilingual terminology extraction results. In order to overcome insufficient data coverage and to make ...

متن کامل

Philippine Languages Online Corpora: Status, issues, and prospects

2011

Shirley N. Dita Rachel E. O. Roxas

This paper presents the work being done so far on the building of online corpus for Philippine languages. As for the status, the Philippine Languages Online Corpora (PLOC) now boasts a 250,000-word written corpus of the eight major languages in the archipelago. Some of the issues confronting the corpus building and future directions for this project are likewise discussed in this paper.

متن کامل

Building a Prototype Text to Speech for Sanskrit

2010

Baiju Mahananda C. M. S. Raju Ramalinga Reddy Patil Narayana Jha Shrinivasa Varakhedi Prahallad Kishore

This paper describes about the work done in building a prototype text to speech system for Sanskrit. A basic prototype text-tospeech is built using a simplified Sanskrit phone set, and employing a unit selection technique, where prerecorded sub-word units are concatenated to synthesize a sentence. We also discuss the issues involved in building a full-fledged text-to-speech for Sanskrit.

متن کامل