Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
نویسندگان
چکیده
منابع مشابه
Swahili Text-to-speech System
Text-to-speech (TTS) applications have been applied in diverse areas all over the world. Considering the fact that Swahili pronunciation is not complicated, and the language spoken by about 45 – 100 million people as their first or second language,, we considered the feasibility, and developed a Swahili Text-to-Speech (TTS) system. This paper gives an account of the Swahili TTS system developed...
متن کاملA MODEL FOR EVOLUTIONARY DYNAMICS OF WORDS IN A LANGUAGE
Human language, over its evolutionary history, has emerged as one of the fundamental defining characteristic of the modern man. However, this milestone evolutionary process through natural selection has not left any ’linguistic fossils’ that may enable us to trace back the actual course of development of language and its establishment in human societies. Lacking analytical tools to fathom the cr...
متن کاملExploring text datasets by visualizing relevant words
When working with a new dataset, it is important to first explore and familiarize oneself with it, before applying any advanced machine learning algorithms. However, to the best of our knowledge, no tools exist that quickly and reliably give insight into the contents of a selection of documents with respect to what distinguishes them from other documents belonging to different categories. In th...
متن کاملWeb-based corpus acquisition for Swahili language modelling
Finding large amounts of text data for use in natural language technology is difficult for under-resourced languages such as Swahili. The corpora that are readily accessible for these languages are not sufficient to be used in language technologies, whose requirements can run into the hundreds of millions of words. This paper describes how we can take advantage of search engines such as Google ...
متن کاملCompetitive Intelligence Text Mining: Words Speak
Competitive intelligence (CI) has become one of the major subjects for researchers in recent years. The present research is aimed to achieve a part of the CI by investigating the scientific articles on this field through text mining in three interrelated steps. In the first step, a total of 1143 articles released between 1987 and 2016 were selected by searching the phrase "competitive intellige...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Data in Brief
سال: 2020
ISSN: 2352-3409
DOI: 10.1016/j.dib.2020.106517