Freshman or Fresher? Quantifying the Geographic Variation of Internet Language

نویسندگان

  • Vivek Kulkarni
  • Bryan Perozzi
  • Steven Skiena
چکیده

We present a new computational technique to detect and analyze statistically significant geographic variation in language. Our meta-analysis approach captures statistical properties of word usage across geographical regions and uses statistical methods to identify significant changes specific to regions. While previous approaches have primarily focused on lexical variation between regions, our method identifies words that demonstrate semantic and syntactic variation as well. We extend recently developed techniques for neural language models to learn word representations which capture differing semantics across geographical regions. In order to quantify this variation and ensure robust detection of true regional differences, we formulate a null model to determine whether observed changes are statistically significant. Our method is the first such approach to explicitly account for random variation due to chance while detecting regional variation in word meaning. To validate our model, we study and analyze two different massive online data sets: millions of tweets from Twitter spanning not only four different countries but also fifty states, as well as millions of phrases contained in the Google Book Ngrams. Our analysis reveals interesting facets of language change at multiple scales of geographic resolution – from neighboring states to distant continents. Finally, using our model, we propose a measure of semantic distance between languages. Our analysis of British and American English over a period of 100 years reveals that semantic variation between these dialects is shrinking.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Freshman or Fresher? Quantifying the Geographic Variation of Language in Online Social Media

In this paper we present a new computational technique to detect and analyze statistically significant geographic variation in language. While previous approaches have primarily focused on lexical variation between regions, our method identifies words that demonstrate semantic and syntactic variation as well. Our meta-analysis approach captures statistical properties of word usage across geogra...

متن کامل

A Review of Internet-Centered Language Assessment: Origins, Challenges, and Perspectives

This article defines the origin of an internet-centered language assessment (ICLA), how ICLAs are different from the other traditional computer-oriented tests, and what uses and functions ICLAs have in different taxonomies of language testing. After a very short review of computer- oriented testing, ICLAs are defined and categorized in low-tech or high tech categories. Since low-tech tests are ...

متن کامل

Quantifying Investment in Language Learning: Model and Questionnaire Development and Validation in the Iranian Context

The present exploratory study aimed to provide a more tangible and comprehensive picture of the construct of investment in language learning through investigating the issue from a quantitative perspective. To this end, the present researchers followed three main phases. First, a hypothesized model of investment in language learning with six components was developed for the Iranian English as a ...

متن کامل

Temporal and spatial variation of hardness and total dissolved solids concentration in drinking water resources of Ilam City using Geographic Information System

Background: In recent times, the decreasing groundwater reserves due to over-consumption of water resources and the unprecedented reduction of precipitation, during the past 1 decades, have resulted in a change in the volume and quality of water with time. The aim of this study was to determine the spatial and temporal variations of hardness and total dissolved solids in drinking water resource...

متن کامل

Survey on the Status of Persian-Language Health Services through the Internet

Abstract Background: The Internet has been able to convert the manner of information seeking and has changed the users’ approach to information particularly in health domain. In this regard, the number of Persian-language websites in health service are increasing. Therefore, information about the variety of services offered by them is very important. The present study was designed to describe ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1510.06786  شماره 

صفحات  -

تاریخ انتشار 2015