Two 1%s Don't Make a Whole: Comparing Simultaneous Samples from Twitter's Streaming API
نویسندگان
چکیده
In the present work, we compare samples of tweets from the Twitter Streaming API that were constructed from different connections tracking the same popular keywords at the same time. We find that tweets from the Streaming API are not sampled at random; rather, on average over 96% of the tweets seen in one sample are seen in all others. Somewhat surprisingly, however, tweets found only in a subset of samples do not significantly differ from those found in all samples in terms of user popularity or tweet structure. We conclude they are likely the result of a technical artifact rather than any systematic bias. Practically, our results show that for those wanting more data, an infinite number of Streaming API samples are necessary to collect “most” of the tweets containing a popular keyword. Additionally, it is likely that any findings from one sample from the Streaming API will hold for all samples that could have been taken.
منابع مشابه
Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose
Twitter is a social media giant famous for the exchange of short, 140-character messages called “tweets”. In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a “Streaming API” which provides a sample of all tweets matching some parameters preset by the API user. The API serv...
متن کاملDesign and Test of the Real-time Text mining dashboard for Twitter
One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...
متن کاملMethods for Coding Tobacco-Related Twitter Data: A Systematic Review
BACKGROUND As Twitter has grown in popularity to 313 million monthly active users, researchers have increasingly been using it as a data source for tobacco-related research. OBJECTIVE The objective of this systematic review was to assess the methodological approaches of categorically coded tobacco Twitter data and make recommendations for future studies. METHODS Data sources included PsycIN...
متن کاملDyVSoR: dynamic malware detection based on extracting patterns from value sets of registers
To control the exponential growth of malware files, security analysts pursue dynamic approaches that automatically identify and analyze malicious software samples. Obfuscation and polymorphism employed by malwares make it difficult for signature-based systems to detect sophisticated malware files. The dynamic analysis or run-time behavior provides a better technique to identify the threat. In t...
متن کاملThe comparison of PCR technique and API-20E kit with the conventional biomedical methods for the identification of Salmonella species in laboratory
Abstract Bachground and objectives: Salmonella is one of the most important agents of gastrointestinal infection and diarrhea in our country. Misdiagnosis of these bacteria leads to cure failure. The aim of this study was to make a comparison between PCR and the API-20E and conventional biochemical tests carried out for the identification of Salmonella. Material and Methods: In this study 470 s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014