How Many Samples Required in Big Data Collection: A Differential Message Importance Measure

نویسندگان

  • Shanyun Liu
  • Rui She
  • Pingyi Fan
چکیده

Information collection is a fundamental problem in big data, where the size of sampling sets plays a very important role. This work considers the information collection process by taking message importance into account. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. It is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to Kolmogorov-Smirnov statistic, but it offers a new way to characterize the distribution goodness-of-fit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the DMIM deviation, which contributes to the selection of suitable sampling points in actual system. Keywords—Differential Message importance measure, Big Data, Kolmogorov-Smirnov test, Goodness of fit.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization

Data collection is a fundamental problem in the scenario of big data, where the size of sampling sets plays a very important role, especially in the characterization of data structure. This paper considers the information collection process by taking message importance into account, and gives a distribution-free criterion to determine how many samples are required in big data structure characte...

متن کامل

Non-parametric Message Important Measure: Storage Code Design and Transmission Planning for Big Data

Storage and transmission in big data are discussed in this paper, where message importance is taken into account. Similar to Shannon Entropy and Renyi Entropy, we define non-parametric message important measure (NMIM) as a measure for the message importance in the scenario of big data, which can characterize the uncertainty of random events. It is proved that the proposed NMIM can sufficiently ...

متن کامل

State Variation Mining: On Information Divergence with Message Importance in Big Data

Information transfer which reveals the state variation of variables can play a vital role in big data analytics and processing. In fact, the measure for information transfer can reflect the system change from the statistics by using the variable distributions, similar to KL divergence and Renyi divergence. Furthermore, in terms of the information transfer in big data, small probability events d...

متن کامل

A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection

Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....

متن کامل

Decentralized Multigrid for In-situ Big Data Computing

Modern seismic sensors are capable of recording high precision vibration data continuously for several months. Seismic raw data consists of information regarding earthquake’s origin time, location, wave velocity, etc. Currently, these high volume data are gathered manually from each station for analysis. This process restricts us from obtaining high-resolution images in real-time. A new in-netw...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1801.04063  شماره 

صفحات  -

تاریخ انتشار 2018