A preliminary study on similarity-preserving digital book identifiers

نویسندگان

Klemo Vladimir

Marin Silic

Nenad Romic

Goran Delac

Sinisa Srbljic

چکیده

Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing to even smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Empirical Comparison of Approaches to Approximate String Matching in Private Record Linkage

Due to the frequency of spelling and typographical errors in practical applications, record linkage algorithms have to use string similarity functions. In many legal contexts, identifiers such as names have to be encrypted before a record linkage can be attempted. Therefore, algorithms for computing string similarity functions with encrypted identifiers are essential for approximating string ma...

متن کامل

Protocols and Systems for Privacy Preserving Protection of Digital Identity

Bhargav-Spantzel, Abhilasha Ph.D., Purdue University, December, 2007. Protocols and Systems for Privacy Preserving Protection of Digital Identity. Major Professor: Elisa Bertino. To support emerging online activities within the digital information infrastructure, such as commerce, healthcare, entertainment and scientific collaboration, it is increasingly important to verify and protect the digi...

متن کامل

Looking for the Best Historical Window for Assessing Semantic Similarity Using Human Literature

We describe the way to get benefit from broad cultural trends through the quantitative analysis of a vast digital book collection representing the digested history of humanity. Our research work has revealed that appropriately comparing the occurrence patterns of words in some periods of human literature can help us to accurately determine the semantic similarity between these words by means of...

متن کامل

Unsupervised data linking using a genetic algorithm

As commonly accepted identifiers for data instances in semantic datasets (such as ISBN codes or DOI identifiers) are often not available, discovering links between overlapping datasets on the Web is generally realised through the use of fuzzy similarity measures. Configuring such measures, i.e. deciding which similarity function to apply to which data properties with which parameters, is often ...

متن کامل

DOI: The "Big Brother" in the dissemination of scientific documentation.

Rapid growth in the availability and use of digital documents has prompted the development of instruments to handle them. A most important example of these instruments are digital identifiers, which provide a codification system that allows digital items, usually up to the level of a computer file, to be singled out and located. Digital identifiers make up standardized global systems applied to...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

A preliminary study on similarity-preserving digital book identifiers

نویسندگان

چکیده

منابع مشابه

An Empirical Comparison of Approaches to Approximate String Matching in Private Record Linkage

Protocols and Systems for Privacy Preserving Protection of Digital Identity

Looking for the Best Historical Window for Assessing Semantic Similarity Using Human Literature

Unsupervised data linking using a genetic algorithm

DOI: The "Big Brother" in the dissemination of scientific documentation.

عنوان ژورنال:

اشتراک گذاری