A preliminary study on similarity-preserving digital book identifiers
نویسندگان
چکیده
Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing to even smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.
منابع مشابه
An Empirical Comparison of Approaches to Approximate String Matching in Private Record Linkage
Due to the frequency of spelling and typographical errors in practical applications, record linkage algorithms have to use string similarity functions. In many legal contexts, identifiers such as names have to be encrypted before a record linkage can be attempted. Therefore, algorithms for computing string similarity functions with encrypted identifiers are essential for approximating string ma...
متن کاملProtocols and Systems for Privacy Preserving Protection of Digital Identity
Bhargav-Spantzel, Abhilasha Ph.D., Purdue University, December, 2007. Protocols and Systems for Privacy Preserving Protection of Digital Identity. Major Professor: Elisa Bertino. To support emerging online activities within the digital information infrastructure, such as commerce, healthcare, entertainment and scientific collaboration, it is increasingly important to verify and protect the digi...
متن کاملLooking for the Best Historical Window for Assessing Semantic Similarity Using Human Literature
We describe the way to get benefit from broad cultural trends through the quantitative analysis of a vast digital book collection representing the digested history of humanity. Our research work has revealed that appropriately comparing the occurrence patterns of words in some periods of human literature can help us to accurately determine the semantic similarity between these words by means of...
متن کاملUnsupervised data linking using a genetic algorithm
As commonly accepted identifiers for data instances in semantic datasets (such as ISBN codes or DOI identifiers) are often not available, discovering links between overlapping datasets on the Web is generally realised through the use of fuzzy similarity measures. Configuring such measures, i.e. deciding which similarity function to apply to which data properties with which parameters, is often ...
متن کاملDOI: The "Big Brother" in the dissemination of scientific documentation.
Rapid growth in the availability and use of digital documents has prompted the development of instruments to handle them. A most important example of these instruments are digital identifiers, which provide a codification system that allows digital items, usually up to the level of a computer file, to be singled out and located. Digital identifiers make up standardized global systems applied to...
متن کامل