Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries
نویسندگان
چکیده
There are huge numbers of documents in digital libraries. How to effectively organize these documents so that humans can easily browse or reference is a challenging task. Existing classification methods and chronological or geographical ordering only provide partial views of the news articles. The relationships among news articles might not be easily grasped. In this paper, we propose a near-duplicate copy detection approach to organizing news archives in digital libraries. Conventional copy detection methods use word-level features which could be time-consuming and not robust to term substitutions. In this paper, we propose a sentence-level statistics-based approach to detect nearduplicate documents, which is language independent, simple but effective. It’s orthogonal to and can be used to complement word-based approaches. Also it’s insensitive to actual page layout of articles. The experimental results showed the high efficiency and good accuracy of the proposed approach in detecting near-duplicates in news archives.
منابع مشابه
News Topic Tracking and Re-ranking with Query Expansion Based on Near-Duplicate Detection
Increase of digital storage capacity enabled the creation of large-scale news video archives. To make full use of the archive, it is necessary to grasp the development and dependencies of news stories. Considering this problem, we investigate tracking and re-ranking methodologies of news stories. The archive used as a test-bed consists of more than 30,000 news stories. This paper proposes a nov...
متن کاملReassembling Multilingual Temporal News Datasets with Incomplete Information
Institutional investors are building increasingly more sophisticated algorithmic trading engines that account for textual as well as numerical information. To train these engines they need large datasets of information with highly accurate timestamps that cover long periods with differing trading conditions. Thus, the demand for temporal news datasets beyond the point where full archives are av...
متن کاملDetection of Copy-Move Forgery in Digital Images Using Scale Invariant Feature Transform Algorithm and the Spearman Relationship
Increased popularity of digital media and image editing software has led to the spread of multimedia content forgery for various purposes. Undoubtedly, law and forensic medicine experts require trustworthy and non-forged images to enforce rights. Copy-move forgery is the most common type of manipulation of digital images. Copy-move forgery is used to hide an area of the image or to repeat a por...
متن کاملA Systematic Review of Data Mining Applications in Digital Libraries
Purpose: Study aimed to identify the applications of data mining in the provision of services, collection and management of digital libraries. Methodology: This is an applied study in terms of purpose and in terms of method is qualitative research that have been done by systematic review method. For this purpose, articles have been obtained by searching databases of Springer, Emerald, ProQuest,...
متن کاملKohonen Self Organizing for Automatic Identification of Cartographic Objects
Automatic identification and localization of cartographic objects in aerial and satellite images have gained increasing attention in recent years in digital photogrammetry and remote sensing. Although the automatic extraction of man made objects in essence is still an unresolved issue, the man made objects can be extracted from aerial photos and satellite images. Recently, the high-resolution s...
متن کامل