Practical Challenge of Shredded Documents: Clustering of Chinese Homologous Pieces

نویسندگان

  • Nan Xing
  • Jianqi Zhang
  • Furong Cao
  • Pengfei Liu
چکیده

When recovering a shredded document that has numerous mixed pieces, the difficulty of the recovery process can be reduced by clustering, which is a method of grouping pieces that originally belonged to the same page. Restoring homologous shredded documents (pieces from different pages of the same file) is a frequent problem, and because these pieces have nearly indistinguishable visual characteristics, grouping them is extremely difficult. Clustering research has important practical significance for document recovery because homologous pieces are ubiquitous. Because of the wide usage of Chinese and the huge demand for Chinese shredded document recovery, our research focuses on Chinese homologous pieces. In this paper, we propose a method of completely clustering Chinese homologous pieces in which the distribution features of the characters in the pieces and the document layout are used to correlate adjacent pieces and cluster them in different areas of a document. The experimental results show that the proposed method has a good clustering effect on real pieces. For the dataset containing 10 page documents (a total of 462 pieces), its average accuracy is 97.19%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature extraction and clustering for the computer-aided reconstruction of strip-cut shredded documents

bstract. We propose a solution for the computer-aided recontruction of strip-cut shredded documents. First of all, the visual conent of the strips is automatically extracted and represented by a umber of numerical features. Usually, the pieces of different pages ave been mixed. A grouping of the strips belonging to a same page s thus realized by means of a clustering operator, to ease the suces...

متن کامل

Reconstructing shredded documents through feature matching.

We describe a procedure for reconstructing documents that have been shredded by hand, a problem that often arises in forensics. The proposed method first applies a polygonal approximation in order to reduce the complexity of the boundaries and then extracts relevant features of the polygon to carry out the local reconstruction. In this way, the overall complexity can be dramatically reduced bec...

متن کامل

Semi-Automatic Reconstruction of Cross-Cut Shredded Documents

We propose a new approach for cross-cut shredded document reconstruction and evaluate it on the DARPA Shredder Challenge dataset. We begin by pre-processing chads. A set of costs based on shape (gaps, overlaps, edge similarity), graphical content (ruling line alignment, text line alignment), and semantic content (character and letter combinations) is calculated and used to rank putative chad ma...

متن کامل

Enhancing a Genetic Algorithm with a Solution Archive to Reconstruct Cross Cut Shredded Text Documents

In this work the concept of a trie-based complete solution archive in combination with a genetic algorithm is applied to the Reconstruction of Cross-Cut Shredded Text Documents (RCCSTD) problem. This archive is able to detect and subsequently convert duplicates into new yet unvisited solutions. Cross-cut shredded documents are documents that are cut into rectangular pieces of equal size and sha...

متن کامل

An alternative clustering approach for reconstructing cross cut shredded text documents

In this paper, we propose a clustering approach for solving the problem of reconstructing cross-cut shredded documents. This problem is important in the field of forensic science. Unlike other clustering approaches which are applied as a preprocessing step before the actual reconstruction algorithms, our clustering approach is part of the reconstruction process itself. We define a new cost func...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017