Index Compression through Document Reordering
نویسندگان
چکیده
An important concern in the design of search engines is the construction of an inv erted index. An inv erted index, also called a concordance, contains a list of documents (or posting list) for every possible search term. These posting lists are usually compressed with di erence coding. Di erence coding yields the best compression when the lists to be coded hav e high locality. Coding methods hav e been designed to speci cally take advantage of locality in in v ertedindices. Here, we describe an algorithm to permute the document n umbers so as to create locality in an inv erted index. This is done b yclustering the documents. Our algorithm, when applied to the TREC ad hoc database (disks 4 and 5), improv es the performance of the best di erence coding algorithm we found by fourteen percent. The improvement increases as the size of the index increases, so we expect that greater improv ements would be possible on larger datasets.
منابع مشابه
Evaluation of Some Reordering Techniques for Image VQ Index Compression
Frequently, it is observed that the sequence of indexes generated by a vector quantizer (VQ) contains a high degree of correlation, and, therefore, can be further compressed using lossless data compression techniques. In this paper, we address the problem of codebook reordering regarding the compression of the image of VQ indexes by general purpose lossless image coding methods, such as JPEG-LS...
متن کاملRFC 4224 ROHC over Reordering Channels
RObust Header Compression (ROHC), RFC 3095, defines a framework for header compression, along with a number of compression protocols (profiles). One operating assumption for the profiles defined in RFC 3095 is that the channel between compressor and decompressor is required to maintain packet ordering. This document discusses aspects of using ROHC over channels that can reorder packets. It prov...
متن کاملCompression Schemes with Data Reordering for Ordered Data
Although there have been many compression schemes for reducing data effectively, most schemes do not consider the reordering of data. In the case of unordered data, if the users change the data order in a given data set, the compression ratio may be improved compared to the original compression before reordering data. However, in the case of ordered data, the users need a mapping table that map...
متن کاملEnhanced Compressed RTP (CRTP) for Links with High Delay, Packet Loss and Reordering
Enhanced Compressed RTP (CRTP) for Links with High Delay, Packet Loss and Reordering Status of this Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protoc...
متن کاملJPEG 2000 coding of color-quantized images
The efficiency of compressing color-quantized images using general purpose lossless image coding methods depends on the degree of smoothness of the index images. A wellknown and very effective approach for increasing smoothness relies on palette reordering techniques. In this paper, we show that these reordering methods may leave some room for further improvements in the compression performance...
متن کامل