Offline Selective Data Deduplication for Primary Storage Systems

نویسندگان

  • Sejin Park
  • Chanik Park
چکیده

Data deduplication is a technology that eliminates redundant data to save storage space. Most previous studies on data deduplication target backup storage, where the deduplication ratio and throughput are important. However, data deduplication on primary storage has recently been receiving attention; in this case, I/O latency should be considered equally with the deduplication ratio. Unfortunately, data deduplication causes high sequential-read-latency problems. When a file is created, the file system allocates physically contiguous blocks to support low sequential-read latency. However, the data deduplication process rearranges the block mapping information to eliminate duplicate blocks. Because of this rearrangement, the physical sequentiality of blocks in a file is broken. This makes a sequentialread request slower because it operates like a random-read operation. In this paper, we propose a selective data deduplication scheme for primary storage systems. A selective scheme can achieve a high deduplication ratio and a low I/O latency by applying different data-chunking methods to the files, according to their file access characteristics. In the proposed system, file accesses are characterized by recent access time and the access frequency of each file. No chunking is applied to update-intensive files since they are meaningless in terms of data deduplication. For sequential-readintensive files, we apply big chunking to preserve their sequentiality on the media. For random-read-intensive files, small chunking is used to increase the deduplication ratio. Experimental evaluation showed that the proposed method achieves a maximum of 86% of an ideal deduplication ratio and 97% of the sequential-read performance of a native file system. key words: data deduplication, selective deduplication, rank based deduplication

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DEDIS: Exact Deduplication for Primary Distributed Storage∗

The removal of duplicate data from primary storage volumes in a cloud computing environment is increasingly desirable, as the resulting space savings contribute to the cost effectiveness of a large scale multi-tenant infrastructure. However, traditional archival and backup deduplication systems are not suited for large scale virtualized infrastructures and the I/O demanding applications there d...

متن کامل

DEDIS: Distributed Exact Deduplication for Primary Storage Infrastructures

Deduplication is now widely accepted as an efficient technique for reducing storage costs at the expense of some processing overhead, being increasingly sought in primary storage systems [7, 8] and cloud computing infrastructures holding Virtual Machine (VM) volumes [2, 1, 5]. Besides a large number of duplicates that can be found across static VM images [3], dynamic general purpose data from V...

متن کامل

Secure and Efficient Client and Server Side Data Deduplication to Reduce Storage in Remote Cloud Computing Systems

Duplication of data in storage systems is becoming increasingly common problem. The system introduces I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations and shares data with existing users if Deduplication found on the client or server side. I/O Deduplicat...

متن کامل

Distributed Exact Deduplication for Primary Storage Infrastructures

Deduplication of primary storage volumes in a cloud computing environment is increasingly desirable, as the resulting space savings contribute to the cost effectiveness of a large scale multi-tenant infrastructure. However, traditional archival and backup deduplication systems impose prohibitive overhead for latency-sensitive applications deployed at these infrastructures while, current primary...

متن کامل

Storage Deduplication and Management for Application Testing over a Virtual Network Testbed

With the virtual machine technologies, Virtual Ad hoc Network (VAN) testbed was designed to evaluate functional correctness and communication performance of Mobile Ad hoc Network (MANET) applications. When VAN is used for large-scale testing that requires hundreds of virtual machines, storage redundancy becomes an issue. Although Content Addressable Storage (CAS) techniques were designed to add...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEICE Transactions

دوره 99-D  شماره 

صفحات  -

تاریخ انتشار 2016