SlideShare a Scribd company logo
2
Most read
Data Deduplication and Chunking
Now-a-days, the demand of data storage capacity is increasing drastically. Due to more
demands of storage, the computer society is moving towards cloud storage. Security of data
and cost factors are important challenges in cloud storage. A duplicate file not only wastes the
storage, it also increases the access time. So the detection and removal of duplicate data is an
essential task. Data deduplication, an efficient approach to data reduction, has gained
increasing attention and popularity in large-scale storage systems. It eliminates redundant data
at the file or subfile level and identifies duplicate content by its cryptographically secure hash
signature. It is very tricky because neither duplicate files don’t have a common key nor they
contain error. There are several approaches to identify and remove redundant data:
 at file level
 at chunk level.
Data deduplication is technique toprevent the storage of redundant data in storage devices.
Data deduplication is gaining much attention by the researchers because it is an efficient
approach to data reduction. Deduplication identifies duplicate contents at chunk-level by
using hash functions and eliminates redundant contents at chunk level. According to
Microsoft research on deduplication, in their production primary and secondary memory are
redundant about 50% and 85% of the data respectively and should be removed by the
deduplication technology.
Chunking is a process to split a file into smaller files called chunks. In some
applications, such as remote data compression, data synchronization, and
data deduplication, chunking is important because it determines the duplicate
detection performance of the system.
Deduplication can be performed at chunk level or file level.
Chunk-level deduplication is preferred over file level deduplication because it identifies and
eliminate redundancy at a finer granularity. Chunking is one of the main challenges in the
deduplication-system. Efficient chunking is one of the key elements that decide the overall
deduplication performance.
The chunking step also play a significant impact on the deduplication ratio and performance.
An effective chunking algorithm should satisfy two properties.
Chunks must be created in a content-defined manner and all chunk sizes should have equal
probability of being selected. Chunk level deduplication scheme divides the input stream into
chunks and produce hash value that identifies each chunk uniquely. Chunking algorithm can
divide the input data stream into chunk of either fixedsize or variable size.If the fingerprints
are matched with previously stored chunk, then these duplicated chunks are removed otherwise
these are stored.
Data deduplication is the process of eliminating the redundant data. It is the technique that is
used to track and remove the same chunks in a storage unit. It is an efficient way to store data
or information. The process of deduplication is shown in figure.
Data deduplication strategies are classified into two parts i.e. File level deduplication and Block
(chunk)level deduplication. In file level deduplication, if the hash value for two files are same
then they are considered identical. Only one copy of a file is kept. Duplicate or redundant
copies of the same file are removed. This type of deduplication is also known as single instance
storage (SIS). In block (chunk)level deduplication, the file is fragmented into blocks (chunks)
then checks and removes duplicate blocks present in files. Only one copy of each block is
maintained. It reduces more space than SIS. It maybe further divided into fixed length and
variable length deduplication. In fixed length deduplication, the size of each block is constant
or fixed whereas in variable length deduplication, the size of each block is different or not
fixed.
Chunking based Data Deduplication
In this method, each file is firstly divided into a sequence of mutually exclusive blocks, called
chunks. Each chunk is a contiguous sequence of bytes from the file which is processed
independently. Various approaches are discussed in details to generate chunks.
Static Chunking
The static or fixed size chunking mechanism divides the file into same sized chunks. Figure 6
shows the working of static size chunking. It is the fastest among other chunking algorithms to
detect duplicate blocks but its performance is not satisfactory. It also leads to a serious problem
called “Boundary Shift”. Boundary shifting problem arises due to modifications in the data. If
user inserts or deletes a byte from the data, then static chunking will generate different
fingerprints for the subsequent chunks even though mostly data in file are intact.
Content Defined Chunking
To overcome the problem of boundary shift, content defined chunking (CDC) is used. CDC
reduces the amount of duplicate data found by the data deduplication systems, instead of setting
boundaries at multiples of the limited chunk size. CDC approach defines breakpoints where a
specific condition becomes true. The working of CDC is shown in figure 7. In simple words,
It can be said that CDC approach determine the chunk boundaries based on the content of the
file. Most of the deduplication systems use the CDC algorithm to achieve good data
deduplication throughput. It is also known as Variable size chunking. It uses the Rabin
fingerprint algorithm for hash value generation. It takes more processing time than the fixed
size chunking but has better deduplication efficiency.
Frequency Based Chunking Algorithm
FBC approach is a hybrid chunking algorithm that divides the byte stream according to the
chunk frequency. First, it identifies the fixed size chunks with high frequency using bloom
filter then the appearance is checked for each chunk in the bloom filter. If the chunk exists in
bloom filter, then it is passed through the parallel filter otherwise record it in one of bloom
filter then count the frequency for each chunk from parallel filter .Its working is shown in figure
13.

More Related Content

PPTX
JAVA-PPT'S.pptx
PDF
AWS Big Data Landscape
PPTX
Lesson 6 php if...else...elseif statements
PPTX
Presentation of control statement
PPT
Project Management System
PPTX
Result Management System - CSE Final Year Projects
PPTX
Employee Management System
DOC
Dbms lab Manual
JAVA-PPT'S.pptx
AWS Big Data Landscape
Lesson 6 php if...else...elseif statements
Presentation of control statement
Project Management System
Result Management System - CSE Final Year Projects
Employee Management System
Dbms lab Manual

What's hot (20)

PDF
Srs for banking system
PPT
Java Notes
PDF
Example for ER diagram part11
DOC
Hostel management
PDF
Lecture 6 Data Pre-processing in data mining.pdf
PDF
C++ Programming.pdf
RTF
Functional requirements-document
PPTX
Online Hostel Management System Proposal
DOCX
Hostel Management system Report
PPTX
blood bank management system project report
PPTX
Modeling complex system
PDF
Blood bank management system
PDF
Sql Commands
PDF
Wedding Hall Management 9975053592
PDF
Ti1220 Lecture 2: Names, Bindings, and Scopes
PPTX
NESTED SUBQUERY.pptx
PPTX
Lecture 1 oop
PPTX
ATM Banking
PPSX
Sql triggers
Srs for banking system
Java Notes
Example for ER diagram part11
Hostel management
Lecture 6 Data Pre-processing in data mining.pdf
C++ Programming.pdf
Functional requirements-document
Online Hostel Management System Proposal
Hostel Management system Report
blood bank management system project report
Modeling complex system
Blood bank management system
Sql Commands
Wedding Hall Management 9975053592
Ti1220 Lecture 2: Names, Bindings, and Scopes
NESTED SUBQUERY.pptx
Lecture 1 oop
ATM Banking
Sql triggers
Ad

Similar to Data deduplication and chunking (20)

PPTX
Deduplication in Open Spurce Cloud
ODT
Data Deduplication: Venti and its improvements
PDF
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
PDF
Deduplication - Remove Duplicate
PDF
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
PDF
Survey on cloud backup services of personal storage
ODP
Dedupe nmamit
PDF
A Survey: Enhanced Block Level Message Locked Encryption for data Deduplication
PDF
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
PDF
A Hybrid Cloud Approach for Secure Authorized De-Duplication
PDF
File Sharing and Data Duplication Removal in Cloud Using File Checksum
PDF
IRJET- Cross User Bigdata Deduplication
PPT
Finding Similar Files in Large Document Repositories
PDF
Data De-Duplication Engine for Efficient Storage Management
PDF
Data Deduplication Approaches: Concepts, Strategies, and Challenges 1st Editi...
PDF
Secure distributed deduplication systems with improved reliability 2
PDF
Analysis on Deduplication Techniques for Storage of Data in Cloud
PDF
Duplicate File Analyzer using N-layer Hash and Hash Table
PDF
iaetsd Controlling data deuplication in cloud storage
PDF
Secure Distributed Deduplication Systems with Improved Reliability
Deduplication in Open Spurce Cloud
Data Deduplication: Venti and its improvements
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
Deduplication - Remove Duplicate
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
Survey on cloud backup services of personal storage
Dedupe nmamit
A Survey: Enhanced Block Level Message Locked Encryption for data Deduplication
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
A Hybrid Cloud Approach for Secure Authorized De-Duplication
File Sharing and Data Duplication Removal in Cloud Using File Checksum
IRJET- Cross User Bigdata Deduplication
Finding Similar Files in Large Document Repositories
Data De-Duplication Engine for Efficient Storage Management
Data Deduplication Approaches: Concepts, Strategies, and Challenges 1st Editi...
Secure distributed deduplication systems with improved reliability 2
Analysis on Deduplication Techniques for Storage of Data in Cloud
Duplicate File Analyzer using N-layer Hash and Hash Table
iaetsd Controlling data deuplication in cloud storage
Secure Distributed Deduplication Systems with Improved Reliability
Ad

Recently uploaded (20)

PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
Mechanical Engineering MATERIALS Selection
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
DOCX
573137875-Attendance-Management-System-original
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Welding lecture in detail for understanding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
bas. eng. economics group 4 presentation 1.pptx
CH1 Production IntroductoryConcepts.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Geodesy 1.pptx...............................................
Foundation to blockchain - A guide to Blockchain Tech
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
CYBER-CRIMES AND SECURITY A guide to understanding
Automation-in-Manufacturing-Chapter-Introduction.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Mechanical Engineering MATERIALS Selection
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
573137875-Attendance-Management-System-original
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Lecture Notes Electrical Wiring System Components
Welding lecture in detail for understanding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
R24 SURVEYING LAB MANUAL for civil enggi
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx

Data deduplication and chunking

  • 1. Data Deduplication and Chunking Now-a-days, the demand of data storage capacity is increasing drastically. Due to more demands of storage, the computer society is moving towards cloud storage. Security of data and cost factors are important challenges in cloud storage. A duplicate file not only wastes the storage, it also increases the access time. So the detection and removal of duplicate data is an essential task. Data deduplication, an efficient approach to data reduction, has gained increasing attention and popularity in large-scale storage systems. It eliminates redundant data at the file or subfile level and identifies duplicate content by its cryptographically secure hash signature. It is very tricky because neither duplicate files don’t have a common key nor they contain error. There are several approaches to identify and remove redundant data:  at file level  at chunk level. Data deduplication is technique toprevent the storage of redundant data in storage devices. Data deduplication is gaining much attention by the researchers because it is an efficient approach to data reduction. Deduplication identifies duplicate contents at chunk-level by using hash functions and eliminates redundant contents at chunk level. According to Microsoft research on deduplication, in their production primary and secondary memory are redundant about 50% and 85% of the data respectively and should be removed by the deduplication technology. Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance of the system. Deduplication can be performed at chunk level or file level. Chunk-level deduplication is preferred over file level deduplication because it identifies and eliminate redundancy at a finer granularity. Chunking is one of the main challenges in the deduplication-system. Efficient chunking is one of the key elements that decide the overall deduplication performance. The chunking step also play a significant impact on the deduplication ratio and performance. An effective chunking algorithm should satisfy two properties. Chunks must be created in a content-defined manner and all chunk sizes should have equal probability of being selected. Chunk level deduplication scheme divides the input stream into chunks and produce hash value that identifies each chunk uniquely. Chunking algorithm can divide the input data stream into chunk of either fixedsize or variable size.If the fingerprints are matched with previously stored chunk, then these duplicated chunks are removed otherwise these are stored. Data deduplication is the process of eliminating the redundant data. It is the technique that is used to track and remove the same chunks in a storage unit. It is an efficient way to store data or information. The process of deduplication is shown in figure.
  • 2. Data deduplication strategies are classified into two parts i.e. File level deduplication and Block (chunk)level deduplication. In file level deduplication, if the hash value for two files are same then they are considered identical. Only one copy of a file is kept. Duplicate or redundant copies of the same file are removed. This type of deduplication is also known as single instance storage (SIS). In block (chunk)level deduplication, the file is fragmented into blocks (chunks) then checks and removes duplicate blocks present in files. Only one copy of each block is maintained. It reduces more space than SIS. It maybe further divided into fixed length and variable length deduplication. In fixed length deduplication, the size of each block is constant or fixed whereas in variable length deduplication, the size of each block is different or not fixed. Chunking based Data Deduplication In this method, each file is firstly divided into a sequence of mutually exclusive blocks, called chunks. Each chunk is a contiguous sequence of bytes from the file which is processed independently. Various approaches are discussed in details to generate chunks. Static Chunking The static or fixed size chunking mechanism divides the file into same sized chunks. Figure 6 shows the working of static size chunking. It is the fastest among other chunking algorithms to detect duplicate blocks but its performance is not satisfactory. It also leads to a serious problem called “Boundary Shift”. Boundary shifting problem arises due to modifications in the data. If user inserts or deletes a byte from the data, then static chunking will generate different fingerprints for the subsequent chunks even though mostly data in file are intact.
  • 3. Content Defined Chunking To overcome the problem of boundary shift, content defined chunking (CDC) is used. CDC reduces the amount of duplicate data found by the data deduplication systems, instead of setting boundaries at multiples of the limited chunk size. CDC approach defines breakpoints where a specific condition becomes true. The working of CDC is shown in figure 7. In simple words, It can be said that CDC approach determine the chunk boundaries based on the content of the file. Most of the deduplication systems use the CDC algorithm to achieve good data deduplication throughput. It is also known as Variable size chunking. It uses the Rabin fingerprint algorithm for hash value generation. It takes more processing time than the fixed size chunking but has better deduplication efficiency.
  • 4. Frequency Based Chunking Algorithm FBC approach is a hybrid chunking algorithm that divides the byte stream according to the chunk frequency. First, it identifies the fixed size chunks with high frequency using bloom filter then the appearance is checked for each chunk in the bloom filter. If the chunk exists in bloom filter, then it is passed through the parallel filter otherwise record it in one of bloom filter then count the frequency for each chunk from parallel filter .Its working is shown in figure 13.