SlideShare a Scribd company logo
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 4 (June 2017) www.irjcs.com
_________________________________________________________________________________________________
IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281
Indexcopernicus: (ICV 2015): 79.58
© 2014- 17, IRJCS- All Rights Reserved Page -24
Duplicate File Analyzer using N-layer Hash and
Hash Table
Siladitya Mukherjee
Assistant Professor, Department of Computer Science,
St. Xavier’s College (Autonomous), Kolkata, India
siladitya.mukherjee@sxccal.edu
Pramod George Jose
M.Sc. Graduate, Department of Computer Science,
St. Xavier’s College (Autonomous), Kolkata, India
pramod.the.programmer@gmail.com
Soumick Chatterjee
M.Sc. Graduate, Department of Computer Science,
St. Xavier’s College (Autonomous), Kolkata, India
soumick@live.com
Priyanka Basak
M.Sc. Graduate, Department of Computer Science,
St. Xavier’s College (Autonomous), Kolkata, India
mailpriyankabasak20@gmail.com
Manuscript History
Number: IRJCS/RS/Vol.04/Issue06/JNCS10083
Received: 15, May 2017
Final Correction: 29, May 2017
Final Accepted: 26, May 2017
Published: June 2017
Citation: Mukherjee, S.; Jose, P. G.; Chatterjee, S. & Basak, P. (2017), 'Duplicate File Analyzer using N-layer Hash and
Hash Table', Master's thesis, St. Xavier’s College (Autonomous), Kolkata, India.
Editor: Dr.A.Arul L.S, Chief Editor, IRJCS, AM Publications, India
Copyright: ©2017 This is an open access article distributed under the terms of the Creative Commons Attribution License, Which
Permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Abstract— with the advancement in data storage technology, the cost per gigabyte has reduced significantly, caus-
ing users to negligently store redundant files on their system. These may be created while taking manual backups
or by improperly written programs. Often, files with the exact content have different file names; and files with dif-
ferent content may have the same name. Hence, devising an algorithm to identify redundant files based on their file
name and/or size is not enough. In this paper, the authors have proposed a novel approach where the N-layer hash
of all the files are individually calculated and stored in a hash table data structure. If an N-layer hash of a file
matches with a hash that already exists in the hash table, that file is marked as a duplicate, which can be deleted or
moved to a specific location as per the user's choice. The use of the hash table data structure helps achieve O(n)
time complexity and the use of N-layer hashes improve the accuracy of identifying redundant files. This approach
can be used for folder specific, drive specific or a system wide scan as required.
Index Terms— File Management, File Organization, Files Optimization, Files Structure, Secondary storage, Hash
table representations, File comparison, File signature, Hash, N-layer hash
I. INTRODUCTION
1.1 Problem definition
The storage space available on secondary memory, especially hard disks, has grown almost exponentially ever
since the first hard disk was invented by IBM in 1956 [1].
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 4 (June 2017) www.irjcs.com
_________________________________________________________________________________________________
IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281
Indexcopernicus: (ICV 2015): 79.58
© 2014- 17, IRJCS- All Rights Reserved Page -25
Due to this, more and more people have become negligent about what they store on their storage devices and they
often end up storing a lot of redundant files. This leads to unnecessary files being stored on storage media which
may result in more chances of fragmentation as the number of files are relatively more, longer disk access rates,
longer time required for file scanning programs like anti-malware software, etc. This also creates a major problem
on Solid State Drives (SSDs) which are based on NAND-flash technology and hence much faster than traditional
Hard Disk Drives (HDDs), but are relatively costlier. Therefore, it becomes important to save disk space as much as
possible.
Another major concern is associated with regular backups. The importance of taking regular backups of important
files is a well-known fact. The problem with this approach is that most of the time, files are backed up more than
once. Apart from incremental backup solutions, a full back-up or a differential backup solution will copy all the
files, or parts of files that have changed since the last full backup, even if the system already has identical copies of
those files in a previous backup [2], thereby eating up precious storage space. This also applies to situations when
a user takes a full backup of his files before performing a clean installation of his computer’s operating system. He
may already have most of the files on his external media, but the user copies all the files and then forgets about the
redundant files. Manually going through all the files, identifying the redundant files and then deleting or moving
them is a herculean task and is impractical.
1.2 Objective
Existing solutions try to identify identical files by comparing the meta-data of the files, like the size of the files, the
name of the file itself or any such combination. This is not an efficient solution as there can be two files with the
same name and size, but may have different content.
One approach to solve this problem is to perform a byte-by-byte comparison between each file and every other file
in the system, but the time complexity of such an algorithm would be O(n2). To make things worse, such an algo-
rithm would have to access the storage media a lot, thereby increasing the execution time. This problem is further
worsened while comparing multimedia files which are generally much larger than text files. The objective is to de-
sign such a system that can uniquely identify files and mark them as redundant or non-redundant, in less than
O(n2) time complexity while also using as less memory and processing power as possible.
II. PROPOSED SOLUTION
2.1 Hash functions and their utility
A hash function is a one-way mathematical function which transforms an input message of arbitrary length into a
unique hash value of fixed length in such a way that it is infeasible to compute the input message, given the hash
value. The resulting hash value is also called a message digest or checksum which serves as a “signature” of the
input message. Even a small change made to the input file or message will yield a completely different output hash
and this is known as the avalanche effect [3]. In essence, a particular message will generate a unique message di-
gest; which can only be generated by that message or file [4]. Some examples of popular hash algorithms are- the
family of Secure Hash Algorithm (SHA), the family of RIPEMD (RACE Integrity Primitives Evaluation Message Di-
gest), Message Digest 5 (MD5), Tiger, Whirlpool, etc.
The characteristic that can undoubtedly distinguish files is the signature or the message digest of each file. By us-
ing a hash table as the data structure to store the file signatures (hashes), a time complexity of O(n) can be
achieved. These file signatures are of fixed size - 48 bytes long (32 bytes for SHA256 and 16 bytes for MD5) for
SHA256-MD5 combination, irrespective of the input file size; which is much smaller than the size of the original
file, and are stored in the primary memory which further reduces the execution time. For example, if we are com-
paring a movie file of, say 2 GB, we would just have to consider 48 bytes (for SHA256-MD5 combination) instead of
2 GB.
2.2 Hash tables
A hash table [5] is a data structure that implements an unordered associative array [6]. An associative array is an
abstract data type which consists of a collection of key-value pairs. An associative array, to some extent, is similar
to an indexed one-dimensional array available in most programming languages. The search time complexity of an
indexed array is O(1) only if the index of the desired value is known ahead of time. Otherwise, an appropriate
search algorithm has to be used on the array.
The time complexity in such a case would depend upon the chosen search algorithm. This need for a search algo-
rithm arises because there is no relation between an index and the value stored at that index. The insertion time
complexity for an array always remains as O(1). Hash tables use a hash function to map a key with its correspond-
ing value. The hash function is applied on the key to obtain an index and the value is stored at that index in the
hash table. In this way, the key used while inserting a value into the hash table can later be used to retrieve that
value in O(1) time.
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 4 (June 2017) www.irjcs.com
_________________________________________________________________________________________________
IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281
Indexcopernicus: (ICV 2015): 79.58
© 2014- 17, IRJCS- All Rights Reserved Page -26
The key is analogous to the index in case of an indexed array. For example, consider a situation where customer
names and their locations have to be stored and retrieved. A 2D array can be used, where each element of the first
column would store a customer's name and the element of the second column and the same row would contain the
customer's location.
Fig. 1. 2D array to store Customer Name & Location
This solution would allow insertions and deletions in O(1) time; but if we would need to search for the location of a
particular customer, then first we would have to search through the first column and search for the desired name,
say, using the simplest linear search algorithm. Once the customer name is matched, then the corresponding cus-
tomer's location can be fetched. In this case, the worst-case time complexity would be O(n). On the other hand, a
hash table can be used, where the customer's name can be treated as the key and the customer's location as the
value. Now, we can use the customer's name as key to insert the customer's location into the hash table as value
and during retrieval we would just need to provide the customer's name and the hash table would be able to fetch
the customer's location in O(1) time complexity.
Fig. 2. Hash Table to store Customer Name & Location
2.3 N-layer hashing for improved collision resistance
Hash functions map n-length messages into a fixed size hash. The set of all such n-length messages is much bigger
than the set of possible fixed length hashes, and this mapping of a larger input set onto a smaller output set will
essentially result in a collision as per the pigeonhole principle. Consider a hash function like SHA-256 that produc-
es an output of 256 bits from an arbitrarily large input [7]. Since it must generate one of 2256 outputs for each
member of a much larger set of inputs, some inputs will definitely hash to the same output hash. This is called col-
lision.
The signature of a file should be unique, but, there can be situations where the hash of two different files result in
the same hash. For example, in the well-known hash algorithm, MD5, there is a probability of collision in every 264
hashes, which is similar to the birthday paradox. To overcome this problem, two or more layers of hash algorithms
can be employed. In other words, the output hashes of different hash functions, applied on the same file, are conca-
tenated together. This improves collision resistance as the probability of collisions occurring in all the hash func-
tions is very low. If hash algorithms which use the same block size are used, then this solution can be further im-
proved by reading a block from the input file only once and then sending it to the respective hash functions. Thus,
multiple hashes of the same file can be computed in a single pass of the input file. This will not result in an increase
in the time complexity of the algorithm and it would still remain as O(n). In this paper, the authors have used
SHA256 and MD5 combination as the file signature and both use 64 byte blocks.
2.4 Brief description
The authors in this paper propose a solution to this problem by analyzing all the files in a computer system, or a
part of it (like a folder), and classifies each file as redundant (if an exact copy of the file already exists) or non-
redundant, by comparing the file signatures. The proposed solution in default mode makes use of a hashing algo-
rithm to generate the file signature, viz. SHA-256, and uses a combination of SHA-256 and MD5 for improved accu-
racy, if specified by the user. Any other hash algorithm can be used instead of SHA-256 and MD5. A dictionary [8],
[9] data structure in .NET is employed which is a strongly typed hash-table where the best case search and inser-
tion time complexity is O(1). The signature of each file (SHA256 or SHA256-MD5 combination) is calculated which
is used as the key and the path of that file is used as the value which makes up a key-value pair.
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 4 (June 2017) www.irjcs.com
_________________________________________________________________________________________________
IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281
Indexcopernicus: (ICV 2015): 79.58
© 2014- 17, IRJCS- All Rights Reserved Page -27
The signature generated for each file is first looked up in the dictionary to ensure whether the file already exists, in
which case, it is classified as a redundant copy and the path of the file is added to a list of redundant files. If the file
does not exist, the signature and the path of the file is added to the dictionary as a key-value pair. At the end of the
scanning process, from the list of redundant files, the user can choose the files to delete, the files to move and the
files to keep as it is.
III. SOLUTION DESCRIPTION
3.1 Data flow diagram
Fig. 3. Level 0 DFD
Fig. 4. Level 1 DFD
Fig. 5. Level 2 DFD
3.2 Algorithm
Algorithm DuplicateFileAnalyzer:
1. Select scan location i.e., a folder, an internal or removable drive or the entire computer.
2. Repeat until all files in the selected scan location, including all the sub-folders have been analyzed.
a. Compute SHA-256 of the file.
b. Compute MD5 of the file.
c. Compute N - number of different hashes
d. Concatenate all hashes.
e. If the concatenated hash already exists as a key in the files dictionary, add the full path of the current file
to the duplicate files list. Else, add a new key-value pair to the files dictionary, where key would be this
concatenated hash and the value would be the full path of the current file.
3. For each file in the duplicate files list: -
a. Select whether to delete or move the current file or to keep it as it is.
b. Perform the selected operation.
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 4 (June 2017) www.irjcs.com
_________________________________________________________________________________________________
IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281
Indexcopernicus: (ICV 2015): 79.58
© 2014- 17, IRJCS- All Rights Reserved Page -28
3.3 Flowchart
Fig. 6. Flowchart of DuplicateFileAnalyzer Algorithm
IV. RESULTS AND DISCUSSIONS
4.1 Features
• Recursive directory file comparison.
• Use of two hashes for reducing the chances of hash collisions.
• Use of two hashes of same block size, for computing both hashes simultaneously.
• Use of dictionary data structure for O(1) lookup.
4.2 Tests and results
Consider the following folder hierarchy: -
Fig. 7. File-Folder Structure which is used for Testing
In this structure, MyTest.txt, MyTest2.txt and fol2MyTest.txt are files with unique contents. fol2testing.mp3 and
fol2fol3test.mp4 have the same content as MyTest.txt, but differ in their name and extension. After this file struc-
ture is processed by the demo application created using this proposed solution, it correctly identifies testing.mp3
and test.mp4 as duplicate files, even though the file names and extensions are different.
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 4 (June 2017) www.irjcs.com
_________________________________________________________________________________________________
IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281
Indexcopernicus: (ICV 2015): 79.58
© 2014- 17, IRJCS- All Rights Reserved Page -29
Fig. 8. Screenshot of demo application before starting the analysis
Fig. 9. Screenshot of demo application after analyzing the test file structure
After further testing, it is confirmed that, this also works perfectly with other file types, as well as for selective
drive scan and system wide scan.
4.3Time Complexity & Performance
The proposed solution can employ one or more hash functions like SHA-256, MD5, etc. to create an N-layer signa-
ture. It is recommended that hash functions that use the same block size be used, so that a block of file can be read
just once, and then sent to the hash functions for calculating the hash. This way we can reduce the number of disk
accesses required when more than one hash function is used. The time taken for this read operation would grow
linearly with the number of files to be scanned and hence, would take O(n) time.
The insertion and search time complexity of a hash table is O(1) as explained before and can be ignored, as it is
constant. So, in effect, this solution has O(n) time complexity. Another way to increase the performance of this al-
gorithm is to use multi-threading. Multiple threads can be used to compute the signatures of the different files in
parallel.
4.4Applications and future prospect
• In general, this solution can be used to clean-up any system of duplicate files.
• This algorithm can be made to work in real time so that it analyzes the new files added to the system, by
means of copy-paste, downloads or just by saving, to check whether any of those files already exist.
The hash table generated can be saved on the secondary memory so that the N-layer hashes need not be
computed over again.
• If this is paired with a full or differential backup solution, then copying redundant files can be avoided, sav-
ing a lot of time and space. A custom backup script can be created where it first compares with the existing
backed up files, and then only copies those files which are new or have changed since the last backup.
• This process is not memory or processor hungry, so it can also be used in mobile phones or in environ-
ments which have limited memory and processing power.
• This solution can be used in online file storages, where, if, multiple users have stored the same file in their
personal accounts, then the server will only keep a single copy.
• This algorithm can be further extended to detect similar looking photos and retain only the best one
through image processing.
4.5 Drawbacks
Redundant files are sometimes crucial for backup purposes so that they can be successfully restored after a system
crash and they should not be deleted. This algorithm is not recommended for such cases.
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 4 (June 2017) www.irjcs.com
_________________________________________________________________________________________________
IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281
Indexcopernicus: (ICV 2015): 79.58
© 2014- 17, IRJCS- All Rights Reserved Page -30
V. CONCLUSION
This paper addresses a major concern with data storage by identifying and managing duplicate files in an efficient
manner. This can be further extended with various other solutions as discussed.
REFERENCES
1. “Wikipedia – Hard Disk Drive”: https://en.wikipedia.org/wiki/Hard_disk_drive
2. “Difference between: Full, Differential, and Incremental Backup” : http://tinyurl.com/zpnnvv4
3. https://en.wikipedia.org/wiki/Avalanche_effect
4. Pramod George Jose, Soumick Chatterjee, Mayank Patodia, Sneha Kabra, and Asoke Nath, “Hash and Salt based
Steganographic Approach with Modified LSB Encoding,” International Journal of Innovative Research in Com-
puter and Communication Engineering, vol. 4, issue 6, pp. 10599 – 10610, June 2016.
5. “Wikipedia – Hash Table”: https://en.wikipedia.org/wiki/Hash_table
6. “Wikipedia – Associative Array”: https://en.wikipedia.org/wiki/Associative_array
7. “Wikipedia – Collision Resistance”: https://en.wikipedia.org/wiki/Collision_resistance
8. “Difference between Hashtable and Dictionary”: http://tinyurl.com/hljs2lf
9. “.NET data structures”: http://tinyurl.com/hqfauaf
Prof. Siladitya Mukherjee has more than 13 years of teaching experience since January, 2004. He worked as a
paper-setter and a moderator in Calcutta University and Jadavpur University. He is currently working as an Assis-
tant Professor in the Department of Computer Science, St. Xavier’s College (Autonomous), Kolkata. His area of in-
terests includes Web technologies, Unix Operating System, Programming in Object Oriented Technologies and Ge-
netic Algorithm. His research interestsis Data Mining, Artificial Intelligence, Image Processing, Web Technologies
and Genetic Algorithms.
Pramod George Jose is a M.Sc. Computer Science graduate from the Department of Computer Science, St. Xavier’s
College (Autonomous), Kolkata; interested in Cryptography, Steganography, Computer Security and low level
working of computer systems.
Soumick Chatterjee is a M.Sc. Computer Science graduate from the Department of Computer Science, St. Xavier’s
College (Autonomous), Kolkata; interested in Web Technologies, Cross-platform development, Cryptography, Ste-
ganography, Machine Intelligence with successful experience in Tech Entrepreneurship.
Priyanka Basak is a M.Sc. Computer Science graduate from the Department of Computer Science, St. Xavier’s Col-
lege (Autonomous), Kolkata; nterested in Cryptography, Image Processing, Machine Learning and Computer Vi-
sion.

More Related Content

PPTX
Presented by Anu Mattatholi
PDF
An Evaluation and Overview of Indices Based on Arabic Documents
PDF
An Evaluation and Overview of Indices Based on Arabic Documents
ODT
Data Deduplication: Venti and its improvements
PPTX
handle data with DHT and load balnce over P2P network
PDF
Implementation of New Modified MD5-512 bit Algorithm for Cryptography
PDF
A Survey: Enhanced Block Level Message Locked Encryption for data Deduplication
PDF
DIGITAL INVESTIGATION USING HASHBASED CARVING
Presented by Anu Mattatholi
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
Data Deduplication: Venti and its improvements
handle data with DHT and load balnce over P2P network
Implementation of New Modified MD5-512 bit Algorithm for Cryptography
A Survey: Enhanced Block Level Message Locked Encryption for data Deduplication
DIGITAL INVESTIGATION USING HASHBASED CARVING

What's hot (20)

PDF
IRJET - A Secure Access Policies based on Data Deduplication System
PDF
WoSC19: Serverless Workflows for Indexing Large Scientific Data
DOCX
Secure distributed de duplication systems with
PPT
Distributed Interactive Computing Environment (DICE)
PDF
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
PDF
File Reconstruction in Digital Forensic
PDF
Efficient Similarity Search over Encrypted Data
PPT
Archiving and managing a million or more data files on BiG Grid
PDF
Hd3113831386
PDF
Secure Deduplication with Efficient and Reliable Dekey Management with the Pr...
PDF
An evaluation and overview of indices
PDF
Managing data in computational edge clouds
PDF
C0312023
PDF
Mongodb
PDF
Comparison of Compression Algorithms in text data for Data Mining
PDF
A0360109
PDF
B0330811
DOCX
Robust module based data management
PPTX
Robust Module based data management system
IRJET - A Secure Access Policies based on Data Deduplication System
WoSC19: Serverless Workflows for Indexing Large Scientific Data
Secure distributed de duplication systems with
Distributed Interactive Computing Environment (DICE)
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
File Reconstruction in Digital Forensic
Efficient Similarity Search over Encrypted Data
Archiving and managing a million or more data files on BiG Grid
Hd3113831386
Secure Deduplication with Efficient and Reliable Dekey Management with the Pr...
An evaluation and overview of indices
Managing data in computational edge clouds
C0312023
Mongodb
Comparison of Compression Algorithms in text data for Data Mining
A0360109
B0330811
Robust module based data management
Robust Module based data management system
Ad

Similar to Duplicate File Analyzer using N-layer Hash and Hash Table (20)

PDF
Performance Analysis of Hashing Mathods on the Employment of App
PDF
Hash table methods
PDF
Hashing and File Structures in Data Structure.pdf
PPT
Unit 3 chapter-1managing-files-of-records
PPTX
Data Structures-Topic-Hashing, Collision
PPT
Hash presentation
PDF
Implementação do Hash Coalha/Coalesced
PPTX
File Structures and Access in Data Structures
PPTX
File Structures and File Access in Data Structures
PPTX
File Strucutres and Access in Data Structures
PPTX
hashing in data strutures advanced in languae java
PPTX
Hashing techniques, Hashing function,Collision detection techniques
PPTX
unit-1-dsa-hashing-2022_compressed-1-converted.pptx
PPT
Hashing and collision for database systems
PPTX
Hash based inventory system
PPTX
Lec12-Hash-Tables-27122022-125641pm.pptx
PPTX
Mca ii dfs u-5 sorting and searching structure
PPT
PPTX
Hashing And Hashing Tables
PPTX
Hashing
Performance Analysis of Hashing Mathods on the Employment of App
Hash table methods
Hashing and File Structures in Data Structure.pdf
Unit 3 chapter-1managing-files-of-records
Data Structures-Topic-Hashing, Collision
Hash presentation
Implementação do Hash Coalha/Coalesced
File Structures and Access in Data Structures
File Structures and File Access in Data Structures
File Strucutres and Access in Data Structures
hashing in data strutures advanced in languae java
Hashing techniques, Hashing function,Collision detection techniques
unit-1-dsa-hashing-2022_compressed-1-converted.pptx
Hashing and collision for database systems
Hash based inventory system
Lec12-Hash-Tables-27122022-125641pm.pptx
Mca ii dfs u-5 sorting and searching structure
Hashing And Hashing Tables
Hashing
Ad

More from AM Publications (20)

PDF
DEVELOPMENT OF TODDLER FAMILY CADRE TRAINING BASED ON ANDROID APPLICATIONS IN...
PDF
TESTING OF COMPOSITE ON DROP-WEIGHT IMPACT TESTING AND DAMAGE IDENTIFICATION ...
PDF
THE USE OF FRACTAL GEOMETRY IN TILING MOTIF DESIGN
PDF
TWO-DIMENSIONAL INVERSION FINITE ELEMENT MODELING OF MAGNETOTELLURIC DATA: CA...
PDF
USING THE GENETIC ALGORITHM TO OPTIMIZE LASER WELDING PARAMETERS FOR MARTENSI...
PDF
ANALYSIS AND DESIGN E-MARKETPLACE FOR MICRO, SMALL AND MEDIUM ENTERPRISES
PDF
REMOTE SENSING AND GEOGRAPHIC INFORMATION SYSTEMS
PDF
EVALUATE THE STRAIN ENERGY ERROR FOR THE LASER WELD BY THE H-REFINEMENT OF TH...
PDF
HMM APPLICATION IN ISOLATED WORD SPEECH RECOGNITION
PDF
PEDESTRIAN DETECTION IN LOW RESOLUTION VIDEOS USING A MULTI-FRAME HOG-BASED D...
PDF
INTELLIGENT BLIND STICK
PDF
EFFECT OF SILICON - RUBBER (SR) SHEETS AS AN ALTERNATIVE FILTER ON HIGH AND L...
PDF
UTILIZATION OF IMMUNIZATION SERVICES AMONG CHILDREN UNDER FIVE YEARS OF AGE I...
PDF
REPRESENTATION OF THE BLOCK DATA ENCRYPTION ALGORITHM IN AN ANALYTICAL FORM F...
PDF
OPTICAL CHARACTER RECOGNITION USING RBFNN
PDF
DETECTION OF MOVING OBJECT
PDF
SIMULATION OF ATMOSPHERIC POLLUTANTS DISPERSION IN AN URBAN ENVIRONMENT
PDF
PREPARATION AND EVALUATION OF WOOL KERATIN BASED CHITOSAN NANOFIBERS FOR AIR ...
PDF
ANALYSIS ON LOAD BALANCING ALGORITHMS IMPLEMENTATION ON CLOUD COMPUTING ENVIR...
PDF
A MODEL BASED APPROACH FOR IMPLEMENTING WLAN SECURITY
DEVELOPMENT OF TODDLER FAMILY CADRE TRAINING BASED ON ANDROID APPLICATIONS IN...
TESTING OF COMPOSITE ON DROP-WEIGHT IMPACT TESTING AND DAMAGE IDENTIFICATION ...
THE USE OF FRACTAL GEOMETRY IN TILING MOTIF DESIGN
TWO-DIMENSIONAL INVERSION FINITE ELEMENT MODELING OF MAGNETOTELLURIC DATA: CA...
USING THE GENETIC ALGORITHM TO OPTIMIZE LASER WELDING PARAMETERS FOR MARTENSI...
ANALYSIS AND DESIGN E-MARKETPLACE FOR MICRO, SMALL AND MEDIUM ENTERPRISES
REMOTE SENSING AND GEOGRAPHIC INFORMATION SYSTEMS
EVALUATE THE STRAIN ENERGY ERROR FOR THE LASER WELD BY THE H-REFINEMENT OF TH...
HMM APPLICATION IN ISOLATED WORD SPEECH RECOGNITION
PEDESTRIAN DETECTION IN LOW RESOLUTION VIDEOS USING A MULTI-FRAME HOG-BASED D...
INTELLIGENT BLIND STICK
EFFECT OF SILICON - RUBBER (SR) SHEETS AS AN ALTERNATIVE FILTER ON HIGH AND L...
UTILIZATION OF IMMUNIZATION SERVICES AMONG CHILDREN UNDER FIVE YEARS OF AGE I...
REPRESENTATION OF THE BLOCK DATA ENCRYPTION ALGORITHM IN AN ANALYTICAL FORM F...
OPTICAL CHARACTER RECOGNITION USING RBFNN
DETECTION OF MOVING OBJECT
SIMULATION OF ATMOSPHERIC POLLUTANTS DISPERSION IN AN URBAN ENVIRONMENT
PREPARATION AND EVALUATION OF WOOL KERATIN BASED CHITOSAN NANOFIBERS FOR AIR ...
ANALYSIS ON LOAD BALANCING ALGORITHMS IMPLEMENTATION ON CLOUD COMPUTING ENVIR...
A MODEL BASED APPROACH FOR IMPLEMENTING WLAN SECURITY

Recently uploaded (20)

PPTX
web development for engineering and engineering
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPT
Project quality management in manufacturing
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Welding lecture in detail for understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Construction Project Organization Group 2.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Geodesy 1.pptx...............................................
PDF
Well-logging-methods_new................
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
web development for engineering and engineering
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Model Code of Practice - Construction Work - 21102022 .pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Project quality management in manufacturing
R24 SURVEYING LAB MANUAL for civil enggi
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
bas. eng. economics group 4 presentation 1.pptx
CH1 Production IntroductoryConcepts.pptx
Welding lecture in detail for understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Construction Project Organization Group 2.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Internet of Things (IOT) - A guide to understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Geodesy 1.pptx...............................................
Well-logging-methods_new................
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...

Duplicate File Analyzer using N-layer Hash and Hash Table

  • 1. International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Issue 06, Volume 4 (June 2017) www.irjcs.com _________________________________________________________________________________________________ IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281 Indexcopernicus: (ICV 2015): 79.58 © 2014- 17, IRJCS- All Rights Reserved Page -24 Duplicate File Analyzer using N-layer Hash and Hash Table Siladitya Mukherjee Assistant Professor, Department of Computer Science, St. Xavier’s College (Autonomous), Kolkata, India siladitya.mukherjee@sxccal.edu Pramod George Jose M.Sc. Graduate, Department of Computer Science, St. Xavier’s College (Autonomous), Kolkata, India pramod.the.programmer@gmail.com Soumick Chatterjee M.Sc. Graduate, Department of Computer Science, St. Xavier’s College (Autonomous), Kolkata, India soumick@live.com Priyanka Basak M.Sc. Graduate, Department of Computer Science, St. Xavier’s College (Autonomous), Kolkata, India mailpriyankabasak20@gmail.com Manuscript History Number: IRJCS/RS/Vol.04/Issue06/JNCS10083 Received: 15, May 2017 Final Correction: 29, May 2017 Final Accepted: 26, May 2017 Published: June 2017 Citation: Mukherjee, S.; Jose, P. G.; Chatterjee, S. & Basak, P. (2017), 'Duplicate File Analyzer using N-layer Hash and Hash Table', Master's thesis, St. Xavier’s College (Autonomous), Kolkata, India. Editor: Dr.A.Arul L.S, Chief Editor, IRJCS, AM Publications, India Copyright: ©2017 This is an open access article distributed under the terms of the Creative Commons Attribution License, Which Permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited Abstract— with the advancement in data storage technology, the cost per gigabyte has reduced significantly, caus- ing users to negligently store redundant files on their system. These may be created while taking manual backups or by improperly written programs. Often, files with the exact content have different file names; and files with dif- ferent content may have the same name. Hence, devising an algorithm to identify redundant files based on their file name and/or size is not enough. In this paper, the authors have proposed a novel approach where the N-layer hash of all the files are individually calculated and stored in a hash table data structure. If an N-layer hash of a file matches with a hash that already exists in the hash table, that file is marked as a duplicate, which can be deleted or moved to a specific location as per the user's choice. The use of the hash table data structure helps achieve O(n) time complexity and the use of N-layer hashes improve the accuracy of identifying redundant files. This approach can be used for folder specific, drive specific or a system wide scan as required. Index Terms— File Management, File Organization, Files Optimization, Files Structure, Secondary storage, Hash table representations, File comparison, File signature, Hash, N-layer hash I. INTRODUCTION 1.1 Problem definition The storage space available on secondary memory, especially hard disks, has grown almost exponentially ever since the first hard disk was invented by IBM in 1956 [1].
  • 2. International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Issue 06, Volume 4 (June 2017) www.irjcs.com _________________________________________________________________________________________________ IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281 Indexcopernicus: (ICV 2015): 79.58 © 2014- 17, IRJCS- All Rights Reserved Page -25 Due to this, more and more people have become negligent about what they store on their storage devices and they often end up storing a lot of redundant files. This leads to unnecessary files being stored on storage media which may result in more chances of fragmentation as the number of files are relatively more, longer disk access rates, longer time required for file scanning programs like anti-malware software, etc. This also creates a major problem on Solid State Drives (SSDs) which are based on NAND-flash technology and hence much faster than traditional Hard Disk Drives (HDDs), but are relatively costlier. Therefore, it becomes important to save disk space as much as possible. Another major concern is associated with regular backups. The importance of taking regular backups of important files is a well-known fact. The problem with this approach is that most of the time, files are backed up more than once. Apart from incremental backup solutions, a full back-up or a differential backup solution will copy all the files, or parts of files that have changed since the last full backup, even if the system already has identical copies of those files in a previous backup [2], thereby eating up precious storage space. This also applies to situations when a user takes a full backup of his files before performing a clean installation of his computer’s operating system. He may already have most of the files on his external media, but the user copies all the files and then forgets about the redundant files. Manually going through all the files, identifying the redundant files and then deleting or moving them is a herculean task and is impractical. 1.2 Objective Existing solutions try to identify identical files by comparing the meta-data of the files, like the size of the files, the name of the file itself or any such combination. This is not an efficient solution as there can be two files with the same name and size, but may have different content. One approach to solve this problem is to perform a byte-by-byte comparison between each file and every other file in the system, but the time complexity of such an algorithm would be O(n2). To make things worse, such an algo- rithm would have to access the storage media a lot, thereby increasing the execution time. This problem is further worsened while comparing multimedia files which are generally much larger than text files. The objective is to de- sign such a system that can uniquely identify files and mark them as redundant or non-redundant, in less than O(n2) time complexity while also using as less memory and processing power as possible. II. PROPOSED SOLUTION 2.1 Hash functions and their utility A hash function is a one-way mathematical function which transforms an input message of arbitrary length into a unique hash value of fixed length in such a way that it is infeasible to compute the input message, given the hash value. The resulting hash value is also called a message digest or checksum which serves as a “signature” of the input message. Even a small change made to the input file or message will yield a completely different output hash and this is known as the avalanche effect [3]. In essence, a particular message will generate a unique message di- gest; which can only be generated by that message or file [4]. Some examples of popular hash algorithms are- the family of Secure Hash Algorithm (SHA), the family of RIPEMD (RACE Integrity Primitives Evaluation Message Di- gest), Message Digest 5 (MD5), Tiger, Whirlpool, etc. The characteristic that can undoubtedly distinguish files is the signature or the message digest of each file. By us- ing a hash table as the data structure to store the file signatures (hashes), a time complexity of O(n) can be achieved. These file signatures are of fixed size - 48 bytes long (32 bytes for SHA256 and 16 bytes for MD5) for SHA256-MD5 combination, irrespective of the input file size; which is much smaller than the size of the original file, and are stored in the primary memory which further reduces the execution time. For example, if we are com- paring a movie file of, say 2 GB, we would just have to consider 48 bytes (for SHA256-MD5 combination) instead of 2 GB. 2.2 Hash tables A hash table [5] is a data structure that implements an unordered associative array [6]. An associative array is an abstract data type which consists of a collection of key-value pairs. An associative array, to some extent, is similar to an indexed one-dimensional array available in most programming languages. The search time complexity of an indexed array is O(1) only if the index of the desired value is known ahead of time. Otherwise, an appropriate search algorithm has to be used on the array. The time complexity in such a case would depend upon the chosen search algorithm. This need for a search algo- rithm arises because there is no relation between an index and the value stored at that index. The insertion time complexity for an array always remains as O(1). Hash tables use a hash function to map a key with its correspond- ing value. The hash function is applied on the key to obtain an index and the value is stored at that index in the hash table. In this way, the key used while inserting a value into the hash table can later be used to retrieve that value in O(1) time.
  • 3. International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Issue 06, Volume 4 (June 2017) www.irjcs.com _________________________________________________________________________________________________ IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281 Indexcopernicus: (ICV 2015): 79.58 © 2014- 17, IRJCS- All Rights Reserved Page -26 The key is analogous to the index in case of an indexed array. For example, consider a situation where customer names and their locations have to be stored and retrieved. A 2D array can be used, where each element of the first column would store a customer's name and the element of the second column and the same row would contain the customer's location. Fig. 1. 2D array to store Customer Name & Location This solution would allow insertions and deletions in O(1) time; but if we would need to search for the location of a particular customer, then first we would have to search through the first column and search for the desired name, say, using the simplest linear search algorithm. Once the customer name is matched, then the corresponding cus- tomer's location can be fetched. In this case, the worst-case time complexity would be O(n). On the other hand, a hash table can be used, where the customer's name can be treated as the key and the customer's location as the value. Now, we can use the customer's name as key to insert the customer's location into the hash table as value and during retrieval we would just need to provide the customer's name and the hash table would be able to fetch the customer's location in O(1) time complexity. Fig. 2. Hash Table to store Customer Name & Location 2.3 N-layer hashing for improved collision resistance Hash functions map n-length messages into a fixed size hash. The set of all such n-length messages is much bigger than the set of possible fixed length hashes, and this mapping of a larger input set onto a smaller output set will essentially result in a collision as per the pigeonhole principle. Consider a hash function like SHA-256 that produc- es an output of 256 bits from an arbitrarily large input [7]. Since it must generate one of 2256 outputs for each member of a much larger set of inputs, some inputs will definitely hash to the same output hash. This is called col- lision. The signature of a file should be unique, but, there can be situations where the hash of two different files result in the same hash. For example, in the well-known hash algorithm, MD5, there is a probability of collision in every 264 hashes, which is similar to the birthday paradox. To overcome this problem, two or more layers of hash algorithms can be employed. In other words, the output hashes of different hash functions, applied on the same file, are conca- tenated together. This improves collision resistance as the probability of collisions occurring in all the hash func- tions is very low. If hash algorithms which use the same block size are used, then this solution can be further im- proved by reading a block from the input file only once and then sending it to the respective hash functions. Thus, multiple hashes of the same file can be computed in a single pass of the input file. This will not result in an increase in the time complexity of the algorithm and it would still remain as O(n). In this paper, the authors have used SHA256 and MD5 combination as the file signature and both use 64 byte blocks. 2.4 Brief description The authors in this paper propose a solution to this problem by analyzing all the files in a computer system, or a part of it (like a folder), and classifies each file as redundant (if an exact copy of the file already exists) or non- redundant, by comparing the file signatures. The proposed solution in default mode makes use of a hashing algo- rithm to generate the file signature, viz. SHA-256, and uses a combination of SHA-256 and MD5 for improved accu- racy, if specified by the user. Any other hash algorithm can be used instead of SHA-256 and MD5. A dictionary [8], [9] data structure in .NET is employed which is a strongly typed hash-table where the best case search and inser- tion time complexity is O(1). The signature of each file (SHA256 or SHA256-MD5 combination) is calculated which is used as the key and the path of that file is used as the value which makes up a key-value pair.
  • 4. International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Issue 06, Volume 4 (June 2017) www.irjcs.com _________________________________________________________________________________________________ IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281 Indexcopernicus: (ICV 2015): 79.58 © 2014- 17, IRJCS- All Rights Reserved Page -27 The signature generated for each file is first looked up in the dictionary to ensure whether the file already exists, in which case, it is classified as a redundant copy and the path of the file is added to a list of redundant files. If the file does not exist, the signature and the path of the file is added to the dictionary as a key-value pair. At the end of the scanning process, from the list of redundant files, the user can choose the files to delete, the files to move and the files to keep as it is. III. SOLUTION DESCRIPTION 3.1 Data flow diagram Fig. 3. Level 0 DFD Fig. 4. Level 1 DFD Fig. 5. Level 2 DFD 3.2 Algorithm Algorithm DuplicateFileAnalyzer: 1. Select scan location i.e., a folder, an internal or removable drive or the entire computer. 2. Repeat until all files in the selected scan location, including all the sub-folders have been analyzed. a. Compute SHA-256 of the file. b. Compute MD5 of the file. c. Compute N - number of different hashes d. Concatenate all hashes. e. If the concatenated hash already exists as a key in the files dictionary, add the full path of the current file to the duplicate files list. Else, add a new key-value pair to the files dictionary, where key would be this concatenated hash and the value would be the full path of the current file. 3. For each file in the duplicate files list: - a. Select whether to delete or move the current file or to keep it as it is. b. Perform the selected operation.
  • 5. International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Issue 06, Volume 4 (June 2017) www.irjcs.com _________________________________________________________________________________________________ IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281 Indexcopernicus: (ICV 2015): 79.58 © 2014- 17, IRJCS- All Rights Reserved Page -28 3.3 Flowchart Fig. 6. Flowchart of DuplicateFileAnalyzer Algorithm IV. RESULTS AND DISCUSSIONS 4.1 Features • Recursive directory file comparison. • Use of two hashes for reducing the chances of hash collisions. • Use of two hashes of same block size, for computing both hashes simultaneously. • Use of dictionary data structure for O(1) lookup. 4.2 Tests and results Consider the following folder hierarchy: - Fig. 7. File-Folder Structure which is used for Testing In this structure, MyTest.txt, MyTest2.txt and fol2MyTest.txt are files with unique contents. fol2testing.mp3 and fol2fol3test.mp4 have the same content as MyTest.txt, but differ in their name and extension. After this file struc- ture is processed by the demo application created using this proposed solution, it correctly identifies testing.mp3 and test.mp4 as duplicate files, even though the file names and extensions are different.
  • 6. International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Issue 06, Volume 4 (June 2017) www.irjcs.com _________________________________________________________________________________________________ IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281 Indexcopernicus: (ICV 2015): 79.58 © 2014- 17, IRJCS- All Rights Reserved Page -29 Fig. 8. Screenshot of demo application before starting the analysis Fig. 9. Screenshot of demo application after analyzing the test file structure After further testing, it is confirmed that, this also works perfectly with other file types, as well as for selective drive scan and system wide scan. 4.3Time Complexity & Performance The proposed solution can employ one or more hash functions like SHA-256, MD5, etc. to create an N-layer signa- ture. It is recommended that hash functions that use the same block size be used, so that a block of file can be read just once, and then sent to the hash functions for calculating the hash. This way we can reduce the number of disk accesses required when more than one hash function is used. The time taken for this read operation would grow linearly with the number of files to be scanned and hence, would take O(n) time. The insertion and search time complexity of a hash table is O(1) as explained before and can be ignored, as it is constant. So, in effect, this solution has O(n) time complexity. Another way to increase the performance of this al- gorithm is to use multi-threading. Multiple threads can be used to compute the signatures of the different files in parallel. 4.4Applications and future prospect • In general, this solution can be used to clean-up any system of duplicate files. • This algorithm can be made to work in real time so that it analyzes the new files added to the system, by means of copy-paste, downloads or just by saving, to check whether any of those files already exist. The hash table generated can be saved on the secondary memory so that the N-layer hashes need not be computed over again. • If this is paired with a full or differential backup solution, then copying redundant files can be avoided, sav- ing a lot of time and space. A custom backup script can be created where it first compares with the existing backed up files, and then only copies those files which are new or have changed since the last backup. • This process is not memory or processor hungry, so it can also be used in mobile phones or in environ- ments which have limited memory and processing power. • This solution can be used in online file storages, where, if, multiple users have stored the same file in their personal accounts, then the server will only keep a single copy. • This algorithm can be further extended to detect similar looking photos and retain only the best one through image processing. 4.5 Drawbacks Redundant files are sometimes crucial for backup purposes so that they can be successfully restored after a system crash and they should not be deleted. This algorithm is not recommended for such cases.
  • 7. International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Issue 06, Volume 4 (June 2017) www.irjcs.com _________________________________________________________________________________________________ IRJCS: Impact Factor Value – SJIF: Innospace, Morocco (2016): 4.281 Indexcopernicus: (ICV 2015): 79.58 © 2014- 17, IRJCS- All Rights Reserved Page -30 V. CONCLUSION This paper addresses a major concern with data storage by identifying and managing duplicate files in an efficient manner. This can be further extended with various other solutions as discussed. REFERENCES 1. “Wikipedia – Hard Disk Drive”: https://en.wikipedia.org/wiki/Hard_disk_drive 2. “Difference between: Full, Differential, and Incremental Backup” : http://tinyurl.com/zpnnvv4 3. https://en.wikipedia.org/wiki/Avalanche_effect 4. Pramod George Jose, Soumick Chatterjee, Mayank Patodia, Sneha Kabra, and Asoke Nath, “Hash and Salt based Steganographic Approach with Modified LSB Encoding,” International Journal of Innovative Research in Com- puter and Communication Engineering, vol. 4, issue 6, pp. 10599 – 10610, June 2016. 5. “Wikipedia – Hash Table”: https://en.wikipedia.org/wiki/Hash_table 6. “Wikipedia – Associative Array”: https://en.wikipedia.org/wiki/Associative_array 7. “Wikipedia – Collision Resistance”: https://en.wikipedia.org/wiki/Collision_resistance 8. “Difference between Hashtable and Dictionary”: http://tinyurl.com/hljs2lf 9. “.NET data structures”: http://tinyurl.com/hqfauaf Prof. Siladitya Mukherjee has more than 13 years of teaching experience since January, 2004. He worked as a paper-setter and a moderator in Calcutta University and Jadavpur University. He is currently working as an Assis- tant Professor in the Department of Computer Science, St. Xavier’s College (Autonomous), Kolkata. His area of in- terests includes Web technologies, Unix Operating System, Programming in Object Oriented Technologies and Ge- netic Algorithm. His research interestsis Data Mining, Artificial Intelligence, Image Processing, Web Technologies and Genetic Algorithms. Pramod George Jose is a M.Sc. Computer Science graduate from the Department of Computer Science, St. Xavier’s College (Autonomous), Kolkata; interested in Cryptography, Steganography, Computer Security and low level working of computer systems. Soumick Chatterjee is a M.Sc. Computer Science graduate from the Department of Computer Science, St. Xavier’s College (Autonomous), Kolkata; interested in Web Technologies, Cross-platform development, Cryptography, Ste- ganography, Machine Intelligence with successful experience in Tech Entrepreneurship. Priyanka Basak is a M.Sc. Computer Science graduate from the Department of Computer Science, St. Xavier’s Col- lege (Autonomous), Kolkata; nterested in Cryptography, Image Processing, Machine Learning and Computer Vi- sion.