SlideShare a Scribd company logo
A RESEARCH WORK
ON
EFFICIENT STORAGE OF DATA IN CLOUD COMPUTING
A CASE STUDY OF GOOGLE DOCS FOR
INFORMATION SHARING BY UNIUYO ENGINEERING
STUDENTS
BY
NKWOCHA, CHINEDU SOLOMON
10/EG/CO/456
DEPARTMENT OF COMPUTER ENGINEERING
FACULTY OF ENGINEERING
UNIVERSITY OF UYO, UYO
SUBMITTED TO
DR. FRANCIS UDOH
COURSE LECTURER GRE 322
DEPARTMENT OF CHEM/PET ENGINEERING
FACULTY OF ENGINEERING
UNIVERSITY OF UYO, UYO
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS OF
GRE 322
MARCH 2014
ABSTRACT
Keeping critical data safe and accessible from several locations has become a global
preoccupation, either being this data personal, organizational or from applications. As a
consequence of this issue, we verify the emergence of on-line storage services. In addition, there
is the new paradigm of Cloud Computing, which brings new ideas to build services that allow
users to store their data and run their applications in the Cloud. By doing a smart and efficient
management of these services’ storage, it is possible to improve the quality of service offered, as
well as to optimize the usage of the infrastructure where the services run. This management is
even more critical and complex when the infrastructure is composed by thousands of nodes
running several virtual machines and sharing the same storage. The elimination of redundant data
at these services’ storage can be used to simplify and enhance this management. This dissertation
presents a solution to detect and eliminate duplicated data between virtual machines that run on
the same physical host and write their virtual disks’ data to a shared storage. A prototype that
implements this solution is introduced and evaluated. Finally, a study that compares the
efficiency of two different approaches used to eliminate redundant data in a personal data set is
described.
Cloud computing is a new computing paradigm that attracted many computer
users, business, and government agencies. Cloud computing brought a lot of advantages
especially in ubiquitous services where everybody can access computer services through
internet. With cloud computing, there is no need of physical hardware or servers that will
support the company’s computer system, internet services and networks. One of the core
services provided by cloud computing is data storage. In the past decades, data storage has
been recognized as one of the main concerns of information technology. The benefits of
network-based applications have led to the transition from server-attached storage to
distributed storage. Based on the fact that data security is the foundation of information
security, a great quantity of efforts has been made in the area of distributed storage
security. In this paper, the authors tried to study the threats and attacks that possibly
launch in cloud computing data storage and proposed a security mechanism.
CHAPTER 1
INTRODUCTION
WHAT IS CLOUD COMPUTING? Key to the definition of cloud computing is the
“cloud” itself. Here cloud is the group of large group of interconnected computers. These
computers can be personal or network servers. The cloud of computers extend beyond a single
company or entity. The application and data served by cloud are available to broader group of
users, cross enterprise, and cross platform. Access is via internet. Any authorized user can
access these documents, application from any computer over the internet.
Cloud computing moves the application software and databases to the large data centres,
where the data security is not trustworthy. Hence it leads to many new security challenges. By
utilizing the holomorphic token with distributed verification of erasure-coded data, our scheme
achieves the integration of storage correctness insurance and data error localization, i.e., the
identification of misbehaving server. The new scheme supports secure and efficient dynamic
operations on data blocks, including: data update, delete and append. Extensive security and
performance analysis shows that the proposed scheme is highly efficient as we restrict some IP
address and even authenticate the existing users to avoid malicious data modification attack.
Cloud computing is a new computing paradigm where in computer processing is
being performed through internet by a standard browser. Cloud computing builds on
established trends for driving the cost out of the delivery of services while increasing the
speed and agility with which services are deployed. It shortens the time from sketching out
application architecture to actual deployment. Cloud computing incorporates virtualization,
on-demand deployment, Internet delivery of services, and open source software.
The Cloud Computing Architecture of a cloud solution is the structure of the
system, which comprises on premise and cloud resources, services, middleware, and
software components, geo-location, the externally visible properties of those, and the
relationships between them. The term also refers to documentation of a system’s cloud
computing architecture.
BACKGROUND INFORMATION
Dependable backup services are increasingly important to enterprises but also to common
users that want to keep their personal files safe. A traditional approach, for common users, is to
have a copy of all their files in an external hard drive. One example of such system is Time
Machine. For enterprises the solution requires having a larger storage and a more complex
solution to backup their critical data. For some enterprises, data is so important that several
backup copies must be kept in different physical locations in order to avoid losing it in case of
natural catastrophes.
An engineering student of the University of Uyo has just finished typing a fifty-page
extensive research project which needs to be submitted the next day. Excitedly, he walks into a
business centre to have his work typed and spiral-binded, but unfortunately his file gets
corrupted as the file is been opened for printing, as a result of a deadly virus in the desktop
computer at the business centre. He is in rage and fury because he has no other backup of the file
anywhere. His all-night labour has availed to nothing and he is stranded because he can no
longer meet the deadline for submission. If he had typed his research work using a cloud
computing service like Google Docs, he would have had no cause to worry as he would have
easily downloaded his document to Microsoft Word with the click of a button. This is a common
problem many engineering students face in the University of Uyo.
Another important aspect, for both enterprises and common users, is the need of
accessing their data remotely from different places. For this purpose, the web is a good solution,
having in mind how easy it is to insert and retrieve information of any kind from it. This explains
the emergence and success of on-line backup services like Dropbox, Box.net, RapidShare and
Google Docs, that allow clients to have their data safe in the web. These services are more than
just simple data archives. Some of them support other features, like collaborative work,
versioning, online editing and synchronization of user’s data between several devices. With these
new functionalities, a storage service that is intended to store and retrieve data efficiently is
necessary. In classical archival systems this was not needed to be contemplated. As expected, all
these services have limits to the amount of data each user can backup and, therefore, clients must
pay a fee to expand these limits.
STATEMENT OF PROBLEM
Effective deduplication in a cloud computing scenario raises however a number of
challenges.
First, there are architecture challenges. In this scenario, at least one distinct VM is
running for each client’s application. This means that data is spread by several VMs virtual
disks. Additionally, groups of VMs are running in distinct physical machines. Finally, there is
the necessity of keeping data deduplication process transparent to the VMs and the applications
running on them.
Second, there are algorithm challenges. An efficient method to detect modified data and
to share identical data is needed. This method must use metadata to compare the modified data
with the storage’s data in order to share it. Metadata’s size is needed to have in account to
achieve an efficient deduplication algorithm. To detect modified data without having to scan all
the storage, a method that intercepts I/O requests to the disk is also necessary. This approach
reduces the CPU usage but can introduce significant overhead in the I/O requests. This overhead
value must be as low as possible.
PURPOSE OF STUDY
The main goal of this dissertation is to show how deduplication can be achieved in a
virtualized system, towards finding and eliminating redundant data in the context of cloud
computing services.
A second objective is to evaluate the impact of deduplication in personal data towards
demonstrating the usefulness of the proposed solution
SIGNIFICANCE OF STUDY
If every University of Uyo engineering student embraces the vast capabilities and
functionalities of the Google Docs cloud computing service then,
 They can access their files and documents anywhere there is an internet connectivity.
 It eliminates the need to carry flash drives.
 They can post their important files and documents to the web instantly.
 They can collaborate with others –while maintaining only one copy of the document.
 They can downloads their files to Microsoft Word and Excel spreadsheet.
 They can also use the forms in Google Docs as a survey instrument and as an assessment
instrument for their research work as they are connected to a vast community of similar-
minded people.
 They can handle their PowerPoint presentations with Google Presentations which is
compatible anywhere there is web access. This eliminates software compatibility issues
and there is no need to carry thumb drives around as their presentations can be easily
downloaded to Microsoft PowerPoint.
CHAPTER 2
LITERATUREREVIEW
CHAPTER 3
RESEARCH METHODS
Gre322me
Gre322me
CHAPTER 4
RESULTS & DISCUSSIONS
All the cloud services described above allow the client to stop concerning with problems,
such as:
Dependability: Clients’ applications and data stored in these services must be accessible
24×24 hours a day and seven days per week. Besides that, clients’ data is stored redundantly in
several data centres in different geographical locations.
Elasticity: Clients have the illusion of having unlimited resources. For example, in
Amazon EC2, when clients’ applications need to scale, this can be done in a few minutes by
running an additional virtual machine. Clients also have the possibility of starting with few
resources, buying more resources only when they are necessary.
Another important aspect is the use of virtualization [9, 20] technology by cloud services [1,
54]. Virtual machines (VMs) allow these services to have increased flexibility in terms of
deployment and redeployment of applications in the cloud. Deploying a new virtual machine or
redeploying it in another physical server is faster and simpler than deploying a new physical
server. Virtualization also allows having more control over cloud resources, like disk, network
and computation power. Therefore, these resources can be distributed accordingly to the
applications’ needs. The use of virtual machines is a key aspect to achieve the elasticity
property. Virtual machine’s isolation property assures us that a failure in one VM does not
affect the other VMs running in the same physical server.
Cloud infrastructure is therefore composed by several data centres. In each data centre there are
several physical nodes running a certain number of virtual machines. The cloud storage has to be
large enough to accommodate these virtual machines images and the clients’ data that is stored
remotely.
Both cloud services and on-line backup services will have a large amount of data that needs to
be stored persistently and, as a consequence, a huge amount of duplicated data is expected to be
found.
In the case of on-line backup services, many users will have duplicated files, like music, videos,
text files, etc. As it concerns enterprises that use backup services, identical data will probably be
found between the employees’ files. Yet, another source of duplicated data that may exist in
some enterprises storage is the e-mail. There are studies, which show us that e-mail data sets
have more than 32% of their blocks duplicated [8].
Regarding cloud computing provider’s storage, duplicated data is expected to be found between
the several virtual machines images and between data stored remotely. In fact, if we recall that
Dropbox uses Amazon S3 as backend, Amazon’s storage will have all that duplicated data
mentioned above for on-line backup applications. Additionally, there is the possibility to run
applications that use databases in the cloud, which may need to create several replicas in order to
process a large number of concurrent read requests. This will further increase the number of
duplicated data.
If some of this duplicated data is eliminated, the storage’s space in use can be reduced and a
better service can be achieved by supporting more clients without having to add extra storage
resources or by providing a better service to each client. This storage space reduction also allows
these services to make a more efficient and simple management of their data.
A certain level of redundancy is always needed to have a service that is resilient to failures and
has efficient access to data from several locations. For this purpose, data must be replicated
amongst several nodes and these nodes should be located in several geographical locations.
Usually, systems that provide solutions to reduce data duplication are known for indexing the
storage’s data in order to share data with the same content. The elimination of redundant data is
known as deduplication. Deduplication can also be used to improve bandwidth usage for remote
storage services. If the storage’s data is indexed, then it is possible to choose what data really
needs to be transmitted to the storage server and what data is already there. The bandwidth cost
and the data upload speed is described as one of the main obstacles for the growth of cloud
computing [1].
From the perspective of data security, which has always been an important aspect of quality of
service, Cloud Computing inevitably poses new challenging security threats for number of
reasons.
1. Firstly, traditional cryptographic primitives for the purpose of data security protection cannot
be directly adopted due to the users’ loss control of data under Cloud Computing.
2. Secondly, Cloud Computing is not just a third party data warehouse. The data stored in the
cloud may be frequently updated by the users, including insertion, deletion, modification,
appending, reordering, etc.
We propose an effective and flexible distributed scheme with explicit dynamic data support to
ensure the correctness of users’ data in the cloud. We rely on erasure correcting code in the file
distribution preparation to provide redundancies and guarantee the data dependability. This
reduces the communication and storage overhead as compared to the traditional replication-
based file distribution techniques.
1. Compared to many of its predecessors, which only provide binary results about the storage
state across the distributed servers, the challenge-response protocol in our work further provides
the localization of data error.
2. Unlike most prior works for ensuring remote data integrity, the new scheme supports secure
and efficient dynamic operations on data blocks, including: update, delete and append.
3. Extensive security and performance analysis shows that the proposed scheme is highly
efficient and resilient against Byzantine failure, malicious data modification attack, and even
server colluding attacks.
CHAPTER 5
CONCLUSIONS/RECOMMENDATIONS
Cloud computing provides a supercomputing power. i.e it extends beyond a single company or
enterprise. To ensure the correctness of users’ data in cloud data storage, we proposed an
effective and flexible distributed scheme with explicit dynamic data support, including block
update, delete, and append.
This dissertation introduces a solution to find and eliminate duplicated data in a virtualized
system. First, the effectiveness of two techniques to find duplicated data was evaluated with the
GSD’s data set, which contains personal files from several researchers. One of the techniques
detects duplicated files and the other detects duplicated blocks with a xed size. For our specific
scenario, we concluded that the block approach is better. This observation is still true when the
space overhead introduced by the metadata necessary to eliminate duplicated data is taken in
account. Three different block’s sizes were used and the results, in terms of space saved, were
similar. This study was essential to de ne a realistic benchmark to test our prototype.
A solution to detect and eliminate redundancy between virtual disks of VMs, which are running
in the same physical machine and sharing a common storage, was presented. Our solution does
not use a typical scan approach to detect duplicated data at the storage. Instead it uses a dynamic
approach that intercepts I/O write requests from the VMs to their virtual disks and uses this
information to share identical data. With this approach, we reduce the computational power that
would be necessary by using a scan method. We also minimize the overhead introduced into the
I/O write requests, by delaying the process of calculating the blocks signatures and sharing them.
Our architecture is composed by three modules and each has a different purpose. The I/O
interception module intercepts VMs’ I/O requests and redirects them to the correct physical
address. This module keeps a list of all the blocks that were written, which will be consumed by
our Share module. The Share module is responsible for processing each element of that list and
sharing it. At this module, an additional mechanism was introduced to prevent the sharing of
blocks that are modified frequently. Besides these modules, there is a Garbage Collector module
responsible for distributing free blocks to the I/O Interception module and collecting unused
blocks that result from the sharing process and from the copy-on-write operations. The copy-on-
write operations are fundamental to prevent VMs from modifying blocks that are being shared.
We also presented a prototype that uses Xen and implements the architecture described above.
This prototype contemplates a new mechanism that was not introduced in our architecture. This
mechanism is used to load the VMs’ images into the global storage and allows sharing data
contained in VMs images that is never written. Two optimizations were introduced in our
prototype with the intent of increasing the throughput of VMs’ write and read requests. In fact,
these optimizations were crucial to achieve I/O throughput rates that are identical to the ones
obtained from a baseline approach were no deduplication is performed.
Finally, we benchmarked our prototype using a distribution to generate the content to be written
that resembles the distribution of GSD’s data set. The TPC-C NURand function was used to
calculate the file’s position where read and write operations are executed. From these
benchmarks, we concluded that our prototype shares identical data between several VMs without
introducing a significant amount of overhead in the CPU usage and in I/O requests throughput
and latency. Regarding RAM usage, the value is acceptable for our study purpose. In fact, RAM
usage can be improved significantly by using an optimized approach to store the metadata used
by our prototype.
To conclude, this document presents a solution to find and eliminate duplicated data in a
virtualized scenario, which is not addressed, as far as we know, by any commercial or open-
source product.
REFERENCES
APPENDICES

More Related Content

PDF
Enhanced Integrity Preserving Homomorphic Scheme for Cloud Storage
DOC
PDF
International journal of computer science and innovation vol 2015-n2-paper2
PDF
The Improvement and Performance of Mobile Environment using Both Cloud and Te...
PDF
B1802041217
PDF
Study on Secure Cryptographic Techniques in Cloud
PDF
B042306013
PDF
Guaranteed Availability of Cloud Data with Efficient Cost
Enhanced Integrity Preserving Homomorphic Scheme for Cloud Storage
International journal of computer science and innovation vol 2015-n2-paper2
The Improvement and Performance of Mobile Environment using Both Cloud and Te...
B1802041217
Study on Secure Cryptographic Techniques in Cloud
B042306013
Guaranteed Availability of Cloud Data with Efficient Cost

What's hot (18)

DOCX
LIBR 243 final - Caroline Helms (1)
PDF
Distributed Large Dataset Deployment with Improved Load Balancing and Perform...
PDF
Challenges and Proposed Solutions for Cloud Forensic
PDF
592-1627-1-PB
PDF
Efficient and reliable hybrid cloud architecture for big database
PDF
DATA PROVENENCE IN PUBLIC CLOUD
PDF
50120140504001
PDF
Exploring Cloud Encryption
PDF
Research Paper Digital Forensics on Google Cloud Platform
DOC
Reseach paper-mla-sample
PDF
Deduplication on Encrypted Big Data in HDFS
PDF
Multi-Campus Universities Private-Cloud Migration Infrastructure
PDF
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
PDF
Le nuove tecnologie come vantaggio competitivo della piccola impresa
PDF
Privacy preserving public auditing for secure cloud storage
PDF
Privacy preserving public auditing for
PDF
Security Issues’ in Cloud Computing and its Solutions.
PDF
Privacy preserving public auditing for secured cloud storage
LIBR 243 final - Caroline Helms (1)
Distributed Large Dataset Deployment with Improved Load Balancing and Perform...
Challenges and Proposed Solutions for Cloud Forensic
592-1627-1-PB
Efficient and reliable hybrid cloud architecture for big database
DATA PROVENENCE IN PUBLIC CLOUD
50120140504001
Exploring Cloud Encryption
Research Paper Digital Forensics on Google Cloud Platform
Reseach paper-mla-sample
Deduplication on Encrypted Big Data in HDFS
Multi-Campus Universities Private-Cloud Migration Infrastructure
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
Le nuove tecnologie come vantaggio competitivo della piccola impresa
Privacy preserving public auditing for secure cloud storage
Privacy preserving public auditing for
Security Issues’ in Cloud Computing and its Solutions.
Privacy preserving public auditing for secured cloud storage
Ad

Viewers also liked (10)

PDF
Bacon Guzenko Requirements traceability
PDF
DMA Brochure
PDF
Bwpaper2
PPTX
The nature and scope of marketing
PDF
DMA Brochure
PDF
Analysis of example_capacitor_bank_switching_solution_and_recommendations_for...
PPTX
BACON 2016: Requirements Traceability
PPT
Marketing planning
PPTX
AF-23- IPv6 Security_Final
DOC
Resume___RVG
Bacon Guzenko Requirements traceability
DMA Brochure
Bwpaper2
The nature and scope of marketing
DMA Brochure
Analysis of example_capacitor_bank_switching_solution_and_recommendations_for...
BACON 2016: Requirements Traceability
Marketing planning
AF-23- IPv6 Security_Final
Resume___RVG
Ad

Similar to Gre322me (20)

PDF
Fs2510501055
PDF
iaetsd Controlling data deuplication in cloud storage
PDF
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
PDF
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
PDF
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
PDF
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
PDF
A Hybrid Cloud Approach for Secure Authorized De-Duplication
PDF
A hybrid cloud approach for secure authorized
PDF
Survey on cloud backup services of personal storage
PDF
IRJET- Cloud based Deduplication using Middleware Approach
PDF
Suitability_of_Addition-Composition_Full_Homomorphic_Encryption_Scheme.pdf
PDF
Suitability of Addition-Composition Fully Homomorphic Encryption Scheme for S...
PDF
50620130101004
PDF
Titles with Abstracts_2023-2024_Cloud Computing.pdf
PPTX
Improving availability and reducing redundancy using deduplication of cloud s...
PDF
Improving Data Storage Security in Cloud using Hadoop
PDF
ENHANCING SECURITY IN CLOUD COMPUTING BY COMBINING DYNAMIC BROADCAST ENCRYPTI...
PDF
File Sharing and Data Duplication Removal in Cloud Using File Checksum
PDF
Enhancing Data Security in Cloud Computation Using Addition-Composition Fully...
PDF
Dynamic Resource Allocation and Data Security for Cloud
Fs2510501055
iaetsd Controlling data deuplication in cloud storage
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
A Hybrid Cloud Approach for Secure Authorized De-Duplication
A hybrid cloud approach for secure authorized
Survey on cloud backup services of personal storage
IRJET- Cloud based Deduplication using Middleware Approach
Suitability_of_Addition-Composition_Full_Homomorphic_Encryption_Scheme.pdf
Suitability of Addition-Composition Fully Homomorphic Encryption Scheme for S...
50620130101004
Titles with Abstracts_2023-2024_Cloud Computing.pdf
Improving availability and reducing redundancy using deduplication of cloud s...
Improving Data Storage Security in Cloud using Hadoop
ENHANCING SECURITY IN CLOUD COMPUTING BY COMBINING DYNAMIC BROADCAST ENCRYPTI...
File Sharing and Data Duplication Removal in Cloud Using File Checksum
Enhancing Data Security in Cloud Computation Using Addition-Composition Fully...
Dynamic Resource Allocation and Data Security for Cloud

Gre322me

  • 1. A RESEARCH WORK ON EFFICIENT STORAGE OF DATA IN CLOUD COMPUTING A CASE STUDY OF GOOGLE DOCS FOR INFORMATION SHARING BY UNIUYO ENGINEERING STUDENTS BY NKWOCHA, CHINEDU SOLOMON 10/EG/CO/456 DEPARTMENT OF COMPUTER ENGINEERING FACULTY OF ENGINEERING UNIVERSITY OF UYO, UYO SUBMITTED TO DR. FRANCIS UDOH COURSE LECTURER GRE 322 DEPARTMENT OF CHEM/PET ENGINEERING FACULTY OF ENGINEERING UNIVERSITY OF UYO, UYO IN PARTIAL FULFILLMENT OF THE REQUIREMENTS OF GRE 322 MARCH 2014
  • 2. ABSTRACT Keeping critical data safe and accessible from several locations has become a global preoccupation, either being this data personal, organizational or from applications. As a consequence of this issue, we verify the emergence of on-line storage services. In addition, there is the new paradigm of Cloud Computing, which brings new ideas to build services that allow users to store their data and run their applications in the Cloud. By doing a smart and efficient management of these services’ storage, it is possible to improve the quality of service offered, as well as to optimize the usage of the infrastructure where the services run. This management is even more critical and complex when the infrastructure is composed by thousands of nodes running several virtual machines and sharing the same storage. The elimination of redundant data at these services’ storage can be used to simplify and enhance this management. This dissertation presents a solution to detect and eliminate duplicated data between virtual machines that run on the same physical host and write their virtual disks’ data to a shared storage. A prototype that implements this solution is introduced and evaluated. Finally, a study that compares the efficiency of two different approaches used to eliminate redundant data in a personal data set is described. Cloud computing is a new computing paradigm that attracted many computer users, business, and government agencies. Cloud computing brought a lot of advantages especially in ubiquitous services where everybody can access computer services through internet. With cloud computing, there is no need of physical hardware or servers that will support the company’s computer system, internet services and networks. One of the core services provided by cloud computing is data storage. In the past decades, data storage has been recognized as one of the main concerns of information technology. The benefits of
  • 3. network-based applications have led to the transition from server-attached storage to distributed storage. Based on the fact that data security is the foundation of information security, a great quantity of efforts has been made in the area of distributed storage security. In this paper, the authors tried to study the threats and attacks that possibly launch in cloud computing data storage and proposed a security mechanism.
  • 4. CHAPTER 1 INTRODUCTION WHAT IS CLOUD COMPUTING? Key to the definition of cloud computing is the “cloud” itself. Here cloud is the group of large group of interconnected computers. These computers can be personal or network servers. The cloud of computers extend beyond a single company or entity. The application and data served by cloud are available to broader group of users, cross enterprise, and cross platform. Access is via internet. Any authorized user can access these documents, application from any computer over the internet. Cloud computing moves the application software and databases to the large data centres, where the data security is not trustworthy. Hence it leads to many new security challenges. By utilizing the holomorphic token with distributed verification of erasure-coded data, our scheme achieves the integration of storage correctness insurance and data error localization, i.e., the identification of misbehaving server. The new scheme supports secure and efficient dynamic operations on data blocks, including: data update, delete and append. Extensive security and performance analysis shows that the proposed scheme is highly efficient as we restrict some IP address and even authenticate the existing users to avoid malicious data modification attack. Cloud computing is a new computing paradigm where in computer processing is being performed through internet by a standard browser. Cloud computing builds on established trends for driving the cost out of the delivery of services while increasing the speed and agility with which services are deployed. It shortens the time from sketching out application architecture to actual deployment. Cloud computing incorporates virtualization, on-demand deployment, Internet delivery of services, and open source software.
  • 5. The Cloud Computing Architecture of a cloud solution is the structure of the system, which comprises on premise and cloud resources, services, middleware, and software components, geo-location, the externally visible properties of those, and the relationships between them. The term also refers to documentation of a system’s cloud computing architecture. BACKGROUND INFORMATION Dependable backup services are increasingly important to enterprises but also to common users that want to keep their personal files safe. A traditional approach, for common users, is to have a copy of all their files in an external hard drive. One example of such system is Time Machine. For enterprises the solution requires having a larger storage and a more complex solution to backup their critical data. For some enterprises, data is so important that several backup copies must be kept in different physical locations in order to avoid losing it in case of natural catastrophes. An engineering student of the University of Uyo has just finished typing a fifty-page extensive research project which needs to be submitted the next day. Excitedly, he walks into a business centre to have his work typed and spiral-binded, but unfortunately his file gets corrupted as the file is been opened for printing, as a result of a deadly virus in the desktop computer at the business centre. He is in rage and fury because he has no other backup of the file anywhere. His all-night labour has availed to nothing and he is stranded because he can no longer meet the deadline for submission. If he had typed his research work using a cloud
  • 6. computing service like Google Docs, he would have had no cause to worry as he would have easily downloaded his document to Microsoft Word with the click of a button. This is a common problem many engineering students face in the University of Uyo. Another important aspect, for both enterprises and common users, is the need of accessing their data remotely from different places. For this purpose, the web is a good solution, having in mind how easy it is to insert and retrieve information of any kind from it. This explains the emergence and success of on-line backup services like Dropbox, Box.net, RapidShare and Google Docs, that allow clients to have their data safe in the web. These services are more than just simple data archives. Some of them support other features, like collaborative work, versioning, online editing and synchronization of user’s data between several devices. With these new functionalities, a storage service that is intended to store and retrieve data efficiently is necessary. In classical archival systems this was not needed to be contemplated. As expected, all these services have limits to the amount of data each user can backup and, therefore, clients must pay a fee to expand these limits. STATEMENT OF PROBLEM Effective deduplication in a cloud computing scenario raises however a number of challenges. First, there are architecture challenges. In this scenario, at least one distinct VM is running for each client’s application. This means that data is spread by several VMs virtual disks. Additionally, groups of VMs are running in distinct physical machines. Finally, there is
  • 7. the necessity of keeping data deduplication process transparent to the VMs and the applications running on them. Second, there are algorithm challenges. An efficient method to detect modified data and to share identical data is needed. This method must use metadata to compare the modified data with the storage’s data in order to share it. Metadata’s size is needed to have in account to achieve an efficient deduplication algorithm. To detect modified data without having to scan all the storage, a method that intercepts I/O requests to the disk is also necessary. This approach reduces the CPU usage but can introduce significant overhead in the I/O requests. This overhead value must be as low as possible. PURPOSE OF STUDY The main goal of this dissertation is to show how deduplication can be achieved in a virtualized system, towards finding and eliminating redundant data in the context of cloud computing services. A second objective is to evaluate the impact of deduplication in personal data towards demonstrating the usefulness of the proposed solution SIGNIFICANCE OF STUDY If every University of Uyo engineering student embraces the vast capabilities and functionalities of the Google Docs cloud computing service then,  They can access their files and documents anywhere there is an internet connectivity.
  • 8.  It eliminates the need to carry flash drives.  They can post their important files and documents to the web instantly.  They can collaborate with others –while maintaining only one copy of the document.  They can downloads their files to Microsoft Word and Excel spreadsheet.  They can also use the forms in Google Docs as a survey instrument and as an assessment instrument for their research work as they are connected to a vast community of similar- minded people.  They can handle their PowerPoint presentations with Google Presentations which is compatible anywhere there is web access. This eliminates software compatibility issues and there is no need to carry thumb drives around as their presentations can be easily downloaded to Microsoft PowerPoint. CHAPTER 2 LITERATUREREVIEW
  • 13. CHAPTER 4 RESULTS & DISCUSSIONS All the cloud services described above allow the client to stop concerning with problems, such as: Dependability: Clients’ applications and data stored in these services must be accessible 24×24 hours a day and seven days per week. Besides that, clients’ data is stored redundantly in several data centres in different geographical locations. Elasticity: Clients have the illusion of having unlimited resources. For example, in Amazon EC2, when clients’ applications need to scale, this can be done in a few minutes by running an additional virtual machine. Clients also have the possibility of starting with few resources, buying more resources only when they are necessary. Another important aspect is the use of virtualization [9, 20] technology by cloud services [1, 54]. Virtual machines (VMs) allow these services to have increased flexibility in terms of deployment and redeployment of applications in the cloud. Deploying a new virtual machine or
  • 14. redeploying it in another physical server is faster and simpler than deploying a new physical server. Virtualization also allows having more control over cloud resources, like disk, network and computation power. Therefore, these resources can be distributed accordingly to the applications’ needs. The use of virtual machines is a key aspect to achieve the elasticity property. Virtual machine’s isolation property assures us that a failure in one VM does not affect the other VMs running in the same physical server. Cloud infrastructure is therefore composed by several data centres. In each data centre there are several physical nodes running a certain number of virtual machines. The cloud storage has to be large enough to accommodate these virtual machines images and the clients’ data that is stored remotely. Both cloud services and on-line backup services will have a large amount of data that needs to be stored persistently and, as a consequence, a huge amount of duplicated data is expected to be found. In the case of on-line backup services, many users will have duplicated files, like music, videos, text files, etc. As it concerns enterprises that use backup services, identical data will probably be found between the employees’ files. Yet, another source of duplicated data that may exist in some enterprises storage is the e-mail. There are studies, which show us that e-mail data sets have more than 32% of their blocks duplicated [8]. Regarding cloud computing provider’s storage, duplicated data is expected to be found between the several virtual machines images and between data stored remotely. In fact, if we recall that Dropbox uses Amazon S3 as backend, Amazon’s storage will have all that duplicated data mentioned above for on-line backup applications. Additionally, there is the possibility to run
  • 15. applications that use databases in the cloud, which may need to create several replicas in order to process a large number of concurrent read requests. This will further increase the number of duplicated data. If some of this duplicated data is eliminated, the storage’s space in use can be reduced and a better service can be achieved by supporting more clients without having to add extra storage resources or by providing a better service to each client. This storage space reduction also allows these services to make a more efficient and simple management of their data. A certain level of redundancy is always needed to have a service that is resilient to failures and has efficient access to data from several locations. For this purpose, data must be replicated amongst several nodes and these nodes should be located in several geographical locations. Usually, systems that provide solutions to reduce data duplication are known for indexing the storage’s data in order to share data with the same content. The elimination of redundant data is known as deduplication. Deduplication can also be used to improve bandwidth usage for remote storage services. If the storage’s data is indexed, then it is possible to choose what data really needs to be transmitted to the storage server and what data is already there. The bandwidth cost and the data upload speed is described as one of the main obstacles for the growth of cloud computing [1]. From the perspective of data security, which has always been an important aspect of quality of service, Cloud Computing inevitably poses new challenging security threats for number of reasons.
  • 16. 1. Firstly, traditional cryptographic primitives for the purpose of data security protection cannot be directly adopted due to the users’ loss control of data under Cloud Computing. 2. Secondly, Cloud Computing is not just a third party data warehouse. The data stored in the cloud may be frequently updated by the users, including insertion, deletion, modification, appending, reordering, etc. We propose an effective and flexible distributed scheme with explicit dynamic data support to ensure the correctness of users’ data in the cloud. We rely on erasure correcting code in the file distribution preparation to provide redundancies and guarantee the data dependability. This reduces the communication and storage overhead as compared to the traditional replication- based file distribution techniques. 1. Compared to many of its predecessors, which only provide binary results about the storage state across the distributed servers, the challenge-response protocol in our work further provides the localization of data error. 2. Unlike most prior works for ensuring remote data integrity, the new scheme supports secure and efficient dynamic operations on data blocks, including: update, delete and append. 3. Extensive security and performance analysis shows that the proposed scheme is highly efficient and resilient against Byzantine failure, malicious data modification attack, and even server colluding attacks.
  • 17. CHAPTER 5 CONCLUSIONS/RECOMMENDATIONS Cloud computing provides a supercomputing power. i.e it extends beyond a single company or enterprise. To ensure the correctness of users’ data in cloud data storage, we proposed an effective and flexible distributed scheme with explicit dynamic data support, including block update, delete, and append. This dissertation introduces a solution to find and eliminate duplicated data in a virtualized system. First, the effectiveness of two techniques to find duplicated data was evaluated with the GSD’s data set, which contains personal files from several researchers. One of the techniques detects duplicated files and the other detects duplicated blocks with a xed size. For our specific scenario, we concluded that the block approach is better. This observation is still true when the space overhead introduced by the metadata necessary to eliminate duplicated data is taken in
  • 18. account. Three different block’s sizes were used and the results, in terms of space saved, were similar. This study was essential to de ne a realistic benchmark to test our prototype. A solution to detect and eliminate redundancy between virtual disks of VMs, which are running in the same physical machine and sharing a common storage, was presented. Our solution does not use a typical scan approach to detect duplicated data at the storage. Instead it uses a dynamic approach that intercepts I/O write requests from the VMs to their virtual disks and uses this information to share identical data. With this approach, we reduce the computational power that would be necessary by using a scan method. We also minimize the overhead introduced into the I/O write requests, by delaying the process of calculating the blocks signatures and sharing them. Our architecture is composed by three modules and each has a different purpose. The I/O interception module intercepts VMs’ I/O requests and redirects them to the correct physical address. This module keeps a list of all the blocks that were written, which will be consumed by our Share module. The Share module is responsible for processing each element of that list and sharing it. At this module, an additional mechanism was introduced to prevent the sharing of blocks that are modified frequently. Besides these modules, there is a Garbage Collector module responsible for distributing free blocks to the I/O Interception module and collecting unused blocks that result from the sharing process and from the copy-on-write operations. The copy-on- write operations are fundamental to prevent VMs from modifying blocks that are being shared. We also presented a prototype that uses Xen and implements the architecture described above. This prototype contemplates a new mechanism that was not introduced in our architecture. This mechanism is used to load the VMs’ images into the global storage and allows sharing data contained in VMs images that is never written. Two optimizations were introduced in our prototype with the intent of increasing the throughput of VMs’ write and read requests. In fact,
  • 19. these optimizations were crucial to achieve I/O throughput rates that are identical to the ones obtained from a baseline approach were no deduplication is performed. Finally, we benchmarked our prototype using a distribution to generate the content to be written that resembles the distribution of GSD’s data set. The TPC-C NURand function was used to calculate the file’s position where read and write operations are executed. From these benchmarks, we concluded that our prototype shares identical data between several VMs without introducing a significant amount of overhead in the CPU usage and in I/O requests throughput and latency. Regarding RAM usage, the value is acceptable for our study purpose. In fact, RAM usage can be improved significantly by using an optimized approach to store the metadata used by our prototype. To conclude, this document presents a solution to find and eliminate duplicated data in a virtualized scenario, which is not addressed, as far as we know, by any commercial or open- source product. REFERENCES