Gre322me

A RESEARCH WORK
ON
EFFICIENT STORAGE OF DATA IN CLOUD COMPUTING
A CASE STUDY OF GOOGLE DOCS FOR
INFORMATION SHARING BY UNIUYO ENGINEERING
STUDENTS
BY
NKWOCHA, CHINEDU SOLOMON
10/EG/CO/456
DEPARTMENT OF COMPUTER ENGINEERING
FACULTY OF ENGINEERING
UNIVERSITY OF UYO, UYO
SUBMITTED TO
DR. FRANCIS UDOH
COURSE LECTURER GRE 322
DEPARTMENT OF CHEM/PET ENGINEERING
FACULTY OF ENGINEERING
UNIVERSITY OF UYO, UYO
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS OF
GRE 322
MARCH 2014

ABSTRACT
Keeping critical data safe and accessible from several locations has become a global
preoccupation, either being this data personal, organizational or from applications. As a
consequence of this issue, we verify the emergence of on-line storage services. In addition, there
is the new paradigm of Cloud Computing, which brings new ideas to build services that allow
users to store their data and run their applications in the Cloud. By doing a smart and efficient
management of these services’ storage, it is possible to improve the quality of service offered, as
well as to optimize the usage of the infrastructure where the services run. This management is
even more critical and complex when the infrastructure is composed by thousands of nodes
running several virtual machines and sharing the same storage. The elimination of redundant data
at these services’ storage can be used to simplify and enhance this management. This dissertation
presents a solution to detect and eliminate duplicated data between virtual machines that run on
the same physical host and write their virtual disks’ data to a shared storage. A prototype that
implements this solution is introduced and evaluated. Finally, a study that compares the
efficiency of two different approaches used to eliminate redundant data in a personal data set is
described.
Cloud computing is a new computing paradigm that attracted many computer
users, business, and government agencies. Cloud computing brought a lot of advantages
especially in ubiquitous services where everybody can access computer services through
internet. With cloud computing, there is no need of physical hardware or servers that will
support the company’s computer system, internet services and networks. One of the core
services provided by cloud computing is data storage. In the past decades, data storage has
been recognized as one of the main concerns of information technology. The benefits of

network-based applications have led to the transition from server-attached storage to
distributed storage. Based on the fact that data security is the foundation of information
security, a great quantity of efforts has been made in the area of distributed storage
security. In this paper, the authors tried to study the threats and attacks that possibly
launch in cloud computing data storage and proposed a security mechanism.

CHAPTER 1
INTRODUCTION
WHAT IS CLOUD COMPUTING? Key to the definition of cloud computing is the
“cloud” itself. Here cloud is the group of large group of interconnected computers. These
computers can be personal or network servers. The cloud of computers extend beyond a single
company or entity. The application and data served by cloud are available to broader group of
users, cross enterprise, and cross platform. Access is via internet. Any authorized user can
access these documents, application from any computer over the internet.
Cloud computing moves the application software and databases to the large data centres,
where the data security is not trustworthy. Hence it leads to many new security challenges. By
utilizing the holomorphic token with distributed verification of erasure-coded data, our scheme
achieves the integration of storage correctness insurance and data error localization, i.e., the
identification of misbehaving server. The new scheme supports secure and efficient dynamic
operations on data blocks, including: data update, delete and append. Extensive security and
performance analysis shows that the proposed scheme is highly efficient as we restrict some IP
address and even authenticate the existing users to avoid malicious data modification attack.
Cloud computing is a new computing paradigm where in computer processing is
being performed through internet by a standard browser. Cloud computing builds on
established trends for driving the cost out of the delivery of services while increasing the
speed and agility with which services are deployed. It shortens the time from sketching out
application architecture to actual deployment. Cloud computing incorporates virtualization,
on-demand deployment, Internet delivery of services, and open source software.

The Cloud Computing Architecture of a cloud solution is the structure of the
system, which comprises on premise and cloud resources, services, middleware, and
software components, geo-location, the externally visible properties of those, and the
relationships between them. The term also refers to documentation of a system’s cloud
computing architecture.
BACKGROUND INFORMATION
Dependable backup services are increasingly important to enterprises but also to common
users that want to keep their personal files safe. A traditional approach, for common users, is to
have a copy of all their files in an external hard drive. One example of such system is Time
Machine. For enterprises the solution requires having a larger storage and a more complex
solution to backup their critical data. For some enterprises, data is so important that several
backup copies must be kept in different physical locations in order to avoid losing it in case of
natural catastrophes.
An engineering student of the University of Uyo has just finished typing a fifty-page
extensive research project which needs to be submitted the next day. Excitedly, he walks into a
business centre to have his work typed and spiral-binded, but unfortunately his file gets
corrupted as the file is been opened for printing, as a result of a deadly virus in the desktop
computer at the business centre. He is in rage and fury because he has no other backup of the file
anywhere. His all-night labour has availed to nothing and he is stranded because he can no
longer meet the deadline for submission. If he had typed his research work using a cloud

computing service like Google Docs, he would have had no cause to worry as he would have
easily downloaded his document to Microsoft Word with the click of a button. This is a common
problem many engineering students face in the University of Uyo.
Another important aspect, for both enterprises and common users, is the need of
accessing their data remotely from different places. For this purpose, the web is a good solution,
having in mind how easy it is to insert and retrieve information of any kind from it. This explains
the emergence and success of on-line backup services like Dropbox, Box.net, RapidShare and
Google Docs, that allow clients to have their data safe in the web. These services are more than
just simple data archives. Some of them support other features, like collaborative work,
versioning, online editing and synchronization of user’s data between several devices. With these
new functionalities, a storage service that is intended to store and retrieve data efficiently is
necessary. In classical archival systems this was not needed to be contemplated. As expected, all
these services have limits to the amount of data each user can backup and, therefore, clients must
pay a fee to expand these limits.
STATEMENT OF PROBLEM
Effective deduplication in a cloud computing scenario raises however a number of
challenges.
First, there are architecture challenges. In this scenario, at least one distinct VM is
running for each client’s application. This means that data is spread by several VMs virtual
disks. Additionally, groups of VMs are running in distinct physical machines. Finally, there is

the necessity of keeping data deduplication process transparent to the VMs and the applications
running on them.
Second, there are algorithm challenges. An efficient method to detect modified data and
to share identical data is needed. This method must use metadata to compare the modified data
with the storage’s data in order to share it. Metadata’s size is needed to have in account to
achieve an efficient deduplication algorithm. To detect modified data without having to scan all
the storage, a method that intercepts I/O requests to the disk is also necessary. This approach
reduces the CPU usage but can introduce significant overhead in the I/O requests. This overhead
value must be as low as possible.
PURPOSE OF STUDY
The main goal of this dissertation is to show how deduplication can be achieved in a
virtualized system, towards finding and eliminating redundant data in the context of cloud
computing services.
A second objective is to evaluate the impact of deduplication in personal data towards
demonstrating the usefulness of the proposed solution
SIGNIFICANCE OF STUDY
If every University of Uyo engineering student embraces the vast capabilities and
functionalities of the Google Docs cloud computing service then,
 They can access their files and documents anywhere there is an internet connectivity.

 It eliminates the need to carry flash drives.
 They can post their important files and documents to the web instantly.
 They can collaborate with others –while maintaining only one copy of the document.
 They can downloads their files to Microsoft Word and Excel spreadsheet.
 They can also use the forms in Google Docs as a survey instrument and as an assessment
instrument for their research work as they are connected to a vast community of similar-
minded people.
 They can handle their PowerPoint presentations with Google Presentations which is
compatible anywhere there is web access. This eliminates software compatibility issues
and there is no need to carry thumb drives around as their presentations can be easily
downloaded to Microsoft PowerPoint.
CHAPTER 2
LITERATUREREVIEW

CHAPTER 4
RESULTS & DISCUSSIONS
All the cloud services described above allow the client to stop concerning with problems,
such as:
Dependability: Clients’ applications and data stored in these services must be accessible
24×24 hours a day and seven days per week. Besides that, clients’ data is stored redundantly in
several data centres in different geographical locations.
Elasticity: Clients have the illusion of having unlimited resources. For example, in
Amazon EC2, when clients’ applications need to scale, this can be done in a few minutes by
running an additional virtual machine. Clients also have the possibility of starting with few
resources, buying more resources only when they are necessary.
Another important aspect is the use of virtualization [9, 20] technology by cloud services [1,
54]. Virtual machines (VMs) allow these services to have increased flexibility in terms of
deployment and redeployment of applications in the cloud. Deploying a new virtual machine or

redeploying it in another physical server is faster and simpler than deploying a new physical
server. Virtualization also allows having more control over cloud resources, like disk, network
and computation power. Therefore, these resources can be distributed accordingly to the
applications’ needs. The use of virtual machines is a key aspect to achieve the elasticity
property. Virtual machine’s isolation property assures us that a failure in one VM does not
affect the other VMs running in the same physical server.
Cloud infrastructure is therefore composed by several data centres. In each data centre there are
several physical nodes running a certain number of virtual machines. The cloud storage has to be
large enough to accommodate these virtual machines images and the clients’ data that is stored
remotely.
Both cloud services and on-line backup services will have a large amount of data that needs to
be stored persistently and, as a consequence, a huge amount of duplicated data is expected to be
found.
In the case of on-line backup services, many users will have duplicated files, like music, videos,
text files, etc. As it concerns enterprises that use backup services, identical data will probably be
found between the employees’ files. Yet, another source of duplicated data that may exist in
some enterprises storage is the e-mail. There are studies, which show us that e-mail data sets
have more than 32% of their blocks duplicated [8].
Regarding cloud computing provider’s storage, duplicated data is expected to be found between
the several virtual machines images and between data stored remotely. In fact, if we recall that
Dropbox uses Amazon S3 as backend, Amazon’s storage will have all that duplicated data
mentioned above for on-line backup applications. Additionally, there is the possibility to run

applications that use databases in the cloud, which may need to create several replicas in order to
process a large number of concurrent read requests. This will further increase the number of
duplicated data.
If some of this duplicated data is eliminated, the storage’s space in use can be reduced and a
better service can be achieved by supporting more clients without having to add extra storage
resources or by providing a better service to each client. This storage space reduction also allows
these services to make a more efficient and simple management of their data.
A certain level of redundancy is always needed to have a service that is resilient to failures and
has efficient access to data from several locations. For this purpose, data must be replicated
amongst several nodes and these nodes should be located in several geographical locations.
Usually, systems that provide solutions to reduce data duplication are known for indexing the
storage’s data in order to share data with the same content. The elimination of redundant data is
known as deduplication. Deduplication can also be used to improve bandwidth usage for remote
storage services. If the storage’s data is indexed, then it is possible to choose what data really
needs to be transmitted to the storage server and what data is already there. The bandwidth cost
and the data upload speed is described as one of the main obstacles for the growth of cloud
computing [1].
From the perspective of data security, which has always been an important aspect of quality of
service, Cloud Computing inevitably poses new challenging security threats for number of
reasons.

1. Firstly, traditional cryptographic primitives for the purpose of data security protection cannot
be directly adopted due to the users’ loss control of data under Cloud Computing.
2. Secondly, Cloud Computing is not just a third party data warehouse. The data stored in the
cloud may be frequently updated by the users, including insertion, deletion, modification,
appending, reordering, etc.
We propose an effective and flexible distributed scheme with explicit dynamic data support to
ensure the correctness of users’ data in the cloud. We rely on erasure correcting code in the file
distribution preparation to provide redundancies and guarantee the data dependability. This
reduces the communication and storage overhead as compared to the traditional replication-
based file distribution techniques.
1. Compared to many of its predecessors, which only provide binary results about the storage
state across the distributed servers, the challenge-response protocol in our work further provides
the localization of data error.
2. Unlike most prior works for ensuring remote data integrity, the new scheme supports secure
and efficient dynamic operations on data blocks, including: update, delete and append.
3. Extensive security and performance analysis shows that the proposed scheme is highly
efficient and resilient against Byzantine failure, malicious data modification attack, and even
server colluding attacks.

CHAPTER 5
CONCLUSIONS/RECOMMENDATIONS
Cloud computing provides a supercomputing power. i.e it extends beyond a single company or
enterprise. To ensure the correctness of users’ data in cloud data storage, we proposed an
effective and flexible distributed scheme with explicit dynamic data support, including block
update, delete, and append.
This dissertation introduces a solution to find and eliminate duplicated data in a virtualized
system. First, the effectiveness of two techniques to find duplicated data was evaluated with the
GSD’s data set, which contains personal files from several researchers. One of the techniques
detects duplicated files and the other detects duplicated blocks with a xed size. For our specific
scenario, we concluded that the block approach is better. This observation is still true when the
space overhead introduced by the metadata necessary to eliminate duplicated data is taken in

account. Three different block’s sizes were used and the results, in terms of space saved, were
similar. This study was essential to de ne a realistic benchmark to test our prototype.
A solution to detect and eliminate redundancy between virtual disks of VMs, which are running
in the same physical machine and sharing a common storage, was presented. Our solution does
not use a typical scan approach to detect duplicated data at the storage. Instead it uses a dynamic
approach that intercepts I/O write requests from the VMs to their virtual disks and uses this
information to share identical data. With this approach, we reduce the computational power that
would be necessary by using a scan method. We also minimize the overhead introduced into the
I/O write requests, by delaying the process of calculating the blocks signatures and sharing them.
Our architecture is composed by three modules and each has a different purpose. The I/O
interception module intercepts VMs’ I/O requests and redirects them to the correct physical
address. This module keeps a list of all the blocks that were written, which will be consumed by
our Share module. The Share module is responsible for processing each element of that list and
sharing it. At this module, an additional mechanism was introduced to prevent the sharing of
blocks that are modified frequently. Besides these modules, there is a Garbage Collector module
responsible for distributing free blocks to the I/O Interception module and collecting unused
blocks that result from the sharing process and from the copy-on-write operations. The copy-on-
write operations are fundamental to prevent VMs from modifying blocks that are being shared.
We also presented a prototype that uses Xen and implements the architecture described above.
This prototype contemplates a new mechanism that was not introduced in our architecture. This
mechanism is used to load the VMs’ images into the global storage and allows sharing data
contained in VMs images that is never written. Two optimizations were introduced in our
prototype with the intent of increasing the throughput of VMs’ write and read requests. In fact,

these optimizations were crucial to achieve I/O throughput rates that are identical to the ones
obtained from a baseline approach were no deduplication is performed.
Finally, we benchmarked our prototype using a distribution to generate the content to be written
that resembles the distribution of GSD’s data set. The TPC-C NURand function was used to
calculate the file’s position where read and write operations are executed. From these
benchmarks, we concluded that our prototype shares identical data between several VMs without
introducing a significant amount of overhead in the CPU usage and in I/O requests throughput
and latency. Regarding RAM usage, the value is acceptable for our study purpose. In fact, RAM
usage can be improved significantly by using an optimized approach to store the metadata used
by our prototype.
To conclude, this document presents a solution to find and eliminate duplicated data in a
virtualized scenario, which is not addressed, as far as we know, by any commercial or open-
source product.
REFERENCES

Gre322me

More Related Content

What's hot (18)

Viewers also liked (10)

Similar to Gre322me (20)

Gre322me