1. Page 1
Unit 4
Cloud Programming and Software Environments: Features of Cloud and Grid Platforms, Parallel
& Distributed programming Paradigms, Programming Support of Google App Engine,
Programming on Amazon AWS and Microsoft Azure, Emerging Cloud Software Environments.
4.1 Features of Cloud and Grid Platforms
4.1.1 Capabilities of Cloud and Platform Features:
The commercial clouds require huge number of capabilities. They provide computing of less
cost with flexibility. This will in addition provides extra capabilities together called as “Platform
as a Service”. The present platform features for the Azure are table, queues, web, and worker
roles, database SQL, blobs. The platform features for Amazon are “Infrastructure as a Service”,
all together other than queues, notifications, monitoring, content delivery network, relational
database and map reduce. The capabilities of cloud platform are listed in the below table.
Capability Description
Physical or virtual computing platform The cloud environment incorporates both physical as
well as virtual platform. The virtual platforms posses
the unique capabilities in order to isolate environments
for various applications and users
Huge data storage service, distributed
file system
The cloud data storage services after a wide range of
disk capacity for heavy data sets. In addition to this
service interface and store data the distributed file
system provides heavy data storage services.
Huge database storage service The clouds require this service similar to DBMS so that
the developers can store the data in semantic way.
Huge data processing method and
programming model
The infrastructure of cloud after several nodes for
simple applications. So the programmers must manage
the issues such as network failure or scaling of running
code etc in order to use all the services provided by
platform
Support workflow and data query
language
The programming model wides the cloud infrastructure.
The workflow language and data query language is
provided for application logic.
Programming interface and services
deployment
Cloud applications need web interfaces or special API’s
such as J2EE, PHP, ASP or Rails. They can make use
of Ajax technologies for improving user experience
while using web browsers for function access
Runtime support The runtime support is open for the users as well as
applications. It incorporates the distributed monitoring
services, distributed task scheduler, distributed locking
etc.
Support services Various important services are data and computing
service.
The infrastructure cloud features as illustrated in below table:
Accounting It has economies and acts as an active area for commercial
Clouds
2. Page 2
Authentication and authorization It requires all the systems to have single sing in
Data transport It supports the data transfer among job components within
and between the clouds and grids
Registry It provides registry as information resource for system
Operating systems Supports OS such as Android, Linux, Windows and apples
Program library It stores images as well as program material
Scheduling and gang scheduling Provides scheduling similar to that of Azure worker role in
addition to condor, platform, oracle grid engine etc. the
gang scheduling will assign multiple tasks scalability
Software as a Service This service is shared among clouds and grids; it is
successful and is used in clouds as distributed systems.
Virtualization It is a basic feature and supports elastic feature. It also
incorporates virtual networking
Traditional features in cluster, grid and parallel computing environments are given below:
Cluster management Clusters are developed using the tools provided by rocks and
packages
Data management A metadata support called RDR triple stores is provided. In
addition to these SQL and NoSQL are provided
Portals It is also termed as gateways that has transformation in
technology from portlets to HUB zero
Virtual organizations Organizations range from specialized grid solutions to
popular web 2.0
Grid programming environments The programming environment differs from link together
services as in open grid services architecture to grid RPC
and SAGA
Open MP/Threading It incorporates parallel computers like click and roughly
shared memory technologies in addition to transactional
memory and fine grained data flow
The platform features supported by clouds and grids area s follows:
Blob It provides the basic storage concept that is typified by
Azure, Blob, and Amazon S3
DPFS It provides support for file systems like Google, HDFS,
cosmos along with compute data affinity that is optimized for
data processing
Map reduce It supports map reduce programming model with Hadoop on
Linux, Dryad on Windows HPCS and twister on windows
and Linux
Fault tolerance It is a major feature of clouds
Notification It is the basic function of publish subscribe systems
Monitoring It provides may grid solutions like Inca which is be based on
publish subscribe
Programming model The cloud programming models are developed with other
platform features. It relates to web and grid models.
Queues The queue system depends on publish subscribe
SQL It is a relational database
Table It supports the table data structure that are modelled on
apache Hadoop or Amazon simple DB/Azure table
Web role It is used in Azure for determining the link tosser
Worker role It is used in Amazon and grids
Scalable Synchronization It supports distributed locks and is used by big table.
3. Page 3
4.1.2 Traditional features common to grids and clouds:
Various features that are common to grids and clouds are as follows:
1. Workflows: in US and Europe the workflow has created various projects such as pegasus,
Taverna and Kepler. The commercial systems include Pipeline, Pilot, AVS and LIMS
environment. Trident is the latest entry which will run the workflow proxy services on external
environments if working on Azure or Windows.
2. Data Transport: The data transfer is the major issue in commercial clouds. The links of high
bandwidth can be allowed among clouds and tera grid if at all commercial clouds are made the
major components of cyber infrastructure. The cloud data can be structured into tables and
blocks in order to make the high performance parallel algorithms in addition to HTTP
mechanisms for data transfer among academic systems/Teragrid and commercial clouds.
3. Security, Privacy and Availability: These techniques related to security, privacy and
availability used for developing good and dependable cloud programming environment are
illustrated as follows,
• Using virtual clustering for achieving dynamic resources support at less overhead cost
• Using special API’s for user authentication and email sending through commercial
accounts.
• Accessing cloud resources using security protocols like HTTPS and SSL
• Using stable and persistent data storage through quick queries for data access
• Including the features for improving availability and disaster recovery with file migration
of VM’s.
• Using fine gained access control for protecting the data integrity and deterring intruders
and hackers
• Protecting shared data form Malicious alternation, deletion of copy right Molations
• Using popular systems for protecting data centres from stopping the privates and
authorizing the trusted clients.
4.1.3 Data Features and Databases:
The features of data and databases are illustrated as follows,
1. Program Library: An attempt is made for developing a VM image library for managing the
images to be used in academic and commercial clouds.
2. Blobs and Drives: The containers of Azure are responsible for arranging the storage in
clouds such as blobs for Azure and S3 for Amazon. The users are allowed to attach directly for
computing instances like Azure drives and Elastic Book Store for Amazon. The cloud storage is
found to be fault tolerant when tera grid require backup storage.
3. DPFS: It supports the file systems like Google File System, HDFS and cosmos with the
features of optimized compute data affinity for processing the data. The DPFS can linked with
blob and drives based architecture but it is better to use if application centric storage model with
optimized compute data affinity. With this the blob and drives must be used as repository centric
view. The DPFS file systems are developed for executing the data intensive applications
efficiently.
4. Page 4
4. SQL and Relational Databases: The relational databases are offered by the Amazon and
Azure clouds. As done earlier a new private computing model is built on future grid for the
observational medical outcomes partnership for patient related data that makes use of oracle and
SAS. This is where future grid includes Hadoop to scale various analysis methods. The
databases are predicated to be used for determine the methods that deploy capabilities. The
database software can be added on to the disk. It can be executed through the database instance.
The database on Amazon and Azure can be installed on different VM with this “SQL as service”
gets implemented.
5. Table and NoSQL Non-related Databases: A large number of developments took place
related to a simplified database structure called “NoSQL”. This emphasizes distribution and
scalability. The clouds like Big Table in Google and simpleDB in Amazon and Azure Table in
Azure make use of it. Non-relational databases are used in many terms of triple stores
depending upon the MapReduce and tables or Hadoop File system with good success.
The tables of the cloud can be classified into Azure table and Amazon simpleDB which support
lightweight storage for document stores. They are found to be schema free and will soon gain
importance in scientific computing.
6. Queuing Services: The Amazon as well as Azure provide the robust and scalable queuing
services for the components to interact with each other in an application. These messages are
short and contain a REST (Representational State Transfer) interface which has deliver at least
once semantics. Time outs are used for controlling them in order to post the amount of
processing time assigned for the client.
4.1.4 Programming and Runtime support:
The programming and runtime support after parallel programming and runtime support of major
functions in grids and clouds.
1. Worker and Web Roles: Azure provides roles for facilitating nontrivial functionality and
also for preserving better affinity support in non-virtualized environment. They are the
schedulable processes that can be launched automatically. Queues are considered to be complex
since they offer natural method for assigning tasks in fault-tolerant and distributed fashion. The
web roles offer a significant method for the portal.
2. MapReduce: The parallel languages are found to have great interest in loosely coupled
computations that execute among the data samples. The grid applications are provided with
efficient execution by language and runtime. The map reduce is found to be more advantageous
than traditional implementations of the task problems. This is because it supports dynamic
execution, strong fault tolerance and easy to use high level interface. Hadoop and Dryad are the
map reduce implementations that can be executed with or without VMs. Hadoop is provided by
Amazon and Dryad is to be available on Azure.
3. Cloud Programming Models: The GAE and Manjrasoft Aneka environment are the two
basic programming models that are applied on clouds. But these models are not specific to this
architecture model. Iterative MapReduce is an interesting programming model that offers
portability between cloud, HPC and cluster environments.
4. SaaS: Services are used similarly in both commercial clouds and latest distributed systems.
The users can package their programs as required. Hence, SaaS services can be enabled without
any additional support because of this reason. SaaS environment is expected to provide various
5. Page 5
useful tools for developing the cloud applications over the huge datasets. In addition to this
various protection features are also offered by SaaS for achieving scalability, security, privacy
and availability.
4.2 PARALLEL AND DISTRIBUTED PROGRAMMING PARADIGMS
4.2.1 Parallel computing and Programming Paradigms
The distributed and parallel programs are assumed as the parallel programs running on set of
computing engines or distributed computing systems. The distributed computing denote the
computational engines interconnected in a network intended to run a job or application. The
parallel computing denotes the usage of one or more computational engine intended to run a
job or application. The parallel programs are allowed to be run on distributed computing
systems. But it has certain issues described below
1. Partitioning: Partition is done in the below two ways,
i) Computation Partitioning: The given program or job is divided into various tasks
depending upon the portion identification that is capable for concurrent transaction. Various
parts of a program can process different data or share same data.
ii) Data Partitioning: The input or intermediate data is divided into various partitions that
can be processed on various workers. A copy of the program or various parts of it is
responsible for processing the pieces of data.
2. Mapping: The process of assigning parts of program of data pieces to the respective
resources is called mapping. It is handled by the system resources allocators.
3. Synchronization: Synchronization is required since various workers perform various
tasks. Coordination is also important among the workers. With this, race conditions can be
prevented and data dependency can also managed.
4. Communication: Communication is considered as the major concept when the
intermediate data is sent to workers. This is because data dependency is a major reason for
communication among the workers.
5. Scheduling: A scheduler is responsible for picking a set of jobs or programs and for
running them on distributed computing system. It is required when the resources are not
sufficient to run various jobs or programs simultaneously. Scheduling is done based on the
scheduling policy.
4.4.1.1 Motivation for Programming Paradigms:
Handling of complete data flow of parallel and distributed programming is observed to be
time consuming. It also needs special programming knowledge. These issues affect
programmer productivity and programs time to market. For this purpose the parallel and
distributed programming paradigms or models are use in order to hide the data flow part from
the users.
These models after abstraction layer to the users in order to hide the implementation details
of data flow that requires the users to code. An important metric to be considered is simple
coding for parallel programming with respect to parallel and distributed programming
paradigms. Motivation behind parallel and distributed programming model is as follows:
6. Page 6
1. Improve the program productivity
2. Decrease programs time to market
3. Leverage underlying resources efficiently
4. Increasing system throughput
5. Support for higher levels of abstraction
4.2.2 MapReduce Function and Framework:
MapReduce: A software framework that allows to perform parallel and distributed
computing on huge data sets is called MapReduce. It hides the flow of data of a parallel
program on distributed computing system. For this purpose two interfaces Map and Reduce
are provided to the users as functions.
The data flow in a program is manipulated through these functions. The below figure
illustrates the flow of data from Map to Reduce.
Figure: MapReduce software framework
In the above figure the abstraction layer abstracts the data flow steps such as partitioning,
mapping, synchronization, communication and scheduling to the users. The Map and Reduce
functions can be overridden by the user in order to achieve their respective goals. These
functions can be passed with the required parameters such as spec and results etc. the
program structure containing Map and Reduce subroutine is illustrated below:
Map function(….){.......................}
Reduce function(….){…............. }
Main function(….)
{
Initialize spec object
-
-
MapReduce(spec & results)
}
The input data to Map function is a (key. Value)pair where key indicates the line offset in a
input file and value is the line content. The output returned from Map function is also a (ley,
value) pair called as intermediate pair. The Reduce function is responsible for receiving the
intermediate (key, value) pairs as a set of values as (key, [set of values]) by sorting and
grouping the same value keys. It processes and generates a group of (key, value) pairs as
output. The formal notation of Map function is,
𝑀𝑎𝑝 𝐹𝑢𝑛𝑐𝑡i𝑜𝑛
(key1, val1)−−−−−−−−−−→List (key2, val2)
7. Page 7
The result obtained is a intermediate (key, value) pairs. They are gathered by the MapReduce
library and sorted based on the keys.
The various occurrences of same key are gathered and reduce function is applied on them to
produce another list.
𝑅𝑒𝑑𝑢𝑐𝑒 𝐹𝑢𝑛𝑐𝑡i𝑜𝑛
(key2, List(val2))−−−−−−−−−−−−→List (val1)
Map Reduce Actual Data and Control Flow:
The MapReduce framework is responsible for running the program on distributed computing
system efficiently. This process is detailed as follows:
1. Data Partitioning: The input data is retrieved from GFS and divided into Mpieces by the
MapReduce library. These partitions correspond to number of map tasks.
2. Computation Partitioning: The obliging users perform computation partitioning for
coding as Map and Reduce functions. The result will be an user program containing Map and
Reduce functions. They are distributed and initiated on number of computation engines that
are available.
3. Determining the Master and Workers: The architecture of MapReduce depends upon
master workers model. Here a copy of user programs becomes the master and the remaining
become the workers. The master is responsible for assigning the map and reduce tasks to the
idle workers. And the worker is responsible to run the map/reduce task through Map/Reduce
function execution.
4. Retrieving Input Data (Data Distribution): The respective input data is read by the
worker and sent to the map function after dividing it.
5. Map Function: The input data is retrieved by the map function in the form of (key, value)
pairs in order to process and produce the intermediate (key, value) pairs.
6. Combiner Function: This function is applied on(key, value) pairs and invoked in the user
program. It merges the local data of map worker and sends it on networks. With this the
communication cost decreases.
7. Partitioning Function: The intermediate (key, value) pairs are partitioned using
partitioning function. All the similar keys are stored in same region through hash function.
The data later sent to master which intron forward to the workers.
8. Synchroniztion: Synchronization policy of MapReduce allow coordination between map
workers and reduce workers and provides interaction among them after task completion.
9. Communication: A remote procedure call is used by the reduce worker for reading the
data from the map workers. A all-to-all communication occurs between the map and reduce
workers giving rise to network congestion. For this purpose a data transfer module is
developed for scheduling the data transfer.
10. Sorting and Corresponding: A reduce worker decides the reading process of input data
and groups the intermediate (key, value) pairs by sorting the data according to the keys. All
the occurrences of similar keys are grouped and unique is produced.
8. Page 8
11. Reduce Function: The reduce worker is responsible for iterating the grouped (key,
value) pairs for all the unique keys and the set of key and values are sent to reduce function.
This function will process the received data and stores the output in predetermined files of
user program
Figure: Linking Map and Reduce workers through MapReduce Partitioning Functions.
Twister and Iterative MapReduce:
The performance of any runtime requires to be checked and the MPI and MapReduce also
need to be compared. The communication and load imbalance are the important sources of
parallel overhead. The overhead of communication can be high in MapReduce because of
the following reasons.
• The MapReduce performs read and write through files and MPI allows data transfer
between the nodes in the network.
• MPI will not transfer the complete data rather it only does the required data for
updataion. The MPI flow is called flow and MapReduce flow is called full data flow.
This phenomenon can be observed in all the classic parallel loosely synchronous applications
that show off the iteration off structure in the compute phases and communication phases.
The performance issues can deposited with below changes:
• Transfer of data between the steps without expanding the steps internally to disk.
• Usage of long running threads or processors for communicating flow.
The above changes give rise to increase in performance at the cost of fault tolerance and also
support dynamic changes like available bodes. The below figure depicts the twister
programming paradigms along with its architecture at run time. The twister illustrates the
difference of static data that can never be reloaded from dynamic flow that is communicated.
9. Page 9
Figures: Twister for Iterative MapReduced Programming
The pair of map and reduce is executed iteratively in thread that are long running. The below
figure shows the comparison of thread and process structures of parallel programming
paradigms like Hadoop, Dryad, Twister and MPI.
Figure: Four Parallel Programming Paradigms for Thread and Process structure
Yahoo Hadoop: It is used for short running processes communication through disk and
tracking process.
Microsoft Dryad: It is used for short running processes communication through pipes,
diskor shared memory between cores.
MapReduce: It is used for long running processing with asynchronous distributed
rendezvous synchronization
MPI: It is used for long running process with rendezvous for message exchange
synchronization.
4.2.3 Hadoop Library from Apache:
MapReduce has an open sources implementation called Hadoop. It is coded in Java by
apache and makes use of Hadoop distributed File System (HDFS) as internal layer. The core
of Hadoop has two layers called MapReduce engine and HDFS. The MapReduce engine is
the top layer and acts as computations engine and data storage manager.
10. Page 10
Architechture of MapReduce in Hadoop:
MapReduce engine is the upper layer of the Hadoop. It is responsible for managing the data
flow and control flow of the MapReudce jobs in distributed computing systems. MapReduce
engine contains a master/slave architecture with single Job Tracker (that acts as a master) and
several Task Trackers (which act as slaves). The MapReduce job is managed by the Job
Tracker over the cluster. It also monitors and assigns the jobs and tasks to the Task Trackers.
The Task Tracker is responsible for managing the map/reduce tasks execution on a
computation node with in the cluster.
Every Task Tracker is assigned with various execution slots to execute map or reduce task.
A map task that is running on a slot will process a data block. A one-to-one correspondence
is found between the map task and the data block of the Data Node.
Figure: Hadoop HDFS and MapReduce Architecture
Running the Job in Hadoop:
Components required to run a job in this system are user node, Job Tracker and a set of Task
Trackers. The function runJob(conf) is called to begin the data flow in the user program.
The conf is the parameter is an object for MapReduce framework and HDFs. The function is
similar to MapReduce(spec & Results)
Figure: Data flow in running MapReduce job at various task trackers using Hadoop Library
HDFS:
HDFS acts as a distributed file system and stores the data on distributed computed system
after organizing. Architecture of it contains a master and slave with single NameNode and
number of DataNodes respectively. The files are divided into fixed sized blocks by this
11. Page 11
architecture that are stored on workers. Mapping will be done based on NameNode. The
NameNode is responsible for managing the file systems metadata and namespace. Moreover
it maintains the metadata in the area and metadata is the data of file. System accessible to the
file management.
Features of HDFS area as follows,
1. Fault Tolerance: Fault tolerance is an important characteristic of HDFS. As Hadoop
is to be deployed on low-cost hardware, it frequently comes across hardware failure.
For this reason Hadoop considers the below issues to comes the rekliability
requirements.
(i) Block Replication: To ensure data reliability, replications of file blocks are
maintained and distributed across the cluster.
(ii) Replica Placement: Replica placement is one of issue in building the fault
tolerance. It is reliable to store the replicas on nodes of other racks in the
cluster. But this technique is not much considered because it is of high cost.
So, reliability is compromised to make HDFS cost effective.
(iii) Heartbeat and Block Report Messages: The periodic messages which the
DataNode sends to the NameNode are Hearbeats and Block reports. This
implies the proper functioning of DataNode. The block report consists a list of
blocks in DataNode.
2. High Throughput Access to Large Data Sets: The throughput of HDFS is
important because it is designed and purposed for batch processing. In addition to this
the applications that runs on HDFS contain heavy data sets and separation files.
These files are divided into large blocks so that HDFS can decrease the storage of
metadata required by a file. With this the block list decreases with the increase in
block size and also fast streaming reads are provided by the HDFS.
Operations of HDFS: Operations of HDFS are depicted as follow:
1. File Read: To perform read operation the user will send the “open” request to
NameNode for file block location. The response will be address of DataNode in
which replica data is stored. The addresses depend upon the block replicas. After this
read is performed to connect to the nearby DataNode. The connection will be
terminated after streaming the connection. The complete process will iterative until
the file is streamed completely to the user.
2. File Write: The user initially will send a create request to NameNode for new file
creation. Then the data is written to it using write function. The data queue which is
an internal queue first holds the first data block later it is written to DataNode while
the data streamer monitors it. At parallel even replicas of the data blocks are also
created accordingly.
4.3 PROGRAMMING SUPPORT OF GOOGLE APP ENGINE
4.3.1 Programming Google App Engine: The key features of GAE programming
model for languages such as Java and python is illustrated in the following figure
12. Page 12
Figure: GAE Programming Environment
The GAE is allowed to debugged on the local machine by the client environment that
contains eclipse plug-in for Java. The Java web applications develpers are provided with
GWT (Google Web Toolkit). It can be used even for JavaScript or Rub. The language
python is used with framworks like Django and Cherrypy. Instead a webapp python
environment is provided by Google. The data is stored and accessed using various
constructs from the NOSQL data storage. The entities can be retrieved by queries
through filtering and sorting the values. The JDO (Java Data Object) and JPA(Java
Persistence API) interfaces are offered by Java and implemented by open source Data
Nucleus Access Platform. The python is provided with SQL – like query language called
GQL.
The applications is capable of executing various data store operations in one transaction
the succeed or fail all together. The entities can be assigned to groups by GAE
application. Google appended a new feature blob store for heavy files.
The Google SDC (Secure Data Connection) can tunnel Using Internet and connect the
Intranet to external GAE application. The URL Fetch operation will make the
applications capable to fetch the resources and to interact with others on Internet through
HTTP and HTTPs requests. It also accesses the web resources through high speed
Google infrastructure to get the web pages for various products of Google.
4.3.2 Google File System(GFS):
GFS was designed as a storage service for Google’s search Engine. It was basically
designed to store and process huge amount of data needed by Google.
Google File System is a distributed file system that was developed to support Google
applications. The reason for employing GFS is that it is capable of holding a file of about
100MB. It basically partitions a file into fixed size segments called chunks. Each chunk
provides a data block of about 64KB. Besides this it also ensures reliability of data by
distributing replicate copies of data across multiple chunk server.
It also allow multiple append operations concurrently. It make use of a single master in
order to provide access to metadata and simultaneously stores the data. It provides an
13. Page 13
accessing interface similar to POSIX file system. This feature allow application to view
the physical location of a file blocks. It also make use of customized API in order to
capture the append operation and also to record them.
The architecture of Google File System is shown below:
Figure: GFS Architecture
The architecture includes only single Master for storing meta data in cluster. The
different nodes act as a chunk servers. Each chunk server is responsible for storing data.
A Master is also responsible for managing file system namespace and locking facilities.
It also interact with chunk server in order to obtain the management information from
them and also to instruct the chunk server to perform task like load balancing/fail
recovery.
A single master is capable of managing the whole cluster. It (use of master) inhabit the
use of complicated distributed algorithms in GFS architecture design. Despite of this, the
inclusion of only single master may affect the performance of the system.
To overcome the performance bottle neck, shadow mater is employed by Google. This
master allow replication of data on master so as to ensure that data operation performed
between client and chunk server are directly done without any further interruptions.
Besides this, attached copy of control messages transferred between client and chunk
server is maintained for further use. Hence, this facilities allow single master to manage a
cluster containing 1,000 nodes. The figure illustrates the data mutation operations like
write and append in GFS
Figure: Data Mutation in GFS
14. Page 14
Data mutation must be performed by creating a data blocks for each replicated copy. The
goal of Data mutation is to reduce the involvement of master in the cluster. The steps for
performing mutation are as follows:
1. Initially the client queries the master about the chunk server which contains the
current chunk lease and also the location of the replicas.
2. If the chunk servers do not contain the chunk lease, then the master grants a single
lease to the chosen replica and replies the client with a primary identity and the
location of replica.
3. The client maintains the cached copy of data for future mutation
4. The client then forwards the data to other replicas. Chunk servers accept the
replicated data and maintain them in the internal LRU buffer cache. To improve the
performance of the system data flow is decoupled from control flow and also by
scheduling the expensive data flow depending on the network technology.
5. After receiving the acknowledgment from the replicas. The client forwards the write
request to the primary replica. The primary replica allots consecutive serial numbers
to all the mutations received from multiple clients and performs serialization on them.
It then applies mutations to all the local states in serial order.
6. The write request are now forwarded to secondary replica by primary replica. The
secondary replica now applies mutations in the order similar to primary.
7. The secondary replicas now reply the primary about the completion of operation.
8. The primary replica intron replies to the client with all the errors that are encountered
during mutation process. The code generated by client helps recovering from errors
and also in retrying failed mutation.
A GFS allow users to perform append operation. This operation allows users to append
data block at the end of file. GFS offers fast recover capability to recover from various
system error. Besides this, it also ensures,
1. High availability
2. High performance
3. High fault tolerance and
4. High scalability
4.4 PROGRAMMING ON AMAZON AWS AND MICROSOFT AZURE
4.4.1 Programming in Amazon:
Amazon is the company that started for generating VMs in application hosting. The
customers rather than using the physical machines to run the applications will opt to rent
VMs. With VMs the customers are allowed to load any desired software. In addition to this
they can also create, launch and even terminate the server instance by paying for them.
Various types of VMs are provided by the amazon. The instance are called as Amazon
Machine Images(AMIs) preconfigured with Linux or windows in addition to the software.
There are three types of AMIs defined as follows:
1. Private AMI: The images that are created by the users are called private AMIs. They are
private by default and can be allowed to be launched by others.
2. Public AMI: The images that are created by users and released for AWS community by
allowing others to launch and use the instances are called public AMI
15. Page 15
3. Paid QAMI: The images created by the users through certain functions and which are
allowed to be launched by other by paying for them are called paid QAMI.
The below figure shows the execution environment where AMIs act as the templates for
instance running VMs. All the public, private and paid AMIs supports this environment.
Figure: Execution Environment of Amazon EC2
4.4.2 Amazon Simple Storage Service (S3):
Amazon S3 (Simple Storage Services) aims to provide an interface with simple web service
for storing and retrieving the data on web at any time/anywhere. The service is in the form of
object oriented storage. The objects are accessible to the users through SOAP (Simple Object
Access Protocol) along with supporting browsers or client programs. A reliable message
service among any two processes is provided by SQS. The below figure shows Amazon S3
execution environment.
Figure: Execution Environment of Amazon S3
An object is the fundamental unit of the S3. A bucket holds various objects by accessing
them through keys. There are even other attributes such as values, access control information
and metadata. The users perform read, write and delete on objects through the key-value
programming interface. Users can access the data from the Amazon clouds through the
interfaces REST and SOAP. Features of S3 are illustrated as follows:
16. Page 16
• The authentication mechanisms are used to secure the data from unauthorized uses.
For this users are granted rights on objects by making them private or public.
• It contains URLs and ACLs for every object
• The cost is from $0.55 to 0.15for every GB per month.
• Transfer cost out of S3 region is from $0.08to $0.15 per GB
• Redundancy is maintained through geographic dispersion
• An interface called BitTorrent protocol is used to decrease the cost of high scale
distribution.
• It provides 99.999999999% durability and 99.99% availability of objects for a year
with less RRSC (Reduced Redundancy Storage)
• Data transfer between Amazon EC2 and S3 is not charged.
4.4.3 Amazon Elastic Block Store (EBS):
The Amazon elastic block store offers a volume block interface to save as well as restore the
EC2 instance’s virtual images. Once the usage of traditional EC2 instances completed, they
will be deleted. The status of them is saved in EBS system upon machine shutdown. The
running data and EC2 instances are saved using EBS. The users are allowed to create the
storage ranging from 1GB to 1TB. These volumes can be mounted as EC2 instances.
Various volumes can be mounted as if they belong one instance. The user is allowed to create
file system above Amazon EBs volumes or to use in any desired way. Data saving is done
through snapshots to improve the performance. Amazon charges according to the usage.
Amazon SimpleDB Services:
Amazon simpleDB service will provide a simple data model with respect to relational
database data model the user input is sorted into domains which are considered as tables. A
table contains items as rows and attributes values as cells of the respective row. A single cell
can also be assigned with multiple values.
The developers requires the data to be stored, accessed and queried easily. But this
consistency is not considered by the simpleDB. The azure table manage less amounts of data
in distributed table so they called as ‘Little Table’. The BigTable is supposed to store big
data. The simple DB costs $0.140 for each amazon simpleDB machine hours..
4.4.4 Microsoft Azure Programming:
Various features of azure cloud platform are programming components, SQL Azure, client
development, storage and programming subsystem. They are depicted in the following
figure.
The underneath layer in the above figure is fabric that contains virtualized hardware along
with sophisticated control environment that dynamically assigns resources and implements
fault tolerance. With this even domain name system and monitoring capabilities get
implemented. Service models are allowed to be defined by XML templates and various
service copies to be initialized.
The services are monitored while the system is running and the users are allowed to access
the event logs, trace/debug the data. IIS web server logs, crash dumps, performance counters,
crash dump and other files. Azure storage is responsible to hold all this data and debugging
is allowed to be performed. The Azure is related to Internet through a customized computer
17. Page 17
VM known as web role that supports basic Microsoft web hosting. These VMs are called
appliances. The roles that support HTTPs and TCP provide below methods.
Figure: Azure Cloud platform features
Onstart(): This method is called fabric on startup that allows user to perform initialization
tasks. It will show busy status to load balance until it is false.
Onstop():This method is invoked when the role is supported to be shut down and then it
exits.
Run(): This method contains of main logic.
SQL Azure: The SQL server is provided as a service by SQL Azure. The REST interface is
used to access the storage modalities excluding the drivers that are introduce most recently
and which are analogous to amazon EBS. It provides a file system interface in the form of
durable NTFS volume backed by blob storage. The interface REST are related with URL’s
by default. The storage gets duplicated three times due to fault tolerance and also guaranteed
to be consistent in access.
The storage system gets emerged from blobs that are analogous to S3 for amazon. The blobs
are classified as,
Account→ Containers→ Page/Block Blobs.
The containers observed to be analogous to directories within traditional file systems where
account is the root. The block blob streams the data and arranges them as sequence of blocks
4MB each upto 200 GB. The page blobs are intended for read/write access and contains set
of pages with 1 TB size.
Azure Tables: The azure table and queue storage modes are intended for less volumes of data.
The queue will be offering reliable message delivery and also support work spooling among
the web and worker roles. They does not restrict the messages. The azure supports the
operations such as PUT, GET, DELETE as well as CREATE and DELETE. Every account is
assigned infinite tables containing rows and columns in the form of entities and properties
respectively.
18. Page 18
The table entities are not restricted in number rather huge number of entities of distributed
computers. The general properties such as <name, type, values> are assigned to the entities.
The other properties such as PartitionKey and RowKey can be assigned to entities. The
purpose of RowKey is to assign unique label to every enty. The purpose of PartitionKey is to
get shared.
4. 5 EMERGING CLOUD SOFTWARE ENVIRONMENTS
4.5.1 Open Source Eucalyptus:
Eucalyptus is a product of eucalyptus system which is a type of a open software environment.
It was developed from a research project at university of California, Santa Barbara. The
purpose of it is to bring the cloud computing paradigm to academic super computers and
cluster. To communicate with cloud service it provides an AWS – compliant EC2 based web
service interface. Apart from this, it also provides services like AWS-compliant Walrus and
user interface to manage users and images.
It also supports the development of computer cloud and storage cloud. It stores images in
Walrus storage system which is similar to Amazon S3 service. It can be uploaded and
retrieved anytime. This helps users to create special virtual appliances. The below figure
depicts the architecture that depends on VM images requirement.
Figure: Eucalyptus Architecture for VM Image Management
Nimbus:
Nimbus is a collection of open source tools which aim to offer an IaasS cloud computing
solution. It provides a special web interface known as Nimbus Web, which is placed around
python Django Web application installed independent of Nimbus service. A storage cloud
implementation known as Cumulus is combined with other central services and it is similar to
Amazon S3 REST API.
19. Page 19
Figure: Nimbus cloud infrastructure
Nimbus supports the below defined resource management strategies. They are,
1. Resource pool
2. Pilot
1. Resource Pool: It is a default resource pool mode which has a direct control on a pool of
virtual machine manager nodes.
2. Pilot: In this mode th eservice requests cluster a Local Resource Management System for
VM manager in order to deploy VMs.
Nimbus implements Amazon’s EC2 interface in order to allow the users to make use of
clients which are developed with the aim of real EC2 system against Nimbus-based clouds.
4.5.2 Open Nebula: Open Nebula is an open source device that enables users to convert
available infrastructure into a IaaS cloud. It is designed to be flexible and modular to merge
with various storage and network infrastructure configurations and hypervisor technologies.
It consists of 3 components. They are
1. Core
2. Capacity manager or Scheduler
3. Access Drivers.
1. Core: It is a centralized component that controls the complete life cycle of virtual
machine. It includes setting networks for set of virtual machines and controls the storage
requirements like VM disk image deployment or software environment.
2. Capacity Manager: It controls the working provided by core. It is a requirement or rank
matchmaker. It is also used to develop scheduling polices using lease model and
reservations.
3. Access Drivers: It provides an abstraction of infrastructure to show the working of
monitoring, storage and virtualization services available in cluster.
Apart from this, it provides management interfaces to merge core working with other data-
center tools like a accounting or monitoring frame works. It implements libvirt API and
Command-Line Interface (CLI) for virtual machine management. It also consists of two
features for changing environment like live migration and VM snapshots.
20. Page 20
It also includes EC2 driver, that can send requests to Amazon EC2, Eucalyptus and Elastic
Hosts driver. In this image access control is used for images registered and it makes easy to
multiuser environments and image sharing.
Figure: Architecture of OpenNebula
Sector /Sphere: Sector/Sphere is a software which supports huge distributed data storage
and data processing on large clusters, in data center or multiple data centers. It consists of
sector distributed file system and sphere parallel data processing frame work. By using the
fast network connection the DFS is placed in large areas and enables user to control large
data sets. Fault tolerance is performed by copying and controlling data in file system. It is
also familiar with network topology and provides reliability, availability and access. In this
communication is achieved using User Datagram Protocol (UDP) and User Defined
Type(UDT) i.e UDP is used for message passing and UDT is used to transfer data.
Figure: Architecture of Sector/Sphere
It is a parallel data process designed to work with data controlled by sector. The data stored
can be processed by the developers using a programming framework provided by sphere. In
this application inputs and outputs are known as sector files. To support difficult
applications, multiple sphere processing segments are merged.
It consists of 4 components, and they are
1. Security server
21. Page 21
2. Slave nodes
3. Client
4. Space
1. Security Servers: It is responsible for verifying master servers, slave nodes and users.
This master server contains file system meta data, schedule jobs and responds to user’s
requests.
2. Slave Nodes: It is used to store and execute the data. It can be placed in a single data
center or multiple data center with high-speed network connections.
3. Client: It provides tools and programming API’s for accessing and processing the data.
4. Space: It consists of a framework to support column-based distributed data table. These
tables are stored in the form of columns and are divided into multiple slave nodes. It supports
a set of SQL operations.
Open Stack:
Open Stack was introduced by Rack space and NASA in July 2010. It is used to share
resources and technologies with a scalable and secure cloud infrastructure. Features of Open
Stack are given as follows:
a) Open Stack Compute
b) Open Stack Storage
a) Open Stack Compute: It is the internal fabric of cloud. It is used to create and control
large sets of virtual private servers.
Figure: Architecture of Open Stack Nova System
It develops a cloud computing internal fabric controller, which is a part of an IaaS system
called Nova. It is built based on the idea of shared-nothing and exchange of message-based
information. Communication is done by using message queues. The components get
blocked while interacting with each other thereby waiting for the response can be prevented
by using deferred objects. This deferred object containing callbacks that triggered when
receiving response.
Shared-nothing paradigm can be achieved by changing the system state to distributed data
system. In this architecture, the API server receivers HTTP requests from boto, and it
converts the commands to API format then, it forwards the requests to cloud controller. This
cloud controller interacts with user manager with the help of Lightweight Directory Access
Protocol (LDAP). Apart from this nova combines networking parts to control private
22. Page 22
networks, public IP addressing, VPN connectivity and firewall rules. It includes the
following types.
1. Network Controller: It controls address and virtual LAN allocation
2. Routing Node: It governs the NAT(Network Address Translation) conversion of
public IPs and enforces firewall rules.
3. Addressing Node: It runs Dynamic Host Configuration Protocol(DHCP) services for
private networks.
4. Tunneling Node: It provides VPN (Virtual Private Network) connectivity.
The Network state consists of the following:
• VPN Assignment: It is for a project
• Private Subnet Assignment: It is for a security group in VLAN
• Private IP Allocation: It is for running instances
• Public IP Allocation: It is for a project
• Public IP Associations: It is for private IP or running instance.
b) Open Stack Storage: It generates solution based on interacting parts and concepts that
consists a proxy server, ring, object server, container server, account server, replication,
updates and auditors. This proxy server allows the accounts, containers or objects in this
storage rings and route the request. A ring shows the mapping between names of entities and
their physical locations and it contains zones, devices partitions and replicas.
An object server is a simple blob storage server which is used to store, retrieve and delete
objects which are on local devices. A container server is used for listing the objects and it is
handled by the account server.
4.5.3 Majrasoft Aneka cloud and appliances:
Aneka is a cloud application platform that is developed by Manjrasoft. It aims to support the
development and deployment of parallel and distributed applications on private and public
clouds. It produces a collection of APIs to utilize distributed resources and business logic
applications through programming abstractions. System administrators control tools to
observe and control deployed infrastructure. To increase the applications in both Linux and
Microsoft .NET framework it works as a workload distribution and management platform
Some of the key advantages of Aneka over other workload distribution solutions, include
• It supports multiple programming and application environments
• It also supports multiple runtime environments
• It uses various virtual and physical machines to accelerate the application production
depending upon the quality of service agreement of the user
• It is layered upon Microsoft .NET framework to support LINUX environment
23. Page 23
Figure: Architecture of Aneka components
Aneka offers 3 types of capabilities which are essential for building, accelerating and
managing clouds and their applications:
1. Build: Aneka includes a new SDK which combines APIs and tools to enable users to
rapidly develop applications. Aneka also allows users to build different runtime
environments such as enterprise/private cloud by harnessing compute resources in network or
enterprise data centers.
2. Accelerate: Aneka supports rapid development and deployment of applications in multiple
runtime environments running different operating systems such as Windows or Linux/UNIX
etc. Aneka supports dynamic leasing of extra capabilities from public clouds such as EC2 to
meet QoS for users.
3. Manage: Management tools and capabilities supported by Aneka include GUI and APIs to
set up, monitor, manage, and maintain remote and global Aneka compute clouds.
In Aneka, the available services can be aggregated into three major categories like
1. Fabric Services
2. Foundation Services
3. Application Services
1. Fabric Services: These services implement the fundamental operations of the
infrastructure of the cloud. So these include HA and failover for improved reliability, node
membership and directory, resource provisioning and performance monitoring.
2. Foundation Services: These services constitute the core functionalities of the Aneka
middleware. They provide basic set of capabilities that enhance application execution in the
24. Page 24
cloud. These services include storage management, resource reservation, reporting,
accounting, billing, services monitoring and licensing.
3. Application Services: These services deal directly with the execution of applications and
are in charge of providing the appropriate runtime environment for each application model.
At this level, Aneka can support different application models and distributed programming
patterns.