SlideShare a Scribd company logo
Page 1
Unit 4
Cloud Programming and Software Environments: Features of Cloud and Grid Platforms, Parallel
& Distributed programming Paradigms, Programming Support of Google App Engine,
Programming on Amazon AWS and Microsoft Azure, Emerging Cloud Software Environments.
4.1 Features of Cloud and Grid Platforms
4.1.1 Capabilities of Cloud and Platform Features:
The commercial clouds require huge number of capabilities. They provide computing of less
cost with flexibility. This will in addition provides extra capabilities together called as “Platform
as a Service”. The present platform features for the Azure are table, queues, web, and worker
roles, database SQL, blobs. The platform features for Amazon are “Infrastructure as a Service”,
all together other than queues, notifications, monitoring, content delivery network, relational
database and map reduce. The capabilities of cloud platform are listed in the below table.
Capability Description
Physical or virtual computing platform The cloud environment incorporates both physical as
well as virtual platform. The virtual platforms posses
the unique capabilities in order to isolate environments
for various applications and users
Huge data storage service, distributed
file system
The cloud data storage services after a wide range of
disk capacity for heavy data sets. In addition to this
service interface and store data the distributed file
system provides heavy data storage services.
Huge database storage service The clouds require this service similar to DBMS so that
the developers can store the data in semantic way.
Huge data processing method and
programming model
The infrastructure of cloud after several nodes for
simple applications. So the programmers must manage
the issues such as network failure or scaling of running
code etc in order to use all the services provided by
platform
Support workflow and data query
language
The programming model wides the cloud infrastructure.
The workflow language and data query language is
provided for application logic.
Programming interface and services
deployment
Cloud applications need web interfaces or special API’s
such as J2EE, PHP, ASP or Rails. They can make use
of Ajax technologies for improving user experience
while using web browsers for function access
Runtime support The runtime support is open for the users as well as
applications. It incorporates the distributed monitoring
services, distributed task scheduler, distributed locking
etc.
Support services Various important services are data and computing
service.
The infrastructure cloud features as illustrated in below table:
Accounting It has economies and acts as an active area for commercial
Clouds
Page 2
Authentication and authorization It requires all the systems to have single sing in
Data transport It supports the data transfer among job components within
and between the clouds and grids
Registry It provides registry as information resource for system
Operating systems Supports OS such as Android, Linux, Windows and apples
Program library It stores images as well as program material
Scheduling and gang scheduling Provides scheduling similar to that of Azure worker role in
addition to condor, platform, oracle grid engine etc. the
gang scheduling will assign multiple tasks scalability
Software as a Service This service is shared among clouds and grids; it is
successful and is used in clouds as distributed systems.
Virtualization It is a basic feature and supports elastic feature. It also
incorporates virtual networking
Traditional features in cluster, grid and parallel computing environments are given below:
Cluster management Clusters are developed using the tools provided by rocks and
packages
Data management A metadata support called RDR triple stores is provided. In
addition to these SQL and NoSQL are provided
Portals It is also termed as gateways that has transformation in
technology from portlets to HUB zero
Virtual organizations Organizations range from specialized grid solutions to
popular web 2.0
Grid programming environments The programming environment differs from link together
services as in open grid services architecture to grid RPC
and SAGA
Open MP/Threading It incorporates parallel computers like click and roughly
shared memory technologies in addition to transactional
memory and fine grained data flow
The platform features supported by clouds and grids area s follows:
Blob It provides the basic storage concept that is typified by
Azure, Blob, and Amazon S3
DPFS It provides support for file systems like Google, HDFS,
cosmos along with compute data affinity that is optimized for
data processing
Map reduce It supports map reduce programming model with Hadoop on
Linux, Dryad on Windows HPCS and twister on windows
and Linux
Fault tolerance It is a major feature of clouds
Notification It is the basic function of publish subscribe systems
Monitoring It provides may grid solutions like Inca which is be based on
publish subscribe
Programming model The cloud programming models are developed with other
platform features. It relates to web and grid models.
Queues The queue system depends on publish subscribe
SQL It is a relational database
Table It supports the table data structure that are modelled on
apache Hadoop or Amazon simple DB/Azure table
Web role It is used in Azure for determining the link tosser
Worker role It is used in Amazon and grids
Scalable Synchronization It supports distributed locks and is used by big table.
Page 3
4.1.2 Traditional features common to grids and clouds:
Various features that are common to grids and clouds are as follows:
1. Workflows: in US and Europe the workflow has created various projects such as pegasus,
Taverna and Kepler. The commercial systems include Pipeline, Pilot, AVS and LIMS
environment. Trident is the latest entry which will run the workflow proxy services on external
environments if working on Azure or Windows.
2. Data Transport: The data transfer is the major issue in commercial clouds. The links of high
bandwidth can be allowed among clouds and tera grid if at all commercial clouds are made the
major components of cyber infrastructure. The cloud data can be structured into tables and
blocks in order to make the high performance parallel algorithms in addition to HTTP
mechanisms for data transfer among academic systems/Teragrid and commercial clouds.
3. Security, Privacy and Availability: These techniques related to security, privacy and
availability used for developing good and dependable cloud programming environment are
illustrated as follows,
• Using virtual clustering for achieving dynamic resources support at less overhead cost
• Using special API’s for user authentication and email sending through commercial
accounts.
• Accessing cloud resources using security protocols like HTTPS and SSL
• Using stable and persistent data storage through quick queries for data access
• Including the features for improving availability and disaster recovery with file migration
of VM’s.
• Using fine gained access control for protecting the data integrity and deterring intruders
and hackers
• Protecting shared data form Malicious alternation, deletion of copy right Molations
• Using popular systems for protecting data centres from stopping the privates and
authorizing the trusted clients.
4.1.3 Data Features and Databases:
The features of data and databases are illustrated as follows,
1. Program Library: An attempt is made for developing a VM image library for managing the
images to be used in academic and commercial clouds.
2. Blobs and Drives: The containers of Azure are responsible for arranging the storage in
clouds such as blobs for Azure and S3 for Amazon. The users are allowed to attach directly for
computing instances like Azure drives and Elastic Book Store for Amazon. The cloud storage is
found to be fault tolerant when tera grid require backup storage.
3. DPFS: It supports the file systems like Google File System, HDFS and cosmos with the
features of optimized compute data affinity for processing the data. The DPFS can linked with
blob and drives based architecture but it is better to use if application centric storage model with
optimized compute data affinity. With this the blob and drives must be used as repository centric
view. The DPFS file systems are developed for executing the data intensive applications
efficiently.
Page 4
4. SQL and Relational Databases: The relational databases are offered by the Amazon and
Azure clouds. As done earlier a new private computing model is built on future grid for the
observational medical outcomes partnership for patient related data that makes use of oracle and
SAS. This is where future grid includes Hadoop to scale various analysis methods. The
databases are predicated to be used for determine the methods that deploy capabilities. The
database software can be added on to the disk. It can be executed through the database instance.
The database on Amazon and Azure can be installed on different VM with this “SQL as service”
gets implemented.
5. Table and NoSQL Non-related Databases: A large number of developments took place
related to a simplified database structure called “NoSQL”. This emphasizes distribution and
scalability. The clouds like Big Table in Google and simpleDB in Amazon and Azure Table in
Azure make use of it. Non-relational databases are used in many terms of triple stores
depending upon the MapReduce and tables or Hadoop File system with good success.
The tables of the cloud can be classified into Azure table and Amazon simpleDB which support
lightweight storage for document stores. They are found to be schema free and will soon gain
importance in scientific computing.
6. Queuing Services: The Amazon as well as Azure provide the robust and scalable queuing
services for the components to interact with each other in an application. These messages are
short and contain a REST (Representational State Transfer) interface which has deliver at least
once semantics. Time outs are used for controlling them in order to post the amount of
processing time assigned for the client.
4.1.4 Programming and Runtime support:
The programming and runtime support after parallel programming and runtime support of major
functions in grids and clouds.
1. Worker and Web Roles: Azure provides roles for facilitating nontrivial functionality and
also for preserving better affinity support in non-virtualized environment. They are the
schedulable processes that can be launched automatically. Queues are considered to be complex
since they offer natural method for assigning tasks in fault-tolerant and distributed fashion. The
web roles offer a significant method for the portal.
2. MapReduce: The parallel languages are found to have great interest in loosely coupled
computations that execute among the data samples. The grid applications are provided with
efficient execution by language and runtime. The map reduce is found to be more advantageous
than traditional implementations of the task problems. This is because it supports dynamic
execution, strong fault tolerance and easy to use high level interface. Hadoop and Dryad are the
map reduce implementations that can be executed with or without VMs. Hadoop is provided by
Amazon and Dryad is to be available on Azure.
3. Cloud Programming Models: The GAE and Manjrasoft Aneka environment are the two
basic programming models that are applied on clouds. But these models are not specific to this
architecture model. Iterative MapReduce is an interesting programming model that offers
portability between cloud, HPC and cluster environments.
4. SaaS: Services are used similarly in both commercial clouds and latest distributed systems.
The users can package their programs as required. Hence, SaaS services can be enabled without
any additional support because of this reason. SaaS environment is expected to provide various
Page 5
useful tools for developing the cloud applications over the huge datasets. In addition to this
various protection features are also offered by SaaS for achieving scalability, security, privacy
and availability.
4.2 PARALLEL AND DISTRIBUTED PROGRAMMING PARADIGMS
4.2.1 Parallel computing and Programming Paradigms
The distributed and parallel programs are assumed as the parallel programs running on set of
computing engines or distributed computing systems. The distributed computing denote the
computational engines interconnected in a network intended to run a job or application. The
parallel computing denotes the usage of one or more computational engine intended to run a
job or application. The parallel programs are allowed to be run on distributed computing
systems. But it has certain issues described below
1. Partitioning: Partition is done in the below two ways,
i) Computation Partitioning: The given program or job is divided into various tasks
depending upon the portion identification that is capable for concurrent transaction. Various
parts of a program can process different data or share same data.
ii) Data Partitioning: The input or intermediate data is divided into various partitions that
can be processed on various workers. A copy of the program or various parts of it is
responsible for processing the pieces of data.
2. Mapping: The process of assigning parts of program of data pieces to the respective
resources is called mapping. It is handled by the system resources allocators.
3. Synchronization: Synchronization is required since various workers perform various
tasks. Coordination is also important among the workers. With this, race conditions can be
prevented and data dependency can also managed.
4. Communication: Communication is considered as the major concept when the
intermediate data is sent to workers. This is because data dependency is a major reason for
communication among the workers.
5. Scheduling: A scheduler is responsible for picking a set of jobs or programs and for
running them on distributed computing system. It is required when the resources are not
sufficient to run various jobs or programs simultaneously. Scheduling is done based on the
scheduling policy.
4.4.1.1 Motivation for Programming Paradigms:
Handling of complete data flow of parallel and distributed programming is observed to be
time consuming. It also needs special programming knowledge. These issues affect
programmer productivity and programs time to market. For this purpose the parallel and
distributed programming paradigms or models are use in order to hide the data flow part from
the users.
These models after abstraction layer to the users in order to hide the implementation details
of data flow that requires the users to code. An important metric to be considered is simple
coding for parallel programming with respect to parallel and distributed programming
paradigms. Motivation behind parallel and distributed programming model is as follows:
Page 6
1. Improve the program productivity
2. Decrease programs time to market
3. Leverage underlying resources efficiently
4. Increasing system throughput
5. Support for higher levels of abstraction
4.2.2 MapReduce Function and Framework:
MapReduce: A software framework that allows to perform parallel and distributed
computing on huge data sets is called MapReduce. It hides the flow of data of a parallel
program on distributed computing system. For this purpose two interfaces Map and Reduce
are provided to the users as functions.
The data flow in a program is manipulated through these functions. The below figure
illustrates the flow of data from Map to Reduce.
Figure: MapReduce software framework
In the above figure the abstraction layer abstracts the data flow steps such as partitioning,
mapping, synchronization, communication and scheduling to the users. The Map and Reduce
functions can be overridden by the user in order to achieve their respective goals. These
functions can be passed with the required parameters such as spec and results etc. the
program structure containing Map and Reduce subroutine is illustrated below:
Map function(….){.......................}
Reduce function(….){…............. }
Main function(….)
{
Initialize spec object
-
-
MapReduce(spec & results)
}
The input data to Map function is a (key. Value)pair where key indicates the line offset in a
input file and value is the line content. The output returned from Map function is also a (ley,
value) pair called as intermediate pair. The Reduce function is responsible for receiving the
intermediate (key, value) pairs as a set of values as (key, [set of values]) by sorting and
grouping the same value keys. It processes and generates a group of (key, value) pairs as
output. The formal notation of Map function is,
𝑀𝑎𝑝 𝐹𝑢𝑛𝑐𝑡i𝑜𝑛
(key1, val1)−−−−−−−−−−→List (key2, val2)
Page 7
The result obtained is a intermediate (key, value) pairs. They are gathered by the MapReduce
library and sorted based on the keys.
The various occurrences of same key are gathered and reduce function is applied on them to
produce another list.
𝑅𝑒𝑑𝑢𝑐𝑒 𝐹𝑢𝑛𝑐𝑡i𝑜𝑛
(key2, List(val2))−−−−−−−−−−−−→List (val1)
Map Reduce Actual Data and Control Flow:
The MapReduce framework is responsible for running the program on distributed computing
system efficiently. This process is detailed as follows:
1. Data Partitioning: The input data is retrieved from GFS and divided into Mpieces by the
MapReduce library. These partitions correspond to number of map tasks.
2. Computation Partitioning: The obliging users perform computation partitioning for
coding as Map and Reduce functions. The result will be an user program containing Map and
Reduce functions. They are distributed and initiated on number of computation engines that
are available.
3. Determining the Master and Workers: The architecture of MapReduce depends upon
master workers model. Here a copy of user programs becomes the master and the remaining
become the workers. The master is responsible for assigning the map and reduce tasks to the
idle workers. And the worker is responsible to run the map/reduce task through Map/Reduce
function execution.
4. Retrieving Input Data (Data Distribution): The respective input data is read by the
worker and sent to the map function after dividing it.
5. Map Function: The input data is retrieved by the map function in the form of (key, value)
pairs in order to process and produce the intermediate (key, value) pairs.
6. Combiner Function: This function is applied on(key, value) pairs and invoked in the user
program. It merges the local data of map worker and sends it on networks. With this the
communication cost decreases.
7. Partitioning Function: The intermediate (key, value) pairs are partitioned using
partitioning function. All the similar keys are stored in same region through hash function.
The data later sent to master which intron forward to the workers.
8. Synchroniztion: Synchronization policy of MapReduce allow coordination between map
workers and reduce workers and provides interaction among them after task completion.
9. Communication: A remote procedure call is used by the reduce worker for reading the
data from the map workers. A all-to-all communication occurs between the map and reduce
workers giving rise to network congestion. For this purpose a data transfer module is
developed for scheduling the data transfer.
10. Sorting and Corresponding: A reduce worker decides the reading process of input data
and groups the intermediate (key, value) pairs by sorting the data according to the keys. All
the occurrences of similar keys are grouped and unique is produced.
Page 8
11. Reduce Function: The reduce worker is responsible for iterating the grouped (key,
value) pairs for all the unique keys and the set of key and values are sent to reduce function.
This function will process the received data and stores the output in predetermined files of
user program
Figure: Linking Map and Reduce workers through MapReduce Partitioning Functions.
Twister and Iterative MapReduce:
The performance of any runtime requires to be checked and the MPI and MapReduce also
need to be compared. The communication and load imbalance are the important sources of
parallel overhead. The overhead of communication can be high in MapReduce because of
the following reasons.
• The MapReduce performs read and write through files and MPI allows data transfer
between the nodes in the network.
• MPI will not transfer the complete data rather it only does the required data for
updataion. The MPI flow is called flow and MapReduce flow is called full data flow.
This phenomenon can be observed in all the classic parallel loosely synchronous applications
that show off the iteration off structure in the compute phases and communication phases.
The performance issues can deposited with below changes:
• Transfer of data between the steps without expanding the steps internally to disk.
• Usage of long running threads or processors for communicating flow.
The above changes give rise to increase in performance at the cost of fault tolerance and also
support dynamic changes like available bodes. The below figure depicts the twister
programming paradigms along with its architecture at run time. The twister illustrates the
difference of static data that can never be reloaded from dynamic flow that is communicated.
Page 9
Figures: Twister for Iterative MapReduced Programming
The pair of map and reduce is executed iteratively in thread that are long running. The below
figure shows the comparison of thread and process structures of parallel programming
paradigms like Hadoop, Dryad, Twister and MPI.
Figure: Four Parallel Programming Paradigms for Thread and Process structure
Yahoo Hadoop: It is used for short running processes communication through disk and
tracking process.
Microsoft Dryad: It is used for short running processes communication through pipes,
diskor shared memory between cores.
MapReduce: It is used for long running processing with asynchronous distributed
rendezvous synchronization
MPI: It is used for long running process with rendezvous for message exchange
synchronization.
4.2.3 Hadoop Library from Apache:
MapReduce has an open sources implementation called Hadoop. It is coded in Java by
apache and makes use of Hadoop distributed File System (HDFS) as internal layer. The core
of Hadoop has two layers called MapReduce engine and HDFS. The MapReduce engine is
the top layer and acts as computations engine and data storage manager.
Page 10
Architechture of MapReduce in Hadoop:
MapReduce engine is the upper layer of the Hadoop. It is responsible for managing the data
flow and control flow of the MapReudce jobs in distributed computing systems. MapReduce
engine contains a master/slave architecture with single Job Tracker (that acts as a master) and
several Task Trackers (which act as slaves). The MapReduce job is managed by the Job
Tracker over the cluster. It also monitors and assigns the jobs and tasks to the Task Trackers.
The Task Tracker is responsible for managing the map/reduce tasks execution on a
computation node with in the cluster.
Every Task Tracker is assigned with various execution slots to execute map or reduce task.
A map task that is running on a slot will process a data block. A one-to-one correspondence
is found between the map task and the data block of the Data Node.
Figure: Hadoop HDFS and MapReduce Architecture
Running the Job in Hadoop:
Components required to run a job in this system are user node, Job Tracker and a set of Task
Trackers. The function runJob(conf) is called to begin the data flow in the user program.
The conf is the parameter is an object for MapReduce framework and HDFs. The function is
similar to MapReduce(spec & Results)
Figure: Data flow in running MapReduce job at various task trackers using Hadoop Library
HDFS:
HDFS acts as a distributed file system and stores the data on distributed computed system
after organizing. Architecture of it contains a master and slave with single NameNode and
number of DataNodes respectively. The files are divided into fixed sized blocks by this
Page 11
architecture that are stored on workers. Mapping will be done based on NameNode. The
NameNode is responsible for managing the file systems metadata and namespace. Moreover
it maintains the metadata in the area and metadata is the data of file. System accessible to the
file management.
Features of HDFS area as follows,
1. Fault Tolerance: Fault tolerance is an important characteristic of HDFS. As Hadoop
is to be deployed on low-cost hardware, it frequently comes across hardware failure.
For this reason Hadoop considers the below issues to comes the rekliability
requirements.
(i) Block Replication: To ensure data reliability, replications of file blocks are
maintained and distributed across the cluster.
(ii) Replica Placement: Replica placement is one of issue in building the fault
tolerance. It is reliable to store the replicas on nodes of other racks in the
cluster. But this technique is not much considered because it is of high cost.
So, reliability is compromised to make HDFS cost effective.
(iii) Heartbeat and Block Report Messages: The periodic messages which the
DataNode sends to the NameNode are Hearbeats and Block reports. This
implies the proper functioning of DataNode. The block report consists a list of
blocks in DataNode.
2. High Throughput Access to Large Data Sets: The throughput of HDFS is
important because it is designed and purposed for batch processing. In addition to this
the applications that runs on HDFS contain heavy data sets and separation files.
These files are divided into large blocks so that HDFS can decrease the storage of
metadata required by a file. With this the block list decreases with the increase in
block size and also fast streaming reads are provided by the HDFS.
Operations of HDFS: Operations of HDFS are depicted as follow:
1. File Read: To perform read operation the user will send the “open” request to
NameNode for file block location. The response will be address of DataNode in
which replica data is stored. The addresses depend upon the block replicas. After this
read is performed to connect to the nearby DataNode. The connection will be
terminated after streaming the connection. The complete process will iterative until
the file is streamed completely to the user.
2. File Write: The user initially will send a create request to NameNode for new file
creation. Then the data is written to it using write function. The data queue which is
an internal queue first holds the first data block later it is written to DataNode while
the data streamer monitors it. At parallel even replicas of the data blocks are also
created accordingly.
4.3 PROGRAMMING SUPPORT OF GOOGLE APP ENGINE
4.3.1 Programming Google App Engine: The key features of GAE programming
model for languages such as Java and python is illustrated in the following figure
Page 12
Figure: GAE Programming Environment
The GAE is allowed to debugged on the local machine by the client environment that
contains eclipse plug-in for Java. The Java web applications develpers are provided with
GWT (Google Web Toolkit). It can be used even for JavaScript or Rub. The language
python is used with framworks like Django and Cherrypy. Instead a webapp python
environment is provided by Google. The data is stored and accessed using various
constructs from the NOSQL data storage. The entities can be retrieved by queries
through filtering and sorting the values. The JDO (Java Data Object) and JPA(Java
Persistence API) interfaces are offered by Java and implemented by open source Data
Nucleus Access Platform. The python is provided with SQL – like query language called
GQL.
The applications is capable of executing various data store operations in one transaction
the succeed or fail all together. The entities can be assigned to groups by GAE
application. Google appended a new feature blob store for heavy files.
The Google SDC (Secure Data Connection) can tunnel Using Internet and connect the
Intranet to external GAE application. The URL Fetch operation will make the
applications capable to fetch the resources and to interact with others on Internet through
HTTP and HTTPs requests. It also accesses the web resources through high speed
Google infrastructure to get the web pages for various products of Google.
4.3.2 Google File System(GFS):
GFS was designed as a storage service for Google’s search Engine. It was basically
designed to store and process huge amount of data needed by Google.
Google File System is a distributed file system that was developed to support Google
applications. The reason for employing GFS is that it is capable of holding a file of about
100MB. It basically partitions a file into fixed size segments called chunks. Each chunk
provides a data block of about 64KB. Besides this it also ensures reliability of data by
distributing replicate copies of data across multiple chunk server.
It also allow multiple append operations concurrently. It make use of a single master in
order to provide access to metadata and simultaneously stores the data. It provides an
Page 13
accessing interface similar to POSIX file system. This feature allow application to view
the physical location of a file blocks. It also make use of customized API in order to
capture the append operation and also to record them.
The architecture of Google File System is shown below:
Figure: GFS Architecture
The architecture includes only single Master for storing meta data in cluster. The
different nodes act as a chunk servers. Each chunk server is responsible for storing data.
A Master is also responsible for managing file system namespace and locking facilities.
It also interact with chunk server in order to obtain the management information from
them and also to instruct the chunk server to perform task like load balancing/fail
recovery.
A single master is capable of managing the whole cluster. It (use of master) inhabit the
use of complicated distributed algorithms in GFS architecture design. Despite of this, the
inclusion of only single master may affect the performance of the system.
To overcome the performance bottle neck, shadow mater is employed by Google. This
master allow replication of data on master so as to ensure that data operation performed
between client and chunk server are directly done without any further interruptions.
Besides this, attached copy of control messages transferred between client and chunk
server is maintained for further use. Hence, this facilities allow single master to manage a
cluster containing 1,000 nodes. The figure illustrates the data mutation operations like
write and append in GFS
Figure: Data Mutation in GFS
Page 14
Data mutation must be performed by creating a data blocks for each replicated copy. The
goal of Data mutation is to reduce the involvement of master in the cluster. The steps for
performing mutation are as follows:
1. Initially the client queries the master about the chunk server which contains the
current chunk lease and also the location of the replicas.
2. If the chunk servers do not contain the chunk lease, then the master grants a single
lease to the chosen replica and replies the client with a primary identity and the
location of replica.
3. The client maintains the cached copy of data for future mutation
4. The client then forwards the data to other replicas. Chunk servers accept the
replicated data and maintain them in the internal LRU buffer cache. To improve the
performance of the system data flow is decoupled from control flow and also by
scheduling the expensive data flow depending on the network technology.
5. After receiving the acknowledgment from the replicas. The client forwards the write
request to the primary replica. The primary replica allots consecutive serial numbers
to all the mutations received from multiple clients and performs serialization on them.
It then applies mutations to all the local states in serial order.
6. The write request are now forwarded to secondary replica by primary replica. The
secondary replica now applies mutations in the order similar to primary.
7. The secondary replicas now reply the primary about the completion of operation.
8. The primary replica intron replies to the client with all the errors that are encountered
during mutation process. The code generated by client helps recovering from errors
and also in retrying failed mutation.
A GFS allow users to perform append operation. This operation allows users to append
data block at the end of file. GFS offers fast recover capability to recover from various
system error. Besides this, it also ensures,
1. High availability
2. High performance
3. High fault tolerance and
4. High scalability
4.4 PROGRAMMING ON AMAZON AWS AND MICROSOFT AZURE
4.4.1 Programming in Amazon:
Amazon is the company that started for generating VMs in application hosting. The
customers rather than using the physical machines to run the applications will opt to rent
VMs. With VMs the customers are allowed to load any desired software. In addition to this
they can also create, launch and even terminate the server instance by paying for them.
Various types of VMs are provided by the amazon. The instance are called as Amazon
Machine Images(AMIs) preconfigured with Linux or windows in addition to the software.
There are three types of AMIs defined as follows:
1. Private AMI: The images that are created by the users are called private AMIs. They are
private by default and can be allowed to be launched by others.
2. Public AMI: The images that are created by users and released for AWS community by
allowing others to launch and use the instances are called public AMI
Page 15
3. Paid QAMI: The images created by the users through certain functions and which are
allowed to be launched by other by paying for them are called paid QAMI.
The below figure shows the execution environment where AMIs act as the templates for
instance running VMs. All the public, private and paid AMIs supports this environment.
Figure: Execution Environment of Amazon EC2
4.4.2 Amazon Simple Storage Service (S3):
Amazon S3 (Simple Storage Services) aims to provide an interface with simple web service
for storing and retrieving the data on web at any time/anywhere. The service is in the form of
object oriented storage. The objects are accessible to the users through SOAP (Simple Object
Access Protocol) along with supporting browsers or client programs. A reliable message
service among any two processes is provided by SQS. The below figure shows Amazon S3
execution environment.
Figure: Execution Environment of Amazon S3
An object is the fundamental unit of the S3. A bucket holds various objects by accessing
them through keys. There are even other attributes such as values, access control information
and metadata. The users perform read, write and delete on objects through the key-value
programming interface. Users can access the data from the Amazon clouds through the
interfaces REST and SOAP. Features of S3 are illustrated as follows:
Page 16
• The authentication mechanisms are used to secure the data from unauthorized uses.
For this users are granted rights on objects by making them private or public.
• It contains URLs and ACLs for every object
• The cost is from $0.55 to 0.15for every GB per month.
• Transfer cost out of S3 region is from $0.08to $0.15 per GB
• Redundancy is maintained through geographic dispersion
• An interface called BitTorrent protocol is used to decrease the cost of high scale
distribution.
• It provides 99.999999999% durability and 99.99% availability of objects for a year
with less RRSC (Reduced Redundancy Storage)
• Data transfer between Amazon EC2 and S3 is not charged.
4.4.3 Amazon Elastic Block Store (EBS):
The Amazon elastic block store offers a volume block interface to save as well as restore the
EC2 instance’s virtual images. Once the usage of traditional EC2 instances completed, they
will be deleted. The status of them is saved in EBS system upon machine shutdown. The
running data and EC2 instances are saved using EBS. The users are allowed to create the
storage ranging from 1GB to 1TB. These volumes can be mounted as EC2 instances.
Various volumes can be mounted as if they belong one instance. The user is allowed to create
file system above Amazon EBs volumes or to use in any desired way. Data saving is done
through snapshots to improve the performance. Amazon charges according to the usage.
Amazon SimpleDB Services:
Amazon simpleDB service will provide a simple data model with respect to relational
database data model the user input is sorted into domains which are considered as tables. A
table contains items as rows and attributes values as cells of the respective row. A single cell
can also be assigned with multiple values.
The developers requires the data to be stored, accessed and queried easily. But this
consistency is not considered by the simpleDB. The azure table manage less amounts of data
in distributed table so they called as ‘Little Table’. The BigTable is supposed to store big
data. The simple DB costs $0.140 for each amazon simpleDB machine hours..
4.4.4 Microsoft Azure Programming:
Various features of azure cloud platform are programming components, SQL Azure, client
development, storage and programming subsystem. They are depicted in the following
figure.
The underneath layer in the above figure is fabric that contains virtualized hardware along
with sophisticated control environment that dynamically assigns resources and implements
fault tolerance. With this even domain name system and monitoring capabilities get
implemented. Service models are allowed to be defined by XML templates and various
service copies to be initialized.
The services are monitored while the system is running and the users are allowed to access
the event logs, trace/debug the data. IIS web server logs, crash dumps, performance counters,
crash dump and other files. Azure storage is responsible to hold all this data and debugging
is allowed to be performed. The Azure is related to Internet through a customized computer
Page 17
VM known as web role that supports basic Microsoft web hosting. These VMs are called
appliances. The roles that support HTTPs and TCP provide below methods.
Figure: Azure Cloud platform features
Onstart(): This method is called fabric on startup that allows user to perform initialization
tasks. It will show busy status to load balance until it is false.
Onstop():This method is invoked when the role is supported to be shut down and then it
exits.
Run(): This method contains of main logic.
SQL Azure: The SQL server is provided as a service by SQL Azure. The REST interface is
used to access the storage modalities excluding the drivers that are introduce most recently
and which are analogous to amazon EBS. It provides a file system interface in the form of
durable NTFS volume backed by blob storage. The interface REST are related with URL’s
by default. The storage gets duplicated three times due to fault tolerance and also guaranteed
to be consistent in access.
The storage system gets emerged from blobs that are analogous to S3 for amazon. The blobs
are classified as,
Account→ Containers→ Page/Block Blobs.
The containers observed to be analogous to directories within traditional file systems where
account is the root. The block blob streams the data and arranges them as sequence of blocks
4MB each upto 200 GB. The page blobs are intended for read/write access and contains set
of pages with 1 TB size.
Azure Tables: The azure table and queue storage modes are intended for less volumes of data.
The queue will be offering reliable message delivery and also support work spooling among
the web and worker roles. They does not restrict the messages. The azure supports the
operations such as PUT, GET, DELETE as well as CREATE and DELETE. Every account is
assigned infinite tables containing rows and columns in the form of entities and properties
respectively.
Page 18
The table entities are not restricted in number rather huge number of entities of distributed
computers. The general properties such as <name, type, values> are assigned to the entities.
The other properties such as PartitionKey and RowKey can be assigned to entities. The
purpose of RowKey is to assign unique label to every enty. The purpose of PartitionKey is to
get shared.
4. 5 EMERGING CLOUD SOFTWARE ENVIRONMENTS
4.5.1 Open Source Eucalyptus:
Eucalyptus is a product of eucalyptus system which is a type of a open software environment.
It was developed from a research project at university of California, Santa Barbara. The
purpose of it is to bring the cloud computing paradigm to academic super computers and
cluster. To communicate with cloud service it provides an AWS – compliant EC2 based web
service interface. Apart from this, it also provides services like AWS-compliant Walrus and
user interface to manage users and images.
It also supports the development of computer cloud and storage cloud. It stores images in
Walrus storage system which is similar to Amazon S3 service. It can be uploaded and
retrieved anytime. This helps users to create special virtual appliances. The below figure
depicts the architecture that depends on VM images requirement.
Figure: Eucalyptus Architecture for VM Image Management
Nimbus:
Nimbus is a collection of open source tools which aim to offer an IaasS cloud computing
solution. It provides a special web interface known as Nimbus Web, which is placed around
python Django Web application installed independent of Nimbus service. A storage cloud
implementation known as Cumulus is combined with other central services and it is similar to
Amazon S3 REST API.
Page 19
Figure: Nimbus cloud infrastructure
Nimbus supports the below defined resource management strategies. They are,
1. Resource pool
2. Pilot
1. Resource Pool: It is a default resource pool mode which has a direct control on a pool of
virtual machine manager nodes.
2. Pilot: In this mode th eservice requests cluster a Local Resource Management System for
VM manager in order to deploy VMs.
Nimbus implements Amazon’s EC2 interface in order to allow the users to make use of
clients which are developed with the aim of real EC2 system against Nimbus-based clouds.
4.5.2 Open Nebula: Open Nebula is an open source device that enables users to convert
available infrastructure into a IaaS cloud. It is designed to be flexible and modular to merge
with various storage and network infrastructure configurations and hypervisor technologies.
It consists of 3 components. They are
1. Core
2. Capacity manager or Scheduler
3. Access Drivers.
1. Core: It is a centralized component that controls the complete life cycle of virtual
machine. It includes setting networks for set of virtual machines and controls the storage
requirements like VM disk image deployment or software environment.
2. Capacity Manager: It controls the working provided by core. It is a requirement or rank
matchmaker. It is also used to develop scheduling polices using lease model and
reservations.
3. Access Drivers: It provides an abstraction of infrastructure to show the working of
monitoring, storage and virtualization services available in cluster.
Apart from this, it provides management interfaces to merge core working with other data-
center tools like a accounting or monitoring frame works. It implements libvirt API and
Command-Line Interface (CLI) for virtual machine management. It also consists of two
features for changing environment like live migration and VM snapshots.
Page 20
It also includes EC2 driver, that can send requests to Amazon EC2, Eucalyptus and Elastic
Hosts driver. In this image access control is used for images registered and it makes easy to
multiuser environments and image sharing.
Figure: Architecture of OpenNebula
Sector /Sphere: Sector/Sphere is a software which supports huge distributed data storage
and data processing on large clusters, in data center or multiple data centers. It consists of
sector distributed file system and sphere parallel data processing frame work. By using the
fast network connection the DFS is placed in large areas and enables user to control large
data sets. Fault tolerance is performed by copying and controlling data in file system. It is
also familiar with network topology and provides reliability, availability and access. In this
communication is achieved using User Datagram Protocol (UDP) and User Defined
Type(UDT) i.e UDP is used for message passing and UDT is used to transfer data.
Figure: Architecture of Sector/Sphere
It is a parallel data process designed to work with data controlled by sector. The data stored
can be processed by the developers using a programming framework provided by sphere. In
this application inputs and outputs are known as sector files. To support difficult
applications, multiple sphere processing segments are merged.
It consists of 4 components, and they are
1. Security server
Page 21
2. Slave nodes
3. Client
4. Space
1. Security Servers: It is responsible for verifying master servers, slave nodes and users.
This master server contains file system meta data, schedule jobs and responds to user’s
requests.
2. Slave Nodes: It is used to store and execute the data. It can be placed in a single data
center or multiple data center with high-speed network connections.
3. Client: It provides tools and programming API’s for accessing and processing the data.
4. Space: It consists of a framework to support column-based distributed data table. These
tables are stored in the form of columns and are divided into multiple slave nodes. It supports
a set of SQL operations.
Open Stack:
Open Stack was introduced by Rack space and NASA in July 2010. It is used to share
resources and technologies with a scalable and secure cloud infrastructure. Features of Open
Stack are given as follows:
a) Open Stack Compute
b) Open Stack Storage
a) Open Stack Compute: It is the internal fabric of cloud. It is used to create and control
large sets of virtual private servers.
Figure: Architecture of Open Stack Nova System
It develops a cloud computing internal fabric controller, which is a part of an IaaS system
called Nova. It is built based on the idea of shared-nothing and exchange of message-based
information. Communication is done by using message queues. The components get
blocked while interacting with each other thereby waiting for the response can be prevented
by using deferred objects. This deferred object containing callbacks that triggered when
receiving response.
Shared-nothing paradigm can be achieved by changing the system state to distributed data
system. In this architecture, the API server receivers HTTP requests from boto, and it
converts the commands to API format then, it forwards the requests to cloud controller. This
cloud controller interacts with user manager with the help of Lightweight Directory Access
Protocol (LDAP). Apart from this nova combines networking parts to control private
Page 22
networks, public IP addressing, VPN connectivity and firewall rules. It includes the
following types.
1. Network Controller: It controls address and virtual LAN allocation
2. Routing Node: It governs the NAT(Network Address Translation) conversion of
public IPs and enforces firewall rules.
3. Addressing Node: It runs Dynamic Host Configuration Protocol(DHCP) services for
private networks.
4. Tunneling Node: It provides VPN (Virtual Private Network) connectivity.
The Network state consists of the following:
• VPN Assignment: It is for a project
• Private Subnet Assignment: It is for a security group in VLAN
• Private IP Allocation: It is for running instances
• Public IP Allocation: It is for a project
• Public IP Associations: It is for private IP or running instance.
b) Open Stack Storage: It generates solution based on interacting parts and concepts that
consists a proxy server, ring, object server, container server, account server, replication,
updates and auditors. This proxy server allows the accounts, containers or objects in this
storage rings and route the request. A ring shows the mapping between names of entities and
their physical locations and it contains zones, devices partitions and replicas.
An object server is a simple blob storage server which is used to store, retrieve and delete
objects which are on local devices. A container server is used for listing the objects and it is
handled by the account server.
4.5.3 Majrasoft Aneka cloud and appliances:
Aneka is a cloud application platform that is developed by Manjrasoft. It aims to support the
development and deployment of parallel and distributed applications on private and public
clouds. It produces a collection of APIs to utilize distributed resources and business logic
applications through programming abstractions. System administrators control tools to
observe and control deployed infrastructure. To increase the applications in both Linux and
Microsoft .NET framework it works as a workload distribution and management platform
Some of the key advantages of Aneka over other workload distribution solutions, include
• It supports multiple programming and application environments
• It also supports multiple runtime environments
• It uses various virtual and physical machines to accelerate the application production
depending upon the quality of service agreement of the user
• It is layered upon Microsoft .NET framework to support LINUX environment
Page 23
Figure: Architecture of Aneka components
Aneka offers 3 types of capabilities which are essential for building, accelerating and
managing clouds and their applications:
1. Build: Aneka includes a new SDK which combines APIs and tools to enable users to
rapidly develop applications. Aneka also allows users to build different runtime
environments such as enterprise/private cloud by harnessing compute resources in network or
enterprise data centers.
2. Accelerate: Aneka supports rapid development and deployment of applications in multiple
runtime environments running different operating systems such as Windows or Linux/UNIX
etc. Aneka supports dynamic leasing of extra capabilities from public clouds such as EC2 to
meet QoS for users.
3. Manage: Management tools and capabilities supported by Aneka include GUI and APIs to
set up, monitor, manage, and maintain remote and global Aneka compute clouds.
In Aneka, the available services can be aggregated into three major categories like
1. Fabric Services
2. Foundation Services
3. Application Services
1. Fabric Services: These services implement the fundamental operations of the
infrastructure of the cloud. So these include HA and failover for improved reliability, node
membership and directory, resource provisioning and performance monitoring.
2. Foundation Services: These services constitute the core functionalities of the Aneka
middleware. They provide basic set of capabilities that enhance application execution in the
Page 24
cloud. These services include storage management, resource reservation, reporting,
accounting, billing, services monitoring and licensing.
3. Application Services: These services deal directly with the execution of applications and
are in charge of providing the appropriate runtime environment for each application model.
At this level, Aneka can support different application models and distributed programming
patterns.

More Related Content

PPTX
Module 2-Cloud Computing Architecture.pptx
PPTX
CSE2013-cloud computing-L3-L4.pptx
PPT
Cloud Computing
PPT
Cloud computing
PPTX
Cloud computing_Final
PDF
Cloud versus cloud
PDF
Cloud computing Basics
PPTX
Cloud computing: highlights
Module 2-Cloud Computing Architecture.pptx
CSE2013-cloud computing-L3-L4.pptx
Cloud Computing
Cloud computing
Cloud computing_Final
Cloud versus cloud
Cloud computing Basics
Cloud computing: highlights

Similar to cc_unit4_smce.pdf Cloud Computing Unit-4 (20)

PPT
Distributed_and_cloud_computing-unit-2.ppt
PPTX
cloud computing module3 CLOUD COMPUTING ARCHITECTURE
PPT
Cloud computing vs grid computing
PPT
Introduction To Cloud Computing By Beant Singh Duggal
PPT
Cloud computing by amazon
PPTX
module of Btech CSE student , subject is cloud computing, this is first modul...
PPTX
Clould Computing and its application in Libraries
PDF
DEVELOPING APPLICATION FOR CLOUD – A PROGRAMMER’S PERSPECTIVE
PPT
Cloud Computing By Pankaj Sharma
PDF
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
PPT
Azure Serrvices Platform Pro Dev Partners
PDF
Survey_Report_on_AWS_by_Praval_&_Arjun
PPTX
Grid and Cloud Computing Lecture-2a.pptx
PPTX
CC-Module 1-Update module 2 in cloud computing.pptx
PPTX
PPTX
Ppt on cloud computing
KEY
Introduction to Cloud Computing - CCGRID 2009
PPT
Computing Outside The Box
PDF
Sem rep edited
Distributed_and_cloud_computing-unit-2.ppt
cloud computing module3 CLOUD COMPUTING ARCHITECTURE
Cloud computing vs grid computing
Introduction To Cloud Computing By Beant Singh Duggal
Cloud computing by amazon
module of Btech CSE student , subject is cloud computing, this is first modul...
Clould Computing and its application in Libraries
DEVELOPING APPLICATION FOR CLOUD – A PROGRAMMER’S PERSPECTIVE
Cloud Computing By Pankaj Sharma
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Azure Serrvices Platform Pro Dev Partners
Survey_Report_on_AWS_by_Praval_&_Arjun
Grid and Cloud Computing Lecture-2a.pptx
CC-Module 1-Update module 2 in cloud computing.pptx
Ppt on cloud computing
Introduction to Cloud Computing - CCGRID 2009
Computing Outside The Box
Sem rep edited
Ad

More from smceramu (11)

PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
IOT Notes Unit 1.pdf Internet of Things
PDF
IOT_LECTURE_NOTEs.pdf Internet of Things
PDF
IOT COMPLETE NOTES.pdf Internet of Things
PDF
DL&CO'S-1.pdf deep learning & computer orgonization
PDF
Fds- unit1.pdf FDS FDS FDS FDS FDS FDS FDS
PDF
cc_unit3_SMCE.pdf Cloud Computing Unit-3
PDF
CN 1-5 Unit's.pdf Computer Networks Material
PDF
CP MATERIAL.pdf Computer Programming Material
PPT
CN UNIT-I PPT.ppt compuetr networks well material
PDF
unit1-.pdf CN UNIT I COMPUTER NETWORKS NOTES COME POWERPOINT PRESENTATION
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
IOT Notes Unit 1.pdf Internet of Things
IOT_LECTURE_NOTEs.pdf Internet of Things
IOT COMPLETE NOTES.pdf Internet of Things
DL&CO'S-1.pdf deep learning & computer orgonization
Fds- unit1.pdf FDS FDS FDS FDS FDS FDS FDS
cc_unit3_SMCE.pdf Cloud Computing Unit-3
CN 1-5 Unit's.pdf Computer Networks Material
CP MATERIAL.pdf Computer Programming Material
CN UNIT-I PPT.ppt compuetr networks well material
unit1-.pdf CN UNIT I COMPUTER NETWORKS NOTES COME POWERPOINT PRESENTATION
Ad

Recently uploaded (20)

PPTX
Geodesy 1.pptx...............................................
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Well-logging-methods_new................
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
Geodesy 1.pptx...............................................
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Well-logging-methods_new................
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
CH1 Production IntroductoryConcepts.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
R24 SURVEYING LAB MANUAL for civil enggi
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Operating System & Kernel Study Guide-1 - converted.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Automation-in-Manufacturing-Chapter-Introduction.pdf

cc_unit4_smce.pdf Cloud Computing Unit-4

  • 1. Page 1 Unit 4 Cloud Programming and Software Environments: Features of Cloud and Grid Platforms, Parallel & Distributed programming Paradigms, Programming Support of Google App Engine, Programming on Amazon AWS and Microsoft Azure, Emerging Cloud Software Environments. 4.1 Features of Cloud and Grid Platforms 4.1.1 Capabilities of Cloud and Platform Features: The commercial clouds require huge number of capabilities. They provide computing of less cost with flexibility. This will in addition provides extra capabilities together called as “Platform as a Service”. The present platform features for the Azure are table, queues, web, and worker roles, database SQL, blobs. The platform features for Amazon are “Infrastructure as a Service”, all together other than queues, notifications, monitoring, content delivery network, relational database and map reduce. The capabilities of cloud platform are listed in the below table. Capability Description Physical or virtual computing platform The cloud environment incorporates both physical as well as virtual platform. The virtual platforms posses the unique capabilities in order to isolate environments for various applications and users Huge data storage service, distributed file system The cloud data storage services after a wide range of disk capacity for heavy data sets. In addition to this service interface and store data the distributed file system provides heavy data storage services. Huge database storage service The clouds require this service similar to DBMS so that the developers can store the data in semantic way. Huge data processing method and programming model The infrastructure of cloud after several nodes for simple applications. So the programmers must manage the issues such as network failure or scaling of running code etc in order to use all the services provided by platform Support workflow and data query language The programming model wides the cloud infrastructure. The workflow language and data query language is provided for application logic. Programming interface and services deployment Cloud applications need web interfaces or special API’s such as J2EE, PHP, ASP or Rails. They can make use of Ajax technologies for improving user experience while using web browsers for function access Runtime support The runtime support is open for the users as well as applications. It incorporates the distributed monitoring services, distributed task scheduler, distributed locking etc. Support services Various important services are data and computing service. The infrastructure cloud features as illustrated in below table: Accounting It has economies and acts as an active area for commercial Clouds
  • 2. Page 2 Authentication and authorization It requires all the systems to have single sing in Data transport It supports the data transfer among job components within and between the clouds and grids Registry It provides registry as information resource for system Operating systems Supports OS such as Android, Linux, Windows and apples Program library It stores images as well as program material Scheduling and gang scheduling Provides scheduling similar to that of Azure worker role in addition to condor, platform, oracle grid engine etc. the gang scheduling will assign multiple tasks scalability Software as a Service This service is shared among clouds and grids; it is successful and is used in clouds as distributed systems. Virtualization It is a basic feature and supports elastic feature. It also incorporates virtual networking Traditional features in cluster, grid and parallel computing environments are given below: Cluster management Clusters are developed using the tools provided by rocks and packages Data management A metadata support called RDR triple stores is provided. In addition to these SQL and NoSQL are provided Portals It is also termed as gateways that has transformation in technology from portlets to HUB zero Virtual organizations Organizations range from specialized grid solutions to popular web 2.0 Grid programming environments The programming environment differs from link together services as in open grid services architecture to grid RPC and SAGA Open MP/Threading It incorporates parallel computers like click and roughly shared memory technologies in addition to transactional memory and fine grained data flow The platform features supported by clouds and grids area s follows: Blob It provides the basic storage concept that is typified by Azure, Blob, and Amazon S3 DPFS It provides support for file systems like Google, HDFS, cosmos along with compute data affinity that is optimized for data processing Map reduce It supports map reduce programming model with Hadoop on Linux, Dryad on Windows HPCS and twister on windows and Linux Fault tolerance It is a major feature of clouds Notification It is the basic function of publish subscribe systems Monitoring It provides may grid solutions like Inca which is be based on publish subscribe Programming model The cloud programming models are developed with other platform features. It relates to web and grid models. Queues The queue system depends on publish subscribe SQL It is a relational database Table It supports the table data structure that are modelled on apache Hadoop or Amazon simple DB/Azure table Web role It is used in Azure for determining the link tosser Worker role It is used in Amazon and grids Scalable Synchronization It supports distributed locks and is used by big table.
  • 3. Page 3 4.1.2 Traditional features common to grids and clouds: Various features that are common to grids and clouds are as follows: 1. Workflows: in US and Europe the workflow has created various projects such as pegasus, Taverna and Kepler. The commercial systems include Pipeline, Pilot, AVS and LIMS environment. Trident is the latest entry which will run the workflow proxy services on external environments if working on Azure or Windows. 2. Data Transport: The data transfer is the major issue in commercial clouds. The links of high bandwidth can be allowed among clouds and tera grid if at all commercial clouds are made the major components of cyber infrastructure. The cloud data can be structured into tables and blocks in order to make the high performance parallel algorithms in addition to HTTP mechanisms for data transfer among academic systems/Teragrid and commercial clouds. 3. Security, Privacy and Availability: These techniques related to security, privacy and availability used for developing good and dependable cloud programming environment are illustrated as follows, • Using virtual clustering for achieving dynamic resources support at less overhead cost • Using special API’s for user authentication and email sending through commercial accounts. • Accessing cloud resources using security protocols like HTTPS and SSL • Using stable and persistent data storage through quick queries for data access • Including the features for improving availability and disaster recovery with file migration of VM’s. • Using fine gained access control for protecting the data integrity and deterring intruders and hackers • Protecting shared data form Malicious alternation, deletion of copy right Molations • Using popular systems for protecting data centres from stopping the privates and authorizing the trusted clients. 4.1.3 Data Features and Databases: The features of data and databases are illustrated as follows, 1. Program Library: An attempt is made for developing a VM image library for managing the images to be used in academic and commercial clouds. 2. Blobs and Drives: The containers of Azure are responsible for arranging the storage in clouds such as blobs for Azure and S3 for Amazon. The users are allowed to attach directly for computing instances like Azure drives and Elastic Book Store for Amazon. The cloud storage is found to be fault tolerant when tera grid require backup storage. 3. DPFS: It supports the file systems like Google File System, HDFS and cosmos with the features of optimized compute data affinity for processing the data. The DPFS can linked with blob and drives based architecture but it is better to use if application centric storage model with optimized compute data affinity. With this the blob and drives must be used as repository centric view. The DPFS file systems are developed for executing the data intensive applications efficiently.
  • 4. Page 4 4. SQL and Relational Databases: The relational databases are offered by the Amazon and Azure clouds. As done earlier a new private computing model is built on future grid for the observational medical outcomes partnership for patient related data that makes use of oracle and SAS. This is where future grid includes Hadoop to scale various analysis methods. The databases are predicated to be used for determine the methods that deploy capabilities. The database software can be added on to the disk. It can be executed through the database instance. The database on Amazon and Azure can be installed on different VM with this “SQL as service” gets implemented. 5. Table and NoSQL Non-related Databases: A large number of developments took place related to a simplified database structure called “NoSQL”. This emphasizes distribution and scalability. The clouds like Big Table in Google and simpleDB in Amazon and Azure Table in Azure make use of it. Non-relational databases are used in many terms of triple stores depending upon the MapReduce and tables or Hadoop File system with good success. The tables of the cloud can be classified into Azure table and Amazon simpleDB which support lightweight storage for document stores. They are found to be schema free and will soon gain importance in scientific computing. 6. Queuing Services: The Amazon as well as Azure provide the robust and scalable queuing services for the components to interact with each other in an application. These messages are short and contain a REST (Representational State Transfer) interface which has deliver at least once semantics. Time outs are used for controlling them in order to post the amount of processing time assigned for the client. 4.1.4 Programming and Runtime support: The programming and runtime support after parallel programming and runtime support of major functions in grids and clouds. 1. Worker and Web Roles: Azure provides roles for facilitating nontrivial functionality and also for preserving better affinity support in non-virtualized environment. They are the schedulable processes that can be launched automatically. Queues are considered to be complex since they offer natural method for assigning tasks in fault-tolerant and distributed fashion. The web roles offer a significant method for the portal. 2. MapReduce: The parallel languages are found to have great interest in loosely coupled computations that execute among the data samples. The grid applications are provided with efficient execution by language and runtime. The map reduce is found to be more advantageous than traditional implementations of the task problems. This is because it supports dynamic execution, strong fault tolerance and easy to use high level interface. Hadoop and Dryad are the map reduce implementations that can be executed with or without VMs. Hadoop is provided by Amazon and Dryad is to be available on Azure. 3. Cloud Programming Models: The GAE and Manjrasoft Aneka environment are the two basic programming models that are applied on clouds. But these models are not specific to this architecture model. Iterative MapReduce is an interesting programming model that offers portability between cloud, HPC and cluster environments. 4. SaaS: Services are used similarly in both commercial clouds and latest distributed systems. The users can package their programs as required. Hence, SaaS services can be enabled without any additional support because of this reason. SaaS environment is expected to provide various
  • 5. Page 5 useful tools for developing the cloud applications over the huge datasets. In addition to this various protection features are also offered by SaaS for achieving scalability, security, privacy and availability. 4.2 PARALLEL AND DISTRIBUTED PROGRAMMING PARADIGMS 4.2.1 Parallel computing and Programming Paradigms The distributed and parallel programs are assumed as the parallel programs running on set of computing engines or distributed computing systems. The distributed computing denote the computational engines interconnected in a network intended to run a job or application. The parallel computing denotes the usage of one or more computational engine intended to run a job or application. The parallel programs are allowed to be run on distributed computing systems. But it has certain issues described below 1. Partitioning: Partition is done in the below two ways, i) Computation Partitioning: The given program or job is divided into various tasks depending upon the portion identification that is capable for concurrent transaction. Various parts of a program can process different data or share same data. ii) Data Partitioning: The input or intermediate data is divided into various partitions that can be processed on various workers. A copy of the program or various parts of it is responsible for processing the pieces of data. 2. Mapping: The process of assigning parts of program of data pieces to the respective resources is called mapping. It is handled by the system resources allocators. 3. Synchronization: Synchronization is required since various workers perform various tasks. Coordination is also important among the workers. With this, race conditions can be prevented and data dependency can also managed. 4. Communication: Communication is considered as the major concept when the intermediate data is sent to workers. This is because data dependency is a major reason for communication among the workers. 5. Scheduling: A scheduler is responsible for picking a set of jobs or programs and for running them on distributed computing system. It is required when the resources are not sufficient to run various jobs or programs simultaneously. Scheduling is done based on the scheduling policy. 4.4.1.1 Motivation for Programming Paradigms: Handling of complete data flow of parallel and distributed programming is observed to be time consuming. It also needs special programming knowledge. These issues affect programmer productivity and programs time to market. For this purpose the parallel and distributed programming paradigms or models are use in order to hide the data flow part from the users. These models after abstraction layer to the users in order to hide the implementation details of data flow that requires the users to code. An important metric to be considered is simple coding for parallel programming with respect to parallel and distributed programming paradigms. Motivation behind parallel and distributed programming model is as follows:
  • 6. Page 6 1. Improve the program productivity 2. Decrease programs time to market 3. Leverage underlying resources efficiently 4. Increasing system throughput 5. Support for higher levels of abstraction 4.2.2 MapReduce Function and Framework: MapReduce: A software framework that allows to perform parallel and distributed computing on huge data sets is called MapReduce. It hides the flow of data of a parallel program on distributed computing system. For this purpose two interfaces Map and Reduce are provided to the users as functions. The data flow in a program is manipulated through these functions. The below figure illustrates the flow of data from Map to Reduce. Figure: MapReduce software framework In the above figure the abstraction layer abstracts the data flow steps such as partitioning, mapping, synchronization, communication and scheduling to the users. The Map and Reduce functions can be overridden by the user in order to achieve their respective goals. These functions can be passed with the required parameters such as spec and results etc. the program structure containing Map and Reduce subroutine is illustrated below: Map function(….){.......................} Reduce function(….){…............. } Main function(….) { Initialize spec object - - MapReduce(spec & results) } The input data to Map function is a (key. Value)pair where key indicates the line offset in a input file and value is the line content. The output returned from Map function is also a (ley, value) pair called as intermediate pair. The Reduce function is responsible for receiving the intermediate (key, value) pairs as a set of values as (key, [set of values]) by sorting and grouping the same value keys. It processes and generates a group of (key, value) pairs as output. The formal notation of Map function is, 𝑀𝑎𝑝 𝐹𝑢𝑛𝑐𝑡i𝑜𝑛 (key1, val1)−−−−−−−−−−→List (key2, val2)
  • 7. Page 7 The result obtained is a intermediate (key, value) pairs. They are gathered by the MapReduce library and sorted based on the keys. The various occurrences of same key are gathered and reduce function is applied on them to produce another list. 𝑅𝑒𝑑𝑢𝑐𝑒 𝐹𝑢𝑛𝑐𝑡i𝑜𝑛 (key2, List(val2))−−−−−−−−−−−−→List (val1) Map Reduce Actual Data and Control Flow: The MapReduce framework is responsible for running the program on distributed computing system efficiently. This process is detailed as follows: 1. Data Partitioning: The input data is retrieved from GFS and divided into Mpieces by the MapReduce library. These partitions correspond to number of map tasks. 2. Computation Partitioning: The obliging users perform computation partitioning for coding as Map and Reduce functions. The result will be an user program containing Map and Reduce functions. They are distributed and initiated on number of computation engines that are available. 3. Determining the Master and Workers: The architecture of MapReduce depends upon master workers model. Here a copy of user programs becomes the master and the remaining become the workers. The master is responsible for assigning the map and reduce tasks to the idle workers. And the worker is responsible to run the map/reduce task through Map/Reduce function execution. 4. Retrieving Input Data (Data Distribution): The respective input data is read by the worker and sent to the map function after dividing it. 5. Map Function: The input data is retrieved by the map function in the form of (key, value) pairs in order to process and produce the intermediate (key, value) pairs. 6. Combiner Function: This function is applied on(key, value) pairs and invoked in the user program. It merges the local data of map worker and sends it on networks. With this the communication cost decreases. 7. Partitioning Function: The intermediate (key, value) pairs are partitioned using partitioning function. All the similar keys are stored in same region through hash function. The data later sent to master which intron forward to the workers. 8. Synchroniztion: Synchronization policy of MapReduce allow coordination between map workers and reduce workers and provides interaction among them after task completion. 9. Communication: A remote procedure call is used by the reduce worker for reading the data from the map workers. A all-to-all communication occurs between the map and reduce workers giving rise to network congestion. For this purpose a data transfer module is developed for scheduling the data transfer. 10. Sorting and Corresponding: A reduce worker decides the reading process of input data and groups the intermediate (key, value) pairs by sorting the data according to the keys. All the occurrences of similar keys are grouped and unique is produced.
  • 8. Page 8 11. Reduce Function: The reduce worker is responsible for iterating the grouped (key, value) pairs for all the unique keys and the set of key and values are sent to reduce function. This function will process the received data and stores the output in predetermined files of user program Figure: Linking Map and Reduce workers through MapReduce Partitioning Functions. Twister and Iterative MapReduce: The performance of any runtime requires to be checked and the MPI and MapReduce also need to be compared. The communication and load imbalance are the important sources of parallel overhead. The overhead of communication can be high in MapReduce because of the following reasons. • The MapReduce performs read and write through files and MPI allows data transfer between the nodes in the network. • MPI will not transfer the complete data rather it only does the required data for updataion. The MPI flow is called flow and MapReduce flow is called full data flow. This phenomenon can be observed in all the classic parallel loosely synchronous applications that show off the iteration off structure in the compute phases and communication phases. The performance issues can deposited with below changes: • Transfer of data between the steps without expanding the steps internally to disk. • Usage of long running threads or processors for communicating flow. The above changes give rise to increase in performance at the cost of fault tolerance and also support dynamic changes like available bodes. The below figure depicts the twister programming paradigms along with its architecture at run time. The twister illustrates the difference of static data that can never be reloaded from dynamic flow that is communicated.
  • 9. Page 9 Figures: Twister for Iterative MapReduced Programming The pair of map and reduce is executed iteratively in thread that are long running. The below figure shows the comparison of thread and process structures of parallel programming paradigms like Hadoop, Dryad, Twister and MPI. Figure: Four Parallel Programming Paradigms for Thread and Process structure Yahoo Hadoop: It is used for short running processes communication through disk and tracking process. Microsoft Dryad: It is used for short running processes communication through pipes, diskor shared memory between cores. MapReduce: It is used for long running processing with asynchronous distributed rendezvous synchronization MPI: It is used for long running process with rendezvous for message exchange synchronization. 4.2.3 Hadoop Library from Apache: MapReduce has an open sources implementation called Hadoop. It is coded in Java by apache and makes use of Hadoop distributed File System (HDFS) as internal layer. The core of Hadoop has two layers called MapReduce engine and HDFS. The MapReduce engine is the top layer and acts as computations engine and data storage manager.
  • 10. Page 10 Architechture of MapReduce in Hadoop: MapReduce engine is the upper layer of the Hadoop. It is responsible for managing the data flow and control flow of the MapReudce jobs in distributed computing systems. MapReduce engine contains a master/slave architecture with single Job Tracker (that acts as a master) and several Task Trackers (which act as slaves). The MapReduce job is managed by the Job Tracker over the cluster. It also monitors and assigns the jobs and tasks to the Task Trackers. The Task Tracker is responsible for managing the map/reduce tasks execution on a computation node with in the cluster. Every Task Tracker is assigned with various execution slots to execute map or reduce task. A map task that is running on a slot will process a data block. A one-to-one correspondence is found between the map task and the data block of the Data Node. Figure: Hadoop HDFS and MapReduce Architecture Running the Job in Hadoop: Components required to run a job in this system are user node, Job Tracker and a set of Task Trackers. The function runJob(conf) is called to begin the data flow in the user program. The conf is the parameter is an object for MapReduce framework and HDFs. The function is similar to MapReduce(spec & Results) Figure: Data flow in running MapReduce job at various task trackers using Hadoop Library HDFS: HDFS acts as a distributed file system and stores the data on distributed computed system after organizing. Architecture of it contains a master and slave with single NameNode and number of DataNodes respectively. The files are divided into fixed sized blocks by this
  • 11. Page 11 architecture that are stored on workers. Mapping will be done based on NameNode. The NameNode is responsible for managing the file systems metadata and namespace. Moreover it maintains the metadata in the area and metadata is the data of file. System accessible to the file management. Features of HDFS area as follows, 1. Fault Tolerance: Fault tolerance is an important characteristic of HDFS. As Hadoop is to be deployed on low-cost hardware, it frequently comes across hardware failure. For this reason Hadoop considers the below issues to comes the rekliability requirements. (i) Block Replication: To ensure data reliability, replications of file blocks are maintained and distributed across the cluster. (ii) Replica Placement: Replica placement is one of issue in building the fault tolerance. It is reliable to store the replicas on nodes of other racks in the cluster. But this technique is not much considered because it is of high cost. So, reliability is compromised to make HDFS cost effective. (iii) Heartbeat and Block Report Messages: The periodic messages which the DataNode sends to the NameNode are Hearbeats and Block reports. This implies the proper functioning of DataNode. The block report consists a list of blocks in DataNode. 2. High Throughput Access to Large Data Sets: The throughput of HDFS is important because it is designed and purposed for batch processing. In addition to this the applications that runs on HDFS contain heavy data sets and separation files. These files are divided into large blocks so that HDFS can decrease the storage of metadata required by a file. With this the block list decreases with the increase in block size and also fast streaming reads are provided by the HDFS. Operations of HDFS: Operations of HDFS are depicted as follow: 1. File Read: To perform read operation the user will send the “open” request to NameNode for file block location. The response will be address of DataNode in which replica data is stored. The addresses depend upon the block replicas. After this read is performed to connect to the nearby DataNode. The connection will be terminated after streaming the connection. The complete process will iterative until the file is streamed completely to the user. 2. File Write: The user initially will send a create request to NameNode for new file creation. Then the data is written to it using write function. The data queue which is an internal queue first holds the first data block later it is written to DataNode while the data streamer monitors it. At parallel even replicas of the data blocks are also created accordingly. 4.3 PROGRAMMING SUPPORT OF GOOGLE APP ENGINE 4.3.1 Programming Google App Engine: The key features of GAE programming model for languages such as Java and python is illustrated in the following figure
  • 12. Page 12 Figure: GAE Programming Environment The GAE is allowed to debugged on the local machine by the client environment that contains eclipse plug-in for Java. The Java web applications develpers are provided with GWT (Google Web Toolkit). It can be used even for JavaScript or Rub. The language python is used with framworks like Django and Cherrypy. Instead a webapp python environment is provided by Google. The data is stored and accessed using various constructs from the NOSQL data storage. The entities can be retrieved by queries through filtering and sorting the values. The JDO (Java Data Object) and JPA(Java Persistence API) interfaces are offered by Java and implemented by open source Data Nucleus Access Platform. The python is provided with SQL – like query language called GQL. The applications is capable of executing various data store operations in one transaction the succeed or fail all together. The entities can be assigned to groups by GAE application. Google appended a new feature blob store for heavy files. The Google SDC (Secure Data Connection) can tunnel Using Internet and connect the Intranet to external GAE application. The URL Fetch operation will make the applications capable to fetch the resources and to interact with others on Internet through HTTP and HTTPs requests. It also accesses the web resources through high speed Google infrastructure to get the web pages for various products of Google. 4.3.2 Google File System(GFS): GFS was designed as a storage service for Google’s search Engine. It was basically designed to store and process huge amount of data needed by Google. Google File System is a distributed file system that was developed to support Google applications. The reason for employing GFS is that it is capable of holding a file of about 100MB. It basically partitions a file into fixed size segments called chunks. Each chunk provides a data block of about 64KB. Besides this it also ensures reliability of data by distributing replicate copies of data across multiple chunk server. It also allow multiple append operations concurrently. It make use of a single master in order to provide access to metadata and simultaneously stores the data. It provides an
  • 13. Page 13 accessing interface similar to POSIX file system. This feature allow application to view the physical location of a file blocks. It also make use of customized API in order to capture the append operation and also to record them. The architecture of Google File System is shown below: Figure: GFS Architecture The architecture includes only single Master for storing meta data in cluster. The different nodes act as a chunk servers. Each chunk server is responsible for storing data. A Master is also responsible for managing file system namespace and locking facilities. It also interact with chunk server in order to obtain the management information from them and also to instruct the chunk server to perform task like load balancing/fail recovery. A single master is capable of managing the whole cluster. It (use of master) inhabit the use of complicated distributed algorithms in GFS architecture design. Despite of this, the inclusion of only single master may affect the performance of the system. To overcome the performance bottle neck, shadow mater is employed by Google. This master allow replication of data on master so as to ensure that data operation performed between client and chunk server are directly done without any further interruptions. Besides this, attached copy of control messages transferred between client and chunk server is maintained for further use. Hence, this facilities allow single master to manage a cluster containing 1,000 nodes. The figure illustrates the data mutation operations like write and append in GFS Figure: Data Mutation in GFS
  • 14. Page 14 Data mutation must be performed by creating a data blocks for each replicated copy. The goal of Data mutation is to reduce the involvement of master in the cluster. The steps for performing mutation are as follows: 1. Initially the client queries the master about the chunk server which contains the current chunk lease and also the location of the replicas. 2. If the chunk servers do not contain the chunk lease, then the master grants a single lease to the chosen replica and replies the client with a primary identity and the location of replica. 3. The client maintains the cached copy of data for future mutation 4. The client then forwards the data to other replicas. Chunk servers accept the replicated data and maintain them in the internal LRU buffer cache. To improve the performance of the system data flow is decoupled from control flow and also by scheduling the expensive data flow depending on the network technology. 5. After receiving the acknowledgment from the replicas. The client forwards the write request to the primary replica. The primary replica allots consecutive serial numbers to all the mutations received from multiple clients and performs serialization on them. It then applies mutations to all the local states in serial order. 6. The write request are now forwarded to secondary replica by primary replica. The secondary replica now applies mutations in the order similar to primary. 7. The secondary replicas now reply the primary about the completion of operation. 8. The primary replica intron replies to the client with all the errors that are encountered during mutation process. The code generated by client helps recovering from errors and also in retrying failed mutation. A GFS allow users to perform append operation. This operation allows users to append data block at the end of file. GFS offers fast recover capability to recover from various system error. Besides this, it also ensures, 1. High availability 2. High performance 3. High fault tolerance and 4. High scalability 4.4 PROGRAMMING ON AMAZON AWS AND MICROSOFT AZURE 4.4.1 Programming in Amazon: Amazon is the company that started for generating VMs in application hosting. The customers rather than using the physical machines to run the applications will opt to rent VMs. With VMs the customers are allowed to load any desired software. In addition to this they can also create, launch and even terminate the server instance by paying for them. Various types of VMs are provided by the amazon. The instance are called as Amazon Machine Images(AMIs) preconfigured with Linux or windows in addition to the software. There are three types of AMIs defined as follows: 1. Private AMI: The images that are created by the users are called private AMIs. They are private by default and can be allowed to be launched by others. 2. Public AMI: The images that are created by users and released for AWS community by allowing others to launch and use the instances are called public AMI
  • 15. Page 15 3. Paid QAMI: The images created by the users through certain functions and which are allowed to be launched by other by paying for them are called paid QAMI. The below figure shows the execution environment where AMIs act as the templates for instance running VMs. All the public, private and paid AMIs supports this environment. Figure: Execution Environment of Amazon EC2 4.4.2 Amazon Simple Storage Service (S3): Amazon S3 (Simple Storage Services) aims to provide an interface with simple web service for storing and retrieving the data on web at any time/anywhere. The service is in the form of object oriented storage. The objects are accessible to the users through SOAP (Simple Object Access Protocol) along with supporting browsers or client programs. A reliable message service among any two processes is provided by SQS. The below figure shows Amazon S3 execution environment. Figure: Execution Environment of Amazon S3 An object is the fundamental unit of the S3. A bucket holds various objects by accessing them through keys. There are even other attributes such as values, access control information and metadata. The users perform read, write and delete on objects through the key-value programming interface. Users can access the data from the Amazon clouds through the interfaces REST and SOAP. Features of S3 are illustrated as follows:
  • 16. Page 16 • The authentication mechanisms are used to secure the data from unauthorized uses. For this users are granted rights on objects by making them private or public. • It contains URLs and ACLs for every object • The cost is from $0.55 to 0.15for every GB per month. • Transfer cost out of S3 region is from $0.08to $0.15 per GB • Redundancy is maintained through geographic dispersion • An interface called BitTorrent protocol is used to decrease the cost of high scale distribution. • It provides 99.999999999% durability and 99.99% availability of objects for a year with less RRSC (Reduced Redundancy Storage) • Data transfer between Amazon EC2 and S3 is not charged. 4.4.3 Amazon Elastic Block Store (EBS): The Amazon elastic block store offers a volume block interface to save as well as restore the EC2 instance’s virtual images. Once the usage of traditional EC2 instances completed, they will be deleted. The status of them is saved in EBS system upon machine shutdown. The running data and EC2 instances are saved using EBS. The users are allowed to create the storage ranging from 1GB to 1TB. These volumes can be mounted as EC2 instances. Various volumes can be mounted as if they belong one instance. The user is allowed to create file system above Amazon EBs volumes or to use in any desired way. Data saving is done through snapshots to improve the performance. Amazon charges according to the usage. Amazon SimpleDB Services: Amazon simpleDB service will provide a simple data model with respect to relational database data model the user input is sorted into domains which are considered as tables. A table contains items as rows and attributes values as cells of the respective row. A single cell can also be assigned with multiple values. The developers requires the data to be stored, accessed and queried easily. But this consistency is not considered by the simpleDB. The azure table manage less amounts of data in distributed table so they called as ‘Little Table’. The BigTable is supposed to store big data. The simple DB costs $0.140 for each amazon simpleDB machine hours.. 4.4.4 Microsoft Azure Programming: Various features of azure cloud platform are programming components, SQL Azure, client development, storage and programming subsystem. They are depicted in the following figure. The underneath layer in the above figure is fabric that contains virtualized hardware along with sophisticated control environment that dynamically assigns resources and implements fault tolerance. With this even domain name system and monitoring capabilities get implemented. Service models are allowed to be defined by XML templates and various service copies to be initialized. The services are monitored while the system is running and the users are allowed to access the event logs, trace/debug the data. IIS web server logs, crash dumps, performance counters, crash dump and other files. Azure storage is responsible to hold all this data and debugging is allowed to be performed. The Azure is related to Internet through a customized computer
  • 17. Page 17 VM known as web role that supports basic Microsoft web hosting. These VMs are called appliances. The roles that support HTTPs and TCP provide below methods. Figure: Azure Cloud platform features Onstart(): This method is called fabric on startup that allows user to perform initialization tasks. It will show busy status to load balance until it is false. Onstop():This method is invoked when the role is supported to be shut down and then it exits. Run(): This method contains of main logic. SQL Azure: The SQL server is provided as a service by SQL Azure. The REST interface is used to access the storage modalities excluding the drivers that are introduce most recently and which are analogous to amazon EBS. It provides a file system interface in the form of durable NTFS volume backed by blob storage. The interface REST are related with URL’s by default. The storage gets duplicated three times due to fault tolerance and also guaranteed to be consistent in access. The storage system gets emerged from blobs that are analogous to S3 for amazon. The blobs are classified as, Account→ Containers→ Page/Block Blobs. The containers observed to be analogous to directories within traditional file systems where account is the root. The block blob streams the data and arranges them as sequence of blocks 4MB each upto 200 GB. The page blobs are intended for read/write access and contains set of pages with 1 TB size. Azure Tables: The azure table and queue storage modes are intended for less volumes of data. The queue will be offering reliable message delivery and also support work spooling among the web and worker roles. They does not restrict the messages. The azure supports the operations such as PUT, GET, DELETE as well as CREATE and DELETE. Every account is assigned infinite tables containing rows and columns in the form of entities and properties respectively.
  • 18. Page 18 The table entities are not restricted in number rather huge number of entities of distributed computers. The general properties such as <name, type, values> are assigned to the entities. The other properties such as PartitionKey and RowKey can be assigned to entities. The purpose of RowKey is to assign unique label to every enty. The purpose of PartitionKey is to get shared. 4. 5 EMERGING CLOUD SOFTWARE ENVIRONMENTS 4.5.1 Open Source Eucalyptus: Eucalyptus is a product of eucalyptus system which is a type of a open software environment. It was developed from a research project at university of California, Santa Barbara. The purpose of it is to bring the cloud computing paradigm to academic super computers and cluster. To communicate with cloud service it provides an AWS – compliant EC2 based web service interface. Apart from this, it also provides services like AWS-compliant Walrus and user interface to manage users and images. It also supports the development of computer cloud and storage cloud. It stores images in Walrus storage system which is similar to Amazon S3 service. It can be uploaded and retrieved anytime. This helps users to create special virtual appliances. The below figure depicts the architecture that depends on VM images requirement. Figure: Eucalyptus Architecture for VM Image Management Nimbus: Nimbus is a collection of open source tools which aim to offer an IaasS cloud computing solution. It provides a special web interface known as Nimbus Web, which is placed around python Django Web application installed independent of Nimbus service. A storage cloud implementation known as Cumulus is combined with other central services and it is similar to Amazon S3 REST API.
  • 19. Page 19 Figure: Nimbus cloud infrastructure Nimbus supports the below defined resource management strategies. They are, 1. Resource pool 2. Pilot 1. Resource Pool: It is a default resource pool mode which has a direct control on a pool of virtual machine manager nodes. 2. Pilot: In this mode th eservice requests cluster a Local Resource Management System for VM manager in order to deploy VMs. Nimbus implements Amazon’s EC2 interface in order to allow the users to make use of clients which are developed with the aim of real EC2 system against Nimbus-based clouds. 4.5.2 Open Nebula: Open Nebula is an open source device that enables users to convert available infrastructure into a IaaS cloud. It is designed to be flexible and modular to merge with various storage and network infrastructure configurations and hypervisor technologies. It consists of 3 components. They are 1. Core 2. Capacity manager or Scheduler 3. Access Drivers. 1. Core: It is a centralized component that controls the complete life cycle of virtual machine. It includes setting networks for set of virtual machines and controls the storage requirements like VM disk image deployment or software environment. 2. Capacity Manager: It controls the working provided by core. It is a requirement or rank matchmaker. It is also used to develop scheduling polices using lease model and reservations. 3. Access Drivers: It provides an abstraction of infrastructure to show the working of monitoring, storage and virtualization services available in cluster. Apart from this, it provides management interfaces to merge core working with other data- center tools like a accounting or monitoring frame works. It implements libvirt API and Command-Line Interface (CLI) for virtual machine management. It also consists of two features for changing environment like live migration and VM snapshots.
  • 20. Page 20 It also includes EC2 driver, that can send requests to Amazon EC2, Eucalyptus and Elastic Hosts driver. In this image access control is used for images registered and it makes easy to multiuser environments and image sharing. Figure: Architecture of OpenNebula Sector /Sphere: Sector/Sphere is a software which supports huge distributed data storage and data processing on large clusters, in data center or multiple data centers. It consists of sector distributed file system and sphere parallel data processing frame work. By using the fast network connection the DFS is placed in large areas and enables user to control large data sets. Fault tolerance is performed by copying and controlling data in file system. It is also familiar with network topology and provides reliability, availability and access. In this communication is achieved using User Datagram Protocol (UDP) and User Defined Type(UDT) i.e UDP is used for message passing and UDT is used to transfer data. Figure: Architecture of Sector/Sphere It is a parallel data process designed to work with data controlled by sector. The data stored can be processed by the developers using a programming framework provided by sphere. In this application inputs and outputs are known as sector files. To support difficult applications, multiple sphere processing segments are merged. It consists of 4 components, and they are 1. Security server
  • 21. Page 21 2. Slave nodes 3. Client 4. Space 1. Security Servers: It is responsible for verifying master servers, slave nodes and users. This master server contains file system meta data, schedule jobs and responds to user’s requests. 2. Slave Nodes: It is used to store and execute the data. It can be placed in a single data center or multiple data center with high-speed network connections. 3. Client: It provides tools and programming API’s for accessing and processing the data. 4. Space: It consists of a framework to support column-based distributed data table. These tables are stored in the form of columns and are divided into multiple slave nodes. It supports a set of SQL operations. Open Stack: Open Stack was introduced by Rack space and NASA in July 2010. It is used to share resources and technologies with a scalable and secure cloud infrastructure. Features of Open Stack are given as follows: a) Open Stack Compute b) Open Stack Storage a) Open Stack Compute: It is the internal fabric of cloud. It is used to create and control large sets of virtual private servers. Figure: Architecture of Open Stack Nova System It develops a cloud computing internal fabric controller, which is a part of an IaaS system called Nova. It is built based on the idea of shared-nothing and exchange of message-based information. Communication is done by using message queues. The components get blocked while interacting with each other thereby waiting for the response can be prevented by using deferred objects. This deferred object containing callbacks that triggered when receiving response. Shared-nothing paradigm can be achieved by changing the system state to distributed data system. In this architecture, the API server receivers HTTP requests from boto, and it converts the commands to API format then, it forwards the requests to cloud controller. This cloud controller interacts with user manager with the help of Lightweight Directory Access Protocol (LDAP). Apart from this nova combines networking parts to control private
  • 22. Page 22 networks, public IP addressing, VPN connectivity and firewall rules. It includes the following types. 1. Network Controller: It controls address and virtual LAN allocation 2. Routing Node: It governs the NAT(Network Address Translation) conversion of public IPs and enforces firewall rules. 3. Addressing Node: It runs Dynamic Host Configuration Protocol(DHCP) services for private networks. 4. Tunneling Node: It provides VPN (Virtual Private Network) connectivity. The Network state consists of the following: • VPN Assignment: It is for a project • Private Subnet Assignment: It is for a security group in VLAN • Private IP Allocation: It is for running instances • Public IP Allocation: It is for a project • Public IP Associations: It is for private IP or running instance. b) Open Stack Storage: It generates solution based on interacting parts and concepts that consists a proxy server, ring, object server, container server, account server, replication, updates and auditors. This proxy server allows the accounts, containers or objects in this storage rings and route the request. A ring shows the mapping between names of entities and their physical locations and it contains zones, devices partitions and replicas. An object server is a simple blob storage server which is used to store, retrieve and delete objects which are on local devices. A container server is used for listing the objects and it is handled by the account server. 4.5.3 Majrasoft Aneka cloud and appliances: Aneka is a cloud application platform that is developed by Manjrasoft. It aims to support the development and deployment of parallel and distributed applications on private and public clouds. It produces a collection of APIs to utilize distributed resources and business logic applications through programming abstractions. System administrators control tools to observe and control deployed infrastructure. To increase the applications in both Linux and Microsoft .NET framework it works as a workload distribution and management platform Some of the key advantages of Aneka over other workload distribution solutions, include • It supports multiple programming and application environments • It also supports multiple runtime environments • It uses various virtual and physical machines to accelerate the application production depending upon the quality of service agreement of the user • It is layered upon Microsoft .NET framework to support LINUX environment
  • 23. Page 23 Figure: Architecture of Aneka components Aneka offers 3 types of capabilities which are essential for building, accelerating and managing clouds and their applications: 1. Build: Aneka includes a new SDK which combines APIs and tools to enable users to rapidly develop applications. Aneka also allows users to build different runtime environments such as enterprise/private cloud by harnessing compute resources in network or enterprise data centers. 2. Accelerate: Aneka supports rapid development and deployment of applications in multiple runtime environments running different operating systems such as Windows or Linux/UNIX etc. Aneka supports dynamic leasing of extra capabilities from public clouds such as EC2 to meet QoS for users. 3. Manage: Management tools and capabilities supported by Aneka include GUI and APIs to set up, monitor, manage, and maintain remote and global Aneka compute clouds. In Aneka, the available services can be aggregated into three major categories like 1. Fabric Services 2. Foundation Services 3. Application Services 1. Fabric Services: These services implement the fundamental operations of the infrastructure of the cloud. So these include HA and failover for improved reliability, node membership and directory, resource provisioning and performance monitoring. 2. Foundation Services: These services constitute the core functionalities of the Aneka middleware. They provide basic set of capabilities that enhance application execution in the
  • 24. Page 24 cloud. These services include storage management, resource reservation, reporting, accounting, billing, services monitoring and licensing. 3. Application Services: These services deal directly with the execution of applications and are in charge of providing the appropriate runtime environment for each application model. At this level, Aneka can support different application models and distributed programming patterns.