SlideShare a Scribd company logo
OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows
on Clouds
Abstract:
The global deployment of cloud datacenters is enabling large scale scientific
workflows to improve performance and deliver fast responses. This
unprecedented geographical distribution of the computation is doubled by an
increase in the scale of the data handled by such applications, bringing new
challenges related to the efficient data management across sites. High
throughput, low latencies or cost-related trade-offs are just a few concerns for
both cloud providers and users when it comes to handling data across
datacenters. Existing solutions are limited to cloud-provided storage, which offers
low performance based on rigid cost schemes. In turn, workflow engines need to
improvise substitutes, achieving performance at the cost of complex system
configurations, maintenance overheads, reduced reliability and reusability. In this
paper, we introduce OverFlow, a uniform data management system for scientific
workflows running across geographically distributed sites, aiming to reap
economic benefits from this geo-diversity. Our solution is environment-aware, as
it monitors and models the global cloudinfrastructure, offering high and
predictable data handling performance for transfer cost and time, within and
across sites. OverFlow proposes a set of pluggable services, grouped in a data
scientist cloud kit. They provide the applications with the possibility to monitor
the underlying infrastructure, to exploit smart data compression, deduplication
and geo-replication, to evaluate data management costs, to set a tradeoff
between money and time, and optimize the transfer strategy accordingly. The
system was validated on the Microsoft Azure cloud across its 6 EU and US
datacenters. The experiments were conducted on hundreds of nodes using
synthetic benchmarks and real-life bio-informatics applications (A-Brain, BLAST).
The results show that our system is able to model accurately
the cloud performance and to leverage this for efficient data dissemination, being
able to reduc- the monetary costs and transfer time by up to three times.

More Related Content

PPTX
Twister4Azure - Iterative MapReduce for Azure Cloud
DOCX
Cross cloud map reduce for big data
PPTX
Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)
DOCX
Improving resource utilisation in the cloud environment using multivariate pr...
PPTX
排隊理論_An Exploration of The Optimization of Executive Scheduling in The Cloud ...
PPTX
Clustring computing
PDF
Combining efficiency, fidelity, and flexibility in resource information services
PPT
Large Scale On-Demand Image Processing For Disaster Relief
Twister4Azure - Iterative MapReduce for Azure Cloud
Cross cloud map reduce for big data
Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)
Improving resource utilisation in the cloud environment using multivariate pr...
排隊理論_An Exploration of The Optimization of Executive Scheduling in The Cloud ...
Clustring computing
Combining efficiency, fidelity, and flexibility in resource information services
Large Scale On-Demand Image Processing For Disaster Relief

What's hot (20)

PPTX
An Overview of Bionimbus (March 2010)
PDF
Evolutionary multi objective workflow scheduling in cloud
PPTX
Bionimbus Cambridge Workshop (3-28-11, v7)
PPTX
My Other Computer is a Data Center: The Sector Perspective on Big Data
PDF
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
PPTX
Bionimbus - An Overview (2010-v6)
PPT
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
PPTX
Health & Status Monitoring (2010-v8)
PDF
ODP
Realworld Powersavings for the Cloud Customer
PPTX
Open Science Data Cloud (IEEE Cloud 2011)
PPTX
OCC Overview OMG Clouds Meeting 07-13-09 v3
PPTX
PPTX
Open Science Data Cloud - CCA 11
PDF
Robust workload and energy management for sustainable data centers
PDF
ATMOSPHERE presentation
PDF
Ensemble a tool for performance modeling of applications in cloud data centers
PPTX
Bioclouds CAMDA (Robert Grossman) 09-v9p
PDF
linkedin Summary v1 Track record
PPTX
CCI DAY PRESENTATION
An Overview of Bionimbus (March 2010)
Evolutionary multi objective workflow scheduling in cloud
Bionimbus Cambridge Workshop (3-28-11, v7)
My Other Computer is a Data Center: The Sector Perspective on Big Data
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
Bionimbus - An Overview (2010-v6)
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Health & Status Monitoring (2010-v8)
Realworld Powersavings for the Cloud Customer
Open Science Data Cloud (IEEE Cloud 2011)
OCC Overview OMG Clouds Meeting 07-13-09 v3
Open Science Data Cloud - CCA 11
Robust workload and energy management for sustainable data centers
ATMOSPHERE presentation
Ensemble a tool for performance modeling of applications in cloud data centers
Bioclouds CAMDA (Robert Grossman) 09-v9p
linkedin Summary v1 Track record
CCI DAY PRESENTATION
Ad

Similar to Over flow multi site aware big data management for scientific workflows on clouds (20)

PDF
IRJET- Cost Effective Workflow Scheduling in Bigdata
DOCX
Collaboration and fairness-aware big data management in distributed clouds
PDF
Hybrid fault tolerant cost aware mechanism for scientific workflow in cloud c...
PDF
Adoption of Cloud Computing in Scientific Research
PDF
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
PDF
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
PDF
Software Architecture for Big Data and the Cloud 1st Edition Ivan Mistrik
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
A multi-core makespan model for parallel scientific workflow execution in clo...
PDF
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
PDF
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
PDF
Ubiquitous-cloud-inspired deterministic and stochastic service provider model...
PDF
InfluxData Architecture for IoT | Noah Crowley | InfluxData
PDF
A Review on Novel Approach for Load Balancing in Cloud Computing
PPT
Aaas Data Intensive Science And Grid
DOCX
Ieee transactions on networking 2018 Title with Abstract
DOCX
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
DOCX
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
PPTX
Le Bauer: Data Driven Model Development
DOCX
Ieee acm transactions 2018 on networking topics with abstract for final year ...
IRJET- Cost Effective Workflow Scheduling in Bigdata
Collaboration and fairness-aware big data management in distributed clouds
Hybrid fault tolerant cost aware mechanism for scientific workflow in cloud c...
Adoption of Cloud Computing in Scientific Research
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
Software Architecture for Big Data and the Cloud 1st Edition Ivan Mistrik
International Journal of Engineering Research and Development (IJERD)
A multi-core makespan model for parallel scientific workflow execution in clo...
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
Ubiquitous-cloud-inspired deterministic and stochastic service provider model...
InfluxData Architecture for IoT | Noah Crowley | InfluxData
A Review on Novel Approach for Load Balancing in Cloud Computing
Aaas Data Intensive Science And Grid
Ieee transactions on networking 2018 Title with Abstract
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
Le Bauer: Data Driven Model Development
Ieee acm transactions 2018 on networking topics with abstract for final year ...
Ad

More from ieeepondy (20)

PDF
Demand aware network function placement
PDF
Service description in the nfv revolution trends, challenges and a way forward
PDF
Secure optimization computation outsourcing in cloud computing a case study o...
PDF
Spatial related traffic sign inspection for inventory purposes using mobile l...
PDF
Standards for hybrid clouds
PDF
Rfhoc a random forest approach to auto-tuning hadoop's configuration
PDF
Resource and instance hour minimization for deadline constrained dag applicat...
PDF
Reliable and confidential cloud storage with efficient data forwarding functi...
PDF
Rebuttal to “comments on ‘control cloud data access privilege and anonymity w...
PDF
Scalable cloud–sensor architecture for the internet of things
PDF
Scalable algorithms for nearest neighbor joins on big trajectory data
PDF
Privacy preserving deep computation model on cloud for big data feature learning
PDF
Pricing the cloud ieee projects, ieee projects chennai, ieee projects 2016,ie...
PDF
Protection of big data privacy
PDF
Power optimization with bler constraint for wireless fronthauls in c ran
PDF
Performance aware cloud resource allocation via fitness-enabled auction
PDF
Performance limitations of a text search application running in cloud instances
PDF
Performance analysis and optimal cooperative cluster size for randomly distri...
PDF
Predictive control for energy aware consolidation in cloud datacenters
PDF
Optimizing purdue lin microphysics scheme for intel xeon phi coprocessor
Demand aware network function placement
Service description in the nfv revolution trends, challenges and a way forward
Secure optimization computation outsourcing in cloud computing a case study o...
Spatial related traffic sign inspection for inventory purposes using mobile l...
Standards for hybrid clouds
Rfhoc a random forest approach to auto-tuning hadoop's configuration
Resource and instance hour minimization for deadline constrained dag applicat...
Reliable and confidential cloud storage with efficient data forwarding functi...
Rebuttal to “comments on ‘control cloud data access privilege and anonymity w...
Scalable cloud–sensor architecture for the internet of things
Scalable algorithms for nearest neighbor joins on big trajectory data
Privacy preserving deep computation model on cloud for big data feature learning
Pricing the cloud ieee projects, ieee projects chennai, ieee projects 2016,ie...
Protection of big data privacy
Power optimization with bler constraint for wireless fronthauls in c ran
Performance aware cloud resource allocation via fitness-enabled auction
Performance limitations of a text search application running in cloud instances
Performance analysis and optimal cooperative cluster size for randomly distri...
Predictive control for energy aware consolidation in cloud datacenters
Optimizing purdue lin microphysics scheme for intel xeon phi coprocessor

Recently uploaded (20)

PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Lesson notes of climatology university.
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Basic Mud Logging Guide for educational purpose
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Anesthesia in Laparoscopic Surgery in India
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
human mycosis Human fungal infections are called human mycosis..pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
2.FourierTransform-ShortQuestionswithAnswers.pdf
Lesson notes of climatology university.
Final Presentation General Medicine 03-08-2024.pptx
Microbial diseases, their pathogenesis and prophylaxis
O5-L3 Freight Transport Ops (International) V1.pdf
PPH.pptx obstetrics and gynecology in nursing
Basic Mud Logging Guide for educational purpose
O7-L3 Supply Chain Operations - ICLT Program
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student

Over flow multi site aware big data management for scientific workflows on clouds

  • 1. OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds Abstract: The global deployment of cloud datacenters is enabling large scale scientific workflows to improve performance and deliver fast responses. This unprecedented geographical distribution of the computation is doubled by an increase in the scale of the data handled by such applications, bringing new challenges related to the efficient data management across sites. High throughput, low latencies or cost-related trade-offs are just a few concerns for both cloud providers and users when it comes to handling data across datacenters. Existing solutions are limited to cloud-provided storage, which offers low performance based on rigid cost schemes. In turn, workflow engines need to improvise substitutes, achieving performance at the cost of complex system configurations, maintenance overheads, reduced reliability and reusability. In this paper, we introduce OverFlow, a uniform data management system for scientific workflows running across geographically distributed sites, aiming to reap economic benefits from this geo-diversity. Our solution is environment-aware, as it monitors and models the global cloudinfrastructure, offering high and predictable data handling performance for transfer cost and time, within and across sites. OverFlow proposes a set of pluggable services, grouped in a data scientist cloud kit. They provide the applications with the possibility to monitor the underlying infrastructure, to exploit smart data compression, deduplication and geo-replication, to evaluate data management costs, to set a tradeoff
  • 2. between money and time, and optimize the transfer strategy accordingly. The system was validated on the Microsoft Azure cloud across its 6 EU and US datacenters. The experiments were conducted on hundreds of nodes using synthetic benchmarks and real-life bio-informatics applications (A-Brain, BLAST). The results show that our system is able to model accurately the cloud performance and to leverage this for efficient data dissemination, being able to reduc- the monetary costs and transfer time by up to three times.