SlideShare a Scribd company logo
EGI-InSPIRE RI-261323
Hadoop analytics provisioning based
on a virtual infrastructure
J. López Cacheiro, A. Simón, J. Villasuso, E. Freire,
Iván Díaz, C. Fernández, and B. Parak
1
EGI-InSPIRE
EGI-InSPIRE RI-261323
Outline
●
Introduction
●
Methodology
●
Results
●
Conclusions
●
Future work
EGI-InSPIRE RI-261323
Introduction
●
Federation of resources poses serious challenges both from
the cloud and Big Data perspectives.
●
FedCloud is in production since mid May 2014
●
Is it suitable to perform Big Data analytics using Hadoop?
EGI-InSPIRE RI-261323
Introduction: FedCloud
●
A federated cloud infrastructure developed inside EGI
●
EGI Federated Cloud Task Force started its activity in
September 2011
●
FedCloud is in production since mid May 2014
●
During the last year it has been already available for
experimentation by the different user communities
EGI-InSPIRE RI-261323
Introduction: Big Data
●
Big Data is an emerging field
●
It can benefit from existing developments in cloud
technologies
●
One of the most popular solutions is Apache Hadoop
EGI-InSPIRE RI-261323
Introduction: FedCloud
●
A federated cloud infrastructure developed inside EGI
●
EGI Federated Cloud Task Force started its activity in
September 2011
●
FedCloud is in production since mid May 2014
●
During the last year it has been already available for
experimentation by the different user communities
EGI-InSPIRE RI-261323
Introduction: Hadoop
●
A distributed filesystem (HDFS)
●
A framework for job scheduling
●
A MapReduce implementation for parallel processing of
large data sets
EGI-InSPIRE RI-261323
Leveraging FedCloud for Hadoop
The aim of our work is to do an initial assessment of the
suitability of FedCloud to run Hadoop jobs through a series
of real-world benchmarks.
EGI-InSPIRE RI-261323
Methodology
●
Cluster startup
●
Benchmarks
EGI-InSPIRE RI-261323
Methodology: Cluster startup
●
Hadoop clusters consists of one master node and a variable
number of slave nodes depending on the size of the cluster
●
The master node will run three services
●
namenode
●
secondarynamenode
●
jobtracker
●
The slaves will run two services:
●
tasktracker
●
datanode
EGI-InSPIRE RI-261323
Methodology: Cluster startup
●
A customized VM image was created including different
versions of Hadoop (up to 1.2.1) and Oracle Java JDK
●
This image was registered in the EGI Marketplace
●
Each RP manually downloaded the image to the local site in
order to make it available at its endpoint
●
To instantiate the VM that will form the Hadoop cluster we
used rOCCI
●
FedCloud does not count with a WMS, so each VM creation
request must be sent to the appropriate endpoint
EGI-InSPIRE RI-261323
Methodology: Cluster startup
EGI-InSPIRE RI-261323
Methodology: Benchmarks
●
Two standard Big Data benchmarks:
●
TeraGen and TeraSort
●
Two use cases:
●
Representing typical small and medium-size jobs
●
Using real data sets and common MapReduce operations
EGI-InSPIRE RI-261323
Methodology: Teragen/Terasort
●
The “equivalent” of Linpack in HPC
●
The Sort Benchmarks list: “equivalent” to Top500
●
TeraGen generates a input data set in parallel
●
TeraSort is a standard MapReduce sort job
EGI-InSPIRE RI-261323
Methodology: Use cases
●
Small-size jobs:
●
Encyclopaedia Britannica: 176 MB
●
Medium-size jobs:
●
Wikipedia: 41 GB
EGI-InSPIRE RI-261323
Methodology: Use cases
●
Both use cases were run for different cluster sizes ranging
from 10 to 101
●
In all the executions we measured two parameters
●
The put time
●
The wordcount time
EGI-InSPIRE RI-261323
Results
●
Cluster startup: including comparison with Amazon EC2
●
Benchmarks
EGI-InSPIRE RI-261323
Results: Cluster startup
●
Startup times for each cluster deployment varied
●
The most representative ones are those obtained for the
larger deployment: the 101-node cluster
●
The time needed by the rOCCI client to return all the
resource endpoints ranged from 71 to 86 minutes
●
To have all the VMs running took around 80 minutes more
●
Total cluster startup time ranged from 2.5 to 3 hours
EGI-InSPIRE RI-261323
OpenNebula Frontend Load
More than 20
simultaneous scp processes
affect considerably
the system performance
EGI-InSPIRE RI-261323
Results: Startup Optimizations
●
We reduced the size of the VM image to 4GB
●
An additional disk of 70GB was created on-the-fly
●
We modified the datastore used in OpenNebula, so that the
transfer manager used is shared instead of ssh
●
Instances are declared as non-persistent
●
The image copy process is no longer done in a centralized
way from the front-end node but in parallel from each node
using a simple copy operation from the shared storage
EGI-InSPIRE RI-261323
Results: Optimized Startup Times
Cluster size
(#)
Startup time
(s)
10 249
21 269
51 822
101 1096
EGI-InSPIRE RI-261323
Results: Startup Amazon EC2
Cluster size
(#)
Startup time
(s)
Comments
10 191
21 178
51 583* 2 nodes not2 nodes not
workingworking
101 276* 16 nodes not
working
101 297* 1 node not
working
EGI-InSPIRE RI-261323
Results: Benchmarks Teragen/Terasort
EGI-InSPIRE RI-261323
Results: Enciclopaedia Britannica
EGI-InSPIRE RI-261323
Results: Wikipedia
EGI-InSPIRE RI-261323
Conclusions
●
FedCloud is especially suitable for:
●
small and medium-size Hadoop jobs
●
where the data set is already pre-deployed in HDFS
●
Optimized startup times are very close to those obtained in
Amazon EC2
●
The initial data set deployment (put time) imposes a large
overhead
●
Scalability: we were able to run the TeraGen benchmark up
to 2TB and the TeraSort benchmark up to 500GB
EGI-InSPIRE RI-261323
Future work: FedCloud Wish List
●
Adding a central workload management system
●
An automated way to distribute and synchronize the image
between all sites
●
A mechanism to query the resources available at a given site
EGI-InSPIRE RI-261323
Thanks For Your Attention!
Questions?

More Related Content

PDF
Federated HPC Clouds applied to Radiation Therapy
PDF
Workshop actualización SVG CESGA 2012
PDF
How Kubernetes make OpenStack & Ceph better
PDF
Future Science on Future OpenStack
PDF
Containers on Baremetal and Preemptible VMs at CERN and SKA
PDF
HybridAzureCloud
PDF
以 Kubernetes 部屬 Spark 大數據計算環境
PDF
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Federated HPC Clouds applied to Radiation Therapy
Workshop actualización SVG CESGA 2012
How Kubernetes make OpenStack & Ceph better
Future Science on Future OpenStack
Containers on Baremetal and Preemptible VMs at CERN and SKA
HybridAzureCloud
以 Kubernetes 部屬 Spark 大數據計算環境
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...

What's hot (20)

PDF
Moving from CellsV1 to CellsV2 at CERN
PPTX
Kubernetes and OpenStack at Scale
PDF
10 Years of OpenStack at CERN - From 0 to 300k cores
PDF
Seastar @ SF/BA C++UG
PDF
Web後端技術的演變
PPTX
20161025 OpenStack at CERN Barcelona
PPTX
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
PPTX
CERN User Story
PPTX
20170926 cern cloud v4
PDF
Peanut Butter and jelly: Mapping the deep Integration between Ceph and OpenStack
PPTX
20150924 rda federation_v1
PPT
Ceph Performance and Optimization - Ceph Day Frankfurt
PDF
GPU cloud with Job scheduler and Container
PPTX
The OpenStack Cloud at CERN - OpenStack Nordic
PPT
Harnessing OpenCL in Modern Coprocessors
PDF
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
PDF
When disaster strikes the cloud: Who, what, when, where and how to recover
PDF
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
PDF
KubeCon Prometheus Salon -- Kubernetes metrics deep dive
Moving from CellsV1 to CellsV2 at CERN
Kubernetes and OpenStack at Scale
10 Years of OpenStack at CERN - From 0 to 300k cores
Seastar @ SF/BA C++UG
Web後端技術的演變
20161025 OpenStack at CERN Barcelona
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
CERN User Story
20170926 cern cloud v4
Peanut Butter and jelly: Mapping the deep Integration between Ceph and OpenStack
20150924 rda federation_v1
Ceph Performance and Optimization - Ceph Day Frankfurt
GPU cloud with Job scheduler and Container
The OpenStack Cloud at CERN - OpenStack Nordic
Harnessing OpenCL in Modern Coprocessors
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
When disaster strikes the cloud: Who, what, when, where and how to recover
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
KubeCon Prometheus Salon -- Kubernetes metrics deep dive
Ad

Similar to Hadoop analytics provisioning based on a virtual infrastructure (20)

PPTX
Benchmark of Alibaba Cloud capabilities
PDF
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
PPTX
Truemotion Adventures in Containerization
PDF
Building an analytics workflow using Apache Airflow
ODP
devops@cineca
PDF
HPC on OpenStack
PDF
To Russia with Love: Deploying Kubernetes in Exotic Locations On Prem
PDF
Sprint 121
ODP
Continuous delivery of Windows micro services in the cloud
PDF
InfluxDB Live Product Training
PDF
6 Months Sailing with Docker in Production
PDF
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
PPTX
Container orchestration and microservices world
PDF
Spark China Summit 2015 Guancheng Chen
PDF
Scheduling a fuller house - Talk at QCon NY 2016
PDF
Netflix Container Scheduling and Execution - QCon New York 2016
PDF
Terraform 101: What's infrastructure as code?
PDF
Matthew Mosesohn - Configuration Management at Large Companies
PDF
Microservices at Mercari
PDF
Netflix Open Source: Building a Distributed and Automated Open Source Program
Benchmark of Alibaba Cloud capabilities
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
Truemotion Adventures in Containerization
Building an analytics workflow using Apache Airflow
devops@cineca
HPC on OpenStack
To Russia with Love: Deploying Kubernetes in Exotic Locations On Prem
Sprint 121
Continuous delivery of Windows micro services in the cloud
InfluxDB Live Product Training
6 Months Sailing with Docker in Production
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Container orchestration and microservices world
Spark China Summit 2015 Guancheng Chen
Scheduling a fuller house - Talk at QCon NY 2016
Netflix Container Scheduling and Execution - QCon New York 2016
Terraform 101: What's infrastructure as code?
Matthew Mosesohn - Configuration Management at Large Companies
Microservices at Mercari
Netflix Open Source: Building a Distributed and Automated Open Source Program
Ad

More from CESGA Centro de Supercomputación de Galicia (19)

PPTX
Jornada convocatoria experimentos H2020 FORTISSIMO2
PPTX
Jornada convocatoria experimentos H2020 FORTISSIMO2
PPTX
Jornada convocatoria experimentos H2020 FORTISSIMO2
PPT
Jornada convocatoria experimentos H2020 FORTISSIMO2
PDF
Novedades de gestión del H2020
PDF
A Web-platform for radiotherapy, a new workflow concept and an information sh...
PDF
Spatial data infraestructure ID-Patri
PPTX
CLOUDPYME: Servicios en Cloud para la PYMEs innovadoras
PPTX
21 anos en apoyo de la investigación en Galicia
PDF
Fortissimo Enabling manufacturing SMEs to benefit from HPC
PDF
CloudPYME: Cloud para empresas que Innovan
PDF
Workshop on Fostering Innovation for Cyber-Physical Systems, Advanced Comput...
PDF
Energy Efficiency Policy at CESGA
PDF
La Virtualización y el Cloud en el CESGA: Proyecto de Escritorios Virtuales e...
PDF
HP E INTEL CONMEMORAN LOS 20 AÑOS DEL CENTRO DE SUPERCOMPUTACIÓN DE GALICIA
PDF
FUJITSU celebra en Santiago los 20 años de la instalación de su primer Superc...
PDF
Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...
PDF
Workshop actualización SVG CESGA 2012. Aplicaciones
PDF
DATA CENTERS 2012 Eficacia, Disponibilidad y Seguridad
Jornada convocatoria experimentos H2020 FORTISSIMO2
Jornada convocatoria experimentos H2020 FORTISSIMO2
Jornada convocatoria experimentos H2020 FORTISSIMO2
Jornada convocatoria experimentos H2020 FORTISSIMO2
Novedades de gestión del H2020
A Web-platform for radiotherapy, a new workflow concept and an information sh...
Spatial data infraestructure ID-Patri
CLOUDPYME: Servicios en Cloud para la PYMEs innovadoras
21 anos en apoyo de la investigación en Galicia
Fortissimo Enabling manufacturing SMEs to benefit from HPC
CloudPYME: Cloud para empresas que Innovan
Workshop on Fostering Innovation for Cyber-Physical Systems, Advanced Comput...
Energy Efficiency Policy at CESGA
La Virtualización y el Cloud en el CESGA: Proyecto de Escritorios Virtuales e...
HP E INTEL CONMEMORAN LOS 20 AÑOS DEL CENTRO DE SUPERCOMPUTACIÓN DE GALICIA
FUJITSU celebra en Santiago los 20 años de la instalación de su primer Superc...
Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real C...
Workshop actualización SVG CESGA 2012. Aplicaciones
DATA CENTERS 2012 Eficacia, Disponibilidad y Seguridad

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Machine learning based COVID-19 study performance prediction
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Per capita expenditure prediction using model stacking based on satellite ima...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
Machine learning based COVID-19 study performance prediction
MIND Revenue Release Quarter 2 2025 Press Release
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Hadoop analytics provisioning based on a virtual infrastructure