SlideShare a Scribd company logo
GlusterFS For Hadoop– Overview
Vijay Bellur
GlusterFS co-maintainer
Lalatendu Mohanty
GlusterFS Community
05/17/16
Agenda
● What is GlusterFS?
● Overview
● Use Cases
● Hadoop on GlusterFS
● Q&A
05/17/16
What is GlusterFS?
●
A general purpose scale-out distributed file system.
●
Aggregates storage exports over network interconnect to
provide a single unified namespace.
●
Filesystem is stackable and completely in userspace.
●
Layered on disk file systems that support extended
attributes.
05/17/16
Typical GlusterFS Deployment
Global namespace
Scale-out storage
building blocks
Supports
thousands of clients
Access using
GlusterFS native,
NFS, SMB and HTTP
protocols
Linear performance
scaling
05/17/16
GlusterFS Architecture – Foundations
● Software only, runs on commodity hardware
● No external metadata servers
● Scale-out with Elasticity
● Extensible and modular
● Deployment agnostic
● Unified access
● Largely POSIX compliant
05/17/16
Concepts & Algorithms
05/17/16
GlusterFS concepts – Trusted Storage Pool
●
Trusted Storage Pool (cluster) is a collection of storage servers.
●
Trusted Storage Pool is formed by invitation – “probe” a new
member from the cluster and not vice versa.
●
Membership information used for determining quorum.
●
Members can be dynamically added and removed from the
pool.
05/17/16

A brick is the combination of a node and an export directory – for e.g.
hostname:/dir

Each brick inherits limits of the underlying filesystem

No limit on the number bricks per node

Ideally, each brick in a cluster should be of the same size
/export3 /export3 /export3
Storage Node
/export1
Storage Node
/export2
/export1
/export2
/export4
/export5
Storage Node
/export1
/export2
3 bricks 5 bricks 3 bricks
GlusterFS concepts - Bricks
05/17/16
GlusterFS concepts - Volumes
● A volume is a logical collection of bricks.
● Volume is identified by an administrator provided name.
● Volume is a mountable entity and the volume name is
provided at the time of mounting.
– mount -t glusterfs server1:/<volname> /my/mnt/point
● Bricks from the same node can be part of different
volumes
05/17/16
GlusterFS concepts - Volumes
Node2Node1 Node3
/export/brick1
/export/brick2
/export/brick1
/export/brick2
/export/brick1
/export/brick2
music
Videos
05/17/16
Volume Types
➢
Type of a volume is specified at the time of volume
creation
➢
Volume type determines how and where data is placed
➢
Following volume types are supported in glusterfs:
a) Distribute
b) Stripe
c) Replication
d) Distributed Replicate
e) Striped Replicate
f) Distributed Striped Replicate
05/17/16
Distributed Volume
➢
Distributes files across various bricks of the volume.
➢
Directories are present on all bricks of the volume.
➢
Single brick failure will result in loss of data availability.
➢
Removes the need for an external meta data server.
05/17/16
How does a distributed volume work?
➢
Uses Davies-Meyer hash algorithm.
➢
A 32-bit hash space is divided into N ranges for N bricks
➢
At the time of directory creation, a range is assigned to each directory.
➢
During a file creation or retrieval, hash is computed on the file name.
This hash value is used to locate or place the file.
➢
Different directories in the same brick end up with different hash
ranges.
05/17/16
Replicated Volume
●
Synchronous replication of all directory and file updates.
●
Provides high availability of data when node failures occur.
●
Transaction driven for ensuring consistency.
●
Changelogs maintained for re-conciliation.
●
Any number of replicas can be configured.
05/17/16
How does a replicated volume work?
05/17/16
Distributed Replicated Volume
● Distribute files across replicated bricks
● Number of bricks must be a multiple of the replica count
● Ordering of bricks in volume definition matters
● Scaling and high availability
● Reads get load balanced.
● Most preferred model of deployment currently.
05/17/16
Distributed Replicated Volume
05/17/16
Striped Volume
●
Files are striped into chunks and placed in various bricks.
●
Recommended only when very large files greater than the size
of the disks are present.
●
A brick failure can result in data loss. Redundancy with
replication is highly recommended (striped replicated volumes).
05/17/16
Elastic Volume Management
Application transparent operations that can be performed in the
storage layer.
●
Addition of Bricks to a volume
●
Remove brick from a volume
●
Rebalance data spread within a volume
●
Replace a brick in a volume
●
Performance / Functionality tuning
05/17/16
Access Mechanisms
Gluster volumes can be accessed via the following mechanisms:
– FUSE based Native protocol
– NFSv3 and v4
– SMB
– libgfapi
– ReST/HTTP
– HDFS
05/17/16
Implementation
05/17/16
Translators in GlusterFS
●
Building blocks for a GlusterFS process.
●
Based on Translators in GNU HURD.
●
Each translator is a functional unit.
●
Translators can be stacked together for achieving
desired functionality.
●
Translators are deployment agnostic – can be loaded
in either the client or server stacks.
05/17/16
Customizable Translator Stack
05/17/16
Ecosystem Integration
●
Currently integrated with various ecosystems:
●
OpenStack
●
Samba
●
Ganesha
●
oVirt
●
qemu
●
Hadoop
●
pcp
●
Proxmox
●
uWSGI
05/17/16
05/17/16
Use Cases - current
● Unstructured data storage
● Archival
● Disaster Recovery
● Virtual Machine Image Store
● Cloud Storage for Service Providers
● Content Cloud
● Big Data
● Semi-structured & Structured data
05/17/16
Hadoop And GlusterFS
● GlusterFS can be used for Hadoop
● GlusterFS Hadoop plugin replaces HDFS with GlusterFS
● MapReduce jobs can be run on GlusterFS volumes.
● https://github.com/gluster/glusterfs-hadoop
05/17/16
Advantage Of Using GlusterFS
● Advantage of a POSIX compliant filesystem.
● Same volume/storage can be used for MapReduce and storing
application data.
● E.g. : log files, unstructured data.
● No need to copy data from storage to HDFS for running MapReduce.
● No need for “NameNode” i.e. metadata server.
● Advantage of GlusterFS features (e.g. Geo-replication, Erasure
Coding)
● Geo-replication is a distributed, continuous, asynchronous, and
incremental replication service for disastrous recovery
● It can replicate data from one site to another over Local Area
Networks (LANs), Wide Area Networks (WANs), and the
Internet.
05/17/16
Advantage Of Using GlusterFS
● Erasure Coding provides the fundamental technology for storage
systems to add redundancy and tolerate failures.
● On GlusterFS, MapReduce jobs use “data locality optimization”.
● That means Hadoop tries its best to run map tasks on nodes
where the data is present locally to optimize on the network and
inter-node communication latency.
● GlusterFS works with Apache Spark Project and The Apache Ambari
project.
05/17/16
Apache Spark Project
● Apache Spark is an open-source data analytics cluster
computing framework
● Spark fits into the Hadoop open-source community, building on
top of the Hadoop Distributed File System (HDFS)
● Spark is not tied to the two-stage MapReduce paradigm and
promises performance up to 100 times faster than Hadoop
MapReduce, for certain applications.
● Spark provides primitives for in-memory cluster computing.
● https://spark.apache.org/docs/0.8.1/cluster-overview.html
05/17/16
Apache Ambari Project
● The Apache Ambari project is for provisioning, managing, and
monitoring Apache Hadoop clusters.
● It provides an intuitive, easy-to-use Hadoop management web
UI backed by its RESTful APIs.
● Apache Ambari project supports the automated deployment and
configuration of Hadoop on top of GlusterFS.
● http://www.gluster.org/2013/10/automated-hadoop-deployment-o
● http://ambari.apache.org/
05/17/16
Hadoop access
05/17/16
Resources
Mailing lists:
gluster-users@gluster.org
gluster-devel@nongnu.org
IRC:
#gluster and #gluster-dev on freenode
Links:
http://www.gluster.org
http://hekafs.org
http://forge.gluster.org
http://www.gluster.org/community/documentation/index.php/Arch
http://hadoopecosystemtable.github.io/
Thank you!
Lalatendu Mohanty
lmohanty@redhat.com
Twitter: @lalatenduM

More Related Content

ODP
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
PDF
Gluster d2
PDF
Gluster intro-tdose
ODP
Tiering barcelona
ODP
Lisa 2015-gluster fs-introduction
ODP
Gluster intro-tdose
PDF
20160401 guster-roadmap
ODP
Red Hat Gluster Storage : GlusterFS
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
Gluster d2
Gluster intro-tdose
Tiering barcelona
Lisa 2015-gluster fs-introduction
Gluster intro-tdose
20160401 guster-roadmap
Red Hat Gluster Storage : GlusterFS

What's hot (20)

PDF
Gluster for sysadmins
ODP
Gluster technical overview
PDF
Sdc 2012-challenges
ODP
GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
ODP
20160130 Gluster-roadmap
ODP
Scale out backups-with_bareos_and_gluster
ODP
Developing apps and_integrating_with_gluster_fs_-_libgfapi
ODP
Dedupe nmamit
PDF
Gluster fs current_features_and_roadmap
ODP
Glusterfs for sysadmins-justin_clift
ODP
Dustin Black - Red Hat Storage Server Administration Deep Dive
ODP
Software defined storage
PDF
Gluster overview & future directions vault 2015
PDF
Disperse xlator ramon_datalab
ODP
Gluster fs architecture_future_directions_tlv
PDF
Qemu gluster fs
ODP
YDAL Barcelona
PPTX
Gluster Storage
ODP
Sdc challenges-2012
ODP
Lisa 2015-gluster fs-hands-on
Gluster for sysadmins
Gluster technical overview
Sdc 2012-challenges
GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
20160130 Gluster-roadmap
Scale out backups-with_bareos_and_gluster
Developing apps and_integrating_with_gluster_fs_-_libgfapi
Dedupe nmamit
Gluster fs current_features_and_roadmap
Glusterfs for sysadmins-justin_clift
Dustin Black - Red Hat Storage Server Administration Deep Dive
Software defined storage
Gluster overview & future directions vault 2015
Disperse xlator ramon_datalab
Gluster fs architecture_future_directions_tlv
Qemu gluster fs
YDAL Barcelona
Gluster Storage
Sdc challenges-2012
Lisa 2015-gluster fs-hands-on
Ad

Viewers also liked (18)

ODP
Join the super_colony_-_feb2013
PDF
Debugging with-wireshark-niels-de-vos
ODP
Gsummit apis-2013
ODP
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
ODP
Introduction to Open Source
ODP
PDF
Gluster fs current_features_and_roadmap
ODP
Kkeithley ufonfs-gluster summit
PDF
Smb gluster devmar2013
PPT
Working with solr.pptx
PDF
レッドハット グラスター ストレージ Red Hat Gluster Storage (Japanese)
ODP
Lcna example-2012
ODP
20160401 Gluster-roadmap
ODP
Leases and-caching final
ODP
Accessing gluster ufo_-_eco_willson
PDF
On demand file-caching_-_gustavo_brand
PDF
Gluster wireshark niels_de_vos
ODP
Gluster d thread_synchronization_using_urcu_lca2016
Join the super_colony_-_feb2013
Debugging with-wireshark-niels-de-vos
Gsummit apis-2013
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
Introduction to Open Source
Gluster fs current_features_and_roadmap
Kkeithley ufonfs-gluster summit
Smb gluster devmar2013
Working with solr.pptx
レッドハット グラスター ストレージ Red Hat Gluster Storage (Japanese)
Lcna example-2012
20160401 Gluster-roadmap
Leases and-caching final
Accessing gluster ufo_-_eco_willson
On demand file-caching_-_gustavo_brand
Gluster wireshark niels_de_vos
Gluster d thread_synchronization_using_urcu_lca2016
Ad

Similar to Gluster fs hadoop_fifth-elephant (20)

PDF
GlusterFS And Big Data
PDF
GlusterFS Talk for CentOS Dojo Bangalore
PDF
Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013
PDF
Gluster fs architecture_future_directions_tlv
ODP
Performance characterization in large distributed file system with gluster fs
PDF
Glusterfs and openstack
PDF
GlusterFs: a scalable file system for today's and tomorrow's big data
PDF
GlusterFS : un file system open source per i big data di oggi e domani - Robe...
PDF
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
PDF
OSBConf 2015 | Scale out backups with bareos and gluster by niels de vos
PDF
Scale out backups-with_bareos_and_gluster
PPTX
Celi @Codemotion 2014 - Roberto Franchini GlusterFS
PDF
Introducing gluster filesystem by aditya
ODP
GlusterFS Architecture - June 30, 2011 Meetup
ODP
Gluster d2
ODP
20160401 guster-roadmap
PDF
Gpfs introandsetup
PDF
Gluster fs architecture_&_roadmap_atin_punemeetup_2015
PDF
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
PPTX
Chaptor 2- Big Data Processing in big data technologies
GlusterFS And Big Data
GlusterFS Talk for CentOS Dojo Bangalore
Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013
Gluster fs architecture_future_directions_tlv
Performance characterization in large distributed file system with gluster fs
Glusterfs and openstack
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFS : un file system open source per i big data di oggi e domani - Robe...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
OSBConf 2015 | Scale out backups with bareos and gluster by niels de vos
Scale out backups-with_bareos_and_gluster
Celi @Codemotion 2014 - Roberto Franchini GlusterFS
Introducing gluster filesystem by aditya
GlusterFS Architecture - June 30, 2011 Meetup
Gluster d2
20160401 guster-roadmap
Gpfs introandsetup
Gluster fs architecture_&_roadmap_atin_punemeetup_2015
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
Chaptor 2- Big Data Processing in big data technologies

More from Gluster.org (20)

PDF
Automating Gluster @ Facebook - Shreyas Siravara
PDF
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
PDF
Facebook’s upstream approach to GlusterFS - David Hasson
PDF
Throttling Traffic at Facebook Scale
PDF
GlusterFS w/ Tiered XFS
PDF
Gluster Metrics: why they are crucial for running stable deployments of all s...
PDF
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
PDF
Data Reduction for Gluster with VDO
PDF
Releases: What are contributors responsible for
PDF
RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
PDF
Gluster and Kubernetes
PDF
Native Clients, more the merrier with GFProxy!
PDF
Gluster: a SWOT Analysis
PDF
GlusterD-2.0: What's Happening? - Kaushal Madappa
PDF
Scalability and Performance of CNS 3.6
PDF
What Makes Us Fail
PDF
Gluster as Native Storage for Containers - past, present and future
PDF
Heketi Functionality into Glusterd2
PDF
Hands On Gluster with Jeff Darcy
PDF
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
Automating Gluster @ Facebook - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
Facebook’s upstream approach to GlusterFS - David Hasson
Throttling Traffic at Facebook Scale
GlusterFS w/ Tiered XFS
Gluster Metrics: why they are crucial for running stable deployments of all s...
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
Data Reduction for Gluster with VDO
Releases: What are contributors responsible for
RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan
Gluster and Kubernetes
Native Clients, more the merrier with GFProxy!
Gluster: a SWOT Analysis
GlusterD-2.0: What's Happening? - Kaushal Madappa
Scalability and Performance of CNS 3.6
What Makes Us Fail
Gluster as Native Storage for Containers - past, present and future
Heketi Functionality into Glusterd2
Hands On Gluster with Jeff Darcy
Architecture of the High Availability Solution for Ganesha and Samba with Kal...

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
KodekX | Application Modernization Development
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Understanding_Digital_Forensics_Presentation.pptx
NewMind AI Monthly Chronicles - July 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
“AI and Expert System Decision Support & Business Intelligence Systems”
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
KodekX | Application Modernization Development
The AUB Centre for AI in Media Proposal.docx
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Encapsulation theory and applications.pdf
Building Integrated photovoltaic BIPV_UPV.pdf

Gluster fs hadoop_fifth-elephant

  • 1. GlusterFS For Hadoop– Overview Vijay Bellur GlusterFS co-maintainer Lalatendu Mohanty GlusterFS Community
  • 2. 05/17/16 Agenda ● What is GlusterFS? ● Overview ● Use Cases ● Hadoop on GlusterFS ● Q&A
  • 3. 05/17/16 What is GlusterFS? ● A general purpose scale-out distributed file system. ● Aggregates storage exports over network interconnect to provide a single unified namespace. ● Filesystem is stackable and completely in userspace. ● Layered on disk file systems that support extended attributes.
  • 4. 05/17/16 Typical GlusterFS Deployment Global namespace Scale-out storage building blocks Supports thousands of clients Access using GlusterFS native, NFS, SMB and HTTP protocols Linear performance scaling
  • 5. 05/17/16 GlusterFS Architecture – Foundations ● Software only, runs on commodity hardware ● No external metadata servers ● Scale-out with Elasticity ● Extensible and modular ● Deployment agnostic ● Unified access ● Largely POSIX compliant
  • 7. 05/17/16 GlusterFS concepts – Trusted Storage Pool ● Trusted Storage Pool (cluster) is a collection of storage servers. ● Trusted Storage Pool is formed by invitation – “probe” a new member from the cluster and not vice versa. ● Membership information used for determining quorum. ● Members can be dynamically added and removed from the pool.
  • 8. 05/17/16  A brick is the combination of a node and an export directory – for e.g. hostname:/dir  Each brick inherits limits of the underlying filesystem  No limit on the number bricks per node  Ideally, each brick in a cluster should be of the same size /export3 /export3 /export3 Storage Node /export1 Storage Node /export2 /export1 /export2 /export4 /export5 Storage Node /export1 /export2 3 bricks 5 bricks 3 bricks GlusterFS concepts - Bricks
  • 9. 05/17/16 GlusterFS concepts - Volumes ● A volume is a logical collection of bricks. ● Volume is identified by an administrator provided name. ● Volume is a mountable entity and the volume name is provided at the time of mounting. – mount -t glusterfs server1:/<volname> /my/mnt/point ● Bricks from the same node can be part of different volumes
  • 10. 05/17/16 GlusterFS concepts - Volumes Node2Node1 Node3 /export/brick1 /export/brick2 /export/brick1 /export/brick2 /export/brick1 /export/brick2 music Videos
  • 11. 05/17/16 Volume Types ➢ Type of a volume is specified at the time of volume creation ➢ Volume type determines how and where data is placed ➢ Following volume types are supported in glusterfs: a) Distribute b) Stripe c) Replication d) Distributed Replicate e) Striped Replicate f) Distributed Striped Replicate
  • 12. 05/17/16 Distributed Volume ➢ Distributes files across various bricks of the volume. ➢ Directories are present on all bricks of the volume. ➢ Single brick failure will result in loss of data availability. ➢ Removes the need for an external meta data server.
  • 13. 05/17/16 How does a distributed volume work? ➢ Uses Davies-Meyer hash algorithm. ➢ A 32-bit hash space is divided into N ranges for N bricks ➢ At the time of directory creation, a range is assigned to each directory. ➢ During a file creation or retrieval, hash is computed on the file name. This hash value is used to locate or place the file. ➢ Different directories in the same brick end up with different hash ranges.
  • 14. 05/17/16 Replicated Volume ● Synchronous replication of all directory and file updates. ● Provides high availability of data when node failures occur. ● Transaction driven for ensuring consistency. ● Changelogs maintained for re-conciliation. ● Any number of replicas can be configured.
  • 15. 05/17/16 How does a replicated volume work?
  • 16. 05/17/16 Distributed Replicated Volume ● Distribute files across replicated bricks ● Number of bricks must be a multiple of the replica count ● Ordering of bricks in volume definition matters ● Scaling and high availability ● Reads get load balanced. ● Most preferred model of deployment currently.
  • 18. 05/17/16 Striped Volume ● Files are striped into chunks and placed in various bricks. ● Recommended only when very large files greater than the size of the disks are present. ● A brick failure can result in data loss. Redundancy with replication is highly recommended (striped replicated volumes).
  • 19. 05/17/16 Elastic Volume Management Application transparent operations that can be performed in the storage layer. ● Addition of Bricks to a volume ● Remove brick from a volume ● Rebalance data spread within a volume ● Replace a brick in a volume ● Performance / Functionality tuning
  • 20. 05/17/16 Access Mechanisms Gluster volumes can be accessed via the following mechanisms: – FUSE based Native protocol – NFSv3 and v4 – SMB – libgfapi – ReST/HTTP – HDFS
  • 22. 05/17/16 Translators in GlusterFS ● Building blocks for a GlusterFS process. ● Based on Translators in GNU HURD. ● Each translator is a functional unit. ● Translators can be stacked together for achieving desired functionality. ● Translators are deployment agnostic – can be loaded in either the client or server stacks.
  • 24. 05/17/16 Ecosystem Integration ● Currently integrated with various ecosystems: ● OpenStack ● Samba ● Ganesha ● oVirt ● qemu ● Hadoop ● pcp ● Proxmox ● uWSGI
  • 26. 05/17/16 Use Cases - current ● Unstructured data storage ● Archival ● Disaster Recovery ● Virtual Machine Image Store ● Cloud Storage for Service Providers ● Content Cloud ● Big Data ● Semi-structured & Structured data
  • 27. 05/17/16 Hadoop And GlusterFS ● GlusterFS can be used for Hadoop ● GlusterFS Hadoop plugin replaces HDFS with GlusterFS ● MapReduce jobs can be run on GlusterFS volumes. ● https://github.com/gluster/glusterfs-hadoop
  • 28. 05/17/16 Advantage Of Using GlusterFS ● Advantage of a POSIX compliant filesystem. ● Same volume/storage can be used for MapReduce and storing application data. ● E.g. : log files, unstructured data. ● No need to copy data from storage to HDFS for running MapReduce. ● No need for “NameNode” i.e. metadata server. ● Advantage of GlusterFS features (e.g. Geo-replication, Erasure Coding) ● Geo-replication is a distributed, continuous, asynchronous, and incremental replication service for disastrous recovery ● It can replicate data from one site to another over Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
  • 29. 05/17/16 Advantage Of Using GlusterFS ● Erasure Coding provides the fundamental technology for storage systems to add redundancy and tolerate failures. ● On GlusterFS, MapReduce jobs use “data locality optimization”. ● That means Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. ● GlusterFS works with Apache Spark Project and The Apache Ambari project.
  • 30. 05/17/16 Apache Spark Project ● Apache Spark is an open-source data analytics cluster computing framework ● Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS) ● Spark is not tied to the two-stage MapReduce paradigm and promises performance up to 100 times faster than Hadoop MapReduce, for certain applications. ● Spark provides primitives for in-memory cluster computing. ● https://spark.apache.org/docs/0.8.1/cluster-overview.html
  • 31. 05/17/16 Apache Ambari Project ● The Apache Ambari project is for provisioning, managing, and monitoring Apache Hadoop clusters. ● It provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. ● Apache Ambari project supports the automated deployment and configuration of Hadoop on top of GlusterFS. ● http://www.gluster.org/2013/10/automated-hadoop-deployment-o ● http://ambari.apache.org/
  • 33. 05/17/16 Resources Mailing lists: gluster-users@gluster.org gluster-devel@nongnu.org IRC: #gluster and #gluster-dev on freenode Links: http://www.gluster.org http://hekafs.org http://forge.gluster.org http://www.gluster.org/community/documentation/index.php/Arch http://hadoopecosystemtable.github.io/