DataLogix Hadoop Solution

Hadoop & HDFS " version 1.0

File & Content Solutions!

What is Hadoop!

§  Built and distributed as part of the Apache Software
Project; 
"
§  Hadoop EcoSystem:"
§  Common – set of components and interfaces for a DFS and
general I/O;"
§  Avro – A serialization system for efﬁcient, cross language RPC,
and persistent data storage;"
§  MapReduce – A distributed data processing model and
execution environment that runs on large clusters of commodity
machines;"
§  HDFS – A distributed File System that runs on large clusters of
commodity hardware."


Common Terms in Hadoop HDFS!

§  Name node - manages the File System namespace. It
maintains the File System tree and the metadata for all
the ﬁles and directories in the tree.  
 
This information is stored persistently on the local disk in
the form of two ﬁles: the namespace image and the edit
log. 
"
§  Data node- Workhorses of the File System. They store
and retrieve blocks when they are told to (by clients or
the name node), and they report back to the name node
periodically with lists of blocks that they are storing."


Common Terms in Hadoop HDFS!

§  Secondary Name node - Its main role is to periodically
merge the namespace image with the edit log to prevent
the edit log from becoming too large. The secondary
name node usually runs on a separate physical machine 
"


Hadoop Distributed File System - HDFS!

§  HDFS is a File System designed for storing very large
ﬁles with streaming data access patterns, running on
clusters of commodity hardware.  
"
§  HDFS has a permissions model for ﬁles and directories
that is much like POSIX."

POSIX is an acronym for Portable Operating System Interface."


Writing data into Hadoop!


Reading data from HDFS!


MapReduce!

§  "Map" step: The master node takes the input, divides it
into smaller sub-problems, and distributes them to
worker nodes. A worker node may do this again in turn,
leading to a multi-level tree structure. The worker node
processes the smaller problem, and passes the answer
back to its master node. 
"
§  "Reduce" step: The master node then collects the
answers to all the sub-problems and combines them in
some way to form the output – the answer to the problem
it was originally trying to solve."
"

MapReduce!


HDFS Storage Solution!

§  The DataLogix Hadoop Storage Solution contains:"
§  Enterprise Scale-Out storage solution for Hadoop workﬂows. 
"
§  Native connectivity for Hadoop and Eco-systems components:"
§  Hive"
§  Hbase"
§  Pig"
§  Mahout 
"
§  No single point of failure Name Node; 
"
§  No 3x mirroring, native N+M protection is used; 
"
§  SnapShot, Sync and NDMP back-up is supported."


Writing into Hadoop with the DataLogix solution!

§  The storage system becomes the Name Node and as well as the Data
Node 
"
§  Provides scalability and protection of the data.  
"
§  Hadoop cluster no longer has a single point of failure and no longer
writes multiple 64MB-128MB chunks of data to datanodes"


Reading Hadoop Data !

§  Data is read off the cluster back to the compute nodes; 
"
§  The Data Nodes are now compute nodes and are independent of
the data in the Hadoop cluster:"
§  Beneﬁts are that Hadoop hardware can be ugraded without the need for
migration of data. "


More information?!!

§  More information about the Hadoop storage solutions? 
 
Please contact us: 
 
DataLogix 
Phone: +31(0)30-7440710 
e-mail: info@datalogix.nl 
 
www.datalogix.nl"


DataLogix Hadoop Solution

More Related Content

What's hot (20)

Similar to DataLogix Hadoop Solution (20)

Recently uploaded (20)

DataLogix Hadoop Solution