SlideShare a Scribd company logo
Msquare Systems Inc.,
What is Hadoop?

Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain
insight from massive amounts of structured and unstructured data quickly and without significant investment.

Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists
of three main functions: storage, processing and resource management.
Core services on Hadoop

MapReduce: MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in
parallel across a cluster of several machines in a reliable and fault-tolerant.

HDFS: Hadoop Distributed File System is a java-based file system that provides scalable and reliable data storage for large group of
clusters.

Hadoop Yarn: Yarn is a next generation framework for Hadoop Data processing extending MapReduce capabilities by supporting nonMapReduce workloads associated with other programming models.

Apache Tez: Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic
graph) of tasks for near real-time big data processing
Hadoop Data Services

Apache Pig: Its platform for processing and analyzing large data sets.

Apache Hbase: A column-oriented No SQL data storage system that provides random real-time read/write access to big data for user
applications.

Apache Hive: Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and add-hoc queries via
SQL-like interface for large datasets stored in HDFS.

Apache Flume: Allows efficiently aggregating and moving large amounts of log data from many different sources to Hadoop.

Apache Mahout: Apache Mahout scalable machine learning algorithms for hadoop, which aids with data science for clustering, classification
and batch based collaborative filtering
Hadoop Data Services

Apache Accumulo : Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable
implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper.

Apache Storm : Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data
processing capabilities to Apache Hadoop 2.x.
Apache Catalog : A table and metadata management service that provides a centralized way for data processing systems to understand the
structure and location of the data stored within Apache Hadoop

Apache Sqoop : Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for
various, popular enterprise data sources.
Hadoop Operational Services

Apache Zookeeper: A highly available system for coordinating distributing processes.

Apache Falcon: Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache
hadoop.

Apache Ambari: Open source installation lifecycle management, administration, and monitoring system for Apache Hadoop Clusters.

Apache knox: “Knox” gateway is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster.

Apache Oozie: Oozie Java web application used to schedule Apache Hadoop Jobs. Oozie combines multiple jobs sequentially into one logical
unit of work.
What Hadoop can, and can't do

What Hadoop can't do
You can't use Hadoop for
 Structured data
 Transactional data

What Hadoop can do
You can use Hadoop for
 Big Data
Support & Partner
Getting Started or Support –

Muthu Natarajan

muthu.n@msquaresystems.com

www.msquaresystems.com

Phone: 212-941-6000

More Related Content

PDF
Learning How to Learn Hadoop
PDF
Introduction To Hadoop Ecosystem
PPTX
Apache hadoop introduction and architecture
PPT
Hadoop distributions - ecosystem
PPTX
Big data concepts
PPTX
Hadoop An Introduction
PPTX
PPT on Hadoop
PPS
Big data hadoop rdbms
Learning How to Learn Hadoop
Introduction To Hadoop Ecosystem
Apache hadoop introduction and architecture
Hadoop distributions - ecosystem
Big data concepts
Hadoop An Introduction
PPT on Hadoop
Big data hadoop rdbms

What's hot (20)

PPTX
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
PPT
Introduction to Apache hadoop
PPTX
Big Data and Hadoop
PPTX
عصر کلان داده، چرا و چگونه؟
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
Hadoop Tutorial For Beginners
PDF
Seminar_Report_hadoop
PPTX
Big data vahidamiri-datastack.ir
PPTX
Big Data and Hadoop Introduction
ODP
Hadoop seminar
PDF
Introduction to Hadoop and MapReduce
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Big Data and Hadoop - An Introduction
PPTX
Big data and Hadoop
PPTX
Hadoop info
DOCX
PPTX
Big data processing with apache spark part1
DOCX
Hadoop Seminar Report
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
PPTX
Big data Hadoop presentation
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
Introduction to Apache hadoop
Big Data and Hadoop
عصر کلان داده، چرا و چگونه؟
Big data vahidamiri-tabriz-13960226-datastack.ir
Hadoop Tutorial For Beginners
Seminar_Report_hadoop
Big data vahidamiri-datastack.ir
Big Data and Hadoop Introduction
Hadoop seminar
Introduction to Hadoop and MapReduce
HADOOP TECHNOLOGY ppt
Big Data and Hadoop - An Introduction
Big data and Hadoop
Hadoop info
Big data processing with apache spark part1
Hadoop Seminar Report
Top Hadoop Big Data Interview Questions and Answers for Fresher
Big data Hadoop presentation
Ad

Viewers also liked (7)

PDF
Processing cassandra datasets with hadoop streaming based approaches
PDF
An Introduction of Recent Research on MapReduce (2011)
PPTX
Silicon Halton - Meetup 72 - Co-Accelerate Your Business
PPTX
Silicon Halton Meetup 75: Augmented Reality, A Primer
PPTX
Silicon Halton Meetup 77 - Work Unscripted
PPTX
Silicon Halton Meetup 83 - Sr. HR Panel
PPTX
Silicon Halton Meetup 79 - Chart of Accounts
Processing cassandra datasets with hadoop streaming based approaches
An Introduction of Recent Research on MapReduce (2011)
Silicon Halton - Meetup 72 - Co-Accelerate Your Business
Silicon Halton Meetup 75: Augmented Reality, A Primer
Silicon Halton Meetup 77 - Work Unscripted
Silicon Halton Meetup 83 - Sr. HR Panel
Silicon Halton Meetup 79 - Chart of Accounts
Ad

Similar to Hadoop white papers (20)

PPTX
Brief Introduction about Hadoop and Core Services.
PPTX
Introduction to Apache Hadoop Ecosystem
PPTX
hadoop-ecosystem-ppt.pptx
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PPTX
Hadoop And Their Ecosystem
PPTX
Hadoop And Their Ecosystem ppt
PDF
Introduction to Hadoop
PDF
What is Apache Hadoop and its ecosystem?
PPTX
Apache hadoop basics
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
DOCX
project report on hadoop
PPTX
Introduction to apache hadoop copy
PPTX
MODULE 1: Introduction to Big Data Analytics.pptx
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
Not Just Another Overview of Apache Hadoop
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
Getting started big data
PPTX
Bw tech hadoop
Brief Introduction about Hadoop and Core Services.
Introduction to Apache Hadoop Ecosystem
hadoop-ecosystem-ppt.pptx
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Hadoop and their in big data analysis EcoSystem.pptx
EclipseCon Keynote: Apache Hadoop - An Introduction
Hadoop And Their Ecosystem
Hadoop And Their Ecosystem ppt
Introduction to Hadoop
What is Apache Hadoop and its ecosystem?
Apache hadoop basics
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
project report on hadoop
Introduction to apache hadoop copy
MODULE 1: Introduction to Big Data Analytics.pptx
Hadoop and Big data in Big data and cloud.pptx
Not Just Another Overview of Apache Hadoop
Hadoop_EcoSystem slide by CIDAC India.pptx
Getting started big data
Bw tech hadoop

More from Muthu Natarajan (8)

PDF
Understanding about relational database m-square systems inc
PDF
Agile methodologiesvswaterfall
PDF
Business intelligence data analytics-visualization
PDF
Business intelligence, Data Analytics & Data Visualization
PPTX
Social Media Strategies and Social Marketing
PPTX
Protect your website
PPTX
Hr presentation
PPTX
Cloud Computing & Benefits
Understanding about relational database m-square systems inc
Agile methodologiesvswaterfall
Business intelligence data analytics-visualization
Business intelligence, Data Analytics & Data Visualization
Social Media Strategies and Social Marketing
Protect your website
Hr presentation
Cloud Computing & Benefits

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Spectral efficient network and resource selection model in 5G networks
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
Spectral efficient network and resource selection model in 5G networks
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
Modernizing your data center with Dell and AMD
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 3 Spatial Domain Image Processing.pdf

Hadoop white papers

  • 2. What is Hadoop? Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment. Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists of three main functions: storage, processing and resource management.
  • 3. Core services on Hadoop MapReduce: MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of several machines in a reliable and fault-tolerant. HDFS: Hadoop Distributed File System is a java-based file system that provides scalable and reliable data storage for large group of clusters. Hadoop Yarn: Yarn is a next generation framework for Hadoop Data processing extending MapReduce capabilities by supporting nonMapReduce workloads associated with other programming models. Apache Tez: Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing
  • 4. Hadoop Data Services Apache Pig: Its platform for processing and analyzing large data sets. Apache Hbase: A column-oriented No SQL data storage system that provides random real-time read/write access to big data for user applications. Apache Hive: Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and add-hoc queries via SQL-like interface for large datasets stored in HDFS. Apache Flume: Allows efficiently aggregating and moving large amounts of log data from many different sources to Hadoop. Apache Mahout: Apache Mahout scalable machine learning algorithms for hadoop, which aids with data science for clustering, classification and batch based collaborative filtering
  • 5. Hadoop Data Services Apache Accumulo : Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper. Apache Storm : Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to Apache Hadoop 2.x. Apache Catalog : A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop Apache Sqoop : Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources.
  • 6. Hadoop Operational Services Apache Zookeeper: A highly available system for coordinating distributing processes. Apache Falcon: Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache hadoop. Apache Ambari: Open source installation lifecycle management, administration, and monitoring system for Apache Hadoop Clusters. Apache knox: “Knox” gateway is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. Apache Oozie: Oozie Java web application used to schedule Apache Hadoop Jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
  • 7. What Hadoop can, and can't do What Hadoop can't do You can't use Hadoop for  Structured data  Transactional data What Hadoop can do You can use Hadoop for  Big Data
  • 8. Support & Partner Getting Started or Support – Muthu Natarajan muthu.n@msquaresystems.com www.msquaresystems.com Phone: 212-941-6000