SlideShare a Scribd company logo
re:Introduce Big Data and Hadoop Eco-system
Presented By:
Mohammed Shakir Ali
Oct 21st 2015.
2
What is Big Data ?
Big data is a popular term used to describe the exponential growth and availability of
data, both structured and unstructured. [Ref : www.sas.com]
Big data is a broad term for data sets so large or complex that traditional data processing
applications are inadequate. [Ref: www.wikipedia.com]
Everyday, we create 2.5 quintillion bytes of data–so much that 90% of the data in the world today has been
created in the last two years alone. (10^18 bytes = 1000 petabytes).
2.5 Quintillion bytes = 2500 petabytes. [Ref: www.ibm.com/software/au/data/bigdata/]
3
Characteristics of Big Data.
●
Volume
●
Variety
●
Velocity
●
Veracity
4
Characteristics of Big Data.
●
Volume
●
Variety
●
Velocity
●
Veracity
5
Is Big Data really new ?
Lets check...Google search terms for Big Data vs (Data Analysis and BI).
6
Is Big Data really new ?
Lets check...Google search terms for Big Data vs (Data Analysis and BI).
https://www.google.com/trends/explore#q=Big%20Data%2C%20Data%20Analysis%2C%20Business%20Intelligence&geo=US&date=1%2F2005%20121m&cmpt=q&tz=Etc%2FGMT-10
7
Big Data Management Challenges.
Big Data just keeps growing and growing,...according to Forrester Research:
–The average organization will grow their data by 50 percent in the coming year.
–Overall corporate data will grow by a staggering 94 percent.
–Database systems will grow by 97 percent.
–Server backups for disaster recovery and continuity will expand by 89 percent.
8
Big Data Management Challenges.
Use case of a Leading Medical Research Facility:
-Generates 100 terabytes of data from various instruments,
-Data is copied by 10 different research departments,
- Departments further process the data and add 5 terabytes of additional synthesized data each.
-Now they must manage a total of over a Petabyte of data, of which less than 150 terabytes is unique.
-Entire Petabyte of data is backed up, moved to a disaster recovery site, consuming additional power and space
used to store it all.
Now the medical center has used over 10 petabytes of storage to manage less than 150 terabytes of real unique
data.
9
Big Data Management Challenges.
Three basic challenges:
–Storing,
–Processing and
–Managing it efficiently.
Reference:
http://www.forbes.com/sites/ciocentral/2012/07/05/best-practices-for-managing-big-data/
Possible Solutions:
–Scale-out architectures to manage large Data
sets
-Reduce the data to unique set of data.
–Data Virtualization to incorporate centralized
management of Data set.
-Reuse of same data footprint and to reduce data
duplication.
Project Open Data
● Several governments around the world are making data available to public.
● Data is a valuable national resource and a strategic asset to the U.S.
Government, its partners, and the public.
● Managing this data as an asset and making it available, discoverable, and
usable – in a word, open – not only strengthens our democracy and
promotes efficiency and effectiveness in government, but also has the
potential to create economic opportunity and improve citizens’ quality of life.
● For example, when the U.S. Government released weather and GPS data to
the public, it fueled an industry that today is valued at tens of billions of
dollars per year.
Reference: https://project-open-data.cio.gov/
Benefits Big Data.
● Cost Reduction
Big data technologies like Hadoop and cloud-based analytics can provide substantial cost
advantages.
● Faster, better decision making
Analytics has always involved attempts to improve decision making, with high seed of
Hadoop and in-memory analytics, several organizations have speed up decision process
systems.
● New products and services.
Use of big data analytics is to create new products and services for customers.
Several organizations have come up with new products/services with help of Big Data.
● Reference : https://www.sas.com/fr_fr/news/sascom/2014q3/Big-data-davenport.html
Conclusion
● Increased interest in Big Data and Hadoop eco-system is
seen in recent years.
● Recent trend in Data growth has created new challenges
for Data management, along with new opportunities.
● Several software products/solutions are available to
manage Big Data effectively.
Hadoop architecture Eco-system
14
What is Apache Hadoop
Apache Hadoop is an open-source software framework written in Java for distributed
storage and distributed processing of very large data sets.
- It runs on computer clusters built from commodity hardware.
- All the modules in Hadoop are designed to withstand hardware failures .
15
Apache Hadoop Framework.
Apache Hadoop framework is composed of the following modules:
1) Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
2) Hadoop MapReduce – a programming model for large scale data processing.
3) Hadoop YARN – a resource-management platform responsible for managing computing
resources in clusters and using them for scheduling of users' applications and
4) Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
16
Apache Hadoop Adaption
On February 19, 2008, Yahoo! Inc. launched large Hadoop Cluster running on a Linux
cluster with more than 10,000 cores and produced data that was used in every Yahoo!
web search query.
17
Apache Hadoop Adaption
On February 19, 2008, Yahoo! Inc. launched large Hadoop Cluster running on a Linux
cluster with more than 10,000 cores and produced data that was used in every Yahoo!
web search query.
In 2010, Facebook claimed that they had the largest Hadoop cluster in the world with 21
PB of storage.
18
Apache Hadoop Adaption
On February 19, 2008, Yahoo! Inc. launched large Hadoop Cluster running on a Linux
cluster with more than 10,000 cores and produced data that was used in every Yahoo!
web search query.
In 2010, Facebook claimed that they had the largest Hadoop cluster in the world with 21
PB of storage.
As of 2013, Hadoop adoption is widespread.
For example, more than half of the Fortune 50 use Hadoop
19
Search trends about Big Data.
HPC vs Hadoop search trends:
https://www.google.com/trends/explore#q=HPC%2C%20Hadoop&geo=US&date=1%2F2005%20121m&cmpt=q&tz=Etc%2FGMT-10
20
Big Data and Hadoop Architecture
21
Apache Hadoop Architecture
22
Hadoop Cluster Setup
23
Apache Hadoop Projects
●
Apache Pig: is a high-level platform for creating MapReduce programs used with Hadoop.
●
Apache Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop
●
Apache Spark: Apache Spark is an open source cluster computing framework originally
developed in the AMPLab at UC, Berkeley.
●
Apache Storm: Apache Storm is a distributed computation framework written
predominantly in the Clojure programming language.
●
Apache Hbase: HBase is an open source, non-relational, distributed database modeled after
Google's BigTable and written in Java.
●
Apache Zookeeper, Impala, Flume, Sqoop…!
24
Search trends about Big Data.
Apache Hadoop vs Apache Spark search trends:
https://www.google.com/trends/explore#q=Hadoop%2C%20Apache%20Spark&geo=US&date=1%2F2005%20121m&cmpt=q&tz=Etc%2FGMT-10
25
Prominent Hadoop Distrubutors
●
Cloudera
●
Hortonworks
●
MapR
26
Hadoop preview:
Cloudera Quickstart VM:
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cloudera_quickstart_vm.html
Big Data work flow.
http://insightdataengineering.com/blog/pipeline_map.html

More Related Content

PPTX
Gail Zhou on "Big Data Technology, Strategy, and Applications"
PPTX
Introduction of Big data and Hadoop
PDF
Büyük Veriyle Büyük Resmi Görmek
PPTX
A brief history of "big data"
PDF
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
DOCX
Big data abstract
PPTX
How Do I Learn Big Data
PPTX
Hadoop for beginners free course ppt
Gail Zhou on "Big Data Technology, Strategy, and Applications"
Introduction of Big data and Hadoop
Büyük Veriyle Büyük Resmi Görmek
A brief history of "big data"
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Big data abstract
How Do I Learn Big Data
Hadoop for beginners free course ppt

What's hot (18)

PPTX
Big data PPT
PDF
Introduction to Big Data
PPSX
Introduction to Bigdata & Hadoop
PPTX
Big Data Analytics for Non-Programmers
PPT
Hadoop in action
PDF
Hadoop essential setup
PPTX
Hadoop Tutorial
PPTX
Bigdata " new level"
PDF
Introduction to Big Data
PDF
Big Data Story - From An Engineer's Perspective
PPTX
Data mining with big data
PPTX
Is Hadoop a necessity for Data Science
PPTX
Introduction to BIg Data and Hadoop
PPTX
Big data ppt
PDF
Introduction to Big Data by Manouj Bongirr
PDF
Intro to HDFS and MapReduce
PDF
The evolution of data analytics
PPTX
A Glimpse of Bigdata - Introduction
Big data PPT
Introduction to Big Data
Introduction to Bigdata & Hadoop
Big Data Analytics for Non-Programmers
Hadoop in action
Hadoop essential setup
Hadoop Tutorial
Bigdata " new level"
Introduction to Big Data
Big Data Story - From An Engineer's Perspective
Data mining with big data
Is Hadoop a necessity for Data Science
Introduction to BIg Data and Hadoop
Big data ppt
Introduction to Big Data by Manouj Bongirr
Intro to HDFS and MapReduce
The evolution of data analytics
A Glimpse of Bigdata - Introduction
Ad

Similar to re:Introduce Big Data and Hadoop Eco-system. (20)

PDF
Big Data and Hadoop - key drivers, ecosystem and use cases
PPTX
Overview of bigdata
PPTX
Introduction to Big Data & Big Data 1.0 System
PDF
PPTX
Big Data in Action : Operations, Analytics and more
PPTX
BIG Data & Hadoop Applications in Social Media
PPT
Data analytics & its Trends
PPT
Hadoop HDFS.ppt
PDF
How to build and run a big data platform in the 21st century
PPTX
Data mining with big data
PPTX
Big data
PPTX
Big Data
PDF
Big Data - Gerami
PPTX
How Big Data ,Cloud Computing ,Data Science can help business
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
PPT
Oh! Session on Introduction to BIG Data
PDF
Big data
PDF
UNIT-II-BIG-DATA-FINAL(aktu imp)-PDF.pdf
PPTX
PPTX
Top 10 renowned big data companies
Big Data and Hadoop - key drivers, ecosystem and use cases
Overview of bigdata
Introduction to Big Data & Big Data 1.0 System
Big Data in Action : Operations, Analytics and more
BIG Data & Hadoop Applications in Social Media
Data analytics & its Trends
Hadoop HDFS.ppt
How to build and run a big data platform in the 21st century
Data mining with big data
Big data
Big Data
Big Data - Gerami
How Big Data ,Cloud Computing ,Data Science can help business
Lecture 5 - Big Data and Hadoop Intro.ppt
Oh! Session on Introduction to BIG Data
Big data
UNIT-II-BIG-DATA-FINAL(aktu imp)-PDF.pdf
Top 10 renowned big data companies
Ad

Recently uploaded (20)

PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Global journeys: estimating international migration
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
.pdf is not working space design for the following data for the following dat...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Mega Projects Data Mega Projects Data
Global journeys: estimating international migration
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
.pdf is not working space design for the following data for the following dat...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Moving the Public Sector (Government) to a Digital Adoption
Supervised vs unsupervised machine learning algorithms
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
climate analysis of Dhaka ,Banglades.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx

re:Introduce Big Data and Hadoop Eco-system.

  • 1. re:Introduce Big Data and Hadoop Eco-system Presented By: Mohammed Shakir Ali Oct 21st 2015.
  • 2. 2 What is Big Data ? Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. [Ref : www.sas.com] Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. [Ref: www.wikipedia.com] Everyday, we create 2.5 quintillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone. (10^18 bytes = 1000 petabytes). 2.5 Quintillion bytes = 2500 petabytes. [Ref: www.ibm.com/software/au/data/bigdata/]
  • 3. 3 Characteristics of Big Data. ● Volume ● Variety ● Velocity ● Veracity
  • 4. 4 Characteristics of Big Data. ● Volume ● Variety ● Velocity ● Veracity
  • 5. 5 Is Big Data really new ? Lets check...Google search terms for Big Data vs (Data Analysis and BI).
  • 6. 6 Is Big Data really new ? Lets check...Google search terms for Big Data vs (Data Analysis and BI). https://www.google.com/trends/explore#q=Big%20Data%2C%20Data%20Analysis%2C%20Business%20Intelligence&geo=US&date=1%2F2005%20121m&cmpt=q&tz=Etc%2FGMT-10
  • 7. 7 Big Data Management Challenges. Big Data just keeps growing and growing,...according to Forrester Research: –The average organization will grow their data by 50 percent in the coming year. –Overall corporate data will grow by a staggering 94 percent. –Database systems will grow by 97 percent. –Server backups for disaster recovery and continuity will expand by 89 percent.
  • 8. 8 Big Data Management Challenges. Use case of a Leading Medical Research Facility: -Generates 100 terabytes of data from various instruments, -Data is copied by 10 different research departments, - Departments further process the data and add 5 terabytes of additional synthesized data each. -Now they must manage a total of over a Petabyte of data, of which less than 150 terabytes is unique. -Entire Petabyte of data is backed up, moved to a disaster recovery site, consuming additional power and space used to store it all. Now the medical center has used over 10 petabytes of storage to manage less than 150 terabytes of real unique data.
  • 9. 9 Big Data Management Challenges. Three basic challenges: –Storing, –Processing and –Managing it efficiently. Reference: http://www.forbes.com/sites/ciocentral/2012/07/05/best-practices-for-managing-big-data/ Possible Solutions: –Scale-out architectures to manage large Data sets -Reduce the data to unique set of data. –Data Virtualization to incorporate centralized management of Data set. -Reuse of same data footprint and to reduce data duplication.
  • 10. Project Open Data ● Several governments around the world are making data available to public. ● Data is a valuable national resource and a strategic asset to the U.S. Government, its partners, and the public. ● Managing this data as an asset and making it available, discoverable, and usable – in a word, open – not only strengthens our democracy and promotes efficiency and effectiveness in government, but also has the potential to create economic opportunity and improve citizens’ quality of life. ● For example, when the U.S. Government released weather and GPS data to the public, it fueled an industry that today is valued at tens of billions of dollars per year. Reference: https://project-open-data.cio.gov/
  • 11. Benefits Big Data. ● Cost Reduction Big data technologies like Hadoop and cloud-based analytics can provide substantial cost advantages. ● Faster, better decision making Analytics has always involved attempts to improve decision making, with high seed of Hadoop and in-memory analytics, several organizations have speed up decision process systems. ● New products and services. Use of big data analytics is to create new products and services for customers. Several organizations have come up with new products/services with help of Big Data. ● Reference : https://www.sas.com/fr_fr/news/sascom/2014q3/Big-data-davenport.html
  • 12. Conclusion ● Increased interest in Big Data and Hadoop eco-system is seen in recent years. ● Recent trend in Data growth has created new challenges for Data management, along with new opportunities. ● Several software products/solutions are available to manage Big Data effectively.
  • 14. 14 What is Apache Hadoop Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets. - It runs on computer clusters built from commodity hardware. - All the modules in Hadoop are designed to withstand hardware failures .
  • 15. 15 Apache Hadoop Framework. Apache Hadoop framework is composed of the following modules: 1) Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; 2) Hadoop MapReduce – a programming model for large scale data processing. 3) Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications and 4) Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
  • 16. 16 Apache Hadoop Adaption On February 19, 2008, Yahoo! Inc. launched large Hadoop Cluster running on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo! web search query.
  • 17. 17 Apache Hadoop Adaption On February 19, 2008, Yahoo! Inc. launched large Hadoop Cluster running on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo! web search query. In 2010, Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage.
  • 18. 18 Apache Hadoop Adaption On February 19, 2008, Yahoo! Inc. launched large Hadoop Cluster running on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo! web search query. In 2010, Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage. As of 2013, Hadoop adoption is widespread. For example, more than half of the Fortune 50 use Hadoop
  • 19. 19 Search trends about Big Data. HPC vs Hadoop search trends: https://www.google.com/trends/explore#q=HPC%2C%20Hadoop&geo=US&date=1%2F2005%20121m&cmpt=q&tz=Etc%2FGMT-10
  • 20. 20 Big Data and Hadoop Architecture
  • 23. 23 Apache Hadoop Projects ● Apache Pig: is a high-level platform for creating MapReduce programs used with Hadoop. ● Apache Hive: Apache Hive is a data warehouse infrastructure built on top of Hadoop ● Apache Spark: Apache Spark is an open source cluster computing framework originally developed in the AMPLab at UC, Berkeley. ● Apache Storm: Apache Storm is a distributed computation framework written predominantly in the Clojure programming language. ● Apache Hbase: HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. ● Apache Zookeeper, Impala, Flume, Sqoop…!
  • 24. 24 Search trends about Big Data. Apache Hadoop vs Apache Spark search trends: https://www.google.com/trends/explore#q=Hadoop%2C%20Apache%20Spark&geo=US&date=1%2F2005%20121m&cmpt=q&tz=Etc%2FGMT-10
  • 26. 26 Hadoop preview: Cloudera Quickstart VM: http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cloudera_quickstart_vm.html Big Data work flow. http://insightdataengineering.com/blog/pipeline_map.html

Editor's Notes

  • #2: <number>
  • #3: <number>
  • #4: <number>
  • #5: <number>
  • #6: <number>
  • #7: <number>
  • #8: <number>
  • #9: <number>
  • #10: <number>
  • #15: <number>
  • #16: <number>
  • #17: <number>
  • #18: <number>
  • #19: <number>
  • #20: <number>
  • #21: <number>
  • #22: <number>
  • #23: <number>
  • #24: <number>
  • #25: <number>
  • #26: <number>
  • #27: <number>