SlideShare a Scribd company logo
Prepared By: Marwan A. Al-Wajeeh
1
2
Outline
Big Data an Overview
Big Data Sources
What Is Big Data
Big Data Challenges
Big Data Analytics
3
More than 2.5 billion bytes of data are created EVERY DAY
IBM: 90 percent world’s Data today was produced in the last
two years
80% of world data is unstructured
Facebook Process 500 TB per day.
Lots and Lots of Web Pages (20 billion web pages in google)
A billion Facebook Users
Billions+ Facebook Pages
Hundreds of Million Twitters Account
Hundreds of Million Twitters per Day
Billions Google Queries per Day
Millions of servers, Beta Bytes of Data
4
Big Data an Overview
5
Big Data
6
Internet of Events: 4 sources of event data
7
Big Data Sources
Big Data is a collection of data sets that are large and
complex in nature.
Big Data is any data that is expensive to manage and
hard to extract value from.
They constitute both structure and un structured
data they grow large so fast that they are not
manageable by traditional relational database
systems or congenital statistical tools.
8
What Is Big Data?
Volume: the size of data
 Google Example:
 10 Billions web pages
 Average size of web pages = 200KB
 10 billion * 20KB= 200 TB
 Disk read bandwidth = 50MB/Sec
 Time to read= 4 million seconds= 46+ Day
 Airbus A380 Example:
 Each A380 four engine generates 1 PB of data on a flight,
for example, from London (LHR) to Singapore (SIN)
9
Big Data: Four Challenges (4 V’s)
Velocity (speed of change).
 we are not only generating a lot amount of data but the data is
continuously being added and things are changing very
rapidly.
Verity (different types of data source).
 The diversity of sources, format, quality, and structure
Veracity (uncertainty of data).
 that means that you cannot completely sure that we have
recorded incompletely sure.
10
Big Data: Four Challenges (4 V’s)
11
Traditional vs Big Data
Big data analytics is the process of:
Collecting
Organizing and
Analyzing
Of large set of data “big data” to
Discover patterns and
Other useful information
12
Big Data Analytics
Traditional Analytics Big Data Analytics
Analytics using known data which
is well understood
Not well understood data format
from it largely being unstructured
and semi-structured
Built based on relational data
models
Big data comes in various form and
formats from multiple disconnected
systems. They are almost flat with
no relation ship.
13
Traditional vs Big Data Analytics
 Traditional RDBMS Fails to handle Big Data
Big Data (terabytes) can not fit in the memory for a
single computer
Processing of Big Data in single computer will take a
lot of time
Scaling with the traditional RDBMS is expensive.
14
Analytical Challenges with Big Data
Memory
Disk
CPU
Machine Learning, Statistics
 The algorithms runs on the CPU, and access the data that is in
memory
Then bring the data from disk into memory
What Happens if the data so big, that is can’t all fit in the
memory at the same time.
15
Single Node architecture
 10 billion web pages
Average size of webpage= 20KB
10 billion * 20 KB= 200TB
Disk read bandwidth = 50MB/sec
Time to read = 4 million second= 46+ days
Thus: this is unacceptable, and we need a better solution
 Clustering Computing emerge as new solution
The fundamental idea is to split the data into chunks, if we
have 1000 disks and CPUs, the process will done with in
hour.
16
Google Example
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Each rack contains 16-64 nodes
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Switch1 Gbps between
any pair of nodes
in a rack
2-10 Gbps backbone between racks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
17
Cluster Architecture
Multiple rack So We
have a data center
18
Now once we have this kind of cluster
This does not solve the problem completely
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org 19
 Node Failure
A single server can stay up for 3 years (1000 days)
1000 server in the cluster => 1 failure/ day
Million server in cluster => 1000 failure/day (Google have
approximately million server)
 how to store data persistently and keep it available if
nodes can fail
 how to deal with node failure during along running
computation?
20
Cluster Commuting Challenges
 Network bottleneck
Network bandwidth = 1 Gbps
Moving 10 TB takes approximately 1 day
Complex computation might need to move a lot of data
and that can slow computation down.
We need a framework doesn't move data around so much
while it’s doing computation.
Distribution programming is hard!
 It is hard to write distributed programs correctly
We need simple model that hides most of complexity of
distributed programming
21
Cluster Commuting Challenges
Map- Reduce address the challenges of cluster
computing
Store date redundantly on multiple nodes for persistence
and availability
Move computation close to the data to minimize data
movement
Simple programming model to hide complexity of all this
magic
22
Map-Reduce
23
Hadoop= MapReduce + HDFS
Pig Hive HBase
Flume
Rhado
op
Spoop
Oozie
Avro
Zoo
Keeper
Big Data Analytics Tools and Technologies
Thank You
24
4 Types of Analytics
Descriptive: What happened?
Diagnostics: Why did it happen?
Predictive: what will happen?
Prescriptive: what is the best that can happen
Analytics Tools:
SAS
IBM SPSS
Stata
R
MATLAb
25
 The key aspects of the big data platform are: Integration, Analytics
, Visualization, Development, workload optimization , security and
governs
26
The 5 High Value Big Data Use
Cases
27
Thank You
28

More Related Content

PDF
Machine learning for java developers
PPTX
PPTX
Is Spark the right choice for data analysis ?
PDF
Data science
PPTX
Nicola Pagni - Anomaly Detection in Elasticsearch
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PDF
PDF
Code Once Use Often with Declarative Data Pipelines
Machine learning for java developers
Is Spark the right choice for data analysis ?
Data science
Nicola Pagni - Anomaly Detection in Elasticsearch
Data Science With Python | Python For Data Science | Python Data Science Cour...
Code Once Use Often with Declarative Data Pipelines

What's hot (20)

PPT
Data Science in the Real World: Making a Difference
PDF
Tracking data lineage at Stitch Fix
PDF
Improving ad hoc and production workflows at Stitch Fix
PDF
AllegroGraph - Cognitive Probability Graph webcast
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Quick presentation for the OpenML workshop in Eindhoven 2014
PDF
Deep Learning with MXNet - Dmitry Larko
PDF
Hadoop/Spark Non-Technical Basics
PPTX
Top 10 Data analytics tools to look for in 2021
PPTX
When We Spark and When We Don’t: Developing Data and ML Pipelines
PDF
Is It A Right Time For Me To Learn Hadoop. Find out ?
PPTX
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
PDF
Top Machine Learning Tools and Frameworks for Beginners | Edureka
PPTX
Evolution of big data
PDF
A compute infrastructure for data scientists
PDF
Maoye resume 2017_1_v10_short
PDF
Big Data is changing abruptly, and where it is likely heading
PDF
Cheat sheets for data scientists
PPTX
Neo4j_allHands_04112013
PPTX
Python for data science
Data Science in the Real World: Making a Difference
Tracking data lineage at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
AllegroGraph - Cognitive Probability Graph webcast
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Quick presentation for the OpenML workshop in Eindhoven 2014
Deep Learning with MXNet - Dmitry Larko
Hadoop/Spark Non-Technical Basics
Top 10 Data analytics tools to look for in 2021
When We Spark and When We Don’t: Developing Data and ML Pipelines
Is It A Right Time For Me To Learn Hadoop. Find out ?
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Top Machine Learning Tools and Frameworks for Beginners | Edureka
Evolution of big data
A compute infrastructure for data scientists
Maoye resume 2017_1_v10_short
Big Data is changing abruptly, and where it is likely heading
Cheat sheets for data scientists
Neo4j_allHands_04112013
Python for data science
Ad

Similar to Introduction Big data (20)

PPTX
PDF
Big data introduction
PPT
Big data
PPTX
Data analytics introduction
PPTX
Data mining with big data
PDF
INF2190_W1_2016_public
PPTX
lec1_Unit 1_rev.pptx_big data aanalytics
PPTX
Data mining with big data implementation
PPT
Big data analytics, survey r.nabati
PDF
Lecture1 introduction to big data
PPTX
PPT 1.1.2.pptx ehhllo hi hwi bdfhd dbdhu
PDF
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
PPTX
Big Data
PPTX
Big data
PPTX
Big data
PPTX
Big data unit 2
PDF
Big Data overview
PPTX
DataJan27.pptxDataFoundationsPresentation
PPTX
Big data
PPTX
Big data
Big data introduction
Big data
Data analytics introduction
Data mining with big data
INF2190_W1_2016_public
lec1_Unit 1_rev.pptx_big data aanalytics
Data mining with big data implementation
Big data analytics, survey r.nabati
Lecture1 introduction to big data
PPT 1.1.2.pptx ehhllo hi hwi bdfhd dbdhu
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
Big Data
Big data
Big data
Big data unit 2
Big Data overview
DataJan27.pptxDataFoundationsPresentation
Big data
Big data
Ad

Recently uploaded (20)

PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
1. Introduction to Computer Programming.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
A Presentation on Artificial Intelligence
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
project resource management chapter-09.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
1 - Historical Antecedents, Social Consideration.pdf
MIND Revenue Release Quarter 2 2025 Press Release
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A novel scalable deep ensemble learning framework for big data classification...
Web App vs Mobile App What Should You Build First.pdf
WOOl fibre morphology and structure.pdf for textiles
1. Introduction to Computer Programming.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Heart disease approach using modified random forest and particle swarm optimi...
A Presentation on Artificial Intelligence
OMC Textile Division Presentation 2021.pptx
project resource management chapter-09.pdf
Hybrid model detection and classification of lung cancer
Assigned Numbers - 2025 - Bluetooth® Document
SOPHOS-XG Firewall Administrator PPT.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Group 1 Presentation -Planning and Decision Making .pptx

Introduction Big data

  • 1. Prepared By: Marwan A. Al-Wajeeh 1
  • 2. 2
  • 3. Outline Big Data an Overview Big Data Sources What Is Big Data Big Data Challenges Big Data Analytics 3
  • 4. More than 2.5 billion bytes of data are created EVERY DAY IBM: 90 percent world’s Data today was produced in the last two years 80% of world data is unstructured Facebook Process 500 TB per day. Lots and Lots of Web Pages (20 billion web pages in google) A billion Facebook Users Billions+ Facebook Pages Hundreds of Million Twitters Account Hundreds of Million Twitters per Day Billions Google Queries per Day Millions of servers, Beta Bytes of Data 4 Big Data an Overview
  • 6. 6 Internet of Events: 4 sources of event data
  • 8. Big Data is a collection of data sets that are large and complex in nature. Big Data is any data that is expensive to manage and hard to extract value from. They constitute both structure and un structured data they grow large so fast that they are not manageable by traditional relational database systems or congenital statistical tools. 8 What Is Big Data?
  • 9. Volume: the size of data  Google Example:  10 Billions web pages  Average size of web pages = 200KB  10 billion * 20KB= 200 TB  Disk read bandwidth = 50MB/Sec  Time to read= 4 million seconds= 46+ Day  Airbus A380 Example:  Each A380 four engine generates 1 PB of data on a flight, for example, from London (LHR) to Singapore (SIN) 9 Big Data: Four Challenges (4 V’s)
  • 10. Velocity (speed of change).  we are not only generating a lot amount of data but the data is continuously being added and things are changing very rapidly. Verity (different types of data source).  The diversity of sources, format, quality, and structure Veracity (uncertainty of data).  that means that you cannot completely sure that we have recorded incompletely sure. 10 Big Data: Four Challenges (4 V’s)
  • 12. Big data analytics is the process of: Collecting Organizing and Analyzing Of large set of data “big data” to Discover patterns and Other useful information 12 Big Data Analytics
  • 13. Traditional Analytics Big Data Analytics Analytics using known data which is well understood Not well understood data format from it largely being unstructured and semi-structured Built based on relational data models Big data comes in various form and formats from multiple disconnected systems. They are almost flat with no relation ship. 13 Traditional vs Big Data Analytics
  • 14.  Traditional RDBMS Fails to handle Big Data Big Data (terabytes) can not fit in the memory for a single computer Processing of Big Data in single computer will take a lot of time Scaling with the traditional RDBMS is expensive. 14 Analytical Challenges with Big Data
  • 15. Memory Disk CPU Machine Learning, Statistics  The algorithms runs on the CPU, and access the data that is in memory Then bring the data from disk into memory What Happens if the data so big, that is can’t all fit in the memory at the same time. 15 Single Node architecture
  • 16.  10 billion web pages Average size of webpage= 20KB 10 billion * 20 KB= 200TB Disk read bandwidth = 50MB/sec Time to read = 4 million second= 46+ days Thus: this is unacceptable, and we need a better solution  Clustering Computing emerge as new solution The fundamental idea is to split the data into chunks, if we have 1000 disks and CPUs, the process will done with in hour. 16 Google Example
  • 17. Mem Disk CPU Mem Disk CPU … Switch Each rack contains 16-64 nodes Mem Disk CPU Mem Disk CPU … Switch Switch1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 Cluster Architecture Multiple rack So We have a data center
  • 18. 18 Now once we have this kind of cluster This does not solve the problem completely
  • 19. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
  • 20.  Node Failure A single server can stay up for 3 years (1000 days) 1000 server in the cluster => 1 failure/ day Million server in cluster => 1000 failure/day (Google have approximately million server)  how to store data persistently and keep it available if nodes can fail  how to deal with node failure during along running computation? 20 Cluster Commuting Challenges
  • 21.  Network bottleneck Network bandwidth = 1 Gbps Moving 10 TB takes approximately 1 day Complex computation might need to move a lot of data and that can slow computation down. We need a framework doesn't move data around so much while it’s doing computation. Distribution programming is hard!  It is hard to write distributed programs correctly We need simple model that hides most of complexity of distributed programming 21 Cluster Commuting Challenges
  • 22. Map- Reduce address the challenges of cluster computing Store date redundantly on multiple nodes for persistence and availability Move computation close to the data to minimize data movement Simple programming model to hide complexity of all this magic 22 Map-Reduce
  • 23. 23 Hadoop= MapReduce + HDFS Pig Hive HBase Flume Rhado op Spoop Oozie Avro Zoo Keeper Big Data Analytics Tools and Technologies
  • 25. 4 Types of Analytics Descriptive: What happened? Diagnostics: Why did it happen? Predictive: what will happen? Prescriptive: what is the best that can happen Analytics Tools: SAS IBM SPSS Stata R MATLAb 25
  • 26.  The key aspects of the big data platform are: Integration, Analytics , Visualization, Development, workload optimization , security and governs 26
  • 27. The 5 High Value Big Data Use Cases 27