Introduction Big data

Prepared By: Marwan A. Al-Wajeeh
1

Outline
Big Data an Overview
Big Data Sources
What Is Big Data
Big Data Challenges
Big Data Analytics
3

More than 2.5 billion bytes of data are created EVERY DAY
IBM: 90 percent world’s Data today was produced in the last
two years
80% of world data is unstructured
Facebook Process 500 TB per day.
Lots and Lots of Web Pages (20 billion web pages in google)
A billion Facebook Users
Billions+ Facebook Pages
Hundreds of Million Twitters Account
Hundreds of Million Twitters per Day
Billions Google Queries per Day
Millions of servers, Beta Bytes of Data
4
Big Data an Overview

6
Internet of Events: 4 sources of event data

Big Data is a collection of data sets that are large and
complex in nature.
Big Data is any data that is expensive to manage and
hard to extract value from.
They constitute both structure and un structured
data they grow large so fast that they are not
manageable by traditional relational database
systems or congenital statistical tools.
8
What Is Big Data?

Volume: the size of data
 Google Example:
 10 Billions web pages
 Average size of web pages = 200KB
 10 billion * 20KB= 200 TB
 Disk read bandwidth = 50MB/Sec
 Time to read= 4 million seconds= 46+ Day
 Airbus A380 Example:
 Each A380 four engine generates 1 PB of data on a flight,
for example, from London (LHR) to Singapore (SIN)
9
Big Data: Four Challenges (4 V’s)

Velocity (speed of change).
 we are not only generating a lot amount of data but the data is
continuously being added and things are changing very
rapidly.
Verity (different types of data source).
 The diversity of sources, format, quality, and structure
Veracity (uncertainty of data).
 that means that you cannot completely sure that we have
recorded incompletely sure.
10
Big Data: Four Challenges (4 V’s)

Big data analytics is the process of:
Collecting
Organizing and
Analyzing
Of large set of data “big data” to
Discover patterns and
Other useful information
12
Big Data Analytics

Traditional Analytics Big Data Analytics
Analytics using known data which
is well understood
Not well understood data format
from it largely being unstructured
and semi-structured
Built based on relational data
models
Big data comes in various form and
formats from multiple disconnected
systems. They are almost flat with
no relation ship.
13
Traditional vs Big Data Analytics

 Traditional RDBMS Fails to handle Big Data
Big Data (terabytes) can not fit in the memory for a
single computer
Processing of Big Data in single computer will take a
lot of time
Scaling with the traditional RDBMS is expensive.
14
Analytical Challenges with Big Data

Memory
Disk
CPU
Machine Learning, Statistics
 The algorithms runs on the CPU, and access the data that is in
memory
Then bring the data from disk into memory
What Happens if the data so big, that is can’t all fit in the
memory at the same time.
15
Single Node architecture

 10 billion web pages
Average size of webpage= 20KB
10 billion * 20 KB= 200TB
Disk read bandwidth = 50MB/sec
Time to read = 4 million second= 46+ days
Thus: this is unacceptable, and we need a better solution
 Clustering Computing emerge as new solution
The fundamental idea is to split the data into chunks, if we
have 1000 disks and CPUs, the process will done with in
hour.
16
Google Example

Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Each rack contains 16-64 nodes
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Switch1 Gbps between
any pair of nodes
in a rack
2-10 Gbps backbone between racks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
17
Cluster Architecture
Multiple rack So We
have a data center

18
Now once we have this kind of cluster
This does not solve the problem completely

J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org 19

 Node Failure
A single server can stay up for 3 years (1000 days)
1000 server in the cluster => 1 failure/ day
Million server in cluster => 1000 failure/day (Google have
approximately million server)
 how to store data persistently and keep it available if
nodes can fail
 how to deal with node failure during along running
computation?
20
Cluster Commuting Challenges

 Network bottleneck
Network bandwidth = 1 Gbps
Moving 10 TB takes approximately 1 day
Complex computation might need to move a lot of data
and that can slow computation down.
We need a framework doesn't move data around so much
while it’s doing computation.
Distribution programming is hard!
 It is hard to write distributed programs correctly
We need simple model that hides most of complexity of
distributed programming
21
Cluster Commuting Challenges

Map- Reduce address the challenges of cluster
computing
Store date redundantly on multiple nodes for persistence
and availability
Move computation close to the data to minimize data
movement
Simple programming model to hide complexity of all this
magic
22
Map-Reduce

23
Hadoop= MapReduce + HDFS
Pig Hive HBase
Flume
Rhado
op
Spoop
Oozie
Avro
Zoo
Keeper
Big Data Analytics Tools and Technologies

4 Types of Analytics
Descriptive: What happened?
Diagnostics: Why did it happen?
Predictive: what will happen?
Prescriptive: what is the best that can happen
Analytics Tools:
SAS
IBM SPSS
Stata
R
MATLAb
25

 The key aspects of the big data platform are: Integration, Analytics
, Visualization, Development, workload optimization , security and
governs
26

The 5 High Value Big Data Use
Cases
27

Introduction Big data

More Related Content

What's hot (20)

Similar to Introduction Big data (20)

Recently uploaded (20)

Introduction Big data