SlideShare a Scribd company logo
Introduction	
to
Big	Data
By:	Haluan	Mohammad	Irsad
Definition
Big data can defined by 3Vs (Three Vs):
– Volume, starts as low as 1 terabyte and it has no upper limit.
– Velocity, data volume per unit time, should be at least 30KB/sec.
– Variety, add unstructured & semi-structured to structured data.
Volume
Big data is composed of huge numbers of very small transactions that come in variety
formats.
Data produce true value only after they’re aggregated and analyzed.
Velocity
Required latency less than 100ms, measured from the time data is created to time has
the responds.
Throughput requirement can easily be as high as 1,000 messages per second.
Variety
Composing of a combination of datasets with differing underlying structures (structured,
semi-structured, or structured).
Heterogeneous format: graphics, JSON, XML, CSV, and log files.
Idetifying by	the	
Sources
Data		nowadays	generated	by:
• Humans
• Machine
• Sensors
Typical	sources:
• Social	media
• Financial	transactions
• Health	records
• Click	streams
• Log	files
• Internet	of	Things
Problem
■ Managing volume of data, caused by overloading volume
■ Maintaining system performance, caused by low velocity of data access
■ Avoiding disjunction of data, caused by variety of data structure and formats.
How to accomplish those?
Managing Volume
Scalable database, use NoSQL DBMS (MongoDB, Cassandra DB, Titan DB)
Maintaining Performance
■ For Batch Processing, use Hadoop MapReduce
■ For Stream Processing, use Apache Spark, Apache Storm, Apache Drill
Avoiding Disjunction
■ Use Flat Storage Architecture / Data Lake, to hold huge volumes of multi-structured
data.
■ Use Hadoop Distributed File System (HDFS), to deploy to the machine.
Hadoop
Digging	deeper	into
What is Hadoop?
A framework that allows for the distributed processing of large data sets across clusters
of computers using simple programming models.
Why Hadoop?
■ Most proven framework in industry nowadays
■ Open Source
■ Rich features & functionalities
■ Rich support plugins in the ecosystem
=> (https://hadoopecosystemtable.github.io/)
When to use Hadoop?
■ For processing large volumes of data
■ For parallel data processing
■ For storing a diverse set of data
When not to use Hadoop?
■ For a relational database system
■ For a general network network file system
■ For non-parallel data processing
Core Functions
■ Data Storage
■ Data Processing
■ Resource Management
Data Storage
Hadoop use HDFS (Hadoop Distributed File System) to store the data.
HDFS is a distributed file system designed to fault tolerant and deployed on low-cost
hardware
HDFS’ Goals
■ Hardware Failure, detection of faults and quick automatic recovery.
■ Streaming Data Access, emphasis on high throughput of data access.
■ Large Data Sets, provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster and support tens of millions of files in a single instance.
■ Simple Coherency Model, HDFS applications need a write-once-read-many access
model for files.
■ Portability, designed to be easily move from one platform to another.
Data Processing
Data processing has two ways, batch and real-time.
■ Batch processing, execution of a series of jobs.
– Use Hadoop MapReduce
■ Real-time processing, execution of instantaneously jobs.
– Use Apache Spark
Hadoop	MapReduce
• MapReduce	is	a	parallel	distributed	processing	
that	can	be	used	to	process	large	of	data	in	
batch	to	transform	it	into	manageable-size	
data.
• This	work	is	done	in	two	steps:
1. Map	the	Data,	this	stage	is	to	delegate	
the	data	into	key-value	pairs	&	divided	
into	fragments,	then	assigned	to	map	
tasks.
2. Reduce	the	Data,	this	stage	is	the	
combination	of	the Shuffle stage	and	
the Reduce stage,	the	goals	of	this	stage	
is	to	process	the	data	result	of	map	
tasks,	then	produce	a	new	set	of	output	
which	will	stored	in	the	HDFS.
Apache	Spark
Is	a	compute	engine	for	Hadoop	data,	provides	
expressive	programming	model	(SparkQL),	stream	
processing,	machine	learning	(MLib),	and	graph	
computation	(GraphX).
Resource Management
Manage all resources in the Hadoop cluster, to monitor if there are any faults, job
scheduling, and do quick automatic recovery.
Hadoop use YARN.
Hadoop	YARN
• The	ResourceManager is	the	ultimate	authority	
that	arbitrates	resources	among	all	the	
applications	in	the	system	(cluster).
• The	NodeManager is	the	per-machine	framework	
agent	who	is	responsible	for	containers,	
monitoring	their	resource	usage	(cpu,	memory,	
disk,	network)	and	reporting	the	same	to	the	
ResourceManager
YARN (cot’d)
■ The Scheduler is responsible for allocating resources to the various running
applications subject to familiar constraints of capacities, queues, etc.
– Performs no monitoring or tracking of status for the application
– No guarantees about restarting failed tasks either due to application failure or
hardware failures
– Performs its scheduling function based on the resource requirements of the
applications
Analyzing Data
Process of inspecting, cleansing, transforming, and modeling data with the goal of
discovering useful information, suggesting conclusions, and supporting decision-
making.
The goal of analyzing data is to leverage your business to grow more higher.
Hadoop support this activity with the help from Apache Mahout.
Apache	Mahout
Library	that	help	creates	a	machine	learning	
applications.
The	main	functions	is	to	help	solve:
1. Classification,	assigning	a	set	of	data	to	known	
category.
2. Clustering,	grouping	a	set	of	objects	based	on	
the	similarity.
3. Recommendations,	give	list	of	
recommendation	based	on	statistic	analyzing.
Mahout	provides	the	algorithm	to	solve	all	problem	
above	and	allow	to	customized	them	on	demand.
Visualization
Hadoop by default doesn’t support to visualize the data.
To visualize the data, use Apache Zeppelin (http://zeppelin.apache.org/).
Apache	Zeppelin
Apache	Zeppelin	runs	on	top	of	Apache	Spark,	but	
provide	pluggable	interpreter	APIs	to	support	other	
data	processing	system.
Benefits
Hadoop give some benefits:
■ Ease of scaling
Hadoop is designed as a distributed system
■ Performance
Hadoop is designed to works in distributed & parallel processing
■ Availability & Reliability
Hadoop platform is providing data protection and automatic failover
configuration
Conclusion
■ Big data is not a barrier, but only a data that need to be managed properly.
■ Used a proper tools to managed them.
■ Prepare the strategy to processing the data (batch or stream).
■ Managed & maintain the system carefully.
■ Use plugins that needed by functional requirements.
■ Grow your business with Data-Driven Approach
FIN

More Related Content

PDF
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
PPTX
ETL Technologies.pptx
PDF
Introduction to Data Stream Processing
PDF
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PPTX
PDF
Data Product Architectures
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
ETL Technologies.pptx
Introduction to Data Stream Processing
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Product Architectures

What's hot (20)

PDF
Hadoop Overview & Architecture
 
PPTX
Mining Data Streams
PDF
DI&A Slides: Data Lake vs. Data Warehouse
PPTX
Presentation About Big Data (DBMS)
PDF
Data warehouse architecture
PPTX
What is big data?
PDF
Data Catalog for Better Data Discovery and Governance
PDF
Big Data Architecture and Design Patterns
PPTX
Hadoop File system (HDFS)
PPT
Big Data
PPTX
Presentation on Big Data
PPTX
Knowledge Graph Introduction
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Big data ecosystem
PPTX
Building a Big Data Pipeline
PPTX
Big Data Analytics with Hadoop
PPTX
Big data by Mithlesh sadh
PDF
What is HDFS | Hadoop Distributed File System | Edureka
PPTX
Big Data in the Cloud
Hadoop Overview & Architecture
 
Mining Data Streams
DI&A Slides: Data Lake vs. Data Warehouse
Presentation About Big Data (DBMS)
Data warehouse architecture
What is big data?
Data Catalog for Better Data Discovery and Governance
Big Data Architecture and Design Patterns
Hadoop File system (HDFS)
Big Data
Presentation on Big Data
Knowledge Graph Introduction
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Big data ecosystem
Building a Big Data Pipeline
Big Data Analytics with Hadoop
Big data by Mithlesh sadh
What is HDFS | Hadoop Distributed File System | Edureka
Big Data in the Cloud
Ad

Viewers also liked (19)

PPTX
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
PDF
Tracxn Research - Finance & Accounting Landscape, February 2017
PDF
2015 Internet Trends Report
PDF
Tracxn Research - Mobile Advertising Landscape, February 2017
PDF
Tracxn Research - Industrial Robotics Landscape, February 2017
PPTX
Google Cloud Spanner Preview
PDF
2017 iosco research report on financial technologies (fintech)
PDF
Tracxn Research - Healthcare Analytics Landscape, February 2017
PDF
Tracxn Research - Construction Tech Landscape, February 2017
PPTX
Tugas 4 0317-imelda felicia-1412510545
PPTX
Comparing 30 MongoDB operations with Oracle SQL statements
PDF
Tracxn Research - Insurance Tech Landscape, February 2017
PDF
MongoDB NoSQL database a deep dive -MyWhitePaper
PDF
Tracxn Research - Chatbots Landscape, February 2017
PPTX
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
PPTX
Tugas4 0317-nasrulakbar-141250552
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PDF
Europa AI startup scaleups report 2016
PPTX
Webinar: Fighting Fraud with Graph Databases
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Tracxn Research - Finance & Accounting Landscape, February 2017
2015 Internet Trends Report
Tracxn Research - Mobile Advertising Landscape, February 2017
Tracxn Research - Industrial Robotics Landscape, February 2017
Google Cloud Spanner Preview
2017 iosco research report on financial technologies (fintech)
Tracxn Research - Healthcare Analytics Landscape, February 2017
Tracxn Research - Construction Tech Landscape, February 2017
Tugas 4 0317-imelda felicia-1412510545
Comparing 30 MongoDB operations with Oracle SQL statements
Tracxn Research - Insurance Tech Landscape, February 2017
MongoDB NoSQL database a deep dive -MyWhitePaper
Tracxn Research - Chatbots Landscape, February 2017
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
Tugas4 0317-nasrulakbar-141250552
Developing streaming applications with apache apex (strata + hadoop world)
Europa AI startup scaleups report 2016
Webinar: Fighting Fraud with Graph Databases
Ad

Similar to Introduction to Big Data (20)

PDF
Bigdata and Hadoop Bootcamp
PDF
Hadoop Tutorial for Big Data Enthusiasts
PPTX
Fundamentals of big data analytics and Hadoop
PPTX
Apache-Hadoop-Slides.pptx
PDF
Hadoop Master Class : A concise overview
PDF
Hadoop .pdf
PPTX
Not Just Another Overview of Apache Hadoop
PPTX
Big data Presentation
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Big data and hadoop overvew
PDF
Hadoop introduction
PPTX
Big data
PPTX
Hadoop and Big Data
PPTX
2016-07-21-Godil-presentation.pptx
ODP
BigData Hadoop
PPTX
Module 1- Introduction to Big Data and Hadoop
PPT
Big data and hadoop
PPTX
Inroduction to Big Data
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
Bigdata and Hadoop Bootcamp
Hadoop Tutorial for Big Data Enthusiasts
Fundamentals of big data analytics and Hadoop
Apache-Hadoop-Slides.pptx
Hadoop Master Class : A concise overview
Hadoop .pdf
Not Just Another Overview of Apache Hadoop
Big data Presentation
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Big data and hadoop overvew
Hadoop introduction
Big data
Hadoop and Big Data
2016-07-21-Godil-presentation.pptx
BigData Hadoop
Module 1- Introduction to Big Data and Hadoop
Big data and hadoop
Inroduction to Big Data
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Hadoop_EcoSystem slide by CIDAC India.pptx

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Cloud computing and distributed systems.
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Encapsulation theory and applications.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
Cloud computing and distributed systems.
CIFDAQ's Market Insight: SEC Turns Pro Crypto
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Monthly Chronicles - July 2025
Encapsulation theory and applications.pdf
Spectral efficient network and resource selection model in 5G networks
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Introduction to Big Data

Editor's Notes

  • #9: RDBMS could scale on read operation, but in write operation you need to drop ACID requirements which is violated RDBMS core rules.
  • #13: It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • #20: Batch: “where data is collected and then processed as one unit with processing completion times on the order of hours or days”