SlideShare a Scribd company logo
Collecting, Storing and
Analyzing Big Data
trieunt@fpt.com.vn
tantrieuf31@gmail.com
Big Data Process Development
Agenda
Collecting → Storing → Processing → Analyzing → Learning → Reacting
Data engineering process: 3 tasks
1. Collecting
a. Concepts
b. Technology
2. Storing
a. Big Data Storage Concepts
b. Big Data Storage Technology
3. Processing
a. Big Data Processing Concepts
b. Big Data Processing Technology
Data Science/Machine Learning process: 3 tasks
4) Analyzing → 5) Learning → 5) Reacting
Big Data Analytics Lifecycle
Collecting
Storing
Processing
(Collecting) → Storing → Processing → Analyzing
→ Learning → Reacting
Collecting
Collecting tools
Batch collecting: Apache Sqoop ( from DBMS to Apache Hadoop)
Real-time collecting: RFX-tracking (from stream data to Apache Kafka)
Collecting → (Storing) → Processing → Analyzing
→ Learning → Reacting
Storing Concepts
Clusters
File Systems and Distributed File Systems
NoSQL
Sharding
Replication
Sharding and Replication
CAP Theorem
ACID
BASE
Clusters
NoSQL
Sharding
Slide 2 collecting, storing and analyzing big data
Replication (Master-Slave)
Replication (Peer-to-Peer)
CAP Theorem
Slide 2 collecting, storing and analyzing big data
Collecting → Storing → (Processing) → Analyzing
→ Learning → Reacting
Processing concepts
Parallel Data Processing
Distributed Data Processing
Hadoop
Processing Workloads
Cluster
Processing in Batch Mode
Processing in Realtime Mode
Parallel Data Processing
Distributed Data Processing
Hadoop
Hadoop is a versatile framework that provides both processing and
storage capabilities
Batch processing (offline processing)
Transactional processing
Cluster
Map and Reduce Tasks
Processing in Realtime Mode
Tools
When standard relational database
(Oracle,MySQL, ...) is not good enough
the “analytic system” MySQL database from a startup, tracking all actions in
mobile games: iOS, Android, ...
3 common problems in Big Data System
1. Size: the volume of the datasets is a critical factor.
2. Complexity: the structure, behaviour and permutations of the datasets is
a critical factor.
3. Technologies: the tools and techniques which are used to process a
sizable or complex dataset is a critical factor.
What is Apache Phoenix ?
Apache Phoenix is a SQL skin over HBase.
It means scaling Phoenix just like scale-up and
scale-out the Hbase
Phoenix
SQL Engine
Interesting features of Apache Phoenix
● Embedded JDBC driver implements the majority of java.sql interfaces, including
the metadata APIs.
● Allows columns to be modeled as a multi-part row key or key/value cells.
● Full query support with predicate push down and optimal scan key formation.
● DDL support: CREATE TABLE, DROP TABLE, and ALTER TABLE for
adding/removing columns.
● Versioned schema repository. Snapshot queries use the schema that was in
place when data was written.
● DML support: UPSERT VALUES for row-by-row insertion, UPSERT SELECT for
mass data transfer between the same or different tables, and DELETE for
deleting rows.
● Limited transaction support through client-side batching.
● Single table only - no joins yet and secondary indexes are a work in progress.
● Follows ANSI SQL standards whenever possible
● Requires HBase v 0.94.2 or above
● 100% Java
Slide 2 collecting, storing and analyzing big data
the Phoenix table schema
Setting JDBC Phoenix Driver
Phoenix and SQL tool in Eclipse 4
Phoenix vs Hive
(running over HDFS and HBase)
http://phoenix.apache.org/performance.html
Performance: Phoenix vs Hive
Readings
1. https://medium.baqend.com/real-time-stream-processors-a-survey-and-d
ecision-guidance-6d248f692056#.s00ac9xtu
2. https://medium.baqend.com/nosql-databases-a-survey-and-decision-guid
ance-ea7823a822d#.pn63unwx6
3. https://www.infoq.com/articles/apache-kafka
4. https://docs.google.com/document/d/1ZtEhLw3lrQSeNWVEJkKLLy8B8t9zA
0MGRqmCOV3_hsA/edit?usp=sharing

More Related Content

PPTX
Big Data Analytics
PPTX
Mongo db intro.pptx
PPTX
BIG DATA and USE CASES
PDF
Big Data
PDF
Unit 1 Information Storage and Retrieval
PPTX
Data analytics
PPT
Why Data Virtualization? An Introduction by Denodo
PDF
Summary introduction to data engineering
Big Data Analytics
Mongo db intro.pptx
BIG DATA and USE CASES
Big Data
Unit 1 Information Storage and Retrieval
Data analytics
Why Data Virtualization? An Introduction by Denodo
Summary introduction to data engineering

What's hot (20)

PPTX
Introduction to data science
PDF
MLOps Using MLflow
PPTX
Big Data Analytics
PPTX
Introduction to Data Engineering
PDF
Introduction to Data Science
PPTX
Introduction of data science
PPTX
MLOps in action
PDF
Introduction of Knowledge Graphs
PPTX
Introduction to Data Engineering
PPTX
Data science
PPT
Diagrama de Navegação e Vocabulário Visual de Garrett
PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
PPTX
Data science | What is Data science
PDF
Introduction to-data-science
PDF
Time series deep learning
PPTX
Data mining concepts and work
PDF
Big data Analytics
PDF
Data Science Project Lifecycle
PPTX
Demystifying data engineering
PPTX
The Data Warehouse Lifecycle
Introduction to data science
MLOps Using MLflow
Big Data Analytics
Introduction to Data Engineering
Introduction to Data Science
Introduction of data science
MLOps in action
Introduction of Knowledge Graphs
Introduction to Data Engineering
Data science
Diagrama de Navegação e Vocabulário Visual de Garrett
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data science | What is Data science
Introduction to-data-science
Time series deep learning
Data mining concepts and work
Big data Analytics
Data Science Project Lifecycle
Demystifying data engineering
The Data Warehouse Lifecycle
Ad

Viewers also liked (20)

PDF
Slide 3 Fast Data processing with kafka, rfx and redis
PDF
How to build a data driven business in big data age
PDF
Fast Data processing with RFX
PDF
Reactive Data System in Practice
PDF
Where is my next jobs in the age of Big Data and Automation
PDF
2016 Data Science Salary Survey
PDF
Experience economy
PDF
Introduction to Human Data Theory for Digital Economy
PDF
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)
PDF
Introduction to RFX for Backend Developer
PDF
From Data Analytics to Fast Data Intelligence
PDF
Using User Behavior for Real-time Advertising
PPTX
Parallel and Iterative Processing for Machine Learning Recommendations with S...
PDF
Giới thiệu cơ bản về Big Data và các ứng dụng thực tiễn
PDF
Building Reactive Real-time Data Pipeline
PDF
Netty Cookbook - Chapter 1
PDF
Netty Cookbook - Chapter 2
PPTX
Art nouveau & de st ijl
PDF
Consciousness as a Limitation
PDF
praktikum_solidarnost_Ivaylo Radev
Slide 3 Fast Data processing with kafka, rfx and redis
How to build a data driven business in big data age
Fast Data processing with RFX
Reactive Data System in Practice
Where is my next jobs in the age of Big Data and Automation
2016 Data Science Salary Survey
Experience economy
Introduction to Human Data Theory for Digital Economy
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)
Introduction to RFX for Backend Developer
From Data Analytics to Fast Data Intelligence
Using User Behavior for Real-time Advertising
Parallel and Iterative Processing for Machine Learning Recommendations with S...
Giới thiệu cơ bản về Big Data và các ứng dụng thực tiễn
Building Reactive Real-time Data Pipeline
Netty Cookbook - Chapter 1
Netty Cookbook - Chapter 2
Art nouveau & de st ijl
Consciousness as a Limitation
praktikum_solidarnost_Ivaylo Radev
Ad

Similar to Slide 2 collecting, storing and analyzing big data (20)

PDF
Big data analysis concepts and references
PPTX
Introduction to BIG DATA
PDF
Hadoop-based architecture approaches
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
PPTX
Data mining with big data
PPTX
Data mining with big data
PPT
Big Data - JAX2011 (Pavlo Baron)
PDF
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming...
PDF
Big data technology
PDF
A REVIEW PAPER ON BIG DATA ANALYTICS
PPT
Hive @ Hadoop day seattle_2010
PPT
Data analytics & its Trends
PDF
Lesson 1 introduction to_big_data_and_hadoop.pptx
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
PPT
data analytics lecture 3.2.ppt
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
PPTX
Hadoop for the Absolute Beginner
PPTX
Big Data Practice_Planning_steps_RK
PPTX
Big Data przt.pptx
PPTX
Data mining with big data implementation
Big data analysis concepts and references
Introduction to BIG DATA
Hadoop-based architecture approaches
Lecture 5 - Big Data and Hadoop Intro.ppt
Data mining with big data
Data mining with big data
Big Data - JAX2011 (Pavlo Baron)
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming...
Big data technology
A REVIEW PAPER ON BIG DATA ANALYTICS
Hive @ Hadoop day seattle_2010
Data analytics & its Trends
Lesson 1 introduction to_big_data_and_hadoop.pptx
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
data analytics lecture 3.2.ppt
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Hadoop for the Absolute Beginner
Big Data Practice_Planning_steps_RK
Big Data przt.pptx
Data mining with big data implementation

More from Trieu Nguyen (20)

PDF
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
PDF
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
PDF
Building Your Customer Data Platform with LEO CDP
PDF
How to track and improve Customer Experience with LEO CDP
PDF
[Notes] Customer 360 Analytics with LEO CDP
PDF
Leo CDP - Pitch Deck
PDF
LEO CDP - What's new in 2022
PDF
Lộ trình triển khai LEO CDP cho ngành bất động sản
PDF
Why is LEO CDP important for digital business ?
PDF
From Dataism to Customer Data Platform
PDF
Data collection, processing & organization with USPA framework
PDF
Part 1: Introduction to digital marketing technology
PDF
Why is Customer Data Platform (CDP) ?
PDF
How to build a Personalized News Recommendation Platform
PDF
How to grow your business in the age of digital marketing 4.0
PDF
Video Ecosystem and some ideas about video big data
PDF
Concepts, use cases and principles to build big data systems (1)
PDF
Open OTT - Video Content Platform
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Introduction to Recommendation Systems (Vietnam Web Submit)
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP
How to track and improve Customer Experience with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP
Leo CDP - Pitch Deck
LEO CDP - What's new in 2022
Lộ trình triển khai LEO CDP cho ngành bất động sản
Why is LEO CDP important for digital business ?
From Dataism to Customer Data Platform
Data collection, processing & organization with USPA framework
Part 1: Introduction to digital marketing technology
Why is Customer Data Platform (CDP) ?
How to build a Personalized News Recommendation Platform
How to grow your business in the age of digital marketing 4.0
Video Ecosystem and some ideas about video big data
Concepts, use cases and principles to build big data systems (1)
Open OTT - Video Content Platform
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Introduction to Recommendation Systems (Vietnam Web Submit)

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Business Analytics and business intelligence.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Database Infoormation System (DBIS).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Knowledge Engineering Part 1
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
climate analysis of Dhaka ,Banglades.pptx
Qualitative Qantitative and Mixed Methods.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Mega Projects Data Mega Projects Data
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
oil_refinery_comprehensive_20250804084928 (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Business Analytics and business intelligence.pdf
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
Data_Analytics_and_PowerBI_Presentation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IB Computer Science - Internal Assessment.pptx
Foundation of Data Science unit number two notes
Database Infoormation System (DBIS).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Slide 2 collecting, storing and analyzing big data