SlideShare a Scribd company logo
Concepts, use cases
and principles to build
big data systems
http://www.bigdatavietnam.org
https://www.facebook.com/bigdatavn Compiled by Nguyễn Tấn Triều
Key Contents
1. Introduction to the key Big Data concepts
○ The Origins of Big Data
○ What is Big Data ?
○ Why is Big Data So Important ?
○ How Is Big Data Used In Practice ?
2. Introduction to the key principles of Big Data Systems
○ How to design Data Pipeline in 6 steps
○ Using Lambda Architecture for big data processing
3. Practical case study
○ Chat bot with Video Recommendation Engine
4. FAQ for student
Introduction to the
key Big Data
concepts
○ The Origins of Big Data
○ What is Big Data ?
○ Why is Big Data so
important ?
○ How Is Big Data used in
practice ?
The Origins of Big Data
https://www.kdnuggets.com/2017/02/origins-big-data.html
What is Big Data ?
What is Big Data ?
What is Big Data ?
Why is Big Data So Important ?
Why is Big Data So Important ?
Concepts, use cases and principles to build big data systems (1)
Source: https://internetofthingsagenda.techtarget.com/definition/Internet-of-Things-IoT
How Is Big Data Used In Practice ?
How Is Big Data Used In Practice ?
Why is Big Data So Important ?
Concepts, use cases and principles to build big data systems (1)
How Is Big Data Used In Practice ?
Device Analytics
Which device is most
popular used ?
How Is Big Data Used In Practice ?
Time-series Analytics
The peak hours of system
How Is Big Data Used In Practice ?
GeoLocation Heatmap Analytics
Introduction to the
key principles of
Big Data Systems
○ How to design Data
Pipeline in 6 steps
○ Using Lambda
Architecture for big
data processing
How to design Data Pipeline Systems
Collecting → Storing → Processing → Analyzing → Learning → Visualizing
Data engineering process: 3 tasks
1. Collecting
a. Concepts
b. Technology
2. Storing
a. Big Data Storage Concepts
b. Big Data Storage Technology
3. Processing
a. Big Data Processing Concepts
b. Big Data Processing Technology
Data Science/Machine Learning process: 3 tasks
4) Analyzing → 5) Learning → 5) Visualizing
Data Engineer Tasks Data Analyst Tasks
Big Data Analytics Lifecycle
Collecting
Storing
Processing
Analyzing
Learning
Visualizing
(Collecting) → Storing → Processing → Analyzing
→ Learning → Reacting
Collecting
Collecting tools
Batch collecting: Apache Sqoop ( from DBMS to Apache Hadoop)
Real-time collecting: Log Collector with Apache Kafka
Collecting → (Storing) → Processing → Analyzing
→ Learning → Reacting
Storing Concepts
● Clusters
● Scale-Up vs Scale-Out
● File Systems and Distributed File Systems
● NoSQL
● Sharding
● Replication
● Sharding and Replication
● CAP Theorem
Clusters
Scale-Up vs Scale-Out
Database in Big Data
NoSQL
NoSQL
Sharding
Concepts, use cases and principles to build big data systems (1)
Replication (Master-Slave)
Replication (Peer-to-Peer)
CAP Theorem
Collecting → Storing → (Processing) → Analyzing
→ Learning → Reacting
Processing concepts
● Parallel Data Processing
● Distributed Data Processing
● Hadoop
● Processing Workloads
● Cluster
● Processing in Batch Mode
● Processing in Realtime Mode
Parallel Data Processing
Distributed Data Processing
Hadoop
Hadoop is a versatile framework that provides both processing and
storage capabilities
Batch processing (offline processing)
Transactional processing
Cluster
Map and Reduce Tasks
Processing in Realtime Mode
When standard relational database
(Oracle,MySQL, ...) is not good enough
the “analytic system” MySQL database from a startup, tracking all actions in
mobile games: iOS, Android, ...
3 common problems in Big Data System
1. Size: the volume of the datasets is a critical factor.
2. Complexity: the structure, behaviour and permutations of the datasets is
a critical factor.
3. Technologies: the tools and techniques which are used to process a
sizable or complex dataset is a critical factor.
Key ideas of Lambda Architecture in Big Data System
Practical case
study Chat bot with Video
Recommendation Engine
Problem
● A company want to develop a chat bot for
news recommendation
● They want to classify data into standard
categories (26 categories) for
user-friendly query
● The engineering team have develop a
data pipeline for system
Solution Diagram
Big Data
is here
Author @tantrieuf31
Problem: Topic Classification for News
Solution Diagram
FAQ for students
How to learn Big Data ?
Job Opportunity
Ref resources
How to learn Big Data ?
1. Have lots of passion, curiosity with data
2. Knowledge about data structure, statistics and basic maths
3. Love to solve complex problems with data-driven mindset
4. Database knowledge: when to use NoSQL vs RDBMS
5. Knowledge about distributed computing
6. Linux / Open Source Tools
7. Programming language: Python / Java / SQL / JavaScript
8. English skills
Big Data Job Market is really hot
https://www.class-central.com/subject/big-data
Some good books for self-learning
● http://sachvui.com/ebook/du-lieu-lon-big-data.281.html
● https://drive.google.com/open?id=0B3dHGVpTXDOhQXJCR01PVkpQMGM
● https://drive.google.com/file/d/1rPvfio6EkaUvGtgfQoq9p9Fa2ljOMIn1/view?usp=sharing
● https://drive.google.com/open?id=0B3dHGVpTXDOhVTBKX09NUnlLcm8
Free MOOC
https://www.class-central.com/subject/big-data
Concepts, use cases and principles to build big data systems (1)

More Related Content

PDF
Video Ecosystem and some ideas about video big data
PDF
How to grow your business in the age of digital marketing 4.0
PDF
Why is LEO CDP important for digital business ?
PDF
From Dataism to Customer Data Platform
PDF
Data collection, processing & organization with USPA framework
PPTX
SMAC
PDF
Best Technology Capstone Project Ideas
PDF
Big Data & Text Analytics - Lesson Schedule
Video Ecosystem and some ideas about video big data
How to grow your business in the age of digital marketing 4.0
Why is LEO CDP important for digital business ?
From Dataism to Customer Data Platform
Data collection, processing & organization with USPA framework
SMAC
Best Technology Capstone Project Ideas
Big Data & Text Analytics - Lesson Schedule

What's hot (11)

PDF
MLSEV Virtual. ML: Business Perspective
PDF
Building your data driven business with Reactive Marketing Technology
PDF
Machine learning project ideas
PDF
Data Mining & Predictive Analytics - Lesson 14 - Concepts Recapitulation and ...
PDF
Data analytic for mobile app development
PDF
Synthetic data generation for machine learning
PDF
Auto Content Moderation in C2C e-Commerce at OpML20
PDF
A few Challenges to Make Machine Learning Easy
PDF
Machine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
PDF
Machine Learning Project Lifecycle
PDF
Machine learning with python
MLSEV Virtual. ML: Business Perspective
Building your data driven business with Reactive Marketing Technology
Machine learning project ideas
Data Mining & Predictive Analytics - Lesson 14 - Concepts Recapitulation and ...
Data analytic for mobile app development
Synthetic data generation for machine learning
Auto Content Moderation in C2C e-Commerce at OpML20
A few Challenges to Make Machine Learning Easy
Machine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
Machine Learning Project Lifecycle
Machine learning with python
Ad

Similar to Concepts, use cases and principles to build big data systems (1) (20)

PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
Data Discovery and Metadata
PPTX
Future se oct15
PDF
Big Data Intoduction & Hadoop ArchitectureModule1.pdf
PDF
Data Science at Scale - The DevOps Approach
PPTX
Bigdata-Intro.pptx
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
PPTX
Big Data and Data Science: The Technologies Shaping Our Lives
PDF
PDF
A data analyst view of Bigdata
PDF
Analyzing social media with Python and other tools (1/4)
PPTX
Presentation on Big Data Analytics
PDF
How to build and run a big data platform in the 21st century
PPTX
Introduction Big data
PDF
big_data_topic1_[introduction]_[thanh_binh_nguyen].TextMark.pdf
PDF
Data science presentation
PDF
Big Data Analytics
PDF
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
PDF
Big Data Analytics M1.pdf big data analytics
PPTX
Foundations of Big Data: Concepts, Techniques, and Applications
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Data Discovery and Metadata
Future se oct15
Big Data Intoduction & Hadoop ArchitectureModule1.pdf
Data Science at Scale - The DevOps Approach
Bigdata-Intro.pptx
Data Engineer's Lunch #85: Designing a Modern Data Stack
Big Data and Data Science: The Technologies Shaping Our Lives
A data analyst view of Bigdata
Analyzing social media with Python and other tools (1/4)
Presentation on Big Data Analytics
How to build and run a big data platform in the 21st century
Introduction Big data
big_data_topic1_[introduction]_[thanh_binh_nguyen].TextMark.pdf
Data science presentation
Big Data Analytics
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Big Data Analytics M1.pdf big data analytics
Foundations of Big Data: Concepts, Techniques, and Applications
Ad

More from Trieu Nguyen (20)

PDF
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
PDF
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
PDF
Building Your Customer Data Platform with LEO CDP
PDF
How to track and improve Customer Experience with LEO CDP
PDF
[Notes] Customer 360 Analytics with LEO CDP
PDF
Leo CDP - Pitch Deck
PDF
LEO CDP - What's new in 2022
PDF
Lộ trình triển khai LEO CDP cho ngành bất động sản
PDF
Part 1: Introduction to digital marketing technology
PDF
Why is Customer Data Platform (CDP) ?
PDF
How to build a Personalized News Recommendation Platform
PDF
Open OTT - Video Content Platform
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Introduction to Recommendation Systems (Vietnam Web Submit)
PDF
Introduction to Recommendation Systems
PDF
Giới thiệu cơ bản về Big Data và các ứng dụng thực tiễn
PDF
Vietnam E-commerce Report 2016
PDF
Experience economy
PDF
Introduction to Human Data Theory for Digital Economy
PDF
Slide 3 Fast Data processing with kafka, rfx and redis
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP
How to track and improve Customer Experience with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP
Leo CDP - Pitch Deck
LEO CDP - What's new in 2022
Lộ trình triển khai LEO CDP cho ngành bất động sản
Part 1: Introduction to digital marketing technology
Why is Customer Data Platform (CDP) ?
How to build a Personalized News Recommendation Platform
Open OTT - Video Content Platform
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems
Giới thiệu cơ bản về Big Data và các ứng dụng thực tiễn
Vietnam E-commerce Report 2016
Experience economy
Introduction to Human Data Theory for Digital Economy
Slide 3 Fast Data processing with kafka, rfx and redis

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Foundation of Data Science unit number two notes
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Business Analytics and business intelligence.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Lecture1 pattern recognition............
PPTX
1_Introduction to advance data techniques.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
annual-report-2024-2025 original latest.
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Computer network topology notes for revision
PPTX
Introduction to machine learning and Linear Models
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Qualitative Qantitative and Mixed Methods.pptx
Foundation of Data Science unit number two notes
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Analytics and business intelligence.pdf
Supervised vs unsupervised machine learning algorithms
Lecture1 pattern recognition............
1_Introduction to advance data techniques.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Clinical guidelines as a resource for EBP(1).pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
annual-report-2024-2025 original latest.
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
Introduction to machine learning and Linear Models
Data_Analytics_and_PowerBI_Presentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

Concepts, use cases and principles to build big data systems (1)