SlideShare a Scribd company logo
Big Data Applications
Juan Pablo Paz Grau, PhD, PMP.
Juan Pablo Paz Grau, PhD, PMP
Systems Engineer
Specialist in Information Systems Management
PhD in Software Engineering
Certified in ITIL Foundation, PMP
Currently, I work in LG CNS Colombia
LG CNS Colombia is the IT partner of the SIRCI operation
The SIRCI Operation = Transmilenio Operation
Transmilenio is the world renown reference for BRT systems
The biggest public traffic system operation in Colombia
Presentation Agenda
1. What is Big Data?
2. Large Dataset Management Techniques
3. Hadoop Cluster Architecture
4. Closing the Loop: Real Time Cluster Architecture
5. The Development Process for Big Data Systems
6. Showcase of Big Data Tools for Public Traffic Systems
What is Big Data?
The DIKW Triangle
What is Big Data?
Information displayed
to final users
Data generated to
provide information
displayed to final
users
…
What is Big Data?
• Organizations produce lots of
data while they operate their
Information Systems
• Log files
• Access log files
• Debug log files
• Temporal, transient data
• Transactional data
• Usually, this data is stored
temporarily only for debugging
or incident analysis purposes
• With the increasing capacity to
store data, this data is been
reviewed and considered a
valuable source of information
Large Dataset Management Techniques
Very small intro to Hadoop
Cheap, reliable storage of
big datasets in commodity
hardware
A framework to parallelize
big data processing and
analysis
What is Hadoop?
Large Dataset
Large Dataset Management Techniques
Very small intro to Hadoop: Hadoop Distributed File System (HDFS)
File is split in
data blocks
File metadata and block
location is stored in the
name node
Data blocks are physically
stored in data nodes
Block B:
• If Data Node 0 fails, there is another
copy in the same rack at Data Node 1
• If the rack fails, there is still another
copy in another rack at Data Node 2
Rack 1 Rack 2
Large Dataset Management Techniques
• Very small intro to Hadoop: Map Reduce
Map: Select data that
matches a given criteria
(Status = Trip). The map
function returns a set of
{Key,Value} pairs
Shuffle: Collect an
sort the mapped pairs
Reduce: Apply a
reduce function (Sum
distance) for each key
Large Dataset Management Techniques
Very small intro to Hadoop: The Hadoop ecosystem
• Currently, there are a plethora of tools to work
with Big Data in top of Hadoop.
• The tools and frameworks selection will vary
depending on the implementation of the cluster.
Hadoop Cluster Architecture
The Lambda Architecture
Application
Data Access
Batch | Speed
Data
• Data layer: A data model and a set of data stored
following the data model. The data model should
be designed for the targeted subsystem.
• Batch layer: The computation layer that
processes data to turn facts into views for
querying the underlying stored data.
• Speed layer: A real time computation layer that
compensates the latency of the batch layer.
• Data Access layer: The engines, tools and
drivers that exposes views to applications and
manages queries.
• Application layer: The front-end application or
applications that present information to users of
the Big Data system.
Hadoop Cluster Architecture
Data Serialization
Source System
Source System
Source System
Data Serialization
Data Serialization
Data Serialization
Data Lake
Source System
Raw Data
Data Access: Hive, Hadoop Data Warehouse
Hadoop Cluster Architecture
• Built on top of Hadoop
• Eases the tasks of managing data in Hadoop
• Manage files and schemas as tables
• Internal tables: Files managed by Hive
• External tables: Files located outside
of Hive but which can be analyzed with
Hive
• Provides a SQL like language to query data
stored in files
• Translates HiveQL language requests
into Map Reduce jobs
HiveQL
Load Transform Dump
Data Access: Pig, Data Processing Language
Hadoop Cluster Architecture
• Built on top of Hadoop
• Eases the tasks of data processing and
analysis
• Capable of working with any type of data
source
• Provides a scripting language to process and
transform data
Pig
Latin
Hadoop Cluster Architecture
Hive
• Works with structured data
• Can index data
• HiveQL, a SQL like access language
• Turns the HiveQL input into MapReduce
jobs
Pig
• Works with structured/unstructured data
• Cannot index data
• Pig latin, a scripting language
• Turns the Pig latin input into MapReduce
jobs
Hive / Pig Comparison
Closing the Loop: Real Time Cluster Architecture
Why?
1. Hadoop is intended to store history, not changing data (write
once, read many times)
2. Batch processing of data usually takes many time to produce
output summarized data
3. Capability to provide real time processing of Big Data is also
desirable in the Lambda architecture
4. There is a need to implement a solution to cope with the time
between data in the Hadoop cluster and new data been
generated
Data available
in Hadoop
New data
been created
New data
stored in
Hadoop
Data
Gap
Time
Closing the Loop: Real Time Cluster Architecture
Cassandra: Accessing the Cluster
CQL Driver
CQL
1. Used to be through a thrift client, now CQL client
2. CQL (Cassandra QL), a very small subset of SQL
3. Driver is not JDBC like!
Cassandra: Data Model
1. Row oriented, instead of column oriented
2. Each row is identified by a key
3. Each key accesses a collection of columns
The Development Process for Big Data Systems
Development Process: System Implementation
Hadoop Cluster Architecture
Master Node
• Resource Manager
• Name Node
• Hive Server
• Sqoop
• Apache Tomcat
• MySQL Server
Worker Node Worker Node Worker Node Worker Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
Now, we have the cluster services up and running,
and data is flowing into our Big Data repository.
What´s next?
Showcase of Big Data Tools for Public Traffic Systems

More Related Content

PDF
Payment Gateway Live hadoop project
DOCX
Hotel inspection data set analysis copy
PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
Doug Cutting on the State of the Hadoop Ecosystem
PPSX
Hadoop-Quick introduction
PPTX
Big data vahidamiri-datastack.ir
PDF
An Overview of Apache Spark
Payment Gateway Live hadoop project
Hotel inspection data set analysis copy
Data lake-itweekend-sharif university-vahid amiry
Doug Cutting on the State of the Hadoop Ecosystem
Hadoop-Quick introduction
Big data vahidamiri-datastack.ir
An Overview of Apache Spark

What's hot (20)

PDF
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
PPTX
Querying Druid in SQL with Superset
PPTX
Big data architecture on cloud computing infrastructure
PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PPTX
PPTX
Improving Organizational Knowledge with Natural Language Processing Enriched ...
PDF
Introduction To Hadoop Ecosystem
PPTX
عصر کلان داده، چرا و چگونه؟
PDF
Spark Core
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PDF
Scalable Preservation Workflows
PPTX
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
PPTX
Hadoop at LinkedIn
PPTX
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
PDF
What's new in SQL on Hadoop and Beyond
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
PPTX
Using Visualization to Succeed with Big Data
PPTX
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
PDF
Big Telco - Yousun Jeong
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Querying Druid in SQL with Superset
Big data architecture on cloud computing infrastructure
Introduction to Apache Hadoop Eco-System
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Introduction To Hadoop Ecosystem
عصر کلان داده، چرا و چگونه؟
Spark Core
Big data vahidamiri-tabriz-13960226-datastack.ir
Scalable Preservation Workflows
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Hadoop at LinkedIn
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
What's new in SQL on Hadoop and Beyond
Lambda-less Stream Processing @Scale in LinkedIn
Using Visualization to Succeed with Big Data
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Big Telco - Yousun Jeong
Ad

Similar to Big data applications (20)

PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
PPTX
Hadoop ppt1
PPTX
2. hadoop fundamentals
PDF
Unit IV.pdf
PDF
Big data and hadoop overvew
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
PPTX
Architecting Your First Big Data Implementation
PPT
Hadoop hive presentation
PPTX
MODULE 1: Introduction to Big Data Analytics.pptx
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
PPTX
Big data Hadoop
PDF
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
PDF
Rapid Cluster Computing with Apache Spark 2016
PPTX
MOD-2 presentation on engineering students
PDF
Intro to Big Data
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Hadoop and Big data in Big data and cloud.pptx
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Hadoop ppt1
2. hadoop fundamentals
Unit IV.pdf
Big data and hadoop overvew
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Architecting Your First Big Data Implementation
Hadoop hive presentation
MODULE 1: Introduction to Big Data Analytics.pptx
Hadoop and MapReduce addDdaDadadDDAD.pptx
Big data Hadoop
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Rapid Cluster Computing with Apache Spark 2016
MOD-2 presentation on engineering students
Intro to Big Data
Ad

Recently uploaded (20)

PDF
natwest.pdf company description and business model
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
DOCX
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
PDF
Microsoft-365-Administrator-s-Guide_.pdf
PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
PDF
Yusen Logistics Group Sustainability Report 2024.pdf
PDF
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
PPTX
lesson6-211001025531lesson plan ppt.pptx
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PPTX
The Effect of Human Resource Management Practice on Organizational Performanc...
PPTX
fundraisepro pitch deck elegant and modern
PPTX
nose tajweed for the arabic alphabets for the responsive
PPTX
Hydrogel Based delivery Cancer Treatment
DOC
LSTM毕业证学历认证,利物浦大学毕业证学历认证怎么认证
PPTX
Introduction-to-Food-Packaging-and-packaging -materials.pptx
PDF
COLEAD A2F approach and Theory of Change
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPTX
Tour Presentation Educational Activity.pptx
natwest.pdf company description and business model
_ISO_Presentation_ISO 9001 and 45001.pptx
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
Microsoft-365-Administrator-s-Guide_.pdf
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
Yusen Logistics Group Sustainability Report 2024.pdf
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
lesson6-211001025531lesson plan ppt.pptx
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
Tablets And Capsule Preformulation Of Paracetamol
The Effect of Human Resource Management Practice on Organizational Performanc...
fundraisepro pitch deck elegant and modern
nose tajweed for the arabic alphabets for the responsive
Hydrogel Based delivery Cancer Treatment
LSTM毕业证学历认证,利物浦大学毕业证学历认证怎么认证
Introduction-to-Food-Packaging-and-packaging -materials.pptx
COLEAD A2F approach and Theory of Change
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
2025-08-10 Joseph 02 (shared slides).pptx
Tour Presentation Educational Activity.pptx

Big data applications

  • 1. Big Data Applications Juan Pablo Paz Grau, PhD, PMP.
  • 2. Juan Pablo Paz Grau, PhD, PMP Systems Engineer Specialist in Information Systems Management PhD in Software Engineering Certified in ITIL Foundation, PMP Currently, I work in LG CNS Colombia LG CNS Colombia is the IT partner of the SIRCI operation The SIRCI Operation = Transmilenio Operation Transmilenio is the world renown reference for BRT systems The biggest public traffic system operation in Colombia
  • 3. Presentation Agenda 1. What is Big Data? 2. Large Dataset Management Techniques 3. Hadoop Cluster Architecture 4. Closing the Loop: Real Time Cluster Architecture 5. The Development Process for Big Data Systems 6. Showcase of Big Data Tools for Public Traffic Systems
  • 4. What is Big Data? The DIKW Triangle
  • 5. What is Big Data? Information displayed to final users Data generated to provide information displayed to final users …
  • 6. What is Big Data? • Organizations produce lots of data while they operate their Information Systems • Log files • Access log files • Debug log files • Temporal, transient data • Transactional data • Usually, this data is stored temporarily only for debugging or incident analysis purposes • With the increasing capacity to store data, this data is been reviewed and considered a valuable source of information
  • 7. Large Dataset Management Techniques Very small intro to Hadoop Cheap, reliable storage of big datasets in commodity hardware A framework to parallelize big data processing and analysis What is Hadoop? Large Dataset
  • 8. Large Dataset Management Techniques Very small intro to Hadoop: Hadoop Distributed File System (HDFS) File is split in data blocks File metadata and block location is stored in the name node Data blocks are physically stored in data nodes Block B: • If Data Node 0 fails, there is another copy in the same rack at Data Node 1 • If the rack fails, there is still another copy in another rack at Data Node 2 Rack 1 Rack 2
  • 9. Large Dataset Management Techniques • Very small intro to Hadoop: Map Reduce Map: Select data that matches a given criteria (Status = Trip). The map function returns a set of {Key,Value} pairs Shuffle: Collect an sort the mapped pairs Reduce: Apply a reduce function (Sum distance) for each key
  • 10. Large Dataset Management Techniques Very small intro to Hadoop: The Hadoop ecosystem • Currently, there are a plethora of tools to work with Big Data in top of Hadoop. • The tools and frameworks selection will vary depending on the implementation of the cluster.
  • 11. Hadoop Cluster Architecture The Lambda Architecture Application Data Access Batch | Speed Data • Data layer: A data model and a set of data stored following the data model. The data model should be designed for the targeted subsystem. • Batch layer: The computation layer that processes data to turn facts into views for querying the underlying stored data. • Speed layer: A real time computation layer that compensates the latency of the batch layer. • Data Access layer: The engines, tools and drivers that exposes views to applications and manages queries. • Application layer: The front-end application or applications that present information to users of the Big Data system.
  • 12. Hadoop Cluster Architecture Data Serialization Source System Source System Source System Data Serialization Data Serialization Data Serialization Data Lake Source System Raw Data
  • 13. Data Access: Hive, Hadoop Data Warehouse Hadoop Cluster Architecture • Built on top of Hadoop • Eases the tasks of managing data in Hadoop • Manage files and schemas as tables • Internal tables: Files managed by Hive • External tables: Files located outside of Hive but which can be analyzed with Hive • Provides a SQL like language to query data stored in files • Translates HiveQL language requests into Map Reduce jobs HiveQL
  • 14. Load Transform Dump Data Access: Pig, Data Processing Language Hadoop Cluster Architecture • Built on top of Hadoop • Eases the tasks of data processing and analysis • Capable of working with any type of data source • Provides a scripting language to process and transform data Pig Latin
  • 15. Hadoop Cluster Architecture Hive • Works with structured data • Can index data • HiveQL, a SQL like access language • Turns the HiveQL input into MapReduce jobs Pig • Works with structured/unstructured data • Cannot index data • Pig latin, a scripting language • Turns the Pig latin input into MapReduce jobs Hive / Pig Comparison
  • 16. Closing the Loop: Real Time Cluster Architecture Why? 1. Hadoop is intended to store history, not changing data (write once, read many times) 2. Batch processing of data usually takes many time to produce output summarized data 3. Capability to provide real time processing of Big Data is also desirable in the Lambda architecture 4. There is a need to implement a solution to cope with the time between data in the Hadoop cluster and new data been generated Data available in Hadoop New data been created New data stored in Hadoop Data Gap Time
  • 17. Closing the Loop: Real Time Cluster Architecture Cassandra: Accessing the Cluster CQL Driver CQL 1. Used to be through a thrift client, now CQL client 2. CQL (Cassandra QL), a very small subset of SQL 3. Driver is not JDBC like! Cassandra: Data Model 1. Row oriented, instead of column oriented 2. Each row is identified by a key 3. Each key accesses a collection of columns
  • 18. The Development Process for Big Data Systems Development Process: System Implementation Hadoop Cluster Architecture Master Node • Resource Manager • Name Node • Hive Server • Sqoop • Apache Tomcat • MySQL Server Worker Node Worker Node Worker Node Worker Node • Data Node • Node Manager • Cassandra Node • Data Node • Node Manager • Cassandra Node • Data Node • Node Manager • Cassandra Node • Data Node • Node Manager • Cassandra Node
  • 19. Now, we have the cluster services up and running, and data is flowing into our Big Data repository. What´s next? Showcase of Big Data Tools for Public Traffic Systems

Editor's Notes

  • #3: This is the question that your experiment answers