SlideShare a Scribd company logo
Big Data on AWS 
Johann Romefort
Agenda 
• What is Big Data? 
• What is AWS? 
• Presenting the tools: How Big Data and AWS fit 
together
What is Big Data? 
• It’s at the intersection of data’s 3 V: 
• Velocity (Batch / Real time / Streaming) 
• Volume (Terabytes/Petabytes) 
• Variety (structure/semi-structured/unstructured)
Why is everybody talking about it? 
• Cost of generation of data has gone down 
• By 2015, 3B people will be online, pushing data 
volume created to 8 zettabytes 
• More data = More insights = Better decisions 
• Ease and cost of processing is falling thanks to 
cloud platforms
Data flow and constraints 
Generate 
Ingest / Store 
Process 
Visualize / Share 
The 3 V involve 
heterogeneity and 
make it hard to 
achieve those steps
What is AWS? 
• AWS is a cloud computing platform 
• On-demand delivery of IT resources 
• Pay-as-you-go pricing model
Cloud Computing 
+ + 
Compute Storage Networking 
Adapts dynamically to ever 
changing needs to stick closely 
to user infrastructure and 
applications requirements
How does AWS helps 
with Big Data? 
• Remove constraints on the ingesting, storing, and 
processing layer and adapts closely to demands. 
• Provides a collection of integrated tools to adapt to 
the 3 V’s of Big Data 
• Unlimited capacity of storage and processing power 
fits well to changing data storage and analysis 
requirements.
Computing Solutions 
for Big Data on AWS 
EC2 EMR 
Kinesis 
Redshift
Computing Solutions 
for Big Data on AWS 
EC2 
All-purpose computing instances. 
Dynamic Provisioning and resizing 
Let you scale your infrastructure 
at low cost 
Use Case: Well suited for running custom or proprietary 
application (ex: SAP Hana, Tableau…)
Computing Solutions 
for Big Data on AWS 
EMR 
‘Hadoop in the cloud’ 
Adapt to complexity of the analysis 
and volume of data to process 
Use Case: Offline processing of very large volume of data, 
possibly unstructured (Variety variable)
Computing Solutions 
for Big Data on AWS 
Kinesis 
Stream Processing 
Real-time data 
Scale to adapt to the flow of 
inbound data 
Use Case: Complex Event Processing, click streams, 
sensors data, computation over window of time
Computing Solutions 
for Big Data on AWS 
RedShift 
Data Warehouse in the cloud 
Scales to Petabytes 
Supports SQL Querying 
Start small for just $0.25/h 
Use Case: BI Analysis, Use of ODBC/JDBC legacy software 
to analyze or visualize data
Storage Solution 
for Big Data on AWS 
DynamoDB RedShift 
S3 Glacier
Storage Solution 
for Big Data on AWS 
DynamoDB 
NoSQL Database 
Consistent 
Low latency access 
Column-base flexible 
data model 
Use Case: Offline processing of very large volume of data, 
possibly unstructured (Variety variable)
Storage Solution 
for Big Data on AWS 
S3 
Versatile storage system 
Low-cost 
Fast retrieving of data 
Use Case: Backups and Disaster recovery, Media storage, 
Storage for data analysis
Storage Solution 
for Big Data on AWS 
Glacier 
Archive storage of cold data 
Extremely low-cost 
optimized for data infrequently 
accessed 
Use Case: Storing raw logs of data. Storing media archives. 
Magnetic tape replacement
What makes AWS different 
when it comes to big data?
Integrated Environment for Big Data 
Given the 3V’s a collection of tools is most of the time 
needed for your data processing and storage. 
AWS Big Data solutions comes integrated with each others 
already 
AWS Big Data solutions also integrate with the whole AWS 
ecosystem (Security, Identity Management, Logging, Backups, 
Management Console…)
Example of products interacting with 
each other.
Tightly integrated rich 
environment of tools 
+ 
On-demand scaling sticking to 
processing requirements 
= 
Extremely cost-effective and easy to 
deploy solution for big data needs
Use Case: 
Real-time IOT Analytics 
Gathering data in real time from sensors deployed in 
factory and send them for immediate processing 
• Error Detection: Real-time detection of hardware 
problems 
• Optimization and Energy management
First Version of the 
infrastructure 
Aggregate 
Sensors 
data 
nodejs 
stream 
processor 
On customer site 
evaluate rules 
over time 
window 
mongodb 
feed algorithm 
in-house hadoop cluster 
write raw 
data for 
further 
processing 
backup
Second Version of the 
infrastructure 
Aggregate 
Sensors 
data 
On customer site 
evaluate rules 
over time 
window 
write raw 
data for 
archiving 
Kinesis RedShift 
for BI 
analysis 
Glacier
Thank You 
romefort@gmail.com 
follow me on @romefort

More Related Content

PPTX
Lecture1
PDF
Cloud Big Data Architectures
PDF
Build Real-Time Applications with Databricks Streaming
PPTX
Getting to 1.5M Ads/sec: How DataXu manages Big Data
PPTX
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
PDF
5 Comparing Microsoft Big Data Technologies for Analytics
PDF
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
PPTX
The Microsoft BigData Story
Lecture1
Cloud Big Data Architectures
Build Real-Time Applications with Databricks Streaming
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
5 Comparing Microsoft Big Data Technologies for Analytics
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
The Microsoft BigData Story

What's hot (19)

PPTX
How to Operationalise Real-Time Hadoop in the Cloud
PDF
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
PPTX
Big Data on Azure Tutorial
PPTX
Intuit Analytics Cloud 101
PDF
Big Data Architecture and Design Patterns
PDF
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
PPTX
Synapse for mere mortals
PPTX
NoSQL for the SQL Server Pro
PDF
IBM Cloud Day January 2021 - A well architected data lake
PPTX
Architecting a datalake
PPTX
Introduction to PolyBase
PPTX
Snowflake Datawarehouse Architecturing
PDF
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
PPTX
Ai & Data Analytics 2018 - Azure Databricks for data scientist
PDF
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
PPTX
Next Generation Enterprise Architecture
PDF
Suburface 2021 IBM Cloud Data Lake
PPTX
How much money do you lose every time your ecommerce site goes down?
PDF
Designing a modern data warehouse in azure
How to Operationalise Real-Time Hadoop in the Cloud
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Big Data on Azure Tutorial
Intuit Analytics Cloud 101
Big Data Architecture and Design Patterns
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Synapse for mere mortals
NoSQL for the SQL Server Pro
IBM Cloud Day January 2021 - A well architected data lake
Architecting a datalake
Introduction to PolyBase
Snowflake Datawarehouse Architecturing
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Next Generation Enterprise Architecture
Suburface 2021 IBM Cloud Data Lake
How much money do you lose every time your ecommerce site goes down?
Designing a modern data warehouse in azure
Ad

Viewers also liked (20)

PDF
Weave Networking on Docker
PDF
Lean Enterprise, Microservices and Big Data
PDF
Getting started on IoT with AWS and NodeMCU for less than 5€
PPTX
ΟΔΟΣ ΔΗΜΟΣΘΕΝΗ ΜΙΤΣΗ ΛΕΜΕΣΟΣ
PPTX
Καβάφης Κωνσταντίνος
PPTX
οικια μέλπως πηλαβάκη
PPTX
Το καστρο της λεμεσου
PDF
49201940 schaffer-psihologia-copilului-partea-1
PDF
RENNIE COWAN - FILMS
PPT
Alejandra la paradoja de james hunter
DOCX
Tugas manajemen pemasaran
PPTX
Κωνσταντίνος Καβάφης
PPTX
οικια μέλπως πηλαβάκη
PDF
Ud 1. la tierra
DOC
Hafil krk 2013
PPTX
Κωνσταντίνος Καβάφης
ODP
Conventions of short films
PPTX
Καβάφης Κωνσταντίνος
PDF
6365042 dictionar-psihologie-larousse1
PDF
Sample ppt new niche interior by mulavira interior systems
Weave Networking on Docker
Lean Enterprise, Microservices and Big Data
Getting started on IoT with AWS and NodeMCU for less than 5€
ΟΔΟΣ ΔΗΜΟΣΘΕΝΗ ΜΙΤΣΗ ΛΕΜΕΣΟΣ
Καβάφης Κωνσταντίνος
οικια μέλπως πηλαβάκη
Το καστρο της λεμεσου
49201940 schaffer-psihologia-copilului-partea-1
RENNIE COWAN - FILMS
Alejandra la paradoja de james hunter
Tugas manajemen pemasaran
Κωνσταντίνος Καβάφης
οικια μέλπως πηλαβάκη
Ud 1. la tierra
Hafil krk 2013
Κωνσταντίνος Καβάφης
Conventions of short films
Καβάφης Κωνσταντίνος
6365042 dictionar-psihologie-larousse1
Sample ppt new niche interior by mulavira interior systems
Ad

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Big Data Technologies - Introduction.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Electronic commerce courselecture one. Pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Getting Started with Data Integration: FME Form 101
Diabetes mellitus diagnosis method based random forest with bat algorithm
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Network Security Unit 5.pdf for BCA BBA.
Big Data Technologies - Introduction.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A comparative analysis of optical character recognition models for extracting...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Electronic commerce courselecture one. Pdf

Big data on AWS

  • 1. Big Data on AWS Johann Romefort
  • 2. Agenda • What is Big Data? • What is AWS? • Presenting the tools: How Big Data and AWS fit together
  • 3. What is Big Data? • It’s at the intersection of data’s 3 V: • Velocity (Batch / Real time / Streaming) • Volume (Terabytes/Petabytes) • Variety (structure/semi-structured/unstructured)
  • 4. Why is everybody talking about it? • Cost of generation of data has gone down • By 2015, 3B people will be online, pushing data volume created to 8 zettabytes • More data = More insights = Better decisions • Ease and cost of processing is falling thanks to cloud platforms
  • 5. Data flow and constraints Generate Ingest / Store Process Visualize / Share The 3 V involve heterogeneity and make it hard to achieve those steps
  • 6. What is AWS? • AWS is a cloud computing platform • On-demand delivery of IT resources • Pay-as-you-go pricing model
  • 7. Cloud Computing + + Compute Storage Networking Adapts dynamically to ever changing needs to stick closely to user infrastructure and applications requirements
  • 8. How does AWS helps with Big Data? • Remove constraints on the ingesting, storing, and processing layer and adapts closely to demands. • Provides a collection of integrated tools to adapt to the 3 V’s of Big Data • Unlimited capacity of storage and processing power fits well to changing data storage and analysis requirements.
  • 9. Computing Solutions for Big Data on AWS EC2 EMR Kinesis Redshift
  • 10. Computing Solutions for Big Data on AWS EC2 All-purpose computing instances. Dynamic Provisioning and resizing Let you scale your infrastructure at low cost Use Case: Well suited for running custom or proprietary application (ex: SAP Hana, Tableau…)
  • 11. Computing Solutions for Big Data on AWS EMR ‘Hadoop in the cloud’ Adapt to complexity of the analysis and volume of data to process Use Case: Offline processing of very large volume of data, possibly unstructured (Variety variable)
  • 12. Computing Solutions for Big Data on AWS Kinesis Stream Processing Real-time data Scale to adapt to the flow of inbound data Use Case: Complex Event Processing, click streams, sensors data, computation over window of time
  • 13. Computing Solutions for Big Data on AWS RedShift Data Warehouse in the cloud Scales to Petabytes Supports SQL Querying Start small for just $0.25/h Use Case: BI Analysis, Use of ODBC/JDBC legacy software to analyze or visualize data
  • 14. Storage Solution for Big Data on AWS DynamoDB RedShift S3 Glacier
  • 15. Storage Solution for Big Data on AWS DynamoDB NoSQL Database Consistent Low latency access Column-base flexible data model Use Case: Offline processing of very large volume of data, possibly unstructured (Variety variable)
  • 16. Storage Solution for Big Data on AWS S3 Versatile storage system Low-cost Fast retrieving of data Use Case: Backups and Disaster recovery, Media storage, Storage for data analysis
  • 17. Storage Solution for Big Data on AWS Glacier Archive storage of cold data Extremely low-cost optimized for data infrequently accessed Use Case: Storing raw logs of data. Storing media archives. Magnetic tape replacement
  • 18. What makes AWS different when it comes to big data?
  • 19. Integrated Environment for Big Data Given the 3V’s a collection of tools is most of the time needed for your data processing and storage. AWS Big Data solutions comes integrated with each others already AWS Big Data solutions also integrate with the whole AWS ecosystem (Security, Identity Management, Logging, Backups, Management Console…)
  • 20. Example of products interacting with each other.
  • 21. Tightly integrated rich environment of tools + On-demand scaling sticking to processing requirements = Extremely cost-effective and easy to deploy solution for big data needs
  • 22. Use Case: Real-time IOT Analytics Gathering data in real time from sensors deployed in factory and send them for immediate processing • Error Detection: Real-time detection of hardware problems • Optimization and Energy management
  • 23. First Version of the infrastructure Aggregate Sensors data nodejs stream processor On customer site evaluate rules over time window mongodb feed algorithm in-house hadoop cluster write raw data for further processing backup
  • 24. Second Version of the infrastructure Aggregate Sensors data On customer site evaluate rules over time window write raw data for archiving Kinesis RedShift for BI analysis Glacier
  • 25. Thank You romefort@gmail.com follow me on @romefort