SlideShare a Scribd company logo
2 0 1 9
Data Lakes & Analytics en
AWS
Laura Mariana Caicedo
Solutions Architect
Lauracai10
What to expect from this session
• Data Importance
• Big Data Process
• DataLake – LakeFormation
• Right Tool to analyze
Data is a strategic asset
for every organization
The world’s most valuable
resource is
*Copyright: The Economist, 2017, David Parkins
The move
toward
data-centric
companies
Five largest companies by
market cap*
2001
2006
2011
2016
2018
$1.091T
$406B
$446B
$406B
$582B
$976B
$365B
$383B
$556B
$383B
$877B
$272B
$327B
$277B
$452B
$839B
$261B
$293B
$237B
$364B
$523B
$260B
$273B
$228B
$228B
Thinking about data as an asset, not a
cost
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Stop
throwing
data away
Make it
available to
more users
Arm users
with more
data processing
technologies
Data
every 5 years
There is more data
than people think
15
years
live for
Data platforms need to
1,000x
scale
>10x
grows
There are more
people accessing data
And more
requirements for
making data available
Data Scientists
Analysts
Business Users
Applications
Secure Real time
Flexible Scalable
Big Data
Types of big data analytics
•Batch/
interactive
•Stream
processing
•Machine
learning
Plethora of tools
Collect Store
Process/
Analyze
Consume
Time to answer (latency)
Throughput
Cost
Simplify big data processing
StoreCollect
Analytics used to look like this
OLTP ERP CRM LOB
Data warehouse
Business intelligence
Relational data
TBs–PBs scale
Schema defined prior to data load
Operational reporting and ad hoc
Large initial CAPEX + $10K $50K/TB/Year
A data lake is a centralized repository that
allows you to store all your structured and
unstructured data at any scale
Why data lakes?
Data Lakes provide:
Relational and non-relational data
Scale-out to EBs
Diverse set of analytics and machine learning tools
Work on data without any data movement
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111001010
1011100101010000101111101
1010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time
• OLTP (Online Transaction Processing)
Characterized by a large number of short transactions (INSERT,
UPDATE, DELETE) that serve as persistence layer for applications.
e.g. Aurora, MySQL, PostgreSQL, etc. Typically a row-store
architecture
• OLAP (Online Analytical Processing)
Characterized by relatively low volume of transactions, and
queries are often complex and involve aggregations against large
historical datasets for data-driven decision making. e.g. Amazon
Redshift, Greenplum, etc. Typically a column-store architecture
• Data Lake
An architectural paradigm that allows customers to store all of
their data in a single unified place where they can collect and
store any data, at any scale, and at low cost. Data lakes
complement (not replace) other data stores such as data
warehouses. e.g. S3 data lake
OLTP
PostgreSQL
Amazon
Aurora
Amazon EC2
(Business Application)User
Applications
DataLake
Data Stores: What’s the Difference?
OLAP
ETL Tools
Amazon
QuickSight
Amazon Redshift
Amazon
Glue
BI Tools
OLTP ERP CRM LOBUser
Dashboards
Amazon S3 | AWS Glue
Any analytic
workload, any scale,
at the lowest possible
cost
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
On-premises
Data Movement
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
Real-time
Data Movement
Data Lake
There are lots of ingestion tools
Amazon S3
Process Consume
S3 Transfer
Acceleration
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Typical steps of building a data lake
Setup storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5
Building data lakes can still take
months
Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
Sample of steps required
Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone | Time consuming
Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean,
and transform data
Build a secure data lake in days
AWS Lake Formation
How it works
Register existing data or import new
Amazon S3 forms the storage layer for
Lake Formation
Register existing S3 buckets that
contain your data
Ask Lake Formation to create required
S3 buckets and import data into them
Data is stored in your account. You have
direct access to it. No lock-in.
Data Lake Storage
Data
Catalog
Access
Control
Data import
Lake Formation
Crawlers ML-based
data prep
Easily load data to your data lake
logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data import
Lake Formation
Crawlers ML-based
data prep
one-shot
incremental
With blueprints
You
1. Point us to the source
2. Tell us the location to load to
in your data lake
3. Specify how often you want to
load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the
target data format
3. Automatically partition the
data based on the
partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of
the above
Easily de-duplicate your data with ML transforms
Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Lake Formation
Admin
Security permissions in Lake Formation
Control data access with simple
grant and revoke permissions
Specify permissions on tables and
columns rather than on buckets
and objects
Easily view policies granted to a
particular user
Audit all data access at one place
Process/
analyze
Process/analyze
Process/ Analyze analytics
• Amazon Redshift & Amazon Redshift Spectrum
• Managed data warehouse
• Spectrum enables querying S3
• Amazon Athena
• Serverless interactive query service
• Amazon EMR
• Managed Hadoop framework for running Apache Spark, Flink, Presto, Tez, Hive, Pig,
HBase, and others
Amazon
Redshift
EMR Amazon
Athena
Serverless Query Processing
• Serverless query service for querying data in S3 using standard SQL with
no infrastructure to manage
• No data loading required; query directly from Amazon S3
• Use standard ANSI SQL queries with support for joins, JSON, and
window functions
• Support for multiple data formats include text, CSV, TSV, JSON, Avro,
ORC, Parquet
• Pay per query only when you’re running queries based on data scanned.
If you compress your data, you pay less and your queries run faster
Amazon
Athena
Querying it in Amazon Athena
Either Create a Crawler to
auto-generate schema
OR
Write a DDL on the Athena
console/API/ JDBC/ODBC
driver
Start Querying Data
Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
Amazon Redshift Speed: Three Highlights
Machine-learning based acceleration
1
2
Result-set Caching for Local & Data Lake Queries
(sub-second repeat
queries)
3
Constant improvements in performance for
real-world workloads
10x faster
than it was two years ago
Amazon Redshift is now
from over 200 features and enhancements
released due to lessons learnt from more than
10,000 customer deployments processing over 2 exabytes
of data every dayRedshift Spectrum caches
intermediate results that can
benefit different queries
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a
market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is
based on best available resources. Opinions reflect judgment at the time and are subject to change.
The Forrester Wave™: Enterprise Data
Warehouse, Q4 2015
Forrester Wave™ Big Data Warehouse Q4 2018
AWS rated top in the
leader bracket and
received a score of
5/5 (the highest score
possible) in a number
of areas such as Use
Cases, Roadmap,
Market Awareness,
and Ability to Execute
AWS positioned as a
Leader in the Gartner
Magic Quadrant for
Data Management
Gartner Magic Quadrant, 2018
Solutions for
Analytics
Semi-structured/Unstructured Data Processing
• Hadoop, Hive, Presto, Spark, Tez, Impala etc.
• Release 5.2: Hadoop 2.7.3, Hive 2.1, Spark 2.02, Zeppelin, Presto, HBase 1.2.3 and HBase on
S3, Phoenix, Tez, Flink.
• New applications added within 30 days of their open source release
• Fully managed, Auto Scaling clusters with support for on-demand and
spot pricing
• Support for HDFS and S3 file systems enabling separated compute and
storage; multiple clusters can run against the same data in S3
• HIPAA-eligible. Support for end-to-end encryption, IAM/VPC, S3 client-
side encryption with customer managed keys and AWS KMS
Amazon EMR
The Hadoop ecosystem can run in Amazon EMR
Which analytics should I use? Process/analyze
Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Stream
Takes milliseconds to seconds
Example: Fraud alerts, one-minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Data Analytics,
Amazon KCL, AWS Lambda, and others
Predictive
Takes milliseconds (real-time) to minutes (batch)
Example: Fraud detection, forecasting demand, speech
recognition
Amazon SageMaker, Amazon Polly, Amazon Rekognition, Amazon
Transcribe, Amazon Translate, Amazon EMR (Spark ML), Amazon
Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK, and Caffe)
FastSlow
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Presto
Amazon
EMR
Predictive
AmazonML
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
Fast
Collect Store ConsumeProcess/analyzeETL
Amazon
QuickSight
Analysis&visualizationDatasceince
AI Apps
Apps
SW engineers/
business
users
Data scientists/
data engineers
Amazon QuickSight: Examples – Salesforce Analytics
Summary
• Data MUST be used in every organization
• Data lakes are very important to consume structured and
unstructured data
• Data lake governance
• Analyze data with the right tool
• Different type of consumers

More Related Content

PDF
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...
PDF
Transformation Track AWS Cloud Experience Argentina - Bases de Datos en AWS
PDF
Innovation Track AWS Cloud Experience Argentina - Optimizando Costos
PDF
Keynote Roberto Delamora - AWS Cloud Experience Argentina
PDF
Automatisierte Kontrolle und Transparenz in der AWS Cloud – Autopilot für Com...
PDF
AWS Floor 28 - Building Data lake on AWS
PDF
Immersion Day - Como construir seu Data Lake em dias na AWS
PDF
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...
Transformation Track AWS Cloud Experience Argentina - Bases de Datos en AWS
Innovation Track AWS Cloud Experience Argentina - Optimizando Costos
Keynote Roberto Delamora - AWS Cloud Experience Argentina
Automatisierte Kontrolle und Transparenz in der AWS Cloud – Autopilot für Com...
AWS Floor 28 - Building Data lake on AWS
Immersion Day - Como construir seu Data Lake em dias na AWS
AWS Summit Singapore 2019 | Big Data Analytics Architectural Patterns and Bes...

Similar to Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS (20)

PDF
Immersion Day - Como a AWS apoia a estratégia analítica de sua empresa
PDF
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
PPTX
AWS Lake Formation Deep Dive
PDF
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
PPTX
From raw data to business insights. A modern data lake
PDF
The Beginner's Guide to Data Lakes in AWS
PDF
Value of Data Beyond Analytics by Darin Briskman
PPTX
Construindo data lakes e analytics com AWS
PPTX
Building Data Lakes & Analytics on AWS
PDF
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
PDF
Agile enterprise analytics on aws
PDF
From ingest to insights with AWS
PDF
Modern Data Platforms - Thinking Data Flywheel on the Cloud
PDF
Aws Data Engineer Course | Aws Data Engineer Training
PPTX
Introduction to AWS Lake Formation.pptx
PDF
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
PDF
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
PDF
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Immersion Day - Como a AWS apoia a estratégia analítica de sua empresa
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
AWS Lake Formation Deep Dive
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
Big Data, Ingeniería de datos, y Data Lakes en AWS
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
From raw data to business insights. A modern data lake
The Beginner's Guide to Data Lakes in AWS
Value of Data Beyond Analytics by Darin Briskman
Construindo data lakes e analytics com AWS
Building Data Lakes & Analytics on AWS
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
Agile enterprise analytics on aws
From ingest to insights with AWS
Modern Data Platforms - Thinking Data Flywheel on the Cloud
Aws Data Engineer Course | Aws Data Engineer Training
Introduction to AWS Lake Formation.pptx
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Ad

More from Amazon Web Services LATAM (20)

PPTX
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
PPTX
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
PPTX
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
PPTX
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
PPTX
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
PPTX
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
PPTX
Automatice el proceso de entrega con CI/CD en AWS
PPTX
Automatize seu processo de entrega de software com CI/CD na AWS
PPTX
Cómo empezar con Amazon EKS
PPTX
Como começar com Amazon EKS
PPTX
Ransomware: como recuperar os seus dados na nuvem AWS
PPTX
Ransomware: cómo recuperar sus datos en la nube de AWS
PPTX
Ransomware: Estratégias de Mitigação
PPTX
Ransomware: Estratégias de Mitigación
PPTX
Aprenda a migrar y transferir datos al usar la nube de AWS
PPTX
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWS
PPTX
Cómo mover a un almacenamiento de archivos administrados
PPTX
Simplifique su BI con AWS
PPTX
Simplifique o seu BI com a AWS
PPTX
Os benefícios de migrar seus workloads de Big Data para a AWS
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
Automatice el proceso de entrega con CI/CD en AWS
Automatize seu processo de entrega de software com CI/CD na AWS
Cómo empezar con Amazon EKS
Como começar com Amazon EKS
Ransomware: como recuperar os seus dados na nuvem AWS
Ransomware: cómo recuperar sus datos en la nube de AWS
Ransomware: Estratégias de Mitigação
Ransomware: Estratégias de Mitigación
Aprenda a migrar y transferir datos al usar la nube de AWS
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWS
Cómo mover a un almacenamiento de archivos administrados
Simplifique su BI con AWS
Simplifique o seu BI com a AWS
Os benefícios de migrar seus workloads de Big Data para a AWS
Ad

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
NewMind AI Weekly Chronicles - August'25 Week I
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
The AUB Centre for AI in Media Proposal.docx
Unlocking AI with Model Context Protocol (MCP)
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

  • 1. 2 0 1 9
  • 2. Data Lakes & Analytics en AWS Laura Mariana Caicedo Solutions Architect Lauracai10
  • 3. What to expect from this session • Data Importance • Big Data Process • DataLake – LakeFormation • Right Tool to analyze
  • 4. Data is a strategic asset for every organization The world’s most valuable resource is *Copyright: The Economist, 2017, David Parkins
  • 5. The move toward data-centric companies Five largest companies by market cap* 2001 2006 2011 2016 2018 $1.091T $406B $446B $406B $582B $976B $365B $383B $556B $383B $877B $272B $327B $277B $452B $839B $261B $293B $237B $364B $523B $260B $273B $228B $228B
  • 6. Thinking about data as an asset, not a cost © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Stop throwing data away Make it available to more users Arm users with more data processing technologies
  • 7. Data every 5 years There is more data than people think 15 years live for Data platforms need to 1,000x scale >10x grows
  • 8. There are more people accessing data And more requirements for making data available Data Scientists Analysts Business Users Applications Secure Real time Flexible Scalable
  • 10. Types of big data analytics •Batch/ interactive •Stream processing •Machine learning
  • 12. Collect Store Process/ Analyze Consume Time to answer (latency) Throughput Cost Simplify big data processing
  • 14. Analytics used to look like this OLTP ERP CRM LOB Data warehouse Business intelligence Relational data TBs–PBs scale Schema defined prior to data load Operational reporting and ad hoc Large initial CAPEX + $10K $50K/TB/Year
  • 15. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale
  • 16. Why data lakes? Data Lakes provide: Relational and non-relational data Scale-out to EBs Diverse set of analytics and machine learning tools Work on data without any data movement Designed for low cost storage and analytics OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 1001100001001010111001010 1011100101010000101111101 1010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine Learning DW Queries Big data processing Interactive Real-time
  • 17. • OLTP (Online Transaction Processing) Characterized by a large number of short transactions (INSERT, UPDATE, DELETE) that serve as persistence layer for applications. e.g. Aurora, MySQL, PostgreSQL, etc. Typically a row-store architecture • OLAP (Online Analytical Processing) Characterized by relatively low volume of transactions, and queries are often complex and involve aggregations against large historical datasets for data-driven decision making. e.g. Amazon Redshift, Greenplum, etc. Typically a column-store architecture • Data Lake An architectural paradigm that allows customers to store all of their data in a single unified place where they can collect and store any data, at any scale, and at low cost. Data lakes complement (not replace) other data stores such as data warehouses. e.g. S3 data lake OLTP PostgreSQL Amazon Aurora Amazon EC2 (Business Application)User Applications DataLake Data Stores: What’s the Difference? OLAP ETL Tools Amazon QuickSight Amazon Redshift Amazon Glue BI Tools OLTP ERP CRM LOBUser Dashboards
  • 18. Amazon S3 | AWS Glue Any analytic workload, any scale, at the lowest possible cost AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams On-premises Data Movement Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight Analytics Machine Learning Real-time Data Movement Data Lake
  • 19. There are lots of ingestion tools Amazon S3 Process Consume S3 Transfer Acceleration Data sources Transactions Web logs / cookies ERP Connected devices
  • 20. Typical steps of building a data lake Setup storage1 Move data2 Cleanse, prep, and catalog data 3 Configure and enforce security and compliance policies 4 Make data available for analytics 5
  • 21. Building data lakes can still take months
  • 22. Data preparation accounts for ~80% of the work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  • 23. Sample of steps required Find sources Create Amazon Simple Storage Service (Amazon S3) locations Configure access policies Map tables to Amazon S3 locations ETL jobs to load and clean data Create metadata access policies Configure access from analytics services Rinse and repeat for other: data sets, users, and end-services And more: manage and monitor ETL jobs update metadata catalog as data changes update policies across services as users and permissions change manually maintain cleansing scripts create audit processes for compliance … Manual | Error-prone | Time consuming
  • 24. Enforce security policies across multiple services Gain and manage new insights Identify, ingest, clean, and transform data Build a secure data lake in days AWS Lake Formation
  • 26. Register existing data or import new Amazon S3 forms the storage layer for Lake Formation Register existing S3 buckets that contain your data Ask Lake Formation to create required S3 buckets and import data into them Data is stored in your account. You have direct access to it. No lock-in. Data Lake Storage Data Catalog Access Control Data import Lake Formation Crawlers ML-based data prep
  • 27. Easily load data to your data lake logs DBs Blueprints Data Lake Storage Data Catalog Access Control Data import Lake Formation Crawlers ML-based data prep one-shot incremental
  • 28. With blueprints You 1. Point us to the source 2. Tell us the location to load to in your data lake 3. Specify how often you want to load the data Blueprints 1. Discover the source table(s) schema 2. Automatically convert to the target data format 3. Automatically partition the data based on the partitioning schema 4. Keep track of data that was already processed 5. You can customize any of the above
  • 29. Easily de-duplicate your data with ML transforms
  • 30. Secure once, access in multiple ways Data Lake Storage Data Catalog Access Control Lake Formation Admin
  • 31. Security permissions in Lake Formation Control data access with simple grant and revoke permissions Specify permissions on tables and columns rather than on buckets and objects Easily view policies granted to a particular user Audit all data access at one place
  • 33. Process/analyze Process/ Analyze analytics • Amazon Redshift & Amazon Redshift Spectrum • Managed data warehouse • Spectrum enables querying S3 • Amazon Athena • Serverless interactive query service • Amazon EMR • Managed Hadoop framework for running Apache Spark, Flink, Presto, Tez, Hive, Pig, HBase, and others Amazon Redshift EMR Amazon Athena
  • 34. Serverless Query Processing • Serverless query service for querying data in S3 using standard SQL with no infrastructure to manage • No data loading required; query directly from Amazon S3 • Use standard ANSI SQL queries with support for joins, JSON, and window functions • Support for multiple data formats include text, CSV, TSV, JSON, Avro, ORC, Parquet • Pay per query only when you’re running queries based on data scanned. If you compress your data, you pay less and your queries run faster Amazon Athena
  • 35. Querying it in Amazon Athena Either Create a Crawler to auto-generate schema OR Write a DDL on the Athena console/API/ JDBC/ODBC driver Start Querying Data
  • 36. Relational data warehouse Massively parallel; Petabyte scale Fully managed HDD and SSD Platforms $1,000/TB/Year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  • 37. Amazon Redshift Speed: Three Highlights Machine-learning based acceleration 1 2 Result-set Caching for Local & Data Lake Queries (sub-second repeat queries) 3 Constant improvements in performance for real-world workloads 10x faster than it was two years ago Amazon Redshift is now from over 200 features and enhancements released due to lessons learnt from more than 10,000 customer deployments processing over 2 exabytes of data every dayRedshift Spectrum caches intermediate results that can benefit different queries
  • 38. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. The Forrester Wave™: Enterprise Data Warehouse, Q4 2015 Forrester Wave™ Big Data Warehouse Q4 2018 AWS rated top in the leader bracket and received a score of 5/5 (the highest score possible) in a number of areas such as Use Cases, Roadmap, Market Awareness, and Ability to Execute AWS positioned as a Leader in the Gartner Magic Quadrant for Data Management Gartner Magic Quadrant, 2018 Solutions for Analytics
  • 39. Semi-structured/Unstructured Data Processing • Hadoop, Hive, Presto, Spark, Tez, Impala etc. • Release 5.2: Hadoop 2.7.3, Hive 2.1, Spark 2.02, Zeppelin, Presto, HBase 1.2.3 and HBase on S3, Phoenix, Tez, Flink. • New applications added within 30 days of their open source release • Fully managed, Auto Scaling clusters with support for on-demand and spot pricing • Support for HDFS and S3 file systems enabling separated compute and storage; multiple clusters can run against the same data in S3 • HIPAA-eligible. Support for end-to-end encryption, IAM/VPC, S3 client- side encryption with customer managed keys and AWS KMS Amazon EMR
  • 40. The Hadoop ecosystem can run in Amazon EMR
  • 41. Which analytics should I use? Process/analyze Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Stream Takes milliseconds to seconds Example: Fraud alerts, one-minute metrics Amazon EMR (Spark Streaming), Amazon Kinesis Data Analytics, Amazon KCL, AWS Lambda, and others Predictive Takes milliseconds (real-time) to minutes (batch) Example: Fraud detection, forecasting demand, speech recognition Amazon SageMaker, Amazon Polly, Amazon Rekognition, Amazon Transcribe, Amazon Translate, Amazon EMR (Spark ML), Amazon Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK, and Caffe) FastSlow Amazon Redshift & Spectrum Amazon Athena BatchInteractive Amazon ES Presto Amazon EMR Predictive AmazonML KCL Apps AWS Lambda Amazon Kinesis Analytics Stream Streaming Fast
  • 42. Collect Store ConsumeProcess/analyzeETL Amazon QuickSight Analysis&visualizationDatasceince AI Apps Apps SW engineers/ business users Data scientists/ data engineers
  • 43. Amazon QuickSight: Examples – Salesforce Analytics
  • 44. Summary • Data MUST be used in every organization • Data lakes are very important to consume structured and unstructured data • Data lake governance • Analyze data with the right tool • Different type of consumers