Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Data Lakes & Analytics en
AWS
Laura Mariana Caicedo
Solutions Architect
Lauracai10

What to expect from this session
• Data Importance
• Big Data Process
• DataLake – LakeFormation
• Right Tool to analyze

Data is a strategic asset
for every organization
The world’s most valuable
resource is
*Copyright: The Economist, 2017, David Parkins

The move
toward
data-centric
companies
Five largest companies by
market cap*
2001
2006
2011
2016
2018
$1.091T
$406B
$446B
$406B
$582B
$976B
$365B
$383B
$556B
$383B
$877B
$272B
$327B
$277B
$452B
$839B
$261B
$293B
$237B
$364B
$523B
$260B
$273B
$228B
$228B

Thinking about data as an asset, not a
cost
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Stop
throwing
data away
Make it
available to
more users
Arm users
with more
data processing
technologies

Data
every 5 years
There is more data
than people think
15
years
live for
Data platforms need to
1,000x
scale
>10x
grows

There are more
people accessing data
And more
requirements for
making data available
Data Scientists
Analysts
Business Users
Applications
Secure Real time
Flexible Scalable

Types of big data analytics
•Batch/
interactive
•Stream
processing
•Machine
learning

Collect Store
Process/
Analyze
Consume
Time to answer (latency)
Throughput
Cost
Simplify big data processing

Analytics used to look like this
OLTP ERP CRM LOB
Data warehouse
Business intelligence
Relational data
TBs–PBs scale
Schema defined prior to data load
Operational reporting and ad hoc
Large initial CAPEX + $10K $50K/TB/Year

A data lake is a centralized repository that
allows you to store all your structured and
unstructured data at any scale

Why data lakes?
Data Lakes provide:
Relational and non-relational data
Scale-out to EBs
Diverse set of analytics and machine learning tools
Work on data without any data movement
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111001010
1011100101010000101111101
1010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time

• OLTP (Online Transaction Processing)
Characterized by a large number of short transactions (INSERT,
UPDATE, DELETE) that serve as persistence layer for applications.
e.g. Aurora, MySQL, PostgreSQL, etc. Typically a row-store
architecture
• OLAP (Online Analytical Processing)
Characterized by relatively low volume of transactions, and
queries are often complex and involve aggregations against large
historical datasets for data-driven decision making. e.g. Amazon
Redshift, Greenplum, etc. Typically a column-store architecture
• Data Lake
An architectural paradigm that allows customers to store all of
their data in a single unified place where they can collect and
store any data, at any scale, and at low cost. Data lakes
complement (not replace) other data stores such as data
warehouses. e.g. S3 data lake
OLTP
PostgreSQL
Amazon
Aurora
Amazon EC2
(Business Application)User
Applications
DataLake
Data Stores: What’s the Difference?
OLAP
ETL Tools
Amazon
QuickSight
Amazon Redshift
Amazon
Glue
BI Tools
OLTP ERP CRM LOBUser
Dashboards

Amazon S3 | AWS Glue
Any analytic
workload, any scale,
at the lowest possible
cost
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
On-premises
Data Movement
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
Real-time
Data Movement
Data Lake

There are lots of ingestion tools
Amazon S3
Process Consume
S3 Transfer
Acceleration
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices

Typical steps of building a data lake
Setup storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5

Building data lakes can still take
months

Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other

Sample of steps required
Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone | Time consuming

Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean,
and transform data
Build a secure data lake in days
AWS Lake Formation

Register existing data or import new
Amazon S3 forms the storage layer for
Lake Formation
Register existing S3 buckets that
contain your data
Ask Lake Formation to create required
S3 buckets and import data into them
Data is stored in your account. You have
direct access to it. No lock-in.
Data Lake Storage
Data
Catalog
Access
Control
Data import
Lake Formation
Crawlers ML-based
data prep

Easily load data to your data lake
logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data import
Lake Formation
Crawlers ML-based
data prep
one-shot
incremental

With blueprints
You
1. Point us to the source
2. Tell us the location to load to
in your data lake
3. Specify how often you want to
load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the
target data format
3. Automatically partition the
data based on the
partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of
the above

Easily de-duplicate your data with ML transforms

Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Lake Formation
Admin

Security permissions in Lake Formation
Control data access with simple
grant and revoke permissions
Specify permissions on tables and
columns rather than on buckets
and objects
Easily view policies granted to a
particular user
Audit all data access at one place

Process/analyze
Process/ Analyze analytics
• Amazon Redshift & Amazon Redshift Spectrum
• Managed data warehouse
• Spectrum enables querying S3
• Amazon Athena
• Serverless interactive query service
• Amazon EMR
• Managed Hadoop framework for running Apache Spark, Flink, Presto, Tez, Hive, Pig,
HBase, and others
Amazon
Redshift
EMR Amazon
Athena

Serverless Query Processing
• Serverless query service for querying data in S3 using standard SQL with
no infrastructure to manage
• No data loading required; query directly from Amazon S3
• Use standard ANSI SQL queries with support for joins, JSON, and
window functions
• Support for multiple data formats include text, CSV, TSV, JSON, Avro,
ORC, Parquet
• Pay per query only when you’re running queries based on data scanned.
If you compress your data, you pay less and your queries run faster
Amazon
Athena

Querying it in Amazon Athena
Either Create a Crawler to
auto-generate schema
OR
Write a DDL on the Athena
console/API/ JDBC/ODBC
driver
Start Querying Data

Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper

Amazon Redshift Speed: Three Highlights
Machine-learning based acceleration
1
2
Result-set Caching for Local & Data Lake Queries
(sub-second repeat
queries)
3
Constant improvements in performance for
real-world workloads
10x faster
than it was two years ago
Amazon Redshift is now
from over 200 features and enhancements
released due to lessons learnt from more than
10,000 customer deployments processing over 2 exabytes
of data every dayRedshift Spectrum caches
intermediate results that can
benefit different queries

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a
market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is
based on best available resources. Opinions reflect judgment at the time and are subject to change.
The Forrester Wave™: Enterprise Data
Warehouse, Q4 2015
Forrester Wave™ Big Data Warehouse Q4 2018
AWS rated top in the
leader bracket and
received a score of
5/5 (the highest score
possible) in a number
of areas such as Use
Cases, Roadmap,
Market Awareness,
and Ability to Execute
AWS positioned as a
Leader in the Gartner
Magic Quadrant for
Data Management
Gartner Magic Quadrant, 2018
Solutions for
Analytics

Semi-structured/Unstructured Data Processing
• Hadoop, Hive, Presto, Spark, Tez, Impala etc.
• Release 5.2: Hadoop 2.7.3, Hive 2.1, Spark 2.02, Zeppelin, Presto, HBase 1.2.3 and HBase on
S3, Phoenix, Tez, Flink.
• New applications added within 30 days of their open source release
• Fully managed, Auto Scaling clusters with support for on-demand and
spot pricing
• Support for HDFS and S3 file systems enabling separated compute and
storage; multiple clusters can run against the same data in S3
• HIPAA-eligible. Support for end-to-end encryption, IAM/VPC, S3 client-
side encryption with customer managed keys and AWS KMS
Amazon EMR

The Hadoop ecosystem can run in Amazon EMR

Which analytics should I use? Process/analyze
Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Stream
Takes milliseconds to seconds
Example: Fraud alerts, one-minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Data Analytics,
Amazon KCL, AWS Lambda, and others
Predictive
Takes milliseconds (real-time) to minutes (batch)
Example: Fraud detection, forecasting demand, speech
recognition
Amazon SageMaker, Amazon Polly, Amazon Rekognition, Amazon
Transcribe, Amazon Translate, Amazon EMR (Spark ML), Amazon
Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK, and Caffe)
FastSlow
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Presto
Amazon
EMR
Predictive
AmazonML
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
Fast

Collect Store ConsumeProcess/analyzeETL
Amazon
QuickSight
Analysis&visualizationDatasceince
AI Apps
Apps
SW engineers/
business
users
Data scientists/
data engineers

Amazon QuickSight: Examples – Salesforce Analytics

Summary
• Data MUST be used in every organization
• Data lakes are very important to consume structured and
unstructured data
• Data lake governance
• Analyze data with the right tool
• Different type of consumers

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

More Related Content

Similar to Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS (20)

More from Amazon Web Services LATAM (20)

Recently uploaded (20)

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS