SlideShare a Scribd company logo
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
May 2014
Usos e Melhores Práticas para Amazon Redshift
Eric Ferreira
Sr. Database Engineer
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift
Redshift
EMR
EC2
Analyze
Glacier
S3
DynamoDB
Store
Direct Connect
Collect
Kinesis
AWS Summit 2014
Petabyte scale
Massively parallel
Relational data warehouse
Fully managed; zero admin
Amazon
Redshift
a lot faster
a lot cheaper
a whole lot simpler
AWS Summit 2014
Common Customer Use Cases
• Reduce costs by
extending DW rather than
adding HW
• Migrate completely from
existing DW systems
• Respond faster to
business
• Improve performance by
an order of magnitude
• Make more data
available for analysis
• Access business data via
standard reporting tools
• Add analytic functionality
to applications
• Scale DW capacity as
demand grows
• Reduce HW & SW costs
by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
Amazon Redshift Customers
AWS Summit 2014
Growing Ecosystem
AWS Summit 2014
Data Loading Options
• Parallel upload to Amazon S3
• AWS Direct Connect
• AWS Import/Export
• Amazon Kinesis
• Systems integrators
Data Integration Systems Integrators
AWS Summit 2014
Amazon Redshift Architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB or SSH
• Two hardware platforms
– Optimized for data processing
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
AWS Summit 2014
Amazon Redshift Node Types
• Optimized for I/O intensive workloads
• High disk density
• On demand at $0.85/hour
• As low as $1,000/TB/Year
• Scale from 2TB to 1.6PB
DW1.XL: 16 GB RAM, 2 Cores
3 Spindles, 2 TB compressed storage
DW1.8XL: 128 GB RAM, 16 Cores, 24
Spindles 16 TB compressed, 2 GB/sec scan
rate
• High performance at smaller storage size
• High compute and memory density
• On demand at $0.25/hour
• As low as $5,500/TB/Year
• Scale from 160GB to 256TB
DW2.L *New*: 16 GB RAM, 2 Cores,
160 GB compressed SSD storage
DW2.8XL *New*: 256 GB RAM, 32 Cores,
2.56 TB of compressed SSD storage
AWS Summit 2014
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage • With row storage you do
unnecessary I/O
• To get total amount, you have to
read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
AWS Summit 2014
• With column storage, you only
read the data you need
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
AWS Summit 2014
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage • COPY compresses automatically
• You can analyze and override
• More performance, less cost
AWS Summit 2014
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
• Track the minimum and maximum
value for each block
• Skip over blocks that don’t contain
relevant data
AWS Summit 2014
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Use local storage for performance
• Maximize scan rates
• Automatic replication and
continuous backup
• HDD & SSD platforms
AWS Summit 2014
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
AWS Summit 2014
• Load in parallel from Amazon S3 or
Amazon DynamoDB or any SSH
connection
• Data automatically distributed and
sorted according to DDL
• Scales linearly with number of nodes
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
AWS Summit 2014
• Backups to Amazon S3 are automatic, continuous
and incremental
• Configurable system snapshot retention period. Take
user snapshots on-demand
• Cross region backups for disaster recovery
• Streaming restores enable you to resume querying
faster
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
AWS Summit 2014
• Resize while remaining online
• Provision a new cluster in the background
• Copy data in parallel from node to node
• Only charged for source cluster
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
AWS Summit 2014
• Automatic SQL endpoint switchover
via DNS
• Decommission the source cluster
• Simple operation via Console or API
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
AWS Summit 2014
Amazon Redshift is priced to let you analyze all your data
• Number of nodes x cost per
hour
• No charge for leader node
• No upfront costs
• Pay as you go
DW1 (HDD)
Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB
On-Demand $ 0.850 $ 3,723
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DW2 (SSD)
Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB
On-Demand $ 0.250 $ 13,688
1 Year Reservation $ 0.161 $ 8,794
3 Year Reservation $ 0.100 $ 5,498
Amazon Redshift has security built-in
• SSL to secure data in transit; load
encrypted from S3; ECDHE perfect forward
security
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks & in Amazon S3 encrypted
– On-premises HSM & CloudHSM support
• No direct access to compute nodes
• Audit logging & AWS CloudTrail integration
• Amazon VPC support
• SOC 1/2/3, PCI-DSS Level 1, FedRAMP
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
AWS Summit 2014
Amazon Redshift continuously backs up your
data and recovers from failures
• Replication within the cluster and backup to Amazon S3 to maintain multiple
copies of data at all times
• Backups to Amazon S3 are continuous, automatic, and incremental
– Designed for eleven nines of durability
• Continuous monitoring and automated recovery from failures of drives and nodes
• Able to restore snapshots to any Availability Zone within a region
• Easily enable backups to a second region for disaster recovery
AWS Summit 2014
50+ new features since launch in Feb 2013
• Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney
• Certifications – PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others
• Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials,
HSM, ECDHE for perfect forward security
• Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-
region backups
• Query – Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout, HLL,
Concurrency to 50 slots
• Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character
substitution, CSV, auto datetime format detection, epoch, Ingest from SSH, JSON, EMR
AWS Summit 2014
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
MicroStrategy
Industry’s best BI, web and mobile applications on-demand in the
cloud
May 2014
You’ve Got the Data. Now What?
Deriving Big Insights from Big Data
Trends in business analytics
Popular use cases
Agile Business Intelligence
Governed Self-Service
Information Driven Mobile Apps
Redshift Certified
Customer Success
Popular Use Cases
• Information driven purchasing
– reviews
– searches
– pricing
– comparisons
– social networks
– recommendations
• Omni-channel
• Customer 360
• Improve sales efficiency and effectiveness
• Data blending
o sales
o product
o Marketing
• CRM integration
• Mobility
o BYOD
o Caching
• Information driven experience
o Interaction
o Videos
o Documents
• In-store apps
o analytics
o customer 360
o personal shopper
• Store of the future
• Real-time decisions
• Inventory management
Customer Analytics Sales Enablement Retail
Self-Service Analytics Revolutionizes Traditional BI
Boost user satisfaction while massively increasing productivity
More Productive
More content per creator
More Producers
More users can create
content
More Collaborative
Peer-to-peer sharing
5-10x
More Content
5-10x
More Content
Creators
5-10x
More Sharing
>100x
more content
creation and
consumption
Governed Self-Service
World-Class Information-Driven Mobile Apps
The future of dashboards
• More than graphs
• Multi-media
• Transaction enabled
• Live data
• Comprehensive data
• Intuitive
• Easy to use
• Guided workflow for consistent
user experience
• Personalized for each user
Business User Access to 1000s of Data Sources
Faster access to your data
MicroStrategy
Modeled Data
Personal or
Departmental
Cloud-
Based Data
Relational
Databases
Big Data &
Hadoop
Enterprise
Applications
SAP, Oracle e-Business,
Siebel, Peoplesoft, etc.
Redshift, Oracle, SQL
Server, MySQL,
Teradata, Netezza, etc.
Salesforce.com, NetSuite.
Facebook, Eloqua, Google
Docs, etc.
Spreadsheets, Access
databases, CSV, public data
downloads, etc.
EMR, MapR, Cloudera,
Hortonworks, etc…
Enterprise-certified single-
version of the truth
Redshift
Certified Redshift
integration
Customer Success
Netflix has deployed the MicroStrategy
business intelligence software platform on top
of Amazon Elastic MapReduce (Amazon EMR)
for interactive insights. Netflix analysts use
advanced visualizations to explore the
performance of its streaming service closer to
recorded time by directly accessing Hadoop
data on an ad-hoc basis and without writing
code.
Big Data
Analytics
Transform your growing
Big Data resources into
insight and profit.
Self-Service
Analytics
See and understand data in
minutes. No IT needed.
Enterprise-Grade
Business Intelligence
Produce and publish trusted
analytics to improve your
business operations.
AWS Partner
Comprehensive Analytics Platform
#1 Mobile Analytics
MicroStrategy Analytics Enterprise
Business Agility with Trusted, Governed Data
Experience MicroStrategy on AWS Today!
Ingestion – Best Practices
• Goal
• 1 leader node & n compute nodes, Leverage all the compute nodes and minimize overhead
• Best Practices
• Preferred method - COPY from S3
• Loads data in sorted order through the compute nodes
• Single Copy command, Split data into multiple files
• Strongly recommend that you gzip large datasets
copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-Access-Key-
ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’;
• If you must ingest through SQL
• Multi-row inserts
• Avoid large number of singleton
insert/update/delete operations
• To copy from another table
• CREATE TABLE AS or INSERT INTO SELECT
insert into category_stage values
(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);
AWS Summit 2014
Ingestion – Best Practices (Cotd..)
• Best Practices
– Verifying load data
files
• For US east – S3
provides
eventual
consistency.
– Verify files are in
S3
Listing Object Keys
– Query Redshift after
load. This query
returns entries for
loading the tables in
the TICKIT database
select query, trim(filename), curtime, status
from stl_load_commits
where filename like '%tickit%' order by query;
query | btrim | curtime | status
-------+---------------------------+----------------------------+--------
22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1
22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1
22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1
22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1
22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1
22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1
22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 | 1
22596 | tickit/venue_pipe.txt | 2013-02-08 21:04:10.056055 | 1
22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 | 1
22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 | 1
22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 | 1
AWS Summit 2014
Ingestion – Best Practices (Cotd..)
• Best Practices
– We do not support an upsert statement. Use staging tables to perform an upsert by doing
a join on the staging table with the target – Update then Insert.
– We do NOT enforce primary key constraint, if you COPY same data twice, there will be a
duplicate copy.
– Increase the memory available to a COPY or VACUUM by increasing
wlm_query_slot_count
set wlm_query_slot_count to 3;
– Run the ANALYZE command whenever you’ve made a non-trivial number of changes to
your data to ensure your table statistics are current
– Amazon Redshift system table that can be helpful in troubleshooting data load
issues:STL_LOAD_ERRORS discovers the errors that occurred during specific loads. Adjust
MAX ERRORS as needed.
– Check character set : Support only UTF8
AWS Summit 2014
Choose a Sort key
• Goal
• Skip over data blocks to minimize IO
• Best Practice
• Sort based on range or equality predicate (WHERE clause)
• If you access recent data frequently, sort based on TIMESTAMP
AWS Summit 2014
Choose a Distribution Key
• Goal
• Distribute data evenly across nodes
• Minimize data movement among nodes : Co-located Joins and
Co-located Aggregates
• Best Practice
• Consider using Join key as distribution key (JOIN clause)
• If multiple joins, use the foreign key of the largest dimension as
distribution key
• Consider using Group By column as distribution key.( GROUP BY
clause)
• Avoid
• Keys used as equality filter as your distribution key
• If de-normalized tables and no aggregates, do not specify a
distribution key. Redshift will use round robin
AWS Summit 2014
Query Performance – Best practices
• Encode date and time using “TIMESTAMP” data type instead of “Char”
• Specify Constraints
• RedShift does not enforce constraints (primary key, foreign key, unique values) but
the optimizer uses it.
• Loading and/or applications need to be aware
• Specify redundant predicate on the sort column
SELECT * FROM tab1, tab2
WHERE tab1.key = tab2.key
AND tab1.timestamp > '1/1/2013'
AND tab2.timestamp > '1/1/2013';
AWS Summit 2014
Workload Manager
• Allows you to manage and adjust query concurrency
• WLM allows you to
• Increase query concurrency up to 50
• Define user groups and query groups
• Segregate short and long running queries
• Help improve performance of individual queries
• Be aware that query workload is distributed to every compute node.
• Increasing concurrency may not always help due to resource contention (cpu,
memory and I/O).
• Total throughput may increase by letting one query to complete first and other
queries to wait.
AWS Summit 2014
Workload Best Practices
• Organizing and keeping your load files in S3 allows for re-run or scenario
testing as you evolve your workflow in the platform.
– Keep in S3 or Glacier for fiscal/legal reasons
• Data updated for short-term
– consider having a short-term version of the table for staging and a long term
version once data gets stable.
• Round Robin distribution key
– When you don’t have a good Distribution Key
– Check Part 1 for query on checking for distribution skew
– Trade off with collocated joins
• Loading the target (final) table
– Use a chronological date/timestamp columns for first sortkey. Vacuum is needed
less often and runs faster
– When first sort column has low cardinality/resolution (i.e, date instead of
timestamp), subsequent columns should match common filters and/or grouping
columns
AWS Summit 2014
Workload Best Practices cont.
• Use UNLOAD command to archive data that is not needed for business
reasons
– Data that needs to exist only for fiscal/legal reasons can be re-loaded as
needed.
• Consider applying retention policies less often than the regular workflow
– Weekly/Monthly process during a less busy time
– Make space provision for the data growth
– Make sure all queries have date/timestamp range filters (> and <)
– Keep a sliding window of data to minimize block re-write during vacuum
• Take manual snapshots to save status at specific mileposts (year-end).
AWS Summit 2014
Workload Best Practices cont.
• Ratio between Load/Query performance needs
– Low ratio: Consider Load -> Snapshot -> Spin “Query” clusters -> Tear down
– High ratio: Consider Performance above space needs when choosing
number of nodes
• Normalization Rule of Thumb
– De-normalize only to avoid large non-collocated joins
– Slow Changing Dimensions (type II): Keep normalized, match distkey with fact
table
AWS Summit 2014
Space Management
• Redshift has a single pool of space used for tables and temporary
segments.
– Loads need 2.5 times the space of the data being loaded if table has a
sortkey
– Vacuum may need 2.5 times the size of the table.
• Monitor the free space
– Performance Tab in the console
– Cloudwatch Alarms
– SQL
AWS Summit 2014
Space Management cont.
• Tables Sizes
select trim(pgdb.datname) as Database, trim(pgn.nspname) as
Schema,
trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
order by mbytes desc, a.db_id, a.name;
• Free Space
select sum(capacity)/1024 as capacity_gbytes,
sum(used)/1024 as used_gbytes,
(sum(capacity) - sum(used))/1024
as free_gbytes
from stv_partitions
where part_begin=0;
• Redshift allows you to resize your cluster up and down and across node types.
Online (read-only access).
AWS Summit 2014
For more best practices, search youtube for
“Amazon Redshift Best Practices “
Thank You !
AWS Summit 2014

More Related Content

PDF
Amazon web service
PPTX
AWS Initiate - Migrando seus dados - Windows Workloads
PDF
마이크로 서비스 아키텍처와 앱 모던화 – 김일호 :: AWS Builders Online Series
PDF
AWS Chicago user group: AWS Platform for .NET Developers
PPTX
Secure perimeter with AWS workspaces
PDF
Amazon EC2 avançado
PDF
Benefícios e melhores práticas no uso do Amazon Redshift
PDF
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
Amazon web service
AWS Initiate - Migrando seus dados - Windows Workloads
마이크로 서비스 아키텍처와 앱 모던화 – 김일호 :: AWS Builders Online Series
AWS Chicago user group: AWS Platform for .NET Developers
Secure perimeter with AWS workspaces
Amazon EC2 avançado
Benefícios e melhores práticas no uso do Amazon Redshift
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介

Similar to Aws summit 2014 redshift (20)

PDF
[よくわかるAmazon Redshift]Amazon Redshift最新情報と導入事例のご紹介
PDF
Introdução ao data warehouse Amazon Redshift
PPTX
Introdução ao Data Warehouse Amazon Redshift
PPTX
Redshift overview
PDF
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
PDF
Get Value from Your Data
PPTX
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
PDF
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
PDF
Get Value From Your Data
PPTX
AWS Roadshow Herbst 2013: Datenanalyse und Business Intelligence
PDF
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
PDF
Deep Dive: Amazon Redshift (March 2017)
PDF
Immersion Day - Como simplificar o acesso ao seu ambiente analítico
PDF
AWS Innovate: Running Databases in AWS- Russell Nash
PDF
London Redshift Meetup - July 2017
PPTX
July 2017 Meeting of the Denver AWS Users' Group
PDF
Introduction to Amazon Redshift
PDF
Aws Data Engineer Training | Aws Data Engineer Course
PPTX
What is Amazon Redshift?
PDF
Owning Your Own (Data) Lake House
[よくわかるAmazon Redshift]Amazon Redshift最新情報と導入事例のご紹介
Introdução ao data warehouse Amazon Redshift
Introdução ao Data Warehouse Amazon Redshift
Redshift overview
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
Get Value from Your Data
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
Get Value From Your Data
AWS Roadshow Herbst 2013: Datenanalyse und Business Intelligence
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Deep Dive: Amazon Redshift (March 2017)
Immersion Day - Como simplificar o acesso ao seu ambiente analítico
AWS Innovate: Running Databases in AWS- Russell Nash
London Redshift Meetup - July 2017
July 2017 Meeting of the Denver AWS Users' Group
Introduction to Amazon Redshift
Aws Data Engineer Training | Aws Data Engineer Course
What is Amazon Redshift?
Owning Your Own (Data) Lake House
Ad

More from Amazon Web Services LATAM (20)

PPTX
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
PPTX
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
PPTX
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
PPTX
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
PPTX
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
PPTX
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
PPTX
Automatice el proceso de entrega con CI/CD en AWS
PPTX
Automatize seu processo de entrega de software com CI/CD na AWS
PPTX
Cómo empezar con Amazon EKS
PPTX
Como começar com Amazon EKS
PPTX
Ransomware: como recuperar os seus dados na nuvem AWS
PPTX
Ransomware: cómo recuperar sus datos en la nube de AWS
PPTX
Ransomware: Estratégias de Mitigação
PPTX
Ransomware: Estratégias de Mitigación
PPTX
Aprenda a migrar y transferir datos al usar la nube de AWS
PPTX
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWS
PPTX
Cómo mover a un almacenamiento de archivos administrados
PPTX
Simplifique su BI con AWS
PPTX
Simplifique o seu BI com a AWS
PPTX
Os benefícios de migrar seus workloads de Big Data para a AWS
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
AWS para terceiro setor - Sessão 1 - Introdução à nuvem
AWS para terceiro setor - Sessão 2 - Armazenamento e Backup
AWS para terceiro setor - Sessão 3 - Protegendo seus dados.
Automatice el proceso de entrega con CI/CD en AWS
Automatize seu processo de entrega de software com CI/CD na AWS
Cómo empezar con Amazon EKS
Como começar com Amazon EKS
Ransomware: como recuperar os seus dados na nuvem AWS
Ransomware: cómo recuperar sus datos en la nube de AWS
Ransomware: Estratégias de Mitigação
Ransomware: Estratégias de Mitigación
Aprenda a migrar y transferir datos al usar la nube de AWS
Aprenda como migrar e transferir dados ao utilizar a nuvem da AWS
Cómo mover a un almacenamiento de archivos administrados
Simplifique su BI con AWS
Simplifique o seu BI com a AWS
Os benefícios de migrar seus workloads de Big Data para a AWS
Ad

Recently uploaded (20)

PPT
Chapter four Project-Preparation material
PDF
A Brief Introduction About Julia Allison
PDF
Training And Development of Employee .pdf
PDF
Nidhal Samdaie CV - International Business Consultant
PPTX
Lecture (1)-Introduction.pptx business communication
PDF
Types of control:Qualitative vs Quantitative
PDF
WRN_Investor_Presentation_August 2025.pdf
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PDF
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
PPT
Data mining for business intelligence ch04 sharda
PDF
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
DOCX
unit 1 COST ACCOUNTING AND COST SHEET
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
DOCX
Euro SEO Services 1st 3 General Updates.docx
PDF
Business model innovation report 2022.pdf
PDF
How to Get Funding for Your Trucking Business
PPTX
5 Stages of group development guide.pptx
PPTX
Probability Distribution, binomial distribution, poisson distribution
PPT
340036916-American-Literature-Literary-Period-Overview.ppt
Chapter four Project-Preparation material
A Brief Introduction About Julia Allison
Training And Development of Employee .pdf
Nidhal Samdaie CV - International Business Consultant
Lecture (1)-Introduction.pptx business communication
Types of control:Qualitative vs Quantitative
WRN_Investor_Presentation_August 2025.pdf
Power and position in leadershipDOC-20250808-WA0011..pdf
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
Data mining for business intelligence ch04 sharda
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
unit 1 COST ACCOUNTING AND COST SHEET
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
Euro SEO Services 1st 3 General Updates.docx
Business model innovation report 2022.pdf
How to Get Funding for Your Trucking Business
5 Stages of group development guide.pptx
Probability Distribution, binomial distribution, poisson distribution
340036916-American-Literature-Literary-Period-Overview.ppt

Aws summit 2014 redshift

  • 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. May 2014 Usos e Melhores Práticas para Amazon Redshift Eric Ferreira Sr. Database Engineer
  • 2. Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year Amazon Redshift
  • 4. Petabyte scale Massively parallel Relational data warehouse Fully managed; zero admin Amazon Redshift a lot faster a lot cheaper a whole lot simpler AWS Summit 2014
  • 5. Common Customer Use Cases • Reduce costs by extending DW rather than adding HW • Migrate completely from existing DW systems • Respond faster to business • Improve performance by an order of magnitude • Make more data available for analysis • Access business data via standard reporting tools • Add analytic functionality to applications • Scale DW capacity as demand grows • Reduce HW & SW costs by an order of magnitude Traditional Enterprise DW Companies with Big Data SaaS Companies
  • 8. Data Loading Options • Parallel upload to Amazon S3 • AWS Direct Connect • AWS Import/Export • Amazon Kinesis • Systems integrators Data Integration Systems Integrators AWS Summit 2014
  • 9. Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata – Coordinates query execution • Compute Nodes – Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH • Two hardware platforms – Optimized for data processing – DW1: HDD; scale from 2TB to 1.6PB – DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC AWS Summit 2014
  • 10. Amazon Redshift Node Types • Optimized for I/O intensive workloads • High disk density • On demand at $0.85/hour • As low as $1,000/TB/Year • Scale from 2TB to 1.6PB DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan rate • High performance at smaller storage size • High compute and memory density • On demand at $0.25/hour • As low as $5,500/TB/Year • Scale from 160GB to 256TB DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage AWS Summit 2014
  • 11. Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage • With row storage you do unnecessary I/O • To get total amount, you have to read everything ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 AWS Summit 2014
  • 12. • With column storage, you only read the data you need Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 AWS Summit 2014
  • 13. analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage • COPY compresses automatically • You can analyze and override • More performance, less cost AWS Summit 2014
  • 14. Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 • Track the minimum and maximum value for each block • Skip over blocks that don’t contain relevant data AWS Summit 2014
  • 15. Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage • Use local storage for performance • Maximize scan rates • Automatic replication and continuous backup • HDD & SSD platforms AWS Summit 2014
  • 16. Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize AWS Summit 2014
  • 17. • Load in parallel from Amazon S3 or Amazon DynamoDB or any SSH connection • Data automatically distributed and sorted according to DDL • Scales linearly with number of nodes Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize AWS Summit 2014
  • 18. • Backups to Amazon S3 are automatic, continuous and incremental • Configurable system snapshot retention period. Take user snapshots on-demand • Cross region backups for disaster recovery • Streaming restores enable you to resume querying faster Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize AWS Summit 2014
  • 19. • Resize while remaining online • Provision a new cluster in the background • Copy data in parallel from node to node • Only charged for source cluster Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize AWS Summit 2014
  • 20. • Automatic SQL endpoint switchover via DNS • Decommission the source cluster • Simple operation via Console or API Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize AWS Summit 2014
  • 21. Amazon Redshift is priced to let you analyze all your data • Number of nodes x cost per hour • No charge for leader node • No upfront costs • Pay as you go DW1 (HDD) Price Per Hour for DW1.XL Single Node Effective Annual Price per TB On-Demand $ 0.850 $ 3,723 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DW2 (SSD) Price Per Hour for DW2.L Single Node Effective Annual Price per TB On-Demand $ 0.250 $ 13,688 1 Year Reservation $ 0.161 $ 8,794 3 Year Reservation $ 0.100 $ 5,498
  • 22. Amazon Redshift has security built-in • SSL to secure data in transit; load encrypted from S3; ECDHE perfect forward security • Encryption to secure data at rest – AES-256; hardware accelerated – All blocks on disks & in Amazon S3 encrypted – On-premises HSM & CloudHSM support • No direct access to compute nodes • Audit logging & AWS CloudTrail integration • Amazon VPC support • SOC 1/2/3, PCI-DSS Level 1, FedRAMP 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC AWS Summit 2014
  • 23. Amazon Redshift continuously backs up your data and recovers from failures • Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times • Backups to Amazon S3 are continuous, automatic, and incremental – Designed for eleven nines of durability • Continuous monitoring and automated recovery from failures of drives and nodes • Able to restore snapshots to any Availability Zone within a region • Easily enable backups to a second region for disaster recovery AWS Summit 2014
  • 24. 50+ new features since launch in Feb 2013 • Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney • Certifications – PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others • Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM, ECDHE for perfect forward security • Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross- region backups • Query – Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout, HLL, Concurrency to 50 slots • Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution, CSV, auto datetime format detection, epoch, Ingest from SSH, JSON, EMR AWS Summit 2014
  • 25. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. MicroStrategy Industry’s best BI, web and mobile applications on-demand in the cloud May 2014
  • 26. You’ve Got the Data. Now What?
  • 27. Deriving Big Insights from Big Data Trends in business analytics Popular use cases Agile Business Intelligence Governed Self-Service Information Driven Mobile Apps Redshift Certified Customer Success
  • 28. Popular Use Cases • Information driven purchasing – reviews – searches – pricing – comparisons – social networks – recommendations • Omni-channel • Customer 360 • Improve sales efficiency and effectiveness • Data blending o sales o product o Marketing • CRM integration • Mobility o BYOD o Caching • Information driven experience o Interaction o Videos o Documents • In-store apps o analytics o customer 360 o personal shopper • Store of the future • Real-time decisions • Inventory management Customer Analytics Sales Enablement Retail
  • 29. Self-Service Analytics Revolutionizes Traditional BI Boost user satisfaction while massively increasing productivity More Productive More content per creator More Producers More users can create content More Collaborative Peer-to-peer sharing 5-10x More Content 5-10x More Content Creators 5-10x More Sharing >100x more content creation and consumption
  • 31. World-Class Information-Driven Mobile Apps The future of dashboards • More than graphs • Multi-media • Transaction enabled • Live data • Comprehensive data • Intuitive • Easy to use • Guided workflow for consistent user experience • Personalized for each user
  • 32. Business User Access to 1000s of Data Sources Faster access to your data MicroStrategy Modeled Data Personal or Departmental Cloud- Based Data Relational Databases Big Data & Hadoop Enterprise Applications SAP, Oracle e-Business, Siebel, Peoplesoft, etc. Redshift, Oracle, SQL Server, MySQL, Teradata, Netezza, etc. Salesforce.com, NetSuite. Facebook, Eloqua, Google Docs, etc. Spreadsheets, Access databases, CSV, public data downloads, etc. EMR, MapR, Cloudera, Hortonworks, etc… Enterprise-certified single- version of the truth Redshift Certified Redshift integration
  • 33. Customer Success Netflix has deployed the MicroStrategy business intelligence software platform on top of Amazon Elastic MapReduce (Amazon EMR) for interactive insights. Netflix analysts use advanced visualizations to explore the performance of its streaming service closer to recorded time by directly accessing Hadoop data on an ad-hoc basis and without writing code.
  • 34. Big Data Analytics Transform your growing Big Data resources into insight and profit. Self-Service Analytics See and understand data in minutes. No IT needed. Enterprise-Grade Business Intelligence Produce and publish trusted analytics to improve your business operations. AWS Partner Comprehensive Analytics Platform #1 Mobile Analytics MicroStrategy Analytics Enterprise Business Agility with Trusted, Governed Data
  • 36. Ingestion – Best Practices • Goal • 1 leader node & n compute nodes, Leverage all the compute nodes and minimize overhead • Best Practices • Preferred method - COPY from S3 • Loads data in sorted order through the compute nodes • Single Copy command, Split data into multiple files • Strongly recommend that you gzip large datasets copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-Access-Key- ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’; • If you must ingest through SQL • Multi-row inserts • Avoid large number of singleton insert/update/delete operations • To copy from another table • CREATE TABLE AS or INSERT INTO SELECT insert into category_stage values (default, default, default, default), (20, default, 'Country', default), (21, 'Concerts', 'Rock', default); AWS Summit 2014
  • 37. Ingestion – Best Practices (Cotd..) • Best Practices – Verifying load data files • For US east – S3 provides eventual consistency. – Verify files are in S3 Listing Object Keys – Query Redshift after load. This query returns entries for loading the tables in the TICKIT database select query, trim(filename), curtime, status from stl_load_commits where filename like '%tickit%' order by query; query | btrim | curtime | status -------+---------------------------+----------------------------+-------- 22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1 22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1 22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1 22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1 22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1 22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1 22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 | 1 22596 | tickit/venue_pipe.txt | 2013-02-08 21:04:10.056055 | 1 22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 | 1 22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 | 1 22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 | 1 AWS Summit 2014
  • 38. Ingestion – Best Practices (Cotd..) • Best Practices – We do not support an upsert statement. Use staging tables to perform an upsert by doing a join on the staging table with the target – Update then Insert. – We do NOT enforce primary key constraint, if you COPY same data twice, there will be a duplicate copy. – Increase the memory available to a COPY or VACUUM by increasing wlm_query_slot_count set wlm_query_slot_count to 3; – Run the ANALYZE command whenever you’ve made a non-trivial number of changes to your data to ensure your table statistics are current – Amazon Redshift system table that can be helpful in troubleshooting data load issues:STL_LOAD_ERRORS discovers the errors that occurred during specific loads. Adjust MAX ERRORS as needed. – Check character set : Support only UTF8 AWS Summit 2014
  • 39. Choose a Sort key • Goal • Skip over data blocks to minimize IO • Best Practice • Sort based on range or equality predicate (WHERE clause) • If you access recent data frequently, sort based on TIMESTAMP AWS Summit 2014
  • 40. Choose a Distribution Key • Goal • Distribute data evenly across nodes • Minimize data movement among nodes : Co-located Joins and Co-located Aggregates • Best Practice • Consider using Join key as distribution key (JOIN clause) • If multiple joins, use the foreign key of the largest dimension as distribution key • Consider using Group By column as distribution key.( GROUP BY clause) • Avoid • Keys used as equality filter as your distribution key • If de-normalized tables and no aggregates, do not specify a distribution key. Redshift will use round robin AWS Summit 2014
  • 41. Query Performance – Best practices • Encode date and time using “TIMESTAMP” data type instead of “Char” • Specify Constraints • RedShift does not enforce constraints (primary key, foreign key, unique values) but the optimizer uses it. • Loading and/or applications need to be aware • Specify redundant predicate on the sort column SELECT * FROM tab1, tab2 WHERE tab1.key = tab2.key AND tab1.timestamp > '1/1/2013' AND tab2.timestamp > '1/1/2013'; AWS Summit 2014
  • 42. Workload Manager • Allows you to manage and adjust query concurrency • WLM allows you to • Increase query concurrency up to 50 • Define user groups and query groups • Segregate short and long running queries • Help improve performance of individual queries • Be aware that query workload is distributed to every compute node. • Increasing concurrency may not always help due to resource contention (cpu, memory and I/O). • Total throughput may increase by letting one query to complete first and other queries to wait. AWS Summit 2014
  • 43. Workload Best Practices • Organizing and keeping your load files in S3 allows for re-run or scenario testing as you evolve your workflow in the platform. – Keep in S3 or Glacier for fiscal/legal reasons • Data updated for short-term – consider having a short-term version of the table for staging and a long term version once data gets stable. • Round Robin distribution key – When you don’t have a good Distribution Key – Check Part 1 for query on checking for distribution skew – Trade off with collocated joins • Loading the target (final) table – Use a chronological date/timestamp columns for first sortkey. Vacuum is needed less often and runs faster – When first sort column has low cardinality/resolution (i.e, date instead of timestamp), subsequent columns should match common filters and/or grouping columns AWS Summit 2014
  • 44. Workload Best Practices cont. • Use UNLOAD command to archive data that is not needed for business reasons – Data that needs to exist only for fiscal/legal reasons can be re-loaded as needed. • Consider applying retention policies less often than the regular workflow – Weekly/Monthly process during a less busy time – Make space provision for the data growth – Make sure all queries have date/timestamp range filters (> and <) – Keep a sliding window of data to minimize block re-write during vacuum • Take manual snapshots to save status at specific mileposts (year-end). AWS Summit 2014
  • 45. Workload Best Practices cont. • Ratio between Load/Query performance needs – Low ratio: Consider Load -> Snapshot -> Spin “Query” clusters -> Tear down – High ratio: Consider Performance above space needs when choosing number of nodes • Normalization Rule of Thumb – De-normalize only to avoid large non-collocated joins – Slow Changing Dimensions (type II): Keep normalized, match distkey with fact table AWS Summit 2014
  • 46. Space Management • Redshift has a single pool of space used for tables and temporary segments. – Loads need 2.5 times the space of the data being loaded if table has a sortkey – Vacuum may need 2.5 times the size of the table. • Monitor the free space – Performance Tab in the console – Cloudwatch Alarms – SQL AWS Summit 2014
  • 47. Space Management cont. • Tables Sizes select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema, trim(a.name) as Table, b.mbytes, a.rows from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a join pg_class as pgc on pgc.oid = a.id join pg_namespace as pgn on pgn.oid = pgc.relnamespace join pg_database as pgdb on pgdb.oid = a.db_id join (select tbl, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl order by mbytes desc, a.db_id, a.name; • Free Space select sum(capacity)/1024 as capacity_gbytes, sum(used)/1024 as used_gbytes, (sum(capacity) - sum(used))/1024 as free_gbytes from stv_partitions where part_begin=0; • Redshift allows you to resize your cluster up and down and across node types. Online (read-only access). AWS Summit 2014
  • 48. For more best practices, search youtube for “Amazon Redshift Best Practices “ Thank You ! AWS Summit 2014