SlideShare a Scribd company logo
Big Data in the Cloud
@bensullins
Big data in the cloud
Columnar DB

MPP Architecture

Speed!
2TB
XL Node
High Storage Extra Large (XL) DW
Node:
CPU: 2 virtual cores - Intel Xeon E5
Memory: 15 GiB
Storage: 3 HDD with 2TB of local
attached storage
Network: Moderate
Disk I/O: Moderate
API: dw.hs1.xlarge

16TB
8XL Node
High Storage Eight Extra Large (8XL) DW Node:

CPU: 16 virtual cores - Intel Xeon E5
Memory: 120 GiB
Storage: 24 HDD with 16TB of local attached
storage
Network: 10 Gigabit Ethernet with support for
cluster placement groups
Disk I/O: Very High
API: dw.hs1.8xlarge
On-Demand Pricing

DW Node Class (On-Demand)

Hourly

XL Node - 2TB storage (Per Node)

$0.850 per Hour

8XL Node - 16TB storage (Per
Node)

$6.800 per Hour

Reserved Instance 1yr (41% savings)
DW Node Class (Reserved)

Up front

Hourly

XL Node - 2TB storage (Per Node)

$2,500

$0.215 per Hour

8XL Node - 16TB storage (Per Node)

$20,000

$1.720 per Hour

Reserved Instance 3yr (73% savings)
DW Node Class (Reserved)

Up front

Hourly

XL Node - 2TB storage (Per Node)

$3,000

$0.114 per Hour

8XL Node - 16TB storage (Per Node)

$24,000

$0.912 per Hour
Web Interface
Fully Managed

Automated Backups
Fault Tolerant
AES-256 bit Encryption

Amazon VPC Firewall
Big data in the cloud
BigQuery
Big data in the cloud
Columnar DB

Tree Architecture

Speed!
“Dremel can

Scan 35 Billion
Rows
without an Index in

Tens of Seconds”
– Solutions Architect, Google Cloud Solutions
Team
Big data in the cloud
On-Demand Pricing
Resource

Pricing

Storage

$80 (per TB/month)

Interactive Queries

$35 (per TB processed)

Batch Queries

$20 (per TB processed)

Packaged Pricing
Data

100 TB

$3,300 per month ($33 per TB)

400 TB

$12,000 per month ($30 per TB)

1,500 TB

$40,500 per month ($27 per TB)

4,000 TB
•
•

Cost

$100,000 per month ($25 per TB)

Packages are billed in full at the end of each month, whether the package is used or not.
If you use more data than the amount in your chosen package, on-demand rates apply for any
additional data.
Big data in the cloud
Cloud Big Data Sources Comparison
Amazon Redshift

Google BigQuery

Columnar + MPP

Columnar + Tree

Petabytes in Scale

Infinite Scalability

Easy management interface

No Management Required

Straight forward billing
($1K/TB/Yr)

Confusing Pricing Model

Great connectivity w/ BI Tools

Fair Connectivity w/ BI Tools

More Related Content

PDF
Aws meetup (sep 2015) exprimir cada centavo
PDF
Data engineering Stl Big Data IDEA user group
PPTX
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
PPTX
Extended memory access in PHP
PPTX
2012 apache hadoop_map_reduce_windows_azure
PDF
Wikimedia Content API (Strangeloop)
PDF
[4DEV][Łódź] Ivan Vaskevych - InfluxDB and Grafana fighting together with IoT...
PDF
Rich storytelling with Drupal, Paragraphs and Islandora DAMS
Aws meetup (sep 2015) exprimir cada centavo
Data engineering Stl Big Data IDEA user group
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Extended memory access in PHP
2012 apache hadoop_map_reduce_windows_azure
Wikimedia Content API (Strangeloop)
[4DEV][Łódź] Ivan Vaskevych - InfluxDB and Grafana fighting together with IoT...
Rich storytelling with Drupal, Paragraphs and Islandora DAMS

What's hot (20)

KEY
Mongodb lab
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
PDF
刘诚忠:Running cloudera impala on postgre sql
PDF
Tweaking perfomance on high-load projects_Думанский Дмитрий
PDF
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
PPTX
Amazon Web Services lection 4
PDF
Barcelona MUG MongoDB + Hadoop Presentation
PPTX
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
PPTX
Bucket your partitions wisely - Cassandra summit 2016
PPTX
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
PPTX
Graph databases
PPTX
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
PDF
Building maps for apps in the cloud - a Softlayer Use Case
PPTX
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
PPTX
Big data solution capacity planning
PDF
Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
PPT
MongoDB @ fliptop
PPTX
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
ODP
My talk at Topconf.com conference, Tallinn, 1st of November 2012
Mongodb lab
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
刘诚忠:Running cloudera impala on postgre sql
Tweaking perfomance on high-load projects_Думанский Дмитрий
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Amazon Web Services lection 4
Barcelona MUG MongoDB + Hadoop Presentation
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Bucket your partitions wisely - Cassandra summit 2016
Webinar 2017. Supercharge your analytics with ClickHouse. Vadim Tkachenko
Graph databases
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Building maps for apps in the cloud - a Softlayer Use Case
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Big data solution capacity planning
Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud
MongoDB @ fliptop
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
My talk at Topconf.com conference, Tallinn, 1st of November 2012
Ad

Viewers also liked (10)

PPT
Perceptually Based Depth-Ordering Enhancement for Direct Volume Rendering
PDF
Red wine
PPTX
PPTX
Google drive presentación
PPTX
Making pretty charts that actually mean something
PPT
Perceptually Based Depth-Ordering Enhancement for Direct Volume Rendering
PPTX
Research Genre
PPTX
Big Data Analytics Preview
PPTX
Visual Analysis of Topic Competition on Social Media
PPTX
StoryFlow - Visually Tracking Evolution of Stories
Perceptually Based Depth-Ordering Enhancement for Direct Volume Rendering
Red wine
Google drive presentación
Making pretty charts that actually mean something
Perceptually Based Depth-Ordering Enhancement for Direct Volume Rendering
Research Genre
Big Data Analytics Preview
Visual Analysis of Topic Competition on Social Media
StoryFlow - Visually Tracking Evolution of Stories
Ad

Similar to Big data in the cloud (20)

PDF
MySQL NDB Cluster 8.0 SQL faster than NoSQL
PDF
E Science As A Lens On The World Lazowska
PDF
E Science As A Lens On The World Lazowska
PDF
Oracle Exadata Version 2
PDF
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
ODP
Sanger HPC infrastructure Report (2007)
PDF
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
PDF
Amazon Web Services - An Overview
PPTX
Windows Azure Storage – Architecture View
PPTX
Exadata x2 ext
PDF
Soluzioni integrate per il design e la comunicazione digitale: Buffalo
PPTX
Oracle Database Appliance
PPT
Petascale Analytics - The World of Big Data Requires Big Analytics
PDF
Designs, Lessons and Advice from Building Large Distributed Systems
PPTX
Dissecting Scalable Database Architectures
PDF
Future of computing is boring (and that is exciting!)
ODP
Cluster Filesystems and the next 1000 human genomes
PDF
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
PDF
Memory-Based Cloud Architectures
PDF
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
MySQL NDB Cluster 8.0 SQL faster than NoSQL
E Science As A Lens On The World Lazowska
E Science As A Lens On The World Lazowska
Oracle Exadata Version 2
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Sanger HPC infrastructure Report (2007)
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Amazon Web Services - An Overview
Windows Azure Storage – Architecture View
Exadata x2 ext
Soluzioni integrate per il design e la comunicazione digitale: Buffalo
Oracle Database Appliance
Petascale Analytics - The World of Big Data Requires Big Analytics
Designs, Lessons and Advice from Building Large Distributed Systems
Dissecting Scalable Database Architectures
Future of computing is boring (and that is exciting!)
Cluster Filesystems and the next 1000 human genomes
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Memory-Based Cloud Architectures
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PPTX
A Presentation on Artificial Intelligence
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Tartificialntelligence_presentation.pptx
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A comparative analysis of optical character recognition models for extracting...
Empathic Computing: Creating Shared Understanding
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
A Presentation on Artificial Intelligence
Dropbox Q2 2025 Financial Results & Investor Presentation
20250228 LYD VKU AI Blended-Learning.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
Tartificialntelligence_presentation.pptx
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf

Big data in the cloud

  • 1. Big Data in the Cloud @bensullins
  • 4. 2TB XL Node High Storage Extra Large (XL) DW Node: CPU: 2 virtual cores - Intel Xeon E5 Memory: 15 GiB Storage: 3 HDD with 2TB of local attached storage Network: Moderate Disk I/O: Moderate API: dw.hs1.xlarge 16TB 8XL Node High Storage Eight Extra Large (8XL) DW Node: CPU: 16 virtual cores - Intel Xeon E5 Memory: 120 GiB Storage: 24 HDD with 16TB of local attached storage Network: 10 Gigabit Ethernet with support for cluster placement groups Disk I/O: Very High API: dw.hs1.8xlarge
  • 5. On-Demand Pricing DW Node Class (On-Demand) Hourly XL Node - 2TB storage (Per Node) $0.850 per Hour 8XL Node - 16TB storage (Per Node) $6.800 per Hour Reserved Instance 1yr (41% savings) DW Node Class (Reserved) Up front Hourly XL Node - 2TB storage (Per Node) $2,500 $0.215 per Hour 8XL Node - 16TB storage (Per Node) $20,000 $1.720 per Hour Reserved Instance 3yr (73% savings) DW Node Class (Reserved) Up front Hourly XL Node - 2TB storage (Per Node) $3,000 $0.114 per Hour 8XL Node - 16TB storage (Per Node) $24,000 $0.912 per Hour
  • 6. Web Interface Fully Managed Automated Backups Fault Tolerant
  • 12. “Dremel can Scan 35 Billion Rows without an Index in Tens of Seconds” – Solutions Architect, Google Cloud Solutions Team
  • 14. On-Demand Pricing Resource Pricing Storage $80 (per TB/month) Interactive Queries $35 (per TB processed) Batch Queries $20 (per TB processed) Packaged Pricing Data 100 TB $3,300 per month ($33 per TB) 400 TB $12,000 per month ($30 per TB) 1,500 TB $40,500 per month ($27 per TB) 4,000 TB • • Cost $100,000 per month ($25 per TB) Packages are billed in full at the end of each month, whether the package is used or not. If you use more data than the amount in your chosen package, on-demand rates apply for any additional data.
  • 16. Cloud Big Data Sources Comparison Amazon Redshift Google BigQuery Columnar + MPP Columnar + Tree Petabytes in Scale Infinite Scalability Easy management interface No Management Required Straight forward billing ($1K/TB/Yr) Confusing Pricing Model Great connectivity w/ BI Tools Fair Connectivity w/ BI Tools

Editor's Notes

  • #4: Optimized for Data Warehousing – Amazon Redshift uses a variety of innovations to obtain very high query performance on datasets ranging in size from hundreds of gigabytes to a petabyte or more. It uses columnar storage, data compression, and zone maps to reduce the amount of IO needed to perform queries. Amazon Redshift has a massively parallel processing (MPP) architecture, parallelizing and distributing SQL operations to take advantage of all available resources. The underlying hardware is designed for high performance data processing, using local attached storage to maximize throughput between the Intel Xeon E5 processor and drives, and a 10GigE mesh network to maximize throughput between nodes.
  • #5: Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse up or down as your performance or capacity needs change. Amazon Redshift enables you to start with as little as a single 2TB XL node and scale up all the way to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Amazon Redshift will place your existing cluster into read-only mode, provision a new cluster of your chosen size, and then copy data from your old cluster to your new one in parallel. You can continue running queries against your old cluster while the new one is being provisioned. Once your data has been copied to your new cluster, Amazon Redshift will automatically redirect queries to your new cluster and remove the old cluster.
  • #6: No Up-Front Costs – You pay only for the resources you provision. You can choose On-Demand pricing with no up-front costs or long-term commitments, or obtain significantly discounted rates with Reserved Instance pricing. On-Demand pricing starts at just $0.85 per hour for a single node 2TB data warehouse, scaling linearly with cluster size. With Reserved Instance pricing, you can lower your effective price to $0.228 per hour for a single 2TB node, or under $1,000 per TB per year. To see more details, visit the Amazon Redshift Pricing page.
  • #7: Get Started in Minutes – With a few clicks in the AWS Management Console or simple API calls, you can create a cluster, specifying its size, underlying node type, and security profile. Amazon Redshift will provision your nodes, configure the connections between them, and secure the cluster. Your data warehouse should be up and running in minutes.Fully Managed – Amazon Redshift handles all the work needed to manage, monitor, and scale your data warehouse, from monitoring cluster health and taking backups to applying patches and upgrades. You can easily add or remove nodes from your cluster as your performance and capacity needs change. By handling all these time-consuming, labor-intensive tasks, Amazon Redshift frees you up to focus on your data and business.Fault Tolerant – Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3. Amazon Redshift continuously monitors the health of the cluster and automatically re-replicates data from failed drives and replaces nodes as necessary.Automated Backups – Amazon Redshift’s automated snapshot feature continuously backs up new data on the cluster to Amazon S3. Snapshots, are automated, incremental, and continous. Amazon Redshift stores your snapshots for a user-defined period, which can be from one to thirty-five days. You can also take your own snapshots at any time, which leverage all existing system snapshots and are retained until you explicitly delete them. Once you delete a cluster, your system snapshots are removed but your user snapshots are available until you explicitly delete them.Easy Restores - You can use any system or user snapshot to restore your cluster using the AWS Management Console or the Amazon Redshift APIs. Your cluster is available as soon as the system metadata has been restored and you can start running queries while user data is spooled down in the background.
  • #8: Encryption – With just a couple of parameter settings, you can set up Amazon Redshift to use SSL to secure data in transit and hardware-acccelerated AES-256 encryption for data at rest. If you choose to enable encryption of data at rest, all data written to disk will be encrypted as well as any backups.Isolation - Amazon Redshift enables you to configure firewall rules to control network access to your data warehouse cluster. You can also run Amazon Redshift inside Amazon Virtual Private Cloud (Amazon VPC) to isolate your data warehouse cluster in your own virtual network and connect it to your existing IT infrastructure using industry-standard encrypted IPsec VPN.
  • #9: SQL - Amazon Redshift is a SQL data warehouse and uses industry standard ODBC and JDBC connections and Postgres drivers. Many popular software vendors are certifying Amazon Redshift with their offerings to enable you to continue to use the tools you do today. See the Amazon Redshift partner page for details.Designed for use with other AWS Services – Amazon Redshift is integrated with other AWS services and has built in commands to load data in parallel to each node from Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. AWS Data Pipeline enables easy, programmatic integration between Amazon Redshift, Amazon Elastic MapReduce (Amazon EMR), and Amazon Relational Database Service (Amazon RDS).
  • #10: BigQuery is Google’s Cloud Big Data solution based on the Dremel platform. Dremel has been in development for over 6 years and powers much of Googles Cloud Platform. It’s worth mentioning that for this course I’m going to cover BigQuery at a high-level and then later we’ll connect Tableau up to it to see how functionally to use it. If you’d like to dive deeper into BigQuery Lynn Langit has a course on here which goes into much greater detail that is definitely worth checking out.Let’s start by taking a look at their homepage.
  • #11: Looking at their interface, on their homepage they proclaim, analyze terabytes of data w/ just a click of a button. Sounds promising, if it weren’t for Amazon Redshift offering petabytes in scale.You’ll also notice a query editor and result pane previewed, this is encouraging however for non-sql developers this can be a scary sight.
  • #12: Similar to Amazon’s Redshift Google Bigquery stores data in a columnar database format which is great for data compression and query speeds.Google Bigquery differs from Amazon redshift however in that it uses this Tree structure which is similar to a MPP database however it spreads the data extremely wide and for queries creates execution “trees” which can scan tens of thousands of servers or leaf nodes containing the data and return results in miliseconds.Like Redshift, this all adds up to speed! Google is trying to differentiate from MPP solutions with BigQuery by providing what they call full-scan results. This is essentially by creating a query tree of every possible combination of query you can run. In their whitepaper from Kazunori Sato titled “An Inside Look at Google BigQuery” he states that “BigQuery solves the parallel disk I/O problem by utilizing the cloud platform’s economy of scale. You would need to run 10,000 disk drives and 5,000 processors simultaneously to execute the full scan of 1TB of data within one second. “ Impressive.The quote from this whitepaper tat
  • #13: Dremel is the platform which Google is Based on.
  • #14: Scalabitlity with Google BigQuery is a bit if a mystery to be honest. Since they handle all of the administration and data distribution for you the scalability really is only limited by based on how much you can afford. Once you upload your data to BigQuery, it handles the rest, you only need to worry about how much data is going to be processed in your queries, this brings us to their pricing model.
  • #15: Big Data analysis engine without operating a data center Managed service means no additional capital costs Ability to terminate service and remove your data at any timeTransparency in pricing and usage Simplicity: only 2 pricing components (query processing, storage) Flexibility: choice to pay-by-the-month for what you useFull Visibility and Control Monthly billing: Monitor and throttle what you use Tools to optimize usage/costs: best practices, tooling, samplesSince you’re charged by amount of data processed this can be very expensive if using a “chatty” query tool like Tableau. Google recommends to shard data into separate tables using a time stamp and setting your queries to filter just to a specific date range to minimize query costs.In my view this is the only issue with BigQuery. Let’s say you have a query which pulls back something like sales for the west region by month for the past year. This will return 24 data points. That’s 12 integers for sales, and 12 date values corresponding to the month of sales. To get to these 24 data points your query may have to scan millions or billions of rows, imagine Amazon’s detailed sales transactions, aggregate the data, then return your results. Since you’re paying for all the data scanned, a single query could really rack up the bills. Now, if you were building a focused application and not doing visual analytics using a tool like Tableau you can probably handle this quite well however in this case, it can be cost prohibitive to store your data here. I have a friend who was testing this and one of his analysts actually ran a single query that cost them $400!