SlideShare a Scribd company logo
Our Year in Google
Carmen Mason :: VitalSource Technologies
Allan Mason :: Pythian Group, Inc.
2
Who we are
Allan and Carmen Mason
Digital learning platforms that
drive outcomes.
3
We Provide Solutions For:
4
Institutions Corporate
Learning
Associations &
Certifying Bodies
Executive
Education
Campus
Stores
Publishers
Key Statistics: Last 12 Months
5
15
STUDENTS SERVED
Million 7
INSTITUIONS
Thousand 22
BOOKS AND COURSES
DELIVERED
Million
6
VitalSource Global Scale
© The Pythian Group Inc., 2018 8© The Pythian Group Inc., 2018 8
HELPING
BUSINESSES
BECOME MORE
DATA-DRIVEN
© The Pythian Group Inc., 2018 9© The Pythian Group Inc., 2018 9
PYTHIAN
A global IT company that helps businesses leverage disruptive technologies to better compete.
Our services and software solutions unleash the power of cloud, data and analytics to drive better
business outcomes for our clients.
Our 20 years in data, commitment to hiring the best talent, and our deep technical and business expertise
allow us to meet our promise of using technology to deliver the best outcomes faster.
© 2017 Pythian. Confidential 10
Years in Business
20 Pythian Experts
in 35 Countries
400+ Current Clients
Globally
350+
We’re Hiring!
https://www.pythian.com/careers/
© 2018 Pythian. Confidential
12
Agenda
Things we’ll cover today:
• Motivations
• Considerations that drove us
• Decisions we faced
• Our current stack
• What’s next
Motivations
14
Local Data Center HA Configuration
Solution Provides
• Saved Binary Logs
• Differential Relay Logs
• VIP Handling
• Notification
15
Failover in the Data Center
• 10 – 20 seconds
• Automatically assigns most up to date replica as the new master
• VIP tied to CNAME
• Replicas are slaved to the new master
• A CHANGE MASTER statement is logged to bring the old master in line
once it’s fixed.
• Alerts by email, and notifications in Slack, with decisions that were made
during the failover.
16
Ownership
• DNS – Managed by Parent Company’s Windows team.
• Load Balancer – Managed by Parent Company’s Network team.
• App Servers – VMs managed by Parent Company’s VMware team.
• Covers – An expensive Isilon managed by Parent Company’s Storage
team.
• All our stuff – Network managed by Parent Company’s Network team.
17
A Lazy Day in May
18
“Firemen just ran into our datacenter…
with a hose.”
19
Do things the right way
Decisions
21
Why The Cloud?
• Flexibility to scale
• “Always on” mentality
• Easy to manage
22
23
Why Google Cloud Platform
• Google Cloud Platform regions are connected using Google’s private
network.
• They wrote Kubernetes; they support it well.
• Excellent, hands on support.
• Better stability and performance, in our experience.
24
Cost Comparison
Cloud Provider Instance Type vCPU RAM SSD Max IOPS Max
Throughput
Cost/mo.
n1-standard-16 16
cores
60G 1024G 32,000 based on
storage
25,000 based on
vCPU
480 IO block
sizes of 256
KB
$ 562.44
€ 489.88
m4.4xlarge 16
cores
64G 1024G 30,000 based on
storage.
320 IO $ 2,368.11
€ 2.062,68
as of 2018-10-22
Google Compute Engine
vs Google CloudSQL
26
Self Managed Instances vs. Google CloudSQL
Google Compute Engine
(GCE)
CloudSQL
27
Self Managed Instances: The Pros
• Customizable instances
• Full control of the OS and installed packages
• MySQL built in max values on variables
• EG. Max Connections
• Choose Percona Server, MariaDB, etc.
Google Compute Engine
28
Self Managed Instance: The Cons
• We manage the backups
• We manage patches and upgrades
• We manage failovers
• Requires more time on Systems Administration
• Leaving less time for Database work
Google Compute Engine
29
CloudSQL: The Pros
• Routine patching
• Automatic failovers
• No resource management needed.
Google CloudSQL
30
CloudSQL: The Cons
• Can't customize CloudSQL instance types
• Configurable Limits vs. Fixed Limits
• Limitations based on Machine Type:
• Maximum Concurrent Connections: up to 4,000 concurrent users.
• Storage Limits:
• 10,230 GB on standard and high memory machine types.
• 3,062 GB on micro and small machine types.
• https://cloud.google.com/sql/docs/quotas
• https://cloud.google.com/sql/docs/mysql/known-issues
Google CloudSQL
31
CloudSQL:
High Availability Options
Failover Requirements
• Unresponsive for ~60 seconds or zone failure.
• Replica must be in same zone
• Create your own Region 2 slave!
• Replication lag must be < 10 minutes.
Actions of a Failover
• Master fails or is unresponsive.
• Or Zone failure.
• Failover replication catches up.
• Replica is promoted. (name and IP moved)
• New failover replica created.
• Read replica recreated in same zone as new master.
(IP moved)
32
Cost Comparisons – Compute Engine
2 x Database
Master, Replica
n1-standard-16 1460 total hours
per month
$776.72 / € 662.00
1 x Failover n1-standard-16 730 total hours per
month
$388.36 / € 331.00
SSD Persistent disk SSD Storage 3072 GB $522.24 / € 445.11
Total $1,687.32 / € 1.438,10
as of 2018-10-22
db-n1-standard-16 2048 GB 730 total hours per
month
$2,274.80 / € 1.938,81
db-n1-standard-16 1024 GB 730 total hours per
month
$963.32 / € 821.04
Total $3,238.12 / € 2.759,85
Compute Engine Costs
CloudSQL Costs
33
Our Winner?
Compute Engine
Our Current Stack
35
ProxySQL - Why
• Written specifically for MySQL.
• Monitors topology, quickly recognizes changes, and forwards queries accordingly.
• Provides functionality that we liked, such as:
- Rewriting queries on the fly – bad developers!
- Query routing – read / write splitting
- Load balancing – balance reads between replicas
- Query throttling – If you won’t throttle the API…
36
Current HA Configuration:
Google Container Engine (GKE) & ProxySQL
select * from mysql_servers;
hostgroup_id hostname port status
0 app1-db-p01-use1b 3306 ONLINE
1 app1-db-p01-use1b 3306 ONLINE
1 app1-db-p02-use1c 3306 ONLINE
37
ProxySQL - Configuration
hostgroup_id hostname port status
0 app1-db-p01-use1b 3306 OFFLINE_HARD
1 app1-db-p01-use1b 3306 ONLINE
1 app1-db-p02-use1c 3306 ONLINE
• Monitor user in MySQL.
• GRANT REPLICATION CLIENT ON *.* TO 'monitor'@'%' IDENTIFIED BY 'password’;
• ProxySQL monitors the read_only flag on the server to determine hostgroup.
• In our configuration:
• hostgroup 0 is writable.
• hostgroup 1 is read-only.
hostgroup = 0 (writers)
hostgroup = 1 (readers)
38
ProxySQL - Troubleshooting
Install a mysql client so we can look at proxysql admin interface:
apt-get update -qq ; apt-get install -y lsof netcat vim-tiny mysql-client > /dev/null 2>&1
Log in to proxysql admin interface:
mysql -h127.0.0.1 -P6032 –uuser -p@pass
Get a list of servers and their statuses:
select * from mysql_servers;
See monitoring checks for read_only:
SELECT * FROM monitor.mysql_server_read_only_log ORDER BY time_start_us DESC ;
See monitoring checks for response time:
SELECT * FROM monitor.mysql_server_ping_log ORDER BY time_start_us DESC;
Show current connection stats:
SELECT * FROM stats.stats_mysql_connection_pool;
39
A Little Mystery, a Little Drama…
40
WHERE clause?
Who needs a WHERE clause?
• Long Running Query Alerts
• Disk Space Alerts
• Missing WHERE Clauses
• Forced Developer Re-education Programs
• Kill it with FIRE
41
Finger Pointing – Bad MySQL Optimizer!
1. It MUST be MySQL!
2. It HAS to be ProxySQL!
3. It MUST be the ORM!
42
ProxySQL - CYA
SELECT * FROM table WHERE id = 12;
43
ProxySQL - CYA
SELECT * FROM table WHERE id = 12;
44
ProxySQL - CYA
UDATE table.column SET column = `new_value` WHERE id = 12;
45
ProxySQL - CYA
DELETE FROM table WHERE id = 12;
46
ProxySQL to the Rescue!
47
ProxySQL – CYA – Block Query Rules
48
ProxySQL – CYA – Block Query Rules
mysql_query_rules: (
{ rule_id=1 active=1 match_digest="^UPDATE" flagIN=0 flagOUT=1 log=0 comment="flag UPDATEs for chain 1" },
{ rule_id=2 active=1 match_digest="^DELETE" flagIN=0 flagOUT=1 log=0 comment="flag DELETEs for chain 1" },
{ rule_id=3 active=1 match_digest="^SELECT" flagIN=0 flagOUT=1 log=0 comment="flag SELECTs for chain 1" },
{ rule_id=4 active=1 match_pattern="WHERE " flagIN=1 apply=1 log=0 comment="if there's a WHERE, apply it" },
{ rule_id=5 active=1 match_pattern="LIMIT " flagIN=1 apply=1 log=0 comment="if there's no WHERE but there is a LIMIT, apply it" },
{ rule_id=6 active=1 error_msg="All UPDATE/DELETE/SELECT queries *MUST* include a WHERE or a LIMIT!" flagIN=1 apply=1
log=1 comment="if we're here, we're missing WHERE and LIMIT - throw an error and log it" }
)
49
ProxySQL – CYA – Show Rule Hits
SELECT hits,
mysql_query_rules.rule_id,
digest,
active,
username,
match_digest,
match_pattern,
replace_pattern,
cache_ttl,
apply
FROM mysql_query_rules
natural JOIN stats.stats_mysql_query_rules
ORDER BY mysql_query_rules.rule_id;
50
Any Ideas?
???
51
The Big Reveal – Maybe?
Latest idea:
NewRelic for Google Cloud combined with Ruby’s ActiveRecord
52
Current HA Configuration:
Master High Availability
53
Master High Availability (MHA) – Why?
• Easy to automate the configuration and installation
• Very fast failovers: 7 to 11 seconds of downtime, proven repeatedly in
production.
• Brings replication servers current using the logs from the most up to date
slave, or the master (if accessible).
54
Current HA Configuration
Solution Provides:
• Failover between Zones
• Failover between Regions (manual)
• Saved Binary Logs
• Differential Relay Logs…
• Notification
Things We’ve Learned
56
Why not MHA?
• Non-atomic relay log entries in RBR makes relay log use scary.
• https://dev.mysql.com/doc/refman/5.6/en/replication-solutions-unexpected-slave-
halt.html
• Cannot use to failover to central slave with current configuration.
• Requires manually updating config files.
• Can create a split brain scenario.
• Network issue between zones, means MHA manager can’t see the master. It will
perform a failover.
MHA
57
Big Disk = More Performance
Instance vCPU # Sustained Random IOPS Sustained Throughput (MB/s)
Read Max
(IO block sizes <8 KB)
Write Read Max
(IO block sizes >256 KB)
Write
>15 vCPUs 15000 15000 240 240
<32 vCPUs 60000 30000 1200 400
Volume Size Sustained Random IOPS Sustained Throughput (MB/s)
Reads Writes Reads Writes
10 GB 300 300 4.8 4.8
2048 GB 60000 30000 983 400
4096 GB 60000 30000 1200 400
SSD Persistent Disks Performances. Minimums and Maximums
58
IP Aliases
• VIP for MHA?
• “If you remove an alias IP range from one VM and assign it to another VM, it might
take up to a minute for the transfer to complete.”
- cloud.google.com/vpc/docs/configure-alias-ip-ranges
• Pods become natively routable on the GCP network.
• Regular network costs vs. egress bandwidth charges.
• Reduced latency.
• Improved Security
In The Works
60
Different Proof of Concepts in Progress
• CloudSQL
• Google Spanner – Google’s “NewSQL” Database (DBaaS)
• Orchestrator with ProxySQL
Beam me up,
Scotty!!!
61
Carmen Mason
Senior Database Administrator
VitalSource Technologies, LLC
https://www.vitalsource.com
@CarmenMasonCIC
62
Allan Mason
Database Consultant
Pythian Group
https://pythian.com/
@_digitalknight
63
Watkins
Mascot
VitalSource Technologies
https://www.vitalsource.com/
@vitalsource
64
Thank you!

More Related Content

PPTX
Cassandra Adoption on Cisco UCS & Open stack
PDF
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
PDF
How can you successfully migrate to hosted private cloud 2020
PDF
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
PDF
RedisConf18 - Open Source Built for Scale: Redis in Amazon ElastiCache Service
PDF
Unity Makes Strength
PPTX
From PoCs to Production
PDF
Micro-batching: High-performance writes
Cassandra Adoption on Cisco UCS & Open stack
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
How can you successfully migrate to hosted private cloud 2020
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
RedisConf18 - Open Source Built for Scale: Redis in Amazon ElastiCache Service
Unity Makes Strength
From PoCs to Production
Micro-batching: High-performance writes

What's hot (20)

PPTX
Performance tuning - A key to successful cassandra migration
PDF
Cassandra Day London 2015: Securing Cassandra and DataStax Enterprise
PDF
OVHcloud Partner Webinar - Data Processing
PDF
Hardening cassandra for compliance or paranoia
PPTX
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
PDF
Building a better web
PPTX
Webinar | Building Apps with the Cassandra Python Driver
PDF
Ruby Driver Explained: DataStax Webinar May 5th 2015
PDF
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
PDF
MongoDB .local Bengaluru 2019: Becoming an Ops Manager Backup Superhero!
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PDF
MongoDB .local Bengaluru 2019: Using MongoDB Services in Kubernetes: Any Plat...
PPTX
Using Time Window Compaction Strategy For Time Series Workloads
PDF
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
PDF
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
PDF
Dave Williams - Nagios Log Server - Practical Experience
PDF
Managing Cassandra at Scale by Al Tobey
PDF
Rails Caching: Secrets From the Edge
PDF
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
PPTX
MySQL Cluster - Latest Developments (up to and including MySQL Cluster 7.4)
Performance tuning - A key to successful cassandra migration
Cassandra Day London 2015: Securing Cassandra and DataStax Enterprise
OVHcloud Partner Webinar - Data Processing
Hardening cassandra for compliance or paranoia
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
Building a better web
Webinar | Building Apps with the Cassandra Python Driver
Ruby Driver Explained: DataStax Webinar May 5th 2015
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
MongoDB .local Bengaluru 2019: Becoming an Ops Manager Backup Superhero!
Enabling Search in your Cassandra Application with DataStax Enterprise
MongoDB .local Bengaluru 2019: Using MongoDB Services in Kubernetes: Any Plat...
Using Time Window Compaction Strategy For Time Series Workloads
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
Dave Williams - Nagios Log Server - Practical Experience
Managing Cassandra at Scale by Al Tobey
Rails Caching: Secrets From the Edge
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
MySQL Cluster - Latest Developments (up to and including MySQL Cluster 7.4)
Ad

Similar to A Year in Google - Percona Live Europe 2018 (20)

PDF
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
PDF
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
PDF
MySQL Enterprise Monitor
PPTX
Cloud Platform Symantec Meetup Nov 2014
PPTX
Puppet at Scale – Case Study of PayPal's Learnings - PuppetConf 2013
PPTX
The State of the Dolphin, MySQL Keynote at Percona Live Europe 2019, Amsterda...
PDF
MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...
PPTX
Kubernetes Cairo Meetup_dec_2019
PDF
Lessons learned when managing MySQL in the Cloud
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PDF
Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~
PDF
Dip into prometheus
PDF
Black friday logs - Scaling Elasticsearch
PDF
Java Day Minsk 2016 Keynote about Microservices in real world
PDF
Oracle_Patching_Untold_Story_Final_Part2.pdf
PDF
Estimating the Total Costs of Your Cloud Analytics Platform 
PPTX
Is Private Cloud Right for Your Organization
PDF
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PPTX
Managing and Scaling Puppet - PuppetConf 2014
PDF
My sql cluster case study apr16
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
MySQL Enterprise Monitor
Cloud Platform Symantec Meetup Nov 2014
Puppet at Scale – Case Study of PayPal's Learnings - PuppetConf 2013
The State of the Dolphin, MySQL Keynote at Percona Live Europe 2019, Amsterda...
MySQL Day Paris 2018 - MySQL InnoDB Cluster; A complete High Availability sol...
Kubernetes Cairo Meetup_dec_2019
Lessons learned when managing MySQL in the Cloud
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Event-driven automation, DevOps way ~IoT時代の自動化、そのリアリティとは?~
Dip into prometheus
Black friday logs - Scaling Elasticsearch
Java Day Minsk 2016 Keynote about Microservices in real world
Oracle_Patching_Untold_Story_Final_Part2.pdf
Estimating the Total Costs of Your Cloud Analytics Platform 
Is Private Cloud Right for Your Organization
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
Managing and Scaling Puppet - PuppetConf 2014
My sql cluster case study apr16
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation theory and applications.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Monthly Chronicles - July 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation theory and applications.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MYSQL Presentation for SQL database connectivity
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

A Year in Google - Percona Live Europe 2018

  • 1. Our Year in Google Carmen Mason :: VitalSource Technologies Allan Mason :: Pythian Group, Inc.
  • 2. 2 Who we are Allan and Carmen Mason
  • 3. Digital learning platforms that drive outcomes. 3
  • 4. We Provide Solutions For: 4 Institutions Corporate Learning Associations & Certifying Bodies Executive Education Campus Stores Publishers
  • 5. Key Statistics: Last 12 Months 5 15 STUDENTS SERVED Million 7 INSTITUIONS Thousand 22 BOOKS AND COURSES DELIVERED Million
  • 6. 6
  • 8. © The Pythian Group Inc., 2018 8© The Pythian Group Inc., 2018 8 HELPING BUSINESSES BECOME MORE DATA-DRIVEN
  • 9. © The Pythian Group Inc., 2018 9© The Pythian Group Inc., 2018 9 PYTHIAN A global IT company that helps businesses leverage disruptive technologies to better compete. Our services and software solutions unleash the power of cloud, data and analytics to drive better business outcomes for our clients. Our 20 years in data, commitment to hiring the best talent, and our deep technical and business expertise allow us to meet our promise of using technology to deliver the best outcomes faster.
  • 10. © 2017 Pythian. Confidential 10 Years in Business 20 Pythian Experts in 35 Countries 400+ Current Clients Globally 350+
  • 12. 12 Agenda Things we’ll cover today: • Motivations • Considerations that drove us • Decisions we faced • Our current stack • What’s next
  • 14. 14 Local Data Center HA Configuration Solution Provides • Saved Binary Logs • Differential Relay Logs • VIP Handling • Notification
  • 15. 15 Failover in the Data Center • 10 – 20 seconds • Automatically assigns most up to date replica as the new master • VIP tied to CNAME • Replicas are slaved to the new master • A CHANGE MASTER statement is logged to bring the old master in line once it’s fixed. • Alerts by email, and notifications in Slack, with decisions that were made during the failover.
  • 16. 16 Ownership • DNS – Managed by Parent Company’s Windows team. • Load Balancer – Managed by Parent Company’s Network team. • App Servers – VMs managed by Parent Company’s VMware team. • Covers – An expensive Isilon managed by Parent Company’s Storage team. • All our stuff – Network managed by Parent Company’s Network team.
  • 17. 17 A Lazy Day in May
  • 18. 18 “Firemen just ran into our datacenter… with a hose.”
  • 19. 19 Do things the right way
  • 21. 21 Why The Cloud? • Flexibility to scale • “Always on” mentality • Easy to manage
  • 22. 22
  • 23. 23 Why Google Cloud Platform • Google Cloud Platform regions are connected using Google’s private network. • They wrote Kubernetes; they support it well. • Excellent, hands on support. • Better stability and performance, in our experience.
  • 24. 24 Cost Comparison Cloud Provider Instance Type vCPU RAM SSD Max IOPS Max Throughput Cost/mo. n1-standard-16 16 cores 60G 1024G 32,000 based on storage 25,000 based on vCPU 480 IO block sizes of 256 KB $ 562.44 € 489.88 m4.4xlarge 16 cores 64G 1024G 30,000 based on storage. 320 IO $ 2,368.11 € 2.062,68 as of 2018-10-22
  • 25. Google Compute Engine vs Google CloudSQL
  • 26. 26 Self Managed Instances vs. Google CloudSQL Google Compute Engine (GCE) CloudSQL
  • 27. 27 Self Managed Instances: The Pros • Customizable instances • Full control of the OS and installed packages • MySQL built in max values on variables • EG. Max Connections • Choose Percona Server, MariaDB, etc. Google Compute Engine
  • 28. 28 Self Managed Instance: The Cons • We manage the backups • We manage patches and upgrades • We manage failovers • Requires more time on Systems Administration • Leaving less time for Database work Google Compute Engine
  • 29. 29 CloudSQL: The Pros • Routine patching • Automatic failovers • No resource management needed. Google CloudSQL
  • 30. 30 CloudSQL: The Cons • Can't customize CloudSQL instance types • Configurable Limits vs. Fixed Limits • Limitations based on Machine Type: • Maximum Concurrent Connections: up to 4,000 concurrent users. • Storage Limits: • 10,230 GB on standard and high memory machine types. • 3,062 GB on micro and small machine types. • https://cloud.google.com/sql/docs/quotas • https://cloud.google.com/sql/docs/mysql/known-issues Google CloudSQL
  • 31. 31 CloudSQL: High Availability Options Failover Requirements • Unresponsive for ~60 seconds or zone failure. • Replica must be in same zone • Create your own Region 2 slave! • Replication lag must be < 10 minutes. Actions of a Failover • Master fails or is unresponsive. • Or Zone failure. • Failover replication catches up. • Replica is promoted. (name and IP moved) • New failover replica created. • Read replica recreated in same zone as new master. (IP moved)
  • 32. 32 Cost Comparisons – Compute Engine 2 x Database Master, Replica n1-standard-16 1460 total hours per month $776.72 / € 662.00 1 x Failover n1-standard-16 730 total hours per month $388.36 / € 331.00 SSD Persistent disk SSD Storage 3072 GB $522.24 / € 445.11 Total $1,687.32 / € 1.438,10 as of 2018-10-22 db-n1-standard-16 2048 GB 730 total hours per month $2,274.80 / € 1.938,81 db-n1-standard-16 1024 GB 730 total hours per month $963.32 / € 821.04 Total $3,238.12 / € 2.759,85 Compute Engine Costs CloudSQL Costs
  • 35. 35 ProxySQL - Why • Written specifically for MySQL. • Monitors topology, quickly recognizes changes, and forwards queries accordingly. • Provides functionality that we liked, such as: - Rewriting queries on the fly – bad developers! - Query routing – read / write splitting - Load balancing – balance reads between replicas - Query throttling – If you won’t throttle the API…
  • 36. 36 Current HA Configuration: Google Container Engine (GKE) & ProxySQL select * from mysql_servers; hostgroup_id hostname port status 0 app1-db-p01-use1b 3306 ONLINE 1 app1-db-p01-use1b 3306 ONLINE 1 app1-db-p02-use1c 3306 ONLINE
  • 37. 37 ProxySQL - Configuration hostgroup_id hostname port status 0 app1-db-p01-use1b 3306 OFFLINE_HARD 1 app1-db-p01-use1b 3306 ONLINE 1 app1-db-p02-use1c 3306 ONLINE • Monitor user in MySQL. • GRANT REPLICATION CLIENT ON *.* TO 'monitor'@'%' IDENTIFIED BY 'password’; • ProxySQL monitors the read_only flag on the server to determine hostgroup. • In our configuration: • hostgroup 0 is writable. • hostgroup 1 is read-only. hostgroup = 0 (writers) hostgroup = 1 (readers)
  • 38. 38 ProxySQL - Troubleshooting Install a mysql client so we can look at proxysql admin interface: apt-get update -qq ; apt-get install -y lsof netcat vim-tiny mysql-client > /dev/null 2>&1 Log in to proxysql admin interface: mysql -h127.0.0.1 -P6032 –uuser -p@pass Get a list of servers and their statuses: select * from mysql_servers; See monitoring checks for read_only: SELECT * FROM monitor.mysql_server_read_only_log ORDER BY time_start_us DESC ; See monitoring checks for response time: SELECT * FROM monitor.mysql_server_ping_log ORDER BY time_start_us DESC; Show current connection stats: SELECT * FROM stats.stats_mysql_connection_pool;
  • 39. 39 A Little Mystery, a Little Drama…
  • 40. 40 WHERE clause? Who needs a WHERE clause? • Long Running Query Alerts • Disk Space Alerts • Missing WHERE Clauses • Forced Developer Re-education Programs • Kill it with FIRE
  • 41. 41 Finger Pointing – Bad MySQL Optimizer! 1. It MUST be MySQL! 2. It HAS to be ProxySQL! 3. It MUST be the ORM!
  • 42. 42 ProxySQL - CYA SELECT * FROM table WHERE id = 12;
  • 43. 43 ProxySQL - CYA SELECT * FROM table WHERE id = 12;
  • 44. 44 ProxySQL - CYA UDATE table.column SET column = `new_value` WHERE id = 12;
  • 45. 45 ProxySQL - CYA DELETE FROM table WHERE id = 12;
  • 47. 47 ProxySQL – CYA – Block Query Rules
  • 48. 48 ProxySQL – CYA – Block Query Rules mysql_query_rules: ( { rule_id=1 active=1 match_digest="^UPDATE" flagIN=0 flagOUT=1 log=0 comment="flag UPDATEs for chain 1" }, { rule_id=2 active=1 match_digest="^DELETE" flagIN=0 flagOUT=1 log=0 comment="flag DELETEs for chain 1" }, { rule_id=3 active=1 match_digest="^SELECT" flagIN=0 flagOUT=1 log=0 comment="flag SELECTs for chain 1" }, { rule_id=4 active=1 match_pattern="WHERE " flagIN=1 apply=1 log=0 comment="if there's a WHERE, apply it" }, { rule_id=5 active=1 match_pattern="LIMIT " flagIN=1 apply=1 log=0 comment="if there's no WHERE but there is a LIMIT, apply it" }, { rule_id=6 active=1 error_msg="All UPDATE/DELETE/SELECT queries *MUST* include a WHERE or a LIMIT!" flagIN=1 apply=1 log=1 comment="if we're here, we're missing WHERE and LIMIT - throw an error and log it" } )
  • 49. 49 ProxySQL – CYA – Show Rule Hits SELECT hits, mysql_query_rules.rule_id, digest, active, username, match_digest, match_pattern, replace_pattern, cache_ttl, apply FROM mysql_query_rules natural JOIN stats.stats_mysql_query_rules ORDER BY mysql_query_rules.rule_id;
  • 51. 51 The Big Reveal – Maybe? Latest idea: NewRelic for Google Cloud combined with Ruby’s ActiveRecord
  • 53. 53 Master High Availability (MHA) – Why? • Easy to automate the configuration and installation • Very fast failovers: 7 to 11 seconds of downtime, proven repeatedly in production. • Brings replication servers current using the logs from the most up to date slave, or the master (if accessible).
  • 54. 54 Current HA Configuration Solution Provides: • Failover between Zones • Failover between Regions (manual) • Saved Binary Logs • Differential Relay Logs… • Notification
  • 56. 56 Why not MHA? • Non-atomic relay log entries in RBR makes relay log use scary. • https://dev.mysql.com/doc/refman/5.6/en/replication-solutions-unexpected-slave- halt.html • Cannot use to failover to central slave with current configuration. • Requires manually updating config files. • Can create a split brain scenario. • Network issue between zones, means MHA manager can’t see the master. It will perform a failover. MHA
  • 57. 57 Big Disk = More Performance Instance vCPU # Sustained Random IOPS Sustained Throughput (MB/s) Read Max (IO block sizes <8 KB) Write Read Max (IO block sizes >256 KB) Write >15 vCPUs 15000 15000 240 240 <32 vCPUs 60000 30000 1200 400 Volume Size Sustained Random IOPS Sustained Throughput (MB/s) Reads Writes Reads Writes 10 GB 300 300 4.8 4.8 2048 GB 60000 30000 983 400 4096 GB 60000 30000 1200 400 SSD Persistent Disks Performances. Minimums and Maximums
  • 58. 58 IP Aliases • VIP for MHA? • “If you remove an alias IP range from one VM and assign it to another VM, it might take up to a minute for the transfer to complete.” - cloud.google.com/vpc/docs/configure-alias-ip-ranges • Pods become natively routable on the GCP network. • Regular network costs vs. egress bandwidth charges. • Reduced latency. • Improved Security
  • 60. 60 Different Proof of Concepts in Progress • CloudSQL • Google Spanner – Google’s “NewSQL” Database (DBaaS) • Orchestrator with ProxySQL Beam me up, Scotty!!!
  • 61. 61 Carmen Mason Senior Database Administrator VitalSource Technologies, LLC https://www.vitalsource.com @CarmenMasonCIC
  • 62. 62 Allan Mason Database Consultant Pythian Group https://pythian.com/ @_digitalknight

Editor's Notes

  • #2: Carmen In this talk, we're going to review how we got to where we are now, and the challenges that we've faced outside of the datacenter.  We will talk about the decisions that we've made for our high availability, DR solution, and database hosting. 
  • #3: Carmen First, a little about us… We have been working together for over 15 years. Those years working together as a team have given us an understanding of each other's strengths. Today, we work for different companies, though it doesn't always seem like it. We share ideas, work through issues together, and assist when there's an emergency. There in the back is the next generation DBA.
  • #4: Carmen I work at VitalSource Technologies as the sole Database Administrator. This talk is specific to our experiences. A little about the company I work for before we get into it.
  • #5: Carmen Vitalsource provides digital learning platforms for kindergarten through 12th grade, universities, as well as corporate learning solutions. Most people that have used e-books for training of any kind have used our book format.
  • #6: Carmen In the last 12 months, we served over 15 million students worldwide, over 7 thousand institutions, and 22 million books and courses were delivered through our platform.
  • #7: Carmen We have offices around the world.
  • #8: Carmen Here is a heatmap that shows our users. More than 15 million active users, opening more than 1.5 million books a day, generating more than 75 million engagement activities a week, from 241 countries and territories, in 37 languages. And now, a little about the company that my co-host works for…
  • #9: Allan I work for The Pythian Group, and we help companies love their data.
  • #10: Allan We help companies compete. Providing innovative solutions to their needs.
  • #11: Allan We have been in business for over 20 years, have over 400 experts in over 35 Countries and we have over 350 clients globally. We have experience in a lot of environments solving every imaginable problem.
  • #12: Allan And we are hiring!
  • #13: Allan Let’s get to it. We're going to review our initial move to Google Cloud Platform and the motivations behind that move. We'll also discuss the considerations necessary to make informed decisions for this move, including those that were a surprise to us after the move. Then we'll share our current infrastructure with you, and the new challenges that we are facing. Finally we'll chat about our plans for the near term.
  • #14: Carmen Our motivations for moving to the cloud are pretty common. Let’s go over a little back story.
  • #15: Carmen In the local data center, we used MHA with MHA-Helper to provide a high availability solution for our databases. Application servers connected to the cname of the database master, which was set to the Virtual IP Address that MHA-helper was managing for us. This solution provided us with differential relay logs to update all slaves, saved binary logs when possible, as well as Virtual IP handling. We would be notified of the decisions that MHA made in email and chat. This solution provided us with 4 9’s uptime on the databases. Failovers typically took less than 11 seconds. Let’s talk about those failovers.
  • #16: Carmen In our parent company’s data center, we had a failover time of 10 – 20 seconds. MHA determined who would be master, based on the most up to date replica. The Virtual IP Address was removed from the old master and applied to the new one. MHA then slaved the Replicas to the new master. A CHANGE MASTER statement was then logged for me so that I could update the old master when it was ready to be brought back into replication as a new slave. Finally, we were notified in chat and emailed all of the decisions that were made during the failover. Until the end of last year, our entire infrastructure was housed in our parent company’s data center in Tennessee. Now, only our Vertica servers remain on their network.
  • #17: Carmen Another reason behind the move… Most of our architecture was owned and managed by our parent company. A change management process that often involved multiple teams from two different companies can be excruciatingly slow. This arrangement made maintenance windows a nightmare. Our development moves at start up speed and we need the freedom that the Cloud offers us.
  • #18: Carmen Our kick in the pants to finally move to the cloud came one lazy day in May. There was an announcement in the chat panic room…
  • #19: Carmen There's a fire alarm at the corporate headquarters. The UPS room is smoking. Next thing we know, we're being told that we need to stand by to switch to the DR site. There's a possible fire in the datacenter.                     This is a photo of the firemen in front of the entrance to our datacenter. Allan LEAKY HOSE. DATA CENTER.                      What could Possiblly go wrong?? This moved up our timeline for migrating the the Cloud. We were already discussing the move. This was just additional motivation. Other considerations were: The ability to manage our own resources and deployment chain The flexibility of developing in the cloud, such as quickly adding additional resources, creating test environments that can easily be discarded and rebuilt to fit new needs.
  • #20: Carmen Moving to the cloud gave us the opportunity to look at the way that we were doing things. We adopted a continuous deployment model in development, and our devs have thrived on it.   With this chance to reevaluate how we did things, we decided to quickly adopt of a lot of new technologies, including Kubernetes, docker, jenkins, all things Google Cloud, and more. There was a lot of learning, trial and error, for a small team of sys admins and one DBA in a very short amount of time. The pressure from the Business side was intense.
  • #21: Allan Let's talk about the decisions we needed to make.
  • #22: Allan Why move to the cloud? Flexibility to scale… is a pretty well known pro for moving to any cloud. Adding resources is a quick and easy process. Spinning up new clusters of servers is point and click, or a few lines of code. I can stand up an entire stack from client to application to database and back in the time it takes a Developer to write a Jira ticket and Business to throw money at it.  No more purchase orders, waiting for shipments, etc. Always on mentality… No one turns off the internet. No one at Google is going to tell us, hey we need to replace the F5, and it’s going to take 3 hours. Their mentality is that they are always on. Other options don’t even exist. Easy to manage… A few clicks of a button, a few lines of code, and suddenly we have an entire application stack to serve our platform. ----- Meeting Notes (11/6/18 12:27) ----- No manually hard drives, swapping out flakey server blades, replacing UPS batteries, or finding a bad network cable.
  • #23: Allan Quick show of hands. Who is here is using Amazon Web Services? Microsoft Azure? Google Cloud Platform? What about on premises? Anyone still in a local data center?
  • #24: Allan Let’s talk about why we chose Google Cloud. The connection between GCP regions is over Google’s private network, and it’s crazy fast.  They wrote Kubernetes, and they support it really well. Their support has been excellent. Our account manager is very responsive. We have met with the lead engineers for Kubernetes and Spanner. The help is there if we need it, and their engineers are excellent. We used AWS for years, on some of our non-core applications, giving us plenty of experience with it. With Google, our stability and performance overall has been better.
  • #25: Allan Google Cloud Platform seems noticeably less expensive than Amazon Web Services. Comparing apples to apples as much as possible on the two platforms, with near equal resources, we went with our standard GCE database build, an n1-standard-16 with 16 vCPUs, and 60G of ram. We added a 1Terabyte persistent SSD to get the best IOPS and throughput possible. This costs us a little over $562.44 When we compare that to a similar AWS offering, the price difference is significant. Now, we knew that we were moving to the cloud, and we knew which cloud.
  • #26: Carmen Google Compute Engine versus Google CloudSQL.
  • #27: Carmen Let’s talk about the Pros and Cons of going with <click> Self Managed instances with Percona Server installed <click> vs. <click> CloudSQL 2nd Gen. instances with MySQL.
  • #28: Carmen If the default instance types do not match your needs, Compute Engine allows you to customize the number of virtual CPU cores, and the amount of RAM. Compute Engine (basically) gives full control of the instance. It’s possible to tweak the operating system to improve performance, and meet needs, such as increasing the ulimit for backups. This includes having the full list of MySQL configuration options and values available, such as max concurrent connections, etc. This provides the option to use different flavors of MySQL such as Percona Server or MariaDB. So, we can benefit from the improvements, such as performance, that any of these offer. ----- Meeting Notes (11/6/18 12:27) ----- Change from improvements, such as perf To improvements and perf
  • #29: Carmen The cons are all about deciding where you want to spend your time. You have to manage the backups, patching the operating system, implementing and managing a failover solution. As a Database Administrator: You end up spending more time as a SysAdmin. But you have complete control
  • #30: Carmen The pros of CloudSQL include routine patching, automatic failovers, and no resource management needed. You specify the time that you are ok with your database being restarted for maintenance, and they do it all for you. Resources are added to the instance based on need. When the disk is within 5GB of being full, and if the option was chosen, the storage space for your instance will automatically be increased.
  • #31: Carmen The cons are significant for us: GCP has what they call “configurable limits”, or soft limits and “fixed limits”, which are hard limits. A configurable limit, you can just ask them to “pretty please” bump up the limit, using the correct forms. An example of this would be the number of instances per project. Max concurrent Connections and Storage limits on CloudSQL are both examples of fixed limits, which cannot be changed Max concurrent connections depends on machine type with a max of 4k concurrent users. During our peak times of the year, it’s not uncommon for us to have 7 thousand or more connections. The storage limits are based on the machine type as well, which means CloudSQL is not a good solution for us for some of our databases. For standard and high memory machine types, you are looking at about a 10 Terabyte limitation, and 3 terabytes on micro and small machine types.
  • #32: Allan The initial interest was to use CloudSQL, which is a Google Cloud managed database solution, similar to RDS We initially figured the "Second generation HA with CloudSQL.” would replace our current high availability solution without impacting our uptime. It doesn’t, though it comes close. It requires up to 60 seconds of the master being unresponsive, or a complete zone failure to initiate a failover. The failover replica needs to be in the same region, just a different zone. The configuration uses semi-synchronous replication. Which means replication lag can affect your master’s performance. You can help the lag, by adding memory or increasing storage to get the higher I/O throughput on the failover replica. ----- Meeting Notes (11/6/18 12:27) ----- That's a whole minute just to start the failover process.
  • #33: Allan Comparing apples to apples as much possible. Our standard database build is an N1-standard-16, which is 16 virtual cores and 60G of RAM, with three instances. A master, and two slaves, one of which is the disaster recovery failover candidate. These are the numbers we get.
  • #34: Allan Reviewing the costs, level of control, and failover scenarios, we decided to move our databases to Google Cloud Compute Engine.
  • #35: Allan Let's talk about our current stack
  • #36: Allan Why do we use ProxySQL? ProxySQL is written specifically for MySQL. The author, Renee is quick to respond to bug reports, and requests for commits. It gives us what we need: It monitors our topology, and quickly recognizes changes to it. ProxySQL also provides other functionality, such as: Rewrites queries for when our developers need time to fix their code. Splits reads and writes between the master and slave. Load balances the heavy read loads during school registration season, and during finals week for our Notes app. Not to be left out, query throttling, for when the universities hit us hard with all of their student registrations at once.
  • #37: Allan We made a lot of changes to our infrastructure since the move. Our applications are now managed by Google Container Engine Kubernetes pods which auto scale to fit our needs. ProxySQL is installed in each node. We decided to install it in the same container as the application. This is known as a Tiled Approach. It gives us the lowest latency possible between the application and ProxySQL. There is no single point of failure with this solution.  The configuration for ProxySQL is mounted as a Kubernetes secret. The application connects locally to pass queries to ProxySQL, which in turns passes the queries to the databases depending the query. Our masters are in both hostgroup 1 and hostgroup 0, and we rely on Ruby’s ActiveSlave to decide which server to use.
  • #38: Allan ProxySQL only needs a database user with the Replication Client Grant, to monitor the replication status. ProxySQL’s Monitor then watches for the read_only flag on the database, which MHA handles for us. The “writers” and “readers” refers to the collection of instances that are writable or not. ProxySQL will automatically shun a database if it’s offline,
  • #39: Allan Troubleshooting ProxySQL is pretty easy. Here’s a few queries we use to check things. We first install the MySQL client, in order to work with the ProxySQL Admin interface. After that’s out of the way, the commands here give us some good information. We can loop through and keep checking server statuses. Get a list of servers and their statuses. Check the read_only flags, response times, and connection stats.
  • #40: Carmen ProxySQL became extremely important covering our butts while we were troubleshooting a huge issue that we were having.
  • #41: Carmen Let’s really talk about this… We started having a serious problem when we first moved to GCP. We began to get a couple of really nasty long running queries alerts. I realized I knew this query. This was a commonly run, complicated report. Looking at the database, I saw that it was missing it’s WHERE clause. I quickly killed the query, and contacted the Project Lead for that application, and said Dude, have you lost your mind?? You’re killing my database, and filling the disk. We need to talk about your query writing skills. The query was running a full table scan against some of our largest database tables. As a result, it was creating massive temp files. Believing this to be a one off, the developer did a some work to investigate, but it quickly got swept aside as a low priority. However, about a week later, I noticed another long running query. This time it was an UPDATE that was updating everyone’s bookmark to be on the same page for the same book for every single student in the database, because it was missing a WHERE clause… AGAIN. I killed it, and recovered from backup because this one had already started making changes to the data.
  • #42: The initial knee jerk reaction by everyone else, was that it was the database. But, last I checked MySQL’s Query Optimizer doesn’t randomly drop WHERE clauses that it doesn’t like. So, we looked to ProxySQL. It’s also an unlikely candidate, but was easier for us to remove from the equation than MySQL! After testing without ProxySQL, we were able to prove that it also wasn’t at fault.
  • #43: Carmen The problem was that we were having queries with dropped WHERE clauses. This isn’t that big a deal with a SELECT statement.
  • #44: Carmen Our long running query alerts will pop up before these are ever an issue. Worse case, it will create a monstrous temp file before we get to it.
  • #45: Carmen When this is a serious issue is with this… Which updates all of the rows with that column data
  • #46: Carmen If that doesn't make you feel a little nervous, let's consider this… … How about inadvertently wiping about all the data in the table? This might be a great time to test your backups!
  • #47: With ProxySQL we were able to implement Pattern Matching to block this potentially dangerous situation.
  • #48: Carmen Here’s a flowchart to show the rules, and how they work for us. So, how do we do this? We use ProxyES-QUE-ELL’s mysql query rules. These rules check to see if the query is an UPDATE, DELETE, or a SELECT. If not, the query will be sent to the database. If yes, Does it have a WHERE clause? If yes… it goes through to the database, if no… Does it have a LIMIT? Yes? Free pass, no? 86 with an error message.
  • #49: Carmen Here’s what we have in the config file to set up these rules in ProxySQL. In this case, ProxySQL has been a great line of defense against this potentially catastrophic bug. ----- Meeting Notes (11/6/18 12:27) ----- Any questions so far?
  • #50: Carmen We use this query will show us how many times these rules have been applied to a query on this pod. In this example, when reviewing the results, the last “rule_id” is the one we cared about. It represented the number of queries being blocked by our block query rules.
  • #51: Anyone have any ideas as to what the problem might have been?
  • #52: After a lot of digging by some of VitalSource’s brightest minds, the problem seems to be NewRelic for Google Cloud. Our thinking is NewRelic’s hooks into Ruby’s ActiveRecord is somehow causing us to lose the WHERE clause. After removing NewRelic completely we have not had a single reoccurrence of the issue.
  • #53: Allan This is our MHA solution. We have two database instances in the East region, each in a separate zone, and one database in Central in case East goes down in a fiery inferno. This gives us failover options for both zone and region failure. It’s important to note for central to be master, the failover would be manual using a script we wrote. We currently have Central set to no-master in the config file. This is because of the approximately 10ms latency between our application cluster in East and the database in central.
  • #54: Allan We have a few years of experience with MHA. Therefore we decided to stick with it when we moved to GCP. It’s complicated to set up, but using automation for the setup resolves this nicely. It’s proven itself for us in the local data center repeatedly. At the time this decision was made, it was the only known solution to bring replication slaves up to date with the latest slave. We'll come back to this later, as it's a little complicated.
  • #55: Allan Here’s the bigger picture. Each pod in our kubernetes cluster has ProxySQL installed locally. ProxySQL uses the monitor user to monitor the read_only flag to determine if a host Is writable or not. This solution gives us Failovers between zones and regions. Saved bin logs from the master, if they’re available. Differential relay logs from the most up to date slave, which we’ll talk more about later. And notifications of failover to our Slack and email.
  • #56: These are a few of the things that we’ve learned since moving to Google Cloud
  • #57: Carmen We have decided to move away from MHA recently. Why? Well, there are a few reasons…. The possibility of non-atomic relay log entries in RBR makes relay log use scary. https://dev.mysql.com/doc/refman/5.6/en/replication-solutions-unexpected-slave-halt.html We cannot use to failover to central slave with current configuration. Requires manually updating config files. We can create a split brain scenario pretty easily with our current configuration. If there’s a network issue between zones, and the MHA manager can’t see the master, it will perform a failover. This can be remedied by implementing a secondary network check from the Central Region. ----- Meeting Notes (11/6/18 12:27) ----- And MHA doesn't seem to be supported any longer. Remove URL in our notes
  • #58: Allan In Google Cloud, larger disks, more vCPU, means better IOPS and Throughput. To understand i/o needs, you have to look at how you’re going to use the storage. Small reads and writes will be limited by input/output operations per second, or IOPS. Large reads and writes are limited by throughput. Since we’re talking databases, let’s stick to SSDs. The IOPS will scale until you reach either the limits of the volume or the limit of the Compute Engine instance. To get the most IOPS from persistent SSD, you need to start with vCPU. Change the machine type to increase the per-vm limits. This will cap out at 32 virtual CPU. Resize the disk to increase IOPS and throughput per disk. This will cap out at 2 Terabyte for Sustained Random IOPS and 4 Terabyte for sustained throughput.
  • #59: Carmen Initially we thought that we would be able to use IP Aliases to hack in a virtual IP in Google Cloud. Bit, it didn’t work. Briefly looking into it revealed that it would not be able to match our current failover times. More digging into it revealed that we knew nothing about IP Aliases. These things were awesome for our Kubernetes clusters. We have since rebuilt our kubernetes clusters to use them. What does that mean for us? Now, pods are natively routable on the GCP network, which means saving money! We’re looking at regular network costs now instead of paying egress bandwidth charges, which applied even within the same project. Native means reduced latency. Pods can now directly access hosted services without going through a NAT gateway. We no longer need to disable anti spoof protection on the VMs. We can now validate the IP of the source and destination Pod.
  • #60: We know we still have a lot of work to do.
  • #61: Allan We have a few proof of concepts in the works. We’re considering CloudSQL for some of our databases, to lighten the administrative load. We’re also considering Google Spanner during a rewrite of our large, high read/write database. And for the databases that we do not move to CloudSQL, we need a replacement for MHA as we have since revoked it’s awesome status. For this we’re planning to use Orchestrator, which is a solid choice for many reasons. ----- Meeting Notes (11/6/18 12:27) ----- Orchestrator is very actively maintained and tested by it's author, Shlomi.
  • #62: Here's my info
  • #63: And here's Allan's information
  • #64: And, of course, our very special guest star, Watkins
  • #65: Thanks everyone. Does anyone have any questions.