#3:Lets start with our customer’s life before AWS. There were 6 clusters in a hardware DC. Each cluster share own master MySQL database, set of slaves DB, set of Web-apps, own portal app, tools, etc.
Drivers for changes:
BF preparation is too expensive/complicated
New cluster creation or disaster recovery process was a sequence of documented actions (runbooks).
Few downtimes during 2008 challenged to be more redundant and need failover DC
AWS technologies: none
#4:Two AWS DCs + master on premise hardware DC
AWS Benefits:
Rolling deployment: turn off DC from pool and deploy
Fast and “unlimited” capacity scaling up/down
Lessons learned:
AWS is flexible and scalable
A lot of infra changes need to be done
All components need to be cloud-ready (build to fail)
We need to change and adopt to cloud.
Moving forward:
The slaves routinely fell behind when we had to ingest lots of new data, sometimes by 10 hours or more. We needed to rethink our entire stack.
AWS technologies: EC2, EIP
#5:Full platform/organizational re-architecture. Agile/DevOps/…
Re-architecture drivers:
* app/organizational changes
* next level of flexibility, performance, and reliability
* solve our multi-region replication problem
* get rid of our individual clusters
App/Org changes:
* Monolithic Java app was broken up into a set of small services, each supported by a decentralized engineering team.
* The teams were responsible for the entire service life-cycle, from Development to QA to Operations.
* Engineering adopted Agile as a development methodology, where previously we were waterfall driven.
* Shorter release cycle: from coordinated 8-12 weeks to once per week with any time coordinated release
Tools changes:
Puppet adoption
Zabbix/Nagios to Datadog
Distributed logging to centralized
Data stack:
For our DB system of record, we chose Cassandra for its multi-region replication abilities (DynamoDB did not have this feature) and cloud-native operational qualities.
ElasticSearch replaced Solr for similar reasons.
AWS technologies:
IaaC: CloudFormation
3 VPCs: dev/qa/prod, 3 regions with 3 AZ each
Public and Private ELBs
AutoScaling
EBS – early adopter of big volumes
MySQL RDS
Route53
SWF/SQS/SNS/SES
#6:ECS – clustered Docker container orchestration service
Early ECS adoption:
Early ECS with HAProxy balancers layer + Consul (service discovery) + Consul-template (dynamic balancing via HAProxy) + Registrator (service registration in Consul) + Custom deployment tool (based on Thor + AWS Ruby SDK)
Complex and hard to manage/troubleshoot
Additional layers costs
Missed features
Current ECS implementation:
ALB with host based and URL based target rules
Clear and simple deployment process via yaml CFN templates
New features as Docker labels
Missing: multiple ELBs and Service Discovery
#7:Lambda + API Gateway + EFS + ECS example
Task: run ML TensorFlow image trainer for particular image class by request
AWS Batch is not support EFS – sad
API Gateway + Lambda can launch task on ECS cluster, but what if there is no free resources?
Lambda increasing ASG size as well )
CloudWatch decreasing ASG size
Tokenizer ECS DEMO
#8:Data Pipeline example
Customer’s actions -> S3 parquet data -> EMR cluster -> Spark steps (joins, sorts, aggregations) -> S3 parquet data -> Hive indexing-> ES
DP features:
Preconditions
Intermediate actions (scripts)
AMI with frameworks
Spots
Flexible scaling
Additional talk - cost saving:
Spots/Reserved
T2 instances
S3 policies + storage types
Resources inspection