AWS at scale

#3: Lets start with our customer’s life before AWS. There were 6 clusters in a hardware DC. Each cluster share own master MySQL database, set of slaves DB, set of Web-apps, own portal app, tools, etc. Drivers for changes: BF preparation is too expensive/complicated New cluster creation or disaster recovery process was a sequence of documented actions (runbooks). Few downtimes during 2008 challenged to be more redundant and need failover DC AWS technologies: none
#4: Two AWS DCs + master on premise hardware DC AWS Benefits: Rolling deployment: turn off DC from pool and deploy Fast and “unlimited” capacity scaling up/down Lessons learned: AWS is flexible and scalable A lot of infra changes need to be done All components need to be cloud-ready (build to fail) We need to change and adopt to cloud. Moving forward: The slaves routinely fell behind when we had to ingest lots of new data, sometimes by 10 hours or more. We needed to rethink our entire stack. AWS technologies: EC2, EIP
#5: Full platform/organizational re-architecture. Agile/DevOps/… Re-architecture drivers: * app/organizational changes * next level of flexibility, performance, and reliability * solve our multi-region replication problem * get rid of our individual clusters App/Org changes: * Monolithic Java app was broken up into a set of small services, each supported by a decentralized engineering team. * The teams were responsible for the entire service life-cycle, from Development to QA to Operations. * Engineering adopted Agile as a development methodology, where previously we were waterfall driven. * Shorter release cycle: from coordinated 8-12 weeks to once per week with any time coordinated release Tools changes: Puppet adoption Zabbix/Nagios to Datadog Distributed logging to centralized Data stack: For our DB system of record, we chose Cassandra for its multi-region replication abilities (DynamoDB did not have this feature) and cloud-native operational qualities. ElasticSearch replaced Solr for similar reasons. AWS technologies: IaaC: CloudFormation 3 VPCs: dev/qa/prod, 3 regions with 3 AZ each Public and Private ELBs AutoScaling EBS – early adopter of big volumes MySQL RDS Route53 SWF/SQS/SNS/SES
#6: ECS – clustered Docker container orchestration service Early ECS adoption: Early ECS with HAProxy balancers layer + Consul (service discovery) + Consul-template (dynamic balancing via HAProxy) + Registrator (service registration in Consul) + Custom deployment tool (based on Thor + AWS Ruby SDK) Complex and hard to manage/troubleshoot Additional layers costs Missed features Current ECS implementation: ALB with host based and URL based target rules Clear and simple deployment process via yaml CFN templates New features as Docker labels Missing: multiple ELBs and Service Discovery
#7: Lambda + API Gateway + EFS + ECS example Task: run ML TensorFlow image trainer for particular image class by request AWS Batch is not support EFS – sad API Gateway + Lambda can launch task on ECS cluster, but what if there is no free resources? Lambda increasing ASG size as well ) CloudWatch decreasing ASG size Tokenizer ECS DEMO
#8: Data Pipeline example Customer’s actions -> S3 parquet data -> EMR cluster -> Spark steps (joins, sorts, aggregations) -> S3 parquet data -> Hive indexing-> ES DP features: Preconditions Intermediate actions (scripts) AMI with frameworks Spots Flexible scaling Additional talk - cost saving: Spots/Reserved T2 instances S3 policies + storage types Resources inspection