Db migrations equal pain

1. DB Migrations = Pain %4

2. Context ● Look is an application for live video streaming ● Backend, iOS and Android client, Admin page, frontend for customers ● Good management ● Good architecture %7

3. Context ● 3 environments: develop, qa, production (and local) ● 3 core services: – web (aka api) – rtmp (video streaming) – cent (realtime messaging) %11

4. Context ● There are 2 backend developers ● We think about code quality: – very strict linter – tests: unit and behave – deploy in 1 command %15

5. Story ● Deployment after 3 monthes of development ● DB redesign: changed one of the core models to fit business logic – Schema migration – Data migration ● Statistics on the admin page ● Successfully deployed to dev and qa %19

6. Story ● Data migrations was running during 40 minutes: – I was ready to it ● Production was down during 5 hours – Kernel Panic! ● I deployed the previous version and restore DB from snapshot – lost last 3 hours of data %22

7. Plan ● Analyze ● Fix ● Learn the lesson %26

8. What was the symptoms? ● Django was not responding to request at all ● Memory usage was fine ● CPU was fine ● Network was fine ● Actually, Django was responding with HUGE latency – the best case was 5 minutes, to the simplest request! %30

9. How did we investigate? ● Find bottlenecks: – analyze latencies locally – django-silk is the best ● Fix them one by one ● Test the fixes on the develop environment %33

10. How did we fix it? ● Speed up data migrations: 40 minutes → 7 minutes – select_related ● Move all long running tasks to celery tasks ● To prevent race between celery and django we run them on separate instances %37

11. How did we fix it? ● Simplify admin page – Calculate metrics in periodic celery task ● each 10 minutes, with timeout 1 hour – Keep in DB – Join with the metric table %41

12. What do we need to do? ● Zero down time deployment aka Continius Deployment %44

13. Continues Deployment ● Blue Green Deployment %48

14. Our way ● Use 2 web instances: – Current – Staging ● Use 2 DB instances: – Current – Staging %52

15. Our way ● Deployment steps: – Deploy to staging – Run migrations – Wait – Swap the DNS %56

16. The fixes deployment %59

17. The fixes deployment ● Production was down during 4 hour – Panic! ● The same symptoms! %63

18. The guess ● Look at whole stack: – DB flood the disk space – The free disk space metric has reverse sawtooth form ● Super hot fix: turn off metric task – The free disk space metric have the same period as the periodic task for calculating metrics %67

19. Invistigation ● Use the production DB clone ● Run the raw query that collects metrics – It was running 1 hour! ● This is the reason! %70

20. How did we fix it? ● The raw query looks like: – SELECT DISTINCT – 8 LEFT OUTER JOINs – 5 COUNTs – 3 CASEs – GROUP BY user.id ● Use EXPLAIN %74

21. How did we fix it? ● We were not trying to use the raw query in django – There is no reasons to do so ● Attempts: – Remove metrics that requires CASEs – Reduce amount of COUNTs and JOINs – Remove DISTINCT – Fetch row by row – Use one query for each metric %78

22. How did we fix it? ● The fix is: – Use one query for each metric ● The best performance in the production case %81

23. Did it help? Yes %85

24. The lesson ● Good management and good architecture are matter ● Deploy more frequently ● Do not use data migrations as is – Use commands ● Django admin is not efficient for aggregation queries ● Analyze and synthesize are matter %89

25. A proof ● I have refactored another core model: – A schema migration – A command for data migration ● I have deployed it without downtime ● Look production environment is still alive %93

26. Summary ● Analyze ● Fix ● Learn the lesson %96

27. References ● https://crystalnix.com/works/look/ ● http://martinfowler.com/bliki/BlueGreenDeploym ent.html ● https://gist.github.com/EvgeneOskin/99880b7 b7e0cd2d0115f87b7eeb5ae57 %100

28. DB Migrations = Pain %4

29. Context ● Look is an application for live video streaming ● Backend, iOS and Android client, Admin page, frontend for customers ● Good management ● Good architecture %7

30. Context ● 3 environments: develop, qa, production (and local) ● 3 core services: – web (aka api) – rtmp (video streaming) – cent (realtime messaging) %11

31. Context ● There are 2 backend developers ● We think about code quality: – very strict linter – tests: unit and behave – deploy in 1 command %15

32. Story ● Deployment after 3 monthes of development ● DB redesign: changed one of the core models to fit business logic – Schema migration – Data migration ● Statistics on the admin page ● Successfully deployed to dev and qa %19

33. Story ● Data migrations was running during 40 minutes: – I was ready to it ● Production was down during 5 hours – Kernel Panic! ● I deployed the previous version and restore DB from snapshot – lost last 3 hours of data %22

34. Plan ● Analyze ● Fix ● Learn the lesson %26

35. What was the symptoms? ● Django was not responding to request at all ● Memory usage was fine ● CPU was fine ● Network was fine ● Actually, Django was responding with HUGE latency – the best case was 5 minutes, to the simplest request! %30

36. How did we investigate? ● Find bottlenecks: – analyze latencies locally – django-silk is the best ● Fix them one by one ● Test the fixes on the develop environment %33

37. How did we fix it? ● Speed up data migrations: 40 minutes → 7 minutes – select_related ● Move all long running tasks to celery tasks ● To prevent race between celery and django we run them on separate instances %37

38. How did we fix it? ● Simplify admin page – Calculate metrics in periodic celery task ● each 10 minutes, with timeout 1 hour – Keep in DB – Join with the metric table %41

39. What do we need to do? ● Zero down time deployment aka Continius Deployment %44

40. Continues Deployment ● Blue Green Deployment %48

41. Our way ● Use 2 web instances: – Current – Staging ● Use 2 DB instances: – Current – Staging %52

42. Our way ● Deployment steps: – Deploy to staging – Run migrations – Wait – Swap the DNS %56

43. The fixes deployment %59

44. The fixes deployment ● Production was down during 4 hour – Panic! ● The same symptoms! %63

45. The guess ● Look at whole stack: – DB flood the disk space – The free disk space metric has reverse sawtooth form ● Super hot fix: turn off metric task – The free disk space metric have the same period as the periodic task for calculating metrics %67

46. Invistigation ● Use the production DB clone ● Run the raw query that collects metrics – It was running 1 hour! ● This is the reason! %70

47. How did we fix it? ● The raw query looks like: – SELECT DISTINCT – 8 LEFT OUTER JOINs – 5 COUNTs – 3 CASEs – GROUP BY user.id ● Use EXPLAIN %74

48. How did we fix it? ● We were not trying to use the raw query in django – There is no reasons to do so ● Attempts: – Remove metrics that requires CASEs – Reduce amount of COUNTs and JOINs – Remove DISTINCT – Fetch row by row – Use one query for each metric %78

49. How did we fix it? ● The fix is: – Use one query for each metric ● The best performance in the production case %81

50. Did it help? Yes %85

51. The lesson ● Good management and good architecture are matter ● Deploy more frequently ● Do not use data migrations as is – Use commands ● Django admin is not efficient for aggregation queries ● Analyze and synthesize are matter %89

52. A proof ● I have refactored another core model: – A schema migration – A command for data migration ● I have deployed it without downtime ● Look production environment is still alive %93

53. Summary ● Analyze ● Fix ● Learn the lesson %96

54. References ● https://crystalnix.com/works/look/ ● http://martinfowler.com/bliki/BlueGreenDeploym ent.html ● https://gist.github.com/EvgeneOskin/99880b7 b7e0cd2d0115f87b7eeb5ae57 %100

Db migrations equal pain

More Related Content

Similar to Db migrations equal pain (20)

Recently uploaded (20)

Db migrations equal pain