SlideShare a Scribd company logo
DB Migrations = Pain
%4
Context
● Look is an application for live video streaming
● Backend, iOS and Android client, Admin page,
frontend for customers
● Good management
● Good architecture
%7
Context
● 3 environments: develop, qa, production (and
local)
● 3 core services:
– web (aka api)
– rtmp (video streaming)
– cent (realtime messaging)
%11
Context
● There are 2 backend developers
● We think about code quality:
– very strict linter
– tests: unit and behave
– deploy in 1 command
%15
Story
● Deployment after 3 monthes of development
● DB redesign: changed one of the core models
to fit business logic
– Schema migration
– Data migration
● Statistics on the admin page
● Successfully deployed to dev and qa
%19
Story
● Data migrations was running during 40 minutes:
– I was ready to it
● Production was down during 5 hours
– Kernel Panic!
● I deployed the previous version and restore DB
from snapshot – lost last 3 hours of data
%22
Plan
● Analyze
● Fix
● Learn the lesson
%26
What was the symptoms?
● Django was not responding to request at all
● Memory usage was fine
● CPU was fine
● Network was fine
● Actually, Django was responding with HUGE
latency
– the best case was 5 minutes, to the simplest
request!
%30
How did we investigate?
● Find bottlenecks:
– analyze latencies locally – django-silk is the best
● Fix them one by one
● Test the fixes on the develop environment
%33
How did we fix it?
● Speed up data migrations: 40 minutes → 7
minutes
– select_related
● Move all long running tasks to celery tasks
● To prevent race between celery and django we
run them on separate instances
%37
How did we fix it?
● Simplify admin page
– Calculate metrics in periodic celery task
● each 10 minutes, with timeout 1 hour
– Keep in DB
– Join with the metric table
%41
What do we need to do?
● Zero down time deployment aka Continius
Deployment
%44
Continues Deployment
● Blue Green Deployment
%48
Our way
● Use 2 web instances:
– Current
– Staging
● Use 2 DB instances:
– Current
– Staging
%52
Our way
● Deployment steps:
– Deploy to staging
– Run migrations
– Wait
– Swap the DNS
%56
The fixes deployment
%59
The fixes deployment
● Production was down during 4 hour
– Panic!
● The same symptoms!
%63
The guess
● Look at whole stack:
– DB flood the disk space
– The free disk space metric has reverse sawtooth
form
● Super hot fix: turn off metric task
– The free disk space metric have the same period as
the periodic task for calculating metrics
%67
Invistigation
● Use the production DB clone
● Run the raw query that collects metrics
– It was running 1 hour!
● This is the reason!
%70
How did we fix it?
● The raw query looks like:
– SELECT DISTINCT
– 8 LEFT OUTER JOINs
– 5 COUNTs
– 3 CASEs
– GROUP BY user.id
● Use EXPLAIN
%74
How did we fix it?
● We were not trying to use the raw query in
django
– There is no reasons to do so
● Attempts:
– Remove metrics that requires CASEs
– Reduce amount of COUNTs and JOINs
– Remove DISTINCT – Fetch row by row
– Use one query for each metric
%78
How did we fix it?
● The fix is:
– Use one query for each metric
● The best performance in the production case
%81
Did it help?
Yes
%85
The lesson
● Good management and good architecture are
matter
● Deploy more frequently
● Do not use data migrations as is – Use
commands
● Django admin is not efficient for aggregation
queries
● Analyze and synthesize are matter
%89
A proof
● I have refactored another core model:
– A schema migration
– A command for data migration
● I have deployed it without downtime
● Look production environment is still alive
%93
Summary
● Analyze
● Fix
● Learn the lesson
%96
References
● https://crystalnix.com/works/look/
● http://martinfowler.com/bliki/BlueGreenDeploym
ent.html
● https://gist.github.com/EvgeneOskin/99880b7
b7e0cd2d0115f87b7eeb5ae57
%100
DB Migrations = Pain
%4
Context
● Look is an application for live video streaming
● Backend, iOS and Android client, Admin page,
frontend for customers
● Good management
● Good architecture
%7
Context
● 3 environments: develop, qa, production (and
local)
● 3 core services:
– web (aka api)
– rtmp (video streaming)
– cent (realtime messaging)
%11
Context
● There are 2 backend developers
● We think about code quality:
– very strict linter
– tests: unit and behave
– deploy in 1 command
%15
Story
● Deployment after 3 monthes of development
● DB redesign: changed one of the core models
to fit business logic
– Schema migration
– Data migration
● Statistics on the admin page
● Successfully deployed to dev and qa
%19
Story
● Data migrations was running during 40 minutes:
– I was ready to it
● Production was down during 5 hours
– Kernel Panic!
● I deployed the previous version and restore DB
from snapshot – lost last 3 hours of data
%22
Plan
● Analyze
● Fix
● Learn the lesson
%26
What was the symptoms?
● Django was not responding to request at all
● Memory usage was fine
● CPU was fine
● Network was fine
● Actually, Django was responding with HUGE
latency
– the best case was 5 minutes, to the simplest
request!
%30
How did we investigate?
● Find bottlenecks:
– analyze latencies locally – django-silk is the best
● Fix them one by one
● Test the fixes on the develop environment
%33
How did we fix it?
● Speed up data migrations: 40 minutes → 7
minutes
– select_related
● Move all long running tasks to celery tasks
● To prevent race between celery and django we
run them on separate instances
%37
How did we fix it?
● Simplify admin page
– Calculate metrics in periodic celery task
● each 10 minutes, with timeout 1 hour
– Keep in DB
– Join with the metric table
%41
What do we need to do?
● Zero down time deployment aka Continius
Deployment
%44
Continues Deployment
● Blue Green Deployment
%48
Our way
● Use 2 web instances:
– Current
– Staging
● Use 2 DB instances:
– Current
– Staging
%52
Our way
● Deployment steps:
– Deploy to staging
– Run migrations
– Wait
– Swap the DNS
%56
The fixes deployment
%59
The fixes deployment
● Production was down during 4 hour
– Panic!
● The same symptoms!
%63
The guess
● Look at whole stack:
– DB flood the disk space
– The free disk space metric has reverse sawtooth
form
● Super hot fix: turn off metric task
– The free disk space metric have the same period as
the periodic task for calculating metrics
%67
Invistigation
● Use the production DB clone
● Run the raw query that collects metrics
– It was running 1 hour!
● This is the reason!
%70
How did we fix it?
● The raw query looks like:
– SELECT DISTINCT
– 8 LEFT OUTER JOINs
– 5 COUNTs
– 3 CASEs
– GROUP BY user.id
● Use EXPLAIN
%74
How did we fix it?
● We were not trying to use the raw query in
django
– There is no reasons to do so
● Attempts:
– Remove metrics that requires CASEs
– Reduce amount of COUNTs and JOINs
– Remove DISTINCT – Fetch row by row
– Use one query for each metric
%78
How did we fix it?
● The fix is:
– Use one query for each metric
● The best performance in the production case
%81
Did it help?
Yes
%85
The lesson
● Good management and good architecture are
matter
● Deploy more frequently
● Do not use data migrations as is – Use
commands
● Django admin is not efficient for aggregation
queries
● Analyze and synthesize are matter
%89
A proof
● I have refactored another core model:
– A schema migration
– A command for data migration
● I have deployed it without downtime
● Look production environment is still alive
%93
Summary
● Analyze
● Fix
● Learn the lesson
%96
References
● https://crystalnix.com/works/look/
● http://martinfowler.com/bliki/BlueGreenDeploym
ent.html
● https://gist.github.com/EvgeneOskin/99880b7
b7e0cd2d0115f87b7eeb5ae57
%100

More Related Content

PPT
Database performance improvement, a six sigma project (control) by nirav shah
PDF
MODIFIED Final Presentation - JAVID
PDF
Présentation de Django @ Orange Labs (FR)
PDF
Free django
PDF
Rapport de projet de fin d'étude licence informatique et multimédia
PPTX
Journey through high performance django application
PDF
Unbreaking Your Django Application
PPTX
PyGrunn 2017 - Django Performance Unchained - slides
Database performance improvement, a six sigma project (control) by nirav shah
MODIFIED Final Presentation - JAVID
Présentation de Django @ Orange Labs (FR)
Free django
Rapport de projet de fin d'étude licence informatique et multimédia
Journey through high performance django application
Unbreaking Your Django Application
PyGrunn 2017 - Django Performance Unchained - slides

Similar to Db migrations equal pain (20)

PDF
Where Django Caching Bust at the Seams
KEY
DjangoCon 2010 Scaling Disqus
PDF
Django production
PDF
Python id meetup, Maintaining a Dirty Code Django Project
PDF
High Performance Django 1
PDF
High Performance Django
PDF
Building a custom cms with django
PDF
Data herding
PDF
Data herding
KEY
Scaling Django for X Factor - DJUGL Oct 2012
PDF
Performant Django - Ara Anjargolian
DOCX
Django: Best Practices for Optimized Development and Deployment
PDF
Speed is a Feature - PyConAr 2014
PDF
Speed is a feature PyConAr 2014
KEY
Django Deployment with Fabric
PDF
Django Performance Recipes
PPTX
Django deployment best practices
PDF
Django tricks (2)
PDF
How a Small Team Scales Instagram
PDF
Efficient Django
Where Django Caching Bust at the Seams
DjangoCon 2010 Scaling Disqus
Django production
Python id meetup, Maintaining a Dirty Code Django Project
High Performance Django 1
High Performance Django
Building a custom cms with django
Data herding
Data herding
Scaling Django for X Factor - DJUGL Oct 2012
Performant Django - Ara Anjargolian
Django: Best Practices for Optimized Development and Deployment
Speed is a Feature - PyConAr 2014
Speed is a feature PyConAr 2014
Django Deployment with Fabric
Django Performance Recipes
Django deployment best practices
Django tricks (2)
How a Small Team Scales Instagram
Efficient Django
Ad

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
history of c programming in notes for students .pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPT
Introduction Database Management System for Course Database
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
top salesforce developer skills in 2025.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Design an Analysis of Algorithms II-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
How to Migrate SBCGlobal Email to Yahoo Easily
Digital Systems & Binary Numbers (comprehensive )
Designing Intelligence for the Shop Floor.pdf
Reimagine Home Health with the Power of Agentic AI​
Wondershare Filmora 15 Crack With Activation Key [2025
history of c programming in notes for students .pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
CHAPTER 2 - PM Management and IT Context
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Introduction Database Management System for Course Database
Upgrade and Innovation Strategies for SAP ERP Customers
top salesforce developer skills in 2025.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Softaken Excel to vCard Converter Software.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Operating system designcfffgfgggggggvggggggggg
Design an Analysis of Algorithms II-SECS-1021-03
Ad

Db migrations equal pain

  • 1. DB Migrations = Pain %4
  • 2. Context ● Look is an application for live video streaming ● Backend, iOS and Android client, Admin page, frontend for customers ● Good management ● Good architecture %7
  • 3. Context ● 3 environments: develop, qa, production (and local) ● 3 core services: – web (aka api) – rtmp (video streaming) – cent (realtime messaging) %11
  • 4. Context ● There are 2 backend developers ● We think about code quality: – very strict linter – tests: unit and behave – deploy in 1 command %15
  • 5. Story ● Deployment after 3 monthes of development ● DB redesign: changed one of the core models to fit business logic – Schema migration – Data migration ● Statistics on the admin page ● Successfully deployed to dev and qa %19
  • 6. Story ● Data migrations was running during 40 minutes: – I was ready to it ● Production was down during 5 hours – Kernel Panic! ● I deployed the previous version and restore DB from snapshot – lost last 3 hours of data %22
  • 7. Plan ● Analyze ● Fix ● Learn the lesson %26
  • 8. What was the symptoms? ● Django was not responding to request at all ● Memory usage was fine ● CPU was fine ● Network was fine ● Actually, Django was responding with HUGE latency – the best case was 5 minutes, to the simplest request! %30
  • 9. How did we investigate? ● Find bottlenecks: – analyze latencies locally – django-silk is the best ● Fix them one by one ● Test the fixes on the develop environment %33
  • 10. How did we fix it? ● Speed up data migrations: 40 minutes → 7 minutes – select_related ● Move all long running tasks to celery tasks ● To prevent race between celery and django we run them on separate instances %37
  • 11. How did we fix it? ● Simplify admin page – Calculate metrics in periodic celery task ● each 10 minutes, with timeout 1 hour – Keep in DB – Join with the metric table %41
  • 12. What do we need to do? ● Zero down time deployment aka Continius Deployment %44
  • 13. Continues Deployment ● Blue Green Deployment %48
  • 14. Our way ● Use 2 web instances: – Current – Staging ● Use 2 DB instances: – Current – Staging %52
  • 15. Our way ● Deployment steps: – Deploy to staging – Run migrations – Wait – Swap the DNS %56
  • 17. The fixes deployment ● Production was down during 4 hour – Panic! ● The same symptoms! %63
  • 18. The guess ● Look at whole stack: – DB flood the disk space – The free disk space metric has reverse sawtooth form ● Super hot fix: turn off metric task – The free disk space metric have the same period as the periodic task for calculating metrics %67
  • 19. Invistigation ● Use the production DB clone ● Run the raw query that collects metrics – It was running 1 hour! ● This is the reason! %70
  • 20. How did we fix it? ● The raw query looks like: – SELECT DISTINCT – 8 LEFT OUTER JOINs – 5 COUNTs – 3 CASEs – GROUP BY user.id ● Use EXPLAIN %74
  • 21. How did we fix it? ● We were not trying to use the raw query in django – There is no reasons to do so ● Attempts: – Remove metrics that requires CASEs – Reduce amount of COUNTs and JOINs – Remove DISTINCT – Fetch row by row – Use one query for each metric %78
  • 22. How did we fix it? ● The fix is: – Use one query for each metric ● The best performance in the production case %81
  • 24. The lesson ● Good management and good architecture are matter ● Deploy more frequently ● Do not use data migrations as is – Use commands ● Django admin is not efficient for aggregation queries ● Analyze and synthesize are matter %89
  • 25. A proof ● I have refactored another core model: – A schema migration – A command for data migration ● I have deployed it without downtime ● Look production environment is still alive %93
  • 26. Summary ● Analyze ● Fix ● Learn the lesson %96
  • 28. DB Migrations = Pain %4
  • 29. Context ● Look is an application for live video streaming ● Backend, iOS and Android client, Admin page, frontend for customers ● Good management ● Good architecture %7
  • 30. Context ● 3 environments: develop, qa, production (and local) ● 3 core services: – web (aka api) – rtmp (video streaming) – cent (realtime messaging) %11
  • 31. Context ● There are 2 backend developers ● We think about code quality: – very strict linter – tests: unit and behave – deploy in 1 command %15
  • 32. Story ● Deployment after 3 monthes of development ● DB redesign: changed one of the core models to fit business logic – Schema migration – Data migration ● Statistics on the admin page ● Successfully deployed to dev and qa %19
  • 33. Story ● Data migrations was running during 40 minutes: – I was ready to it ● Production was down during 5 hours – Kernel Panic! ● I deployed the previous version and restore DB from snapshot – lost last 3 hours of data %22
  • 34. Plan ● Analyze ● Fix ● Learn the lesson %26
  • 35. What was the symptoms? ● Django was not responding to request at all ● Memory usage was fine ● CPU was fine ● Network was fine ● Actually, Django was responding with HUGE latency – the best case was 5 minutes, to the simplest request! %30
  • 36. How did we investigate? ● Find bottlenecks: – analyze latencies locally – django-silk is the best ● Fix them one by one ● Test the fixes on the develop environment %33
  • 37. How did we fix it? ● Speed up data migrations: 40 minutes → 7 minutes – select_related ● Move all long running tasks to celery tasks ● To prevent race between celery and django we run them on separate instances %37
  • 38. How did we fix it? ● Simplify admin page – Calculate metrics in periodic celery task ● each 10 minutes, with timeout 1 hour – Keep in DB – Join with the metric table %41
  • 39. What do we need to do? ● Zero down time deployment aka Continius Deployment %44
  • 40. Continues Deployment ● Blue Green Deployment %48
  • 41. Our way ● Use 2 web instances: – Current – Staging ● Use 2 DB instances: – Current – Staging %52
  • 42. Our way ● Deployment steps: – Deploy to staging – Run migrations – Wait – Swap the DNS %56
  • 44. The fixes deployment ● Production was down during 4 hour – Panic! ● The same symptoms! %63
  • 45. The guess ● Look at whole stack: – DB flood the disk space – The free disk space metric has reverse sawtooth form ● Super hot fix: turn off metric task – The free disk space metric have the same period as the periodic task for calculating metrics %67
  • 46. Invistigation ● Use the production DB clone ● Run the raw query that collects metrics – It was running 1 hour! ● This is the reason! %70
  • 47. How did we fix it? ● The raw query looks like: – SELECT DISTINCT – 8 LEFT OUTER JOINs – 5 COUNTs – 3 CASEs – GROUP BY user.id ● Use EXPLAIN %74
  • 48. How did we fix it? ● We were not trying to use the raw query in django – There is no reasons to do so ● Attempts: – Remove metrics that requires CASEs – Reduce amount of COUNTs and JOINs – Remove DISTINCT – Fetch row by row – Use one query for each metric %78
  • 49. How did we fix it? ● The fix is: – Use one query for each metric ● The best performance in the production case %81
  • 51. The lesson ● Good management and good architecture are matter ● Deploy more frequently ● Do not use data migrations as is – Use commands ● Django admin is not efficient for aggregation queries ● Analyze and synthesize are matter %89
  • 52. A proof ● I have refactored another core model: – A schema migration – A command for data migration ● I have deployed it without downtime ● Look production environment is still alive %93
  • 53. Summary ● Analyze ● Fix ● Learn the lesson %96