Cleaning the Park:
Reclaim your Logging
Matt Campbell
matthew.campbell@d2l.com
@beardedcoder
Matt Campbell
Engineering Director with D2L
Leading project to achieve proper web-scale in AWS
Previously lead move to monthly deployments
matthew.campbell@d2l.com
@beardedcoder
What can I expect from this talk?
• Who is D2L and what do they do?
• How we ended up with a dirty park
• How we tried (and failed) to clean up
• How we tried (and failed less) to clean up
• What I’ve learned along the way
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Millions of
logins daily
at peak
TBs of
aggregate
data
PBs of
aggregate
content
Clients
with Multi-
TB DBs
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Dev adds code
Code generates
log
Cleaning The Part: Reclaim your Logging
Logging
• Structured information written out during operation of code.
• Can be at various levels (info, debug, warn, error, fatal).
• Typically used to debug exceptional code behavior.
For the purposes of this talk, when I refer to logging, I mean Error/Fatal level
logging (the really bad stuff)
Cleaning The Part: Reclaim your Logging
Broken Window Theory
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Users
Application Servers
Database
New Application Servers
Migrated DatabasePartially Migrated Database
Made visible count of logs made each day
Cleaning The Part: Reclaim your Logging
Users
Application Servers
Database
New Application Servers
Migrated Database
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Dev and Ops were very separated at the time
Tried to solve logging for all things at once
Tried to solve for all regions and all instances at once
They didn’t have a strong why
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Users
Application Servers
Migrated Database
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Dev adds code
Code generates
log
Dev reviews
log and makes
change
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Resistance to Change
UnfamiliaritywithProblem
Resistance to Change
UnfamiliaritywithProblem
Resistance to Change
UnfamiliaritywithProblem
Resistance to Change
UnfamiliaritywithProblem
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Resistance to Change
UnfamiliaritywithProblem
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Top Tips
Cleaning The Part: Reclaim your Logging
Cleaning The Part: Reclaim your Logging
Go for something sticky
Cleaning The Part: Reclaim your Logging
Everyone has inertia to overcome
It’s OK to acknowledge this will take awhile
Celebrate the small victories
Don’t be afraid to start
You don’t need to know how to finish
Cleaning the Park:
Reclaim your Logging
Matt Campbell
matthew.campbell@d2l.com
@beardedcoder

More Related Content

PPTX
Building Bridges: A DevOps Story
PPTX
Propel to the cloud with open source
PPT
Improve Your Front-End Project Workflow With Grunt
PDF
Scaling the guardian
PPTX
Working with Azure Cosmos DB in Azure Functions
PDF
Cloud Apps Workshop - Kompani Group - Miami
PDF
Hector's slides
PDF
Life After Adobe - Nick Barreto & Simon Collinson - ebookcraft 2018
Building Bridges: A DevOps Story
Propel to the cloud with open source
Improve Your Front-End Project Workflow With Grunt
Scaling the guardian
Working with Azure Cosmos DB in Azure Functions
Cloud Apps Workshop - Kompani Group - Miami
Hector's slides
Life After Adobe - Nick Barreto & Simon Collinson - ebookcraft 2018

What's hot (16)

PPTX
EXPERTALKS: Jul 2012 - Build using Gradle
KEY
Scaling small apps
PDF
Reactive application
PDF
Building Modular Dynamic Web Apps Ben Hale
PPTX
Go Hybrid with Azure Web Apps
PPTX
TechDays Wrap-up Seven Stars Shares
PDF
GreenButton-201502
PDF
FMEWT17 Getting Satrted FME 2017 (Ken)
PPTX
Cloud Expo Silicon Valley: Prepare for the Surge… Before It’s Too Late
PPTX
Cloud computing: cost reduction
PPTX
TallyJS #1 - Intro to AngularJS
PPT
Programming pillars
PDF
The Why and How of Applications with APIs and microservices
KEY
Cloud automation strategies
PPT
Google App Engine: Should you or should you not?
PDF
Building a Single Page Application with GatsbyJS
EXPERTALKS: Jul 2012 - Build using Gradle
Scaling small apps
Reactive application
Building Modular Dynamic Web Apps Ben Hale
Go Hybrid with Azure Web Apps
TechDays Wrap-up Seven Stars Shares
GreenButton-201502
FMEWT17 Getting Satrted FME 2017 (Ken)
Cloud Expo Silicon Valley: Prepare for the Surge… Before It’s Too Late
Cloud computing: cost reduction
TallyJS #1 - Intro to AngularJS
Programming pillars
The Why and How of Applications with APIs and microservices
Cloud automation strategies
Google App Engine: Should you or should you not?
Building a Single Page Application with GatsbyJS
Ad

Similar to Cleaning The Part: Reclaim your Logging (20)

PDF
Using ScyllaDB for Real-Time Write-Heavy Workloads
PDF
Building data pipelines at Shopee with DEC
PDF
Bio bigdata
PPTX
DBT ELT approach for Advanced Analytics.pptx
PPTX
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
PDF
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
PDF
MongoDB World 2016: Get MEAN and Lean with MongoDB and Kubernetes
PDF
Lean Enterprise, Microservices and Big Data
PDF
Mongodb
PPTX
CQRS recipes or how to cook your architecture
PPTX
Introduction to Azure DocumentDB
PDF
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
KEY
Hybrid MongoDB and RDBMS Applications
PPTX
Dapper: the microORM that will change your life
PPTX
NoSQL and MongoDB Introdction
PDF
Python in an Evolving Enterprise System (PyData SV 2013)
PDF
AIE 1 S4 - Database I _ Essentials for AI Engineers .pdf
PPTX
Building a devops CMDB
PPTX
Use dependency injection to get Hadoop *out* of your application code
PDF
Bodo Value Guide.pdf
Using ScyllaDB for Real-Time Write-Heavy Workloads
Building data pipelines at Shopee with DEC
Bio bigdata
DBT ELT approach for Advanced Analytics.pptx
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
MongoDB World 2016: Get MEAN and Lean with MongoDB and Kubernetes
Lean Enterprise, Microservices and Big Data
Mongodb
CQRS recipes or how to cook your architecture
Introduction to Azure DocumentDB
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
Hybrid MongoDB and RDBMS Applications
Dapper: the microORM that will change your life
NoSQL and MongoDB Introdction
Python in an Evolving Enterprise System (PyData SV 2013)
AIE 1 S4 - Database I _ Essentials for AI Engineers .pdf
Building a devops CMDB
Use dependency injection to get Hadoop *out* of your application code
Bodo Value Guide.pdf
Ad

More from Matthew Campbell, OCT (9)

PPTX
Happy Teams Make Better Code
PPTX
Real World Retrospectives
PPTX
Unit 8: Control Statements
PPTX
Unit 7: Built-In Functions
PPTX
Unit 6: Functions and Subroutines
PPTX
Unit 6: Functions and Subroutines - Part 2/2
PPTX
Unit 5: Variables
PPTX
Chapter 2: Preliminaries
PPTX
Chapter 3 Excel Macros
Happy Teams Make Better Code
Real World Retrospectives
Unit 8: Control Statements
Unit 7: Built-In Functions
Unit 6: Functions and Subroutines
Unit 6: Functions and Subroutines - Part 2/2
Unit 5: Variables
Chapter 2: Preliminaries
Chapter 3 Excel Macros

Recently uploaded (20)

PDF
Java Basics-Introduction and program control
PPTX
Software Engineering and software moduleing
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PPTX
CyberSecurity Mobile and Wireless Devices
PPTX
A Brief Introduction to IoT- Smart Objects: The "Things" in IoT
PDF
Computer System Architecture 3rd Edition-M Morris Mano.pdf
PDF
Design Guidelines and solutions for Plastics parts
PPTX
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
PPTX
Chapter 2 -Technology and Enginerring Materials + Composites.pptx
PPTX
Module 8- Technological and Communication Skills.pptx
PPTX
Petroleum Refining & Petrochemicals.pptx
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PPTX
Amdahl’s law is explained in the above power point presentations
PDF
UEFA_Embodied_Carbon_Emissions_Football_Infrastructure.pdf
PDF
Unit1 - AIML Chapter 1 concept and ethics
PPTX
Feature types and data preprocessing steps
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
Information Storage and Retrieval Techniques Unit III
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Java Basics-Introduction and program control
Software Engineering and software moduleing
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
CyberSecurity Mobile and Wireless Devices
A Brief Introduction to IoT- Smart Objects: The "Things" in IoT
Computer System Architecture 3rd Edition-M Morris Mano.pdf
Design Guidelines and solutions for Plastics parts
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
Chapter 2 -Technology and Enginerring Materials + Composites.pptx
Module 8- Technological and Communication Skills.pptx
Petroleum Refining & Petrochemicals.pptx
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
August -2025_Top10 Read_Articles_ijait.pdf
Amdahl’s law is explained in the above power point presentations
UEFA_Embodied_Carbon_Emissions_Football_Infrastructure.pdf
Unit1 - AIML Chapter 1 concept and ethics
Feature types and data preprocessing steps
Soil Improvement Techniques Note - Rabbi
Information Storage and Retrieval Techniques Unit III
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx

Cleaning The Part: Reclaim your Logging

Editor's Notes

  • #2: https://www.soswildlifecontrol.com/wp-content/uploads/2017/01/qtq80-vB3b6f.jpeg
  • #5: Who is D2L? What is it that we do? Image Copyright 2016 D2L Inc.
  • #6: We’re global https://upload.wikimedia.org/wikipedia/commons/0/09/BlankMap-World-v2.png
  • #7: We’re data intensive <See if you can get stats on logins, amount of data, size of DBs>
  • #8: In the early days of D2L, there weren’t fancy technologies like ElasticSearch and fancy third-party logging vendors. So we dumped the logs in the spot we had: SQLServer http://1.bp.blogspot.com/-M1MjcnIA7fU/Voz-EKMq5PI/AAAAAAAAL-w/xvywHnJugm8/s1600/rolling-logs.jpg
  • #9: When the company was small this worked just fine. Every had access to everything and people worked close (as in physically) together.
  • #10: However, as the company grew, people had to specialize and we had to start restricting access to client data. This meant that developers no longer had easy access to the logs that were being generated https://i.ytimg.com/vi/TeYaQGbD6Xc/maxresdefault.jpg
  • #11: However, we knew logging was a good thing so we kept adding logging to the code. For a time I imagine devs still had insight (through back channels), but overtime that would have waned.
  • #13: However, we knew logging was a good thing so we kept adding logging to the code. For a time I imagine devs still had insight (through back channels), but overtime that would have waned.
  • #14: Eventually, we were adding logging because the code we saw had logging in it. And as much as we don’t want to admit it, we copy other code to help get our work done faster. And in copying that code we don’t also think if what we are copying should all remain. So, more and more logging was added. https://i.redd.it/tjltyl5m332z.jpg
  • #15: This meant we had 100,000’s of logs monthly (just at an error level) that were rarely looked at. https://upload.wikimedia.org/wikipedia/commons/c/c7/Logs.jpg
  • #16: However, all these logs were consuming disc and IOPS on our databases https://img.gawkerassets.com/img/19aanrziuk65tjpg/original.jpg
  • #20: https://www.soswildlifecontrol.com/wp-content/uploads/2017/01/qtq80-vB3b6f.jpeg
  • #21: https://www.soswildlifecontrol.com/wp-content/uploads/2017/01/qtq80-vB3b6f.jpeg
  • #23: To my knowledge we’ve had 3 previous attempts to correct this problem https://i0.wp.com/mobcblog.wpengine.com/wp-content/uploads/2015/09/52.jpg
  • #24: The first attempt spun off of my past project, moving D2L to continuous delivery. My manager at the time (and the architect of the CD solution) realized that the amount of noise in the logs meant we couldn’t use them to figure out if we had broken anything while moving the deployment model forward.
  • #25: So we built a widget on a dashboard that showed the number of error logs generated that day.
  • #26: And nothing happened. While, something happened, we all felt really bad about how many logs we made.
  • #27: We hadn’t fixed the main problems blocking us: Simple access to the logs and an aggregated view across all instances.
  • #28: This is when the idea of combining all the logs together into a Centralized solution first came about http://cdn1us.denofgeek.com/sites/denofgeekus/files/2016/10/captain-planet-movie.jpg
  • #29: http://teamgaffney.com/wp-content/uploads/2016/02/Return-to-Sender.png
  • #31: https://i.imgflip.com/y9sav.jpg
  • #32: Focused on solving logging for the sake of logging. But no one cared about logging.
  • #33: Full disclosure, we’re not actually done yet <gasp> https://fluffrick.files.wordpress.com/2012/01/richmond.jpg
  • #34: So why is this time any different?
  • #35: We have a stronger why behind the move. This time it is spurred by our large-scale migration into AWS. https://cdn.vox-cdn.com/thumbor/tlEpA9YH7R4FUtj2Bf347AoeP8I=/0x0:308x164/1200x800/filters:focal(95x65:143x113)/cdn.vox-cdn.com/uploads/chorus_image/image/54087325/images.0.jpg
  • #36: Moving into AWS provided us with access to technologies that we didn’t have before
  • #37: With our move to AWS we also are rearchitecting our application for multi-tenancy. The way we currently log is a blocker to that.
  • #38: All of this gave us a stronger vision for the project than just fix logging.
  • #50: https://nbchardballtalk.files.wordpress.com/2017/02/field-of-dreams-e1486999847754.jpg?w=1200
  • #58: The goal was to still get to 0 logs. With error logs under control we can actually start to treat new error logs as critical events. However there were far too many logs for one team to handle. Even with the new people who came on board once Kibana was out, it wasn’t enough to start making progress. http://i2.cdn.turner.com/money/dam/assets/150109042203-line-graph-ground-breaking-1024x576.png
  • #59: Recall that we want to get to 0 error logs so that any logs can be treated as an Incident
  • #62: Enter LogDropper. We started an initiative to
  • #65: The emotional cycle of change
  • #74: http://www.fitcrazy.tv/wp-content/uploads/2016/09/blog-lesson-from-barbell-main.jpg
  • #75: http://static.tabatatimes.com/wp-content/uploads/2013/10/Sisyphus-Image-01C.jpg
  • #76: https://greatperformersacademy.com/images/images/Articles_images/small-wins-happiness-success.jpg
  • #77: https://upload.wikimedia.org/wikipedia/commons/6/63/African_elephant_warning_raised_trunk.jpg
  • #79: https://www.soswildlifecontrol.com/wp-content/uploads/2017/01/qtq80-vB3b6f.jpeg