SlideShare a Scribd company logo
Analysing Python
Machine Learning
Notebooks with Moose
Evref
fervE
Marius Mignard1
Steven Costiou1
Nicolas Anquetil1
Anne Etien1
1. Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
2
Machine learning (ML) workflow
3
Notebooks
Markdown text
Python code
Code execution outputs
4
Notebooks benefits
- Accessible
- Default platform for ML development
- Centralise information
- Can be reused in whole or in part
5
Notebooks drawbacks
- Used by people without Software
Engineering knowledge
- Lack of understanding the underlying
mechanisms
6
Machine learning (ML) usage
McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, Back T, Chesus M, Corrado GC, Darzi A, Etemadi M, Garcia-Vicente F, Gilbert FJ, Halling-Brown
M, Hassabis D, Jansen S, Karthikesalingam A, Kelly CJ, King D, Ledsam JR, Melnick D, Mostofi H, Peng L, Reicher JJ, Romera-Paredes B, Sidebottom R, Suleyman M, Tse
D, Young KC, De Fauw J, Shetty S. International evaluation of an AI system for breast cancer screening. Nature. 2020 Jan;577(7788):89-94.
(A) A sample cancer case that was missed by all six
readers in the US reader study, but correctly identified
by the AI system
(B) A sample cancer case that was caught by all six
readers in the US reader study, but missed
by the AI system.
7
Multi-level need
8
Multi-level need – Existing tools
9
Python metamodel
Famix
10
Notebook metamodel
11
Notebook metamodel
12
Notebook metamodel
14
Rule engine
Moose Critic
15
Rules example
Context :
All code cells;
All imports
Condition :
is reimported
Context :
All cells;
Condition :
Lines < 50
Context :
All code cells;
read_csv() invocations
Condition :
Presence of required parameters
Python – Keep the code clean Notebook – Enforce a modular design
ML – Type inference error
16
Python rules – Literature extraction
Most common ML code violations detected by Pylint
Error
Convention
Warning
Refactor
17
Python rules – Literature mint
Van Oort, B., Cruz, L., Aniche, M., & Van Deursen, A. (2021, May).
The prevalence of code smells in machine learning projects.
In 2021 IEEE/ACM 1st workshop on AI engineering-software engineering for AI (WAIN)
(pp. 1-8). IEEE.
Siddik, M. S., & Bezemer, C. P. (2023, October).
Do Code Quality and Style Issues Differ Across (Non-) Machine Learning Notebooks? Yes!.
In 2023 IEEE 23rd International Working Conference
on Source Code Analysis and Manipulation (SCAM) (pp. 72-83). IEEE.
Convention
Warning
Refactor
18
Python rules – Literature mint
Van Oort, B., Cruz, L., Aniche, M., & Van Deursen, A. (2021, May).
The prevalence of code smells in machine learning projects.
In 2021 IEEE/ACM 1st workshop on AI engineering-software engineering for AI (WAIN)
(pp. 1-8). IEEE.
Siddik, M. S., & Bezemer, C. P. (2023, October).
Do Code Quality and Style Issues Differ Across (Non-) Machine Learning Notebooks? Yes!.
In 2023 IEEE 23rd International Working Conference
on Source Code Analysis and Manipulation (SCAM) (pp. 72-83). IEEE.
Convention
Warning
Refactor
19
Python rules – Literature mint
Van Oort, B., Cruz, L., Aniche, M., & Van Deursen, A. (2021, May).
The prevalence of code smells in machine learning projects.
In 2021 IEEE/ACM 1st workshop on AI engineering-software engineering for AI (WAIN)
(pp. 1-8). IEEE.
Siddik, M. S., & Bezemer, C. P. (2023, October).
Do Code Quality and Style Issues Differ Across (Non-) Machine Learning Notebooks? Yes!.
In 2023 IEEE 23rd International Working Conference
on Source Code Analysis and Manipulation (SCAM) (pp. 72-83). IEEE.
Convention
Warning
Refactor
20
Python rules
Script Notebook
a = 1 + 1
a
pointless-statement / W0104
PyLint rule used when a statement doesn't have (or at least seems to) any effect.
21
Python rules
22
Notebook rules
Derived from
best practices
existing tools
23
ML rules
### Scikit-Learn
from sklearn.cluster import KMeans
- kmeans = KMeans()
+ kmeans = KMeans(n_clusters=8, random_state=0)
Hyperparameter not Explicitly Set
[1] :
### Scikit-Learn
from sklearn.model_selection import KFold
+ rng = 0
- kf = KFold(random_state=None)
+ kf = KFold(random_state=rng)
Randomness Uncontrolled
[1] :
24
ML rules
25
Vespucci linter process overview
26
Results
- 24 rules
- 3 levels : Python, Notebook, ML
- Analysis on 5000 notebooks
27
Results
- 24 rules
- 3 levels : Python, Notebook, ML
- Analysis on 5000 notebooks
28
Results
- 24 rules
- 3 levels : Python, Notebook, ML
- Analysis on 5000 notebooks
29
Results
- 24 rules
- 3 levels : Python, Notebook, ML
- Analysis on 5000 notebooks
30
Future work (short term)
- Linter in notebooks/IDE using LSP
- MoTion usage for complex context queries
- Semantic rules
31
Future work (long term)
33
Conclusion
Evref
fervE
Marius Mignard : marius.mignard@inria.fr

More Related Content

PDF
On the code of data science
PPTX
Clean code in Jupyter notebooks
PDF
Clean Code in Jupyter notebook
PDF
D7 MarkPlus - Machine Learning Algorithm.pdf
PDF
Machine Learning Goes Production
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
PDF
S2-Programming_with_Data_Computational_Physics.pdf
PPTX
databricks ml flow demonstration using automatic features engineering
On the code of data science
Clean code in Jupyter notebooks
Clean Code in Jupyter notebook
D7 MarkPlus - Machine Learning Algorithm.pdf
Machine Learning Goes Production
Artificial Intelligence, Machine Learning and Deep Learning
S2-Programming_with_Data_Computational_Physics.pdf
databricks ml flow demonstration using automatic features engineering

Similar to Analysing Python Machine Learning Notebooks with Moose (20)

PDF
Can ML help software developers? (TEQnation 2022)
PDF
Jupyter machine learning crash course
PDF
Reproducible AI Using PyTorch and MLflow
PDF
Season 7 Episode 1 - Tools for Data Scientists
PDF
Python and Machine Learning - BCN Python Meetup - 25th Sep 2014
PDF
An Empirical Study of Refactorings and Technical Debt in Machine Learning Sys...
PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
PDF
Effective Python 90 specific ways to write better Python Second Edition Brett...
PDF
Software maintenance PyConPL 2016
PDF
Effective Python 90 specific ways to write better Python Second Edition Brett...
PDF
Machine learning at scale challenges and solutions
PDF
Spark m llib
PDF
Importance of ML Reproducibility & Applications with MLfLow
PDF
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
PPTX
Neel Sundaresan - Teaching a machine to code
PDF
Computational practices for reproducible science
PDF
Building a cutting-edge data processing environment on a budget
PPTX
Machine Learning Techniques in Python Dissertation - Phdassistance
PDF
Data Curation and Debugging for Data Centric AI
PDF
Week 3 data journey and data storage
Can ML help software developers? (TEQnation 2022)
Jupyter machine learning crash course
Reproducible AI Using PyTorch and MLflow
Season 7 Episode 1 - Tools for Data Scientists
Python and Machine Learning - BCN Python Meetup - 25th Sep 2014
An Empirical Study of Refactorings and Technical Debt in Machine Learning Sys...
Python for Machine Learning_ A Comprehensive Overview.pptx
Effective Python 90 specific ways to write better Python Second Edition Brett...
Software maintenance PyConPL 2016
Effective Python 90 specific ways to write better Python Second Edition Brett...
Machine learning at scale challenges and solutions
Spark m llib
Importance of ML Reproducibility & Applications with MLfLow
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
Neel Sundaresan - Teaching a machine to code
Computational practices for reproducible science
Building a cutting-edge data processing environment on a budget
Machine Learning Techniques in Python Dissertation - Phdassistance
Data Curation and Debugging for Data Centric AI
Week 3 data journey and data storage
Ad

More from ESUG (20)

PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
PDF
Micromaid: A simple Mermaid-like chart generator for Pharo
PDF
Directing Generative AI for Pharo Documentation
PDF
Even Lighter Than Lightweiht: Augmenting Type Inference with Primitive Heuris...
PDF
Composing and Performing Electronic Music on-the-Fly with Pharo and Coypu
PDF
Gamifying Agent-Based Models in Cormas: Towards the Playable Architecture for...
PDF
FASTTypeScript metamodel generation using FAST traits and TreeSitter project
PDF
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
PDF
Package-Aware Approach for Repository-Level Code Completion in Pharo
PDF
Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology
PDF
An Analysis of Inline Method Refactoring
PDF
Identification of unnecessary object allocations using static escape analysis
PDF
Control flow-sensitive optimizations In the Druid Meta-Compiler
PDF
Clean Blocks (IWST 2025, Gdansk, Poland)
PDF
Encoding for Objects Matters (IWST 2025)
PDF
Challenges of Transpiling Smalltalk to JavaScript
PDF
Immersive experiences: what Pharo users do!
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
PDF
Cavrois - an Organic Window Management (ESUG 2025)
PDF
Fluid Class Definitions in Pharo (ESUG 2025)
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
Micromaid: A simple Mermaid-like chart generator for Pharo
Directing Generative AI for Pharo Documentation
Even Lighter Than Lightweiht: Augmenting Type Inference with Primitive Heuris...
Composing and Performing Electronic Music on-the-Fly with Pharo and Coypu
Gamifying Agent-Based Models in Cormas: Towards the Playable Architecture for...
FASTTypeScript metamodel generation using FAST traits and TreeSitter project
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
Package-Aware Approach for Repository-Level Code Completion in Pharo
Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology
An Analysis of Inline Method Refactoring
Identification of unnecessary object allocations using static escape analysis
Control flow-sensitive optimizations In the Druid Meta-Compiler
Clean Blocks (IWST 2025, Gdansk, Poland)
Encoding for Objects Matters (IWST 2025)
Challenges of Transpiling Smalltalk to JavaScript
Immersive experiences: what Pharo users do!
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
Cavrois - an Organic Window Management (ESUG 2025)
Fluid Class Definitions in Pharo (ESUG 2025)
Ad

Recently uploaded (20)

PPTX
Microbiology with diagram medical studies .pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
An interstellar mission to test astrophysical black holes
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
Cell Membrane: Structure, Composition & Functions
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPT
protein biochemistry.ppt for university classes
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
diccionario toefl examen de ingles para principiante
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Microbiology with diagram medical studies .pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Derivatives of integument scales, beaks, horns,.pptx
An interstellar mission to test astrophysical black holes
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
AlphaEarth Foundations and the Satellite Embedding dataset
Cell Membrane: Structure, Composition & Functions
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
protein biochemistry.ppt for university classes
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
INTRODUCTION TO EVS | Concept of sustainability
TOTAL hIP ARTHROPLASTY Presentation.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Biophysics 2.pdffffffffffffffffffffffffff
ECG_Course_Presentation د.محمد صقران ppt
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
diccionario toefl examen de ingles para principiante
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Phytochemical Investigation of Miliusa longipes.pdf
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS

Analysing Python Machine Learning Notebooks with Moose

  • 1. Analysing Python Machine Learning Notebooks with Moose Evref fervE Marius Mignard1 Steven Costiou1 Nicolas Anquetil1 Anne Etien1 1. Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
  • 4. 4 Notebooks benefits - Accessible - Default platform for ML development - Centralise information - Can be reused in whole or in part
  • 5. 5 Notebooks drawbacks - Used by people without Software Engineering knowledge - Lack of understanding the underlying mechanisms
  • 6. 6 Machine learning (ML) usage McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, Back T, Chesus M, Corrado GC, Darzi A, Etemadi M, Garcia-Vicente F, Gilbert FJ, Halling-Brown M, Hassabis D, Jansen S, Karthikesalingam A, Kelly CJ, King D, Ledsam JR, Melnick D, Mostofi H, Peng L, Reicher JJ, Romera-Paredes B, Sidebottom R, Suleyman M, Tse D, Young KC, De Fauw J, Shetty S. International evaluation of an AI system for breast cancer screening. Nature. 2020 Jan;577(7788):89-94. (A) A sample cancer case that was missed by all six readers in the US reader study, but correctly identified by the AI system (B) A sample cancer case that was caught by all six readers in the US reader study, but missed by the AI system.
  • 8. 8 Multi-level need – Existing tools
  • 14. 15 Rules example Context : All code cells; All imports Condition : is reimported Context : All cells; Condition : Lines < 50 Context : All code cells; read_csv() invocations Condition : Presence of required parameters Python – Keep the code clean Notebook – Enforce a modular design ML – Type inference error
  • 15. 16 Python rules – Literature extraction Most common ML code violations detected by Pylint Error Convention Warning Refactor
  • 16. 17 Python rules – Literature mint Van Oort, B., Cruz, L., Aniche, M., & Van Deursen, A. (2021, May). The prevalence of code smells in machine learning projects. In 2021 IEEE/ACM 1st workshop on AI engineering-software engineering for AI (WAIN) (pp. 1-8). IEEE. Siddik, M. S., & Bezemer, C. P. (2023, October). Do Code Quality and Style Issues Differ Across (Non-) Machine Learning Notebooks? Yes!. In 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM) (pp. 72-83). IEEE. Convention Warning Refactor
  • 17. 18 Python rules – Literature mint Van Oort, B., Cruz, L., Aniche, M., & Van Deursen, A. (2021, May). The prevalence of code smells in machine learning projects. In 2021 IEEE/ACM 1st workshop on AI engineering-software engineering for AI (WAIN) (pp. 1-8). IEEE. Siddik, M. S., & Bezemer, C. P. (2023, October). Do Code Quality and Style Issues Differ Across (Non-) Machine Learning Notebooks? Yes!. In 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM) (pp. 72-83). IEEE. Convention Warning Refactor
  • 18. 19 Python rules – Literature mint Van Oort, B., Cruz, L., Aniche, M., & Van Deursen, A. (2021, May). The prevalence of code smells in machine learning projects. In 2021 IEEE/ACM 1st workshop on AI engineering-software engineering for AI (WAIN) (pp. 1-8). IEEE. Siddik, M. S., & Bezemer, C. P. (2023, October). Do Code Quality and Style Issues Differ Across (Non-) Machine Learning Notebooks? Yes!. In 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM) (pp. 72-83). IEEE. Convention Warning Refactor
  • 19. 20 Python rules Script Notebook a = 1 + 1 a pointless-statement / W0104 PyLint rule used when a statement doesn't have (or at least seems to) any effect.
  • 21. 22 Notebook rules Derived from best practices existing tools
  • 22. 23 ML rules ### Scikit-Learn from sklearn.cluster import KMeans - kmeans = KMeans() + kmeans = KMeans(n_clusters=8, random_state=0) Hyperparameter not Explicitly Set [1] : ### Scikit-Learn from sklearn.model_selection import KFold + rng = 0 - kf = KFold(random_state=None) + kf = KFold(random_state=rng) Randomness Uncontrolled [1] :
  • 25. 26 Results - 24 rules - 3 levels : Python, Notebook, ML - Analysis on 5000 notebooks
  • 26. 27 Results - 24 rules - 3 levels : Python, Notebook, ML - Analysis on 5000 notebooks
  • 27. 28 Results - 24 rules - 3 levels : Python, Notebook, ML - Analysis on 5000 notebooks
  • 28. 29 Results - 24 rules - 3 levels : Python, Notebook, ML - Analysis on 5000 notebooks
  • 29. 30 Future work (short term) - Linter in notebooks/IDE using LSP - MoTion usage for complex context queries - Semantic rules