Talk from IWST 2025: Directing Generative AI for Pharo Documentation
PDF: https://archive.esug.org/ESUG2025/iwst-day2/iwst-208-zara-directing-generative-ai-pharo-docs.pdf
1. 1
1
1
Directing Generative AI
for Pharo Documentation
How can we effectively use AI to help us write
documentation?
Pascal Zaragoza
Nicolas Hlad
ESUG 2025
4. 4
4
4
Context : Documentation in Pharo 12
Why documentation matters
§ ~58% of a developer’s time is spent on code
comprehension [1].
§ Bad documentation = more time lost
§ Good documentation = less time lost
Code documentation in Pharo
§ Package, class and method-level comments
§ Class-Responsibility-Collaborator definition
ESUG 2025
5. 5
5
5
Problem
Package documentation
§ Only 16.7% of packages have comments.
§ 81.1% of classes have comments.
§ 41.9% of methods have comments.
§ Most package comments are very short (60.3% <
100 characters).
Conclusion: There is a strong need for improved and scalable documentation practices in Pharo.
ESUG 2025
7. 7
7
7
Overview of the Comment Generation Approach
Goal: Improve Pharo package documentation
using LLMs.
Method: Retrieval-Augmented Generation
(RAG).
Focus: Evaluate how different information
sources affect generated comment quality.
3-Step Process:
§ Generate a model representation of the
package (using Moose).
§ Data extraction/retrieval from the model.
§ Comment generation via LLM.
https://github.com/pzaragoza93/AutoCodeDocumentator
Prompts
Comment
3) comment generation
1) model generation model
(mistral-small-2503)
2) Extraction strategy
ESUG 2025
10. 10
10
10
Overview of the Comment Generation Approach
https://github.com/pzaragoza93/AutoCodeDocumentator
Prompt
Comment
LLM Service
model generation model
(mistral-small-2503)
ESUG 2025
11. 11
11
11
Strategy 1 – Naive Extraction
Input: Full source code of each class (.st files).
Process:
§ Summarize class responsibilities,
collaborators, and key implementations.
§ Use LLM to generate CRC-based package
comment from class summaries.
Pros:
§ Rich context.
§ Can infer detailed responsibilities and
interactions.
Cons:
§ Risk of hallucinations (e.g., non-existent
classes).
§ Computationally expensive due to large
context size.
https://github.com/pzaragoza93/AutoCodeDocumentator
ESUG 2025
12. 12
12
12
Strategy 2 – Comment-Based Extraction
Input: Existing class comments only.
Process:
§ Aggregate class comments.
§ Generate package-level CRC comment
using LLM.
Pros:
§ Leverages human-authored summaries.
§ Lower risk of hallucination.
Cons:
§ Limited by comment coverage
(incomplete/missing comments).
§ Misses undocumented class behaviors or
dependencies.
https://github.com/pzaragoza93/AutoCodeDocumentator
ESUG 2025
13. 13
13
13
Strategy 3 – Comment & Outgoing Reference Extraction
Input: Class comments + method-level
outgoing references.
Process:
§ Extract collaborators through reference
analysis.
§ Combine with existing class comments for
CRC-based comment generation.
Pros:
§ Balances authored insights with structural
dependency data.
§ Better handles inter-class collaboration
context.
Cons:
§ Dependent on reference accuracy and
structure parsing.
§ Limited by comment coverage
(incomplete/missing comments).
https://github.com/pzaragoza93/AutoCodeDocumentator
ESUG 2025
15. 15
15
15
Experimentation
Purpose: Assess the impact of different LLM strategies on
package comment generation.
Strategies Tested:
§ Naive (source code based)
§ Comment-based
§ Comment + Dependency-based
Focus: Identify strengths and weaknesses across strategies.
ESUG 2025
16. 16
16
16
Research questions
§ RQ1: Impact on CRC structure quality?
§ RQ2: Accuracy of responsibility descriptions?
§ RQ3: Accuracy of collaborator descriptions?
§ RQ4: Overall quality vs. original comments?
§ RQ5: Effect of package size on comment quality?
VS
ESUG 2025
17. 17
17
17
Evaluation Dataset
Dataset: 21 Pharo packages
§ Grouped by size: Small, Medium, Large (7 each)
Filtering:
§ Only packages with existing comments included.
§ Excluded test and baseline packages.
Each package: Evaluated with all 3 strategies → 63
generated comments.
Large Language Model: mistral-small-2503
§ Apache 2 Licence
Package N
Package …
Package 1
Filter
Package 21
Package …
Package 1
Comment 1
(Strat 3)
Comment 1
(Strat 2)
Comment 1
(Strat 1)
Comment Generation
(LLM: mistral-small-2503)
Comment Evaluation
Evaluation 63
Evaluation …
Evaluation 1
…
(mistral-small-2503)
ESUG 2025
18. 18
18
18
Evaluation Method
Review Process:
§ 6 Pharo users in 3 groups.
§ Each user reviewed 7 packages and their 3
generated comments (21 comments per group).
Manual Scoring using 12 questions across 4 categories
(3 questions for each category):
§ CRC Structure (RQ 1)
§ Responsibility Accuracy (RQ 2)
§ Collaborator Accuracy (RQ 3)
§ Comparison to Original (RQ 4)
Scale: 7-point Likert (strongly disagree to strongly
agree)
Table 1: List of questions, their category and question ID used in the
questionnaire.
ESUG 2025
19. 19
19
19
Evaluation Method
Review Process:
§ 6 Pharo users in 3 groups.
§ Each user reviewed 7 packages and their 3
generated comments (21 comments per group).
Manual Scoring using 12 questions across 4 categories
(3 questions for each category):
§ CRC Structure (RQ 1)
§ Responsibility Accuracy (RQ 2)
§ Collaborator Accuracy (RQ 3)
§ Comparison to Original (RQ 4)
Scale: 7-point Likert (strongly disagree to strongly
agree)
https://github.com/pzaragoza93/label-studio-pharo-evaluation
ESUG 2025
21. 21
21
21
Results regarding RQ 1 - 4
§ Comparison between strategies across the 12
different statements :
§ No strategy offers a significatively better result (RQ
1, 2, 3, 4).
§ All strategies generate comments that are
prefered over existing comments
Table 2: Average Likert score for each question across all 3
strategies.
ESUG 2025
22. 22
22
22
Results regarding RQ 5
Comparison of results between different package sizes
(small, medium, large):
§ Overall small packages receive higher scores
§ Small packages have clearer comments
§ Smaller packages have collaborators that are well-
mentioned & we are not missing key collaborators.
§ Smaller packages are more useful than existing
comments
Table 3: Average Likert score for each question across all 3
strategies.
ESUG 2025
24. 24
24
24
Conclusion, Limitations, and Future Directions (?)
Limitations
§ Limited amount of evaluation per comment
§ Needs more work on prompt tuning, document
structure
§ Weak solution for identifying collaborators
Conclusions:
§ Generated comments are more complete, clear
and useful than some human-made comments
→ Maybe use when there are no comments ?
Future Directions
§ Use heuristics for identifying collaborators & GenAI
for describing these collaborations
§ Adapt to existing dynamic comment features (e.g.
examples)
§ Automate a pipeline for comment suggestion in
existing Pharo projects
ESUG 2025
25. 25
25
25
Conclusion, Limitations, and Future Directions (?)
Limitations
§ Limited amount of evaluation per comment
§ Needs more work on prompt tuning, document
structure
§ Weak solution for identifying collaborators
Conclusions:
§ Generated comments are more complete, clear
and useful than some human-made comments
→ Maybe use when there are no comments ?
Future Directions
§ Use heuristics for identifying collaborators & GenAI
for describing these collaborations
§ Adapt to existing dynamic comment features (e.g.
examples)
§ Automate a pipeline for comment suggestion in
existing Pharo projects
ESUG 2025