SlideShare a Scribd company logo
Co-Constructing Explanations
for AI Systems using Provenance
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org
Thanks to Jan-Christoph Kalo, Fina Polat, Shubha Guha, Enrico Daga
XAI-KG Workshop - ESWC 2025
Explanations increasingly required for AI systems
“While the explanation process involves the assimilation of knowledge,
it also transforms the learner’s knowledge and is thus an
accommodative process.
Thus, rather than referring to "explanations" (and assuming that the
property of being an explanation is a property of statements), it might
be prudent to refer to explaining, and regard explaining an active
exploration process. ”
03.04.23 3
Explanation by Exploration or Self Explanation
S.T. Mueller, R.R. Hoffman, W. Clancey, A. Emrey and G. Klein, Explanation in
Human-AI Systems: A Literature Meta-Review, Synopsis
of Key Ideas and Publications, and Bibliography for Explainable AI, arXiv, 2019.
https://arxiv.org/abs/ 25 1902.01876
“The process of explaining, and human-human communication
broadly, is a co-adaptive ‘tuning’ process, which requires that
the explainer and learner have a capacity to take each other's
perspective.”
03.04.23 4
A collaborative and co-adaptive process
S.T. Mueller, R.R. Hoffman, W. Clancey, A. Emrey and G. Klein, Explanation in
Human-AI Systems: A Literature Meta-Review, Synopsis
of Key Ideas and Publications, and Bibliography for Explainable AI, arXiv, 2019.
https://arxiv.org/abs/ 25 1902.01876
Complex AI Systems
Trace-Based Explanation
Shruthi Chari, Daniel M Gruen, Oshani Seneviratne, and Deborah L McGuinness. 2020.
Directions for explainable knowledge-enabled systems. In Knowledge Graphs for eXplainable
Arti
fi
cial Intelligence: Foundations, Applications and Challenges. IOS Press, 245–261.
Provenance is all you need!
def load_data(…):
input1 = pd.read_csv(…)
input1 = input1[input1[‘attr’] > 10]
input2 = pd.read_csv(…)
return input1.join(input2, …)
def featurise(…):
return ColumnTransformer(
[(‘categorical’), …, …)
(‘numerical’), …, …)])
all_data = load_(data)
train = all_data[all_data[‘date’] < …]
test = all_data[all_data[‘date’] >= …]
y_train = label_binarize(train[‘label’])
y_test = label_binarize(test[‘label’])
Pipeline = Pipeline([
(‘features’, featurise(…)),
(‘learner’, KerasClassifier(…))])
model = pipeline.fit(train, y_train)
quality = model.score(test, y_test)
input1
σattr>10
input2
⋈id=fk_id
1-hot
πattr
σdate<… σdate>=…
scale
πval
concat
FitClassifier
Score Xtrain
ytrain
D1
D2
Xtest
ytest
id attr label date
fk_id val
πlabel
binarize 1-hot
πattr
scale
concat
πlabel
binarize
πval
0, 1, 0, …, 0.23
1, 0, 0, …, 0.11
[ ]
0, 0, 1, …, 0.46
0, 1, 0, …, 0.17
[ ]
{(1,1)}
…
{(1,...)}
{(2,1)}
...
{(2,...)}
0
1
[ ]
1
0
[ ] {(1,1), (2,3)}
{(1,5), (2,7)}
{(1,1), (2,3)}
{(1,5), (2,7)}
{(1,4), (2,7)}
{(1,3), (2,1)}
{(1,4), (2,7)}
{(1,3), (2,1)}
User-defined ML pipeline Extracted DAG representation Materialised artifacts with their provenance
1 2 3
Grafberger, et al. Data distribution debugging in ML pipelines. VLDBJ (2022).
Data Journeys
- Importance of the entire pipeline
- Data Journey: a multi- layered,
semantic representation of a data
processing activity, linked to the
digital assets involved (code,
components, data).
- Can we provide a compact
representation?
Enrico Daga and Paul Groth. "Data journeys: Explaining AI work
fl
ows
through abstraction." Semantic Web 15.4 (2024): 1057-1083.
DAG representation of a Random Forests Python
Notebook
Example: Random Forests
Results
03.04.23 12
Initial user survey
Exploration to co-construction
Katharina J Rohl
fi
ng et al. 2020. Explanation as a social practice: toward a
conceptual framework for the social design of ai systems. IEEE Transactions on
Cognitive and Developmental Systems, 13, 3, 717–728.
What would it look like for this system?
18.03.25 15
Prototype
16
XAI tools and techniques are available
Sina Mohseni, Niloofar Zarei, and Eric D. Ragan. 2021. A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems.
ACM Trans. Interact. Intell. Syst. 11, 3–4, Article 24 (December 2021), 45 pages. https://doi.org/10.1145/3387166
XAI - Evaluation of Explanations
Argument Quality Assessment
▫ Metrics [1]:
▿ functionally-grounded - metrics that do require human feedback and measure
properties of the explanation (e.g. faithfulness - how accurately does the explanation
correspond to the thing being explained);
▿ human-grounded metrics - metrics that involve human participation either through
feedback, observation or proxy tasks (e.g. how interpretable is an explanation to an
end user);
▿ application-grounded - metrics that measure explanations through their usage in an
application (e.g. does the performance of the human-AI system improve on a
downstream task);
▫ Challenges:
▿ Interactivity
▿ Many personas
Evaluation of Explanations
[1] Gesina Schwalbe and Bettina Finzel. A comprehensive taxonomy for
explainable arti
fi
cial intelligence: a systematic survey of surveys on
methods and concepts. Data Mining and Knowledge Discovery,
38(5):3043–3101, 2024.
Virtual Personas
Suhong Moon, Marwa Abdulhai, Minwoo Kang, Joseph
Suh, Widyadewi Soedarmadji, Eran Kohen Behar, and
David Chan. 2024. Virtual Personas for Language Models
via an Anthology of Backstories. In Proceedings of the 2024
Conference on Empirical Methods in Natural Language
Processing, pages 19864–19897, Miami, Florida, USA.
Association for Computational Linguistics.
Virtual Personas - Critique
Lindia Tjuatja, Valerie Chen, Tongshuang Wu,
Ameet Talwalkwar, Graham Neubig; Do LLMs
Exhibit Human-like Response Biases? A Case
Study in Survey Design. Transactions of the
Association for Computational Linguistics 2024;
12 1011–1026. doi: https://doi.org/10.1162/
tacl_a_00685
LLMs as judges
Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. 2024.
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges. In Proceedings of
the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045,
Miami, Florida, USA. Association for Computational Linguistics.
LLMs as judges
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang,
Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric
P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023.
Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In
Proceedings of the 37th International Conference on Neural
Information Processing Systems (NIPS '23). Curran Associates Inc.,
Red Hook, NY, USA, Article 2020, 46595–46623.
https://github.com/JanKalo/enexa_explanation/ 18.03.25 25
1. Clarity & Structure: Does the explanation flow logically? Is it easy to follow?
2. Depth & Completeness: Does the explanation o
ff
er su
ffi
cient detail without
omitting crucial points?
3. Correctness & Fidelity: Are facts accurate, and does the explanation
remain faithful to the original query/context?
4. Relevance & Focus: Does the content stay on-topic and address user
queries directly?
5. Appropriateness for the Persona: Is the style/tone appropriate for the
user’s persona (e.g., an AI engineer, business strategist, etc.)?
6. Transparency: Does the explanation clarify its reasoning or highlight
uncertainties?
7. Engagement & Intuition: Is the conversation engaging, and does it address
the user’s interests intuitively?
18.03.25 26
Criteria
18.03.25 27
18.03.25 28
Results
Steps Forward
• A persona database
• Validation of LLM-as-a-judge
• Multi-modal provenance explanations
• Leverage detailed provenance information with visual and interactive outputs
• Interactive AI agents
• Develop agents that can illustrate, gesture to diagrams, and walk users
through data
fl
ow across AI system components in real-time
• Enhanced provenance systems:
• Implement retrieval augmented generation over provenance stores
• Integrate LLMs directly into provenance collection and preparation
processes
• Appropriate abstraction levels for user interaction
• Interactive explanation environments
Co-Constructing Explanations for AI Systems using Provenance
from https://www.dagstuhl.de/25051
Complex Use Cases & Common Evaluation Pitfalls
- Law
- Clinical Trials
- Science
- Missing or Incorrect Ground Truth Data
- Data Leakage
- Confirmation Bias
- Deployment mismatch
Desiderata for Evaluation Approaches
1. be able to evaluate such multifaceted and complex outputs
2. no ground truth is available
3. run in a continuous manner
4. cope with changes in outputs
5. efficiently make use of human effort
6. readily applied to new problems, tasks and domains with a
minimal amount of effort
7. cope with variation in the tailored outputs
De
fi
ning performance in terms of explanation quality
AI System performance:
Given a set of tasks, and corresponding outputs and their explanations created by
an AI system. AI System performance is the aggregation of the quality of the
explanations.
Claim:
By evaluating through explanations we cover the desiderata
on the prior slide
Design and Development
objective, task, and
context
Explanation
Dimension Selection
AI System
Development
Deployment
AI System Execution execution results
AI system
Explanation
Generation
execution traces
Evaluation
explanation
Explanation
Assessment
dimensions and
metrics
evaluation result
Conclusion
• AI systems build explanations together with users. It’s a process.
• We need better ways to evaluate such processes
• LLMs as proxy personas and LLMs as judges may allow for extensive
and reproducible evaluations of explanations.
• Explanation as evaluation for AI Systems
Paul Groth | @pgroth | pgroth.com | indelab.org

More Related Content

PDF
GDG Cloud Southlake #17: Meg Dickey-Kurdziolek: Explainable AI is for Everyone
PDF
Model Evaluation in the land of Deep Learning
PDF
Human-centered AI: how can we support lay users to understand AI?
PPTX
algorithmic-decisions, fairness, machine learning, provenance, transparency
PPTX
Evidence-based Semantic Web Just a Dream or the Way to Go?
PDF
A Review on Reasoning System, Types, and Tools and Need for Hybrid Reasoning
PDF
DALL-E 2 - OpenAI imagery automation first developed by Vishal Coodye in 2021...
PDF
Meetup 22/2/2018 - Artificiële Intelligentie & Human Resources
GDG Cloud Southlake #17: Meg Dickey-Kurdziolek: Explainable AI is for Everyone
Model Evaluation in the land of Deep Learning
Human-centered AI: how can we support lay users to understand AI?
algorithmic-decisions, fairness, machine learning, provenance, transparency
Evidence-based Semantic Web Just a Dream or the Way to Go?
A Review on Reasoning System, Types, and Tools and Need for Hybrid Reasoning
DALL-E 2 - OpenAI imagery automation first developed by Vishal Coodye in 2021...
Meetup 22/2/2018 - Artificiële Intelligentie & Human Resources

Similar to Co-Constructing Explanations for AI Systems using Provenance (20)

PPTX
Human-centered AI: how can we support end-users to interact with AI?
PPTX
Explainable AI for non-expert users
PPT
kantorNSF-NIJ-ISI-03-06-04.ppt
PDF
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
PPTX
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
PDF
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
PPTX
ASEP midsem review_ asep project[1].pptx
DOC
On Machine Learning and Data Mining
PPTX
Applied AI Workshop - Presentation - Connect Day GDL
PPTX
Interpretable Machine Learning
PDF
Nt1310 Unit 1 Literature Review
PPTX
Question Answering System using machine learning approach
PDF
Personality Prediction with CV Analysis
PDF
Deep Learning for Recommender Systems
PDF
Deep Learning for Recommender Systems
PPTX
Tutorial Cognition - Irene
PPT
Services For Science April 2009
PDF
AI Beyond Deep Learning
PPTX
Thoughts on Knowledge Graphs & Deeper Provenance
PDF
Human-centered AI: towards the next generation of interactive and adaptive ex...
Human-centered AI: how can we support end-users to interact with AI?
Explainable AI for non-expert users
kantorNSF-NIJ-ISI-03-06-04.ppt
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
ASEP midsem review_ asep project[1].pptx
On Machine Learning and Data Mining
Applied AI Workshop - Presentation - Connect Day GDL
Interpretable Machine Learning
Nt1310 Unit 1 Literature Review
Question Answering System using machine learning approach
Personality Prediction with CV Analysis
Deep Learning for Recommender Systems
Deep Learning for Recommender Systems
Tutorial Cognition - Irene
Services For Science April 2009
AI Beyond Deep Learning
Thoughts on Knowledge Graphs & Deeper Provenance
Human-centered AI: towards the next generation of interactive and adaptive ex...
Ad

More from Paul Groth (20)

PDF
Evaluation Challenges in Using Generative AI for Science & Technical Content
PDF
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
PDF
Data Curation and Debugging for Data Centric AI
PPTX
Content + Signals: The value of the entire data estate for machine learning
PPTX
Data Communities - reusable data in and outside your organization.
PPTX
Minimal viable-datareuse-czi
PDF
Knowledge Graph Maintenance
PDF
Knowledge Graph Futures
PDF
Knowledge Graph Maintenance
PPTX
Thinking About the Making of Data
PPTX
End-to-End Learning for Answering Structured Queries Directly over Text
PPTX
From Data Search to Data Showcasing
PPTX
Elsevier’s Healthcare Knowledge Graph
PPTX
The Challenge of Deeper Knowledge Graphs for Science
PPTX
More ways of symbol grounding for knowledge graphs?
PPTX
Diversity and Depth: Implementing AI across many long tail domains
PPTX
Progressive Provenance Capture Through Re-computation
PPTX
From Text to Data to the World: The Future of Knowledge Graphs
PPTX
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
PPTX
The need for a transparent data supply chain
Evaluation Challenges in Using Generative AI for Science & Technical Content
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Data Curation and Debugging for Data Centric AI
Content + Signals: The value of the entire data estate for machine learning
Data Communities - reusable data in and outside your organization.
Minimal viable-datareuse-czi
Knowledge Graph Maintenance
Knowledge Graph Futures
Knowledge Graph Maintenance
Thinking About the Making of Data
End-to-End Learning for Answering Structured Queries Directly over Text
From Data Search to Data Showcasing
Elsevier’s Healthcare Knowledge Graph
The Challenge of Deeper Knowledge Graphs for Science
More ways of symbol grounding for knowledge graphs?
Diversity and Depth: Implementing AI across many long tail domains
Progressive Provenance Capture Through Re-computation
From Text to Data to the World: The Future of Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
The need for a transparent data supply chain
Ad

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Per capita expenditure prediction using model stacking based on satellite ima...
Building Integrated photovoltaic BIPV_UPV.pdf
Understanding_Digital_Forensics_Presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
Review of recent advances in non-invasive hemoglobin estimation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Co-Constructing Explanations for AI Systems using Provenance

  • 1. Co-Constructing Explanations for AI Systems using Provenance Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Jan-Christoph Kalo, Fina Polat, Shubha Guha, Enrico Daga XAI-KG Workshop - ESWC 2025
  • 3. “While the explanation process involves the assimilation of knowledge, it also transforms the learner’s knowledge and is thus an accommodative process. Thus, rather than referring to "explanations" (and assuming that the property of being an explanation is a property of statements), it might be prudent to refer to explaining, and regard explaining an active exploration process. ” 03.04.23 3 Explanation by Exploration or Self Explanation S.T. Mueller, R.R. Hoffman, W. Clancey, A. Emrey and G. Klein, Explanation in Human-AI Systems: A Literature Meta-Review, Synopsis of Key Ideas and Publications, and Bibliography for Explainable AI, arXiv, 2019. https://arxiv.org/abs/ 25 1902.01876
  • 4. “The process of explaining, and human-human communication broadly, is a co-adaptive ‘tuning’ process, which requires that the explainer and learner have a capacity to take each other's perspective.” 03.04.23 4 A collaborative and co-adaptive process S.T. Mueller, R.R. Hoffman, W. Clancey, A. Emrey and G. Klein, Explanation in Human-AI Systems: A Literature Meta-Review, Synopsis of Key Ideas and Publications, and Bibliography for Explainable AI, arXiv, 2019. https://arxiv.org/abs/ 25 1902.01876
  • 6. Trace-Based Explanation Shruthi Chari, Daniel M Gruen, Oshani Seneviratne, and Deborah L McGuinness. 2020. Directions for explainable knowledge-enabled systems. In Knowledge Graphs for eXplainable Arti fi cial Intelligence: Foundations, Applications and Challenges. IOS Press, 245–261.
  • 7. Provenance is all you need! def load_data(…): input1 = pd.read_csv(…) input1 = input1[input1[‘attr’] > 10] input2 = pd.read_csv(…) return input1.join(input2, …) def featurise(…): return ColumnTransformer( [(‘categorical’), …, …) (‘numerical’), …, …)]) all_data = load_(data) train = all_data[all_data[‘date’] < …] test = all_data[all_data[‘date’] >= …] y_train = label_binarize(train[‘label’]) y_test = label_binarize(test[‘label’]) Pipeline = Pipeline([ (‘features’, featurise(…)), (‘learner’, KerasClassifier(…))]) model = pipeline.fit(train, y_train) quality = model.score(test, y_test) input1 σattr>10 input2 ⋈id=fk_id 1-hot πattr σdate<… σdate>=… scale πval concat FitClassifier Score Xtrain ytrain D1 D2 Xtest ytest id attr label date fk_id val πlabel binarize 1-hot πattr scale concat πlabel binarize πval 0, 1, 0, …, 0.23 1, 0, 0, …, 0.11 [ ] 0, 0, 1, …, 0.46 0, 1, 0, …, 0.17 [ ] {(1,1)} … {(1,...)} {(2,1)} ... {(2,...)} 0 1 [ ] 1 0 [ ] {(1,1), (2,3)} {(1,5), (2,7)} {(1,1), (2,3)} {(1,5), (2,7)} {(1,4), (2,7)} {(1,3), (2,1)} {(1,4), (2,7)} {(1,3), (2,1)} User-defined ML pipeline Extracted DAG representation Materialised artifacts with their provenance 1 2 3 Grafberger, et al. Data distribution debugging in ML pipelines. VLDBJ (2022).
  • 8. Data Journeys - Importance of the entire pipeline - Data Journey: a multi- layered, semantic representation of a data processing activity, linked to the digital assets involved (code, components, data). - Can we provide a compact representation? Enrico Daga and Paul Groth. "Data journeys: Explaining AI work fl ows through abstraction." Semantic Web 15.4 (2024): 1057-1083.
  • 9. DAG representation of a Random Forests Python Notebook
  • 13. Exploration to co-construction Katharina J Rohl fi ng et al. 2020. Explanation as a social practice: toward a conceptual framework for the social design of ai systems. IEEE Transactions on Cognitive and Developmental Systems, 13, 3, 717–728.
  • 14. What would it look like for this system?
  • 16. 16
  • 17. XAI tools and techniques are available Sina Mohseni, Niloofar Zarei, and Eric D. Ragan. 2021. A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems. ACM Trans. Interact. Intell. Syst. 11, 3–4, Article 24 (December 2021), 45 pages. https://doi.org/10.1145/3387166
  • 18. XAI - Evaluation of Explanations
  • 20. ▫ Metrics [1]: ▿ functionally-grounded - metrics that do require human feedback and measure properties of the explanation (e.g. faithfulness - how accurately does the explanation correspond to the thing being explained); ▿ human-grounded metrics - metrics that involve human participation either through feedback, observation or proxy tasks (e.g. how interpretable is an explanation to an end user); ▿ application-grounded - metrics that measure explanations through their usage in an application (e.g. does the performance of the human-AI system improve on a downstream task); ▫ Challenges: ▿ Interactivity ▿ Many personas Evaluation of Explanations [1] Gesina Schwalbe and Bettina Finzel. A comprehensive taxonomy for explainable arti fi cial intelligence: a systematic survey of surveys on methods and concepts. Data Mining and Knowledge Discovery, 38(5):3043–3101, 2024.
  • 21. Virtual Personas Suhong Moon, Marwa Abdulhai, Minwoo Kang, Joseph Suh, Widyadewi Soedarmadji, Eran Kohen Behar, and David Chan. 2024. Virtual Personas for Language Models via an Anthology of Backstories. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19864–19897, Miami, Florida, USA. Association for Computational Linguistics.
  • 22. Virtual Personas - Critique Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, Graham Neubig; Do LLMs Exhibit Human-like Response Biases? A Case Study in Survey Design. Transactions of the Association for Computational Linguistics 2024; 12 1011–1026. doi: https://doi.org/10.1162/ tacl_a_00685
  • 23. LLMs as judges Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. 2024. Leveraging Large Language Models for NLG Evaluation: Advances and Challenges. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045, Miami, Florida, USA. Association for Computational Linguistics.
  • 24. LLMs as judges Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23). Curran Associates Inc., Red Hook, NY, USA, Article 2020, 46595–46623.
  • 26. 1. Clarity & Structure: Does the explanation flow logically? Is it easy to follow? 2. Depth & Completeness: Does the explanation o ff er su ffi cient detail without omitting crucial points? 3. Correctness & Fidelity: Are facts accurate, and does the explanation remain faithful to the original query/context? 4. Relevance & Focus: Does the content stay on-topic and address user queries directly? 5. Appropriateness for the Persona: Is the style/tone appropriate for the user’s persona (e.g., an AI engineer, business strategist, etc.)? 6. Transparency: Does the explanation clarify its reasoning or highlight uncertainties? 7. Engagement & Intuition: Is the conversation engaging, and does it address the user’s interests intuitively? 18.03.25 26 Criteria
  • 29. Steps Forward • A persona database • Validation of LLM-as-a-judge • Multi-modal provenance explanations • Leverage detailed provenance information with visual and interactive outputs • Interactive AI agents • Develop agents that can illustrate, gesture to diagrams, and walk users through data fl ow across AI system components in real-time • Enhanced provenance systems: • Implement retrieval augmented generation over provenance stores • Integrate LLMs directly into provenance collection and preparation processes • Appropriate abstraction levels for user interaction • Interactive explanation environments
  • 32. Complex Use Cases & Common Evaluation Pitfalls - Law - Clinical Trials - Science - Missing or Incorrect Ground Truth Data - Data Leakage - Confirmation Bias - Deployment mismatch
  • 33. Desiderata for Evaluation Approaches 1. be able to evaluate such multifaceted and complex outputs 2. no ground truth is available 3. run in a continuous manner 4. cope with changes in outputs 5. efficiently make use of human effort 6. readily applied to new problems, tasks and domains with a minimal amount of effort 7. cope with variation in the tailored outputs
  • 34. De fi ning performance in terms of explanation quality AI System performance: Given a set of tasks, and corresponding outputs and their explanations created by an AI system. AI System performance is the aggregation of the quality of the explanations. Claim: By evaluating through explanations we cover the desiderata on the prior slide
  • 35. Design and Development objective, task, and context Explanation Dimension Selection AI System Development Deployment AI System Execution execution results AI system Explanation Generation execution traces Evaluation explanation Explanation Assessment dimensions and metrics evaluation result
  • 36. Conclusion • AI systems build explanations together with users. It’s a process. • We need better ways to evaluate such processes • LLMs as proxy personas and LLMs as judges may allow for extensive and reproducible evaluations of explanations. • Explanation as evaluation for AI Systems Paul Groth | @pgroth | pgroth.com | indelab.org