SlideShare a Scribd company logo
Evaluation Challenges in Using
Generative AI for Science &
Technical Content
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org
Thanks to Bradley Allen, Fina Polat, Xue Li, Daniel Daza
SemTech4STLD Workshop - ESWC 2025
Outline
• A use case & where we are today
• The challenges of evaluation in for information extraction and knowledge
graph construction
• Some routes forward & maybe a bold idea
Using AI to
Study Standards
• Provenance working group:
• 8820 public emails,
• 666 issues,
• 600 wiki pages,
• 6000 mercurial commits
• 152 teleconferences
Standards are hard
The rationale of PROVL Moreau, P Groth, J Cheney, T Lebo, S Miles
Web Semantics: Science, Services and Agents on the World Wide Web 35, 235-257
Standards are digital
Standard development leaves digital traces
New tools to analyze standards development
https://github.com/glasgow-ipl/ietfdata
https://github.com/datactive/bigbang
Nick Doty et al. https://github.com/IETF-Hackathon/ietf111-project-presentations/blob/main/ietf111-hackathon-bigbang.pdf
Questions one might like to ask
• Understand the content of email messages and their rhetorical
structure. (e.g. arguments were put forward but constantly ignored)
• Recover technical considerations and rationales behind the choices
made and ultimately documented in a standard
• More fine-grained quantitative and qualitative analysis
From: Michael Welzl, Stephan Oepen, Cezary Jaskula, Carsten Griwodz, and Safiqul Islam. 2021. Collaboration
in the IETF: an initial analysis of two decades in email discussions. SIGCOMM Comput. Commun. Rev. 51, 3
(July 2021), 29–32. DOI:https://doi.org/10.1145/3477482.3477488
Example uses of AI for standards analysis
From EUROCAE ED 133: FLIGHT OBJECT INTEROPERABILITY SPECIFICATION
Recognising entities in conversations
Predicting the success of a standard
Stephen McQuistin, Mladen Karan, Prashant Khare, Colin Perkins, Gareth Tyson, Matthew Purver, Patrick Healey, Waleed Iqbal, Junaid Qadir, and
Ignacio Castro. 2021. Characterising the IETF through the lens of RFC deployment. In <i>Proceedings of the 21st ACM Internet Measurement
Conference</i> (<i>IMC '21</i>). Association for Computing Machinery, New York, NY, USA, 137–149. DOI:https://doi.org/
10.1145/3487552.3487821
Intelligent Interventions Develop new natural language processing and machine learning
techniques to understand what’s going on within standards
development:
• How are people, organizations, topics, documents, priorities,
requirements, etc… connected?
• What are people and standards actually talking about?
Based on this understanding, develop intelligent tools to better
integrate public values.
Challenges in using AI for Standards Analysis
14
Email threads
https://lists.w3.org/Archives/Public/
● Long form conversations;
● Change of speaker;
● Lexical ambiguity;
● Specialized domain;
● Informal structures;
● Extensions across sessions;
● Lack of annotated data
● Complex entities
● Multiple perspectivies
● Dynamic analyses
15
Xue Li, Sara Magliacane, and Paul Groth. 2021. The Challenges of Cross-Document Coreference
Resolution for Email. In <i>Proceedings of the 11th on Knowledge Capture Conference</i>
(<i>K-CAP '21</i>). Association for Computing Machinery, New York, NY, USA, 273–276.
DOI:https://doi.org/10.1145/3460210.3493573
Methods for building databases
of information from standards conversations
1 – knowledge graphs
1
Decoder-only representative large language models.
Source: S. Pan et al., Unifying Large Language Models and Knowledge Graphs: A Roadmap
https://arxiv.org/abs/2306.08302
LLMs and Generative AI
Evaluation Challenges in Using Generative AI for Science & Technical Content
Evaluation Challenges in Using Generative AI for Science & Technical Content
Evaluation Challenges in Using Generative AI for Science & Technical Content
- Sustainability
- Security / Resilience
- Connecting the Unconnected
Evaluation
Challenges
The tale of SlotGan
Daniel Daza, Michael Cochez, and Paul Groth. 2022. SlotGAN: Detecting Mentions in Text via Adversarial Distant
Learning. In Proceedings of the Sixth Workshop on Structured Prediction for NLP, pages 32–39, Dublin, Ireland.
Association for Computational Linguistics.
Relation Extraction & Instruction Tuning
Do Instruction-tuned Large Language Models Help with Relation Extraction?
Xue Li, Fina Polat and Paul Groth. LM-AKBC Workshop at ISWC 2023
https://ceur-ws.org/Vol-3577/paper15.pdf
Results on REBEL dataset
Results on Post-Hoc Human Eval
Can we preserve relation extraction performance
while preserving in-context capabilities?
Method: Instruction Tune Dolly LLM with
LORA using a relation extraction dataset
(REBEL)
▫ Prompt Engineering techniques:
▿ Zero-shot, one-shot, few-shot
▿ RAG - Retrieval Augmented Generation
▿ CoT - Chain of Thought
▿ CoT self consistency
▿ ReAct - Reasoning (e.g.chain-of-thought prompting) and Acting
(e.g.action plan generation)
▫ Polat F, Tiddi I, Groth P. Testing prompt engineering methods for knowledge
extraction from text. Semantic Web. 2025;16(2). doi:10.3233/SW-243719
05.06.24 24
Test and compare Prompt Engineering for Knowledge Extraction
05.06.24 25
Open Information Extraction
26
Performance on RED-FM
05.06.24 27
Ontology Based Triple Assesment
28
Ontology Based
Assessment
Impressions
• Results appear to be really good qualitatively
• Annotation quality is varied
• Challenges in agreement
• Large scale is often automated
• Is everything in domain?
Routes
Forward
More complex tasks
User studies
E Papadopoulou. Retrieval Augmented Generation of Tabular Answers at Query
Time using Pre-trained Large Language Models. (2023) https://
scripties.uba.uva.nl/search?id=record_53599
LLMs as judges
Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. 2024.
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges. In Proceedings of
the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045,
Miami, Florida, USA. Association for Computational Linguistics.
LLMs as judges
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
LLM-as-a-judge with MT-bench and Chatbot Arena. In
Proceedings of the 37th International Conference on
Neural Information Processing Systems (NIPS '23).
Curran Associates Inc., Red Hook, NY, USA, Article
2020, 46595–46623.
Agreement
Problem statement
• We will focus on how LLMs can be used to
support the evaluation of class membership
relations in a KG
• Class membership represents
classification schemes
• Classification schemes
• Crucial to knowledge infrastructures
• Implications for social policy and scientific
consensus
• Class membership is important for data
governance
• "providing a set of mappings from a
representation language to agreed-upon
concepts in the real world" [Khatri and Brown]
36
Allen, B.P., Groth, P.T. (2025). Evaluating Class Membership Relations in Knowledge Graphs Using
Large Language Models. In: Meroño Peñuela, A., et al. The Semantic Web: ESWC 2024 Satellite
Events. ESWC 2024. Lecture Notes in Computer Science, vol 15344. Springer, Cham. https://
doi.org/10.1007/978-3-031-78952-6_2
Class membership relation evaluation
by an LLM
domain
knowledge in
natural language
corpus C
= arg max L (
𝑇
| (e, instance-of, o) )
knowledge
graph G
pre-training
sampling
(e, instance-of, c)
decision
37
Performance metrics
• Classifiers can exhibit good alignment with KGs (Q1)
• One LLM was in moderate agreement (κ > 0.60) with Wikidata
• Four were in moderate agreement with CaLiGraph
38
Error analysis results
• Error analysis based on review by one of the authors
• FNs, FPs with rationales and assign error to LLM or KG
• LLM errors: incorrect reasoning, missing data
• KG errors: missing relation, incorrect relation
• Error analysis performed for gpt-4-0125-preview
• Classifiers can detect missing or incorrect relations (Q2)
• 40.9% of errors were due to the problems with the KG
• 29.1% of errors were due to missing or insufficient data in the entity description
• 30.0% of errors due to incorrect reasoning by the LLM
• Pairwise human-KG and human-LLM agreement differed between the KGs
• Human showed fair agreement with Wikidata and no agreement with the classifier
• Human showed slight agreement with the classifier and no agreement with CaLiGraph
39
Agents as Peers
• Rationales
• Based on provenance and
evidence
• Consensus formation
• Encoding consensus as sharable
knowledge (graphs)
Conclusion
• Gen AI allows for impressive capabilities for Scienti
fi
c & Legal Content
• How do we know the results are good?
• Standard evaluations
• Approaches: complex tasks, user feedback, LLMs as judges
• consensus among peers - science!
Paul Groth | @pgroth | pgroth.com | indelab.org

More Related Content

PPTX
ANTI-LEPROTIC DRUGS.pptx
PDF
Aminoglycoside antibiotics
PPTX
4th unit oral contraceptives
PPTX
loading dose and maintainance dose.power point(pptx)
PPTX
3.2 B-Glycosides-Glycyrhetinic-acid-Rutin.pptx
PPTX
Medical Termination of Pregnancy Act
PPTX
Pharmacy act 1948.pptx
PPTX
Hypnotics and sedatives slideshare
ANTI-LEPROTIC DRUGS.pptx
Aminoglycoside antibiotics
4th unit oral contraceptives
loading dose and maintainance dose.power point(pptx)
3.2 B-Glycosides-Glycyrhetinic-acid-Rutin.pptx
Medical Termination of Pregnancy Act
Pharmacy act 1948.pptx
Hypnotics and sedatives slideshare

Similar to Evaluation Challenges in Using Generative AI for Science & Technical Content (20)

PDF
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
PPTX
Pemanfaatan Big Data Dalam Riset 2023.pptx
PPTX
Why is TDD so hard for Data Engineering and Analytics Projects?
DOC
Dr DanielJ Clouse resumeobf
DOC
Dr Daniel J Clouse Resume
PDF
Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs
PDF
Liberact conference 2013 Gnome Surfer & Moclo Planner
PDF
data-science-roadmap Mục tiêu hướng tới Data Science
PDF
This is ChatGPT Book Data Science Roadmap.pdf
PPTX
Explanatory Capabilities of Large Language Models in Prescriptive Process Mon...
PDF
EarthCube Monthly Community Webinar- Nov. 22, 2013
PPTX
PhD Defense Øyvind Hauge
PDF
Reducing Labeling Costs in Sentiment Analysis via Semi-Supervised Learning
PDF
The Generative AI System Shock, and some thoughts on Collective Intelligence ...
PDF
Where Can I Learn Data Science Skills | IABAC
PPTX
KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
PDF
Recent developments in CS education research Jul 18
PPTX
BTSym24_ApresentationRSL_V2_2024 webscraping.pptx
PDF
Crowdsourcing Linked Data Quality Assessment
PPTX
Thoughts on Knowledge Graphs & Deeper Provenance
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Pemanfaatan Big Data Dalam Riset 2023.pptx
Why is TDD so hard for Data Engineering and Analytics Projects?
Dr DanielJ Clouse resumeobf
Dr Daniel J Clouse Resume
Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs
Liberact conference 2013 Gnome Surfer & Moclo Planner
data-science-roadmap Mục tiêu hướng tới Data Science
This is ChatGPT Book Data Science Roadmap.pdf
Explanatory Capabilities of Large Language Models in Prescriptive Process Mon...
EarthCube Monthly Community Webinar- Nov. 22, 2013
PhD Defense Øyvind Hauge
Reducing Labeling Costs in Sentiment Analysis via Semi-Supervised Learning
The Generative AI System Shock, and some thoughts on Collective Intelligence ...
Where Can I Learn Data Science Skills | IABAC
KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
Recent developments in CS education research Jul 18
BTSym24_ApresentationRSL_V2_2024 webscraping.pptx
Crowdsourcing Linked Data Quality Assessment
Thoughts on Knowledge Graphs & Deeper Provenance
Ad

More from Paul Groth (20)

PDF
Co-Constructing Explanations for AI Systems using Provenance
PDF
Data Curation and Debugging for Data Centric AI
PPTX
Content + Signals: The value of the entire data estate for machine learning
PPTX
Data Communities - reusable data in and outside your organization.
PPTX
Minimal viable-datareuse-czi
PDF
Knowledge Graph Maintenance
PDF
Knowledge Graph Futures
PDF
Knowledge Graph Maintenance
PPTX
Thinking About the Making of Data
PPTX
End-to-End Learning for Answering Structured Queries Directly over Text
PPTX
From Data Search to Data Showcasing
PPTX
Elsevier’s Healthcare Knowledge Graph
PPTX
The Challenge of Deeper Knowledge Graphs for Science
PPTX
More ways of symbol grounding for knowledge graphs?
PPTX
Diversity and Depth: Implementing AI across many long tail domains
PPTX
Progressive Provenance Capture Through Re-computation
PPTX
From Text to Data to the World: The Future of Knowledge Graphs
PPTX
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
PPTX
The need for a transparent data supply chain
PPTX
Knowledge graph construction for research & medicine
Co-Constructing Explanations for AI Systems using Provenance
Data Curation and Debugging for Data Centric AI
Content + Signals: The value of the entire data estate for machine learning
Data Communities - reusable data in and outside your organization.
Minimal viable-datareuse-czi
Knowledge Graph Maintenance
Knowledge Graph Futures
Knowledge Graph Maintenance
Thinking About the Making of Data
End-to-End Learning for Answering Structured Queries Directly over Text
From Data Search to Data Showcasing
Elsevier’s Healthcare Knowledge Graph
The Challenge of Deeper Knowledge Graphs for Science
More ways of symbol grounding for knowledge graphs?
Diversity and Depth: Implementing AI across many long tail domains
Progressive Provenance Capture Through Re-computation
From Text to Data to the World: The Future of Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
The need for a transparent data supply chain
Knowledge graph construction for research & medicine
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
Building Integrated photovoltaic BIPV_UPV.pdf

Evaluation Challenges in Using Generative AI for Science & Technical Content

  • 1. Evaluation Challenges in Using Generative AI for Science & Technical Content Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Bradley Allen, Fina Polat, Xue Li, Daniel Daza SemTech4STLD Workshop - ESWC 2025
  • 2. Outline • A use case & where we are today • The challenges of evaluation in for information extraction and knowledge graph construction • Some routes forward & maybe a bold idea
  • 3. Using AI to Study Standards
  • 4. • Provenance working group: • 8820 public emails, • 666 issues, • 600 wiki pages, • 6000 mercurial commits • 152 teleconferences Standards are hard The rationale of PROVL Moreau, P Groth, J Cheney, T Lebo, S Miles Web Semantics: Science, Services and Agents on the World Wide Web 35, 235-257
  • 7. New tools to analyze standards development https://github.com/glasgow-ipl/ietfdata https://github.com/datactive/bigbang
  • 8. Nick Doty et al. https://github.com/IETF-Hackathon/ietf111-project-presentations/blob/main/ietf111-hackathon-bigbang.pdf
  • 9. Questions one might like to ask • Understand the content of email messages and their rhetorical structure. (e.g. arguments were put forward but constantly ignored) • Recover technical considerations and rationales behind the choices made and ultimately documented in a standard • More fine-grained quantitative and qualitative analysis From: Michael Welzl, Stephan Oepen, Cezary Jaskula, Carsten Griwodz, and Safiqul Islam. 2021. Collaboration in the IETF: an initial analysis of two decades in email discussions. SIGCOMM Comput. Commun. Rev. 51, 3 (July 2021), 29–32. DOI:https://doi.org/10.1145/3477482.3477488
  • 10. Example uses of AI for standards analysis
  • 11. From EUROCAE ED 133: FLIGHT OBJECT INTEROPERABILITY SPECIFICATION Recognising entities in conversations
  • 12. Predicting the success of a standard Stephen McQuistin, Mladen Karan, Prashant Khare, Colin Perkins, Gareth Tyson, Matthew Purver, Patrick Healey, Waleed Iqbal, Junaid Qadir, and Ignacio Castro. 2021. Characterising the IETF through the lens of RFC deployment. In <i>Proceedings of the 21st ACM Internet Measurement Conference</i> (<i>IMC '21</i>). Association for Computing Machinery, New York, NY, USA, 137–149. DOI:https://doi.org/ 10.1145/3487552.3487821
  • 13. Intelligent Interventions Develop new natural language processing and machine learning techniques to understand what’s going on within standards development: • How are people, organizations, topics, documents, priorities, requirements, etc… connected? • What are people and standards actually talking about? Based on this understanding, develop intelligent tools to better integrate public values.
  • 14. Challenges in using AI for Standards Analysis 14 Email threads https://lists.w3.org/Archives/Public/ ● Long form conversations; ● Change of speaker; ● Lexical ambiguity; ● Specialized domain; ● Informal structures; ● Extensions across sessions; ● Lack of annotated data ● Complex entities ● Multiple perspectivies ● Dynamic analyses 15 Xue Li, Sara Magliacane, and Paul Groth. 2021. The Challenges of Cross-Document Coreference Resolution for Email. In <i>Proceedings of the 11th on Knowledge Capture Conference</i> (<i>K-CAP '21</i>). Association for Computing Machinery, New York, NY, USA, 273–276. DOI:https://doi.org/10.1145/3460210.3493573
  • 15. Methods for building databases of information from standards conversations 1 – knowledge graphs 1
  • 16. Decoder-only representative large language models. Source: S. Pan et al., Unifying Large Language Models and Knowledge Graphs: A Roadmap https://arxiv.org/abs/2306.08302 LLMs and Generative AI
  • 20. - Sustainability - Security / Resilience - Connecting the Unconnected
  • 22. The tale of SlotGan Daniel Daza, Michael Cochez, and Paul Groth. 2022. SlotGAN: Detecting Mentions in Text via Adversarial Distant Learning. In Proceedings of the Sixth Workshop on Structured Prediction for NLP, pages 32–39, Dublin, Ireland. Association for Computational Linguistics.
  • 23. Relation Extraction & Instruction Tuning Do Instruction-tuned Large Language Models Help with Relation Extraction? Xue Li, Fina Polat and Paul Groth. LM-AKBC Workshop at ISWC 2023 https://ceur-ws.org/Vol-3577/paper15.pdf Results on REBEL dataset Results on Post-Hoc Human Eval Can we preserve relation extraction performance while preserving in-context capabilities? Method: Instruction Tune Dolly LLM with LORA using a relation extraction dataset (REBEL)
  • 24. ▫ Prompt Engineering techniques: ▿ Zero-shot, one-shot, few-shot ▿ RAG - Retrieval Augmented Generation ▿ CoT - Chain of Thought ▿ CoT self consistency ▿ ReAct - Reasoning (e.g.chain-of-thought prompting) and Acting (e.g.action plan generation) ▫ Polat F, Tiddi I, Groth P. Testing prompt engineering methods for knowledge extraction from text. Semantic Web. 2025;16(2). doi:10.3233/SW-243719 05.06.24 24 Test and compare Prompt Engineering for Knowledge Extraction
  • 27. 05.06.24 27 Ontology Based Triple Assesment
  • 29. Impressions • Results appear to be really good qualitatively • Annotation quality is varied • Challenges in agreement • Large scale is often automated • Is everything in domain?
  • 32. User studies E Papadopoulou. Retrieval Augmented Generation of Tabular Answers at Query Time using Pre-trained Large Language Models. (2023) https:// scripties.uba.uva.nl/search?id=record_53599
  • 33. LLMs as judges Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. 2024. Leveraging Large Language Models for NLG Evaluation: Advances and Challenges. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045, Miami, Florida, USA. Association for Computational Linguistics.
  • 34. LLMs as judges Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23). Curran Associates Inc., Red Hook, NY, USA, Article 2020, 46595–46623.
  • 36. Problem statement • We will focus on how LLMs can be used to support the evaluation of class membership relations in a KG • Class membership represents classification schemes • Classification schemes • Crucial to knowledge infrastructures • Implications for social policy and scientific consensus • Class membership is important for data governance • "providing a set of mappings from a representation language to agreed-upon concepts in the real world" [Khatri and Brown] 36 Allen, B.P., Groth, P.T. (2025). Evaluating Class Membership Relations in Knowledge Graphs Using Large Language Models. In: Meroño Peñuela, A., et al. The Semantic Web: ESWC 2024 Satellite Events. ESWC 2024. Lecture Notes in Computer Science, vol 15344. Springer, Cham. https:// doi.org/10.1007/978-3-031-78952-6_2
  • 37. Class membership relation evaluation by an LLM domain knowledge in natural language corpus C = arg max L ( 𝑇 | (e, instance-of, o) ) knowledge graph G pre-training sampling (e, instance-of, c) decision 37
  • 38. Performance metrics • Classifiers can exhibit good alignment with KGs (Q1) • One LLM was in moderate agreement (κ > 0.60) with Wikidata • Four were in moderate agreement with CaLiGraph 38
  • 39. Error analysis results • Error analysis based on review by one of the authors • FNs, FPs with rationales and assign error to LLM or KG • LLM errors: incorrect reasoning, missing data • KG errors: missing relation, incorrect relation • Error analysis performed for gpt-4-0125-preview • Classifiers can detect missing or incorrect relations (Q2) • 40.9% of errors were due to the problems with the KG • 29.1% of errors were due to missing or insufficient data in the entity description • 30.0% of errors due to incorrect reasoning by the LLM • Pairwise human-KG and human-LLM agreement differed between the KGs • Human showed fair agreement with Wikidata and no agreement with the classifier • Human showed slight agreement with the classifier and no agreement with CaLiGraph 39
  • 40. Agents as Peers • Rationales • Based on provenance and evidence • Consensus formation • Encoding consensus as sharable knowledge (graphs)
  • 41. Conclusion • Gen AI allows for impressive capabilities for Scienti fi c & Legal Content • How do we know the results are good? • Standard evaluations • Approaches: complex tasks, user feedback, LLMs as judges • consensus among peers - science! Paul Groth | @pgroth | pgroth.com | indelab.org