SlideShare a Scribd company logo
AI for Program Specifications
Shuvendu Lahiri
Research in Software Engineering (RiSE),
Microsoft Research
1
Collaborators: Sarah Fakhoury, Saikat Chakraborty, Markus Kuppe, Shan Lu, Tahina Ramananandro, Nikhil Swamy, ….
Interns: George Sakkas, Aaditya Naik, Madeline Endres, Livia Sun, ….
Trusted AI-assisted
Programming Project
UC Berekeley (May 12, 2025)
Natural Language
RFC
API Ref.
Docstring
Program
Informal + Ambiguous
Further compounded
by AI use
Software requirements are consumed to create code
2
Formal (yet operational)
Significant gap (“what”vs. “how”)
Consume
requirements
Write an
implementation
Define
requirements
Natural Language
RFC
API Ref.
Docstring
Program
Informal + Ambiguous
Formal requirements (specifications) can reduce the gap
3
Formal (yet operational)
Specifications
Enforceable by PL/FM
(tests, formal verification, refinement)
Formal (declarative)
Smaller gap (both capture “what”)
Write an
implementation
Consume
requirements
Define
requirements
User-intent formalization
User-intent formalization for programs:
Research Questions
• Problem statement:
• Can we automatically evaluate the correctness of a formal (requirement)
specification given an informal natural language intent?
• Applications:
• What new applications can be enabled by user-intent formalization for
programs
4
User-intent formalization (UIF) for programs
• Observation
• NL2Code progress has been spurred by establishment of benchmarks and
automated metrics [MBPP (Google), HumanEval (OpenAI), …]
• Driving progress in latest large language models (LLMs) as a core task
• Uses semantic based metric through tests for evaluating correctness of generated code
(not NLP/syntactic metrics such as BLEU etc.)
• Used in mainstream programming [GitHub Copilot, Cursor, CodeWhisperer, Tabnine, …]
5
• (Revised) Problem statement:
• Can we automatically evaluate the correctness of a formal (requirement) specification
given an informal natural language intent + a diverse set of (hidden) validation tests?
Overview
• UIF for mainstream languages (Python, Java), and use case
• Endres, Fakhoury, Chakraborty, Lahiri [FSE’24]
• UIF for verification-aware languages (Dafny, F*, Verus, …)
• Lahiri [FMCAD’24]
• Application to a specification model training (for Verus) [ICLR’25]
• UIF and interactive code-generation (TiCoder)
• Fakhoury, Naik, Sakkas, Chakraborty, Lahiri [TSE’24]
• UIF for verified parser generation from RFCs
• 3DGen: Fakhoury, Kuppe, Lahiri, Ramananandro, Swamy [ICSE’25]
-----
• Loopy: AI for proof (inductive specification) generation
• Kamath, Senthilnathan, Chakraborty, Deligiannis, Lahiri, Lal, Rastogi, Roy, Sharma
[FMCAD’24]
6
UIF for mainstream languages (Python, Java)
7
Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?
Endres, Fakhoury, Chakraborty, Lahiri FSE’24
https://github.com/microsoft/nl-2-postcond
Formal Specifications in Python
8
UIF for mainstream languages (Python, Java)
[1,2,3,2,4] -> [1,3,4]
assert all(i in return_list for i in numbers if numbers.count(i) == 1)
Problem formulation (ideal)
• Given
• NL description nl for a method m
• (hidden) reference implementation I
• Generate a postcondition S of m from nl
• Evaluation metrics (intuition)
• Soundness: I satisfies S
• Completeness: S discriminates I from any buggy implementations
9
Problem formulation (based on tests)
• Given
• NL description nl for a method m
• (hidden) reference implementation I + a set of input/output tests T
• Generate a postcondition S of m from nl
• Evaluation metrics (intuition)
• Test-set Soundness: S is consistent with I for each test t in T
• Test-set Completeness: S discriminates I from any buggy implementations on some
test t in T
0 if unsound
• Score =
|buggy mutants discriminated|/|mutants|
10
Docstring Postconditions Reference
Implementation
Set of
Test Inputs
Evaluation
LLM
12
RQ1: Postcondition Soundness
Docstring Postconditions
Evaluation
Buggy
Implementations
Set of
Test Inputs
LLM
13
RQ1: Postcondition Completeness
GPT-4
subsbtantially
better at
complete
specs
Common postcondition categories on
HumanEval
14
Evaluate on Defects4J dataset of real-world bugs and fixes in mature Java
projects
We use GPT-4 to generate 10 postconditions and 10 preconditions for each
buggy function.
We consider two ablations (33,600 total GPT-4 calls)
• NL + Buggy Method Code + Relevant File Context
• NL + Relevant File Context
For each, we measure:
15
Correctness
Does the spec pass the
tests on correct code?
Bug-discriminating
If it is correct, does the
spec fail any of the tests
on buggy code?
Application: Can GPT-4 generated specifications
find real-world bugs?
16
Defects4J results
Across ablations, 65 bugs (12.5% of all bugs) are plausibly caught by
generated specifications
• We manually verify a subset of bug catching conditions
Complementary to prior assertion generation approaches TOGA [Dinella, Ryan,
Mytkowicz, Lahiri, ICSE’22] and Daikon [Ernst et al. ICSE’99]
• TOGA mostly finds expected exceptional bugs. TOGA can only tolerate bugs during
testing, and cannot prevent bugs in production.
• Daikon specs overfit the regression tests and bug-discriminating specs are unsound
UIF for verification-aware languages
(Dafny, F*, Verus, …)
17
[Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages, Lahiri, FMCAD’24]
https://github.com/microsoft/nl-2-postcond
Motivating example
18
predicate InArray(a: array<int>, x: int)
reads a
{exists i :: 0 <= i < a.Length && a[i] == x}
method SharedElements(a: array<int>, b: array<int>) returns (result: seq<int>)
ensures forall x :: x in result ==> (InArray(a, x) && InArray(b, x))
ensures forall i, j :: 0 <= i < j < |result| ==> result[i] != result[j]
"Write a function to find the shared elements from the given two lists.“ [MBPP task#2]
GPT4 generated
specifcation.
Labeled as
“strong” by authors
[Towards AI-Assisted Synthesis of Verified Dafny Methods. Misu, Lopes, Ma, Noble. FSE’24]
"test_cases": {
"test_1": "var a1 := new int[] [3, 4, 5, 6];
var a2 := new int[] [5, 7, 4, 10];
var res1 :=SharedElements(a1,a2);
//expected[4, 5];",
"test_2": “…",
"test_3": “…"
}
Hidden (not visible to spec generator)
Prior works
MBPP-DFY
[Misu, Lopes, Ma, Noble. FSE’24]
• Generates spec+code+proof in
Dafny from NL using GPT-4
• Creates 153 “verified”
examples
• Problem:
• How good are the
specifications?
• Manual/subjective (non-
automated) metric for
evaluation of specifications
19
nl2postcond
[Endres, Fakhoury, Chakraborty, Lahiri.
FSE’24]
• Creates automated metrics
for specification quality for
mainstream languages (Java,
Py) given tests and code
mutants
• Problem:
• Doesn’t work with rich
specifications (e.g.,
quantifiers, ghost variables)
in verification-aware
languages
This work
• An approach to evaluate the correctness of formal program
specifications (given tests)
• A step towards creating a benchmark for user-intent-formalization
(specification quality) for programs in Dafny
• Leverages the dataset from MBPP-DFY
• “Adapts” the metric from nl2postcond for verification-aware languages
• Outcomes
• A “benchmark” of 64/153 examples from MBPP-DFY
• Automated metric for spec evaluation based on symbolically testing
specifications
20
Our approach
• Symbolically test specifications (given tests as input/output
examples)
• Given
• A method signature
method Foo(x): (returns y) requires P(x) ensures Q(x, y)
• A set of input/output tests T
• Specification Soundness (for a test (i, o)) //Boolean metric
• {P} x := i; y := o; {Q} is valid
• Specification Completeness (for a test (i,o)) //Quantitative
metric (kill-set from mutation-testing)
• Fraction of mutants o’ of o, s.t. {P} x := i; y := o’; {Q} is not valid
21
predicate InArray(a: array<int>, x: int)
reads a
{exists i :: 0 <= i < a.Length && a[i] == x}
method SharedElements(a: array<int>, b: array<int>)
returns (result: seq<int>)
ensures forall x :: x in result ==> (InArray(a, x) &&
InArray(b, x))
ensures forall i, j :: 0 <= i < j < |result| ==>
result[i] != result[j]
method SharedElementsTest(){
var a1:= new int[] [3, 4, 5, 6];
var a2:= new int[] [5, 7, 4, 10];
var res1:=SharedElements(a1,a2);
//expected[4, 5];
}
22
"Write a function to find the shared
elements from the given two lists."
Should
verify
Should
fail to verify
[6];
Evaluation https://github.com/microsoft/nl-2-postcond
• Evaluate the soundness/completeness metrics against human labels
• Dataset: 153 Dafny specifications for MBPP-DFY dataset
• Implemented mutation operators for simple output values (Booleans, integers,
string, lists)
• Results
• Automated metrics gets parity with the human-labeling for most examples
• Can replace manual annotation with automated checks (given tests)
• Soundness check satisfied for 64/153 examples
• Finds instances where a “strong” specification is not complete
• Finds instances of “incorrect” specifications
23
predicate InArray(a: array<int>, x: int)
reads a
{exists i :: 0 <= i < a.Length && a[i] == x}
method SharedElements(a: array<int>, b: array<int>) returns (result: seq<int>)
ensures forall x :: x in result ==> (InArray(a, x) && InArray(b, x))
ensures forall i, j :: 0 <= i < j < |result| ==> result[i] != result[j]
method SharedElementsTest(){
var a1:= new int[] [3, 4, 5, 6];
var a2:= new int[] [5, 7, 4, 10];
var res1:=SharedElements(a1,a2);
//expected[4, 5];
}
24
"Write a function to find the shared elements from the given two lists."
GPT4 generated.
Labeled as
“strong”
Our metric marks this as a
weak specification (wrt the test)
Changing ==> to <==> makes it a
strong invariant by our metric
Result 1: label weak specifications
2 more such examples in the paper detected automatically
Result 2: label incorrect specifications
"234": {
"task_id": "234",
"task_description": "Write a method in Dafny to find the
volume of a cube given its side length.",
"method_signature": "method CubeVolume(l:int) returns (int)",
"test_cases": {
"test_1": "var out1:= CubeVolume(3);nassert out1==27;",
"test_2": "var out2:= CubeVolume(2);nassert out2==8;",
"test_3": "var out3:= CubeVolume(5);nassert out3==25;"
}
25
Our metric marks this as a
Incorrect specification (wrt the tests)
Problem: authors introduced error during copy-paste from Python -> Dafny!
2 more such examples in the paper detected automatically
Challenges: quantifiers, unrolling, …
• Parsing complex datatypes (e.g., 2-dimensional arrays etc.)
• Need redundant assertions to trigger the quantifiers [DONE]
assume {:axiom} a[..a.Length] == a1[..a1.Length];
assert a[0] == a1[0] && .... && a[3] == a1[3];
• Unroll (or partially evaluate) recursive predicates based on concrete
input
• Trigger for nested quantifiers in specification predicates
• method IsNonPrime(n: int) returns (result: bool)
requires n >= 2
ensures result <==> (exists k :: 2 <= k < n && n % k == 0)
26
Alternate proposals for benchmark
• Other alternatives for automated metric
• Running the reference implementation with specifications on the tests
• Compare against a reference specification
• Verifying the specification against reference implementation
• Checking for the most precise specification (given the tests)
• Check the paper for details why they do not suffice …..
27
Application: Specification model training for Verus
28
Automated Proof Generation for Rust Code via Self-Evolution. Chen, Lu, (Shan) Lu, Gong, Yang, Li, Misu, Yu, Duan, Cheng, Yang, Lahiri,
Xie, Zhou [arXiv:2410.15756], ICLR 2025
The approach of scoring specifications given tests has been ported to Verus
for boosting synthetic (spec, code, proof) generation for model training
Problem: Fine-tune an LLM to produce high-quality Proof given <DocString, Code, Spec>
• Need a high quality < DocString, Code, Spec, Proof > tuples
• Sub-Problem: Need a high quality < DocString, Code, Spec> tuples
• Approach: Fine-tune an LLM to produce high quality Spec given DocString and Tests using the
metric from earlier slide
Seed
AutoVerus training workflow
29
Code
~45K Rust Programs
from code-synthesis dataset
Model
Training
Data
Augmentation
Data Pruning
Code, Spec
+
Code, Spec
~20K
Train a
Specification model
Seed
AutoVerus training workflow
30
Code, Spec
Proof
Code Spec
Code
~45K Rust Programs
from code-synthesis dataset
Model
Training
Data
Augmentation
Data Pruning
Code, Spec
+
~20K
+
Model Training
Data
Augmentation
Data Pruning
~4K
Train a
Specification model
Train a
Proof model
Specification model training
for Verus
31
Automated Proof Generation for Rust Code via Self-Evolution. Chen, Lu, (Shan) Lu, Gong, Yang, Li, Misu, Yu, Duan, Cheng, Yang, Lahiri,
Xie, Zhou [arXiv:2410.15756], ICLR 2025
The approach of scoring specifications
given tests has been ported to Verus for
boosting synthetic (spec, code, proof)
generation for model training
UIF (tests as spec) can improve code generation
accuracy and developer productivity with user-
in-the-loop
32
[LLM-based Test-driven Interactive Code Generation: User Study and Empirical
Evaluation, Fakhoury, Naik, Sakkas, Chakraborty, Lahiri, TSE’24]
[Interactive Code Generation via Test-Driven User-Intent Formalization, Lahiri,
Naik, Sakkas, Choudhury, Veh, Musuvathi, Inala, Wang, Gao, arXiv:2208.05950]
https://github.com/microsoft/TiCoder
33
TiCoder: Interactive code generation with test-
driven user-intent formalization
Given: a set of code suggestions from LLM for a
natural language NL
Repeat
1. Generate checkable specifications from NL: (here
I/O tests)
2. Rank the test/specifications (using code
suggestions)
3. Query user if a (ranked) spec is consistent with
their intent
4. User responds Yes/No/Undef (and optionally
corrects output for No)
5. Prune code suggestions to conform to set of
approved tests
Present a ranked list of code and approved tests
34
Intent formalization unlocks LLM utility: empirical
investigation with simulated user
• Improves pass@k numbers (with increasing user interaction)
• Provides tests for “free” (helpful for formalizing intent)
35
• Participants are significantly more likely to correctly evaluate AI generated code (40% vs 84%)
• TiCoder significantly reduces self-reported cognitive load (p = 0.007)
• TiCoder often reduces time on task across tasks, but not always (n.s.)
36
Intent formalization improves developer outcomes
Intent formalization unlocks LLM utility: user-study
Real world application of UIF: Verified Parser
Generation from RFCs
3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers. Fakhoury,
Kuppe, Lahiri, Ramananandro, Swamy [ICSE’25]
37
Background: verified parser generation from
declarative 3D specifications in Everparse
• Parsing and input validation failures: A
major root cause of software security
vulnerabilities
• `80%, according to DARPA, MITRE
• Mostly due to handwritten parsing code
• Writing functionally correct parsers is hard
(Endianness, data dependencies, size
constraints, etc.)
• Especially disastrous in memory unsafe
languages
Safe high-performance formally verified
C (or Rust) code
Functional Specification: Data Format Description (3D)
[1] Swamy, Nikhil, et al. "Hardening attack surfaces with formally
proven binary format parsers." International Conference on
Programming Language Design and Implementation (PLDI). 2022.
[2] Ramananandro, Tahina, et al. "EverParse: Verified secure zero-
copy parsers for authenticated message formats." 28th USENIX
Security Symposium (USENIX Security 19). 2019
Microsoft Research
Problem: Automate verified parser generation through
Everparse
Microsoft Research
3DGen: AI-Assisted Generation of Verified Parsers
Microsoft Research
• Three Agent personas collaborating:
o Planner: dictates roles, orchestrates
conversation
o Domain Expert Agent: Extracts
constraints from NL or Code, provides
feedback about generated specification
o 3D Agent: translates extracted
specifications into 3D
• Implemented with AutoGen [1]
o Composable
Retrieval Augmented (RAG) agents
o Gpt-4-32k model
Agent Implementation
[1] Wu, Qingyun, et al. "AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework." arXiv:2308.08155 (2023)
Automating verified parser generation with 3DGen
External
• Generated verified parsers
from RFCs
• 20 packet formats for standard
protocols
Currently being integrated
into product codebases
42
AI for Proof(Inductive Specification)
Generation
• Ranking LLM-Generated Loop Invariants for Program Verification
Chakraborty, Lahiri, Fakhoury, Musuvathi, Lal, Rastogi, Swamy, Sharma
[EMNLP’23]
• Leveraging LLMs for Program Verification. Kamath, Senthilnathan,
Chakraborty, Deligiannis, Lahiri, Lal, Rastogi, Roy, Sharma [FMCAD’24]
• Towards Neural Synthesis for SMT-Assisted Proof-Oriented
Programming Chakraborty, Ebner, Bhat, Fakhoury, Fatima, Lahiri,
Swamy [ICSE’25 Distinguished Paper]
• AutoVerus: Automated Proof Generation for Rust Code. Yang, Li,
Misu, Yao, Cui, Gong, Hawblitzel, Lahiri, Lorch, Lu, Yang, Zhou, Lu
[arXiv]
43
Loopy: Inferring invariants for program
verification (safety, termination)
44
Leveraging LLMs for Program Verification. Kamath, Senthilnathan, Chakraborty, Deligiannis, Lahiri, Lal,
Rastogi, Roy, Sharma [FMCAD’24]
Loopy technique
45
https://github.com/microsoft/loop-invariant-gen-experiments
Integrated into other tools:
AutoVerus: Automated Proof Generation for Rust Code
(arxiv.org/abs/2409.13082
Other works in AI and verification/testing
46
• LLM and verified code translation
• LLM-Vectorizer: LLM-based Verified Loop Vectorizer. Taneja, Laird, Yan,
Musuvathi, Lahiri [CGO’25]
• QURE: AI-Assisted and Automatically Verified UDF Inlining. Siddiqui, König,
Cao, Yan, Lahiri [Sigmod '25]
• LLM and test generation
• TOGA: A Neural Method for Test Oracle Generation. Dinella, Ryan,
Mytkowicz, Lahiri [ICSE ‘22]
• CODAMOSA: Escaping Coverage Plateaus in Test Generation with Pre-
trained Large Language Models. Lemieux, Priya Inala, Lahiri, Sen [ICSE’23]
Summary
47
Trusted AI-assisted
Programming
Related research with LLM and verification/testing
• LLM and proof automation
• Ranking LLM-Generated Loop Invariants for Program Verification Chakraborty, Lahiri, Fakhoury, Musuvathi, Lal, Rastogi, Swamy, Sharma [EMNLP’23]
• Leveraging LLMs for Program Verification. Kamath, Senthilnathan, Chakraborty, Deligiannis, Lahiri, Lal, Rastogi, Roy, Sharma [FMCAD’24]
• Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming Chakraborty, Ebner, Bhat, Fakhoury, Fatima, Lahiri, Swamy [ICSE’25 to
appear]
• AutoVerus: Automated Proof Generation for Rust Code. Yang, Li, Misu, Yao, Cui, Gong, Hawblitzel, Lahiri, Lorch, Lu, Yang, Zhou, Lu [arXiv]
• LLM and verified code translation
• LLM-Vectorizer: LLM-based Verified Loop Vectorizer. Taneja, Laird, Yan, Musuvathi, Lahiri [CGO’25 to appear]
• QURE: AI-Assisted and Automatically Verified UDF Inlining. Siddiqui, König, Cao, Yan, Lahiri [Sigmod '25 to appear]
• LLM and test generation
• TOGA: A Neural Method for Test Oracle Generation. Dinella, Ryan, Mytkowicz, Lahiri [ICSE ‘22]
• CODAMOSA: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. Lemieux, Priya Inala, Lahiri, Sen [ICSE’23]
RiSE
AI for Program Specifications
• UIF for mainstream languages (Python, Java), and use case
• Endres, Fakhoury, Chakraborty, Lahiri [FSE’24]
• UIF for verification-aware languages (Dafny, F*, Verus, …)
• Lahiri [FMCAD’24]
• Application to a specification model training (for Verus) [ICLR’25]
• UIF for verified parser generation from RFCs
• 3DGen: Fakhoury, Kuppe, Lahiri, Ramananandro, Swamy [ICSE’25]
• Application to interactive code-generation (TiCoder)
• LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation. Fakhoury, Naik, Sakkas, Chakraborty, Lahiri [TSE’24]
Questions
48
Application of specification model training
• The approach has been ported to Verus for boosting synthetic
(spec, code, proof) generation for model training
• Automated Proof Generation for Rust Code via Self-Evolution. Chen,
Lu, (Shan) Lu, Gong, Yang, Li, Misu, Yu, Duan, Cheng, Yang, Lahiri, Xie,
Zhou [arXiv:2410.15756], ICLR 2025
49
Model accuracy increases with training data
size
50
Training Data
(#Proved Programs)
Model Accuracy
(Accuracy@10, VerusBench)
~4K
17.3%
Start
~8K
41.7%
Round 1
~10K
49.6%
Round 2
~11K
54.0%
Round 3
Adding “debugging” data into training set
51
Training Data
(#Proved Programs)
Model Accuracy
(Accuracy@10, VerusBench)
~4K
17.3%
Start
~8K
41.7%
Round 1
~10K
49.6%
Round 2
~11K
54.0%
Round 3
Training Data
(# Debugging Pairs)
)
~13K
82.0%
Round 4
All done in ~1.5 days
GPT: docstring →Verus specification
52
fn binary_search(v: &Vec<u64>, k: u64) -> (r: usize)
Given a sorted input vector and a search key that exists in the vector,
return a vector index that is valid and points to the search key
requires
forall|i:int, j:int| 0 <= i <= j < v.len() ==> v[i] <= v[j],
exists|i:int| 0 <= i < v.len() && k == v[i],
ensures
r < v.len(),
k == v[r as int], requires
forall|i:int, j:int| 0 <= i <= j < v.len() ==> v[i] <= v[j],
exists|i:int| 0 <= i < v.len() && k == v[i],
ensures
r < v.len(),
ensures
k == r,
ensures
r >= 0,
How to score a specification?
53
mutation
Original Test
Cases
Mutated Test
Cases
A specification
xx% of original test cases are approved
yy% of mutated test cases are rejected
The higher
the better
The higher
the better
RQ1: How good are LLMs at generating specs
from Natural Language?
Evaluation Methodology: EvalPlus
[Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code
Generation. Liu et al. NeurIPS’23]
For each problem in HumanEval, we used LLMs to generate a set of
postconditions. We consider the following ablations1:
1. Model (GPT 3.5 and GPT 4 and StarCoder)
2. Prompting with NL only vs. NL + reference solution
54
Evaluate on Defects4J dataset of real-world bugs
and fixes in mature Java projects
Our postconditions leverage functional Java syntax introduced in
Java 8. Not all bugs in Defects4J are Java 8 syntax compatible.
Our NL2Spec Defects4J subset contains 525 bugs from 11
projects. These bugs implicate 840 buggy Java methods.
55
Application: Can GPT-4 generated specifications
find real-world bugs?
[Defects4J: a database of existing faults to enable controlled testing studies for Java programs. 2014. Rene Just, Darioush Jalali, Michael Ernst]
We use GPT-4 to generate 10 postconditions and 10
preconditions for each buggy function.
We consider two ablations (33,600 total GPT-4 calls)
• NL + Buggy Method Code + Relevant File Context
• NL + Relevant File Context
For each, we measure:
56
Correctness
Does the spec pass the
tests on correct code?
Bug-discriminating
If it is correct, does the
spec fail any of the tests
on buggy code?
Application: Can GPT-4 generated specifications
find real-world bugs?
microsoft/TiCoder
TODO Chart Example
https://issues.apache.org/jira/browse/CLI-151
57
RQ2: Example triggered bug from Defects4J

More Related Content

PDF
AI for Program Specifications UW PLSE 2025 - final.pdf
PDF
User intent formalization AIware 2024.pdf
PDF
Agents for SW development - Berkeley LLM AI Agents MOOC
DOCX
Omar Lahlou
DOCX
Omar Lahlou
PDF
How AI supports software testing at Testing United 2024
PDF
How AI supports software testing Kari Kakkonen at Upload
PPT
Test-Driven Development in the Corporate Workplace
AI for Program Specifications UW PLSE 2025 - final.pdf
User intent formalization AIware 2024.pdf
Agents for SW development - Berkeley LLM AI Agents MOOC
Omar Lahlou
Omar Lahlou
How AI supports software testing at Testing United 2024
How AI supports software testing Kari Kakkonen at Upload
Test-Driven Development in the Corporate Workplace

Similar to AI for Program Specifications Berkeley May 2025.pdf (20)

PDF
My life as a cyborg
PDF
AI improves software testing through test automation, test creation and test ...
PDF
AI improves software testing to be more fault tolerant, focused and efficient
PPTX
Enhancing Your Test Automation Scenario Coverage with Selenium - QA or the Hi...
PPT
Ensuring code quality
PDF
May: Automated Developer Testing: Achievements and Challenges
PDF
Machine programming
PDF
Software Analytics - Achievements and Challenges
PDF
When develpment met test(shift left testing)
PDF
How AI supports software testing at Kokkola
PPTX
Practices and Tools for Building Better APIs
PPTX
Semantic-Aware Code Model: Elevating the Future of Software Development
PPTX
GPCE16: Automatic Non-functional Testing of Code Generators Families
PPTX
CASCON 2023 Most Influential Paper Award Talk
PDF
Enter the mind of an Agile Developer
DOCX
expBSIT (1) (1)
PPTX
Transferring Software Testing Tools to Practice
PDF
PDF
Msr17a.ppt
PDF
Testing tools and AI - ideas what to try with some tool examples
My life as a cyborg
AI improves software testing through test automation, test creation and test ...
AI improves software testing to be more fault tolerant, focused and efficient
Enhancing Your Test Automation Scenario Coverage with Selenium - QA or the Hi...
Ensuring code quality
May: Automated Developer Testing: Achievements and Challenges
Machine programming
Software Analytics - Achievements and Challenges
When develpment met test(shift left testing)
How AI supports software testing at Kokkola
Practices and Tools for Building Better APIs
Semantic-Aware Code Model: Elevating the Future of Software Development
GPCE16: Automatic Non-functional Testing of Code Generators Families
CASCON 2023 Most Influential Paper Award Talk
Enter the mind of an Agile Developer
expBSIT (1) (1)
Transferring Software Testing Tools to Practice
Msr17a.ppt
Testing tools and AI - ideas what to try with some tool examples
Ad

Recently uploaded (20)

PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
2. Earth - The Living Planet Module 2ELS
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
7. General Toxicologyfor clinical phrmacy.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Cell Membrane: Structure, Composition & Functions
Taita Taveta Laboratory Technician Workshop Presentation.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Comparative Structure of Integument in Vertebrates.pptx
HPLC-PPT.docx high performance liquid chromatography
2. Earth - The Living Planet Module 2ELS
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
AlphaEarth Foundations and the Satellite Embedding dataset
lecture 2026 of Sjogren's syndrome l .pdf
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Ad

AI for Program Specifications Berkeley May 2025.pdf

  • 1. AI for Program Specifications Shuvendu Lahiri Research in Software Engineering (RiSE), Microsoft Research 1 Collaborators: Sarah Fakhoury, Saikat Chakraborty, Markus Kuppe, Shan Lu, Tahina Ramananandro, Nikhil Swamy, …. Interns: George Sakkas, Aaditya Naik, Madeline Endres, Livia Sun, …. Trusted AI-assisted Programming Project UC Berekeley (May 12, 2025)
  • 2. Natural Language RFC API Ref. Docstring Program Informal + Ambiguous Further compounded by AI use Software requirements are consumed to create code 2 Formal (yet operational) Significant gap (“what”vs. “how”) Consume requirements Write an implementation Define requirements
  • 3. Natural Language RFC API Ref. Docstring Program Informal + Ambiguous Formal requirements (specifications) can reduce the gap 3 Formal (yet operational) Specifications Enforceable by PL/FM (tests, formal verification, refinement) Formal (declarative) Smaller gap (both capture “what”) Write an implementation Consume requirements Define requirements User-intent formalization
  • 4. User-intent formalization for programs: Research Questions • Problem statement: • Can we automatically evaluate the correctness of a formal (requirement) specification given an informal natural language intent? • Applications: • What new applications can be enabled by user-intent formalization for programs 4
  • 5. User-intent formalization (UIF) for programs • Observation • NL2Code progress has been spurred by establishment of benchmarks and automated metrics [MBPP (Google), HumanEval (OpenAI), …] • Driving progress in latest large language models (LLMs) as a core task • Uses semantic based metric through tests for evaluating correctness of generated code (not NLP/syntactic metrics such as BLEU etc.) • Used in mainstream programming [GitHub Copilot, Cursor, CodeWhisperer, Tabnine, …] 5 • (Revised) Problem statement: • Can we automatically evaluate the correctness of a formal (requirement) specification given an informal natural language intent + a diverse set of (hidden) validation tests?
  • 6. Overview • UIF for mainstream languages (Python, Java), and use case • Endres, Fakhoury, Chakraborty, Lahiri [FSE’24] • UIF for verification-aware languages (Dafny, F*, Verus, …) • Lahiri [FMCAD’24] • Application to a specification model training (for Verus) [ICLR’25] • UIF and interactive code-generation (TiCoder) • Fakhoury, Naik, Sakkas, Chakraborty, Lahiri [TSE’24] • UIF for verified parser generation from RFCs • 3DGen: Fakhoury, Kuppe, Lahiri, Ramananandro, Swamy [ICSE’25] ----- • Loopy: AI for proof (inductive specification) generation • Kamath, Senthilnathan, Chakraborty, Deligiannis, Lahiri, Lal, Rastogi, Roy, Sharma [FMCAD’24] 6
  • 7. UIF for mainstream languages (Python, Java) 7 Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions? Endres, Fakhoury, Chakraborty, Lahiri FSE’24 https://github.com/microsoft/nl-2-postcond
  • 8. Formal Specifications in Python 8 UIF for mainstream languages (Python, Java) [1,2,3,2,4] -> [1,3,4] assert all(i in return_list for i in numbers if numbers.count(i) == 1)
  • 9. Problem formulation (ideal) • Given • NL description nl for a method m • (hidden) reference implementation I • Generate a postcondition S of m from nl • Evaluation metrics (intuition) • Soundness: I satisfies S • Completeness: S discriminates I from any buggy implementations 9
  • 10. Problem formulation (based on tests) • Given • NL description nl for a method m • (hidden) reference implementation I + a set of input/output tests T • Generate a postcondition S of m from nl • Evaluation metrics (intuition) • Test-set Soundness: S is consistent with I for each test t in T • Test-set Completeness: S discriminates I from any buggy implementations on some test t in T 0 if unsound • Score = |buggy mutants discriminated|/|mutants| 10
  • 11. Docstring Postconditions Reference Implementation Set of Test Inputs Evaluation LLM 12 RQ1: Postcondition Soundness
  • 12. Docstring Postconditions Evaluation Buggy Implementations Set of Test Inputs LLM 13 RQ1: Postcondition Completeness GPT-4 subsbtantially better at complete specs
  • 14. Evaluate on Defects4J dataset of real-world bugs and fixes in mature Java projects We use GPT-4 to generate 10 postconditions and 10 preconditions for each buggy function. We consider two ablations (33,600 total GPT-4 calls) • NL + Buggy Method Code + Relevant File Context • NL + Relevant File Context For each, we measure: 15 Correctness Does the spec pass the tests on correct code? Bug-discriminating If it is correct, does the spec fail any of the tests on buggy code? Application: Can GPT-4 generated specifications find real-world bugs?
  • 15. 16 Defects4J results Across ablations, 65 bugs (12.5% of all bugs) are plausibly caught by generated specifications • We manually verify a subset of bug catching conditions Complementary to prior assertion generation approaches TOGA [Dinella, Ryan, Mytkowicz, Lahiri, ICSE’22] and Daikon [Ernst et al. ICSE’99] • TOGA mostly finds expected exceptional bugs. TOGA can only tolerate bugs during testing, and cannot prevent bugs in production. • Daikon specs overfit the regression tests and bug-discriminating specs are unsound
  • 16. UIF for verification-aware languages (Dafny, F*, Verus, …) 17 [Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages, Lahiri, FMCAD’24] https://github.com/microsoft/nl-2-postcond
  • 17. Motivating example 18 predicate InArray(a: array<int>, x: int) reads a {exists i :: 0 <= i < a.Length && a[i] == x} method SharedElements(a: array<int>, b: array<int>) returns (result: seq<int>) ensures forall x :: x in result ==> (InArray(a, x) && InArray(b, x)) ensures forall i, j :: 0 <= i < j < |result| ==> result[i] != result[j] "Write a function to find the shared elements from the given two lists.“ [MBPP task#2] GPT4 generated specifcation. Labeled as “strong” by authors [Towards AI-Assisted Synthesis of Verified Dafny Methods. Misu, Lopes, Ma, Noble. FSE’24] "test_cases": { "test_1": "var a1 := new int[] [3, 4, 5, 6]; var a2 := new int[] [5, 7, 4, 10]; var res1 :=SharedElements(a1,a2); //expected[4, 5];", "test_2": “…", "test_3": “…" } Hidden (not visible to spec generator)
  • 18. Prior works MBPP-DFY [Misu, Lopes, Ma, Noble. FSE’24] • Generates spec+code+proof in Dafny from NL using GPT-4 • Creates 153 “verified” examples • Problem: • How good are the specifications? • Manual/subjective (non- automated) metric for evaluation of specifications 19 nl2postcond [Endres, Fakhoury, Chakraborty, Lahiri. FSE’24] • Creates automated metrics for specification quality for mainstream languages (Java, Py) given tests and code mutants • Problem: • Doesn’t work with rich specifications (e.g., quantifiers, ghost variables) in verification-aware languages
  • 19. This work • An approach to evaluate the correctness of formal program specifications (given tests) • A step towards creating a benchmark for user-intent-formalization (specification quality) for programs in Dafny • Leverages the dataset from MBPP-DFY • “Adapts” the metric from nl2postcond for verification-aware languages • Outcomes • A “benchmark” of 64/153 examples from MBPP-DFY • Automated metric for spec evaluation based on symbolically testing specifications 20
  • 20. Our approach • Symbolically test specifications (given tests as input/output examples) • Given • A method signature method Foo(x): (returns y) requires P(x) ensures Q(x, y) • A set of input/output tests T • Specification Soundness (for a test (i, o)) //Boolean metric • {P} x := i; y := o; {Q} is valid • Specification Completeness (for a test (i,o)) //Quantitative metric (kill-set from mutation-testing) • Fraction of mutants o’ of o, s.t. {P} x := i; y := o’; {Q} is not valid 21
  • 21. predicate InArray(a: array<int>, x: int) reads a {exists i :: 0 <= i < a.Length && a[i] == x} method SharedElements(a: array<int>, b: array<int>) returns (result: seq<int>) ensures forall x :: x in result ==> (InArray(a, x) && InArray(b, x)) ensures forall i, j :: 0 <= i < j < |result| ==> result[i] != result[j] method SharedElementsTest(){ var a1:= new int[] [3, 4, 5, 6]; var a2:= new int[] [5, 7, 4, 10]; var res1:=SharedElements(a1,a2); //expected[4, 5]; } 22 "Write a function to find the shared elements from the given two lists." Should verify Should fail to verify [6];
  • 22. Evaluation https://github.com/microsoft/nl-2-postcond • Evaluate the soundness/completeness metrics against human labels • Dataset: 153 Dafny specifications for MBPP-DFY dataset • Implemented mutation operators for simple output values (Booleans, integers, string, lists) • Results • Automated metrics gets parity with the human-labeling for most examples • Can replace manual annotation with automated checks (given tests) • Soundness check satisfied for 64/153 examples • Finds instances where a “strong” specification is not complete • Finds instances of “incorrect” specifications 23
  • 23. predicate InArray(a: array<int>, x: int) reads a {exists i :: 0 <= i < a.Length && a[i] == x} method SharedElements(a: array<int>, b: array<int>) returns (result: seq<int>) ensures forall x :: x in result ==> (InArray(a, x) && InArray(b, x)) ensures forall i, j :: 0 <= i < j < |result| ==> result[i] != result[j] method SharedElementsTest(){ var a1:= new int[] [3, 4, 5, 6]; var a2:= new int[] [5, 7, 4, 10]; var res1:=SharedElements(a1,a2); //expected[4, 5]; } 24 "Write a function to find the shared elements from the given two lists." GPT4 generated. Labeled as “strong” Our metric marks this as a weak specification (wrt the test) Changing ==> to <==> makes it a strong invariant by our metric Result 1: label weak specifications 2 more such examples in the paper detected automatically
  • 24. Result 2: label incorrect specifications "234": { "task_id": "234", "task_description": "Write a method in Dafny to find the volume of a cube given its side length.", "method_signature": "method CubeVolume(l:int) returns (int)", "test_cases": { "test_1": "var out1:= CubeVolume(3);nassert out1==27;", "test_2": "var out2:= CubeVolume(2);nassert out2==8;", "test_3": "var out3:= CubeVolume(5);nassert out3==25;" } 25 Our metric marks this as a Incorrect specification (wrt the tests) Problem: authors introduced error during copy-paste from Python -> Dafny! 2 more such examples in the paper detected automatically
  • 25. Challenges: quantifiers, unrolling, … • Parsing complex datatypes (e.g., 2-dimensional arrays etc.) • Need redundant assertions to trigger the quantifiers [DONE] assume {:axiom} a[..a.Length] == a1[..a1.Length]; assert a[0] == a1[0] && .... && a[3] == a1[3]; • Unroll (or partially evaluate) recursive predicates based on concrete input • Trigger for nested quantifiers in specification predicates • method IsNonPrime(n: int) returns (result: bool) requires n >= 2 ensures result <==> (exists k :: 2 <= k < n && n % k == 0) 26
  • 26. Alternate proposals for benchmark • Other alternatives for automated metric • Running the reference implementation with specifications on the tests • Compare against a reference specification • Verifying the specification against reference implementation • Checking for the most precise specification (given the tests) • Check the paper for details why they do not suffice ….. 27
  • 27. Application: Specification model training for Verus 28 Automated Proof Generation for Rust Code via Self-Evolution. Chen, Lu, (Shan) Lu, Gong, Yang, Li, Misu, Yu, Duan, Cheng, Yang, Lahiri, Xie, Zhou [arXiv:2410.15756], ICLR 2025 The approach of scoring specifications given tests has been ported to Verus for boosting synthetic (spec, code, proof) generation for model training Problem: Fine-tune an LLM to produce high-quality Proof given <DocString, Code, Spec> • Need a high quality < DocString, Code, Spec, Proof > tuples • Sub-Problem: Need a high quality < DocString, Code, Spec> tuples • Approach: Fine-tune an LLM to produce high quality Spec given DocString and Tests using the metric from earlier slide
  • 28. Seed AutoVerus training workflow 29 Code ~45K Rust Programs from code-synthesis dataset Model Training Data Augmentation Data Pruning Code, Spec + Code, Spec ~20K Train a Specification model
  • 29. Seed AutoVerus training workflow 30 Code, Spec Proof Code Spec Code ~45K Rust Programs from code-synthesis dataset Model Training Data Augmentation Data Pruning Code, Spec + ~20K + Model Training Data Augmentation Data Pruning ~4K Train a Specification model Train a Proof model
  • 30. Specification model training for Verus 31 Automated Proof Generation for Rust Code via Self-Evolution. Chen, Lu, (Shan) Lu, Gong, Yang, Li, Misu, Yu, Duan, Cheng, Yang, Lahiri, Xie, Zhou [arXiv:2410.15756], ICLR 2025 The approach of scoring specifications given tests has been ported to Verus for boosting synthetic (spec, code, proof) generation for model training
  • 31. UIF (tests as spec) can improve code generation accuracy and developer productivity with user- in-the-loop 32 [LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation, Fakhoury, Naik, Sakkas, Chakraborty, Lahiri, TSE’24] [Interactive Code Generation via Test-Driven User-Intent Formalization, Lahiri, Naik, Sakkas, Choudhury, Veh, Musuvathi, Inala, Wang, Gao, arXiv:2208.05950] https://github.com/microsoft/TiCoder
  • 32. 33 TiCoder: Interactive code generation with test- driven user-intent formalization Given: a set of code suggestions from LLM for a natural language NL Repeat 1. Generate checkable specifications from NL: (here I/O tests) 2. Rank the test/specifications (using code suggestions) 3. Query user if a (ranked) spec is consistent with their intent 4. User responds Yes/No/Undef (and optionally corrects output for No) 5. Prune code suggestions to conform to set of approved tests Present a ranked list of code and approved tests
  • 33. 34
  • 34. Intent formalization unlocks LLM utility: empirical investigation with simulated user • Improves pass@k numbers (with increasing user interaction) • Provides tests for “free” (helpful for formalizing intent) 35
  • 35. • Participants are significantly more likely to correctly evaluate AI generated code (40% vs 84%) • TiCoder significantly reduces self-reported cognitive load (p = 0.007) • TiCoder often reduces time on task across tasks, but not always (n.s.) 36 Intent formalization improves developer outcomes Intent formalization unlocks LLM utility: user-study
  • 36. Real world application of UIF: Verified Parser Generation from RFCs 3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers. Fakhoury, Kuppe, Lahiri, Ramananandro, Swamy [ICSE’25] 37
  • 37. Background: verified parser generation from declarative 3D specifications in Everparse • Parsing and input validation failures: A major root cause of software security vulnerabilities • `80%, according to DARPA, MITRE • Mostly due to handwritten parsing code • Writing functionally correct parsers is hard (Endianness, data dependencies, size constraints, etc.) • Especially disastrous in memory unsafe languages Safe high-performance formally verified C (or Rust) code Functional Specification: Data Format Description (3D) [1] Swamy, Nikhil, et al. "Hardening attack surfaces with formally proven binary format parsers." International Conference on Programming Language Design and Implementation (PLDI). 2022. [2] Ramananandro, Tahina, et al. "EverParse: Verified secure zero- copy parsers for authenticated message formats." 28th USENIX Security Symposium (USENIX Security 19). 2019
  • 38. Microsoft Research Problem: Automate verified parser generation through Everparse
  • 39. Microsoft Research 3DGen: AI-Assisted Generation of Verified Parsers
  • 40. Microsoft Research • Three Agent personas collaborating: o Planner: dictates roles, orchestrates conversation o Domain Expert Agent: Extracts constraints from NL or Code, provides feedback about generated specification o 3D Agent: translates extracted specifications into 3D • Implemented with AutoGen [1] o Composable Retrieval Augmented (RAG) agents o Gpt-4-32k model Agent Implementation [1] Wu, Qingyun, et al. "AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework." arXiv:2308.08155 (2023)
  • 41. Automating verified parser generation with 3DGen External • Generated verified parsers from RFCs • 20 packet formats for standard protocols Currently being integrated into product codebases 42
  • 42. AI for Proof(Inductive Specification) Generation • Ranking LLM-Generated Loop Invariants for Program Verification Chakraborty, Lahiri, Fakhoury, Musuvathi, Lal, Rastogi, Swamy, Sharma [EMNLP’23] • Leveraging LLMs for Program Verification. Kamath, Senthilnathan, Chakraborty, Deligiannis, Lahiri, Lal, Rastogi, Roy, Sharma [FMCAD’24] • Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming Chakraborty, Ebner, Bhat, Fakhoury, Fatima, Lahiri, Swamy [ICSE’25 Distinguished Paper] • AutoVerus: Automated Proof Generation for Rust Code. Yang, Li, Misu, Yao, Cui, Gong, Hawblitzel, Lahiri, Lorch, Lu, Yang, Zhou, Lu [arXiv] 43
  • 43. Loopy: Inferring invariants for program verification (safety, termination) 44 Leveraging LLMs for Program Verification. Kamath, Senthilnathan, Chakraborty, Deligiannis, Lahiri, Lal, Rastogi, Roy, Sharma [FMCAD’24]
  • 44. Loopy technique 45 https://github.com/microsoft/loop-invariant-gen-experiments Integrated into other tools: AutoVerus: Automated Proof Generation for Rust Code (arxiv.org/abs/2409.13082
  • 45. Other works in AI and verification/testing 46 • LLM and verified code translation • LLM-Vectorizer: LLM-based Verified Loop Vectorizer. Taneja, Laird, Yan, Musuvathi, Lahiri [CGO’25] • QURE: AI-Assisted and Automatically Verified UDF Inlining. Siddiqui, König, Cao, Yan, Lahiri [Sigmod '25] • LLM and test generation • TOGA: A Neural Method for Test Oracle Generation. Dinella, Ryan, Mytkowicz, Lahiri [ICSE ‘22] • CODAMOSA: Escaping Coverage Plateaus in Test Generation with Pre- trained Large Language Models. Lemieux, Priya Inala, Lahiri, Sen [ICSE’23]
  • 46. Summary 47 Trusted AI-assisted Programming Related research with LLM and verification/testing • LLM and proof automation • Ranking LLM-Generated Loop Invariants for Program Verification Chakraborty, Lahiri, Fakhoury, Musuvathi, Lal, Rastogi, Swamy, Sharma [EMNLP’23] • Leveraging LLMs for Program Verification. Kamath, Senthilnathan, Chakraborty, Deligiannis, Lahiri, Lal, Rastogi, Roy, Sharma [FMCAD’24] • Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming Chakraborty, Ebner, Bhat, Fakhoury, Fatima, Lahiri, Swamy [ICSE’25 to appear] • AutoVerus: Automated Proof Generation for Rust Code. Yang, Li, Misu, Yao, Cui, Gong, Hawblitzel, Lahiri, Lorch, Lu, Yang, Zhou, Lu [arXiv] • LLM and verified code translation • LLM-Vectorizer: LLM-based Verified Loop Vectorizer. Taneja, Laird, Yan, Musuvathi, Lahiri [CGO’25 to appear] • QURE: AI-Assisted and Automatically Verified UDF Inlining. Siddiqui, König, Cao, Yan, Lahiri [Sigmod '25 to appear] • LLM and test generation • TOGA: A Neural Method for Test Oracle Generation. Dinella, Ryan, Mytkowicz, Lahiri [ICSE ‘22] • CODAMOSA: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. Lemieux, Priya Inala, Lahiri, Sen [ICSE’23] RiSE AI for Program Specifications • UIF for mainstream languages (Python, Java), and use case • Endres, Fakhoury, Chakraborty, Lahiri [FSE’24] • UIF for verification-aware languages (Dafny, F*, Verus, …) • Lahiri [FMCAD’24] • Application to a specification model training (for Verus) [ICLR’25] • UIF for verified parser generation from RFCs • 3DGen: Fakhoury, Kuppe, Lahiri, Ramananandro, Swamy [ICSE’25] • Application to interactive code-generation (TiCoder) • LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation. Fakhoury, Naik, Sakkas, Chakraborty, Lahiri [TSE’24]
  • 48. Application of specification model training • The approach has been ported to Verus for boosting synthetic (spec, code, proof) generation for model training • Automated Proof Generation for Rust Code via Self-Evolution. Chen, Lu, (Shan) Lu, Gong, Yang, Li, Misu, Yu, Duan, Cheng, Yang, Lahiri, Xie, Zhou [arXiv:2410.15756], ICLR 2025 49
  • 49. Model accuracy increases with training data size 50 Training Data (#Proved Programs) Model Accuracy (Accuracy@10, VerusBench) ~4K 17.3% Start ~8K 41.7% Round 1 ~10K 49.6% Round 2 ~11K 54.0% Round 3
  • 50. Adding “debugging” data into training set 51 Training Data (#Proved Programs) Model Accuracy (Accuracy@10, VerusBench) ~4K 17.3% Start ~8K 41.7% Round 1 ~10K 49.6% Round 2 ~11K 54.0% Round 3 Training Data (# Debugging Pairs) ) ~13K 82.0% Round 4 All done in ~1.5 days
  • 51. GPT: docstring →Verus specification 52 fn binary_search(v: &Vec<u64>, k: u64) -> (r: usize) Given a sorted input vector and a search key that exists in the vector, return a vector index that is valid and points to the search key requires forall|i:int, j:int| 0 <= i <= j < v.len() ==> v[i] <= v[j], exists|i:int| 0 <= i < v.len() && k == v[i], ensures r < v.len(), k == v[r as int], requires forall|i:int, j:int| 0 <= i <= j < v.len() ==> v[i] <= v[j], exists|i:int| 0 <= i < v.len() && k == v[i], ensures r < v.len(), ensures k == r, ensures r >= 0,
  • 52. How to score a specification? 53 mutation Original Test Cases Mutated Test Cases A specification xx% of original test cases are approved yy% of mutated test cases are rejected The higher the better The higher the better
  • 53. RQ1: How good are LLMs at generating specs from Natural Language? Evaluation Methodology: EvalPlus [Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Liu et al. NeurIPS’23] For each problem in HumanEval, we used LLMs to generate a set of postconditions. We consider the following ablations1: 1. Model (GPT 3.5 and GPT 4 and StarCoder) 2. Prompting with NL only vs. NL + reference solution 54
  • 54. Evaluate on Defects4J dataset of real-world bugs and fixes in mature Java projects Our postconditions leverage functional Java syntax introduced in Java 8. Not all bugs in Defects4J are Java 8 syntax compatible. Our NL2Spec Defects4J subset contains 525 bugs from 11 projects. These bugs implicate 840 buggy Java methods. 55 Application: Can GPT-4 generated specifications find real-world bugs? [Defects4J: a database of existing faults to enable controlled testing studies for Java programs. 2014. Rene Just, Darioush Jalali, Michael Ernst]
  • 55. We use GPT-4 to generate 10 postconditions and 10 preconditions for each buggy function. We consider two ablations (33,600 total GPT-4 calls) • NL + Buggy Method Code + Relevant File Context • NL + Relevant File Context For each, we measure: 56 Correctness Does the spec pass the tests on correct code? Bug-discriminating If it is correct, does the spec fail any of the tests on buggy code? Application: Can GPT-4 generated specifications find real-world bugs? microsoft/TiCoder