SlideShare a Scribd company logo
Semantic Scaffolds for
Pseudocode-to-Code
Generation
Ruiqi Zhong, Mitchell Stern, Dan Klein
Abstract
⦿ A method for program generation based on semantic scaffolds
⦿ Searching over plausible scaffolds then using these as constraints
for a beam search over programs
⦿ Applied hierarchical search method to the SPoC dataset
2
Contribution
⦿ proposed the use of semantic scaffolds to add semantic
constraints to models for long form language-to-code generation
tasks
⦿ introduced a hierarchical beam search algorithm
⦿ achieved a new state-of-the-art accuracy of 55.1% on the SPoC
dataset
3
2.1 Data
⦿ focused on the SPoC dataset
⦿ It consists of C++ solutions to problems from Code forces
⦿ It contains 18,356 programs in total with 14.7 lines per program
on average
4
2.2 Task
⦿ Suppose the target program has L lines
⦿ Natural language pseudocode annotation 𝑥𝑙 and an indentation
level 𝑖𝑙
⦿ Goal is to find a candidate program y based on (𝑥1; 𝑖1); : : : ; (𝑥𝐿; 𝑖𝑙)
5
3. Combination Constraints
⦿ this approach ignores any dependence between different lines
⦿ if we naively combine certain subsets of candidates together, the
resulting program will be invalid due to the use of undeclared
variables or mismatching braces
⦿ we propose to enforce certain syntactic and semantic constraints
when combining candidate code pieces
6
3.1 Syntactic Constraints
⦿ restrict our attention to the set of “primary expressions”
consisting of high-level control structures such as if, else
⦿ parse the candidate code pieces for each line into a list of primary
expression symbols
⦿ exist a grammatical derivation that combines their respective
symbols
7
3.2 Symbol Table Constraints
⦿ this approach ignores any dependence between different lines
⦿ extract the variable names
⌾ undeclared variables are not used
⌾ variables are not re-declared within the same scope
8
4.1 Beam Search
⦿ The search begins with k randomly generated states
⦿ At each step, all the successors of all k states are generated
⦿ If any one of the successors is a goal, the algorithm halts
⦿ Otherwise, it selects the k best successors from the complete list
and repeats
9
4.1 Beam Search
⦿ Advantages:
⌾ potentially reducing the time, of a search
⌾ the memory consumption of the search is less than others
⦿ Disadvantages:
⌾ the search may not result in an optimal goal and may not
even reach a goal at all
⌾ terminates for two cases: a required goal node is reached, or
a goal node is not reached and there are no nodes left to be
explored 10
5. Implementation (Empty Pseudocode)
⦿ 26% of the lines in the data set do not have pseudocode
annotations
⦿ Such as “int main() {”, “{”, “}”, etc.
⦿ They did not use the any code pieces for these lines
11
5. Implementation
⦿ Empty Pseudocode
⌾ “int main() -> {”, “{”, “}”,
⦿ Search Algorithm
⌾ OpenNMT
⦿ Parsing Code Pieces
⦿ Search Algorithm
12
5. Implementation (Search Algorithm
Hyper parameters)
⦿ They consider the top C = 100 code pieces for each line
⦿ default beam width W is 50 for scaffold search
⦿ the top K = 20 scaffolds for the subsequent generation
13
6.2 Comparison of Constraints
Figure: (a) Candidate code pieces and their syntactic/Symtable configuration for each line; (b)
use beam search to find highest scoring valid scaffolds; (c) given a scaffold, select code pieces
that has the same configurations for each line. (d) combine code pieces to form full program.
14
6.2 Comparison of Constraints
⦿ No Constraints: the best-first search method that scores lines
independently.
⦿ Syntactic Constraints: the constraints on the primary expression
and indentation level
⦿ Symbol Table Constraints: both the syntactic constraints and the
symbol table constraints described
⦿ Back off: sometimes hierarchical beam search with the SymTable
constraints fails to return any valid scaffold. We back off to just
the Syntactic constraints if this happens.
15
7.2 Rejection by Constraints
⦿ Syntactic Constraints:
⌾ “}”, “int main(){”, “{”, ”return 0”, “};”
⌾ if(...){” and “if(...)”
⦿ Symbol Table (SymTable) Constraints
⌾ Set N to 222222 -> int N = 222222 or int N = 222222
16
7.3 Code Piece Error Analysis
⦿ The model generation is wrong despite clear pseudocode
⦿ variable type clarification
⦿ Syntactic context
⦿ consists of variable name types
17

More Related Content

PPT
stacks in algorithems and data structure
ODP
Machine learning group - Practical examples
PDF
Pcd201516
PDF
Fy secondsemester2016
PDF
Fy secondsemester2016
PPTX
5 3 Notes
PPTX
Timed Colored Perti Nets
PDF
stacks in algorithems and data structure
Machine learning group - Practical examples
Pcd201516
Fy secondsemester2016
Fy secondsemester2016
5 3 Notes
Timed Colored Perti Nets

What's hot (17)

PDF
Modeling and Evaluation of Performance and Reliability of Component-based So...
PDF
Theta: a Framework for Abstraction Refinement-Based Model Checking
PPTX
LISP: Scope and extent in lisp
PPTX
Compiler lecture 07
PDF
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
PPTX
Scope - Static and Dynamic
PDF
A time study in numerical methods programming
PPT
9781285852744 ppt ch08
PPT
1212 regular meeting
PDF
Bt0066 database management system2
PPTX
Data Structure and Algorithms –Introduction.pptx
PDF
3 statements and operators
PPT
Information Flow based Ontology Mapping - 2002
PDF
Numeric Data types in Python
PPTX
Test 3 exam review guide
PPTX
20100522 software verification_sharygina_lecture02
Modeling and Evaluation of Performance and Reliability of Component-based So...
Theta: a Framework for Abstraction Refinement-Based Model Checking
LISP: Scope and extent in lisp
Compiler lecture 07
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Scope - Static and Dynamic
A time study in numerical methods programming
9781285852744 ppt ch08
1212 regular meeting
Bt0066 database management system2
Data Structure and Algorithms –Introduction.pptx
3 statements and operators
Information Flow based Ontology Mapping - 2002
Numeric Data types in Python
Test 3 exam review guide
20100522 software verification_sharygina_lecture02
Ad

Similar to Semantic scaffolds for pseudocode to-code generation (2020) (20)

PPTX
SPoC: search-based pseudocode to code
PDF
Introduction to algorithms
PPTX
CC week 1.pptx
PDF
techniques for removing smallest Stopping sets in LDPC codes
PPT
Symbol Table, Error Handler & Code Generation
PPTX
Symbolic Execution And KLEE
PDF
0-Slot05-06-07-Basic-Logics.pdf
PDF
Fy secondsemester2016
PPT
Software testing strategies
PDF
Lexically constrained decoding for sequence generation using grid beam search
PDF
TSR CLASS CD-UNIT 5.pdf sddfsfdsfqweqdew
PPT
Data structure and algorithm.lect-03.ppt
PDF
Large Language Models for Test Case Evolution and Repair
PPTX
Introduction to Compilers
DOC
Course Breakup Plan- C
PDF
Fractal analysis of good programming style
PDF
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
PDF
Scalable up genomic analysis with ADAM
PDF
Pattern-based Definition and Generation of Components for a Synchronous React...
PDF
Algorithm Specification and Data Abstraction
SPoC: search-based pseudocode to code
Introduction to algorithms
CC week 1.pptx
techniques for removing smallest Stopping sets in LDPC codes
Symbol Table, Error Handler & Code Generation
Symbolic Execution And KLEE
0-Slot05-06-07-Basic-Logics.pdf
Fy secondsemester2016
Software testing strategies
Lexically constrained decoding for sequence generation using grid beam search
TSR CLASS CD-UNIT 5.pdf sddfsfdsfqweqdew
Data structure and algorithm.lect-03.ppt
Large Language Models for Test Case Evolution and Repair
Introduction to Compilers
Course Breakup Plan- C
Fractal analysis of good programming style
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
Scalable up genomic analysis with ADAM
Pattern-based Definition and Generation of Components for a Synchronous React...
Algorithm Specification and Data Abstraction
Ad

More from Minhazul Arefin (8)

PPTX
LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning ...
PPTX
pic2code: Generating HTML Code from Handwritten Picture.pptx
PPTX
Controlling Home Appliances adopting Chatbot using Machine Learning Approach
PPTX
Object Detection on Dental X-ray Images using R-CNN
PPTX
Natural Language Query to SQL conversion using Machine Learning Approach
PPTX
Efficient estimation of word representations in vector space (2013)
PPTX
Recurrent neural networks (rnn) and long short term memory networks (lstm)
PPTX
The rise of “Big Data” on cloud computing
LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning ...
pic2code: Generating HTML Code from Handwritten Picture.pptx
Controlling Home Appliances adopting Chatbot using Machine Learning Approach
Object Detection on Dental X-ray Images using R-CNN
Natural Language Query to SQL conversion using Machine Learning Approach
Efficient estimation of word representations in vector space (2013)
Recurrent neural networks (rnn) and long short term memory networks (lstm)
The rise of “Big Data” on cloud computing

Recently uploaded (20)

PDF
composite construction of structures.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
DOCX
573137875-Attendance-Management-System-original
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Construction Project Organization Group 2.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Artificial Intelligence
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
composite construction of structures.pdf
bas. eng. economics group 4 presentation 1.pptx
UNIT 4 Total Quality Management .pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Operating System & Kernel Study Guide-1 - converted.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Sustainable Sites - Green Building Construction
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
573137875-Attendance-Management-System-original
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Mechanical Engineering MATERIALS Selection
Construction Project Organization Group 2.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
CH1 Production IntroductoryConcepts.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Artificial Intelligence
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf

Semantic scaffolds for pseudocode to-code generation (2020)

  • 2. Abstract ⦿ A method for program generation based on semantic scaffolds ⦿ Searching over plausible scaffolds then using these as constraints for a beam search over programs ⦿ Applied hierarchical search method to the SPoC dataset 2
  • 3. Contribution ⦿ proposed the use of semantic scaffolds to add semantic constraints to models for long form language-to-code generation tasks ⦿ introduced a hierarchical beam search algorithm ⦿ achieved a new state-of-the-art accuracy of 55.1% on the SPoC dataset 3
  • 4. 2.1 Data ⦿ focused on the SPoC dataset ⦿ It consists of C++ solutions to problems from Code forces ⦿ It contains 18,356 programs in total with 14.7 lines per program on average 4
  • 5. 2.2 Task ⦿ Suppose the target program has L lines ⦿ Natural language pseudocode annotation 𝑥𝑙 and an indentation level 𝑖𝑙 ⦿ Goal is to find a candidate program y based on (𝑥1; 𝑖1); : : : ; (𝑥𝐿; 𝑖𝑙) 5
  • 6. 3. Combination Constraints ⦿ this approach ignores any dependence between different lines ⦿ if we naively combine certain subsets of candidates together, the resulting program will be invalid due to the use of undeclared variables or mismatching braces ⦿ we propose to enforce certain syntactic and semantic constraints when combining candidate code pieces 6
  • 7. 3.1 Syntactic Constraints ⦿ restrict our attention to the set of “primary expressions” consisting of high-level control structures such as if, else ⦿ parse the candidate code pieces for each line into a list of primary expression symbols ⦿ exist a grammatical derivation that combines their respective symbols 7
  • 8. 3.2 Symbol Table Constraints ⦿ this approach ignores any dependence between different lines ⦿ extract the variable names ⌾ undeclared variables are not used ⌾ variables are not re-declared within the same scope 8
  • 9. 4.1 Beam Search ⦿ The search begins with k randomly generated states ⦿ At each step, all the successors of all k states are generated ⦿ If any one of the successors is a goal, the algorithm halts ⦿ Otherwise, it selects the k best successors from the complete list and repeats 9
  • 10. 4.1 Beam Search ⦿ Advantages: ⌾ potentially reducing the time, of a search ⌾ the memory consumption of the search is less than others ⦿ Disadvantages: ⌾ the search may not result in an optimal goal and may not even reach a goal at all ⌾ terminates for two cases: a required goal node is reached, or a goal node is not reached and there are no nodes left to be explored 10
  • 11. 5. Implementation (Empty Pseudocode) ⦿ 26% of the lines in the data set do not have pseudocode annotations ⦿ Such as “int main() {”, “{”, “}”, etc. ⦿ They did not use the any code pieces for these lines 11
  • 12. 5. Implementation ⦿ Empty Pseudocode ⌾ “int main() -> {”, “{”, “}”, ⦿ Search Algorithm ⌾ OpenNMT ⦿ Parsing Code Pieces ⦿ Search Algorithm 12
  • 13. 5. Implementation (Search Algorithm Hyper parameters) ⦿ They consider the top C = 100 code pieces for each line ⦿ default beam width W is 50 for scaffold search ⦿ the top K = 20 scaffolds for the subsequent generation 13
  • 14. 6.2 Comparison of Constraints Figure: (a) Candidate code pieces and their syntactic/Symtable configuration for each line; (b) use beam search to find highest scoring valid scaffolds; (c) given a scaffold, select code pieces that has the same configurations for each line. (d) combine code pieces to form full program. 14
  • 15. 6.2 Comparison of Constraints ⦿ No Constraints: the best-first search method that scores lines independently. ⦿ Syntactic Constraints: the constraints on the primary expression and indentation level ⦿ Symbol Table Constraints: both the syntactic constraints and the symbol table constraints described ⦿ Back off: sometimes hierarchical beam search with the SymTable constraints fails to return any valid scaffold. We back off to just the Syntactic constraints if this happens. 15
  • 16. 7.2 Rejection by Constraints ⦿ Syntactic Constraints: ⌾ “}”, “int main(){”, “{”, ”return 0”, “};” ⌾ if(...){” and “if(...)” ⦿ Symbol Table (SymTable) Constraints ⌾ Set N to 222222 -> int N = 222222 or int N = 222222 16
  • 17. 7.3 Code Piece Error Analysis ⦿ The model generation is wrong despite clear pseudocode ⦿ variable type clarification ⦿ Syntactic context ⦿ consists of variable name types 17