Natural Is The Best: Model-Agnostic Code
Simplification for Pre-trained Large Language Models
Li XiaoNing
Central University of Finance and Economics
2022212378@email.cufe.edu.cn
March 17, 2024
Li XiaoNing (CUFE) Title March 17, 2024 1 / 24
Overview
1 Introduction
2 Methodology
3 Experiment
4 Conclusion
Li XiaoNing (CUFE) Content March 17, 2024 2 / 24
Pre-trained Language Models in Code Intelligence
Code Intelligence(CI)
The main goal of CI is to enable models to master the skills of automated
code understanding and generation, thereby promoting the development of
software engineering intelligence.
Pre-trained Language Models(PLM)
PLM has strong semantic understanding ability and has achieved good
results in many NLP tasks.
CodeBERT: MLM + RTD.
CodeT5: MSP + IT + MIP + dual-gen.
CodeGen is designed for automatically generating code, streamlining
the development process by reducing manual coding efforts.
......
Li XiaoNing (CUFE) Introduction March 17, 2024 3 / 24
Challenges of PLM in CI
Challenges
1 Time-consuming: The computational complexity is quadratically
with the length of the input code sequence.
2 Input tokens limit:The limit for the number of tokens to be fed to
CodeBERT is 512 tokens. The word position embedding matrix for
GPT-2 is (1,024x768) and for GPT-3 is (2,048x128).
3 Expensive: For gpt-4-32k, input is 0.06$/1k tokens, output is
0.12$/1k tokens.
Li XiaoNing (CUFE) Introduction March 17, 2024 4 / 24
An Example for Code Simplification: DietCode[1]
1 Pre-training: Obtain rich contextualized vector representations by
self-supervised learning.
2 Fine-tuning: Adapt to specific tasks.
Li XiaoNing (CUFE) Introduction March 17, 2024 5 / 24
An Example for Code Simplification: DietCode[1]
1 Statement selection(a-b): 0-1 knapsack problem.C is the target
length plus the length of the longest sentence. V is the attention
weight of statements. W is the length of statements. If the number
of statements is n, the time complexity is O(n*C).
2 Token pruning(b-c): Remove the tokens that have the lowest
attention weights in the lowest weighed statements.
Li XiaoNing (CUFE) Introduction March 17, 2024 6 / 24
Attention is The Best?
Limitation
1 Inefficiency.The process of obtaining attention weights is relative
complex and time-consuming.
2 Model dependency.The attention weights relies on the model
architecture and the pre-training datasets.
To ensure effciency and generality, it is advisable to consider a
broader range of factors when making decisions about program
simplification.
Li XiaoNing (CUFE) Introduction March 17, 2024 7 / 24
Natural Factors
Lexical Level
1 Symbol tokens: brackets, separators and operators.
2 Identifiers: variable name.
Syntactic Level
1 Control structures: for, if, try, catch, switch, while, do while and their
conditions.
2 Method signatures: method declaration and the parameters of the
method.
3 Method invocations: calls to methods.
Semantic Level
tokens not present in Program Dependency Graph(PDG).
Li XiaoNing (CUFE) Introduction March 17, 2024 8 / 24
Empirical Study
RQ-1: What is the impact of randomly removing code tokens
on the performance of PLM?
RQ-2: What is the impact of removing lexical tokens on the
performance of PLM?
RQ-3: What is the impact of removing syntactical tokens on the
performance of PLM?
RQ-4: What is the impact of removing semantic tokens on the
performance of PLM?
Li XiaoNing (CUFE) Introduction March 17, 2024 9 / 24
Tasks & Datasets
Downstream Tasks
1 Code search: find relevant code snippets from a codebase given a
query.
2 Code summarization: generate a natural language summary for a
given code snippet.
Datasets
1 Pre-train and fine-tune paradigm: we use the same dataset as
CodeBERT and DietCode.
2 Pre-train,prompt, and predict paradigm: we randomly selected
400 samples from the dataset.
Li XiaoNing (CUFE) Methodology March 17, 2024 10 / 24
Models & Metrics
Models
1 Pre-train and fine-tune paradigm: CodeBERT and CodeT5 are the
typical models used in this paradigm.
2 Pre-train,prompt, and predict paradigm: GPT-4 model is the
state-of-the-art model that can be used for code-related downstream
tasks in the paradigm.
Metrics
1 Pre-train and fine-tune paradigm:
MRR: the average of the multiplicative inverse of the index for the first
correct answer for the query.
BLEU-4: the average of 4-gram precision on a couple of sequences.
2 Pre-train,prompt, and predict paradigm:
Accuracy: divide the number of correct predictions by the total
number of predictions.
BLEU-4: ditto.
Li XiaoNing (CUFE) Methodology March 17, 2024 11 / 24
RQ-1: What is the impact of randomly removing code
tokens on the performance of PLM?
Ratio
CodeBERT CodeT5
Time R-Time MRR R-MRR Time R-Time MRR R-MRR
0% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%–
10% 388m 10.39%↓ 0.706 4.98%↓ 391m 9.08%↓ 0.737 1.60%↓
20% 354m 18.24%↓ 0.694 6.59%↓ 355m 18.20%↓ 0.726 3.07%↓
30% 309m 28.64%↓ 0.668 10.09%↓ 310m 28.57%↓ 0.696 7.08%↓
40% 252m 41.80%↓ 0.648 12.79%↓ 253m 41.71%↓ 0.679 9.35%↓
50% 210m 51.50%↓ 0.611 17.77%↓ 212m 51.52%↓ 0.641 14.42%↓
Ratio
CodeBERT CodeT5
Time R-Time BLEU-4 R-BLEU Time R-Time BLUE-4 R-BLUE
0% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%–
10% 840m 7.69%↓ 17.35 6.62%↓ 845m 7.75%↓ 19.56 4.49%↓
20% 802m 11.87%↓ 17.18 7.53%↓ 807m 11.90%↓ 19.47 4.93%↓
30% 734m 19.34%↓ 16.75 9.85%↓ 740m 19.21%↓ 19.07 6.88%↓
40% 684m 24.84%↓ 16.36 11.95%↓ 689m 24.78%↓ 18.36 10.35%↓
50% 621m 31.76%↓ 15.06 18.95%↓ 627m 31.55%↓ 17.30 15.53%↓
Minimizing the code entered can significantly cut down training time.
Li XiaoNing (CUFE) Experiment March 17, 2024 12 / 24
RQ-2: What is the impact of removing lexical tokens on
the performance of PLM?
Methods Ratio
CodeBERT CodeT5
Time R-Time MRR R-MRR Time R-Time MRR R-MRR
Base 0.00% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%–
Identifiers 15.48% 367m 15.24%↓ 0.650 12.52%↓ 375m 13.59%↓ 0.683 8.81%↓
Symbol tokens 51.38% 215m 50.35%↓ 0.722 2.83%↓ 193m 55.53%↓ 0.729 2.67%↓
Methods Ratio
CodeBERT CodeT5
Time R-Time BLEU-4 R-BLEU Time R-Time BLUE-4 R-BLEU
Base 0.00% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%–
Identifiers 15.69% 829m 8.90%↓ 17.87 3.83% ↓ 828m 9.61% ↓ 19.49 4.88% ↓
Symbol tokens 52.31% 606m 33.41%↓ 18.47 0.59% ↓ 628m 31.44% ↓ 20.34 0.73% ↓
Identifiers > Symbol tokens
Li XiaoNing (CUFE) Experiment March 17, 2024 13 / 24
RQ-3: What is the impact of removing syntactical tokens
on the performance of PLM?
Methods Ratio
CodeBERT CodeT5
Time R-Time MRR R-MRR Time R-Time MRR R-MRR
Base 0.00% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%–
Control structures 11.90% 381m 12.01%↓ 0.715 3.77%↓ 386m 11.06%↓ 0.730 2.54%↓
Method invocations 37.47% 266m 38.57%↓ 0.682 8.21%↓ 268m 38.25%↓ 0.698 6.81%↓
Method signature 15.96% 354m 18.24%↓ 0.649 12.65%↓ 354m 18.43%↓ 0.680 9.21%↓
Methods Ratio
CodeBERT CodeT5
Time R-Time BLEU-4 R-BLEU Time R-Time BLUE-4 R-BLEU
Base 0.00% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%–
Control structures 16.48% 820m 9.89%↓ 18.57 0.05% ↓ 843m 7.97% ↓ 20.38 0.54% ↓
Method invocations 35.48% 692m 23.96%↓ 18.17 2.21% ↓ 703m 23.25% ↓ 20.31 0.88% ↓
Method signature 11.36% 813m 10.66%↓ 15.86 14.64% ↓ 817m 10.81% ↓ 16.61 18.94% ↓
Method signature > Control structure ≈ Method invocations
Li XiaoNing (CUFE) Experiment March 17, 2024 14 / 24
RQ-4: What is the impact of removing semantic tokens on
the performance of PLM?
Methods Ratio
CodeBERT CodeT5
Time R-Time MRR R-MRR Time R-Time MRR R-MRR
Base 0.00% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%–
PDG 24.84% 332m 23.33%↓ 0.713 4.04%↓ 328m 24.42%↓ 0.729 2.67%↓
Methods Ratio
CodeBERT CodeT5
Time R-Time BLEU-4 R-BLEU Time R-Time BLUE-4 R-BLEU
Base 0.00% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%–
PDG 23.62% 750m 17.58%↓ 18.46 0.65% ↓ 766m 16.38% ↓ 20.39 0.49% ↓
Tokens that are not in PDGs have little impact on downstream tasks
due to their little semantic information.
Li XiaoNing (CUFE) Experiment March 17, 2024 15 / 24
SlimCode: Model-Agnostic Code Simplification for PLM
Import levels of tokens
method signature > identifiers >control structures ≈ method
invocations > symbol tokens
The principle of our novel model-agnostic code simplification technique is
that the tokens with the lower levels of importance should be removed
before the ones with the higher levels.
Li XiaoNing (CUFE) Experiment March 17, 2024 16 / 24
SlimCode:Model-agnostic Code Simplification For PLMs
Algorithm 1 Code Simplification Algorithm.
INPUT: D = {d1, ..., dm}, ranking scores V , SimplifiedRatio and the original
input length L
OUTPUT: A simplified code dataset D′
PROCEDURE:
1: Initialize D′
, a copy of D
2: for j from 1 to m do
3: Initialize an empty dictionary removedTokens with positions and their to-
kens
4: if nj > (1 − SimplifiedRatio) × L then
5: W ← nj −(1-SimplifiedRatio)×L
6: currentWeight ← 0
7: while currentWeight < W do
8: Add {index: token with lowest v } (∈ d′
j ,/
∈ removedTokens) into
removedTokens
9: currentWeight ← sizeof (removedTokens)
10: dj = dj /selectedTokens[1 : W]
11: return D′
Li XiaoNing (CUFE) Experiment March 17, 2024 17 / 24
An Example of SlimCode
1 a-b: Remove simple symbols.
2 b-c: Remove moderate impact tokens.
Li XiaoNing (CUFE) Experiment March 17, 2024 18 / 24
The Results of Comparing SlimCode with DietCode
Ratio
Code Search Code Summarization
CodeBERT CodeT5 Pruning CodeBERT CodeT5 Pruning
MRR R-M MRR R-M Time Times BLUE R-B BLUE R-B Time Times
Base 0.743 0.00% – 0.754 0.00%– N/A —— 18.58 0.00%– 20.49 0.00%– N/A ——
10% 0.740 0.40% ↓ 0.749 0.66%↓ 17m 32.2 18.56 0.11%↓ 20.44 0.24%↓ 45s 133.1
20% 0.721 2.96% ↓ 0.748 0.80%↓ 17m 28.9 18.29 1.56%↓ 20.38 0.54%↓ 53s 101.4
30% 0.734 1.21% ↓ 0.751 0.40%↓ 20m 21.9 18.62 0.22%↑ 20.49 0.00%↓ 59s 80.3
40% 0.733 1.35% ↓ 0.757 0.40%↑ 21m 18.3 18.41 0.91%↓ 20.52 0.15%↑ 66s 64.3
50% 0.719 3.23% ↓ 0.745 1.19%↓ 21m 18.3 18.63 0.27%↑ 20.23 1.27%↓ 69s 53.2
Ratio
Code Search Code Summarization
CodeBERT CodeT5 Pruning CodeBERT CodeT5 Pruning
MRR R-M MRR R-M Time BLUE R-B BLUE R-B Time
Base 0.743 0%– 0.754 0.00%– N/A 18.58 0%– 20.49 0.00%– N/A
10% 0.702 5.52%↓ 0.730 3.18%↓ 9h24m 17.68 4.84%↓ 19.68 3.90%↓ 1h40m
20% 0.686 7.67%↓ 0.718 4.77%↓ 8h28m 17.94 3.44%↓ 19.77 3.51%↓ 1h30m
30% 0.693 6.73%↓ 0.714 5.31%↓ 7h37m 17.73 4.57%↓ 19.68 3.95%↓ 1h19m
40% 0.679 8.61%↓ 0.707 6.21%↓ 6h45m 17.53 5.65%↓ 19.42 5.22%↓ 1h11m
50% 0.651 12.38%↓ 0.676 10.34%↓ 5h59m 17.67 4.90%↓ 19.22 6.20%↓ 1h02m
Overall, SlimCode has outperformed the state-of-the-art approach,
DietCode and more efficient in code search and summarization.
Li XiaoNing (CUFE) Experiment March 17, 2024 19 / 24
SlimCode V.S. DietCode
Compared with DietCode, SlimCode removes much fewer tokens from
method signatures and identifiers.
Li XiaoNing (CUFE) Experiment March 17, 2024 20 / 24
The Results of Comparing SlimCode with DietCode
Removal method IT R-IT OT R-OT TT R-TT Accuracy R-P
Base 69820 0.00% 44584 0.00% 114404 0.00%– 0.85 0.00%
SlimCode(10%) 64776 7.22%↓ 40988 8.06%↓ 105769 7.55%↓ 0.78 8.24%↓
SlimCode(20%) 59941 14.15%↓ 39500 11.40%↓ 99441 13.08%↓ 0.71 16.47%↓
SlimCode(30%) 55036 21.17%↓ 38224 14.27%↓ 93260 18.48%↓ 0.74 12.94%↓
SlimCode(40%) 50224 28.06%↓ 38616 13.39%↓ 88840 22.34%↓ 0.74 12.94%↓
SlimCode(50%) 45388 34.99%↓ 37879 15.05%↓ 83267 27.22%↓ 0.87 2.35%↑
Removal method IT R-IT OT R-OT TT R-TT BLEU-4 R-B
Base 48037 0.00%- 15348 0.00%- 63385 0.00%- 5.50 0.00%-
SlimCode(10%) 44005 8.39%↓ 15868 3.62%↑ 59873 5.48%↓ 5.42 1.45%↓
SlimCode(20%) 40188 16.34%↓ 16544 7.79%↑ 56732 10.49%↓ 5.40 1.82%↓
SlimCode(30%) 36321 24.39%↓ 16092 4.85%↑ 52413 17.31%↓ 5.50 0.00%-
SlimCode(40%) 32492 32.36%↓ 16593 8.11%↑ 49085 22.56%↓ 5.20 5.45%↓
SlimCode(50%) 28724 40.20%↓ 15848 3.26%↑ 44572 29.68%↓ 5.51 2.04%↑
According to OpenAI’s pricing policy[2], the cost is proportional to the
number of tokens.For example, the total cost of 400 samples just is 9.54
dollars for code search.After we remove 50% tokens by SlimCode in the
input, the cost is 7.27 dollars. About 24% of the money is saved.
Li XiaoNing (CUFE) Experiment March 17, 2024 21 / 24
Conclusion
SlimCode - A agnostic code simplification method for PLM
1 Empirical analysis of the critical information for PLM.
2 A program simplification approach for PLM with the advantage of
generality and efficiency.
3 Cost savings in API usages to PLM.
Future Work
1 More models.
2 More tasks.
Li XiaoNing (CUFE) Conclusion March 17, 2024 22 / 24
References
Zhaowe Zhang, Hongyu Zhang, Beijun Shen, Xiaodong Gu. 2022. Diet Code Is
Healthy: Simplifying Programs for Pre-trained Models of Code.
OPENAI-Pricing [n. d.]. OPENAI-Pricing. https://openai.com/pricing.
Li XiaoNing (CUFE) References March 17, 2024 23 / 24
THANK YOU!
Q & A
Li XiaoNing (CUFE) Thanks March 17, 2024 24 / 24

More Related Content

PPTX
Software engineering module 4 notes for btech and mca
PPTX
Esem2010 shihab
PPTX
Error control coding techniques
PDF
A Tale of Experiments on Bug Prediction
PPT
A Validation of Object-Oriented Design Metrics as Quality Indicators
PDF
PPTX
Advanced Econometrics L13-14.pptx
PDF
A tale of experiments on bug prediction
Software engineering module 4 notes for btech and mca
Esem2010 shihab
Error control coding techniques
A Tale of Experiments on Bug Prediction
A Validation of Object-Oriented Design Metrics as Quality Indicators
Advanced Econometrics L13-14.pptx
A tale of experiments on bug prediction

Similar to Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models (20)

PPTX
Heterogeneous Defect Prediction (

ESEC/FSE 2015)
PDF
VISSOFTPresentation.pdf
PDF
PDF
PPTX
Semantic-Aware Code Model: Elevating the Future of Software Development
PDF
RDataMining slides-regression-classification
PDF
vorlage
PDF
Euro python 2015 writing quality code
PDF
LDPC Encoding and Hamming Encoding
PDF
implementation of area efficient high speed eddr architecture
PDF
srivastava2018.pdf
PDF
PPTX
CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Te...
PDF
markomanolis_phd_defense
PDF
PDF
Fy secondsemester2016
PDF
Fy secondsemester2016
PDF
Fy secondsemester2016
PDF
Six Sigma Green Belt (Product Track)
PDF
Six Sigma Green Belt (Product Track)
Heterogeneous Defect Prediction (

ESEC/FSE 2015)
VISSOFTPresentation.pdf
Semantic-Aware Code Model: Elevating the Future of Software Development
RDataMining slides-regression-classification
vorlage
Euro python 2015 writing quality code
LDPC Encoding and Hamming Encoding
implementation of area efficient high speed eddr architecture
srivastava2018.pdf
CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Te...
markomanolis_phd_defense
Fy secondsemester2016
Fy secondsemester2016
Fy secondsemester2016
Six Sigma Green Belt (Product Track)
Six Sigma Green Belt (Product Track)
Ad

Recently uploaded (20)

PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Feature types and data preprocessing steps
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Current and future trends in Computer Vision.pptx
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PPT
Total quality management ppt for engineering students
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PDF
737-MAX_SRG.pdf student reference guides
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
Design Guidelines and solutions for Plastics parts
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
III.4.1.2_The_Space_Environment.p pdffdf
Feature types and data preprocessing steps
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Current and future trends in Computer Vision.pptx
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
Total quality management ppt for engineering students
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
Categorization of Factors Affecting Classification Algorithms Selection
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
737-MAX_SRG.pdf student reference guides
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Visual Aids for Exploratory Data Analysis.pdf
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
Design Guidelines and solutions for Plastics parts
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
August 2025 - Top 10 Read Articles in Network Security & Its Applications
distributed database system" (DDBS) is often used to refer to both the distri...
Ad

Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models

  • 1. Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models Li XiaoNing Central University of Finance and Economics 2022212378@email.cufe.edu.cn March 17, 2024 Li XiaoNing (CUFE) Title March 17, 2024 1 / 24
  • 2. Overview 1 Introduction 2 Methodology 3 Experiment 4 Conclusion Li XiaoNing (CUFE) Content March 17, 2024 2 / 24
  • 3. Pre-trained Language Models in Code Intelligence Code Intelligence(CI) The main goal of CI is to enable models to master the skills of automated code understanding and generation, thereby promoting the development of software engineering intelligence. Pre-trained Language Models(PLM) PLM has strong semantic understanding ability and has achieved good results in many NLP tasks. CodeBERT: MLM + RTD. CodeT5: MSP + IT + MIP + dual-gen. CodeGen is designed for automatically generating code, streamlining the development process by reducing manual coding efforts. ...... Li XiaoNing (CUFE) Introduction March 17, 2024 3 / 24
  • 4. Challenges of PLM in CI Challenges 1 Time-consuming: The computational complexity is quadratically with the length of the input code sequence. 2 Input tokens limit:The limit for the number of tokens to be fed to CodeBERT is 512 tokens. The word position embedding matrix for GPT-2 is (1,024x768) and for GPT-3 is (2,048x128). 3 Expensive: For gpt-4-32k, input is 0.06$/1k tokens, output is 0.12$/1k tokens. Li XiaoNing (CUFE) Introduction March 17, 2024 4 / 24
  • 5. An Example for Code Simplification: DietCode[1] 1 Pre-training: Obtain rich contextualized vector representations by self-supervised learning. 2 Fine-tuning: Adapt to specific tasks. Li XiaoNing (CUFE) Introduction March 17, 2024 5 / 24
  • 6. An Example for Code Simplification: DietCode[1] 1 Statement selection(a-b): 0-1 knapsack problem.C is the target length plus the length of the longest sentence. V is the attention weight of statements. W is the length of statements. If the number of statements is n, the time complexity is O(n*C). 2 Token pruning(b-c): Remove the tokens that have the lowest attention weights in the lowest weighed statements. Li XiaoNing (CUFE) Introduction March 17, 2024 6 / 24
  • 7. Attention is The Best? Limitation 1 Inefficiency.The process of obtaining attention weights is relative complex and time-consuming. 2 Model dependency.The attention weights relies on the model architecture and the pre-training datasets. To ensure effciency and generality, it is advisable to consider a broader range of factors when making decisions about program simplification. Li XiaoNing (CUFE) Introduction March 17, 2024 7 / 24
  • 8. Natural Factors Lexical Level 1 Symbol tokens: brackets, separators and operators. 2 Identifiers: variable name. Syntactic Level 1 Control structures: for, if, try, catch, switch, while, do while and their conditions. 2 Method signatures: method declaration and the parameters of the method. 3 Method invocations: calls to methods. Semantic Level tokens not present in Program Dependency Graph(PDG). Li XiaoNing (CUFE) Introduction March 17, 2024 8 / 24
  • 9. Empirical Study RQ-1: What is the impact of randomly removing code tokens on the performance of PLM? RQ-2: What is the impact of removing lexical tokens on the performance of PLM? RQ-3: What is the impact of removing syntactical tokens on the performance of PLM? RQ-4: What is the impact of removing semantic tokens on the performance of PLM? Li XiaoNing (CUFE) Introduction March 17, 2024 9 / 24
  • 10. Tasks & Datasets Downstream Tasks 1 Code search: find relevant code snippets from a codebase given a query. 2 Code summarization: generate a natural language summary for a given code snippet. Datasets 1 Pre-train and fine-tune paradigm: we use the same dataset as CodeBERT and DietCode. 2 Pre-train,prompt, and predict paradigm: we randomly selected 400 samples from the dataset. Li XiaoNing (CUFE) Methodology March 17, 2024 10 / 24
  • 11. Models & Metrics Models 1 Pre-train and fine-tune paradigm: CodeBERT and CodeT5 are the typical models used in this paradigm. 2 Pre-train,prompt, and predict paradigm: GPT-4 model is the state-of-the-art model that can be used for code-related downstream tasks in the paradigm. Metrics 1 Pre-train and fine-tune paradigm: MRR: the average of the multiplicative inverse of the index for the first correct answer for the query. BLEU-4: the average of 4-gram precision on a couple of sequences. 2 Pre-train,prompt, and predict paradigm: Accuracy: divide the number of correct predictions by the total number of predictions. BLEU-4: ditto. Li XiaoNing (CUFE) Methodology March 17, 2024 11 / 24
  • 12. RQ-1: What is the impact of randomly removing code tokens on the performance of PLM? Ratio CodeBERT CodeT5 Time R-Time MRR R-MRR Time R-Time MRR R-MRR 0% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%– 10% 388m 10.39%↓ 0.706 4.98%↓ 391m 9.08%↓ 0.737 1.60%↓ 20% 354m 18.24%↓ 0.694 6.59%↓ 355m 18.20%↓ 0.726 3.07%↓ 30% 309m 28.64%↓ 0.668 10.09%↓ 310m 28.57%↓ 0.696 7.08%↓ 40% 252m 41.80%↓ 0.648 12.79%↓ 253m 41.71%↓ 0.679 9.35%↓ 50% 210m 51.50%↓ 0.611 17.77%↓ 212m 51.52%↓ 0.641 14.42%↓ Ratio CodeBERT CodeT5 Time R-Time BLEU-4 R-BLEU Time R-Time BLUE-4 R-BLUE 0% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%– 10% 840m 7.69%↓ 17.35 6.62%↓ 845m 7.75%↓ 19.56 4.49%↓ 20% 802m 11.87%↓ 17.18 7.53%↓ 807m 11.90%↓ 19.47 4.93%↓ 30% 734m 19.34%↓ 16.75 9.85%↓ 740m 19.21%↓ 19.07 6.88%↓ 40% 684m 24.84%↓ 16.36 11.95%↓ 689m 24.78%↓ 18.36 10.35%↓ 50% 621m 31.76%↓ 15.06 18.95%↓ 627m 31.55%↓ 17.30 15.53%↓ Minimizing the code entered can significantly cut down training time. Li XiaoNing (CUFE) Experiment March 17, 2024 12 / 24
  • 13. RQ-2: What is the impact of removing lexical tokens on the performance of PLM? Methods Ratio CodeBERT CodeT5 Time R-Time MRR R-MRR Time R-Time MRR R-MRR Base 0.00% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%– Identifiers 15.48% 367m 15.24%↓ 0.650 12.52%↓ 375m 13.59%↓ 0.683 8.81%↓ Symbol tokens 51.38% 215m 50.35%↓ 0.722 2.83%↓ 193m 55.53%↓ 0.729 2.67%↓ Methods Ratio CodeBERT CodeT5 Time R-Time BLEU-4 R-BLEU Time R-Time BLUE-4 R-BLEU Base 0.00% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%– Identifiers 15.69% 829m 8.90%↓ 17.87 3.83% ↓ 828m 9.61% ↓ 19.49 4.88% ↓ Symbol tokens 52.31% 606m 33.41%↓ 18.47 0.59% ↓ 628m 31.44% ↓ 20.34 0.73% ↓ Identifiers > Symbol tokens Li XiaoNing (CUFE) Experiment March 17, 2024 13 / 24
  • 14. RQ-3: What is the impact of removing syntactical tokens on the performance of PLM? Methods Ratio CodeBERT CodeT5 Time R-Time MRR R-MRR Time R-Time MRR R-MRR Base 0.00% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%– Control structures 11.90% 381m 12.01%↓ 0.715 3.77%↓ 386m 11.06%↓ 0.730 2.54%↓ Method invocations 37.47% 266m 38.57%↓ 0.682 8.21%↓ 268m 38.25%↓ 0.698 6.81%↓ Method signature 15.96% 354m 18.24%↓ 0.649 12.65%↓ 354m 18.43%↓ 0.680 9.21%↓ Methods Ratio CodeBERT CodeT5 Time R-Time BLEU-4 R-BLEU Time R-Time BLUE-4 R-BLEU Base 0.00% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%– Control structures 16.48% 820m 9.89%↓ 18.57 0.05% ↓ 843m 7.97% ↓ 20.38 0.54% ↓ Method invocations 35.48% 692m 23.96%↓ 18.17 2.21% ↓ 703m 23.25% ↓ 20.31 0.88% ↓ Method signature 11.36% 813m 10.66%↓ 15.86 14.64% ↓ 817m 10.81% ↓ 16.61 18.94% ↓ Method signature > Control structure ≈ Method invocations Li XiaoNing (CUFE) Experiment March 17, 2024 14 / 24
  • 15. RQ-4: What is the impact of removing semantic tokens on the performance of PLM? Methods Ratio CodeBERT CodeT5 Time R-Time MRR R-MRR Time R-Time MRR R-MRR Base 0.00% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%– PDG 24.84% 332m 23.33%↓ 0.713 4.04%↓ 328m 24.42%↓ 0.729 2.67%↓ Methods Ratio CodeBERT CodeT5 Time R-Time BLEU-4 R-BLEU Time R-Time BLUE-4 R-BLEU Base 0.00% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%– PDG 23.62% 750m 17.58%↓ 18.46 0.65% ↓ 766m 16.38% ↓ 20.39 0.49% ↓ Tokens that are not in PDGs have little impact on downstream tasks due to their little semantic information. Li XiaoNing (CUFE) Experiment March 17, 2024 15 / 24
  • 16. SlimCode: Model-Agnostic Code Simplification for PLM Import levels of tokens method signature > identifiers >control structures ≈ method invocations > symbol tokens The principle of our novel model-agnostic code simplification technique is that the tokens with the lower levels of importance should be removed before the ones with the higher levels. Li XiaoNing (CUFE) Experiment March 17, 2024 16 / 24
  • 17. SlimCode:Model-agnostic Code Simplification For PLMs Algorithm 1 Code Simplification Algorithm. INPUT: D = {d1, ..., dm}, ranking scores V , SimplifiedRatio and the original input length L OUTPUT: A simplified code dataset D′ PROCEDURE: 1: Initialize D′ , a copy of D 2: for j from 1 to m do 3: Initialize an empty dictionary removedTokens with positions and their to- kens 4: if nj > (1 − SimplifiedRatio) × L then 5: W ← nj −(1-SimplifiedRatio)×L 6: currentWeight ← 0 7: while currentWeight < W do 8: Add {index: token with lowest v } (∈ d′ j ,/ ∈ removedTokens) into removedTokens 9: currentWeight ← sizeof (removedTokens) 10: dj = dj /selectedTokens[1 : W] 11: return D′ Li XiaoNing (CUFE) Experiment March 17, 2024 17 / 24
  • 18. An Example of SlimCode 1 a-b: Remove simple symbols. 2 b-c: Remove moderate impact tokens. Li XiaoNing (CUFE) Experiment March 17, 2024 18 / 24
  • 19. The Results of Comparing SlimCode with DietCode Ratio Code Search Code Summarization CodeBERT CodeT5 Pruning CodeBERT CodeT5 Pruning MRR R-M MRR R-M Time Times BLUE R-B BLUE R-B Time Times Base 0.743 0.00% – 0.754 0.00%– N/A —— 18.58 0.00%– 20.49 0.00%– N/A —— 10% 0.740 0.40% ↓ 0.749 0.66%↓ 17m 32.2 18.56 0.11%↓ 20.44 0.24%↓ 45s 133.1 20% 0.721 2.96% ↓ 0.748 0.80%↓ 17m 28.9 18.29 1.56%↓ 20.38 0.54%↓ 53s 101.4 30% 0.734 1.21% ↓ 0.751 0.40%↓ 20m 21.9 18.62 0.22%↑ 20.49 0.00%↓ 59s 80.3 40% 0.733 1.35% ↓ 0.757 0.40%↑ 21m 18.3 18.41 0.91%↓ 20.52 0.15%↑ 66s 64.3 50% 0.719 3.23% ↓ 0.745 1.19%↓ 21m 18.3 18.63 0.27%↑ 20.23 1.27%↓ 69s 53.2 Ratio Code Search Code Summarization CodeBERT CodeT5 Pruning CodeBERT CodeT5 Pruning MRR R-M MRR R-M Time BLUE R-B BLUE R-B Time Base 0.743 0%– 0.754 0.00%– N/A 18.58 0%– 20.49 0.00%– N/A 10% 0.702 5.52%↓ 0.730 3.18%↓ 9h24m 17.68 4.84%↓ 19.68 3.90%↓ 1h40m 20% 0.686 7.67%↓ 0.718 4.77%↓ 8h28m 17.94 3.44%↓ 19.77 3.51%↓ 1h30m 30% 0.693 6.73%↓ 0.714 5.31%↓ 7h37m 17.73 4.57%↓ 19.68 3.95%↓ 1h19m 40% 0.679 8.61%↓ 0.707 6.21%↓ 6h45m 17.53 5.65%↓ 19.42 5.22%↓ 1h11m 50% 0.651 12.38%↓ 0.676 10.34%↓ 5h59m 17.67 4.90%↓ 19.22 6.20%↓ 1h02m Overall, SlimCode has outperformed the state-of-the-art approach, DietCode and more efficient in code search and summarization. Li XiaoNing (CUFE) Experiment March 17, 2024 19 / 24
  • 20. SlimCode V.S. DietCode Compared with DietCode, SlimCode removes much fewer tokens from method signatures and identifiers. Li XiaoNing (CUFE) Experiment March 17, 2024 20 / 24
  • 21. The Results of Comparing SlimCode with DietCode Removal method IT R-IT OT R-OT TT R-TT Accuracy R-P Base 69820 0.00% 44584 0.00% 114404 0.00%– 0.85 0.00% SlimCode(10%) 64776 7.22%↓ 40988 8.06%↓ 105769 7.55%↓ 0.78 8.24%↓ SlimCode(20%) 59941 14.15%↓ 39500 11.40%↓ 99441 13.08%↓ 0.71 16.47%↓ SlimCode(30%) 55036 21.17%↓ 38224 14.27%↓ 93260 18.48%↓ 0.74 12.94%↓ SlimCode(40%) 50224 28.06%↓ 38616 13.39%↓ 88840 22.34%↓ 0.74 12.94%↓ SlimCode(50%) 45388 34.99%↓ 37879 15.05%↓ 83267 27.22%↓ 0.87 2.35%↑ Removal method IT R-IT OT R-OT TT R-TT BLEU-4 R-B Base 48037 0.00%- 15348 0.00%- 63385 0.00%- 5.50 0.00%- SlimCode(10%) 44005 8.39%↓ 15868 3.62%↑ 59873 5.48%↓ 5.42 1.45%↓ SlimCode(20%) 40188 16.34%↓ 16544 7.79%↑ 56732 10.49%↓ 5.40 1.82%↓ SlimCode(30%) 36321 24.39%↓ 16092 4.85%↑ 52413 17.31%↓ 5.50 0.00%- SlimCode(40%) 32492 32.36%↓ 16593 8.11%↑ 49085 22.56%↓ 5.20 5.45%↓ SlimCode(50%) 28724 40.20%↓ 15848 3.26%↑ 44572 29.68%↓ 5.51 2.04%↑ According to OpenAI’s pricing policy[2], the cost is proportional to the number of tokens.For example, the total cost of 400 samples just is 9.54 dollars for code search.After we remove 50% tokens by SlimCode in the input, the cost is 7.27 dollars. About 24% of the money is saved. Li XiaoNing (CUFE) Experiment March 17, 2024 21 / 24
  • 22. Conclusion SlimCode - A agnostic code simplification method for PLM 1 Empirical analysis of the critical information for PLM. 2 A program simplification approach for PLM with the advantage of generality and efficiency. 3 Cost savings in API usages to PLM. Future Work 1 More models. 2 More tasks. Li XiaoNing (CUFE) Conclusion March 17, 2024 22 / 24
  • 23. References Zhaowe Zhang, Hongyu Zhang, Beijun Shen, Xiaodong Gu. 2022. Diet Code Is Healthy: Simplifying Programs for Pre-trained Models of Code. OPENAI-Pricing [n. d.]. OPENAI-Pricing. https://openai.com/pricing. Li XiaoNing (CUFE) References March 17, 2024 23 / 24
  • 24. THANK YOU! Q & A Li XiaoNing (CUFE) Thanks March 17, 2024 24 / 24