Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models

Natural Is The Best: Model-Agnostic Code
Simplification for Pre-trained Large Language Models
Li XiaoNing
Central University of Finance and Economics
2022212378@email.cufe.edu.cn
March 17, 2024
Li XiaoNing (CUFE) Title March 17, 2024 1 / 24

Overview
1 Introduction
2 Methodology
3 Experiment
4 Conclusion
Li XiaoNing (CUFE) Content March 17, 2024 2 / 24

Pre-trained Language Models in Code Intelligence
Code Intelligence(CI)
The main goal of CI is to enable models to master the skills of automated
code understanding and generation, thereby promoting the development of
software engineering intelligence.
Pre-trained Language Models(PLM)
PLM has strong semantic understanding ability and has achieved good
results in many NLP tasks.
CodeBERT: MLM + RTD.
CodeT5: MSP + IT + MIP + dual-gen.
CodeGen is designed for automatically generating code, streamlining
the development process by reducing manual coding efforts.
......
Li XiaoNing (CUFE) Introduction March 17, 2024 3 / 24

Challenges of PLM in CI
Challenges
1 Time-consuming: The computational complexity is quadratically
with the length of the input code sequence.
2 Input tokens limit:The limit for the number of tokens to be fed to
CodeBERT is 512 tokens. The word position embedding matrix for
GPT-2 is (1,024x768) and for GPT-3 is (2,048x128).
3 Expensive: For gpt-4-32k, input is 0.06$/1k tokens, output is
0.12$/1k tokens.

An Example for Code Simplification: DietCode[1]
1 Pre-training: Obtain rich contextualized vector representations by
self-supervised learning.
2 Fine-tuning: Adapt to specific tasks.

An Example for Code Simplification: DietCode[1]
1 Statement selection(a-b): 0-1 knapsack problem.C is the target
length plus the length of the longest sentence. V is the attention
weight of statements. W is the length of statements. If the number
of statements is n, the time complexity is O(n*C).
2 Token pruning(b-c): Remove the tokens that have the lowest
attention weights in the lowest weighed statements.

Attention is The Best?
Limitation
1 Inefficiency.The process of obtaining attention weights is relative
complex and time-consuming.
2 Model dependency.The attention weights relies on the model
architecture and the pre-training datasets.
To ensure effciency and generality, it is advisable to consider a
broader range of factors when making decisions about program
simplification.

Natural Factors
Lexical Level
1 Symbol tokens: brackets, separators and operators.
2 Identifiers: variable name.
Syntactic Level
1 Control structures: for, if, try, catch, switch, while, do while and their
conditions.
2 Method signatures: method declaration and the parameters of the
method.
3 Method invocations: calls to methods.
Semantic Level
tokens not present in Program Dependency Graph(PDG).

Empirical Study
RQ-1: What is the impact of randomly removing code tokens
on the performance of PLM?
RQ-2: What is the impact of removing lexical tokens on the
performance of PLM?
RQ-3: What is the impact of removing syntactical tokens on the
performance of PLM?
RQ-4: What is the impact of removing semantic tokens on the
performance of PLM?

Tasks & Datasets
Downstream Tasks
1 Code search: find relevant code snippets from a codebase given a
query.
2 Code summarization: generate a natural language summary for a
given code snippet.
Datasets
1 Pre-train and fine-tune paradigm: we use the same dataset as
CodeBERT and DietCode.
2 Pre-train,prompt, and predict paradigm: we randomly selected
400 samples from the dataset.
Li XiaoNing (CUFE) Methodology March 17, 2024 10 / 24

Models & Metrics
Models
1 Pre-train and fine-tune paradigm: CodeBERT and CodeT5 are the
typical models used in this paradigm.
2 Pre-train,prompt, and predict paradigm: GPT-4 model is the
state-of-the-art model that can be used for code-related downstream
tasks in the paradigm.
Metrics
1 Pre-train and fine-tune paradigm:
MRR: the average of the multiplicative inverse of the index for the first
correct answer for the query.
BLEU-4: the average of 4-gram precision on a couple of sequences.
2 Pre-train,prompt, and predict paradigm:
Accuracy: divide the number of correct predictions by the total
number of predictions.
BLEU-4: ditto.
Li XiaoNing (CUFE) Methodology March 17, 2024 11 / 24

RQ-1: What is the impact of randomly removing code
tokens on the performance of PLM?
Ratio
CodeBERT CodeT5
Time R-Time MRR R-MRR Time R-Time MRR R-MRR
0% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%–
10% 388m 10.39%↓ 0.706 4.98%↓ 391m 9.08%↓ 0.737 1.60%↓
20% 354m 18.24%↓ 0.694 6.59%↓ 355m 18.20%↓ 0.726 3.07%↓
30% 309m 28.64%↓ 0.668 10.09%↓ 310m 28.57%↓ 0.696 7.08%↓
40% 252m 41.80%↓ 0.648 12.79%↓ 253m 41.71%↓ 0.679 9.35%↓
50% 210m 51.50%↓ 0.611 17.77%↓ 212m 51.52%↓ 0.641 14.42%↓
Ratio
CodeBERT CodeT5
Time R-Time BLEU-4 R-BLEU Time R-Time BLUE-4 R-BLUE
0% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%–
10% 840m 7.69%↓ 17.35 6.62%↓ 845m 7.75%↓ 19.56 4.49%↓
20% 802m 11.87%↓ 17.18 7.53%↓ 807m 11.90%↓ 19.47 4.93%↓
30% 734m 19.34%↓ 16.75 9.85%↓ 740m 19.21%↓ 19.07 6.88%↓
40% 684m 24.84%↓ 16.36 11.95%↓ 689m 24.78%↓ 18.36 10.35%↓
50% 621m 31.76%↓ 15.06 18.95%↓ 627m 31.55%↓ 17.30 15.53%↓
Minimizing the code entered can significantly cut down training time.
Li XiaoNing (CUFE) Experiment March 17, 2024 12 / 24

RQ-2: What is the impact of removing lexical tokens on
the performance of PLM?
Methods Ratio
CodeBERT CodeT5
Base 0.00% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%–
Identifiers 15.48% 367m 15.24%↓ 0.650 12.52%↓ 375m 13.59%↓ 0.683 8.81%↓
Symbol tokens 51.38% 215m 50.35%↓ 0.722 2.83%↓ 193m 55.53%↓ 0.729 2.67%↓
Methods Ratio
CodeBERT CodeT5
Time R-Time BLEU-4 R-BLEU Time R-Time BLUE-4 R-BLEU
Base 0.00% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%–
Identifiers 15.69% 829m 8.90%↓ 17.87 3.83% ↓ 828m 9.61% ↓ 19.49 4.88% ↓
Symbol tokens 52.31% 606m 33.41%↓ 18.47 0.59% ↓ 628m 31.44% ↓ 20.34 0.73% ↓
Identifiers > Symbol tokens

RQ-3: What is the impact of removing syntactical tokens
on the performance of PLM?
Methods Ratio
CodeBERT CodeT5
Base 0.00% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%–
Control structures 11.90% 381m 12.01%↓ 0.715 3.77%↓ 386m 11.06%↓ 0.730 2.54%↓
Method invocations 37.47% 266m 38.57%↓ 0.682 8.21%↓ 268m 38.25%↓ 0.698 6.81%↓
Method signature 15.96% 354m 18.24%↓ 0.649 12.65%↓ 354m 18.43%↓ 0.680 9.21%↓
Methods Ratio
CodeBERT CodeT5
Base 0.00% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%–
Control structures 16.48% 820m 9.89%↓ 18.57 0.05% ↓ 843m 7.97% ↓ 20.38 0.54% ↓
Method invocations 35.48% 692m 23.96%↓ 18.17 2.21% ↓ 703m 23.25% ↓ 20.31 0.88% ↓
Method signature 11.36% 813m 10.66%↓ 15.86 14.64% ↓ 817m 10.81% ↓ 16.61 18.94% ↓
Method signature > Control structure ≈ Method invocations

RQ-4: What is the impact of removing semantic tokens on
the performance of PLM?
Methods Ratio
CodeBERT CodeT5
Base 0.00% 433m 0.00%– 0.743 0.00%– 434m 0.00%– 0.749 0.00%–
PDG 24.84% 332m 23.33%↓ 0.713 4.04%↓ 328m 24.42%↓ 0.729 2.67%↓
Methods Ratio
CodeBERT CodeT5
Base 0.00% 910m 0.00%– 18.58 0.00%– 916m 0.00%– 20.49 0.00%–
PDG 23.62% 750m 17.58%↓ 18.46 0.65% ↓ 766m 16.38% ↓ 20.39 0.49% ↓
Tokens that are not in PDGs have little impact on downstream tasks
due to their little semantic information.

SlimCode: Model-Agnostic Code Simplification for PLM
Import levels of tokens
method signature > identifiers >control structures ≈ method
invocations > symbol tokens
The principle of our novel model-agnostic code simplification technique is
that the tokens with the lower levels of importance should be removed
before the ones with the higher levels.

SlimCode:Model-agnostic Code Simplification For PLMs
Algorithm 1 Code Simplification Algorithm.
INPUT: D = {d1, ..., dm}, ranking scores V , SimplifiedRatio and the original
input length L
OUTPUT: A simplified code dataset D′
PROCEDURE:
1: Initialize D′
, a copy of D
2: for j from 1 to m do
3: Initialize an empty dictionary removedTokens with positions and their to-
kens
4: if nj > (1 − SimplifiedRatio) × L then
5: W ← nj −(1-SimplifiedRatio)×L
6: currentWeight ← 0
7: while currentWeight < W do
8: Add {index: token with lowest v } (∈ d′
j ,/
∈ removedTokens) into
removedTokens
9: currentWeight ← sizeof (removedTokens)
10: dj = dj /selectedTokens[1 : W]
11: return D′

An Example of SlimCode
1 a-b: Remove simple symbols.
2 b-c: Remove moderate impact tokens.

The Results of Comparing SlimCode with DietCode
Ratio
Code Search Code Summarization
CodeBERT CodeT5 Pruning CodeBERT CodeT5 Pruning
MRR R-M MRR R-M Time Times BLUE R-B BLUE R-B Time Times
Base 0.743 0.00% – 0.754 0.00%– N/A —— 18.58 0.00%– 20.49 0.00%– N/A ——
10% 0.740 0.40% ↓ 0.749 0.66%↓ 17m 32.2 18.56 0.11%↓ 20.44 0.24%↓ 45s 133.1
20% 0.721 2.96% ↓ 0.748 0.80%↓ 17m 28.9 18.29 1.56%↓ 20.38 0.54%↓ 53s 101.4
30% 0.734 1.21% ↓ 0.751 0.40%↓ 20m 21.9 18.62 0.22%↑ 20.49 0.00%↓ 59s 80.3
40% 0.733 1.35% ↓ 0.757 0.40%↑ 21m 18.3 18.41 0.91%↓ 20.52 0.15%↑ 66s 64.3
50% 0.719 3.23% ↓ 0.745 1.19%↓ 21m 18.3 18.63 0.27%↑ 20.23 1.27%↓ 69s 53.2
Ratio
Code Search Code Summarization
CodeBERT CodeT5 Pruning CodeBERT CodeT5 Pruning
MRR R-M MRR R-M Time BLUE R-B BLUE R-B Time
Base 0.743 0%– 0.754 0.00%– N/A 18.58 0%– 20.49 0.00%– N/A
10% 0.702 5.52%↓ 0.730 3.18%↓ 9h24m 17.68 4.84%↓ 19.68 3.90%↓ 1h40m
20% 0.686 7.67%↓ 0.718 4.77%↓ 8h28m 17.94 3.44%↓ 19.77 3.51%↓ 1h30m
30% 0.693 6.73%↓ 0.714 5.31%↓ 7h37m 17.73 4.57%↓ 19.68 3.95%↓ 1h19m
40% 0.679 8.61%↓ 0.707 6.21%↓ 6h45m 17.53 5.65%↓ 19.42 5.22%↓ 1h11m
50% 0.651 12.38%↓ 0.676 10.34%↓ 5h59m 17.67 4.90%↓ 19.22 6.20%↓ 1h02m
Overall, SlimCode has outperformed the state-of-the-art approach,
DietCode and more efficient in code search and summarization.

SlimCode V.S. DietCode
Compared with DietCode, SlimCode removes much fewer tokens from
method signatures and identifiers.

The Results of Comparing SlimCode with DietCode
Removal method IT R-IT OT R-OT TT R-TT Accuracy R-P
Base 69820 0.00% 44584 0.00% 114404 0.00%– 0.85 0.00%
SlimCode(10%) 64776 7.22%↓ 40988 8.06%↓ 105769 7.55%↓ 0.78 8.24%↓
SlimCode(20%) 59941 14.15%↓ 39500 11.40%↓ 99441 13.08%↓ 0.71 16.47%↓
SlimCode(30%) 55036 21.17%↓ 38224 14.27%↓ 93260 18.48%↓ 0.74 12.94%↓
SlimCode(40%) 50224 28.06%↓ 38616 13.39%↓ 88840 22.34%↓ 0.74 12.94%↓
SlimCode(50%) 45388 34.99%↓ 37879 15.05%↓ 83267 27.22%↓ 0.87 2.35%↑
Removal method IT R-IT OT R-OT TT R-TT BLEU-4 R-B
Base 48037 0.00%- 15348 0.00%- 63385 0.00%- 5.50 0.00%-
SlimCode(10%) 44005 8.39%↓ 15868 3.62%↑ 59873 5.48%↓ 5.42 1.45%↓
SlimCode(20%) 40188 16.34%↓ 16544 7.79%↑ 56732 10.49%↓ 5.40 1.82%↓
SlimCode(30%) 36321 24.39%↓ 16092 4.85%↑ 52413 17.31%↓ 5.50 0.00%-
SlimCode(40%) 32492 32.36%↓ 16593 8.11%↑ 49085 22.56%↓ 5.20 5.45%↓
SlimCode(50%) 28724 40.20%↓ 15848 3.26%↑ 44572 29.68%↓ 5.51 2.04%↑
According to OpenAI’s pricing policy[2], the cost is proportional to the
number of tokens.For example, the total cost of 400 samples just is 9.54
dollars for code search.After we remove 50% tokens by SlimCode in the
input, the cost is 7.27 dollars. About 24% of the money is saved.

Conclusion
SlimCode - A agnostic code simplification method for PLM
1 Empirical analysis of the critical information for PLM.
2 A program simplification approach for PLM with the advantage of
generality and efficiency.
3 Cost savings in API usages to PLM.
Future Work
1 More models.
2 More tasks.
Li XiaoNing (CUFE) Conclusion March 17, 2024 22 / 24

References
Zhaowe Zhang, Hongyu Zhang, Beijun Shen, Xiaodong Gu. 2022. Diet Code Is
Healthy: Simplifying Programs for Pre-trained Models of Code.
OPENAI-Pricing [n. d.]. OPENAI-Pricing. https://openai.com/pricing.
Li XiaoNing (CUFE) References March 17, 2024 23 / 24

THANK YOU!
Q & A
Li XiaoNing (CUFE) Thanks March 17, 2024 24 / 24

Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models

More Related Content

Similar to Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models (20)

Recently uploaded (20)

Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models