Practical LLM inference in modern Java.pptx

Practical LLM inference
in modern Java
Alina Yurenko, Alfonso² Peterssen
Oracle Labs

Who we are
• Alina Yurenko, developer advocate @ GraalVM
• Alfonso² Peterssen, Java on Java @ GraalVM
alina-yur mukel
Questions, comments, pull requests?:)

What we will cover today
• Implementing a fast LLM inference engine in modern Java
• Executing such Java LLM engine locally
• Optimizing for the best CPU-based performance
• Performance optimizations: Java Vector API, GraalVM, AOT
techniques
• Integrating with LangChain4j

Llama3.java
Based on llama2.c by Andrej Karpathy and his excellent educational videos, and llama.cpp
github.com/mukel/
llama3.java

(Not so) Large Language Models
16B+
4B+
< 4B

Large Language Models are everywhere

(Not so) Large Language Models
• Meta Llama 3+ (1B & 3B & 8B)
• Mistral (7B)
• Microsoft Phi-3 (3.5B & 7B)
• Google Gemma 2 (2.6B & 9B)
• Alibaba Qwen2.5 (0.5B & 1.5B & 3B & 7B & 14B)
• … or fine-tune your own!

(Not so) Large Language Models are
everywhere
Microsoft shipping RWKV.cpp

everywhere
Apple On-Device model (~3B)

everywhere
Gemini Nano, a powerful 3.25B
parameter LLM, 100% locally i
n your browser!

everywhere
JetBrains Line Compl
etion (100M)

everywhere
DevoxxGenie IDEA Plugin

Local LLM inference
Cost
Privacy Control

Local LLM inference in Java
No native
dependencies
Cost
Privacy Control
Developer
productivity
Performance

LLM engines
[Token][ize][r]
Prompt Format
Inference
Sampler
GGUF File
Parser

Transformer architecture implemented in
Java
*Low Level Technicals of LLMs: Daniel
Han
class FloatTensor {
float getFloat(int index);
void setFloat(int index, float value);
// ...
void matmul(FloatTensor that, FloatTensor out, int dim0, int dim1) {
for (int i = 0; i < dim0; ++i) {
float result = 0f;
for (int j = 0; j < dim1; j++) {
result += this.getFloat(i * dim1 + j) * that.getFloat(j);
}
out.setFloat(i, result);
}
}
// ...
}
// qkv matmuls for this position
weights.wq[l].matmul(state.xb, state.q, dim, dim);
weights.wk[l].matmul(state.xb, state.k, kvDim, dim);
weights.wv[l].matmul(state.xb, state.v, kvDim, dim);

Java
Han
// RoPE relative positional encoding: complex-valued rotate q and k in each head
for (int i = 0; i < dim; i += 2) {
int head_dim = i % headSize;
float fcr = weights.freq_cis_real.get(position * (headSize / 2) + (head_dim / 2));
float fci = weights.freq_cis_imag.get(position * (headSize / 2) + (head_dim / 2));
int rotn = i < kvDim ? 2 : 1; // how many vectors? 2 = q & k, 1 = q only
for (int v = 0; v < rotn; v++) {
FloatTensor vec = v == 0 ? state.q : state.k; // the vector to rotate (query or key
float v0 = vec.getFloat(i);
float v1 = vec.getFloat(i + 1);
vec.setFloat(i, v0 * fcr - v1 * fci);
vec.setFloat(i + 1, v0 * fci + v1 * fcr);
}
}

Java
Han
NOT a single token, but a vector of probabilities

Anatomy of LLM weights
• Mostly matrices, quantized …
• During inference, every weight is read exactly once
https://bbycroft.net/llm
Is LLM inference the
new Bitcoin mining?

What is the limiting factor of inference
performance?
Memory bandwidth
*and the ability to fully utilize it
Project Panama
JEP 489: Vector API (Ninth Incubator)
Introduce an API to express vector computations
that reliably compile at runtime to optimal vector
instructions on supported CPU architectures, thus
achieving performance superior to equivalent
scalar computations.

Memory bandwidth
50 GB/s 10 TB/s
1 TB/s
400 GB/s

AI and Memory Wall
• Inference is memory bound
• 90% of inference is spent on
matrix × vector operations
"AI and Memory Wall"
void matrixVectorMul(FloatTensor m, FloatTensor v, FloatTensor out, int dim0, int
dim1) {
for (int i = 0; i < dim0; ++i) {
float result = 0f;
for (int j = 0; j < dim1; j++) {
result += m.getFloat(i * dim1 + j) * v.getFloat(j);
}
out.setFloat(i, result);
}
}

Quantizations
16 Colors
Grayscale
256 Colors 24bit RGB

Smaller weights, lower accuracy, faster
inference
Lossy Quantization
Theoretical throughput (tokens/s) of Llama 3.1 8B @ 50 GB/s
Approximate (!)
comfortable human
perception rate

LLM Inference Engine in Java
• High performance
• Modern APIs, such as Vector API and FFM API
• Rich ecosystem
• More control over the models and the inference process (caching, etc)
Why Java?

🤝 Latest Java APIs
Foreign Function and Memory
API
• Interface for interop between Java
code and native code
• Just works tm
on Graal JIT
• Experimental support in Native
Image with -H:+ForeignAPISupport
(upcalls, downcalls, foreign memory)
Vector API
• Enables fast vector computations
• Initial support in GraalVM for JDK 21,
more coming in GraalVM for JDK 24
• Works with Native Image!
github.com/
gergo-
Thank you
Gergö!
🏆

Faster LLM inference on GraalVM 🚀
• Local LLM inference, powered by modern Java APIs and
GraalVM's performance optimizations
• ~15% faster inference on Oracle GraalVM (across
several models, such as Llama3+) 🔥
• Updates in GraalVM for JDK 24 – get the latest EA build
(JDK 24 EA build 15+):
• sdk install java 24.ea.15-graal

Faster LLM inference on GraalVM
• AOT at the speed of JIT 🚀
• inference code is AOT-friendly
• Combining the power of Java with native
performance: you can pre-parse GGUF
meta-data and cache prompts at build time
for instant inference
• <25ms time to first token
Native Image

Demo – Native llama3 & AOT optimizations

Demo – LangChain4j Integration

Practical LLM inference in modern Java.pptx

• Performance of llama3.java running on GraalVM Native Image is comparable to llama.cpp,
and gets even closer as the model size increases. More optimizations coming soon:)
• Different approaches: idiomatic Java vs hand-tuned tensor operations
• Free performance boost: quantization helps, and larger models are resilient to it
llama3 native image
Oracle GraalVM 24 EA 15
Linux 3950X 64GB@3800

Other models?
• Meta Llama 3+ (1B & 3B & 8B)
… but also 30+ other models, such as:
• Mistral (7B)
• Microsoft Phi-3 (3.5B & 7B)
• Google Gemma 2 (2.6B & 9B)
• Alibaba Qwen2.5 (0.5B & 1.5B & 3B & 7B & 14B)
• … and specialized models for math, programming, and more
Llama3.java can be easily adapted to different models and
vendors
huggingface.co/mukel

LLM Inference Engine
• Performance testing and tuning (ARM, Apple Silicon, AVX512...)
• Implement additional features (prompt caching, quantizations, YARN...)
• Support for GPUs (HAT, TornadoVM?)
• Further integrations with the Java libraries and frameworks (such as
LangChain4j)
• Audio
• Vision
Help wanted ‍
♂️
‍
️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️ ‍
♂️

Wrap up
• FAST LLM inference in pure, modern Java
• No dependencies (73kB jar file)
• GraalVM makes it even faster 🚀
• Native Image support with AOT model pre-loading, for instant
time-to-first-token
• Simple, accessible and with a high educational value for learning
about LLMs
• Use a lightweight local inference engine or your inspiration for
your next Java project
• Fun to hack and fun to use!
github.com/mukel/llama3.java

Practical LLM inference in modern Java.pptx

More Related Content

Similar to Practical LLM inference in modern Java.pptx (20)

More from Alina Yurenko (11)

Recently uploaded (20)

Practical LLM inference in modern Java.pptx