SlideShare a Scribd company logo
Practical LLM inference
in modern Java
Alina Yurenko, Alfonso² Peterssen
Oracle Labs
Who we are
• Alina Yurenko, developer advocate @ GraalVM
• Alfonso² Peterssen, Java on Java @ GraalVM
alina-yur mukel
Questions, comments, pull requests?:)
What we will cover today
• Implementing a fast LLM inference engine in modern Java
• Executing such Java LLM engine locally
• Optimizing for the best CPU-based performance
• Performance optimizations: Java Vector API, GraalVM, AOT
techniques
• Integrating with LangChain4j
Once upon a tweet...
Llama3.java
Based on llama2.c by Andrej Karpathy and his excellent educational videos, and llama.cpp
github.com/mukel/
llama3.java
(Not so) Large Language Models
16B+
4B+
< 4B
Large Language Models are everywhere
(Not so) Large Language Models
• Meta Llama 3+ (1B & 3B & 8B)
• Mistral (7B)
• Microsoft Phi-3 (3.5B & 7B)
• Google Gemma 2 (2.6B & 9B)
• Alibaba Qwen2.5 (0.5B & 1.5B & 3B & 7B & 14B)
• … or fine-tune your own!
(Not so) Large Language Models are
everywhere
Microsoft shipping RWKV.cpp
(Not so) Large Language Models are
everywhere
Apple On-Device model (~3B) ​
(Not so) Large Language Models are
everywhere
Gemini Nano, a powerful 3.25B
parameter LLM, 100% locally i
n your browser!
(Not so) Large Language Models are
everywhere
JetBrains Line Compl
etion (100M)
(Not so) Large Language Models are
everywhere
DevoxxGenie IDEA Plugin
Demo: Llama3.java
Local LLM inference
Cost
Privacy Control​
Local LLM inference in Java
No native
dependencies
Cost
Privacy Control​
Developer
productivity
Performance
LLM engines
[Token][ize][r]
Prompt Format
Inference
Sampler
GGUF File
Parser
Transformer architecture implemented in
Java
*Low Level Technicals of LLMs: Daniel
Han
class FloatTensor {
float getFloat(int index);
void setFloat(int index, float value);
// ...
void matmul(FloatTensor that, FloatTensor out, int dim0, int dim1) {
for (int i = 0; i < dim0; ++i) {
float result = 0f;
for (int j = 0; j < dim1; j++) {
result += this.getFloat(i * dim1 + j) * that.getFloat(j);
}
out.setFloat(i, result);
}
}
// ...
}
// qkv matmuls for this position
weights.wq[l].matmul(state.xb, state.q, dim, dim);
weights.wk[l].matmul(state.xb, state.k, kvDim, dim);
weights.wv[l].matmul(state.xb, state.v, kvDim, dim);
Transformer architecture implemented in
Java
*Low Level Technicals of LLMs: Daniel
Han
// RoPE relative positional encoding: complex-valued rotate q and k in each head
for (int i = 0; i < dim; i += 2) {
int head_dim = i % headSize;
float fcr = weights.freq_cis_real.get(position * (headSize / 2) + (head_dim / 2));
float fci = weights.freq_cis_imag.get(position * (headSize / 2) + (head_dim / 2));
int rotn = i < kvDim ? 2 : 1; // how many vectors? 2 = q & k, 1 = q only
for (int v = 0; v < rotn; v++) {
FloatTensor vec = v == 0 ? state.q : state.k; // the vector to rotate (query or key
float v0 = vec.getFloat(i);
float v1 = vec.getFloat(i + 1);
vec.setFloat(i, v0 * fcr - v1 * fci);
vec.setFloat(i + 1, v0 * fci + v1 * fcr);
}
}
Transformer architecture implemented in
Java
*Low Level Technicals of LLMs: Daniel
Han
NOT a single token, but a vector of probabilities
Anatomy of LLM weights
• Mostly matrices, quantized …
• During inference, every weight is read exactly once
https://bbycroft.net/llm
Is LLM inference the
new Bitcoin mining?
What is the limiting factor of inference
performance?
Memory bandwidth
*and the ability to fully utilize it
Project Panama
JEP 489: Vector API (Ninth Incubator)
Introduce an API to express vector computations
that reliably compile at runtime to optimal vector
instructions on supported CPU architectures, thus
achieving performance superior to equivalent
scalar computations.
Memory bandwidth
50 GB/s 10 TB/s
1 TB/s
400 GB/s
AI and Memory Wall
• Inference is memory bound
• 90% of inference is spent on
matrix × vector operations
"AI and Memory Wall"
void matrixVectorMul(FloatTensor m, FloatTensor v, FloatTensor out, int dim0, int
dim1) {
for (int i = 0; i < dim0; ++i) {
float result = 0f;
for (int j = 0; j < dim1; j++) {
result += m.getFloat(i * dim1 + j) * v.getFloat(j);
}
out.setFloat(i, result);
}
}
Quantizations
16 Colors
Grayscale
256 Colors 24bit RGB
Smaller weights, lower accuracy, faster
inference
Lossy Quantization
Theoretical throughput (tokens/s) of Llama 3.1 8B @ 50 GB/s
Approximate (!)
comfortable human
perception rate
LLM Inference Engine in Java
• High performance
• Modern APIs, such as Vector API and FFM API
• Rich ecosystem
• More control over the models and the inference process (caching, etc)
Why Java?
🤝 Latest Java APIs
Foreign Function and Memory
API
• Interface for interop between Java
code and native code
• Just works tm
on Graal JIT
• Experimental support in Native
Image with -H:+ForeignAPISupport
(upcalls, downcalls, foreign memory)
Vector API
• Enables fast vector computations
• Initial support in GraalVM for JDK 21,
more coming in GraalVM for JDK 24
• Works with Native Image!
github.com/
gergo-
Thank you
Gergö!
🏆
Faster LLM inference on GraalVM 🚀
• Local LLM inference, powered by modern Java APIs and
GraalVM's performance optimizations
• ~15% faster inference on Oracle GraalVM (across
several models, such as Llama3+) 🔥
• Updates in GraalVM for JDK 24 – get the latest EA build
(JDK 24 EA build 15+):
• sdk install java 24.ea.15-graal
Faster LLM inference on GraalVM
• AOT at the speed of JIT 🚀
• inference code is AOT-friendly
• Combining the power of Java with native
performance: you can pre-parse GGUF
meta-data and cache prompts at build time
for instant inference
• <25ms time to first token
Native Image
Demo – Native llama3 & AOT optimizations
Demo – LangChain4j Integration
Practical LLM inference in modern Java.pptx
• Performance of llama3.java running on GraalVM Native Image is comparable to llama.cpp,
and gets even closer as the model size increases. More optimizations coming soon:)
• Different approaches: idiomatic Java vs hand-tuned tensor operations
• Free performance boost: quantization helps, and larger models are resilient to it
llama3 native image
Oracle GraalVM 24 EA 15
Linux 3950X 64GB@3800
Other models?
• Meta Llama 3+ (1B & 3B & 8B)
… but also 30+ other models, such as:
• Mistral (7B)
• Microsoft Phi-3 (3.5B & 7B)
• Google Gemma 2 (2.6B & 9B)
• Alibaba Qwen2.5 (0.5B & 1.5B & 3B & 7B & 14B)
• … and specialized models for math, programming, and more
Llama3.java can be easily adapted to different models and
vendors
huggingface.co/mukel
LLM Inference Engine
• Performance testing and tuning (ARM, Apple Silicon, AVX512...)
• Implement additional features (prompt caching, quantizations, YARN...)
• Support for GPUs (HAT, TornadoVM?)
• Further integrations with the Java libraries and frameworks (such as
LangChain4j)
• Audio
• Vision
Help wanted ‍
♂️
‍
️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️
‍
♂️ ‍
♂️
Wrap up
• FAST LLM inference in pure, modern Java
• No dependencies (73kB jar file)
• GraalVM makes it even faster 🚀
• Native Image support with AOT model pre-loading, for instant
time-to-first-token
• Simple, accessible and with a high educational value for learning
about LLMs
• Use a lightweight local inference engine or your inspiration for
your next Java project
• Fun to hack and fun to use!
github.com/mukel/llama3.java
Please rate this session :)

More Related Content

PDF
Introducing BoxLang : A new JVM language for productivity and modularity!
PDF
Java Memory Model
PDF
Java 8 selected updates
PDF
Byteman and The Jokre, Sanne Grinovero (JBoss by RedHat)
PDF
BoxLang JVM Language : The Future is Dynamic
PDF
Building Dynamic AWS Lambda Applications with BoxLang
PDF
Building Dynamic AWS Lambda Applications with BoxLang
KEY
The Why and How of Scala at Twitter
Introducing BoxLang : A new JVM language for productivity and modularity!
Java Memory Model
Java 8 selected updates
Byteman and The Jokre, Sanne Grinovero (JBoss by RedHat)
BoxLang JVM Language : The Future is Dynamic
Building Dynamic AWS Lambda Applications with BoxLang
Building Dynamic AWS Lambda Applications with BoxLang
The Why and How of Scala at Twitter

Similar to Practical LLM inference in modern Java.pptx (20)

PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
KEY
Scala clojure techday_2011
PPTX
Grow and Shrink - Dynamically Extending the Ruby VM Stack
PDF
Drools, jBPM OptaPlanner presentation
PDF
Sista: Improving Cog’s JIT performance
PPTX
Cloud Native Compiler
PDF
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
PDF
Clojure made-simple - John Stevenson
PDF
Java 25 and Beyond - A Roadmap of Innovations
PPTX
Java performance tuning
PPTX
Hadoop cluster performance profiler
PDF
Distributed and concurrent programming with RabbitMQ and EventMachine Rails U...
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
PPTX
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
PDF
Scarlet SmallTalk
PDF
Experiments in Sharing Java VM Technology with CRuby
PPTX
White and Black Magic on the JVM
PDF
Jvm internals
KEY
Using Smalltalk for controlling robotics systems
PPTX
Information from pixels
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Scala clojure techday_2011
Grow and Shrink - Dynamically Extending the Ruby VM Stack
Drools, jBPM OptaPlanner presentation
Sista: Improving Cog’s JIT performance
Cloud Native Compiler
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Clojure made-simple - John Stevenson
Java 25 and Beyond - A Roadmap of Innovations
Java performance tuning
Hadoop cluster performance profiler
Distributed and concurrent programming with RabbitMQ and EventMachine Rails U...
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
Scarlet SmallTalk
Experiments in Sharing Java VM Technology with CRuby
White and Black Magic on the JVM
Jvm internals
Using Smalltalk for controlling robotics systems
Information from pixels
Ad

More from Alina Yurenko (11)

PDF
USING GRAALVM IN PRODUCTION - JAVAONE.pdf
PPTX
Bring the Action: Using GraalVM in Production
PPTX
Practical LLM inference in modern Java.pptx
PPTX
GOING AOT WITH GRAALVM FOR JAVA - JAVAZONE
PPTX
Going AOT: Everything you need to know about GraalVM for Java applications
PPTX
All you need to know about Spring Boot and GraalVM
PPTX
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
PDF
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
PDF
Everything you need to know about GraalVM Native Image
PDF
New opportunities for Developers With GraalVM
PDF
How to bring 1000 attendees to your DevFest — GDG Summit
USING GRAALVM IN PRODUCTION - JAVAONE.pdf
Bring the Action: Using GraalVM in Production
Practical LLM inference in modern Java.pptx
GOING AOT WITH GRAALVM FOR JAVA - JAVAZONE
Going AOT: Everything you need to know about GraalVM for Java applications
All you need to know about Spring Boot and GraalVM
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
Everything you need to know about GraalVM Native Image
New opportunities for Developers With GraalVM
How to bring 1000 attendees to your DevFest — GDG Summit
Ad

Recently uploaded (20)

PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Essential Infomation Tech presentation.pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
ai tools demonstartion for schools and inter college
PDF
System and Network Administraation Chapter 3
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
AI in Product Development-omnex systems
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Odoo Companies in India – Driving Business Transformation.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Essential Infomation Tech presentation.pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
ai tools demonstartion for schools and inter college
System and Network Administraation Chapter 3
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Upgrade and Innovation Strategies for SAP ERP Customers
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Softaken Excel to vCard Converter Software.pdf
Reimagine Home Health with the Power of Agentic AI​
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PTS Company Brochure 2025 (1).pdf.......
AI in Product Development-omnex systems
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
wealthsignaloriginal-com-DS-text-... (1).pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Understanding Forklifts - TECH EHS Solution
Odoo Companies in India – Driving Business Transformation.pdf

Practical LLM inference in modern Java.pptx

  • 1. Practical LLM inference in modern Java Alina Yurenko, Alfonso² Peterssen Oracle Labs
  • 2. Who we are • Alina Yurenko, developer advocate @ GraalVM • Alfonso² Peterssen, Java on Java @ GraalVM alina-yur mukel Questions, comments, pull requests?:)
  • 3. What we will cover today • Implementing a fast LLM inference engine in modern Java • Executing such Java LLM engine locally • Optimizing for the best CPU-based performance • Performance optimizations: Java Vector API, GraalVM, AOT techniques • Integrating with LangChain4j
  • 4. Once upon a tweet...
  • 5. Llama3.java Based on llama2.c by Andrej Karpathy and his excellent educational videos, and llama.cpp github.com/mukel/ llama3.java
  • 6. (Not so) Large Language Models 16B+ 4B+ < 4B
  • 7. Large Language Models are everywhere
  • 8. (Not so) Large Language Models • Meta Llama 3+ (1B & 3B & 8B) • Mistral (7B) • Microsoft Phi-3 (3.5B & 7B) • Google Gemma 2 (2.6B & 9B) • Alibaba Qwen2.5 (0.5B & 1.5B & 3B & 7B & 14B) • … or fine-tune your own!
  • 9. (Not so) Large Language Models are everywhere Microsoft shipping RWKV.cpp
  • 10. (Not so) Large Language Models are everywhere Apple On-Device model (~3B) ​
  • 11. (Not so) Large Language Models are everywhere Gemini Nano, a powerful 3.25B parameter LLM, 100% locally i n your browser!
  • 12. (Not so) Large Language Models are everywhere JetBrains Line Compl etion (100M)
  • 13. (Not so) Large Language Models are everywhere DevoxxGenie IDEA Plugin
  • 16. Local LLM inference in Java No native dependencies Cost Privacy Control​ Developer productivity Performance
  • 18. Transformer architecture implemented in Java *Low Level Technicals of LLMs: Daniel Han class FloatTensor { float getFloat(int index); void setFloat(int index, float value); // ... void matmul(FloatTensor that, FloatTensor out, int dim0, int dim1) { for (int i = 0; i < dim0; ++i) { float result = 0f; for (int j = 0; j < dim1; j++) { result += this.getFloat(i * dim1 + j) * that.getFloat(j); } out.setFloat(i, result); } } // ... } // qkv matmuls for this position weights.wq[l].matmul(state.xb, state.q, dim, dim); weights.wk[l].matmul(state.xb, state.k, kvDim, dim); weights.wv[l].matmul(state.xb, state.v, kvDim, dim);
  • 19. Transformer architecture implemented in Java *Low Level Technicals of LLMs: Daniel Han // RoPE relative positional encoding: complex-valued rotate q and k in each head for (int i = 0; i < dim; i += 2) { int head_dim = i % headSize; float fcr = weights.freq_cis_real.get(position * (headSize / 2) + (head_dim / 2)); float fci = weights.freq_cis_imag.get(position * (headSize / 2) + (head_dim / 2)); int rotn = i < kvDim ? 2 : 1; // how many vectors? 2 = q & k, 1 = q only for (int v = 0; v < rotn; v++) { FloatTensor vec = v == 0 ? state.q : state.k; // the vector to rotate (query or key float v0 = vec.getFloat(i); float v1 = vec.getFloat(i + 1); vec.setFloat(i, v0 * fcr - v1 * fci); vec.setFloat(i + 1, v0 * fci + v1 * fcr); } }
  • 20. Transformer architecture implemented in Java *Low Level Technicals of LLMs: Daniel Han NOT a single token, but a vector of probabilities
  • 21. Anatomy of LLM weights • Mostly matrices, quantized … • During inference, every weight is read exactly once https://bbycroft.net/llm Is LLM inference the new Bitcoin mining?
  • 22. What is the limiting factor of inference performance? Memory bandwidth *and the ability to fully utilize it Project Panama JEP 489: Vector API (Ninth Incubator) Introduce an API to express vector computations that reliably compile at runtime to optimal vector instructions on supported CPU architectures, thus achieving performance superior to equivalent scalar computations.
  • 23. Memory bandwidth 50 GB/s 10 TB/s 1 TB/s 400 GB/s
  • 24. AI and Memory Wall • Inference is memory bound • 90% of inference is spent on matrix × vector operations "AI and Memory Wall" void matrixVectorMul(FloatTensor m, FloatTensor v, FloatTensor out, int dim0, int dim1) { for (int i = 0; i < dim0; ++i) { float result = 0f; for (int j = 0; j < dim1; j++) { result += m.getFloat(i * dim1 + j) * v.getFloat(j); } out.setFloat(i, result); } }
  • 26. Smaller weights, lower accuracy, faster inference Lossy Quantization Theoretical throughput (tokens/s) of Llama 3.1 8B @ 50 GB/s Approximate (!) comfortable human perception rate
  • 27. LLM Inference Engine in Java • High performance • Modern APIs, such as Vector API and FFM API • Rich ecosystem • More control over the models and the inference process (caching, etc) Why Java?
  • 28. 🤝 Latest Java APIs Foreign Function and Memory API • Interface for interop between Java code and native code • Just works tm on Graal JIT • Experimental support in Native Image with -H:+ForeignAPISupport (upcalls, downcalls, foreign memory) Vector API • Enables fast vector computations • Initial support in GraalVM for JDK 21, more coming in GraalVM for JDK 24 • Works with Native Image! github.com/ gergo- Thank you Gergö! 🏆
  • 29. Faster LLM inference on GraalVM 🚀 • Local LLM inference, powered by modern Java APIs and GraalVM's performance optimizations • ~15% faster inference on Oracle GraalVM (across several models, such as Llama3+) 🔥 • Updates in GraalVM for JDK 24 – get the latest EA build (JDK 24 EA build 15+): • sdk install java 24.ea.15-graal
  • 30. Faster LLM inference on GraalVM • AOT at the speed of JIT 🚀 • inference code is AOT-friendly • Combining the power of Java with native performance: you can pre-parse GGUF meta-data and cache prompts at build time for instant inference • <25ms time to first token Native Image
  • 31. Demo – Native llama3 & AOT optimizations
  • 32. Demo – LangChain4j Integration
  • 34. • Performance of llama3.java running on GraalVM Native Image is comparable to llama.cpp, and gets even closer as the model size increases. More optimizations coming soon:) • Different approaches: idiomatic Java vs hand-tuned tensor operations • Free performance boost: quantization helps, and larger models are resilient to it llama3 native image Oracle GraalVM 24 EA 15 Linux 3950X 64GB@3800
  • 35. Other models? • Meta Llama 3+ (1B & 3B & 8B) … but also 30+ other models, such as: • Mistral (7B) • Microsoft Phi-3 (3.5B & 7B) • Google Gemma 2 (2.6B & 9B) • Alibaba Qwen2.5 (0.5B & 1.5B & 3B & 7B & 14B) • … and specialized models for math, programming, and more Llama3.java can be easily adapted to different models and vendors huggingface.co/mukel
  • 36. LLM Inference Engine • Performance testing and tuning (ARM, Apple Silicon, AVX512...) • Implement additional features (prompt caching, quantizations, YARN...) • Support for GPUs (HAT, TornadoVM?) • Further integrations with the Java libraries and frameworks (such as LangChain4j) • Audio • Vision Help wanted ‍ ♂️ ‍ ️ ‍ ♂️ ‍ ♂️ ‍ ♂️ ‍ ♂️ ‍ ♂️ ‍ ♂️ ‍ ♂️ ‍ ♂️ ‍ ♂️ ‍ ♂️ ‍ ♂️ ‍ ♂️ ‍ ♂️
  • 37. Wrap up • FAST LLM inference in pure, modern Java • No dependencies (73kB jar file) • GraalVM makes it even faster 🚀 • Native Image support with AOT model pre-loading, for instant time-to-first-token • Simple, accessible and with a high educational value for learning about LLMs • Use a lightweight local inference engine or your inspiration for your next Java project • Fun to hack and fun to use! github.com/mukel/llama3.java
  • 38. Please rate this session :)