SlideShare a Scribd company logo
Highly Scalable Java Programming
      for Multi-Core System

        Zhi Gan (ganzhi@gmail.com)

        http://ganzhi.blogspot.com
Agenda

 • Software Challenges

 • Profiling Tools Introduction

 • Best Practice for Java Programming

 • Rocket Science: Lock-Free Programming




                            2
Software challenges
• Parallelism
   – Larger threads per system = more parallelism needed to achieve
     high utilization
   – Thread-to-thread affinity (shared code and/or data)

• Memory management
   – Sharing of cache and memory bandwidth across more threads =
     greater need for memory efficiency
   – Thread-to-memory affinity (execute thread closest to associated
     data)

• Storage management
   – Allocate data across DRAM, Disk & Flash according to access
     frequency and patterns

                                    3
Typical Scalability Curve
The 1st Step: Profiling Parallel
Application
Important Profiling Tools
• Java Lock Monitor (JLM)
  – understand the usage of locks in their applications
  – similar tool: Java Lock Analyzer (JLA)
• Multi-core SDK (MSDK)
  – in-depth analysis of the complete execution stack
• AIX Performance Tools
  – Simple Performance Lock Analysis Tool (SPLAT)
  – XProfiler
  – prof, tprof and gprof
Tprof and VPA tool
Java Lock Monitor



• %MISS : 100 * SLOW / NONREC
• GETS : Lock Entries
• NONREC : Non Recursive Gets
• SLOW : Non Recursives that Wait
• REC : Recursive Gets
• TIER2 : SMP: Total try-enter spin loop cnt (middle for 3
  tier)
• TIER3 : SMP: Total yield spin loop cnt (outer for 3 tier)
• %UTIL : 100 * Hold-Time / Total-Time
• AVER-HTM : Hold-Time / NONREC
Multi-core SDK
                              Dead Lock View




       Synchronization View
Best Practice for High Scalable Java
            Programming
What Is Lock Contention?




                           From JLM tool website
Lock Operation Itself Is Expensive
• CAS operations are predominantly used for
  locking
• it takes up a big part of the execution time
Reduce Locking Scope
public synchronized void foo1(int k)    public void foo2(int k) {
  {                                       String key =
    String key = Integer.toString(k);     Integer.toString(k);
    String value = key+"value";           String value = key+"value";
    if (null == key){                     if (null == key){
        return ;                                return ;
    }else {                               }else{
        maph.put(key, value);                   synchronized(this){
    }                                               maph.put(key, value);
}                                               }
                                          }
                                        }
                                                                     25%

Execution Time: 16106                   Execution Time: 12157
  milliseconds                            milliseconds
Results from JLM report




                          Reduced AVER_HTM
Lock Splitting
 public synchronized void   public void addUser2(String u){
   addUser1(String u) {       synchronized(users){
   users.add(u);                    users.add(u);
 }                            }
                            }
                            public void addQuery2(String q){
 public synchronized void     synchronized(queries){
   addQuery1(String q) {            queries.add(q);
   queries.add(q);            }
 }                          }

 Execution Time: 12981      Execution Time: 4797 milliseconds
   milliseconds
                                              64%
Result from JLM report




                         Reduced lock tries
Lock Striping
 public synchronized void       public void put2(int indx,
   put1(int indx, String k) {     String k) {
     share[indx] = k;             synchronized
 }                                (locks[indx%N_LOCKS]) {
                                       share[indx] = k;
                                   }
                                }

 Execution Time: 5536           Execution Time: 1857
   milliseconds                   milliseconds

                                              66%
Result from JLM report




                         More locks with
                         less AVER_HTM
Split Hot Points : Scalable Counter




  – ConcurrentHashMap maintains a independent
    counter for each segment of hash map, and use
    a lock for each counter
  – get global counter by sum all independent
    counters
Alternatives of Exclusive Lock
• Duplicate shared resource if possible
• Atomic variables
  – counter, sequential number generator, head
    pointer of linked-list
• Concurrent container
  – java.util.concurrent package, Amino lib
• Read-Write Lock
  – java.util.concurrent.locks.ReadWriteLock
Example of AtomicLongArray
public synchronized void set1(int   private final AtomicLongArray a;
  idx, long val) {
  d[idx] = val;                     public void set2(int idx, long val) {
}                                     a.addAndGet(idx, val);
                                    }

public synchronized long get1(int   public long get2(int idx) {
  idx) {                              long ret = a.get(idx); return ret;
  long ret = d[idx];                }
  return ret;
}

Execution Time: 23550               Execution Time: 842 milliseconds
  milliseconds
                                                   96%
Using Concurrent Container
• java.util.concurrent package
  – since Java1.5
  – ConcurrentHashMap, ConcurrentLinkedQueue,
    CopyOnWriteArrayList, etc
• Amino Lib is another good choice
  – LockFreeList, LockFreeStack, LockFreeQueue, etc
• Thread-safe container
• Optimized for common operations
• High performance and scalability for multi-core
  platform
• Drawback: without full feature support
Using Immutable and Thread Local data
• Immutable data
  – remain unchanged in its life cycle
  – always thread-safe
• Thread Local data
  – only be used by a single thread
  – not shared among different threads
  – to replace global waiting queue, object pool
  – used in work-stealing scheduler
Reduce Memory Allocation
• JVM: Two level of memory allocation
  – firstly from thread-local buffer
  – then from global buffer
• Thread-local buffer will be exhausted quickly
  if frequency of allocation is high
• ThreadLocal class may be helpful if
  temporary object is needed in a loop
Rocket Science: Lock-Free Programming
Using Lock-Free/Wait-Free Algorithm
• Lock-Free allow concurrent updates of
  shared data structures without using any
  locking mechanisms
  – solves some of the basic problems associated
    with using locks in the code
  – helps create algorithms that show good
    scalability
• Highly scalable and efficient
• Amino Lib
Why Lock-Free Often Means Better Scalability? (I)




  Lock:All threads wait for one
                               Lock free: No wait, but only one can succeed,
                                        Other threads need retry
Why Lock-Free Often Means Better Scalability? (II)




     X                                  X




  Lock:All threads wait for one
                               Lock free: No wait, but only one can succeed,
                                    Other threads often need to retry
Performance of A Lock-Free Stack




  Picture from: http://www.infoq.com/articles/scalable-java-components
References
• Amino Lib
  – http://amino-cbbs.sourceforge.net/
• MSDK
  – http://www.alphaworks.ibm.com/tech/msdk
• JLA
  – http://www.alphaworks.ibm.com/tech/jla
Backup

More Related Content

PPT
Hs java open_party
POTX
Stream analysis with kafka native way and considerations about monitoring as ...
PPTX
JVM Memory Model - Yoav Abrahami, Wix
PPTX
Introduction to netlink in linux kernel (english)
PDF
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
PDF
The Newest in Session Types
PDF
4th European Lisp Symposium: Jobim: an Actors Library for the Clojure Program...
PDF
06 - Qt Communication
Hs java open_party
Stream analysis with kafka native way and considerations about monitoring as ...
JVM Memory Model - Yoav Abrahami, Wix
Introduction to netlink in linux kernel (english)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
The Newest in Session Types
4th European Lisp Symposium: Jobim: an Actors Library for the Clojure Program...
06 - Qt Communication

What's hot (20)

PPTX
Jvm memory model
PDF
Apache Storm
PPT
Reactive programming with examples
PDF
Large volume data analysis on the Typesafe Reactive Platform
PPTX
Network emulator
PPT
Shared objects and synchronization
PPT
Jvm Performance Tunning
PPT
PPTX
Tc basics
PPTX
Isola 12 presentation
PPTX
From Trill to Quill and Beyond
PDF
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...
PPT
No Heap Remote Objects for Distributed real-time Java
PDF
Qt for beginners
PPTX
Quantum programming
PDF
Linux Linux Traffic Control
PDF
Microservices with Micronaut
PPTX
Fork and join framework
PDF
Thanos - Prometheus on Scale
Jvm memory model
Apache Storm
Reactive programming with examples
Large volume data analysis on the Typesafe Reactive Platform
Network emulator
Shared objects and synchronization
Jvm Performance Tunning
Tc basics
Isola 12 presentation
From Trill to Quill and Beyond
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...
No Heap Remote Objects for Distributed real-time Java
Qt for beginners
Quantum programming
Linux Linux Traffic Control
Microservices with Micronaut
Fork and join framework
Thanos - Prometheus on Scale
Ad

Viewers also liked (20)

PPT
Diary of a Scalable Java Application
PDF
Apache Cassandra Lesson: Data Modelling and CQL3
PDF
Java scalability considerations yogesh deshpande
PPTX
Scalable Java Application Development on AWS
PPS
Web20expo Scalable Web Arch
PDF
Cuestionario internet Hernandez Michel
PPT
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
PPTX
Scalable Application Development on AWS
PPTX
Scalable Applications with Scala
PPTX
Building Highly Scalable Java Applications on Windows Azure - JavaOne S313978
KEY
Writing Scalable Software in Java
PDF
Scalable web architecture
PPT
Scalable Web Architectures and Infrastructure
PDF
天猫后端技术架构优化实践
PPTX
Full stack-development with node js
PPTX
Scalable Web Architecture and Distributed Systems
PPTX
浅谈电商网站数据访问层(DAL)与 ORM 之适用性
PPTX
Machine learning with scikitlearn
PPT
Building a Scalable Architecture for web apps
PDF
Scalable Django Architecture
Diary of a Scalable Java Application
Apache Cassandra Lesson: Data Modelling and CQL3
Java scalability considerations yogesh deshpande
Scalable Java Application Development on AWS
Web20expo Scalable Web Arch
Cuestionario internet Hernandez Michel
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Scalable Application Development on AWS
Scalable Applications with Scala
Building Highly Scalable Java Applications on Windows Azure - JavaOne S313978
Writing Scalable Software in Java
Scalable web architecture
Scalable Web Architectures and Infrastructure
天猫后端技术架构优化实践
Full stack-development with node js
Scalable Web Architecture and Distributed Systems
浅谈电商网站数据访问层(DAL)与 ORM 之适用性
Machine learning with scikitlearn
Building a Scalable Architecture for web apps
Scalable Django Architecture
Ad

Similar to Highly Scalable Java Programming for Multi-Core System (20)

PDF
Groovy concurrency
PPTX
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
PPTX
.NET Multithreading/Multitasking
PDF
Artimon - Apache Flume (incubating) NYC Meetup 20111108
PPTX
Architecting for Microservices Part 2
PDF
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
PDF
Forgive me for i have allocated
PDF
LibOS as a regression test framework for Linux networking #netdev1.1
PDF
Performance van Java 8 en verder - Jeroen Borgers
PDF
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
PDF
Concurrency
PDF
Qt multi threads
PDF
NetflixOSS Open House Lightning talks
PDF
13multithreaded Programming
PPTX
Practical LLM inference in modern Java.pptx
PPTX
Practical LLM inference in modern Java.pptx
ODP
Concurrent Programming in Java
PPTX
Developing distributed applications with Akka and Akka Cluster
PDF
Design and Implementation of the Security Graph Language
PPT
Load Balancing In Cloud Computing newppt
Groovy concurrency
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
.NET Multithreading/Multitasking
Artimon - Apache Flume (incubating) NYC Meetup 20111108
Architecting for Microservices Part 2
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
Forgive me for i have allocated
LibOS as a regression test framework for Linux networking #netdev1.1
Performance van Java 8 en verder - Jeroen Borgers
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Concurrency
Qt multi threads
NetflixOSS Open House Lightning talks
13multithreaded Programming
Practical LLM inference in modern Java.pptx
Practical LLM inference in modern Java.pptx
Concurrent Programming in Java
Developing distributed applications with Akka and Akka Cluster
Design and Implementation of the Security Graph Language
Load Balancing In Cloud Computing newppt

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Reach Out and Touch Someone: Haptics and Empathic Computing
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Mobile App Security Testing_ A Comprehensive Guide.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf
Modernizing your data center with Dell and AMD
Understanding_Digital_Forensics_Presentation.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication

Highly Scalable Java Programming for Multi-Core System

  • 1. Highly Scalable Java Programming for Multi-Core System Zhi Gan (ganzhi@gmail.com) http://ganzhi.blogspot.com
  • 2. Agenda • Software Challenges • Profiling Tools Introduction • Best Practice for Java Programming • Rocket Science: Lock-Free Programming 2
  • 3. Software challenges • Parallelism – Larger threads per system = more parallelism needed to achieve high utilization – Thread-to-thread affinity (shared code and/or data) • Memory management – Sharing of cache and memory bandwidth across more threads = greater need for memory efficiency – Thread-to-memory affinity (execute thread closest to associated data) • Storage management – Allocate data across DRAM, Disk & Flash according to access frequency and patterns 3
  • 5. The 1st Step: Profiling Parallel Application
  • 6. Important Profiling Tools • Java Lock Monitor (JLM) – understand the usage of locks in their applications – similar tool: Java Lock Analyzer (JLA) • Multi-core SDK (MSDK) – in-depth analysis of the complete execution stack • AIX Performance Tools – Simple Performance Lock Analysis Tool (SPLAT) – XProfiler – prof, tprof and gprof
  • 8. Java Lock Monitor • %MISS : 100 * SLOW / NONREC • GETS : Lock Entries • NONREC : Non Recursive Gets • SLOW : Non Recursives that Wait • REC : Recursive Gets • TIER2 : SMP: Total try-enter spin loop cnt (middle for 3 tier) • TIER3 : SMP: Total yield spin loop cnt (outer for 3 tier) • %UTIL : 100 * Hold-Time / Total-Time • AVER-HTM : Hold-Time / NONREC
  • 9. Multi-core SDK Dead Lock View Synchronization View
  • 10. Best Practice for High Scalable Java Programming
  • 11. What Is Lock Contention? From JLM tool website
  • 12. Lock Operation Itself Is Expensive • CAS operations are predominantly used for locking • it takes up a big part of the execution time
  • 13. Reduce Locking Scope public synchronized void foo1(int k) public void foo2(int k) { { String key = String key = Integer.toString(k); Integer.toString(k); String value = key+"value"; String value = key+"value"; if (null == key){ if (null == key){ return ; return ; }else { }else{ maph.put(key, value); synchronized(this){ } maph.put(key, value); } } } } 25% Execution Time: 16106 Execution Time: 12157 milliseconds milliseconds
  • 14. Results from JLM report Reduced AVER_HTM
  • 15. Lock Splitting public synchronized void public void addUser2(String u){ addUser1(String u) { synchronized(users){ users.add(u); users.add(u); } } } public void addQuery2(String q){ public synchronized void synchronized(queries){ addQuery1(String q) { queries.add(q); queries.add(q); } } } Execution Time: 12981 Execution Time: 4797 milliseconds milliseconds 64%
  • 16. Result from JLM report Reduced lock tries
  • 17. Lock Striping public synchronized void public void put2(int indx, put1(int indx, String k) { String k) { share[indx] = k; synchronized } (locks[indx%N_LOCKS]) { share[indx] = k; } } Execution Time: 5536 Execution Time: 1857 milliseconds milliseconds 66%
  • 18. Result from JLM report More locks with less AVER_HTM
  • 19. Split Hot Points : Scalable Counter – ConcurrentHashMap maintains a independent counter for each segment of hash map, and use a lock for each counter – get global counter by sum all independent counters
  • 20. Alternatives of Exclusive Lock • Duplicate shared resource if possible • Atomic variables – counter, sequential number generator, head pointer of linked-list • Concurrent container – java.util.concurrent package, Amino lib • Read-Write Lock – java.util.concurrent.locks.ReadWriteLock
  • 21. Example of AtomicLongArray public synchronized void set1(int private final AtomicLongArray a; idx, long val) { d[idx] = val; public void set2(int idx, long val) { } a.addAndGet(idx, val); } public synchronized long get1(int public long get2(int idx) { idx) { long ret = a.get(idx); return ret; long ret = d[idx]; } return ret; } Execution Time: 23550 Execution Time: 842 milliseconds milliseconds 96%
  • 22. Using Concurrent Container • java.util.concurrent package – since Java1.5 – ConcurrentHashMap, ConcurrentLinkedQueue, CopyOnWriteArrayList, etc • Amino Lib is another good choice – LockFreeList, LockFreeStack, LockFreeQueue, etc • Thread-safe container • Optimized for common operations • High performance and scalability for multi-core platform • Drawback: without full feature support
  • 23. Using Immutable and Thread Local data • Immutable data – remain unchanged in its life cycle – always thread-safe • Thread Local data – only be used by a single thread – not shared among different threads – to replace global waiting queue, object pool – used in work-stealing scheduler
  • 24. Reduce Memory Allocation • JVM: Two level of memory allocation – firstly from thread-local buffer – then from global buffer • Thread-local buffer will be exhausted quickly if frequency of allocation is high • ThreadLocal class may be helpful if temporary object is needed in a loop
  • 26. Using Lock-Free/Wait-Free Algorithm • Lock-Free allow concurrent updates of shared data structures without using any locking mechanisms – solves some of the basic problems associated with using locks in the code – helps create algorithms that show good scalability • Highly scalable and efficient • Amino Lib
  • 27. Why Lock-Free Often Means Better Scalability? (I) Lock:All threads wait for one Lock free: No wait, but only one can succeed, Other threads need retry
  • 28. Why Lock-Free Often Means Better Scalability? (II) X X Lock:All threads wait for one Lock free: No wait, but only one can succeed, Other threads often need to retry
  • 29. Performance of A Lock-Free Stack Picture from: http://www.infoq.com/articles/scalable-java-components
  • 30. References • Amino Lib – http://amino-cbbs.sourceforge.net/ • MSDK – http://www.alphaworks.ibm.com/tech/msdk • JLA – http://www.alphaworks.ibm.com/tech/jla

Editor's Notes

  • #6: What if all previous best prestise cannot meet your need? You would like to optimize your application manually?
  • #7: msdk – This tool can be used to do detailed performance analysis of concurrent Java applications. It does an in-depth analysis of the complete execution stack, starting from the hardware to the application layer. Information is gathered from all four layers of the stack – hardware, operating system, jvm and application.
  • #8: `
  • #28: For multi-thread application, lock-free approach is different with lock-based approach in several aspects: When accessing shared resource, lock-based approach will only allow one thread to enter critical section and others will wait for it On the contrary, lock-free approach will all every thread to modify state of shared state. But one of the all threads can succeed, and all other threads will be aware of their action are failed so they will retry or choose other actions.
  • #29: The real difference occurs when something bad happens to the running thread. If a running thread is paused by OS scheduler, different thing will happen to the two approach: Lock-based approach: All other threads are waiting for this thread, and no one can make progress Lock-free approach: Other threads will be free to do any operations. And the paused thread might fail its current operation From this difference, we can found in multi-core environment, lock-free will have more advantage. It will have better scalability since threads don’t wait for each other. And it will waste some CPU cycles if contention. But this won’t be a problem for most cases since we have more than enough CPU resource 