SlideShare a Scribd company logo
PERFORMANCE AND 
PREDICTABILITY 
Richard Warburton 
@richardwarburto 
insightfullogic.com
Performance and Predictability - Richard Warburton
Why care about low level rubbish? 
Branch Prediction 
Memory Access 
Storage 
Conclusions
Technology or Principles 
“60 messages per second.” 
“8 ElasticSearch servers on AWS, 26 front end proxy 
serves. Double that in backend app servers.” 
60 / (8 + 26 + 2 * 26) = 0.7 messages / server / second. 
http://highscalability.com/blog/2014/1/6/how-hipchat-stores-and-indexes-billions-of-messages-using- 
el.html
“What we need is less hipsters and more 
geeks” 
- Martin Thompson
Performance Discussion 
Product Solutions 
“Just use our library/tool/framework, and everything is web-scale!” 
Architecture Advocacy 
“Always design your software like this.” 
Methodology & Fundamentals 
“Here are some principles and knowledge, use your brain“
Performance and Predictability - Richard Warburton
Case Study: Messaging 
● 1 Thread reading network data 
● 1 Thread writing network data 
● 1 Thread conducting admin tasks
Unifying Theme: Be Predictable 
An opportunity for an underlying system: 
○ Branch Prediction 
○ Memory Access 
○ Hard Disks
Do you care? 
Many problems not Predictability Related 
Networking 
Database or External Service 
Minimising I/O 
Garbage Collection 
Insufficient Parallelism 
Use an Optimisation Omen
Why care about low level rubbish? 
Branch Prediction 
Memory Access 
Storage 
Conclusions
What 4 things do CPUs actually do?
Fetch, Decode, Execute, Writeback
Pipelined
Performance and Predictability - Richard Warburton
Super-pipelined & Superscalar
What about branches? 
public static int simple(int x, int y, int z) { 
int ret; 
if (x > 5) { 
ret = y + z; 
} else { 
ret = y; 
} 
return ret; 
}
Branches cause stalls, stalls kill performance
Can we eliminate branches?
Strategy: predict branches and speculatively 
execute
Static Prediction 
A forward branch defaults to not taken 
A backward branch defaults to taken
Performance and Predictability - Richard Warburton
Conditional Branches 
if(x == 0) { 
x = 1; 
} 
x++; 
mov eax, $x 
cmp eax, 0 
jne end 
mov eax, 1 
end: 
inc eax 
mov $x, eax
Static Hints (Pentium 4 or later) 
__emit 0x3E defaults to taken 
__emit 0x2E defaults to not taken 
don’t use them, flip the branch
Dynamic prediction: record history and 
predict future
Branch Target Buffer (BTB) 
a log of the history of each branch 
also stores the program counter address 
its finite!
Local 
record per conditional branch histories 
Global 
record shared history of conditional jumps
Loop 
specialised predictor when there’s a loop (jumping in a 
cycle n times) 
Function 
specialised buffer for predicted nearby function returns 
N level Adaptive Predictor 
accounts for up patterns of up to N+1 if statements
Optimisation Omen 
Use Performance Event Counters (Model Specific 
Registers) 
Can be configured to store branch prediction 
information 
Profilers & Tooling: perf (linux), VTune, AMD Code 
Analyst, Visual Studio, Oracle Performance Studio
Demo perf
Summary 
CPUs are Super-pipelined and Superscalar 
Branches cause stalls 
Simplify your code! Especially branching logic and 
megamorphic callsites
Why care about low level rubbish? 
Branch Prediction 
Memory Access 
Storage 
Conclusions
The Problem Very Fast 
Relatively Slow
The Solution: CPU Cache 
Core Demands Data, looks at its cache 
If present (a "hit") then data returned to register 
If absent (a "miss") then data looked up from 
memory and stored in the cache 
Fast memory is expensive, a small amount is affordable
Multilevel Cache: Intel Sandybridge 
Physical Core 0 
HT: 2 Logical Cores 
Level 1 
Instruction 
Cache 
Shared Level 3 Cache 
Level 1 
Data 
Cache 
Level 2 Cache 
.... 
Physical Core N 
HT: 2 Logical Cores 
Level 1 
Data 
Cache 
Level 1 
Instruction 
Cache 
Level 2 Cache
How bad is a miss? 
Location Latency in Clockcycles 
Register 0 
L1 Cache 3 
L2 Cache 9 
L3 Cache 21 
Main Memory 150-400
Prefetching 
Eagerly load data 
Adjacent & Streaming Prefetches 
Arrange Data so accesses are predictable
Temporal Locality 
Repeatedly referring to same data in a short time span 
Spatial Locality 
Referring to data that is close together in memory 
Sequential Locality 
Referring to data that is arranged linearly in memory
General Principles 
Use smaller data types (-XX:+UseCompressedOops) 
Avoid 'big holes' in your data 
Make accesses as linear as possible
Primitive Arrays 
// Sequential Access = Predictable 
for (int i=0; i<someArray.length; i++) 
someArray[i]++;
Primitive Arrays - Skipping Elements 
// Holes Hurt 
for (int i=0; i<someArray.length; i += SKIP) 
someArray[i]++;
Primitive Arrays - Skipping Elements
Multidimensional Arrays 
Multidimensional Arrays are really Arrays of 
Arrays in Java. (Unlike C) 
Some people realign their accesses: 
for (int col=0; col<COLS; col++) { 
for (int row=0; row<ROWS; row++) { 
array[ROWS * col + row]++; 
} 
}
Bad Access Alignment 
Strides the wrong way, bad 
locality. 
array[COLS * row + col]++; 
Strides the right way, good 
locality. 
array[ROWS * col + row]++;
Full Random Access 
L1D - 5 clocks 
L2 - 37 clocks 
Memory - 280 clocks 
Sequential Access 
L1D - 5 clocks 
L2 - 14 clocks 
Memory - 28 clocks
Data Layout Principles 
Primitive Collections (GNU Trove, GS-Coll, FastUtil, HPPC) 
Arrays > Linked Lists 
Hashtable > Search Tree 
Avoid Code bloating (Loop Unrolling)
Custom Data Structures 
Judy Arrays 
an associative array/map 
kD-Trees 
generalised Binary Space Partitioning 
Z-Order Curve 
multidimensional data in one dimension
Data Locality vs Java Heap Layout 
0 
1 
2 
class Foo { 
Integer count; 
Bar bar; 
Baz baz; 
} 
// No alignment guarantees 
for (Foo foo : foos) { 
foo.count = 5; 
foo.bar.visit(); 
} 
3 
... 
Foo 
count 
bar 
baz
Data Locality vs Java Heap Layout 
Serious Java Weakness 
Location of objects in memory hard to 
guarantee. 
GC also interferes 
Copying 
Compaction
Optimisation Omen 
Again Use Performance Event Counters 
Measure for cache hit/miss rates 
Correlate with Pipeline Stalls to identify where this is 
relevant
Object Layout Control 
On Heap 
http://objectlayout.github.io/ObjectLayout 
Off Heap 
- Data Structures: Chronicle or JCTools Experimental 
- Serialisation: SBE, Cap’n’p, Flatbuffers
Summary 
Cache misses cause stalls, which kill performance 
Measurable via Performance Event Counters 
Common Techniques for optimizing code
Why care about low level rubbish? 
Branch Prediction 
Memory Access 
Storage 
Conclusions
Hard Disks 
Commonly used persistent storage 
Spinning Rust, with a head to read/write 
Constant Angular Velocity - rotations per minute stays 
constant 
Sectors size differs between device
A simple model 
Zone Constant Angular Velocity (ZCAV) / 
Zoned Bit Recording (ZBR) 
Operation Time = 
Time to process the command 
Time to seek 
Rotational speed latency 
Sequential Transfer TIme
ZBR implies faster transfer at limits than 
centre (~25%)
Seeking vs Sequential reads 
Seek and Rotation times dominate on small values of 
data 
Random writes of 4kb can be 300 times slower than 
theoretical max data transfer 
Consider the impact of context switching between 
applications or threads
Fragmentation causes unnecessary seeks
Sector (Mis) Alignment
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
Optimisation Omen 
1. Application Spending time waiting on I/O 
2. I/O Subsystem not transferring much data
Summary 
Simple, sequential access patterns win 
Fragmentation is your enemy 
Alignment can be important
Why care about low level rubbish? 
Branch Prediction 
Memory Access 
Storage 
Conclusions
Speedups 
● Possible 20 cycle stall for a mispredict (example 5x 
slowdown) 
● 200x for L1 cache hit vs Main Memory 
● 300x for sequential vs random on disk 
● Theoretical Max
Latency Numbers 
L1 cache reference 0.5 ns 
Branch mispredict 5 ns 
L2 cache reference 7 ns 
14x L1 cache 
Mutex lock/unlock 25 ns 
Main memory reference 100 ns 
20x L2 cache, 200x L1 cache 
Compress 1K bytes with Zippy 3,000 ns 
Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms 
Read 4K randomly from SSD* 150,000 ns 0.15 ms 
Read 1 MB sequentially from memory 250,000 ns 0.25 ms 
Round trip within same datacenter 500,000 ns 0.5 ms 
Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 
Disk seek 10,000,000 ns 10 ms 
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms 
Stolen (cited) from https://gist.github.com/jboner/2841832
Common Themes 
● Principles over Tools 
● Data over Unsubstantiated Claims 
● Simple over Complex 
● Predictable Access over Random Access
More information 
Articles 
http://www.akkadia.org/drepper/cpumemory.pdf 
https://gmplib.org/~tege/x86-timing.pdf 
http://psy-lob-saw.blogspot.co.uk/ 
http://www.intel.com/content/www/us/en/architecture-and-technology/64- 
ia-32-architectures-optimization-manual.html 
http://mechanical-sympathy.blogspot.co.uk 
http://www.agner.org/optimize/microarchitecture.pdf 
Mailing Lists: 
https://groups.google.com/forum/#!forum/mechanical-sympathy 
https://groups.google.com/a/jclarity.com/forum/#!forum/friends 
http://gee.cs.oswego.edu/dl/concurrency-interest/
http://java8training.com 
http://is.gd/javalambdas
Q & A 
@richardwarburto 
insightfullogic.com 
tinyurl.com/java8lambdas

More Related Content

PPTX
ACM 2013-02-25
PPT
HPTS talk on micro-sharding with Katta
PPTX
DIY Java Profiling
PPTX
Spark vs storm
PDF
Distributed real time stream processing- why and how
PDF
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
PDF
Real-Time Analytics with Kafka, Cassandra and Storm
PDF
(JVM) Garbage Collection - Brown Bag Session
ACM 2013-02-25
HPTS talk on micro-sharding with Katta
DIY Java Profiling
Spark vs storm
Distributed real time stream processing- why and how
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Real-Time Analytics with Kafka, Cassandra and Storm
(JVM) Garbage Collection - Brown Bag Session

What's hot (20)

PPTX
Storm 2012-03-29
PPTX
Improved Reliable Streaming Processing: Apache Storm as example
PDF
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
PDF
Garbage collection in JVM
PDF
JVM Garbage Collection Tuning
PPTX
Am I reading GC logs Correctly?
PPTX
Cassandra and Storm at Health Market Sceince
PDF
Real-time streams and logs with Storm and Kafka
PDF
Storm: The Real-Time Layer - GlueCon 2012
KEY
Everything I Ever Learned About JVM Performance Tuning @Twitter
PDF
Alto Desempenho com Java
PPTX
Intel Nervana Artificial Intelligence Meetup 11/30/16
PDF
Buzz Words Dunning Real-Time Learning
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
PDF
On heap cache vs off-heap cache
PDF
Introduction of Java GC Tuning and Java Java Mission Control
PPTX
Millions quotes per second in pure java
PDF
[243] turning data into value
PDF
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
Storm 2012-03-29
Improved Reliable Streaming Processing: Apache Storm as example
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Garbage collection in JVM
JVM Garbage Collection Tuning
Am I reading GC logs Correctly?
Cassandra and Storm at Health Market Sceince
Real-time streams and logs with Storm and Kafka
Storm: The Real-Time Layer - GlueCon 2012
Everything I Ever Learned About JVM Performance Tuning @Twitter
Alto Desempenho com Java
Intel Nervana Artificial Intelligence Meetup 11/30/16
Buzz Words Dunning Real-Time Learning
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
On heap cache vs off-heap cache
Introduction of Java GC Tuning and Java Java Mission Control
Millions quotes per second in pure java
[243] turning data into value
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Scaling Apache Storm - Strata + Hadoop World 2014
Ad

Viewers also liked (20)

PDF
StackiFest 16: Stacki Overview- Anoop Rajendra
PPTX
StackiFest16: Building a Cluster with Stacki - Greg Bruno
PDF
StackiFest16: What's Next in Stacki - Mason Katz
PDF
StackiFest16: Stacki 1600+ Server Journey - Dave Peterson, Salesforce
PDF
StackiFest16: CoreOS/Ubuntu on Stacki
PPTX
StackiFest16: How PayPal got a 300 Nodes up in 14 minutes - Greg Bruno
PDF
StackiFest16: Building a Cart
PDF
StackiFest16: Automation for Event-Driven Infrastructure - Dave Boucha
PDF
Provisioning with Stacki at NIST
PPTX
Continuous modeling - automating model building on high-performance e-Infrast...
PDF
AMD Opteron A1100 Series SoC Launch Presentation
PDF
PDF
Programmation lock free - les techniques des pros (2eme partie)
PDF
Programmation lock free - les techniques des pros (1ere partie)
PDF
PDF
Multi-Core (MC) Processor Qualification for Safety Critical Systems
PDF
The Future of the OS
PPTX
SC16 Student Cluster Competition Configurations & Results
PDF
SC16: Helping HPC Users Specify Job Memory Requirements via Machine Learning
PDF
Arenaz slides-booth-talks-sc16-openmp
StackiFest 16: Stacki Overview- Anoop Rajendra
StackiFest16: Building a Cluster with Stacki - Greg Bruno
StackiFest16: What's Next in Stacki - Mason Katz
StackiFest16: Stacki 1600+ Server Journey - Dave Peterson, Salesforce
StackiFest16: CoreOS/Ubuntu on Stacki
StackiFest16: How PayPal got a 300 Nodes up in 14 minutes - Greg Bruno
StackiFest16: Building a Cart
StackiFest16: Automation for Event-Driven Infrastructure - Dave Boucha
Provisioning with Stacki at NIST
Continuous modeling - automating model building on high-performance e-Infrast...
AMD Opteron A1100 Series SoC Launch Presentation
Programmation lock free - les techniques des pros (2eme partie)
Programmation lock free - les techniques des pros (1ere partie)
Multi-Core (MC) Processor Qualification for Safety Critical Systems
The Future of the OS
SC16 Student Cluster Competition Configurations & Results
SC16: Helping HPC Users Specify Job Memory Requirements via Machine Learning
Arenaz slides-booth-talks-sc16-openmp
Ad

Similar to Performance and Predictability - Richard Warburton (20)

PDF
Performance and predictability
PDF
Performance and predictability
PDF
Caching in (DevoxxUK 2013)
PDF
Caching in
PDF
Caching in
PPTX
Code and memory optimization tricks
PPTX
Code and Memory Optimisation Tricks
PDF
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PPTX
CPU Memory Hierarchy and Caching Techniques
PPTX
CPU Caches
PPT
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
PPT
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
PDF
Speedup Your Java Apps with Hardware Counters
PDF
Performance optimization techniques for Java code
PDF
Cache optimization
PPT
Chap2 slides
PDF
final (1)
PPT
Presentation
PDF
Code dive 2019 kamil witecki - should i care about cpu cache
PDF
How shit works: the CPU
Performance and predictability
Performance and predictability
Caching in (DevoxxUK 2013)
Caching in
Caching in
Code and memory optimization tricks
Code and Memory Optimisation Tricks
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
CPU Memory Hierarchy and Caching Techniques
CPU Caches
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...
Speedup Your Java Apps with Hardware Counters
Performance optimization techniques for Java code
Cache optimization
Chap2 slides
final (1)
Presentation
Code dive 2019 kamil witecki - should i care about cpu cache
How shit works: the CPU

More from JAXLondon2014 (20)

PDF
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
PDF
Performance Metrics for your Delivery Pipeline - Wolfgang Gottesheim
PPTX
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
PDF
Conditional Logging Considered Harmful - Sean Reilly
PDF
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani
PPT
API Management - a hands on workshop - Paul Fremantle
PDF
'Bootiful' Code with Spring Boot - Josh Long
PDF
The Full Stack Java Developer - Josh Long
PDF
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
PDF
Dataflow, the Forgotten Way - Russel Winder
PDF
Habits of Highly Effective Technical Teams - Martijn Verburg
PDF
The Lazy Developer's Guide to Cloud Foundry - Holly Cummins
PPTX
Testing within an Agile Environment - Beyza Sakir and Chris Gollop
PDF
Testing the Enterprise Layers - the A, B, C's of Integration Testing - Aslak ...
PDF
Squeezing Performance of out of In-Memory Data Grids - Fuad Malikov
PDF
Spocktacular Testing - Russel Winder
PDF
Server Side JavaScript on the Java Platform - David Delabassee
PDF
Reflection Madness - Dr. Heinz Kabutz
PDF
Rapid Web Application Development with MongoDB and the JVM - Trisha Gee
PDF
Pushing Java EE outside of the Enterprise: Home Automation and IoT - David De...
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
Performance Metrics for your Delivery Pipeline - Wolfgang Gottesheim
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
Conditional Logging Considered Harmful - Sean Reilly
Finding your Way in the Midst of the NoSQL Haze - Abdelmonaim Remani
API Management - a hands on workshop - Paul Fremantle
'Bootiful' Code with Spring Boot - Josh Long
The Full Stack Java Developer - Josh Long
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
Dataflow, the Forgotten Way - Russel Winder
Habits of Highly Effective Technical Teams - Martijn Verburg
The Lazy Developer's Guide to Cloud Foundry - Holly Cummins
Testing within an Agile Environment - Beyza Sakir and Chris Gollop
Testing the Enterprise Layers - the A, B, C's of Integration Testing - Aslak ...
Squeezing Performance of out of In-Memory Data Grids - Fuad Malikov
Spocktacular Testing - Russel Winder
Server Side JavaScript on the Java Platform - David Delabassee
Reflection Madness - Dr. Heinz Kabutz
Rapid Web Application Development with MongoDB and the JVM - Trisha Gee
Pushing Java EE outside of the Enterprise: Home Automation and IoT - David De...

Recently uploaded (20)

PPTX
Understanding-Communication-Berlos-S-M-C-R-Model.pptx
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
Intro to ISO 9001 2015.pptx wareness raising
PPTX
nose tajweed for the arabic alphabets for the responsive
PPTX
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
PPTX
Tour Presentation Educational Activity.pptx
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PDF
Instagram's Product Secrets Unveiled with this PPT
PPTX
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
PPTX
Relationship Management Presentation In Banking.pptx
PPTX
The Effect of Human Resource Management Practice on Organizational Performanc...
PPTX
worship songs, in any order, compilation
PPTX
Hydrogel Based delivery Cancer Treatment
PPTX
Learning-Plan-5-Policies-and-Practices.pptx
PPTX
Primary and secondary sources, and history
PPTX
fundraisepro pitch deck elegant and modern
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PPTX
The spiral of silence is a theory in communication and political science that...
PDF
Swiggy’s Playbook: UX, Logistics & Monetization
PPTX
Impressionism_PostImpressionism_Presentation.pptx
Understanding-Communication-Berlos-S-M-C-R-Model.pptx
An Unlikely Response 08 10 2025.pptx
Intro to ISO 9001 2015.pptx wareness raising
nose tajweed for the arabic alphabets for the responsive
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
Tour Presentation Educational Activity.pptx
2025-08-10 Joseph 02 (shared slides).pptx
Instagram's Product Secrets Unveiled with this PPT
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
Relationship Management Presentation In Banking.pptx
The Effect of Human Resource Management Practice on Organizational Performanc...
worship songs, in any order, compilation
Hydrogel Based delivery Cancer Treatment
Learning-Plan-5-Policies-and-Practices.pptx
Primary and secondary sources, and history
fundraisepro pitch deck elegant and modern
Tablets And Capsule Preformulation Of Paracetamol
The spiral of silence is a theory in communication and political science that...
Swiggy’s Playbook: UX, Logistics & Monetization
Impressionism_PostImpressionism_Presentation.pptx

Performance and Predictability - Richard Warburton

  • 1. PERFORMANCE AND PREDICTABILITY Richard Warburton @richardwarburto insightfullogic.com
  • 3. Why care about low level rubbish? Branch Prediction Memory Access Storage Conclusions
  • 4. Technology or Principles “60 messages per second.” “8 ElasticSearch servers on AWS, 26 front end proxy serves. Double that in backend app servers.” 60 / (8 + 26 + 2 * 26) = 0.7 messages / server / second. http://highscalability.com/blog/2014/1/6/how-hipchat-stores-and-indexes-billions-of-messages-using- el.html
  • 5. “What we need is less hipsters and more geeks” - Martin Thompson
  • 6. Performance Discussion Product Solutions “Just use our library/tool/framework, and everything is web-scale!” Architecture Advocacy “Always design your software like this.” Methodology & Fundamentals “Here are some principles and knowledge, use your brain“
  • 8. Case Study: Messaging ● 1 Thread reading network data ● 1 Thread writing network data ● 1 Thread conducting admin tasks
  • 9. Unifying Theme: Be Predictable An opportunity for an underlying system: ○ Branch Prediction ○ Memory Access ○ Hard Disks
  • 10. Do you care? Many problems not Predictability Related Networking Database or External Service Minimising I/O Garbage Collection Insufficient Parallelism Use an Optimisation Omen
  • 11. Why care about low level rubbish? Branch Prediction Memory Access Storage Conclusions
  • 12. What 4 things do CPUs actually do?
  • 17. What about branches? public static int simple(int x, int y, int z) { int ret; if (x > 5) { ret = y + z; } else { ret = y; } return ret; }
  • 18. Branches cause stalls, stalls kill performance
  • 19. Can we eliminate branches?
  • 20. Strategy: predict branches and speculatively execute
  • 21. Static Prediction A forward branch defaults to not taken A backward branch defaults to taken
  • 23. Conditional Branches if(x == 0) { x = 1; } x++; mov eax, $x cmp eax, 0 jne end mov eax, 1 end: inc eax mov $x, eax
  • 24. Static Hints (Pentium 4 or later) __emit 0x3E defaults to taken __emit 0x2E defaults to not taken don’t use them, flip the branch
  • 25. Dynamic prediction: record history and predict future
  • 26. Branch Target Buffer (BTB) a log of the history of each branch also stores the program counter address its finite!
  • 27. Local record per conditional branch histories Global record shared history of conditional jumps
  • 28. Loop specialised predictor when there’s a loop (jumping in a cycle n times) Function specialised buffer for predicted nearby function returns N level Adaptive Predictor accounts for up patterns of up to N+1 if statements
  • 29. Optimisation Omen Use Performance Event Counters (Model Specific Registers) Can be configured to store branch prediction information Profilers & Tooling: perf (linux), VTune, AMD Code Analyst, Visual Studio, Oracle Performance Studio
  • 31. Summary CPUs are Super-pipelined and Superscalar Branches cause stalls Simplify your code! Especially branching logic and megamorphic callsites
  • 32. Why care about low level rubbish? Branch Prediction Memory Access Storage Conclusions
  • 33. The Problem Very Fast Relatively Slow
  • 34. The Solution: CPU Cache Core Demands Data, looks at its cache If present (a "hit") then data returned to register If absent (a "miss") then data looked up from memory and stored in the cache Fast memory is expensive, a small amount is affordable
  • 35. Multilevel Cache: Intel Sandybridge Physical Core 0 HT: 2 Logical Cores Level 1 Instruction Cache Shared Level 3 Cache Level 1 Data Cache Level 2 Cache .... Physical Core N HT: 2 Logical Cores Level 1 Data Cache Level 1 Instruction Cache Level 2 Cache
  • 36. How bad is a miss? Location Latency in Clockcycles Register 0 L1 Cache 3 L2 Cache 9 L3 Cache 21 Main Memory 150-400
  • 37. Prefetching Eagerly load data Adjacent & Streaming Prefetches Arrange Data so accesses are predictable
  • 38. Temporal Locality Repeatedly referring to same data in a short time span Spatial Locality Referring to data that is close together in memory Sequential Locality Referring to data that is arranged linearly in memory
  • 39. General Principles Use smaller data types (-XX:+UseCompressedOops) Avoid 'big holes' in your data Make accesses as linear as possible
  • 40. Primitive Arrays // Sequential Access = Predictable for (int i=0; i<someArray.length; i++) someArray[i]++;
  • 41. Primitive Arrays - Skipping Elements // Holes Hurt for (int i=0; i<someArray.length; i += SKIP) someArray[i]++;
  • 42. Primitive Arrays - Skipping Elements
  • 43. Multidimensional Arrays Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C) Some people realign their accesses: for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) { array[ROWS * col + row]++; } }
  • 44. Bad Access Alignment Strides the wrong way, bad locality. array[COLS * row + col]++; Strides the right way, good locality. array[ROWS * col + row]++;
  • 45. Full Random Access L1D - 5 clocks L2 - 37 clocks Memory - 280 clocks Sequential Access L1D - 5 clocks L2 - 14 clocks Memory - 28 clocks
  • 46. Data Layout Principles Primitive Collections (GNU Trove, GS-Coll, FastUtil, HPPC) Arrays > Linked Lists Hashtable > Search Tree Avoid Code bloating (Loop Unrolling)
  • 47. Custom Data Structures Judy Arrays an associative array/map kD-Trees generalised Binary Space Partitioning Z-Order Curve multidimensional data in one dimension
  • 48. Data Locality vs Java Heap Layout 0 1 2 class Foo { Integer count; Bar bar; Baz baz; } // No alignment guarantees for (Foo foo : foos) { foo.count = 5; foo.bar.visit(); } 3 ... Foo count bar baz
  • 49. Data Locality vs Java Heap Layout Serious Java Weakness Location of objects in memory hard to guarantee. GC also interferes Copying Compaction
  • 50. Optimisation Omen Again Use Performance Event Counters Measure for cache hit/miss rates Correlate with Pipeline Stalls to identify where this is relevant
  • 51. Object Layout Control On Heap http://objectlayout.github.io/ObjectLayout Off Heap - Data Structures: Chronicle or JCTools Experimental - Serialisation: SBE, Cap’n’p, Flatbuffers
  • 52. Summary Cache misses cause stalls, which kill performance Measurable via Performance Event Counters Common Techniques for optimizing code
  • 53. Why care about low level rubbish? Branch Prediction Memory Access Storage Conclusions
  • 54. Hard Disks Commonly used persistent storage Spinning Rust, with a head to read/write Constant Angular Velocity - rotations per minute stays constant Sectors size differs between device
  • 55. A simple model Zone Constant Angular Velocity (ZCAV) / Zoned Bit Recording (ZBR) Operation Time = Time to process the command Time to seek Rotational speed latency Sequential Transfer TIme
  • 56. ZBR implies faster transfer at limits than centre (~25%)
  • 57. Seeking vs Sequential reads Seek and Rotation times dominate on small values of data Random writes of 4kb can be 300 times slower than theoretical max data transfer Consider the impact of context switching between applications or threads
  • 62. Optimisation Omen 1. Application Spending time waiting on I/O 2. I/O Subsystem not transferring much data
  • 63. Summary Simple, sequential access patterns win Fragmentation is your enemy Alignment can be important
  • 64. Why care about low level rubbish? Branch Prediction Memory Access Storage Conclusions
  • 65. Speedups ● Possible 20 cycle stall for a mispredict (example 5x slowdown) ● 200x for L1 cache hit vs Main Memory ● 300x for sequential vs random on disk ● Theoretical Max
  • 66. Latency Numbers L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms Disk seek 10,000,000 ns 10 ms Read 1 MB sequentially from disk 20,000,000 ns 20 ms Send packet CA->Netherlands->CA 150,000,000 ns 150 ms Stolen (cited) from https://gist.github.com/jboner/2841832
  • 67. Common Themes ● Principles over Tools ● Data over Unsubstantiated Claims ● Simple over Complex ● Predictable Access over Random Access
  • 68. More information Articles http://www.akkadia.org/drepper/cpumemory.pdf https://gmplib.org/~tege/x86-timing.pdf http://psy-lob-saw.blogspot.co.uk/ http://www.intel.com/content/www/us/en/architecture-and-technology/64- ia-32-architectures-optimization-manual.html http://mechanical-sympathy.blogspot.co.uk http://www.agner.org/optimize/microarchitecture.pdf Mailing Lists: https://groups.google.com/forum/#!forum/mechanical-sympathy https://groups.google.com/a/jclarity.com/forum/#!forum/friends http://gee.cs.oswego.edu/dl/concurrency-interest/
  • 70. Q & A @richardwarburto insightfullogic.com tinyurl.com/java8lambdas