SlideShare a Scribd company logo
© 2009 IBM Corporation
Session IK: Parallel Batch Performance
Considerations
Martin Packer, IBM
martin_packer@uk.ibm.com
Abstract
With the laws of physics providing a nice brick wall that chip
builders are heading towards for processor clock speed, we are
heading into the territory where simply buying a new machine
won't necessarily make your batch go faster. So if you can't go
short, go wide! This session looks at some of the performance
issues and techniques of splitting your batch jobs into parallel
streams to do more at once.
Motivations
Increased Window Challenge
● Workloads growing in an accelerated fashion:
● Business success
● Mergers and acquisitions
– Standardisation of processes and applications
● More processing
– Regulation
– Analytics
– “Just because”
● Shortened Window
● Challenge will outstrip “single actor” speed-up
● For SOME installations and applications
– CPU
– Disk
– Tape
– Etc
● Important to assess where on the (possibly) bell-shaped curve you are
Parallel Batch Performance Considerations
This is still
10 – 15
Years away
Maybe more
Maybe never
© 2012 IBM Corporation
IBM System z
zEC12TLLB7zEC12TLLB7
zEC12 – Overall Attributes Highlights (compared to z196)
• 50% more cores in a CP chip
– Up to 5.7% faster core running frequency
– Up to 25% capacity improvement over z196 uni-processor
• Bigger caches and shorter latency
– Total L2 per core is 33% bigger
– Total on-chip shared L3 is 100% bigger
– Unique private L2 designed to reduce L1 miss latency by up to 45%
• 3rd Generation High Frequency, 2nd Generation Out of Order Design
– Numerous pipeline improvements based on z10 and z196 designs
– # of instructions in flight is increased by 25%
• New 2nd level Branch Prediction Table for enterprise scale program footprint
– 3.5x more branches
• Dedicated Co-processor per core with improved performance and additional capability
– New hardware support for Unicode UTF8<>UTF16 bulk conversions
• Multiple innovative architectural extensions for software exploitation
Other Motivations
● Take advantage of capacity overnight
● Proactively move work out of the peak
● Resilience
● Consider what happens when some part of the batch fails
● Note: This presentation doesn't deal with the case of
concurrent batch
● Some challenges are different
● Some are the same
– You just have 24 hours to get stuff done, not e.g. 8
Issues
Issues
● Main issue is to break “main loop” down into
e.g. 8 copies acting on a subset of the data
● Assuming the program in question has this pattern *
● Not the only issue but necessary to drive
cloning
● With this pattern dividing the Master File is the
key thing …
* This is a simplification of the more general “the whole stream
needs cloning” case
Parallel Batch Performance Considerations
More Issues
● Reworking the “results”
● Probably some kind of merge process
● Handling inter-clone issues
● Locking
● I/O Bottlenecks
● Provisioning resource
● Concurrent use of memory and CPU greatly increased
● Scheduling and choreography
● Streams in lockstep or not
● Recovery boundaries
● Automation of cloning in the schedule
A Note On Reworking
● Consider the “merge at the end” portion:
● Probably valuable to separate data merge from
“presentation”
– “Presentation” here means e.g. reports, persistent output
● Consider an “architected” intermediate file
– XML or JSON or whatever
● Use the architected intermediate file for other
purposes
– e.g PDF format reporting
– Alongside original purpose
Implementation
Implementation
● Implementation consists of three obvious steps:
● Analysis
● Make changes – or Implementation :-)
● Monitoring
(and loop back around)
Analysis
● Look for e.g.
● CPU-intensive steps
● Database-I/O intensive steps
● Prefer other tune-ups
● Be clear whether other tune-ups get you there
– Some may effectively do cloning for you
● Take a forward-looking view
– Lead time
– Keep list of potential jobs to clone later on
● Assess whether “code surgery” will be required
Making Changes
● Splitting the transaction file
● Changing the program to expect a subset of the
data
● Merging the results
● Refactoring JCL
● Changing the Schedule
● Reducing data contention
Monitoring
● Monitoring isn't terribly different from any other batch
monitoring.
● Usual tools, including:
● Scheduler-based monitoring tools - for how the clones are
progressing against the planned schedule.
● SMF - for timings, etc.
● Logs
● Need to demonstrate still functions correctly
● Work on “finding the sweet spot”:
● e.g. Is 2 the best, 4 or 8?*
● Work on “balance”
– * Note bias to “power of two”
Case Study
Case Study
● Our application
● Not meant to be identical to yours
● Scales nicely through iterations
● Process important
● Stepwise progress
● Use e.g. DB2 Accounting Trace to guide
● In the following:
● 0-Up is the original unmodified program
● 1-Up is prepared for 2-Up etc and has
– Reporting removed & replaced by writing a report data file
● Report writing and “fan out” and “fan in” minimal elapsed / CPU
● 2 Key metrics:
● Total CPU cost
● Longest Leg Elapsed Time
Don't Commit: Won't Go Beyond 1-Up
45% DB2 CPU, 45% Non-DB2 CPU
0 -Up 1 -Up
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Total CPU Max Elapsed
Minutes
Commit Every Update: Scales Nicely Up To 8-Up
50% DB2 CPU, 50% Non-DB2 CPU
0 -Up 1 -Up 2 -Up 4 -Up 8 -Up 16-Up 32-Up
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Total CPU Max Elapsed
Minutes
8 Balanced Partitions: 50% Elapsed / CPU Reduction Up To 32-Up
Up to 16-Up almost all time is Non-DB2 CPU, 32 is 50% “Queue”
16 and 32 partitions made no difference
0 -Up 1 -Up 2 -Up 4 -Up 8 -Up 16 -Up 32 -Up
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Total CPU Max Elapsed
Minutes
Case Study Lessons
● Applications mustn't break when you try to clone
● “Sweet spot” in our case is around 8-up
● Might still drive further if CPU increase acceptable
● Elapsed time got better at 16-up
● Data Management work can help
● Partitioning very nice in our case
● Environmental conditions matter
● In our case CPU contention limited scalability
● DB2 Accounting Trace guided us:
● Explained why preloading data into DB2 buffer pools did nothing

More Related Content

PDF
zIIP Capacity Planning
PDF
I Know What You Did THIS Summer
PDF
zIIP Capacity Planning
PDF
Parallel Sysplex Performance Topics
PDF
zIIP Capacity Planning - May 2018
PDF
Much Ado about CPU
PDF
Munich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topics
PDF
Munich 2016 - Z011598 Martin Packer - He Picks On CICS
zIIP Capacity Planning
I Know What You Did THIS Summer
zIIP Capacity Planning
Parallel Sysplex Performance Topics
zIIP Capacity Planning - May 2018
Much Ado about CPU
Munich 2016 - Z011601 Martin Packer - Parallel Sysplex Performance Topics topics
Munich 2016 - Z011598 Martin Packer - He Picks On CICS

What's hot (20)

PDF
Memory Matters in 2011
PDF
Time For D.I.M.E?
PDF
Even More Fun With DDF
PDF
Munich 2016 - Z011599 Martin Packer - More Fun With DDF
PDF
Life And Times Of An Address Space
PDF
Much Ado About CPU
PDF
Educational seminar lessons learned from customer db2 for z os health check...
PDF
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
PPTX
Top 5 performance and capacity challenges for z/OS
PDF
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)
PDF
FlashCopy and DB2 for z/OS
ODP
DB2 Through My Eyes
PDF
DB2 for z/OS - Starter's guide to memory monitoring and control
PDF
Best practices for DB2 for z/OS log based recovery
PDF
A Time Traveller’s Guide to DB2: Technology Themes for 2014 and Beyond
PDF
A First Look at the DB2 10 DSNZPARM Changes
PDF
DB2 Design for High Availability and Scalability
PPTX
Capacity Management for system z license charge reporting
PDF
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle
Memory Matters in 2011
Time For D.I.M.E?
Even More Fun With DDF
Munich 2016 - Z011599 Martin Packer - More Fun With DDF
Life And Times Of An Address Space
Much Ado About CPU
Educational seminar lessons learned from customer db2 for z os health check...
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
Top 5 performance and capacity challenges for z/OS
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)
FlashCopy and DB2 for z/OS
DB2 Through My Eyes
DB2 for z/OS - Starter's guide to memory monitoring and control
Best practices for DB2 for z/OS log based recovery
A Time Traveller’s Guide to DB2: Technology Themes for 2014 and Beyond
A First Look at the DB2 10 DSNZPARM Changes
DB2 Design for High Availability and Scalability
Capacity Management for system z license charge reporting
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle
Ad

Similar to Parallel Batch Performance Considerations (20)

PDF
First Things First
PDF
First Things First
PPTX
Взгляд на облака с точки зрения HPC
PPT
High Performance Computing - Cloud Point of View
PDF
Xldb2011 wed 1415_andrew_lamb-buildingblocks
PDF
Mainframes agile2012
PDF
Java Batch for Cost Optimized Efficiency
DOCX
Computers in management
PDF
Arc 300-3 ade miller-en
PDF
Performance and predictability (1)
PDF
Performance and Predictability - Richard Warburton
PDF
Plannaing And Management Of Project Report
PDF
Batch Modernization on z/OS
PPTX
Retargeting Embedded Software Stack for Many-Core Systems
PPTX
sat_presentation
PPT
Cloud Computing with .Net
PPT
Managing projects by data
PPTX
Gcp dataflow
PPT
Chap2 slides
DOC
Project scheduler doc
First Things First
First Things First
Взгляд на облака с точки зрения HPC
High Performance Computing - Cloud Point of View
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Mainframes agile2012
Java Batch for Cost Optimized Efficiency
Computers in management
Arc 300-3 ade miller-en
Performance and predictability (1)
Performance and Predictability - Richard Warburton
Plannaing And Management Of Project Report
Batch Modernization on z/OS
Retargeting Embedded Software Stack for Many-Core Systems
sat_presentation
Cloud Computing with .Net
Managing projects by data
Gcp dataflow
Chap2 slides
Project scheduler doc
Ad

More from Martin Packer (9)

PDF
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance Specialist
ODP
Time For DIME
ODP
DB2 Data Sharing Performance
PDF
I Know What You Did Last Summer
PDF
Optimizing z/OS Batch
PDF
Much Ado About CPU
PDF
DB2 Data Sharing Performance for Beginners
PDF
Curt Cotner DDF Inactive Threads Support DB2 Version 3
PDF
Coupling Facility CPU
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance Specialist
Time For DIME
DB2 Data Sharing Performance
I Know What You Did Last Summer
Optimizing z/OS Batch
Much Ado About CPU
DB2 Data Sharing Performance for Beginners
Curt Cotner DDF Inactive Threads Support DB2 Version 3
Coupling Facility CPU

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Spectroscopy.pptx food analysis technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Cloud computing and distributed systems.
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Digital-Transformation-Roadmap-for-Companies.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectroscopy.pptx food analysis technology
Review of recent advances in non-invasive hemoglobin estimation
Understanding_Digital_Forensics_Presentation.pptx
Cloud computing and distributed systems.
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Chapter 3 Spatial Domain Image Processing.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation_ Review paper, used for researhc scholars
NewMind AI Weekly Chronicles - August'25 Week I

Parallel Batch Performance Considerations

  • 1. © 2009 IBM Corporation Session IK: Parallel Batch Performance Considerations Martin Packer, IBM martin_packer@uk.ibm.com
  • 2. Abstract With the laws of physics providing a nice brick wall that chip builders are heading towards for processor clock speed, we are heading into the territory where simply buying a new machine won't necessarily make your batch go faster. So if you can't go short, go wide! This session looks at some of the performance issues and techniques of splitting your batch jobs into parallel streams to do more at once.
  • 4. Increased Window Challenge ● Workloads growing in an accelerated fashion: ● Business success ● Mergers and acquisitions – Standardisation of processes and applications ● More processing – Regulation – Analytics – “Just because” ● Shortened Window ● Challenge will outstrip “single actor” speed-up ● For SOME installations and applications – CPU – Disk – Tape – Etc ● Important to assess where on the (possibly) bell-shaped curve you are
  • 6. This is still 10 – 15 Years away Maybe more Maybe never
  • 7. © 2012 IBM Corporation IBM System z zEC12TLLB7zEC12TLLB7 zEC12 – Overall Attributes Highlights (compared to z196) • 50% more cores in a CP chip – Up to 5.7% faster core running frequency – Up to 25% capacity improvement over z196 uni-processor • Bigger caches and shorter latency – Total L2 per core is 33% bigger – Total on-chip shared L3 is 100% bigger – Unique private L2 designed to reduce L1 miss latency by up to 45% • 3rd Generation High Frequency, 2nd Generation Out of Order Design – Numerous pipeline improvements based on z10 and z196 designs – # of instructions in flight is increased by 25% • New 2nd level Branch Prediction Table for enterprise scale program footprint – 3.5x more branches • Dedicated Co-processor per core with improved performance and additional capability – New hardware support for Unicode UTF8<>UTF16 bulk conversions • Multiple innovative architectural extensions for software exploitation
  • 8. Other Motivations ● Take advantage of capacity overnight ● Proactively move work out of the peak ● Resilience ● Consider what happens when some part of the batch fails ● Note: This presentation doesn't deal with the case of concurrent batch ● Some challenges are different ● Some are the same – You just have 24 hours to get stuff done, not e.g. 8
  • 10. Issues ● Main issue is to break “main loop” down into e.g. 8 copies acting on a subset of the data ● Assuming the program in question has this pattern * ● Not the only issue but necessary to drive cloning ● With this pattern dividing the Master File is the key thing … * This is a simplification of the more general “the whole stream needs cloning” case
  • 12. More Issues ● Reworking the “results” ● Probably some kind of merge process ● Handling inter-clone issues ● Locking ● I/O Bottlenecks ● Provisioning resource ● Concurrent use of memory and CPU greatly increased ● Scheduling and choreography ● Streams in lockstep or not ● Recovery boundaries ● Automation of cloning in the schedule
  • 13. A Note On Reworking ● Consider the “merge at the end” portion: ● Probably valuable to separate data merge from “presentation” – “Presentation” here means e.g. reports, persistent output ● Consider an “architected” intermediate file – XML or JSON or whatever ● Use the architected intermediate file for other purposes – e.g PDF format reporting – Alongside original purpose
  • 15. Implementation ● Implementation consists of three obvious steps: ● Analysis ● Make changes – or Implementation :-) ● Monitoring (and loop back around)
  • 16. Analysis ● Look for e.g. ● CPU-intensive steps ● Database-I/O intensive steps ● Prefer other tune-ups ● Be clear whether other tune-ups get you there – Some may effectively do cloning for you ● Take a forward-looking view – Lead time – Keep list of potential jobs to clone later on ● Assess whether “code surgery” will be required
  • 17. Making Changes ● Splitting the transaction file ● Changing the program to expect a subset of the data ● Merging the results ● Refactoring JCL ● Changing the Schedule ● Reducing data contention
  • 18. Monitoring ● Monitoring isn't terribly different from any other batch monitoring. ● Usual tools, including: ● Scheduler-based monitoring tools - for how the clones are progressing against the planned schedule. ● SMF - for timings, etc. ● Logs ● Need to demonstrate still functions correctly ● Work on “finding the sweet spot”: ● e.g. Is 2 the best, 4 or 8?* ● Work on “balance” – * Note bias to “power of two”
  • 20. Case Study ● Our application ● Not meant to be identical to yours ● Scales nicely through iterations ● Process important ● Stepwise progress ● Use e.g. DB2 Accounting Trace to guide ● In the following: ● 0-Up is the original unmodified program ● 1-Up is prepared for 2-Up etc and has – Reporting removed & replaced by writing a report data file ● Report writing and “fan out” and “fan in” minimal elapsed / CPU ● 2 Key metrics: ● Total CPU cost ● Longest Leg Elapsed Time
  • 21. Don't Commit: Won't Go Beyond 1-Up 45% DB2 CPU, 45% Non-DB2 CPU 0 -Up 1 -Up 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 Total CPU Max Elapsed Minutes
  • 22. Commit Every Update: Scales Nicely Up To 8-Up 50% DB2 CPU, 50% Non-DB2 CPU 0 -Up 1 -Up 2 -Up 4 -Up 8 -Up 16-Up 32-Up 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 Total CPU Max Elapsed Minutes
  • 23. 8 Balanced Partitions: 50% Elapsed / CPU Reduction Up To 32-Up Up to 16-Up almost all time is Non-DB2 CPU, 32 is 50% “Queue” 16 and 32 partitions made no difference 0 -Up 1 -Up 2 -Up 4 -Up 8 -Up 16 -Up 32 -Up 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 Total CPU Max Elapsed Minutes
  • 24. Case Study Lessons ● Applications mustn't break when you try to clone ● “Sweet spot” in our case is around 8-up ● Might still drive further if CPU increase acceptable ● Elapsed time got better at 16-up ● Data Management work can help ● Partitioning very nice in our case ● Environmental conditions matter ● In our case CPU contention limited scalability ● DB2 Accounting Trace guided us: ● Explained why preloading data into DB2 buffer pools did nothing