Parallel Batch Performance Considerations

© 2009 IBM Corporation
Session IK: Parallel Batch Performance
Considerations
Martin Packer, IBM
martin_packer@uk.ibm.com

Abstract
With the laws of physics providing a nice brick wall that chip
builders are heading towards for processor clock speed, we are
heading into the territory where simply buying a new machine
won't necessarily make your batch go faster. So if you can't go
short, go wide! This session looks at some of the performance
issues and techniques of splitting your batch jobs into parallel
streams to do more at once.

Increased Window Challenge
● Workloads growing in an accelerated fashion:
● Business success
● Mergers and acquisitions
– Standardisation of processes and applications
● More processing
– Regulation
– Analytics
– “Just because”
● Shortened Window
● Challenge will outstrip “single actor” speed-up
● For SOME installations and applications
– CPU
– Disk
– Tape
– Etc
● Important to assess where on the (possibly) bell-shaped curve you are

This is still
10 – 15
Years away
Maybe more
Maybe never

© 2012 IBM Corporation
IBM System z
zEC12TLLB7zEC12TLLB7
zEC12 – Overall Attributes Highlights (compared to z196)
• 50% more cores in a CP chip
– Up to 5.7% faster core running frequency
– Up to 25% capacity improvement over z196 uni-processor
• Bigger caches and shorter latency
– Total L2 per core is 33% bigger
– Total on-chip shared L3 is 100% bigger
– Unique private L2 designed to reduce L1 miss latency by up to 45%
• 3rd Generation High Frequency, 2nd Generation Out of Order Design
– Numerous pipeline improvements based on z10 and z196 designs
– # of instructions in flight is increased by 25%
• New 2nd level Branch Prediction Table for enterprise scale program footprint
– 3.5x more branches
• Dedicated Co-processor per core with improved performance and additional capability
– New hardware support for Unicode UTF8<>UTF16 bulk conversions
• Multiple innovative architectural extensions for software exploitation

Other Motivations
● Take advantage of capacity overnight
● Proactively move work out of the peak
● Resilience
● Consider what happens when some part of the batch fails
● Note: This presentation doesn't deal with the case of
concurrent batch
● Some challenges are different
● Some are the same
– You just have 24 hours to get stuff done, not e.g. 8

Issues
● Main issue is to break “main loop” down into
e.g. 8 copies acting on a subset of the data
● Assuming the program in question has this pattern *
● Not the only issue but necessary to drive
cloning
● With this pattern dividing the Master File is the
key thing …
* This is a simplification of the more general “the whole stream
needs cloning” case

More Issues
● Reworking the “results”
● Probably some kind of merge process
● Handling inter-clone issues
● Locking
● I/O Bottlenecks
● Provisioning resource
● Concurrent use of memory and CPU greatly increased
● Scheduling and choreography
● Streams in lockstep or not
● Recovery boundaries
● Automation of cloning in the schedule

A Note On Reworking
● Consider the “merge at the end” portion:
● Probably valuable to separate data merge from
“presentation”
– “Presentation” here means e.g. reports, persistent output
● Consider an “architected” intermediate file
– XML or JSON or whatever
● Use the architected intermediate file for other
purposes
– e.g PDF format reporting
– Alongside original purpose

Implementation
● Implementation consists of three obvious steps:
● Analysis
● Make changes – or Implementation :-)
● Monitoring
(and loop back around)

Analysis
● Look for e.g.
● CPU-intensive steps
● Database-I/O intensive steps
● Prefer other tune-ups
● Be clear whether other tune-ups get you there
– Some may effectively do cloning for you
● Take a forward-looking view
– Lead time
– Keep list of potential jobs to clone later on
● Assess whether “code surgery” will be required

Making Changes
● Splitting the transaction file
● Changing the program to expect a subset of the
data
● Merging the results
● Refactoring JCL
● Changing the Schedule
● Reducing data contention

Monitoring
● Monitoring isn't terribly different from any other batch
monitoring.
● Usual tools, including:
● Scheduler-based monitoring tools - for how the clones are
progressing against the planned schedule.
● SMF - for timings, etc.
● Logs
● Need to demonstrate still functions correctly
● Work on “finding the sweet spot”:
● e.g. Is 2 the best, 4 or 8?*
● Work on “balance”
– * Note bias to “power of two”

Case Study
● Our application
● Not meant to be identical to yours
● Scales nicely through iterations
● Process important
● Stepwise progress
● Use e.g. DB2 Accounting Trace to guide
● In the following:
● 0-Up is the original unmodified program
● 1-Up is prepared for 2-Up etc and has
– Reporting removed & replaced by writing a report data file
● Report writing and “fan out” and “fan in” minimal elapsed / CPU
● 2 Key metrics:
● Total CPU cost
● Longest Leg Elapsed Time

Don't Commit: Won't Go Beyond 1-Up
45% DB2 CPU, 45% Non-DB2 CPU
0 -Up 1 -Up
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Total CPU Max Elapsed
Minutes

Commit Every Update: Scales Nicely Up To 8-Up
50% DB2 CPU, 50% Non-DB2 CPU
0 -Up 1 -Up 2 -Up 4 -Up 8 -Up 16-Up 32-Up
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Minutes

8 Balanced Partitions: 50% Elapsed / CPU Reduction Up To 32-Up
Up to 16-Up almost all time is Non-DB2 CPU, 32 is 50% “Queue”
16 and 32 partitions made no difference
0 -Up 1 -Up 2 -Up 4 -Up 8 -Up 16 -Up 32 -Up
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Minutes

Case Study Lessons
● Applications mustn't break when you try to clone
● “Sweet spot” in our case is around 8-up
● Might still drive further if CPU increase acceptable
● Elapsed time got better at 16-up
● Data Management work can help
● Partitioning very nice in our case
● Environmental conditions matter
● In our case CPU contention limited scalability
● DB2 Accounting Trace guided us:
● Explained why preloading data into DB2 buffer pools did nothing

Parallel Batch Performance Considerations

More Related Content

What's hot (20)

Similar to Parallel Batch Performance Considerations (20)

More from Martin Packer (9)

Recently uploaded (20)

Parallel Batch Performance Considerations