SlideShare a Scribd company logo
© 2014 IBM Corporation
Page 1
z/OS Thru-V2R1 Communications
Server Performance Functions Update
David Herr – dherr@us.ibm.com
IBM Raleigh, NC
Thursday, June 12 2014
633
© 2014 IBM Corporation
Page 2
Trademarks
Notes:
Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary
depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that
an individual user will achieve throughput improvements equivalent to the performance ratios stated here.
IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.
All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual environmental
costs and performance characteristics will vary depending on individual customer configurations and conditions.
This publication was produced in the United States. IBM may not offer the products, services or features discussed in this document in other countries, and the information may be subject to change without notice. Consult
your local IBM business contact for information on the product or services available in your area.
All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
Information about non-IBM products is obtained from the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or any
other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
Prices subject to change without notice. Contact your IBM representative or Business Partner for the most current pricing in your geography.
This information provides only general descriptions of the types and portions of workloads that are eligible for execution on Specialty Engines (e.g, zIIPs, zAAPs, and IFLs) ("SEs"). IBM authorizes customers to use IBM
SE only to execute the processing of Eligible Workloads of specific Programs expressly authorized by IBM as specified in the “Authorized Use Table for IBM Machines” provided at
www.ibm.com/systems/support/machine_warranties/machine_code/aut.html (“AUT”). No other workload processing is authorized for execution on an SE. IBM offers SE at a lower price than General Processors/Central
Processors because customers are authorized to use SEs only to process certain types and/or amounts of workloads as specified by IBM in the AUT.
The following are trademarks or registered trademarks of other companies.
* Other product and service names might be trademarks of IBM or other companies.
The following are trademarks of the International Business Machines Corporation in the United States and/or other countries.
* Registered trademarks of IBM Corporation
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.
Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom.
Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.
IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce.
ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office.
Java and all Java based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
OpenStack is a trademark of OpenStack LLC. The OpenStack trademark policy is available on the OpenStack website.
TEALEAF is a registered trademark of Tealeaf, an IBM Company.
Windows Server and the Windows logo are trademarks of the Microsoft group of countries.
Worklight is a trademark or registered trademark of Worklight, an IBM Company.
UNIX is a registered trademark of The Open Group in the United States and other countries.
AIX*
BladeCenter*
CICS*
Cognos*
DataPower*
DB2*
DFSMS
EASY Tier
FICON*
GDPS*
PowerHA*
PR/SM
PureSystems
Rational*
RACF*
RMF
Smarter Planet*
Storwize*
System Storage*
System x*
System z*
System z10*
Tivoli*
WebSphere*
XIV*
zEnterprise*
z10
z10 EC
z/OS*
z/VM*
z/VSE*
HiperSockets*
HyperSwap
IMS
InfiniBand*
Lotus*
MQSeries*
NetView*
OMEGAMON*
Parallel Sysplex*
POWER7*
© 2014 IBM Corporation
Page 3
Agenda
Disclaimer: All statements regarding IBM future direction or intent, including current product plans, are subject to
change or withdrawal without notice and represent goals and objectives only. All information is provided for
informational purposes only, on an “as is” basis, without warranty of any kind.
V2R1 Performance Enhancements
Optimizing inbound communications using
OSA-Express
Optimizing outbound communications
using OSA-Express
OSA-Express4
z/OS Communications Server Performance
Summaries
© 2014 IBM Corporation
Page 4
V2R1 Performance Enhancements
© 2014 IBM Corporation
Page 5
Shared Memory Communications – Remote (SMC-R)
V2R1
OSA ROCE
TCP
IP
Interface
Sockets
Middleware/Application
z/OS System B
SMC-R
OSAROCE
TCP
IP
Interface
Sockets
Middleware/Application
z/OS System A
SMC-R
TCP connection establishment over IP
IP Network (Ethernet)
RDMA Network RoCE
TCP connection transitions to SMC-R allowing application data to be exchanged using RDMA
Dynamic (in-line) negotiation for SMC-R is initiated by presence of TCP Options
TCP syn flows (with TCP Options
indicating SMC-R capability)
RMBe RMBe
SMC-R Background
App data
App data
Both TCP and SMC-R “connections”
remain active
© 2014 IBM Corporation
Page 6
SMC-R - RDMA
V2R1
Key attributes of RDMA
Enables a host to read or write directly from/to a remote host’s memory
without involving the remote host’s CPU
By registering specific memory for RDMA partner use
Interrupts still required for notification (i.e. CPU cycles are not
completely eliminated)
Reduced networking stack overhead by using streamlined, low level, RMDA
interfaces
Key requirements:
A reliable “lossless” network fabric (LAN for layer 2 data center network
distance)
An RDMA capable NIC (RNIC) and RDMA capable switched fabric
(switches)
© 2014 IBM Corporation
Page 7
SMC-R - Solution
V2R1
Shared Memory Communications over RDMA (SMC-R) is a protocol that
allows TCP sockets applications to transparently exploit RDMA (RoCE)
SMC-R is a “hybrid” solution that:
Uses TCP connection (3-way handshake) to establish SMC-R connection
Each TCP end point exchanges TCP options that indicate whether it
supports the SMC-R protocol
SMC-R “rendezvous” (RDMA attributes) information is then exchanged
within the TCP data stream (similar to SSL handshake)
Socket application data is exchanged via RDMA (write operations)
TCP connection remains active (controls SMC-R connection)
This model preserves many critical existing operational and network
management features of TCP/IP
© 2014 IBM Corporation
Page 8
SMC-R – Role of the RMBe (buffer size)
V2R1
The RMBe is a slot in the RMB buffer for a specific TCP connection
Based on TCPRCVBufrsize – NOT equal to
Can be controlled by application using setsockopt() SO_RCVBUF
5 sizes – 32K, 64K, 128K, 256K and 1024K (1MB)
Depending on the workload, a larger RMBe can improve performance
Streaming (bulk) workloads
Less wrapping of the RMBe = less RDMA writes
Less frequent “acknowledgement” interrupts to sending side
Less write() blocks on sending side
RMB – 1MB
Appl
TCP/IP
RMBe – TCP connection
Data waiting to be received
Available
Space available –
Keep writing!
(pipe stays full)
Wrap point
SMC link
© 2014 IBM Corporation
Page 9
SMC-R – Micro benchmark performance results
V2R1
Response time/Throughput and CPU improvements
Workload:
Using AWM (Application Workload Modeler) to model “socket to socket”
performance using SMC-R
AWM very lightweight - contains no application/business logic
Stresses and measures the networking infrastructure
Real workload benefits will be smaller than the improvements seen
in AWM benchmarks!
MTU: RoCE (1K and 2K) OSA (1500 and 8000)
Large Send enabled for some of the TCP/IP streaming runs
RR1(1/1): Single interactive session with 1 byte request and 1 byte reply
RR10: 10 concurrent connections with various message sizes
STR1(1/20M): Single Streaming session with 1 byte request (Client) and
20,000,000 bytes reply (Server)
Used large RMBs – 1MB
© 2014 IBM Corporation
Page 10
SMC-R – Micro benchmark performance results V2R1
SMC-R(RoCE) vs. TCP/IP(OSA)
Performance Summary
Request/Response micro-benchmark
289.93 281.35
675.60
716.70
539.61
267.80
108.19
7.68
5.73
-12.73
-22.01
-29.87
-43.29
-55.58
-9.34
14.98
-8.50
-18.75
-27.39
-36.51
-53.58
-74.35
-73.77
-87.37
-88.38
-84.39
-72.63
-51.78
-325
-125
75
275
475
675
875
R
R
1(1/1)
R
R
10(1k/1k)
R
R
10(2k/2k)
R
R
10(4k/4k)
R
R
10(8k/8k)
R
R
10(16k/16k)
R
R
10(32k/32k)
AWM IPv4 R/R Workload
%(RelativetoTCP/IP)
Raw Tput
CPU-Server
CPU-Client
Resp Time
Significant Latency reduction across all data sizes (52-88%)
Reduced CPU cost as payload increases (up to 56% CPU savings)
Impressive throughput gains across all data sizes (Up to +717%)
Note: vs typical OSA customer configuration
MTU (1500), Large Send disabled
RoCE MTU: 1K
15.7Gb/sec
15.9Gb/sec
14.6Gb/sec9.1Gb/sec
177,972 trans/secs (R/R)
Latency 28 mics
(full roundtrip)
June 4, 2013
Client, Server : 4 CPs 2827-791 (zEC12 GA2)
Interfaces: 10GbE RoCE Express and 10GbE OSA Expess5
© 2014 IBM Corporation
Page 11
SMC-R – Micro benchmark performance results V2R1
Notes:
• Significant throughput benefits and CPU reduction benefits
• Up to 69% throuput improvement
• Up to 66% reduction in CPU costs
• 2K RoCE MTU does yield throughput advantages
• LS – Large Send enabled (Segmentation offload)
-39.67
-15.56
-36.99
-11.73
-40.76
-15.1
-64.84
-66.23
-64.28
-65.74
-44.21
-43.88
-63.9
-65.67
-63.77
-65.23
-43.96
-41.05
65.75
21.29
58.7
16.15
68.82
20.58
STR1(1/20M) STR3(1/20M) STR1(1/20M) STR3(1/20M) STR1(1/20M) STR3(1/20M)
-100
-50
0
50
100
%(RelativetoTCP/IP)
Raw Tput
CPU-Server
CPU-Client
Resp Time
May 29,2013
Client, Server: 2827-791 2CPs
Interfaces: 10GbE RoCE Express and 10GbE O
z/OS V2R1 SMC-R vs TCP/IP
Streaming Data Performance Summary (AWM)
MTU 2K/1500 1K/1500 2K/8000-LS
8.8Gb/sec
8.9Gb/sec
1MB RMBs
Saturation
reached
© 2014 IBM Corporation
Page 12
SMC-R – Micro benchmark performance results
V2R1
Summary –
– Network latency for z/OS TCP/IP based OLTP (request/response)
workloads reduced by up to 80%*
• Networking related CPU consumption reduction for z/OS TCP/IP
based OLTP (request/response) workloads increases as payload size
increases
– Networking related CPU consumption for z/OS TCP/IP based workloads
with streaming data patterns reduced by up to 60% with a network
throughput increase of up to 60%**
– CPU consumption can be further optimized by using larger RMBe sizes
• Less data consumed processing
• Less data wrapping
• Less data queuing
* Based on benchmarks of modeled z/OS TCP sockets based workloads with request/response traffic patterns using SMC-R vs. TCP/IP. The
actual response times and CPU savings any user will experience will vary.
** Based on benchmarks of modeled z/OS TCP sockets based workloads with streaming data patterns using SMC-R vs. TCP/IP. The benefits
any user will experience will vary
© 2014 IBM Corporation
Page 13
SMC-R – Sysplex Distributor performance results
Line 1 - TCP/IP distributed connections without QDIO Accelerator
Line 2 - TCP/IP distributed connections utilizing QDIO Accelerator
Line 3 - SMC-R distributed connections
V2R1
With SMC-R the distributing stack
is bypassed for inbound data.
Connection setup and SMC-R
rendezvous packets will be the only
inbound traffic going through the
distributing stack.
Remember that all outbound traffic
bypasses the distributing stack for
all scenarios.
© 2014 IBM Corporation
Page 14
SMC-R – Sysplex Distributor performance results
248
-89
295
-97
Accel No Accel
RR20
-100
0
100
200
300
%Improvement
Throughput
CPU Cost/tran - Distributor
Request/Response 100/800
Sysplex Distributor
Workload – 20 simultaneous request/response
connections sending 100 and receiving 800 bytes.
Large data workloads would yield even bigger
performance improvements.
Results from Sysplex distributing
Stack perspective
SMC-R removes virtually all CP
processing on
Distributing stack
250%+ throughput improvement
V2R1
© 2014 IBM Corporation
Page 15
SMC-R – FTP performance summary
V2R1
The performance measurements discussed in this document were collected using a dedicated system environment. The results
obtained in other configurations or operating system environments may vary significantly depending upon environments used.
0.73-20.86-48.18
1.06-15.83-47.49
FTP1(1200M) FTP3(1200M)
-75
-50
-25
0
%(RelativetoOSD)
Raw Tput
CPU-Client
CPU-Server
zEC12 V2R1 SMC vs. OSD Performance Summary
FTP Performance
FTP binary PUTs to z/OS FTP server, 1 and 3 sessions, transferring 1200
MB data
OSD – OSA Express4 10Gb interface
Reading from and writing to DASD datasets – Limits throughput
AWM FTP client
© 2014 IBM Corporation
Page 16
SMC-R - WebSphere MQ for z/OS performance
improvement V2R1
Latency improvements
Workload
Measurements using WebSphere MQ V7.1.0
MQ between 2 LPARs on zEC12 machine (10 processors each)
Request/Response workload
On each LPAR, a queue manager was started and configured with 50
outbound sender channels and 50 inbound receiver channels, with
default options for the channel definitions (100 TCP connections)
Each configuration was run with message sizes of 2KB, 32KB and
64KB where all messages were non-persistent
Results were consistent across all three message sizes
© 2014 IBM Corporation
Page 17
SMC-R - WebSphere MQ for z/OS performance
improvement
V2R1
WebSphere MQ for z/OS realizes up to a 3x increase in messages per second it can deliver
across z/OS systems when using SMC-R vs standard TCP/IP for 64K messages over 1 channel *
Latency improvements
*Based on internal IBM benchmarks using a modeled WebSphere MQ for z/OS workload driving non-persistent messages across z/OS
systems in a request/response pattern. The benchmarks included various data sizes and number of channel pairs. The actual throughput
and CPU savings users will experience may vary based on the user workload and configuration.
2k, 32k and 64k message sizes
1 to 50 TCP connections each way
z/OS SYSA
z/OS SYSB
WebSphere
MQ
WebSphere MQ
WebSphere MQ for z/OS using SMC-R
MQ messages TCP/IP (OSA4S)
MQ messages SMC-R (ROCE)
z/OS SYSA
z/OS SYSB
WebSphere
MQ
WebSphere MQ
WebSphere MQ for z/OS using SMC-R
MQ messages TCP/IP (OSA4S)
MQ messages SMC-R (ROCE)
© 2014 IBM Corporation
Page 18
SMC-R – CICS performance improvement V2R1
Response time and CPU utilization improvements
Workload - Each transaction
– Makes 5 DPL (Distributed Program Link) requests over an
IPIC connection
– Sends 32K container on each request
– Server program Receives the data and Send back 32K
– Receives back a 32K container for each request
Note: Results based on internal IBM benchmarks using a modeled CICS workload driving a CICS transaction that performs 5 DPL calls to a
CICS region on a remote z/OS system, using 32K input/output containers. Response times and CPU savings measured on z/OS system initiating
the DPL calls. The actual response times and CPU savings any user will experience will vary.
IPIC - IP Interconnectivity
• Introduced in CICS TS
3.2/TG 7.1
•TCP/IP based
communications
•Alternative to LU6.2/SNA for
Distributed program calls
© 2014 IBM Corporation
Page 19
SMC-R – CICS performance improvement V2R1
Benchmarks run on z/OS V2R1 with latest zEC12 and new 10GbE RoCE Express feature
– Compared use of SMC-R (10GbE RoCE Express) vs standard TCP/IP (10GbE OSA Express4S) with
CICS IPIC communications for DPL (Distributed Program Link) processing
– Up to 48% improvement in CICS transaction response time as measured on CICS system issuing
the DPL calls (CICS A)
– Up to 10% decrease in overall z/OS CPU consumption on the CICS systems
© 2014 IBM Corporation
Page 20
SMC-R – Websphere to DB2 communications
performance improvement
V2R1Response time improvements
SMC-Rz/OS SYSA
z/OS SYSB
RoCE
WAS
Liberty
TradeLite DB2
JDBC/DRDA
3 per HTTP
Connection
Linux on x
Workload Client Simulator
(JIBE)
HTTP/REST
40 Concurrent
TCP/IP Connections
TCP/IP
WebSphere to DB2 communications using SMC-R
Based on projections and measurements completed in a controlled environment. Results may vary by customer based on
individual workload, configuration and software levels.
40% reduction in overall
Transaction response time! –
As seen from client’s perspective
Small data sizes ~ 100 bytes
© 2014 IBM Corporation
Page 21
TCP/IP Enhanced Fast Path Sockets V2R1
Socket Application
Recv(s1,buffer,…);
OSA
USS Logical File System (LFS)
TCP/IP Physical File System (PFS)
Transport Layer (TCP, UDP, RAW)
IP
Interface/Device Driver
USS
Suspend/Resume
Services
Wait/Post
Suspend/Resume
SpaceSwitch
(OMVS)
SpaceSwitch
(OMVS)
SpaceSwitch
(TCP/IP)
z/OS
Socket Application
Recv(s1,buffer,…);
OSA
USS Logical File System (LFS)
TCP/IP Physical File System (PFS)
Transport Layer (TCP, UDP, RAW)
IP
Interface/Device Driver
TCP/IP
Suspend/Resume
Services
Wait/Post
Suspend/Resume
SpaceSwitch
(OMVS)
Streamlinedpath
SpaceSwitch
(TCP/IP)
z/OS
TCP/IP sockets (normal path) TCP/IP fast path sockets (Pre-V2R1)
Space switch
(OMVS)
Full function support for sockets, including
support for Unix signals, POSIX compliance
When TCP/IP needs to suspend a thread
waiting for network flows, USS suspend/resume
services are invoked
Streamlined path through USS LFS for
selected socket APIs
TCP/IP performs the wait/post or
suspend/resume inline using its own services
Significant reduction in path length
© 2014 IBM Corporation
Page 22
TCP/IP Enhanced Fast Path Sockets V2R1
Pre-V2R1 fast path provided CPU savings but not widely
adopted:
No support for Unix signals (other than SIGTERM)
Only useful to applications that have no requirement for
signal support
No DBX support (debugger)
Must be explicitly enabled!
BPXK_INET_FASTPATH environment variable
Iocc#FastPath IOCTL
Only supported for UNIX System Services socket API or the
z/OS XL C/C++ Run-time Library functions
© 2014 IBM Corporation
Page 23
TCP/IP Enhanced Fast Path Sockets V2R1
Socket Application
Recv
Recvfrom
Recvmsg
USS Logical File System (LFS)
TCP/IP Physical File System (PFS)
Transport Layer (TCP, UDP, RAW)
IP
Interface/Device Driver
USS
Suspend/Resume
Services
Pause/Release
Services
Space Switch
OMVS
Streamlined path
Through LFS
Space Switch
TCP/IP
z/OS
Streamlined path
Through LFS
No
Space
Switch!
Send
Sendto
Sendmsg
Fast path sockets performance without all
the conditions!:
• Enabled by default
• Full POSIX compliance, signals support
and DBX support
• Valid for ALL socket APIs (with the
exception of the Pascal API
© 2014 IBM Corporation
Page 24
TCP/IP Enhanced Fast Path Sockets V2R1
No new externals
Still supports “activating Fast path explicitly” to avoid migration
issues
Provides performance benefits of enhanced Fast Path
sockets
Keeps the following restrictions:
Does not support POSIX signals (blocked by z/OS
UNIX)
Cannot use dbx debugger
© 2014 IBM Corporation
Page 25
TCP/IP Enhanced Fast Path Sockets
V2R1
Note: The performance measurements discussed in this presentation are z/OS V2R1 Communications Server
numbers and were collected using a dedicated system environment. The results obtained in other
configurations or operating system environments may vary.
12.48
-22.32 -23.77
2.23
-9.12
-3.04
-0.35
-4.97 -4.97
RR40 (1h/8h) CRR20 (64/8k) STR3 (1/20M)
IPv4 AWM Primitive Workloads
-40
-30
-20
-10
0
10
20
30
%(RelativetowithoutFastpath)
Raw TPUT
CPU-Client
CPU-Server
May 2, 2013
Client and server LPARs: zEC12 with 6 CPs per LPAR
Interface: OSA-E4 10 GbE
V2R1 IPv4 AWM Primitives
V2R1 with Fastpath vs. V2R1 without Fastpath
© 2014 IBM Corporation
Page 26
V2R1
Background information: IP filtering basics
IP filtering at the z/OS IP Layer
Filter rules defined based on relevant
attributes
Used to control routed and local traffic
Defined actions taken when a filter rule is
matched
IP filter rules are defined in three ways:
TCPIP profile
Policy Agent
Defense Manager Daemon
“Not Filtering” routed traffic means that
all routed traffic is permitted by the
effective filter rules
Applications
TCP/UDP
IPv4 & IPv6
Interfaces
Filter policy
(pagent or profile)
Defensive filters
Applications
TCP/UDP
IPv4 & IPv6
Interfaces
Defensive filters
• Traffic routed through this
TCP/IP stack
• Does not apply to Sysplex
Distributor connection
routing
• Traffic going to or coming
from applications on this
TCP/IP stack only
Routed Traffic
Local Traffic
Filter policy
(pagent or profile)
QDIO Accelerator coexistence with IP Filtering
© 2014 IBM Corporation
Page 27
Background information: QDIO Accelerator
Provides fast path IP forwarding for these
DLC combinations
Inbound QDIO, outbound QDIO or HiperSockets
Inbound HiperSockets, outbound QDIO or
HiperSockets
Sysplex Distributor (SD) acceleration
Inbound packets over HiperSockets or OSA-E
QDIO
When SD gets to the target stack using either
• Dynamic XCF connectivity over HiperSockets
• VIPAROUTE over OSA-E QDIO
Improves performance and reduces processor
usage for such workloads
DLC
IP
TCP/UDP
Application
DLC
IP
TCP/UDP
Application
IP-layer
routing table
Connection
routing table
DLC
IP
TCP/UDP
Application
DLC
IP
TCP/UDP
Application
OSAOSA OSA
DLC
IP
TCP/UDP
Application
DLC
IP
TCP/UDP
ApplicationShadow copies
of selected
entries in
Connection
routing table
and IP-layer
routing table
DLC
IP
TCP/UDP
Application
DLC
IP
TCP/UDP
Application
OSAOSA OSA
V1R11
© 2014 IBM Corporation
Page 28
Problem
No support for acceleration when IP security enabled
• Even if stack processing is not needed for forwarded traffic
Solution
Allow QDIO accelerator for routed traffic when IPCONFIG IPSECURITY and
IPCONFIG QDIOACCELERATOR configured
• QDIO accelerator for non-Sysplex Distributor traffic requires that the
acceleration stack must not filter or log routed traffic
• Always allow QDIO accelerator for Sysplex Distributor traffic
Not supported for HiperSockets accelerator
QDIO Accelerator coexistence with IP Filtering
© 2014 IBM Corporation
Page 29
QDIO Accelerator: IPCONFIG syntax and performance results
…
| _NOQDIOACCELerator________________________ |
|_|__________________________________________|_________|
| | _QDIOPriority 1________| |
| |_QDIOACCELerator__|_______________________| |
| |_QDIOPriority priority_| |
…
-0.9
-57.45
-0.33
z/OS Client
z/OS Sysplex Distributor
z/OS Target
-75
-50
-25
0
25
50
75
%(RelativetonoQDIOAccelerator)
CPU / Transaction
Request-Response workload
RR20: 20 sessions, 100 / 800
© 2014 IBM Corporation
Page 30
FTP using zHPF – Improving throughput
There are many factors that influence the transfer rates for z/OS FTP
connections. Some of the more significant ones are (in order of impact):
– DASD read/write access
– Data transfer type (Binary, ASCII..)
– Dataset characteristics (e.g., fixed block or variable)
*Note the network (Hipersockets, OSA, 10Gb, SMC-R) characteristics
have very little impact when reading from, and writing to, DASD as you
will see in our results section.
zHPF FAQ link
• http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/FQ127122
• Works with DS8000 storage systems
© 2014 IBM Corporation
Page 31
FTP using zHPF – Improving throughput
FTP Workload
z/OS FTP client GET or PUT 1200 MB data set from or to z/OS FTP
server
DASD to DASD (read from or write to)
zHPF enabled/disabled
Single file transfer
Used Variable block data set for the test
Organization .... PS
Record Format ...VB
Record Length …6140
Block size ...........23424
For Hipersocket
Configure GLOBALCONFIG IQDMULTIWRITE
© 2014 IBM Corporation
Page 32
FTP using zHPF – Improving throughput
Throughput is improved by 43-49% with Enabling zHPF
© 2014 IBM Corporation
Page 33
Optimizing inbound communications
using
OSA-Express
© 2014 IBM Corporation
Page 34
Timing Considerations for Various Inbound workloads…
Inbound Streaming Traffic Pattern
Interactive Traffic Pattern
For inbound streaming traffic, it’s most efficient to have OSA
defer interrupting z/OS until it sees a pause in the stream…. To accomplish
this, we’d want the OSA LAN-Idle timer set fairly high (e.g., don’t interrupt
unless there’s a traffic pause of at least 20 microseconds)
2 2 2 2 2 2 2 2 40
flow direction
receiving OSA Express
2 2 2 2 2 2 2 2 40
flow direction
receiving OSA-Express3
packets tightly spaced pause next
burst
But for interactive traffic, response time would be best if OSA would
interrupt z/OS immediately…. To accomplish this, we’d want the OSA LAN-Idle timer
set as low as it can go (e.g., 1 microsecond)
single packet (request) IN
single packet (response) OUT
For detailed discussion on inbound interrupt timing, please see Part 1 of “z/OS Communications Server V1R12 Performance Study: OSA-
Express3 Inbound Workload Queueing”. http://www-01.ibm.com/support/docview.wss?uid=swg27005524
Read-Side interrupt
frequency is all
about the LAN-Idle
timer!
Read-Side interrupt
frequency is all
about the LAN-Idle
timer!
© 2014 IBM Corporation
Page 35
Dynamic LAN Idle Timer –
Introduced in z/OS V1R9
With Dynamic LAN Idle, blocking times
are now dynamically adjusted by the host
in response to the workload
characteristics.
Optimizes interrupts and latency!
HostHost
OSAOSA
LANLAN
Heavy
Workloads
Light
Workloads
OSA Generated PCI InterruptOSA Generated PCI Interrupt
HostHost
OSAOSA
LANLAN
BlockMore
BlockLess
© 2014 IBM Corporation
Page 36
Dynamic LAN Idle Timer: Configuration
Configure INBPERF DYNAMIC on the INTERFACE statement
––– BALANCEDBALANCEDBALANCED (default)(default)(default) --- a static interrupta static interrupta static interrupt---timing value, selected to achievetiming value, selected to achievetiming value, selected to achieve
reasonably high throughput and reasonably low CPUreasonably high throughput and reasonably low CPUreasonably high throughput and reasonably low CPU
– DYNAMIC - a dynamic interrupt-timing value that changes based on current
inbound workload conditions
––– MINCPUMINCPUMINCPU --- a static interrupta static interrupta static interrupt---timing value, selected to minimize host interruptstiming value, selected to minimize host interruptstiming value, selected to minimize host interrupts
without regard to throughputwithout regard to throughputwithout regard to throughput
––– MINLATENCYMINLATENCYMINLATENCY --- a static interrupta static interrupta static interrupt---timing value, selected to minimize latencytiming value, selected to minimize latencytiming value, selected to minimize latency
Note: These values cannot be changed without stopping and restarting the interface
>>-INTERFace--intf_name----------------------------------------->
.
.-INBPERF BALANCED--------.
>--+-------------------------+-------->
'-INBPERF--+-DYNAMIC----+-'
+-MINCPU-----+
'-MINLATENCY-'
.
Generally Recommended!Generally Recommended!
© 2014 IBM Corporation
Page 37
Dynamic LAN Idle Timer: But what about mixed workloads?
flow direction
receiving OSA-Express3
2 2 2 2 2 2 2 2 40
connection A - streaming
connection B - interactive
INBPERF DYNAMIC (Dynamic LAN Idle) is great for EITHER streaming OR
interactive…but if BOTH types of traffic are running together,
DYNAMIC mode will tend toward CPU conservation (elongating
the LAN-Idle timer). So in a mixed (streaming + interactive) workload,
the interactive flows will be delayed, waiting for the OSA to detect a pause
in the stream…..
© 2014 IBM Corporation
Page 38
Inbound Workload Queuing
With OSA-Express3/4S IWQ and z/OS
V1R12, OSA now directs streaming traffic
onto its own input queue – transparently
separating the streaming traffic away from
the more latency-sensitive interactive
flows…
And each input queue has its own LAN-Idle
timer, so the Dynamic LAN Idle function can
now tune the streaming (bulk) queue to
conserve CPU (high LAN-idle timer setting),
while generally allowing the primary queue
to operate with very low latency (minimizing
its LAN-idle timer setting). So interactive
traffic (on the primary input queue) may see
significantly improved response time.
The separation of streaming traffic away
from interactive also enables new streaming
traffic efficiencies in Communications
Server. This results in improved in-order
delivery (better throughput and CPU
consumption).
OSAOSA
LANLAN
Sysplex
Distributor
Streaming Default
(interactive)
z/OS
CPU 1CPU 1 CPU 2CPU 2 CPU 3CPU 3
Custom Lan Idle timer and
Interrupt processing for
each traffic pattern
V1R12
CPU 0CPU 0
EE
V1R13
© 2014 IBM Corporation
Page 39
Improved Streaming Traffic Efficiency With IWQ
B
B
B
C
C
C
D
D
D
D
D
A
A
SRB 1 on CP 0
A
A
A
A
D
D
B
B
B
B
C
C
C
SRB 2 on CP 1
interrupt time.......
t1 - qdio rd interrupt, SRB disp CP 0
x x
t2 - qdio rd interrupt, SRB disp CP 1
at the time CP1 (SRB2) starts the TCP-layer
processing for Connection A's 1st packet, CP0 (SRB1)
has progressed only into Connection C's packets...
So, the Connection A
packets being carried by
SRB 2 will be seen
before those carried by
SRB 1...
This is out-of-order
packet delivery,
brought on by
multiprocessor races
through TCP/IP
inbound code.
Out-of-order delivery
will consume
excessive CPU and
memory, and usually
leads to throughput
problems.
progressthroughthebatchofinboundpackets
IWQ does away
with MP-race-
induced ordering
problems!
With streaming
traffic sorted onto
its own queue, it is
now convenient to
service streaming
traffic from a
single CP (i.e.,
using a single
SRB).
So with IWQ, we
no longer have
inbound SRB
races for
streaming data.
Before we had IWQ, Multiprocessor races would degrade streaming
performance!
© 2014 IBM Corporation
Page 40
QDIO Inbound Workload Queuing – Configuration
INBPERF DYNAMIC WORKLOADQ enables QDIO Inbound Workload
Queuing (IWQ)
>>-INTERFace--intf_name----------------------------------------->
.
.-INBPERF BALANCED--------------------.
>--+-------------------------------------+-->
| .-NOWORKLOADQ-. |
‘-INBPERF-+-DYNAMIC-+-------------+-+-’
| ‘-WORKLOADQ---’ |
+-MINCPU------------------+
‘-MINLATENCY--------------’
– INTERFACE statements only - no support for DEVICE/LINK definitions
– QDIO Inbound Workload Queuing requires VMAC
© 2014 IBM Corporation
Page 41
QDIO Inbound Workload Queuing
Display OSAINFO command (V1R12) shows you what’s registered in OSA
BULKDATA queue registers 5-tuples with OSA (streaming connections)
SYSDIST queue registers Distributable DVIPAs with OSA
D TCPIP,,OSAINFO,INTFN=V6O3ETHG0
.
Ancillary Input Queue Routing Variables:
Queue Type: BULKDATA Queue ID: 2 Protocol: TCP
Src: 2000:197:11:201:0:1:0:1..221
Dst: 100::101..257
Src: 2000:197:11:201:0:2:0:1..290
Dst: 200::202..514
Total number of IPv6 connections: 2
Queue Type: SYSDIST Queue ID: 3 Protocol: TCP
Addr: 2000:197:11:201:0:1:0:1
Addr: 2000:197:11:201:0:2:0:1
Total number of IPv6 addresses: 2
36 of 36 Lines Displayed
End of report
5-Tuples5-Tuples
DVIPAsDVIPAs
© 2014 IBM Corporation
Page 42
QDIO Inbound Workload Queuing: Netstat DEvlinks/-d
Display TCPIP,,Netstat,DEvlinks to see whether QDIO inbound workload
queueing is enabled for a QDIO interface
D TCPIP,,NETSTAT,DEVLINKS,INTFNAME=QDIO4101L
EZD0101I NETSTAT CS V1R12 TCPCS1
INTFNAME: QDIO4101L INTFTYPE: IPAQENET INTFSTATUS: READY
PORTNAME: QDIO4101 DATAPATH: 0E2A DATAPATHSTATUS: READY
CHPIDTYPE: OSD
SPEED: 0000001000
...
READSTORAGE: GLOBAL (4096K)
INBPERF: DYNAMIC
WORKLOADQUEUEING: YES
CHECKSUMOFFLOAD: YES
SECCLASS: 255 MONSYSPLEX: NO
ISOLATE: NO OPTLATENCYMODE: NO
...
1 OF 1 RECORDS DISPLAYED
END OF THE REPORT
© 2014 IBM Corporation
Page 43
QDIO Inbound Workload Queuing: Display TRLE
Display NET,TRL,TRLE=trlename to see whether QDIO inbound workload
queueing is in use for a QDIO interface
D NET,TRL,TRLE=QDIO101
IST097I DISPLAY ACCEPTED
...
IST2263I PORTNAME = QDIO4101 PORTNUM = 0 OSA CODE LEVEL = ABCD
...
IST1221I DATA DEV = 0E2A STATUS = ACTIVE STATE = N/A
IST1724I I/O TRACE = OFF TRACE LENGTH = *NA*
IST1717I ULPID = TCPCS1
IST2310I ACCELERATED ROUTING DISABLED
IST2331I QUEUE QUEUE READ
IST2332I ID TYPE STORAGE
IST2205I ------ -------- ---------------
IST2333I RD/1 PRIMARY 4.0M(64 SBALS)
IST2333I RD/2 BULKDATA 4.0M(64 SBALS)
IST2333I RD/3 SYSDIST 4.0M(64 SBALS)
...
IST924I -------------------------------------------------------------
IST314I END
© 2014 IBM Corporation
Page 44
QDIO Inbound Workload Queuing: Netstat ALL/-A
Display TCPIP,,Netstat,ALL to see whether QDIO inbound workload
BULKDATA queueing is in use for a given connection
D TCPIP,,NETSTAT,ALL,CLIENT=USER1
EZD0101I NETSTAT CS V1R12 TCPCS1
CLIENT NAME: USER1 CLIENT ID: 00000046
LOCAL SOCKET: ::FFFF:172.16.1.1..20
FOREIGN SOCKET: ::FFFF:172.16.1.5..1030
BYTESIN: 00000000000023316386
BYTESOUT: 00000000000000000000
SEGMENTSIN: 00000000000000016246
SEGMENTSOUT: 00000000000000000922
LAST TOUCHED: 21:38:53 STATE: ESTABLSH
...
Ancillary Input Queue: Yes
BulkDataIntfName: QDIO4101L
...
APPLICATION DATA: EZAFTP0S D USER1 C PSSS
----
1 OF 1 RECORDS DISPLAYED
END OF THE REPORT
© 2014 IBM Corporation
Page 45
QDIO Inbound Workload Queuing: Netstat STATS/-S
Display TCPIP,,Netstat,STATS to see the total number of TCP segments
received on BULKDATA queues
D TCPIP,,NETSTAT,STATS,PROTOCOL=TCP
EZD0101I NETSTAT CS V1R12 TCPCS1
TCP STATISTICS
CURRENT ESTABLISHED CONNECTIONS = 6
ACTIVE CONNECTIONS OPENED = 1
PASSIVE CONNECTIONS OPENED = 5
CONNECTIONS CLOSED = 5
ESTABLISHED CONNECTIONS DROPPED = 0
CONNECTION ATTEMPTS DROPPED = 0
CONNECTION ATTEMPTS DISCARDED = 0
TIMEWAIT CONNECTIONS REUSED = 0
SEGMENTS RECEIVED = 38611
...
SEGMENTS RECEIVED ON OSA BULK QUEUES= 2169
SEGMENTS SENT = 2254
...
END OF THE REPORT
© 2014 IBM Corporation
Page 46
Quick INBPERF Review Before We Push On….
The original static INBPERF settings (MINCPU, MINLATENCY, BALANCED)
provide sub-optimal performance for workloads that tend to shift between
request/response and streaming modes.
We therefore recommend customers specify INBPERF DYNAMIC, since it
self-tunes, to provide excellent performance even when inbound traffic patterns
shift.
Inbound Workload Queueing (IWQ) mode is an extension to the Dynamic LAN
Idle function. IWQ improves upon the DYNAMIC setting, in part because it
provides finer interrupt-timing control for mixed (interactive + streaming)
workloads.
© 2014 IBM Corporation
Page 47
Optimized Latency Mode (OLM)
z/OS software and OSA-Express3 and above microcode can further reduce latency via some aggressive
processing changes (enabled via the OLM keyword on the INTERFACE statement):
– Inbound
• OSA-Express signals host if data is “on its way” (“Early Interrupt”)
• Host may spin for a while, if the early interrupt is fielded before the inbound data is “ready”
– Outbound
• OSA-Express does not wait for SIGA to look for outbound data (“SIGA reduction”)
• OSA-Express microprocessor may spin for a while, looking for new outbound data to transmit
OLM is intended for workloads that have demanding QoS requirements for response time (transaction
rate)
– high volume request/response workloads (traffic is predominantly transaction oriented versus
streaming)
The latency-reduction techniques employed by OLM will limit the degree to which the OSA can be shared
among partitions
Application
client
Application
server
TCP/IP
Stack
TCP/IP
Stack
OSA OSANetwork
Network
SIGA-write PCI
SIGA-writePCI
Request
Response
V1R11
© 2014 IBM Corporation
Page 48
Optimized Latency Mode (OLM): How to configure
New OLM parameter
– IPAQENET/IPAQENET6
– Not allowed on
DEVICE/LINK
Enables Optimized Latency
Mode for this INTERFACE only
Forces INBPERF to DYNAMIC
Default NOOLM
INTERFACE NSQDIO411 DEFINE IPAQENET
IPADDR 172.16.11.1/24
PORTNAME NSQDIO1
MTU 1492 VMAC OLM
INBPERF DYNAMIC
SOURCEVIPAINTERFACE LVIPA1
D TCPIP,,NETSTAT,DEVLINKS,INTFNAME=LNSQDIO1
JOB 6 EZD0101I NETSTAT CS V1R11 TCPCS
INTFNAME: LNSQDIO1 INTFTYPE: IPAQENET INTFSTATUS: READY
.
READSTORAGE: GLOBAL (4096K) INBPERF: DYNAMIC
.
ISOLATE: NO OPTLATENCYMODE: YES
Use Netstat DEvlinks/-d to see current OLM configuration
© 2014 IBM Corporation
Page 49
Optimized Latency Mode (OLM): Performance Data
– Client and Server
• Have minimal application
logic
– RR1
• 1 session
• 1 byte in, 1 byte out
– RR20
• 20 sessions
• 128 bytes in, 1024 bytes out
– RR40
• 40 sessions
• 128 bytes in, 1024 bytes out
– RR80
• 80 sessions
• 128 bytes in, 1024 bytes out
OSA-E3 OSA-E3
TCPIP
Server
TCPIP
Client
0
100
200
300
400
500
600
700
800
900
RR1 RR20 RR40 RR80
DYNAMIC
DYN+OLM
0
20000
40000
60000
80000
100000
120000
RR1 RR20 RR40 RR80
DYNAMIC
DYN+OLM
End-to-end latency (response time) in microseconds
Transaction rate – transactions per second
z10 (4 CP
LPARs),
z/OS V1R13,
OSA-E3
1Gbe
Note: The performance measurements discussed in this presentation are z/OS V1R13 Communications Server
numbers and were collected using a dedicated system environment. The results obtained in other
configurations or operating system environments may vary.
Lower
is better
Higher
is better
© 2014 IBM Corporation
Page 50
Dynamic LAN Idle Timer: Performance Data
Dynamic LAN Idle improved RR1 TPS 50% and RR10 TPS by 33%. Response Time for
these workloads is improved 33% and 47%, respectively.
1h/8h indicates 100 bytes in and 800 bytes out
z10 (4 CP LPARs),
z/OS V1R13, OSA-E3
1Gbe
50.1
-33.4
87.7
-47.4
-60
-40
-20
0
20
40
60
80
100
DynamicLANIdlevs.
Balanced
RR1(1h/8h) RR10(1h/8h)
RR1 and RR10 Dynamic LAN Idle
Trans/Sec
Resp Time
Note: The performance measurements discussed in this presentation are z/OS V1R13 Communications Server
numbers and were collected using a dedicated system environment. The results obtained in other
configurations or operating system environments may vary.
© 2014 IBM Corporation
Page 51
Inbound Workload Queuing: Performance Data
z10
(3 CP
LPARs)
OSA-Express3’s
In Dynamic
or IWQ mode
Aix 5.3
p570
z/OS V1R12 z/OS V1R12
1GBe
or 10GBe
network
For z/OS outbound streaming to another
platform, the degree of performance boost
(due to IWQ) is relative to receiving platform’s
sensitivity to out-of-order packet delivery. For
streaming INTO z/OS, IWQ will be especially
beneficial for multi-CP configurations.
IWQ: Mixed Workload Results vs DYNAMIC:
–z/OS<->AIX R/R Throughput improved 55% (Response
Time improved 36%)
–Streaming Throughput also improved in this test: +5%
0
10000
20000
30000
40000
50000
60000
70000
80000
RRtrans/sec
or
STRKB/sec
RR30 STR1
RR (z/OS to AIX)
STR (z/OS to z/OS)
Mixed Workload (IWQ vs Dynamic)
DYNAMIC
IWQ
© 2014 IBM Corporation
Page 52
Inbound Workload Queuing: Performance Data
z10
(3 CP
LPARs)
OSA-Express3’s
in Dynamic
or IWQ mode
Aix 5.3
p570
z/OS V1R12 z/OS V1R12
1GBe
or 10GBe
network
IWQ: Pure Streaming Results vs DYNAMIC:
–z/OS<->AIX Streaming Throughput improved 40%
–z/OS<->z/OS Streaming Throughput improved 24%
0
100
200
300
400
500
600
MB/sec
z/OS to AIX z/OS to z/OS
Pure Streaming (IWQ vs Dynamic)
DYNAMIC
IWQ
For z/OS outbound streaming to another
platform, the degree of performance boost
(due to IWQ) is relative to receiving platform’s
sensitivity to out-of-order packet delivery. For
streaming INTO z/OS, IWQ will be especially
beneficial for multi-CP configurations.
© 2014 IBM Corporation
Page 53
IWQ Usage Considerations:
Minor ECSA Usage increase: IWQ will grow ECSA usage by 72KBytes (per
OSA interface) if Sysplex Distributor (SD) or EE is in use; 36KBytes if SD and EE
are not in use
IWQ requires OSA-Express3 in QDIO mode running on IBM System z10 or OSA-
Express3/OSA-Express4/OSA-Express5 in QDIO mode running on zEnterprise
196/ zEC12(for OSAE5).
IWQ must be configured using the INTERFACE statement (not DEVICE/LINK)
IWQ is not supported when z/OS is running as a z/VM guest with simulated
devices (VSWITCH or guest LAN)
Make sure to apply z/OS V1R12 PTF UK61028 (APAR PM20056) for added
streaming throughput boost with IWQ
© 2014 IBM Corporation
Page 54
Optimizing outbound
communications using OSA-
Express
© 2014 IBM Corporation
Page 55
TCP Segmentation Offload
Segmentation consumes (high cost) host CPU cycles in the TCP stack
Segmentation Offload (also referred to as “Large Send”)
– Offload most IPv4 and/or IPv6 TCP segmentation processing to OSA
– Decrease host CPU utilization
– Increase data transfer efficiency
– Checksum offload also added for IPv6
HostHost
1-41-4
OSAOSA
11 22 33 44
LANLAN
Single Large Segment Individual Segments
TCP Segmentation
Performed In the OSA
TCP Segmentation
Performed In the OSA
V1R13
© 2014 IBM Corporation
Page 56
z/OS Segmentation Offload performance measurements
-35.8
8.6
-50
-40
-30
-20
-10
0
10
Relativetono
offload
STR-3
OSA-Express4 10Gb
CPU/MB
Throughput
Segmentation offload may significantly reduce CPU
cycles when sending bulk data from z/OS!Send buffer size: 180K for streaming workloads
Note: The performance measurements discussed in this presentation are z/OS V1R13 Communications Server
numbers and were collected using a dedicated system environment. The results obtained in other
configurations or operating system environments may vary.
© 2014 IBM Corporation
Page 57
Enabled with IPCONFIG/IPCONFIG6 SEGMENTATIONOFFLOAD
Disabled by default
Previously enabled via GLOBALCONFIG
Segmentation cannot be offloaded for
– Packets to another stack sharing OSA port
– IPSec encapsulated packets
– When multipath is in effect (unless all interfaces in the multipath group support
segmentation offload)
>>-IPCONFIG------------------------------------------------->
.
.
>----+-----------------------------------------------+-+-------><
| .-NOSEGMENTATIONOFFLoad-. |
+-+-----------------------+--------------------+
| '-SEGMENTATIONOFFLoad---' |
TCP Segmentation Offload: Configuration
V1R13
Reminder!
Checksum Offload
enabled by default
© 2014 IBM Corporation
Page 58
z/OS Checksum Offload performance measurements
V1R13
Note: The performance measurements discussed in this presentation are z/OS V2R1 Communications Server
numbers and were collected using a dedicated system environment. The results obtained in other
configurations or operating system environments may vary.
-1.59
-2.66
-5.23
-8.06
-13.65
-14.83
RR30(1h/8h) CRR20(64/8k) STR3(1/20M)
AWM IPv6 Primitives Workloads
-20
-15
-10
-5
0
5
10
%(RelativetoNoChecksumOffload)
CPU-Client
CPU-Server
zEC12 2CPs V2R1 - Effect of ChecksumOffload - IPv6
Performance Relative to NoChecksumOffload
OSA Exp4 10Gb interface
© 2014 IBM Corporation
Page 59
OSA-Express4
© 2014 IBM Corporation
Page 60
OSA-Express4 Enhancements – 10GB improvements
Improved on-card processor speed and memory bus provides better utilization of
10GB network
z196 (4 CP LPARs),
z/OS V1R13, OSA-
E3/OSA-E4 10Gbe
489
874
0
200
400
600
800
1000
Throughput
(MB/sec)
OSA-E3 OSA-E4
OSA 10GBe - Inbound Bulk traffic
Note: The performance measurements discussed in this presentation are z/OS V1R13 Communications Server
numbers and were collected using a dedicated system environment. The results obtained in other
configurations or operating system environments may vary.
© 2014 IBM Corporation
Page 61
OSA-Express4 Enhancements – EE Inbound Queue
Enterprise Extender queue provides internal optimizations
EE traffic processed quicker
Avoids memory copy of data
z196 (4 CP LPARs),
z/OS V1R13, OSA-
E3/OSA-E4 1Gbe
Note: The performance measurements discussed in this presentation are z/OS V1R13 Communications Server
numbers and were collected using a dedicated system environment. The results obtained in other
configurations or operating system environments may vary.
2.6
-0.4
32.9
-2.9
-10
0
10
20
30
40
MIQvs.Dynamic
TCP STR1(1/20MB) EE RR10(1h/8h)
OSA 1GBe - mixed TCP and EE workloads
Trans/Sec
CPU/trans
© 2014 IBM Corporation
Page 62
OSA-Express4 Enhancements – Other improvements
Checksum Offload support for IPv6 traffic
Segmentation Offload support for IPv6 traffic
© 2014 IBM Corporation
Page 63
z/OS Communications Server
Performance Summaries
© 2014 IBM Corporation
Page 64
z/OS Communications Server Performance Summaries
Performance of each z/OS Communications Server release is studied by an
internal performance team
Summaries are created and published on line
– http://www-01.ibm.com/support/docview.wss?rs=852&uid=swg27005524
Recently added:
– The z/OS VR1 Communications Server Performance Summary
– Release to release comparisons
– Capacity planning information
– IBM z/OS Shared Memory Communications over RDMA: Performance
Considerations - Whitepaper
© 2014 IBM Corporation
Page 65
z/OS Communications Server Performance Website
http://www-01.ibm.com/support/docview.wss?uid=swg27005524
© 2014 IBM Corporation
Page 66
Detailed Usage Considerations for
IWQ and OLM
© 2014 IBM Corporation
Page 67
IWQ Usage Considerations:
Minor ECSA Usage increase: IWQ will grow ECSA usage by 72KBytes (per
OSA interface) if Sysplex Distributor (SD) is in use; 36KBytes if SD is not in use
IWQ requires OSA-Express3 in QDIO mode running on IBM System z10 or OSA-
Express3/OSA-Express4 in QDIO mode running on zEnterprise 196.
– For z10: the minimum field level recommended for OSA-Express3 is
microcode level- Driver 79, EC N24398, MCL006
– For z196 GA1: the minimum field level recommended for OSA-Express3 is
microcode level- Driver 86, EC N28792, MCL009
– For z196 GA2: the minimum field level recommended for OSA-Express3 is
microcode level- Driver 93, EC N48158, MCL009
– For z196 GA2: the minimum field level recommended for OSA-Express4 is
microcode level- Driver 93, EC N48121, MCL010
IWQ must be configured using the INTERFACE statement (not DEVICE/LINK)
IWQ is not supported when z/OS is running as a z/VM guest with simulated
devices (VSWITCH or guest LAN)
Make sure to apply z/OS V1R12 PTF UK61028 (APAR PM20056) for added
streaming throughput boost with IWQ
© 2014 IBM Corporation
Page 68
OLM Usage Considerations(1): OSA Sharing
Concurrent interfaces to an OSA-Express port using OLM is limited.
– If one or more interfaces operate OLM on a given port,
• Only four total interfaces allowed to that single port
• Only eight total interfaces allowed to that CHPID
– All four interfaces can operate in OLM
– An interface can be:
• Another interface (e.g. IPv6) defined for this OSA-Express port
• Another stack on the same LPAR using the OSA-Express port
• Another LPAR using the OSA-Express port
• Another VLAN defined for this OSA-Express port
• Any stack activating the OSA-Express Network Traffic Analyzer
(OSAENTA)
© 2014 IBM Corporation
Page 69
OLM Usage Considerations (2):
QDIO Accelerator or HiperSockets Accelerator will not accelerate traffic to or from
an OSA-Express operating in OLM
OLM usage may increase z/OS CPU consumption (due to “early interrupt”)
– Usage of OLM is therefore not recommended on z/OS images expected to
normally be running at extremely high utilization levels
– OLM does not apply to the bulk-data input queue of an IWQ-mode
OSA. From a CPU-consumption perspective, OLM is therefore a more
attractive option when combined with IWQ than without IWQ
Only supported on OSA-Express3 and above with the INTERFACE statement
Enabled via PTFs for z/OS V1R11
– PK90205 (PTF UK49041) and OA29634 (UA49172).

More Related Content

PDF
Networking on z/OS
PDF
DNS Server configuration in cisco packet tracer
PDF
IP Routing on z/OS
PPTX
Marvella city a complete township in haridwar
PDF
Social engineering by-rakesh-nagekar
PPTX
88001174636 Marvella city in haridwar
PPTX
PPTX
Mobile application security 101
Networking on z/OS
DNS Server configuration in cisco packet tracer
IP Routing on z/OS
Marvella city a complete township in haridwar
Social engineering by-rakesh-nagekar
88001174636 Marvella city in haridwar
Mobile application security 101

Viewers also liked (20)

PDF
CSM Storage Debugging
PPTX
UGA Guest Lecture: Social Media 101
PDF
z/OS Communications Server: z/OS Resolver
PPTX
Seh based exploitation
PDF
World Cup! Young Germany Guest Blogging
PPTX
Example problems
PDF
Nomadic Display Set Up HangTen
DOC
New living and working
PPTX
Uga Webinar Series: building credibility as a young professional
PPTX
Newsbytes_NULLHYD_Dec
PPTX
Security News Bytes
PPT
Null dec 2014
PPSX
Función BUSCARV
PDF
Heartbleed by-danish amber
PPTX
Pengenalan Pillow Lava di Berbah,Sleman,Yogyakarta
PDF
Null July - OWTF - Bharadwaj Machiraju
PPTX
Example problems Binomial Multiplication
PDF
Nomadic Display Setup Fabri Mural
DOC
So you want to retire in florida 1997 far
PDF
The art of_firewalking-by-sujay
CSM Storage Debugging
UGA Guest Lecture: Social Media 101
z/OS Communications Server: z/OS Resolver
Seh based exploitation
World Cup! Young Germany Guest Blogging
Example problems
Nomadic Display Set Up HangTen
New living and working
Uga Webinar Series: building credibility as a young professional
Newsbytes_NULLHYD_Dec
Security News Bytes
Null dec 2014
Función BUSCARV
Heartbleed by-danish amber
Pengenalan Pillow Lava di Berbah,Sleman,Yogyakarta
Null July - OWTF - Bharadwaj Machiraju
Example problems Binomial Multiplication
Nomadic Display Setup Fabri Mural
So you want to retire in florida 1997 far
The art of_firewalking-by-sujay
Ad

Similar to z/OS Through V2R1Communications Server Performance Functions Update (20)

PDF
z/OS V2R2 Communications Server Overview
PDF
z/OS Communications Server Overview
PDF
Servidor IBM zEnterprise BC12
PDF
Maximize o valor do z/OS
PDF
z/OS V2R3 Communications Server Content Preview
PPT
Relative Capacity por Eduardo Oliveira e Joseph Temple
PDF
z/OS Encryption Readiness Technology (zERT)
PDF
z/OS Small Enhancements - Episode 2014B
PDF
Shared Memory Communications-Direct Memory Access (SMC-D) Overview
PDF
z/OS small enhancements, episode 2018A
PDF
2016 02-16-announce-overview-zsp04505 usen
PDF
Introduction to IBM Shared Memory Communications Version 2 (SMCv2) and SMC-Dv2
PDF
CICS TS for z/VSE Update including CICS connectivity options
 
PDF
z/OS Small Enhancements - Episode 2015A
PDF
IBM z/OS V2R2 Networking Technologies Update
PDF
z/OS Small Enhancements - Episode 2014A
PDF
z/OS Small Enhancements - Episode 2015B
PDF
Z13 update
PDF
17294_HiperSockets.pdf
PPTX
IBM z14 Announcement Overview Presentation
z/OS V2R2 Communications Server Overview
z/OS Communications Server Overview
Servidor IBM zEnterprise BC12
Maximize o valor do z/OS
z/OS V2R3 Communications Server Content Preview
Relative Capacity por Eduardo Oliveira e Joseph Temple
z/OS Encryption Readiness Technology (zERT)
z/OS Small Enhancements - Episode 2014B
Shared Memory Communications-Direct Memory Access (SMC-D) Overview
z/OS small enhancements, episode 2018A
2016 02-16-announce-overview-zsp04505 usen
Introduction to IBM Shared Memory Communications Version 2 (SMCv2) and SMC-Dv2
CICS TS for z/VSE Update including CICS connectivity options
 
z/OS Small Enhancements - Episode 2015A
IBM z/OS V2R2 Networking Technologies Update
z/OS Small Enhancements - Episode 2014A
z/OS Small Enhancements - Episode 2015B
Z13 update
17294_HiperSockets.pdf
IBM z14 Announcement Overview Presentation
Ad

More from zOSCommserver (14)

PDF
zOS CommServer support for the Network Express feature on z17
PDF
z/OS Communications Server Technical Update
PDF
IBM z/OS Communications Server z/OS Encryption Readiness Technology (zERT)
PDF
z/OS V2.4 Preview: z/OS Container Extensions - Running Linux on Z docker cont...
PDF
IBM Configuration Assistant for z/OS Communications Server update
PDF
z/OS 2.3 HiperSockets Converged Interface (HSCI) support
PDF
TCP/IP Stack Configuration with Configuration Assistant for IBM z/OS CS
PDF
Motivations and Considerations for Migrating from SMTPD/Sendmail to CSSMTP
PDF
IBM Design Thinking with z/OS Communications Server
PDF
ISPF Recent and Coming Enhancements
PDF
Sysplex in a Nutshell
PDF
TN3270 Access to Mainframe SNA Applications
PDF
Integrated Intrusion Detection Services for z/OS Communications Server
PDF
Enabling Continuous Availability and Reducing Downtime with IBM Multi-Site Wo...
zOS CommServer support for the Network Express feature on z17
z/OS Communications Server Technical Update
IBM z/OS Communications Server z/OS Encryption Readiness Technology (zERT)
z/OS V2.4 Preview: z/OS Container Extensions - Running Linux on Z docker cont...
IBM Configuration Assistant for z/OS Communications Server update
z/OS 2.3 HiperSockets Converged Interface (HSCI) support
TCP/IP Stack Configuration with Configuration Assistant for IBM z/OS CS
Motivations and Considerations for Migrating from SMTPD/Sendmail to CSSMTP
IBM Design Thinking with z/OS Communications Server
ISPF Recent and Coming Enhancements
Sysplex in a Nutshell
TN3270 Access to Mainframe SNA Applications
Integrated Intrusion Detection Services for z/OS Communications Server
Enabling Continuous Availability and Reducing Downtime with IBM Multi-Site Wo...

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Programs and apps: productivity, graphics, security and other tools
Encapsulation_ Review paper, used for researhc scholars
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
Review of recent advances in non-invasive hemoglobin estimation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Unlocking AI with Model Context Protocol (MCP)

z/OS Through V2R1Communications Server Performance Functions Update

  • 1. © 2014 IBM Corporation Page 1 z/OS Thru-V2R1 Communications Server Performance Functions Update David Herr – dherr@us.ibm.com IBM Raleigh, NC Thursday, June 12 2014 633
  • 2. © 2014 IBM Corporation Page 2 Trademarks Notes: Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions. This publication was produced in the United States. IBM may not offer the products, services or features discussed in this document in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information on the product or services available in your area. All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Information about non-IBM products is obtained from the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. Prices subject to change without notice. Contact your IBM representative or Business Partner for the most current pricing in your geography. This information provides only general descriptions of the types and portions of workloads that are eligible for execution on Specialty Engines (e.g, zIIPs, zAAPs, and IFLs) ("SEs"). IBM authorizes customers to use IBM SE only to execute the processing of Eligible Workloads of specific Programs expressly authorized by IBM as specified in the “Authorized Use Table for IBM Machines” provided at www.ibm.com/systems/support/machine_warranties/machine_code/aut.html (“AUT”). No other workload processing is authorized for execution on an SE. IBM offers SE at a lower price than General Processors/Central Processors because customers are authorized to use SEs only to process certain types and/or amounts of workloads as specified by IBM in the AUT. The following are trademarks or registered trademarks of other companies. * Other product and service names might be trademarks of IBM or other companies. The following are trademarks of the International Business Machines Corporation in the United States and/or other countries. * Registered trademarks of IBM Corporation Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office. Java and all Java based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. OpenStack is a trademark of OpenStack LLC. The OpenStack trademark policy is available on the OpenStack website. TEALEAF is a registered trademark of Tealeaf, an IBM Company. Windows Server and the Windows logo are trademarks of the Microsoft group of countries. Worklight is a trademark or registered trademark of Worklight, an IBM Company. UNIX is a registered trademark of The Open Group in the United States and other countries. AIX* BladeCenter* CICS* Cognos* DataPower* DB2* DFSMS EASY Tier FICON* GDPS* PowerHA* PR/SM PureSystems Rational* RACF* RMF Smarter Planet* Storwize* System Storage* System x* System z* System z10* Tivoli* WebSphere* XIV* zEnterprise* z10 z10 EC z/OS* z/VM* z/VSE* HiperSockets* HyperSwap IMS InfiniBand* Lotus* MQSeries* NetView* OMEGAMON* Parallel Sysplex* POWER7*
  • 3. © 2014 IBM Corporation Page 3 Agenda Disclaimer: All statements regarding IBM future direction or intent, including current product plans, are subject to change or withdrawal without notice and represent goals and objectives only. All information is provided for informational purposes only, on an “as is” basis, without warranty of any kind. V2R1 Performance Enhancements Optimizing inbound communications using OSA-Express Optimizing outbound communications using OSA-Express OSA-Express4 z/OS Communications Server Performance Summaries
  • 4. © 2014 IBM Corporation Page 4 V2R1 Performance Enhancements
  • 5. © 2014 IBM Corporation Page 5 Shared Memory Communications – Remote (SMC-R) V2R1 OSA ROCE TCP IP Interface Sockets Middleware/Application z/OS System B SMC-R OSAROCE TCP IP Interface Sockets Middleware/Application z/OS System A SMC-R TCP connection establishment over IP IP Network (Ethernet) RDMA Network RoCE TCP connection transitions to SMC-R allowing application data to be exchanged using RDMA Dynamic (in-line) negotiation for SMC-R is initiated by presence of TCP Options TCP syn flows (with TCP Options indicating SMC-R capability) RMBe RMBe SMC-R Background App data App data Both TCP and SMC-R “connections” remain active
  • 6. © 2014 IBM Corporation Page 6 SMC-R - RDMA V2R1 Key attributes of RDMA Enables a host to read or write directly from/to a remote host’s memory without involving the remote host’s CPU By registering specific memory for RDMA partner use Interrupts still required for notification (i.e. CPU cycles are not completely eliminated) Reduced networking stack overhead by using streamlined, low level, RMDA interfaces Key requirements: A reliable “lossless” network fabric (LAN for layer 2 data center network distance) An RDMA capable NIC (RNIC) and RDMA capable switched fabric (switches)
  • 7. © 2014 IBM Corporation Page 7 SMC-R - Solution V2R1 Shared Memory Communications over RDMA (SMC-R) is a protocol that allows TCP sockets applications to transparently exploit RDMA (RoCE) SMC-R is a “hybrid” solution that: Uses TCP connection (3-way handshake) to establish SMC-R connection Each TCP end point exchanges TCP options that indicate whether it supports the SMC-R protocol SMC-R “rendezvous” (RDMA attributes) information is then exchanged within the TCP data stream (similar to SSL handshake) Socket application data is exchanged via RDMA (write operations) TCP connection remains active (controls SMC-R connection) This model preserves many critical existing operational and network management features of TCP/IP
  • 8. © 2014 IBM Corporation Page 8 SMC-R – Role of the RMBe (buffer size) V2R1 The RMBe is a slot in the RMB buffer for a specific TCP connection Based on TCPRCVBufrsize – NOT equal to Can be controlled by application using setsockopt() SO_RCVBUF 5 sizes – 32K, 64K, 128K, 256K and 1024K (1MB) Depending on the workload, a larger RMBe can improve performance Streaming (bulk) workloads Less wrapping of the RMBe = less RDMA writes Less frequent “acknowledgement” interrupts to sending side Less write() blocks on sending side RMB – 1MB Appl TCP/IP RMBe – TCP connection Data waiting to be received Available Space available – Keep writing! (pipe stays full) Wrap point SMC link
  • 9. © 2014 IBM Corporation Page 9 SMC-R – Micro benchmark performance results V2R1 Response time/Throughput and CPU improvements Workload: Using AWM (Application Workload Modeler) to model “socket to socket” performance using SMC-R AWM very lightweight - contains no application/business logic Stresses and measures the networking infrastructure Real workload benefits will be smaller than the improvements seen in AWM benchmarks! MTU: RoCE (1K and 2K) OSA (1500 and 8000) Large Send enabled for some of the TCP/IP streaming runs RR1(1/1): Single interactive session with 1 byte request and 1 byte reply RR10: 10 concurrent connections with various message sizes STR1(1/20M): Single Streaming session with 1 byte request (Client) and 20,000,000 bytes reply (Server) Used large RMBs – 1MB
  • 10. © 2014 IBM Corporation Page 10 SMC-R – Micro benchmark performance results V2R1 SMC-R(RoCE) vs. TCP/IP(OSA) Performance Summary Request/Response micro-benchmark 289.93 281.35 675.60 716.70 539.61 267.80 108.19 7.68 5.73 -12.73 -22.01 -29.87 -43.29 -55.58 -9.34 14.98 -8.50 -18.75 -27.39 -36.51 -53.58 -74.35 -73.77 -87.37 -88.38 -84.39 -72.63 -51.78 -325 -125 75 275 475 675 875 R R 1(1/1) R R 10(1k/1k) R R 10(2k/2k) R R 10(4k/4k) R R 10(8k/8k) R R 10(16k/16k) R R 10(32k/32k) AWM IPv4 R/R Workload %(RelativetoTCP/IP) Raw Tput CPU-Server CPU-Client Resp Time Significant Latency reduction across all data sizes (52-88%) Reduced CPU cost as payload increases (up to 56% CPU savings) Impressive throughput gains across all data sizes (Up to +717%) Note: vs typical OSA customer configuration MTU (1500), Large Send disabled RoCE MTU: 1K 15.7Gb/sec 15.9Gb/sec 14.6Gb/sec9.1Gb/sec 177,972 trans/secs (R/R) Latency 28 mics (full roundtrip) June 4, 2013 Client, Server : 4 CPs 2827-791 (zEC12 GA2) Interfaces: 10GbE RoCE Express and 10GbE OSA Expess5
  • 11. © 2014 IBM Corporation Page 11 SMC-R – Micro benchmark performance results V2R1 Notes: • Significant throughput benefits and CPU reduction benefits • Up to 69% throuput improvement • Up to 66% reduction in CPU costs • 2K RoCE MTU does yield throughput advantages • LS – Large Send enabled (Segmentation offload) -39.67 -15.56 -36.99 -11.73 -40.76 -15.1 -64.84 -66.23 -64.28 -65.74 -44.21 -43.88 -63.9 -65.67 -63.77 -65.23 -43.96 -41.05 65.75 21.29 58.7 16.15 68.82 20.58 STR1(1/20M) STR3(1/20M) STR1(1/20M) STR3(1/20M) STR1(1/20M) STR3(1/20M) -100 -50 0 50 100 %(RelativetoTCP/IP) Raw Tput CPU-Server CPU-Client Resp Time May 29,2013 Client, Server: 2827-791 2CPs Interfaces: 10GbE RoCE Express and 10GbE O z/OS V2R1 SMC-R vs TCP/IP Streaming Data Performance Summary (AWM) MTU 2K/1500 1K/1500 2K/8000-LS 8.8Gb/sec 8.9Gb/sec 1MB RMBs Saturation reached
  • 12. © 2014 IBM Corporation Page 12 SMC-R – Micro benchmark performance results V2R1 Summary – – Network latency for z/OS TCP/IP based OLTP (request/response) workloads reduced by up to 80%* • Networking related CPU consumption reduction for z/OS TCP/IP based OLTP (request/response) workloads increases as payload size increases – Networking related CPU consumption for z/OS TCP/IP based workloads with streaming data patterns reduced by up to 60% with a network throughput increase of up to 60%** – CPU consumption can be further optimized by using larger RMBe sizes • Less data consumed processing • Less data wrapping • Less data queuing * Based on benchmarks of modeled z/OS TCP sockets based workloads with request/response traffic patterns using SMC-R vs. TCP/IP. The actual response times and CPU savings any user will experience will vary. ** Based on benchmarks of modeled z/OS TCP sockets based workloads with streaming data patterns using SMC-R vs. TCP/IP. The benefits any user will experience will vary
  • 13. © 2014 IBM Corporation Page 13 SMC-R – Sysplex Distributor performance results Line 1 - TCP/IP distributed connections without QDIO Accelerator Line 2 - TCP/IP distributed connections utilizing QDIO Accelerator Line 3 - SMC-R distributed connections V2R1 With SMC-R the distributing stack is bypassed for inbound data. Connection setup and SMC-R rendezvous packets will be the only inbound traffic going through the distributing stack. Remember that all outbound traffic bypasses the distributing stack for all scenarios.
  • 14. © 2014 IBM Corporation Page 14 SMC-R – Sysplex Distributor performance results 248 -89 295 -97 Accel No Accel RR20 -100 0 100 200 300 %Improvement Throughput CPU Cost/tran - Distributor Request/Response 100/800 Sysplex Distributor Workload – 20 simultaneous request/response connections sending 100 and receiving 800 bytes. Large data workloads would yield even bigger performance improvements. Results from Sysplex distributing Stack perspective SMC-R removes virtually all CP processing on Distributing stack 250%+ throughput improvement V2R1
  • 15. © 2014 IBM Corporation Page 15 SMC-R – FTP performance summary V2R1 The performance measurements discussed in this document were collected using a dedicated system environment. The results obtained in other configurations or operating system environments may vary significantly depending upon environments used. 0.73-20.86-48.18 1.06-15.83-47.49 FTP1(1200M) FTP3(1200M) -75 -50 -25 0 %(RelativetoOSD) Raw Tput CPU-Client CPU-Server zEC12 V2R1 SMC vs. OSD Performance Summary FTP Performance FTP binary PUTs to z/OS FTP server, 1 and 3 sessions, transferring 1200 MB data OSD – OSA Express4 10Gb interface Reading from and writing to DASD datasets – Limits throughput AWM FTP client
  • 16. © 2014 IBM Corporation Page 16 SMC-R - WebSphere MQ for z/OS performance improvement V2R1 Latency improvements Workload Measurements using WebSphere MQ V7.1.0 MQ between 2 LPARs on zEC12 machine (10 processors each) Request/Response workload On each LPAR, a queue manager was started and configured with 50 outbound sender channels and 50 inbound receiver channels, with default options for the channel definitions (100 TCP connections) Each configuration was run with message sizes of 2KB, 32KB and 64KB where all messages were non-persistent Results were consistent across all three message sizes
  • 17. © 2014 IBM Corporation Page 17 SMC-R - WebSphere MQ for z/OS performance improvement V2R1 WebSphere MQ for z/OS realizes up to a 3x increase in messages per second it can deliver across z/OS systems when using SMC-R vs standard TCP/IP for 64K messages over 1 channel * Latency improvements *Based on internal IBM benchmarks using a modeled WebSphere MQ for z/OS workload driving non-persistent messages across z/OS systems in a request/response pattern. The benchmarks included various data sizes and number of channel pairs. The actual throughput and CPU savings users will experience may vary based on the user workload and configuration. 2k, 32k and 64k message sizes 1 to 50 TCP connections each way z/OS SYSA z/OS SYSB WebSphere MQ WebSphere MQ WebSphere MQ for z/OS using SMC-R MQ messages TCP/IP (OSA4S) MQ messages SMC-R (ROCE) z/OS SYSA z/OS SYSB WebSphere MQ WebSphere MQ WebSphere MQ for z/OS using SMC-R MQ messages TCP/IP (OSA4S) MQ messages SMC-R (ROCE)
  • 18. © 2014 IBM Corporation Page 18 SMC-R – CICS performance improvement V2R1 Response time and CPU utilization improvements Workload - Each transaction – Makes 5 DPL (Distributed Program Link) requests over an IPIC connection – Sends 32K container on each request – Server program Receives the data and Send back 32K – Receives back a 32K container for each request Note: Results based on internal IBM benchmarks using a modeled CICS workload driving a CICS transaction that performs 5 DPL calls to a CICS region on a remote z/OS system, using 32K input/output containers. Response times and CPU savings measured on z/OS system initiating the DPL calls. The actual response times and CPU savings any user will experience will vary. IPIC - IP Interconnectivity • Introduced in CICS TS 3.2/TG 7.1 •TCP/IP based communications •Alternative to LU6.2/SNA for Distributed program calls
  • 19. © 2014 IBM Corporation Page 19 SMC-R – CICS performance improvement V2R1 Benchmarks run on z/OS V2R1 with latest zEC12 and new 10GbE RoCE Express feature – Compared use of SMC-R (10GbE RoCE Express) vs standard TCP/IP (10GbE OSA Express4S) with CICS IPIC communications for DPL (Distributed Program Link) processing – Up to 48% improvement in CICS transaction response time as measured on CICS system issuing the DPL calls (CICS A) – Up to 10% decrease in overall z/OS CPU consumption on the CICS systems
  • 20. © 2014 IBM Corporation Page 20 SMC-R – Websphere to DB2 communications performance improvement V2R1Response time improvements SMC-Rz/OS SYSA z/OS SYSB RoCE WAS Liberty TradeLite DB2 JDBC/DRDA 3 per HTTP Connection Linux on x Workload Client Simulator (JIBE) HTTP/REST 40 Concurrent TCP/IP Connections TCP/IP WebSphere to DB2 communications using SMC-R Based on projections and measurements completed in a controlled environment. Results may vary by customer based on individual workload, configuration and software levels. 40% reduction in overall Transaction response time! – As seen from client’s perspective Small data sizes ~ 100 bytes
  • 21. © 2014 IBM Corporation Page 21 TCP/IP Enhanced Fast Path Sockets V2R1 Socket Application Recv(s1,buffer,…); OSA USS Logical File System (LFS) TCP/IP Physical File System (PFS) Transport Layer (TCP, UDP, RAW) IP Interface/Device Driver USS Suspend/Resume Services Wait/Post Suspend/Resume SpaceSwitch (OMVS) SpaceSwitch (OMVS) SpaceSwitch (TCP/IP) z/OS Socket Application Recv(s1,buffer,…); OSA USS Logical File System (LFS) TCP/IP Physical File System (PFS) Transport Layer (TCP, UDP, RAW) IP Interface/Device Driver TCP/IP Suspend/Resume Services Wait/Post Suspend/Resume SpaceSwitch (OMVS) Streamlinedpath SpaceSwitch (TCP/IP) z/OS TCP/IP sockets (normal path) TCP/IP fast path sockets (Pre-V2R1) Space switch (OMVS) Full function support for sockets, including support for Unix signals, POSIX compliance When TCP/IP needs to suspend a thread waiting for network flows, USS suspend/resume services are invoked Streamlined path through USS LFS for selected socket APIs TCP/IP performs the wait/post or suspend/resume inline using its own services Significant reduction in path length
  • 22. © 2014 IBM Corporation Page 22 TCP/IP Enhanced Fast Path Sockets V2R1 Pre-V2R1 fast path provided CPU savings but not widely adopted: No support for Unix signals (other than SIGTERM) Only useful to applications that have no requirement for signal support No DBX support (debugger) Must be explicitly enabled! BPXK_INET_FASTPATH environment variable Iocc#FastPath IOCTL Only supported for UNIX System Services socket API or the z/OS XL C/C++ Run-time Library functions
  • 23. © 2014 IBM Corporation Page 23 TCP/IP Enhanced Fast Path Sockets V2R1 Socket Application Recv Recvfrom Recvmsg USS Logical File System (LFS) TCP/IP Physical File System (PFS) Transport Layer (TCP, UDP, RAW) IP Interface/Device Driver USS Suspend/Resume Services Pause/Release Services Space Switch OMVS Streamlined path Through LFS Space Switch TCP/IP z/OS Streamlined path Through LFS No Space Switch! Send Sendto Sendmsg Fast path sockets performance without all the conditions!: • Enabled by default • Full POSIX compliance, signals support and DBX support • Valid for ALL socket APIs (with the exception of the Pascal API
  • 24. © 2014 IBM Corporation Page 24 TCP/IP Enhanced Fast Path Sockets V2R1 No new externals Still supports “activating Fast path explicitly” to avoid migration issues Provides performance benefits of enhanced Fast Path sockets Keeps the following restrictions: Does not support POSIX signals (blocked by z/OS UNIX) Cannot use dbx debugger
  • 25. © 2014 IBM Corporation Page 25 TCP/IP Enhanced Fast Path Sockets V2R1 Note: The performance measurements discussed in this presentation are z/OS V2R1 Communications Server numbers and were collected using a dedicated system environment. The results obtained in other configurations or operating system environments may vary. 12.48 -22.32 -23.77 2.23 -9.12 -3.04 -0.35 -4.97 -4.97 RR40 (1h/8h) CRR20 (64/8k) STR3 (1/20M) IPv4 AWM Primitive Workloads -40 -30 -20 -10 0 10 20 30 %(RelativetowithoutFastpath) Raw TPUT CPU-Client CPU-Server May 2, 2013 Client and server LPARs: zEC12 with 6 CPs per LPAR Interface: OSA-E4 10 GbE V2R1 IPv4 AWM Primitives V2R1 with Fastpath vs. V2R1 without Fastpath
  • 26. © 2014 IBM Corporation Page 26 V2R1 Background information: IP filtering basics IP filtering at the z/OS IP Layer Filter rules defined based on relevant attributes Used to control routed and local traffic Defined actions taken when a filter rule is matched IP filter rules are defined in three ways: TCPIP profile Policy Agent Defense Manager Daemon “Not Filtering” routed traffic means that all routed traffic is permitted by the effective filter rules Applications TCP/UDP IPv4 & IPv6 Interfaces Filter policy (pagent or profile) Defensive filters Applications TCP/UDP IPv4 & IPv6 Interfaces Defensive filters • Traffic routed through this TCP/IP stack • Does not apply to Sysplex Distributor connection routing • Traffic going to or coming from applications on this TCP/IP stack only Routed Traffic Local Traffic Filter policy (pagent or profile) QDIO Accelerator coexistence with IP Filtering
  • 27. © 2014 IBM Corporation Page 27 Background information: QDIO Accelerator Provides fast path IP forwarding for these DLC combinations Inbound QDIO, outbound QDIO or HiperSockets Inbound HiperSockets, outbound QDIO or HiperSockets Sysplex Distributor (SD) acceleration Inbound packets over HiperSockets or OSA-E QDIO When SD gets to the target stack using either • Dynamic XCF connectivity over HiperSockets • VIPAROUTE over OSA-E QDIO Improves performance and reduces processor usage for such workloads DLC IP TCP/UDP Application DLC IP TCP/UDP Application IP-layer routing table Connection routing table DLC IP TCP/UDP Application DLC IP TCP/UDP Application OSAOSA OSA DLC IP TCP/UDP Application DLC IP TCP/UDP ApplicationShadow copies of selected entries in Connection routing table and IP-layer routing table DLC IP TCP/UDP Application DLC IP TCP/UDP Application OSAOSA OSA V1R11
  • 28. © 2014 IBM Corporation Page 28 Problem No support for acceleration when IP security enabled • Even if stack processing is not needed for forwarded traffic Solution Allow QDIO accelerator for routed traffic when IPCONFIG IPSECURITY and IPCONFIG QDIOACCELERATOR configured • QDIO accelerator for non-Sysplex Distributor traffic requires that the acceleration stack must not filter or log routed traffic • Always allow QDIO accelerator for Sysplex Distributor traffic Not supported for HiperSockets accelerator QDIO Accelerator coexistence with IP Filtering
  • 29. © 2014 IBM Corporation Page 29 QDIO Accelerator: IPCONFIG syntax and performance results … | _NOQDIOACCELerator________________________ | |_|__________________________________________|_________| | | _QDIOPriority 1________| | | |_QDIOACCELerator__|_______________________| | | |_QDIOPriority priority_| | … -0.9 -57.45 -0.33 z/OS Client z/OS Sysplex Distributor z/OS Target -75 -50 -25 0 25 50 75 %(RelativetonoQDIOAccelerator) CPU / Transaction Request-Response workload RR20: 20 sessions, 100 / 800
  • 30. © 2014 IBM Corporation Page 30 FTP using zHPF – Improving throughput There are many factors that influence the transfer rates for z/OS FTP connections. Some of the more significant ones are (in order of impact): – DASD read/write access – Data transfer type (Binary, ASCII..) – Dataset characteristics (e.g., fixed block or variable) *Note the network (Hipersockets, OSA, 10Gb, SMC-R) characteristics have very little impact when reading from, and writing to, DASD as you will see in our results section. zHPF FAQ link • http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/FQ127122 • Works with DS8000 storage systems
  • 31. © 2014 IBM Corporation Page 31 FTP using zHPF – Improving throughput FTP Workload z/OS FTP client GET or PUT 1200 MB data set from or to z/OS FTP server DASD to DASD (read from or write to) zHPF enabled/disabled Single file transfer Used Variable block data set for the test Organization .... PS Record Format ...VB Record Length …6140 Block size ...........23424 For Hipersocket Configure GLOBALCONFIG IQDMULTIWRITE
  • 32. © 2014 IBM Corporation Page 32 FTP using zHPF – Improving throughput Throughput is improved by 43-49% with Enabling zHPF
  • 33. © 2014 IBM Corporation Page 33 Optimizing inbound communications using OSA-Express
  • 34. © 2014 IBM Corporation Page 34 Timing Considerations for Various Inbound workloads… Inbound Streaming Traffic Pattern Interactive Traffic Pattern For inbound streaming traffic, it’s most efficient to have OSA defer interrupting z/OS until it sees a pause in the stream…. To accomplish this, we’d want the OSA LAN-Idle timer set fairly high (e.g., don’t interrupt unless there’s a traffic pause of at least 20 microseconds) 2 2 2 2 2 2 2 2 40 flow direction receiving OSA Express 2 2 2 2 2 2 2 2 40 flow direction receiving OSA-Express3 packets tightly spaced pause next burst But for interactive traffic, response time would be best if OSA would interrupt z/OS immediately…. To accomplish this, we’d want the OSA LAN-Idle timer set as low as it can go (e.g., 1 microsecond) single packet (request) IN single packet (response) OUT For detailed discussion on inbound interrupt timing, please see Part 1 of “z/OS Communications Server V1R12 Performance Study: OSA- Express3 Inbound Workload Queueing”. http://www-01.ibm.com/support/docview.wss?uid=swg27005524 Read-Side interrupt frequency is all about the LAN-Idle timer! Read-Side interrupt frequency is all about the LAN-Idle timer!
  • 35. © 2014 IBM Corporation Page 35 Dynamic LAN Idle Timer – Introduced in z/OS V1R9 With Dynamic LAN Idle, blocking times are now dynamically adjusted by the host in response to the workload characteristics. Optimizes interrupts and latency! HostHost OSAOSA LANLAN Heavy Workloads Light Workloads OSA Generated PCI InterruptOSA Generated PCI Interrupt HostHost OSAOSA LANLAN BlockMore BlockLess
  • 36. © 2014 IBM Corporation Page 36 Dynamic LAN Idle Timer: Configuration Configure INBPERF DYNAMIC on the INTERFACE statement ––– BALANCEDBALANCEDBALANCED (default)(default)(default) --- a static interrupta static interrupta static interrupt---timing value, selected to achievetiming value, selected to achievetiming value, selected to achieve reasonably high throughput and reasonably low CPUreasonably high throughput and reasonably low CPUreasonably high throughput and reasonably low CPU – DYNAMIC - a dynamic interrupt-timing value that changes based on current inbound workload conditions ––– MINCPUMINCPUMINCPU --- a static interrupta static interrupta static interrupt---timing value, selected to minimize host interruptstiming value, selected to minimize host interruptstiming value, selected to minimize host interrupts without regard to throughputwithout regard to throughputwithout regard to throughput ––– MINLATENCYMINLATENCYMINLATENCY --- a static interrupta static interrupta static interrupt---timing value, selected to minimize latencytiming value, selected to minimize latencytiming value, selected to minimize latency Note: These values cannot be changed without stopping and restarting the interface >>-INTERFace--intf_name-----------------------------------------> . .-INBPERF BALANCED--------. >--+-------------------------+--------> '-INBPERF--+-DYNAMIC----+-' +-MINCPU-----+ '-MINLATENCY-' . Generally Recommended!Generally Recommended!
  • 37. © 2014 IBM Corporation Page 37 Dynamic LAN Idle Timer: But what about mixed workloads? flow direction receiving OSA-Express3 2 2 2 2 2 2 2 2 40 connection A - streaming connection B - interactive INBPERF DYNAMIC (Dynamic LAN Idle) is great for EITHER streaming OR interactive…but if BOTH types of traffic are running together, DYNAMIC mode will tend toward CPU conservation (elongating the LAN-Idle timer). So in a mixed (streaming + interactive) workload, the interactive flows will be delayed, waiting for the OSA to detect a pause in the stream…..
  • 38. © 2014 IBM Corporation Page 38 Inbound Workload Queuing With OSA-Express3/4S IWQ and z/OS V1R12, OSA now directs streaming traffic onto its own input queue – transparently separating the streaming traffic away from the more latency-sensitive interactive flows… And each input queue has its own LAN-Idle timer, so the Dynamic LAN Idle function can now tune the streaming (bulk) queue to conserve CPU (high LAN-idle timer setting), while generally allowing the primary queue to operate with very low latency (minimizing its LAN-idle timer setting). So interactive traffic (on the primary input queue) may see significantly improved response time. The separation of streaming traffic away from interactive also enables new streaming traffic efficiencies in Communications Server. This results in improved in-order delivery (better throughput and CPU consumption). OSAOSA LANLAN Sysplex Distributor Streaming Default (interactive) z/OS CPU 1CPU 1 CPU 2CPU 2 CPU 3CPU 3 Custom Lan Idle timer and Interrupt processing for each traffic pattern V1R12 CPU 0CPU 0 EE V1R13
  • 39. © 2014 IBM Corporation Page 39 Improved Streaming Traffic Efficiency With IWQ B B B C C C D D D D D A A SRB 1 on CP 0 A A A A D D B B B B C C C SRB 2 on CP 1 interrupt time....... t1 - qdio rd interrupt, SRB disp CP 0 x x t2 - qdio rd interrupt, SRB disp CP 1 at the time CP1 (SRB2) starts the TCP-layer processing for Connection A's 1st packet, CP0 (SRB1) has progressed only into Connection C's packets... So, the Connection A packets being carried by SRB 2 will be seen before those carried by SRB 1... This is out-of-order packet delivery, brought on by multiprocessor races through TCP/IP inbound code. Out-of-order delivery will consume excessive CPU and memory, and usually leads to throughput problems. progressthroughthebatchofinboundpackets IWQ does away with MP-race- induced ordering problems! With streaming traffic sorted onto its own queue, it is now convenient to service streaming traffic from a single CP (i.e., using a single SRB). So with IWQ, we no longer have inbound SRB races for streaming data. Before we had IWQ, Multiprocessor races would degrade streaming performance!
  • 40. © 2014 IBM Corporation Page 40 QDIO Inbound Workload Queuing – Configuration INBPERF DYNAMIC WORKLOADQ enables QDIO Inbound Workload Queuing (IWQ) >>-INTERFace--intf_name-----------------------------------------> . .-INBPERF BALANCED--------------------. >--+-------------------------------------+--> | .-NOWORKLOADQ-. | ‘-INBPERF-+-DYNAMIC-+-------------+-+-’ | ‘-WORKLOADQ---’ | +-MINCPU------------------+ ‘-MINLATENCY--------------’ – INTERFACE statements only - no support for DEVICE/LINK definitions – QDIO Inbound Workload Queuing requires VMAC
  • 41. © 2014 IBM Corporation Page 41 QDIO Inbound Workload Queuing Display OSAINFO command (V1R12) shows you what’s registered in OSA BULKDATA queue registers 5-tuples with OSA (streaming connections) SYSDIST queue registers Distributable DVIPAs with OSA D TCPIP,,OSAINFO,INTFN=V6O3ETHG0 . Ancillary Input Queue Routing Variables: Queue Type: BULKDATA Queue ID: 2 Protocol: TCP Src: 2000:197:11:201:0:1:0:1..221 Dst: 100::101..257 Src: 2000:197:11:201:0:2:0:1..290 Dst: 200::202..514 Total number of IPv6 connections: 2 Queue Type: SYSDIST Queue ID: 3 Protocol: TCP Addr: 2000:197:11:201:0:1:0:1 Addr: 2000:197:11:201:0:2:0:1 Total number of IPv6 addresses: 2 36 of 36 Lines Displayed End of report 5-Tuples5-Tuples DVIPAsDVIPAs
  • 42. © 2014 IBM Corporation Page 42 QDIO Inbound Workload Queuing: Netstat DEvlinks/-d Display TCPIP,,Netstat,DEvlinks to see whether QDIO inbound workload queueing is enabled for a QDIO interface D TCPIP,,NETSTAT,DEVLINKS,INTFNAME=QDIO4101L EZD0101I NETSTAT CS V1R12 TCPCS1 INTFNAME: QDIO4101L INTFTYPE: IPAQENET INTFSTATUS: READY PORTNAME: QDIO4101 DATAPATH: 0E2A DATAPATHSTATUS: READY CHPIDTYPE: OSD SPEED: 0000001000 ... READSTORAGE: GLOBAL (4096K) INBPERF: DYNAMIC WORKLOADQUEUEING: YES CHECKSUMOFFLOAD: YES SECCLASS: 255 MONSYSPLEX: NO ISOLATE: NO OPTLATENCYMODE: NO ... 1 OF 1 RECORDS DISPLAYED END OF THE REPORT
  • 43. © 2014 IBM Corporation Page 43 QDIO Inbound Workload Queuing: Display TRLE Display NET,TRL,TRLE=trlename to see whether QDIO inbound workload queueing is in use for a QDIO interface D NET,TRL,TRLE=QDIO101 IST097I DISPLAY ACCEPTED ... IST2263I PORTNAME = QDIO4101 PORTNUM = 0 OSA CODE LEVEL = ABCD ... IST1221I DATA DEV = 0E2A STATUS = ACTIVE STATE = N/A IST1724I I/O TRACE = OFF TRACE LENGTH = *NA* IST1717I ULPID = TCPCS1 IST2310I ACCELERATED ROUTING DISABLED IST2331I QUEUE QUEUE READ IST2332I ID TYPE STORAGE IST2205I ------ -------- --------------- IST2333I RD/1 PRIMARY 4.0M(64 SBALS) IST2333I RD/2 BULKDATA 4.0M(64 SBALS) IST2333I RD/3 SYSDIST 4.0M(64 SBALS) ... IST924I ------------------------------------------------------------- IST314I END
  • 44. © 2014 IBM Corporation Page 44 QDIO Inbound Workload Queuing: Netstat ALL/-A Display TCPIP,,Netstat,ALL to see whether QDIO inbound workload BULKDATA queueing is in use for a given connection D TCPIP,,NETSTAT,ALL,CLIENT=USER1 EZD0101I NETSTAT CS V1R12 TCPCS1 CLIENT NAME: USER1 CLIENT ID: 00000046 LOCAL SOCKET: ::FFFF:172.16.1.1..20 FOREIGN SOCKET: ::FFFF:172.16.1.5..1030 BYTESIN: 00000000000023316386 BYTESOUT: 00000000000000000000 SEGMENTSIN: 00000000000000016246 SEGMENTSOUT: 00000000000000000922 LAST TOUCHED: 21:38:53 STATE: ESTABLSH ... Ancillary Input Queue: Yes BulkDataIntfName: QDIO4101L ... APPLICATION DATA: EZAFTP0S D USER1 C PSSS ---- 1 OF 1 RECORDS DISPLAYED END OF THE REPORT
  • 45. © 2014 IBM Corporation Page 45 QDIO Inbound Workload Queuing: Netstat STATS/-S Display TCPIP,,Netstat,STATS to see the total number of TCP segments received on BULKDATA queues D TCPIP,,NETSTAT,STATS,PROTOCOL=TCP EZD0101I NETSTAT CS V1R12 TCPCS1 TCP STATISTICS CURRENT ESTABLISHED CONNECTIONS = 6 ACTIVE CONNECTIONS OPENED = 1 PASSIVE CONNECTIONS OPENED = 5 CONNECTIONS CLOSED = 5 ESTABLISHED CONNECTIONS DROPPED = 0 CONNECTION ATTEMPTS DROPPED = 0 CONNECTION ATTEMPTS DISCARDED = 0 TIMEWAIT CONNECTIONS REUSED = 0 SEGMENTS RECEIVED = 38611 ... SEGMENTS RECEIVED ON OSA BULK QUEUES= 2169 SEGMENTS SENT = 2254 ... END OF THE REPORT
  • 46. © 2014 IBM Corporation Page 46 Quick INBPERF Review Before We Push On…. The original static INBPERF settings (MINCPU, MINLATENCY, BALANCED) provide sub-optimal performance for workloads that tend to shift between request/response and streaming modes. We therefore recommend customers specify INBPERF DYNAMIC, since it self-tunes, to provide excellent performance even when inbound traffic patterns shift. Inbound Workload Queueing (IWQ) mode is an extension to the Dynamic LAN Idle function. IWQ improves upon the DYNAMIC setting, in part because it provides finer interrupt-timing control for mixed (interactive + streaming) workloads.
  • 47. © 2014 IBM Corporation Page 47 Optimized Latency Mode (OLM) z/OS software and OSA-Express3 and above microcode can further reduce latency via some aggressive processing changes (enabled via the OLM keyword on the INTERFACE statement): – Inbound • OSA-Express signals host if data is “on its way” (“Early Interrupt”) • Host may spin for a while, if the early interrupt is fielded before the inbound data is “ready” – Outbound • OSA-Express does not wait for SIGA to look for outbound data (“SIGA reduction”) • OSA-Express microprocessor may spin for a while, looking for new outbound data to transmit OLM is intended for workloads that have demanding QoS requirements for response time (transaction rate) – high volume request/response workloads (traffic is predominantly transaction oriented versus streaming) The latency-reduction techniques employed by OLM will limit the degree to which the OSA can be shared among partitions Application client Application server TCP/IP Stack TCP/IP Stack OSA OSANetwork Network SIGA-write PCI SIGA-writePCI Request Response V1R11
  • 48. © 2014 IBM Corporation Page 48 Optimized Latency Mode (OLM): How to configure New OLM parameter – IPAQENET/IPAQENET6 – Not allowed on DEVICE/LINK Enables Optimized Latency Mode for this INTERFACE only Forces INBPERF to DYNAMIC Default NOOLM INTERFACE NSQDIO411 DEFINE IPAQENET IPADDR 172.16.11.1/24 PORTNAME NSQDIO1 MTU 1492 VMAC OLM INBPERF DYNAMIC SOURCEVIPAINTERFACE LVIPA1 D TCPIP,,NETSTAT,DEVLINKS,INTFNAME=LNSQDIO1 JOB 6 EZD0101I NETSTAT CS V1R11 TCPCS INTFNAME: LNSQDIO1 INTFTYPE: IPAQENET INTFSTATUS: READY . READSTORAGE: GLOBAL (4096K) INBPERF: DYNAMIC . ISOLATE: NO OPTLATENCYMODE: YES Use Netstat DEvlinks/-d to see current OLM configuration
  • 49. © 2014 IBM Corporation Page 49 Optimized Latency Mode (OLM): Performance Data – Client and Server • Have minimal application logic – RR1 • 1 session • 1 byte in, 1 byte out – RR20 • 20 sessions • 128 bytes in, 1024 bytes out – RR40 • 40 sessions • 128 bytes in, 1024 bytes out – RR80 • 80 sessions • 128 bytes in, 1024 bytes out OSA-E3 OSA-E3 TCPIP Server TCPIP Client 0 100 200 300 400 500 600 700 800 900 RR1 RR20 RR40 RR80 DYNAMIC DYN+OLM 0 20000 40000 60000 80000 100000 120000 RR1 RR20 RR40 RR80 DYNAMIC DYN+OLM End-to-end latency (response time) in microseconds Transaction rate – transactions per second z10 (4 CP LPARs), z/OS V1R13, OSA-E3 1Gbe Note: The performance measurements discussed in this presentation are z/OS V1R13 Communications Server numbers and were collected using a dedicated system environment. The results obtained in other configurations or operating system environments may vary. Lower is better Higher is better
  • 50. © 2014 IBM Corporation Page 50 Dynamic LAN Idle Timer: Performance Data Dynamic LAN Idle improved RR1 TPS 50% and RR10 TPS by 33%. Response Time for these workloads is improved 33% and 47%, respectively. 1h/8h indicates 100 bytes in and 800 bytes out z10 (4 CP LPARs), z/OS V1R13, OSA-E3 1Gbe 50.1 -33.4 87.7 -47.4 -60 -40 -20 0 20 40 60 80 100 DynamicLANIdlevs. Balanced RR1(1h/8h) RR10(1h/8h) RR1 and RR10 Dynamic LAN Idle Trans/Sec Resp Time Note: The performance measurements discussed in this presentation are z/OS V1R13 Communications Server numbers and were collected using a dedicated system environment. The results obtained in other configurations or operating system environments may vary.
  • 51. © 2014 IBM Corporation Page 51 Inbound Workload Queuing: Performance Data z10 (3 CP LPARs) OSA-Express3’s In Dynamic or IWQ mode Aix 5.3 p570 z/OS V1R12 z/OS V1R12 1GBe or 10GBe network For z/OS outbound streaming to another platform, the degree of performance boost (due to IWQ) is relative to receiving platform’s sensitivity to out-of-order packet delivery. For streaming INTO z/OS, IWQ will be especially beneficial for multi-CP configurations. IWQ: Mixed Workload Results vs DYNAMIC: –z/OS<->AIX R/R Throughput improved 55% (Response Time improved 36%) –Streaming Throughput also improved in this test: +5% 0 10000 20000 30000 40000 50000 60000 70000 80000 RRtrans/sec or STRKB/sec RR30 STR1 RR (z/OS to AIX) STR (z/OS to z/OS) Mixed Workload (IWQ vs Dynamic) DYNAMIC IWQ
  • 52. © 2014 IBM Corporation Page 52 Inbound Workload Queuing: Performance Data z10 (3 CP LPARs) OSA-Express3’s in Dynamic or IWQ mode Aix 5.3 p570 z/OS V1R12 z/OS V1R12 1GBe or 10GBe network IWQ: Pure Streaming Results vs DYNAMIC: –z/OS<->AIX Streaming Throughput improved 40% –z/OS<->z/OS Streaming Throughput improved 24% 0 100 200 300 400 500 600 MB/sec z/OS to AIX z/OS to z/OS Pure Streaming (IWQ vs Dynamic) DYNAMIC IWQ For z/OS outbound streaming to another platform, the degree of performance boost (due to IWQ) is relative to receiving platform’s sensitivity to out-of-order packet delivery. For streaming INTO z/OS, IWQ will be especially beneficial for multi-CP configurations.
  • 53. © 2014 IBM Corporation Page 53 IWQ Usage Considerations: Minor ECSA Usage increase: IWQ will grow ECSA usage by 72KBytes (per OSA interface) if Sysplex Distributor (SD) or EE is in use; 36KBytes if SD and EE are not in use IWQ requires OSA-Express3 in QDIO mode running on IBM System z10 or OSA- Express3/OSA-Express4/OSA-Express5 in QDIO mode running on zEnterprise 196/ zEC12(for OSAE5). IWQ must be configured using the INTERFACE statement (not DEVICE/LINK) IWQ is not supported when z/OS is running as a z/VM guest with simulated devices (VSWITCH or guest LAN) Make sure to apply z/OS V1R12 PTF UK61028 (APAR PM20056) for added streaming throughput boost with IWQ
  • 54. © 2014 IBM Corporation Page 54 Optimizing outbound communications using OSA- Express
  • 55. © 2014 IBM Corporation Page 55 TCP Segmentation Offload Segmentation consumes (high cost) host CPU cycles in the TCP stack Segmentation Offload (also referred to as “Large Send”) – Offload most IPv4 and/or IPv6 TCP segmentation processing to OSA – Decrease host CPU utilization – Increase data transfer efficiency – Checksum offload also added for IPv6 HostHost 1-41-4 OSAOSA 11 22 33 44 LANLAN Single Large Segment Individual Segments TCP Segmentation Performed In the OSA TCP Segmentation Performed In the OSA V1R13
  • 56. © 2014 IBM Corporation Page 56 z/OS Segmentation Offload performance measurements -35.8 8.6 -50 -40 -30 -20 -10 0 10 Relativetono offload STR-3 OSA-Express4 10Gb CPU/MB Throughput Segmentation offload may significantly reduce CPU cycles when sending bulk data from z/OS!Send buffer size: 180K for streaming workloads Note: The performance measurements discussed in this presentation are z/OS V1R13 Communications Server numbers and were collected using a dedicated system environment. The results obtained in other configurations or operating system environments may vary.
  • 57. © 2014 IBM Corporation Page 57 Enabled with IPCONFIG/IPCONFIG6 SEGMENTATIONOFFLOAD Disabled by default Previously enabled via GLOBALCONFIG Segmentation cannot be offloaded for – Packets to another stack sharing OSA port – IPSec encapsulated packets – When multipath is in effect (unless all interfaces in the multipath group support segmentation offload) >>-IPCONFIG-------------------------------------------------> . . >----+-----------------------------------------------+-+------->< | .-NOSEGMENTATIONOFFLoad-. | +-+-----------------------+--------------------+ | '-SEGMENTATIONOFFLoad---' | TCP Segmentation Offload: Configuration V1R13 Reminder! Checksum Offload enabled by default
  • 58. © 2014 IBM Corporation Page 58 z/OS Checksum Offload performance measurements V1R13 Note: The performance measurements discussed in this presentation are z/OS V2R1 Communications Server numbers and were collected using a dedicated system environment. The results obtained in other configurations or operating system environments may vary. -1.59 -2.66 -5.23 -8.06 -13.65 -14.83 RR30(1h/8h) CRR20(64/8k) STR3(1/20M) AWM IPv6 Primitives Workloads -20 -15 -10 -5 0 5 10 %(RelativetoNoChecksumOffload) CPU-Client CPU-Server zEC12 2CPs V2R1 - Effect of ChecksumOffload - IPv6 Performance Relative to NoChecksumOffload OSA Exp4 10Gb interface
  • 59. © 2014 IBM Corporation Page 59 OSA-Express4
  • 60. © 2014 IBM Corporation Page 60 OSA-Express4 Enhancements – 10GB improvements Improved on-card processor speed and memory bus provides better utilization of 10GB network z196 (4 CP LPARs), z/OS V1R13, OSA- E3/OSA-E4 10Gbe 489 874 0 200 400 600 800 1000 Throughput (MB/sec) OSA-E3 OSA-E4 OSA 10GBe - Inbound Bulk traffic Note: The performance measurements discussed in this presentation are z/OS V1R13 Communications Server numbers and were collected using a dedicated system environment. The results obtained in other configurations or operating system environments may vary.
  • 61. © 2014 IBM Corporation Page 61 OSA-Express4 Enhancements – EE Inbound Queue Enterprise Extender queue provides internal optimizations EE traffic processed quicker Avoids memory copy of data z196 (4 CP LPARs), z/OS V1R13, OSA- E3/OSA-E4 1Gbe Note: The performance measurements discussed in this presentation are z/OS V1R13 Communications Server numbers and were collected using a dedicated system environment. The results obtained in other configurations or operating system environments may vary. 2.6 -0.4 32.9 -2.9 -10 0 10 20 30 40 MIQvs.Dynamic TCP STR1(1/20MB) EE RR10(1h/8h) OSA 1GBe - mixed TCP and EE workloads Trans/Sec CPU/trans
  • 62. © 2014 IBM Corporation Page 62 OSA-Express4 Enhancements – Other improvements Checksum Offload support for IPv6 traffic Segmentation Offload support for IPv6 traffic
  • 63. © 2014 IBM Corporation Page 63 z/OS Communications Server Performance Summaries
  • 64. © 2014 IBM Corporation Page 64 z/OS Communications Server Performance Summaries Performance of each z/OS Communications Server release is studied by an internal performance team Summaries are created and published on line – http://www-01.ibm.com/support/docview.wss?rs=852&uid=swg27005524 Recently added: – The z/OS VR1 Communications Server Performance Summary – Release to release comparisons – Capacity planning information – IBM z/OS Shared Memory Communications over RDMA: Performance Considerations - Whitepaper
  • 65. © 2014 IBM Corporation Page 65 z/OS Communications Server Performance Website http://www-01.ibm.com/support/docview.wss?uid=swg27005524
  • 66. © 2014 IBM Corporation Page 66 Detailed Usage Considerations for IWQ and OLM
  • 67. © 2014 IBM Corporation Page 67 IWQ Usage Considerations: Minor ECSA Usage increase: IWQ will grow ECSA usage by 72KBytes (per OSA interface) if Sysplex Distributor (SD) is in use; 36KBytes if SD is not in use IWQ requires OSA-Express3 in QDIO mode running on IBM System z10 or OSA- Express3/OSA-Express4 in QDIO mode running on zEnterprise 196. – For z10: the minimum field level recommended for OSA-Express3 is microcode level- Driver 79, EC N24398, MCL006 – For z196 GA1: the minimum field level recommended for OSA-Express3 is microcode level- Driver 86, EC N28792, MCL009 – For z196 GA2: the minimum field level recommended for OSA-Express3 is microcode level- Driver 93, EC N48158, MCL009 – For z196 GA2: the minimum field level recommended for OSA-Express4 is microcode level- Driver 93, EC N48121, MCL010 IWQ must be configured using the INTERFACE statement (not DEVICE/LINK) IWQ is not supported when z/OS is running as a z/VM guest with simulated devices (VSWITCH or guest LAN) Make sure to apply z/OS V1R12 PTF UK61028 (APAR PM20056) for added streaming throughput boost with IWQ
  • 68. © 2014 IBM Corporation Page 68 OLM Usage Considerations(1): OSA Sharing Concurrent interfaces to an OSA-Express port using OLM is limited. – If one or more interfaces operate OLM on a given port, • Only four total interfaces allowed to that single port • Only eight total interfaces allowed to that CHPID – All four interfaces can operate in OLM – An interface can be: • Another interface (e.g. IPv6) defined for this OSA-Express port • Another stack on the same LPAR using the OSA-Express port • Another LPAR using the OSA-Express port • Another VLAN defined for this OSA-Express port • Any stack activating the OSA-Express Network Traffic Analyzer (OSAENTA)
  • 69. © 2014 IBM Corporation Page 69 OLM Usage Considerations (2): QDIO Accelerator or HiperSockets Accelerator will not accelerate traffic to or from an OSA-Express operating in OLM OLM usage may increase z/OS CPU consumption (due to “early interrupt”) – Usage of OLM is therefore not recommended on z/OS images expected to normally be running at extremely high utilization levels – OLM does not apply to the bulk-data input queue of an IWQ-mode OSA. From a CPU-consumption perspective, OLM is therefore a more attractive option when combined with IWQ than without IWQ Only supported on OSA-Express3 and above with the INTERFACE statement Enabled via PTFs for z/OS V1R11 – PK90205 (PTF UK49041) and OA29634 (UA49172).