SlideShare a Scribd company logo
OpenPOWER Acceleration of HPCC Systems
2015 LexisNexis® Risk Solutions HPCC Engineering Summit – Community Day
Presenters: JT Kellington & Allan Cantle
September 29, 2015
Delray Beach, FL
Accelerating the
Pace of Innovation
Overview/Agenda
• The OpenPOWER Story
• Porting HPCC Systems to ppc64el
• POWER8 at a Glance
• Power 8 Data Centric Architectures with FPGA Acceleration
• Summary
2
The OpenPOWER Story
OpenPOWER, a catalyst for Open Innovation
4
• Moore’s law no longer satisfies
performance gain
• Growing workload demands
• Numerous IT consumption models
• Mature Open software ecosystem
Performance of POWER architecture
amplified capability
Open Development
open software, open hardware
Collaboration of thought leaders
simultaneous innovation, multiple disciplines
The OpenPOWER Foundation is an open development community,
using the POWER Architecture to serve the evolving needs of customers.
• Rich software ecosystem
• Spectrum of power servers
• Multiple hardware options
• Derivative POWER chips
Market Shifts New Open Innovation
 145+ members investing on Power architecture
 Geographic, sector, and expertise diversity
 1,500 ISVs serving Linux on POWER, now with easier
porting from x86!
 Open from the chip through software
 9 workgroups chartered, 15 hardware innovations
revealed at March 2015 summit
 Partner announced plans, offerings, deployments
www.openpowerfoundation.org
Porting HPCC Systems to ppc64el
HPCC Systems and POWER8 Designed For Easy Porting
HPCC Systems Framework
• ECL is Platform Agnostic
• Hardware and architecture
changes transparent to end user
• CMake, gcc, binutils, Node.js, etc.
• Open Source!
• Quick and easy collaboration
POWER8 Designed for Open Innovation
• Little Endian OS
• More compatible with industry
• IBM Software Development Kit for
Linux on Power
• Advance Toolchain
• Post-Link Optimization
• Linux performance analysis tools
• Full support in Ubuntu, RHEL, and
SLES
6
Porting Details
Initial port took less than a week!
• Added ARCH defines for PPC64EL (system/include/platform.h)
• Added stack description for POWER8 (ecl/hql/hqlstack.hpp)
• Added binutils support for powerpc (ecl/hqlcpp/hqlres.cpp)
7
Deployment and testing took a bit longer
• Couple of bugs found:
• Stricter alignment for atomic operations
• New compiler did not allow compare of “this” to NULL
• Updates to Init system with latest Ubuntu
• Monkey on the keyboard!
• Basic setup and running was easy, but harder to go to the “next level”
• Storage configuration, memory usage, thread support, etc.
POWER8 at a Glance
POWER8 Is Designed For Big Data!
9
Technology
• 22 nm SOI, eDRAM, 15 ML 650 mm2
Caches
• 512 KB SRAM L2 / core
• 96 MB eDRAM shared L3
Memory
• Up to 230 GB/s
sustained bandwidth
Bus Interfaces
• Durable open memory attach
interface
• Up to 48x 48x Integrated PCI
gen3/socket
• SMP interconnect
• CAPI
Cores
• 12 cores (SMT8)
• 8 dispatch, 10 issue,
16 execution pipes
• 2x internal data
flows/queues
• Enhanced prefetching
• 64 KB data cache,
32 KB instruction cache
Accelerators
• Crypto and memory
expansion
• Transactional memory
• VMM assist
• Data move/VM mobility
POWER8 Scale-Out Dual Chip Module
Chip Interconnect
CoreCoreCore
L2L2L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L2L2L2
Core Core Core
Chip Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
MemoryBus
MemoryBus
SMPInterconnect
SMPInterconnect
SMPSMPCAPIPCIe
SMPCAPIPCIeSMP
POWER8 Introduces CAPI Technology
10
CAPP PCIe
POWER8 Processor
FPGA
Accelerated Functional Unit
(AFU)
CAPI
IBM Supplied POWER Service
Layer
Typical I/O Model Flow
Flow with a Coherent Model
Shared Mem.
Notify Accelerator
Acceleration
Shared Memory
Completion
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Int
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Advantages of Coherent Attachment Over I/O Attachment
 Virtual Addressing & Data Caching (significant latency reduction)
 Easier, Natural Programming Model (avoid application restructuring)
 Enables Apps Not Possible on I/O (Pointer chasing, shared mem semaphores, …)
strategy ( )
CAPI Attached Flash Optimization
• Attach IBM FlashSystem to POWER8 via CAPI
• Read/write commands issued via APIs from applications to eliminate 97% of code path length
• Saves 20-30 cores per 1M IOPS
Pin buffers, Translate,
Map DMA, Start I/O
Application
Read/Write Syscall
Interrupt, unmap,
unpin,Iodone scheduling
20K
instructions
reduced to
<2000
Disk and Adapter DD
strategy ( ) iodone ( )
FileSystem
Application
User Library
Posix Async
I/O Style API
Shared Memory
Work Queue
aio_read()
aio_write()
iodone ( )
LVM
CAPI Unlocks the Next Level of Performance for Flash
Identical hardware with 2 different
paths to data
FlashSystem
Conventional
I/O (FC) CAPI
0
20,000
40,000
60,000
80,000
100,000
120,000
Conventional CAPI
IOPS per Hardware Thread
0
50
100
150
200
250
300
350
400
450
500
Conventional CAPI
Latency (microseconds)
IBM POWER S822L >5x better IOPS
per HW thread
>2x lower latency
POWER 8 Data Centric Architectures
with FPGA Acceleration
• Allan Cantle – President & Founder, Nallatech
Nallatech at a glance
Server qualified accelerator cards featuring FPGAs, network I/O and an open
architecture software/firmware framework
» Energy-efficient High Performance
Heterogeneous Computing
» Real-time, low latency network
and I/O processing
» Design Services for Application Porting
» IBM Open Power partner
» Altera OpenCL partner
» Xilinx Alliance partner
» 20 Year History of FPGA Acceleration including
IBM System Z qualified & 1000+ Node deployments
14
Heterogeneous Computing – Motivation
• Efficient Evolutional Compute Performance
• 1998 - DARPA Polymorphic Computing
• 2003 – Power Wall
GPGPUProcessors
FPGA Processors Multicore Microprocessors
Polymorphic Processor
A processor that is equally efficient at processing
all different data types
Polymorphic
Application Data Types
SymbolicStreaming – SIMD - Vector
(DSP)
Bit Level
SWEPTEfficiency
Size,Weight,Energy,Performance,Time
Heterogeneous Computing – Motivation
• Efficient Evolutional Compute Performance
• 1998 - DARPA Polymorphic Computing
• 2003 – Power Wall
GPGPUProcessors
FPGA Processors Multicore Microprocessors
Application Data Types
SymbolicStreaming – SIMD - Vector
(DSP)
Bit Level
SWEPTEfficiency
Size,Weight,Energy,Performance,Time
FPGA’s for “Software Accessible” Compute Acceleration
• FPGAs are best processor for
• “In the pipe” Stream Computing Operations
• Highly parallel Bit & Byte level computation
• FPGA are NOW programmable by software engineers with OpenCL!
• Industry Standard Language managed by Khronos
• Heterogeneous language of choice for all processor types
• Low Level Explicitly Parallel language
• Heterogeneous Support is Evolving
17
Typical Thor & Roxie Cluster Node Configurations
• Large Thor Scale Out Cluster = 400 Nodes
• Large Roxie Scale Out Cluster = 100 Nodes
Xeon
E5-2670 V3
12 Cores
Memory
16GB
SAS
RAID5
Controller
PCIe
L3Cache–30MB
L2/Core–256KB
L1/Core–64KB
1.2TB
HDD
Network
Controller
~60GB/s
PCIe
10GbE
200MB/s
1.2TB
HDD
1.2TB
HDD
140 IOPs
200MB/s
140 IOPs
200MB/s
140 IOPs
Thor
Xeon
E5-2670 V3
12 Cores
L3Cache–30MB
L2/Core–256KB
L1/core–64KB
SATA
RAID5
Controller
1TB
SSD
560MB/s
1TB
SSD
1TB
SSD
100 KIOPs
560MB/s
100 KIOPs
560MB/s
100 KIOPs
Roxie
PCIe
~60GB/s
~60GB/s 2x QPI Links
Memory
16GB
400MB/s
140 IOPs
1.1 GB/s
100 KIOPs
= I/O Bus
= Memory Bus
Server
Node 1
Server
Node 400
Server
Node 2
Server
Node 399
400 Node
THOR Cluster
Hierarchical
400 Port
Network
Switch
Disc
Disc
Disc
Disc
400MB/s
140 IOPs
400MB/s
140 IOPs
400MB/s
140 IOPs
400MB/s
140 IOPs
1GB/s
1GB/s
1GB/s
1GB/s
1GB/s
Server
Node 1
Server
Node 100
Server
Node 2
Server
Node 99
100 Node
ROXIE Cluster
Hierarchical
100 Port
Network
Switch
SSD
SSD
SSD
SSD
1.1GB/s
100 KIOPs
1.1GB/s
100 KIOPs
1.1GB/s
100 KIOPs
1.1GB/s
100 KIOPs
1GB/s
1GB/s
1GB/s
1GB/s
1GB/s
• CPU intensive operations on
fragmented storage data
• Average 1GB/s to storage
• Mapper type parallel operations
• Aggregate 160GBytes/s
Storage bandwidth
• Aggregate 56K IOPs
• CPU intensive operations on
fragmented storage data
• Average 1GB/s to storage
• Pre-indexed eases limitation
• Mapper type parallel operations
• Aggregate 110GBytes/s
Storage bandwidth
• Aggregate 10M IOPs
Typical Scale Out Thor & Roxie Cluster Configurations
Memory Coherency & Data Movement Conflict
• Mapper Type – Needle in a Haystack Operations
• Little to no compute on lots of data
• But hey, it’s not a problem for the Xeon so why worry?
• Creates unnecessary traffic congestion across IO interfaces
• I/O Management Consumes CPU Threads
• Excessive Cache thrashing consumes memory bandwidth
• E.g. only 8x 4K IOPs will fit in a 32K L1 Data Cache
• Power & efficiency issues with all the unnecessary data movement
1x
Network
Controller
1x
10GbE
Storage uP
L3
L2
L1
Memory
1x
100x
Data Movement example of
CPU operating on large data
files with mapper type
functions
The burden of
using CPU centric
architectures for
Data Centric
Problems
= I/O Bus
= Memory Bus
1x
Network
Controller
1x
10GbE
Storage uP
L3
L2
L1
Memory
1x
100x
Example of CPU moving
stored data to network
for another CPU
Memory Coherency & Data Movement Conflict
• Processor Intensive Operations on fragmented data across cluster
• CPU burdened by passing lots of fragmented data from storage to the
network
• Similarly CPU has to wait for fragmented data to arrive
• Network Congestion & latency become big issues
• Less CPU cycles & cache memory available for compute intensive
operations
The burden of
using CPU centric
architectures for
Data Centric
Problems
= I/O Bus
= Memory Bus
Homogeneous Commodity Clusters still win out…right?
• Maybe, but storage is getting a LOT faster with NVMe!
Potential
72GB/s
Network
Controller
PCIe
10GbE
24 NVMe
Drives
uP
L3
L2
L1
Memory
60GB/s
Example of CPU moving
stored data to network
for another CPU
Increased Storage IO Bandwidth would
saturate CPU’s available memory
bandwidth & CPU centric architecture
breaks down completely
The burden of
using CPU centric
architectures for
Data Centric
Problems
= I/O Bus
= Memory Bus
IBM initiated Data Centric Transformation with CAPI
• CAPI = Coherent Accelerator Processor Interface
• Transport over PCIe bus
• IBM’s CAPI Flash Product
• Make Flash memory appear as Block Memory, BM
• FPGA provides “translation” BM to Flash commands
• CPU threads for storage management reduced from 40-50 down to 2
• CAPI attached Network Adaptor
• 3 Symmetrical MultiProcessing, SMP, Interconnects for Scale Up architectures
Network
Controller
PCIe
10GbE
56TB
Storage
Power 8
L3
L2
L1
Memory
230GB/s
SMP x3
FPGA
4GB/s
CAPI
CAPI
4GB/s
IBM
Standard Flash Drawer
vs CAPI Flash Drawer
= I/O Bus
= Memory Bus
Introducing In-Storage Acceleration over CAPI
• Extend CAPI Flash Product Innovations
• Increase CAPI Flash Bandwidth
• FPGA Acceleration
• Leverage NVMe IOPS & Sequential data rates
• Data Compression, Encryption, Filtering
• Bit/Byte Level Stream Computing
• Very power efficient architecture
10GbE
Power 8
L3
L2
L1
FPGA
Acceleration
4GB/s
CAPI
CAPI
Memory
230GB/s
56TB
Storage
Network
Controller
16GB/s
96GB/s
FPGA
SMP x3
= Memory Bus
P8 Node with CAPI Flash – Available Today
P8
12 Core
Socket
56TB
4GB/s
FPGA
4GB/s
1 MIOPS
P8
12 Core
Socket
56TB
4GB/s
FPGA
4GB/s
76.8GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
56TB
4GB/s
FPGA
4GB/s
56TB
4GB/s
FPGA
4GB/s
40/100GbE
NIC
40/100GbE
NIC
1 MIOPS
1 MIOPS
1 MIOPS
Maximum Configuration Shown
• Possible Configuration for Scale Out Roxie & Thor Clusters
Scale Out P8 Node with FPGA accelerated CAPI Flash
P8
12 Core
Socket
56TB
4GB/s
FPGA
1 MIOPS
P8
12 Core
Socket
56TB
4GB/s
FPGA
76.8GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
56TB
4GB/s
FPGA
56TB
4GB/s
FPGA
40/100GbE
NIC
40/100GbE
NIC
1 MIOPS
1 MIOPS
1 MIOPS
Maximum Configuration Shown
60GB/s
14.4 MIOPs
60GB/s
14.4 MIOPs
60GB/s
14.4 MIOPs
60GB/s
14.4 MIOPs
• Possible Configuration for Scale Out Roxie & Thor Clusters
Scale Up IBM E870/E880 with Scale out FPGA Acceleration
P8
Socket 56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
P8
Socket 56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
P8
Socket56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
P8
Socket56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
76.8GB/s
76.8GB/s
76.8GB/s 76.8GB/s
76.8GB/s
76.8GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
• Balanced Scale Up & Scale Out Solution
• Scale Up Processors ideal for Compute intensive operations
• Scale Out FPGA Accelerated Storage, ideal for Parallel Mapper type operations
P8
Socket 56TB
FPGA
P8
Socket 56TB
FPGA
P8
Socket56TB
FPGA
P8
Socket56TB
FPGA 1TB Memory 1TB Memory
1TB Memory 1TB Memory
Scale Up IBM E880 with Scale out FPGA Acceleration
• 16 Way, 192 Core, Scale Up Compute with aggregate
• 16TB Memory @ 3.68TBytes/s
• 896TB Storage @ 960 GBytes/s
In Summary: HPCC Systems and OpenPOWER Are Ready For Business!
• IBM Power8’s Data Centric Architecture is ideal for Big Data Problems
• FPGAs provide an ideal streaming, near storage, accelerator
• Opportunity to re-architect HPCC’s ECL for Power 8 with FPGA Acceleration
• Will provide a step change in performance at lower power & footprint
• A community effort will help to accelerate the transition
29
Contact information: JT Kellington | Allan Cantle
Phone: 512-286-6948 | 805 377 4993
Email: jwkellin@us.ibm.com | a.cantle@nallatech.com
Web: www.ibm.com | www.nallatech.com

More Related Content

PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
IBM HPC Transformation with AI
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
Lenovo HPC Strategy Update
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
PDF
Accelerate Big Data Processing with High-Performance Computing Technologies
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
OpenCAPI next generation accelerator
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
IBM HPC Transformation with AI
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Lenovo HPC Strategy Update
Using a Field Programmable Gate Array to Accelerate Application Performance
Accelerate Big Data Processing with High-Performance Computing Technologies
TAU E4S ON OpenPOWER /POWER9 platform
OpenCAPI next generation accelerator

What's hot (20)

PDF
High-Performance and Scalable Designs of Programming Models for Exascale Systems
PDF
OpenHPC: A Comprehensive System Software Stack
PDF
Omp tutorial cpugpu_programming_cdac
PDF
00 opencapi acceleration framework yonglu_ver2
PDF
OpenPOWER Latest Updates
PDF
DOME 64-bit μDataCenter
PDF
Lenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
PDF
BXI: Bull eXascale Interconnect
PDF
ARM HPC Ecosystem
PPT
OpenPOWER Webinar
PPTX
Infrastructure et serveurs HP
PDF
AMD It's Time to ROC
PDF
MIT's experience on OpenPOWER/POWER 9 platform
PDF
High Performance Interconnects: Landscape, Assessments & Rankings
PDF
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
PDF
GEN-Z: An Overview and Use Cases
PDF
It's Time to ROCm!
PPT
NWU and HPC
PDF
Best Practices and Performance Studies for High-Performance Computing Clusters
PDF
Meet HBase 2.0 and Phoenix 5.0
High-Performance and Scalable Designs of Programming Models for Exascale Systems
OpenHPC: A Comprehensive System Software Stack
Omp tutorial cpugpu_programming_cdac
00 opencapi acceleration framework yonglu_ver2
OpenPOWER Latest Updates
DOME 64-bit μDataCenter
Lenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
BXI: Bull eXascale Interconnect
ARM HPC Ecosystem
OpenPOWER Webinar
Infrastructure et serveurs HP
AMD It's Time to ROC
MIT's experience on OpenPOWER/POWER 9 platform
High Performance Interconnects: Landscape, Assessments & Rankings
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
GEN-Z: An Overview and Use Cases
It's Time to ROCm!
NWU and HPC
Best Practices and Performance Studies for High-Performance Computing Clusters
Meet HBase 2.0 and Phoenix 5.0
Ad

Viewers also liked (20)

PPTX
ORGANIZATIONS CHART
PDF
Leveraging SMEs’ Strength for INSPIRE
PPT
Corporate etiquette
PDF
100+pet ideas
PPTX
Presentation supervision
PDF
asian pharma press -published
PDF
Маркетинговый план LR
PPSX
20130314 het abc van sociale media sint pauwels
PDF
LEVICK Weekly - Mar 8 2013
PDF
LEVICK Weekly - Jan 11 2013
PPTX
Presentacion ciclopaseo ingles
PPT
JB Marks Education Trust presentation: Bursary Recipients
PDF
IRJP Published
PDF
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 1/2 |...
PPTX
Bridge the gap alerts
PDF
Android Training in Bangalore
PPTX
Genre theory
PDF
Air Compressor Manufacturers, Air Compressor Manufacturers Gujarat
PDF
Manual para-realizar-estudios-de-prefactibilidad-y-factibilidad
PDF
El costo real de los ataques ¿Cómo te impacta en ROI?
ORGANIZATIONS CHART
Leveraging SMEs’ Strength for INSPIRE
Corporate etiquette
100+pet ideas
Presentation supervision
asian pharma press -published
Маркетинговый план LR
20130314 het abc van sociale media sint pauwels
LEVICK Weekly - Mar 8 2013
LEVICK Weekly - Jan 11 2013
Presentacion ciclopaseo ingles
JB Marks Education Trust presentation: Bursary Recipients
IRJP Published
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 1/2 |...
Bridge the gap alerts
Android Training in Bangalore
Genre theory
Air Compressor Manufacturers, Air Compressor Manufacturers Gujarat
Manual para-realizar-estudios-de-prefactibilidad-y-factibilidad
El costo real de los ataques ¿Cómo te impacta en ROI?
Ad

Similar to OpenPOWER Acceleration of HPCC Systems (20)

PDF
Demystify OpenPOWER
PDF
6 open capi_meetup_in_japan_final
PDF
Assisting User’s Transition to Titan’s Accelerated Architecture
PDF
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
PDF
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
PDF
OpenCAPI Technology Ecosystem
PPT
Sparc t4 systems customer presentation
PPTX
Ceph Community Talk on High-Performance Solid Sate Ceph
PDF
GTC15-Manoj-Roge-OpenPOWER
PPTX
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
PPTX
Introduction to HPC & Supercomputing in AI
PDF
Maxwell siuc hpc_description_tutorial
PDF
Japan's post K Computer
PDF
OCP Telco Engineering Workshop at BCE2017
PDF
ODP Presentation LinuxCon NA 2014
PDF
Heterogeneous Computing : The Future of Systems
PDF
Challenges in Embedded Computing
PPTX
Hyper threading
PDF
@IBM Power roadmap 8
PDF
Mauricio breteernitiz hpc-exascale-iscte
Demystify OpenPOWER
6 open capi_meetup_in_japan_final
Assisting User’s Transition to Titan’s Accelerated Architecture
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
OpenCAPI Technology Ecosystem
Sparc t4 systems customer presentation
Ceph Community Talk on High-Performance Solid Sate Ceph
GTC15-Manoj-Roge-OpenPOWER
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Introduction to HPC & Supercomputing in AI
Maxwell siuc hpc_description_tutorial
Japan's post K Computer
OCP Telco Engineering Workshop at BCE2017
ODP Presentation LinuxCon NA 2014
Heterogeneous Computing : The Future of Systems
Challenges in Embedded Computing
Hyper threading
@IBM Power roadmap 8
Mauricio breteernitiz hpc-exascale-iscte

More from HPCC Systems (20)

PPTX
Natural Language to SQL Query conversion using Machine Learning Techniques on...
PPT
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
PPTX
Towards Trustable AI for Complex Systems
PPTX
Welcome
PPTX
Closing / Adjourn
PPTX
Community Website: Virtual Ribbon Cutting
PPTX
Path to 8.0
PPTX
Release Cycle Changes
PPTX
Geohashing with Uber’s H3 Geospatial Index
PPTX
Advancements in HPCC Systems Machine Learning
PPTX
Docker Support
PPTX
Expanding HPCC Systems Deep Neural Network Capabilities
PPTX
Leveraging Intra-Node Parallelization in HPCC Systems
PPTX
DataPatterns - Profiling in ECL Watch
PPTX
Leveraging the Spark-HPCC Ecosystem
PPTX
Work Unit Analysis Tool
PPTX
Community Award Ceremony
PPTX
Dapper Tool - A Bundle to Make your ECL Neater
PPTX
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
PPTX
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Towards Trustable AI for Complex Systems
Welcome
Closing / Adjourn
Community Website: Virtual Ribbon Cutting
Path to 8.0
Release Cycle Changes
Geohashing with Uber’s H3 Geospatial Index
Advancements in HPCC Systems Machine Learning
Docker Support
Expanding HPCC Systems Deep Neural Network Capabilities
Leveraging Intra-Node Parallelization in HPCC Systems
DataPatterns - Profiling in ECL Watch
Leveraging the Spark-HPCC Ecosystem
Work Unit Analysis Tool
Community Award Ceremony
Dapper Tool - A Bundle to Make your ECL Neater
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
A Quantitative-WPS Office.pptx research study
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to machine learning and Linear Models
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Quality review (1)_presentation of this 21
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Database Infoormation System (DBIS).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Fluorescence-microscope_Botany_detailed content
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
A Quantitative-WPS Office.pptx research study
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to machine learning and Linear Models
Clinical guidelines as a resource for EBP(1).pdf
Quality review (1)_presentation of this 21
Galatica Smart Energy Infrastructure Startup Pitch Deck
Database Infoormation System (DBIS).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Supervised vs unsupervised machine learning algorithms
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd

OpenPOWER Acceleration of HPCC Systems

  • 1. OpenPOWER Acceleration of HPCC Systems 2015 LexisNexis® Risk Solutions HPCC Engineering Summit – Community Day Presenters: JT Kellington & Allan Cantle September 29, 2015 Delray Beach, FL Accelerating the Pace of Innovation
  • 2. Overview/Agenda • The OpenPOWER Story • Porting HPCC Systems to ppc64el • POWER8 at a Glance • Power 8 Data Centric Architectures with FPGA Acceleration • Summary 2
  • 4. OpenPOWER, a catalyst for Open Innovation 4 • Moore’s law no longer satisfies performance gain • Growing workload demands • Numerous IT consumption models • Mature Open software ecosystem Performance of POWER architecture amplified capability Open Development open software, open hardware Collaboration of thought leaders simultaneous innovation, multiple disciplines The OpenPOWER Foundation is an open development community, using the POWER Architecture to serve the evolving needs of customers. • Rich software ecosystem • Spectrum of power servers • Multiple hardware options • Derivative POWER chips Market Shifts New Open Innovation  145+ members investing on Power architecture  Geographic, sector, and expertise diversity  1,500 ISVs serving Linux on POWER, now with easier porting from x86!  Open from the chip through software  9 workgroups chartered, 15 hardware innovations revealed at March 2015 summit  Partner announced plans, offerings, deployments www.openpowerfoundation.org
  • 5. Porting HPCC Systems to ppc64el
  • 6. HPCC Systems and POWER8 Designed For Easy Porting HPCC Systems Framework • ECL is Platform Agnostic • Hardware and architecture changes transparent to end user • CMake, gcc, binutils, Node.js, etc. • Open Source! • Quick and easy collaboration POWER8 Designed for Open Innovation • Little Endian OS • More compatible with industry • IBM Software Development Kit for Linux on Power • Advance Toolchain • Post-Link Optimization • Linux performance analysis tools • Full support in Ubuntu, RHEL, and SLES 6
  • 7. Porting Details Initial port took less than a week! • Added ARCH defines for PPC64EL (system/include/platform.h) • Added stack description for POWER8 (ecl/hql/hqlstack.hpp) • Added binutils support for powerpc (ecl/hqlcpp/hqlres.cpp) 7 Deployment and testing took a bit longer • Couple of bugs found: • Stricter alignment for atomic operations • New compiler did not allow compare of “this” to NULL • Updates to Init system with latest Ubuntu • Monkey on the keyboard! • Basic setup and running was easy, but harder to go to the “next level” • Storage configuration, memory usage, thread support, etc.
  • 8. POWER8 at a Glance
  • 9. POWER8 Is Designed For Big Data! 9 Technology • 22 nm SOI, eDRAM, 15 ML 650 mm2 Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 Memory • Up to 230 GB/s sustained bandwidth Bus Interfaces • Durable open memory attach interface • Up to 48x 48x Integrated PCI gen3/socket • SMP interconnect • CAPI Cores • 12 cores (SMT8) • 8 dispatch, 10 issue, 16 execution pipes • 2x internal data flows/queues • Enhanced prefetching • 64 KB data cache, 32 KB instruction cache Accelerators • Crypto and memory expansion • Transactional memory • VMM assist • Data move/VM mobility POWER8 Scale-Out Dual Chip Module Chip Interconnect CoreCoreCore L2L2L2 L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank L2L2L2 Core Core Core Chip Interconnect Core Core Core L2 L2 L2 L2 L2 L2 L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank Core Core Core MemoryBus MemoryBus SMPInterconnect SMPInterconnect SMPSMPCAPIPCIe SMPCAPIPCIeSMP
  • 10. POWER8 Introduces CAPI Technology 10 CAPP PCIe POWER8 Processor FPGA Accelerated Functional Unit (AFU) CAPI IBM Supplied POWER Service Layer Typical I/O Model Flow Flow with a Coherent Model Shared Mem. Notify Accelerator Acceleration Shared Memory Completion DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Int Completion Copy or Unpin Result Data Ret. From DD Completion Advantages of Coherent Attachment Over I/O Attachment  Virtual Addressing & Data Caching (significant latency reduction)  Easier, Natural Programming Model (avoid application restructuring)  Enables Apps Not Possible on I/O (Pointer chasing, shared mem semaphores, …)
  • 11. strategy ( ) CAPI Attached Flash Optimization • Attach IBM FlashSystem to POWER8 via CAPI • Read/write commands issued via APIs from applications to eliminate 97% of code path length • Saves 20-30 cores per 1M IOPS Pin buffers, Translate, Map DMA, Start I/O Application Read/Write Syscall Interrupt, unmap, unpin,Iodone scheduling 20K instructions reduced to <2000 Disk and Adapter DD strategy ( ) iodone ( ) FileSystem Application User Library Posix Async I/O Style API Shared Memory Work Queue aio_read() aio_write() iodone ( ) LVM
  • 12. CAPI Unlocks the Next Level of Performance for Flash Identical hardware with 2 different paths to data FlashSystem Conventional I/O (FC) CAPI 0 20,000 40,000 60,000 80,000 100,000 120,000 Conventional CAPI IOPS per Hardware Thread 0 50 100 150 200 250 300 350 400 450 500 Conventional CAPI Latency (microseconds) IBM POWER S822L >5x better IOPS per HW thread >2x lower latency
  • 13. POWER 8 Data Centric Architectures with FPGA Acceleration • Allan Cantle – President & Founder, Nallatech
  • 14. Nallatech at a glance Server qualified accelerator cards featuring FPGAs, network I/O and an open architecture software/firmware framework » Energy-efficient High Performance Heterogeneous Computing » Real-time, low latency network and I/O processing » Design Services for Application Porting » IBM Open Power partner » Altera OpenCL partner » Xilinx Alliance partner » 20 Year History of FPGA Acceleration including IBM System Z qualified & 1000+ Node deployments 14
  • 15. Heterogeneous Computing – Motivation • Efficient Evolutional Compute Performance • 1998 - DARPA Polymorphic Computing • 2003 – Power Wall GPGPUProcessors FPGA Processors Multicore Microprocessors Polymorphic Processor A processor that is equally efficient at processing all different data types Polymorphic Application Data Types SymbolicStreaming – SIMD - Vector (DSP) Bit Level SWEPTEfficiency Size,Weight,Energy,Performance,Time
  • 16. Heterogeneous Computing – Motivation • Efficient Evolutional Compute Performance • 1998 - DARPA Polymorphic Computing • 2003 – Power Wall GPGPUProcessors FPGA Processors Multicore Microprocessors Application Data Types SymbolicStreaming – SIMD - Vector (DSP) Bit Level SWEPTEfficiency Size,Weight,Energy,Performance,Time
  • 17. FPGA’s for “Software Accessible” Compute Acceleration • FPGAs are best processor for • “In the pipe” Stream Computing Operations • Highly parallel Bit & Byte level computation • FPGA are NOW programmable by software engineers with OpenCL! • Industry Standard Language managed by Khronos • Heterogeneous language of choice for all processor types • Low Level Explicitly Parallel language • Heterogeneous Support is Evolving 17
  • 18. Typical Thor & Roxie Cluster Node Configurations • Large Thor Scale Out Cluster = 400 Nodes • Large Roxie Scale Out Cluster = 100 Nodes Xeon E5-2670 V3 12 Cores Memory 16GB SAS RAID5 Controller PCIe L3Cache–30MB L2/Core–256KB L1/Core–64KB 1.2TB HDD Network Controller ~60GB/s PCIe 10GbE 200MB/s 1.2TB HDD 1.2TB HDD 140 IOPs 200MB/s 140 IOPs 200MB/s 140 IOPs Thor Xeon E5-2670 V3 12 Cores L3Cache–30MB L2/Core–256KB L1/core–64KB SATA RAID5 Controller 1TB SSD 560MB/s 1TB SSD 1TB SSD 100 KIOPs 560MB/s 100 KIOPs 560MB/s 100 KIOPs Roxie PCIe ~60GB/s ~60GB/s 2x QPI Links Memory 16GB 400MB/s 140 IOPs 1.1 GB/s 100 KIOPs = I/O Bus = Memory Bus
  • 19. Server Node 1 Server Node 400 Server Node 2 Server Node 399 400 Node THOR Cluster Hierarchical 400 Port Network Switch Disc Disc Disc Disc 400MB/s 140 IOPs 400MB/s 140 IOPs 400MB/s 140 IOPs 400MB/s 140 IOPs 1GB/s 1GB/s 1GB/s 1GB/s 1GB/s Server Node 1 Server Node 100 Server Node 2 Server Node 99 100 Node ROXIE Cluster Hierarchical 100 Port Network Switch SSD SSD SSD SSD 1.1GB/s 100 KIOPs 1.1GB/s 100 KIOPs 1.1GB/s 100 KIOPs 1.1GB/s 100 KIOPs 1GB/s 1GB/s 1GB/s 1GB/s 1GB/s • CPU intensive operations on fragmented storage data • Average 1GB/s to storage • Mapper type parallel operations • Aggregate 160GBytes/s Storage bandwidth • Aggregate 56K IOPs • CPU intensive operations on fragmented storage data • Average 1GB/s to storage • Pre-indexed eases limitation • Mapper type parallel operations • Aggregate 110GBytes/s Storage bandwidth • Aggregate 10M IOPs Typical Scale Out Thor & Roxie Cluster Configurations
  • 20. Memory Coherency & Data Movement Conflict • Mapper Type – Needle in a Haystack Operations • Little to no compute on lots of data • But hey, it’s not a problem for the Xeon so why worry? • Creates unnecessary traffic congestion across IO interfaces • I/O Management Consumes CPU Threads • Excessive Cache thrashing consumes memory bandwidth • E.g. only 8x 4K IOPs will fit in a 32K L1 Data Cache • Power & efficiency issues with all the unnecessary data movement 1x Network Controller 1x 10GbE Storage uP L3 L2 L1 Memory 1x 100x Data Movement example of CPU operating on large data files with mapper type functions The burden of using CPU centric architectures for Data Centric Problems = I/O Bus = Memory Bus
  • 21. 1x Network Controller 1x 10GbE Storage uP L3 L2 L1 Memory 1x 100x Example of CPU moving stored data to network for another CPU Memory Coherency & Data Movement Conflict • Processor Intensive Operations on fragmented data across cluster • CPU burdened by passing lots of fragmented data from storage to the network • Similarly CPU has to wait for fragmented data to arrive • Network Congestion & latency become big issues • Less CPU cycles & cache memory available for compute intensive operations The burden of using CPU centric architectures for Data Centric Problems = I/O Bus = Memory Bus
  • 22. Homogeneous Commodity Clusters still win out…right? • Maybe, but storage is getting a LOT faster with NVMe! Potential 72GB/s Network Controller PCIe 10GbE 24 NVMe Drives uP L3 L2 L1 Memory 60GB/s Example of CPU moving stored data to network for another CPU Increased Storage IO Bandwidth would saturate CPU’s available memory bandwidth & CPU centric architecture breaks down completely The burden of using CPU centric architectures for Data Centric Problems = I/O Bus = Memory Bus
  • 23. IBM initiated Data Centric Transformation with CAPI • CAPI = Coherent Accelerator Processor Interface • Transport over PCIe bus • IBM’s CAPI Flash Product • Make Flash memory appear as Block Memory, BM • FPGA provides “translation” BM to Flash commands • CPU threads for storage management reduced from 40-50 down to 2 • CAPI attached Network Adaptor • 3 Symmetrical MultiProcessing, SMP, Interconnects for Scale Up architectures Network Controller PCIe 10GbE 56TB Storage Power 8 L3 L2 L1 Memory 230GB/s SMP x3 FPGA 4GB/s CAPI CAPI 4GB/s IBM Standard Flash Drawer vs CAPI Flash Drawer = I/O Bus = Memory Bus
  • 24. Introducing In-Storage Acceleration over CAPI • Extend CAPI Flash Product Innovations • Increase CAPI Flash Bandwidth • FPGA Acceleration • Leverage NVMe IOPS & Sequential data rates • Data Compression, Encryption, Filtering • Bit/Byte Level Stream Computing • Very power efficient architecture 10GbE Power 8 L3 L2 L1 FPGA Acceleration 4GB/s CAPI CAPI Memory 230GB/s 56TB Storage Network Controller 16GB/s 96GB/s FPGA SMP x3 = Memory Bus
  • 25. P8 Node with CAPI Flash – Available Today P8 12 Core Socket 56TB 4GB/s FPGA 4GB/s 1 MIOPS P8 12 Core Socket 56TB 4GB/s FPGA 4GB/s 76.8GB/s 1TB Memory 230GB/s 1TB Memory 230GB/s 56TB 4GB/s FPGA 4GB/s 56TB 4GB/s FPGA 4GB/s 40/100GbE NIC 40/100GbE NIC 1 MIOPS 1 MIOPS 1 MIOPS Maximum Configuration Shown • Possible Configuration for Scale Out Roxie & Thor Clusters
  • 26. Scale Out P8 Node with FPGA accelerated CAPI Flash P8 12 Core Socket 56TB 4GB/s FPGA 1 MIOPS P8 12 Core Socket 56TB 4GB/s FPGA 76.8GB/s 1TB Memory 230GB/s 1TB Memory 230GB/s 56TB 4GB/s FPGA 56TB 4GB/s FPGA 40/100GbE NIC 40/100GbE NIC 1 MIOPS 1 MIOPS 1 MIOPS Maximum Configuration Shown 60GB/s 14.4 MIOPs 60GB/s 14.4 MIOPs 60GB/s 14.4 MIOPs 60GB/s 14.4 MIOPs • Possible Configuration for Scale Out Roxie & Thor Clusters
  • 27. Scale Up IBM E870/E880 with Scale out FPGA Acceleration P8 Socket 56TB 4GB/s FPGA 60GB/s 14.4 MIOPs P8 Socket 56TB 4GB/s FPGA 60GB/s 14.4 MIOPs P8 Socket56TB 4GB/s FPGA 60GB/s 14.4 MIOPs P8 Socket56TB 4GB/s FPGA 60GB/s 14.4 MIOPs 76.8GB/s 76.8GB/s 76.8GB/s 76.8GB/s 76.8GB/s 76.8GB/s 1TB Memory 230GB/s 1TB Memory 230GB/s 1TB Memory 230GB/s 1TB Memory 230GB/s • Balanced Scale Up & Scale Out Solution • Scale Up Processors ideal for Compute intensive operations • Scale Out FPGA Accelerated Storage, ideal for Parallel Mapper type operations
  • 28. P8 Socket 56TB FPGA P8 Socket 56TB FPGA P8 Socket56TB FPGA P8 Socket56TB FPGA 1TB Memory 1TB Memory 1TB Memory 1TB Memory Scale Up IBM E880 with Scale out FPGA Acceleration • 16 Way, 192 Core, Scale Up Compute with aggregate • 16TB Memory @ 3.68TBytes/s • 896TB Storage @ 960 GBytes/s
  • 29. In Summary: HPCC Systems and OpenPOWER Are Ready For Business! • IBM Power8’s Data Centric Architecture is ideal for Big Data Problems • FPGAs provide an ideal streaming, near storage, accelerator • Opportunity to re-architect HPCC’s ECL for Power 8 with FPGA Acceleration • Will provide a step change in performance at lower power & footprint • A community effort will help to accelerate the transition 29 Contact information: JT Kellington | Allan Cantle Phone: 512-286-6948 | 805 377 4993 Email: jwkellin@us.ibm.com | a.cantle@nallatech.com Web: www.ibm.com | www.nallatech.com