SlideShare a Scribd company logo
WIR SCHAFFEN WISSEN – HEUTE FÜR MORGEN
OpenCAPI-based image analysis
pipeline for 18 GB/s kHz-framerate X-
ray camera at the SLS synchrotron
Filip Leonarski :: Beamline Data Scientist :: Macromolecular Crystallography
Page 1
• Introduction: Macromolecular crystallography at synchrotrons and X-ray
detectors
• Technology: POWER + OpenCAPI
• Solution: Jungfraujoch
Plan
Page 2
X-ray
1901 Nobel Prize
W. Röentgen
Discovery of X-rays
X-ray macromolecular crystallography (MX)
Page 4
1901 Nobel Prize
W. Röentgen
Discovery of X-rays
(Photo 51 by R.
Gosling and R.
Franklin)
1962 Nobel Prize
F. Crick, J. Watson and
M. Wilkins
Structure of DNA
double helix solved
with X-rays
X-ray macromolecular crystallography (MX)
Page 5
1901 Nobel Prize
W. Röentgen
Discovery of X-rays
(Photo 51 by R.
Gosling and R.
Franklin)
1962 Nobel Prize
F. Crick, J. Watson and
M. Wilkins
Structure of DNA
double helix solved
with X-rays
2009 Nobel Prize
V. Ramakrishnan*, T.
Steiz, A. Yonath*
Structure of ribosome
(*) some of their structures
were solved at PSI
Wikipedia:
X-ray crystallography is the experimental science determining the atomic and
molecular structure of a crystal, in which the crystalline structure causes a beam of
incident X-rays to diffract into many specific directions. By measuring the angles
and intensities of these diffracted beams, a crystallographer can produce a three-
dimensional picture of the density of electrons within the crystal.
X-ray macromolecular crystallography (MX)
Page 6
• Particle accelerators are source of the
brightest X-ray beam (multiple orders of
magnitudes as compared to conventional X-
ray tubes), when charged particles travel
through magnetic field
- Effect is nuisance for high energy physics
(undesirable energy loss),
- but it is a blessing for structural science =>
modern storage rings are build exclusively
as light sources.
• Synchrotrons provide continuous X-ray
beam, while X-ray free electron lasers
produce femtosecond long bright pulses
MX at synchrotron
Page 7
Paul Scherrer Institute
Page 8
SwissFEL
Swiss Light
Source
Swiss Alps
• 3 experimental stations at the synchrotron
• 1 experimental station at the SwissFEL
• Beamtime is shared between academic and
industrial users
- Industrial customers are mostly pharmaceutical
companies looking for drug binding to potential
drug targets
- Academic users are universities and scientific
institutes worldwide doing basic research in
structural biology
MX at Swiss Light Source and SwissFEL
Page 9
• New storage ring to be installed in 2024-2025
• Flux (photons/second) will increase by order of magnitude
• Measurements can be done 10x faster
• Enabling fragment screening method – i.e. single protein target is
crystallized with hundredths or thousands of molecular fragments to
find best drug
- This is like molecular docking, but fully experimentally
Major upgrade in 2024/2025 for SLS 2.0
Page 10
• PSI is major detector developer
- Hybrid pixel detectors developed for
CERN high energy physics
experiments
- Design could be used for X-ray
cameras – first PILATUS in 2000s
- PSI start-up Dectris, commercialized
PILATUS and EIGER detectors, most
synchrotrons are equipped with
their detectors
• Currently PSI is rolling out new
generation: JUNGFRAU
Page 11
New detector for SwissFEL and SLS 2.0
• Silicon sensor converts X-ray to
electric charge
• Bump bonded to sensor is ASIC, with
dedicated electronics for each pixel
• Pixel has three capacitors allowing
different amplification
• They are dynamically switched during
exposure to adjust for incoming
charge
Page 12
Adaptive gain detector to increase dynamic
range
Aim: measure reliably from 1 to 20,000,000 photons per second
Page 13
Adaptive gain detector to increase dynamic
range
0001010111110011
Pixel output in JF:
0001010111110011
Gain: 00:G0 01:G1 11:G2
ADC value: 0001010111110011
Photon number: =
!"# $ %&'&()*+
,*-.∗%01)1. &.&2,3 Gain and pedestal factors are
specific for pixel and gain setting
Prior calibration
Dedicated dark run
• Detector is modular
• 524,288 pixels per module
• 2.2 kHz * 524,288 pixels * 16 bit = 2.3 GB/s
- 2 x 10 Gbit/s links
• 4 Mpixel detector (2020)
- 16 x 10 Gbit/s
• 10 Mpixel (2022)
- 40 x 10 Gbit/s
Page 14
Modular detector
4 Mpixel (2020)
10 Mpixel (2022)
Page 15
MX detector data rates double every 2 years
0.1
1
10
100
2006 2008 2010 2012 2014 2016 2018 2020 2022 2024
Frame
rate
[GB/s]
Year
2007 PSI PILATUS 6 Mpixel 12.5 Hz 0.2 GB/s
2014 Dectris EIGER 16 Mpixel 133 Hz 3.4 GB/s
2019 Dectris EIGER 2 XE 16 Mpixel 400 Hz 13.5 GB/s
2020 PSI JUNGFRAU 4 Mpixel 2200 Hz 18.4 GB/s
2022 PSI JUNGFRAU 10 Mpixel 2200 Hz 46.1 GB/s
• Detector is streaming frames over UDP
- Receiver using Linux Datagram Socket
• Conversion of pixel read-out
- CPU SIMD code
• Compression
- CPU compression
First approach: scale conventional architecture
Page 16
• Detector is streaming frames over UDP
- Receiver using Linux Datagram Socket
• Conversion of pixel read-out
- CPU SIMD code
• Compression
- CPU compression
First approach: scale conventional architecture
Page 17
Aim
20 GB/s
Reached
5 GB/s
WIR SCHAFFEN WISSEN – HEUTE FÜR MORGEN
POWER / OpenCAPI / FPGA architecture
Page 18
• Real-time performance
- FPGA design is cycle-accurate, with fixed latency and throughput
• Large memory throughput
- FPGAs with HBM2 have 460 GB/s bandwidth to 8 GB large memory
• Ethernet on-board
- FPGA are made to work with network, often having dedicated “hard” cores for
ethernet
• Development of FPGAs is difficult and time consuming
- Hardware description languages
- PCI Express
• Virtex Ultrascale+ HBM (XCVU33P and XCVU35P)
- Availble as low-profile half-length 75W cards
FPGA are perfect devices for data acquisition
Page 19
• C/C++ compiler to produce
hardware design language (Verilog
or VHDL)
• All code is valid C++ code, it can be
executed on CPU and functionally is
generally equivalent
• Dedicated pragma to guide FPGA
synthesis
• It is generally understandable for
software developers, but may
contain strange/inoptimal
constructs from software point of
view
High-level synthesis
Page 20
Bitshuffle for 16-bit numbers
• For VU33/35P:
- Size: 8 GB
- Bandwidth: up to 460 GB/s
- Latency: up to 120 cycles @ 200 MHz
• Complex architecture
- 32 x 256-bit AXI3 interfaces
- Either operating as 32 separate memories
- Or as single memory with crossbar (at the cost of up to 50% throughput)
• 256-bit is a problem, as data are 512-bit (PCIe Gen3 x16) or 1024-bit (OpenCAPI,
PCIe Gen4 x16)
• Simulation only with special tools (Cadence Xcelium), impossible with Xilinx tools
High-bandwidth memory
Page 21
• PCI Express is CPU-centric bus, as it is design to
support peripherals
• This is good model, when FPGA is a coprocessor
to CPU – which sends data, and waits for reply
=> but for data acquisition, it is FPGA that is
producing the data, CPU has no prior knowledge
which packet will be processed at the time
• DMA is operating on physical addresses: virtual
addresses need to be pinned by kernel (so are
not swapped and moved)
Þ need to maintain own driver
Þ address translation cache possible on FPGA,
but requires memory
PCI Express DMA
Page 22
Xilinx QDMA is a robust
but highly complex
solution for PCI Express –
used to interface FPGAs
with x86 AMD and Intel
CPUs
• IBM POWER9 showed great numbers for
I/O and memory throughput in Summit
and Sierra supercomputers
• IBM designed own memory coherent
interface for accelerators
(CAPI/OpenCAPI), which has advantages
over PCIe
POWER architecture
Page 23
Source: Wikipedia
OpenCAPI
Page 24
FPGA
board
POWER9
CPU
OpenCAPI
cable
OpenCAPI
Page 25
FPGA
board
POWER9
CPU
OpenCAPI
cable
• Predecessor CAPI => proprietary IBM
• Communication over PCIe physical lines
(but different protocol)
• OpenCAPI => consortium model
• Dedicated cabling (8 x 25 Gbit/s lines)
• For POWER10 – this will be default memory interface,
(allowing to have any type of memory attached to CPU + to
“share” memory over network)
• Similar difference what 80286/80386 virtual
mode brought to software development
• In OpenCAPI one needs single kernel operation
=> Attach accelerator to running process
• Then, accelerator has access to virtual address
space of running process – it is FPGA that is
initiating the communication
=> Address translation is handled by TLB and OS
=> FPGA sees memory in a fully cache-coherent
way
• All security/reliability/efficiency mechanisms in
CPU and kernel are also present in OpenCAPI
Page 26
What difference brings OpenCAPI?
Source: Wikipedia
• Main function for the action contains a pointer to virutal address space
- On device the pointer will be synthesized as 1024-bit master memory-mapped
AXI interface
- On CPU this pointer has to be just set to zero (which is first address of virtual
address space)
• Any cell in virtual memory is just accessed as offset from this pointer
• Only requirement is that memory is aligned to 128-bytes
- No special memory allocator, malloc or mmap is fine
- No pinning/registering
• The same memory buffer class for both simulation and working with device
• For configuration, there is also 4 MiB memory-maped I/O space (like BAR in PCIe)
- On device implemented as slave AXI-lite (32-bit)
How to develop with OpenCAPI?
Page 27
• Open source “shell” mantained by IBM
• http://github.com/OpenCAPI/oc-accel
• Provides ready made tool to work with OpenCAPI (from transceiver setup to
AXImm bridge)
• Provides preconfigured interfaces for I/O peripherals (HBM, 100G, NVMe)
• Provides simulation environment
- One can simulate both SW and HW in a single simulation (both user FPGA
design and software are not modified from their “real” implementation)
OC-Accel
Page 28
WIR SCHAFFEN WISSEN – HEUTE FÜR MORGEN
Jungfraujoch – FPGA implementation
Page 29
Page 30
Jungfraujoch server
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
FPGA board with OpenCAPI interface
- Data acquisition
- Initial data analysis
- Pre-compression
(2.5 Mpixel/board for JF)
Up to 50 GB/s acquisition and
data analysis in a single 2U
IBM POWER9 server with 1-4 FPGA
boards
Frame
summation
Page 31
Jungfraujoch FGPA streaming design
Modular design
• Stream of data handled by successive cores doing work in parallel
à throughput and latency of each core is determined by the hardware design
• Extra stages can be relatively simply added, option to bypass cores
• All cores are C++ functions, connected with AXI-Stream FIFOs
• As buffering is expensive on FPGA, it is best suited for algorithm that have limited
dependencies between frames
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation
Page 32
Jungfraujoch
Ethernet UDP/IP core
Processes ethernet packets from network, ignores unnecessary packets, reads
frame header to get frame number, module number, etc.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation
Page 33
Jungfraujoch
Dark current core
This cores is responsible for calculating moving average of detector frames.
Calculated value is used as dark current (pedestal) for subsequent frames.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation
Page 34
Jungfraujoch
Conversion core
This cores translates JUNGFRAU read-out into units of energy or photon counts.
It benefits from very fast HBM2 memory within the FPGA (460 GB/s). Data
leaving this core can be used for processing by data analysis software.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation
Page 35
Jungfraujoch
Frame summation core (work in progress)
As data that left gain correction core are on linear scale, they can be summed to
reduce downstream data rate, if lower frame rate is needed, as compared to
detector.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation
Page 36
Jungfraujoch
Strong pixel finder core
This is first step of spot finding algorithm (for example COLSPOT). It identifies
pixels that are stronger than given number of standard deviations of their
neighborhood.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation
Page 37
Jungfraujoch
Bitshuffle
FPGAs are bit order agnostic. Therefore exchanging bit order in popular
compression prefilter is pretty much for free on FPGA.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation
Page 38
Jungfraujoch
Host memory write
Address in host memory buffer is calculated and data forwarded to host memory
via OpenCAPI. Additional image statistics are saved as well.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation
Page 39
Jungfraujoch implementation on VU33P FPGA
Spot finding
HBM
Gain
Pedestal
Write data
OpenCAPI
100G
UDP
Jungfraujoch FPGA power usage is 18 W/board
for the whole streaming functionality
Page 40
Xilinx Vivado Power Report
2 boards for 4 Mpixel JUNGFRAU and 4 boards for 10 Mpixel JUNGFRAU
• VU33P or VU35P with 8 GB of HBM2
• OpenCAPI link and PCIe Gen3 x16 (or two
PCIe Gen4 x8)
• Small flash (2 kb) to store MAC address,
board IR
• QSFP-DD optical socket (same as QSFP28,
but with 8 lanes for 2x100G) =>
compatible with QSFP28 transceivers
• Up to 75W
Alpha Data 9H3 board
Page 41
• Software tests – Catch2
- 8 min
- Among other software tests includes 13
FPGA action tests (whole SLS code)
- Automated tests cover 95% lines of high-
level synthesis code
- Covers most of the functionality
correctness – including address calculation
- Main limitation is debugging of FIFOs
parallel behavior (deadlocks, etc.)
• Hardware simulation – Cadence Xcelium
- 4 hours
- Collection of 8 frames from single module
- Checks if hardware description is correct,
can find problems with synchronization,
and other, very rare, issues
- Too slow to verify functionality
OpenCAPI programming - testing
Page 42
• Detector and data acquisition system was sent in
November for an experiment in Photon Factory, KEK
• More than 2,000 datasets collected for protein
targets, few real-life native-SAD structures solved
• Due to pandemic, detector support and
development (including deployment of new FPGA
design) was done fully remotely from Switzerland
Commissioning in KEK (Jan – May 2021)
Page 43
BL-1A Photon Factory
JUNGFRAU detector (up)
tested in helium chamber
for native-SAD
measurements with 3.75
keV X-rays
Page 44
Structure of Nucleocapsid Phosphoprotein from
SARS-CoV-2 solved in 1 second
• Crystal was previously measured with
conventional setup at our beamline –
with measurement taking longer than
one minute
• With JUNGFRAU detector and
OpenCAPI readout, 2000 images
collected in one second allowed to
solve structure of this protein
• Experimental team: Filip Leonarski, Sylvain
Engilberge, Vincent Olieric, Meitian Wang (MX
Group), Aldo Mozzanica (PSI Detector Group)
• SARS-CoV-2 protein was produced by Zinzula, L.,
Basquin, J., Bracher, A., Baumeister, W. (MPI,
Martinsried)
Possible gain from using FPGA based system
Page 45
Courtesy: B. Mesnet (IBM)
Possible gain from using FPGA based system
Page 46
Courtesy: B. Mesnet (IBM)
MX Group (PSI)
• Vincent Olieric
• Takashi Tomizaki
• Chia-Ying Huang
• Sylvain Engilberg
• Justyna Wojdyła
• Meitian Wang
Detector Group (PSI)
• Aldo Mozzanica
• Martin Brückner
• Carlos Lopez-Cuenca
• Bernd Schmitt
Science IT (PSI)
• Leonardo Sala
Controls (PSI)
• Andrej Babic
• Leonardo Hax-Damiani
SLS management (PSI)
• Oliver Bunk
Photon Factory, KEK
• Naohiro Matsugaki
• Yusuke Yamada
• Masahide Hikita
MAX IV
• Jie Nan
• Zdenek Matej
Uni Konstanz
• Kay Diederichs
LBL
• Aaron Brewster
DLS
• Graeme Winter
• DIALS Team
ESRF
• Jerome Kieffer
IBM Systems (France)
• Alexandre Castellane
• Bruno Mesnet
InnoBoost SA
• Lionel Clavien
Acknowledgements
Page 47

More Related Content

PDF
IBM HPC Transformation with AI
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
Summit workshop thompto
PPT
OpenPOWER Webinar
PDF
Deeplearningusingcloudpakfordata
PDF
OpenPOWER Latest Updates
PDF
Ac922 cdac webinar
PDF
Covid-19 Response Capability with Power Systems
IBM HPC Transformation with AI
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Summit workshop thompto
OpenPOWER Webinar
Deeplearningusingcloudpakfordata
OpenPOWER Latest Updates
Ac922 cdac webinar
Covid-19 Response Capability with Power Systems

What's hot (20)

PDF
POWER10 innovations for HPC
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
MIT's experience on OpenPOWER/POWER 9 platform
PDF
OpenPOWER System Marconi100
PDF
Xilinx Edge Compute using Power 9 /OpenPOWER systems
PDF
Overview of HPC Interconnects
PDF
POWER9 for AI & HPC
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
IBM BOA for POWER
PDF
OpenPOWER/POWER9 AI webinar
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
OpenPOWER Webinar on Machine Learning for Academic Research
PDF
High Performance Interconnects: Assessment & Rankings
PDF
NNSA Explorations: ARM for Supercomputing
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
OpenPOWER Acceleration of HPCC Systems
PDF
DOME 64-bit μDataCenter
PDF
State of ARM-based HPC
PDF
SGI: Meeting Manufacturing's Need for Production Supercomputing
PPTX
Ac922 watson 180208 v1
POWER10 innovations for HPC
TAU E4S ON OpenPOWER /POWER9 platform
MIT's experience on OpenPOWER/POWER 9 platform
OpenPOWER System Marconi100
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Overview of HPC Interconnects
POWER9 for AI & HPC
Energy Efficient Computing using Dynamic Tuning
IBM BOA for POWER
OpenPOWER/POWER9 AI webinar
Hardware & Software Platforms for HPC, AI and ML
OpenPOWER Webinar on Machine Learning for Academic Research
High Performance Interconnects: Assessment & Rankings
NNSA Explorations: ARM for Supercomputing
CUDA-Python and RAPIDS for blazing fast scientific computing
OpenPOWER Acceleration of HPCC Systems
DOME 64-bit μDataCenter
State of ARM-based HPC
SGI: Meeting Manufacturing's Need for Production Supercomputing
Ac922 watson 180208 v1
Ad

Similar to OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray Camera at the Swiss Light Source synchrotron (20)

PDF
CEPH DAY BERLIN - CEPH ON THE BRAIN!
PDF
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PDF
PCCC24(第24回PCクラスタシンポジウム):筑波大学計算科学研究センター テーマ2「スーパーコンピュータCygnus / Pegasus」
PDF
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
PDF
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
PPTX
Ceph on 64-bit ARM with X-Gene
PDF
IBM and ASTRON 64bit μServer for DOME
PDF
Manycores for the Masses
PPTX
FPGAs in the cloud? (October 2017)
PPTX
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
PPTX
Introduction to DPDK
PPTX
Chips&toys
PPT
Semiconductor overview
PDF
OpenPOWER Summit 2020 - OpenCAPI Keynote
PDF
Future Commodity Chip Called CELL for HPC
PPTX
Disaggregating Ceph using NVMeoF
PPTX
supercomputer
PPT
Parallelism Processor Design
CEPH DAY BERLIN - CEPH ON THE BRAIN!
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
PCCC24(第24回PCクラスタシンポジウム):筑波大学計算科学研究センター テーマ2「スーパーコンピュータCygnus / Pegasus」
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
Ceph on 64-bit ARM with X-Gene
IBM and ASTRON 64bit μServer for DOME
Manycores for the Masses
FPGAs in the cloud? (October 2017)
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Introduction to DPDK
Chips&toys
Semiconductor overview
OpenPOWER Summit 2020 - OpenCAPI Keynote
Future Commodity Chip Called CELL for HPC
Disaggregating Ceph using NVMeoF
supercomputer
Parallelism Processor Design
Ad

More from Ganesan Narayanasamy (20)

PDF
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
PDF
Chip Design Curriculum development Residency program
PDF
Basics of Digital Design and Verilog
PDF
180 nm Tape out experience using Open POWER ISA
PDF
Workload Transformation and Innovations in POWER Architecture
PDF
OpenPOWER Workshop at IIT Roorkee
PDF
Deep Learning Use Cases using OpenPOWER systems
PDF
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
PDF
AI in healthcare - Use Cases
PDF
AI in Health Care using IBM Systems/OpenPOWER systems
PDF
AI in Healh Care using IBM POWER systems
PDF
Poster from NUS
PDF
SAP HANA on POWER9 systems
PPTX
Graphical Structure Learning accelerated with POWER9
PDF
AI in the enterprise
PDF
Robustness in deep learning
PDF
Perspectives of Frond end Design
PDF
A2O Core implementation on FPGA
PDF
OpenPOWER Foundation Introduction
PDF
Open Hardware and Future Computing
Empowering Engineering Faculties: Bridging the Gap with Emerging Technologies
Chip Design Curriculum development Residency program
Basics of Digital Design and Verilog
180 nm Tape out experience using Open POWER ISA
Workload Transformation and Innovations in POWER Architecture
OpenPOWER Workshop at IIT Roorkee
Deep Learning Use Cases using OpenPOWER systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare - Use Cases
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Healh Care using IBM POWER systems
Poster from NUS
SAP HANA on POWER9 systems
Graphical Structure Learning accelerated with POWER9
AI in the enterprise
Robustness in deep learning
Perspectives of Frond end Design
A2O Core implementation on FPGA
OpenPOWER Foundation Introduction
Open Hardware and Future Computing

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Modernizing your data center with Dell and AMD
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Approach and Philosophy of On baking technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Dropbox Q2 2025 Financial Results & Investor Presentation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Monthly Chronicles - July 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced Soft Computing BINUS July 2025.pdf
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Modernizing your data center with Dell and AMD
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Approach and Philosophy of On baking technology
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Advanced methodologies resolving dimensionality complications for autism neur...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray Camera at the Swiss Light Source synchrotron

  • 1. WIR SCHAFFEN WISSEN – HEUTE FÜR MORGEN OpenCAPI-based image analysis pipeline for 18 GB/s kHz-framerate X- ray camera at the SLS synchrotron Filip Leonarski :: Beamline Data Scientist :: Macromolecular Crystallography Page 1
  • 2. • Introduction: Macromolecular crystallography at synchrotrons and X-ray detectors • Technology: POWER + OpenCAPI • Solution: Jungfraujoch Plan Page 2
  • 3. X-ray 1901 Nobel Prize W. Röentgen Discovery of X-rays
  • 4. X-ray macromolecular crystallography (MX) Page 4 1901 Nobel Prize W. Röentgen Discovery of X-rays (Photo 51 by R. Gosling and R. Franklin) 1962 Nobel Prize F. Crick, J. Watson and M. Wilkins Structure of DNA double helix solved with X-rays
  • 5. X-ray macromolecular crystallography (MX) Page 5 1901 Nobel Prize W. Röentgen Discovery of X-rays (Photo 51 by R. Gosling and R. Franklin) 1962 Nobel Prize F. Crick, J. Watson and M. Wilkins Structure of DNA double helix solved with X-rays 2009 Nobel Prize V. Ramakrishnan*, T. Steiz, A. Yonath* Structure of ribosome (*) some of their structures were solved at PSI
  • 6. Wikipedia: X-ray crystallography is the experimental science determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions. By measuring the angles and intensities of these diffracted beams, a crystallographer can produce a three- dimensional picture of the density of electrons within the crystal. X-ray macromolecular crystallography (MX) Page 6
  • 7. • Particle accelerators are source of the brightest X-ray beam (multiple orders of magnitudes as compared to conventional X- ray tubes), when charged particles travel through magnetic field - Effect is nuisance for high energy physics (undesirable energy loss), - but it is a blessing for structural science => modern storage rings are build exclusively as light sources. • Synchrotrons provide continuous X-ray beam, while X-ray free electron lasers produce femtosecond long bright pulses MX at synchrotron Page 7
  • 8. Paul Scherrer Institute Page 8 SwissFEL Swiss Light Source Swiss Alps
  • 9. • 3 experimental stations at the synchrotron • 1 experimental station at the SwissFEL • Beamtime is shared between academic and industrial users - Industrial customers are mostly pharmaceutical companies looking for drug binding to potential drug targets - Academic users are universities and scientific institutes worldwide doing basic research in structural biology MX at Swiss Light Source and SwissFEL Page 9
  • 10. • New storage ring to be installed in 2024-2025 • Flux (photons/second) will increase by order of magnitude • Measurements can be done 10x faster • Enabling fragment screening method – i.e. single protein target is crystallized with hundredths or thousands of molecular fragments to find best drug - This is like molecular docking, but fully experimentally Major upgrade in 2024/2025 for SLS 2.0 Page 10
  • 11. • PSI is major detector developer - Hybrid pixel detectors developed for CERN high energy physics experiments - Design could be used for X-ray cameras – first PILATUS in 2000s - PSI start-up Dectris, commercialized PILATUS and EIGER detectors, most synchrotrons are equipped with their detectors • Currently PSI is rolling out new generation: JUNGFRAU Page 11 New detector for SwissFEL and SLS 2.0
  • 12. • Silicon sensor converts X-ray to electric charge • Bump bonded to sensor is ASIC, with dedicated electronics for each pixel • Pixel has three capacitors allowing different amplification • They are dynamically switched during exposure to adjust for incoming charge Page 12 Adaptive gain detector to increase dynamic range Aim: measure reliably from 1 to 20,000,000 photons per second
  • 13. Page 13 Adaptive gain detector to increase dynamic range 0001010111110011 Pixel output in JF: 0001010111110011 Gain: 00:G0 01:G1 11:G2 ADC value: 0001010111110011 Photon number: = !"# $ %&'&()*+ ,*-.∗%01)1. &.&2,3 Gain and pedestal factors are specific for pixel and gain setting Prior calibration Dedicated dark run
  • 14. • Detector is modular • 524,288 pixels per module • 2.2 kHz * 524,288 pixels * 16 bit = 2.3 GB/s - 2 x 10 Gbit/s links • 4 Mpixel detector (2020) - 16 x 10 Gbit/s • 10 Mpixel (2022) - 40 x 10 Gbit/s Page 14 Modular detector 4 Mpixel (2020) 10 Mpixel (2022)
  • 15. Page 15 MX detector data rates double every 2 years 0.1 1 10 100 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 Frame rate [GB/s] Year 2007 PSI PILATUS 6 Mpixel 12.5 Hz 0.2 GB/s 2014 Dectris EIGER 16 Mpixel 133 Hz 3.4 GB/s 2019 Dectris EIGER 2 XE 16 Mpixel 400 Hz 13.5 GB/s 2020 PSI JUNGFRAU 4 Mpixel 2200 Hz 18.4 GB/s 2022 PSI JUNGFRAU 10 Mpixel 2200 Hz 46.1 GB/s
  • 16. • Detector is streaming frames over UDP - Receiver using Linux Datagram Socket • Conversion of pixel read-out - CPU SIMD code • Compression - CPU compression First approach: scale conventional architecture Page 16
  • 17. • Detector is streaming frames over UDP - Receiver using Linux Datagram Socket • Conversion of pixel read-out - CPU SIMD code • Compression - CPU compression First approach: scale conventional architecture Page 17 Aim 20 GB/s Reached 5 GB/s
  • 18. WIR SCHAFFEN WISSEN – HEUTE FÜR MORGEN POWER / OpenCAPI / FPGA architecture Page 18
  • 19. • Real-time performance - FPGA design is cycle-accurate, with fixed latency and throughput • Large memory throughput - FPGAs with HBM2 have 460 GB/s bandwidth to 8 GB large memory • Ethernet on-board - FPGA are made to work with network, often having dedicated “hard” cores for ethernet • Development of FPGAs is difficult and time consuming - Hardware description languages - PCI Express • Virtex Ultrascale+ HBM (XCVU33P and XCVU35P) - Availble as low-profile half-length 75W cards FPGA are perfect devices for data acquisition Page 19
  • 20. • C/C++ compiler to produce hardware design language (Verilog or VHDL) • All code is valid C++ code, it can be executed on CPU and functionally is generally equivalent • Dedicated pragma to guide FPGA synthesis • It is generally understandable for software developers, but may contain strange/inoptimal constructs from software point of view High-level synthesis Page 20 Bitshuffle for 16-bit numbers
  • 21. • For VU33/35P: - Size: 8 GB - Bandwidth: up to 460 GB/s - Latency: up to 120 cycles @ 200 MHz • Complex architecture - 32 x 256-bit AXI3 interfaces - Either operating as 32 separate memories - Or as single memory with crossbar (at the cost of up to 50% throughput) • 256-bit is a problem, as data are 512-bit (PCIe Gen3 x16) or 1024-bit (OpenCAPI, PCIe Gen4 x16) • Simulation only with special tools (Cadence Xcelium), impossible with Xilinx tools High-bandwidth memory Page 21
  • 22. • PCI Express is CPU-centric bus, as it is design to support peripherals • This is good model, when FPGA is a coprocessor to CPU – which sends data, and waits for reply => but for data acquisition, it is FPGA that is producing the data, CPU has no prior knowledge which packet will be processed at the time • DMA is operating on physical addresses: virtual addresses need to be pinned by kernel (so are not swapped and moved) Þ need to maintain own driver Þ address translation cache possible on FPGA, but requires memory PCI Express DMA Page 22 Xilinx QDMA is a robust but highly complex solution for PCI Express – used to interface FPGAs with x86 AMD and Intel CPUs
  • 23. • IBM POWER9 showed great numbers for I/O and memory throughput in Summit and Sierra supercomputers • IBM designed own memory coherent interface for accelerators (CAPI/OpenCAPI), which has advantages over PCIe POWER architecture Page 23 Source: Wikipedia
  • 25. OpenCAPI Page 25 FPGA board POWER9 CPU OpenCAPI cable • Predecessor CAPI => proprietary IBM • Communication over PCIe physical lines (but different protocol) • OpenCAPI => consortium model • Dedicated cabling (8 x 25 Gbit/s lines) • For POWER10 – this will be default memory interface, (allowing to have any type of memory attached to CPU + to “share” memory over network)
  • 26. • Similar difference what 80286/80386 virtual mode brought to software development • In OpenCAPI one needs single kernel operation => Attach accelerator to running process • Then, accelerator has access to virtual address space of running process – it is FPGA that is initiating the communication => Address translation is handled by TLB and OS => FPGA sees memory in a fully cache-coherent way • All security/reliability/efficiency mechanisms in CPU and kernel are also present in OpenCAPI Page 26 What difference brings OpenCAPI? Source: Wikipedia
  • 27. • Main function for the action contains a pointer to virutal address space - On device the pointer will be synthesized as 1024-bit master memory-mapped AXI interface - On CPU this pointer has to be just set to zero (which is first address of virtual address space) • Any cell in virtual memory is just accessed as offset from this pointer • Only requirement is that memory is aligned to 128-bytes - No special memory allocator, malloc or mmap is fine - No pinning/registering • The same memory buffer class for both simulation and working with device • For configuration, there is also 4 MiB memory-maped I/O space (like BAR in PCIe) - On device implemented as slave AXI-lite (32-bit) How to develop with OpenCAPI? Page 27
  • 28. • Open source “shell” mantained by IBM • http://github.com/OpenCAPI/oc-accel • Provides ready made tool to work with OpenCAPI (from transceiver setup to AXImm bridge) • Provides preconfigured interfaces for I/O peripherals (HBM, 100G, NVMe) • Provides simulation environment - One can simulate both SW and HW in a single simulation (both user FPGA design and software are not modified from their “real” implementation) OC-Accel Page 28
  • 29. WIR SCHAFFEN WISSEN – HEUTE FÜR MORGEN Jungfraujoch – FPGA implementation Page 29
  • 30. Page 30 Jungfraujoch server Ethernet UDP/IP Dark current Conversion Strong pixel finder Bitshuffle Memory writer FPGA board with OpenCAPI interface - Data acquisition - Initial data analysis - Pre-compression (2.5 Mpixel/board for JF) Up to 50 GB/s acquisition and data analysis in a single 2U IBM POWER9 server with 1-4 FPGA boards Frame summation
  • 31. Page 31 Jungfraujoch FGPA streaming design Modular design • Stream of data handled by successive cores doing work in parallel à throughput and latency of each core is determined by the hardware design • Extra stages can be relatively simply added, option to bypass cores • All cores are C++ functions, connected with AXI-Stream FIFOs • As buffering is expensive on FPGA, it is best suited for algorithm that have limited dependencies between frames Ethernet UDP/IP Dark current Conversion Strong pixel finder Bitshuffle Memory writer Frame summation
  • 32. Page 32 Jungfraujoch Ethernet UDP/IP core Processes ethernet packets from network, ignores unnecessary packets, reads frame header to get frame number, module number, etc. Ethernet UDP/IP Dark current Conversion Strong pixel finder Bitshuffle Memory writer Frame summation
  • 33. Page 33 Jungfraujoch Dark current core This cores is responsible for calculating moving average of detector frames. Calculated value is used as dark current (pedestal) for subsequent frames. Ethernet UDP/IP Dark current Conversion Strong pixel finder Bitshuffle Memory writer Frame summation
  • 34. Page 34 Jungfraujoch Conversion core This cores translates JUNGFRAU read-out into units of energy or photon counts. It benefits from very fast HBM2 memory within the FPGA (460 GB/s). Data leaving this core can be used for processing by data analysis software. Ethernet UDP/IP Dark current Conversion Strong pixel finder Bitshuffle Memory writer Frame summation
  • 35. Page 35 Jungfraujoch Frame summation core (work in progress) As data that left gain correction core are on linear scale, they can be summed to reduce downstream data rate, if lower frame rate is needed, as compared to detector. Ethernet UDP/IP Dark current Conversion Strong pixel finder Bitshuffle Memory writer Frame summation
  • 36. Page 36 Jungfraujoch Strong pixel finder core This is first step of spot finding algorithm (for example COLSPOT). It identifies pixels that are stronger than given number of standard deviations of their neighborhood. Ethernet UDP/IP Dark current Conversion Strong pixel finder Bitshuffle Memory writer Frame summation
  • 37. Page 37 Jungfraujoch Bitshuffle FPGAs are bit order agnostic. Therefore exchanging bit order in popular compression prefilter is pretty much for free on FPGA. Ethernet UDP/IP Dark current Conversion Strong pixel finder Bitshuffle Memory writer Frame summation
  • 38. Page 38 Jungfraujoch Host memory write Address in host memory buffer is calculated and data forwarded to host memory via OpenCAPI. Additional image statistics are saved as well. Ethernet UDP/IP Dark current Conversion Strong pixel finder Bitshuffle Memory writer Frame summation
  • 39. Page 39 Jungfraujoch implementation on VU33P FPGA Spot finding HBM Gain Pedestal Write data OpenCAPI 100G UDP
  • 40. Jungfraujoch FPGA power usage is 18 W/board for the whole streaming functionality Page 40 Xilinx Vivado Power Report 2 boards for 4 Mpixel JUNGFRAU and 4 boards for 10 Mpixel JUNGFRAU
  • 41. • VU33P or VU35P with 8 GB of HBM2 • OpenCAPI link and PCIe Gen3 x16 (or two PCIe Gen4 x8) • Small flash (2 kb) to store MAC address, board IR • QSFP-DD optical socket (same as QSFP28, but with 8 lanes for 2x100G) => compatible with QSFP28 transceivers • Up to 75W Alpha Data 9H3 board Page 41
  • 42. • Software tests – Catch2 - 8 min - Among other software tests includes 13 FPGA action tests (whole SLS code) - Automated tests cover 95% lines of high- level synthesis code - Covers most of the functionality correctness – including address calculation - Main limitation is debugging of FIFOs parallel behavior (deadlocks, etc.) • Hardware simulation – Cadence Xcelium - 4 hours - Collection of 8 frames from single module - Checks if hardware description is correct, can find problems with synchronization, and other, very rare, issues - Too slow to verify functionality OpenCAPI programming - testing Page 42
  • 43. • Detector and data acquisition system was sent in November for an experiment in Photon Factory, KEK • More than 2,000 datasets collected for protein targets, few real-life native-SAD structures solved • Due to pandemic, detector support and development (including deployment of new FPGA design) was done fully remotely from Switzerland Commissioning in KEK (Jan – May 2021) Page 43 BL-1A Photon Factory JUNGFRAU detector (up) tested in helium chamber for native-SAD measurements with 3.75 keV X-rays
  • 44. Page 44 Structure of Nucleocapsid Phosphoprotein from SARS-CoV-2 solved in 1 second • Crystal was previously measured with conventional setup at our beamline – with measurement taking longer than one minute • With JUNGFRAU detector and OpenCAPI readout, 2000 images collected in one second allowed to solve structure of this protein • Experimental team: Filip Leonarski, Sylvain Engilberge, Vincent Olieric, Meitian Wang (MX Group), Aldo Mozzanica (PSI Detector Group) • SARS-CoV-2 protein was produced by Zinzula, L., Basquin, J., Bracher, A., Baumeister, W. (MPI, Martinsried)
  • 45. Possible gain from using FPGA based system Page 45 Courtesy: B. Mesnet (IBM)
  • 46. Possible gain from using FPGA based system Page 46 Courtesy: B. Mesnet (IBM)
  • 47. MX Group (PSI) • Vincent Olieric • Takashi Tomizaki • Chia-Ying Huang • Sylvain Engilberg • Justyna Wojdyła • Meitian Wang Detector Group (PSI) • Aldo Mozzanica • Martin Brückner • Carlos Lopez-Cuenca • Bernd Schmitt Science IT (PSI) • Leonardo Sala Controls (PSI) • Andrej Babic • Leonardo Hax-Damiani SLS management (PSI) • Oliver Bunk Photon Factory, KEK • Naohiro Matsugaki • Yusuke Yamada • Masahide Hikita MAX IV • Jie Nan • Zdenek Matej Uni Konstanz • Kay Diederichs LBL • Aaron Brewster DLS • Graeme Winter • DIALS Team ESRF • Jerome Kieffer IBM Systems (France) • Alexandre Castellane • Bruno Mesnet InnoBoost SA • Lionel Clavien Acknowledgements Page 47