Calcul Québec - Université Laval
Building a Storage System for
Genomics
1
HPCS 2014
Halifax, NS
Florent.Parent@calculquebec.ca
Frederick.Lefebvre@calculquebec.ca
Calcul Québec - Université Laval
Agenda
Genomics storage project background
Reviewing and optimizing the proposal
Network + politics issues
Writing the RFP
Lessons learned
!
!
2
Calcul Québec - Université Laval
Genomics storage project: background
FCI Leading Edge Fund (2012 competition)
“Human and Microbial Integrative Genomics”
Project Lead: Dr. Jacques Simard, CRCHUQ, (Université Laval)
16 researchers from Université Laval
Bioinformatics and Computational Infrastructure
Arnaud Droit
Large storage component
3
Calcul Québec - Université Laval
CRCHUQ
Genomics, proteomics and metabolomics
Data sources: HiSeq 2500, 2000 and MiSeq
Applications: RAY, genomics pipeline …
Some researchers already active HPC users
Jacques Corbeil, Sébastien Boisvert (RAY), Arnaud Droit,
Yohan Bossé
4
Calcul Québec - Université Laval
Site specifications: Physical
Number of racks in silo: 56 max
Floor loading capacity: 940 lb/pi2
5
Calcul Québec - Université Laval
Site specifications: Power
6
1 MW generator
Campus Data
Center
CII: Centre des Infrastructure Informationelles
Silo:
1.1 MW available (~33% used)
72 kW UPS (+ generator)
25 kV hydro line
2 MVA transformer
Calcul Québec - Université Laval
Site specifications: Cooling
Rack cooling: 100% air
No CRAC units! Using campus wide
chilled-water loop for cooling
Cooling capacity: 1.5 MW
Residual heat transferred to campus hot-
water loop
Partial free air cooling (up to 300 kW)
7
Cooling
coils
Air blowers
Free air
cooling
Calcul Québec - Université Laval
Site specifications: Networking
8
Fibre optic network
to Québec hospital
research networks
Calcul Québec - Université Laval
Timeline
9
2014
Feb 26
All tests pass
system accepted
Jan 6
Physical
installation
2013
Oct 3
RFP
published
Jan 15
MSSS
meeting
Jan 22
Acceptance
testing starts
Nov 20
RFP winner
announced
number of meetings with vendors/manufacturers
July 9
MSSS
derogation
2012
Feb
First
meeting
number of meetings with vendors/manufacturers
Nov
identify
FW issue
April
FCI
conditions
met
March
finalize
budget
Calcul Québec - Université Laval
Researcher LEF proposal
Initial contact: Researcher->VPR->CC staff
Initial meetings: Review proposal, discussions
Researchers planned to install storage at CRCHUQ facilities
Based on quote from local supplier
Review and optimize
Discuss possible optimization in proposal
Scheduled meetings with HPC storage suppliers
10
Calcul Québec - Université Laval
CFI LEF
Discussed option to host storage at CC site
Install some storage and compute at CRCHUQ
Bulk storage at UL/CQ/CC site
High speed connectivity already in place (10G UL-CRCHUQ)
Sounds simple, right? …
11
Calcul Québec - Université Laval
Concerns raised
Ease of access to CC hosted storage
Security
12
Calcul Québec - Université Laval
Refining the proposal
Evaluate benefits to hosts at CQ/CC
Power, cooling infrastructure already in place
O&M handled by CQ staff.
Collaborate with CRCHUQ sysadmin
Initial budget planned room renovations and
extra A/C: $$ saved for more infrastructure
13
Calcul Québec - Université Laval
MOU
MOU sent to CFI (Jan 2013)
CRCHUQ and CQ/CC staff work on RFP and acquisition process
CQ/CC staff manage storage
Storage is for exclusive usage of CRCHUQ
Local storage for CRCHUQ, Parallel FS at CQ/UL
Archival (tapes) will use existing system available at remote
CQ/CC site
14
Calcul Québec - Université Laval
Genomics Storage Components
15
Calcul Québec - Université Laval
Interconnect
Université Laval owns fibre optics MAN
Interconnects all QC hospital research networks
16
Calcul Québec - Université Laval
Test node
Network testing
17
Test node (VM)
4 Gbps
141 Mbps
“IS-QC” Firewall
Limits flows to 1.2 Gbps
Calcul Québec - Université Laval
“Pile of firewalls”
CRCHUQ already manages it’s security
firewall at it’s periphery
IS-QC under MSSS authority
acts as“safety valve”
Work with CRCHUQ to request derogation to
remove IS-QC
18
Calcul Québec - Université Laval
MSSS Derogation
Document and prepare meeting in Dec 2012
Jan 2013: meeting with MSSS security staff
Jan 2013: regional security coordinator refusal
Feb 2013: CRCHUQ director writes to deputy minister
of MSSS IT
July 2013: Deputy minister (MSSS IT) visits UL/CQ
Derogation done.
19
Calcul Québec - Université Laval
Network Measurements
perfSonar for periodic network measurements
20
Calcul Québec - Université Laval
Archival
Use existing tape archives at CQ
Plenty of network bandwidth (for now…)
21
V1.0Calcul Québec - Université Laval
RFP
22
Calcul Québec - Université Laval
Building the RFP
An iterative process
Based on multiple meetings with researchers
+ Expertise and market knowledge of local HPC team
23
Vendors
RFP
Researcher HPC team
Calcul Québec - Université Laval
Premise
2 storage systems with different requirements in
very different environments
Parallel storage
Large and high-speed in modern datacenter with plenty of
power and cooling
On-site storage
Smaller capacity with slower interconnect in air-conditioned
server room
24
Calcul Québec - Université Laval
Challenges
Budget is limited. We want to get the most out
of it !
But the most of what ?
Parallel storage capacity/
Parallel storage write speed
On-site storage capacity
etc…
25
Calcul Québec - Université Laval
Challenges (cont.)
Most of the budget to be allocated to the
parallel storage
To enable computing and mid-term storage
On-site storage must be large enough. No more.
A quality based RFP allows for such distinctions
26
Calcul Québec - Université Laval
How large is large enough
The sequencing platform could generate 10TB of
data per week
Operating at full capacity
40TB would provide 1 month of buffering
27
On-site StorageSequencers Parallel Storage
Buffering
Automated Data
Synchronisation
Calcul Québec - Université Laval
quality based RFP
We chose to publish a quality based RFP
In contrast to a lowest-bidder process
!
Evaluated on cost + « quality criteria »
Vendors are asked to spend at least 95% of budget
28
Calcul Québec - Université Laval
Challenges (solution)
Define 2 indépendant sets of requirements
!
Use the « quality criteria » to let vendors know
what they should prioritize
More weight will be given to the parallel storage components
!
29
Calcul Québec - Université Laval
Hardware only or integrated solution ?
A) Hardware only:Write an RFP to buy XTB of raw disk space
+Y servers and the accompanying interconnect. Integrate
everything in-house to deploy a storage system.
B) Integrated solution: Ask for a complete system to meet a
size and performance requirements.
First things first
30
Calcul Québec - Université Laval
Integrated Solution
Cumbersome question … Lustre, GPFS or
anything
Should we ask for a specific parallel FS ?
Some parallel FS are tied to a specific vendor or a very small
set of vendors
Went with Lustre because it is a multi-vendor
ecosystem
… and our team is already familiar with it
31
Calcul Québec - Université Laval
Fostering competition
The RFP can be so specific as to open the door
only to a single product
!
Or it can let bidders come up with their own
solution to our problem
32
Specific
product
Surprise…
Calcul Québec - Université Laval
Fostering competition (cont.)
Vendors know when a RFP is targeted to them
They will price accordingly
Inversely, vendors will not bid if they do not feel
they have a fair chance
Less bid will often equal « higher price »
A less constrained RFP will generally attract
more proposals
!
33
Calcul Québec - Université Laval
Fostering competition (cont.)
Example of being too specific:
« Storage units with 60 drives in raid5, 8+2 configuration »
!
Such a statement could apply to a single vendor,
while limiting the available technologies
34
Calcul Québec - Université Laval
Spec'ing a storage system
Power & Cooling capacity
Physical space and room topology
Compatibility with existing infrastructure
Software
Physical
35
Calcul Québec - Université Laval
Physical infrastructure
36
Document floor/rack plan
Maximum weight per square foot ?
How much space do we actually have ?
Where does the system need to connect?
Both power and interconnect
Cable length
Calcul Québec - Université Laval
Power & Cooling
37
How much electrical capacity is available
Total?
Per rack ?
UPS ?
Can our room cooling system handle that much
new power ?
Calcul Québec - Université Laval
Requirements for parallel storage
1 PB usable (or more) Lustre FS
Compatible with Lustre clients 1.8.9 and 2.4.x
20 GB/s aggregate read/write speed (or more)
Drives and Lustre servers redundancy
« how » is purposely left unspecified
Infiniband interconnect
2:1 blocking factor with computing resources
38
Calcul Québec - Université Laval
Requirements for parallel storage
Vendor to provide all interconnect
Leaf IB switch, ethernet switch for management and cabling
Site provides uplink to core switches
20KW maximum electrical consumption
Vendor to supply PDUs (switched)
Site to connect PDUs to existing electrical infrastructure
39
Calcul Québec - Université Laval
Requirements for on-site storage
Export network filesystem
Compatible with sequencers,Windows 7, Linux and Mac
10G Ethernet interconnect
50TB usable capacity (or more)
with option to grow up to 300TB
Drives and servers redundancy
40
Calcul Québec - Université Laval
Requirements for on-site storage
Site to provide all cabling and interconnect for
on-site storage
PDUs and rack space provided by the site
41
Calcul Québec - Université Laval
Measuring the quality of a proposal
Final evaluation is based on « adjusted price »
calculated from the bid price and the rating of
the « quality criteria » given by the evaluation
committee
!
« adjusted price » can vary from the real price by
up to 30%
42
Calcul Québec - Université Laval
Quality criteria
43
Parallel Storage 45 %
On-site Storage 20 %
Interconnect & Networking 10 %
Vendor’s Experience & Reputation 25 %
Calcul Québec - Université Laval
Quality criteria (cont.)
!
In the 1st three categories, meeting the base
requirements gives a passing score of 70%.
Any specs or meaningful features above base
requirements will improve the mark.
!
44
Calcul Québec - Université Laval
Quality criteria (cont.)
In the « vendor » category, score is based on the
bidder’s experience in deploying similar
systems with a requirement for at least 1 such
system in the past 18 months.
Support structure and resume of the lead
architect for the project are also a factor.
!
45
V1.0Calcul Québec - Université Laval
Benchmarks &
stability tests
46
Calcul Québec - Université Laval
Acceptance tests
We define stability tests to validate the system
can operate in a real production environment.
!
We run synthetic benchmarks to make sure the
system hits the performance targets set by the
vendor as requested by the quality criteria.
47
Calcul Québec - Université Laval
Stability tests
To validate normal operation
Homogenous firmware and software versions everywhere
No errors or warning
Verify the systems reboots cleanly
Lustre mounts properly
Simulate drive failures
Verify rebuild process
48
Calcul Québec - Université Laval
Benchmarks
We set some base rules
No custom tools. Re-use existing software
Let the vendor tune the tests for his system
But test must be large enough to avoid cache effect
What to benchmark
Read/write speed of single target : IOZone
Maximum aggregate read/write speed : IOR
Maximum I/O operations per second (IOPS) : mdtest
49
V1.0Calcul Québec - Université Laval
RFP results
50
Calcul Québec - Université Laval
Bids
We got 6 valid proposals
Parallel storage capacity varied from more than 60% across bids
Aggregate speed for parallel storage varied by almost 50%
On-site capacity varied by almost 100%
On-site storage went from a NAS on ZFS to full fledges Lustre or
GPFS systems
51
Calcul Québec - Université Laval
System selected
Parallel storage: Xyratex CS6000
1.4 PB usable Lustre FS
12 OSS and 4 targets per OSS
4TB NL SAS drives +SSD for journals
30 GB/s maximum aggregated R/W speed
On-site storage: Xyratex CS1500
120TB usable Lustre FS (scales to 7 PB)
4 CIFS/NFS exporters
52
V1.0Calcul Québec - Université Laval
Deployment
53
Calcul Québec - Université Laval
Operation
Both system in production since early february
Parallel storage dedicated to research group
mounted on compute ressources
Data transfers are enabled by Globus endpoints
on dedicated DTNs at both sites.
Todo: Review network topology for transfers
Perfsonar nodes to be deployed at research center
54
Calcul Québec - Université Laval
Operation (cont.)
Researchers need a CC account to access Parallel
Storage
Access control and allocations are a challenge
Shared spreadsheet filled by research center to allocate space
on parallel FS for their users (Cumbersome!)
Integration with the CCDB would leverage existing system to
manage storage allocations
!
55
V1.0Calcul Québec - Université Laval
Lessons learned
56
Calcul Québec - Université Laval
Lessons learned
Time consuming (2 year projects)
Mostly thrust and relationship building
Time needed to write an RFP should not be underestimated
Benefit for the research group
Access to a team of specialist to lead their project
Major cost saving on the infrastructure. No investment to
upgrade an existing server room (UPS, Power, Cooling, etc)
57
Calcul Québec - Université Laval
Cost to integrate CS6000
Installation: 900$ (rack enclosure)
Power: 1457$ (new outlets)
Cooling: 0$
Infiniband: Used existing cables
6 CXP - QSFP cables (18 QDR links)
58
Calcul Québec - Université Laval
Improving the process
Sharing RFPs between Compute Canada site
could ease the process for new projects
Common benchmarks across Compute Canada
would help when designing acceptance tests
Applies to both storage and computing
59

More Related Content

PDF
Parallel io
PDF
HPE Solutions for Challenges in AI and Big Data
PDF
Building Efficient HPC Clouds with MCAPICH2 and RDMA-Hadoop over SR-IOV Infin...
PPTX
Introducing the TPCx-HS Benchmark for Big Data
PDF
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
PPT
Real IO and Parallel NetCDF4 Performance
PPTX
Big data- HDFS(2nd presentation)
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Parallel io
HPE Solutions for Challenges in AI and Big Data
Building Efficient HPC Clouds with MCAPICH2 and RDMA-Hadoop over SR-IOV Infin...
Introducing the TPCx-HS Benchmark for Big Data
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Real IO and Parallel NetCDF4 Performance
Big data- HDFS(2nd presentation)
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial

What's hot (20)

PDF
Cisco UCS: meeting the growing need for bandwidth
PDF
IBM Data Centric Systems & OpenPOWER
PPTX
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
PDF
Design installation-commissioning-red raider-cluster-ttu
PDF
UberCloud HPC Experiment Introduction for Beginners
PPTX
Ac922 watson 180208 v1
PDF
POWER10 innovations for HPC
PPT
Hadoop 1.x vs 2
PDF
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
PDF
OpenPOWER System Marconi100
PDF
Ncar globally accessible user environment
PDF
HPC Storage and IO Trends and Workflows
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
PDF
Xilinx Edge Compute using Power 9 /OpenPOWER systems
PPT
Status of HDF-EOS, Related Software and Tools
PDF
Covid-19 Response Capability with Power Systems
PDF
IBM HPC Transformation with AI
PDF
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
Cisco UCS: meeting the growing need for bandwidth
IBM Data Centric Systems & OpenPOWER
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Design installation-commissioning-red raider-cluster-ttu
UberCloud HPC Experiment Introduction for Beginners
Ac922 watson 180208 v1
POWER10 innovations for HPC
Hadoop 1.x vs 2
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
OpenPOWER System Marconi100
Ncar globally accessible user environment
HPC Storage and IO Trends and Workflows
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
TAU E4S ON OpenPOWER /POWER9 platform
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Status of HDF-EOS, Related Software and Tools
Covid-19 Response Capability with Power Systems
IBM HPC Transformation with AI
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
Ad

Viewers also liked (20)

DOC
Tabla reymol
PPTX
Presentación1 cb
PPTX
Powerpoint
PPTX
I os x android
PPTX
11урок теорія меню
PPTX
E mail marketing
PPTX
1 вступ робоче місце
DOCX
mark cv 2016
PDF
Bimba kids 13-01-2013
PDF
Beaver Creek Park Mgmt Plan
PPTX
Ахітектура та скульптура
PPTX
Powerpoint
PDF
Bimba kids 21-10-2012 (2)
DOCX
RESUME administrator 010616
PDF
Трудове навчання_6 клас_ 4 параграф
PDF
11 урок письмове завдання меню
PDF
Bimba kids 06-01-2013 (1)
PPTX
Blogger, word press e tumblr
PDF
Dougherty 1
PDF
SITES Certifies 12 New Projects _ SITES
Tabla reymol
Presentación1 cb
Powerpoint
I os x android
11урок теорія меню
E mail marketing
1 вступ робоче місце
mark cv 2016
Bimba kids 13-01-2013
Beaver Creek Park Mgmt Plan
Ахітектура та скульптура
Powerpoint
Bimba kids 21-10-2012 (2)
RESUME administrator 010616
Трудове навчання_6 клас_ 4 параграф
11 урок письмове завдання меню
Bimba kids 06-01-2013 (1)
Blogger, word press e tumblr
Dougherty 1
SITES Certifies 12 New Projects _ SITES
Ad

Similar to HPCS2014 - Building a storage system for genomics (20)

PDF
Implementation of a high performace clusther for a university
PDF
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
PDF
LUG2015 - Frédérick Lefebvre - Monitoring an heterogenous Lustre environment
PDF
Designs, Lessons and Advice from Building Large Distributed Systems
PPT
My other computer is a datacentre - 2012 edition
PDF
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
ODP
Cluster Filesystems and the next 1000 human genomes
PDF
The Open Science Data Cloud: Empowering the Long Tail of Science
PPT
Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...
PDF
Asd 2015
PDF
NSCC Training Introductory Class
PDF
Sc12 workshop-writeup
PDF
SVC / Storwize analysis cost effective storage planning (use case)
PPTX
NSCC Training Introductory Class
PPTX
Getting Access to ALCF Resources and Services
PDF
Testing Storage Systems: Methodology and Common Pitfalls
PDF
QBD_1464843125535 - Copy
PDF
The Birth of HPC Cuba
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
PDF
CLFS 2010
Implementation of a high performace clusther for a university
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
LUG2015 - Frédérick Lefebvre - Monitoring an heterogenous Lustre environment
Designs, Lessons and Advice from Building Large Distributed Systems
My other computer is a datacentre - 2012 edition
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Cluster Filesystems and the next 1000 human genomes
The Open Science Data Cloud: Empowering the Long Tail of Science
Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...
Asd 2015
NSCC Training Introductory Class
Sc12 workshop-writeup
SVC / Storwize analysis cost effective storage planning (use case)
NSCC Training Introductory Class
Getting Access to ALCF Resources and Services
Testing Storage Systems: Methodology and Common Pitfalls
QBD_1464843125535 - Copy
The Birth of HPC Cuba
Big Data Anti-Patterns: Lessons From the Front LIne
CLFS 2010

HPCS2014 - Building a storage system for genomics

  • 1. Calcul Québec - Université Laval Building a Storage System for Genomics 1 HPCS 2014 Halifax, NS Florent.Parent@calculquebec.ca Frederick.Lefebvre@calculquebec.ca
  • 2. Calcul Québec - Université Laval Agenda Genomics storage project background Reviewing and optimizing the proposal Network + politics issues Writing the RFP Lessons learned ! ! 2
  • 3. Calcul Québec - Université Laval Genomics storage project: background FCI Leading Edge Fund (2012 competition) “Human and Microbial Integrative Genomics” Project Lead: Dr. Jacques Simard, CRCHUQ, (Université Laval) 16 researchers from Université Laval Bioinformatics and Computational Infrastructure Arnaud Droit Large storage component 3
  • 4. Calcul Québec - Université Laval CRCHUQ Genomics, proteomics and metabolomics Data sources: HiSeq 2500, 2000 and MiSeq Applications: RAY, genomics pipeline … Some researchers already active HPC users Jacques Corbeil, Sébastien Boisvert (RAY), Arnaud Droit, Yohan Bossé 4
  • 5. Calcul Québec - Université Laval Site specifications: Physical Number of racks in silo: 56 max Floor loading capacity: 940 lb/pi2 5
  • 6. Calcul Québec - Université Laval Site specifications: Power 6 1 MW generator Campus Data Center CII: Centre des Infrastructure Informationelles Silo: 1.1 MW available (~33% used) 72 kW UPS (+ generator) 25 kV hydro line 2 MVA transformer
  • 7. Calcul Québec - Université Laval Site specifications: Cooling Rack cooling: 100% air No CRAC units! Using campus wide chilled-water loop for cooling Cooling capacity: 1.5 MW Residual heat transferred to campus hot- water loop Partial free air cooling (up to 300 kW) 7 Cooling coils Air blowers Free air cooling
  • 8. Calcul Québec - Université Laval Site specifications: Networking 8 Fibre optic network to Québec hospital research networks
  • 9. Calcul Québec - Université Laval Timeline 9 2014 Feb 26 All tests pass system accepted Jan 6 Physical installation 2013 Oct 3 RFP published Jan 15 MSSS meeting Jan 22 Acceptance testing starts Nov 20 RFP winner announced number of meetings with vendors/manufacturers July 9 MSSS derogation 2012 Feb First meeting number of meetings with vendors/manufacturers Nov identify FW issue April FCI conditions met March finalize budget
  • 10. Calcul Québec - Université Laval Researcher LEF proposal Initial contact: Researcher->VPR->CC staff Initial meetings: Review proposal, discussions Researchers planned to install storage at CRCHUQ facilities Based on quote from local supplier Review and optimize Discuss possible optimization in proposal Scheduled meetings with HPC storage suppliers 10
  • 11. Calcul Québec - Université Laval CFI LEF Discussed option to host storage at CC site Install some storage and compute at CRCHUQ Bulk storage at UL/CQ/CC site High speed connectivity already in place (10G UL-CRCHUQ) Sounds simple, right? … 11
  • 12. Calcul Québec - Université Laval Concerns raised Ease of access to CC hosted storage Security 12
  • 13. Calcul Québec - Université Laval Refining the proposal Evaluate benefits to hosts at CQ/CC Power, cooling infrastructure already in place O&M handled by CQ staff. Collaborate with CRCHUQ sysadmin Initial budget planned room renovations and extra A/C: $$ saved for more infrastructure 13
  • 14. Calcul Québec - Université Laval MOU MOU sent to CFI (Jan 2013) CRCHUQ and CQ/CC staff work on RFP and acquisition process CQ/CC staff manage storage Storage is for exclusive usage of CRCHUQ Local storage for CRCHUQ, Parallel FS at CQ/UL Archival (tapes) will use existing system available at remote CQ/CC site 14
  • 15. Calcul Québec - Université Laval Genomics Storage Components 15
  • 16. Calcul Québec - Université Laval Interconnect Université Laval owns fibre optics MAN Interconnects all QC hospital research networks 16
  • 17. Calcul Québec - Université Laval Test node Network testing 17 Test node (VM) 4 Gbps 141 Mbps “IS-QC” Firewall Limits flows to 1.2 Gbps
  • 18. Calcul Québec - Université Laval “Pile of firewalls” CRCHUQ already manages it’s security firewall at it’s periphery IS-QC under MSSS authority acts as“safety valve” Work with CRCHUQ to request derogation to remove IS-QC 18
  • 19. Calcul Québec - Université Laval MSSS Derogation Document and prepare meeting in Dec 2012 Jan 2013: meeting with MSSS security staff Jan 2013: regional security coordinator refusal Feb 2013: CRCHUQ director writes to deputy minister of MSSS IT July 2013: Deputy minister (MSSS IT) visits UL/CQ Derogation done. 19
  • 20. Calcul Québec - Université Laval Network Measurements perfSonar for periodic network measurements 20
  • 21. Calcul Québec - Université Laval Archival Use existing tape archives at CQ Plenty of network bandwidth (for now…) 21
  • 22. V1.0Calcul Québec - Université Laval RFP 22
  • 23. Calcul Québec - Université Laval Building the RFP An iterative process Based on multiple meetings with researchers + Expertise and market knowledge of local HPC team 23 Vendors RFP Researcher HPC team
  • 24. Calcul Québec - Université Laval Premise 2 storage systems with different requirements in very different environments Parallel storage Large and high-speed in modern datacenter with plenty of power and cooling On-site storage Smaller capacity with slower interconnect in air-conditioned server room 24
  • 25. Calcul Québec - Université Laval Challenges Budget is limited. We want to get the most out of it ! But the most of what ? Parallel storage capacity/ Parallel storage write speed On-site storage capacity etc… 25
  • 26. Calcul Québec - Université Laval Challenges (cont.) Most of the budget to be allocated to the parallel storage To enable computing and mid-term storage On-site storage must be large enough. No more. A quality based RFP allows for such distinctions 26
  • 27. Calcul Québec - Université Laval How large is large enough The sequencing platform could generate 10TB of data per week Operating at full capacity 40TB would provide 1 month of buffering 27 On-site StorageSequencers Parallel Storage Buffering Automated Data Synchronisation
  • 28. Calcul Québec - Université Laval quality based RFP We chose to publish a quality based RFP In contrast to a lowest-bidder process ! Evaluated on cost + « quality criteria » Vendors are asked to spend at least 95% of budget 28
  • 29. Calcul Québec - Université Laval Challenges (solution) Define 2 indépendant sets of requirements ! Use the « quality criteria » to let vendors know what they should prioritize More weight will be given to the parallel storage components ! 29
  • 30. Calcul Québec - Université Laval Hardware only or integrated solution ? A) Hardware only:Write an RFP to buy XTB of raw disk space +Y servers and the accompanying interconnect. Integrate everything in-house to deploy a storage system. B) Integrated solution: Ask for a complete system to meet a size and performance requirements. First things first 30
  • 31. Calcul Québec - Université Laval Integrated Solution Cumbersome question … Lustre, GPFS or anything Should we ask for a specific parallel FS ? Some parallel FS are tied to a specific vendor or a very small set of vendors Went with Lustre because it is a multi-vendor ecosystem … and our team is already familiar with it 31
  • 32. Calcul Québec - Université Laval Fostering competition The RFP can be so specific as to open the door only to a single product ! Or it can let bidders come up with their own solution to our problem 32 Specific product Surprise…
  • 33. Calcul Québec - Université Laval Fostering competition (cont.) Vendors know when a RFP is targeted to them They will price accordingly Inversely, vendors will not bid if they do not feel they have a fair chance Less bid will often equal « higher price » A less constrained RFP will generally attract more proposals ! 33
  • 34. Calcul Québec - Université Laval Fostering competition (cont.) Example of being too specific: « Storage units with 60 drives in raid5, 8+2 configuration » ! Such a statement could apply to a single vendor, while limiting the available technologies 34
  • 35. Calcul Québec - Université Laval Spec'ing a storage system Power & Cooling capacity Physical space and room topology Compatibility with existing infrastructure Software Physical 35
  • 36. Calcul Québec - Université Laval Physical infrastructure 36 Document floor/rack plan Maximum weight per square foot ? How much space do we actually have ? Where does the system need to connect? Both power and interconnect Cable length
  • 37. Calcul Québec - Université Laval Power & Cooling 37 How much electrical capacity is available Total? Per rack ? UPS ? Can our room cooling system handle that much new power ?
  • 38. Calcul Québec - Université Laval Requirements for parallel storage 1 PB usable (or more) Lustre FS Compatible with Lustre clients 1.8.9 and 2.4.x 20 GB/s aggregate read/write speed (or more) Drives and Lustre servers redundancy « how » is purposely left unspecified Infiniband interconnect 2:1 blocking factor with computing resources 38
  • 39. Calcul Québec - Université Laval Requirements for parallel storage Vendor to provide all interconnect Leaf IB switch, ethernet switch for management and cabling Site provides uplink to core switches 20KW maximum electrical consumption Vendor to supply PDUs (switched) Site to connect PDUs to existing electrical infrastructure 39
  • 40. Calcul Québec - Université Laval Requirements for on-site storage Export network filesystem Compatible with sequencers,Windows 7, Linux and Mac 10G Ethernet interconnect 50TB usable capacity (or more) with option to grow up to 300TB Drives and servers redundancy 40
  • 41. Calcul Québec - Université Laval Requirements for on-site storage Site to provide all cabling and interconnect for on-site storage PDUs and rack space provided by the site 41
  • 42. Calcul Québec - Université Laval Measuring the quality of a proposal Final evaluation is based on « adjusted price » calculated from the bid price and the rating of the « quality criteria » given by the evaluation committee ! « adjusted price » can vary from the real price by up to 30% 42
  • 43. Calcul Québec - Université Laval Quality criteria 43 Parallel Storage 45 % On-site Storage 20 % Interconnect & Networking 10 % Vendor’s Experience & Reputation 25 %
  • 44. Calcul Québec - Université Laval Quality criteria (cont.) ! In the 1st three categories, meeting the base requirements gives a passing score of 70%. Any specs or meaningful features above base requirements will improve the mark. ! 44
  • 45. Calcul Québec - Université Laval Quality criteria (cont.) In the « vendor » category, score is based on the bidder’s experience in deploying similar systems with a requirement for at least 1 such system in the past 18 months. Support structure and resume of the lead architect for the project are also a factor. ! 45
  • 46. V1.0Calcul Québec - Université Laval Benchmarks & stability tests 46
  • 47. Calcul Québec - Université Laval Acceptance tests We define stability tests to validate the system can operate in a real production environment. ! We run synthetic benchmarks to make sure the system hits the performance targets set by the vendor as requested by the quality criteria. 47
  • 48. Calcul Québec - Université Laval Stability tests To validate normal operation Homogenous firmware and software versions everywhere No errors or warning Verify the systems reboots cleanly Lustre mounts properly Simulate drive failures Verify rebuild process 48
  • 49. Calcul Québec - Université Laval Benchmarks We set some base rules No custom tools. Re-use existing software Let the vendor tune the tests for his system But test must be large enough to avoid cache effect What to benchmark Read/write speed of single target : IOZone Maximum aggregate read/write speed : IOR Maximum I/O operations per second (IOPS) : mdtest 49
  • 50. V1.0Calcul Québec - Université Laval RFP results 50
  • 51. Calcul Québec - Université Laval Bids We got 6 valid proposals Parallel storage capacity varied from more than 60% across bids Aggregate speed for parallel storage varied by almost 50% On-site capacity varied by almost 100% On-site storage went from a NAS on ZFS to full fledges Lustre or GPFS systems 51
  • 52. Calcul Québec - Université Laval System selected Parallel storage: Xyratex CS6000 1.4 PB usable Lustre FS 12 OSS and 4 targets per OSS 4TB NL SAS drives +SSD for journals 30 GB/s maximum aggregated R/W speed On-site storage: Xyratex CS1500 120TB usable Lustre FS (scales to 7 PB) 4 CIFS/NFS exporters 52
  • 53. V1.0Calcul Québec - Université Laval Deployment 53
  • 54. Calcul Québec - Université Laval Operation Both system in production since early february Parallel storage dedicated to research group mounted on compute ressources Data transfers are enabled by Globus endpoints on dedicated DTNs at both sites. Todo: Review network topology for transfers Perfsonar nodes to be deployed at research center 54
  • 55. Calcul Québec - Université Laval Operation (cont.) Researchers need a CC account to access Parallel Storage Access control and allocations are a challenge Shared spreadsheet filled by research center to allocate space on parallel FS for their users (Cumbersome!) Integration with the CCDB would leverage existing system to manage storage allocations ! 55
  • 56. V1.0Calcul Québec - Université Laval Lessons learned 56
  • 57. Calcul Québec - Université Laval Lessons learned Time consuming (2 year projects) Mostly thrust and relationship building Time needed to write an RFP should not be underestimated Benefit for the research group Access to a team of specialist to lead their project Major cost saving on the infrastructure. No investment to upgrade an existing server room (UPS, Power, Cooling, etc) 57
  • 58. Calcul Québec - Université Laval Cost to integrate CS6000 Installation: 900$ (rack enclosure) Power: 1457$ (new outlets) Cooling: 0$ Infiniband: Used existing cables 6 CXP - QSFP cables (18 QDR links) 58
  • 59. Calcul Québec - Université Laval Improving the process Sharing RFPs between Compute Canada site could ease the process for new projects Common benchmarks across Compute Canada would help when designing acceptance tests Applies to both storage and computing 59