Fabric Monitor, Accounting, Storage and Reports experience at the INFN Tier1 Felice Rosso on behalf of INFN Tier1  [email_address] Workshop sul calcolo e reti INFN - Otranto -   8-6-2006
Outline CNAF-INFN Tier1 FARM and GRID Monitoring Local Queues Monitoring Local and GRID accounting Storage Monitoring and accounting Summary
Introduction Location: INFN-CNAF,  Bologna (Italy) one of the main nodes of GARR network Computing facility for INFN HNEP community Partecipating to LCG, EGEE, INFNGRID projects Multi-Experiments TIER1 LHC experiments (Alice, Atlas, CMS, LHCb) CDF, BABAR VIRGO, MAGIC, ARGO, Bio, TheoPhys, Pamela ... Resources assigned to experiments on a yearly Plan.
The Farm in a Nutshell - SLC 3.0.6, LCG 2.7, LSF 6.1   - ~ 720 WNs LSF pool (~1580 KSI2K) Common LSF pool: 1 job per logical cpu (slot) MAX 1 process running at the same time per job GRID and local submission allowed On the same WN can run GRID and not GRID jobs On the same queue can be submitted GRID and not GRID jobs For each VO/EXP one or more queues Since 24th of April 2005 2.700.000 jobs were executed on our LSF pool (~1.600.000 GRID) 3 CEs  (main CE 4 opteron dualcore, 24 GB RAM)  + 1 CE gLite
Access to Batch system “ Legacy” non-Grid Access CE LSF Wn1 WNn SE Grid Access LSF client UI UI UI Grid
Farm Monitoring Goals Scalability to Tier1 full size Many parameters for each WN/server DataBase and Plots on Web Pages Data Analysis Report problems on Web Page(s) Share data with GRID tools RedEye : INFN-T1 tool monitoring RedEye : simple local user. No root!
Tier1 Fabric Monitoring What do we get? • CPU load, status and jiffies Ethernet I/O, (MRTG by net-boys) Temperatures, RPM fans (IPMI) Total and type of active TCP connections Processes created, running, zombie etc RAM and SWAP memory Users logged in SLC3 and SLC4 compatible
Tier1 Fabric Monitor
Local WN Monitoring On each WN every 5 min (local crontab) infos are saved locally (<3KBytes --> 2-3 TCP packets) 1 minute later a collector “gets” via socket the infos “ gets”: tidy parallel fork with timeout control To get and save locally datas from 750 WN ~ 6 sec best case. 20 sec worst case (timeout knife) Upgrade DataBase (last day, week, month) For each WN --> 1 file  (possibility of cumulative plots) Analysis monitor datas Local thumbnail cache creation (web clickable) http://collector.cnaf.infn.it/davide/rack.php http://collector.cnaf.infn.it/davide/analyzer.html
Web Snapshot CPU-RAM
Web Snapshot TCP connections
Web Snapshot users logged
Analyzer.html
Fabric  GRID Monitoring Effort on exporting relevant fabric metrics to the Grid level e.g.: # of active WNs,  # of free slots,  etc… GridICE integration Configuration based on Quattor  Avoid duplication of sensors on farm
Local Queues Monitoring Every 5 minutes on batch manager is saved queues status (snapshot) A collector gets the infos and upgrades the local database (same logic of farm monitoring) Daily / Weekly / Monthly / Yearly DB DB: Total and single queues 3 classes of users for each queue Plots generator: Gnuplot 4.0 http://tier1.cnaf.infn.it/monitor/LSF/
Web Snapshot LSF Status
UGRID: general GRID user (lhcb001, lhcb030…) SGM: Software GRID Manager (lhcbsgm) OTHER: local user
UGRID: general GRID user (babar001, babar030…) SGM: Software GRID Manager (babarsgm) OTHER: local user
RedEye - LSF Monitoring Real time slot usage   Fast, few CPU power needed, stable, works on WAN     RedEye simple user, not root   BUT… all slots have the same weight (Future: Jeep solution) Jobs shorter than 5 minutes can be lost SO: We need something good for ALL jobs. We need to know who and how uses our FARM. Solution: Offline parsing LSF log files one time per day (Jeep integration)
Job-related metrics From LSF log file we got the following non-GRID info: LSF JobID, local UID owner of the JOB “ any kind of time” (submission, WCT etc) Max RSS and Virtual Memory usage From which computer (hostname) the job was submitted (GRID CE/locally) Where the job was executed (WN hostname) We complete this set with KSI2K & GRID infos (Jeep) DGAS interface  http://www.to.infn.it/grid/accounting/main.html http://tier1.cnaf.infn.it/monitor/LSF/plots/acct/
Queues accounting report
Queues accounting report
Queues accounting report
Queues accounting report KSI2K [WCT] May 2006, All jobs
Queues accounting report CPUTime [hours] May 2006, GRID jobs
How we use KspecINT2K? 1 slot  ->  1 job http://tier1.cnaf.infn.it/monitor/LSF/plots/ksi/ For each job:
KSI2K T1-INFN Story
KSI2K T1-INFN Story
Job Check and Report Lsb.acct had a big bug! Randomly: CPU-user-time = 0.00 sec From bjobs -l <JOBID> correct CPUtime Fixed by Platform at 25th of July 2005 CPUtime > WCT? --> Possible Spawn RAM memory: is a job on the right WN? Is the WorkerNode a “black hole”? We have a daily report (Web page)
Fabric and GRID monitoring Effort on exporting relevant queue and job metrics to the Grid level.  Integration with GridICE Integration with DGAS (done!) Grid (VO) level view of resource usage  Integration of local job information with Grid related metrics. E.g.: DN of the user proxy VOMS extensions to user proxy Grid Job ID
GRID ICE Dissemination  http://grid.infn.it/gridice GridICE server (development with upcoming features) http://gridice3.cnaf.infn.it:50080/gridice GridICE server for EGEE Grid http://gridice2.cnaf.infn.it:50080/gridice GridICE server for INFN-Grid http://gridice4.cnaf.infn.it:50080/gridice
GRID ICE For each site check GRID services (RB, BDII, CE, SE…) Check service--> Does PID exist? Summary and/or notification From GRID servers: Summary CPU and Storage resources available per site and/or per VO Storage available on SE per VO from BDII Downtimes
GRID ICE Grid Ice as fabric monitor for “small” sites Based on LeMon (server and sensors) Parsing of LeMon flatfiles logs Plots based on RRD Tools Legnaro: ~70 WorkerNodes
GridICE screenshots
Jeep General Purpose collector datas (push technology) DB-WNINFO: Historical hardware DB (MySQL on HLR node). KSI2K used by each single job (DGAS) Job Monitoring (Check RAM usage in real time, efficiency history) FS-INFO: Enough available space on volumes? AutoFS: all dynamic mount-points are working? Match making UID/GID --> VO
The Storage in a Nutshell Different hardware (NAS, SAN, Tapes) More than 300 TB HD, 130 TB Tape Different access methods  (NFS/RFIO/Xrootd/gridftp) Volumes FileSystem: EXT3, XFS and GPFS Volumes bigger than 2 TBytes: RAID 50 (EXT3/XFS). Direct (GPFS) Tape access: CASTOR (50 TB of HD as stage) Volumes management via Postgresql DB 60 servers to export FileSystems to WNs
Storage at T1-INFN Hierarchical Nagios servers to check services status gridftp, srm, rfio, castor, ssh Local tool to sum space used by VOs RRD to plot (volume space total and used) Binary and owner (IBM/STEK) software to check  some  hardware status. Very very very difficult to interface owner software to T1 framework For now: only e-mail report for bad blocks, disks failure and FileSystem failure Plots: intranet & on demand by VO
Tape/Storage usage report
Summary Fabric level monitoring with smart report is needed to ease management T1 has already solution for 2 next years! Not exportable due to man-power (no support) Future at INFN? What is T2s man-power? LeMon&Oracle? What is T2s man-power? RedEye? What is T2s man-power? Real collaboration  is far from mailing list and phone conferences only

More Related Content

PDF
Spying on the Linux kernel for fun and profit
PDF
Kafka short
PDF
Processing Big Data in Realtime
PDF
PDF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
PDF
Security Monitoring with eBPF
PDF
RxNetty vs Tomcat Performance Results
PDF
Linux 4.x Tracing Tools: Using BPF Superpowers
Spying on the Linux kernel for fun and profit
Kafka short
Processing Big Data in Realtime
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Security Monitoring with eBPF
RxNetty vs Tomcat Performance Results
Linux 4.x Tracing Tools: Using BPF Superpowers

What's hot (20)

PDF
ATO Linux Performance 2018
PDF
Bpf performance tools chapter 4 bcc
PDF
YOW2018 Cloud Performance Root Cause Analysis at Netflix
PDF
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
PPTX
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
PDF
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
PDF
Cilium - Fast IPv6 Container Networking with BPF and XDP
PDF
BPF: Tracing and more
PPTX
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
PDF
Linux Profiling at Netflix
PDF
LISA17 Container Performance Analysis
POTX
Performance Tuning EC2 Instances
PDF
Comprehensive XDP Off‌load-handling the Edge Cases
PDF
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
PDF
Data Structures for High Resolution, Real-time Telemetry at Scale
PDF
Velocity 2017 Performance analysis superpowers with Linux eBPF
PDF
eBPF - Rethinking the Linux Kernel
PDF
FOSDEM2015: Live migration for containers is around the corner
PDF
4th RICC workshopのご案内
PDF
Designing Tracing Tools
ATO Linux Performance 2018
Bpf performance tools chapter 4 bcc
YOW2018 Cloud Performance Root Cause Analysis at Netflix
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Cilium - Fast IPv6 Container Networking with BPF and XDP
BPF: Tracing and more
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
Linux Profiling at Netflix
LISA17 Container Performance Analysis
Performance Tuning EC2 Instances
Comprehensive XDP Off‌load-handling the Edge Cases
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
Data Structures for High Resolution, Real-time Telemetry at Scale
Velocity 2017 Performance analysis superpowers with Linux eBPF
eBPF - Rethinking the Linux Kernel
FOSDEM2015: Live migration for containers is around the corner
4th RICC workshopのご案内
Designing Tracing Tools
Ad

Viewers also liked (15)

PPT
3 C5 Fong Eric3 C11 July
PDF
Kiss Willem
PPT
Thai Textile Stat.(Jan. Jun.04)
PPT
Cc2madie
PDF
12 F629
PPT
L.O.W Reliable Sites
PPT
Manuf
PPT
E – Pedagogy In Adult Education In Europe1
PDF
Kiss Fm Attractions
PPT
Tobacco Cartoon Zoom In
PPT
Nuevo Servicio De Colectivos
PPT
2.0fusion to save confusion
PPT
Presentation
PPS
Ναύπλιο, Συνοικία Γυαλού
PPS
O Sol Da Meia Noite
3 C5 Fong Eric3 C11 July
Kiss Willem
Thai Textile Stat.(Jan. Jun.04)
Cc2madie
12 F629
L.O.W Reliable Sites
Manuf
E – Pedagogy In Adult Education In Europe1
Kiss Fm Attractions
Tobacco Cartoon Zoom In
Nuevo Servicio De Colectivos
2.0fusion to save confusion
Presentation
Ναύπλιο, Συνοικία Γυαλού
O Sol Da Meia Noite
Ad

Similar to Mon Acc Ccr Workshop (20)

PPT
Computing Outside The Box September 2009
PDF
Presentazione laurea 1.2 matteo concas
PPT
Computing Outside The Box June 2009
PPT
Computing Outside The Box
PDF
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
PPT
NFSv4 Replication for Grid Computing
PPT
Using Grid Technologies in the Cloud for High Scalability
PPTX
Google
PPT
Data Grid Taxonomies
PPTX
High-Availability of YARN (MRv2)
PPT
He Pi Xii2003
PPTX
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
PDF
Evergreen Sysadmin Survival Skills
PDF
Circonus: Design failures - A Case Study
PPT
PPTX
Cloud Security Monitoring and Spark Analytics
PDF
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
PDF
Tales Of The Black Knight - Keeping EverythingMe running
PDF
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
PDF
Cloud arch patterns
Computing Outside The Box September 2009
Presentazione laurea 1.2 matteo concas
Computing Outside The Box June 2009
Computing Outside The Box
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
NFSv4 Replication for Grid Computing
Using Grid Technologies in the Cloud for High Scalability
Google
Data Grid Taxonomies
High-Availability of YARN (MRv2)
He Pi Xii2003
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Evergreen Sysadmin Survival Skills
Circonus: Design failures - A Case Study
Cloud Security Monitoring and Spark Analytics
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
Tales Of The Black Knight - Keeping EverythingMe running
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
Cloud arch patterns

More from FNian (20)

PPT
Wipro Media Q1 0809
PPT
Watts Brief
PPT
The Role Of Business In Society Presentation At
PPT
Unit C Eco Toolbox
PPT
Singapore Jakarta Conf
PPT
Syndication Pp
PPT
Integration of internal database system
PPT
Analyse sourcing and manufacturing strategies
PPT
Scitc 2006 India 2005 And Future
PPT
Miller China Trade
PPT
Developing a market plan
PPT
Gianelle Tattara
PPT
Gp Industry
PPT
House
PPT
How To Biuld Internal Rating System For Basel Ii
PPT
Gujarat
PPT
Ietp Session 2 June 28
PPT
India An Overview
PPT
Intra Industry
PPT
Innovation Class 6
Wipro Media Q1 0809
Watts Brief
The Role Of Business In Society Presentation At
Unit C Eco Toolbox
Singapore Jakarta Conf
Syndication Pp
Integration of internal database system
Analyse sourcing and manufacturing strategies
Scitc 2006 India 2005 And Future
Miller China Trade
Developing a market plan
Gianelle Tattara
Gp Industry
House
How To Biuld Internal Rating System For Basel Ii
Gujarat
Ietp Session 2 June 28
India An Overview
Intra Industry
Innovation Class 6

Recently uploaded (20)

PDF
Family Law: The Role of Communication in Mediation (www.kiu.ac.ug)
PDF
Daniels 2024 Inclusive, Sustainable Development
PDF
1911 Gold Corporate Presentation Aug 2025.pdf
PPTX
chapter 2 entrepreneurship full lecture ppt
PDF
Solaris Resources Presentation - Corporate August 2025.pdf
PDF
income tax laws notes important pakistan
PDF
Ron Thomas - Top Influential Business Leaders Shaping the Modern Industry – 2025
PPTX
Slide gioi thieu VietinBank Quy 2 - 2025
PPTX
Board-Reporting-Package-by-Umbrex-5-23-23.pptx
PDF
Satish NS: Fostering Innovation and Sustainability: Haier India’s Customer-Ce...
PPTX
2 - Self & Personality 587689213yiuedhwejbmansbeakjrk
PPTX
TRAINNING, DEVELOPMENT AND APPRAISAL.pptx
PDF
ANALYZING THE OPPORTUNITIES OF DIGITAL MARKETING IN BANGLADESH TO PROVIDE AN ...
DOCX
Handbook of Entrepreneurship- Chapter 5: Identifying business opportunity.docx
PPT
Lecture notes on Business Research Methods
DOCX
FINALS-BSHhchcuvivicucucucucM-Centro.docx
PDF
Charisse Litchman: A Maverick Making Neurological Care More Accessible
PPTX
svnfcksanfskjcsnvvjknsnvsdscnsncxasxa saccacxsax
PDF
Robin Fischer: A Visionary Leader Making a Difference in Healthcare, One Day ...
PDF
Nante Industrial Plug Factory: Engineering Quality for Modern Power Applications
Family Law: The Role of Communication in Mediation (www.kiu.ac.ug)
Daniels 2024 Inclusive, Sustainable Development
1911 Gold Corporate Presentation Aug 2025.pdf
chapter 2 entrepreneurship full lecture ppt
Solaris Resources Presentation - Corporate August 2025.pdf
income tax laws notes important pakistan
Ron Thomas - Top Influential Business Leaders Shaping the Modern Industry – 2025
Slide gioi thieu VietinBank Quy 2 - 2025
Board-Reporting-Package-by-Umbrex-5-23-23.pptx
Satish NS: Fostering Innovation and Sustainability: Haier India’s Customer-Ce...
2 - Self & Personality 587689213yiuedhwejbmansbeakjrk
TRAINNING, DEVELOPMENT AND APPRAISAL.pptx
ANALYZING THE OPPORTUNITIES OF DIGITAL MARKETING IN BANGLADESH TO PROVIDE AN ...
Handbook of Entrepreneurship- Chapter 5: Identifying business opportunity.docx
Lecture notes on Business Research Methods
FINALS-BSHhchcuvivicucucucucM-Centro.docx
Charisse Litchman: A Maverick Making Neurological Care More Accessible
svnfcksanfskjcsnvvjknsnvsdscnsncxasxa saccacxsax
Robin Fischer: A Visionary Leader Making a Difference in Healthcare, One Day ...
Nante Industrial Plug Factory: Engineering Quality for Modern Power Applications

Mon Acc Ccr Workshop

  • 1. Fabric Monitor, Accounting, Storage and Reports experience at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 [email_address] Workshop sul calcolo e reti INFN - Otranto - 8-6-2006
  • 2. Outline CNAF-INFN Tier1 FARM and GRID Monitoring Local Queues Monitoring Local and GRID accounting Storage Monitoring and accounting Summary
  • 3. Introduction Location: INFN-CNAF, Bologna (Italy) one of the main nodes of GARR network Computing facility for INFN HNEP community Partecipating to LCG, EGEE, INFNGRID projects Multi-Experiments TIER1 LHC experiments (Alice, Atlas, CMS, LHCb) CDF, BABAR VIRGO, MAGIC, ARGO, Bio, TheoPhys, Pamela ... Resources assigned to experiments on a yearly Plan.
  • 4. The Farm in a Nutshell - SLC 3.0.6, LCG 2.7, LSF 6.1 - ~ 720 WNs LSF pool (~1580 KSI2K) Common LSF pool: 1 job per logical cpu (slot) MAX 1 process running at the same time per job GRID and local submission allowed On the same WN can run GRID and not GRID jobs On the same queue can be submitted GRID and not GRID jobs For each VO/EXP one or more queues Since 24th of April 2005 2.700.000 jobs were executed on our LSF pool (~1.600.000 GRID) 3 CEs (main CE 4 opteron dualcore, 24 GB RAM) + 1 CE gLite
  • 5. Access to Batch system “ Legacy” non-Grid Access CE LSF Wn1 WNn SE Grid Access LSF client UI UI UI Grid
  • 6. Farm Monitoring Goals Scalability to Tier1 full size Many parameters for each WN/server DataBase and Plots on Web Pages Data Analysis Report problems on Web Page(s) Share data with GRID tools RedEye : INFN-T1 tool monitoring RedEye : simple local user. No root!
  • 7. Tier1 Fabric Monitoring What do we get? • CPU load, status and jiffies Ethernet I/O, (MRTG by net-boys) Temperatures, RPM fans (IPMI) Total and type of active TCP connections Processes created, running, zombie etc RAM and SWAP memory Users logged in SLC3 and SLC4 compatible
  • 9. Local WN Monitoring On each WN every 5 min (local crontab) infos are saved locally (<3KBytes --> 2-3 TCP packets) 1 minute later a collector “gets” via socket the infos “ gets”: tidy parallel fork with timeout control To get and save locally datas from 750 WN ~ 6 sec best case. 20 sec worst case (timeout knife) Upgrade DataBase (last day, week, month) For each WN --> 1 file (possibility of cumulative plots) Analysis monitor datas Local thumbnail cache creation (web clickable) http://collector.cnaf.infn.it/davide/rack.php http://collector.cnaf.infn.it/davide/analyzer.html
  • 11. Web Snapshot TCP connections
  • 14. Fabric  GRID Monitoring Effort on exporting relevant fabric metrics to the Grid level e.g.: # of active WNs, # of free slots, etc… GridICE integration Configuration based on Quattor Avoid duplication of sensors on farm
  • 15. Local Queues Monitoring Every 5 minutes on batch manager is saved queues status (snapshot) A collector gets the infos and upgrades the local database (same logic of farm monitoring) Daily / Weekly / Monthly / Yearly DB DB: Total and single queues 3 classes of users for each queue Plots generator: Gnuplot 4.0 http://tier1.cnaf.infn.it/monitor/LSF/
  • 17. UGRID: general GRID user (lhcb001, lhcb030…) SGM: Software GRID Manager (lhcbsgm) OTHER: local user
  • 18. UGRID: general GRID user (babar001, babar030…) SGM: Software GRID Manager (babarsgm) OTHER: local user
  • 19. RedEye - LSF Monitoring Real time slot usage  Fast, few CPU power needed, stable, works on WAN  RedEye simple user, not root  BUT… all slots have the same weight (Future: Jeep solution) Jobs shorter than 5 minutes can be lost SO: We need something good for ALL jobs. We need to know who and how uses our FARM. Solution: Offline parsing LSF log files one time per day (Jeep integration)
  • 20. Job-related metrics From LSF log file we got the following non-GRID info: LSF JobID, local UID owner of the JOB “ any kind of time” (submission, WCT etc) Max RSS and Virtual Memory usage From which computer (hostname) the job was submitted (GRID CE/locally) Where the job was executed (WN hostname) We complete this set with KSI2K & GRID infos (Jeep) DGAS interface http://www.to.infn.it/grid/accounting/main.html http://tier1.cnaf.infn.it/monitor/LSF/plots/acct/
  • 24. Queues accounting report KSI2K [WCT] May 2006, All jobs
  • 25. Queues accounting report CPUTime [hours] May 2006, GRID jobs
  • 26. How we use KspecINT2K? 1 slot -> 1 job http://tier1.cnaf.infn.it/monitor/LSF/plots/ksi/ For each job:
  • 29. Job Check and Report Lsb.acct had a big bug! Randomly: CPU-user-time = 0.00 sec From bjobs -l <JOBID> correct CPUtime Fixed by Platform at 25th of July 2005 CPUtime > WCT? --> Possible Spawn RAM memory: is a job on the right WN? Is the WorkerNode a “black hole”? We have a daily report (Web page)
  • 30. Fabric and GRID monitoring Effort on exporting relevant queue and job metrics to the Grid level. Integration with GridICE Integration with DGAS (done!) Grid (VO) level view of resource usage Integration of local job information with Grid related metrics. E.g.: DN of the user proxy VOMS extensions to user proxy Grid Job ID
  • 31. GRID ICE Dissemination http://grid.infn.it/gridice GridICE server (development with upcoming features) http://gridice3.cnaf.infn.it:50080/gridice GridICE server for EGEE Grid http://gridice2.cnaf.infn.it:50080/gridice GridICE server for INFN-Grid http://gridice4.cnaf.infn.it:50080/gridice
  • 32. GRID ICE For each site check GRID services (RB, BDII, CE, SE…) Check service--> Does PID exist? Summary and/or notification From GRID servers: Summary CPU and Storage resources available per site and/or per VO Storage available on SE per VO from BDII Downtimes
  • 33. GRID ICE Grid Ice as fabric monitor for “small” sites Based on LeMon (server and sensors) Parsing of LeMon flatfiles logs Plots based on RRD Tools Legnaro: ~70 WorkerNodes
  • 35. Jeep General Purpose collector datas (push technology) DB-WNINFO: Historical hardware DB (MySQL on HLR node). KSI2K used by each single job (DGAS) Job Monitoring (Check RAM usage in real time, efficiency history) FS-INFO: Enough available space on volumes? AutoFS: all dynamic mount-points are working? Match making UID/GID --> VO
  • 36. The Storage in a Nutshell Different hardware (NAS, SAN, Tapes) More than 300 TB HD, 130 TB Tape Different access methods (NFS/RFIO/Xrootd/gridftp) Volumes FileSystem: EXT3, XFS and GPFS Volumes bigger than 2 TBytes: RAID 50 (EXT3/XFS). Direct (GPFS) Tape access: CASTOR (50 TB of HD as stage) Volumes management via Postgresql DB 60 servers to export FileSystems to WNs
  • 37. Storage at T1-INFN Hierarchical Nagios servers to check services status gridftp, srm, rfio, castor, ssh Local tool to sum space used by VOs RRD to plot (volume space total and used) Binary and owner (IBM/STEK) software to check some hardware status. Very very very difficult to interface owner software to T1 framework For now: only e-mail report for bad blocks, disks failure and FileSystem failure Plots: intranet & on demand by VO
  • 39. Summary Fabric level monitoring with smart report is needed to ease management T1 has already solution for 2 next years! Not exportable due to man-power (no support) Future at INFN? What is T2s man-power? LeMon&Oracle? What is T2s man-power? RedEye? What is T2s man-power? Real collaboration is far from mailing list and phone conferences only