SlideShare a Scribd company logo
Software as Infrastructure
      at NSF/OCI

             Daniel S. Katz
          Program Director,
  Office of Cyberinfrastructure (OCI)
Software as Infrastructure
      at NSF/OCI

             Daniel S. Katz
          Program Director,
  Office of Cyberinfrastructure (OCI)
         Division of Advanced
       Cyberinfrastructure (ACI)
Big Science and Infrastructure
                    •  Higgs* boson discovery announced at CERN July 4, 2012
                    •  Instrument: Large Hadron Collider (LHC)
                    •  Infrastructure
                           –  Computing Hardware: Worldwide LHC Computing Grid (WLCG):
                              235,000 cores across 36 countries, including OpenScience Grid
                              (OSG, US), European Grid Infrastructure (EGI, Europe), ...
                           –  Data: ~20 PB of data created in 2011-2012
                           –  Software: grid middleware, physics analysis applications, ...
                           –  Networks
                           –  Education &
                              Training
                    •  Data generated
                       centrally, moved
                       (~3 PB/week)
                       across multi-tiered
                       infrastructure to be
                       computing upon

See: http://www.isgtw.org/feature/how-grid-computing-helped-cern-hunt-higgs
Big Science and Infrastructure
•  Hurricanes affect humans
•  Multi-physics: atmosphere, ocean, coast, vegetation, soil
    –  Sensors and data as inputs
•  Humans: what have they built, where are they, what will they do
    –  Data and models as inputs
•  Infrastructure:
    –  Urgent/scheduled processing, workflows
    –  Software applications, workflows
    –  Networks
    –  Decision-support systems,
       visualization
    –  Data storage,
       interoperability
Long-tail Science and Infrastructure
•  Exploding data volumes &
   powerful simulation methods            NSF grant size, 2007.
                                          (“Dark data in the long tail
   mean that more researchers
                                          of science”, B. Heidorn)
   need advanced infrastructure
•  Such “long-tail” researchers
   cannot afford expensive
   expertise and unique
   infrastructure
•  Challenge: Outsource and/or
   automate time-consuming
   common processes
   –  Tools, e.g., Globus Online
      and data management
       •  Note: much LHC data is moved
          by Globus GridFTP, e.g., May/
          June 2012, >20 PB, >20M files
   –  Gateways, e.g., nanoHUB,
      CIPRES, access to scientific
      simulation software
Long-tail Science and Infrastructure
                    •  CIPRES Science Gateway for Phylogenetics
                            –  Study of diversification of life and relationships among
                               living things through time
                    •  Highly used
                            –    Cited in at least 400 publications, e.g., Nature, PNAS, Cell
                            –    More than 5000 unique users in 3 years
                            –    Used routinely in at least 68 undergraduate classes
                            –    45% US (including most states), 55% 70 other countries
                    •  Infrastructure
                            –  Flexible web application
                                  •  A science gateway, uses software and lessons from XSEDE
                                     gateways team, e.g., identify management, HPC job control
                            –  Science software: tree inference and sequence alignment
                                  •  Parallel versions of MrBayes, RAxML, GARLI, BEAST, MAFFT
                                  •  PAUP*, Poy, ClustalW, Contralign, FSA, MUSCLE, ...
                            –  Data
                                  •  Personal user space for storing
                                     results
                                  •  Tools to transfer and view data

Credit: Mark Miller, SDSC
Infrastructure Challenges
•  Science
   –  Larger teams, more disciplines, more countries
•  Data
   –  Size, complexity, rates all increasing rapidly
   –  Need for interoperability (systems and policies)
•  Systems
   –    More cores, more architectures (GPUs), more memory hierarchy
   –    Changing balances (latency vs bandwidth)
   –    Changing limits (power, funds)
   –    System architecture and business models changing (clouds)
   –    Network capacity growing; increase networks -> increased security
•  Software
   –  Multiphysics algorithms, frameworks
   –  Programing models and abstractions for science, data, and hardware
   –  V&V, reproducibility, fault tolerance
•  People
   –  Education and training
   –  Career paths
   –  Credit and attribution
Cyberinfrastructure (e-Research)
•  “Cyberinfrastructure consists of computing systems,
   data storage systems, advanced instruments and
   data repositories, visualization environments, and
   people, all linked together by software and high
   performance networks to improve research
   productivity and enable breakthroughs not otherwise
   possible.”
                                    -- Craig Stewart

•  Infrastructure elements:
    –  parts of an infrastructure,
    –  developed by individuals and groups,
    –  international,
    –  developed for a purpose,
    –  used by a community
Software is Infrastructure
                                                                Science	
  
•  Software essential for the bulk of science
    -  About half the papers in recent issues of
       Science were software-intensive projects
    -  Research becoming dependent upon                     So(ware	
  	
  
       advances in software
    -  Significant software development being
       conducted across NSF: NEON, OOI,                    Compu0ng	
  
       NEES, NCN, iPlant, etc                            Infrastructure	
  
•  Wide range of software types: system,           Scientific
   applications, modeling, gateways,               Discovery         Technological
                                                                        Innovation
   analysis, algorithms, middleware, libraries
•  Development, production and
   maintenance are people intensive
•  Software life-times are long compared to
   hardware
•  Under-appreciated value                               Software



                           Software                        Education
Cyberinfrastructure Framework for 21st Century
Science and Engineering (CIF21)
•    Cross-NSF portfolio of activities to provide integrated cyber resources
     that will enable new multidisciplinary research opportunities in all
     science and engineering fields by leveraging ongoing investments and
     using common approaches and components (http://www.nsf.gov/cif21)

•    ACCI task force reports (http://www.nsf.gov/od/oci/taskforces/index.jsp)
      –  Campus Bridging, Cyberlearning & Workforce Development, Data
         & Visualization, Grand Challenges, HPC, Software for Science &
         Engineering
      –  Included recommendation for NSF-wide CDS&E program
•    Vision and Strategy Reports
      –  ACI - http://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf12051
      –  Software - http://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf12113
      –  Data - http://www.nsf.gov/od/oci/cif21/DataVision2012.pdf
•    Implementation
      –  Implementation of Software Vision
         http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504817
Software Vision
      NSF will take a leadership role in providing
      software as enabling infrastructure for
      science and engineering research and
      education, and in promoting software as a
      principal component of its comprehensive
      CIF21 vision
   •  ...
   •  Reducing the complexity of software will be a
      unifying theme across the CIF21 vision,
      advancing both the use and development of
      new software and promoting the ubiquitous
      integration of scientific software across all
      disciplines, in education, and in industry
          –  A Vision and Strategy for Software for Science,
             Engineering, and Education – NSF 12-113
Infrastructure Role & Lifecycle
                              Create and maintain a
                              software ecosystem
                              providing new
                              capabilities that             Enable transformative,
                              advance and accelerate        interdisciplinary,
 Support the                  scientific inquiry at         collaborative, science
 foundational                 unprecedented                 and engineering
 research necessary           complexity and scale          research and
 to continue to                                             education through the
 efficiently advance                                        use of advanced
 scientific software                                        software and services



Transform practice through new                 Develop a next generation diverse
policies for software addressing               workforce of scientists and
challenges of academic culture, open           engineers equipped with essential
dissemination and use, reproducibility         skills to use and develop software,
and trust, curation, sustainability,           with software and services used in
governance, citation, stewardship, and         both the research and education
attribution of software authorship             process
ACI Software Cluster Programs
•  Exploiting Parallelism and Scalability (XPS)
    –  New CISE & OCI program for foundational groundbreaking
       research leading to a new era of parallel (and distributed)
       computing
    –  Issued in Oct., proposals submitted in Feb.
•  Computational and Data-Enabled Science & Engineering
   (CDS&E)
    –  Virtual program (ENG, MPS, OCI) for science-specific proofing of
       algorithms and codes
    –  Identify and capitalize on opportunities for major scientific and
       engineering breakthroughs through new computational and data
       analysis approaches
•  Software Infrastructure for Sustained Innovation (SI2)
    –  Transform innovations in research and education into sustained
       software resources that are an integral part of the
       cyberinfrastructure
    –  Develop and maintain sustainable software infrastructure that can
       enhance productivity and accelerate innovation in science and
       engineering
Software Infrastructure Projects
SI2 Software Activities
•  Elements (SSE) & Frameworks (SSI)
    –  Past general solicitations, with most of NSF (BIO, CISE, EHR,
       ENG, MPS, SBE): NSF 10-551 (2011), NSF 11-539 (2012)
        •  About 27 SSE and 20 SSI projects (19 SSE & 13 SSI in FY12)
    –  Current focused solicitation, with MPS/CHE and EPSRC: US/UK
       collaborations in computational chemistry, NSF 12-576 (2012)
        •  Will fund 4 awards from 18 proposals
    –  Solicitation open (NSF 13-525)
•  Institutes (S2I2)
    –  Solicitation for conceptualization awards, NSF 11-589 (2012)
        •  13 projects (co-funded with BIO, CISE, ENG, MPS)
    –  Second solicitation for 3-5 more S2I2s (NSF 13-511)
    –  Full institute solicitation in late FY14
•  US/China DCL (with CISE/CNS, loosely with NSFC)
    –  NSF 12-096: will make decisions soon on small set of initial
       projects
    –  Included in fuure SSE&SSI solicitation
•  See http://bit.ly/sw-ci for current projects
SI2 Solicitation and Decision Process

•  Cross-NSF software working group with
   members from all directorates
•  Determined how SI2 fits with other NSF
   programs that support software
   –  See: Implementation of NSF Software Vision - http://
      www.nsf.gov/funding/pgm_summ.jsp?pims_id=504817
•  Discusses solicitations, determines who will
   participate in each
•  Discusses and participates in review process
•  Work together to fund worthy proposals
SI2 Solicitation and Decision Process

•  Proposal reviews well -> my role becomes
   matchmaking
   –  I want to find program officers with funds, and convince them
      that they should spend their funds on the proposal
•  Unidisciplinary project (e.g. bioinformatics app)
   –  Work with single program officer, either likes the proposal or
      not
•  Multidisciplinary project (e.g., molecular
   dynamics)
   –  Work with multiple program officers, ...
•  Onmidisciplinary project (e.g. http, math library)
   –  Try to work with all program officers, often am told “it’s your
      responsibility”
•  In all cases, need to forecast impact
   –  Past performance does predict future results
Measuring Impact – Scenarios
1.  Developer of open source physics simulation
   –  Possible metrics
       •    How many downloads? (easiest to measure, least value)
       •    How many contributors?
       •    How many uses?
       •    How many papers cite it?
       •    How many papers that cite it are cited? (hardest to measure,
            most value)

2.  Developer of open source math library
   –  Possible metrics are similar, but citations are less
      likely
   –  What if users don’t download it?
       •    It’s part of a distro
       •    It’s pre-installed (and optimized) on an HPC system
       •    It’s part of a cloud image
       •    It’s a service
Vision for Metrics & Citation, part 1
•  Products (software, paper, data set) are
   registered
   –  Credit map (weighted list of contributors—people,
      products, etc.) is an input
   –  DOI is an output
   –  Leads to transitive credit
       •  E.g., paper 1 provides 25% credit to software A, and software A
          provides 10% credit to library X -> library X gets 2.5% credit for
          paper 1
       •  Helps developer – “my tools are widely used, give me tenure” or
          “NSF should fund my tool maintenance”
   –  Issues:
       •  Social: Trust in person who registers a product
            –  This seems to work for papers today (without weights) for both
               author lists and for citations
            –  Do weights require more than human memory?
       •  Technological: Registration system
            –  Where is it/them, what are interfaces, how do they work together?
Vision for Metrics & Citation, part 2
•  Product usage is recorded
   –  Where?
       •  Both the developer and user want to track usage
       •  Privacy issues? (legal, competitive, ...)
       •  Via a phone home mechanism?
   –  What does “using” a data set mean? And how could
      trigger a usage record
   –  Can general code be developed for this, to be
      incorporated in software packages?
•  Ties to provenance
•  With user input, tie later products to usage
   –  User may not know science outcome when using tool
   –  After science outcome is known, may be hard to
      determine which product usages were involved
Vision for Metrics & Citation, thoughts
•  Can this be done incrementally?
•  Lack of credit is a larger problem than often
   perceived
   –  Lack of credit is a disincentive for sharing software
      and data
   –  Providing credit would both remove disincentive as
      well as adding incentive
   –  See Lewin’s principal of force field analysis (1943)
•  For commercial tools, credit is tracked by $
   –  But this doesn’t help understand what tools were used
      for what outcomes
   –  Does this encourage collaboration?
•  Could a more economic model be used?
   –  NSF gives tokens are part of science grants, users
      distribute tokens while/after using tools
Software Questions: Sustainability
•  My definition as a program officer:
   –  How will you support your software without me
      continuing to pay for it?
•  What does support mean?
   –  Can I build and run it on my current system?
       •  Adapt to changing underlying hardware/software
   –  Do I understand what it does?
       •  Documentation, training
   –  Does it do what it does correctly?
       •  Bug tracking and updates
       •  Verification and validation
   –  Does it do what I want?
       •  Requirement tracking and updates
   –  Is it changing?
       •  Heritage (legacy) vs. developing software

•  How can Apache help?
Software Questions: Governance
•  Why do we care?
   –  Governance tells users and contributors how the
      project makes decisions, how they can be involved
•  What are the issues?
   –  Community: Users? Developers? Both?
   –  Models: dictatorship (Linux kernel), meritocracy
      (Apache), other?
   –  Tie to development models: cathedral, bazaar
•  How can Apache help?
   –  Study how these work in smaller specialized projects?
Other Questions for Apache

•  Does the Apache Way work for science?
   –  Or just for underlying tools that are useful both for
      science and other applications?
•  How many users/developers are needed for
   success?
•  Incubator model
   –  Can it be used as is for general science software?
   –  Or forked and modified?
•  Open Source for understanding (available) vs
   Open Source for reuse/development
   (changeable)?
General Software Questions
•  Software that is intended to be infrastructure has
   challenges
   –  Unlike in business, more users means more work
   –  The last 20% takes 80% of the effort
   –  What can NSF do to make these things easier?
•  What fraction of funds should be spent of
   support of existing infrastructure vs.
   development of new infrastructure?
•  How do we decide when to stop supporting a
   software element?
•  How do we encourage reuse and discourage
   duplication?
•  How do we more effectively support career
   paths for software developers (with universities,
   labs, etc.)
What Can You Do?
•  Look at the current set of SI2 software
   and institutes, and get involved with one
   –  http://bit.ly/sw-ci
•  Tell me what we should be doing
   differently
   –  Here or email: dkatz@nsf.gov

More Related Content

PDF
NSF SI2 program discussion at 2014 SI2 PI meeting
PDF
NSF SI2 program discussion at 2013 SI2 PI meeting
PPT
Sla2009 D Curation Heidorn
PPTX
Summary of 3DPAS
PDF
Sgci iwsg-a-10-10-16
PPT
Data Landscapes - Addiction
PPTX
Why manage research data?
PDF
Graham Pryor
NSF SI2 program discussion at 2014 SI2 PI meeting
NSF SI2 program discussion at 2013 SI2 PI meeting
Sla2009 D Curation Heidorn
Summary of 3DPAS
Sgci iwsg-a-10-10-16
Data Landscapes - Addiction
Why manage research data?
Graham Pryor

What's hot (20)

PPT
Cyberistructure
PDF
Taming the Big Data Beast - Together
PPT
BeSTGRID OpenGridForum 29 GIN session
PPT
The Developing Needs for e-infrastructures
PDF
Advancing Science through Coordinated Cyberinfrastructure
PPTX
Internet2 Bio IT 2016 v2
PPT
An Integrated West Coast Science DMZ for Data-Intensive Research
PPTX
Ci days notre_dame_april2010
PPTX
Rethinking how we provide science IT in an era of massive data but modest bud...
PPTX
A VIVO VIEW OF CANCER RESEARCH: Dream, Vision and Reality
PPT
big_data_casestudies_2.ppt
PPT
UC-Wide Cyberinfrastructure for Data-Intensive Research
PDF
HPC lab projects
PDF
Sgci nsf-2-22-17
PPT
The Pacific Research Platform
PDF
dkNET Webinar "Pancreatlas™: Mapping the Human Pancreas in Health and Disease...
PDF
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
PPTX
Montana State, Research Networking and the Outcomes from the First National R...
PPTX
Building the FAIR Research Commons: A Data Driven Society of Scientists
Cyberistructure
Taming the Big Data Beast - Together
BeSTGRID OpenGridForum 29 GIN session
The Developing Needs for e-infrastructures
Advancing Science through Coordinated Cyberinfrastructure
Internet2 Bio IT 2016 v2
An Integrated West Coast Science DMZ for Data-Intensive Research
Ci days notre_dame_april2010
Rethinking how we provide science IT in an era of massive data but modest bud...
A VIVO VIEW OF CANCER RESEARCH: Dream, Vision and Reality
big_data_casestudies_2.ppt
UC-Wide Cyberinfrastructure for Data-Intensive Research
HPC lab projects
Sgci nsf-2-22-17
The Pacific Research Platform
dkNET Webinar "Pancreatlas™: Mapping the Human Pancreas in Health and Disease...
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
Montana State, Research Networking and the Outcomes from the First National R...
Building the FAIR Research Commons: A Data Driven Society of Scientists
Ad

Similar to NSF Software @ ApacheConNA (20)

PDF
Software and Education at NSF/ACI
PPTX
Open Source and Science at the National Science Foundation (NSF)
PPTX
Working towards Sustainable Software for Science (an NSF and community view)
PDF
SGCI - Science Gateways: An Overview
PPTX
PPTX
Funding Software in Academia
PPTX
SGCI-URSSI-Sustainability in Research Computing
PPTX
Scientific Software Innovation Institutes (S2I2s) as part of NSF’s SI2 program
PDF
SGCI - The Science Gateways Community Institute: International Collaboration ...
ODP
Paerip chain-becker-10-11-2011
PPTX
Mexico talk foster march 2012
PPT
NSF and Environmental Cyberinfrastructure
PPTX
Opinions on the State of Production Distributed Infrastructure (PDI)
PDF
Grid is Dead ? Nimrod on the Cloud
PPT
TeraGrid and Physics Research
PDF
Research software susainability
PDF
SGCI - The Science Gateways Community Institute: Going Beyond Borders
PPTX
Infrastructure for Supporting Computational Social Science
PDF
SGCI Science Gateways: Ushering in a New Era of Sustainability
PDF
Bridging Gaps and Broadening Participation in Today's and Future Research Com...
Software and Education at NSF/ACI
Open Source and Science at the National Science Foundation (NSF)
Working towards Sustainable Software for Science (an NSF and community view)
SGCI - Science Gateways: An Overview
Funding Software in Academia
SGCI-URSSI-Sustainability in Research Computing
Scientific Software Innovation Institutes (S2I2s) as part of NSF’s SI2 program
SGCI - The Science Gateways Community Institute: International Collaboration ...
Paerip chain-becker-10-11-2011
Mexico talk foster march 2012
NSF and Environmental Cyberinfrastructure
Opinions on the State of Production Distributed Infrastructure (PDI)
Grid is Dead ? Nimrod on the Cloud
TeraGrid and Physics Research
Research software susainability
SGCI - The Science Gateways Community Institute: Going Beyond Borders
Infrastructure for Supporting Computational Social Science
SGCI Science Gateways: Ushering in a New Era of Sustainability
Bridging Gaps and Broadening Participation in Today's and Future Research Com...
Ad

More from Daniel S. Katz (20)

PPTX
Software Professionals (RSEs) at NCSA
PPTX
Parsl: Pervasive Parallel Programming in Python
PPTX
Requiring Publicly-Funded Software, Algorithms, and Workflows to be Made Publ...
PPTX
What is eScience, and where does it go from here?
PDF
Citation and Research Objects: Toward Active Research Objects
PDF
FAIR is not Fair Enough, Particularly for Software Citation, Availability, or...
PPTX
Fundamentals of software sustainability
PPTX
Software Citation in Theory and Practice
PPTX
PDF
Research Software Sustainability: WSSSPE & URSSI
PDF
Software citation
PDF
Expressing and sharing workflows
PDF
Citation and reproducibility in software
PPTX
Software Citation: Principles, Implementation, and Impact
PPTX
Summary of WSSSPE and its working groups
PPTX
Working towards Sustainable Software for Science: Practice and Experience (WS...
PPTX
20160607 citation4software panel
PPTX
20160607 citation4software opening
PPTX
Scientific Software Challenges and Community Responses
PPTX
What do we need beyond a DOI?
Software Professionals (RSEs) at NCSA
Parsl: Pervasive Parallel Programming in Python
Requiring Publicly-Funded Software, Algorithms, and Workflows to be Made Publ...
What is eScience, and where does it go from here?
Citation and Research Objects: Toward Active Research Objects
FAIR is not Fair Enough, Particularly for Software Citation, Availability, or...
Fundamentals of software sustainability
Software Citation in Theory and Practice
Research Software Sustainability: WSSSPE & URSSI
Software citation
Expressing and sharing workflows
Citation and reproducibility in software
Software Citation: Principles, Implementation, and Impact
Summary of WSSSPE and its working groups
Working towards Sustainable Software for Science: Practice and Experience (WS...
20160607 citation4software panel
20160607 citation4software opening
Scientific Software Challenges and Community Responses
What do we need beyond a DOI?

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
Teaching material agriculture food technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Dropbox Q2 2025 Financial Results & Investor Presentation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Teaching material agriculture food technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.

NSF Software @ ApacheConNA

  • 1. Software as Infrastructure at NSF/OCI Daniel S. Katz Program Director, Office of Cyberinfrastructure (OCI)
  • 2. Software as Infrastructure at NSF/OCI Daniel S. Katz Program Director, Office of Cyberinfrastructure (OCI) Division of Advanced Cyberinfrastructure (ACI)
  • 3. Big Science and Infrastructure •  Higgs* boson discovery announced at CERN July 4, 2012 •  Instrument: Large Hadron Collider (LHC) •  Infrastructure –  Computing Hardware: Worldwide LHC Computing Grid (WLCG): 235,000 cores across 36 countries, including OpenScience Grid (OSG, US), European Grid Infrastructure (EGI, Europe), ... –  Data: ~20 PB of data created in 2011-2012 –  Software: grid middleware, physics analysis applications, ... –  Networks –  Education & Training •  Data generated centrally, moved (~3 PB/week) across multi-tiered infrastructure to be computing upon See: http://www.isgtw.org/feature/how-grid-computing-helped-cern-hunt-higgs
  • 4. Big Science and Infrastructure •  Hurricanes affect humans •  Multi-physics: atmosphere, ocean, coast, vegetation, soil –  Sensors and data as inputs •  Humans: what have they built, where are they, what will they do –  Data and models as inputs •  Infrastructure: –  Urgent/scheduled processing, workflows –  Software applications, workflows –  Networks –  Decision-support systems, visualization –  Data storage, interoperability
  • 5. Long-tail Science and Infrastructure •  Exploding data volumes & powerful simulation methods NSF grant size, 2007. (“Dark data in the long tail mean that more researchers of science”, B. Heidorn) need advanced infrastructure •  Such “long-tail” researchers cannot afford expensive expertise and unique infrastructure •  Challenge: Outsource and/or automate time-consuming common processes –  Tools, e.g., Globus Online and data management •  Note: much LHC data is moved by Globus GridFTP, e.g., May/ June 2012, >20 PB, >20M files –  Gateways, e.g., nanoHUB, CIPRES, access to scientific simulation software
  • 6. Long-tail Science and Infrastructure •  CIPRES Science Gateway for Phylogenetics –  Study of diversification of life and relationships among living things through time •  Highly used –  Cited in at least 400 publications, e.g., Nature, PNAS, Cell –  More than 5000 unique users in 3 years –  Used routinely in at least 68 undergraduate classes –  45% US (including most states), 55% 70 other countries •  Infrastructure –  Flexible web application •  A science gateway, uses software and lessons from XSEDE gateways team, e.g., identify management, HPC job control –  Science software: tree inference and sequence alignment •  Parallel versions of MrBayes, RAxML, GARLI, BEAST, MAFFT •  PAUP*, Poy, ClustalW, Contralign, FSA, MUSCLE, ... –  Data •  Personal user space for storing results •  Tools to transfer and view data Credit: Mark Miller, SDSC
  • 7. Infrastructure Challenges •  Science –  Larger teams, more disciplines, more countries •  Data –  Size, complexity, rates all increasing rapidly –  Need for interoperability (systems and policies) •  Systems –  More cores, more architectures (GPUs), more memory hierarchy –  Changing balances (latency vs bandwidth) –  Changing limits (power, funds) –  System architecture and business models changing (clouds) –  Network capacity growing; increase networks -> increased security •  Software –  Multiphysics algorithms, frameworks –  Programing models and abstractions for science, data, and hardware –  V&V, reproducibility, fault tolerance •  People –  Education and training –  Career paths –  Credit and attribution
  • 8. Cyberinfrastructure (e-Research) •  “Cyberinfrastructure consists of computing systems, data storage systems, advanced instruments and data repositories, visualization environments, and people, all linked together by software and high performance networks to improve research productivity and enable breakthroughs not otherwise possible.” -- Craig Stewart •  Infrastructure elements: –  parts of an infrastructure, –  developed by individuals and groups, –  international, –  developed for a purpose, –  used by a community
  • 9. Software is Infrastructure Science   •  Software essential for the bulk of science -  About half the papers in recent issues of Science were software-intensive projects -  Research becoming dependent upon So(ware     advances in software -  Significant software development being conducted across NSF: NEON, OOI, Compu0ng   NEES, NCN, iPlant, etc Infrastructure   •  Wide range of software types: system, Scientific applications, modeling, gateways, Discovery Technological Innovation analysis, algorithms, middleware, libraries •  Development, production and maintenance are people intensive •  Software life-times are long compared to hardware •  Under-appreciated value Software Software Education
  • 10. Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21) •  Cross-NSF portfolio of activities to provide integrated cyber resources that will enable new multidisciplinary research opportunities in all science and engineering fields by leveraging ongoing investments and using common approaches and components (http://www.nsf.gov/cif21) •  ACCI task force reports (http://www.nsf.gov/od/oci/taskforces/index.jsp) –  Campus Bridging, Cyberlearning & Workforce Development, Data & Visualization, Grand Challenges, HPC, Software for Science & Engineering –  Included recommendation for NSF-wide CDS&E program •  Vision and Strategy Reports –  ACI - http://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf12051 –  Software - http://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf12113 –  Data - http://www.nsf.gov/od/oci/cif21/DataVision2012.pdf •  Implementation –  Implementation of Software Vision http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504817
  • 11. Software Vision NSF will take a leadership role in providing software as enabling infrastructure for science and engineering research and education, and in promoting software as a principal component of its comprehensive CIF21 vision •  ... •  Reducing the complexity of software will be a unifying theme across the CIF21 vision, advancing both the use and development of new software and promoting the ubiquitous integration of scientific software across all disciplines, in education, and in industry –  A Vision and Strategy for Software for Science, Engineering, and Education – NSF 12-113
  • 12. Infrastructure Role & Lifecycle Create and maintain a software ecosystem providing new capabilities that Enable transformative, advance and accelerate interdisciplinary, Support the scientific inquiry at collaborative, science foundational unprecedented and engineering research necessary complexity and scale research and to continue to education through the efficiently advance use of advanced scientific software software and services Transform practice through new Develop a next generation diverse policies for software addressing workforce of scientists and challenges of academic culture, open engineers equipped with essential dissemination and use, reproducibility skills to use and develop software, and trust, curation, sustainability, with software and services used in governance, citation, stewardship, and both the research and education attribution of software authorship process
  • 13. ACI Software Cluster Programs •  Exploiting Parallelism and Scalability (XPS) –  New CISE & OCI program for foundational groundbreaking research leading to a new era of parallel (and distributed) computing –  Issued in Oct., proposals submitted in Feb. •  Computational and Data-Enabled Science & Engineering (CDS&E) –  Virtual program (ENG, MPS, OCI) for science-specific proofing of algorithms and codes –  Identify and capitalize on opportunities for major scientific and engineering breakthroughs through new computational and data analysis approaches •  Software Infrastructure for Sustained Innovation (SI2) –  Transform innovations in research and education into sustained software resources that are an integral part of the cyberinfrastructure –  Develop and maintain sustainable software infrastructure that can enhance productivity and accelerate innovation in science and engineering
  • 15. SI2 Software Activities •  Elements (SSE) & Frameworks (SSI) –  Past general solicitations, with most of NSF (BIO, CISE, EHR, ENG, MPS, SBE): NSF 10-551 (2011), NSF 11-539 (2012) •  About 27 SSE and 20 SSI projects (19 SSE & 13 SSI in FY12) –  Current focused solicitation, with MPS/CHE and EPSRC: US/UK collaborations in computational chemistry, NSF 12-576 (2012) •  Will fund 4 awards from 18 proposals –  Solicitation open (NSF 13-525) •  Institutes (S2I2) –  Solicitation for conceptualization awards, NSF 11-589 (2012) •  13 projects (co-funded with BIO, CISE, ENG, MPS) –  Second solicitation for 3-5 more S2I2s (NSF 13-511) –  Full institute solicitation in late FY14 •  US/China DCL (with CISE/CNS, loosely with NSFC) –  NSF 12-096: will make decisions soon on small set of initial projects –  Included in fuure SSE&SSI solicitation •  See http://bit.ly/sw-ci for current projects
  • 16. SI2 Solicitation and Decision Process •  Cross-NSF software working group with members from all directorates •  Determined how SI2 fits with other NSF programs that support software –  See: Implementation of NSF Software Vision - http:// www.nsf.gov/funding/pgm_summ.jsp?pims_id=504817 •  Discusses solicitations, determines who will participate in each •  Discusses and participates in review process •  Work together to fund worthy proposals
  • 17. SI2 Solicitation and Decision Process •  Proposal reviews well -> my role becomes matchmaking –  I want to find program officers with funds, and convince them that they should spend their funds on the proposal •  Unidisciplinary project (e.g. bioinformatics app) –  Work with single program officer, either likes the proposal or not •  Multidisciplinary project (e.g., molecular dynamics) –  Work with multiple program officers, ... •  Onmidisciplinary project (e.g. http, math library) –  Try to work with all program officers, often am told “it’s your responsibility” •  In all cases, need to forecast impact –  Past performance does predict future results
  • 18. Measuring Impact – Scenarios 1.  Developer of open source physics simulation –  Possible metrics •  How many downloads? (easiest to measure, least value) •  How many contributors? •  How many uses? •  How many papers cite it? •  How many papers that cite it are cited? (hardest to measure, most value) 2.  Developer of open source math library –  Possible metrics are similar, but citations are less likely –  What if users don’t download it? •  It’s part of a distro •  It’s pre-installed (and optimized) on an HPC system •  It’s part of a cloud image •  It’s a service
  • 19. Vision for Metrics & Citation, part 1 •  Products (software, paper, data set) are registered –  Credit map (weighted list of contributors—people, products, etc.) is an input –  DOI is an output –  Leads to transitive credit •  E.g., paper 1 provides 25% credit to software A, and software A provides 10% credit to library X -> library X gets 2.5% credit for paper 1 •  Helps developer – “my tools are widely used, give me tenure” or “NSF should fund my tool maintenance” –  Issues: •  Social: Trust in person who registers a product –  This seems to work for papers today (without weights) for both author lists and for citations –  Do weights require more than human memory? •  Technological: Registration system –  Where is it/them, what are interfaces, how do they work together?
  • 20. Vision for Metrics & Citation, part 2 •  Product usage is recorded –  Where? •  Both the developer and user want to track usage •  Privacy issues? (legal, competitive, ...) •  Via a phone home mechanism? –  What does “using” a data set mean? And how could trigger a usage record –  Can general code be developed for this, to be incorporated in software packages? •  Ties to provenance •  With user input, tie later products to usage –  User may not know science outcome when using tool –  After science outcome is known, may be hard to determine which product usages were involved
  • 21. Vision for Metrics & Citation, thoughts •  Can this be done incrementally? •  Lack of credit is a larger problem than often perceived –  Lack of credit is a disincentive for sharing software and data –  Providing credit would both remove disincentive as well as adding incentive –  See Lewin’s principal of force field analysis (1943) •  For commercial tools, credit is tracked by $ –  But this doesn’t help understand what tools were used for what outcomes –  Does this encourage collaboration? •  Could a more economic model be used? –  NSF gives tokens are part of science grants, users distribute tokens while/after using tools
  • 22. Software Questions: Sustainability •  My definition as a program officer: –  How will you support your software without me continuing to pay for it? •  What does support mean? –  Can I build and run it on my current system? •  Adapt to changing underlying hardware/software –  Do I understand what it does? •  Documentation, training –  Does it do what it does correctly? •  Bug tracking and updates •  Verification and validation –  Does it do what I want? •  Requirement tracking and updates –  Is it changing? •  Heritage (legacy) vs. developing software •  How can Apache help?
  • 23. Software Questions: Governance •  Why do we care? –  Governance tells users and contributors how the project makes decisions, how they can be involved •  What are the issues? –  Community: Users? Developers? Both? –  Models: dictatorship (Linux kernel), meritocracy (Apache), other? –  Tie to development models: cathedral, bazaar •  How can Apache help? –  Study how these work in smaller specialized projects?
  • 24. Other Questions for Apache •  Does the Apache Way work for science? –  Or just for underlying tools that are useful both for science and other applications? •  How many users/developers are needed for success? •  Incubator model –  Can it be used as is for general science software? –  Or forked and modified? •  Open Source for understanding (available) vs Open Source for reuse/development (changeable)?
  • 25. General Software Questions •  Software that is intended to be infrastructure has challenges –  Unlike in business, more users means more work –  The last 20% takes 80% of the effort –  What can NSF do to make these things easier? •  What fraction of funds should be spent of support of existing infrastructure vs. development of new infrastructure? •  How do we decide when to stop supporting a software element? •  How do we encourage reuse and discourage duplication? •  How do we more effectively support career paths for software developers (with universities, labs, etc.)
  • 26. What Can You Do? •  Look at the current set of SI2 software and institutes, and get involved with one –  http://bit.ly/sw-ci •  Tell me what we should be doing differently –  Here or email: dkatz@nsf.gov