SlideShare a Scribd company logo
www.bsc.es Barcelona, September 22nd
, 2015
Toni Cortes
Leader of the storage-system research group
BSC and integrating persistent data
and parallel programming models
2
Barcelona Supercomputing Center
Centro Nacional de Supercomputación
• BSC-CNS objectives:
– R&D in Computer, Life, Earth and Engineering Sciences.
– Supercomputing services and support to Spanish and
European researchers.
• BSC-CNS is a consortium that includes:
– Spanish Government 60%
– Catalonian Government 30%
– Universitat Politècnica de Catalunya (UPC) 10%
• 425 people, 40 countries
3
BSC Scientific & Technical Departments
4
Mission of BSC R&D Departments
EARTH SCIENCES
To develop and
implement global and
regional state-of-the-art
models for short-term air
quality forecast and long-
term climate applications.
LIFE SCIENCES
To understand living
organisms by means of
theoretical and
computational methods
(molecular modeling,
genomics, proteomics).
CASE
To develop scientific and
engineering software to
efficiently exploit super-
computing capabilities
(biomedical, geophysics,
atmospheric, energy,
social and economic
simulations).
COMPUTER
SCIENCES
To influence the way
machines are built,
programmed and used:
programming models,
performance tools, Big
Data, computer
architecture, energy
efficiency.
5
From Research to Market
Embedded electronics
for improving safety in
time-critical applications
Middleware,
System
Software
Automotive
Aviation
HPC
Smart Cities
IoT
Cloud
Big Data
Bioinformatic tools
for target and drug
discovery
Pharma
Medical
Air quality, weather
and climate
modelling products
Weather
Services /
Climate
Agencies
Renewables
Agriculture
Simulations of
complex problems
Medical
Engineering
Smart Cities
Nautical
Automotive
Aviation
Renewables
Oil & Gas
Pharma
BSC Technologies
Programming models,
performance tools &
energy efficient
hardware
Mobile
HPC
Data Centres
Rail
Space
8
Consolidate Spanish Position: Severo Ochoa
Energy-efficiency
for Exascale and
Bigdata
Energy-efficiency
for Exascale and
Bigdata
Multidisciplinary research
program to address the design and
use of future Exascale
supercomputers.
– Programming models for
energy-efficiency and Big Data.
– Three challenging applications
as a starting point for
Interdepartamental
collaboration.
– Enhancing external
cooperation.
– Improving human resource
management.
– Building internal and external
training platforms.
– Articulating procedures for a
better internal and external
communication.
• Consolidating the Institution as a
world leader both in HPC research
and applications and in the scientific
and professional empowerment of its
members.
9
BSC Severo Ochoa collaborations
CS
MGT
OP
CASE LS
ES
10
International collaborations: Joint-
Laboratory on Extreme Scale Computing
• In June 2014, the University of Illinois at Urbana-Champaign, INRIA,
Argonne National Laboratory, Barcelona Supercomputing Center and
Jülich Supercomputing Centre formed the Joint Laboratory on Extreme
Scale Computing.
• The Joint Laboratory focuses on software challenges found in extreme
scale high-performance computers.
• Researchers from the different centres regularly meet for workshops,
and at the last one, organised by BSC in Barcelona in June 2015, over
100 researchers, from the six centres which are now members, took part.
11
JLESC: working together towards success
Resilience
I/O, Storage and
Visualisation
Parallel
ProgrammingTools
Applications and
Numerical
Algorithms
12
Link to EU & Spanish Large Industries
Research into advanced technologies
for the exploration of hydrocarbons,
subterranean and subsea reserve
modelling and fluid flows
Repsol-BSC Research Center Iberdrola Renovables
Model to estimate onshore and
offshore wind production
13
Attract R&D projects from IT Corporations
Analysis of Hadoop workload performance
under different software parameters and
hardware configurations.
Results available online.
BSC-Microsoft Research Centre
Future challenges for supercomputers
including power efficiency and scalability,
new programming models, and tools for
analysis and optimization of applications.
BSC-IBM Technology Center
for Supercomputing
Training in Parallel Programming using CUDA
and StarSs Optimising management of
execution resources in multi-GPU
environments with GMAC.
BSC-NVIDIA CUDA
Center of Excellence
Multi-year agreement focussing on
optimising efficiency through research into
Programming Models, Performance Tools
and Applications.
Intel-BSC Exascale Lab
Agreement on
memory performance
in HPC systems with
14
Help to define the future of global HPC
International Roadmapping Leadership in Exascale
Enabling the Data Revolution
Contributing to Standardisation
15
Increase Industry Collaboration.
BSC & Industry 2012
16
Increase Industry Collaboration.
BSC & Industry 2015
17
Spin-off creation
PELE project (Protein Energy Landscape Exploration)
NAR, 41, W322-8 (2013); J. Comp Chem 31, 1224-35 (2010)
Hormonal nuclear receptors
(joint project with AstraZeneca)
We can observe how a drug finds its
target and we can study, at an atomic
level, the way in which they get linked.
We can study different effects caused by
mutations, as well as new drugs.
First BSC’s Spin Off: How a drug finds its target
18
BSC People
www.bsc.es Barcelona, September 22nd
, 2015
Toni Cortes
Leader of the storage-system research group
BSC and integrating persistent data
and parallel programming models
Agenda
The pillars
The dark side
The secret potential
Time to wake up!
From a different perspective…
“We cannot solve our problems
with the same thinking
we used when we created them”
Albert Einstein
Some of today’s thinking
–Data stored in
• Files
• Databases
–Data is a 2nd
-class citizen
• Accessed with its own primitives
• Data and code are different
Agenda
The motivation
The pillars
The dark side
The secret potential
Time to wake up!
Before everything started
The pillars of dataClay
What ignited our research, our “big bang”
– Different data models: persistent vs. non persistent
– New storage devices: byte addressable
– Coupling data and code
– Sharing is what really matters
And then dataClay came to life …
(more details on how all fits together
in the next minutes)
Two data models!
Why waste time doing it twice?
Today
– We have one data model for volatile data
– Traditional data structures and/or objects
– We have a different data model for the persistent data
– Relational database, NoSQL database, files
Future
– Store data in the same way as when volatile
– Store objects and their relations
New storage devices
Better to be prepared on time
New storage hardware is coming
– Storage class memory
– Non-volatile RAM
Main characteristics
– Performance between memory and SSDs
– Byte addressable
File systems or table based DB are not the right abstraction
– Both were designed to use block devices
– Can be used, but would be a pity
• What a potential loss!!
• Imagine a Horse-drawn Ferrari?
Coupling data and computation
They can live isolated, but …
Computation and data are two different abstractions
– They are separated
This brings the problem of
– Should I move the data to compute it?
• Does not work for big data sets
– Should I move computation to the data?
• Deployment difficult
If data and code were the same thing …
– Using data would be much easier
– (and safer  see more in a few minutes)
Data sharing today
And why it is not enough
Download files
– Flexible
– Only for static data
– Avoid unneeded copies and transfers
– Data provider loses control over the downloaded data
“Data services” an API to access the data
– Data provider keeps control
– Both dynamic and static data
– No unneeded copies or transfers
– API restricted to what the provider can do
Agenda
The motivation
The pillars
The technology
The dark side
The secret potential
Time to wake up!
Our vision
What dataClay does
dataClay is a platform that enables
– Apps to make objects and their relationships persistent
– 3rd parties to add mode data or “change” the data model
– 3rd parties to upload computations to be shared
– Each user to see different “views” of the data
– Data owner to maintain control over its data
– Efficient access to data
Key technologies
– Self-contained objects
– Data enrichment by 3rd parties
Key technology
Self-contained objects
Push the idea of data services to the limit
– Based on the OO paradigm
Data
Client App Client App
Data Data
Data
FunctionsFunctions
Security, Integrity, …Security, Integrity, …
Data
Security, ...
Functions
Data service
Data store
Data store
Self-contained objects
But, what is really new?
Self-contained and data services
– Same concept different implementation?
Then…
– … we need something else …
– … something to make it really flexible!
3rd
-party enrichment
What is it exactly?
By enrichment we understand:
– Adding new information (fields or data) to existing datasets
– Adding new code to existing datasets
• New methods
• New implementations
This enrichment should
– Be possible during the life of data
– Not be limited to the data owner
– Enable different views of the data to different users/clients
• Not everybody should see the same enrichments
• Several enrichments should be available concurrently
– Enable the avoidance of queries
3rd
-party enrichment
And now animated
Data can be enriched both with data and code
Code will be executed in the provider infrastructure
Enrichment
Client App
Data-provider Infrastructure
Using a single infrastructure?
Killing the bottleneck
Using a “single” infrastructure may become a bottleneck
Security and privacy policies should be part of the data
– Thus, data could be offloaded to other infrastructures
• Without breaking the data policies
– Data owner enables 3rd party enrichment and …
… does not lose control
How it is implemented?
– Policies are defined using a declarative language
– Policies enforced as part of object methods
Distributing objects
Efficient usage of resources
– Data and code can be offloaded
• to resources not accessible by the data provider
Data
Security, ...
Functions
Provider Infrastructure
Client Infrastructure
Cloud
Agenda
The motivation
The pillars
The technology
The dark side
The integration into the
parallel programming language
The secret potential
Time to wake up!
Task-based programming
Task is the unit of work
Data dependences between tasks
– Imply partial order
– Exhibit potential parallelism
– Imply local synchronization
• Not global!
Implicit workflow
COMPSs
Sequential programming
– General purpose programming language + annotations
• Currently Java and Python
Task based
– Builds a task graph at runtime
• Express potential concurrency
• Includes dependencies
– Simple linear address space
Unaware of computing platform
– Enabled by the runtime for clusters, clouds and grids
Python (PyCOMPSs) syntax
How to write PyCOMPS code
Invoke tasks
– As functions/methods
API for data synchronization
Task definition in function declaration
– decorators
class Foo(object):
@task()
def
myMethod(self):
…
foo = Foo()
myFunction( foo )
foo.myMethod()
…
foo =
compss_wait_on(foo)
foo.bar()
Main Program
@task( par = INOUT )
def myFunction(par):
…
myF
myM
synch
Function definition
Parallel execution
...
T1 (data1, out data2);
T2 (data4, out data5);
T3 (data2, data5, out data6);
T4 (data7, out data8);
T5 (data6, data8, out data9);
...
T10 T20
T30
T40
T50
COMPSs framework
COMPSs Runtime
Job ManagerJob Manager
Computing
Infrastructure
Application
Resource ManagerResource Manager SchedulerScheduler
Task AnalyzerTask Analyzer Data Info ProviderData Info Provider
DAG
Application Code
Worker Persitent Worker
Worker
Worker
ExecuteTask
dataClay as a COMPSs worker
Executes a method (possibly static) in a given backend
– Acts as COMPSs worker threads
As opposed to direct method execution
– You can decide the execution backend executeTask
– Asynchronous
• Result can be checked by using getResult
Job ManagerJob Manager
Computing
Infrastructure
Worker Persitent Worker
Worker
Worker
dataClay
A trivial example to follow
Input: collection of persons
– Person
…
Integer age
…
Boolean isOlder (limit, outCollection) {
if (age>limit) add self into outCollection
}
Output: collection of persons older than a given age (limit)
“Per object” parallelism
COMPSs “instantiates” one worker per object
– Iterates over a collection using a standard iterator
• Instantiates the method in the node where the object is
– Targeted at object methods
– getLocations
• Blocking may be needed
– Object-method granularity may be too small
– It implies grouping objects in the same backend
Task
1
Task
2
Task
3
Task
4
Task
5
Task
6
“Per object” parallelism
Declare method isOlder as a parallel task
Code
For element in the collection
// For each element
// This method is executed in parallel
// in the node where the data is
element.isOlder(age)
Parallelizing for each element may be too small
–Blocking
“Per object” parallelism
Create a new method isOlderBlocking(age,ini,num)
For element between ini and ini+num
element.isOlder(age)
Code
For i in (#elements in collection/block)
// For each element
// This method is executed in parallel
element.isOlderBloking(age,i*block, block)
Now we have the right granularity
–The scientist needs to define blocking size
–And placement if locality is important!!!
“Per backend” parallelism
COMPSs “instantiates” one worker per backend
– Obtains all locations using on the collection
• getLocations
– Each task executes a collection method
• Iterates over a “local” iterator
– Will only return objects in the current back end
– Work stealing may be implemented if needed
Task
1
Task
2
Task
3
“Per backend” parallelism
Create a new collection method isOlderCollection(age)
For element in collection using local iterator
// No parallelism here
element.isOlder(age)
Define this method as “parallel”
Code
// Parallelism: executed in all backends with
// elements
isOlderCollection (age)
Now we have the right granularity
–Scientists did not have to write “special” code
• Only encapsulated and used a “local” iterator
“Other” iterators
These are just examples, other iterators could be defined
– To implement locality as in a close backend
– To implement work stealing
– To take into account heterogeneity
The iterators are implemented as general in the collection
– Scientist only need to understand what they do
• And use them
Agenda
The motivation
The pillars
The technology
The dark side
The integration into the
parallel programming language
The secret potential
Conclusions
Time to wake up!
Conclusions
Ideas to take back home
Integrating persistent data into the programming model
– Unifies the model for both persistent and volatile data
– Simplifies the decision of where to compute
• Code is part of the data
– Enables the use of data parallelism
• Iterators can be adapted transparently to the programmer
– Enables data distribution
• Behavior policies are embedded
I talk, they do the work
Thanks to …
Current team
– Anna Queralt
– Jonathan Martí
– Daniel Gasull
– Juanjo Costa
– Alex Barceló
Master students
– David Gracia
– Christos Ioannidis
Former team members
– Ernest Artiaga
www.bsc.es
Thank you!

More Related Content

PDF
Bayesian inference and big data: are we there yet? by Jose Luis Hidalgo at Bi...
PPTX
Big Data and Data Science: The Technologies Shaping Our Lives
PDF
Data Science versus Artificial Intelligence: a useful distinction
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
PDF
Big data Mining Using Very-Large-Scale Data Processing Platforms
PDF
Machine Learning Deep Learning AI and Data Science
PPTX
Introduction of Data Science
PDF
Lecture1 introduction to big data
Bayesian inference and big data: are we there yet? by Jose Luis Hidalgo at Bi...
Big Data and Data Science: The Technologies Shaping Our Lives
Data Science versus Artificial Intelligence: a useful distinction
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big data Mining Using Very-Large-Scale Data Processing Platforms
Machine Learning Deep Learning AI and Data Science
Introduction of Data Science
Lecture1 introduction to big data

What's hot (20)

PDF
An Obligatory Introduction to Data Science
PDF
INF2190_W1_2016_public
PDF
JIMS Rohini IT Flash Monthly Newsletter - October Issue
PPTX
PDT: Personal Data from Things, and its provenance
PPTX
How IOT & Big Data will shape up Future Economies?
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
Data Mining and Big Data Challenges and Research Opportunities
PDF
Introduction on Data Science
PDF
Introduction to big data
PDF
Challenges of Big Data Research
PPT
Data mining
PDF
Multipleregression covidmobility and Covid-19 policy recommendation
PDF
Elements of AI Luxembourg - session 5
PDF
Human-in-the-loop: a design pattern for managing teams which leverage ML by P...
PDF
Data science
PDF
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
PPTX
Cloud Programming Models: eScience, Big Data, etc.
PPTX
(Big) Data (Science) Skills
PDF
Big data analytics 1
PPT
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
An Obligatory Introduction to Data Science
INF2190_W1_2016_public
JIMS Rohini IT Flash Monthly Newsletter - October Issue
PDT: Personal Data from Things, and its provenance
How IOT & Big Data will shape up Future Economies?
International Journal of Engineering Research and Development (IJERD)
Data Mining and Big Data Challenges and Research Opportunities
Introduction on Data Science
Introduction to big data
Challenges of Big Data Research
Data mining
Multipleregression covidmobility and Covid-19 policy recommendation
Elements of AI Luxembourg - session 5
Human-in-the-loop: a design pattern for managing teams which leverage ML by P...
Data science
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Cloud Programming Models: eScience, Big Data, etc.
(Big) Data (Science) Skills
Big data analytics 1
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Ad

Similar to BSC and Integrating Persistent Data and Parallel Programming Models (20)

PPTX
Big Data HPC Convergence and a bunch of other things
PPTX
Data-intensive bioinformatics on HPC and Cloud
PPTX
Stories About Spark, HPC and Barcelona by Jordi Torres
PDF
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
PPTX
Innovation med big data – chr. hansens erfaringer
PDF
PPT
Dynamic Semantics for Semantics for Dynamic IoT Environments
PPTX
Frankfurt Big Data Lab & Refugee Projeect
PDF
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
PPT
How to make data more usable on the Internet of Things
PPTX
2016 05 sanger
PPTX
Presentation on Big Data Analytics
PPT
Dynamic Semantics for the Internet of Things
PPT
Ibm and innovation overview 20150326 v15 short
PDF
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
PDF
Svenska AI-sällskapet på Vinnova
PDF
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
PDF
Values & Vision - Cloud Sandboxes for BIG Earth Sciences
PDF
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
PDF
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Big Data HPC Convergence and a bunch of other things
Data-intensive bioinformatics on HPC and Cloud
Stories About Spark, HPC and Barcelona by Jordi Torres
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Innovation med big data – chr. hansens erfaringer
Dynamic Semantics for Semantics for Dynamic IoT Environments
Frankfurt Big Data Lab & Refugee Projeect
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
How to make data more usable on the Internet of Things
2016 05 sanger
Presentation on Big Data Analytics
Dynamic Semantics for the Internet of Things
Ibm and innovation overview 20150326 v15 short
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
Svenska AI-sällskapet på Vinnova
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Values & Vision - Cloud Sandboxes for BIG Earth Sciences
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
sap open course for s4hana steps from ECC to s4
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25-Week II
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Assigned Numbers - 2025 - Bluetooth® Document
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation_ Review paper, used for researhc scholars

BSC and Integrating Persistent Data and Parallel Programming Models

  • 1. www.bsc.es Barcelona, September 22nd , 2015 Toni Cortes Leader of the storage-system research group BSC and integrating persistent data and parallel programming models
  • 2. 2 Barcelona Supercomputing Center Centro Nacional de Supercomputación • BSC-CNS objectives: – R&D in Computer, Life, Earth and Engineering Sciences. – Supercomputing services and support to Spanish and European researchers. • BSC-CNS is a consortium that includes: – Spanish Government 60% – Catalonian Government 30% – Universitat Politècnica de Catalunya (UPC) 10% • 425 people, 40 countries
  • 3. 3 BSC Scientific & Technical Departments
  • 4. 4 Mission of BSC R&D Departments EARTH SCIENCES To develop and implement global and regional state-of-the-art models for short-term air quality forecast and long- term climate applications. LIFE SCIENCES To understand living organisms by means of theoretical and computational methods (molecular modeling, genomics, proteomics). CASE To develop scientific and engineering software to efficiently exploit super- computing capabilities (biomedical, geophysics, atmospheric, energy, social and economic simulations). COMPUTER SCIENCES To influence the way machines are built, programmed and used: programming models, performance tools, Big Data, computer architecture, energy efficiency.
  • 5. 5 From Research to Market Embedded electronics for improving safety in time-critical applications Middleware, System Software Automotive Aviation HPC Smart Cities IoT Cloud Big Data Bioinformatic tools for target and drug discovery Pharma Medical Air quality, weather and climate modelling products Weather Services / Climate Agencies Renewables Agriculture Simulations of complex problems Medical Engineering Smart Cities Nautical Automotive Aviation Renewables Oil & Gas Pharma BSC Technologies Programming models, performance tools & energy efficient hardware Mobile HPC Data Centres Rail Space
  • 6. 8 Consolidate Spanish Position: Severo Ochoa Energy-efficiency for Exascale and Bigdata Energy-efficiency for Exascale and Bigdata Multidisciplinary research program to address the design and use of future Exascale supercomputers. – Programming models for energy-efficiency and Big Data. – Three challenging applications as a starting point for Interdepartamental collaboration. – Enhancing external cooperation. – Improving human resource management. – Building internal and external training platforms. – Articulating procedures for a better internal and external communication. • Consolidating the Institution as a world leader both in HPC research and applications and in the scientific and professional empowerment of its members.
  • 7. 9 BSC Severo Ochoa collaborations CS MGT OP CASE LS ES
  • 8. 10 International collaborations: Joint- Laboratory on Extreme Scale Computing • In June 2014, the University of Illinois at Urbana-Champaign, INRIA, Argonne National Laboratory, Barcelona Supercomputing Center and Jülich Supercomputing Centre formed the Joint Laboratory on Extreme Scale Computing. • The Joint Laboratory focuses on software challenges found in extreme scale high-performance computers. • Researchers from the different centres regularly meet for workshops, and at the last one, organised by BSC in Barcelona in June 2015, over 100 researchers, from the six centres which are now members, took part.
  • 9. 11 JLESC: working together towards success Resilience I/O, Storage and Visualisation Parallel ProgrammingTools Applications and Numerical Algorithms
  • 10. 12 Link to EU & Spanish Large Industries Research into advanced technologies for the exploration of hydrocarbons, subterranean and subsea reserve modelling and fluid flows Repsol-BSC Research Center Iberdrola Renovables Model to estimate onshore and offshore wind production
  • 11. 13 Attract R&D projects from IT Corporations Analysis of Hadoop workload performance under different software parameters and hardware configurations. Results available online. BSC-Microsoft Research Centre Future challenges for supercomputers including power efficiency and scalability, new programming models, and tools for analysis and optimization of applications. BSC-IBM Technology Center for Supercomputing Training in Parallel Programming using CUDA and StarSs Optimising management of execution resources in multi-GPU environments with GMAC. BSC-NVIDIA CUDA Center of Excellence Multi-year agreement focussing on optimising efficiency through research into Programming Models, Performance Tools and Applications. Intel-BSC Exascale Lab Agreement on memory performance in HPC systems with
  • 12. 14 Help to define the future of global HPC International Roadmapping Leadership in Exascale Enabling the Data Revolution Contributing to Standardisation
  • 15. 17 Spin-off creation PELE project (Protein Energy Landscape Exploration) NAR, 41, W322-8 (2013); J. Comp Chem 31, 1224-35 (2010) Hormonal nuclear receptors (joint project with AstraZeneca) We can observe how a drug finds its target and we can study, at an atomic level, the way in which they get linked. We can study different effects caused by mutations, as well as new drugs. First BSC’s Spin Off: How a drug finds its target
  • 17. www.bsc.es Barcelona, September 22nd , 2015 Toni Cortes Leader of the storage-system research group BSC and integrating persistent data and parallel programming models
  • 18. Agenda The pillars The dark side The secret potential Time to wake up!
  • 19. From a different perspective… “We cannot solve our problems with the same thinking we used when we created them” Albert Einstein Some of today’s thinking –Data stored in • Files • Databases –Data is a 2nd -class citizen • Accessed with its own primitives • Data and code are different
  • 20. Agenda The motivation The pillars The dark side The secret potential Time to wake up!
  • 21. Before everything started The pillars of dataClay What ignited our research, our “big bang” – Different data models: persistent vs. non persistent – New storage devices: byte addressable – Coupling data and code – Sharing is what really matters And then dataClay came to life … (more details on how all fits together in the next minutes)
  • 22. Two data models! Why waste time doing it twice? Today – We have one data model for volatile data – Traditional data structures and/or objects – We have a different data model for the persistent data – Relational database, NoSQL database, files Future – Store data in the same way as when volatile – Store objects and their relations
  • 23. New storage devices Better to be prepared on time New storage hardware is coming – Storage class memory – Non-volatile RAM Main characteristics – Performance between memory and SSDs – Byte addressable File systems or table based DB are not the right abstraction – Both were designed to use block devices – Can be used, but would be a pity • What a potential loss!! • Imagine a Horse-drawn Ferrari?
  • 24. Coupling data and computation They can live isolated, but … Computation and data are two different abstractions – They are separated This brings the problem of – Should I move the data to compute it? • Does not work for big data sets – Should I move computation to the data? • Deployment difficult If data and code were the same thing … – Using data would be much easier – (and safer  see more in a few minutes)
  • 25. Data sharing today And why it is not enough Download files – Flexible – Only for static data – Avoid unneeded copies and transfers – Data provider loses control over the downloaded data “Data services” an API to access the data – Data provider keeps control – Both dynamic and static data – No unneeded copies or transfers – API restricted to what the provider can do
  • 26. Agenda The motivation The pillars The technology The dark side The secret potential Time to wake up!
  • 27. Our vision What dataClay does dataClay is a platform that enables – Apps to make objects and their relationships persistent – 3rd parties to add mode data or “change” the data model – 3rd parties to upload computations to be shared – Each user to see different “views” of the data – Data owner to maintain control over its data – Efficient access to data Key technologies – Self-contained objects – Data enrichment by 3rd parties
  • 28. Key technology Self-contained objects Push the idea of data services to the limit – Based on the OO paradigm Data Client App Client App Data Data Data FunctionsFunctions Security, Integrity, …Security, Integrity, … Data Security, ... Functions Data service Data store Data store
  • 29. Self-contained objects But, what is really new? Self-contained and data services – Same concept different implementation? Then… – … we need something else … – … something to make it really flexible!
  • 30. 3rd -party enrichment What is it exactly? By enrichment we understand: – Adding new information (fields or data) to existing datasets – Adding new code to existing datasets • New methods • New implementations This enrichment should – Be possible during the life of data – Not be limited to the data owner – Enable different views of the data to different users/clients • Not everybody should see the same enrichments • Several enrichments should be available concurrently – Enable the avoidance of queries
  • 31. 3rd -party enrichment And now animated Data can be enriched both with data and code Code will be executed in the provider infrastructure Enrichment Client App Data-provider Infrastructure
  • 32. Using a single infrastructure? Killing the bottleneck Using a “single” infrastructure may become a bottleneck Security and privacy policies should be part of the data – Thus, data could be offloaded to other infrastructures • Without breaking the data policies – Data owner enables 3rd party enrichment and … … does not lose control How it is implemented? – Policies are defined using a declarative language – Policies enforced as part of object methods
  • 33. Distributing objects Efficient usage of resources – Data and code can be offloaded • to resources not accessible by the data provider Data Security, ... Functions Provider Infrastructure Client Infrastructure Cloud
  • 34. Agenda The motivation The pillars The technology The dark side The integration into the parallel programming language The secret potential Time to wake up!
  • 35. Task-based programming Task is the unit of work Data dependences between tasks – Imply partial order – Exhibit potential parallelism – Imply local synchronization • Not global! Implicit workflow
  • 36. COMPSs Sequential programming – General purpose programming language + annotations • Currently Java and Python Task based – Builds a task graph at runtime • Express potential concurrency • Includes dependencies – Simple linear address space Unaware of computing platform – Enabled by the runtime for clusters, clouds and grids
  • 37. Python (PyCOMPSs) syntax How to write PyCOMPS code Invoke tasks – As functions/methods API for data synchronization Task definition in function declaration – decorators class Foo(object): @task() def myMethod(self): … foo = Foo() myFunction( foo ) foo.myMethod() … foo = compss_wait_on(foo) foo.bar() Main Program @task( par = INOUT ) def myFunction(par): … myF myM synch Function definition
  • 38. Parallel execution ... T1 (data1, out data2); T2 (data4, out data5); T3 (data2, data5, out data6); T4 (data7, out data8); T5 (data6, data8, out data9); ... T10 T20 T30 T40 T50
  • 39. COMPSs framework COMPSs Runtime Job ManagerJob Manager Computing Infrastructure Application Resource ManagerResource Manager SchedulerScheduler Task AnalyzerTask Analyzer Data Info ProviderData Info Provider DAG Application Code Worker Persitent Worker Worker Worker
  • 40. ExecuteTask dataClay as a COMPSs worker Executes a method (possibly static) in a given backend – Acts as COMPSs worker threads As opposed to direct method execution – You can decide the execution backend executeTask – Asynchronous • Result can be checked by using getResult Job ManagerJob Manager Computing Infrastructure Worker Persitent Worker Worker Worker dataClay
  • 41. A trivial example to follow Input: collection of persons – Person … Integer age … Boolean isOlder (limit, outCollection) { if (age>limit) add self into outCollection } Output: collection of persons older than a given age (limit)
  • 42. “Per object” parallelism COMPSs “instantiates” one worker per object – Iterates over a collection using a standard iterator • Instantiates the method in the node where the object is – Targeted at object methods – getLocations • Blocking may be needed – Object-method granularity may be too small – It implies grouping objects in the same backend Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
  • 43. “Per object” parallelism Declare method isOlder as a parallel task Code For element in the collection // For each element // This method is executed in parallel // in the node where the data is element.isOlder(age) Parallelizing for each element may be too small –Blocking
  • 44. “Per object” parallelism Create a new method isOlderBlocking(age,ini,num) For element between ini and ini+num element.isOlder(age) Code For i in (#elements in collection/block) // For each element // This method is executed in parallel element.isOlderBloking(age,i*block, block) Now we have the right granularity –The scientist needs to define blocking size –And placement if locality is important!!!
  • 45. “Per backend” parallelism COMPSs “instantiates” one worker per backend – Obtains all locations using on the collection • getLocations – Each task executes a collection method • Iterates over a “local” iterator – Will only return objects in the current back end – Work stealing may be implemented if needed Task 1 Task 2 Task 3
  • 46. “Per backend” parallelism Create a new collection method isOlderCollection(age) For element in collection using local iterator // No parallelism here element.isOlder(age) Define this method as “parallel” Code // Parallelism: executed in all backends with // elements isOlderCollection (age) Now we have the right granularity –Scientists did not have to write “special” code • Only encapsulated and used a “local” iterator
  • 47. “Other” iterators These are just examples, other iterators could be defined – To implement locality as in a close backend – To implement work stealing – To take into account heterogeneity The iterators are implemented as general in the collection – Scientist only need to understand what they do • And use them
  • 48. Agenda The motivation The pillars The technology The dark side The integration into the parallel programming language The secret potential Conclusions Time to wake up!
  • 49. Conclusions Ideas to take back home Integrating persistent data into the programming model – Unifies the model for both persistent and volatile data – Simplifies the decision of where to compute • Code is part of the data – Enables the use of data parallelism • Iterators can be adapted transparently to the programmer – Enables data distribution • Behavior policies are embedded
  • 50. I talk, they do the work Thanks to … Current team – Anna Queralt – Jonathan Martí – Daniel Gasull – Juanjo Costa – Alex Barceló Master students – David Gracia – Christos Ioannidis Former team members – Ernest Artiaga