SlideShare a Scribd company logo
Ian Foster
Argonne National Laboratory and University of Chicago
foster@anl.gov
ianfoster.org
Taming Big Data!
Publish
results
Collect
data
Design
experiment
Test
hypothesis
Hypothesize
explanation
Identify
patterns
Analyze
data
Discovery is an iterative process
Pose
question
Janet Rowley, 1972
Publish
results
Collect
data
Design
experiment
Test
hypothesis
Hypothesize
explanation
Identify
patterns
Analyze
data
Discovery in the big data era:
Resource-intensive, expensive, slow
Pose
question
Three big data challenges
Channel massive flows
Automate management
Build discovery engines
4
Three big data challenges
Channel massive flows
Automate management
Build discovery engines
5
Channel massive data flows
Data must move to be useful. We may optimize,
but we can never entirely eliminate distance.
• Sources: experimental facilities,
sensors, computations
• Sinks: analysis computers,
display systems
• Stores: impedance
matchers & time shifters
• Pipes: IO systems and
networks connect other elements
“We must think of data as a flowing river over time, not a static
snapshot. Make copies, share, and do magic” – S. Madhavan
Stor
e
Transfer is challenging at many levels
Speed and reliability
• GridFTP protocol
• Globus implementation
Scheduling and modeling
• SEAL and STEAL algorithms
• RAMSES project
7
8
Source
data
store
Desti-
nation
data
store
Wide
Area
Network
File transfer is an end-to-end problem
9
Application
OS
FS Stack
HBA/HCA
LAN
Switch
Router
Source
data
transfer
node
TCP
IP
NIC
Application
OS
FS Stack
HBA/HCA
LAN
Switch
Router TCP
IP
NIC
Storage Array
Wide
Area
Network
OST
MDT
Lustre
file
system
Destination
data transfer
node
OSS
OSS
MDS
MDS
+ diverse environments
+ diverse workloads
+ contention
File transfer is an end-to-end problem
GridFTP protocol and implementations:
Fast, reliable, secure 3rd-party data transfer
10
Extend legacy FTP protocol to enhance performance, reliability, security
Globus GridFTP provides a widely-used open source implementation.
Modular, pluggable architecture (different protocols, I/O interfaces).
Many optimizations: e.g., concurrency, parallelism, pipelining.
Data Transfer
Node at Site B
Data Transfer
Node at Site A
ParallelFileSystem
GridFTP
Server
Process
GridFTP
Server
Process
Parallelism = 3
Concurrency = 2
GridFTP
Server
Process
GridFTP
Server
Process
TCP Connection
TCP Connection
TCP Connection
TCP Connection
TCP Connection
TCP Connection
85 Gbps sustained disk-to-disk over 100
Gbps network, Ottawa—New Orleans
11
Raj Kettiumuthu
and team,
Argonne
Nov 2014
Higgs discovery “only possible because
of the extraordinary achievements of …
grid computing”—Rolf Heuer, CERN DG
10s of PB, 100s of institutions, 1000s of
scientists, 100Ks of CPUs, Bs of tasks
12
13
One Advanced
Photon Source
data node:
125 destinations
Same
node
(1 Gbps
link)
Taming Big Data!
16
Transfer scheduling and optimization
• Science data traffic is
extremely bursty
• User experience can be
improved by scheduling to
minimize slowdown
• Traffic can be categorized:
interactive or batch
• Increased concurrency
tends to increase aggregate
throughput, to a point
17
Concurrency over 24 hours. Kettimuthu et
al., 2015
Throughput vs. concurency & parallelism.
Kettimuthu et al., 2014
A load-aware, adaptive algorithm:
(1) Data-driven model of throughput
18
EP2
EP3
EP4
EP1
Collect many <s, d, cs, cd, v, a> data
E.g., <EP1, EP3, 3, 3, 20GB, 29sec>
Estimate throughput(s, d, cs, cd, v)
Adjust with estimate of external load
Define transfer priority:
Schedule transfers if neither source nor destination
is saturated, using model to decide concurrency
If source or destination is saturated, interrupt active
transfer(s) to service waiting requests, if in so doing
can reduce overall average slowdown
19
A load-aware, adaptive algorithm:
(2) Concurrency-constrained scheduling
20
21
Gagan Agarwal1* Prasanna Balaprakash2 Ian Foster2* Raj Kettimuthu2
Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3*
Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5*
Venkat Vishwanath2 Yao Zhang2
1 Ohio State University 2 Argonne National Laboratory
3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs)
Advanced Scientific Computing Research
Program manager: Rich Carlson♦︎
How to create more accurate, useful, and
portable models of distributed systems?
Simple analytical model:
T= α+ β*l
[startup cost + sustained bandwidth]
Experiment + regression
to estimate α, β
23
First-principles modeling
to better capture details
of system & application
components
Data-driven modeling to
learn unknown details of
system & application
components
Model
composition
Model, data
comparison
Differential regression for combining
data from different sources
Example of use: Predict performance on connection length L
not realizable on physical infrastructure
E.g., IB-RDMA or HTCP throughput on 900-mile connection
1) Make multiple measurements of performance on path lengths d:
– Ms(d): OPNET simulation
– ME(d): ANUE-emulated path
– MU(di): Real network (USN)
2) Compute measurement regressions on d: ṀA(.), A∈{S, E, U}
3) Compute differential regressions: ∆ṀA,B(.) = ṀA(.) - ṀB(.), A, B∈{S, E, U}
4) Apply differential regression to obtain estimates, C∈{S, E}
𝓜U(d) = MC(d) - ∆ṀC,U(d)
simulated/emulated measurements point regression estimate
Source LAN
profile
WAN
profile
Destination LAN
profile
Configuration for
host and edge
devices
Configuration
for WAN
devices
Configuration for
host and edge
devices
composition
operations
End-to-end profile composition
Three big data challenges
Channel massive flows
Automate management
Build discovery engines
26
Registry
Staging
Store
Ingest
Store
Analysis
Store
Community
Store
Archive Mirror
Ingest
Store
Analysis
Store
Community
Store
Archive Mirror
Registry
Quota
exceeded
!
Expired
credentials
!
Network
failed. Retry.
!
Permission
denied
!
It should be trivial to Collect, Move, Sync, Share, Analyze,
Annotate, Publish, Search, Backup, & Archive BIG DATA
… but in reality it’s often very challenging
One researcher’s perspective
on data management challenges
28
29
Tripit exemplifies process automation
Me
Book flights
Book hotel
Record flights
Suggest hotel
Record hotel
Get weather
Prepare maps
Share info
Monitor prices
Monitor flight
Other services
How the “business cloud” works
Platform
services
Database, analytics, application, deployment, workflow, queuing
Auto-scaling, Domain Name Service, content distribution
Elastic MapReduce, streaming data analytics
Email, messaging, transcoding. Many more.
Infrastructure
services
Computing, storage, networking
Elastic capacity
Multiple availability zones
Process automation for science
Run experiment
Collect data
Move data
Check data
Annotate data
Share data
Find similar data
Link to literature
Analyze data
Publish data
Automate
and
outsource:
the
Discovery
cloud
Analysis
Staging Ingest
Community
Repository
Archive Mirror
Registry
Next-gen
genome
sequencer
Telescope
In millions of labs worldwide,
researchers struggle with massive
data, advanced software, complex
protocols, burdensome reporting
Globus research data
management services
www.globus.org
Simulation
Reliable, secure, high-performance file
transfer and synchronization
“Fire-and-forget”
transfers
Automatic fault
recovery
Seamless security
integration
Powerful GUI
and APIs
Data
Source
Data
Destination
User initiates
transfer
request
1
Globus
moves and
syncs files
2
Globus
notifies user
3
Data
Source
User A selects
file(s) to share,
selects user or
group, and sets
permissions
1
Globus tracks shared
files; no need to
move files to cloud
storage!
2
User B logs in
to Globus and
accesses
shared file
3
Easily share large
data with any user or
group
No cloud storage
required
Extreme ease of use
• InCommon, Oauth, OpenID, X.509, …
• Credential management
• Group definition and management
• Transfer management and optimization
• Reliability via transfer retries
• Web interface, REST API, command line
• One-click “Globus Connect Personal” install
• 5-minute Globus Connect Server install
37
38
High-speed transfers to/from AWS cloud,
via Globus transfer service
• UChicago  AWS S3 (US region): Sustained 2 Gbps
– 2 GridFTP servers, GPFS file system at UChicago
– Multi-part upload via 16 concurrent HTTP connections
• AWS  AWS (same region): Sustained 5 Gbps
39
go#s3
Globus transfer & sharing; identity & group
management, data discovery & publication
25,000 users, 75 PB and 3B files transferred, 8,000 endpoints
Globus endpoints
Identity, group, profile
management services
…
Sharing service
Transfer service
Globus Toolkit
GlobusConnect
X
Identity, group, profile
management services
Sharing service
Transfer service
Globus Toolkit
GlobusConnect
Publication and discovery
X
43
Identity, group, profile
management services
Sharing service
Transfer service
Globus Toolkit
GlobusAPIs
GlobusConnect
Publication and discovery
X
The Globus Galaxies platform:
Science as a service
Globus
Galaxies
platform
Tool and workflow execution,
publication, discovery, sharing;
identity management; data
management; task scheduling
Infra-
structure
services
EC2, EBS, S3, SNS,
Spot, Route 53,
Cloud Formation
Ematter
materials
scienceFACE-IT
PDACS
Three big data challenges
Channel massive flows
Automate management
Build discovery engines
46
Discovery engines: Integrate simulation,
experiment, and informatics
Informatics
Analysis
Tools
High-throughput
Experiments
Problem
Specification
Modeling and
Simulation
Analysis &
Visualization
Experimental
Design
Analysis &
Visualization
Integrated
Databases
metagenomics.anl.gov
A discovery engine for metagenomics
kbase.us
DOE Systems Biology Knowledge Base (KBase)
Source: Rick Stevens
Taming Big Data!
A discovery engine
for the study of disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization
Immediate assessment of alignment quality in
near-field high-energy diffraction microscopy
5
Blue Gene/Q
Orthros
(All data in NFS)
3: Generate
Parameters
FOP.c
50 tasks
25s/task
¼ CPU hours
Uses Swift/K
Dataset
360 files
4 GB total
1: Median calc
75s (90% I/O)
MedianImage.c
Uses Swift/K
2: Peak Search
15s per file
ImageProcessing.c
Uses Swift/K
Reduced
Dataset
360 files
5 MB total
feedback to experiment
Detector
4: Analysis Pass
FitOrientation.c
60s/task (PC)
1667 CPU hours
60s/task (BG/Q)
1667 CPU hours
Uses Swift/T
GO Transfer
Up to
2.2 M CPU hours
per week!
ssh
Globus Catalog
Scientific Metadata
Workflow ProgressWorkflow
Control
Script
Bash
Manual
This is a
single
workflow
3: Convert bin L
to N
2 min for all files,
convert files to
Network Endian
format
Before
After
Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
Integrate data movement, management, workflow,
and computation to accelerate data-driven
applications
New data, computational capabilities, and
methods create opportunities and challenges
Integrate statistics/machine learning to assess
many models and calibrate them against `all'
relevant data
New computer facilities enable on-demand
computing and high-speed analysis of large
quantities of data
Three big data challenges
Channel massive flows
– New protocols and
management algorithms
Automate management
– The Discovery Cloud
Build discovery engines
– MG-RAST, kBase, Materials
56
U. S. D E PART M ENT OF
ENERGY
57
58

More Related Content

PPTX
Accelerating Discovery via Science Services
PPTX
Big data at experimental facilities
PPTX
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
PPTX
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
PPTX
Learning Systems for Science
PPTX
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
PPTX
Accelerating data-intensive science by outsourcing the mundane
PPTX
Coding the Continuum
Accelerating Discovery via Science Services
Big data at experimental facilities
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Learning Systems for Science
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Accelerating data-intensive science by outsourcing the mundane
Coding the Continuum

What's hot (20)

PPTX
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
PPTX
Big Process for Big Data @ PNNL, May 2013
PPTX
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
PPTX
So Long Computer Overlords
PDF
The Interplay of Workflow Execution and Resource Provisioning
PPT
DIET_BLAST
PPTX
Rpi talk foster september 2011
PPTX
Toward a National Research Platform
PDF
What Are Science Clouds?
PDF
The Open Science Data Cloud: Empowering the Long Tail of Science
PPTX
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
PPTX
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
PPTX
Virtual Science in the Cloud
PDF
Advanced Research Computing at York
PDF
Dynamic Data Center concept
PPTX
Data Tribology: Overcoming Data Friction with Cloud Automation
PPTX
CHASE-CI: A Distributed Big Data Machine Learning Platform
PDF
The Materials Project: Experiences from running a million computational scien...
PPTX
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
PPTX
Toward a Global Research Platform for Big Data Analysis
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Big Process for Big Data @ PNNL, May 2013
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
So Long Computer Overlords
The Interplay of Workflow Execution and Resource Provisioning
DIET_BLAST
Rpi talk foster september 2011
Toward a National Research Platform
What Are Science Clouds?
The Open Science Data Cloud: Empowering the Long Tail of Science
Grid'5000: Running a Large Instrument for Parallel and Distributed Computing ...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
Virtual Science in the Cloud
Advanced Research Computing at York
Dynamic Data Center concept
Data Tribology: Overcoming Data Friction with Cloud Automation
CHASE-CI: A Distributed Big Data Machine Learning Platform
The Materials Project: Experiences from running a million computational scien...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
Toward a Global Research Platform for Big Data Analysis
Ad

Viewers also liked (20)

PDF
Spark: Taming Big Data
PPT
Big Data
PPT
Big data ppt
PDF
Big Data introduction - Café Numérique Bruxelles
PPTX
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
PPTX
Turning Information chaos into reliable data
PPTX
3 top tools for taming big data
PPTX
Taming Big Data in the Reverse Logistics Supply Chain
PDF
Taming the Big Data Beast - Together
PPT
Big data introduction - Big Data from a Consulting perspective - Sogeti
PDF
Taming Big Data with NoSQL
PPTX
Introduction to Big Data
PDF
Taming Big Data With Modern Software Architecture
PDF
Hadoop-2.6.0 Slides
PDF
Big Data: an introduction
PDF
Introduction to big data
PPTX
Introduction to Big Data
PDF
Big data Introduction by Mohan
PDF
Taming Social Data: How Social Data Framing liberates analysis and accelerate...
PDF
Introduction to Big Data
Spark: Taming Big Data
Big Data
Big data ppt
Big Data introduction - Café Numérique Bruxelles
Taming the Data Deluge: How Advertisers and Publishers Can Win at the Big Dat...
Turning Information chaos into reliable data
3 top tools for taming big data
Taming Big Data in the Reverse Logistics Supply Chain
Taming the Big Data Beast - Together
Big data introduction - Big Data from a Consulting perspective - Sogeti
Taming Big Data with NoSQL
Introduction to Big Data
Taming Big Data With Modern Software Architecture
Hadoop-2.6.0 Slides
Big Data: an introduction
Introduction to big data
Introduction to Big Data
Big data Introduction by Mohan
Taming Social Data: How Social Data Framing liberates analysis and accelerate...
Introduction to Big Data
Ad

Similar to Taming Big Data! (20)

PPTX
RAMSES: Robust Analytic Models for Science at Extreme Scales
PPTX
Data Automation at Light Sources
PPTX
Scientific
PDF
Science cloud foster june 2013
PPTX
Science as a Service: How On-Demand Computing can Accelerate Discovery
PPT
Lambda Data Grid
PDF
Using the Open Science Data Cloud for Data Science Research
PPT
Grid optical network service architecture for data intensive applications
PDF
Geospatial Sensor Networks and Partitioning Data
PDF
Presentation southernstork 2009-nov-southernworkshop
PPTX
Big Process for Big Data @ NASA
PPT
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
PDF
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
PDF
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
PPT
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
PPTX
re:Invent 2013-foster-madduri
PPT
Semantics in Sensor Networks
PPTX
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
PDF
Data Mobility Exhibition
PPTX
Network Engineering for High Speed Data Sharing
RAMSES: Robust Analytic Models for Science at Extreme Scales
Data Automation at Light Sources
Scientific
Science cloud foster june 2013
Science as a Service: How On-Demand Computing can Accelerate Discovery
Lambda Data Grid
Using the Open Science Data Cloud for Data Science Research
Grid optical network service architecture for data intensive applications
Geospatial Sensor Networks and Partitioning Data
Presentation southernstork 2009-nov-southernworkshop
Big Process for Big Data @ NASA
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
re:Invent 2013-foster-madduri
Semantics in Sensor Networks
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Data Mobility Exhibition
Network Engineering for High Speed Data Sharing

More from Ian Foster (20)

PPTX
Global Services for Global Science March 2023.pptx
PPTX
The Earth System Grid Federation: Origins, Current State, Evolution
PPTX
Better Information Faster: Programming the Continuum
PPTX
ESnet6 and Smart Instruments
PPTX
Linking Scientific Instruments and Computation
PPTX
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
PPTX
Foster CRA March 2022.pptx
PPTX
Big Data, Big Computing, AI, and Environmental Science
PPTX
AI at Scale for Materials and Chemistry
PPTX
Research Automation for Data-Driven Discovery
PPTX
Scaling collaborative data science with Globus and Jupyter
PPTX
Team Argon Summary
PPTX
Thoughts on interoperability
PPTX
NIH Data Commons Architecture Ideas
PPTX
Going Smart and Deep on Materials at ALCF
PPTX
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
PPTX
Software Infrastructure for a National Research Platform
PPTX
Globus Auth: A Research Identity and Access Management Platform
PPTX
Streamlined data sharing and analysis to accelerate cancer research
PPTX
Accelerating Data-driven Discovery in Energy Science
Global Services for Global Science March 2023.pptx
The Earth System Grid Federation: Origins, Current State, Evolution
Better Information Faster: Programming the Continuum
ESnet6 and Smart Instruments
Linking Scientific Instruments and Computation
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
Foster CRA March 2022.pptx
Big Data, Big Computing, AI, and Environmental Science
AI at Scale for Materials and Chemistry
Research Automation for Data-Driven Discovery
Scaling collaborative data science with Globus and Jupyter
Team Argon Summary
Thoughts on interoperability
NIH Data Commons Architecture Ideas
Going Smart and Deep on Materials at ALCF
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Software Infrastructure for a National Research Platform
Globus Auth: A Research Identity and Access Management Platform
Streamlined data sharing and analysis to accelerate cancer research
Accelerating Data-driven Discovery in Energy Science

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Mega Projects Data Mega Projects Data
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Quality review (1)_presentation of this 21
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Introduction to Business Data Analytics.
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Computer network topology notes for revision
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Supervised vs unsupervised machine learning algorithms
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Mega Projects Data Mega Projects Data
climate analysis of Dhaka ,Banglades.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Quality review (1)_presentation of this 21
Foundation of Data Science unit number two notes
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Business Data Analytics.
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Data_Analytics_and_PowerBI_Presentation.pptx
Computer network topology notes for revision

Taming Big Data!

  • 1. Ian Foster Argonne National Laboratory and University of Chicago foster@anl.gov ianfoster.org Taming Big Data!
  • 4. Three big data challenges Channel massive flows Automate management Build discovery engines 4
  • 5. Three big data challenges Channel massive flows Automate management Build discovery engines 5
  • 6. Channel massive data flows Data must move to be useful. We may optimize, but we can never entirely eliminate distance. • Sources: experimental facilities, sensors, computations • Sinks: analysis computers, display systems • Stores: impedance matchers & time shifters • Pipes: IO systems and networks connect other elements “We must think of data as a flowing river over time, not a static snapshot. Make copies, share, and do magic” – S. Madhavan Stor e
  • 7. Transfer is challenging at many levels Speed and reliability • GridFTP protocol • Globus implementation Scheduling and modeling • SEAL and STEAL algorithms • RAMSES project 7
  • 9. 9 Application OS FS Stack HBA/HCA LAN Switch Router Source data transfer node TCP IP NIC Application OS FS Stack HBA/HCA LAN Switch Router TCP IP NIC Storage Array Wide Area Network OST MDT Lustre file system Destination data transfer node OSS OSS MDS MDS + diverse environments + diverse workloads + contention File transfer is an end-to-end problem
  • 10. GridFTP protocol and implementations: Fast, reliable, secure 3rd-party data transfer 10 Extend legacy FTP protocol to enhance performance, reliability, security Globus GridFTP provides a widely-used open source implementation. Modular, pluggable architecture (different protocols, I/O interfaces). Many optimizations: e.g., concurrency, parallelism, pipelining. Data Transfer Node at Site B Data Transfer Node at Site A ParallelFileSystem GridFTP Server Process GridFTP Server Process Parallelism = 3 Concurrency = 2 GridFTP Server Process GridFTP Server Process TCP Connection TCP Connection TCP Connection TCP Connection TCP Connection TCP Connection
  • 11. 85 Gbps sustained disk-to-disk over 100 Gbps network, Ottawa—New Orleans 11 Raj Kettiumuthu and team, Argonne Nov 2014
  • 12. Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG 10s of PB, 100s of institutions, 1000s of scientists, 100Ks of CPUs, Bs of tasks 12
  • 13. 13 One Advanced Photon Source data node: 125 destinations
  • 16. 16
  • 17. Transfer scheduling and optimization • Science data traffic is extremely bursty • User experience can be improved by scheduling to minimize slowdown • Traffic can be categorized: interactive or batch • Increased concurrency tends to increase aggregate throughput, to a point 17 Concurrency over 24 hours. Kettimuthu et al., 2015 Throughput vs. concurency & parallelism. Kettimuthu et al., 2014
  • 18. A load-aware, adaptive algorithm: (1) Data-driven model of throughput 18 EP2 EP3 EP4 EP1 Collect many <s, d, cs, cd, v, a> data E.g., <EP1, EP3, 3, 3, 20GB, 29sec> Estimate throughput(s, d, cs, cd, v) Adjust with estimate of external load
  • 19. Define transfer priority: Schedule transfers if neither source nor destination is saturated, using model to decide concurrency If source or destination is saturated, interrupt active transfer(s) to service waiting requests, if in so doing can reduce overall average slowdown 19 A load-aware, adaptive algorithm: (2) Concurrency-constrained scheduling
  • 20. 20
  • 21. 21
  • 22. Gagan Agarwal1* Prasanna Balaprakash2 Ian Foster2* Raj Kettimuthu2 Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3* Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5* Venkat Vishwanath2 Yao Zhang2 1 Ohio State University 2 Argonne National Laboratory 3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs) Advanced Scientific Computing Research Program manager: Rich Carlson♦︎
  • 23. How to create more accurate, useful, and portable models of distributed systems? Simple analytical model: T= α+ β*l [startup cost + sustained bandwidth] Experiment + regression to estimate α, β 23 First-principles modeling to better capture details of system & application components Data-driven modeling to learn unknown details of system & application components Model composition Model, data comparison
  • 24. Differential regression for combining data from different sources Example of use: Predict performance on connection length L not realizable on physical infrastructure E.g., IB-RDMA or HTCP throughput on 900-mile connection 1) Make multiple measurements of performance on path lengths d: – Ms(d): OPNET simulation – ME(d): ANUE-emulated path – MU(di): Real network (USN) 2) Compute measurement regressions on d: ṀA(.), A∈{S, E, U} 3) Compute differential regressions: ∆ṀA,B(.) = ṀA(.) - ṀB(.), A, B∈{S, E, U} 4) Apply differential regression to obtain estimates, C∈{S, E} 𝓜U(d) = MC(d) - ∆ṀC,U(d) simulated/emulated measurements point regression estimate
  • 25. Source LAN profile WAN profile Destination LAN profile Configuration for host and edge devices Configuration for WAN devices Configuration for host and edge devices composition operations End-to-end profile composition
  • 26. Three big data challenges Channel massive flows Automate management Build discovery engines 26
  • 27. Registry Staging Store Ingest Store Analysis Store Community Store Archive Mirror Ingest Store Analysis Store Community Store Archive Mirror Registry Quota exceeded ! Expired credentials ! Network failed. Retry. ! Permission denied ! It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, & Archive BIG DATA … but in reality it’s often very challenging
  • 28. One researcher’s perspective on data management challenges 28
  • 29. 29
  • 30. Tripit exemplifies process automation Me Book flights Book hotel Record flights Suggest hotel Record hotel Get weather Prepare maps Share info Monitor prices Monitor flight Other services
  • 31. How the “business cloud” works Platform services Database, analytics, application, deployment, workflow, queuing Auto-scaling, Domain Name Service, content distribution Elastic MapReduce, streaming data analytics Email, messaging, transcoding. Many more. Infrastructure services Computing, storage, networking Elastic capacity Multiple availability zones
  • 32. Process automation for science Run experiment Collect data Move data Check data Annotate data Share data Find similar data Link to literature Analyze data Publish data Automate and outsource: the Discovery cloud
  • 33. Analysis Staging Ingest Community Repository Archive Mirror Registry Next-gen genome sequencer Telescope In millions of labs worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting Globus research data management services www.globus.org Simulation
  • 34. Reliable, secure, high-performance file transfer and synchronization “Fire-and-forget” transfers Automatic fault recovery Seamless security integration Powerful GUI and APIs Data Source Data Destination User initiates transfer request 1 Globus moves and syncs files 2 Globus notifies user 3
  • 35. Data Source User A selects file(s) to share, selects user or group, and sets permissions 1 Globus tracks shared files; no need to move files to cloud storage! 2 User B logs in to Globus and accesses shared file 3 Easily share large data with any user or group No cloud storage required
  • 36. Extreme ease of use • InCommon, Oauth, OpenID, X.509, … • Credential management • Group definition and management • Transfer management and optimization • Reliability via transfer retries • Web interface, REST API, command line • One-click “Globus Connect Personal” install • 5-minute Globus Connect Server install
  • 37. 37
  • 38. 38
  • 39. High-speed transfers to/from AWS cloud, via Globus transfer service • UChicago  AWS S3 (US region): Sustained 2 Gbps – 2 GridFTP servers, GPFS file system at UChicago – Multi-part upload via 16 concurrent HTTP connections • AWS  AWS (same region): Sustained 5 Gbps 39 go#s3
  • 40. Globus transfer & sharing; identity & group management, data discovery & publication 25,000 users, 75 PB and 3B files transferred, 8,000 endpoints Globus endpoints
  • 41. Identity, group, profile management services … Sharing service Transfer service Globus Toolkit GlobusConnect X
  • 42. Identity, group, profile management services Sharing service Transfer service Globus Toolkit GlobusConnect Publication and discovery X
  • 43. 43
  • 44. Identity, group, profile management services Sharing service Transfer service Globus Toolkit GlobusAPIs GlobusConnect Publication and discovery X
  • 45. The Globus Galaxies platform: Science as a service Globus Galaxies platform Tool and workflow execution, publication, discovery, sharing; identity management; data management; task scheduling Infra- structure services EC2, EBS, S3, SNS, Spot, Route 53, Cloud Formation Ematter materials scienceFACE-IT PDACS
  • 46. Three big data challenges Channel massive flows Automate management Build discovery engines 46
  • 47. Discovery engines: Integrate simulation, experiment, and informatics Informatics Analysis Tools High-throughput Experiments Problem Specification Modeling and Simulation Analysis & Visualization Experimental Design Analysis & Visualization Integrated Databases
  • 50. DOE Systems Biology Knowledge Base (KBase) Source: Rick Stevens
  • 52. A discovery engine for the study of disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  • 53. Immediate assessment of alignment quality in near-field high-energy diffraction microscopy 5 Blue Gene/Q Orthros (All data in NFS) 3: Generate Parameters FOP.c 50 tasks 25s/task ¼ CPU hours Uses Swift/K Dataset 360 files 4 GB total 1: Median calc 75s (90% I/O) MedianImage.c Uses Swift/K 2: Peak Search 15s per file ImageProcessing.c Uses Swift/K Reduced Dataset 360 files 5 MB total feedback to experiment Detector 4: Analysis Pass FitOrientation.c 60s/task (PC) 1667 CPU hours 60s/task (BG/Q) 1667 CPU hours Uses Swift/T GO Transfer Up to 2.2 M CPU hours per week! ssh Globus Catalog Scientific Metadata Workflow ProgressWorkflow Control Script Bash Manual This is a single workflow 3: Convert bin L to N 2 min for all files, convert files to Network Endian format Before After Hemant Sharma, Justin Wozniak, Mike Wilde, Jon Almer
  • 54. Integrate data movement, management, workflow, and computation to accelerate data-driven applications New data, computational capabilities, and methods create opportunities and challenges Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data New computer facilities enable on-demand computing and high-speed analysis of large quantities of data
  • 55. Three big data challenges Channel massive flows – New protocols and management algorithms Automate management – The Discovery Cloud Build discovery engines – MG-RAST, kBase, Materials 56
  • 56. U. S. D E PART M ENT OF ENERGY 57
  • 57. 58