SlideShare a Scribd company logo
Unraveling mysteries of the Universe atUnraveling mysteries of the Universe at
CERN, with OpenStack and HadoopCERN, with OpenStack and Hadoop
There and back againThere and back again
/Piotr Turek @rekurencja
Where does the storyWhere does the story
begin?begin?
How it all came to be?How it all came to be?
How will it end?How will it end?
What is the fundamental structureWhat is the fundamental structure
of space and time?of space and time?
"Somewhere, something incredible is"Somewhere, something incredible is
waiting to be known"waiting to be known"
― Carl Sagan― Carl Sagan
Does size matter?Does size matter?
Not always ;)Not always ;)
Particle Physics is bornParticle Physics is born
19541954
A lot of...A lot of...
20092009
Can you see him now?Can you see him now?
4 stories high4 stories high
14,000 tons14,000 tons
+100 m underground+100 m underground
Let's smash some hadrons!Let's smash some hadrons!
ProtonProton ProtonProton
One small bottle for a man...One small bottle for a man...
Mind boggling fact: one bottle can last for
many months
0.2 nanogram / day
~ 2 red blood cells / day
Accelerating ScienceAccelerating Science
0.999999991 c
Mind boggling facts continued...Mind boggling facts continued...
10 km/h slower than light -271.3°C (1.9K) ~coldest
place in the Universe
Total kinetic energy of a train
Beam 1
Beam 2
Eventually...Eventually...
TheThe trainstrains collide andcollide and datadata starts to pour instarts to pour in
lots of data ;)lots of data ;)
~600 million times a second~600 million times a second
How much data isHow much data is tootoo much?much?
1MB1MB
* 1,000,000,000 events / s* 1,000,000,000 events / s
* 3600s * 15 (hours)* 3600s * 15 (hours)
* 100 days* 100 days
==
1 petabyte / s1 petabyte / s
* 54,000 s* 54,000 s
* 100 days =* 100 days = ......
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Noise.Noise. Your data isYour data is
probably full of itprobably full of it
If in doubt, filter it outIf in doubt, filter it out
The LHC Trigger SystemThe LHC Trigger System
Custom-builtCustom-built
hardware (fpgas)hardware (fpgas)
1GHz (interaction rate)
<100kHz
40MHz resolution
The LHC Trigger System (2)The LHC Trigger System (2)
<100kHz
Software filtering andSoftware filtering and
reconstructionreconstruction
~200 events / s
petabytes / yearpetabytes / year
Reconstructed EventReconstructed Event
Let's do somethingLet's do something
useful with datauseful with data
a.k.aa.k.a offline analysisoffline analysis
Let's assume...Let's assume...
I'm a physicistI'm a physicist @ TOTEM experiment@ TOTEM experiment
What is myWhat is my typical use-casetypical use-case
working with the data?working with the data?
Typical use-caseTypical use-case
Out of all dataOut of all data from run X,from run X,
give me onlygive me only specific eventsspecific events thatthat
fulfill certain criteria.fulfill certain criteria.
I willI will analyseanalyse
this samplethis sample manuallymanually
Rinse and Repeat
Remember this slide?Remember this slide?
The old wayThe old way
1. Write custom
scripts
2. Submit a job
to lxbatch
3. Select only events that
satisfy criteria
Files hundreds MB to
many GB each
Filtered sample
Up to a couple TB involvedUp to a couple TB involved
Problem #1: latency variabilityProblem #1: latency variability
Camel DistributionCamel Distribution
warning: xkcd graphs ;)
Problem #1: latency variabilityProblem #1: latency variability
Two-tier storageTwo-tier storage
1. Disk-based hot store
2. Tape-based cold store
Problem #1: latency variabilityProblem #1: latency variability
Job 1Job 1
1. CASTOR, give me files 1..100
2. Downloading file 1 from disk
3. ...
4. Downloading file 99 from disk
5. Bad luck, file 100 is on a tape
6. File 100 loaded onto a disk
7. Downloading file 100
Job 2Job 2
1. CASTOR, give me files 1..100
2. ...
3. ...
4. ...
5. Bad luck, file 100 is on a tape,
again
Sometime later...Sometime later...
Problem #2: work distributionProblem #2: work distribution
CASTOR
20 files
(different sizes)
40 workers available
sub-optimal performance
poor resource utilization
Problem #3: failureProblem #3: failure inintolerancetolerance
BatchBatch jobsjobs like tolike to failfail and whenand when
they do ...they do ...
... it's completely up to you... it's completely up to you
Problem #4: the funny oneProblem #4: the funny one
ROOT Data Analysis FrameworkROOT Data Analysis Framework
“ A cornerstone of High Energy Physics
software
Problem #4: the funny oneProblem #4: the funny one
15 years of development15 years of development
1,762,865 lines of code1,762,865 lines of code
46,308 commits46,308 commits
Object-oriented libraries for:
data analysis
statistics
visualization
simulation
reconstruction
event display
DAQ
C++ Interpreter Suite
CINT - the interpreter
close enough to standard
C++ extensions
rich RTTI
some syntactic sugar
ACLiC - automatic compiler
Why it's theWhy it's the best idea everbest idea ever
the command language,the command language,
the scripting languagethe scripting language
the programming languagethe programming language
are all C++are all C++
Feature rich
Extremely performant
Specialized
storage formats
Why it's theWhy it's the worst idea everworst idea ever
the command language,the command language,
the scripting languagethe scripting language
the programming languagethe programming language
are all C++are all C++
"C makes it easy to shoot yourself in the foot;"C makes it easy to shoot yourself in the foot;
C++ makes it harder, but when you do, itC++ makes it harder, but when you do, it
blows away your whole leg."blows away your whole leg."
― Bjarne Stroustrup― Bjarne Stroustrup
"Especially, when you use it as an interpreted"Especially, when you use it as an interpreted
language with reflection."language with reflection."
― Captain Obvious― Captain Obvious
Key assumptions ofKey assumptions of happyhappy analysisanalysis
1. Load onceLoad once, analyze many times, analyze many times
2. OptimalOptimal granularity of jobsgranularity of jobs
3. ScalableScalable
4. LittleLittle network and I/Onetwork and I/O overheadoverhead
5. FailureFailure toleranttolerant
6. Takes care ofTakes care of 2-5 automatically2-5 automatically
7. RequiresRequires me to writeme to write less codeless code
The new wayThe new way
Create a cluster of machines (Create a cluster of machines (single clicksingle click))
Request filesRequest files from CASTOR to be loaded onto thefrom CASTOR to be loaded onto the
analysis clusteranalysis cluster
SystemSystem automatically loads and distributesautomatically loads and distributes the filesthe files
CASTOR
20 files
(different sizes)
20 workers evenly sized chunks
(small)
EOS
The new wayThe new way
Request filesRequest files stored on the clusterstored on the cluster to be processedto be processed
DeclareDeclare selection logicselection logic
SystemSystem automatically processesautomatically processes the filesthe files
Rinse and Repeat
paths to files on
CASTOR
Overview of architectureOverview of architecture
Building blocks: Hadoop 2Building blocks: Hadoop 2
“ Apache Hadoop is an open-source software
framework for storage and large-scale processing of
data-sets on clusters of commodity hardware.
fault tolerantfault tolerant
scalablescalable
designed for data localitydesigned for data locality
HDFS: A distributed file system
that provides high-throughput
access to application data.
MapReduce: A YARN-based
system for parallel processing
of large data sets.
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Overview of architectureOverview of architecture
Building blocks: OpenStackBuilding blocks: OpenStack
“ OpenStack is a cloud operating system that controls
large pools of compute, storage, and networking
resources throughout a datacenter, all managed through
a dashboard
OpenStackOpenStack vsvs AWSAWS
AmazonAmazon EC2EC2 vsvs NovaNova
AmazonAmazon S3S3 vsvs SwiftSwift
Elastic Block StorageElastic Block Storage vsvs CinderCinder
AmazonAmazon VPCVPC vsvs NeutronNeutron
AmazonAmazon CloudWatchCloudWatch vsvs CeilometerCeilometer
Elastic MapReduceElastic MapReduce vsvs SaharaSahara
AWSAWS ConsoleConsole vsvs HorizonHorizon
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Overview of architectureOverview of architecture
Building blocks: SaharaBuilding blocks: Sahara
“ Sahara is an OpenStack data processing plugin,
which provides a simple means to provision a Hadoop
cluster on top of OpenStack.
template, launch, manage Hadoop clusters
with a single click (or a command)
add / remove nodes
submit, execute and track Hadoop jobs
Building blocks: SaharaBuilding blocks: Sahara
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Are we done withAre we done with
the infrastructurethe infrastructure
then?then?NONO
Challenge: CERN's OpenStack hasChallenge: CERN's OpenStack has
no Saharano Sahara
How do you use Sahara onHow do you use Sahara on
OpenStack that ...OpenStack that ...
... does not support Sahara?... does not support Sahara?
Solution: You need to go deeperSolution: You need to go deeper
SeparateSeparate Horizon on your hostHorizon on your host
Sahara withSahara with changes to authenticationchanges to authentication
Lesson learned:Lesson learned:
OpenStack is flexibleOpenStack is flexible
hashas nice Python code basenice Python code base
<3<3
clean APIsclean APIs
easy to jump intoeasy to jump into
an riddle: What is it?an riddle: What is it?
It's an hipsterIt's an hipster
Challenge: Hadoop onChallenge: Hadoop on exoticexotic distrodistro
Solution: Prepare your own imagesSolution: Prepare your own images
<template>
<name>SLC6 Sahara Icehouse CERN Server - x86_64</name>
<description>SLC6 Server with Cern-additions: AFS, Kerberos, user accounts, ... and
<os>
<name>SLC-6</name>
<version>5</version>
<arch>x86_64</arch>
<install type='iso'>
<iso>http://linuxsoft.cern.ch/cern/slc65/iso/SLC_6.5_x86_64_dvd.iso</iso>
</install>
</os>
<packages>
<package name='virt-what'/>
(...)
</packages>
<files>
<file name='/etc/init.d/firstboot_diskresize' type='raw'>
#!/bin/sh
(...)
</file>
(...)
</files>
<commands>
<command name='time-sync'>
# set up cron job to synchronize time
(...)
</command>
(...)
</commands>
</template>
Used CERN image
builders
OZ tool is cool ->
Upload to Glance
Your OZ customization file may look like this:
Lesson learned:Lesson learned:
Debugging VM imagesDebugging VM images
is difficultis difficult
Challenge: Cluster provisioning fails oftenChallenge: Cluster provisioning fails often
You : Sahara, give me 20 machines
Sahara : Nova, launch machine no1
Sahara : Nova, launch machine no2
...
Sahara : I'm waiting for all to be
Active, before configuring
Sahara : 6 failed, rolling back all!
You : Oh, for God's sake!
...
Sahara : I'm waiting for all to be
Active, before configuring them
... waits forever
Or even worse!Or even worse!
Solution: First trySolution: First try
ModifiedModified Direct Engine:Direct Engine:
timeouttimeout for launching machinesfor launching machines
simplesimple retriesretries for failed machinesfor failed machines
removesremoves completelycompletely failedfailed machinesmachines
...
Sahara: Cluster provisioned.
Machines requested: 20. Machines
succeeded: 5
You: What the...
Solution: Exponential BackoffSolution: Exponential Backoff
SleepingSleeping delaydelay is ais a randomized,randomized, exponentialexponential
functionfunction ofof retry countretry count
...
Sahara: Cluster provisioned.
Machines requested: 20. Machines
succeeded: 18
You: Thanks!
Lesson learned:Lesson learned:
Be nice to systems youBe nice to systems you
depend on ...depend on ...
... They will thank you with a... They will thank you with a 200200
How to load the data using HadoopHow to load the data using Hadoop
CASTOR
20 files
(different sizes)
20 workers evenly sized chunks
(small)
EOS
Map tasksMap tasks HDFSHDFS
We need a map-only jobWe need a map-only job
How to load the data using HadoopHow to load the data using Hadoop
CASTOR
EOS
path 1
path 2
path 3
...
path 1
C++
path 1
path 2
File 1
Map task 1
Map task 2
Map task 3path 3
size = HDFS
block
TTreeTTree: Apache: Apache ParquetParquet ofof HEPHEP
EventsEvents
row per event
Row oriented
Column oriented
Memory layoutsMemory layouts
CompressionCompression unitunit per columnper column
ReadRead onlyonly thethe data you needdata you need
MuchMuch harderharder toto partition evenlypartition evenly ;);)
Lesson learned:Lesson learned:
Columnar storage formatsColumnar storage formats
are great ...are great ...
... give Apache Parquet a try... give Apache Parquet a try
How to filter the data using HadoopHow to filter the data using Hadoop
paths to files on
CASTOR
Map tasksMap tasks
We need a map-only job*We need a map-only job*
How to filter the data using HadoopHow to filter the data using Hadoop
path 1
path 2
path 3
...
C++
Map task 1
Map task 2
Map task N
(...)
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Challenge: It worksChallenge: It works tootoo fastfast
SELECT two columns out of 100 ...
WHERE "complex criteria"
SELECT * ... WHERE "1=1"
Map task takesMap task takes ~6s~6s Map task takesMap task takes ~80s~80s
Execution time depends on:Execution time depends on:
amount of data read
amount of data produced
cpu-heaviness of selection criteria
Increase theIncrease the
HDFS block size?HDFS block size?
Increase theIncrease the
HDFS block size?HDFS block size?
Solution:Solution: OptimizeOptimize eacheach queryquery
Split the job in two:
Learning phaseLearning phase
1. Select a small sample of input
2. Run the job
3. Calculate avg time of map-task
r = t /theaviness requested avg
Mature phaseMature phase
Use CombineWholeFileInputFormat
maxInputSplitSize = r ∗ blockSizeheaviness
r ∗heaviness
Result: FilteringResult: Filtering up to 100 times fasterup to 100 times faster
than loadingthan loading
Lesson learned:Lesson learned:
HadoopHadoop is notis not a low latencya low latency
frameworkframework
... make your tasks... make your tasks heavier than 30sheavier than 30s
Did it make any sense in the end?Did it make any sense in the end?
YESYES
Much moreMuch more
performantperformant
Much moreMuch more
scalablescalable
Little to noLittle to no
code reqcode req
, but, but
Some partsSome parts
missingmissing
changechange
comes slowlycomes slowly
resources asresources as
wellwell
What is the moral ofWhat is the moral of
this story?this story?
There are stories to tell,There are stories to tell,
go create them.go create them.
Thank YouThank You
turu-on-things.comturu-on-things.com
@@rekurencjarekurencja

More Related Content

PPTX
Data Mining with Splunk
PDF
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
PDF
Lecture6.pptx
PDF
Scale-out AI Training on Massive Core System from HPC to Fabric-based SOC
PDF
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
PDF
[212]big models without big data using domain specific deep networks in data-...
PDF
Academic cloud experiences cern v4
PDF
Anil Thomas - Object recognition
Data Mining with Splunk
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
Lecture6.pptx
Scale-out AI Training on Massive Core System from HPC to Fabric-based SOC
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
[212]big models without big data using domain specific deep networks in data-...
Academic cloud experiences cern v4
Anil Thomas - Object recognition

Similar to Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop (20)

PDF
Storm Anatomy
PPT
Far cry 3
PPTX
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
PPTX
CLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR SCIENTIFIC COMPUTING
PPT
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
PDF
[Ruxcon 2011] Post Memory Corruption Memory Analysis
PDF
Raspberry Pi: Python todo en uno para dummies por John Shovic parte 2.pdf
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PPT
Super computers by rachna
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PDF
[HITB Malaysia 2011] Exploit Automation
PPTX
An Incomplete Introduction to Artificial Intelligence
PPT
[CCC-28c3] Post Memory Corruption Memory Analysis
PDF
Chaos Engineering - The Art of Breaking Things in Production
PDF
PHP Backends for Real-Time User Interaction using Apache Storm.
PDF
[Kiwicon 2011] Post Memory Corruption Memory Analysis
PDF
The Cell at Los Alamos: From Ray Tracing to Roadrunner
PDF
Crossing the Streams Mesos &lt;> Kubernetes
PDF
Practical Chaos Engineering
Storm Anatomy
Far cry 3
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
CLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR SCIENTIFIC COMPUTING
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
[Ruxcon 2011] Post Memory Corruption Memory Analysis
Raspberry Pi: Python todo en uno para dummies por John Shovic parte 2.pdf
Hadoop Summit Europe 2014: Apache Storm Architecture
Super computers by rachna
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
[HITB Malaysia 2011] Exploit Automation
An Incomplete Introduction to Artificial Intelligence
[CCC-28c3] Post Memory Corruption Memory Analysis
Chaos Engineering - The Art of Breaking Things in Production
PHP Backends for Real-Time User Interaction using Apache Storm.
[Kiwicon 2011] Post Memory Corruption Memory Analysis
The Cell at Los Alamos: From Ray Tracing to Roadrunner
Crossing the Streams Mesos &lt;> Kubernetes
Practical Chaos Engineering
Ad

Recently uploaded (20)

PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
Types of Token_ From Utility to Security.pdf
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Patient Appointment Booking in Odoo with online payment
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
Time Tracking Features That Teams and Organizations Actually Need
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
STL Containers in C++ : Sequence Container : Vector
PDF
Website Design Services for Small Businesses.pdf
PDF
Autodesk AutoCAD Crack Free Download 2025
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
Designing Intelligence for the Shop Floor.pdf
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Oracle Fusion HCM Cloud Demo for Beginners
DNT Brochure 2025 – ISV Solutions @ D365
Types of Token_ From Utility to Security.pdf
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Computer Software and OS of computer science of grade 11.pptx
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Patient Appointment Booking in Odoo with online payment
"Secure File Sharing Solutions on AWS".pptx
Time Tracking Features That Teams and Organizations Actually Need
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
STL Containers in C++ : Sequence Container : Vector
Website Design Services for Small Businesses.pdf
Autodesk AutoCAD Crack Free Download 2025
How to Use SharePoint as an ISO-Compliant Document Management System
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Designing Intelligence for the Shop Floor.pdf
Ad

Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop

  • 1. Unraveling mysteries of the Universe atUnraveling mysteries of the Universe at CERN, with OpenStack and HadoopCERN, with OpenStack and Hadoop There and back againThere and back again /Piotr Turek @rekurencja
  • 2. Where does the storyWhere does the story begin?begin?
  • 3. How it all came to be?How it all came to be? How will it end?How will it end? What is the fundamental structureWhat is the fundamental structure of space and time?of space and time?
  • 4. "Somewhere, something incredible is"Somewhere, something incredible is waiting to be known"waiting to be known" ― Carl Sagan― Carl Sagan
  • 5. Does size matter?Does size matter? Not always ;)Not always ;) Particle Physics is bornParticle Physics is born
  • 8. Can you see him now?Can you see him now? 4 stories high4 stories high 14,000 tons14,000 tons +100 m underground+100 m underground
  • 9. Let's smash some hadrons!Let's smash some hadrons! ProtonProton ProtonProton
  • 10. One small bottle for a man...One small bottle for a man... Mind boggling fact: one bottle can last for many months 0.2 nanogram / day ~ 2 red blood cells / day
  • 12. Mind boggling facts continued...Mind boggling facts continued... 10 km/h slower than light -271.3°C (1.9K) ~coldest place in the Universe Total kinetic energy of a train Beam 1 Beam 2
  • 13. Eventually...Eventually... TheThe trainstrains collide andcollide and datadata starts to pour instarts to pour in lots of data ;)lots of data ;) ~600 million times a second~600 million times a second
  • 14. How much data isHow much data is tootoo much?much? 1MB1MB * 1,000,000,000 events / s* 1,000,000,000 events / s * 3600s * 15 (hours)* 3600s * 15 (hours) * 100 days* 100 days == 1 petabyte / s1 petabyte / s * 54,000 s* 54,000 s * 100 days =* 100 days = ......
  • 16. Noise.Noise. Your data isYour data is probably full of itprobably full of it If in doubt, filter it outIf in doubt, filter it out
  • 17. The LHC Trigger SystemThe LHC Trigger System Custom-builtCustom-built hardware (fpgas)hardware (fpgas) 1GHz (interaction rate) <100kHz 40MHz resolution
  • 18. The LHC Trigger System (2)The LHC Trigger System (2) <100kHz Software filtering andSoftware filtering and reconstructionreconstruction ~200 events / s
  • 19. petabytes / yearpetabytes / year Reconstructed EventReconstructed Event
  • 20. Let's do somethingLet's do something useful with datauseful with data a.k.aa.k.a offline analysisoffline analysis
  • 21. Let's assume...Let's assume... I'm a physicistI'm a physicist @ TOTEM experiment@ TOTEM experiment What is myWhat is my typical use-casetypical use-case working with the data?working with the data?
  • 22. Typical use-caseTypical use-case Out of all dataOut of all data from run X,from run X, give me onlygive me only specific eventsspecific events thatthat fulfill certain criteria.fulfill certain criteria. I willI will analyseanalyse this samplethis sample manuallymanually Rinse and Repeat
  • 24. The old wayThe old way 1. Write custom scripts 2. Submit a job to lxbatch 3. Select only events that satisfy criteria Files hundreds MB to many GB each Filtered sample Up to a couple TB involvedUp to a couple TB involved
  • 25. Problem #1: latency variabilityProblem #1: latency variability Camel DistributionCamel Distribution warning: xkcd graphs ;)
  • 26. Problem #1: latency variabilityProblem #1: latency variability Two-tier storageTwo-tier storage 1. Disk-based hot store 2. Tape-based cold store
  • 27. Problem #1: latency variabilityProblem #1: latency variability Job 1Job 1 1. CASTOR, give me files 1..100 2. Downloading file 1 from disk 3. ... 4. Downloading file 99 from disk 5. Bad luck, file 100 is on a tape 6. File 100 loaded onto a disk 7. Downloading file 100 Job 2Job 2 1. CASTOR, give me files 1..100 2. ... 3. ... 4. ... 5. Bad luck, file 100 is on a tape, again Sometime later...Sometime later...
  • 28. Problem #2: work distributionProblem #2: work distribution CASTOR 20 files (different sizes) 40 workers available sub-optimal performance poor resource utilization
  • 29. Problem #3: failureProblem #3: failure inintolerancetolerance BatchBatch jobsjobs like tolike to failfail and whenand when they do ...they do ... ... it's completely up to you... it's completely up to you
  • 30. Problem #4: the funny oneProblem #4: the funny one ROOT Data Analysis FrameworkROOT Data Analysis Framework “ A cornerstone of High Energy Physics software
  • 31. Problem #4: the funny oneProblem #4: the funny one 15 years of development15 years of development 1,762,865 lines of code1,762,865 lines of code 46,308 commits46,308 commits Object-oriented libraries for: data analysis statistics visualization simulation reconstruction event display DAQ C++ Interpreter Suite CINT - the interpreter close enough to standard C++ extensions rich RTTI some syntactic sugar ACLiC - automatic compiler
  • 32. Why it's theWhy it's the best idea everbest idea ever the command language,the command language, the scripting languagethe scripting language the programming languagethe programming language are all C++are all C++ Feature rich Extremely performant Specialized storage formats
  • 33. Why it's theWhy it's the worst idea everworst idea ever the command language,the command language, the scripting languagethe scripting language the programming languagethe programming language are all C++are all C++
  • 34. "C makes it easy to shoot yourself in the foot;"C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do, itC++ makes it harder, but when you do, it blows away your whole leg."blows away your whole leg." ― Bjarne Stroustrup― Bjarne Stroustrup "Especially, when you use it as an interpreted"Especially, when you use it as an interpreted language with reflection."language with reflection." ― Captain Obvious― Captain Obvious
  • 35. Key assumptions ofKey assumptions of happyhappy analysisanalysis 1. Load onceLoad once, analyze many times, analyze many times 2. OptimalOptimal granularity of jobsgranularity of jobs 3. ScalableScalable 4. LittleLittle network and I/Onetwork and I/O overheadoverhead 5. FailureFailure toleranttolerant 6. Takes care ofTakes care of 2-5 automatically2-5 automatically 7. RequiresRequires me to writeme to write less codeless code
  • 36. The new wayThe new way Create a cluster of machines (Create a cluster of machines (single clicksingle click)) Request filesRequest files from CASTOR to be loaded onto thefrom CASTOR to be loaded onto the analysis clusteranalysis cluster SystemSystem automatically loads and distributesautomatically loads and distributes the filesthe files CASTOR 20 files (different sizes) 20 workers evenly sized chunks (small) EOS
  • 37. The new wayThe new way Request filesRequest files stored on the clusterstored on the cluster to be processedto be processed DeclareDeclare selection logicselection logic SystemSystem automatically processesautomatically processes the filesthe files Rinse and Repeat paths to files on CASTOR
  • 39. Building blocks: Hadoop 2Building blocks: Hadoop 2 “ Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. fault tolerantfault tolerant scalablescalable designed for data localitydesigned for data locality HDFS: A distributed file system that provides high-throughput access to application data. MapReduce: A YARN-based system for parallel processing of large data sets.
  • 42. Building blocks: OpenStackBuilding blocks: OpenStack “ OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard
  • 43. OpenStackOpenStack vsvs AWSAWS AmazonAmazon EC2EC2 vsvs NovaNova AmazonAmazon S3S3 vsvs SwiftSwift Elastic Block StorageElastic Block Storage vsvs CinderCinder AmazonAmazon VPCVPC vsvs NeutronNeutron AmazonAmazon CloudWatchCloudWatch vsvs CeilometerCeilometer Elastic MapReduceElastic MapReduce vsvs SaharaSahara AWSAWS ConsoleConsole vsvs HorizonHorizon
  • 47. Building blocks: SaharaBuilding blocks: Sahara “ Sahara is an OpenStack data processing plugin, which provides a simple means to provision a Hadoop cluster on top of OpenStack. template, launch, manage Hadoop clusters with a single click (or a command) add / remove nodes submit, execute and track Hadoop jobs
  • 50. Are we done withAre we done with the infrastructurethe infrastructure then?then?NONO
  • 51. Challenge: CERN's OpenStack hasChallenge: CERN's OpenStack has no Saharano Sahara How do you use Sahara onHow do you use Sahara on OpenStack that ...OpenStack that ... ... does not support Sahara?... does not support Sahara?
  • 52. Solution: You need to go deeperSolution: You need to go deeper SeparateSeparate Horizon on your hostHorizon on your host Sahara withSahara with changes to authenticationchanges to authentication
  • 53. Lesson learned:Lesson learned: OpenStack is flexibleOpenStack is flexible hashas nice Python code basenice Python code base <3<3 clean APIsclean APIs easy to jump intoeasy to jump into
  • 54. an riddle: What is it?an riddle: What is it? It's an hipsterIt's an hipster
  • 55. Challenge: Hadoop onChallenge: Hadoop on exoticexotic distrodistro
  • 56. Solution: Prepare your own imagesSolution: Prepare your own images <template> <name>SLC6 Sahara Icehouse CERN Server - x86_64</name> <description>SLC6 Server with Cern-additions: AFS, Kerberos, user accounts, ... and <os> <name>SLC-6</name> <version>5</version> <arch>x86_64</arch> <install type='iso'> <iso>http://linuxsoft.cern.ch/cern/slc65/iso/SLC_6.5_x86_64_dvd.iso</iso> </install> </os> <packages> <package name='virt-what'/> (...) </packages> <files> <file name='/etc/init.d/firstboot_diskresize' type='raw'> #!/bin/sh (...) </file> (...) </files> <commands> <command name='time-sync'> # set up cron job to synchronize time (...) </command> (...) </commands> </template> Used CERN image builders OZ tool is cool -> Upload to Glance Your OZ customization file may look like this:
  • 57. Lesson learned:Lesson learned: Debugging VM imagesDebugging VM images is difficultis difficult
  • 58. Challenge: Cluster provisioning fails oftenChallenge: Cluster provisioning fails often You : Sahara, give me 20 machines Sahara : Nova, launch machine no1 Sahara : Nova, launch machine no2 ... Sahara : I'm waiting for all to be Active, before configuring Sahara : 6 failed, rolling back all! You : Oh, for God's sake! ... Sahara : I'm waiting for all to be Active, before configuring them ... waits forever Or even worse!Or even worse!
  • 59. Solution: First trySolution: First try ModifiedModified Direct Engine:Direct Engine: timeouttimeout for launching machinesfor launching machines simplesimple retriesretries for failed machinesfor failed machines removesremoves completelycompletely failedfailed machinesmachines ... Sahara: Cluster provisioned. Machines requested: 20. Machines succeeded: 5 You: What the...
  • 60. Solution: Exponential BackoffSolution: Exponential Backoff SleepingSleeping delaydelay is ais a randomized,randomized, exponentialexponential functionfunction ofof retry countretry count ... Sahara: Cluster provisioned. Machines requested: 20. Machines succeeded: 18 You: Thanks!
  • 61. Lesson learned:Lesson learned: Be nice to systems youBe nice to systems you depend on ...depend on ... ... They will thank you with a... They will thank you with a 200200
  • 62. How to load the data using HadoopHow to load the data using Hadoop CASTOR 20 files (different sizes) 20 workers evenly sized chunks (small) EOS Map tasksMap tasks HDFSHDFS We need a map-only jobWe need a map-only job
  • 63. How to load the data using HadoopHow to load the data using Hadoop CASTOR EOS path 1 path 2 path 3 ... path 1 C++ path 1 path 2 File 1 Map task 1 Map task 2 Map task 3path 3 size = HDFS block
  • 64. TTreeTTree: Apache: Apache ParquetParquet ofof HEPHEP EventsEvents row per event Row oriented Column oriented Memory layoutsMemory layouts CompressionCompression unitunit per columnper column ReadRead onlyonly thethe data you needdata you need MuchMuch harderharder toto partition evenlypartition evenly ;);)
  • 65. Lesson learned:Lesson learned: Columnar storage formatsColumnar storage formats are great ...are great ... ... give Apache Parquet a try... give Apache Parquet a try
  • 66. How to filter the data using HadoopHow to filter the data using Hadoop paths to files on CASTOR Map tasksMap tasks We need a map-only job*We need a map-only job*
  • 67. How to filter the data using HadoopHow to filter the data using Hadoop path 1 path 2 path 3 ... C++ Map task 1 Map task 2 Map task N (...)
  • 69. Challenge: It worksChallenge: It works tootoo fastfast SELECT two columns out of 100 ... WHERE "complex criteria" SELECT * ... WHERE "1=1" Map task takesMap task takes ~6s~6s Map task takesMap task takes ~80s~80s Execution time depends on:Execution time depends on: amount of data read amount of data produced cpu-heaviness of selection criteria Increase theIncrease the HDFS block size?HDFS block size? Increase theIncrease the HDFS block size?HDFS block size?
  • 70. Solution:Solution: OptimizeOptimize eacheach queryquery Split the job in two: Learning phaseLearning phase 1. Select a small sample of input 2. Run the job 3. Calculate avg time of map-task r = t /theaviness requested avg Mature phaseMature phase Use CombineWholeFileInputFormat maxInputSplitSize = r ∗ blockSizeheaviness r ∗heaviness Result: FilteringResult: Filtering up to 100 times fasterup to 100 times faster than loadingthan loading
  • 71. Lesson learned:Lesson learned: HadoopHadoop is notis not a low latencya low latency frameworkframework ... make your tasks... make your tasks heavier than 30sheavier than 30s
  • 72. Did it make any sense in the end?Did it make any sense in the end? YESYES Much moreMuch more performantperformant Much moreMuch more scalablescalable Little to noLittle to no code reqcode req , but, but Some partsSome parts missingmissing changechange comes slowlycomes slowly resources asresources as wellwell
  • 73. What is the moral ofWhat is the moral of this story?this story?
  • 74. There are stories to tell,There are stories to tell, go create them.go create them.