Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop

Unraveling mysteries of the Universe atUnraveling mysteries of the Universe at
CERN, with OpenStack and HadoopCERN, with OpenStack and Hadoop
There and back againThere and back again
/Piotr Turek @rekurencja

Where does the storyWhere does the story
begin?begin?

How it all came to be?How it all came to be?
How will it end?How will it end?
What is the fundamental structureWhat is the fundamental structure
of space and time?of space and time?

"Somewhere, something incredible is"Somewhere, something incredible is
waiting to be known"waiting to be known"
― Carl Sagan― Carl Sagan

Does size matter?Does size matter?
Not always ;)Not always ;)
Particle Physics is bornParticle Physics is born

19541954
A lot of...A lot of...

Can you see him now?Can you see him now?
4 stories high4 stories high
14,000 tons14,000 tons
+100 m underground+100 m underground

Let's smash some hadrons!Let's smash some hadrons!
ProtonProton ProtonProton

One small bottle for a man...One small bottle for a man...
Mind boggling fact: one bottle can last for
many months
0.2 nanogram / day
~ 2 red blood cells / day

Accelerating ScienceAccelerating Science
0.999999991 c

Mind boggling facts continued...Mind boggling facts continued...
10 km/h slower than light -271.3°C (1.9K) ~coldest
place in the Universe
Total kinetic energy of a train
Beam 1
Beam 2

Eventually...Eventually...
TheThe trainstrains collide andcollide and datadata starts to pour instarts to pour in
lots of data ;)lots of data ;)
~600 million times a second~600 million times a second

How much data isHow much data is tootoo much?much?
1MB1MB
* 1,000,000,000 events / s* 1,000,000,000 events / s
* 3600s * 15 (hours)* 3600s * 15 (hours)
* 100 days* 100 days
==
1 petabyte / s1 petabyte / s
* 54,000 s* 54,000 s
* 100 days =* 100 days = ......

Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop

Noise.Noise. Your data isYour data is
probably full of itprobably full of it
If in doubt, ﬁlter it outIf in doubt, ﬁlter it out

The LHC Trigger SystemThe LHC Trigger System
Custom-builtCustom-built
hardware (fpgas)hardware (fpgas)
1GHz (interaction rate)
<100kHz
40MHz resolution

The LHC Trigger System (2)The LHC Trigger System (2)
<100kHz
Software ﬁltering andSoftware ﬁltering and
reconstructionreconstruction
~200 events / s

petabytes / yearpetabytes / year
Reconstructed EventReconstructed Event

Let's do somethingLet's do something
useful with datauseful with data
a.k.aa.k.a ofﬂine analysisofﬂine analysis

Let's assume...Let's assume...
I'm a physicistI'm a physicist @ TOTEM experiment@ TOTEM experiment
What is myWhat is my typical use-casetypical use-case
working with the data?working with the data?

Typical use-caseTypical use-case
Out of all dataOut of all data from run X,from run X,
give me onlygive me only specific eventsspecific events thatthat
fulfill certain criteria.fulfill certain criteria.
I willI will analyseanalyse
this samplethis sample manuallymanually
Rinse and Repeat

Remember this slide?Remember this slide?

The old wayThe old way
1. Write custom
scripts
2. Submit a job
to lxbatch
3. Select only events that
satisfy criteria
Files hundreds MB to
many GB each
Filtered sample
Up to a couple TB involvedUp to a couple TB involved

Problem #1: latency variabilityProblem #1: latency variability
Camel DistributionCamel Distribution
warning: xkcd graphs ;)

Two-tier storageTwo-tier storage
1. Disk-based hot store
2. Tape-based cold store

Job 1Job 1
1. CASTOR, give me files 1..100
2. Downloading file 1 from disk
3. ...
4. Downloading file 99 from disk
5. Bad luck, file 100 is on a tape
6. File 100 loaded onto a disk
7. Downloading file 100
Job 2Job 2
1. CASTOR, give me files 1..100
2. ...
3. ...
4. ...
5. Bad luck, file 100 is on a tape,
again
Sometime later...Sometime later...

Problem #2: work distributionProblem #2: work distribution
CASTOR
20 ﬁles
(different sizes)
40 workers available
sub-optimal performance
poor resource utilization

Problem #3: failureProblem #3: failure inintolerancetolerance
BatchBatch jobsjobs like tolike to failfail and whenand when
they do ...they do ...
... it's completely up to you... it's completely up to you

Problem #4: the funny oneProblem #4: the funny one
ROOT Data Analysis FrameworkROOT Data Analysis Framework
“ A cornerstone of High Energy Physics
software

Problem #4: the funny oneProblem #4: the funny one
15 years of development15 years of development
1,762,865 lines of code1,762,865 lines of code
46,308 commits46,308 commits
Object-oriented libraries for:
data analysis
statistics
visualization
simulation
reconstruction
event display
DAQ
C++ Interpreter Suite
CINT - the interpreter
close enough to standard
C++ extensions
rich RTTI
some syntactic sugar
ACLiC - automatic compiler

Why it's theWhy it's the best idea everbest idea ever
the command language,the command language,
the scripting languagethe scripting language
the programming languagethe programming language
are all C++are all C++
Feature rich
Extremely performant
Specialized
storage formats

Why it's theWhy it's the worst idea everworst idea ever
the command language,the command language,
the scripting languagethe scripting language
the programming languagethe programming language
are all C++are all C++

"C makes it easy to shoot yourself in the foot;"C makes it easy to shoot yourself in the foot;
C++ makes it harder, but when you do, itC++ makes it harder, but when you do, it
blows away your whole leg."blows away your whole leg."
― Bjarne Stroustrup― Bjarne Stroustrup
"Especially, when you use it as an interpreted"Especially, when you use it as an interpreted
language with reﬂection."language with reﬂection."
― Captain Obvious― Captain Obvious

Key assumptions ofKey assumptions of happyhappy analysisanalysis
1. Load onceLoad once, analyze many times, analyze many times
2. OptimalOptimal granularity of jobsgranularity of jobs
3. ScalableScalable
4. LittleLittle network and I/Onetwork and I/O overheadoverhead
5. FailureFailure toleranttolerant
6. Takes care ofTakes care of 2-5 automatically2-5 automatically
7. RequiresRequires me to writeme to write less codeless code

The new wayThe new way
Create a cluster of machines (Create a cluster of machines (single clicksingle click))
Request filesRequest files from CASTOR to be loaded onto thefrom CASTOR to be loaded onto the
analysis clusteranalysis cluster
SystemSystem automatically loads and distributesautomatically loads and distributes the filesthe files
CASTOR
20 files
(different sizes)
20 workers evenly sized chunks
(small)
EOS

The new wayThe new way
Request filesRequest files stored on the clusterstored on the cluster to be processedto be processed
DeclareDeclare selection logicselection logic
SystemSystem automatically processesautomatically processes the filesthe files
Rinse and Repeat
paths to files on
CASTOR

Overview of architectureOverview of architecture

Building blocks: Hadoop 2Building blocks: Hadoop 2
“ Apache Hadoop is an open-source software
framework for storage and large-scale processing of
data-sets on clusters of commodity hardware.
fault tolerantfault tolerant
scalablescalable
designed for data localitydesigned for data locality
HDFS: A distributed ﬁle system
that provides high-throughput
access to application data.
MapReduce: A YARN-based
system for parallel processing
of large data sets.

Building blocks: OpenStackBuilding blocks: OpenStack
“ OpenStack is a cloud operating system that controls
large pools of compute, storage, and networking
resources throughout a datacenter, all managed through
a dashboard

OpenStackOpenStack vsvs AWSAWS
AmazonAmazon EC2EC2 vsvs NovaNova
AmazonAmazon S3S3 vsvs SwiftSwift
Elastic Block StorageElastic Block Storage vsvs CinderCinder
AmazonAmazon VPCVPC vsvs NeutronNeutron
AmazonAmazon CloudWatchCloudWatch vsvs CeilometerCeilometer
Elastic MapReduceElastic MapReduce vsvs SaharaSahara
AWSAWS ConsoleConsole vsvs HorizonHorizon

Building blocks: SaharaBuilding blocks: Sahara
“ Sahara is an OpenStack data processing plugin,
which provides a simple means to provision a Hadoop
cluster on top of OpenStack.
template, launch, manage Hadoop clusters
with a single click (or a command)
add / remove nodes
submit, execute and track Hadoop jobs

Building blocks: SaharaBuilding blocks: Sahara

Are we done withAre we done with
the infrastructurethe infrastructure
then?then?NONO

Challenge: CERN's OpenStack hasChallenge: CERN's OpenStack has
no Saharano Sahara
How do you use Sahara onHow do you use Sahara on
OpenStack that ...OpenStack that ...
... does not support Sahara?... does not support Sahara?

Solution: You need to go deeperSolution: You need to go deeper
SeparateSeparate Horizon on your hostHorizon on your host
Sahara withSahara with changes to authenticationchanges to authentication

Lesson learned:Lesson learned:
OpenStack is ﬂexibleOpenStack is ﬂexible
hashas nice Python code basenice Python code base
<3<3
clean APIsclean APIs
easy to jump intoeasy to jump into

an riddle: What is it?an riddle: What is it?
It's an hipsterIt's an hipster

Challenge: Hadoop onChallenge: Hadoop on exoticexotic distrodistro

Solution: Prepare your own imagesSolution: Prepare your own images
<template>
<name>SLC6 Sahara Icehouse CERN Server - x86_64</name>
<description>SLC6 Server with Cern-additions: AFS, Kerberos, user accounts, ... and
<os>
<name>SLC-6</name>
<version>5</version>
<arch>x86_64</arch>
<install type='iso'>
<iso>http://linuxsoft.cern.ch/cern/slc65/iso/SLC_6.5_x86_64_dvd.iso</iso>
</install>
</os>
<packages>
<package name='virt-what'/>
(...)
</packages>
<files>
<file name='/etc/init.d/firstboot_diskresize' type='raw'>
#!/bin/sh
(...)
</file>
(...)
</files>
<commands>
<command name='time-sync'>
# set up cron job to synchronize time
(...)
</command>
(...)
</commands>
</template>
Used CERN image
builders
OZ tool is cool ->
Upload to Glance
Your OZ customization ﬁle may look like this:

Debugging VM imagesDebugging VM images
is difﬁcultis difﬁcult

Challenge: Cluster provisioning fails oftenChallenge: Cluster provisioning fails often
You : Sahara, give me 20 machines
Sahara : Nova, launch machine no1
Sahara : Nova, launch machine no2
...
Sahara : I'm waiting for all to be
Active, before conﬁguring
Sahara : 6 failed, rolling back all!
You : Oh, for God's sake!
...
Sahara : I'm waiting for all to be
Active, before conﬁguring them
... waits forever
Or even worse!Or even worse!

Solution: First trySolution: First try
ModiﬁedModiﬁed Direct Engine:Direct Engine:
timeouttimeout for launching machinesfor launching machines
simplesimple retriesretries for failed machinesfor failed machines
removesremoves completelycompletely failedfailed machinesmachines
...
Sahara: Cluster provisioned.
Machines requested: 20. Machines
succeeded: 5
You: What the...

Solution: Exponential BackoffSolution: Exponential Backoff
SleepingSleeping delaydelay is ais a randomized,randomized, exponentialexponential
functionfunction ofof retry countretry count
...
Sahara: Cluster provisioned.
Machines requested: 20. Machines
succeeded: 18
You: Thanks!

Be nice to systems youBe nice to systems you
depend on ...depend on ...
... They will thank you with a... They will thank you with a 200200

How to load the data using HadoopHow to load the data using Hadoop
CASTOR
20 ﬁles
(different sizes)
20 workers evenly sized chunks
(small)
EOS
Map tasksMap tasks HDFSHDFS
We need a map-only jobWe need a map-only job

How to load the data using HadoopHow to load the data using Hadoop
CASTOR
EOS
path 1
path 2
path 3
...
path 1
C++
path 1
path 2
File 1
Map task 1
Map task 2
Map task 3path 3
size = HDFS
block

TTreeTTree: Apache: Apache ParquetParquet ofof HEPHEP
EventsEvents
row per event
Row oriented
Column oriented
Memory layoutsMemory layouts
CompressionCompression unitunit per columnper column
ReadRead onlyonly thethe data you needdata you need
MuchMuch harderharder toto partition evenlypartition evenly ;);)

Columnar storage formatsColumnar storage formats
are great ...are great ...
... give Apache Parquet a try... give Apache Parquet a try

How to filter the data using HadoopHow to filter the data using Hadoop
paths to files on
CASTOR
Map tasksMap tasks
We need a map-only job*We need a map-only job*

How to ﬁlter the data using HadoopHow to ﬁlter the data using Hadoop
path 1
path 2
path 3
...
C++
Map task 1
Map task 2
Map task N
(...)

Challenge: It worksChallenge: It works tootoo fastfast
SELECT two columns out of 100 ...
WHERE "complex criteria"
SELECT * ... WHERE "1=1"
Map task takesMap task takes ~6s~6s Map task takesMap task takes ~80s~80s
Execution time depends on:Execution time depends on:
amount of data read
amount of data produced
cpu-heaviness of selection criteria
Increase theIncrease the
HDFS block size?HDFS block size?
Increase theIncrease the
HDFS block size?HDFS block size?

Solution:Solution: OptimizeOptimize eacheach queryquery
Split the job in two:
Learning phaseLearning phase
1. Select a small sample of input
2. Run the job
3. Calculate avg time of map-task
r = t /theaviness requested avg
Mature phaseMature phase
Use CombineWholeFileInputFormat
maxInputSplitSize = r ∗ blockSizeheaviness
r ∗heaviness
Result: FilteringResult: Filtering up to 100 times fasterup to 100 times faster
than loadingthan loading

HadoopHadoop is notis not a low latencya low latency
frameworkframework
... make your tasks... make your tasks heavier than 30sheavier than 30s

Did it make any sense in the end?Did it make any sense in the end?
YESYES
Much moreMuch more
performantperformant
Much moreMuch more
scalablescalable
Little to noLittle to no
code reqcode req
, but, but
Some partsSome parts
missingmissing
changechange
comes slowlycomes slowly
resources asresources as
wellwell

What is the moral ofWhat is the moral of
this story?this story?

There are stories to tell,There are stories to tell,
go create them.go create them.

Thank YouThank You
turu-on-things.comturu-on-things.com
@@rekurencjarekurencja

Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop

More Related Content

Similar to Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop (20)

Recently uploaded (20)

Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop