SlideShare a Scribd company logo
March 26, 2016
..
Real-time applications on Intel Xeon/Phi
Karel Ha
CERN High Throughput Computing collaboration
Summary:
The Intel Xeon/Phi platform is a powerful x86 multi-core engine with a very high-speed
memory interface. In its next version it will be able to operate as a stand-alone system with a
very high-speed interconnect. This makes it a very interesting candidate for (near) real-time
applications such as event-building, event-sorting and event preparation for subsequent
processing by high level trigger software algorithms.
Real-time applications on Intel Xeon/Phi 1
March 26, 2016
Abstract
The following document is a report providing the first results on the performance of In-
tel Xeon Phi computing accelerator in the context of LHCb Online Data Acquisition system
(DAQ).
Themainfocusisputintotheevent-sortingtask: whendataarrivefromdifferentsources
corresponding to different parts of the LHCb detector, they are grouped by the source,
from which they originate. In the next stage of DAQ, it is necessary to make a decision,
whether to store the given collision event or not. For this purpose, it is more convenient to
group the data by their memberships to collision events (i.e. all data from one collision need
to be placed together), so that the DAQ system can decide based on the “whole picture” of
one event.
The Xeon Phi is an interesting candidate for event-sorting task. It offers a large number
of cores and vast amount of memory. Furthermore, this task can also be very well paral-
lelized, which can make it especially suitable for the many-core architecture of the Xeon
Phi. Thus, this report may be used to study feasibility of the Intel Xeon Phi platform for the
next upgrade of the LHCb detector in 2018-2019.
Real-time applications on Intel Xeon/Phi 2
March 26, 2016
Contents
1 Introduction 4
1.1 Description of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 The goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Offload-bandwidth 8
3 Prefix-offset 9
4 Event-sort 10
4.1 The distribution of iteration durations . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Comparison between event-sort and raw memcpy . . . . . . . . . . . . . . . . . . 12
4.3 Blockschemes for memcpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 ASLR on KNC and its effect on event-sort . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Fixation of input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6 Varying of number of copy-threads . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Some ideas for future work 22
6 Conclusion 23
Appendix A Infrastructure 24
Appendix B Compilers 25
Appendix C Reproducing the event-sort results 25
C.1 Source code and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
C.2 Offload-bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
C.3 Prefix-offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
C.4 Event-sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Real-time applications on Intel Xeon/Phi 3
March 26, 2016
1 INTRODUCTION
Intel Xeon Phi or Intel Many Integrated Core Architecture (MIC) is a promising x86 many-
core computing accelerator. As such, it is suitable for highly parallelizable jobs such as event-
sorting, a subtask of LHCb Data Acquisition System (DAQ). In this report, we present our mea-
surementsofevent-sortingonIntelXeonPhicard, specifically“KnightsCorner”(KNC)version.
There are 3 demo programs:
• offload-bandwidth
• prefix-offset
• event-sort
Thefirsttwo partsserveas preliminarytoolsfor baseline benchmarksandtesting theprop-
erties of Xeon Phi, whereas the last one simulates the real conditions of event-sort in LHCb
DAQ.
For details on the used software and hardware, consult Appendix C. There are also the in-
structions for reproducing the results.
There is also a shared CERNBox folder htcc_shared, which contains all the logs that I regu-
larly kept during my internship. For full details (source codes, bash and gnuplot scripts, figures,
raw output files and results etc.), acquire an access to the shared folder and consult my logs.
1.1 DESCRIPTION OF THE PROBLEM
The LHCb detector at CERN is a complex instrument consisting of many subdetectors. Hence,
there are also many (approximately 1000) sources of input channels for the DAQ system. Each
of the readout boards keeps the fragments of information (so called MEP fragments or also
mep_contents in the source code) in its own buffer. The fragments come from different chan-
nels and different collisions. The number of collisions is called MEP factor (by default 10000
fragments per source).
For further processing, however, it is much more favorable to re-arrange (transpose) the
fragments and group them together according to the collision they belong to:
Real-time applications on Intel Xeon/Phi 4
March 26, 2016
FIGURE 1: TRANSPOSE OF FRAGMENTS
For better illustration, see the example below:
−−−−−−−−−−Input MEP contents−−−−−−−−−−
Source #0 111222333334444
Source #1 555566667777788888
Source #2 9999aaaaabbbcc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−Output MEP contents−−−−−−−−−
C o l l i s i o n #0 11155559999
C o l l i s i o n #1 2226666aaaaa
C o l l i s i o n #2 3333377777bbb
C o l l i s i o n #3 444488888cc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Inthe“InputMEPcontents”,source#0stores3bytesfromcollision#0(labeledbycharacter
“1”), 3 bytes from collision #1 (labeled by character “2”), 5 bytes from collision #2 (labeled by
character “3”) and 4 bytes from collision #3 (labeled by character “4”).
Source #1 (corresponding to a different subdetector) stores 4 bytes from collision #0 (la-
beled by character “5”) followed by the data from the collisions #1 to #3. Source #2 stores 4
bytes also from collision #0 (labeled by character “9”) and likewise for the remaining collisions.
At this point, the transposition re-shuffles the data so that all the information from one col-
lision is placed together. Therefore, in the “Output MEP contents”, buffer for collision #0 con-
tains the previously mentioned 3 bytes from source #0 (labeled by character “1”), 4 bytes from
source #1 (labeled by character “5”) and 4 bytes from source #2 (labeled by character “9”).
Here is another example of the transposition:
−−−−−−−−−−Input MEP contents−−−−−−−−−−
Source #0 11111222333334444
Real-time applications on Intel Xeon/Phi 5
March 26, 2016
Source #1 5566667777788888
Source #2 99aaaaabbbcc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−Output MEP contents−−−−−−−−−
C o l l i s i o n #0 111115599
C o l l i s i o n #1 2226666aaaaa
C o l l i s i o n #2 3333377777bbb
C o l l i s i o n #3 444488888cc
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
The lengths of MEP fragments (usually between 80-120 bytes per fragment) are repre-
sented as 16bit integers and they are stored in a separate array. The reason for this is the per-
formance improvement: more than one length value can be loaded into the cache line, so we
can read and process several lengths of fragments with one cache load.
ThebuffersforMEPfragmentsarestoredinanarrayofarrays. Thereisonearraymep_contents[i]
foreachsource#i. Acontinuousblockofmemoryisallocatedforeverysuchbuffermep_contents[i].
However, two consecutive buffers do not necessarily have to be in a continuous block of mem-
ory.
The output array is saved in one continuous block of memory. It stores the “re-shuffled”
copies of fragments, now grouped by collisions into collision blocks. Furthermore, the collision
blocks are concatenated according to the collision index. For instance, the first example above
would produce this output array:
111115599 2226666aaaaa 3333377777bbb 444488888cc
The spaces were added for clarity, in order to separate different collisions.
1.2 ALGORITHM
In order to copy the data for transposition (for each fragment of each source), two types of
array offsets (represented as 32bit integers) need to be computed:
• read_offsets[] is the array of offsets determining where to copy from. It is the number of
bytes from the beginning of mep_contents[i] where source i is the source corresponding
to the fragment.
• write_offsets[] is the array of offsets determining where to copy to. It is the number of
bytes from the beginning of the output array.
Offsetsarecomputedbyapplyingprefixsumtoappropriateelementsofthearrayoflengths.
The prefix sum is the following problem: given an array of numbers a[], produce an array s []
of the same size, where s[0] = 0 and s[i] = a[0] + a[1] + ... + a[i − 1] for i > 0. The prefix-sum
problem is the core part of event sorting.
Real-time applications on Intel Xeon/Phi 6
March 26, 2016
Since prefix sum for read_offsets [] within one source buffer is independent of other com-
putations in other source buffers, we may parallelize using #pragma omp parallel for.
Similarly,prefixsumfor write_offsets [] canbealsoparallelizedusing#pragma omp parallel for
(for details, see the function get_write_offsets_OMP_version() in prefix−sum.cpp).
After the read_offsets and write_offsets are computed, the content of each fragment can
be copied using the standard memcpy() function. For MEP fragments, this copy-task is inde-
pendent of one another, and hence, can be run in parallel. Namely, #pragma omp parallel for
has been used to parallelize the loop. This loop iterates over all MEP fragments and performs
the memcopies.
1.3 THE GOAL
The goal of the demos is to test the speed and the feasibility of the Xeon Phi for event- sorting.
Possible performance improvements are studied, namely various parallelization techniques.
Real-time applications on Intel Xeon/Phi 7
March 26, 2016
2 OFFLOAD-BANDWIDTH
This programmeasures the bandwidth between host and the deviceusing the #pragma offload
directive...
a) offloading only to the device:
$ make && . / offload −bandwidth . exe −i 20 −e 1500000000
icpc −l r t main . cpp −o offload −bandwidth . exe
Using MIC0 . . .
Transferred : 30 GB
Total time : 4.37726 secs
Bandwidth : 6.8536 GBps
b) offloading only to the device, and copying the result back:
$ make && . / offload −bandwidth . exe −i 20 −e 1500000000
icpc −l r t main . cpp −o offload −bandwidth . exe
Using MIC0 . . .
Transferred : 60 GB
Total time : 8.67822 secs
Bandwidth : 6.91386 GBps
This bandwidth corresponds to the speed of 50 Gbit/s PCIe interface between the host and
the device. Here, the host machine is lhcb−phi.cern.ch (see Appendix A). The speed remains
the same even when the offload-bandwidth is launched to all 4 Xeon Phi cards at the same time
(as 4 concurrent processes). This means there are four 50 Gbit/s PCIe interfaces and each of
them can be fully saturated during offloads.
For more details, consult the README at https://github.com/mathemage/xphi-lhcb/
tree/master/src/offload-bandwidth#parallel-run-on-all-available-mics
Real-time applications on Intel Xeon/Phi 8
March 26, 2016
3 PREFIX-OFFSET
This program implements and tests the speed of prefix sum calculation.
a) 1000 iterations for the array size of 40000000, short int numbers a[i ] range from 0 to 100:
Total time : 521.639 secs
Processed : 7.66814e+07 elements per second
b) 100000 iterations for the array size of 40000000, short int numbers a[i ] range from 0 to
65534:
Total elements : 6000000000
Total time : 77.8086 secs
Processed : 7.71123e+07 elements per second
This is the result from 1 KNC card with lhcb−phi.cern.ch as the host (see Appendix A).
For more details, see the README at https://github.com/mathemage/xphi-lhcb/tree/
master/src/prefix-offset#output
Real-time applications on Intel Xeon/Phi 9
March 26, 2016
4 EVENT-SORT
LHCb Online owns 4 Intel Xeon Phi ”KNC” cards. They are available on lhcb−phi.cern.ch ma-
chine (see Appendix A).
4.1 THE DISTRIBUTION OF ITERATION DURATIONS
The simulation is iterated many times to avoid statistical fluctuations. Number of iterations is
controlled via command-line argument −i.
a) The results for 200 iterations:
# . / event−sort . mic . exe −i 200
. . .
−−−−−−−−−−SUMMARY−−−−−−−−−−
Total elements : 2e+09
Time for computing read_offsets : 0.553636 secs
Time for computing write_offse ts : 2.50423 secs
Time for copying : 17.4631 secs
Total time : 20.521 secs
Total size : 230.013 GB
Processed : 9.74612e+07 elements per second
Throughput : 11.2087 GBps
−−−−−−−−−−−−−−−−−−−−−−−−−−−
Timeforcomputingread_offsetsisthetotaltimespentcalculatingprefixsumsforread_offsets [] ,
timeforcomputingwrite_offsetsisthetotaltimespentcalculatingprefixsumsfor write_offsets []
and time for copying is the total time of performing memcpy() of MEP fragments.
b) The results and the histogram for 1000 iterations:
−−−−−−−−STATISTICS OF TIME INTERVALS ( in secs)−−−−−−−−−−−−
The i n i t i a l i t e r a t i o n : 0.43506
min : 0.10139
max : 0.10303
mean : 0.10216
. . .
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−STATISTICS OF THROUGHPUTS ( in GBps)−−−−−−−−−−−−−−−
min : 11.16119
max : 11.34263
mean : 11.25702
Real-time applications on Intel Xeon/Phi 10
March 26, 2016
. . .
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−SUMMARY−−−−−−−−−−
Total elements : 1e+10
Time for computing read_offsets : 3.14013 secs
Time for computing write_offse ts : 12.2161 secs
Time for copying : 86.8014 secs
Total time : 102.158 secs
Total size : 1149.98 GB
Processed : 9.7888e+07 elements per second
Throughput : 11.2569 GBps
−−−−−−−−−−−−−−−−−−−−−−−−−−−
The histograms of the previous measurements:
Real-time applications on Intel Xeon/Phi 11
March 26, 2016
4.2 COMPARISON BETWEEN EVENT-SORT AND RAW MEMCPY
The program memcpy-bandwidth tests only the throughput of the memcpy() function on the
Intel Xeon Phi. It copies chunks (arrays) of data from one place to another (with OpenMP pa-
rallelization). This process is iterated (50 times in the case below) and the final throughput is
calculated.
The number of threads is varied using #pragma omp parallel for num_threads(). The corre-
sponding plot is in Figure 2.
Real-time applications on Intel Xeon/Phi 12
March 26, 2016
FIGURE 2: EVENT-SORT COMPARED TO RAW MEMCPY(), WITH VARIABLE NUMBER OF
THREADS
4.3 BLOCKSCHEMES FOR MEMCPY
The memory access patterns for event-sort can be optimized by splitting the workload into
blocks or blockschemes of fragments. The serial version of event-sort would process frag-
ments as shown in Figure 3. Each circle represents one MEP fragment, indexed by its source
and its event.
FIGURE 3: WITHOUT A BLOCKSCHEME
Real-time applications on Intel Xeon/Phi 13
March 26, 2016
Thepreviouslymentionedparallelizedevent-sortwouldassigneachcircletoasingleworker-
thread. Since the sizes of fragments are typically 80-120 B, the memcpy is ineffective because
the core caches are much larger and thus not fully used.
By assigning the whole block of workload to every worker-thread, we reduce cache thrash-
ing. There are 4 blocks of 2x2 size in the blockscheme of Figure 4, which would be processed
by 4 worker-threads in parallel.
FIGURE 4: 2X2 BLOCKS
Moreover, the spatial locality of data can also play important role: fragments in the rows of
the picture are stored in a continuous block of memory. Thus, the blocks load from and store
into only continuous parts of memory.
The algorithm is given the block dimensions (on the picture: 2 sources per each block, 2
events per each block). The blocks are then distributed among worker-threads (by OpenMP
parallel for loop). Within every block, each assigned worker performs a memcpy using pre-
viously computed read_offsets [] and write_offsets [] .
Inordertofindoutoptimalblockdimensions, aseriesofbenchmarktestshavebeencarried
out. The results are represented in the following heatmap:
Real-time applications on Intel Xeon/Phi 14
March 26, 2016
FIGURE 5: EVENT-SORT WITH VARIOUS PARAMETERS OF BLOCKSCHEME (KNC)
The event-sort with optimal block dimensions (according to the heatmap on the right side):
# . / upload−to−MIC . sh −i 100 −1 5 −2 28
. . .
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ S U M M A R Y _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Total elements : 1e+09
Time for computing read_offsets : 0.28435 secs
Time for computing write_offse ts : 1.13954 secs
Time for copying : 3.1574 secs
Total time : 4.58129 secs
Total size : 114.998 GB
Processed : 2.18279e+08 elements per second
Throughput : 25.1016 GBps
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Comparing the times, about 69 % of all the time is spent doing memcopies. The rest is the
computation of offsets. Moreover, the overall throughput has been improved by a factor of > 2
(compare with Section 4.1).
Real-time applications on Intel Xeon/Phi 15
March 26, 2016
4.4 ASLR ON KNC AND ITS EFFECT ON EVENT-SORT
Address Space Layout Randomization (ASLR) was suspected to cause great inconsistency in
resultsonKNLXeonPhi. ThiswaspointedoutbyWimHeirman. Thisisthee-mailconversation
with him:
Hi Karel,
I did some more runs, now with Linux address randomization turned on (my
machine had it disabled previously). I do see some large variations now. Do you
have address randomization turned on for your machine? (see output of "sysctl
kernel.randomize_va_space", 0 means disabled while 1 and 2 enable different
parts of it). Can you do a few more runs with a disabled setting? (See [1], I
think the setarch -R option should work even if you don't have root access).
Regards,
Wim
[1]http://stackoverflow.com/questions/11238457/disable-and-re-enable-address-
space-layout-randomization-only-for-myself
I have tried my application on KNCs with various settings of ASLR. There were 100 experi-
ments (runs), each performed only 1 iteration.
For kernel.randomize_va_space = 0:
mean = 20.0434 min = 19.6947 max = 20.4567 standard deviation = 0.1267
For kernel.randomize_va_space = 1:
mean = 20.3565 min = 19.5846 max = 21.1473 standard deviation = 0.3669
For kernel.randomize_va_space = 2:
mean = 20.305 min = 19.555 max = 21.1037 standard deviation = 0.3641
In conclusion, it seems ASLR does have some effect on variation.
4.5 FIXATION OF INPUT DATA
Rainer and I had a hypothesis that the throughput of event-sort may be highly dependent on
the input data size (if lengths fit cache lines). In order to test this idea, I have implemented an
option −−srand−seed. It sets a custom seed for srand() function, which is used for random-
izing the input data. Hence, by initializing to a (chosen) custom seed, the input will be always
same between different runs.
Real-time applications on Intel Xeon/Phi 16
March 26, 2016
For the range of seeds from 0 to 100, I have studied the variabilities (mean, standard devi-
ation, min, max) of resulting throughputs. The screenshot of results is to be found in Figure 6.
The mean, the (sample-based) standard deviation, the min and the mix are always taken from
10 runs. Each one initializes srand() to the same seed (the one corresponding to the seed in the
first column). Blue and red cells are the min and max respectively of values in the correspond-
ing column.
For comparison, here is an entirely serial version (i.e. copy_MEPs_serial_version()) with the
two chosen seeds:
• srand-seed == 83:
mean = 0.111149 standard deviation = 4.47532e-05 min = 0.111081 max = 0.111204
mean = 0.111167 standard deviation = 8.98804e-05 min = 0.111082 max = 0.111397
mean = 0.11108 standard deviation = 0.000120816 min = 0.110984 max = 0.111401
• srand-seed == 89:
mean = 0.111119 standard deviation = 5.10757e-05 min = 0.11104 max = 0.111186
mean = 0.111151 standard deviation = 5.33504e-05 min = 0.111079 max = 0.111227
mean = 0.111093 standard deviation = 0.000144087 min = 0.110992 max = 0.111487
There was no OpenMP for the copying part, but there are still two OpenMP parallel func-
tions for the computation part. That’s why it’s not absolutely 0.
The conclusion is: even though the deviation is negligible, it’s far from (almost) 0. This sug-
gests that the variation is caused by another cause or reason, possibly non-determinism of
thread scheduling.
Real-time applications on Intel Xeon/Phi 17
March 26, 2016
FIGURE 6: EVENT-SORT (IN GBYTES/S) ON KNC FOR VARIOUS ASLR AND VARIOUS FIXA-
TED INPUT DATA (DEPENDENT ON THE SEED)
Real-time applications on Intel Xeon/Phi 18
March 26, 2016
4.6 VARYING OF NUMBER OF COPY-THREADS
Another idea is to fixate the input data and vary the number of threads, which are performing
the copying part. This is done by the OpenMP here:
void copy_MEPs_block_scheme ( ) {
. . .
#pragma omp p a r a l l e l for num_threads ( nthreads )
. . .
}
Figure 7 shows the dependency of (sample-based) standard deviation on the number of
copying threads. The deviation is taken out of 10 experiments (runs). The tested numbers of
copy-threads are 1, 2, 4, 8, 16, 32 and 64.
Figure 8 shows the identical experiment for all numbers of copy-threads from 1 to 64.
From the latter figure, it seems there is no apparent dependency between number of copy-
threads and standard deviation of runs.
Real-time applications on Intel Xeon/Phi 19
March 26, 2016
FIGURE 7: NUMBER OF (COPYING) THREADS VS. STANDARD DEVIATION OF RESULTS
(1, 2, 4, 8, 16, 32, 64 THREADS)
Real-time applications on Intel Xeon/Phi 20
March 26, 2016
FIGURE 8: NUMBER OF (COPYING) THREADS VS. STANDARD DEVIATION OF RESULTS
(1, 2, 3, 4, · · · , 64 THREADS)
Real-time applications on Intel Xeon/Phi 21
March 26, 2016
5 SOME IDEAS FOR FUTURE WORK
• “Recompile”theevent-sortproject using ispccompiler: https://ispc.github.io/. This
compiler has promising auto-vectorization capabilities.
• Write unit tests for the project. For instance, using Google Test framework: https://
github.com/google/googletest
• Use CMake instead of hand-written Makefiles: https://cmake.org/
• Consider(try,testandbenchmark)usageofIntelTBBfortheprefix-sumfunctions: https:
//www.threadingbuildingblocks.org/
• Consider(try,testandbenchmark)usageofOpenCLfortheprefix-sumfunctions: https:
//www.khronos.org/opencl/
• Run high_performance_linpack_benchmark on Xeon Phi: https://lbdokuwiki.cern.
ch/doku.php?id=upgrade:high_performance_linpack_benchmark
• Participate in CERN Concurrency Forum: http://concurrency.web.cern.ch/
Real-time applications on Intel Xeon/Phi 22
March 26, 2016
6 CONCLUSION
The simulations of event sorting task show that KNC is capable of delivering the throughput
of about 25 GB/s. Our aim was to reach 12 GB/s, so as to saturate the 100 Gbit/s Ethernet
network, which is one of the candidate network for the LHCb upgrade.
This has been accomplished by splitting the workload into blocks of fragments and letting
thethreadsmemcopythewholeblocksoffragmentsratherthandoingitfragmentbyfragment.
Theexcessthroughputcanbeexploitedasadditionalcomputingpower! Forexample, some
portion of Xeon Phi cards (cores, number of threads) can be allocated for event-sorting (just
enough for 12.5 GB/s), whereas the remaining capacity may be used for other algorithms, so as
to start the reconstruction process already in this very early stage. Thus, the overall quality of
decisions whether to store or discard the events would improve.
Real-time applications on Intel Xeon/Phi 23
March 26, 2016
A INFRASTRUCTURE
LHCb Online group provides the server machine lhcb−phi.cern.ch. This host machine contains
32 Intel(R) Xeon(R) 2.00GHz processors:
[ kha@lhcb−phi kha ] $ le ss / proc / cpuinfo | t a i l −n 26
processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : I n t e l (R) Xeon (R) CPU E5−2650 0 @ 2.00GHz
stepping : 7
microcode : 1808
cpu MHz : 1200.000
cache size : 20480 KB
physical id : 1
s i b l i n g s : 16
core id : 7
cpu cores : 8
apicid : 47
i n i t i a l apicid : 47
fpu : yes
fpu_exception : yes
cpuid l e v e l : 13
wp : yes
f l a g s : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 c l f l u s h dts acpi
mmx fxsr sse sse2 ss ht tm pbe s y s c a l l nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat
epb xsaveopt pln pts dts tpr_shadow vnmi f l e x p r i o r i t y ept vpid
bogomips : 4014.16
c l f l u s h size : 64
cache_alignment : 64
address sizes : 46 b i t s physical , 48 b i t s v i r t u a l
power management :
with the operating system:
[ kha@lhcb−phi kha ] $ uname −a
Linux lhcb−phi 2.6.32−504. el6 . x86_64 #1 SMP Tue Oct 14 11:22:00 CDT 2014 x86_64 x86_64 x86_64 GNU/ Linux
Ontopofthat, therearealso4IntelKNCXeonPhicards(socalled“thedevices”, hereMIC0,
MIC1, MIC2 and MIC3). They are connected via PCIe 50 Gbit/s lanes to the host and each of
them has 228 of processors:
[ xeonphi@lhcb−phi−mic0 ~]$ le ss / proc / cpuinfo | t a i l −n 26
processor : 31
vendor_id : GenuineIntel
cpu family : 11
model : 1
model name : 0b/01
stepping : 3
cpu MHz : 1100.000
cache size : 512 KB
physical id : 0
s i b l i n g s : 228
core id : 56
cpu cores : 57
apicid : 227
i n i t i a l apicid : 227
fpu : yes
fpu_exception : yes
cpuid l e v e l : 4
wp : yes
f l a g s : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr ht s y s c a l l nx lm nopl lahf_lm
bogomips : 2205.22
c l f l u s h size : 64
cache_alignment : 64
address sizes : 40 b i t s physical , 48 b i t s v i r t u a l
power management :
each with the operating system:
[ kha@lhcb−phi kha ] $ uname −a
Linux lhcb−phi 2.6.32−504. el6 . x86_64 #1 SMP Tue Oct 14 11:22:00 CDT 2014 x86_64 x86_64 x86_64 GNU/ Linux
Real-time applications on Intel Xeon/Phi 24
March 26, 2016
B COMPILERS
The source code is written in C++ and uses OpenMP for task-based parallelization. It requires
Intel compiler:
[ kha@lhcb−phi event−sort ] $ icpc −V
I n t e l (R) Csum I n t e l (R) 64 Compiler XE for applications running on I n t e l (R) 64 , Version 15.0.3.187 Build 20150407
Copyright (C) 1985−2015 I n t e l Corporation . A l l r i g h t s reserved .
or Intel’s version of gcc compiler for cross-compilation on Xeon Phi:
[ kha@lhcb−phi event−sort ] $ / usr / linux−k1om−4.7/ bin / x86_64−k1om−linux−g++ −v
Using built−in specs .
COLLECT_GCC=/ usr / linux−k1om−4.7/ bin / x86_64−k1om−linux−g++
COLLECT_LTO_WRAPPER=/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / libexec /k1om−mpss−
linux / gcc /k1om−mpss−linux / 4 . 7 . 0 / lto−wrapper
Target : k1om−mpss−linux
Configured with : / sandbox / build /tmp/tmp/work/ x86_64−nativesdk−mpsssdk−linux / gcc−cross−canadian−k1om−
4.7.0+ mpss3.5.1 −1/gcc −4.7.0+mpss3 . 5 . 1 / configure −−build=x86_64−linux −−host=x86_64−mpsssdk−linux
−−target=k1om−mpss−linux −−prefix =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr
−−exec_prefix =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr
−−bindir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / bin /k1om−mpss−linux
−−sbindir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / bin /k1om−mpss−linux
−−l i b e x e c d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / libexec /k1om−mpss−linux
−−datadir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share
−−sysconfdir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / etc
−−sharedstatedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux /com
−−l o c a l s t a t e d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / var
−−l i b d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / l i b /k1om−mpss−linux
−−includedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / include
−−oldincludedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / include
−−i n f o d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share / info
−−mandir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share /man −−disable−silent−rules −−disable−
dependency−tracking −−with−l i b t o o l−sysroot =/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−mpsssdk−
linux −−with−gnu−ld −−enable−shared −−enable−languages=c , c++ −−enable−threads=posix −−disable−m u l t i l i b
−−enable−c99 −−enable−long−long −−enable−symvers=gnu −−enable−libstdcxx−pch −−program−prefix =k1om−
mpss−linux−−−enable−target−optspace −−enable−l t o −−enable−l i b s s p −−disable−bootstrap −−disable−libgomp
−−disable−libmudflap −−with−system−z l i b −−with−linker−hash−s tyle =gnu −−enable−cheaders= c_global −−with−
local−prefix =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux / usr −−with−gxx−include−
dir =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux / usr / include / c++ −−with−build−time−
tools =/sandbox / build /tmp/tmp/ sysroots / x86_64−linux / usr /k1om−mpss−linux / bin −−with−
sysroot =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux −−with−build−
sysroot =/sandbox / build /tmp/tmp/ sysroots / knightscorner −−disable−libunwind−exceptions −−disable−l i b s s p
−−disable−libgomp −−disable−libmudflap −−with−mpfr=/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−
mpsssdk−linux −−with−mpc=/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−mpsssdk−linux −−enable−nls
−−enable−_ _ c x a _ a t e x i t
Thread model : posix
gcc version 4.7.0 20110509 ( experimental ) (GCC)
C REPRODUCING THE EVENT-SORT RESULTS
C.1 SOURCE CODE AND SETUP
The source code is available on GitHub: https://github.com/mathemage/xphi-lhcb
$ g i t clone git@github . com : mathemage/ xphi−lhcb . g i t
Cloning into ’ xphi−lhcb ’ . . .
. . .
$ cd xphi−lhcb /
Then source the CERN setup script for Intel tools:
source / afs / cern . ch /sw/ IntelSoftware / linux / a l l−setup . sh
To enable OpenMP, find the libiomp5.so file:
$ find / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / −name libiomp5 . so
/ afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / cce /10.1.008/ l i b / libiomp5 . so
Real-time applications on Intel Xeon/Phi 25
March 26, 2016
/ afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / Compiler /11.1/059/ l i b / ia32 / libiomp5 . so
/ afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / Compiler /11.1/059/ l i b / intel64 / libiomp5 . so
. . .
...and copy it into the xphi−lhcb/lib/ folder.
Note: the instructions below were done and are valid for the commit:
commit ae7bc6ff540fbbdc0c1b09382f5e821e0c40e6dc
Author : Karel Ha <mathemage@gmail . com>
Date : Thu Oct 8 13:17:58 2015 +0200
Change location of libiomp5 . so
(The output produced by later versions of the repository may differ.)
C.2 OFFLOAD-BANDWIDTH
Change to the directory xphi−lhcb/src/offload−bandwidth/ and launch the program once for
each MIC cards (i.e. 4 processes in our case):
[ kha@lhcb−phi offload−bandwidth ] $ . / run−on−a l l−MICs . sh
icpc −l r t main . cpp −o offload−bandwidth . exe
Launching offload−bandwith on MIC 0 . . .
Launching offload−bandwith on MIC 1 . . .
Launching offload−bandwith on MIC 2 . . .
Launching offload−bandwith on MIC 3 . . .
After a while, when all processes finish, you may check the output in the following way...
[ kha@lhcb−phi offload−bandwidth ] $ cat * . out
Using MIC0 . . .
Transferred : 90 GB
Total time : 13.1119 secs
Bandwidth : 6.864 GBps
Using MIC1 . . .
Transferred : 90 GB
Total time : 13.5207 secs
Bandwidth : 6.65647 GBps
Using MIC2 . . .
Transferred : 90 GB
Total time : 13.1548 secs
Bandwidth : 6.84162 GBps
Using MIC3 . . .
Transferred : 90 GB
Total time : 25.9486 secs
Bandwidth : 3.4684 Gbps
C.3 PREFIX-OFFSET
Change to the directory xphi−lhcb/src/prefix−offset/ and run the script:
[ kha@lhcb−phi prefix−offset ] $ . / upload−to−MIC . sh
icpc −l r t −I . . / . . / include −openmp −std=c++14 −mmic main . cpp . . / u t i l s . cpp . . / prefix−sum . cpp −o mic−prefix−offset . exe
mic−prefix−offset . exe 100% 64KB 64.4KB/ s 00:00
libiomp5 . so 100% 1268KB 1.2MB/ s 00:00
Generated random lengths :
Too many numbers to display !
Offsets :
Too many numbers to display !
Total elements : 200000000
Total time : 2.57888 secs
Processed : 7.75531e+07 elements per second
Processed : 0 GBps
C.4 EVENT-SORT
Change to the directory xphi−lhcb/src/event−sort/ and run the script:
Real-time applications on Intel Xeon/Phi 26
March 26, 2016
[ kha@lhcb−phi event−sort ] $ . / upload−to−MIC . sh
Using MIC0 . . .
icpc −g −l r t −I . . / . . / include −openmp −std=c++14 −qopt−report3 −qopt−report−phase=vec −mmic main . cpp . . / prefix−sum . cpp . . / u t i l s . cpp 
−o event−sort . mic . exe
icpc : remark #10397: optimization reports are generated in * . optrpt f i l e s in the output location
event−sort . mic . exe 100% 143KB 142.6KB/ s 00:00
benchmarks . sh 100% 898 0.9KB/ s 00:00
libiomp5 . so 100% 1268KB 1.2MB/ s 00:00
−−−−−−−−STATISTICS OF TIME INTERVALS−−−−−−−−
The i n i t i a l i t e r a t i o n : 0.47684 secs
min : 0.15831 secs
max : 0.15947 secs
mean : 0.15889 secs
Histogram :
[0.15831 , 0.15860): 2 times
[0.15860 , 0.15889): 4 times
[0.15889 , 0.15918): 2 times
[0.15918 , 0.15947): 2 times
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−SUMMARY−−−−−−−−−−
Total elements : 1e+08
Time for computing read_offsets : 0.159042 secs
Time for computing write_offsets : 0.288448 secs
Time for copying : 1.14138 secs
Total time : 1.58887 secs
Total size : 11.5004 GB
Processed : 6.29379e+07 elements per second
Throughput : 7.23812 GBps
−−−−−−−−−−−−−−−−−−−−−−−−−−−
This script cross-compiles the source code for the Intel Xeon Phi architecture and uploads
binaries and required libraries using scp. On the MIC, the binary is called with default settings
of parameters.
You can also run several benchmark tests with varying the number of sources and the MEP
factor and varying the number of iterations:
[ kha@lhcb−phi event−sort ] $ . / upload−to−MIC . sh −b
Running benchmarks . sh
Using MIC0 . . .
icpc −g −l r t −I . . / . . / include −openmp −std=c++14 −qopt−report3 −qopt−report−phase=vec −mmic main . cpp . . / prefix−sum . cpp . . / u t i l s . cpp 
−o event−sort . mic . exe
icpc : remark #10397: optimization reports are generated in * . optrpt f i l e s in the output location
event−sort . mic . exe 100% 143KB 142.6KB/ s 00:00
benchmarks . sh 100% 898 0.9KB/ s 00:00
libiomp5 . so 100% 1268KB 1.2MB/ s 00:00
Varying the number of sources and the MEP factor . . .
. / event−sort . mic . exe −s 1 −m 10000000
. . .
Varying the number of i t e r a t i o n s . . .
. . .
Real-time applications on Intel Xeon/Phi 27

More Related Content

PPT
Ch1 lecture slides Chenming Hu Device for IC
PPTX
Datapath design
PDF
Low power vlsi design ppt
PDF
VERILOG CODE FOR Adder
PPSX
A Comparison Of Vlsi Interconnect Models
PPTX
Electrical signal processing and transmission
PPT
Integrated circuits
PDF
Fundamentals of digital electronics
Ch1 lecture slides Chenming Hu Device for IC
Datapath design
Low power vlsi design ppt
VERILOG CODE FOR Adder
A Comparison Of Vlsi Interconnect Models
Electrical signal processing and transmission
Integrated circuits
Fundamentals of digital electronics

What's hot (20)

PDF
PowerArtist: RTL Design for Power Platform
PDF
Embedded For You - Online sample magazine
PPT
Layout design on MICROWIND
PDF
Discrete event system simulation control flow chart
PDF
Sram technology
DOCX
Half adder layout design
PDF
Automatisme) www.cours-online.com
PPTX
MOSFET fabrication 12
PDF
3. Cours 03 - Réglage des régulateurs PID.pdf
PDF
Kicad 101
PPT
Introduction to VLSI
PPTX
Encoders
PDF
Exercices vhdl
PPTX
Introduction to SILVACO and MOSFET Simulation technique
PPTX
Floating point ALU using VHDL implemented on FPGA
PDF
37248247 cours-hyperfrequences-parametres-s-antennes (1)
PPTX
Study of inter and intra chip variations
PDF
PDF
Lecture2 binary multiplication
PPT
Memristor
PowerArtist: RTL Design for Power Platform
Embedded For You - Online sample magazine
Layout design on MICROWIND
Discrete event system simulation control flow chart
Sram technology
Half adder layout design
Automatisme) www.cours-online.com
MOSFET fabrication 12
3. Cours 03 - Réglage des régulateurs PID.pdf
Kicad 101
Introduction to VLSI
Encoders
Exercices vhdl
Introduction to SILVACO and MOSFET Simulation technique
Floating point ALU using VHDL implemented on FPGA
37248247 cours-hyperfrequences-parametres-s-antennes (1)
Study of inter and intra chip variations
Lecture2 binary multiplication
Memristor
Ad

Similar to Real-time applications on IntelXeon/Phi (20)

PDF
Experiences building a distributed shared log on RADOS - Noah Watkins
PDF
Mirko Damiani - An Embedded soft real time distributed system in Go
PDF
Server Tips
PPTX
Realtime traffic analyser
PPTX
Static Memory Management for Efficient Mobile Sensing Applications
PPTX
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
PPSX
PDF
PDF
Parallel Computing - Lec 4
PDF
Scalable Interconnection Network Models for Rapid Performance Prediction of H...
PDF
Van jaconson netchannels
PDF
Matloff programming on-parallel_machines-2013
PDF
Tips on High Performance Server Programming
PPTX
High performace network of Cloud Native Taiwan User Group
PDF
Reinventing the wheel: libmc
PDF
rooter.pdf
ODP
Nagios Conference 2012 - Dave Josephsen - Stop Being Lazy
PPT
Embedded systems
PDF
GEN: A Database Interface Generator for HPC Programs
PDF
Implementation of coarse-grain coherence tracking support in ring-based multi...
Experiences building a distributed shared log on RADOS - Noah Watkins
Mirko Damiani - An Embedded soft real time distributed system in Go
Server Tips
Realtime traffic analyser
Static Memory Management for Efficient Mobile Sensing Applications
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Parallel Computing - Lec 4
Scalable Interconnection Network Models for Rapid Performance Prediction of H...
Van jaconson netchannels
Matloff programming on-parallel_machines-2013
Tips on High Performance Server Programming
High performace network of Cloud Native Taiwan User Group
Reinventing the wheel: libmc
rooter.pdf
Nagios Conference 2012 - Dave Josephsen - Stop Being Lazy
Embedded systems
GEN: A Database Interface Generator for HPC Programs
Implementation of coarse-grain coherence tracking support in ring-based multi...
Ad

More from Karel Ha (18)

PDF
transcript-master-studies-Karel-Ha
PDF
Schrodinger poster 2020
PDF
CapsuleGAN: Generative Adversarial Capsule Network
PDF
Dynamic Routing Between Capsules
PDF
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
PDF
AlphaZero
PDF
Solving Endgames in Large Imperfect-Information Games such as Poker
PDF
transcript-bachelor-studies-Karel-Ha
PDF
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
PDF
Mastering the game of Go with deep neural networks and tree search: Presentation
PDF
HTCC poster for CERN Openlab opendays 2015
PDF
Separation Axioms
PDF
Oddělovací axiomy v bezbodové topologii
PDF
Algorithmic Game Theory
PDF
Summer Student Programme
PDF
Summer @CERN
PDF
Tape Storage and CRC Protection
PDF
Question Answering with Subgraph Embeddings
transcript-master-studies-Karel-Ha
Schrodinger poster 2020
CapsuleGAN: Generative Adversarial Capsule Network
Dynamic Routing Between Capsules
AI Supremacy in Games: Deep Blue, Watson, Cepheus, AlphaGo, DeepStack and Ten...
AlphaZero
Solving Endgames in Large Imperfect-Information Games such as Poker
transcript-bachelor-studies-Karel-Ha
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
Mastering the game of Go with deep neural networks and tree search: Presentation
HTCC poster for CERN Openlab opendays 2015
Separation Axioms
Oddělovací axiomy v bezbodové topologii
Algorithmic Game Theory
Summer Student Programme
Summer @CERN
Tape Storage and CRC Protection
Question Answering with Subgraph Embeddings

Recently uploaded (20)

DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
Sciences of Europe No 170 (2025)
PDF
HPLC-PPT.docx high performance liquid chromatography
PPT
protein biochemistry.ppt for university classes
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Introduction to Cardiovascular system_structure and functions-1
The scientific heritage No 166 (166) (2025)
Introduction to Fisheries Biotechnology_Lesson 1.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
ECG_Course_Presentation د.محمد صقران ppt
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
neck nodes and dissection types and lymph nodes levels
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
INTRODUCTION TO EVS | Concept of sustainability
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Cell Membrane: Structure, Composition & Functions
Phytochemical Investigation of Miliusa longipes.pdf
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
7. General Toxicologyfor clinical phrmacy.pptx
Sciences of Europe No 170 (2025)
HPLC-PPT.docx high performance liquid chromatography
protein biochemistry.ppt for university classes
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.

Real-time applications on IntelXeon/Phi

  • 1. March 26, 2016 .. Real-time applications on Intel Xeon/Phi Karel Ha CERN High Throughput Computing collaboration Summary: The Intel Xeon/Phi platform is a powerful x86 multi-core engine with a very high-speed memory interface. In its next version it will be able to operate as a stand-alone system with a very high-speed interconnect. This makes it a very interesting candidate for (near) real-time applications such as event-building, event-sorting and event preparation for subsequent processing by high level trigger software algorithms. Real-time applications on Intel Xeon/Phi 1
  • 2. March 26, 2016 Abstract The following document is a report providing the first results on the performance of In- tel Xeon Phi computing accelerator in the context of LHCb Online Data Acquisition system (DAQ). Themainfocusisputintotheevent-sortingtask: whendataarrivefromdifferentsources corresponding to different parts of the LHCb detector, they are grouped by the source, from which they originate. In the next stage of DAQ, it is necessary to make a decision, whether to store the given collision event or not. For this purpose, it is more convenient to group the data by their memberships to collision events (i.e. all data from one collision need to be placed together), so that the DAQ system can decide based on the “whole picture” of one event. The Xeon Phi is an interesting candidate for event-sorting task. It offers a large number of cores and vast amount of memory. Furthermore, this task can also be very well paral- lelized, which can make it especially suitable for the many-core architecture of the Xeon Phi. Thus, this report may be used to study feasibility of the Intel Xeon Phi platform for the next upgrade of the LHCb detector in 2018-2019. Real-time applications on Intel Xeon/Phi 2
  • 3. March 26, 2016 Contents 1 Introduction 4 1.1 Description of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 The goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Offload-bandwidth 8 3 Prefix-offset 9 4 Event-sort 10 4.1 The distribution of iteration durations . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Comparison between event-sort and raw memcpy . . . . . . . . . . . . . . . . . . 12 4.3 Blockschemes for memcpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.4 ASLR on KNC and its effect on event-sort . . . . . . . . . . . . . . . . . . . . . . . 16 4.5 Fixation of input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.6 Varying of number of copy-threads . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Some ideas for future work 22 6 Conclusion 23 Appendix A Infrastructure 24 Appendix B Compilers 25 Appendix C Reproducing the event-sort results 25 C.1 Source code and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.2 Offload-bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.3 Prefix-offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.4 Event-sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Real-time applications on Intel Xeon/Phi 3
  • 4. March 26, 2016 1 INTRODUCTION Intel Xeon Phi or Intel Many Integrated Core Architecture (MIC) is a promising x86 many- core computing accelerator. As such, it is suitable for highly parallelizable jobs such as event- sorting, a subtask of LHCb Data Acquisition System (DAQ). In this report, we present our mea- surementsofevent-sortingonIntelXeonPhicard, specifically“KnightsCorner”(KNC)version. There are 3 demo programs: • offload-bandwidth • prefix-offset • event-sort Thefirsttwo partsserveas preliminarytoolsfor baseline benchmarksandtesting theprop- erties of Xeon Phi, whereas the last one simulates the real conditions of event-sort in LHCb DAQ. For details on the used software and hardware, consult Appendix C. There are also the in- structions for reproducing the results. There is also a shared CERNBox folder htcc_shared, which contains all the logs that I regu- larly kept during my internship. For full details (source codes, bash and gnuplot scripts, figures, raw output files and results etc.), acquire an access to the shared folder and consult my logs. 1.1 DESCRIPTION OF THE PROBLEM The LHCb detector at CERN is a complex instrument consisting of many subdetectors. Hence, there are also many (approximately 1000) sources of input channels for the DAQ system. Each of the readout boards keeps the fragments of information (so called MEP fragments or also mep_contents in the source code) in its own buffer. The fragments come from different chan- nels and different collisions. The number of collisions is called MEP factor (by default 10000 fragments per source). For further processing, however, it is much more favorable to re-arrange (transpose) the fragments and group them together according to the collision they belong to: Real-time applications on Intel Xeon/Phi 4
  • 5. March 26, 2016 FIGURE 1: TRANSPOSE OF FRAGMENTS For better illustration, see the example below: −−−−−−−−−−Input MEP contents−−−−−−−−−− Source #0 111222333334444 Source #1 555566667777788888 Source #2 9999aaaaabbbcc −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−Output MEP contents−−−−−−−−− C o l l i s i o n #0 11155559999 C o l l i s i o n #1 2226666aaaaa C o l l i s i o n #2 3333377777bbb C o l l i s i o n #3 444488888cc −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Inthe“InputMEPcontents”,source#0stores3bytesfromcollision#0(labeledbycharacter “1”), 3 bytes from collision #1 (labeled by character “2”), 5 bytes from collision #2 (labeled by character “3”) and 4 bytes from collision #3 (labeled by character “4”). Source #1 (corresponding to a different subdetector) stores 4 bytes from collision #0 (la- beled by character “5”) followed by the data from the collisions #1 to #3. Source #2 stores 4 bytes also from collision #0 (labeled by character “9”) and likewise for the remaining collisions. At this point, the transposition re-shuffles the data so that all the information from one col- lision is placed together. Therefore, in the “Output MEP contents”, buffer for collision #0 con- tains the previously mentioned 3 bytes from source #0 (labeled by character “1”), 4 bytes from source #1 (labeled by character “5”) and 4 bytes from source #2 (labeled by character “9”). Here is another example of the transposition: −−−−−−−−−−Input MEP contents−−−−−−−−−− Source #0 11111222333334444 Real-time applications on Intel Xeon/Phi 5
  • 6. March 26, 2016 Source #1 5566667777788888 Source #2 99aaaaabbbcc −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−Output MEP contents−−−−−−−−− C o l l i s i o n #0 111115599 C o l l i s i o n #1 2226666aaaaa C o l l i s i o n #2 3333377777bbb C o l l i s i o n #3 444488888cc −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− The lengths of MEP fragments (usually between 80-120 bytes per fragment) are repre- sented as 16bit integers and they are stored in a separate array. The reason for this is the per- formance improvement: more than one length value can be loaded into the cache line, so we can read and process several lengths of fragments with one cache load. ThebuffersforMEPfragmentsarestoredinanarrayofarrays. Thereisonearraymep_contents[i] foreachsource#i. Acontinuousblockofmemoryisallocatedforeverysuchbuffermep_contents[i]. However, two consecutive buffers do not necessarily have to be in a continuous block of mem- ory. The output array is saved in one continuous block of memory. It stores the “re-shuffled” copies of fragments, now grouped by collisions into collision blocks. Furthermore, the collision blocks are concatenated according to the collision index. For instance, the first example above would produce this output array: 111115599 2226666aaaaa 3333377777bbb 444488888cc The spaces were added for clarity, in order to separate different collisions. 1.2 ALGORITHM In order to copy the data for transposition (for each fragment of each source), two types of array offsets (represented as 32bit integers) need to be computed: • read_offsets[] is the array of offsets determining where to copy from. It is the number of bytes from the beginning of mep_contents[i] where source i is the source corresponding to the fragment. • write_offsets[] is the array of offsets determining where to copy to. It is the number of bytes from the beginning of the output array. Offsetsarecomputedbyapplyingprefixsumtoappropriateelementsofthearrayoflengths. The prefix sum is the following problem: given an array of numbers a[], produce an array s [] of the same size, where s[0] = 0 and s[i] = a[0] + a[1] + ... + a[i − 1] for i > 0. The prefix-sum problem is the core part of event sorting. Real-time applications on Intel Xeon/Phi 6
  • 7. March 26, 2016 Since prefix sum for read_offsets [] within one source buffer is independent of other com- putations in other source buffers, we may parallelize using #pragma omp parallel for. Similarly,prefixsumfor write_offsets [] canbealsoparallelizedusing#pragma omp parallel for (for details, see the function get_write_offsets_OMP_version() in prefix−sum.cpp). After the read_offsets and write_offsets are computed, the content of each fragment can be copied using the standard memcpy() function. For MEP fragments, this copy-task is inde- pendent of one another, and hence, can be run in parallel. Namely, #pragma omp parallel for has been used to parallelize the loop. This loop iterates over all MEP fragments and performs the memcopies. 1.3 THE GOAL The goal of the demos is to test the speed and the feasibility of the Xeon Phi for event- sorting. Possible performance improvements are studied, namely various parallelization techniques. Real-time applications on Intel Xeon/Phi 7
  • 8. March 26, 2016 2 OFFLOAD-BANDWIDTH This programmeasures the bandwidth between host and the deviceusing the #pragma offload directive... a) offloading only to the device: $ make && . / offload −bandwidth . exe −i 20 −e 1500000000 icpc −l r t main . cpp −o offload −bandwidth . exe Using MIC0 . . . Transferred : 30 GB Total time : 4.37726 secs Bandwidth : 6.8536 GBps b) offloading only to the device, and copying the result back: $ make && . / offload −bandwidth . exe −i 20 −e 1500000000 icpc −l r t main . cpp −o offload −bandwidth . exe Using MIC0 . . . Transferred : 60 GB Total time : 8.67822 secs Bandwidth : 6.91386 GBps This bandwidth corresponds to the speed of 50 Gbit/s PCIe interface between the host and the device. Here, the host machine is lhcb−phi.cern.ch (see Appendix A). The speed remains the same even when the offload-bandwidth is launched to all 4 Xeon Phi cards at the same time (as 4 concurrent processes). This means there are four 50 Gbit/s PCIe interfaces and each of them can be fully saturated during offloads. For more details, consult the README at https://github.com/mathemage/xphi-lhcb/ tree/master/src/offload-bandwidth#parallel-run-on-all-available-mics Real-time applications on Intel Xeon/Phi 8
  • 9. March 26, 2016 3 PREFIX-OFFSET This program implements and tests the speed of prefix sum calculation. a) 1000 iterations for the array size of 40000000, short int numbers a[i ] range from 0 to 100: Total time : 521.639 secs Processed : 7.66814e+07 elements per second b) 100000 iterations for the array size of 40000000, short int numbers a[i ] range from 0 to 65534: Total elements : 6000000000 Total time : 77.8086 secs Processed : 7.71123e+07 elements per second This is the result from 1 KNC card with lhcb−phi.cern.ch as the host (see Appendix A). For more details, see the README at https://github.com/mathemage/xphi-lhcb/tree/ master/src/prefix-offset#output Real-time applications on Intel Xeon/Phi 9
  • 10. March 26, 2016 4 EVENT-SORT LHCb Online owns 4 Intel Xeon Phi ”KNC” cards. They are available on lhcb−phi.cern.ch ma- chine (see Appendix A). 4.1 THE DISTRIBUTION OF ITERATION DURATIONS The simulation is iterated many times to avoid statistical fluctuations. Number of iterations is controlled via command-line argument −i. a) The results for 200 iterations: # . / event−sort . mic . exe −i 200 . . . −−−−−−−−−−SUMMARY−−−−−−−−−− Total elements : 2e+09 Time for computing read_offsets : 0.553636 secs Time for computing write_offse ts : 2.50423 secs Time for copying : 17.4631 secs Total time : 20.521 secs Total size : 230.013 GB Processed : 9.74612e+07 elements per second Throughput : 11.2087 GBps −−−−−−−−−−−−−−−−−−−−−−−−−−− Timeforcomputingread_offsetsisthetotaltimespentcalculatingprefixsumsforread_offsets [] , timeforcomputingwrite_offsetsisthetotaltimespentcalculatingprefixsumsfor write_offsets [] and time for copying is the total time of performing memcpy() of MEP fragments. b) The results and the histogram for 1000 iterations: −−−−−−−−STATISTICS OF TIME INTERVALS ( in secs)−−−−−−−−−−−− The i n i t i a l i t e r a t i o n : 0.43506 min : 0.10139 max : 0.10303 mean : 0.10216 . . . −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−STATISTICS OF THROUGHPUTS ( in GBps)−−−−−−−−−−−−−−− min : 11.16119 max : 11.34263 mean : 11.25702 Real-time applications on Intel Xeon/Phi 10
  • 11. March 26, 2016 . . . −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−SUMMARY−−−−−−−−−− Total elements : 1e+10 Time for computing read_offsets : 3.14013 secs Time for computing write_offse ts : 12.2161 secs Time for copying : 86.8014 secs Total time : 102.158 secs Total size : 1149.98 GB Processed : 9.7888e+07 elements per second Throughput : 11.2569 GBps −−−−−−−−−−−−−−−−−−−−−−−−−−− The histograms of the previous measurements: Real-time applications on Intel Xeon/Phi 11
  • 12. March 26, 2016 4.2 COMPARISON BETWEEN EVENT-SORT AND RAW MEMCPY The program memcpy-bandwidth tests only the throughput of the memcpy() function on the Intel Xeon Phi. It copies chunks (arrays) of data from one place to another (with OpenMP pa- rallelization). This process is iterated (50 times in the case below) and the final throughput is calculated. The number of threads is varied using #pragma omp parallel for num_threads(). The corre- sponding plot is in Figure 2. Real-time applications on Intel Xeon/Phi 12
  • 13. March 26, 2016 FIGURE 2: EVENT-SORT COMPARED TO RAW MEMCPY(), WITH VARIABLE NUMBER OF THREADS 4.3 BLOCKSCHEMES FOR MEMCPY The memory access patterns for event-sort can be optimized by splitting the workload into blocks or blockschemes of fragments. The serial version of event-sort would process frag- ments as shown in Figure 3. Each circle represents one MEP fragment, indexed by its source and its event. FIGURE 3: WITHOUT A BLOCKSCHEME Real-time applications on Intel Xeon/Phi 13
  • 14. March 26, 2016 Thepreviouslymentionedparallelizedevent-sortwouldassigneachcircletoasingleworker- thread. Since the sizes of fragments are typically 80-120 B, the memcpy is ineffective because the core caches are much larger and thus not fully used. By assigning the whole block of workload to every worker-thread, we reduce cache thrash- ing. There are 4 blocks of 2x2 size in the blockscheme of Figure 4, which would be processed by 4 worker-threads in parallel. FIGURE 4: 2X2 BLOCKS Moreover, the spatial locality of data can also play important role: fragments in the rows of the picture are stored in a continuous block of memory. Thus, the blocks load from and store into only continuous parts of memory. The algorithm is given the block dimensions (on the picture: 2 sources per each block, 2 events per each block). The blocks are then distributed among worker-threads (by OpenMP parallel for loop). Within every block, each assigned worker performs a memcpy using pre- viously computed read_offsets [] and write_offsets [] . Inordertofindoutoptimalblockdimensions, aseriesofbenchmarktestshavebeencarried out. The results are represented in the following heatmap: Real-time applications on Intel Xeon/Phi 14
  • 15. March 26, 2016 FIGURE 5: EVENT-SORT WITH VARIOUS PARAMETERS OF BLOCKSCHEME (KNC) The event-sort with optimal block dimensions (according to the heatmap on the right side): # . / upload−to−MIC . sh −i 100 −1 5 −2 28 . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ S U M M A R Y _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Total elements : 1e+09 Time for computing read_offsets : 0.28435 secs Time for computing write_offse ts : 1.13954 secs Time for copying : 3.1574 secs Total time : 4.58129 secs Total size : 114.998 GB Processed : 2.18279e+08 elements per second Throughput : 25.1016 GBps _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Comparing the times, about 69 % of all the time is spent doing memcopies. The rest is the computation of offsets. Moreover, the overall throughput has been improved by a factor of > 2 (compare with Section 4.1). Real-time applications on Intel Xeon/Phi 15
  • 16. March 26, 2016 4.4 ASLR ON KNC AND ITS EFFECT ON EVENT-SORT Address Space Layout Randomization (ASLR) was suspected to cause great inconsistency in resultsonKNLXeonPhi. ThiswaspointedoutbyWimHeirman. Thisisthee-mailconversation with him: Hi Karel, I did some more runs, now with Linux address randomization turned on (my machine had it disabled previously). I do see some large variations now. Do you have address randomization turned on for your machine? (see output of "sysctl kernel.randomize_va_space", 0 means disabled while 1 and 2 enable different parts of it). Can you do a few more runs with a disabled setting? (See [1], I think the setarch -R option should work even if you don't have root access). Regards, Wim [1]http://stackoverflow.com/questions/11238457/disable-and-re-enable-address- space-layout-randomization-only-for-myself I have tried my application on KNCs with various settings of ASLR. There were 100 experi- ments (runs), each performed only 1 iteration. For kernel.randomize_va_space = 0: mean = 20.0434 min = 19.6947 max = 20.4567 standard deviation = 0.1267 For kernel.randomize_va_space = 1: mean = 20.3565 min = 19.5846 max = 21.1473 standard deviation = 0.3669 For kernel.randomize_va_space = 2: mean = 20.305 min = 19.555 max = 21.1037 standard deviation = 0.3641 In conclusion, it seems ASLR does have some effect on variation. 4.5 FIXATION OF INPUT DATA Rainer and I had a hypothesis that the throughput of event-sort may be highly dependent on the input data size (if lengths fit cache lines). In order to test this idea, I have implemented an option −−srand−seed. It sets a custom seed for srand() function, which is used for random- izing the input data. Hence, by initializing to a (chosen) custom seed, the input will be always same between different runs. Real-time applications on Intel Xeon/Phi 16
  • 17. March 26, 2016 For the range of seeds from 0 to 100, I have studied the variabilities (mean, standard devi- ation, min, max) of resulting throughputs. The screenshot of results is to be found in Figure 6. The mean, the (sample-based) standard deviation, the min and the mix are always taken from 10 runs. Each one initializes srand() to the same seed (the one corresponding to the seed in the first column). Blue and red cells are the min and max respectively of values in the correspond- ing column. For comparison, here is an entirely serial version (i.e. copy_MEPs_serial_version()) with the two chosen seeds: • srand-seed == 83: mean = 0.111149 standard deviation = 4.47532e-05 min = 0.111081 max = 0.111204 mean = 0.111167 standard deviation = 8.98804e-05 min = 0.111082 max = 0.111397 mean = 0.11108 standard deviation = 0.000120816 min = 0.110984 max = 0.111401 • srand-seed == 89: mean = 0.111119 standard deviation = 5.10757e-05 min = 0.11104 max = 0.111186 mean = 0.111151 standard deviation = 5.33504e-05 min = 0.111079 max = 0.111227 mean = 0.111093 standard deviation = 0.000144087 min = 0.110992 max = 0.111487 There was no OpenMP for the copying part, but there are still two OpenMP parallel func- tions for the computation part. That’s why it’s not absolutely 0. The conclusion is: even though the deviation is negligible, it’s far from (almost) 0. This sug- gests that the variation is caused by another cause or reason, possibly non-determinism of thread scheduling. Real-time applications on Intel Xeon/Phi 17
  • 18. March 26, 2016 FIGURE 6: EVENT-SORT (IN GBYTES/S) ON KNC FOR VARIOUS ASLR AND VARIOUS FIXA- TED INPUT DATA (DEPENDENT ON THE SEED) Real-time applications on Intel Xeon/Phi 18
  • 19. March 26, 2016 4.6 VARYING OF NUMBER OF COPY-THREADS Another idea is to fixate the input data and vary the number of threads, which are performing the copying part. This is done by the OpenMP here: void copy_MEPs_block_scheme ( ) { . . . #pragma omp p a r a l l e l for num_threads ( nthreads ) . . . } Figure 7 shows the dependency of (sample-based) standard deviation on the number of copying threads. The deviation is taken out of 10 experiments (runs). The tested numbers of copy-threads are 1, 2, 4, 8, 16, 32 and 64. Figure 8 shows the identical experiment for all numbers of copy-threads from 1 to 64. From the latter figure, it seems there is no apparent dependency between number of copy- threads and standard deviation of runs. Real-time applications on Intel Xeon/Phi 19
  • 20. March 26, 2016 FIGURE 7: NUMBER OF (COPYING) THREADS VS. STANDARD DEVIATION OF RESULTS (1, 2, 4, 8, 16, 32, 64 THREADS) Real-time applications on Intel Xeon/Phi 20
  • 21. March 26, 2016 FIGURE 8: NUMBER OF (COPYING) THREADS VS. STANDARD DEVIATION OF RESULTS (1, 2, 3, 4, · · · , 64 THREADS) Real-time applications on Intel Xeon/Phi 21
  • 22. March 26, 2016 5 SOME IDEAS FOR FUTURE WORK • “Recompile”theevent-sortproject using ispccompiler: https://ispc.github.io/. This compiler has promising auto-vectorization capabilities. • Write unit tests for the project. For instance, using Google Test framework: https:// github.com/google/googletest • Use CMake instead of hand-written Makefiles: https://cmake.org/ • Consider(try,testandbenchmark)usageofIntelTBBfortheprefix-sumfunctions: https: //www.threadingbuildingblocks.org/ • Consider(try,testandbenchmark)usageofOpenCLfortheprefix-sumfunctions: https: //www.khronos.org/opencl/ • Run high_performance_linpack_benchmark on Xeon Phi: https://lbdokuwiki.cern. ch/doku.php?id=upgrade:high_performance_linpack_benchmark • Participate in CERN Concurrency Forum: http://concurrency.web.cern.ch/ Real-time applications on Intel Xeon/Phi 22
  • 23. March 26, 2016 6 CONCLUSION The simulations of event sorting task show that KNC is capable of delivering the throughput of about 25 GB/s. Our aim was to reach 12 GB/s, so as to saturate the 100 Gbit/s Ethernet network, which is one of the candidate network for the LHCb upgrade. This has been accomplished by splitting the workload into blocks of fragments and letting thethreadsmemcopythewholeblocksoffragmentsratherthandoingitfragmentbyfragment. Theexcessthroughputcanbeexploitedasadditionalcomputingpower! Forexample, some portion of Xeon Phi cards (cores, number of threads) can be allocated for event-sorting (just enough for 12.5 GB/s), whereas the remaining capacity may be used for other algorithms, so as to start the reconstruction process already in this very early stage. Thus, the overall quality of decisions whether to store or discard the events would improve. Real-time applications on Intel Xeon/Phi 23
  • 24. March 26, 2016 A INFRASTRUCTURE LHCb Online group provides the server machine lhcb−phi.cern.ch. This host machine contains 32 Intel(R) Xeon(R) 2.00GHz processors: [ kha@lhcb−phi kha ] $ le ss / proc / cpuinfo | t a i l −n 26 processor : 31 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : I n t e l (R) Xeon (R) CPU E5−2650 0 @ 2.00GHz stepping : 7 microcode : 1808 cpu MHz : 1200.000 cache size : 20480 KB physical id : 1 s i b l i n g s : 16 core id : 7 cpu cores : 8 apicid : 47 i n i t i a l apicid : 47 fpu : yes fpu_exception : yes cpuid l e v e l : 13 wp : yes f l a g s : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 c l f l u s h dts acpi mmx fxsr sse sse2 ss ht tm pbe s y s c a l l nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi f l e x p r i o r i t y ept vpid bogomips : 4014.16 c l f l u s h size : 64 cache_alignment : 64 address sizes : 46 b i t s physical , 48 b i t s v i r t u a l power management : with the operating system: [ kha@lhcb−phi kha ] $ uname −a Linux lhcb−phi 2.6.32−504. el6 . x86_64 #1 SMP Tue Oct 14 11:22:00 CDT 2014 x86_64 x86_64 x86_64 GNU/ Linux Ontopofthat, therearealso4IntelKNCXeonPhicards(socalled“thedevices”, hereMIC0, MIC1, MIC2 and MIC3). They are connected via PCIe 50 Gbit/s lanes to the host and each of them has 228 of processors: [ xeonphi@lhcb−phi−mic0 ~]$ le ss / proc / cpuinfo | t a i l −n 26 processor : 31 vendor_id : GenuineIntel cpu family : 11 model : 1 model name : 0b/01 stepping : 3 cpu MHz : 1100.000 cache size : 512 KB physical id : 0 s i b l i n g s : 228 core id : 56 cpu cores : 57 apicid : 227 i n i t i a l apicid : 227 fpu : yes fpu_exception : yes cpuid l e v e l : 4 wp : yes f l a g s : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr ht s y s c a l l nx lm nopl lahf_lm bogomips : 2205.22 c l f l u s h size : 64 cache_alignment : 64 address sizes : 40 b i t s physical , 48 b i t s v i r t u a l power management : each with the operating system: [ kha@lhcb−phi kha ] $ uname −a Linux lhcb−phi 2.6.32−504. el6 . x86_64 #1 SMP Tue Oct 14 11:22:00 CDT 2014 x86_64 x86_64 x86_64 GNU/ Linux Real-time applications on Intel Xeon/Phi 24
  • 25. March 26, 2016 B COMPILERS The source code is written in C++ and uses OpenMP for task-based parallelization. It requires Intel compiler: [ kha@lhcb−phi event−sort ] $ icpc −V I n t e l (R) Csum I n t e l (R) 64 Compiler XE for applications running on I n t e l (R) 64 , Version 15.0.3.187 Build 20150407 Copyright (C) 1985−2015 I n t e l Corporation . A l l r i g h t s reserved . or Intel’s version of gcc compiler for cross-compilation on Xeon Phi: [ kha@lhcb−phi event−sort ] $ / usr / linux−k1om−4.7/ bin / x86_64−k1om−linux−g++ −v Using built−in specs . COLLECT_GCC=/ usr / linux−k1om−4.7/ bin / x86_64−k1om−linux−g++ COLLECT_LTO_WRAPPER=/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / libexec /k1om−mpss− linux / gcc /k1om−mpss−linux / 4 . 7 . 0 / lto−wrapper Target : k1om−mpss−linux Configured with : / sandbox / build /tmp/tmp/work/ x86_64−nativesdk−mpsssdk−linux / gcc−cross−canadian−k1om− 4.7.0+ mpss3.5.1 −1/gcc −4.7.0+mpss3 . 5 . 1 / configure −−build=x86_64−linux −−host=x86_64−mpsssdk−linux −−target=k1om−mpss−linux −−prefix =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr −−exec_prefix =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr −−bindir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / bin /k1om−mpss−linux −−sbindir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / bin /k1om−mpss−linux −−l i b e x e c d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / libexec /k1om−mpss−linux −−datadir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share −−sysconfdir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / etc −−sharedstatedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux /com −−l o c a l s t a t e d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / var −−l i b d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / l i b /k1om−mpss−linux −−includedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / include −−oldincludedir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / include −−i n f o d i r =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share / info −−mandir =/ opt /mpss / 3 . 5 . 1 / sysroots / x86_64−mpsssdk−linux / usr / share /man −−disable−silent−rules −−disable− dependency−tracking −−with−l i b t o o l−sysroot =/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−mpsssdk− linux −−with−gnu−ld −−enable−shared −−enable−languages=c , c++ −−enable−threads=posix −−disable−m u l t i l i b −−enable−c99 −−enable−long−long −−enable−symvers=gnu −−enable−libstdcxx−pch −−program−prefix =k1om− mpss−linux−−−enable−target−optspace −−enable−l t o −−enable−l i b s s p −−disable−bootstrap −−disable−libgomp −−disable−libmudflap −−with−system−z l i b −−with−linker−hash−s tyle =gnu −−enable−cheaders= c_global −−with− local−prefix =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux / usr −−with−gxx−include− dir =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux / usr / include / c++ −−with−build−time− tools =/sandbox / build /tmp/tmp/ sysroots / x86_64−linux / usr /k1om−mpss−linux / bin −−with− sysroot =/ opt /mpss / 3 . 5 . 1 / sysroots /k1om−mpss−linux −−with−build− sysroot =/sandbox / build /tmp/tmp/ sysroots / knightscorner −−disable−libunwind−exceptions −−disable−l i b s s p −−disable−libgomp −−disable−libmudflap −−with−mpfr=/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk− mpsssdk−linux −−with−mpc=/sandbox / build /tmp/tmp/ sysroots / x86_64−nativesdk−mpsssdk−linux −−enable−nls −−enable−_ _ c x a _ a t e x i t Thread model : posix gcc version 4.7.0 20110509 ( experimental ) (GCC) C REPRODUCING THE EVENT-SORT RESULTS C.1 SOURCE CODE AND SETUP The source code is available on GitHub: https://github.com/mathemage/xphi-lhcb $ g i t clone git@github . com : mathemage/ xphi−lhcb . g i t Cloning into ’ xphi−lhcb ’ . . . . . . $ cd xphi−lhcb / Then source the CERN setup script for Intel tools: source / afs / cern . ch /sw/ IntelSoftware / linux / a l l−setup . sh To enable OpenMP, find the libiomp5.so file: $ find / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / −name libiomp5 . so / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / cce /10.1.008/ l i b / libiomp5 . so Real-time applications on Intel Xeon/Phi 25
  • 26. March 26, 2016 / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / Compiler /11.1/059/ l i b / ia32 / libiomp5 . so / afs / cern . ch /sw/ IntelSoftware / linux / x86_64 / Compiler /11.1/059/ l i b / intel64 / libiomp5 . so . . . ...and copy it into the xphi−lhcb/lib/ folder. Note: the instructions below were done and are valid for the commit: commit ae7bc6ff540fbbdc0c1b09382f5e821e0c40e6dc Author : Karel Ha <mathemage@gmail . com> Date : Thu Oct 8 13:17:58 2015 +0200 Change location of libiomp5 . so (The output produced by later versions of the repository may differ.) C.2 OFFLOAD-BANDWIDTH Change to the directory xphi−lhcb/src/offload−bandwidth/ and launch the program once for each MIC cards (i.e. 4 processes in our case): [ kha@lhcb−phi offload−bandwidth ] $ . / run−on−a l l−MICs . sh icpc −l r t main . cpp −o offload−bandwidth . exe Launching offload−bandwith on MIC 0 . . . Launching offload−bandwith on MIC 1 . . . Launching offload−bandwith on MIC 2 . . . Launching offload−bandwith on MIC 3 . . . After a while, when all processes finish, you may check the output in the following way... [ kha@lhcb−phi offload−bandwidth ] $ cat * . out Using MIC0 . . . Transferred : 90 GB Total time : 13.1119 secs Bandwidth : 6.864 GBps Using MIC1 . . . Transferred : 90 GB Total time : 13.5207 secs Bandwidth : 6.65647 GBps Using MIC2 . . . Transferred : 90 GB Total time : 13.1548 secs Bandwidth : 6.84162 GBps Using MIC3 . . . Transferred : 90 GB Total time : 25.9486 secs Bandwidth : 3.4684 Gbps C.3 PREFIX-OFFSET Change to the directory xphi−lhcb/src/prefix−offset/ and run the script: [ kha@lhcb−phi prefix−offset ] $ . / upload−to−MIC . sh icpc −l r t −I . . / . . / include −openmp −std=c++14 −mmic main . cpp . . / u t i l s . cpp . . / prefix−sum . cpp −o mic−prefix−offset . exe mic−prefix−offset . exe 100% 64KB 64.4KB/ s 00:00 libiomp5 . so 100% 1268KB 1.2MB/ s 00:00 Generated random lengths : Too many numbers to display ! Offsets : Too many numbers to display ! Total elements : 200000000 Total time : 2.57888 secs Processed : 7.75531e+07 elements per second Processed : 0 GBps C.4 EVENT-SORT Change to the directory xphi−lhcb/src/event−sort/ and run the script: Real-time applications on Intel Xeon/Phi 26
  • 27. March 26, 2016 [ kha@lhcb−phi event−sort ] $ . / upload−to−MIC . sh Using MIC0 . . . icpc −g −l r t −I . . / . . / include −openmp −std=c++14 −qopt−report3 −qopt−report−phase=vec −mmic main . cpp . . / prefix−sum . cpp . . / u t i l s . cpp −o event−sort . mic . exe icpc : remark #10397: optimization reports are generated in * . optrpt f i l e s in the output location event−sort . mic . exe 100% 143KB 142.6KB/ s 00:00 benchmarks . sh 100% 898 0.9KB/ s 00:00 libiomp5 . so 100% 1268KB 1.2MB/ s 00:00 −−−−−−−−STATISTICS OF TIME INTERVALS−−−−−−−− The i n i t i a l i t e r a t i o n : 0.47684 secs min : 0.15831 secs max : 0.15947 secs mean : 0.15889 secs Histogram : [0.15831 , 0.15860): 2 times [0.15860 , 0.15889): 4 times [0.15889 , 0.15918): 2 times [0.15918 , 0.15947): 2 times −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−SUMMARY−−−−−−−−−− Total elements : 1e+08 Time for computing read_offsets : 0.159042 secs Time for computing write_offsets : 0.288448 secs Time for copying : 1.14138 secs Total time : 1.58887 secs Total size : 11.5004 GB Processed : 6.29379e+07 elements per second Throughput : 7.23812 GBps −−−−−−−−−−−−−−−−−−−−−−−−−−− This script cross-compiles the source code for the Intel Xeon Phi architecture and uploads binaries and required libraries using scp. On the MIC, the binary is called with default settings of parameters. You can also run several benchmark tests with varying the number of sources and the MEP factor and varying the number of iterations: [ kha@lhcb−phi event−sort ] $ . / upload−to−MIC . sh −b Running benchmarks . sh Using MIC0 . . . icpc −g −l r t −I . . / . . / include −openmp −std=c++14 −qopt−report3 −qopt−report−phase=vec −mmic main . cpp . . / prefix−sum . cpp . . / u t i l s . cpp −o event−sort . mic . exe icpc : remark #10397: optimization reports are generated in * . optrpt f i l e s in the output location event−sort . mic . exe 100% 143KB 142.6KB/ s 00:00 benchmarks . sh 100% 898 0.9KB/ s 00:00 libiomp5 . so 100% 1268KB 1.2MB/ s 00:00 Varying the number of sources and the MEP factor . . . . / event−sort . mic . exe −s 1 −m 10000000 . . . Varying the number of i t e r a t i o n s . . . . . . Real-time applications on Intel Xeon/Phi 27