SlideShare a Scribd company logo
Sasha Goldshtein
CTO, Sela Group
Task and Data Parallelism
Agenda
•Multicore machines have been a cheap
commodity for >10 years
•Adoption of concurrent programming is
still slow
•Patterns and best practices are scarce
•We discuss the APIs first…
•…and then turn to examples, best
practices, and tips
TPL Evolution
• GPU
parallelism?
• SIMD
support?
• Language-
level
parallelism?
The Future
• DataFlow in
.NET 4.5
(NuGet)
• Augmented
with
language
support
(await, async
methods)
2012
• Released in
full glory
with .NET
4.0
2010
• Incubated
for 3 years as
“Parallel
Extensions
for .NET”
2008
Tasks
•A task is a unit of work
–May be executed in parallel with other tasks by
a scheduler (e.g. Thread Pool)
–Much more than threads, and yet much
cheaper
Task<string> t = Task.Factory.StartNew(
() => { return DnaSimulation(…); });
t.ContinueWith(r => Show(r.Exception),
TaskContinuationOptions.OnlyOnFaulted);
t.ContinueWith(r => Show(r.Result),
TaskContinuationOptions.OnlyOnRanToCompletion);
DisplayProgress();
try { //The C# 5.0 version
var task = Task.Run(DnaSimulation);
DisplayProgress();
Show(await task);
} catch (Exception ex) {
Show(ex);
}
Parallel Loops
•Ideal for parallelizing work over a collection
of data
•Easy porting of for and foreach loops
–Beware of inter-iteration dependencies!
Parallel.For(0, 100, i => {
...
});
Parallel.ForEach(urls, url => {
webClient.Post(url, options, data);
});
Parallel LINQ
•Mind-bogglingly easy parallelization of
LINQ queries
•Can introduce ordering into the pipeline, or
preserve order of original elements
var query = from monster in monsters.AsParallel()
where monster.IsAttacking
let newMonster = SimulateMovement(monster)
orderby newMonster.XP
select newMonster;
query.ForAll(monster => Move(monster));
Measuring Concurrency
•Visual Studio Concurrency Visualizer to the
rescue
Recursive Parallelism Extraction
•Divide-and-conquer algorithms are often
parallelized through the recursive call
–Be careful with parallelization threshold and
watch out for dependencies
void FFT(float[] src, float[] dst, int n, int r, int s) {
if (n == 1) {
dst[r] = src[r];
} else {
FFT(src, n/2, r, s*2);
FFT(src, n/2, r+s, s*2);
//Combine the two halves in O(n) time
}
}
Parallel.Invoke(
() => FFT(src, n/2, r, s*2),
() => FFT(src, n/2, r+s, s*2)
);
DEMO
Recursive parallel QuickSort
Symmetric Data Processing
•For a large set of uniform data items that
need to processed, parallel loops are usually
the best choice and lead to ideal work
distribution
•Inter-iteration dependencies complicate
things (think in-place blur)
Parallel.For(0, image.Rows, i => {
for (int j = 0; j < image.Cols; ++j) {
destImage.SetPixel(i, j, PixelBlur(image, i, j));
}
});
Uneven Work Distribution
•With non-uniform data items, use custom
partitioning or manual distribution
–Primes: 7 is easier to check than 10,320,647
var work = Enumerable.Range(0, Environment.ProcessorCount)
.Select(n => Task.Run(() =>
CountPrimes(start+chunk*n, start+chunk*(n+1))));
Task.WaitAll(work.ToArray());
versus
Parallel.ForEach(Partitioner.Create(Start, End, chunkSize),
chunk => CountPrimes(chunk.Item1, chunk.Item2)
);
DEMO
Uneven workload distribution
Complex Dependency Management
•Must extract all dependencies and
incorporate them into the algorithm
–Typical scenarios: 1D loops, dynamic
algorithms
–Edit distance: each task depends on 2
predecessors, wavefront
C = x[i-1] == y[i-1] ? 0 : 1;
D[i, j] = min(
D[i-1, j] + 1,
D[i, j-1] + 1,
D[i-1, j-1] + C);
0,0
m,n
DEMO
Dependency management
Synchronization > Aggregation
•Excessive synchronization brings parallel
code to its knees
–Try to avoid shared state
–Aggregate thread- or task-local state and mergeParallel.ForEach(
Partitioner.Create(Start, End, ChunkSize),
() => new List<int>(), //initial local state
(range, pls, localPrimes) => { //aggregator
for (int i = range.Item1; i < range.Item2; ++i)
if (IsPrime(i)) localPrimes.Add(i);
return localPrimes;
},
localPrimes => { lock (primes) //combiner
primes.AddRange(localPrimes);
});
DEMO
Aggregation
Creative Synchronization
• We implement a collection of stock prices,
initialized with 105 name/price pairs
– 107 reads/s, 106 “update” writes/s, 103 “add”
writes/day
– Many reader threads, many writer threads
GET(key):
if safe contains key then return safe[key]
lock { return unsafe[key] }
PUT(key, value):
if safe contains key then safe[key] = value
lock { unsafe[key] = value }
Lock-Free Patterns (1)
•Try to avoid Windows synchronization and
use hardware synchronization
–Primitive operations such as
Interlocked.Increment,
Interlocked.CompareExchange
–Retry pattern with
Interlocked.CompareExchange enables
arbitrary lock-free algorithms
int InterlockedMultiply(ref int x, int y) {
int t, r;
do {
t = x;
r = t * y;
}
while (Interlocked.CompareExchange(ref x, r, t) != t);
return r;
}
Oldvalue
Newvalue
Comparand
Lock-Free Patterns (2)
•User-mode spinlocks (SpinLock class) can
replace locks you acquire very often, which
protect tiny computations
class __DontUseMe__SpinLock {
private volatile int _lck;
public void Enter() {
while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0);
}
public void Exit() {
_lck = 0;
}
}
Miscellaneous Tips (1)
•Don’t mix several concurrency frameworks
in the same process
•Some parallel work is best organized in
pipelines – TPL DataFlow
BroadcastBlock
<Uri>
TransformBlock
<Uri, byte[]>
TransformBlock
<byte[],
string>
ActionBlock
<string>
Miscellaneous Tips (2)
•Some parallel work can be offloaded to the
GPU – C++ AMP
void vadd_exp(float* x, float* y, float* z, int n) {
array_view<const float,1> avX(n, x), avY(n, y);
array_view<float,1> avZ(n, z);
avZ.discard_data();
parallel_for_each(avZ.extent, [=](index<1> i) ... {
avZ[i] = avX[i] + fast_math::exp(avY[i]);
});
avZ.synchronize();
}
Miscellaneous Tips (3)
•Invest in SIMD parallelization of heavy
math or data-parallel algorithms
–Already available on Mono (Mono.Simd)
•Make sure to take cache effects into
account, especially on MP systems
START:
movups xmm0, [esi+4*ecx]
addps xmm0, [edi+4*ecx]
movups [ebx+4*ecx], xmm0
sub ecx, 4
jns START
Summary
• Avoid shared state and synchronization
• Parallelize judiciously and apply
thresholds
• Measure and understand performance
gains or losses
• Concurrency and parallelism are still hard
• A body of best practices, tips, patterns,
examples is being built
Additional References
THANK YOU!
Sasha Goldshtein
CTO, Sela Group
blog.sashag.net
@goldshtn

More Related Content

PPTX
Task and Data Parallelism: Real-World Examples
PDF
Introduction to TensorFlow
PPTX
Tensorflow - Intro (2017)
PDF
Memory Management C++ (Peeling operator new() and delete())
PPTX
Tensor flow (1)
PDF
Adam Sitnik "State of the .NET Performance"
PPTX
Deep Learning, Keras, and TensorFlow
PPTX
Introduction to PyTorch
Task and Data Parallelism: Real-World Examples
Introduction to TensorFlow
Tensorflow - Intro (2017)
Memory Management C++ (Peeling operator new() and delete())
Tensor flow (1)
Adam Sitnik "State of the .NET Performance"
Deep Learning, Keras, and TensorFlow
Introduction to PyTorch

What's hot (20)

PDF
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
PPTX
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
PDF
Deep Learning in theano
PDF
PyTorch for Deep Learning Practitioners
PDF
Dive Into PyTorch
PDF
TensorFlow Tutorial | Deep Learning Using TensorFlow | TensorFlow Tutorial Py...
PDF
Introduction to TensorFlow 2.0
PDF
Uncommon Design Patterns
PDF
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
PPTX
Tensorflow windows installation
PPTX
Lec05 buffers basic_examples
PPTX
Lec09 nbody-optimization
PPT
The Erlang Programming Language
PDF
Accelerating Random Forests in Scikit-Learn
PDF
Tensor board
PPTX
Introduction to theano, case study of Word Embeddings
PPTX
Using Parallel Computing Platform - NHDNUG
PDF
Machine learning in production with scikit-learn
PPTX
Constructors and Destructors
PPT
Integrating Erlang and Java
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Deep Learning in theano
PyTorch for Deep Learning Practitioners
Dive Into PyTorch
TensorFlow Tutorial | Deep Learning Using TensorFlow | TensorFlow Tutorial Py...
Introduction to TensorFlow 2.0
Uncommon Design Patterns
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
Tensorflow windows installation
Lec05 buffers basic_examples
Lec09 nbody-optimization
The Erlang Programming Language
Accelerating Random Forests in Scikit-Learn
Tensor board
Introduction to theano, case study of Word Embeddings
Using Parallel Computing Platform - NHDNUG
Machine learning in production with scikit-learn
Constructors and Destructors
Integrating Erlang and Java
Ad

Viewers also liked (8)

PDF
Concurrency basics
PPT
Instruction Level Parallelism and Superscalar Processors
PPTX
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
PPTX
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
PPTX
Symmetric multiprocessing
PPT
Smp and asmp architecture.
PPTX
Intel® hyper threading technology
PDF
Pipelining and ILP (Instruction Level Parallelism)
Concurrency basics
Instruction Level Parallelism and Superscalar Processors
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Symmetric multiprocessing
Smp and asmp architecture.
Intel® hyper threading technology
Pipelining and ILP (Instruction Level Parallelism)
Ad

Similar to Task and Data Parallelism (20)

PDF
State of the .Net Performance
PPTX
.NET Multithreading/Multitasking
PDF
Look Mommy, No GC! (TechDays NL 2017)
PDF
.Net Multithreading and Parallelization
PDF
Advance data structure & algorithm
PDF
Data Structure - Lecture 2 - Recursion Stack Queue.pdf
PPT
Deuce STM - CMP'09
PPTX
Java Performance Tweaks
PDF
Introduction to Python Objects and Strings
PDF
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
PDF
Data Analytics and Simulation in Parallel with MATLAB*
PPTX
Robust C++ Task Systems Through Compile-time Checks
PDF
parallelizing_the_naughty_dog_engine_using_fibers.pdf
PDF
SLE2015: Distributed ATL
PDF
Language translation with Deep Learning (RNN) with TensorFlow
 
ODP
Talk on Standard Template Library
PPTX
Data structure and algorithm using java
PDF
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
PDF
DSJ_Unit I & II.pdf
PPTX
DSA_Big_O_ Data_Structure_and_Algorithm.pptx
State of the .Net Performance
.NET Multithreading/Multitasking
Look Mommy, No GC! (TechDays NL 2017)
.Net Multithreading and Parallelization
Advance data structure & algorithm
Data Structure - Lecture 2 - Recursion Stack Queue.pdf
Deuce STM - CMP'09
Java Performance Tweaks
Introduction to Python Objects and Strings
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Data Analytics and Simulation in Parallel with MATLAB*
Robust C++ Task Systems Through Compile-time Checks
parallelizing_the_naughty_dog_engine_using_fibers.pdf
SLE2015: Distributed ATL
Language translation with Deep Learning (RNN) with TensorFlow
 
Talk on Standard Template Library
Data structure and algorithm using java
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
DSJ_Unit I & II.pdf
DSA_Big_O_ Data_Structure_and_Algorithm.pptx

More from Sasha Goldshtein (20)

PPTX
Modern Linux Tracing Landscape
PPTX
The Next Linux Superpower: eBPF Primer
PPTX
Staring into the eBPF Abyss
PPTX
Visual Studio 2015 and the Next .NET Framework
PPT
Swift: Apple's New Programming Language for iOS and OS X
PPT
C# Everywhere: Cross-Platform Mobile Apps with Xamarin
PPT
Modern Backends for Mobile Apps
PPT
.NET Debugging Workshop
PPT
Performance and Debugging with the Diagnostics Hub in Visual Studio 2013
PPT
Mastering IntelliTrace in Development and Production
PPTX
Introduction to RavenDB
PPTX
State of the Platforms
PPTX
Delivering Millions of Push Notifications in Minutes
PPTX
Building Mobile Apps with a Mobile Services .NET Backend
PPTX
Building iOS and Android Apps with Mobile Services
PPT
What's New in C++ 11?
PDF
Attacking Web Applications
PPTX
Windows Azure Mobile Services
PPTX
First Steps in Android Development
PPTX
First Steps in iOS Development
Modern Linux Tracing Landscape
The Next Linux Superpower: eBPF Primer
Staring into the eBPF Abyss
Visual Studio 2015 and the Next .NET Framework
Swift: Apple's New Programming Language for iOS and OS X
C# Everywhere: Cross-Platform Mobile Apps with Xamarin
Modern Backends for Mobile Apps
.NET Debugging Workshop
Performance and Debugging with the Diagnostics Hub in Visual Studio 2013
Mastering IntelliTrace in Development and Production
Introduction to RavenDB
State of the Platforms
Delivering Millions of Push Notifications in Minutes
Building Mobile Apps with a Mobile Services .NET Backend
Building iOS and Android Apps with Mobile Services
What's New in C++ 11?
Attacking Web Applications
Windows Azure Mobile Services
First Steps in Android Development
First Steps in iOS Development

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Monthly Chronicles - July 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Digital-Transformation-Roadmap-for-Companies.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Dropbox Q2 2025 Financial Results & Investor Presentation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Modernizing your data center with Dell and AMD
NewMind AI Weekly Chronicles - August'25 Week I
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Task and Data Parallelism

  • 1. Sasha Goldshtein CTO, Sela Group Task and Data Parallelism
  • 2. Agenda •Multicore machines have been a cheap commodity for >10 years •Adoption of concurrent programming is still slow •Patterns and best practices are scarce •We discuss the APIs first… •…and then turn to examples, best practices, and tips
  • 3. TPL Evolution • GPU parallelism? • SIMD support? • Language- level parallelism? The Future • DataFlow in .NET 4.5 (NuGet) • Augmented with language support (await, async methods) 2012 • Released in full glory with .NET 4.0 2010 • Incubated for 3 years as “Parallel Extensions for .NET” 2008
  • 4. Tasks •A task is a unit of work –May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool) –Much more than threads, and yet much cheaper Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); }); t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted); t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion); DisplayProgress(); try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task); } catch (Exception ex) { Show(ex); }
  • 5. Parallel Loops •Ideal for parallelizing work over a collection of data •Easy porting of for and foreach loops –Beware of inter-iteration dependencies! Parallel.For(0, 100, i => { ... }); Parallel.ForEach(urls, url => { webClient.Post(url, options, data); });
  • 6. Parallel LINQ •Mind-bogglingly easy parallelization of LINQ queries •Can introduce ordering into the pipeline, or preserve order of original elements var query = from monster in monsters.AsParallel() where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster; query.ForAll(monster => Move(monster));
  • 7. Measuring Concurrency •Visual Studio Concurrency Visualizer to the rescue
  • 8. Recursive Parallelism Extraction •Divide-and-conquer algorithms are often parallelized through the recursive call –Be careful with parallelization threshold and watch out for dependencies void FFT(float[] src, float[] dst, int n, int r, int s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time } } Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2) );
  • 10. Symmetric Data Processing •For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution •Inter-iteration dependencies complicate things (think in-place blur) Parallel.For(0, image.Rows, i => { for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); } });
  • 11. Uneven Work Distribution •With non-uniform data items, use custom partitioning or manual distribution –Primes: 7 is easier to check than 10,320,647 var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1)))); Task.WaitAll(work.ToArray()); versus Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2) );
  • 13. Complex Dependency Management •Must extract all dependencies and incorporate them into the algorithm –Typical scenarios: 1D loops, dynamic algorithms –Edit distance: each task depends on 2 predecessors, wavefront C = x[i-1] == y[i-1] ? 0 : 1; D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C); 0,0 m,n
  • 15. Synchronization > Aggregation •Excessive synchronization brings parallel code to its knees –Try to avoid shared state –Aggregate thread- or task-local state and mergeParallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes); });
  • 17. Creative Synchronization • We implement a collection of stock prices, initialized with 105 name/price pairs – 107 reads/s, 106 “update” writes/s, 103 “add” writes/day – Many reader threads, many writer threads GET(key): if safe contains key then return safe[key] lock { return unsafe[key] } PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }
  • 18. Lock-Free Patterns (1) •Try to avoid Windows synchronization and use hardware synchronization –Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange –Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r; } Oldvalue Newvalue Comparand
  • 19. Lock-Free Patterns (2) •User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computations class __DontUseMe__SpinLock { private volatile int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; } }
  • 20. Miscellaneous Tips (1) •Don’t mix several concurrency frameworks in the same process •Some parallel work is best organized in pipelines – TPL DataFlow BroadcastBlock <Uri> TransformBlock <Uri, byte[]> TransformBlock <byte[], string> ActionBlock <string>
  • 21. Miscellaneous Tips (2) •Some parallel work can be offloaded to the GPU – C++ AMP void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize(); }
  • 22. Miscellaneous Tips (3) •Invest in SIMD parallelization of heavy math or data-parallel algorithms –Already available on Mono (Mono.Simd) •Make sure to take cache effects into account, especially on MP systems START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4 jns START
  • 23. Summary • Avoid shared state and synchronization • Parallelize judiciously and apply thresholds • Measure and understand performance gains or losses • Concurrency and parallelism are still hard • A body of best practices, tips, patterns, examples is being built
  • 25. THANK YOU! Sasha Goldshtein CTO, Sela Group blog.sashag.net @goldshtn