SlideShare a Scribd company logo
Dmitri Nesteruk
dmitri@activemesa.com
http://activemesa.com
http://spbalt.net
http://devtalk.net
“Premature optimization is the root of all evil.”

                                                    Donald Knuth
              Structured Programming with go to Statements, ACM
         Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.


“In practice, it is often necessary to keep performance goals in mind
when first designing software, but the programmer balances the goals
of design and optimization.”

                                                      Wikipedia
               http://en.wikipedia.org/wiki/Program_optimization
Brief intro
   Why unmanaged code?
   Why parallelize?
P/Invoke
SIMD
OpenMP
Intel stack: TBB, MKL, IPP
GPGPU: Cuda, Accelerator
Miscellanea
Today                                 Tomorrow

 Threads & ThreadPool                  Tasks and TaskManager
 Sync structures                         Task
   Monitor.(Try)Enter/Exit               Future
   ReaderWriterLock(Slim)
   Mutex                               Data-level parallelism
   Semaphore                             Parallel.For/ForEach
 Wait handles
   Manual/AutoResetEvent
                                       Parallel LINQ
                                         AsParallel()
 Pulse & wait
 Async delegates
 Async simplifications
   F# async workflow
   AsyncEnumerator (PowerThreading)
Performance               Managed interfaces for
Low-level (fine-tuning)   SIMD/MPI-optimized
framework                 libraries
Instruction-level         Threading tools
parallelism                 Debugging
GPU SIMD                    Profiling
                            Inferencing
General vectorization
                          Cross-machine
Simple cross-machine
                          debugging
framework
                          Task management UI
WTF?!? Isn’t C# 5% faster than C?
   It depends.
Why is there a difference?
   More safety (e.g., CLR array bound checking)
   JIT: No auto-parallelization
   JIT: No SIMD
   Lack of fine control
IL can be every bit as fast as C/C++
   But this is only true for simple problems
   The code is only as good as the JITter
Libraries (MKL, IPP)

  OpenMP

     Intel TBB, Microsoft PPL

        SIMD (CPU & GPGPU)
Part I
A way of calling unmanaged C++ from .Net
   Not the same as C++/CLI
For interaction with ‘legacy’ systems
Can pass data between managed and
unmanaged code
   Literals (int, string)
   Pointers (e.g., pointer to array)
   Structures
Marshalling is taken care of by the runtime
Make a Win32 C++ DLL
   MYLIB_API int Add(int first, int second)
   {
     return first + second;
   }
   Specify a post-build step to copy DLL to .Net assembly
     Important: default DLL location is solution root
   Build the DLL
Make a .Net application
   [DllImport("MyLib.dll")]
   public static extern int Add(
     int first, int second);
Call the method
Basic C# ↔ C++ Interop
DLL not found                      Entry point not found
  Make sure post-build step          Make sure method names and
  copies DLL to target folder        signatures are equivalent
  Or that DLL is in PATH             Make sure calling convention
An attempt was made to load          matches
                                        [DllImport(…,
DLL with incorrect format                 CallingConvention=))
  DLL relies on other DLLs which     On 64-bit systems, specify entry
  are not found                      name explicitly
    Open Visual Studio command          Use dumpbin /exports
    prompt
                                        [DllImport(…,
    Use dumpbin /dependents               EntryPoint = "?Add@@YAHHH@Z"
    mylib.dll to find out
                                        No, extern "C " does not help
    Copy files to target dir
    This is common in Debug mode   It all works
  32-bit/64-bit mismatch             Congratulations!
Special cases          Handling them
  String handling        Marshal
    Unicode vs. ANSI     MarshalAsAttribute
    LP(C)WSTR            [In] and [Out]
  Arrays                 StructLayout
    fixed
                         IntPtr
  Memory allocation
                         … and lots more
  Calling convention
  “Bitness” issues
                       Handle on a case-by-case
  … and lots more!
                       basis
Make sure signatures
match
  Including return types!
To debug
  If your OS is 64-bit, make
  sure .Net assemblies
  compile in 32-bit mode
  Make sure unmanaged
  code debugging is turned
  on
In 64-bit                       Visit the P/Invoke wiki @
  Launch target DLL with        http://pinvoke.net
  the .Net assembly as target
    Good luck! :)
Part II
An API for multi-platform
shared-memory parallel
programming in C/C++
and Fortran.
Uses #pragma statements
to decorate code
Easy!!!
  Syntax can be learned very
  quickly
  Can be turned off and on
  in project settings
Enable it (disabled by default)




Use it!
   No further action necessary

To use configuration API
   #include <omp.h>
   Call methods, e.g., omp_get_num_procs()
void MultiplyMatricesDoubleOMP(
  int size, double* m1[], double* m2[], double* result[])
{
  int i, j, k;
  #pragma omp parallel for shared(size,m1,m2,result) private (i,j,k)
  for (i = 0; i < size; i++)
  {                                              #pragma omp parallel for
    for (j = 0; j < size; j++)                   Hints to the compiler that it’s
    {                                            worth parallelizing loop
      result[i][j] = 0;
      for (k = 0; k < size; k++)                 shared(size,m1,m2,result)
      {                                          Variables shared between all
        result[i][j] += m1[i][k] * m2[k][j];     threads
      }
    }                                            private(i,j,k)
                                                 Variables which have differing
  }
                                                 values in different threads
}
Using OpenMP in your C++ app
Homepage
http://openmp.org
cOMPunity (community of OMP users)
http://www.compunity.org/
OpenMP debug/optimization article (Russian)
http://bit.ly/BJbPU
VivaMP (static analyzer for OpenMP code)
http://viva64.com/vivamp-tool
Part III
Libraries save you from reinventing the wheel
   Tested
   Optimized (e.g., for multi-core, SIMD)
These typically have C++ and Fortran
interfaces
   Some also have MPI support
   Of course, there are .Net libraries too :)
The ‘trick’ is to use these libraries from C#
   Fortran-compatible API is tricky!
   Data structure passing can be quite arcane!
Intel makes multi-core processors
                        Multi-core know-how
                           Parallel Composer
                             C++ Compiler (autoparallelization, OpenMP 3.0)
Intel Parallel Studio




                             Libraries (Math Kernel Library, Integrated Performance
                             Primitives, Threading Building Blocks)
                             Parallel debugger extension
                           Parallel inspector (memory/threading errors)
                           Parallel amplifier (hotspots, concurrency, locks and
                           waits)
                           Parallel Advisor Lite
Low-level parallelization
framework from Intel
Lets you fine-tune code
for multi-core
Is a library
  Uses a set of primitives
Has OSS license
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
using namespace tbb;                         Functor
struct Average {
    float* input;
    float* output;
    void operator()( const blocked_range<int>& range ) const {
        for( int i=range.begin(); i!=range.end( ); ++i )
            output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f);
    }
};
// Note: The input must be padded such that input[-1] and input[n]
// can be used to calculate the first and last output values.
void ParallelAverage( float* output, float* input, size_t n ) {
    Average avg;
    avg.input = input;
    avg.output = output;
                                                                 Library
    parallel_for(blocked_range<int>( 0, n, 1000 ), avg);
}                                                                  call
Integrated Performance            Math Kernel Library
Primitives
 High-performance libraries for     Optimized, multithreaded
   Signal processing                library for math
   Image processing                 Support for
   Computer vision                    BLAS
   Speech recognition                 LAPACK
   Data compression                   ScaLAPACK
   Cryptography                       Sparse Solvers
   String manipulation                Fast Fourier Transforms
   Audio processing                   Vector Math
   Video coding                       … and lots more
   Realistic rendering
 Also support codec
 construction
Part IV
CPU support for performing operations on
large registers
Normal-size data is loaded into 128-bit
registers
Operation on multiple elements with a single
instruction
   E.g., add 4 numbers at once
Requires special CPU instructions
   Less portable
Supported in C++ via ‘intrinsics’
SSE is an instruction set
   Initially called MMX (n/a on 64-bit CPUs)
   Now SSE and SSE2
Compiler intrinsics
   C++ functions that map to one or more SSE
   assembly instructions
Determining support
   Use cpuid
   Non-issue if you are
     A systems integrator
     Run your own servers (e.g., Asp.Net)
128-bit data types
   __m128
   __m128i (integer intrinsics)
   __m128d (double intrinsics)     } sse2
Operations for load and set
   __m128 a = _mm_set_ps(1,2,3,4);
   To get at data, dereference and choose type
     E.g., myValue.m128_f32[0] gets first float
Perform operations (add, multiply, etc.)
   E.g., _mm_mul_ps(first, second) multiplies two
   values yielding a third
Make or get data
   Either create with initialized values
   static __m128 factor =
      _mm_set_ps(1.0f, 0.3f, 0.59f, 0.11f);
   Or load it into a SIMD-sized memory location
   __m128 pixel;
   pixel.m128_f32[0] = s->m128i_u8[(p<<2)];
   Or convert an existing pointer
   __m128* pixel = (__m128*)(&my_array + p);
Perform a SIMD operation and get data
   pixel = _mm_mul_ps(pixel, factor);
Get the data
   const BYTE sum = (BYTE)(pixel.m128_f32[0]);
Image processing with SIMD
Part IV
Graphics cards have GPUs
These are highly parallelized
   Pipelining
   Useful for graphics
GPUs are programmable
   We can do math ops on vectors
   Mainly float, with double support emerging
GPUs have programmable parts
   Vertex shader (vertex position)
   Pixel shader (pixel colour)
Treat data as texture (render target)
   Load inputs as texture
   Use pixel shader
   Get data from result texture
Special languages used to program them
   HLSL (DirectX)
   GLSL (OpenGL)
High-level wrappers (CUDA, Accelerator)
A Microsoft Research project
   Not for commercial use
Uses a managed API
Employs data-parallel arrays
   Int
   Float
   Bool
   Bitmap-aware
Requires PS 2.0
Sorry! No demo.
Accelerator does not work on 64-bit :(
Unmanaged Parallelization via P/Invoke
If a library already exists, use it
If C# is fast enough, use it
To speed things up, try
   TPL/PLINQ
   Manual Parallelization
   unsafe (can be combined with TPL)
If you are unhappy, then
   Write in C++
   Speculatively add OpenMP directives
   Fine-tune with TBB as needed
System.Drawing.Bitmap is slow
  Has very slow GetPixel()/SetPixel()
  methods
Can fix bitmap in memory and manipulate it in
unmanaged code
What we need to pass in
  Pointer to bitmap image data
  Bitmap width and height
  Bitmap stride (horizontal memory space)
Image-rendered headings with subpixel
postprocessing (http://bit.ly/10x0G8)
  WPF FlowDocument for initial generation
  C++/OpenMP for postprocessing
  Asp.Net for serving the result
Freeform rendered text with OpenType
features (http://bit.ly/1cCP50)
  Bitmap rendering in Direct2D (C++/lightweight
  COM API)
  OpenType markup language
a   l   b   v   o   b   q   l   l   k   u   t   m   y   w   m   w   r   e   e   r   q   q   m   q   i   q   d   n   w   g   s   s   w   d   a
v   p   d   v   n   u   x   j   l   s   y   t   u   b   n   b   y   c   t   h   r   r   y   u   v   a   s   t   a   d   t   n   z   f   f   x
g   q   h   b   j   j   p   y   o   w   s   i   g   i   c   i   i   g   s   o   f   n   f   r   j   f   d   c   f   g   m   k   w   u   y   j
v   b   v   e   m   i   t   i   j   x   u   v   w   s   j   u   g   u   y   l   b   o   c   m   y   k   u   b   w   s   w   n   p   x   i   o
k   a   y   c   q   o   s   u   n   k   s   c   g   x   j   x   j   e   q   p   h   j   i   a   c   m   j   z   h   c   k   v   x   k   a   k
f   e   c   r   u   u   x   q   p   p   k   o   f   w   g   x   b   v   j   m   b   e   l   e   e   w   k   s   c   v   n   n   o   g   c   z
w   w   f   w   i   n   e   h   j   q   l   h   x   u   v   j   o   m   h   g   s   x   a   j   z   b   d   n   u   a   s   c   n   a   j   i
x   w   i   n   w   z   j   d   s   p   n   w   i   p   c   n   d   s   r   m   j   h   z   q   j   g   b   w   j   m   e   z   k   j   v   a
z   o   u   q   w   d   c   j   c   f   o   x   w   t   h   v   s   r   h   o   m   j   y   n   a   u   p   p   u   p   h   z   n   s   j   r
m   b   z   o   w   k   i   n   t   h   l   i   k   z   w   m   z   m   f   x   c   h   o   m   w   x   b   s   m   x   u   c   j   x   o   s
h   x   u   e   t   p   u   x   e   o   v   l   h   a   y   p   f   f   v   a   x   z   x   l   z   u   l   c   l   n   q   g   e   g   m   x
y   k   k   k   q   j   n   h   p   i   j   w   i   p   d   d   a   x   z   s   z   e   m   p   c   l   i   m   s   u   g   e   i   z   o   m
q   p   r   p   d   w   m   y   q   t   o   v   m   p   T   H   E   y   E   N   D   v   z   d   c   z   x   m   g   q   q   r   h   n   b   j
i   b   q   i   p   x   n   h   w   i   d   o   h   m   a   w   c   x   m   g   h   c   y   r   i   k   n   p   n   d   m   c   x   l   z   e
h   h   s   c   l   f   s   y   l   k   j   s   p   t   d   q   e   b   k   v   u   x   k   m   k   z   p   g   k   e   n   a   f   h   h   r
o   x   v   w   k   u   j   u   t   n   e   u   q   f   a   d   n   e   d   y   y   y   f   c   z   c   a   p   x   y   f   b   r   w   e   y
o   f   a   v   f   h   z   r   y   a   n   z   u   q   r   o   g   n   f   p   x   l   j   y   l   u   a   n   r   d   o   r   v   k   m   f
j   y   n   h   p   c   c   t   k   x   y   t   b   f   j   r   n   x   g   c   z   h   s   p   c   e   i   q   g   x   k   p   f   g   r   n
l   y   i   i   f   t   i   s   b   i   f   c   k   c   h   e   s   l   w   y   s   u   p   d   v   x   b   r   l   q   l   k   i   z   d   z
w   s   a   w   r   i   i   u   m   n   i   x   r   c   j   n   d   h   n   w   g   s   f   s   i   l   h   a   b   h   l   h   x   m   v   p
t   e   g   k   n   o   i   s   g   s   x   v   b   o   k   e   c   i   j   y   b   e   d   r   t   p   e   x   v   r   c   w   u   v   d   s
d   o   a   z   t   t   m   u   i   u   v   u   b   p   l   w   c   p   x   n   k   k   v   a   a   v   b   b   s   e   e   f   d   b   f   y
i   v   c   j   k   r   g   r   y   t   j   a   m   f   v   h   b   f   s   b   z   l   i   n   a   x   c   l   r   l   z   i   v   l   c   b
n   u   d   l   l   g   u   y   r   t   t   u   q   t   l   y   j   l   q   u   h   a   o   u   o   p   t   g   v   l   q   q   r   k   r   q
y   p   l   z   d   x   n   q   n   q   v   t   f   b   u   h   r   y   n   k   f   q   i   t   h   i   u   w   i   n   m   l   o   c   c   c

More Related Content

PPTX
C++ via C#
PPTX
Mono + .NET Core = ❤️
PPTX
Egor Bogatov - .NET Core intrinsics and other micro-optimizations
PDF
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
PDF
Codejunk Ignitesd
PPTX
Hacking Go Compiler Internals / GoCon 2014 Autumn
PPTX
How to add an optimization for C# to RyuJIT
PPTX
C++ via C#
Mono + .NET Core = ❤️
Egor Bogatov - .NET Core intrinsics and other micro-optimizations
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
Codejunk Ignitesd
Hacking Go Compiler Internals / GoCon 2014 Autumn
How to add an optimization for C# to RyuJIT

What's hot (20)

PDF
Fun with Lambdas: C++14 Style (part 1)
PPTX
Summary of C++17 features
PDF
2018 cosup-delete unused python code safely - english
PDF
C++11
PDF
Modern c++ (C++ 11/14)
PPTX
C++ Presentation
PDF
C++11 & C++14
PPTX
C++ 11 Features
PDF
Modern C++
PDF
Basic c++ 11/14 for python programmers
PPTX
PDF
C++20 the small things - Timur Doumler
PDF
C++11 concurrency
PDF
C++17 introduction - Meetup @EtixLabs
PPT
Gentle introduction to modern C++
PDF
Golang and Eco-System Introduction / Overview
PDF
Introduction to Go programming language
PPTX
Fun with Lambdas: C++14 Style (part 2)
PPT
Csdfsadf
Fun with Lambdas: C++14 Style (part 1)
Summary of C++17 features
2018 cosup-delete unused python code safely - english
C++11
Modern c++ (C++ 11/14)
C++ Presentation
C++11 & C++14
C++ 11 Features
Modern C++
Basic c++ 11/14 for python programmers
C++20 the small things - Timur Doumler
C++11 concurrency
C++17 introduction - Meetup @EtixLabs
Gentle introduction to modern C++
Golang and Eco-System Introduction / Overview
Introduction to Go programming language
Fun with Lambdas: C++14 Style (part 2)
Csdfsadf
Ad

Similar to Unmanaged Parallelization via P/Invoke (20)

PDF
Parallel Programming
PDF
PPU Optimisation Lesson
PDF
PPTX
Presentation on Shared Memory Parallel Programming
PPT
Migration To Multi Core - Parallel Programming Models
PPTX
Blazing Fast Windows 8 Apps using Visual C++
PPT
Parallel Programming Primer
PDF
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
PPTX
Whats New in Visual Studio 2012 for C++ Developers
PPT
CS4961-L9.ppt
PDF
Intel open mp
PPT
Parallel Programming Primer 1
PPTX
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
PDF
May2010 hex-core-opt
PDF
Directive-based approach to Heterogeneous Computing
PDF
Parallel computation
PPTX
Medical Image Processing Strategies for multi-core CPUs
PDF
parallel-computation.pdf
PDF
Matloff programming on-parallel_machines-2013
PDF
Cray XT Porting, Scaling, and Optimization Best Practices
Parallel Programming
PPU Optimisation Lesson
Presentation on Shared Memory Parallel Programming
Migration To Multi Core - Parallel Programming Models
Blazing Fast Windows 8 Apps using Visual C++
Parallel Programming Primer
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Whats New in Visual Studio 2012 for C++ Developers
CS4961-L9.ppt
Intel open mp
Parallel Programming Primer 1
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
May2010 hex-core-opt
Directive-based approach to Heterogeneous Computing
Parallel computation
Medical Image Processing Strategies for multi-core CPUs
parallel-computation.pdf
Matloff programming on-parallel_machines-2013
Cray XT Porting, Scaling, and Optimization Best Practices
Ad

More from Dmitri Nesteruk (20)

PDF
Good Ideas in Programming Languages
PDF
Design Pattern Observations
PDF
CallSharp: Automatic Input/Output Matching in .NET
PDF
Design Patterns in Modern C++
PPTX
C# Tricks
PPTX
Introduction to Programming Bots
PDF
Converting Managed Languages to C++
PDF
Monte Carlo C++
PDF
Tpl DataFlow
PDF
YouTrack: Not Just an Issue Tracker
PPTX
Проект X2C
PPTX
Domain Transformations
PDF
Victor CG Erofeev - Metro UI
PDF
Developer Efficiency
PPTX
Distributed Development
PDF
Dynamics CRM Data Integration
PDF
Web mining
PDF
Data mapping tutorial
PDF
Reactive Extensions
PDF
Design Patterns in .Net
Good Ideas in Programming Languages
Design Pattern Observations
CallSharp: Automatic Input/Output Matching in .NET
Design Patterns in Modern C++
C# Tricks
Introduction to Programming Bots
Converting Managed Languages to C++
Monte Carlo C++
Tpl DataFlow
YouTrack: Not Just an Issue Tracker
Проект X2C
Domain Transformations
Victor CG Erofeev - Metro UI
Developer Efficiency
Distributed Development
Dynamics CRM Data Integration
Web mining
Data mapping tutorial
Reactive Extensions
Design Patterns in .Net

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Modernizing your data center with Dell and AMD
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
A Presentation on Artificial Intelligence
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Modernizing your data center with Dell and AMD
Diabetes mellitus diagnosis method based random forest with bat algorithm
A Presentation on Artificial Intelligence
Advanced methodologies resolving dimensionality complications for autism neur...
Review of recent advances in non-invasive hemoglobin estimation
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Unmanaged Parallelization via P/Invoke

  • 2. “Premature optimization is the root of all evil.” Donald Knuth Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268. “In practice, it is often necessary to keep performance goals in mind when first designing software, but the programmer balances the goals of design and optimization.” Wikipedia http://en.wikipedia.org/wiki/Program_optimization
  • 3. Brief intro Why unmanaged code? Why parallelize? P/Invoke SIMD OpenMP Intel stack: TBB, MKL, IPP GPGPU: Cuda, Accelerator Miscellanea
  • 4. Today Tomorrow Threads & ThreadPool Tasks and TaskManager Sync structures Task Monitor.(Try)Enter/Exit Future ReaderWriterLock(Slim) Mutex Data-level parallelism Semaphore Parallel.For/ForEach Wait handles Manual/AutoResetEvent Parallel LINQ AsParallel() Pulse & wait Async delegates Async simplifications F# async workflow AsyncEnumerator (PowerThreading)
  • 5. Performance Managed interfaces for Low-level (fine-tuning) SIMD/MPI-optimized framework libraries Instruction-level Threading tools parallelism Debugging GPU SIMD Profiling Inferencing General vectorization Cross-machine Simple cross-machine debugging framework Task management UI
  • 6. WTF?!? Isn’t C# 5% faster than C? It depends. Why is there a difference? More safety (e.g., CLR array bound checking) JIT: No auto-parallelization JIT: No SIMD Lack of fine control IL can be every bit as fast as C/C++ But this is only true for simple problems The code is only as good as the JITter
  • 7. Libraries (MKL, IPP) OpenMP Intel TBB, Microsoft PPL SIMD (CPU & GPGPU)
  • 9. A way of calling unmanaged C++ from .Net Not the same as C++/CLI For interaction with ‘legacy’ systems Can pass data between managed and unmanaged code Literals (int, string) Pointers (e.g., pointer to array) Structures Marshalling is taken care of by the runtime
  • 10. Make a Win32 C++ DLL MYLIB_API int Add(int first, int second) { return first + second; } Specify a post-build step to copy DLL to .Net assembly Important: default DLL location is solution root Build the DLL Make a .Net application [DllImport("MyLib.dll")] public static extern int Add( int first, int second); Call the method
  • 11. Basic C# ↔ C++ Interop
  • 12. DLL not found Entry point not found Make sure post-build step Make sure method names and copies DLL to target folder signatures are equivalent Or that DLL is in PATH Make sure calling convention An attempt was made to load matches [DllImport(…, DLL with incorrect format CallingConvention=)) DLL relies on other DLLs which On 64-bit systems, specify entry are not found name explicitly Open Visual Studio command Use dumpbin /exports prompt [DllImport(…, Use dumpbin /dependents EntryPoint = "?Add@@YAHHH@Z" mylib.dll to find out No, extern "C " does not help Copy files to target dir This is common in Debug mode It all works 32-bit/64-bit mismatch Congratulations!
  • 13. Special cases Handling them String handling Marshal Unicode vs. ANSI MarshalAsAttribute LP(C)WSTR [In] and [Out] Arrays StructLayout fixed IntPtr Memory allocation … and lots more Calling convention “Bitness” issues Handle on a case-by-case … and lots more! basis
  • 14. Make sure signatures match Including return types! To debug If your OS is 64-bit, make sure .Net assemblies compile in 32-bit mode Make sure unmanaged code debugging is turned on In 64-bit Visit the P/Invoke wiki @ Launch target DLL with http://pinvoke.net the .Net assembly as target Good luck! :)
  • 16. An API for multi-platform shared-memory parallel programming in C/C++ and Fortran. Uses #pragma statements to decorate code Easy!!! Syntax can be learned very quickly Can be turned off and on in project settings
  • 17. Enable it (disabled by default) Use it! No further action necessary To use configuration API #include <omp.h> Call methods, e.g., omp_get_num_procs()
  • 18. void MultiplyMatricesDoubleOMP( int size, double* m1[], double* m2[], double* result[]) { int i, j, k; #pragma omp parallel for shared(size,m1,m2,result) private (i,j,k) for (i = 0; i < size; i++) { #pragma omp parallel for for (j = 0; j < size; j++) Hints to the compiler that it’s { worth parallelizing loop result[i][j] = 0; for (k = 0; k < size; k++) shared(size,m1,m2,result) { Variables shared between all result[i][j] += m1[i][k] * m2[k][j]; threads } } private(i,j,k) Variables which have differing } values in different threads }
  • 19. Using OpenMP in your C++ app
  • 20. Homepage http://openmp.org cOMPunity (community of OMP users) http://www.compunity.org/ OpenMP debug/optimization article (Russian) http://bit.ly/BJbPU VivaMP (static analyzer for OpenMP code) http://viva64.com/vivamp-tool
  • 22. Libraries save you from reinventing the wheel Tested Optimized (e.g., for multi-core, SIMD) These typically have C++ and Fortran interfaces Some also have MPI support Of course, there are .Net libraries too :) The ‘trick’ is to use these libraries from C# Fortran-compatible API is tricky! Data structure passing can be quite arcane!
  • 23. Intel makes multi-core processors Multi-core know-how Parallel Composer C++ Compiler (autoparallelization, OpenMP 3.0) Intel Parallel Studio Libraries (Math Kernel Library, Integrated Performance Primitives, Threading Building Blocks) Parallel debugger extension Parallel inspector (memory/threading errors) Parallel amplifier (hotspots, concurrency, locks and waits) Parallel Advisor Lite
  • 24. Low-level parallelization framework from Intel Lets you fine-tune code for multi-core Is a library Uses a set of primitives Has OSS license
  • 25. #include "tbb/parallel_for.h" #include "tbb/blocked_range.h" using namespace tbb; Functor struct Average { float* input; float* output; void operator()( const blocked_range<int>& range ) const { for( int i=range.begin(); i!=range.end( ); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); } }; // Note: The input must be padded such that input[-1] and input[n] // can be used to calculate the first and last output values. void ParallelAverage( float* output, float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; Library parallel_for(blocked_range<int>( 0, n, 1000 ), avg); } call
  • 26. Integrated Performance Math Kernel Library Primitives High-performance libraries for Optimized, multithreaded Signal processing library for math Image processing Support for Computer vision BLAS Speech recognition LAPACK Data compression ScaLAPACK Cryptography Sparse Solvers String manipulation Fast Fourier Transforms Audio processing Vector Math Video coding … and lots more Realistic rendering Also support codec construction
  • 28. CPU support for performing operations on large registers Normal-size data is loaded into 128-bit registers Operation on multiple elements with a single instruction E.g., add 4 numbers at once Requires special CPU instructions Less portable Supported in C++ via ‘intrinsics’
  • 29. SSE is an instruction set Initially called MMX (n/a on 64-bit CPUs) Now SSE and SSE2 Compiler intrinsics C++ functions that map to one or more SSE assembly instructions Determining support Use cpuid Non-issue if you are A systems integrator Run your own servers (e.g., Asp.Net)
  • 30. 128-bit data types __m128 __m128i (integer intrinsics) __m128d (double intrinsics) } sse2 Operations for load and set __m128 a = _mm_set_ps(1,2,3,4); To get at data, dereference and choose type E.g., myValue.m128_f32[0] gets first float Perform operations (add, multiply, etc.) E.g., _mm_mul_ps(first, second) multiplies two values yielding a third
  • 31. Make or get data Either create with initialized values static __m128 factor = _mm_set_ps(1.0f, 0.3f, 0.59f, 0.11f); Or load it into a SIMD-sized memory location __m128 pixel; pixel.m128_f32[0] = s->m128i_u8[(p<<2)]; Or convert an existing pointer __m128* pixel = (__m128*)(&my_array + p); Perform a SIMD operation and get data pixel = _mm_mul_ps(pixel, factor); Get the data const BYTE sum = (BYTE)(pixel.m128_f32[0]);
  • 34. Graphics cards have GPUs These are highly parallelized Pipelining Useful for graphics GPUs are programmable We can do math ops on vectors Mainly float, with double support emerging
  • 35. GPUs have programmable parts Vertex shader (vertex position) Pixel shader (pixel colour) Treat data as texture (render target) Load inputs as texture Use pixel shader Get data from result texture Special languages used to program them HLSL (DirectX) GLSL (OpenGL) High-level wrappers (CUDA, Accelerator)
  • 36. A Microsoft Research project Not for commercial use Uses a managed API Employs data-parallel arrays Int Float Bool Bitmap-aware Requires PS 2.0
  • 37. Sorry! No demo. Accelerator does not work on 64-bit :(
  • 39. If a library already exists, use it If C# is fast enough, use it To speed things up, try TPL/PLINQ Manual Parallelization unsafe (can be combined with TPL) If you are unhappy, then Write in C++ Speculatively add OpenMP directives Fine-tune with TBB as needed
  • 40. System.Drawing.Bitmap is slow Has very slow GetPixel()/SetPixel() methods Can fix bitmap in memory and manipulate it in unmanaged code What we need to pass in Pointer to bitmap image data Bitmap width and height Bitmap stride (horizontal memory space)
  • 41. Image-rendered headings with subpixel postprocessing (http://bit.ly/10x0G8) WPF FlowDocument for initial generation C++/OpenMP for postprocessing Asp.Net for serving the result Freeform rendered text with OpenType features (http://bit.ly/1cCP50) Bitmap rendering in Direct2D (C++/lightweight COM API) OpenType markup language
  • 42. a l b v o b q l l k u t m y w m w r e e r q q m q i q d n w g s s w d a v p d v n u x j l s y t u b n b y c t h r r y u v a s t a d t n z f f x g q h b j j p y o w s i g i c i i g s o f n f r j f d c f g m k w u y j v b v e m i t i j x u v w s j u g u y l b o c m y k u b w s w n p x i o k a y c q o s u n k s c g x j x j e q p h j i a c m j z h c k v x k a k f e c r u u x q p p k o f w g x b v j m b e l e e w k s c v n n o g c z w w f w i n e h j q l h x u v j o m h g s x a j z b d n u a s c n a j i x w i n w z j d s p n w i p c n d s r m j h z q j g b w j m e z k j v a z o u q w d c j c f o x w t h v s r h o m j y n a u p p u p h z n s j r m b z o w k i n t h l i k z w m z m f x c h o m w x b s m x u c j x o s h x u e t p u x e o v l h a y p f f v a x z x l z u l c l n q g e g m x y k k k q j n h p i j w i p d d a x z s z e m p c l i m s u g e i z o m q p r p d w m y q t o v m p T H E y E N D v z d c z x m g q q r h n b j i b q i p x n h w i d o h m a w c x m g h c y r i k n p n d m c x l z e h h s c l f s y l k j s p t d q e b k v u x k m k z p g k e n a f h h r o x v w k u j u t n e u q f a d n e d y y y f c z c a p x y f b r w e y o f a v f h z r y a n z u q r o g n f p x l j y l u a n r d o r v k m f j y n h p c c t k x y t b f j r n x g c z h s p c e i q g x k p f g r n l y i i f t i s b i f c k c h e s l w y s u p d v x b r l q l k i z d z w s a w r i i u m n i x r c j n d h n w g s f s i l h a b h l h x m v p t e g k n o i s g s x v b o k e c i j y b e d r t p e x v r c w u v d s d o a z t t m u i u v u b p l w c p x n k k v a a v b b s e e f d b f y i v c j k r g r y t j a m f v h b f s b z l i n a x c l r l z i v l c b n u d l l g u y r t t u q t l y j l q u h a o u o p t g v l q q r k r q y p l z d x n q n q v t f b u h r y n k f q i t h i u w i n m l o c c c