Java parallel programming made simple

Ateji – the CompanySpecialized in parallelism & language technologiesFounded by Patrick Viry in 2005 Java extensions for optimization (OptimJ, 2008),Parallelism (Ateji PX, 2010)January 2010: 1st round of investmentAtejiPX Selected as Disruptive Technology during SC10Member of HiPEAC, OpenGPU

The Grand Challenge : Parallel Programming for All Application Developers2010 (100 cores)2008 (4 cores)enterpriseservers

Why Java ?Increasingly used for HPC because:Most popular language todayGood runtime performanceMuch better productivity and code qualityFaster time-to-market, less bugs, less maintenanceMuch easier staffingUsed in aerospace, bioinformatics, physics, finance, data mining, statistics, ...Details and references in our latest blog posting: ateji.blogspot.com

How to parallelize Java code ? for(int i : I) {for(int j : J) { for(int k : K) { C[i][j] += A[i][k] * B[k][j]; } } }Ateji PXThreadsfinal int nThreads = System.getAvailableProcessors();final int blockSize = I / nThreads;Thread[] threads = new Thread[nThreads];for(int n=0; n<nThreads; n++) { final int finalN = n; threads[n] = new Thread() { void run() { final int beginIndex = finalN*blockSize; final int endIndex = (finalN == (nThreads-1))?I :(finalN+1)*blockSize; for( int i=beginIndex; i<endIndex; i++) {for(int j=0; j<J; j++) {for(int k=0; k<K; k++) { C[i][j] += A[i][k] * B[k][j];}}}}};threads[n].start();}for(int n=0; n<nThreads; n++) {try {threads[n].join();} catch (InterruptedException e) {System.exit(-1);}} for||(int i : I) {for(int j : J) { for(int k : K) { C[i][j] += A[i][k] * B[k][j]; } } } for||(int i : I) {for(int j : J) { for(int k : K) { C[i][j] += A[i][k] * B[k][j]; } } }for||

It’s easy AND efficient :12.5x speedup on 16 coresSeewhitepaperon www.ateji.com/pxAteji PX for||(int i : I) {for(int j : J) { for(int k : K) { C[i][j] += A[i][k] * B[k][j]; } } } for||(int i : I) {for(int j : J) { for(int k : K) { C[i][j] += A[i][k] * B[k][j]; } } }for||

“The problem with threads”[Technical Report, Edward A. Lee, EECS Berkeley]Threads are a hardware-level concept, not a practical abstraction for programmingThreads do not composeCode correctness requires intricate thinking and inspection of the whole programMost multi-threaded programs are bugged ... … and debuggers do not helpNot an option for most application programmers !

Introducing Parallelism at the Language LevelSequential composition operator: “;”Parallel composition operator: “||”“Hello World!” [ ||System.out.println("Hello");||System.out.println("World");]Run two branches in parallel, wait for terminationprints either orHelloWorldWorldHello

DataParallelismSame operation on all elements [// quantified branches|| (inti : N) array[i]++;]Multiple dimensions and filterse.g. update the upper left triangle of a matrix[|| (int i:N, int j:N, i+j<N) m[i][j]++;]

Task Parallelismintfib(int n) { if(n <= 1) return 1;int fib1, fib2; [|| fib1 = fib(n-1);|| fib2 = fib(n-2); ]; return fib1 + fib2; }Note the recursivity: ||compatible with all language constructs

Speculative ParallelismStop when the fastest algorithm succeeds [ || return algorithm1(); || return algorithm2(); ]Stop sister branches then returnSame behaviour for break, continue, throwNon-local exit very difficult to get right with threads

Parallel reductionsSame behaviour for break, continue, throw

Message PassingIs an essential aspect of parallelismMust be part of the languageSend a message: chan ! ValueReceive a message: chan ? valueTyped Channels Chan<T> : synchronous (rendez-vous)AsyncChan<T>: asynchronous (buffered)‏ User-defined serialization (Java, XML, ASN.1, ...) Can be mapped to I/O devices (files, sockets, MPI)

in1adderoutin2Data Flow and Stream parallelismAn adder void adder(Chan<Integer> in1, in2, out) { for(;;) {int value1, value2;[in1 ? value1; ||in2 ? value2; ];out ! (value1 + value2);}}

c1addersourcec3sinkc2sourceData Flow and Stream parallelismCompose processes [ || source(c1); // generates values on c1 || source(c2); // generates values on c2 || adder(c1, c2, c3); || sink(c3); ] // read values from c3Numeric values + sync = “data flow”String or tuples + async = “stream programming” e.g. MapReduce algorithm

Expressing non-determinismNote the parallel reads [ in1 ? value1 || in2 ? value2 ]Impossibleto express in a sequential language|| for performance, but also expressivitySee also the select construct

Distributing branchesUse indications [ || #Remote(“192.168.20.1”)source(c1);||#Remote(“Amazon EC2”) source(c2); ||#Remote(“GPU”) adder(c1, c2, c3); || sink(c3); ]Multicore Desktop/ServerMulticore CPU/GPU cluster

Compiler handles the boring stuffPassing parametersReturning resultsThrowing exceptionsAccessing non-final fieldsPerforming non-local exitsStopping branches properly

Makingiteasyisalso about tools:EclipseIntegration

Ateji PX SummaryParallelism at the language level is simple and intuitive, efficient, compatible with source code and toolsMost patterns in a single language: data, task, recursive and speculative parallelismshared memory and distributed memoryCovers OpenMP, Cilk, MPI, Occam, Erlang, etc…Most hardware architectures from a single language:Manycore, grid, cloud, GPU

Roadmap as of February 2011Ateji PX 1.1 (multicore version) available today Free evaluation version on www.ateji.comGPU version coming soonOpenGPU projectDistributed version coming soon Grid / Cluster / CloudInteractive correctness proofsIntegration of profiling tools

Call to ActionFree download on www.ateji.com/pxRead the whitepapersPlay with the online demoLook at the samples libraryBenchmark your || codeContact info@ateji.comBlog : ateji.blogspot.com

Java parallel programming made simple

More Related Content

What's hot (19)

Similar to Java parallel programming made simple (20)

Recently uploaded (20)

Java parallel programming made simple