SlideShare a Scribd company logo
An Introduction to SequenceL
Auto-Parallelizing Programming Language and Toolset
www.texasmulticore.com
Brad Nemanich, PhD
Chief Technology Officer
Why is SequenceL Needed?
”The way the processor industry is going is
to add more and more cores, but nobody
knows how to program those things. I mean,
two, yeah; four, not really; eight, forget it.”
– Steve Jobs
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved2
This shift now affects every software company,
large enterprise, and government agency that
develops software
Current (Manual) Approach to Multicore Programming
1. Be sure you identify truly independent computations.
2. Implement concurrency at the highest level possible.
3. Plan early for scalability to take advantage of increasing numbers of
cores.
4. Make use of thread-safe libraries wherever possible.
5. Use the right threading model.
6. Never assume a particular order of execution.
7. Use thread-local storage whenever possible; associate locks to specific
data, if needed.
8. Don’t be afraid to change the algorithm for a better chance of
concurrency.
8 “Simple” Rules for Designing Threaded Applications
(0. Hire team of “Parallel Ninjas”, PhD experts in computer architecture.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved3
Current (Manual) Approach to Multicore Programming
1. Be sure you identify truly independent computations.
2. Implement concurrency at the highest level possible.
3. Plan early for scalability to take advantage of increasing numbers of
cores.
4. Make use of thread-safe libraries wherever possible.
5. Use the right threading model.
6. Never assume a particular order of execution.
7. Use thread-local storage whenever possible; associate locks to specific
data, if needed.
8. Don’t be afraid to change the algorithm for a better chance of
concurrency.
8 “Simple” Rules for Designing Threaded Applications
(0. Hire team of “Parallel Ninjas”, PhD experts in computer architecture.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved4
“The significant problems we face cannot be solved using
the same level of thinking we used when we created them.”
-Albert Einstein
“Parallel Ninja” Approach Does Not Scale
 How do you:
─ find them?
─ afford them?
─ retain them?
─ support rapid innovation?
─ ensure accuracy and correctness?
─ keep them current on platform technologies?
─ do this for all your software?
Einstein was right;
There’s a much better way….
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved5
It’s Time to Change the Game (Again)
6
Wiring Machine CodeWiring
Machine Code Machine Code
Assembly
Language
Netlist
Netlist
1954 1957 1980
Machine Code
HLL + Compiler
(Fortran, COBOL,
PL/I, Lisp, C,…)
Machine Code
Object Oriented
(SmallTalk, C++,
Java, C#,)
19491949
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
It’s Time to Change the Game (Again)
7
Wiring Machine CodeWiring
Machine Code Machine Code
Assembly
Language
Netlist
Netlist
1954 1957 1980
Machine Code
HLL + Compiler
(Fortran, COBOL,
PL/I, Lisp, C,…)
Machine Code
Object Oriented
(SmallTalk, C++,
Java, C#,)
19491949
2004: Multicore
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
It’s Time to Change the Game (Again)
8
Wiring Machine CodeWiring
Machine Code Machine Code
Assembly
Language
Netlist
Netlist
1954 1957 1980
Machine Code
HLL + Compiler
(Fortran, COBOL,
PL/I, Lisp, C,…)
Machine Code
Object Oriented
(SmallTalk, C++,
Java, C#,)
19491949 2014
Machine Code
Object Oriented
C++
Functional,
Auto-
Parallelizing
Object Oriented
C++
Functional,
Auto-
Parallelizing
2004: Multicore
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
SequenceL is a Game Changer
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved9
Faster Performance;
Uses all cores, GPUs
10X Faster Time to
Innovation/Market
Get it Right the
First Time
Quickly Leverage New
Computing Platforms
Built Upon Open Industry
Standards; Works with Existing
Tools & Methodologies
Customer Example: Industrial Control Networking
(WirelessHART, IEC 62591, IEEE 802.15.4)
 New algorithm, developed for large, noisy industrial
process control environments
─ Presented white paper to IEEE
─ Won an award
 Asked TMT to implement for comparison purposes
─ Finished in SequenceL in 3 weeks
 10X faster performance and right the first time
─ Java finished by the inventors in 3 months
 Had errors and much slower; used SequenceL code to debug Java
 Another month getting code correct
 A 5th month improving performance that still fell short
 Bottom line
─ SL was finished in 15% of the time
─ SL was correct the first time
─ SL out-performed the Java code 1.5x-3.0x on a 2 core AMD APU
─ Robust and fast code, fast time to market
10
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
Customer Example: Video Processing Using SequenceL
 Goal: 30Hz to keep up with input video feed
 Best performance (8 core x86 platform)
─ 58 Hz: SequenceL
─ 21 Hz: Matlab (Interpreter)
─ 1.2 Hz: Matlab (Coder/C-out)
Input video feed
(e.g.- Apache helicopter gyro camera)
Processed video
(Proprietary algorithms remove air
turbulence, radiated heat, etc.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved11
Customer Example: Video Processing Using SequenceL
 Goal: 30Hz to keep up with input video feed
 Best performance (8 core x86 platform)
─ 58 Hz: SequenceL
─ 21 Hz: Matlab (Interpreter)
─ 1.2 Hz: Matlab (Coder/C-out)
Input video feed
(e.g.- Apache helicopter gyro camera)
Processed video
(Proprietary algorithms remove air
turbulence, radiated heat, etc.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved12
Customer Example: Video Processing Using SequenceL
 Goal: 30Hz to keep up with input video feed
 Best performance (8 core x86 platform)
─ 58 Hz: SequenceL
─ 21 Hz: Matlab (Interpreter)
─ 1.2 Hz: Matlab (Coder/C-out)
Input video feed
(e.g.- Apache helicopter gyro camera)
Processed video
(Proprietary algorithms remove air
turbulence, radiated heat, etc.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved13
Customer Example: Video Processing Using SequenceL
 Goal: 30Hz to keep up with input video feed
 Best performance (8 core x86 platform)
─ 58 Hz: SequenceL
─ 21 Hz: Matlab (Interpreter)
─ 1.2 Hz: Matlab (Coder/C-out)
Input video feed
(e.g.- Apache helicopter gyro camera)
Processed video
(Proprietary algorithms remove air
turbulence, radiated heat, etc.)
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved14
What is SequenceL?
SequenceL is a…
 High-Abstraction
 Functional
 Self-Parallelizing
…programming language and tool set
….designed to work in concert with other
popular programming languages and tools
15
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
High-Abstraction, High Performance
 Most common programming languages are imperative
─ Detailed sequence of commands for carrying out the computation;
i.e.- tell the computer both “what” to do and “how” to do it
─ Inherently sequential, written for classic Von Neumann computers
─ e.g.- C/C++, Java, C#, Python, Fortran
─ Some add explicit “directives” to manually enable low-level parallelism
 SequenceL is declarative & functional – higher abstraction
─ Describe the desired output in terms of the input, as functions;
i.e.- tell the computer only “what” to do, so no thinking about parallel
─ Abstracts away complex multicore and many-core platforms
 Best analogy is SQL database language
─ A programmer could write their own database procedures in low level C
─ But would be error-prone and not perform as well as with Oracle or DB2
16
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
Drops Into Your Current Design Flow
 Designed to work in concert with
other programming languages,
legacy code and libraries
 Additive: works with existing
design flows, tools, and training
 Builds upon open industry
standards
17
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
Drops Into Your Current Design Flow
 Adds a multicore “power tool” to
the programmers toolbox
 Complete add-on solution
─ IDE plug-ins, debugger, interpreter, auto-
parallelizing compiler, runtime environment
 Easy to modernize legacy applications
─ Parallel C++ output enables just a portion to
be refactored in SequenceL and linked in
─ Uses Vector (SIMD) processor instructions
─ Automatic OpenCL generation averts the
need to learn and incorporate low-level
CUDA or OpenCL code and associated
scaffolding to exploit systems with (GP)GPUs
─ Often faster to refactor portions of code in
SequenceL than find and fix bugs in old code
18
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
The Problem With Directive-Based Programming
Example: 3-body problem
//P1
a1 = grav(P1, P2, m2) + grav(P1, P3, m3);
dv1 = a1*dt;
v1 = v1 + dv1;
dp1 = v1*dt;
//P2
a2 = grav(P2, P1, m1) + grav(P2, P3, m3);
dv2 = a2*dt;
v2 = v2 + dv2;
dp2 = v2*dt;
//P3
a3 = grav(P3, P2, m2) + grav(P3, P1, m1);
dv3 = a3*dt;
v3 = v3 + dv3;
dp3 = v3*dt;
19
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
The Problem With Directive-Based Programming
Example: 3-body problem
//P1
a1 = grav(P1, P2, m2) + grav(P1, P3, m3);
dv1 = a1*dt;
v1 = v1 + dv1;
dp1 = v1*dt;
//P2
a2 = grav(P2, P1, m1) + grav(P2, P3, m3);
dv2 = a2*dt;
v2 = v2 + dv2;
dp2 = v2*dt;
//P3
a3 = grav(P3, P2, m2) + grav(P3, P1, m1);
dv3 = a3*dt;
v3 = v3 + dv3;
dp3 = v3*dt;
Each body can be
calculated at the same
time to give in theory a
3x speedup
20
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
The Problem With Directive-Based Programming
Example: 3-body problem
#pragma omp parallel
#pragma omp single nowait
{
#pragma omp task
{
a1 = grav(P1, P2, m2) + grav(P1, P3, m3);
dv1 = a1*dt;
v1 = v1 + dv1;
dp1 = v1*dt;
}
#pragma omp task
{
a2 = grav(P2, P1, m1) + grav(P2, P3, m3);
dv2 = a2*dt;
v2 = v2 + dv2;
dp2 = v2*dt;
}
#pragma omp task
{
a3 = grav(P3, P2, m2) + grav(P3, P1, m1);
dv3 = a3*dt;
v3 = v3 + dv3;
dp3 = v3*dt;
}
#pragma omp taskwait
}
Using directive-based
approaches like OpenMP,
the burden is on the
programmer to identify
where the program can
be safely parallelized.
Programmer then has to
add the correct pragmas.
21
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
The Problem With Directive-Based Programming
Example: 3-body problem
#pragma omp parallel
#pragma omp single nowait
{
#pragma omp task
{
a1 = grav(P1, P2, m2) + grav(P1, P3, m3);
dv1 = a1*dt;
v1 = v1 + dv1;
dp1 = v1*dt;
}
#pragma omp task
{
a2 = grav(P2, P1, m1) + grav(P2, P3, m3);
dv2 = a2*dt;
v2 = v2 + dv2;
dp2 = v2*dt;
}
#pragma omp task
{
a3 = grav(P3, P2, m2) + grav(P3, P1, m1);
dv3 = a3*dt;
v3 = v3 + dv3;
dp3 = v3*dt;
}
#pragma omp taskwait
}
But maybe you could
parallelize other things…
22
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
The Problem With Directive-Based Programming
Example: 3-body problem
#pragma omp parallel
#pragma omp single nowait
{
#pragma omp task
g1 = grav(P1, P2, m2);
#pragma omp task
g2 = grav(P1, P3, m3);
#pragma omp task
g3 = grav(P2, P1, m1);
#pragma omp task
g4 = grav(P2, P3, m3);
#pragma omp task
g5 = grav(P3, P2, m2);
#pragma omp task
g6 = grav(P3, P1, m1);
#pragma omp taskwait
}
a1 = g1 + g2;
dv1 = a1*dt;
v1 = v1 + dv1;
dp1 = v1*dt;
a2 = g3 + g4;
dv2 = a2*dt;
v2 = v2 + dv2;
dp2 = v2*dt;
a3 = g5 + g6;
dv3 = a3*dt;
v3 = v3 + dv3;
dp3 = v3*dt;
But now you have to start
re-arranging the code,
moving further away from
the original description of
the algorithm
Possible Race Conditions!
If the grav function modifies its
inputs or calls non thread-safe
functions, there could be hard to
detect race conditions, leading to
incorrect results
23
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
SequenceL: Self-Parallelizes, Race-Free, Readable
Example: 3-body problem
threeBody(P1, m1, P2, m2, P3, m3, dt) :=
let
a1 := grav(P1, P2, m2) + grav(P1, P2, m2);
dv1 := a1*dt;
v1 := v1 + dv1;
dp1 := v1*dt;
a2 := g3 = grav(P2, P1, m1) + grav(P2, P3, m3);
dv2 := a2*dt;
v2 := v2 + dv2;
dp2 := v2*dt;
a3 := grav(P3, P2, m2) + grav(P3, P1, m1);
dv3 := a3*dt;
v3 := v3 + dv3;
dp3 := v3*dt;
in
[dp1, dp2, dp3];
With SequenceL the programmer
does not add any parallel
constructs or pragmas.
The program will self-parallelize if
safe to do so (No race conditions).
Code clarity and intent remain,
greatly improving correctness and
quality.
Subsequent enhancements and
innovations are rapid.
This ease of reading/writing
is not by accident.
24
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
Ease of Reading/Writing SequenceL
 Matrix Multiply:
─ The product of an m×p matrix A with a p×n matrix B is
an m×n matrix denoted AB whose entries are given by:
𝐴𝐵 𝑖𝑗 = 𝑘=1
𝑝
𝐴𝑖𝑘 𝐵 𝑘𝑗
25
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
Ease of Reading/Writing SequenceL
 Matrix Multiply in Java:
𝐴𝐵 𝑖𝑗 = 𝑘=1
𝑝
𝐴𝑖𝑘 𝐵 𝑘𝑗
26
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
Ease of Reading/Writing SequenceL
 Matrix Multiply in SequenceL:
─ The product of an m×p matrix A with a p×n matrix B is
an m×n matrix denoted AB whose entries are given by:
𝐴𝐵 𝑖𝑗 = 𝑘=1
𝑝
𝐴𝑖𝑘 𝐵 𝑘𝑗
27
- or -
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
High-Abstraction, High Performance
-
10
20
30
40
50
60
70
C++ Ref. 1 2 4 8 16 32
X
Cores
Matrix Multiply Acceleration
Reference = sequential C++
28
 Parallel Matrix Multiply in SequenceL:
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
Sample SequenceL Performance Speedups
29
0.00
2.00
4.00
6.00
8.00
10.00
12.00
0 2 4 6 8 10 12 14 16
Matrix Multiply
Game Of Life
2D FFT
LU factorization
QuickSort
String Search
Barnes-Hut
n-Body
Matrix Inverse
Sparse Matrix
Compression
Adesk (DC)
Adesk (LW)
Matrix Multiply
(blocking)
Semblance
Speech filter
Perfect
Number of Processor Cores
TimesFaster
© 2015 Texas Multicore Technologies, Inc.
All Rights Reserved
To learn more:
Watch an short 3-part video tutorial at:
http://www.texasmulticoretechnologies.com/resources/videos/
Email: sales@texasmulticore.com for a free 45 day trial
www.texasmulticore.com

More Related Content

PDF
Performance Verification for ESL Design Methodology from AADL Models
PPTX
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
PPT
emips_overview_apr08
PDF
Embedded Development for the Future - Platforms for Rohde-Schwarz Mobile Tester
PPTX
OpenCV for Embedded: Lessons Learned
DOC
NAGESH B KALAL
PDF
LAS16-108: JerryScript and other scripting languages for IoT
PDF
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
Performance Verification for ESL Design Methodology from AADL Models
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
emips_overview_apr08
Embedded Development for the Future - Platforms for Rohde-Schwarz Mobile Tester
OpenCV for Embedded: Lessons Learned
NAGESH B KALAL
LAS16-108: JerryScript and other scripting languages for IoT
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...

What's hot (19)

PPTX
Design and Optimize your code for high-performance with Intel® Advisor and I...
PDF
Tools and Methods for Continuously Expanding Software Applications
PPT
Computing Without Computers - Oct08
PDF
Alley vsu functional_coverage_1f
PPT
Coverage Solutions on Emulators
PDF
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
PPTX
GPU Design on FPGA
PDF
Automatic License Plate Recognition using OpenCV
PPTX
The new reality and tremendous opportunity of open source processing
PDF
Using Embedded Linux for Infrastructure Systems
PPT
20081114 Friday Food iLabt Bart Joris
PDF
Chris brown ti
PDF
SWEET - A Tool for WCET Flow Analysis - Björn Lisper
PDF
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
PDF
Hemanth_Krishnan_resume
PDF
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
PDF
Standard embedded c
PPTX
Track B- Advanced ESL verification - Mentor
Design and Optimize your code for high-performance with Intel® Advisor and I...
Tools and Methods for Continuously Expanding Software Applications
Computing Without Computers - Oct08
Alley vsu functional_coverage_1f
Coverage Solutions on Emulators
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
GPU Design on FPGA
Automatic License Plate Recognition using OpenCV
The new reality and tremendous opportunity of open source processing
Using Embedded Linux for Infrastructure Systems
20081114 Friday Food iLabt Bart Joris
Chris brown ti
SWEET - A Tool for WCET Flow Analysis - Björn Lisper
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
Hemanth_Krishnan_resume
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Standard embedded c
Track B- Advanced ESL verification - Mentor
Ad

Similar to SequenceL Auto-Parallelizing Toolset Intro slideshare (20)

PDF
SequenceL gets rid of decades of programming baggage
PPT
C:\Alon Tech\New Tech\Embedded Conf Tlv\Prez\Sightsys Embedded Day
PDF
TMT SequenceL customer use cases and results
PDF
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
DOCX
SoftwareEngineer
PDF
Os Lamothe
PPTX
Fine line between performance and security
DOC
Ankit sarin
PPT
Overview Of Parallel Development - Ericnel
PDF
Scalability for All: Unreal Engine* 4 with Intel
PPTX
Introduction to C to Hardware (programming FPGAs and CPLDs in C)
PDF
Larson and toubro
PPTX
The Role of Standards in IoT Security
PDF
tybsc it asp.net full unit 1,2,3,4,5,6 notes
PDF
Lean Model-Driven Development through Model-Interpretation: the CPAL design ...
PPTX
Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™
DOCX
SoftwareEngineer
PDF
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
DOCX
SoftwareEngineer
PDF
Design of Software for Embedded Systems
SequenceL gets rid of decades of programming baggage
C:\Alon Tech\New Tech\Embedded Conf Tlv\Prez\Sightsys Embedded Day
TMT SequenceL customer use cases and results
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
SoftwareEngineer
Os Lamothe
Fine line between performance and security
Ankit sarin
Overview Of Parallel Development - Ericnel
Scalability for All: Unreal Engine* 4 with Intel
Introduction to C to Hardware (programming FPGAs and CPLDs in C)
Larson and toubro
The Role of Standards in IoT Security
tybsc it asp.net full unit 1,2,3,4,5,6 notes
Lean Model-Driven Development through Model-Interpretation: the CPAL design ...
Enabling Cross-platform Deep Learning Applications with Intel OpenVINO™
SoftwareEngineer
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
SoftwareEngineer
Design of Software for Embedded Systems
Ad

Recently uploaded (20)

PDF
Digital Strategies for Manufacturing Companies
PDF
medical staffing services at VALiNTRY
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
System and Network Administration Chapter 2
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Introduction to Artificial Intelligence
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
Digital Strategies for Manufacturing Companies
medical staffing services at VALiNTRY
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Which alternative to Crystal Reports is best for small or large businesses.pdf
System and Network Administration Chapter 2
VVF-Customer-Presentation2025-Ver1.9.pptx
Odoo Companies in India – Driving Business Transformation.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
How to Migrate SBCGlobal Email to Yahoo Easily
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Odoo POS Development Services by CandidRoot Solutions
Upgrade and Innovation Strategies for SAP ERP Customers
Operating system designcfffgfgggggggvggggggggg
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Introduction to Artificial Intelligence
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Internet Downloader Manager (IDM) Crack 6.42 Build 41
How to Choose the Right IT Partner for Your Business in Malaysia
wealthsignaloriginal-com-DS-text-... (1).pdf

SequenceL Auto-Parallelizing Toolset Intro slideshare

  • 1. An Introduction to SequenceL Auto-Parallelizing Programming Language and Toolset www.texasmulticore.com Brad Nemanich, PhD Chief Technology Officer
  • 2. Why is SequenceL Needed? ”The way the processor industry is going is to add more and more cores, but nobody knows how to program those things. I mean, two, yeah; four, not really; eight, forget it.” – Steve Jobs © 2015 Texas Multicore Technologies, Inc. All Rights Reserved2 This shift now affects every software company, large enterprise, and government agency that develops software
  • 3. Current (Manual) Approach to Multicore Programming 1. Be sure you identify truly independent computations. 2. Implement concurrency at the highest level possible. 3. Plan early for scalability to take advantage of increasing numbers of cores. 4. Make use of thread-safe libraries wherever possible. 5. Use the right threading model. 6. Never assume a particular order of execution. 7. Use thread-local storage whenever possible; associate locks to specific data, if needed. 8. Don’t be afraid to change the algorithm for a better chance of concurrency. 8 “Simple” Rules for Designing Threaded Applications (0. Hire team of “Parallel Ninjas”, PhD experts in computer architecture.) © 2015 Texas Multicore Technologies, Inc. All Rights Reserved3
  • 4. Current (Manual) Approach to Multicore Programming 1. Be sure you identify truly independent computations. 2. Implement concurrency at the highest level possible. 3. Plan early for scalability to take advantage of increasing numbers of cores. 4. Make use of thread-safe libraries wherever possible. 5. Use the right threading model. 6. Never assume a particular order of execution. 7. Use thread-local storage whenever possible; associate locks to specific data, if needed. 8. Don’t be afraid to change the algorithm for a better chance of concurrency. 8 “Simple” Rules for Designing Threaded Applications (0. Hire team of “Parallel Ninjas”, PhD experts in computer architecture.) © 2015 Texas Multicore Technologies, Inc. All Rights Reserved4 “The significant problems we face cannot be solved using the same level of thinking we used when we created them.” -Albert Einstein
  • 5. “Parallel Ninja” Approach Does Not Scale  How do you: ─ find them? ─ afford them? ─ retain them? ─ support rapid innovation? ─ ensure accuracy and correctness? ─ keep them current on platform technologies? ─ do this for all your software? Einstein was right; There’s a much better way…. © 2015 Texas Multicore Technologies, Inc. All Rights Reserved5
  • 6. It’s Time to Change the Game (Again) 6 Wiring Machine CodeWiring Machine Code Machine Code Assembly Language Netlist Netlist 1954 1957 1980 Machine Code HLL + Compiler (Fortran, COBOL, PL/I, Lisp, C,…) Machine Code Object Oriented (SmallTalk, C++, Java, C#,) 19491949 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 7. It’s Time to Change the Game (Again) 7 Wiring Machine CodeWiring Machine Code Machine Code Assembly Language Netlist Netlist 1954 1957 1980 Machine Code HLL + Compiler (Fortran, COBOL, PL/I, Lisp, C,…) Machine Code Object Oriented (SmallTalk, C++, Java, C#,) 19491949 2004: Multicore © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 8. It’s Time to Change the Game (Again) 8 Wiring Machine CodeWiring Machine Code Machine Code Assembly Language Netlist Netlist 1954 1957 1980 Machine Code HLL + Compiler (Fortran, COBOL, PL/I, Lisp, C,…) Machine Code Object Oriented (SmallTalk, C++, Java, C#,) 19491949 2014 Machine Code Object Oriented C++ Functional, Auto- Parallelizing Object Oriented C++ Functional, Auto- Parallelizing 2004: Multicore © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 9. SequenceL is a Game Changer © 2015 Texas Multicore Technologies, Inc. All Rights Reserved9 Faster Performance; Uses all cores, GPUs 10X Faster Time to Innovation/Market Get it Right the First Time Quickly Leverage New Computing Platforms Built Upon Open Industry Standards; Works with Existing Tools & Methodologies
  • 10. Customer Example: Industrial Control Networking (WirelessHART, IEC 62591, IEEE 802.15.4)  New algorithm, developed for large, noisy industrial process control environments ─ Presented white paper to IEEE ─ Won an award  Asked TMT to implement for comparison purposes ─ Finished in SequenceL in 3 weeks  10X faster performance and right the first time ─ Java finished by the inventors in 3 months  Had errors and much slower; used SequenceL code to debug Java  Another month getting code correct  A 5th month improving performance that still fell short  Bottom line ─ SL was finished in 15% of the time ─ SL was correct the first time ─ SL out-performed the Java code 1.5x-3.0x on a 2 core AMD APU ─ Robust and fast code, fast time to market 10 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 11. Customer Example: Video Processing Using SequenceL  Goal: 30Hz to keep up with input video feed  Best performance (8 core x86 platform) ─ 58 Hz: SequenceL ─ 21 Hz: Matlab (Interpreter) ─ 1.2 Hz: Matlab (Coder/C-out) Input video feed (e.g.- Apache helicopter gyro camera) Processed video (Proprietary algorithms remove air turbulence, radiated heat, etc.) © 2015 Texas Multicore Technologies, Inc. All Rights Reserved11
  • 12. Customer Example: Video Processing Using SequenceL  Goal: 30Hz to keep up with input video feed  Best performance (8 core x86 platform) ─ 58 Hz: SequenceL ─ 21 Hz: Matlab (Interpreter) ─ 1.2 Hz: Matlab (Coder/C-out) Input video feed (e.g.- Apache helicopter gyro camera) Processed video (Proprietary algorithms remove air turbulence, radiated heat, etc.) © 2015 Texas Multicore Technologies, Inc. All Rights Reserved12
  • 13. Customer Example: Video Processing Using SequenceL  Goal: 30Hz to keep up with input video feed  Best performance (8 core x86 platform) ─ 58 Hz: SequenceL ─ 21 Hz: Matlab (Interpreter) ─ 1.2 Hz: Matlab (Coder/C-out) Input video feed (e.g.- Apache helicopter gyro camera) Processed video (Proprietary algorithms remove air turbulence, radiated heat, etc.) © 2015 Texas Multicore Technologies, Inc. All Rights Reserved13
  • 14. Customer Example: Video Processing Using SequenceL  Goal: 30Hz to keep up with input video feed  Best performance (8 core x86 platform) ─ 58 Hz: SequenceL ─ 21 Hz: Matlab (Interpreter) ─ 1.2 Hz: Matlab (Coder/C-out) Input video feed (e.g.- Apache helicopter gyro camera) Processed video (Proprietary algorithms remove air turbulence, radiated heat, etc.) © 2015 Texas Multicore Technologies, Inc. All Rights Reserved14
  • 15. What is SequenceL? SequenceL is a…  High-Abstraction  Functional  Self-Parallelizing …programming language and tool set ….designed to work in concert with other popular programming languages and tools 15 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 16. High-Abstraction, High Performance  Most common programming languages are imperative ─ Detailed sequence of commands for carrying out the computation; i.e.- tell the computer both “what” to do and “how” to do it ─ Inherently sequential, written for classic Von Neumann computers ─ e.g.- C/C++, Java, C#, Python, Fortran ─ Some add explicit “directives” to manually enable low-level parallelism  SequenceL is declarative & functional – higher abstraction ─ Describe the desired output in terms of the input, as functions; i.e.- tell the computer only “what” to do, so no thinking about parallel ─ Abstracts away complex multicore and many-core platforms  Best analogy is SQL database language ─ A programmer could write their own database procedures in low level C ─ But would be error-prone and not perform as well as with Oracle or DB2 16 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 17. Drops Into Your Current Design Flow  Designed to work in concert with other programming languages, legacy code and libraries  Additive: works with existing design flows, tools, and training  Builds upon open industry standards 17 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 18. Drops Into Your Current Design Flow  Adds a multicore “power tool” to the programmers toolbox  Complete add-on solution ─ IDE plug-ins, debugger, interpreter, auto- parallelizing compiler, runtime environment  Easy to modernize legacy applications ─ Parallel C++ output enables just a portion to be refactored in SequenceL and linked in ─ Uses Vector (SIMD) processor instructions ─ Automatic OpenCL generation averts the need to learn and incorporate low-level CUDA or OpenCL code and associated scaffolding to exploit systems with (GP)GPUs ─ Often faster to refactor portions of code in SequenceL than find and fix bugs in old code 18 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 19. The Problem With Directive-Based Programming Example: 3-body problem //P1 a1 = grav(P1, P2, m2) + grav(P1, P3, m3); dv1 = a1*dt; v1 = v1 + dv1; dp1 = v1*dt; //P2 a2 = grav(P2, P1, m1) + grav(P2, P3, m3); dv2 = a2*dt; v2 = v2 + dv2; dp2 = v2*dt; //P3 a3 = grav(P3, P2, m2) + grav(P3, P1, m1); dv3 = a3*dt; v3 = v3 + dv3; dp3 = v3*dt; 19 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 20. The Problem With Directive-Based Programming Example: 3-body problem //P1 a1 = grav(P1, P2, m2) + grav(P1, P3, m3); dv1 = a1*dt; v1 = v1 + dv1; dp1 = v1*dt; //P2 a2 = grav(P2, P1, m1) + grav(P2, P3, m3); dv2 = a2*dt; v2 = v2 + dv2; dp2 = v2*dt; //P3 a3 = grav(P3, P2, m2) + grav(P3, P1, m1); dv3 = a3*dt; v3 = v3 + dv3; dp3 = v3*dt; Each body can be calculated at the same time to give in theory a 3x speedup 20 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 21. The Problem With Directive-Based Programming Example: 3-body problem #pragma omp parallel #pragma omp single nowait { #pragma omp task { a1 = grav(P1, P2, m2) + grav(P1, P3, m3); dv1 = a1*dt; v1 = v1 + dv1; dp1 = v1*dt; } #pragma omp task { a2 = grav(P2, P1, m1) + grav(P2, P3, m3); dv2 = a2*dt; v2 = v2 + dv2; dp2 = v2*dt; } #pragma omp task { a3 = grav(P3, P2, m2) + grav(P3, P1, m1); dv3 = a3*dt; v3 = v3 + dv3; dp3 = v3*dt; } #pragma omp taskwait } Using directive-based approaches like OpenMP, the burden is on the programmer to identify where the program can be safely parallelized. Programmer then has to add the correct pragmas. 21 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 22. The Problem With Directive-Based Programming Example: 3-body problem #pragma omp parallel #pragma omp single nowait { #pragma omp task { a1 = grav(P1, P2, m2) + grav(P1, P3, m3); dv1 = a1*dt; v1 = v1 + dv1; dp1 = v1*dt; } #pragma omp task { a2 = grav(P2, P1, m1) + grav(P2, P3, m3); dv2 = a2*dt; v2 = v2 + dv2; dp2 = v2*dt; } #pragma omp task { a3 = grav(P3, P2, m2) + grav(P3, P1, m1); dv3 = a3*dt; v3 = v3 + dv3; dp3 = v3*dt; } #pragma omp taskwait } But maybe you could parallelize other things… 22 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 23. The Problem With Directive-Based Programming Example: 3-body problem #pragma omp parallel #pragma omp single nowait { #pragma omp task g1 = grav(P1, P2, m2); #pragma omp task g2 = grav(P1, P3, m3); #pragma omp task g3 = grav(P2, P1, m1); #pragma omp task g4 = grav(P2, P3, m3); #pragma omp task g5 = grav(P3, P2, m2); #pragma omp task g6 = grav(P3, P1, m1); #pragma omp taskwait } a1 = g1 + g2; dv1 = a1*dt; v1 = v1 + dv1; dp1 = v1*dt; a2 = g3 + g4; dv2 = a2*dt; v2 = v2 + dv2; dp2 = v2*dt; a3 = g5 + g6; dv3 = a3*dt; v3 = v3 + dv3; dp3 = v3*dt; But now you have to start re-arranging the code, moving further away from the original description of the algorithm Possible Race Conditions! If the grav function modifies its inputs or calls non thread-safe functions, there could be hard to detect race conditions, leading to incorrect results 23 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 24. SequenceL: Self-Parallelizes, Race-Free, Readable Example: 3-body problem threeBody(P1, m1, P2, m2, P3, m3, dt) := let a1 := grav(P1, P2, m2) + grav(P1, P2, m2); dv1 := a1*dt; v1 := v1 + dv1; dp1 := v1*dt; a2 := g3 = grav(P2, P1, m1) + grav(P2, P3, m3); dv2 := a2*dt; v2 := v2 + dv2; dp2 := v2*dt; a3 := grav(P3, P2, m2) + grav(P3, P1, m1); dv3 := a3*dt; v3 := v3 + dv3; dp3 := v3*dt; in [dp1, dp2, dp3]; With SequenceL the programmer does not add any parallel constructs or pragmas. The program will self-parallelize if safe to do so (No race conditions). Code clarity and intent remain, greatly improving correctness and quality. Subsequent enhancements and innovations are rapid. This ease of reading/writing is not by accident. 24 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 25. Ease of Reading/Writing SequenceL  Matrix Multiply: ─ The product of an m×p matrix A with a p×n matrix B is an m×n matrix denoted AB whose entries are given by: 𝐴𝐵 𝑖𝑗 = 𝑘=1 𝑝 𝐴𝑖𝑘 𝐵 𝑘𝑗 25 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 26. Ease of Reading/Writing SequenceL  Matrix Multiply in Java: 𝐴𝐵 𝑖𝑗 = 𝑘=1 𝑝 𝐴𝑖𝑘 𝐵 𝑘𝑗 26 © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 27. Ease of Reading/Writing SequenceL  Matrix Multiply in SequenceL: ─ The product of an m×p matrix A with a p×n matrix B is an m×n matrix denoted AB whose entries are given by: 𝐴𝐵 𝑖𝑗 = 𝑘=1 𝑝 𝐴𝑖𝑘 𝐵 𝑘𝑗 27 - or - © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 28. High-Abstraction, High Performance - 10 20 30 40 50 60 70 C++ Ref. 1 2 4 8 16 32 X Cores Matrix Multiply Acceleration Reference = sequential C++ 28  Parallel Matrix Multiply in SequenceL: © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 29. Sample SequenceL Performance Speedups 29 0.00 2.00 4.00 6.00 8.00 10.00 12.00 0 2 4 6 8 10 12 14 16 Matrix Multiply Game Of Life 2D FFT LU factorization QuickSort String Search Barnes-Hut n-Body Matrix Inverse Sparse Matrix Compression Adesk (DC) Adesk (LW) Matrix Multiply (blocking) Semblance Speech filter Perfect Number of Processor Cores TimesFaster © 2015 Texas Multicore Technologies, Inc. All Rights Reserved
  • 30. To learn more: Watch an short 3-part video tutorial at: http://www.texasmulticoretechnologies.com/resources/videos/ Email: sales@texasmulticore.com for a free 45 day trial www.texasmulticore.com