SlideShare a Scribd company logo
1Aggregation of Parallel Computing                 and Hardware/Software Co-Design Techniques     for High-Performance Remote Sensing ApplicationsPresenter: Dr. Alejandro Castillo Atoche2011/07/25IGARSS’11School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
2OutlineIntroductionPrevious WorkHW/SW Co-designMethodology
Case Study: DEDR-related RSF/RASF AlgorithmsSystolic Architectures (SAs) as Co-processorsIntegration in a Co-design schemeNew design Perspective: Super-Systolic Arrays and VLSI architecturesHardware Implementation ResultsPerformance AnalysisConclusionsSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
3Introduction: Radar Imagery, FactsThe advanced high resolution operations of remote sensing (RS) are computationally complex. The recently development remote sensing (RS) image reconstruction/ enhancement techniques are definitively unacceptable for a (near) real time implementation.In previous works, the algorithms were implemented in conventional simulations in Personal Computers (normally MATLAB), in Digital Signal Processing (DSP) platforms or in Clusters of PCs. School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
4Introduction:        HW/SW co-design, FactsWhy Hardware/software (HW/SW) co-design? 	The HW/SW co-design is a hybrid method aimed to increase the flexibility of the implementations and improvement of the overall design process.Why Systolic Arrays?	     Extremely fast.	     Easily scalable architecture.Why Parallel Techniques?	Optimize and improve the performance of the loops that generally take most of the time in RS algorithms.School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
5MOTIVATIONFirst, novel RS imaging applications require now a response in (near) real time in areas such as: target detection for military purpose, tracking wildfires, and monitoring oil spills, etc. Also, in previous works, virtual remote sensing laboratories had been developed. Now, we are intended to design efficient HW architectures pursuing the real time mode.School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
6CONTRIBUTIONS: First, the application of parallel computing techniques using loop optimization transformations generates efficient super-systolic arrays (SSAs)-based co-processors units of the selected reconstructive SP subtasks.Second, the addressed HW/SW co-design methodology is aimed at an efficient HW implementation of the enhancement/reconstruction regularization methods using the proposed SSA-based co-processor architectures.  School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
7HW/SW Co-design: MethodologyThe proposed co-design methodology encompasses the following general stages: (i)	Algorithmic implementation of the DEDR RSF/RASF (reference simulation in MATLAB and C++ platforms); (ii)	Computational tasks partitioning process;(ii)	 Aggregation of parallel computing techniques;(iii)	 Architecture design procedure of the addressed reconstructive SP computational tasks onto HW blocks (SSAs);School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
8HW/SW Co-design: MethodologySchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
9Algorithmic ref. ImplementationSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
10Algorithmic ref. ImplementationSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
11Algorithmic ref. ImplementationSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
12Partitioning PhaseSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
13Aggregation of parallel computing techniquesWe consider a number of different parallel optimization techniques used in high performance computing (HPC) in order to exploit the maximum possible parallelism in the design:
Loop unrolling,
Nested loop optimization,
Loop interchange.
TilingSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
14Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector MultiplicationThe Matrix Vector multiplication operation is described by the following sum:where,a: is the input matrix of dimensions mxnv: is the input vector of dimensions nx1u: is the results vector of dimensions mx1 i: index variable with range 0 to mSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
15Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector MultiplicationThe matrix vector multiplication is usually implemented in sequential programming languages such as C++ as:  for (i=0; i < m; i++) {      u[i] = 0;      for (j=0; j < n; j++) {         u[i] = u[i] + a[i][j]*v[j];               }  }To find out if we can speed up this algorithm, first we need to re write it in such a way that we can see all of its data dependencies. For this purpose, we use single assignment notation.Inputs:   a[i,j] = A[i,j]  : 0 <= i < m                            0 <= j < n   v[j] = V[j]      : 0 <= j < nOutputs:   U[i] = u[i]      : 0 <= i < mSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
16Aggregation of parallel computing techniquesIndex SpaceijCASE STUDY: Matrix Vector MultiplicationFirst, we assign each operation in the Matrix Vector multiplication algorithm, a location in the space called the Index Space as the one shown on the right. We also re write the algorithm in such a way that we can assign a coordinate in this Index Space to each operation.
 This operation is called Index Matching. for (i=0; i < m; i++) {      u[i][0] = 0;      for (j=0; j < n; j++) {S(i,j):   u[i][0] = u[i][0] + a[i][j]*v[0][j];               }  }NOTE:The algorithm has not been changed in any way, the addition of coordinate [0] has no effect with respect to the previous form of the algorithm.Inputs:   a[i,j] = A[i,j]  : 0 <= i < m                            0 <= j < n   v[0][j] = V[j] : 0 <= j < nOutputs:   U[i] = u[i][0] : 0 <= i < mSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
17Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector Multiplication -> Single Assignment StageInputs:   a[i,j] = A[i,j]  : 0 <= i < m                            0 <= j < n   v[0][j] = V[j] : 0 <= j < nOutputs:   U[i] = u[i][j+1] : 0 <= i < mNow that each operations is assigned to a single point in the Index Space, we can re write the algorithm such that variable assignments occur only once for each coordinate in the Index Space. for (i=0; i < m; i++) {      u[i][0] = 0;      for (j=0; j < n; j++) {         u[i][j+1] = u[i][j] + a[i][j]*v[0][j];               }  }In this version of the algorithm, one variable assignment is done for each point (PE) in the index space, please note that the input vector must be seen by all the PEs in order to perform its correct operation.School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector Multiplication -> Broadcast RemovalInputs:   a[i,j] = A[i,j]  : 0 <= i < m                            0 <= j < n   v[0][j] = V[j] : 0 <= j < nOutputs:   U[i] = u[i][j+1]: 0 <= i < mHaving a signal being broadcast, implies large routing resources and big drivers which can translate into large amounts of buffers being inserted in the final circuit. To avoid this, we remove the variable being broadcast by passing the variable through each of the PEs. for (i=0; i < m; i++) {      u[i][0] = 0;      for (j=0; j < n; j++) {         u[i][j+1] = u[i][j] + a[i][j]*v[i][j];v[i+1][j] = v[i][j];      }  }This form of the algorithm does not only complies with the single assignment requirement but it also has locality, this is, it only depends on data from its neighbors. This graph is also called a Dependency Graph (DG).School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
Aggregation of parallel computing techniques0U[4]a[4][1]a[4][2]a[4][3]a[4][0]a[4][4]0a[3][1]a[3][2]a[3][3]a[3][0]a[3][4]0a[2][1]a[2][2]a[2][3]a[2][0]a[2][4]0a[1][1]a[1][2]a[1][3]a[1][0]a[1][4]U[0]0a[0][1]a[0][2]a[0][3]a[0][0]a[0][4]v[0][0]v[0][1]v[0][2]v[0][3]v[0][4]iU[3]jU[2]U[1]CASE STUDY: Matrix Vector Multiplication -> SchedulingIndex SpaceNow, lets see how the algorithm works in time, look carefully at the animation at the right.
We have identified that in this processor array, it only takes 9 time cycles to run the entire matrix vector multiplication algorithm and that for each time cycle the maximum number of processors being used is 5.
If we are only using a maximum of 5 processors, why should we build an array of 25!!?School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
Aggregation of parallel computing techniques0U[4]U[4]U[3]U[2]U[1]U[0]a[4][1]a[4][2]a[4][3]a[4][0]a[4][4]Index Space0U[3]a[3][1]a[3][2]a[3][3]a[3][0]a[3][4]0U[2]a[4][0]a[4][1]a[4][2]a[4][3]a[2][1]a[2][2]a[2][3]a[2][0]a[2][4]0U[1]a[3][1]a[3][2]a[3][3]a[3][4]a[3][0]a[1][1]a[1][2]a[1][3]a[1][0]a[1][4]0U[0]a[2][3]a[2][2]a[2][4]a[2][1]a[2][0]a[0][1]a[0][2]a[0][3]a[0][0]a[0][4]v[0][0]v[0][1]v[0][2]v[0][3]v[0][4]a[1][3]a[1][4]a[1][0]a[1][1]a[1][2]i00000a[0][4]a[0][0]a[0][1]a[0][2]a[0][3]jv[0][4]v[0][2]v[0][3]v[0][0]v[0][1]CASE STUDY: Matrix Vector Multiplication -> AllocationThe circuit could operate with 5 processors.Matrix Vector Algorithm with projection [1 0]P4a[4][4]P3[1 0]P2P1P0School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector Multiplication -> Space-Time mappingIn this table we can see which processor is being used for each instant t.
Now, if we plot the information in the table into a [t,p] axis, we can see that the polytope defined by this selection table is bounded by the inequations: p>= 0,   p>= t-n, p <=t and p<=m in the following relation: lower bound of p:   p >= max(0,t-n)upper bound of p:   p <= min(m,t)for ALL t
Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector Multiplication -> Space-Time mappingwhere, p is the position of the processing element in the transformed algorithm.

More Related Content

PDF
Graph Analyses with Python and NetworkX
PDF
Implementation of an Effective Self-Timed Multiplier for Single Precision Flo...
PDF
Design and Implementation of Test Vector Generation using Random Forest Techn...
PDF
Interpolation of-geofield-parameters
PPT
Phd Defense 2007
PDF
Dynamics in graph analysis (PyData Carolinas 2016)
PDF
Social network-analysis-in-python
PDF
IRJET - Application of Linear Algebra in Machine Learning
Graph Analyses with Python and NetworkX
Implementation of an Effective Self-Timed Multiplier for Single Precision Flo...
Design and Implementation of Test Vector Generation using Random Forest Techn...
Interpolation of-geofield-parameters
Phd Defense 2007
Dynamics in graph analysis (PyData Carolinas 2016)
Social network-analysis-in-python
IRJET - Application of Linear Algebra in Machine Learning

What's hot (20)

PDF
IRJET- Matrix Multiplication using Strassen’s Method
PDF
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...
PDF
PDF
Security Enhancement of Image Encryption Based on Matrix Approach using Ellip...
PPTX
Fundamentals of Image Processing & Computer Vision with MATLAB
PDF
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...
PDF
Minicourse on Network Science
PDF
PPT
Matlab day 1: Introduction to MATLAB
PPTX
Summer training matlab
PPTX
Unsupervised learning networks
PDF
Representing and visiting graphs
PDF
Image processing
DOCX
MATLAB guide
PDF
victores2013semantic-presentation
PDF
IRJET- An Image Cryptography using Henon Map and Arnold Cat Map
PPT
Gil Shapira's Active Appearance Model slides
PPTX
Matlab for Electrical Engineers
PDF
Tracking Faces using Active Appearance Models
PDF
Deep learning @ University of Oradea - part I (16 Jan. 2018)
IRJET- Matrix Multiplication using Strassen’s Method
Session II - Estimation methods and accuracy Li-Chun Zhang Discussion: Sess...
Security Enhancement of Image Encryption Based on Matrix Approach using Ellip...
Fundamentals of Image Processing & Computer Vision with MATLAB
APPROACHES IN USING EXPECTATIONMAXIMIZATION ALGORITHM FOR MAXIMUM LIKELIHOOD ...
Minicourse on Network Science
Matlab day 1: Introduction to MATLAB
Summer training matlab
Unsupervised learning networks
Representing and visiting graphs
Image processing
MATLAB guide
victores2013semantic-presentation
IRJET- An Image Cryptography using Henon Map and Arnold Cat Map
Gil Shapira's Active Appearance Model slides
Matlab for Electrical Engineers
Tracking Faces using Active Appearance Models
Deep learning @ University of Oradea - part I (16 Jan. 2018)
Ad

Viewers also liked (9)

PDF
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
PDF
Open-BDA Hadoop Summit 2014 - Mr. Krish Krishnan (Driving Business Value – Bi...
PPTX
GRID COMPUTING
PDF
8017 25 image mining
PPT
OpenGL 3.2 and More
PDF
Big Data Architecture
PPT
Remote Sensing PPT
PPT
Data Warehousing and Data Mining
PDF
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA Hadoop Summit 2014 - Mr. Krish Krishnan (Driving Business Value – Bi...
GRID COMPUTING
8017 25 image mining
OpenGL 3.2 and More
Big Data Architecture
Remote Sensing PPT
Data Warehousing and Data Mining
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
Ad

Similar to ARCHAEOLOGICAL LAND USE CHARACTERIZATION USING MULTISPECTRAL REMOTE SENSING DATA (20)

PDF
Design of predictive controller for smooth set point tracking for fast dynami...
PDF
Dance With AI – An interactive dance learning platform
PPTX
MACHINE LEARNING.pptx
PDF
ParallelProgrammingBasics_v2.pdf
DOCX
M Jamee Raza (BSE-23S-056)LA project.docx
PDF
Interior Dual Optimization Software Engineering with Applications in BCS Elec...
PDF
Machine Learning, K-means Algorithm Implementation with R
PDF
Method of optimization of the fundamental matrix by technique speeded up rob...
PDF
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
PDF
Size measurement and estimation
PDF
Advance data structure & algorithm
PDF
An Algorithm for Optimized Cost in a Distributed Computing System
DOC
Cse 7 softcomputing lab
PDF
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
PDF
AMAZON STOCK PRICE PREDICTION BY USING SMLT
PPTX
Basic MATLAB-Presentation.pptx
PDF
Computer Science CS Project Matrix CBSE Class 12th XII .pdf
PDF
Generating LaTeX Code for Handwritten Mathematical Equations using Convolutio...
PDF
AIRLINE FARE PRICE PREDICTION
PDF
BIT211_2.pdf
Design of predictive controller for smooth set point tracking for fast dynami...
Dance With AI – An interactive dance learning platform
MACHINE LEARNING.pptx
ParallelProgrammingBasics_v2.pdf
M Jamee Raza (BSE-23S-056)LA project.docx
Interior Dual Optimization Software Engineering with Applications in BCS Elec...
Machine Learning, K-means Algorithm Implementation with R
Method of optimization of the fundamental matrix by technique speeded up rob...
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
Size measurement and estimation
Advance data structure & algorithm
An Algorithm for Optimized Cost in a Distributed Computing System
Cse 7 softcomputing lab
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
AMAZON STOCK PRICE PREDICTION BY USING SMLT
Basic MATLAB-Presentation.pptx
Computer Science CS Project Matrix CBSE Class 12th XII .pdf
Generating LaTeX Code for Handwritten Mathematical Equations using Convolutio...
AIRLINE FARE PRICE PREDICTION
BIT211_2.pdf

More from grssieee (20)

PDF
Tangent height accuracy of Superconducting Submillimeter-Wave Limb-Emission S...
PDF
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
PPTX
TWO-POINT STATISTIC OF POLARIMETRIC SAR DATA TWO-POINT STATISTIC OF POLARIMET...
PPT
THE SENTINEL-1 MISSION AND ITS APPLICATION CAPABILITIES
PPTX
GMES SPACE COMPONENT:PROGRAMMATIC STATUS
PPTX
PROGRESSES OF DEVELOPMENT OF CFOSAT SCATTEROMETER
PPT
DEVELOPMENT OF ALGORITHMS AND PRODUCTS FOR SUPPORTING THE ITALIAN HYPERSPECTR...
PPT
EO-1/HYPERION: NEARING TWELVE YEARS OF SUCCESSFUL MISSION SCIENCE OPERATION A...
PPT
EO-1/HYPERION: NEARING TWELVE YEARS OF SUCCESSFUL MISSION SCIENCE OPERATION A...
PPT
EO-1/HYPERION: NEARING TWELVE YEARS OF SUCCESSFUL MISSION SCIENCE OPERATION A...
PDF
Test
PPT
test 34mb wo animations
PPT
Test 70MB
PPT
Test 70MB
PDF
2011_Fox_Tax_Worksheets.pdf
PPT
DLR open house
PPT
DLR open house
PPT
DLR open house
PPT
Tana_IGARSS2011.ppt
PPT
Solaro_IGARSS_2011.ppt
Tangent height accuracy of Superconducting Submillimeter-Wave Limb-Emission S...
SEGMENTATION OF POLARIMETRIC SAR DATA WITH A MULTI-TEXTURE PRODUCT MODEL
TWO-POINT STATISTIC OF POLARIMETRIC SAR DATA TWO-POINT STATISTIC OF POLARIMET...
THE SENTINEL-1 MISSION AND ITS APPLICATION CAPABILITIES
GMES SPACE COMPONENT:PROGRAMMATIC STATUS
PROGRESSES OF DEVELOPMENT OF CFOSAT SCATTEROMETER
DEVELOPMENT OF ALGORITHMS AND PRODUCTS FOR SUPPORTING THE ITALIAN HYPERSPECTR...
EO-1/HYPERION: NEARING TWELVE YEARS OF SUCCESSFUL MISSION SCIENCE OPERATION A...
EO-1/HYPERION: NEARING TWELVE YEARS OF SUCCESSFUL MISSION SCIENCE OPERATION A...
EO-1/HYPERION: NEARING TWELVE YEARS OF SUCCESSFUL MISSION SCIENCE OPERATION A...
Test
test 34mb wo animations
Test 70MB
Test 70MB
2011_Fox_Tax_Worksheets.pdf
DLR open house
DLR open house
DLR open house
Tana_IGARSS2011.ppt
Solaro_IGARSS_2011.ppt

ARCHAEOLOGICAL LAND USE CHARACTERIZATION USING MULTISPECTRAL REMOTE SENSING DATA

  • 1. 1Aggregation of Parallel Computing and Hardware/Software Co-Design Techniques for High-Performance Remote Sensing ApplicationsPresenter: Dr. Alejandro Castillo Atoche2011/07/25IGARSS’11School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 3. Case Study: DEDR-related RSF/RASF AlgorithmsSystolic Architectures (SAs) as Co-processorsIntegration in a Co-design schemeNew design Perspective: Super-Systolic Arrays and VLSI architecturesHardware Implementation ResultsPerformance AnalysisConclusionsSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 4. 3Introduction: Radar Imagery, FactsThe advanced high resolution operations of remote sensing (RS) are computationally complex. The recently development remote sensing (RS) image reconstruction/ enhancement techniques are definitively unacceptable for a (near) real time implementation.In previous works, the algorithms were implemented in conventional simulations in Personal Computers (normally MATLAB), in Digital Signal Processing (DSP) platforms or in Clusters of PCs. School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 5. 4Introduction: HW/SW co-design, FactsWhy Hardware/software (HW/SW) co-design? The HW/SW co-design is a hybrid method aimed to increase the flexibility of the implementations and improvement of the overall design process.Why Systolic Arrays? Extremely fast. Easily scalable architecture.Why Parallel Techniques? Optimize and improve the performance of the loops that generally take most of the time in RS algorithms.School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 6. 5MOTIVATIONFirst, novel RS imaging applications require now a response in (near) real time in areas such as: target detection for military purpose, tracking wildfires, and monitoring oil spills, etc. Also, in previous works, virtual remote sensing laboratories had been developed. Now, we are intended to design efficient HW architectures pursuing the real time mode.School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 7. 6CONTRIBUTIONS: First, the application of parallel computing techniques using loop optimization transformations generates efficient super-systolic arrays (SSAs)-based co-processors units of the selected reconstructive SP subtasks.Second, the addressed HW/SW co-design methodology is aimed at an efficient HW implementation of the enhancement/reconstruction regularization methods using the proposed SSA-based co-processor architectures. School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 8. 7HW/SW Co-design: MethodologyThe proposed co-design methodology encompasses the following general stages: (i) Algorithmic implementation of the DEDR RSF/RASF (reference simulation in MATLAB and C++ platforms); (ii) Computational tasks partitioning process;(ii) Aggregation of parallel computing techniques;(iii) Architecture design procedure of the addressed reconstructive SP computational tasks onto HW blocks (SSAs);School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 9. 8HW/SW Co-design: MethodologySchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 10. 9Algorithmic ref. ImplementationSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 11. 10Algorithmic ref. ImplementationSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 12. 11Algorithmic ref. ImplementationSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 13. 12Partitioning PhaseSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 14. 13Aggregation of parallel computing techniquesWe consider a number of different parallel optimization techniques used in high performance computing (HPC) in order to exploit the maximum possible parallelism in the design:
  • 18. TilingSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 19. 14Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector MultiplicationThe Matrix Vector multiplication operation is described by the following sum:where,a: is the input matrix of dimensions mxnv: is the input vector of dimensions nx1u: is the results vector of dimensions mx1 i: index variable with range 0 to mSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 20. 15Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector MultiplicationThe matrix vector multiplication is usually implemented in sequential programming languages such as C++ as: for (i=0; i < m; i++) { u[i] = 0; for (j=0; j < n; j++) { u[i] = u[i] + a[i][j]*v[j]; } }To find out if we can speed up this algorithm, first we need to re write it in such a way that we can see all of its data dependencies. For this purpose, we use single assignment notation.Inputs: a[i,j] = A[i,j] : 0 <= i < m 0 <= j < n v[j] = V[j] : 0 <= j < nOutputs: U[i] = u[i] : 0 <= i < mSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 21. 16Aggregation of parallel computing techniquesIndex SpaceijCASE STUDY: Matrix Vector MultiplicationFirst, we assign each operation in the Matrix Vector multiplication algorithm, a location in the space called the Index Space as the one shown on the right. We also re write the algorithm in such a way that we can assign a coordinate in this Index Space to each operation.
  • 22. This operation is called Index Matching. for (i=0; i < m; i++) { u[i][0] = 0; for (j=0; j < n; j++) {S(i,j): u[i][0] = u[i][0] + a[i][j]*v[0][j]; } }NOTE:The algorithm has not been changed in any way, the addition of coordinate [0] has no effect with respect to the previous form of the algorithm.Inputs: a[i,j] = A[i,j] : 0 <= i < m 0 <= j < n v[0][j] = V[j] : 0 <= j < nOutputs: U[i] = u[i][0] : 0 <= i < mSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 23. 17Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector Multiplication -> Single Assignment StageInputs: a[i,j] = A[i,j] : 0 <= i < m 0 <= j < n v[0][j] = V[j] : 0 <= j < nOutputs: U[i] = u[i][j+1] : 0 <= i < mNow that each operations is assigned to a single point in the Index Space, we can re write the algorithm such that variable assignments occur only once for each coordinate in the Index Space. for (i=0; i < m; i++) { u[i][0] = 0; for (j=0; j < n; j++) { u[i][j+1] = u[i][j] + a[i][j]*v[0][j]; } }In this version of the algorithm, one variable assignment is done for each point (PE) in the index space, please note that the input vector must be seen by all the PEs in order to perform its correct operation.School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 24. Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector Multiplication -> Broadcast RemovalInputs: a[i,j] = A[i,j] : 0 <= i < m 0 <= j < n v[0][j] = V[j] : 0 <= j < nOutputs: U[i] = u[i][j+1]: 0 <= i < mHaving a signal being broadcast, implies large routing resources and big drivers which can translate into large amounts of buffers being inserted in the final circuit. To avoid this, we remove the variable being broadcast by passing the variable through each of the PEs. for (i=0; i < m; i++) { u[i][0] = 0; for (j=0; j < n; j++) { u[i][j+1] = u[i][j] + a[i][j]*v[i][j];v[i+1][j] = v[i][j]; } }This form of the algorithm does not only complies with the single assignment requirement but it also has locality, this is, it only depends on data from its neighbors. This graph is also called a Dependency Graph (DG).School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 25. Aggregation of parallel computing techniques0U[4]a[4][1]a[4][2]a[4][3]a[4][0]a[4][4]0a[3][1]a[3][2]a[3][3]a[3][0]a[3][4]0a[2][1]a[2][2]a[2][3]a[2][0]a[2][4]0a[1][1]a[1][2]a[1][3]a[1][0]a[1][4]U[0]0a[0][1]a[0][2]a[0][3]a[0][0]a[0][4]v[0][0]v[0][1]v[0][2]v[0][3]v[0][4]iU[3]jU[2]U[1]CASE STUDY: Matrix Vector Multiplication -> SchedulingIndex SpaceNow, lets see how the algorithm works in time, look carefully at the animation at the right.
  • 26. We have identified that in this processor array, it only takes 9 time cycles to run the entire matrix vector multiplication algorithm and that for each time cycle the maximum number of processors being used is 5.
  • 27. If we are only using a maximum of 5 processors, why should we build an array of 25!!?School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 28. Aggregation of parallel computing techniques0U[4]U[4]U[3]U[2]U[1]U[0]a[4][1]a[4][2]a[4][3]a[4][0]a[4][4]Index Space0U[3]a[3][1]a[3][2]a[3][3]a[3][0]a[3][4]0U[2]a[4][0]a[4][1]a[4][2]a[4][3]a[2][1]a[2][2]a[2][3]a[2][0]a[2][4]0U[1]a[3][1]a[3][2]a[3][3]a[3][4]a[3][0]a[1][1]a[1][2]a[1][3]a[1][0]a[1][4]0U[0]a[2][3]a[2][2]a[2][4]a[2][1]a[2][0]a[0][1]a[0][2]a[0][3]a[0][0]a[0][4]v[0][0]v[0][1]v[0][2]v[0][3]v[0][4]a[1][3]a[1][4]a[1][0]a[1][1]a[1][2]i00000a[0][4]a[0][0]a[0][1]a[0][2]a[0][3]jv[0][4]v[0][2]v[0][3]v[0][0]v[0][1]CASE STUDY: Matrix Vector Multiplication -> AllocationThe circuit could operate with 5 processors.Matrix Vector Algorithm with projection [1 0]P4a[4][4]P3[1 0]P2P1P0School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 29. Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector Multiplication -> Space-Time mappingIn this table we can see which processor is being used for each instant t.
  • 30. Now, if we plot the information in the table into a [t,p] axis, we can see that the polytope defined by this selection table is bounded by the inequations: p>= 0, p>= t-n, p <=t and p<=m in the following relation: lower bound of p: p >= max(0,t-n)upper bound of p: p <= min(m,t)for ALL t
  • 31. Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector Multiplication -> Space-Time mappingwhere, p is the position of the processing element in the transformed algorithm.
  • 32. t is the time at which a processor in a given coordinate is activated in the transformed algorithm.
  • 33. If we analyze the transformations we did on our index space, as well as describe the scheduling for the new function, we have that:from here, We can re write the algorithm doing the proper substitutions as: for (i=0; i < m; i++) { u[i][0] = 0; for (j=0; j < n; j++) { u[i][j+1] = u[i][j] + a[i][j]*v[i][j]; v[i+1][j] = v[i][j]; } }for (t=0; t < (m+n)-1; t++){forALL(p=max(0,t-(n-1));p≤min(m-1,t);p++){ u[p,t-p+1] = u[p,t-p] + a[p,t-p]v[p,t-p] v[p+1,t-p] = v[p,t-p] }}
  • 34. Aggregation of parallel computing techniquesCASE STUDY: Matrix Vector Multiplication -> Tiling + Strip MiningLets say we build an array of 4 processors and we want to solve a 10x10 matrix multiplied bya 10x1 vector. How can we solve a problem like this if we only have 4 processors?24Integration in a HW/SW Co-design schemeSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 35. 25New Perspective:Super-Systolic Arrays Super-Systolic Arrays is a network of systolic cells in which each cell is also conceptualized in another systolic array in a bit-level fashion.
  • 36. The bit-level Super-Systolic architecture represents a High-Speed Highly-Pipelined structure than can be implemented as coprocessor unit or inclusive stand-alone VLSI ASIC architecture. School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 37. 26FPGA-based Super-Systolic Architecture School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 38. 27Bit-level SSA design on a high-speed VLSI architecture School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 39. 28Bit-level SSA design on a high-speed VLSI architecture The chip was designed using a Standard Cell library in a 0.6µm CMOS process. The resulting integrated circuit core has dimensions of 7.4 mm x 3.5 mm. The total gate count is about 32K using approximately 185K transistors. The 72-pin chip will be packaged in an 80 LD CQFP package.School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 40. 29Performance Analysis: VLSI School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 41. 30Performance Analysis: FPGA 30School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 42. 31ConclusionsThe principal result of this reported study is the aggregation of parallel computing with regularized RS techniques in Super-Systolic Arrays (SSAs) architectures which are integrated via the HW/SW co-design paradigm in FPGA or VLSI platforms for the real time implementation of RS algorithms.The authors consider that with the bit-level implementation of specialized SSAs of processors in combination with VLSI-FPGA platforms represents an emerging research field for the real-time RS data processing for newer Geospatial applications.School of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 43. 32Recent Selected Journal Papers A. Castillo Atoche, D. Torres, Yuriy V. Shkvarko, “Towards Real Time Implementation of Reconstructive Signal Processing Algorithms Using Systolic Arrays Coprocessors”, JOURNAL OF SYSTEMS ARCHITECTURE (JSA), Edit. ELSEVIER, Volume 56, Issue 8, August 2010, Pages 327-339, ISSN: 1383-7621, doi:10.1016/j.sysarc.2010.05.004. JCR.
  • 44. A. Castillo Atoche, D. Torres, Yuriy V. Shkvarko, “Descriptive Regularization-Based Hardware/Software Co-Design for Real-Time Enhanced Imaging in Uncertain Remote Sensing Environment”, EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING (JASP), Edit. HINDAWI, Volume 2010, 31 pages, 2010. ISSN: 1687-6172, e-ISSN: 1687-6180, doi:10.1155/ASP. JCR.
  • 45. Yuriy V. Shkvarko, A. Castillo Atoche, D. Torres, “Near Real Time Enhancement of Geospatial Imagery via Systolic Implementation of Neural Network-Adapted Convex Regularization Techniques”, JOURNAL OF PATTERN RECOGNITION LETTERS, Edit. ELSEVIER, 2011. JCR. In PressSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.
  • 46. 33Thanks for your attention.Dr. Alejandro Castillo AtocheEmail: acastill@uady.mxSchool of Engineering, AutonomousUniversityof Yucatan, Merida, Mexico.