SlideShare a Scribd company logo
CSBP: A Fast Circuit Similarity-Based
          Placement for FPGA Incremental Design
               and Design Space Exploration

      1Xiaoyu   Shi, 1Dahua Zeng, 2Yu Hu, 1Guohui Lin, 1Osmar R. Zaiane

        1Dept.  of Computing Science, University of Alberta
2Dept. of Electrical and Computer Engineering, University of Alberta

                           Presented by Xiaoyu Shi




                                  LOGO

                Please address comments to bryanhu@ece.ualberta.ca
Outline


      Introduction

      Circuit Similarity-Based Placement


      Experimental Results


      Conclusion and Future Work
Introduction
 Field Programmable Gate Array (FPGA)
    Ease of design, low start-up costs and fast manufacturing
     turnaround time.
    Size of FPGAs has reached million gates level.
    Modern FPGA designs suffer from long compilation time.


                                                                  Xilinx SPARTAN-6 board
 FPGA placement
    Determines which logic block within an FPGA should implement each of the
     logic blocks required by the circuits.
    Has a significant impact on the performance and routability in nanometer
     circuit designs.
    The optimization goals are to minimize certain criteria, such as wire length,
     critical delay and area.
    Now becomes the bottleneck of modern FPGA circuit design [Chen’06].
 Up-to-date fast placement algorithms
    Extensive studies have been performed to improve the placement efficiency
     as a single synthesis phase for decades.
    State-of-the-art work includes using multi-core [Ludwin’08], embedding-
     based [Gopalakrishnan’06], partitioning-based [Maidee’05], multi-level
     [Sankar’99], simulated annealing [Betz’97].
Reusable Info in CAD
 Incremental design for FPGAs
    Design preservation is the key of incremental design.
    Similarity among circuits exists because functional changes or optimizations
     are small, and they generally result in a similar topology of the modified
     circuit compared to the original circuit [Krishnaswamy’09].


                                                  Final design
           Final iteration
                                            Optimizations, timing,
           Iteration 3 …                             etc …
                                      Changes due to
           Iteration 2             verification, timing, etc
                             Initial design
           Iteration 1

                   Incremental design process for FPGAs
Reusable Info in CAD (Cont.)
 Design space exploration for FPGAs
    FPGA design offers a variety of customizations by varying design
     parameters.
    Local similarity and global similarity exist in design space exploration.




                                                  Final design

                                            Optimizations, timing,
                                                     etc …
                                      Changes due to
                                   verification, timing, etc
                             Initial design
               Constant multiplier blocks by CMU SPIRAL [Puschel’04]
Data Mining
 Overview
    The key of data mining is to extract patterns and useful information from
     data, including text, graphs and circuits, etc.
    It has been extensively studied since 1950s, and has been widely applied to
     many domains, such as businesses, sciences and health cares.
    Graph mining, including graph pattern mining, graph classification and graph
     compression, is a research hot area in data mining [Borgwardt’08].
 Graph similarity
    It quantitatively defines the topological similarity between two graphs.
    It has been used to many applications, such as web searching
     [Kleinberg’99], social network mapping [Watts’99] and chemical structure
     matching [Hattori’03].
Graph Similarity
 Summary of graph similarity measures
       Measure                        Description                    Time        Global
                                                                   Complexity    Topo
 Isomorphism           Identifying a bijection between the nodes   NP-Hard      Yes
 [Pelillo’02]          of two graphs which preserves (directed)
                       adjacency
 Edit distance         Given a cost function on edit operations,   NP-Hard      Yes
 [Bunke’99]            determine the minimum cost
                       transformation from one graph to another
 Common subgraph       Identifying the largest isomorphic          NP-Hard      Yes
 [Fernandez’01]        subgraphs of two graphs
 Iterative methods     Two graph elements are similar if their     Cubic        Yes
 [Blondel’04]          neighborhoods are similar
 Statistical methods   Assessing aggregate measures of graph       Linear       No
 [Alberta’02]          structure, degree distribution, diameter,
                       betweenness measures


 Iterative methods
      It has lower computational complexity and considers global topological
       information.
      It takes advantage of the graph sparsity.
Circuit Similarity
 Circuit similarity
     We define circuit similarity to describe the similar topological structures
      between two circuits.
     We adapt the iterative methods in graph similarity.
     It exists in several CAD phases, such as placement, routing and verification.
     It can be widely used to accelerate FPGA designs, such as incremental
      design and exploration of the design space, etc.
Outline


      Introduction


      Circuit Similarity-Based Placement

      Experimental Results


      Conclusion and Future Work
Motivating Example
              Circuit similarity algorithm
       V7     V8     V9     V10    V11    V12    V13    V14    V15    V16


V’7
       0.92 0.25 0.48       0.15    0      0      0     0.42   0.06    0
V’8
        0     0.73    0      0     0.05    0     0.39    0     0.17   0.06
V’9
        0     0.39    0      0     0.4     0     0.73    0     0.06   0.48
V’10
                                                                             Graph G
       0.48    0     0.89   0.25   0.3    0.12   0.14   0.06   0.33   0.09
V’11
        0      0     0.11   0.48    0     0.86    0     0.36   0.17    0
V’12
        0      0     0.3    0.34   0.64   0.25   0.39   0.34   0.15   0.42
V’13
       0.48 0.25 0.07       0.4     0     0.36    0     0.88   0.06    0
V’14
       0.4    0.39 0.29     0.15   0.15   0.18   0.12   0.46   0.59   0.06
V’15
        0     0.12 0.09      0     0.63    0     0.36    0     0.27   0.82


                   Similarity score matrix for G and G’
                                                                             Graph G’
Motivating Example (Cont.)
 Circuit similarity-based
  placement
     The initial placement of the new
      circuit design (G’) is generated by
      computing the similarity between
      the original (G) and modified
      circuits, and finding the
      correspondent node matching.
     A low-temperature simulated
      annealing is applied to further
      refine the results.
     The proposed circuit similarity
      algorithm can be used to speedup
      placement, which allows faster
      incremental design and design
      space exploration.
Motivating Example (Cont.)




(a) Placement of     (b) Init placement      (c) Final placement        (d) init placement        (c) Final placement
reference config          using CS                 using CS                 using VPR                  using VPR
                                     Placement layouts comparison of circuit “des”


       A real example                                                  Wire     Delay       Critical     Runtime
                                                                                 (E-05)      Delay        (s)
             For circuit “des”, the reference                                               (E-08)
              configuration (synthesized using
              “resyn3” script in ABC) has 1245            CS-init        306      5.93           -             -
              CLBs and 1501 nets while the
              new configuration (synthesized              VPR-init      1087      14.00          -             -
              using “rwsat2” script in ABC) has
              1215 CLBs and 1471 nets.                    CS-final       237      5.08          8.28        13.38
             The results show that CSBP
              successfully finds the internal             VPR-final      221      4.98         10.10        28.42
              node correspondence.

                                                                   Status of placement results of circuit “des”
Circuit Similarity CAD Flow




CAD flow for incremental design   CAD flow for design space exploration
Circuit Similarity Algorithm
 Iterative similarity algorithm
     We employ the iterative similarity
      algorithm for undirected molecular
      graphs [Rupp’07].
     We adapt the iterative similarity
      algorithm to consider directed
      circuit graphs, fix the I/O pins, and
      compute the similarity of fanin
      and fanout nodes respectively,
      based on unique circuit
      constraints.



                                If (|in(vi)| < |in(v’j)| and |out(vi)| < |out(v’j)|)




                                                                    Summary of variables
Performance Enhancement
 Support constraint
    A support of a node is the set of
     nodes with predefined matchings


    Formally, if v ∈ G and v’ ∈ G’, the
     in the transitive fanin or fanout
     cone of this node.

     support constraint requires:


      where β ∈ (0,1].
 Level constraint
    A topological sort and reverse


    Formally, if v ∈ G and v’ ∈ G’, the
     topological sort can label each
     internal node with two values.

     level constraint requires:

      where Bl and Br are two
      nonnegative integers.



                                           Effectiveness of the pruning techniques
Outline


      Introduction


      Circuit Similarity-Based Placement


      Experimental Results

      Conclusion and Future Work
Incremental Design
                                           f
 CAD flow
    Two-iteration CAD flow.
    CSBP flow (a) and from-scratch
     flow (b) are compared.
    Optimization “imfs” reduces the
     number of CLBs by 2%.
 Settings
    Two versions of CSBP are
     compared: A high quality version
     (CS) with β = 0.5, inner_num = 1
     and Bl = Br = 1; A turbo version
     (CS-t) with β = 1, inner_num = 0.1
     and Bl = Br = 0.
    CSBP is implemented in C and
     evaluated on the 20 largest
     MCNC benchmarks.
    The results are averaged over 5
     funs on a Linux server with dual-
     core 2.19GHz CPU and 5GB
     memory.
    CS2 package [Goldberg’97] is
     used for maximum matching
     problem.                                   CAD flow for incremental design
Results
                    Initial placement results
                        Bounding box cost (bb cost) and delay cost are compared.
                        Clearly, the initial placement results generated using CS is much better than
                         VPR’s initial results, and is very close to VPR’s final results.

     100%                                                                           100%
             90%                                                                    90%
             80%                                                                    80%
Percentage




                                                                       Percentage
             70%                                                                    70%
             60%                                                                    60%
             50%                                                                    50%
             40%                                                                    40%
             30%                                                                    30%
             20%                                                                    20%
             10%                                                                    10%
             0%                                                                      0%
                   s38417
                   s38584




                                                                                           s38417
                                                                                           s38584
                      s298




                                                                                              s298
                        pdc




                                                                                               alu4




                                                                                           ex1010




                                                                                                pdc
                       alu4
                    apex2
                    apex4




                   ex1010




                     tseng




                                                                                            apex2
                                                                                            apex4




                                                                                             tseng
                      ex5p
                       frisc




                                                                                              ex5p
                        seq




                                                                                                des




                                                                                               frisc
                        des




                                                                                                seq
                     diffeq




                   misex3




                       spla




                                                                                            bigkey
                                                                                              clma

                                                                                             diffeq
                                                                                               dsip




                                                                                           misex3




                                                                                               spla
                    bigkey
                      clma



                       dsip
                    elliptic




                                                                                            elliptic
                            CS-init   VPR-final   VPR-init                                        CS-init   VPR-final   VPR-init


                         Comparisons of initial bb cost                                        Comparisons of initial delay cost


                   CS reduces bb cost by 72% on avg. compared to VPR                 CS reduces delay cost by 53% on avg. compared to VPR
Results (Cont.)
                                                    300000
 Post-routing results comparison
                                                    250000
      A low-temperature annealing is               200000
       applied to the initial results.
                                                    150000
      Wire length, critical delay and area
       are compared.                                100000

      The results demonstrate the                  50000
       effectiveness of the pruning                     0
       techniques, which do not affect the




                                                              apex2
                                                              apex4




                                                             ex1010




                                                               tseng
                                                                ex5p




                                                             s38417
                                                             s38584
                                                                  seq
                                                              bigkey

                                                                  des
                                                                clma

                                                               diffeq
                                                                 dsip




                                                             misex3

                                                                s298




                                                                 spla
                                                                 alu4




                                                                  pdc
                                                                 frisc
                                                              elliptic
       quality significantly.

                                                                             CS-t   CS   VPR     Wire length
                                                             CS increases the wire length by 3% on avg.
 4.00E+08
                                                     4.50E-07
 3.50E+08                                            4.00E-07
 3.00E+08                                            3.50E-07
 2.50E+08                                            3.00E-07
 2.00E+08                                            2.50E-07
 1.50E+08                                            2.00E-07
 1.00E+08                                            1.50E-07
                                                     1.00E-07
 5.00E+07
                                                     5.00E-08
 0.00E+00
                                                    0.00E+00
            s38417
            s38584
               s298
                 pdc
                alu4
             apex2
             apex4




            ex1010




              tseng
                 des




               ex5p
                frisc




                 seq
             bigkey
               clma

              diffeq
                dsip




            misex3




                spla
             elliptic




                                                                 s38417
                                                                 s38584
                                                                    s298
                                                                      pdc
                                                                     alu4
                                                                  apex2
                                                                  apex4




                                                                 ex1010




                                                                   tseng
                                                                      des




                                                                    ex5p
                                                                     frisc




                                                                      seq
                                                                  bigkey
                                                                    clma

                                                                   diffeq
                                                                     dsip




                                                                 misex3




                                                                     spla
                                                                  elliptic
                      CS-t   CS   VPR        Area
                                                                             CS-t   CS    VPR       Critical delay
            CS increases the area by 2% on avg.                 CS increases the crit. delay by 6% on avg.
Results (Cont.)
 Runtime comparison
             Only placement time is compared.
             CS-t achieves 31x speedup on average, with up to 91x.
             More speedup is expected when circuits become larger.

           100
           90
           80
           70
Speedups




           60
           50
           40
           30
           20
           10
            0




                                         CS-t   CS   VPR


                                    Speedups compared to VPR
Design Space Exploration
             CAD flow
                    Study logic-level and algorithm-
                     level design space, respectively.
                    CSBP flow (a) and from-scratch
                     flow (b) are compared.
             Settings
                    The logic-level design space
                     consists of 19 configurations
                     generated by 19 ABC1 synthesis
                     scripts in abc.rc.
                    The algorithm-level design space
                     consists of 18 configurations of
                     constant multiplier generated by
                     CMU SPIRAL [Puschel’04]
                     varying bits from 7 to 252.
                    Both CS and CS-t are evaluated.
                    The benchmarking environments
                     are the same as logic-level design
                     space exploration.



1   http://www.eecs.berkeley.edu/~alanmi/abc/
2
                                                          CAD flow for design space exploration
    Bit = 16 is abandoned due to ABC crash
Logic-level Sample Synthesis Scripts
       Alias          Scripts
       resyn          "b; rw; rwz; b; rwz; b"

       resyn2         "b; rw; rf; b; rw; rwz; b; rfz; rwz; b"
       resyn2a        "b; rw; b; rw; rwz; b; rwz; b"

       src_rw         "st; rw -l; rwz -l; rwz -l"

       src_rs         "st; rs -K 6 -N 2 -l; rs -K 9 -N 2 -l; rs -K 12 -N 2 -l"

       choice         "fraig_store; resyn; fraig_store; resyn2; fraig_store; fraig_restore"
       rwsat          "st; rw -l; b -l; rw -l; rf -l"

       compress "b -l; rw -l; rwz -l; b -l; rwz -l; b -l"
       share          "st; multi -m; fx; resyn2"




http://www.eecs.berkeley.edu/~alanmi/abc/
Logic Level Results
                                                     2500

 Initial results comparison                         2000

     The number of CLBs and levels vary             1500
      widely in logic-level design space.            1000
     Show circuit “dsip” as an example.              500
     Bounding box cost and delay cost are
                                                        0
      compared for initial placement




                                                                                                                                                             shake
                                                                                                                                                    rwsat2


                                                                                                                                                                        share




                                                                                                                                                                                                                                       resyn2rsdc
                                                                               resyn2a




                                                                                                                         choice




                                                                                                                                                                                                                                                    compress2rsdc
                                                                      resyn2


                                                                                         resyn3




                                                                                                                                  choice2
                                                                                                                                            rwsat




                                                                                                                                                                                           src_rs
                                                                                                             compress2




                                                                                                                                                                                 src_rw


                                                                                                                                                                                                    src_rws
                                                                                                                                                                                                              resyn2rs
                                                              resyn




                                                                                                  compress




                                                                                                                                                                                                                         compress2rs
      results.


                                                      CS              CS-t                        VPR
                                                                                              Initial bb cost of “dsip”
                                                                                  CS reduces bb cost by 76% on avg.
                                                   4.00E-04

                                 Critical delay    3.00E-04

                                                   2.00E-04

                                                   1.00E-04

                                                   0.00E+00




                                                                                                                                                                                                                                                    compress2rs…
                                                                                  resyn2a
                                                                         resyn2


                                                                                            resyn3


                                                                                                                compress2




                                                                                                                                                                shake




                                                                                                                                                                                                    src_rws
                                                                                                                                                                                                              resyn2rs
                                                                resyn




                                                                                                     compress




                                                                                                                                                       rwsat2


                                                                                                                                                                         share




                                                                                                                                                                                                                         compress2rs
                                                                                                                                                                                                                                       resyn2rsdc
                                                                                                                            choice
                                                                                                                                     choice2
                                                                                                                                               rwsat




                                                                                                                                                                                           src_rs
                                                                                                                                                                                  src_rw
                                                       CS     CS-t                VPR                                        Initial delay cost of “dsip”

                                                                                CS reduces delay cost by 48% on avg.
     Characteristics of logic-level design space
Logic Level Results (Cont.)
                Final placement results
                        Wire length and critical delay of circuit “dsip” are compared.
                        The final results produced by CS and CS-t are very close or better
                         compared to VPR’s, with 32% overhead for wire length and 20%
                         improvement for critical delay.


             100%
                                                                   100%

             80%                                                          80%
Percentage




                                                             Percentage
             60%                                                          60%

             40%                                                          40%

             20%                                                          20%


              0%                                                          0%




                                                                                       resyn2a
                                                                                         resyn2

                                                                                         resyn3

                                                                                    compress2




                                                                                          shake




                                                                                       src_rws
                                                                                      resyn2rs
                                                                                           resyn




                                                                                     compress




                                                                                         rwsat2

                                                                                           share




                                                                                  compress2rs
                                                                                    resyn2rsdc
                                                                                          choice




                                                                                compress2rsdc
                                                                                        choice2
                                                                                           rwsat




                                                                                          src_rs
                                                                                         src_rw
                           resyn2a
                             resyn2

                             resyn3

                        compress2




                              shake




                           src_rws
                          resyn2rs
                               resyn




                         compress




                             rwsat2

                               share




                      compress2rs
                        resyn2rsdc
                              choice




                    compress2rsdc
                            choice2
                               rwsat




                              src_rs
                             src_rw




                                     CS-t   CS   VPR                                            CS-t   CS   VPR


                    Final wire length comparison of “dsip”                       Final critical delay comparison of “dsip”
Logic Level Results (Cont.)
                                                                                                                                                                        800
                                                                                                                                                                        700
 Design space shape characterization                                                                                                                                   600
        We compare the minimal, median and                                                                                                                             500
         maximal wire length and critical delay                                                                                                                         400
         produced by CS, CS-t and VPR.                                                                                                                                  300
                                                                                                                                                                        200
        We also compare the shapes of each
         configuration over 19 designs.                                                                                                                                 100
                                                                                                                                                                          0
        The almost identical curves show that




                                                                                                                                                                                    compress2…
                                                                                                                                                                                          shake
                                                                                                                                                                                         rwsat2

                                                                                                                                                                                           share




                                                                                                                                                                                    resyn2rsdc
                                                                                                                                                                                       resyn2a




                                                                                                                                                                                          choice
                                                                                                                                                                                         resyn2

                                                                                                                                                                                         resyn3




                                                                                                                                                                                        choice2
                                                                                                                                                                                           rwsat




                                                                                                                                                                                          src_rs
                                                                                                                                                                                    compress2




                                                                                                                                                                                         src_rw

                                                                                                                                                                                       src_rws
                                                                                                                                                                                      resyn2rs
                                                                                                                                                                                           resyn




                                                                                                                                                                                     compress




                                                                                                                                                                                  compress2rs
         CSBP is able to accurately depict the
         shape of a design space.
                                                                                                                                                                                  vpr         cs          cs-t
                                                                                                                                                                                    Shape of final wire length of circuit “dsip”
2500
                                                                                                                                                                      4.5E-07
                                                                                                                                                                    0.0000004
2000
                                                                                                                                                                      3.5E-07
                                                                                                                                                                    0.0000003
1500
                                                                                                                                                                      2.5E-07
                                                                                                                                                                    0.0000002
1000
                                                                                                                                                                      1.5E-07

 500                                                                                                                                                                0.0000001
                                                                                                                                                                        5E-08
   0                                                                                                                                                                          0




                                                                                                                                                                                  ex1010
                                                                                                                                                                                   apex2
                                                                                                                                                                                   apex4




                                                                                                                                                                                    tseng
                                                                                                                                                                                       des




                                                                                                                                                                                     ex5p




                                                                                                                                                                                  s38417
                                                                                                                                                                                  s38584
                                                                                                                                                                                   bigkey
                                                                                                                                                                                     clma

                                                                                                                                                                                    diffeq
                                                                                                                                                                                      dsip




                                                                                                                                                                                  misex3

                                                                                                                                                                                     s298



                                                                                                                                                                                       seq
                                                                                                                                                                                      spla
                                                                                                                                                                                       pdc
                                                                                                                                                                                      alu4




                                                                                                                                                                                      frisc
                                                                                                                                                                                   elliptic
                                                                                                                             s38417
                                                                                                                                      s38584
                                                                                                                      s298
       alu4
              apex2
                      apex4




                                                                               ex1010




                                                                                                                pdc




                                                                                                                                                            tseng
                              bigkey


                                              des




                                                                                        ex5p
                                                                                               frisc




                                                                                                                                               seq
                                                                                                                                                     spla
                                       clma


                                                    diffeq
                                                             dsip




                                                                                                       misex3
                                                                    elliptic




              vpr-min                                 cs-min                                    cs-t-min                                                                          vpr-min          cs-min           cs-t-min
  Shape of minimal wire length of 20 circuits over 19 designs                                                                                                         Shape of minimal crit. delay of 20 circuits over 19 designs
Logic Level Results (Cont.)
                   Runtime comparison
                                    Only placement time is compared.
                                    CS-t achieves 30x speedup on
                                     average, with up to 100x.
                                    In practice, one can take
                                     advantage of the significant
                                     speedup of CS-t to perform quick
                                     design space exploration.
    100
           90
           80
           70
Speedups




           60
           50
           40
           30
           20
           10
           0
                                                                                                                                      s38417
                                                                                                                                               s38584
                                                                                                                               s298
                                                                                                                         pdc
                alu4
                       apex2
                               apex4




                                                                                                                                                                     tseng
                                                                                        ex1010


                                                                                                        frisc
                                                       des




                                                                                                 ex5p




                                                                                                                                                        seq
                                                                                                                                                              spla
                                       bigkey
                                                clma


                                                             diffeq




                                                                                                                misex3
                                                                      dsip
                                                                             elliptic




                                                                      CS            CS-t                VPR
                                                                                                                                                                                         Runtime comparison
                                                Speedups compared to VPR                                                                                                     (“*” marked time is measured with a timeout )
Algorithm Level Results
     Experimental settings
           The algorithm-level design is a
            constant multiplier.
           The design parameter explored in our
            experiments is the fractional bits
            varying from 7 to 251.
           CMU SPIRAL is used to generate
            RTL design based on Hcub algorithm
            [Voronenko’07].                            Characteristics of algorithm-level design
                                                          space generated by CMU SPIRAL
     Experimental results
           The initial and final placement results
            are similar to logic-level space
            exploration.
           CS and CS-t achieve 7x and 30x
            speedup compared VPR,
            respectively.




                                                      An example of a constant parallel multiplier
1   Bit = 16 is abandoned due to ABC crash
Algorithm Level Results (Cont.)
                                 Wire length-delay space comparison
                                           The pareto-points, which are the optimal configurations in a design space,
                                            are of most interests to IC designers.
                                           CS and VPR find the same pareto-points.
                                           Bits = 24 is used as the reference circuit.




                           4.00E-07                                                                                  4.25E-07
Estimated critical delay




                                                                                          Estimated critical delay
                           3.50E-07                                 B19             B25                              3.75E-07                                                 B25
                                                                                                                                                            B19
                                                                      B18                                                                                         B18
                           3.00E-07                                               B23                                3.25E-07                                           B23
                                                                            B22
                                                                    B17                                                                                                 B22
                                                                                                                                                            B21 B17
                           2.50E-07                    B14           B21                                             2.75E-07              B14

                                                B12                                                                                                   B15
                                                             B15                                                                         B12
                           2.00E-07                                                                                  2.25E-07
                                           B8
                                                                                                                                    B7         B10
                                                  B10
                           1.50E-07             B9                                                                   1.75E-07        B8 B9
                                           B7
                                      0          100         200    300     400           500                                   0              200                 400              600

                                                             Wire length                                                                             Wire length

                                                Wire length-delay space of VPR                                                           Wire length-delay space of CS
Outline


      Introduction


      Circuit Similarity-Based Placement


      Experimental Results


      Conclusion and Future Work
Future Work
 Improvement to CSBP
    Integrate predefined matchings, for example, naming matching, into our
     CSBP to further enhance both the efficiency and the quality of the design.
 Other applications
    Study the effectiveness of applying circuit similarity algorithm to other
     applications, such as routing and sequential verification for FPGAs
Conclusion
 Proposed an efficient circuit similarity algorithm
 Developed CSBP, a fast circuit similarity-based placement for
  FPGAs
     Applied CSPB to incremental design and design space exploration.
     Open-source tool available at:
      http://webdocs.cs.ualberta.ca/~xshi/soft.html
 Applied CSBP to incremental design for FPGAs
     CSBP is able to reduce engineering effort by capturing the similarity from the
      previous design iterations.
     CSBP is 31x faster compared to VPR.
 Applied CSBP to design space exploration for FPGAs
     CSBP can precisely depict the shape of a design space and pinpoint the
      optimal designs.
     CSBP is 30x faster compared to VPR.
Xiaoyu Shi, Dahua Zeng, Yu Hu, Guohui Lin, Osmar R. Zaiane

 CSBP: A Fast Circuit Similarity-Based Placement for FPGA
    Incremental Design and Design Space Exploration




                       LOGO
          www.themegallery.com

More Related Content

PDF
Id2514581462
PPT
study Streaming Multigrid For Gradient Domain Operations On Large Images
PDF
Design and implementation of Parallel Prefix Adders using FPGAs
PDF
Practical Spherical Harmonics Based PRT Methods
PDF
Ultrasound Modular Architecture
PDF
Ad24210214
PDF
Database slide
PDF
129966863283913778[1]
Id2514581462
study Streaming Multigrid For Gradient Domain Operations On Large Images
Design and implementation of Parallel Prefix Adders using FPGAs
Practical Spherical Harmonics Based PRT Methods
Ultrasound Modular Architecture
Ad24210214
Database slide
129966863283913778[1]

What's hot (14)

PDF
Direct tall-and-skinny QR factorizations in MapReduce architectures
PPT
Optimization of distributed generation of renewable energy sources by intelli...
PDF
Hz2514321439
PDF
10.1.1.2.9988
PDF
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
PDF
vasp-gpu on Balena: Usage and Some Benchmarks
PDF
A Novel CAZAC Sequence Based Timing Synchronization Scheme for OFDM System
PDF
SPU Optimizations - Part 2
PDF
On Data Mining in Inverse Scattering Problems: Neural Networks Applied to GPR...
PPTX
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
PDF
MULTI-POINT DESIGN OF A SUPERSONIC WING USING MODIFIED PARSEC AIRFOIL REPRESE...
PDF
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
PDF
PDF
A Power Efficient Architecture for 2-D Discrete Wavelet Transform
Direct tall-and-skinny QR factorizations in MapReduce architectures
Optimization of distributed generation of renewable energy sources by intelli...
Hz2514321439
10.1.1.2.9988
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
vasp-gpu on Balena: Usage and Some Benchmarks
A Novel CAZAC Sequence Based Timing Synchronization Scheme for OFDM System
SPU Optimizations - Part 2
On Data Mining in Inverse Scattering Problems: Neural Networks Applied to GPR...
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
MULTI-POINT DESIGN OF A SUPERSONIC WING USING MODIFIED PARSEC AIRFOIL REPRESE...
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
A Power Efficient Architecture for 2-D Discrete Wavelet Transform
Ad

Viewers also liked (11)

PDF
Gradient-Based Multi-Objective Optimization Technology
PDF
The Multi-Objective Genetic Algorithm Based Techniques for Intrusion Detection
PDF
Gary Yen: "Multi-objective Optimization and Performance Metrics Ensemble"
PDF
Multi-Objective Optimization in Rule-based Design Space Exploration (ASE 2014)
PPTX
Multi-Objective Evolutionary Algorithms
PDF
Method of solving multi objective optimization problem in the presence of unc...
PDF
Multi objective optimization and Benchmark functions result
PDF
Cyber infrastructure in engineering design
DOCX
Pareto optimal
PDF
Multiobjective optimization and trade offs using pareto optimality
POT
Multi Objective Optimization
Gradient-Based Multi-Objective Optimization Technology
The Multi-Objective Genetic Algorithm Based Techniques for Intrusion Detection
Gary Yen: "Multi-objective Optimization and Performance Metrics Ensemble"
Multi-Objective Optimization in Rule-based Design Space Exploration (ASE 2014)
Multi-Objective Evolutionary Algorithms
Method of solving multi objective optimization problem in the presence of unc...
Multi objective optimization and Benchmark functions result
Cyber infrastructure in engineering design
Pareto optimal
Multiobjective optimization and trade offs using pareto optimality
Multi Objective Optimization
Ad

Similar to CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration (20)

PDF
09 placement
PDF
Applications of the PMP. Cell Formation in Group Technology
PDF
A Comparison of Computation Techniques for DNA Sequence Comparison
PPTX
Vlsi physical design automation on partitioning
PDF
Sketch sort sugiyamalab-20101026 - public
PDF
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
PPT
PDF
VLSI-Physical Design- Tool Terminalogy
PDF
CMPP 2012 held in conjunction with ICNC&rsquo;12
PDF
A multithreaded method for network alignment
PDF
MS Thesis of Al Ameen 1.5 2010
PDF
RCIM 2008 - Modello Generale
PDF
Passive network-redesign-ntua
PDF
Introducing LCS to Digital Design Verification
PDF
Searching in metric spaces
PDF
Nanometer Testing: Challenges and Solutions
PDF
Abraham q3 2008
PPTX
Design for testability for Beginners PPT for FDP.pptx
PDF
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
PDF
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
09 placement
Applications of the PMP. Cell Formation in Group Technology
A Comparison of Computation Techniques for DNA Sequence Comparison
Vlsi physical design automation on partitioning
Sketch sort sugiyamalab-20101026 - public
Evolutionary Algorithmical Approach for VLSI Physical Design- Placement Problem
VLSI-Physical Design- Tool Terminalogy
CMPP 2012 held in conjunction with ICNC&rsquo;12
A multithreaded method for network alignment
MS Thesis of Al Ameen 1.5 2010
RCIM 2008 - Modello Generale
Passive network-redesign-ntua
Introducing LCS to Digital Design Verification
Searching in metric spaces
Nanometer Testing: Challenges and Solutions
Abraham q3 2008
Design for testability for Beginners PPT for FDP.pptx
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...

Recently uploaded (20)

PDF
EC300D LR EC300DLR - Volvo Service Repair Manual.pdf
PDF
Honda Dealership SNS Evaluation pdf/ppts
PPT
Mettal aloys and it's application and theri composition
PDF
Volvo EC290C NL EC290CNL Excavator Service Repair Manual Instant Download.pdf
PDF
Caterpillar CAT 312B L EXCAVATOR (2KW00001-UP) Operation and Maintenance Manu...
PPTX
UNIT-2(B) Organisavtional Appraisal.pptx
PPTX
capstoneoooooooooooooooooooooooooooooooooo
PPTX
Materi Kuliah Umum Prof. Hsien Tsai Wu.pptx
PPT
Kaizen for Beginners and how to implement Kaizen
PPTX
Intro to ISO 9001 2015.pptx for awareness
PPTX
Lecture 3b C Library xnxjxjxjxkx_ ESP32.pptx
PDF
Renesas R-Car_Cockpit_overview210214-Gen4.pdf
PPTX
Fire Fighting Unit IV industrial safety.pptx
PDF
Caterpillar CAT 311B EXCAVATOR (8GR00001-UP) Operation and Maintenance Manual...
PDF
higher edu open stores 12.5.24 (1).pdf forreal
PDF
intrusion control for clean steel 123.pdf
PDF
Volvo EC290C NL EC290CNL Hydraulic Excavator Specs Manual.pdf
PDF
3-REasdfghjkl;[poiunvnvncncn-Process.pdf
PDF
How Much does a Volvo EC290C NL EC290CNL Weight.pdf
PDF
Volvo EC300D L EC300DL excavator weight Manuals.pdf
EC300D LR EC300DLR - Volvo Service Repair Manual.pdf
Honda Dealership SNS Evaluation pdf/ppts
Mettal aloys and it's application and theri composition
Volvo EC290C NL EC290CNL Excavator Service Repair Manual Instant Download.pdf
Caterpillar CAT 312B L EXCAVATOR (2KW00001-UP) Operation and Maintenance Manu...
UNIT-2(B) Organisavtional Appraisal.pptx
capstoneoooooooooooooooooooooooooooooooooo
Materi Kuliah Umum Prof. Hsien Tsai Wu.pptx
Kaizen for Beginners and how to implement Kaizen
Intro to ISO 9001 2015.pptx for awareness
Lecture 3b C Library xnxjxjxjxkx_ ESP32.pptx
Renesas R-Car_Cockpit_overview210214-Gen4.pdf
Fire Fighting Unit IV industrial safety.pptx
Caterpillar CAT 311B EXCAVATOR (8GR00001-UP) Operation and Maintenance Manual...
higher edu open stores 12.5.24 (1).pdf forreal
intrusion control for clean steel 123.pdf
Volvo EC290C NL EC290CNL Hydraulic Excavator Specs Manual.pdf
3-REasdfghjkl;[poiunvnvncncn-Process.pdf
How Much does a Volvo EC290C NL EC290CNL Weight.pdf
Volvo EC300D L EC300DL excavator weight Manuals.pdf

CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration

  • 1. CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration 1Xiaoyu Shi, 1Dahua Zeng, 2Yu Hu, 1Guohui Lin, 1Osmar R. Zaiane 1Dept. of Computing Science, University of Alberta 2Dept. of Electrical and Computer Engineering, University of Alberta Presented by Xiaoyu Shi LOGO Please address comments to bryanhu@ece.ualberta.ca
  • 2. Outline Introduction Circuit Similarity-Based Placement Experimental Results Conclusion and Future Work
  • 3. Introduction  Field Programmable Gate Array (FPGA)  Ease of design, low start-up costs and fast manufacturing turnaround time.  Size of FPGAs has reached million gates level.  Modern FPGA designs suffer from long compilation time. Xilinx SPARTAN-6 board  FPGA placement  Determines which logic block within an FPGA should implement each of the logic blocks required by the circuits.  Has a significant impact on the performance and routability in nanometer circuit designs.  The optimization goals are to minimize certain criteria, such as wire length, critical delay and area.  Now becomes the bottleneck of modern FPGA circuit design [Chen’06].  Up-to-date fast placement algorithms  Extensive studies have been performed to improve the placement efficiency as a single synthesis phase for decades.  State-of-the-art work includes using multi-core [Ludwin’08], embedding- based [Gopalakrishnan’06], partitioning-based [Maidee’05], multi-level [Sankar’99], simulated annealing [Betz’97].
  • 4. Reusable Info in CAD  Incremental design for FPGAs  Design preservation is the key of incremental design.  Similarity among circuits exists because functional changes or optimizations are small, and they generally result in a similar topology of the modified circuit compared to the original circuit [Krishnaswamy’09]. Final design Final iteration Optimizations, timing, Iteration 3 … etc … Changes due to Iteration 2 verification, timing, etc Initial design Iteration 1 Incremental design process for FPGAs
  • 5. Reusable Info in CAD (Cont.)  Design space exploration for FPGAs  FPGA design offers a variety of customizations by varying design parameters.  Local similarity and global similarity exist in design space exploration. Final design Optimizations, timing, etc … Changes due to verification, timing, etc Initial design Constant multiplier blocks by CMU SPIRAL [Puschel’04]
  • 6. Data Mining  Overview  The key of data mining is to extract patterns and useful information from data, including text, graphs and circuits, etc.  It has been extensively studied since 1950s, and has been widely applied to many domains, such as businesses, sciences and health cares.  Graph mining, including graph pattern mining, graph classification and graph compression, is a research hot area in data mining [Borgwardt’08].  Graph similarity  It quantitatively defines the topological similarity between two graphs.  It has been used to many applications, such as web searching [Kleinberg’99], social network mapping [Watts’99] and chemical structure matching [Hattori’03].
  • 7. Graph Similarity  Summary of graph similarity measures Measure Description Time Global Complexity Topo Isomorphism Identifying a bijection between the nodes NP-Hard Yes [Pelillo’02] of two graphs which preserves (directed) adjacency Edit distance Given a cost function on edit operations, NP-Hard Yes [Bunke’99] determine the minimum cost transformation from one graph to another Common subgraph Identifying the largest isomorphic NP-Hard Yes [Fernandez’01] subgraphs of two graphs Iterative methods Two graph elements are similar if their Cubic Yes [Blondel’04] neighborhoods are similar Statistical methods Assessing aggregate measures of graph Linear No [Alberta’02] structure, degree distribution, diameter, betweenness measures  Iterative methods  It has lower computational complexity and considers global topological information.  It takes advantage of the graph sparsity.
  • 8. Circuit Similarity  Circuit similarity  We define circuit similarity to describe the similar topological structures between two circuits.  We adapt the iterative methods in graph similarity.  It exists in several CAD phases, such as placement, routing and verification.  It can be widely used to accelerate FPGA designs, such as incremental design and exploration of the design space, etc.
  • 9. Outline Introduction Circuit Similarity-Based Placement Experimental Results Conclusion and Future Work
  • 10. Motivating Example  Circuit similarity algorithm V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V’7 0.92 0.25 0.48 0.15 0 0 0 0.42 0.06 0 V’8 0 0.73 0 0 0.05 0 0.39 0 0.17 0.06 V’9 0 0.39 0 0 0.4 0 0.73 0 0.06 0.48 V’10 Graph G 0.48 0 0.89 0.25 0.3 0.12 0.14 0.06 0.33 0.09 V’11 0 0 0.11 0.48 0 0.86 0 0.36 0.17 0 V’12 0 0 0.3 0.34 0.64 0.25 0.39 0.34 0.15 0.42 V’13 0.48 0.25 0.07 0.4 0 0.36 0 0.88 0.06 0 V’14 0.4 0.39 0.29 0.15 0.15 0.18 0.12 0.46 0.59 0.06 V’15 0 0.12 0.09 0 0.63 0 0.36 0 0.27 0.82 Similarity score matrix for G and G’ Graph G’
  • 11. Motivating Example (Cont.)  Circuit similarity-based placement  The initial placement of the new circuit design (G’) is generated by computing the similarity between the original (G) and modified circuits, and finding the correspondent node matching.  A low-temperature simulated annealing is applied to further refine the results.  The proposed circuit similarity algorithm can be used to speedup placement, which allows faster incremental design and design space exploration.
  • 12. Motivating Example (Cont.) (a) Placement of (b) Init placement (c) Final placement (d) init placement (c) Final placement reference config using CS using CS using VPR using VPR Placement layouts comparison of circuit “des”  A real example Wire Delay Critical Runtime (E-05) Delay (s)  For circuit “des”, the reference (E-08) configuration (synthesized using “resyn3” script in ABC) has 1245 CS-init 306 5.93 - - CLBs and 1501 nets while the new configuration (synthesized VPR-init 1087 14.00 - - using “rwsat2” script in ABC) has 1215 CLBs and 1471 nets. CS-final 237 5.08 8.28 13.38  The results show that CSBP successfully finds the internal VPR-final 221 4.98 10.10 28.42 node correspondence. Status of placement results of circuit “des”
  • 13. Circuit Similarity CAD Flow CAD flow for incremental design CAD flow for design space exploration
  • 14. Circuit Similarity Algorithm  Iterative similarity algorithm  We employ the iterative similarity algorithm for undirected molecular graphs [Rupp’07].  We adapt the iterative similarity algorithm to consider directed circuit graphs, fix the I/O pins, and compute the similarity of fanin and fanout nodes respectively, based on unique circuit constraints. If (|in(vi)| < |in(v’j)| and |out(vi)| < |out(v’j)|) Summary of variables
  • 15. Performance Enhancement  Support constraint  A support of a node is the set of nodes with predefined matchings  Formally, if v ∈ G and v’ ∈ G’, the in the transitive fanin or fanout cone of this node. support constraint requires: where β ∈ (0,1].  Level constraint  A topological sort and reverse  Formally, if v ∈ G and v’ ∈ G’, the topological sort can label each internal node with two values. level constraint requires: where Bl and Br are two nonnegative integers. Effectiveness of the pruning techniques
  • 16. Outline Introduction Circuit Similarity-Based Placement Experimental Results Conclusion and Future Work
  • 17. Incremental Design  f  CAD flow  Two-iteration CAD flow.  CSBP flow (a) and from-scratch flow (b) are compared.  Optimization “imfs” reduces the number of CLBs by 2%.  Settings  Two versions of CSBP are compared: A high quality version (CS) with β = 0.5, inner_num = 1 and Bl = Br = 1; A turbo version (CS-t) with β = 1, inner_num = 0.1 and Bl = Br = 0.  CSBP is implemented in C and evaluated on the 20 largest MCNC benchmarks.  The results are averaged over 5 funs on a Linux server with dual- core 2.19GHz CPU and 5GB memory.  CS2 package [Goldberg’97] is used for maximum matching problem. CAD flow for incremental design
  • 18. Results  Initial placement results  Bounding box cost (bb cost) and delay cost are compared.  Clearly, the initial placement results generated using CS is much better than VPR’s initial results, and is very close to VPR’s final results. 100% 100% 90% 90% 80% 80% Percentage Percentage 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% s38417 s38584 s38417 s38584 s298 s298 pdc alu4 ex1010 pdc alu4 apex2 apex4 ex1010 tseng apex2 apex4 tseng ex5p frisc ex5p seq des frisc des seq diffeq misex3 spla bigkey clma diffeq dsip misex3 spla bigkey clma dsip elliptic elliptic CS-init VPR-final VPR-init CS-init VPR-final VPR-init Comparisons of initial bb cost Comparisons of initial delay cost CS reduces bb cost by 72% on avg. compared to VPR CS reduces delay cost by 53% on avg. compared to VPR
  • 19. Results (Cont.) 300000  Post-routing results comparison 250000  A low-temperature annealing is 200000 applied to the initial results. 150000  Wire length, critical delay and area are compared. 100000  The results demonstrate the 50000 effectiveness of the pruning 0 techniques, which do not affect the apex2 apex4 ex1010 tseng ex5p s38417 s38584 seq bigkey des clma diffeq dsip misex3 s298 spla alu4 pdc frisc elliptic quality significantly. CS-t CS VPR Wire length CS increases the wire length by 3% on avg. 4.00E+08 4.50E-07 3.50E+08 4.00E-07 3.00E+08 3.50E-07 2.50E+08 3.00E-07 2.00E+08 2.50E-07 1.50E+08 2.00E-07 1.00E+08 1.50E-07 1.00E-07 5.00E+07 5.00E-08 0.00E+00 0.00E+00 s38417 s38584 s298 pdc alu4 apex2 apex4 ex1010 tseng des ex5p frisc seq bigkey clma diffeq dsip misex3 spla elliptic s38417 s38584 s298 pdc alu4 apex2 apex4 ex1010 tseng des ex5p frisc seq bigkey clma diffeq dsip misex3 spla elliptic CS-t CS VPR Area CS-t CS VPR Critical delay CS increases the area by 2% on avg. CS increases the crit. delay by 6% on avg.
  • 20. Results (Cont.)  Runtime comparison  Only placement time is compared.  CS-t achieves 31x speedup on average, with up to 91x.  More speedup is expected when circuits become larger. 100 90 80 70 Speedups 60 50 40 30 20 10 0 CS-t CS VPR Speedups compared to VPR
  • 21. Design Space Exploration  CAD flow  Study logic-level and algorithm- level design space, respectively.  CSBP flow (a) and from-scratch flow (b) are compared.  Settings  The logic-level design space consists of 19 configurations generated by 19 ABC1 synthesis scripts in abc.rc.  The algorithm-level design space consists of 18 configurations of constant multiplier generated by CMU SPIRAL [Puschel’04] varying bits from 7 to 252.  Both CS and CS-t are evaluated.  The benchmarking environments are the same as logic-level design space exploration. 1 http://www.eecs.berkeley.edu/~alanmi/abc/ 2 CAD flow for design space exploration Bit = 16 is abandoned due to ABC crash
  • 22. Logic-level Sample Synthesis Scripts Alias Scripts resyn "b; rw; rwz; b; rwz; b" resyn2 "b; rw; rf; b; rw; rwz; b; rfz; rwz; b" resyn2a "b; rw; b; rw; rwz; b; rwz; b" src_rw "st; rw -l; rwz -l; rwz -l" src_rs "st; rs -K 6 -N 2 -l; rs -K 9 -N 2 -l; rs -K 12 -N 2 -l" choice "fraig_store; resyn; fraig_store; resyn2; fraig_store; fraig_restore" rwsat "st; rw -l; b -l; rw -l; rf -l" compress "b -l; rw -l; rwz -l; b -l; rwz -l; b -l" share "st; multi -m; fx; resyn2" http://www.eecs.berkeley.edu/~alanmi/abc/
  • 23. Logic Level Results 2500  Initial results comparison 2000  The number of CLBs and levels vary 1500 widely in logic-level design space. 1000  Show circuit “dsip” as an example. 500  Bounding box cost and delay cost are 0 compared for initial placement shake rwsat2 share resyn2rsdc resyn2a choice compress2rsdc resyn2 resyn3 choice2 rwsat src_rs compress2 src_rw src_rws resyn2rs resyn compress compress2rs results. CS CS-t VPR Initial bb cost of “dsip” CS reduces bb cost by 76% on avg. 4.00E-04 Critical delay 3.00E-04 2.00E-04 1.00E-04 0.00E+00 compress2rs… resyn2a resyn2 resyn3 compress2 shake src_rws resyn2rs resyn compress rwsat2 share compress2rs resyn2rsdc choice choice2 rwsat src_rs src_rw CS CS-t VPR Initial delay cost of “dsip” CS reduces delay cost by 48% on avg. Characteristics of logic-level design space
  • 24. Logic Level Results (Cont.)  Final placement results  Wire length and critical delay of circuit “dsip” are compared.  The final results produced by CS and CS-t are very close or better compared to VPR’s, with 32% overhead for wire length and 20% improvement for critical delay. 100% 100% 80% 80% Percentage Percentage 60% 60% 40% 40% 20% 20% 0% 0% resyn2a resyn2 resyn3 compress2 shake src_rws resyn2rs resyn compress rwsat2 share compress2rs resyn2rsdc choice compress2rsdc choice2 rwsat src_rs src_rw resyn2a resyn2 resyn3 compress2 shake src_rws resyn2rs resyn compress rwsat2 share compress2rs resyn2rsdc choice compress2rsdc choice2 rwsat src_rs src_rw CS-t CS VPR CS-t CS VPR Final wire length comparison of “dsip” Final critical delay comparison of “dsip”
  • 25. Logic Level Results (Cont.) 800 700  Design space shape characterization 600  We compare the minimal, median and 500 maximal wire length and critical delay 400 produced by CS, CS-t and VPR. 300 200  We also compare the shapes of each configuration over 19 designs. 100 0  The almost identical curves show that compress2… shake rwsat2 share resyn2rsdc resyn2a choice resyn2 resyn3 choice2 rwsat src_rs compress2 src_rw src_rws resyn2rs resyn compress compress2rs CSBP is able to accurately depict the shape of a design space. vpr cs cs-t Shape of final wire length of circuit “dsip” 2500 4.5E-07 0.0000004 2000 3.5E-07 0.0000003 1500 2.5E-07 0.0000002 1000 1.5E-07 500 0.0000001 5E-08 0 0 ex1010 apex2 apex4 tseng des ex5p s38417 s38584 bigkey clma diffeq dsip misex3 s298 seq spla pdc alu4 frisc elliptic s38417 s38584 s298 alu4 apex2 apex4 ex1010 pdc tseng bigkey des ex5p frisc seq spla clma diffeq dsip misex3 elliptic vpr-min cs-min cs-t-min vpr-min cs-min cs-t-min Shape of minimal wire length of 20 circuits over 19 designs Shape of minimal crit. delay of 20 circuits over 19 designs
  • 26. Logic Level Results (Cont.)  Runtime comparison  Only placement time is compared.  CS-t achieves 30x speedup on average, with up to 100x.  In practice, one can take advantage of the significant speedup of CS-t to perform quick design space exploration. 100 90 80 70 Speedups 60 50 40 30 20 10 0 s38417 s38584 s298 pdc alu4 apex2 apex4 tseng ex1010 frisc des ex5p seq spla bigkey clma diffeq misex3 dsip elliptic CS CS-t VPR Runtime comparison Speedups compared to VPR (“*” marked time is measured with a timeout )
  • 27. Algorithm Level Results  Experimental settings  The algorithm-level design is a constant multiplier.  The design parameter explored in our experiments is the fractional bits varying from 7 to 251.  CMU SPIRAL is used to generate RTL design based on Hcub algorithm [Voronenko’07]. Characteristics of algorithm-level design space generated by CMU SPIRAL  Experimental results  The initial and final placement results are similar to logic-level space exploration.  CS and CS-t achieve 7x and 30x speedup compared VPR, respectively. An example of a constant parallel multiplier 1 Bit = 16 is abandoned due to ABC crash
  • 28. Algorithm Level Results (Cont.)  Wire length-delay space comparison  The pareto-points, which are the optimal configurations in a design space, are of most interests to IC designers.  CS and VPR find the same pareto-points.  Bits = 24 is used as the reference circuit. 4.00E-07 4.25E-07 Estimated critical delay Estimated critical delay 3.50E-07 B19 B25 3.75E-07 B25 B19 B18 B18 3.00E-07 B23 3.25E-07 B23 B22 B17 B22 B21 B17 2.50E-07 B14 B21 2.75E-07 B14 B12 B15 B15 B12 2.00E-07 2.25E-07 B8 B7 B10 B10 1.50E-07 B9 1.75E-07 B8 B9 B7 0 100 200 300 400 500 0 200 400 600 Wire length Wire length Wire length-delay space of VPR Wire length-delay space of CS
  • 29. Outline Introduction Circuit Similarity-Based Placement Experimental Results Conclusion and Future Work
  • 30. Future Work  Improvement to CSBP  Integrate predefined matchings, for example, naming matching, into our CSBP to further enhance both the efficiency and the quality of the design.  Other applications  Study the effectiveness of applying circuit similarity algorithm to other applications, such as routing and sequential verification for FPGAs
  • 31. Conclusion  Proposed an efficient circuit similarity algorithm  Developed CSBP, a fast circuit similarity-based placement for FPGAs  Applied CSPB to incremental design and design space exploration.  Open-source tool available at: http://webdocs.cs.ualberta.ca/~xshi/soft.html  Applied CSBP to incremental design for FPGAs  CSBP is able to reduce engineering effort by capturing the similarity from the previous design iterations.  CSBP is 31x faster compared to VPR.  Applied CSBP to design space exploration for FPGAs  CSBP can precisely depict the shape of a design space and pinpoint the optimal designs.  CSBP is 30x faster compared to VPR.
  • 32. Xiaoyu Shi, Dahua Zeng, Yu Hu, Guohui Lin, Osmar R. Zaiane CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration LOGO www.themegallery.com