VLSI Static Timing Analysis Setup And Hold Part 2

Static Timing
Analysis
Part 2
Amr Adel Mohammady
/amradelm
/amradelm

/amradelm
/amradelm
Introduction
• In part 1 we went through the basic principles that are needed to understand all VLSI timing checks. In this parts we will go through
some of the checks in details
• The timing checks covered in this part are:
o Setup timing
o Hold timing
2

/amradelm
/amradelm
Setup Timing Analysis
3

/amradelm
/amradelm
Setup Time
4
At time T=𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒, Data A is launched
from FF1 to FF2. The data needs to make it to
FF2 before the next clock edge arrives at FF2
at time 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒. The next clock edge will
arrive after a clock period
1 The clock takes some time to reach FF1 due
to the buffers. The launch won’t happen
exactly at T=𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒 but after the
delay/latency of the clock buffers
2
As we saw in part 1, once the clock reaches
the FF it takes some time to push the data
out to the Q pin. We called this time 𝑇𝑐𝑞. This
is the 1st delay data A encounters to reach
FF2
3
A
𝐿𝑎𝑢𝑛𝑐ℎ = 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒
𝐶𝑎𝑝𝑡𝑢𝑟𝑒 = 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒
𝐿𝑎𝑢𝑛𝑐ℎ = 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒 + 𝑻𝒍𝒂𝒖𝒏𝒄𝒉_𝒍𝒂𝒕𝒆𝒏𝒄𝒚
A
𝐿𝑎𝑢𝑛𝑐ℎ = 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑙𝑎𝑡𝑒𝑛𝑐𝑦
𝐷 𝑒 𝑙 𝑎 𝑦 = 𝑻𝒄𝒒
A
𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒
𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑙𝑎𝑡𝑒𝑛𝑐𝑦
𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒
𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒
𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 𝑇𝑐𝑞
𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒

/amradelm
/amradelm
Setup Time
5
Data A will propagate through the
combinational path to reach FF2. This is the
2nd delay it encounters
4 As we saw in part 1, the FF requires the data to
arrive some time before the clock edge in
order to avoid metastability. We called this
time 𝑇𝑠𝑒𝑡𝑢𝑝. Hence, we shouldn’t capture data
at 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 but at 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝
5
The clock takes some time to reach FF2 due
to the buffers. The capture won’t happen
exactly at 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 but after the
delay/latency of the clock buffers
6
A
𝐷 𝑒 𝑙 𝑎 𝑦 = 𝑇𝑐𝑞 + 𝑇𝑐𝑜𝑚𝑏
𝐶𝑎𝑝𝑡𝑢𝑟𝑒 = 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 + 𝑻𝒄𝒂𝒑𝒕𝒖𝒓𝒆_𝒍𝒂𝒕𝒆𝒏𝒄𝒚
A
𝐷 𝑒 𝑙 𝑎 𝑦 = 𝑇𝑐𝑞 + 𝑇𝑐𝑜𝑚𝑏
𝐶𝑎𝑝𝑡𝑢𝑟𝑒 = 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 − 𝑻𝒔𝒆𝒕𝒖𝒑
𝐷 𝑒 𝑙 𝑎 𝑦 = 𝑇𝑐𝑞 + 𝑻𝒄𝒐𝒎𝒃
𝑇𝑐𝑜𝑚𝑏
𝑇𝑠𝑒𝑡𝑢𝑝
𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑙𝑎𝑡𝑒𝑛𝑐𝑦

/amradelm
/amradelm
Setup Time
6
7
To make sure a setup violation doesn’t happen, we need to make sure data A arrives at FF2 before the required capture time
The difference between the required and arrival time is called the slack. If the slack is positive we pass setup and if negative we fail.
The launch FF is called the startpoint of the timing path and the capture FF is called the endpoint
𝐿𝑎𝑢𝑛𝑐ℎ + 𝐷𝑒𝑙𝑎𝑦 ≤ 𝐶𝑎𝑝𝑡𝑢𝑟𝑒
𝐴𝑟𝑟𝑖𝑣𝑎𝑙 ≤ 𝑅𝑒𝑞𝑢𝑖𝑟𝑒𝑑
𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇𝑐𝑜𝑚𝑏 < 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 − 𝑇𝑠𝑒𝑡𝑢𝑝 + 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑙𝑎𝑡𝑒𝑛𝑐𝑦
Data arrived at
FF2 at this point
Data is required to arrive
at FF2 before this point

/amradelm
/amradelm
Example Timing Report
7
𝐷 𝑒 𝑙 𝑎 𝑦
𝐶𝑎𝑝𝑡𝑢𝑟𝑒
𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 →
𝑇𝑐𝑞 →
𝑇𝑠𝑒𝑡𝑢𝑝 →
𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 →
𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 →
𝑇𝑐𝑜𝑚𝑏{
𝐿𝑎𝑢𝑛𝑐ℎ
An 554: How to Read HardCopy PrimeTime Timing Reports By Intel
Reference :
𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒 →

/amradelm
/amradelm
Setup Time
• The example we have shown is for a full cycle path where the 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 comes one clock cycle after 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒.
• This is not always the case. The capture edge could come half cycle later, multiple cycles later or from another clock.
o Half cycle paths occur when the launch and capture FFs use different clock edges
o Multi cycle paths occur when the first capture edge is masked by a control circuit and another edge is used.
o Multi clock paths occur when the launch and capture FFs use different clocks from each other. The diagram shows that there could be more than one
launch/capture edges combination. The STA tools will consider the worst case (The purple one)1
• All what we learned still apply and nothing changes. We will just plug different values for the clock edges into the setup equation.
• We will now discuss how to fix a setup violation
8
𝑻𝒍𝒂𝒖𝒏𝒄𝒉_𝒆𝒅𝒈𝒆 + 𝑻𝒍𝒂𝒖𝒏𝒄𝒉_𝒍𝒂𝒕𝒆𝒏𝒄𝒚 + 𝑻𝒄𝒐𝒎𝒃 < 𝑻𝒄𝒂𝒑𝒕𝒖𝒓𝒆_𝒆𝒅𝒈𝒆 − 𝑻𝒔𝒆𝒕𝒖𝒑 + 𝑻𝒄𝒂𝒑𝒕𝒖𝒓𝒆_𝒍𝒂𝒕𝒆𝒏𝒄𝒚
Mask this edge
with control logic
Half Cycle Path Multi Cycle Path Multi Clock Path
The phase difference between the two clocks should be known in order to know exactly where the launch and
capture edge are. If not, we can’t run STA on such paths and we have to resort to clock domain crossing techniques.
[1] :

/amradelm
/amradelm
Overview of The Digital VLSI Flow
• Before we discuss how to fix a setup violation we need to have a quick overview of the digital design flow1.
• Specifications: The design process starts with the requirements to build the system (Functionality, Performance, Power consumption, Cost
etc)
• Architecture : Based on the required specs, the architecture team will start building the system. They will answer questions and make
decisions such as: What blocks are needed in the system to perform the functionality? How to implement these blocks as a digital circuit?
Do we need memory or not? What is its size? What operating voltage do we use? What clock frequency do we need to meet the
performance specs? What is the expected area of the chip and fabrication cost?
• RTL Design : The RTL team will start writing RTL code to implement the architecture and blocks of the system
• Simulation : The implemented design is tested through simulations to make sure it does the required function correctly
• Synthesis : The RTL code is translated into actual logic gates and digital blocks
• PNR : The place and route step involves several sub steps
o Floorplan : Involves allocating space on the chip for various blocks and modules, including the placement of macros, and I/O ports
o Power Grid : Creating the metal structure that delivers the power supply to the standard cells and blocks inside the chip
o Placements : Placing the cells inside the chip
o Clock Tree Synthesis : Creating the clock networks to deliver the clocks from the ports to the registers in the chip
o Routing : Routing the metal interconnects (wires) between the cells
o Timing Closure : Running STA on the chip to make sure it meets the timing requirements
o DRC/LVS/EMIR : DRC ensures the final layout is compliant with the manufacturing rules. LVS ensures the final layout perform the same
function of the schematic/logic description. EMIR ensures all cells get the required voltage without drop and the current flowing
through the wire is within the required limits
9
This is a very simplified view of the digital flow. There are more steps involved but we don’t mention them because
they won’t affect STA
[1] :

/amradelm
/amradelm
How to Fix a Setup Violation – Overview of The Digital VLSI Flow
• PNR :
o The PNR engineer starts the flow trying to meet the requirements with the help of automation tools
o The goal is to reach a good startpoint with good results before the manual work starts
o Once the manual work starts, the startpoint is saved and frozen. The PNR flow is said to be in ECO mode (Engineering Change Order)
o The manual work involves things like moving cells, changing their threshold voltage, manually routing wires, etc.
• Fixing timing:
o Each of the above flow steps involves several optimizations to enhance the timing and fix the violations
o The earlier steps solve larger timing violations that are difficult and sometimes impossible to fix in later stages
o As we go through the flow, the ability to fix large violations decrease and we are more focused in fixing small but tricky violations that
involves lots of manual work.
• We will now go through some of the ways to fix a setup violation. We will start with the solutions done in the early stages and go down till
we see what can be done in later and final stages.
10

/amradelm
/amradelm
How to Fix a Setup Violation – Sol. 1
Reducing the Clock Frequency
• The easiest and simplest solution is to reduce the frequency (increase the period) of the clock to add time to the capture time
• Doing this degrade the performance (Data rate / CPU speed / Operations per second / etc)
• The decision to reduce the clock frequency is left to the architecture team and can’t be modified individually by RTL or PNR engineers
• Sometimes this solution is not acceptable because the product standard requires specific data rate that needs to be met
11

/amradelm
/amradelm
Going To a Smaller Technology Node
• In part 1 we showed how the transistor length (tech node) affects the gate delay. A shorter length has smaller delay
• Going for a smaller tech node means higher fabrication cost and a longer design cycle because smaller tech nodes are more challenging to handle the on
chip variations (OCV) and the physical design rule constraints (DRC) and preparing the design files (standard cell libraries, etc) for the new tech node will
take time.
• Because of this, the target tech node is decided very early in the design process by doing experiments with the tech node to see if the target frequency will
be feasible or not
• These experiments could be :
o Quick hand calculations : By considering the average cell delay in the tech node and the average combinational path length. For example, 𝑇𝑎𝑣𝑔 = 5𝑛𝑠
and the average number of cells in a timing path = 20. So, on average, the combinational delay = 5 ∗ 20 = 100𝑛𝑠 meaning a maximum clock frequency
of
1
100𝑒−9
= 10 MHz.
This is, of course, a very rough estimation as it doesn’t include the effects of wire delay, clock latencies, etc. But the more effort you put in these
calculations the more accurate they get
o Doing a quick project : By synthesizing a small block or a previous project to get an estimate of the maximum clock frequency you can achieve on this
tech node
12

/amradelm
/amradelm
Increasing the Supply Voltage
• In part 1 we showed how the supply voltage affects the gate delay. A higher voltage has smaller delay. However, the power consumption increase quadratically.
• The higher voltage could be applied to certain parts of the chip that needs high performance while leaving other parts with the lower voltage to avoid higher
power consumption. However, this adds several difficulties in the ASIC design process
13
𝑡𝑝𝑟𝑜𝑝 =
0.69 𝑉𝐷𝐷 . 𝐶𝐿
𝑊
2𝐿
𝜇𝐶𝑜𝑥 𝑉𝐷𝐷 − 𝑉𝑡ℎ
2
𝑃𝑜𝑤𝑒𝑟𝑑𝑦𝑛𝑎𝑚𝑖𝑐 = 𝛼𝑓𝐶𝐿𝑉𝐷𝐷
2

/amradelm
/amradelm
Changing the Architecture
• Digital blocks have a tradeoff between speed vs power and area. The designer might choose an implementation that consume more power or has larger area
but higher speed.
• For example, there are different ways to implement binary adders. One implementation is the ripple adder which has small area and power consumption but has
high 𝑇𝑐𝑜𝑚𝑏, while a carry-look-ahead (CLA) adder has smaller 𝑇𝑐𝑜𝑚𝑏 but takes larger area.
14
𝑇𝑐𝑜𝑚𝑏 = 700𝑝𝑠
𝐴𝑟𝑒𝑎 = 75𝜇𝑚2
𝐴𝑟𝑒𝑎 = 130𝜇𝑚2
Kamanga, Isaack. Design Optimization of the 64-Bit Carry Look-Ahead Adder Based on FPGA and Verilog HDL
Reference :

/amradelm
/amradelm
Optimizing the RTL Code
• The way the RTL is written affects the structure of the logic gates
• The example below shows 2 circuits that perform the same functionality however the on the right creates the adder in a chain fashion resulting in a delay of 3
adders in series while the one on the right is made in a parallel tree fashion and only has a delay of 2 series adders
15
100𝑝𝑠
100𝑝𝑠
100𝑝𝑠
100𝑝𝑠
𝑇𝑜𝑡𝑎𝑙 𝑇𝑐𝑜𝑚𝑏 = 200𝑝𝑠

/amradelm
/amradelm
Pipelining
• The most common way to fix setup in RTL design is to add pipeline registers.
• The idea of pipelining is to split a large 𝑇𝑐𝑜𝑚𝑏 into multiple clock cycles.
• For example, to implement the equation 𝐴 + 𝐵 ∗ 𝐶, one can do all the operations in one cycle or do the multiplication in one cycle then the addition in the next
cycle as shown in the diagram
• The disadvantages of pipelining is:
o More area due to the pipeline registers
o More latency. Instead of finishing the operation in one cycle we finish it in multiple cycles.
o Synchronization. Since the data is delayed by the pipeline registers, the downstream logic that will receive the data have to account for this delay. Notice also
how we needed to add pipeline on A as well to synchronize 𝐴1 with 𝐵1 ∗ 𝐶1 otherwise we would have added 𝐴2 from next sample to 𝐵1 ∗ 𝐶1
16
𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙 = 100 + 300 = 400
𝑇𝑎𝑑𝑑 = 100
𝑇𝑚𝑢𝑙 = 300
Without Pipelining With Pipelining

/amradelm
/amradelm
Multi Cycle Path (MCP)
• This method has some similarity to pipelining. Similarly, we will let the combinational path finish in multiple cycles.
• The difference is we won’t add pipeline registers. Instead, we will capture the data at another capture clock edge
• This can be done in 2 ways1:
o Use a control circuit to mask the 1st capture edge and allow another one.
o Use a divided clock for the capture FF as shown in the diagram below
17
You need to inform the STA tool that you will mask the 1st edge since the tool has no knowledge about the
functionality of the circuit. This is done using the “set_multicycle_path” command
https://docs.amd.com/r/2021.2-English/ug903-vivado-using-constraints/Multicycle-Paths
[1] :
Single Cycle Multi Cycle Path
Launch clock
Capture clock
Mask this edge
with control logic

/amradelm
/amradelm
Multi Cycle Path vs Pipelining
• At first it might appear that multi cycle path and pipelining are the same. But a deep look shows the big difference
• In the case of pipelining:
o In the 1st cycle A,B,C enters the 1st stage of the pipeline. In the 2nd cycle A,B,C enters the 2nd stage while a new sample enters 1st stage of the pipeline
o We receive an output every clock cycle and the added latency due to the pipeline registers affects us at the beginning only
• In the case of MCP:
o In the 1st cycle A,B,C enters the circuit. In the 2nd cycle, the circuit is still busy and we can’t insert a new sample until it finishes.
o We receive an output every 2 clock cycles
• This shows that pipelining fix setup and have high processing speed while MCP slows down the processing speed
• You can think of MCP as reducing the clock frequency but selectively in parts of the circuit and not on the entire circuit
18
Pipelining Multi Cycle Path
1st cycle 2nd cycle 3rd cycle 1st cycle 2nd cycle 3rd cycle

/amradelm
/amradelm
Retiming
• In this method if 𝑇𝑐𝑜𝑚𝑏 is large to fit in the clock cycle, we split the logic and move part of it to another cycle.
• Consider the example below:
o The red and green logic combined make a 𝑇𝑐𝑜𝑚𝑏 = 𝟕𝟎𝟎𝑝𝑠 which causes a setup violation.
o We move the green logic to the next clock cycle to be combined with the blue logic.
o This reduces 𝑇𝑐𝑜𝑚𝑏 between FF1 and FF2 to 𝟓𝟎𝟎𝑝𝑠 instead of 𝟕𝟎𝟎𝑝𝑠 which passes setup.
o But increases 𝑇𝑐𝑜𝑚𝑏 between FF2 and FF3 to 𝟑𝟎𝟎𝑝𝑠 instead of 𝟏𝟎𝟎𝑝𝑠 but this is okay because it also passes setup. If the blue logic was big this method won’t
work
19
500𝑝𝑠 200𝑝𝑠 100𝑝𝑠
200𝑝𝑠 100𝑝𝑠
𝟕𝟎𝟎𝒑𝒔
𝟓𝟎𝟎𝒑𝒔 𝟑𝟎𝟎𝒑𝒔

/amradelm
/amradelm
Retiming
• Retiming can be done manually by the RTL designer or automatically by the synthesis tools
o In the example below, the purple logic takes as input A and B. If we move the green logic to the next cycle, we get B one cycle later than what was
expected. When we wait for this one cycle, 𝑨𝟏 will be gone and a new 𝑨𝟐 will arrive which will get computed with sample 𝑩𝟏. This will break the
functionality of the circuit
o Synthesis tools will avoid any retiming that breaks the functionality as this example did.
o The RTL designer has full control over the code so he can fix this issue by, for example, adding a pipeline register before the purple logic to delay it one
cycle and handle any new issues that will appear due to this added register
o Hence, the RTL designer can do more aggressive retiming compared to the synthesis tools but with extra effort.
20
𝑨𝟏
𝑩𝟏
𝑨𝟐
𝑩𝟏
1st Cycle 2nd Cycle

/amradelm
/amradelm
Retiming + Pipelining
• The previous example shows how retiming can be combined with pipelining.
• Lets Consider the same example of 𝑨 + 𝑩 ∗ 𝑪
o We can move the adder to the next clock cycle if there is margin there.
o However, we get the same issue in the previous slide that A is not synchronized with B*C. So we add a pipeline register.
o This way we fixed the setup violation and saved the area of the 𝐵 ∗ 𝐶 pipeline registers
21
Pipelining Pipelining + Retiming

/amradelm
/amradelm
Optimizing Synthesis
• Synthesis tools have lots of features and switches that the engineer can use to enhance the timing and control the trade-offs between the PPA metrics.
• This topic is very large and needs a tutorial on its own, so we will demonstrate just a few of what can be done.
o Increase the timing effort : Most synthesis tools have switches that controls the effort the tool will put to fix a certain PPA metric or to do a certain
optimization. Higher effort leads to better optimization but higher runtime while a lower effort leads to less optimization but better runtime.
o Decrease or disable area and power efforts : Area and power optimizations usually degrade the timing of the circuit. Reducing the effort of these
optimizations or disabling them all together may enhance the timing but worsen the area and power of your chip
o Enable Flattening : The RTL code consists of several modules connected to each other. By default synthesis tools will synthesize each module separately
and then connect them together in the top module, thus preserve the hierarchy and boundaries between the modules. Another approach is to remove the
module boundaries and make all cells in one hierarchy. This is called flattening and generally produce better timing result1
22
No Flattening With Flattening
Flattening makes verification more difficult because the module boundaries are removed which makes tracing signals
and referencing cells more difficult.
[1] :

/amradelm
/amradelm
Applying False Paths in the Constraints
• False paths are timing paths that can’t possibly occur due to the logic of the circuit
• Both muxes have the same select signal. This means we have 2 possible timing paths. The one going through both red logics (200 + 300 = 500𝑝𝑠) and
the one going through both blue logics (100 + 500 = 600𝑝𝑠)
• The paths going through a red logic then a blue logic (200 + 500 = 700𝑝𝑠) or blue logic then red logic (100 + 300 = 400𝑝𝑠) is impossible to happen.
• Unless we instruct the tool to ignore these false paths, they will be considered for timing analysis leading to the large 𝑇𝑐𝑜𝑚𝑏 of the red to blue path which
will violate setup.
23
0 0
1 1
sel
200𝑝𝑠
100𝑝𝑠 500𝑝𝑠
300𝑝𝑠
𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑝𝑎𝑡ℎ𝑠
• If we don’t apply correct constraints on these paths, not only do we get fake setup
violations, but we hinder the synthesis and PnR tools ability to optimize the other real
violating timing paths, because the tools apply extreme optimizations only on the critical
and worst paths and it won’t consider the less critical paths for these optimizations unless
they solve the most critical ones.

/amradelm
/amradelm
Optimizing the Floorplan
• Floorplaning is the 1st step in the PNR flow and involves things like creating the chip size and boundaries, manually placing the major blocks (analog, SRAM,
etc) in the chip, and placing the chip ports
• Here are some of the things that affects the setup in the circuit
o A small chip area might cause the cells to get closer to each other and closer to the ports which in turn will reduce the wire delays. However, if the size is
too small several issues will appear such as big voltage drop, cell congestion, routing detours, crosstalk, etc1.
o The placement of the major blocks in the chip affects the timing. The example on the left shows how the placement of the SRAMs near the IO ports might
block the standard cells from being placed near their relevant ports. Not only that but they will block the routing resulting in longer wire delays to go
around them.
o The placement of the ports also affect the timing. The example on the right shows how a bad placement of the ports can lead to long wire delays and
buffering which will worsen 𝑇𝑐𝑜𝑚𝑏
24
We won’t discuss these issues because they are out of the scope of this document. You are advised to research
these topics to get a better understanding of the slides
[1] :
Block Placement Port Placement

/amradelm
/amradelm
• Reducing the capacitance 𝑪 =
𝝐𝑨
𝒅
1. Increasing the spacing 𝒅 by moving the two wires aways from each other will reduce the capacitance between them.
We can apply NDR on specific nets to tell the router that we want no nets to get routed very close to these nets
2. Reducing the common distance. When two wires move along each other for a long distance the common area 𝑨 will
be big leading to bigger capacitance. We can move one of the two wires to another layer to reduce the delay
25
Optimizing the wire delay
• In part 1 we showed how a signal propagating through an RC circuit will have a delay proportional to the resistance and
the capacitance. Hence, to reduce this delay we need to reduce the resistance and capacitance of the wire.
• This will also decrease the load cap of the cell that drives the wire which will speed up the cell too.
• Reducing the resistance 𝑹 =
𝝆𝑳
𝑨
:
1. Reducing the length 𝑳 of the wire will reduce the delay. We showed some examples on how to reduce it using a
better floorplan.
2. Increasing the width will decrease the delay. Higher metal layers have higher default width and also bigger thickness
hence larger area 𝑨. PNR tools will use these higher layers for long and critical nets to reduce their delay. The PNR
engineer can manually move the wires to higher layers during ECO or apply non-default routing rules (NDR) on these
nets to make the router route them in higher layers

/amradelm
/amradelm
Relaxing the Power Grid
• The power grid is the metal connection that delivers the power from higher metal layers down to the standard cells
• We showed how the wire delay is affected by things like spacing and width, etc. A wide and compact power grid will leave few routing resource for the signal
nets leaving no option for increasing spacing or width.
• However, relaxing the power grid will increase the resistance of the power network causing bigger voltage drop. So the PNR designer has to trade-off
between enhancing timing and fixing voltage drop.
26
Compact PG Relaxed PG

/amradelm
/amradelm
Upsizing
• We showed in part 1 how the MOSFET size affects the propagation delay of the cell. So to fix setup we can use
larger cells that has less propagation delay
• There are several considerations when doing this method:
o Bigger cells means more area and power consumption
o Bigger cells has larger gate capacitance. This will slow down the cell that drives them because it now has
larger load capacitance. The enhancement of upsizing the cell should overcome the slow down of the
driving cell.
o Since big cells consume more power they are likely to cause big voltage drop on the cells around them.
o During ECO flow there might not be enough area to accommodate the bigger cell which require you to
move the cells around it and then reroute the nets to their pins. The moving of the cells and the reroute
could worsen the timing for these cells
27
2𝑛𝑠 3𝑛𝑠
2.5𝑛𝑠 1.5𝑛𝑠
3𝑛𝑠 5𝑛𝑠
5𝑛𝑠
4𝑛𝑠
8𝑛𝑠
The big gate cap not only increased the delay
of the driver but caused a large output
transition time. The large transition time led to
a slower delay for the 2nd buffer
Before Upsizing After Upsizing
Effect of Upsizing on the
Driver and Load

/amradelm
/amradelm
MTCMOS
• Similarly the threshold voltage 𝑉𝑡ℎ of the MOSFET affects the propagation delay of the cell. So to fix setup we can use low 𝑉𝑡ℎ that has less propagation delay
but this will increase the leakage power consumption.
• Synthesis and PNR tools allow you to apply a limit on the percentage of low 𝑉𝑡ℎ in your chip. Relaxing this limit will lead to a better overall timing1.
• The gain from changing the flavor (threshold) is usually less than that of upsizing the cell. However, changing the cell flavor won’t increase the cell area hence
no moving of the cells or rerouting is required. This is why changing the flavor is the first go-to method for PNR engineers during ECO.
28
You need to be careful when relaxing the limit because the tool might resort to the easy solution of using low 𝑉𝑡ℎ
cells and ignore other optimizations in the logic and wire delay leaving you with big leakage power consumption.
[1] :
Before Changing Flavor After Changing Flavor

/amradelm
/amradelm
Increasing the Driving Strength
• When we discussed upsizing we showed that when a cell drives a large load capacitance its output transition time gets slower which in turn will slow down the load cells.
Increasing the driver strength will enhance the transition time which in turn will enhance the load cells delay
• There are several ways to enhance the driving strength
o Upsizing the driver cell : Bigger cells produce larger current and hence charge the load capacitance faster. This method combine the benefit of speeding up the driver by
upsizing and the benefit of speeding up the load cells because they see a better input transition time.
o Downsizing the load cells : this will decrease the load capacitance of the driver which will speed up the propagation and transition time which in turn will speed up the load
cells. However, smaller cells has larger delay, so for this method to work the gain from enhancing the driving strength should overcome the increase in delay due to downsizing
o Fanout splitting : Instead of one cell driving all the fanout we can duplicate the driver and split the fanout among them as shown in the diagram. But note that the driver of the
driver is now seeing double the load cap which increases it’s delay. So you have to balance things to make the overall gain overcome the increase in delay
o Side load isolation : Add a small buffer that isolates a large load from the driver. In the example shown, the driver now sees the small cap of the buffer instead of the large cap
of the large NAND. This will fix the green paths but will worsen the red path because the small buffer will add a delay that increases the overall delay of the red path. For this
method to work, the red path should be passing setup check and have good a margin to accommodate the increase in delay
29
Upsizing the driver Downsizing the load Fanout splitting Side Load Isolation
Original

/amradelm
/amradelm
Breaking up Long Nets
• When a cell drives a very long wire with big capacitance it will have bad propagation and transition times. By breaking the long wire with buffers the overall
enhancement could overcome the delay of the added buffers
• If the wire is very long we can split it with an inverter pair instead of a buffer. This is better because the delay of an inverter is less than that of a buffer of the same
size1. This way we get more cuts in the wire (less load cap for each cell) with roughly the same delay of the added buffer
30
150𝑝𝑠 400𝑝𝑠
100𝑝𝑠
100𝑝𝑠 250𝑝𝑠
50𝑝𝑠 50𝑝𝑠
120𝑝𝑠
35𝑝𝑠
Buffers are basically 2 inverters connected in series
[1] :

/amradelm
/amradelm
Register Duplication
• By duplicating registers, the timing paths can be shortened, reducing the wire
and cell propagation delays.
• Consider the example on the right :
o By duplicating the green registers we managed to move each copy near one
of the blue register
o This first, reduces the wire length between the green and blue registers and
second, allows us to remove the buffers and inverter pairs on the nets and
both reduce the total combinational delay
o This shows that this method becomes more useful when the capture registers
(the blue ones) are placed far away from each other in the chip.
o However, FF1 now drives double the fanout so the delay of the timing path
between FF1 and FF2 is increased. We need to make sure this increase doesn’t
cause the path to violate setup timing.
• Duplication can be done manually in the RTL or automatically by the synthesis
and PnR tools.
31
Before Duplication After Duplication
https://community.intel.com/t5/FPGA-Wiki/Register-Duplication-for-Timing-Closure/ta-p/735917
More details :

/amradelm
/amradelm
Reducing Crosstalk
• When we discussed wire delays we showed that there is a capacitance between any two wires close
to each other. This capacitance is called the coupling capacitance.
• When one of the two wires switches from 0->1 or 1->0, the other wire switches too with the same
polarity. We call the first the aggressor and the second the victim.
• If the aggressor was switching and at the same time the victim was switching with the same polarity,
the aggressor will speed up the input transition time of the victim. This will speed up the
propagation delay of the victim.
• If the victim was switching with a different polarity than the aggressor, this will slow down the
transition time and so slow down the propagation delay of the victim and therefore increase 𝑇𝑐𝑜𝑚𝑏.
• To decrease the effect of crosstalk and speed up the cell delay:
o Reduce the coupling capacitance by increasing the spacing between the wires. This combines the
effect of wire delay optimizations and reducing crosstalk.
o Shielding the wires of victim net with VSS wires will block the crosstalk.
o Downsizing the aggressor cell will reduce its effect on the victim.
o Upsizing the driver of victim will make it overcome the aggressor effect.
32
Aggressor
Victim
Driver of Victim
Transition without crosstalk effect
Transition with crosstalk effect
Rising Falling

/amradelm
/amradelm
Reducing Crosstalk
• The image below shows an aggressor switching from 0->1 vs the victim transition
• We can see that the stronger the driver, the less the effect of the crosstalk.
33
CMOS VLSI Design - https://pages.hmc.edu/harris/cmosvlsi/4e/index.html
CMOS VLSI Design - https://pages.hmc.edu/harris/cmosvlsi/4e/index.html
Reference :

/amradelm
/amradelm
Local Skew
• So far we have been discussing methods that reduce 𝑇𝑐𝑜𝑚𝑏. Now we will consider the launch and capture latencies 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 & 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑙𝑎𝑡𝑒𝑛𝑐𝑦
• From the setup equation we can see that decreasing 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 or increasing 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 will enhance the setup. The difference
𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 − 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 is called the skew and to fix a setup violation we can increase the skew
• To decrease the launch latency we can use any of the methods we discussed such as upsizing, changing flavor, etc
• To increase the capture latency we can use the opposite of the methods we discussed such as downsizing, changing flavor to high 𝑉𝑡ℎ, etc or by
adding buffers.
• Changing the skew to fix a timing path will affect the previous and next paths:
o The launch FF of the current timing path is the capture for the previous one. So if you decreased the launch latency to fix the current path you
will also decrease the capture latency for the previous one which might cause it to violate setup. And the same applies to the next path.
o In other words, you are borrowing some of the positive slack from the prev and next paths.
o That’s why before changing the skew you have to check if the other prev and next paths are passing timing with a good margin or not
34
Current Timing Path Next Path
Previous Path
Launch for current path
Capture for previous path
Capture for current path
Launch for next path

/amradelm
/amradelm
Local Skew
• In general, increasing the delay is a lot easier than decreasing it because we can simply add buffers. That’s why ASIC engineers and PNR tools tend to focus
on increasing the capture latency instead of decreasing the launch latency.
• Another reason why increasing the capture latency is more favored :
o When the PnR tool build the clock tree network, usually multiple FFs are driven by the same clock buffer. If we try to modify the launch latency network to
fix one timing path we will affect the other timing path that use the same clock buffer1
o This is not the case for the capture clock network because we can add a buffer just in front of the clock pin of the FF while not affecting the rest of the FFs
35
Original
Decreasing Launch Latency
All blue FFs are affected
Increasing Capture Latency
Only the 1st blue FFs is affected
We don’t want to affect the latencies of other timing paths because this may cause them to violate hold. More on
this when we discuss hold.
[1] :

/amradelm
/amradelm
Hold Timing Analysis
36

/amradelm
/amradelm
Hold Time
37
1
The waveform below shows the timing of 2 consecutive samples (A and B) going through the FFs
In order to avoid metastability, we want A to get captured and then remain stable at FF2 for an amount of time. we called this time the hold time 𝑇ℎ𝑜𝑙𝑑
This means we want the arrival of B to come after the capturing and hold time of A
𝐿𝑎𝑢𝑛𝑐ℎ + 𝐷𝑒𝑙𝑎𝑦 ≥ 𝐶𝑎𝑝𝑡𝑢𝑟𝑒
𝐴𝑟𝑟𝑖𝑣𝑎𝑙 ≥ 𝑅𝑒𝑞𝑢𝑖𝑟𝑒𝑑
𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒 + 𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇𝑐𝑜𝑚𝑏 ≥ 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 + 𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 + 𝑇ℎ𝑜𝑙𝑑
Data A arrived
at FF2 at this
point
Data A is getting
captured here
𝑇𝑐𝑞
Data B arrived
at FF2 at this
point
Data A is required to
be stable at FF2 till this
time
𝑇ℎ𝑜𝑙𝑑
FF1 FF2
A
B

/amradelm
/amradelm
𝑇𝑐𝑞
Data B arrived at
FF2 at this point
Data A is required to
be stable at FF2 till
this time
𝑇ℎ𝑜𝑙𝑑
𝑇𝑐𝑞
𝑇ℎ𝑜𝑙𝑑
Delay added by
the buffers
Hold Time
38
2
The example below violates this requirement because B arrived before A remained the necessary hold time
A quick solution is to insert buffers in the combinational path to increase 𝑇𝑐𝑜𝑚𝑏 and make B arrive after the required hold time
FF1 FF2
Violation
Pass

/amradelm
/amradelm
Hold Time
39
3
𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 − 1
We also don’t want A to be captured by an earlier edge as this will break the functionality.
𝐿𝑎𝑢𝑛𝑐ℎ + 𝐷𝑒𝑙𝑎𝑦 ≥ 𝐶𝑎𝑝𝑡𝑢𝑟𝑒
𝐴𝑟𝑟𝑖𝑣𝑎𝑙 ≥ 𝑅𝑒𝑞𝑢𝑖𝑟𝑒𝑑
Data A arrived
at FF2 at this
point
Data A should get captured here
FF1 FF2
A
𝑇ℎ𝑜𝑙𝑑
Not only does A need to come after the earlier edge, it also needs to come after the hold time of that edge or it will
cause metastability.
[1] :
Data A is required to arrive
after this point1

/amradelm
/amradelm
Example Timing Report
40
𝐷 𝑒 𝑙 𝑎 𝑦
𝐶𝑎𝑝𝑡𝑢𝑟𝑒
𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 →
𝑇𝑐𝑞 →
𝑇ℎ𝑜𝑙𝑑 →
𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑒𝑑𝑔𝑒 →
𝑇𝑐𝑎𝑝𝑡𝑢𝑟𝑒_𝑙𝑎𝑡𝑒𝑛𝑐𝑦 →
𝑇𝑐𝑜𝑚𝑏{
𝐿𝑎𝑢𝑛𝑐ℎ
Advanced HDL Synthesis and SOC Prototyping: RTL Design Using Verilog | SpringerLink
Reference :
𝑇𝑙𝑎𝑢𝑛𝑐ℎ_𝑒𝑑𝑔𝑒 →

/amradelm
/amradelm
Hold Time
• Like setup, the hold timing path could be full cycle, half cycle, multiple cycles or multi clock.
• We consider the edge where A is captured and B (next data) is launched because B is what will overwrite A.
The red arrows in the waveforms show the launch - capture edges.
• If there are more launch-capture combinations, like the case of multi clock path, the STA tool will consider
the worst of them.
• Like setup. We will just plug different values for the clock edges into the hold equation and the concepts
remain unchanged1.
41
Launch of A
Capture of A
Launch of B
Launch of A
Capture of A
Launch of B
Launch of A
Capture of A
Launch of B
Launch of A
Capture of A
Launch of B
Full Cycle Path
Launch of A
Capture of A
Launch of B
OR
A common mistake is to say hold is not affected by the clock period. This is only true for full and multi cycle paths
where the launch and capture edges occur at the same time. But since full cycle paths are the most common types of
paths and also more susceptible to violation, engineers generalize and say hold is not affected by the clock period
[1] :
Another
Capture of A

/amradelm
/amradelm
Hold Time
• We also don’t want A to be captured by an earlier edge
• We should also check hold between the launch of A and the capture edge that comes before A’s intended
capture edge
• Now we know all the launch capture combinations and the tool will consider the worst of them1
42
Launch of A
Capture of A
Launch of B
Launch of A
Capture of A
Launch of B
Launch of A
Capture of A
Launch of B
Launch of A
Capture of A
Launch of B
Full Cycle Path
Launch of A
Capture of A
Launch of B
OR
Timing Analyzer Example: Clock Analysis Equations | Intel.
[1] :
Another
Capture of A

/amradelm
/amradelm
How to Fix a Hold Violation
• By comparing the setup equation with the hold equation, we find that fixing hold violations requires the opposite of the methods we discussed with setup.
• Instead of decreasing 𝑇𝑐𝑜𝑚𝑏 we will try to increase it by adding buffers, increasing wire delay, downsizing, etc. And instead of increasing the capture latency
or decreasing the launch latency we will do the opposite.
• This shows that hold contradicts setup and fixing hold may worsen setup.
• We showed earlier that increasing delay is always easier than decreasing it. This means that fixing hold is generally easier than fixing setup.
• This is why setup has more priority over hold. Hold is only considered in PNR step and fixing hold violations starts when all setup violations are fixed1.
43
Setup :
Hold :
Hold is still monitored across the PNR stages and while we focus more on setup we make sure hold is solvable and
under control
[1] :

/amradelm
/amradelm
How to Fix a Hold Violation
o The STA engineer sees two violations, setup and hold, both having the same startpoint and endpoint. The engineer tries adding buffers in front of FF2 to fix
hold but the setup is worsened, then tries to fix setup by changing flavor but hold is worsened. It seems we reached a dead end.
o If we investigate the violations in depth, we can see there are two paths, the upper long one which violates setup and the lower short one (blue) that violates
hold.
o So, to fix the setup violations we can change the flavor of the cells in the upper path. And to fix hold we can add buffers along the lower blue path.
• This example shows that some hold violations can be tricky and need a deep look into the timing path.
44

/amradelm
/amradelm
Thank You!
45

VLSI Static Timing Analysis Setup And Hold Part 2

More Related Content

What's hot (20)

Similar to VLSI Static Timing Analysis Setup And Hold Part 2 (20)

Recently uploaded (20)

VLSI Static Timing Analysis Setup And Hold Part 2