SlideShare a Scribd company logo
An Area-efficient
Ternary CAM Design
using Floating Gate Transistors
Viacheslav Fedorov
Monther Abusultan
Sunil P. Khatri
Key Contributions
• First TCAM design using flash transistors
• 2 transistors per TCAM cell (17 for CMOS)
• 1 transistor per port cell (6 for CMOS)
• Layout and SPICE simulations
– 8 times more dense than CMOS TCAM
– 1.6x less power consumption
– Operates at today’s line rates (~ 400 Gb/s)
Outline
• Contribution
• Motivation
• TCAM operation
• Previous work
• Our approach
• Evaluation
• Conclusion
Motivation
• Internet backbone (core) operates at extreme
speeds
– 100s of Gb/s
• Fast IP routers crucial to sustain the internet
• Hardware Ternary Content-addressable Memory
used for core routers
– Enables lookup of IP addresses in parallel
– Increases routing speed dramatically
• Drawbacks: large area, high power consumption
IP Routing
Address Interface
01001 B
01010 C
01011 C
Router 1Router 2
A
E
C
B D
Address Interface
01001 D
11000 E
11001 E
01000
01001
To: 01001
• Ternary (entries can have “0”, “1” or “X”)
TCAM operation
Address Interface
01000 A
01001 A
01010 A
01011 A
10000 B
• Content-addressable
0100001000
Address Interface
010XX A
10000 B
• High-speed hardware-parallel lookups
Longest Prefix Matching
• “010XX” : “010” (prefix) U “XX” (mask)
• IP address might match more than one entry
– “01000” matches “0100X” and “010XX” below
• Select the entry with longest prefix (fewer “X”s)
• Longer prefix = more specific routing
information Address Interface
010XX A
0100X C
000XX D
1XXXX E
110XX B
Outline
• Contribution
• Motivation
• TCAM operation
• Previous work
• Our approach
• Evaluation
• Conclusion
Previous work
• TCAM research largely done using CMOS
• Monolithically stacked TCAM
– 3D stacking memory array on top of search circuitry
– Programmable vias replace SRAM
– 4x cell density, 3.5x dynamic power reduction
– Orthogonal to our ideas
• Resistive TCAM cells
– Utilizing PCM and STT-MRAM technology
– Up to 20x cell density
– Relatively high latency (several nanoseconds)
– Early stages of design
Previous work
• Research on Flash devices
– Device characterization
– Cell program/erase optimization
– Wear leveling algorithms
– Do not consider using them in TCAM circuits
Outline
• Contribution
• Motivation
• TCAM operation
• Previous work
• Our approach
• Evaluation
• Conclusion
Our approach: Overview
• Routing entries stored in blocks
– Fixed number of blocks for each mask length
• Single LPM block
• Shadow blocks
– Control route flaps
– Control burst updates
Our approach: TCAM Block
• Address is looked up in TCAM portion of the
block
– 256 entries looked up in parallel, at most one
matches (implemented using matchline)
• Matched entry has its port memory driven out
Our approach: TCAM Row
• Matchline (precharged) spans 256 TCAM
cells horizontally
– Large delay for any row
• Split the matchline into smaller (8-bit) sections
– Cascade mismatch propagation
– Use keepers to speed up the lookup
256 TCAM cells
Matchline
Our approach: Operation
Stored “1” Stored “0” Stored “X”
match
Our approach: Lookup “1”
Stored “1” Stored “0” Stored “X”
For lookup of “1”:
a(i) = RH
b(i) = RL
match
Match stays prechg Match pulled down Match stays prechg
Our approach: Lookup “0”
Stored “1” Stored “0” Stored “X”
For lookup of “0”:
a(i) = RL
b(i) = RH
match
Match stays prechgMatch pulled down Match stays prechg
Flash versus CMOS TCAM Cells
0.2v
0.7v
0.7v
Flash TCAM cell CMOS TCAM cell
match
Our approach: Proof of correctness
Threshold and read voltages
0.6v
0.21v
0.76v
1.1v
match
Store ”1”
Lookup ”1”
Lookup ”0”
Store ”0”
Our approach: Port cell
4 Flash-based Port cells CMOS Port cell (SRAM)
Our approach: Program
RH
Vp
Vp
RH
Our approach: Erase
V Erase
Outline
• Contribution
• Motivation
• TCAM operation
• Previous work
• Our approach
• Evaluation
• Conclusion
Evaluation
• Implemented flash-based TCAM block
– Emulated flash model cards (45nm from IEDM)
– Developed cell layout
– Raphael parasitic extraction
– HSPICE simulation
• Compared to CMOS implementation
– Used PTM 45nm process
Evaluation
• Layout pictures
Flash-based TCAM cell layout Flash-based Port cell layout
Evaluation
TCAM part Port Memory part Total
Delay Power Delay Power Delay Power Area
CMOS 218 ps 96 mW 174 ps 33 mW 393 ps 129 mW 286655 µ2
Flash 679 ps 65 mW 306 ps 14 mW 985 ps 79 mW 36130 µ2
Ratio
(Flash/CMOS)
2.5x 0.6x 0.126x
Lifetime Estimation
• In-house TCAM-based router simulator
• RIB snapshots of a real internet router
• Replayed UPDATE traces for 1 day
• Assumptions (0.5in2 chip):
– 1.5M FTCAM entries / 500K occupied
– Updating rewrites the whole 256-entry block
– Flash endurance 105 erase/program cycles
– Randomized wear leveling utilized
– Size of CMOS shadow: 48 blocks x 256 entries
Lifetime Estimation
Routing table size breakdown
16 17 18 19 20 21 22 23 24
0
50000
100000
150000
200000
250000
300000
350000
Routing table size / updates
Base Size
UPDATES w/o Shadow
UPDATES w/ Shadow
Prefix Length
NumofEntriesofFlashthatareupdated
Lifetime Estimation
• 535K UPDATES to flash blocks, w/o CMOS shadow
• 210K UPDATES to flash blocks, w/ CMOS shadow
• Observations:
– CMOS shadow blocks filter 61% UPDATES
– Average time between flushes to flash blks ~ 5min
– Several cases when 7 flushes in 1 second
• Can support this with double-buffering
– No packets are lost
• Estimated TCAM lifetime is 5 years (worst case)
Conclusion
• First to design a TCAM using flash transistors
• Extremely high density
– TCAM cell: 2 transistors vs 17 with CMOS
– Port memory cell: 1 trans. vs 6 with CMOS
• Area improvement 8x
• Power improvement 1.64x
• Exceeds current internet backbone data rates
(~400 Gb/s)
• > 5-year lifetime
Questions?
Thank you!

More Related Content

PDF
Cpu Caches
PPTX
Multi-IMA Partition Scheduling for Global I/O Synchronization
PPT
Instruction Level Parallelism and Superscalar Processors
PDF
Pipelining
PDF
Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...
PPT
Improving Passive Packet Capture : Beyond Device Polling
PDF
BUD17-218: Scheduler Load tracking update and improvement
PPT
11 instruction sets addressing modes
Cpu Caches
Multi-IMA Partition Scheduling for Global I/O Synchronization
Instruction Level Parallelism and Superscalar Processors
Pipelining
Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...
Improving Passive Packet Capture : Beyond Device Polling
BUD17-218: Scheduler Load tracking update and improvement
11 instruction sets addressing modes

What's hot (18)

PPTX
Instruction pipelining
PPTX
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
PPTX
Term Project Presentation (4)
PPTX
3 Pipelining
PDF
Parallel Computing - Lec 5
PPT
13 superscalar
PPTX
Superscalar processor
PPTX
Lecture2
PPTX
Superscalar Processor
PPT
Pipeline hazard
PPT
Lec18 pipeline
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Pipelining , structural hazards
PDF
Innovative Solar Array Drive Assembly for CubeSat Satellite
PPT
Pipeline hazards in computer Architecture ppt
PPT
3DD 1e Linux
PDF
Training Slides: Basics 102: Introduction to Tungsten Clustering
PPTX
Computer Architecture
Instruction pipelining
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
Term Project Presentation (4)
3 Pipelining
Parallel Computing - Lec 5
13 superscalar
Superscalar processor
Lecture2
Superscalar Processor
Pipeline hazard
Lec18 pipeline
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Pipelining , structural hazards
Innovative Solar Array Drive Assembly for CubeSat Satellite
Pipeline hazards in computer Architecture ppt
3DD 1e Linux
Training Slides: Basics 102: Introduction to Tungsten Clustering
Computer Architecture
Ad

Viewers also liked (10)

PPTX
Project Presentation Final
PDF
A Novel Architecture Design & Characterization of CAM Controller IP Core with...
PDF
Emerging Non-Volatile Memories patent landscape 2014
PPT
Low power vlsi design
PPT
Low power VLSI design
PPTX
Memristor
PPTX
Memristor
PPTX
PPT
Memristor ppt
Project Presentation Final
A Novel Architecture Design & Characterization of CAM Controller IP Core with...
Emerging Non-Volatile Memories patent landscape 2014
Low power vlsi design
Low power VLSI design
Memristor
Memristor
Memristor ppt
Ad

Similar to TCAM Design using Flash Transistors (20)

PDF
Theta and the Future of Accelerator Programming
PPTX
osdi20-slides_zhao.pptx
PPTX
DLC logic families and memory
PDF
CPU Caches - Jamie Allen
PPT
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
PPT
Memory Hierarchy PPT of Computer Organization
PPT
Ct213 memory subsystem
PPTX
Project Slides for Website 2020-22.pptx
PPTX
CPN302 your-linux-ami-optimization-and-performance
PPTX
Multithreading computer architecture
PPT
Not bridge south bridge archexture
PPTX
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
PPT
7_mem_cache.ppt
PPT
05 internal memory
PDF
In datacenter performance analysis of a tensor processing unit
PDF
Virtualization for Emerging Memory Devices
PPT
SOC-CH4.pptSOC Processors Used in SOCSOC Processors Used in SOC
PPTX
WEEK6_COMPUTER_ORGANIZATION.pptx
PPT
Dsp ajal
PPTX
Lect01 OPERAND ADDRESSING MODES OPERAND ADDRESSING MO
Theta and the Future of Accelerator Programming
osdi20-slides_zhao.pptx
DLC logic families and memory
CPU Caches - Jamie Allen
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Memory Hierarchy PPT of Computer Organization
Ct213 memory subsystem
Project Slides for Website 2020-22.pptx
CPN302 your-linux-ami-optimization-and-performance
Multithreading computer architecture
Not bridge south bridge archexture
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
7_mem_cache.ppt
05 internal memory
In datacenter performance analysis of a tensor processing unit
Virtualization for Emerging Memory Devices
SOC-CH4.pptSOC Processors Used in SOCSOC Processors Used in SOC
WEEK6_COMPUTER_ORGANIZATION.pptx
Dsp ajal
Lect01 OPERAND ADDRESSING MODES OPERAND ADDRESSING MO

Recently uploaded (20)

PPTX
Nanokeyer nano keyekr kano ketkker nano keyer
PPTX
Prograce_Present.....ggation_Simple.pptx
PPTX
quadraticequations-111211090004-phpapp02.pptx
PPTX
unit1d-communitypharmacy-240815170017-d032dce8.pptx
PPTX
Fundamentals of Computer.pptx Computer BSC
PPTX
A Clear View_ Interpreting Scope Numbers and Features
PDF
ICT grade for 8. MATATAG curriculum .P2.pdf
PPTX
"Fundamentals of Digital Image Processing: A Visual Approach"
PPTX
New professional education PROF-ED-7_103359.pptx
PPTX
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
PDF
2_STM32&SecureElements2_STM32&SecureElements
PPTX
Computers and mobile device: Evaluating options for home and work
DOCX
A PROPOSAL ON IoT climate sensor 2.docx
PDF
Tcl Scripting for EDA.pdf
PPTX
Presentation 1.pptxnshshdhhdhdhdhdhhdhdhdhd
PPTX
Lecture-3-Computer-programming for BS InfoTech
PPTX
5. MEASURE OF INTERIOR AND EXTERIOR- MATATAG CURRICULUM.pptx
PDF
How NGOs Save Costs with Affordable IT Rentals
PPTX
Operating System Processes_Scheduler OSS
PPT
Lines and angles cbse class 9 math chemistry
Nanokeyer nano keyekr kano ketkker nano keyer
Prograce_Present.....ggation_Simple.pptx
quadraticequations-111211090004-phpapp02.pptx
unit1d-communitypharmacy-240815170017-d032dce8.pptx
Fundamentals of Computer.pptx Computer BSC
A Clear View_ Interpreting Scope Numbers and Features
ICT grade for 8. MATATAG curriculum .P2.pdf
"Fundamentals of Digital Image Processing: A Visual Approach"
New professional education PROF-ED-7_103359.pptx
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
2_STM32&SecureElements2_STM32&SecureElements
Computers and mobile device: Evaluating options for home and work
A PROPOSAL ON IoT climate sensor 2.docx
Tcl Scripting for EDA.pdf
Presentation 1.pptxnshshdhhdhdhdhdhhdhdhdhd
Lecture-3-Computer-programming for BS InfoTech
5. MEASURE OF INTERIOR AND EXTERIOR- MATATAG CURRICULUM.pptx
How NGOs Save Costs with Affordable IT Rentals
Operating System Processes_Scheduler OSS
Lines and angles cbse class 9 math chemistry

TCAM Design using Flash Transistors

  • 1. An Area-efficient Ternary CAM Design using Floating Gate Transistors Viacheslav Fedorov Monther Abusultan Sunil P. Khatri
  • 2. Key Contributions • First TCAM design using flash transistors • 2 transistors per TCAM cell (17 for CMOS) • 1 transistor per port cell (6 for CMOS) • Layout and SPICE simulations – 8 times more dense than CMOS TCAM – 1.6x less power consumption – Operates at today’s line rates (~ 400 Gb/s)
  • 3. Outline • Contribution • Motivation • TCAM operation • Previous work • Our approach • Evaluation • Conclusion
  • 4. Motivation • Internet backbone (core) operates at extreme speeds – 100s of Gb/s • Fast IP routers crucial to sustain the internet • Hardware Ternary Content-addressable Memory used for core routers – Enables lookup of IP addresses in parallel – Increases routing speed dramatically • Drawbacks: large area, high power consumption
  • 5. IP Routing Address Interface 01001 B 01010 C 01011 C Router 1Router 2 A E C B D Address Interface 01001 D 11000 E 11001 E 01000 01001 To: 01001
  • 6. • Ternary (entries can have “0”, “1” or “X”) TCAM operation Address Interface 01000 A 01001 A 01010 A 01011 A 10000 B • Content-addressable 0100001000 Address Interface 010XX A 10000 B • High-speed hardware-parallel lookups
  • 7. Longest Prefix Matching • “010XX” : “010” (prefix) U “XX” (mask) • IP address might match more than one entry – “01000” matches “0100X” and “010XX” below • Select the entry with longest prefix (fewer “X”s) • Longer prefix = more specific routing information Address Interface 010XX A 0100X C 000XX D 1XXXX E 110XX B
  • 8. Outline • Contribution • Motivation • TCAM operation • Previous work • Our approach • Evaluation • Conclusion
  • 9. Previous work • TCAM research largely done using CMOS • Monolithically stacked TCAM – 3D stacking memory array on top of search circuitry – Programmable vias replace SRAM – 4x cell density, 3.5x dynamic power reduction – Orthogonal to our ideas • Resistive TCAM cells – Utilizing PCM and STT-MRAM technology – Up to 20x cell density – Relatively high latency (several nanoseconds) – Early stages of design
  • 10. Previous work • Research on Flash devices – Device characterization – Cell program/erase optimization – Wear leveling algorithms – Do not consider using them in TCAM circuits
  • 11. Outline • Contribution • Motivation • TCAM operation • Previous work • Our approach • Evaluation • Conclusion
  • 12. Our approach: Overview • Routing entries stored in blocks – Fixed number of blocks for each mask length • Single LPM block • Shadow blocks – Control route flaps – Control burst updates
  • 13. Our approach: TCAM Block • Address is looked up in TCAM portion of the block – 256 entries looked up in parallel, at most one matches (implemented using matchline) • Matched entry has its port memory driven out
  • 14. Our approach: TCAM Row • Matchline (precharged) spans 256 TCAM cells horizontally – Large delay for any row • Split the matchline into smaller (8-bit) sections – Cascade mismatch propagation – Use keepers to speed up the lookup 256 TCAM cells Matchline
  • 15. Our approach: Operation Stored “1” Stored “0” Stored “X” match
  • 16. Our approach: Lookup “1” Stored “1” Stored “0” Stored “X” For lookup of “1”: a(i) = RH b(i) = RL match Match stays prechg Match pulled down Match stays prechg
  • 17. Our approach: Lookup “0” Stored “1” Stored “0” Stored “X” For lookup of “0”: a(i) = RL b(i) = RH match Match stays prechgMatch pulled down Match stays prechg
  • 18. Flash versus CMOS TCAM Cells 0.2v 0.7v 0.7v Flash TCAM cell CMOS TCAM cell match
  • 19. Our approach: Proof of correctness Threshold and read voltages 0.6v 0.21v 0.76v 1.1v match Store ”1” Lookup ”1” Lookup ”0” Store ”0”
  • 20. Our approach: Port cell 4 Flash-based Port cells CMOS Port cell (SRAM)
  • 23. Outline • Contribution • Motivation • TCAM operation • Previous work • Our approach • Evaluation • Conclusion
  • 24. Evaluation • Implemented flash-based TCAM block – Emulated flash model cards (45nm from IEDM) – Developed cell layout – Raphael parasitic extraction – HSPICE simulation • Compared to CMOS implementation – Used PTM 45nm process
  • 25. Evaluation • Layout pictures Flash-based TCAM cell layout Flash-based Port cell layout
  • 26. Evaluation TCAM part Port Memory part Total Delay Power Delay Power Delay Power Area CMOS 218 ps 96 mW 174 ps 33 mW 393 ps 129 mW 286655 µ2 Flash 679 ps 65 mW 306 ps 14 mW 985 ps 79 mW 36130 µ2 Ratio (Flash/CMOS) 2.5x 0.6x 0.126x
  • 27. Lifetime Estimation • In-house TCAM-based router simulator • RIB snapshots of a real internet router • Replayed UPDATE traces for 1 day • Assumptions (0.5in2 chip): – 1.5M FTCAM entries / 500K occupied – Updating rewrites the whole 256-entry block – Flash endurance 105 erase/program cycles – Randomized wear leveling utilized – Size of CMOS shadow: 48 blocks x 256 entries
  • 28. Lifetime Estimation Routing table size breakdown 16 17 18 19 20 21 22 23 24 0 50000 100000 150000 200000 250000 300000 350000 Routing table size / updates Base Size UPDATES w/o Shadow UPDATES w/ Shadow Prefix Length NumofEntriesofFlashthatareupdated
  • 29. Lifetime Estimation • 535K UPDATES to flash blocks, w/o CMOS shadow • 210K UPDATES to flash blocks, w/ CMOS shadow • Observations: – CMOS shadow blocks filter 61% UPDATES – Average time between flushes to flash blks ~ 5min – Several cases when 7 flushes in 1 second • Can support this with double-buffering – No packets are lost • Estimated TCAM lifetime is 5 years (worst case)
  • 30. Conclusion • First to design a TCAM using flash transistors • Extremely high density – TCAM cell: 2 transistors vs 17 with CMOS – Port memory cell: 1 trans. vs 6 with CMOS • Area improvement 8x • Power improvement 1.64x • Exceeds current internet backbone data rates (~400 Gb/s) • > 5-year lifetime