Priorities Shift In IC Design

2/11/2020 Priorities Shift In IC Design
https://semiengineering.com/higher-performance-plus-low-power/ 1/12
(/)
MENU 
Select Language ▼
259
Shares
LOW POWER-HIGH PERFORMANCE (/CATEGORY-MAIN-PAGE-LPHP/)
Priorities Shift In IC Design
AI, edge applications are driving design teams to nd new ways to achieve the best performance per watt.
The rush to the edge and new applications around AI are causing a shift in design strategies toward the highest
performance per watt, rather than the highest performance or lowest power.
This may sound like hair-splitting, but it has set a scramble in motion around how to process more data more quickly
without just relying on faster processors and accelerators. Several factors are driving these changes, including the
slowdown in Moore’s Law (https://semiengineering.com/knowledge_centers/standards-laws/laws/moores-law/),
which limits the number of traditional options, the rollout of AI
(https://semiengineering.com/knowledge_centers/arti cial-intelligence/) everywhere, and a surge in data from more
sensors, cameras and images with higher resolutions. In addition, more data is being run though convolutional
neural networks (https://semiengineering.com/knowledge_centers/arti cial-intelligence/neural-
networks/convolutional-neural-network/) or deep learning
(https://semiengineering.com/knowledge_centers/arti cial-intelligence/deep-learning/) inferencing systems, which
bring huge data processing loads.
“As semiconductor scaling slows, but processing demands increase, designers are going to need to start working
harder for those performance and e ciency gains,” said Russell Klein, HLS platform director at Mentor, a Siemens
Business (https://semiengineering.com/entities/mentor-a-siemens-business/). “When optimizing any system, you
need to focus on the biggest ine ciencies rst. For data processing on embedded systems, that will usually be
software.”
When Moore’s Law was in its prime, processor designers had so many gates they didn’t know what to do with them
all, Klein said. “One answer was to plop down more cores, but programmers were reluctant to adopt multi-core
programming paradigms. Another answer was to make the processor go as fast as possible without regard to area. A
feature that would add 10% to the speed of a processor was considered a win, even if it doubled the size of that
processor. Over time, high-end processors picked up a lot of bloat, but no one really noticed or cared. The
50 19
JANUARY 16TH, 2020 - BY: ANN STEFFORA MUTSCHLER (HTTPS://SEMIENGINEERING.COM/AUTHOR/ANN/)


processors were being stamped out on increasingly e cient and dense silicon. MIPS was the only metric that
mattered, but if you start to care about system level e ciency, that bloated processor, and especially the software
running on it, might warrant some scrutiny.”
Software has a lot of very desirable characteristics, Klein pointed out, but even well-written software is neither fast
nor e cient when compared to the same function implemented in hardware. “Moving algorithms from software on
the processor into hardware can improve both performance and power consumption because software alone is not
going to deliver the performance needed to meet the demands of inferencing, high resolution video processing, or
5G.”
The need for speed
At the same time, tra c data speeds are increasing, and there are new demands on high speed interfaces to access
that data. “High-speed interfaces and SerDes (https://semiengineering.com/knowledge_centers/communications-
io/o -chip-communications/i-o-enabling-technology/serializer-deserializer-serdes/) are an integral part of the
networking chain, and these speed increases are required to support the latest technology demands of arti cial
intelligence (AI), Internet of Things (IoT), virtual reality (VR) and many more technologies that have yet to be
envisioned,” noted Suresh Andani, senior director of IP cores at Rambus
(https://semiengineering.com/entities/rambus-inc/).
Best design practices for high-performance devices include de ning and analyzing the solution space through
accurate full-system modeling; utilizing system design and concurrent engineering to maximize rst-time right
silicon; ensuring tight correlation between models and silicon results; leveraging a system-aware design
methodology; and including built-in test features to support bring-up, characterization and debug, he said.
There are many ways to improve performance per watt, and not just in hardware or software. Kunle Olukotun,
Cadence Design Systems Professor of electrical engineering and computer science at Stanford University, said that
relaxing precision, synchronization and cache coherence can reduce the amount of data that needs to be sent back
and forth. That can be reduced even further by domain-speci c languages, which do not require translation.
“You can have restricted expressiveness for a particular domain,” said Olukotun in a recent presentation. “You also
can utilize parallel patterns and put functional data into parallel patterns based on representation. And you can
optimize for locality and exploit parallelism.”
He noted that exible mapping of data is much more e cient. That can take advantage of data parallelism, model
parallelism, and dynamic precision as needed. In addition, the data ow can be made hierarchical using a wider
interface between the algorithms and the hardware, allowing for parallel patterns, explicit memory hierarchies,
hierarchical control and explicit parameters, all of which are very useful in boosting performance per watt in
extremely performance-centric applications.
Flexibility in designs has been one of the tradeo s in optimizing performance per watt, and many of the new AI chips
under development have been struggling to combine optimally tuned hardware and software into designs while still
leaving enough room for ongoing changes in algorithms and di erent compute tasks.
“You may spend 6 to 9 months mapping how to cut up work, and that provides a big impediment to embracing new
markets quickly,” said Stuart Biles, a fellow and director of research architecture at Arm
(https://semiengineering.com/entities/arm/) Research. “For large OSes, there is a set of functionality in the system
where a particular domain is likely to execute on a general-purpose core. But you can add in exibility for how you
partition that and make the loop quicker. That basically comes down to how well you use an SoC’s
(https://semiengineering.com/knowledge_centers/integrated-circuit/ic-types/system-on-chip/) resources.”

Biles noted that once a common subset is identi ed, then certain functions can be specialized with an eFPGA
(https://semiengineering.com/knowledge_centers/integrated-circuit/ic-types/fpga/embedded-fpga-efpga/) or using
3D integration. We’ve moved from the initial 3D integration to the microarchitecture, where you can cut out cycles
and branch prediction. What’ you’re looking at is the time it takes to get from load/store to processor versus doing
that vertically, and you can change the microarchitectural assumptions based up speci c assumptions in 3D. That
results in di erent delays.”
A di erent take on the same problem is to limit the amount of data that needs to be processed in the rst place. This
is particularly important in edge systems such as cars, where performance per watt is critical due to limited battery
power and the need for real-time results. One way to change that equation is to sharply limit the amount of data
being sent to centralized processing systems in the vehicle by pre-screening it at the sensor level. So while not
actually speeding up the processing per watt, it achieves faster results using less power.
“You can provide a reasonable amount of compute power at the sensor, and you can reduce the amount of data that
the sensor identi es through pre-selection,” said Benjamin Prautsch, group manager for advanced mixed-signal
automation at Fraunhofer IIS’ (https://semiengineering.com/entities/fraunhofer-iis-eas/) Engineering of Adaptive
Systems Division. “So if you’re looking at what is happening in a room, the rst layer can identify if there are people in
there. The same can be used on a manufacturing line. You also can run DNN calculations in a parallel way to be more
e cient.”
Further, AI chips, like many high performance devices, have a tendency to develop hotspots, noted Richard
McPartland, technical marketing manager at Moortec (https://semiengineering.com/entities/moortec-semiconductor-
ltd/). “AI chips are designed to tackle immense processing tasks for training and inference,” he said. “They are
typically very large in silicon area, with hundreds or even thousands of cores on advanced nFET
(https://semiengineering.com/knowledge_centers/integrated-circuit/transistors/3d/ nfet-3/) processes consuming
high current – 100 amperes or more at supply voltages below 1 volt. With AI chip power consumptions at a minimum
in the tens of watts, but often well over 100 watts, it should be no surprise that best design practices includes in-chip
temperature monitoring. And it’s not just one sensor, but typically tens of temperature sensors distributed
throughout the clusters of processors and other blocks. In-chip monitoring should be considered early in the design
ow and included up front in oor planning, and not added as an afterthought. At a minimum, temperature
monitoring can provide protection from thermal runaway. But accurate temperature monitoring also supports
maximizing data throughput by minimizing throttling of the compute elements.”
In-chip voltage monitoring with multiple sense points is also recommended for high-performance devices such as AI
chips, he continued. “Again, this should be included early in the design ow to monitor the supply voltages at critical
circuits, such as the processor clusters, as well as supply drops between the supply pins and the circuit blocks.
Voltage droops occur when the AI chips start operating under load, and being software-driven, this can be di cult to
predict in the chip design phase with the software written later by another team. Including voltage sense points gives
visibility about what is going on with the internal chip supplies, and is invaluable in the chip bring-up phase, as well as
for reducing power consumption through minimizing guard bands.”
Process detectors are also a must-have on high-performance devices such as AI chips, McPartland said. “These
enable a quick and independent veri cation of process performance and variation, not just die-to-die but across
large individual die on advanced nodes. Further, they can be used for power optimization
(https://semiengineering.com/power-optimization-strategies-widen/), such as to reduce power consumption
(https://semiengineering.com/knowledge_centers/low-power/low-power-design/power-consumption/) through

voltage scaling schemes where the voltage guard bands are minimized on a per-die basis based on process speed.
Lower power equates to higher processing performance in the AI world, where processing power is often
constrained by thermal and power issues.
AI algorithm performance challenges
An important consideration of AI and other high-performance devices is the fact that actual performance is not
known until the end application is run. This raises questions for many AI processor startups that insist they can build
a better hardware accelerator for matrix math and other AI algorithms than the next guy.
“That’s their key di erentiation,” said Ron Lowman, strategic marketing manager for IoT at Synopsys
(https://semiengineering.com/entities/synopsys-inc/). “Some of those companies may be in their second or third
designs, whereas the bigger players are in their third or fourth designs, and they’re learning something every time.
The math is changing on them just as rapidly as they can get a chip out, which is helping the situation, but it’s a game
for who can get the highest performance in the data center. That’s now moving down to edge computing
(https://semiengineering.com/knowledge_centers/compute-architectures/edge-computing/). Those AI accelerators
are being built on local and on-premise servers now, and they want to nd their niche in performance per watt and
for speci c applications. But in that space, they still have to accommodate many di erent types of AI functions, be it
for voice or audio or database extraction or vision. That’s a lot of di erent things. Then there’s the guys building the
applications, like for ADAS (https://semiengineering.com/knowledge_centers/automotive/adas-advanced-driver-
assistance-systems/). That’s a very speci c use case, and they can be more speci c to what they’re building, so they
know exactly the model they may want, although that too changes pretty rapidly.”
If the design team has a better handle on the end application and the intended use cases, they can look at each
di erent speci c space, whether it’s for mobile or edge computing, or for automotive. “You can see that the TOPS,
just the pure performance, has grown orders of magnitude over the last couple of years,” Lowman said. “The initial
mobile devices that were going to handle AI had under a TOPS (tera operations per second). Now you’re seeing up to
16 TOPS in those mobile devices. That’s how they start, by saying, ‘This is the general direction because we have to
handle many di erent types of AI functions in the mobile phone.’ You look at ADAS, and those guys were even ahead
of the mobile phones. Now you’re seeing up 35 TOPS for a single instantiation for ADAS, and that continues to grow.
In edge computing, they’re basically scaling down the data center devices to be more power-e cient, and those
applications can range between 50 to hundreds of TOPS. That’s where you start.”
However, a rst-generation AI architecture often is very ine cient for what they want to accomplish because they’re
trying to do too much. If the actual application could be run, the architecture could be tuned signi cantly, because it’s
not just a processor or the ability to just do the MAC. It’s a function of accessing the coe cients from memory, then
processing them very e ectively. It’s also not just adding a bunch of on-chip SRAM
(https://semiengineering.com/knowledge_centers/memory/volatile-memory/static-random-access-memory/) that
solves the problem. Modeling the IP, such as DDR instantiations, and di erent bitwidths with di erent access
capabilities, di erent types of DRAM (https://semiengineering.com/knowledge_centers/memory/volatile-
memory/dynamic-random-access-memory/) con gurations, or LPDDR versus DDR, optimal ways can be found before
the system development is complete using prototyping tools and systems explorations tools.
“If the development team has the real algorithm, it’s much more e ective,” Lowman said. “A lot of people use ResNet-
50 as a benchmark because that’s better than TOPS. But people are well beyond that. You see voice applications for
natural language understanding. ResNet 50 has maybe a few million coe cients, but some of these are in the billions
of coe cients now, so it’s not even representative. And the more representative you can get of the application, the
more accurately you can de ne your SoC architecture to handle those types of things.”

259
Shares
There are so many moving pieces on this, the more modeling you can do upfront with the actual IP, the better o you
are. “This is where some traction is happening, seen in many aspects. The memory pieces that are so important, the
processing pieces that are so important. Just even the interfaces for the sensor inputs, like MIPI, or audio interfaces.
All that architecture can be optimized based on the algorithm, and it’s no di erent than it always has been. If you run
the actual software, you can go ahead and optimize much more e ectively. But there’s a constant need to grow the
performance per watt. If the estimates are to be believed, with some saying that 20% to 50% of all electricity will be
consumed by AI, that’s a huge problem. That is spurring the trend to move to more localized computing, and trying to
compress these things into the application itself. All of those require di erent types of architectures to handle the
di erent functions and features that you’re trying to accomplish,” Lowman said.
Power does play a role here because of the amount of memory capacity needed, the number of coe cients changes,
as well as the number of math blocks.
“You can throw on tons of multiply/accumulates, put them all on chip, but you also have to have all the other things
that are done afterward,” he said. “That includes the input of the data and conditioning of that input data. For
instance, for audio, you need to make sure there are no bottlenecks. How much cache is needed for each of these
data movements? There are all kinds of di erent architectural tradeo s, so the more modeling you can do up front,
the better your system will be if you know the application. If you create a generic one, and then run the one that you
actually run in the system, you may not get the accuracy that you thought you had. There’s a lot of work being done
to improve that over time, and make corrections for that to get the accuracy and power footprint that they need. You
can start with some general features, but every generation I’ve seen is moving very quickly on more performance,
less power, more optimized math, more optimized architectures, and the ability to do not just a standard SRAM but a
multi-port SRAM. This means you’re doing two accesses at once, so you may have as many multiply/accumulates as
you want. But if you can go ahead and do several reads and writes in a single cycle, that saves on power. You can
optimize what that looks like when you’re accessing, and the number of multiply/accumulates you need to do for that
particular stage in the pipeline.”
Conclusion
With so much activity in the high-performance and AI space, it’s an exciting time for the semiconductor ecosystem
around these applications. There is a tremendous amount of startup activity, with the thinking evolving from a more
generic mindset of, “We can do the math for neural networks,” to one in which everybody can do the math for
speci c neural networks in di erent elds, Lowman said. “You can do it for voice, you can do it for vision, you can do
it for data mining, and there are speci c types of vision, voice or sound where you can optimize for certain things.”
This only makes the AI market opportunity more exciting as the technology branches out into many di erent elds
that are extensions of current ones or new areas all together, and the development technologies and tool ecosystem
discovers new ways to make it all a reality.
—Ed Sperling contributed to this report.
TAGS: AI (HTTPS://SEMIENGINEERING.COM/TAG/AI/) ARM (HTTPS://SEMIENGINEERING.COM/TAG/ARM/)
CADENCE (HTTPS://SEMIENGINEERING.COM/TAG/CADENCE/) CHIP DESIGN (HTTPS://SEMIENGINEERING.COM/TAG/CHIP-DESIGN/)
DNNS (HTTPS://SEMIENGINEERING.COM/TAG/DNNS/) EDGE (HTTPS://SEMIENGINEERING.COM/TAG/EDGE/)
EDGE COMPUTING (HTTPS://SEMIENGINEERING.COM/TAG/EDGE-COMPUTING/) FRAUNHOFER EAS (HTTPS://SEMIENGINEERING.COM/TAG/FRAUNHOFER-EAS/)
HIGH PERFORMANCE (HTTPS://SEMIENGINEERING.COM/TAG/HIGH-PERFORMANCE/) IOT (HTTPS://SEMIENGINEERING.COM/TAG/IOT/)
LOW POWER (HTTPS://SEMIENGINEERING.COM/TAG/LOW-POWER/) MENTOR (HTTPS://SEMIENGINEERING.COM/TAG/MENTOR/)
MOORE’S LAW (HTTPS://SEMIENGINEERING.COM/TAG/MOORES-LAW-2/) MOORTEC (HTTPS://SEMIENGINEERING.COM/TAG/MOORTEC/)
RAMBUS (HTTPS://SEMIENGINEERING.COM/TAG/RAMBUS/) SEMICONDUCTOR (HTTPS://SEMIENGINEERING.COM/TAG/SEMICONDUCTOR/)
50 19

SIEMENS (HTTPS://SEMIENGINEERING.COM/TAG/SIEMENS/) STANFORD UNIVERSITY (HTTPS://SEMIENGINEERING.COM/TAG/STANFORD-UNIVERSITY/)
SYNOPSYS (HTTPS://SEMIENGINEERING.COM/TAG/SYNOPSYS/) VIRTUAL REALITY (HTTPS://SEMIENGINEERING.COM/TAG/VIRTUAL-REALITY/)
Ann Steffora Mutschler (all posts) (https://semiengineering.com/author/ann/)
Ann Ste ora Mutschler is executive editor at Semiconductor Engineering.
Leave a Reply
Comment
Name*
(Note: This name will be displayed publicly)
Email*
(This will not be displayed publicly)
Post Comment
SPONSORS
(http://www.mentor.com/) (http://www.rambus.com/)
(http://www.synopsys.com) (http://www.ansys.com/)
(http://www.arm.com/) (http://www.cadence.com)
(http://moortec.com/) (https://www.adestotech.com/)

Priorities Shift In IC Design

More Related Content

What's hot (20)

Similar to Priorities Shift In IC Design (20)

More from Abacus Technologies (20)

Recently uploaded (20)

Priorities Shift In IC Design