SlideShare a Scribd company logo
Pushing Intelligence to Edge Nodes :
Low Power circuits for Self-
Localization and Speaker Recognition
Nick Iliev
Presented to:
prof. Trivedi
prof. Paprotny
prof. Rao
prof. Metlushko
prof. Zheng
1
Intelligence at the edge nodes: Applications
Internet-of-acoustic-things
Simultaneous Localization
and Mapping (SLAM)
Autonomous vehicles
Wearables
2
Focus of this Research
• Develop ultralow power computing platforms for
– Speaker recognition hardware accelerator
– Localization hardware accelerator
• Low power neural network implementations
• Low power GMM-based speaker recognition
• Runtime adaptation : depending on battery state and
performance, select processing clock frequency
and/or number of quantization bits to use.
3
Ultralow power spatial localization
Publications:
• 1. on-board Accelerators based on massively parallel neural networks ( Recurrent
Neural Networks , RNNs ) for coordinate computation and localization. Initial
results published in IEEE ICCD 2017
• 2. non-RNN image-based localization and coordinate mapping (registration) :
published in IEEE ISM 2016
• 3. Review and Comparison of Spatial Localization Methods for Low-Power Wireless
Sensor Networks : IEEE Sensors Journal 2015
4
Spatial Localization: Centralized vs Distributed
Cloud
User
Server
Anchor Sink Node
Nearest Anchor Nodes (shaded)
Broadcasts own location and
Routes data to sink nodes
Unknown location node (IoT node)
Receives locations of anchors
And calculates own location
Based on measurements
In Centralized algorithms,
Server receives all measurements
And calculates locations for all unknown nodes
In Distributed algorithms,
Server only stores locations for all nodes,
Each unknown node computes own location and
broadcasts it to the network
Unknown location node (IoT node)
Receives locations of anchors
And calculates own location
Based on measurements
5
Distributed computational load to anchors
6
• Increases with number of unknown nodes ; no RNN capability at the unknown nodes
= computational
load increase
Distributed computational load to anchors
7
• Each unknown node computes own location with RNN
accelerator – load decrease at anchors. RNN accelerator
offloads CPU, reduces power and latency. No off-line
training.
= computational
load increase
= computational
load decrease
RNN
RNN
RNN
RNN
RNN
Spatial localization in 2D – AOA Geometry
8
• Two or more anchors illuminate each unknown
• Centralized – measure F1,F2 and transmit to server ;
receive own (x,y) from server
• Decentralized (self-localization) – measure F1,F2, compute
own (x,y), and transmit (x,y) to server ; saves
communications bandwidth, power
X
Y
Anchor
‘R1’ at
(X1,Y1)
Anchor ‘R2’ at
(X2,Y2) Sensor ‘U’
of unknown
location
Φ1
Φ2
Angle of
arrival (AOA)
Spatial Localization in 2D: Applications
9
AOA sensor distribution of fields
for sensor with 12 photodetectors
❑ Most use CPU and matrix / linear-algebra
hardware accelerators
❑ A few use Recurrent Neural Network (RNN) in
hardware/software :
S. Li and S. Chen and Y. Lou and B. Lu and Y. Liang, “A
Recurrent Neural Network for Inter-Localization of
Mobile Phones”, in Proc. IEEE-WCCI, Jun. 10-15,
2012.
• Recurrent Neural Network (RNN) hardware/software
embedded accelerators – Mop/s/W
Current RNN Solutions – up to 128 Neurons
0
50
100
150
200
250
300
Mop/s
/
W
Spatial Localization in 2D - my RNN Solution
• Formulate 2D AOA localization as a constrained
primal-dual linear program
• Solve it with RNN – from 2 to 128 neurons
• 𝑀𝑖𝑛 𝐶𝑇
𝜃 ∀ 𝐺 × 𝜃 = 𝐻, 𝜃 ≥ 0 primal
𝑀𝑎𝑥 𝐻𝑇𝜑 ∀ 𝐺𝑇 × 𝜑 ≤ 𝐶 dual
The RNN model for solving the above system is :
•
𝑑
𝑑𝑡
𝜃
𝜑
= −
𝜃 − 𝜃 + 𝐺𝑇𝜑 − 𝐶 +
𝐺 𝜃 + 𝐺𝑇
𝜑 − 𝐶 +
− 𝐻
• here, for a variable w, (w)+ = max (w, 0)
Localization in 2D - Discrete time RNN
• We control convergence rate via dt, which is implemented
as a fixed-point fraction in Q15.17 format. All arithmetic
operations in the data path also use the Q15.17 format
•
𝜃(𝑘 + 1)
𝜑(𝑘 + 1)
=
𝜃 𝑘 + 𝑑𝑡 × 𝑟 𝑘
𝜑 𝑘 + 𝑑𝑡 × 𝐻 − 𝐺 × 𝑟 𝑘
,
where 𝑟 𝑘 = max[ 𝜃 𝑘 + 𝐺𝑇𝜑 𝑘 − 𝐶 , 0].
• The min cost function coefficients, C, in the above primal
problem can be chosen at random since the primary goal
is to solve for q.
Localization in 2D - Digital RNN Architecture
Register
Adder
Register
Adder
×2
×M
θ(k+1)
θ(k)
φ(k+1)
φ(k) Matrix
Product Eval.
G
Adder
GT
φ(k+1)
-C
Register
Register
C
o
m
p
a
r
a
t
o
r
GT
φ(k+1)+θ(k+1)-C
0
Register
r(k+1)
×2 ×2
Multiplier
r(k)
dt
Matrix
Product Eval.
-G
Register
Adder Mult.
dt
Reg.
H H-G×r(k)
r(k)
×2
×2
×M
Primal
solution
Dual solution
Hidden variable
evaluation (RNN block)
Adder
-
Characterization of FPGA-based Localization
Platform ProASIC3EA3PE3000
Combinatorial Cells 24946
Sequential cells (DFFs) 1453
Max Clock Freq MHz 31.45
Power Dissipation for Core at 1.5V 180 mW
Power Dissipation for Core (1.5V) and IO pads
(3.3V)
301.219 mW
Digital RNN Architecture
• Characterization of ASIC-based Localization - PDK45 1V VDD
• HSpice simulations with netlist from Cadence Virtuoso used to
compute average power dissipation, with a 1V supply
• measuring the total current drain from the supply over a 3.2 μs
period
Design Technology NCSU PDK 45 nm
Combinatorial Cells 51890
Sequential cells (DFFs) 962
Max Clock Freq MHz 516
Total Power Dissipation at VDD = 1 V 6.15 mW
Simulated Performance – Mop/sec/W
1 10 100 1000
RNN PDK45 (This work)
RNN FPGA (This work)
LSTM HW 2x Zynq FPGA
LSTM HW Zynq FPGA
Zynq ZC7020 CPU
Exynos5422 4Cortex-A7
Exynos5422 4Cortex-A15
Tegra TK1 GPU
Tegra TK1 CPU
Performance per unit power of different
embedded RNN realizations ( the higher the
better )
Mop / s / W
FPGA - 128 neurons (accounting AOA measurements from 128 anchors),
results in 13 Mop/sec/W with 31.25 MHz processing clock.
PDK 45 – 677.165 Mop/sec/W with 516 MHz processing clock.
A. Chang, B.Martini, E.Culurciello, “Recurrent neural networks hardware
implementation on fpga”, IJAREEIE vol. 5, no 1, pp. 401-409, Jan. 2016.
Simulated RNN state convergence
0 500 1000 1500 2000 2500 3000 3500 4000
0
0.2
0.4
0.6
0.8
1
Time steps, multiples of dt = 0.01. Inset shows steps 400 to 1400.
Primal
states
q1
(blue)
and
q2
(red)
dual
states
f1
(magenta)
and
f2
(black)
400 500 600 700 800 900 1000 1100 1200 1300 1400
0.4
0.5
0.6
0.7
0.8
0.9
1
Simulated convergence: q1 (blue) and q2 (red) are 2D (x,y) coordinates. Solid
lines from MATLAB reference simulation. Dashed lines from Q17.15 fixed-
point Verilog simulation.
Estimates with Noisy Measurements - 1
Error in X & Y estimates against increasing measurement noise. Noise in
measurement angles β1 & β2 is distributed Normally. Error in X & Y estimates is
defined as sum of absolute differences between true and estimated coordinates.
Each point is average over 100 runs.
Estimates with Noisy Measurements - 2
Histogram of estimated X & Y coordinates (normalized to 1).
Localization in 2D – Digital RNN Result Summary
• Proposed 2D AOA Localization architecture uses a
digital fixed-point RNN, with a scalable number of
neurons ( 2 to 128 ) in the hidden layer. The largest
overdetermined system has 128 neurons for AOA
measurements from 128 anchors.
• The RNN solves a primal-dual LP program for the
target’s x,y coordinates.
Future Work in Localization
Localization with Digital RNN
• Reduce power consumption of Hspice netlist – apply power-
gating PMOS / NMOS transistor techniques
• Reduce power consumption of Verilog gate-level netlist by
aggressive clock gating, arithmetic operand gating, imprecise
add/mult bit-widths with acceptable error bounds
• Apply RNN to 3D localization – 3x3 primal/dual LP with 3
neurons for the basic 3x3 system : scale to Nx3 for
overdetermined systems, where N=3,6,9, … etc.
• Compare digital RNN solution with analog OTA based solution
– backup slides
Ultralow power speaker recognition
Publications:
1. Paper to be submitted at IEEE ICCD 2018
22
Text Independent Speaker Recognition
23
• Gaussian mixture model (GMM)-based speaker
probability extraction
• Feature extraction as Mel frequency cepstral
coefficients (MFCCs)
IoT Device - Text-Independent Speaker Recognition
• The Classification block above is a Maximum Likelihood GMM-based classifier,
with all computations in the log domain ; p( . | l i ) is a speaker’s GMM scored
at each MFCC vector x 1 … x T
Ref : D. Reynolds 1995 Ph.D. Thesis
IoT Device - Text-Independent Speaker Recognition
• Example digital system for GMM scoring, up to log domain ( up to Log_Sum of
Exponents, LSE) – see backup for GMM matrix equation; Simulated in floating-
point Matlab : 16 clocks to score 1 12-dimensional z centroid for mixture GMM_i
1
GMM component i – scoring ( evaluation) for incoming 1x12 16-bit two’s complement Q(16,14) MFCC vector
Audio
Stream
MFCC vector
of 12 jointly
Gaussian
Rand Var
X22_Reg1[11:0][15:0]
Mu_[11:0][15:0]
Load_GMM_i_
params
Inv_sigma [11:0][15:0]
Sub_0
…
Sub_11
Sqr_0
…
Sqr_11
mult add
Accum_Reg1[15:0]
Accumulator control :
12 Iterations of mult-add
to Log_Sum
(LSE)
domain
accum[15:0]
Sqr_sub[11:0][15:0]
Sub_vec[11:0][15:0]
Stage 1 Stage 2
Stage 3
GMM scoring – in log domain ( Log_Sum of Exponents, LSE, domain ) ; simulated in flt –point
Matlab
l
To log_LSE unit : M accumulator (accum)
outputs for x1 … xM (accum1 …. accum20 )
Log(k1)
.
.
.
Log(kM)
Pre-computed
element-wise addition
x 1new … x Mnew
sorter
Sorter
(systolic bubble sort)
Sort in M cycles
Find max element from
x 1new … x Mnew
M deep FIFO for
x 1new … x Mnew Saved max in register
x max
Subtract element-wise
x i max - x max i = 1… M
Register_sub
To exp unit
20 clocks for M=20 mixtures
GMM scoring – in log domain ( Log_Sum of Exponents, LSE, domain )
simulated in flt-point Matlab
Total_1z = 16 + 20 + 5 = 41 clocks to score 1 z centroid with GMM_i
Total_40z = 4 * 41 = 164 clocks to score all 40 z centroids all GMMs
GMM scoring – number of operations – Power analysis Estimate
based on published imlementations
• NCSU PDK 45 nm , Vdd = 1.1V , published implementations
• One 16-bit add Carry-Skip = 2 uW, ref 1 , Clk 50 MHz, delay 20 nsec
• On 16x16 mult, Array = 55 uW , ref 2 , Clk 1.234 GHz, delay 0.824 nsec
• 3-way comparator , magnitude = 40 uW , ref 3 , clk 1.2 GHz, delay 0.833 nsec
• SRAM , 4Kb, read access dynamic = 350 uW (leakage 800 uW) , ref 4 , clk 250 MHz
0
200
400
600
800
1000
1200
adds mults compar lookup
Power (mW) for GMM with 20 mixtures,
1 MFCC frame (12 rand var features) scored
1 GMM 1 speaker 38 GMMs 38 speakers 2
Calculated Worst-Case Power:
All ops in each 1.234 GHz cycle :
1 GMM total = 52.48 mW
38 GMMs total = 1994 mW
HIGH !
For each block/operation
using P = a C V 2 f
with a = 1 for all, f=1.234GHz for all
GMM Scoring – Worst case power reduction
techniques - 1
• 1 – Clock Frequency reduction : from (max) 1.234
GHz to 1.234 MHz (div-1000 ): total 1994 mW to
1.994 mW ; in 10 msec frame , above GMM scoring
pipeline has 10 stages, or 1 msec per stage : with
1.234 MHz ( 810 nsec ), we have 1234 clk cycles per
stage : enough clocks for all operations in a stage ;
still using 16-bits for math operations
Now Calculated Worst-case power
for 38 GMM total = 1.994 mW
29
GMM Scoring – Worst case power reduction
techniques - 2
• 2 – Imprecise arithmetic – fewer quantization bits ; using 16-bits from
above :
• Using 6-bits ( vs 16 ) for all arithmetic and for MFCC quantization reduces
adder power from 20 uW to 20uW/(16/6) = 7.5 uW ; mult to 7.73 uW ;
comp to 5.62 uW ; new total worst-case power :
• Reducing the clock rate from 1.234 GHz to 1.234 MHz reduces this to
0.8525 mW , and to 0.0224 mW for 1 GMM
30
1 speaker (GMM ) = 52.48 mW = 517*(20uW) +480*(55uW)+26*(40uW)+42*(350uW)
38 speakers (GMMs ) = 1994 mW = 19646*(20uW) +18240*(55uW)+988*(40uW)+
1596*(350uW)
38 speakers (GMMs ) = 852.5 mW = 19646*(7.5 uW) +18240*(7.73 uW)+988*(5.62 uW)+
1596*(350uW)
Now Calculated Worst-case power
for 38 GMM total = 0.8525 mW
1 speaker (GMM ) = 22.4 mW = 517*(7.5 uW) +480*(7.73 uW)+26*(5.62 uW)+42*(350uW)
GMM Scoring – Worst case power reduction
techniques - 3
• 3 – Frame Decimation (downsampling) – the majority
of todays GMM-based systems use a fixed rate frame
skipping (usually rate = 1, or skip every other frame );
power is saved since fewer frames are scored with all
GMMs
31
IoT Device - Text-Independent Speaker Recognition : Frame
decimation
• Low-power focus , FS_mode=0 (Simulator mode no frames skipped):
A ) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum
Simulation result from floating point Matlab simulation : 500 test frames, post
min-energy filtering ; no frame skipping is done, for 100% success but at 100%
computation ; ( every frame scored with all GMMs, maximum power dissipation)
Note that X axis,FS_Rate=0 at all times ( not to scale )
B )
IoT Device - Text-Independent Speaker Recognition : Frame
decimation
• Low-power focus , FS_mode=1 ( simulator mode, skip 1, 2, 4, … 128 frames ):
• B ) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum
Simulation result from floating point Matlab simulation : 500 test frames as before;
frame skipping is done, for less than 100% success and less than 100% computation
( every frame Not scored with all GMMs ) ; saving power with fewer computations, but
lower success rate of recognition (already below 90% when skipping every 16th frame)
IoT Device - Text-Independent Speaker Recognition : Frame
decimation
• Success rate increases as less frames are skipped ( 128, 64, … 4, 2, 1 )
FS_Mode=1
Challenge : develop algorithm and architecture to generate
the Red performance curve
Text-Independent Speaker Recognition – Clustering test frames : met Challenge
• Low-Power focus, FS_mode=1 vs FS_mode=0 with kMeans clusters : but k = ?
C) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum
- my idea - find clusters in the 500 test frames ; use batch k-Means , starting
- with 10 clusters and incrementing by 10. Use centroids for all clusters to
- score all GMMs with ; this “decimates” 500 test frames to N frames
- ( N centroids ) , N << 500, for each N cluster scenario.
– Simulation result from floating point Matlab simulation : for N= k =40 clusters
– success rate is already 97%, with 8% computation ( % GMMs scored ).
– Classical FS_mode=1 achieves 94% success with 21% computation.
–
Text-Independent Speaker Recognition – number of
computations – Clustering k-Means
• On-line k-Means Clustering – number computations to find 40 clusters : Uses
LMS-like cluster-center update below ; at each timestep t, each frame x1…xt
contributes equally to determine the updated centers z1…z k
• Algorithm ( Lloyd’s) :
•
• Clustering (k-Means, k=40 ) method, using on-line k-Means with k=40 , 1
Iteration : 40 distance computations : 480 adds, 480 mults ; sort 40 values (
40(log40) = 64 3-way comparisons ) ; centroid update : 1 counter add, 12
sub/add, 12 divide, 12 add
• Total for 10 Iterations : 5,050 adds ; 4800 mults; 640 3-way comparisons ;
120 divides ; above Matlab simulation for k=40 clusters converges in 10
iterations
Text-Independent Speaker Recognition – number of computations
– GMM scoring with k centroids
• Compare computations for FS_mode=1 vs computations for
FS_mode=0 with clustering and GMM scoring with k centroids
Adds Mults 3-way
comparisons
Lookups divisions
FS_mode=1
250 frames
from 500;
1GMM
scored
129250 120000 6500 10500 0
On-line k=40
clusters from
500
frames,1GM
M scored
25730 24000 1680 1680 120
Text-Independent Speaker Recognition – number of
computations – GMM scoring with k centroids
0
20000
40000
60000
80000
100000
120000
140000
adds mults 3-way comp lookups divisions
Number of operations
FS_mode=1 FS_mode=0 with k=40 clusters Column1
Text-Independent Speaker Recognition – GMM scoring with k
centroids – Worst-case power analysis
• Using 6-bits ( vs 16 ) for all arithmetic and for MFCC quantization; Clk reduced from 1.234 GHz to 1.234 MHz
• From slide 30, scoring all 38 GMMs with 1 frame takes 0.8525 mW : scale power for rate=1,2,4,8,16 fs_mode=1
• Then scale power for 10,20,30,40,50 centroids (frames) and fs_mode=0
• My worst-case estimate is 34 mW , fs_mode=0 with k=40 centroids
• Competitive with 54 mW design by G.He “A 40-nm 54 mW 3x-real Time VLSI Processor for 60-Kword Continuous
Speech Recognition”
• State-of-the is 6 mW , M.Price 2017 “ A 6mW 5000-Word Real-Time Speech Recognizer Using WFST Models”
• Not apples-to-apples comparison since in Speech Recogniton, decoder’s active-list feedback selects 1 GMM,
GMMs don’t model speakers but senones ; similar in that GMM scoring makes the bulk of all computations
Text-Independent Speaker Recognition – hardware for on-line k-
means
Counter block
n1 … n k
Block to store k
cluster centers
z 1 … z k
Find closest z i to
x t
Update z i
New
test
data
vector
at time t
x t
n i
FSM
Text-Independent Speaker Recognition – hardware for on-line k-
means – detail on Euclidean distance (closest)
stage 5 stage 6
Euclidean dist ( z i - x t ) to
sorter unit
Drive Reference
with z i
i = 1…40
Drive Test bus
with x t
6 clocks to compute 1 Euclidean dist between
40 12-dimensional z centroids and incoming x vector
Text-Independent Speaker Recognition – hardware for
linear time Sorting of K words
sorting_cell_0
state
cell_data
prev_data_is_pushed
data_is_pushed
sorting_cell_1
state
cell_data
prev_data_is_pushed
data_is_pushed
sorting_cell_39
state
cell_data
prev_data_is_pushed
data_is_pushed
. . .
unsorted
_data
32
clk
shift_up
sorted_data
32
42
• For on-line K-Means, with K=40 , 40 systolic sorting cells ; Euclidean-distance block
• drives 1 to 40 words on the unsorted_data bus
40 clocks to sort 40 distance values
Only winning (smallest ) z used
in next stage (LMS update stage)
Text-Independent Speaker Recognition – linear
time Sorter Verilog simulation
43
Text-Independent Speaker Recognition – update
winning cluster center stage (LMS update)
• Z i+1 = Z i + ( 1/n i )*( X – Z i )
• 4 clocks to compute LMS update
• Total clocks for 1 iteration = 6 + 40 + 4 = 50
• Total for 10 iterations = 500 clocks
44
Text-Independent Speaker Recognition – Result summary
• For TIMIT TEST/DR1 38 speaker set I’ve shown that 40 clusters from online
k-Means can achieve 97% recognition success rate
• I have achieved a 12.5 : 1 ( 500 to 40 ) reduction in number of frames used
for GMM scoring while maintaining a 97% success rate ; only 40 centroids
are needed
• 5 : 1 reduction in number of adds and mults
• 3.9 : 1 reduction in number of 3-way comparisons
• 6.25 : 1 reduction in number of lookups
• Estimated 6 : 1 reduction in worst-case power ( 34 mW vs 213 mW)
• Above estimates are for 6-bits quantization for all params and MFCC data;
using 1.234 MHz processing clock for published PDK 45 nm
implementations of arith blocks
Future Works
• Complete the fixed-point Verilog implementation of the on-line 40 cluster
k-Means datapath
• Complete its integration with the GMM scoring datapath
• Simulate end-to-end design and characterize performance : power,
latency, success rate
• Evaluate additional low-power techniques :
• 1 – at GMM layer, select 1 GMM to score, instead of all GMMs (pruning) ;
• 2 - deeper pipelines for on-line clustering unit and for GMM scoring unit :
preferred over adding parallel-units due to leakage current issues at 45 nm
and below ;
• 3 – power modes : sleep , deep-sleep, doze (last GMM used On, others
Off )
• Scale the design to all 168 speakers in the TIMIT TEST/DRx data set.
• Publication : paper to be submitted at IEEE ICCD 2018
46
Backup Slides
47
IoT Device – Speaker Recognition
• If speaker recognition computations can be offloaded from the cloud
processor to the edge IoT node, that cloud processor does not have to
be as fast
• Smartphone apps ( Alexa, Siri, Google Assistant ) generally need 1
Watt of power to process a single speech-recognition query ; 100
Watts for 100 queries
• Dominant computation in max-likelihood GMM speaker recognition is
Gaussian probability estimation (scoring ) – from 6 mW (MIT) to
1.8 W ( CMU ) with GMM accelerators and MFCC frames
• I focus on reducing this power by reducing the total number of GMM
scoring operations via a frame downsampling accelerator , processing
clock frequency reduction, and imprecise arithmetic ( fewer
quantization bits )
• Initial results in paper to be submitted to IEEE MWCS 2018
IoT Device – Localization and Self-Localization
• - Goal : off-load Cloud server computations to IoT device – less network
congestion, faster response times for IoT device localization
• - IoT device has custom low-power circuits for spatial self-localization
• - 2D or 3D spatial coordinates of IoT device : on-board sensors supply data (
acoustic or optical AOA to anchors, anchor’s locations) to the device’s
Processor and Accelerators : it then computes its coordinates ( in its own
coordinate system) and sends them to Cloud server
• Cloud server then does coordinate translation and maps IoT device to global
absolute coordinate map ; or IoT device does coordinate translation on-board
• My research area : on-board Accelerators based on massively parallel neural
networks ( Recurrent Neural Networks , RNNs ) for coordinate computation
• Initial results published in IEEE ICCD 2017
• Additional result : non-RNN image-based localization and coordinate
mapping (registration) : published in IEEE ISM 2016
For AOA measurements from M anchors, this leads to a system of linear equation as
𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1
⋮ ⋮
𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀
𝑥𝑠
𝑦𝑠
=
𝑆𝑖𝑛 𝛼1 × 𝑥1 − 𝐶𝑜𝑠 𝛼1 × 𝑦1
⋮
𝑆𝑖𝑛 𝛼𝑀 × 𝑥𝑀 − 𝐶𝑜𝑠 𝛼𝑀 × 𝑦𝑀
. (2)
In the noiseless case, the above set of linear equations is consistent. However, due to noise
in AOA measurements, the system should be solved in a least square sense. Therefore (2)
can be written as
𝑓1
⋮
𝑓𝑀
=
𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1 −𝑆𝑖𝑛 𝛼1 × 𝑥1 + 𝐶𝑜𝑠 𝛼1 × 𝑦1
⋮ ⋮ ⋮
𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀 −𝑆𝑖𝑛 𝛼𝑀 × 𝑥𝑀 + 𝐶𝑜𝑠 𝛼𝑀 × 𝑦𝑀
𝑥𝑠
𝑦𝑠
1
. (3)
Here, the estimated location of sensor estimated as
𝑥𝑠 𝑦𝑠 = 𝑎𝑟𝑔𝑚𝑖𝑛 ෍
𝑖=1
𝑀
𝑓𝑖
2
. (4)
If we represent, 𝐻 = σ𝑖=1
𝑀
𝑓𝑖
2
, the total error minimizes when Τ
𝑑𝐻 𝑑𝑡 = 0. However, since
𝐻 ≥ 0, Τ
𝑑𝐻 𝑑𝑡 ≤ 0 is also a sufficient condition to minimize H []. Τ
𝑑𝐻 𝑑𝑡 is expanded as,
𝑑𝐻
𝑑𝑡
= 𝑥𝑠 𝑦𝑠 1
𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1 −𝑆𝑖𝑛 𝛼1 × 𝑥1 + 𝐶𝑜𝑠 𝛼1 × 𝑦1
⋮ ⋮ ⋮
𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀 −𝑆𝑖𝑛 𝛼𝑀 × 𝑥𝑀 + 𝐶𝑜𝑠 𝛼𝑀 × 𝑦𝑀
𝑇
×
𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1
⋮ ⋮
𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀
ൗ
𝑑𝑥𝑠
𝑑𝑡
ൗ
𝑑𝑦𝑠
𝑑𝑡
= 0.
Localization in 2D Future Work – low power
Analog OTA circuit 1 - backup
The localization problem can be also formulated
As a system of linear differential equations as show below
•
ൗ
𝑑𝑥𝑠
𝑑𝑡
ൗ
𝑑𝑦𝑠
𝑑𝑡
= −
𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1
⋮ ⋮
𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀
𝑇
• ×
𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1 −𝑆𝑖𝑛 𝛼1 × 𝑥1 + 𝐶𝑜𝑠 𝛼1 × 𝑦1
⋮ ⋮ ⋮
𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀 −𝑆𝑖𝑛 𝛼𝑀 × 𝑥𝑀 + 𝐶𝑜𝑠 𝛼𝑀 × 𝑦𝑀
𝑥𝑠
𝑦𝑠
1
. (6)
• Eq. (6) can be rearranged as
•
ൗ
𝑑𝑥𝑠
𝑑𝑡
ൗ
𝑑𝑦𝑠
𝑑𝑡
= −
σ𝑖=1
𝑀
𝑆𝑖𝑛 𝛼𝑖
2
− σ𝑖=1
𝑀
𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖
− σ𝑖=1
𝑀
𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖 σ𝑖=1
𝑀
𝐶𝑜𝑠 𝛼𝑖
2
𝑥𝑠
𝑦𝑠
−
σ𝑖=1
𝑀
−𝑆𝑖𝑛 𝛼𝑖
2
× 𝑥𝑖 + 𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖 × 𝑦𝑖
σ𝑖=1
𝑀
𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖 × 𝑥𝑖 − 𝐶𝑜𝑠 𝛼𝑖
2
× 𝑦𝑖
. (7)
• Eq. (7) is abbreviated as
•
ൗ
𝑑𝑥𝑠
𝑑𝑡
ൗ
𝑑𝑦𝑠
𝑑𝑡
= −𝐴
𝑥𝑠
𝑦𝑠
− 𝐵 (8)
Localization in 2D Future Work – low power Analog
OTA circuit 2 - backup
• The following OTA circuit is proposed for solving
equation 8) above ; Andrea Gualco’s OTA design and
OTA-based localizer compared with RNN ckt
•
Localization in 2D Future Work – low power Analog
OTA circuit 3 - backup
+
-
+
-
+
-
+
-
VCM
VCM
VCM
VCM
C
C
xs
ys
R1 = 1/A11
R2 = 1/A22
GM2 = A12
GM1 = A12
I1 = -B11
I2 = -B21
Localization in 2D Future Work – low power Analog OTA
circuit 3a - backup
• Linear Coupled Differential equation circuit for 2D localization ; OTA
Verilog-A model completed ; The plot is example simulation output :
OTA output current I(Vcm_out) vs input voltage difference V Hspice
simulation : Vcm (common mode V) = 0.5V
OTA Verilog-A completed – unit tested in HSpice sims
Localization in 3D Future Work – low power Analog Linear
System circuit 4 - backup
• In some 3D spatial localization cases the A matrix in the above OTA circuit
may not be positive definite – hence no convergence can be achieved
• I have a solution in this case using a linear voltage op-amp ( balanced
adder-subtractor ) circuit
x
+
-
Rfx
R4x
R5x
R3x
R1x
R2x
GND
+
-
b1/a11
+
-
Rfy
R4y
R5y
R3y
R1y
R2y
GND
+
-
b2/a22
+
-
Rfz
R4z
R5z
R3z
R1z
R2z
GND
+
-
b3/a33
y
z
y
z
x
z
x
y
Localization in 3D Future Work – low power Analog
Linear System circuit 4a - backup
• The coefficients in these equations are derived from three measured AOA
values ( azimuth angles beta1, beta2, and elevation angle gamma1 ), and two
anchor’s known data (x1,y1,z1) and (x2,y2,z2).
• The active analog network for solving the above 3x3 system is shown below, it
requires 3 op-amps , 3 DC voltage sources, and 18 resistors as shown in the
previous slide.
•
Localization in 3D Future Work – low power Analog Linear
System circuit 4b - backup
• The following 45 nm op-amp and biasing network was used, based on
R.J.Baker ( Reference: Baker, “CMOS Circuit Design , Layout, and
Simulation”, 3rd edition, sect. 24.1 , Fig. 24.2 )
Localization in 3D Future Work – low power Analog Linear System
circuit 5 - backup
• X coordinate = V(out) convergence = 94.532 mV * 50 = 4.73 approx. 5 (true)
• Y coordinate = V(out2) = 309.014 mV * 50 = 15.45 approx. 15 (true )
Localization in 3D Future Work – low power Analog Linear System
circuit 6 - backup
• Z coordinate = V(out3) = 274.6874 mV * 50 = 13.7 approx. 14 (true)
RNN solver – Quadratic Program
• Solving a Quadratic Program with QP block, via Select
QP or LP mux
59
D11 D12
D21 D22
X
Y1(n)
Y2(n)
matrix_vector
mult
+
X1(n)
X2(n)
-
C1
C2
M
a
x
0
R1(n)
R2(n)
- X
X1(n)
X2(n)
dt (scaler)
dX1(n)
dX2(n)
vector-scaler
mult
dX1(n)
dX2(n)
+
X1(n-1)
X2(n-1)
X1(n)
X2(n)
X
I
I + A Select QP or LP
QP
IoT Device - Text-Independent Speaker Recognition
• I’m focusing on Speaker Recognition ( Identification of 1 speaker from a closed set of M
enrolled speakers) not Verification of speaker’s claimed Identity
• GMM based, generative stochastic models, using open-source TIMIT database for model
construction and algorithm and hardware verification ; GMM model build with EM for
each enrolled speaker, using speaker’s training set of MFCC feature vectors (frames) ; an
offline process. A typical 10 msec speaker’s training utterance can have 2000 12-element
MFCC vectors for GMM model building during offline training.
• During online recognition, after Voice Activity Detection and minimum acoustic energy
filtering, about 500 12-element MFCC frames are generated by the unknown (test) speaker.
• A typical maximum-likelihood, GMM-based, speaker recognition system : online recognition
uses the bottom path :
IoT Device - Text-Independent Speaker Recognition
• GMM model of 1 speaker : mixture of multivariate Gaussian densities
•
• The Gaussian mixture probability density function of model ( speaker ) λ
consists of a sum of K weighted component densities, given by the above
equation. K is the number of Gaussian components, Pk is the prior probability
(mixture weight) of the k-th Gaussian component, and
• is the d-variate Gaussian density function with mean vector μk and covariance
matrix Σk. The mixture weights Pk ≥0 are constrained as
GMM scoring – number of operations – Worst case all done in
each clock cycle ( activity factor = 1 for all )
MFCC
Frames = 1
20 GMMs per
Speaker
Speakers Adds Mults 3-way
comparisons
lookups
1 517 480 26 42
38 19646 18240 988 1596
1 speaker (GMM ) = 52.48 mW = 517*(20uW) +480*(55uW)+26*(40uW)+42*(350uW)
38 speakers (GMMs ) = 1994 mW = 19646*(20uW) +18240*(55uW)+988*(40uW)+1596*(350uW)
IoT Device - Text-Independent Speaker Recognition
• For above FS_Mode=0, FS_Rate=0 sim, evolution of probAll,
p( . | l s ) , of all speaker’s posterior probabilities; winning speaker has
smallest negative log(prob) approx. -7990 ; X axis is number of test
frames
•
IoT Device - Text-Independent Speaker Recognition
• For above FS_Mode=1, FS_Rate=1,2,4,8,16 sim, evolution of probAll,
p( . | l s ) , of all speaker’s posterior probabilities; winning speaker has
smallest negative log(prob) approx. -820 ; X axis is number of test frames;
jumps when FS_rate changes; probAll recomputed only for a new FS_rate
Text-Independent Speaker Recognition – Clustering test frames
• Above FS_mode=0 , FS_Rate=0, kMeans 40 clusters simulation :
evolution of probAll, all speaker’s posterior prob over all 40 test
frames (centroids ); winning speaker has smallest negative log(prob)
approx. -590 ; X axis is number of test frames
Text-Independent Speaker Recognition – GMM scoring with k
centroids – power analysis
• ref 1 - S. Shartma et al. 2015, “Design of Low Power High Speed 16 bit Address with
McCMOS in 45 nm Technology”
• ref 2 - S. Mohan et al. 2017, “An improved implementation of hierarchy array multiplier
using Cs1A adder and full swing GDI logic – 45 nm PDK”
• ref 3 - P. Sharma et al. 2016, “Design Analysis of 1-bit Comparator using 45nm Technology”
• ref 4 – J. Stine et al. 2017, “A high performance multi-port SRAM for low voltage shared
memory systems”
Text-Independent Speaker Recognition – number of computations –
Clustering k-Means – table of operations
• Above on-line k-Means, k=40, on 500 test frames, clustering
algorithm requires the following operations per iteration : ( 40 squared
Euclidean distances, sorting (find min of 40 values ), and LMS update
of winning cluster :
Number of
iterations
Adds Mults 3-way
comparisons
divisions
1 505 480 64 12
10 5050 4800 640 120
Text-Independent Speaker Recognition – number of computations – GMM
scoring with k centroids ; table of ops; FS_Mode=0
Adds Mults 3-way
compariso
ns
Divisions Lookups
10
iterations
to converge
to 40
frames
(centroids)
5050 4800 640 120 0
Score GMM
with 40
frames
(centroids)
20680 19200 1040 0 1680
Total
kMeans
and GMM
25730 24000 1680 120 1680

More Related Content

PPT
Positioning techniques in 3 g networks (1)
PDF
4g lte matlab
PPTX
LTE KPI Optimization - A to Z Abiola.pptx
PDF
ltekpioptimization-atozabiola-230817212814-48b65a9a.pdf
PPT
Hairong Qi V Swaminathan
PDF
Hardware Acceleration for Machine Learning
PPTX
Trackster Pruning at the CMS High-Granularity Calorimeter
PDF
Qadence: A differentiable interface for digital-analog quantum programs
Positioning techniques in 3 g networks (1)
4g lte matlab
LTE KPI Optimization - A to Z Abiola.pptx
ltekpioptimization-atozabiola-230817212814-48b65a9a.pdf
Hairong Qi V Swaminathan
Hardware Acceleration for Machine Learning
Trackster Pruning at the CMS High-Granularity Calorimeter
Qadence: A differentiable interface for digital-analog quantum programs

Similar to Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization and Speaker Recognition (20)

PPTX
HS Demo
PDF
A CGRA-based Approach for Accelerating Convolutional Neural Networks
PDF
2016 03-03 marchand
PPTX
B.tech_project_ppt.pptx
PPTX
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
PDF
Virus, Vaccines, Genes and Quantum - 2020-06-18
PDF
Gene's law
PDF
DSM Based low oversampling using SDR transmitter
PDF
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
PPTX
prescalers and dual modulus prescalers
PPT
M.sc. m kamel
PDF
Report Simulations of Communication Systems
PPT
Applications of ann_in_microwave_engineering
PPTX
[Slides]L2.pptx
PDF
Boosting the Performance of Nested Spatial Mapping with Unequal Modulation in...
PPTX
Radio Signal Classification with Deep Neural Networks
PPT
Gsm Cell Planning And Optimization
PPT
Synchronization in SDH network
PDF
Fpga implementation of optimal step size nlms algorithm and its performance a...
PDF
Fpga implementation of optimal step size nlms algorithm and its performance a...
HS Demo
A CGRA-based Approach for Accelerating Convolutional Neural Networks
2016 03-03 marchand
B.tech_project_ppt.pptx
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
Virus, Vaccines, Genes and Quantum - 2020-06-18
Gene's law
DSM Based low oversampling using SDR transmitter
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
prescalers and dual modulus prescalers
M.sc. m kamel
Report Simulations of Communication Systems
Applications of ann_in_microwave_engineering
[Slides]L2.pptx
Boosting the Performance of Nested Spatial Mapping with Unequal Modulation in...
Radio Signal Classification with Deep Neural Networks
Gsm Cell Planning And Optimization
Synchronization in SDH network
Fpga implementation of optimal step size nlms algorithm and its performance a...
Fpga implementation of optimal step size nlms algorithm and its performance a...
Ad

Recently uploaded (20)

PDF
Integrated-2D-and-3D-Animation-Bridging-Dimensions-for-Impactful-Storytelling...
PPTX
HPE Aruba-master-icon-library_052722.pptx
PDF
GREEN BUILDING MATERIALS FOR SUISTAINABLE ARCHITECTURE AND BUILDING STUDY
PPTX
ANATOMY OF ANTERIOR CHAMBER ANGLE AND GONIOSCOPY.pptx
PPT
UNIT I- Yarn, types, explanation, process
PPTX
An introduction to AI in research and reference management
PPTX
BSCS lesson 3.pptxnbbjbb mnbkjbkbbkbbkjb
PDF
BRANDBOOK-Presidential Award Scheme-Kenya-2023
PPTX
YV PROFILE PROJECTS PROFILE PRES. DESIGN
PDF
Urban Design Final Project-Site Analysis
PPTX
6- Architecture design complete (1).pptx
DOCX
actividad 20% informatica microsoft project
PDF
Phone away, tabs closed: No multitasking
PDF
Facade & Landscape Lighting Techniques and Trends.pptx.pdf
PPT
Machine printing techniques and plangi dyeing
PDF
YOW2022-BNE-MinimalViableArchitecture.pdf
PDF
Quality Control Management for RMG, Level- 4, Certificate
PPTX
12. Community Pharmacy and How to organize it
PPTX
DOC-20250430-WA0014._20250714_235747_0000.pptx
PPT
EGWHermeneuticsffgggggggggggggggggggggggggggggggg.ppt
Integrated-2D-and-3D-Animation-Bridging-Dimensions-for-Impactful-Storytelling...
HPE Aruba-master-icon-library_052722.pptx
GREEN BUILDING MATERIALS FOR SUISTAINABLE ARCHITECTURE AND BUILDING STUDY
ANATOMY OF ANTERIOR CHAMBER ANGLE AND GONIOSCOPY.pptx
UNIT I- Yarn, types, explanation, process
An introduction to AI in research and reference management
BSCS lesson 3.pptxnbbjbb mnbkjbkbbkbbkjb
BRANDBOOK-Presidential Award Scheme-Kenya-2023
YV PROFILE PROJECTS PROFILE PRES. DESIGN
Urban Design Final Project-Site Analysis
6- Architecture design complete (1).pptx
actividad 20% informatica microsoft project
Phone away, tabs closed: No multitasking
Facade & Landscape Lighting Techniques and Trends.pptx.pdf
Machine printing techniques and plangi dyeing
YOW2022-BNE-MinimalViableArchitecture.pdf
Quality Control Management for RMG, Level- 4, Certificate
12. Community Pharmacy and How to organize it
DOC-20250430-WA0014._20250714_235747_0000.pptx
EGWHermeneuticsffgggggggggggggggggggggggggggggggg.ppt
Ad

Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization and Speaker Recognition

  • 1. Pushing Intelligence to Edge Nodes : Low Power circuits for Self- Localization and Speaker Recognition Nick Iliev Presented to: prof. Trivedi prof. Paprotny prof. Rao prof. Metlushko prof. Zheng 1
  • 2. Intelligence at the edge nodes: Applications Internet-of-acoustic-things Simultaneous Localization and Mapping (SLAM) Autonomous vehicles Wearables 2
  • 3. Focus of this Research • Develop ultralow power computing platforms for – Speaker recognition hardware accelerator – Localization hardware accelerator • Low power neural network implementations • Low power GMM-based speaker recognition • Runtime adaptation : depending on battery state and performance, select processing clock frequency and/or number of quantization bits to use. 3
  • 4. Ultralow power spatial localization Publications: • 1. on-board Accelerators based on massively parallel neural networks ( Recurrent Neural Networks , RNNs ) for coordinate computation and localization. Initial results published in IEEE ICCD 2017 • 2. non-RNN image-based localization and coordinate mapping (registration) : published in IEEE ISM 2016 • 3. Review and Comparison of Spatial Localization Methods for Low-Power Wireless Sensor Networks : IEEE Sensors Journal 2015 4
  • 5. Spatial Localization: Centralized vs Distributed Cloud User Server Anchor Sink Node Nearest Anchor Nodes (shaded) Broadcasts own location and Routes data to sink nodes Unknown location node (IoT node) Receives locations of anchors And calculates own location Based on measurements In Centralized algorithms, Server receives all measurements And calculates locations for all unknown nodes In Distributed algorithms, Server only stores locations for all nodes, Each unknown node computes own location and broadcasts it to the network Unknown location node (IoT node) Receives locations of anchors And calculates own location Based on measurements 5
  • 6. Distributed computational load to anchors 6 • Increases with number of unknown nodes ; no RNN capability at the unknown nodes = computational load increase
  • 7. Distributed computational load to anchors 7 • Each unknown node computes own location with RNN accelerator – load decrease at anchors. RNN accelerator offloads CPU, reduces power and latency. No off-line training. = computational load increase = computational load decrease RNN RNN RNN RNN RNN
  • 8. Spatial localization in 2D – AOA Geometry 8 • Two or more anchors illuminate each unknown • Centralized – measure F1,F2 and transmit to server ; receive own (x,y) from server • Decentralized (self-localization) – measure F1,F2, compute own (x,y), and transmit (x,y) to server ; saves communications bandwidth, power X Y Anchor ‘R1’ at (X1,Y1) Anchor ‘R2’ at (X2,Y2) Sensor ‘U’ of unknown location Φ1 Φ2 Angle of arrival (AOA)
  • 9. Spatial Localization in 2D: Applications 9 AOA sensor distribution of fields for sensor with 12 photodetectors ❑ Most use CPU and matrix / linear-algebra hardware accelerators ❑ A few use Recurrent Neural Network (RNN) in hardware/software : S. Li and S. Chen and Y. Lou and B. Lu and Y. Liang, “A Recurrent Neural Network for Inter-Localization of Mobile Phones”, in Proc. IEEE-WCCI, Jun. 10-15, 2012.
  • 10. • Recurrent Neural Network (RNN) hardware/software embedded accelerators – Mop/s/W Current RNN Solutions – up to 128 Neurons 0 50 100 150 200 250 300 Mop/s / W
  • 11. Spatial Localization in 2D - my RNN Solution • Formulate 2D AOA localization as a constrained primal-dual linear program • Solve it with RNN – from 2 to 128 neurons • 𝑀𝑖𝑛 𝐶𝑇 𝜃 ∀ 𝐺 × 𝜃 = 𝐻, 𝜃 ≥ 0 primal 𝑀𝑎𝑥 𝐻𝑇𝜑 ∀ 𝐺𝑇 × 𝜑 ≤ 𝐶 dual The RNN model for solving the above system is : • 𝑑 𝑑𝑡 𝜃 𝜑 = − 𝜃 − 𝜃 + 𝐺𝑇𝜑 − 𝐶 + 𝐺 𝜃 + 𝐺𝑇 𝜑 − 𝐶 + − 𝐻 • here, for a variable w, (w)+ = max (w, 0)
  • 12. Localization in 2D - Discrete time RNN • We control convergence rate via dt, which is implemented as a fixed-point fraction in Q15.17 format. All arithmetic operations in the data path also use the Q15.17 format • 𝜃(𝑘 + 1) 𝜑(𝑘 + 1) = 𝜃 𝑘 + 𝑑𝑡 × 𝑟 𝑘 𝜑 𝑘 + 𝑑𝑡 × 𝐻 − 𝐺 × 𝑟 𝑘 , where 𝑟 𝑘 = max[ 𝜃 𝑘 + 𝐺𝑇𝜑 𝑘 − 𝐶 , 0]. • The min cost function coefficients, C, in the above primal problem can be chosen at random since the primary goal is to solve for q.
  • 13. Localization in 2D - Digital RNN Architecture Register Adder Register Adder ×2 ×M θ(k+1) θ(k) φ(k+1) φ(k) Matrix Product Eval. G Adder GT φ(k+1) -C Register Register C o m p a r a t o r GT φ(k+1)+θ(k+1)-C 0 Register r(k+1) ×2 ×2 Multiplier r(k) dt Matrix Product Eval. -G Register Adder Mult. dt Reg. H H-G×r(k) r(k) ×2 ×2 ×M Primal solution Dual solution Hidden variable evaluation (RNN block) Adder - Characterization of FPGA-based Localization Platform ProASIC3EA3PE3000 Combinatorial Cells 24946 Sequential cells (DFFs) 1453 Max Clock Freq MHz 31.45 Power Dissipation for Core at 1.5V 180 mW Power Dissipation for Core (1.5V) and IO pads (3.3V) 301.219 mW
  • 14. Digital RNN Architecture • Characterization of ASIC-based Localization - PDK45 1V VDD • HSpice simulations with netlist from Cadence Virtuoso used to compute average power dissipation, with a 1V supply • measuring the total current drain from the supply over a 3.2 μs period Design Technology NCSU PDK 45 nm Combinatorial Cells 51890 Sequential cells (DFFs) 962 Max Clock Freq MHz 516 Total Power Dissipation at VDD = 1 V 6.15 mW
  • 15. Simulated Performance – Mop/sec/W 1 10 100 1000 RNN PDK45 (This work) RNN FPGA (This work) LSTM HW 2x Zynq FPGA LSTM HW Zynq FPGA Zynq ZC7020 CPU Exynos5422 4Cortex-A7 Exynos5422 4Cortex-A15 Tegra TK1 GPU Tegra TK1 CPU Performance per unit power of different embedded RNN realizations ( the higher the better ) Mop / s / W FPGA - 128 neurons (accounting AOA measurements from 128 anchors), results in 13 Mop/sec/W with 31.25 MHz processing clock. PDK 45 – 677.165 Mop/sec/W with 516 MHz processing clock. A. Chang, B.Martini, E.Culurciello, “Recurrent neural networks hardware implementation on fpga”, IJAREEIE vol. 5, no 1, pp. 401-409, Jan. 2016.
  • 16. Simulated RNN state convergence 0 500 1000 1500 2000 2500 3000 3500 4000 0 0.2 0.4 0.6 0.8 1 Time steps, multiples of dt = 0.01. Inset shows steps 400 to 1400. Primal states q1 (blue) and q2 (red) dual states f1 (magenta) and f2 (black) 400 500 600 700 800 900 1000 1100 1200 1300 1400 0.4 0.5 0.6 0.7 0.8 0.9 1 Simulated convergence: q1 (blue) and q2 (red) are 2D (x,y) coordinates. Solid lines from MATLAB reference simulation. Dashed lines from Q17.15 fixed- point Verilog simulation.
  • 17. Estimates with Noisy Measurements - 1 Error in X & Y estimates against increasing measurement noise. Noise in measurement angles β1 & β2 is distributed Normally. Error in X & Y estimates is defined as sum of absolute differences between true and estimated coordinates. Each point is average over 100 runs.
  • 18. Estimates with Noisy Measurements - 2 Histogram of estimated X & Y coordinates (normalized to 1).
  • 19. Localization in 2D – Digital RNN Result Summary • Proposed 2D AOA Localization architecture uses a digital fixed-point RNN, with a scalable number of neurons ( 2 to 128 ) in the hidden layer. The largest overdetermined system has 128 neurons for AOA measurements from 128 anchors. • The RNN solves a primal-dual LP program for the target’s x,y coordinates.
  • 20. Future Work in Localization
  • 21. Localization with Digital RNN • Reduce power consumption of Hspice netlist – apply power- gating PMOS / NMOS transistor techniques • Reduce power consumption of Verilog gate-level netlist by aggressive clock gating, arithmetic operand gating, imprecise add/mult bit-widths with acceptable error bounds • Apply RNN to 3D localization – 3x3 primal/dual LP with 3 neurons for the basic 3x3 system : scale to Nx3 for overdetermined systems, where N=3,6,9, … etc. • Compare digital RNN solution with analog OTA based solution – backup slides
  • 22. Ultralow power speaker recognition Publications: 1. Paper to be submitted at IEEE ICCD 2018 22
  • 23. Text Independent Speaker Recognition 23 • Gaussian mixture model (GMM)-based speaker probability extraction • Feature extraction as Mel frequency cepstral coefficients (MFCCs)
  • 24. IoT Device - Text-Independent Speaker Recognition • The Classification block above is a Maximum Likelihood GMM-based classifier, with all computations in the log domain ; p( . | l i ) is a speaker’s GMM scored at each MFCC vector x 1 … x T Ref : D. Reynolds 1995 Ph.D. Thesis
  • 25. IoT Device - Text-Independent Speaker Recognition • Example digital system for GMM scoring, up to log domain ( up to Log_Sum of Exponents, LSE) – see backup for GMM matrix equation; Simulated in floating- point Matlab : 16 clocks to score 1 12-dimensional z centroid for mixture GMM_i 1 GMM component i – scoring ( evaluation) for incoming 1x12 16-bit two’s complement Q(16,14) MFCC vector Audio Stream MFCC vector of 12 jointly Gaussian Rand Var X22_Reg1[11:0][15:0] Mu_[11:0][15:0] Load_GMM_i_ params Inv_sigma [11:0][15:0] Sub_0 … Sub_11 Sqr_0 … Sqr_11 mult add Accum_Reg1[15:0] Accumulator control : 12 Iterations of mult-add to Log_Sum (LSE) domain accum[15:0] Sqr_sub[11:0][15:0] Sub_vec[11:0][15:0] Stage 1 Stage 2 Stage 3
  • 26. GMM scoring – in log domain ( Log_Sum of Exponents, LSE, domain ) ; simulated in flt –point Matlab l To log_LSE unit : M accumulator (accum) outputs for x1 … xM (accum1 …. accum20 ) Log(k1) . . . Log(kM) Pre-computed element-wise addition x 1new … x Mnew sorter Sorter (systolic bubble sort) Sort in M cycles Find max element from x 1new … x Mnew M deep FIFO for x 1new … x Mnew Saved max in register x max Subtract element-wise x i max - x max i = 1… M Register_sub To exp unit 20 clocks for M=20 mixtures
  • 27. GMM scoring – in log domain ( Log_Sum of Exponents, LSE, domain ) simulated in flt-point Matlab Total_1z = 16 + 20 + 5 = 41 clocks to score 1 z centroid with GMM_i Total_40z = 4 * 41 = 164 clocks to score all 40 z centroids all GMMs
  • 28. GMM scoring – number of operations – Power analysis Estimate based on published imlementations • NCSU PDK 45 nm , Vdd = 1.1V , published implementations • One 16-bit add Carry-Skip = 2 uW, ref 1 , Clk 50 MHz, delay 20 nsec • On 16x16 mult, Array = 55 uW , ref 2 , Clk 1.234 GHz, delay 0.824 nsec • 3-way comparator , magnitude = 40 uW , ref 3 , clk 1.2 GHz, delay 0.833 nsec • SRAM , 4Kb, read access dynamic = 350 uW (leakage 800 uW) , ref 4 , clk 250 MHz 0 200 400 600 800 1000 1200 adds mults compar lookup Power (mW) for GMM with 20 mixtures, 1 MFCC frame (12 rand var features) scored 1 GMM 1 speaker 38 GMMs 38 speakers 2 Calculated Worst-Case Power: All ops in each 1.234 GHz cycle : 1 GMM total = 52.48 mW 38 GMMs total = 1994 mW HIGH ! For each block/operation using P = a C V 2 f with a = 1 for all, f=1.234GHz for all
  • 29. GMM Scoring – Worst case power reduction techniques - 1 • 1 – Clock Frequency reduction : from (max) 1.234 GHz to 1.234 MHz (div-1000 ): total 1994 mW to 1.994 mW ; in 10 msec frame , above GMM scoring pipeline has 10 stages, or 1 msec per stage : with 1.234 MHz ( 810 nsec ), we have 1234 clk cycles per stage : enough clocks for all operations in a stage ; still using 16-bits for math operations Now Calculated Worst-case power for 38 GMM total = 1.994 mW 29
  • 30. GMM Scoring – Worst case power reduction techniques - 2 • 2 – Imprecise arithmetic – fewer quantization bits ; using 16-bits from above : • Using 6-bits ( vs 16 ) for all arithmetic and for MFCC quantization reduces adder power from 20 uW to 20uW/(16/6) = 7.5 uW ; mult to 7.73 uW ; comp to 5.62 uW ; new total worst-case power : • Reducing the clock rate from 1.234 GHz to 1.234 MHz reduces this to 0.8525 mW , and to 0.0224 mW for 1 GMM 30 1 speaker (GMM ) = 52.48 mW = 517*(20uW) +480*(55uW)+26*(40uW)+42*(350uW) 38 speakers (GMMs ) = 1994 mW = 19646*(20uW) +18240*(55uW)+988*(40uW)+ 1596*(350uW) 38 speakers (GMMs ) = 852.5 mW = 19646*(7.5 uW) +18240*(7.73 uW)+988*(5.62 uW)+ 1596*(350uW) Now Calculated Worst-case power for 38 GMM total = 0.8525 mW 1 speaker (GMM ) = 22.4 mW = 517*(7.5 uW) +480*(7.73 uW)+26*(5.62 uW)+42*(350uW)
  • 31. GMM Scoring – Worst case power reduction techniques - 3 • 3 – Frame Decimation (downsampling) – the majority of todays GMM-based systems use a fixed rate frame skipping (usually rate = 1, or skip every other frame ); power is saved since fewer frames are scored with all GMMs 31
  • 32. IoT Device - Text-Independent Speaker Recognition : Frame decimation • Low-power focus , FS_mode=0 (Simulator mode no frames skipped): A ) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum Simulation result from floating point Matlab simulation : 500 test frames, post min-energy filtering ; no frame skipping is done, for 100% success but at 100% computation ; ( every frame scored with all GMMs, maximum power dissipation) Note that X axis,FS_Rate=0 at all times ( not to scale ) B )
  • 33. IoT Device - Text-Independent Speaker Recognition : Frame decimation • Low-power focus , FS_mode=1 ( simulator mode, skip 1, 2, 4, … 128 frames ): • B ) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum Simulation result from floating point Matlab simulation : 500 test frames as before; frame skipping is done, for less than 100% success and less than 100% computation ( every frame Not scored with all GMMs ) ; saving power with fewer computations, but lower success rate of recognition (already below 90% when skipping every 16th frame)
  • 34. IoT Device - Text-Independent Speaker Recognition : Frame decimation • Success rate increases as less frames are skipped ( 128, 64, … 4, 2, 1 ) FS_Mode=1 Challenge : develop algorithm and architecture to generate the Red performance curve
  • 35. Text-Independent Speaker Recognition – Clustering test frames : met Challenge • Low-Power focus, FS_mode=1 vs FS_mode=0 with kMeans clusters : but k = ? C) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum - my idea - find clusters in the 500 test frames ; use batch k-Means , starting - with 10 clusters and incrementing by 10. Use centroids for all clusters to - score all GMMs with ; this “decimates” 500 test frames to N frames - ( N centroids ) , N << 500, for each N cluster scenario. – Simulation result from floating point Matlab simulation : for N= k =40 clusters – success rate is already 97%, with 8% computation ( % GMMs scored ). – Classical FS_mode=1 achieves 94% success with 21% computation. –
  • 36. Text-Independent Speaker Recognition – number of computations – Clustering k-Means • On-line k-Means Clustering – number computations to find 40 clusters : Uses LMS-like cluster-center update below ; at each timestep t, each frame x1…xt contributes equally to determine the updated centers z1…z k • Algorithm ( Lloyd’s) : • • Clustering (k-Means, k=40 ) method, using on-line k-Means with k=40 , 1 Iteration : 40 distance computations : 480 adds, 480 mults ; sort 40 values ( 40(log40) = 64 3-way comparisons ) ; centroid update : 1 counter add, 12 sub/add, 12 divide, 12 add • Total for 10 Iterations : 5,050 adds ; 4800 mults; 640 3-way comparisons ; 120 divides ; above Matlab simulation for k=40 clusters converges in 10 iterations
  • 37. Text-Independent Speaker Recognition – number of computations – GMM scoring with k centroids • Compare computations for FS_mode=1 vs computations for FS_mode=0 with clustering and GMM scoring with k centroids Adds Mults 3-way comparisons Lookups divisions FS_mode=1 250 frames from 500; 1GMM scored 129250 120000 6500 10500 0 On-line k=40 clusters from 500 frames,1GM M scored 25730 24000 1680 1680 120
  • 38. Text-Independent Speaker Recognition – number of computations – GMM scoring with k centroids 0 20000 40000 60000 80000 100000 120000 140000 adds mults 3-way comp lookups divisions Number of operations FS_mode=1 FS_mode=0 with k=40 clusters Column1
  • 39. Text-Independent Speaker Recognition – GMM scoring with k centroids – Worst-case power analysis • Using 6-bits ( vs 16 ) for all arithmetic and for MFCC quantization; Clk reduced from 1.234 GHz to 1.234 MHz • From slide 30, scoring all 38 GMMs with 1 frame takes 0.8525 mW : scale power for rate=1,2,4,8,16 fs_mode=1 • Then scale power for 10,20,30,40,50 centroids (frames) and fs_mode=0 • My worst-case estimate is 34 mW , fs_mode=0 with k=40 centroids • Competitive with 54 mW design by G.He “A 40-nm 54 mW 3x-real Time VLSI Processor for 60-Kword Continuous Speech Recognition” • State-of-the is 6 mW , M.Price 2017 “ A 6mW 5000-Word Real-Time Speech Recognizer Using WFST Models” • Not apples-to-apples comparison since in Speech Recogniton, decoder’s active-list feedback selects 1 GMM, GMMs don’t model speakers but senones ; similar in that GMM scoring makes the bulk of all computations
  • 40. Text-Independent Speaker Recognition – hardware for on-line k- means Counter block n1 … n k Block to store k cluster centers z 1 … z k Find closest z i to x t Update z i New test data vector at time t x t n i FSM
  • 41. Text-Independent Speaker Recognition – hardware for on-line k- means – detail on Euclidean distance (closest) stage 5 stage 6 Euclidean dist ( z i - x t ) to sorter unit Drive Reference with z i i = 1…40 Drive Test bus with x t 6 clocks to compute 1 Euclidean dist between 40 12-dimensional z centroids and incoming x vector
  • 42. Text-Independent Speaker Recognition – hardware for linear time Sorting of K words sorting_cell_0 state cell_data prev_data_is_pushed data_is_pushed sorting_cell_1 state cell_data prev_data_is_pushed data_is_pushed sorting_cell_39 state cell_data prev_data_is_pushed data_is_pushed . . . unsorted _data 32 clk shift_up sorted_data 32 42 • For on-line K-Means, with K=40 , 40 systolic sorting cells ; Euclidean-distance block • drives 1 to 40 words on the unsorted_data bus 40 clocks to sort 40 distance values Only winning (smallest ) z used in next stage (LMS update stage)
  • 43. Text-Independent Speaker Recognition – linear time Sorter Verilog simulation 43
  • 44. Text-Independent Speaker Recognition – update winning cluster center stage (LMS update) • Z i+1 = Z i + ( 1/n i )*( X – Z i ) • 4 clocks to compute LMS update • Total clocks for 1 iteration = 6 + 40 + 4 = 50 • Total for 10 iterations = 500 clocks 44
  • 45. Text-Independent Speaker Recognition – Result summary • For TIMIT TEST/DR1 38 speaker set I’ve shown that 40 clusters from online k-Means can achieve 97% recognition success rate • I have achieved a 12.5 : 1 ( 500 to 40 ) reduction in number of frames used for GMM scoring while maintaining a 97% success rate ; only 40 centroids are needed • 5 : 1 reduction in number of adds and mults • 3.9 : 1 reduction in number of 3-way comparisons • 6.25 : 1 reduction in number of lookups • Estimated 6 : 1 reduction in worst-case power ( 34 mW vs 213 mW) • Above estimates are for 6-bits quantization for all params and MFCC data; using 1.234 MHz processing clock for published PDK 45 nm implementations of arith blocks
  • 46. Future Works • Complete the fixed-point Verilog implementation of the on-line 40 cluster k-Means datapath • Complete its integration with the GMM scoring datapath • Simulate end-to-end design and characterize performance : power, latency, success rate • Evaluate additional low-power techniques : • 1 – at GMM layer, select 1 GMM to score, instead of all GMMs (pruning) ; • 2 - deeper pipelines for on-line clustering unit and for GMM scoring unit : preferred over adding parallel-units due to leakage current issues at 45 nm and below ; • 3 – power modes : sleep , deep-sleep, doze (last GMM used On, others Off ) • Scale the design to all 168 speakers in the TIMIT TEST/DRx data set. • Publication : paper to be submitted at IEEE ICCD 2018 46
  • 48. IoT Device – Speaker Recognition • If speaker recognition computations can be offloaded from the cloud processor to the edge IoT node, that cloud processor does not have to be as fast • Smartphone apps ( Alexa, Siri, Google Assistant ) generally need 1 Watt of power to process a single speech-recognition query ; 100 Watts for 100 queries • Dominant computation in max-likelihood GMM speaker recognition is Gaussian probability estimation (scoring ) – from 6 mW (MIT) to 1.8 W ( CMU ) with GMM accelerators and MFCC frames • I focus on reducing this power by reducing the total number of GMM scoring operations via a frame downsampling accelerator , processing clock frequency reduction, and imprecise arithmetic ( fewer quantization bits ) • Initial results in paper to be submitted to IEEE MWCS 2018
  • 49. IoT Device – Localization and Self-Localization • - Goal : off-load Cloud server computations to IoT device – less network congestion, faster response times for IoT device localization • - IoT device has custom low-power circuits for spatial self-localization • - 2D or 3D spatial coordinates of IoT device : on-board sensors supply data ( acoustic or optical AOA to anchors, anchor’s locations) to the device’s Processor and Accelerators : it then computes its coordinates ( in its own coordinate system) and sends them to Cloud server • Cloud server then does coordinate translation and maps IoT device to global absolute coordinate map ; or IoT device does coordinate translation on-board • My research area : on-board Accelerators based on massively parallel neural networks ( Recurrent Neural Networks , RNNs ) for coordinate computation • Initial results published in IEEE ICCD 2017 • Additional result : non-RNN image-based localization and coordinate mapping (registration) : published in IEEE ISM 2016
  • 50. For AOA measurements from M anchors, this leads to a system of linear equation as 𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1 ⋮ ⋮ 𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀 𝑥𝑠 𝑦𝑠 = 𝑆𝑖𝑛 𝛼1 × 𝑥1 − 𝐶𝑜𝑠 𝛼1 × 𝑦1 ⋮ 𝑆𝑖𝑛 𝛼𝑀 × 𝑥𝑀 − 𝐶𝑜𝑠 𝛼𝑀 × 𝑦𝑀 . (2) In the noiseless case, the above set of linear equations is consistent. However, due to noise in AOA measurements, the system should be solved in a least square sense. Therefore (2) can be written as 𝑓1 ⋮ 𝑓𝑀 = 𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1 −𝑆𝑖𝑛 𝛼1 × 𝑥1 + 𝐶𝑜𝑠 𝛼1 × 𝑦1 ⋮ ⋮ ⋮ 𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀 −𝑆𝑖𝑛 𝛼𝑀 × 𝑥𝑀 + 𝐶𝑜𝑠 𝛼𝑀 × 𝑦𝑀 𝑥𝑠 𝑦𝑠 1 . (3) Here, the estimated location of sensor estimated as 𝑥𝑠 𝑦𝑠 = 𝑎𝑟𝑔𝑚𝑖𝑛 ෍ 𝑖=1 𝑀 𝑓𝑖 2 . (4) If we represent, 𝐻 = σ𝑖=1 𝑀 𝑓𝑖 2 , the total error minimizes when Τ 𝑑𝐻 𝑑𝑡 = 0. However, since 𝐻 ≥ 0, Τ 𝑑𝐻 𝑑𝑡 ≤ 0 is also a sufficient condition to minimize H []. Τ 𝑑𝐻 𝑑𝑡 is expanded as, 𝑑𝐻 𝑑𝑡 = 𝑥𝑠 𝑦𝑠 1 𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1 −𝑆𝑖𝑛 𝛼1 × 𝑥1 + 𝐶𝑜𝑠 𝛼1 × 𝑦1 ⋮ ⋮ ⋮ 𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀 −𝑆𝑖𝑛 𝛼𝑀 × 𝑥𝑀 + 𝐶𝑜𝑠 𝛼𝑀 × 𝑦𝑀 𝑇 × 𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1 ⋮ ⋮ 𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀 ൗ 𝑑𝑥𝑠 𝑑𝑡 ൗ 𝑑𝑦𝑠 𝑑𝑡 = 0. Localization in 2D Future Work – low power Analog OTA circuit 1 - backup The localization problem can be also formulated As a system of linear differential equations as show below
  • 51. • ൗ 𝑑𝑥𝑠 𝑑𝑡 ൗ 𝑑𝑦𝑠 𝑑𝑡 = − 𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1 ⋮ ⋮ 𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀 𝑇 • × 𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1 −𝑆𝑖𝑛 𝛼1 × 𝑥1 + 𝐶𝑜𝑠 𝛼1 × 𝑦1 ⋮ ⋮ ⋮ 𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀 −𝑆𝑖𝑛 𝛼𝑀 × 𝑥𝑀 + 𝐶𝑜𝑠 𝛼𝑀 × 𝑦𝑀 𝑥𝑠 𝑦𝑠 1 . (6) • Eq. (6) can be rearranged as • ൗ 𝑑𝑥𝑠 𝑑𝑡 ൗ 𝑑𝑦𝑠 𝑑𝑡 = − σ𝑖=1 𝑀 𝑆𝑖𝑛 𝛼𝑖 2 − σ𝑖=1 𝑀 𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖 − σ𝑖=1 𝑀 𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖 σ𝑖=1 𝑀 𝐶𝑜𝑠 𝛼𝑖 2 𝑥𝑠 𝑦𝑠 − σ𝑖=1 𝑀 −𝑆𝑖𝑛 𝛼𝑖 2 × 𝑥𝑖 + 𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖 × 𝑦𝑖 σ𝑖=1 𝑀 𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖 × 𝑥𝑖 − 𝐶𝑜𝑠 𝛼𝑖 2 × 𝑦𝑖 . (7) • Eq. (7) is abbreviated as • ൗ 𝑑𝑥𝑠 𝑑𝑡 ൗ 𝑑𝑦𝑠 𝑑𝑡 = −𝐴 𝑥𝑠 𝑦𝑠 − 𝐵 (8) Localization in 2D Future Work – low power Analog OTA circuit 2 - backup
  • 52. • The following OTA circuit is proposed for solving equation 8) above ; Andrea Gualco’s OTA design and OTA-based localizer compared with RNN ckt • Localization in 2D Future Work – low power Analog OTA circuit 3 - backup + - + - + - + - VCM VCM VCM VCM C C xs ys R1 = 1/A11 R2 = 1/A22 GM2 = A12 GM1 = A12 I1 = -B11 I2 = -B21
  • 53. Localization in 2D Future Work – low power Analog OTA circuit 3a - backup • Linear Coupled Differential equation circuit for 2D localization ; OTA Verilog-A model completed ; The plot is example simulation output : OTA output current I(Vcm_out) vs input voltage difference V Hspice simulation : Vcm (common mode V) = 0.5V OTA Verilog-A completed – unit tested in HSpice sims
  • 54. Localization in 3D Future Work – low power Analog Linear System circuit 4 - backup • In some 3D spatial localization cases the A matrix in the above OTA circuit may not be positive definite – hence no convergence can be achieved • I have a solution in this case using a linear voltage op-amp ( balanced adder-subtractor ) circuit x + - Rfx R4x R5x R3x R1x R2x GND + - b1/a11 + - Rfy R4y R5y R3y R1y R2y GND + - b2/a22 + - Rfz R4z R5z R3z R1z R2z GND + - b3/a33 y z y z x z x y
  • 55. Localization in 3D Future Work – low power Analog Linear System circuit 4a - backup • The coefficients in these equations are derived from three measured AOA values ( azimuth angles beta1, beta2, and elevation angle gamma1 ), and two anchor’s known data (x1,y1,z1) and (x2,y2,z2). • The active analog network for solving the above 3x3 system is shown below, it requires 3 op-amps , 3 DC voltage sources, and 18 resistors as shown in the previous slide. •
  • 56. Localization in 3D Future Work – low power Analog Linear System circuit 4b - backup • The following 45 nm op-amp and biasing network was used, based on R.J.Baker ( Reference: Baker, “CMOS Circuit Design , Layout, and Simulation”, 3rd edition, sect. 24.1 , Fig. 24.2 )
  • 57. Localization in 3D Future Work – low power Analog Linear System circuit 5 - backup • X coordinate = V(out) convergence = 94.532 mV * 50 = 4.73 approx. 5 (true) • Y coordinate = V(out2) = 309.014 mV * 50 = 15.45 approx. 15 (true )
  • 58. Localization in 3D Future Work – low power Analog Linear System circuit 6 - backup • Z coordinate = V(out3) = 274.6874 mV * 50 = 13.7 approx. 14 (true)
  • 59. RNN solver – Quadratic Program • Solving a Quadratic Program with QP block, via Select QP or LP mux 59 D11 D12 D21 D22 X Y1(n) Y2(n) matrix_vector mult + X1(n) X2(n) - C1 C2 M a x 0 R1(n) R2(n) - X X1(n) X2(n) dt (scaler) dX1(n) dX2(n) vector-scaler mult dX1(n) dX2(n) + X1(n-1) X2(n-1) X1(n) X2(n) X I I + A Select QP or LP QP
  • 60. IoT Device - Text-Independent Speaker Recognition • I’m focusing on Speaker Recognition ( Identification of 1 speaker from a closed set of M enrolled speakers) not Verification of speaker’s claimed Identity • GMM based, generative stochastic models, using open-source TIMIT database for model construction and algorithm and hardware verification ; GMM model build with EM for each enrolled speaker, using speaker’s training set of MFCC feature vectors (frames) ; an offline process. A typical 10 msec speaker’s training utterance can have 2000 12-element MFCC vectors for GMM model building during offline training. • During online recognition, after Voice Activity Detection and minimum acoustic energy filtering, about 500 12-element MFCC frames are generated by the unknown (test) speaker. • A typical maximum-likelihood, GMM-based, speaker recognition system : online recognition uses the bottom path :
  • 61. IoT Device - Text-Independent Speaker Recognition • GMM model of 1 speaker : mixture of multivariate Gaussian densities • • The Gaussian mixture probability density function of model ( speaker ) λ consists of a sum of K weighted component densities, given by the above equation. K is the number of Gaussian components, Pk is the prior probability (mixture weight) of the k-th Gaussian component, and • is the d-variate Gaussian density function with mean vector μk and covariance matrix Σk. The mixture weights Pk ≥0 are constrained as
  • 62. GMM scoring – number of operations – Worst case all done in each clock cycle ( activity factor = 1 for all ) MFCC Frames = 1 20 GMMs per Speaker Speakers Adds Mults 3-way comparisons lookups 1 517 480 26 42 38 19646 18240 988 1596 1 speaker (GMM ) = 52.48 mW = 517*(20uW) +480*(55uW)+26*(40uW)+42*(350uW) 38 speakers (GMMs ) = 1994 mW = 19646*(20uW) +18240*(55uW)+988*(40uW)+1596*(350uW)
  • 63. IoT Device - Text-Independent Speaker Recognition • For above FS_Mode=0, FS_Rate=0 sim, evolution of probAll, p( . | l s ) , of all speaker’s posterior probabilities; winning speaker has smallest negative log(prob) approx. -7990 ; X axis is number of test frames •
  • 64. IoT Device - Text-Independent Speaker Recognition • For above FS_Mode=1, FS_Rate=1,2,4,8,16 sim, evolution of probAll, p( . | l s ) , of all speaker’s posterior probabilities; winning speaker has smallest negative log(prob) approx. -820 ; X axis is number of test frames; jumps when FS_rate changes; probAll recomputed only for a new FS_rate
  • 65. Text-Independent Speaker Recognition – Clustering test frames • Above FS_mode=0 , FS_Rate=0, kMeans 40 clusters simulation : evolution of probAll, all speaker’s posterior prob over all 40 test frames (centroids ); winning speaker has smallest negative log(prob) approx. -590 ; X axis is number of test frames
  • 66. Text-Independent Speaker Recognition – GMM scoring with k centroids – power analysis • ref 1 - S. Shartma et al. 2015, “Design of Low Power High Speed 16 bit Address with McCMOS in 45 nm Technology” • ref 2 - S. Mohan et al. 2017, “An improved implementation of hierarchy array multiplier using Cs1A adder and full swing GDI logic – 45 nm PDK” • ref 3 - P. Sharma et al. 2016, “Design Analysis of 1-bit Comparator using 45nm Technology” • ref 4 – J. Stine et al. 2017, “A high performance multi-port SRAM for low voltage shared memory systems”
  • 67. Text-Independent Speaker Recognition – number of computations – Clustering k-Means – table of operations • Above on-line k-Means, k=40, on 500 test frames, clustering algorithm requires the following operations per iteration : ( 40 squared Euclidean distances, sorting (find min of 40 values ), and LMS update of winning cluster : Number of iterations Adds Mults 3-way comparisons divisions 1 505 480 64 12 10 5050 4800 640 120
  • 68. Text-Independent Speaker Recognition – number of computations – GMM scoring with k centroids ; table of ops; FS_Mode=0 Adds Mults 3-way compariso ns Divisions Lookups 10 iterations to converge to 40 frames (centroids) 5050 4800 640 120 0 Score GMM with 40 frames (centroids) 20680 19200 1040 0 1680 Total kMeans and GMM 25730 24000 1680 120 1680