Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization and Speaker Recognition

Pushing Intelligence to Edge Nodes :
Low Power circuits for Self-
Localization and Speaker Recognition
Nick Iliev
Presented to:
prof. Trivedi
prof. Paprotny
prof. Rao
prof. Metlushko
prof. Zheng
1

Intelligence at the edge nodes: Applications
Internet-of-acoustic-things
Simultaneous Localization
and Mapping (SLAM)
Autonomous vehicles
Wearables
2

Focus of this Research
• Develop ultralow power computing platforms for
– Speaker recognition hardware accelerator
– Localization hardware accelerator
• Low power neural network implementations
• Low power GMM-based speaker recognition
• Runtime adaptation : depending on battery state and
performance, select processing clock frequency
and/or number of quantization bits to use.
3

Ultralow power spatial localization
Publications:
• 1. on-board Accelerators based on massively parallel neural networks ( Recurrent
Neural Networks , RNNs ) for coordinate computation and localization. Initial
results published in IEEE ICCD 2017
• 2. non-RNN image-based localization and coordinate mapping (registration) :
published in IEEE ISM 2016
• 3. Review and Comparison of Spatial Localization Methods for Low-Power Wireless
Sensor Networks : IEEE Sensors Journal 2015
4

Spatial Localization: Centralized vs Distributed
Cloud
User
Server
Anchor Sink Node
Nearest Anchor Nodes (shaded)
Broadcasts own location and
Routes data to sink nodes
Unknown location node (IoT node)
Receives locations of anchors
And calculates own location
Based on measurements
In Centralized algorithms,
Server receives all measurements
And calculates locations for all unknown nodes
In Distributed algorithms,
Server only stores locations for all nodes,
Each unknown node computes own location and
broadcasts it to the network
Unknown location node (IoT node)
Receives locations of anchors
And calculates own location
Based on measurements
5

Distributed computational load to anchors
6
• Increases with number of unknown nodes ; no RNN capability at the unknown nodes
= computational
load increase

Distributed computational load to anchors
7
• Each unknown node computes own location with RNN
accelerator – load decrease at anchors. RNN accelerator
offloads CPU, reduces power and latency. No off-line
training.
= computational
load increase
= computational
load decrease
RNN
RNN
RNN
RNN
RNN

Spatial localization in 2D – AOA Geometry
8
• Two or more anchors illuminate each unknown
• Centralized – measure F1,F2 and transmit to server ;
receive own (x,y) from server
• Decentralized (self-localization) – measure F1,F2, compute
own (x,y), and transmit (x,y) to server ; saves
communications bandwidth, power
X
Y
Anchor
‘R1’ at
(X1,Y1)
Anchor ‘R2’ at
(X2,Y2) Sensor ‘U’
of unknown
location
Φ1
Φ2
Angle of
arrival (AOA)

Spatial Localization in 2D: Applications
9
AOA sensor distribution of fields
for sensor with 12 photodetectors
❑ Most use CPU and matrix / linear-algebra
hardware accelerators
❑ A few use Recurrent Neural Network (RNN) in
hardware/software :
S. Li and S. Chen and Y. Lou and B. Lu and Y. Liang, “A
Recurrent Neural Network for Inter-Localization of
Mobile Phones”, in Proc. IEEE-WCCI, Jun. 10-15,
2012.

• Recurrent Neural Network (RNN) hardware/software
embedded accelerators – Mop/s/W
Current RNN Solutions – up to 128 Neurons
0
50
100
150
200
250
300
Mop/s
/
W

Spatial Localization in 2D - my RNN Solution
• Formulate 2D AOA localization as a constrained
primal-dual linear program
• Solve it with RNN – from 2 to 128 neurons
• 𝑀𝑖𝑛 𝐶𝑇
𝜃 ∀ 𝐺 × 𝜃 = 𝐻, 𝜃 ≥ 0 primal
𝑀𝑎𝑥 𝐻𝑇𝜑 ∀ 𝐺𝑇 × 𝜑 ≤ 𝐶 dual
The RNN model for solving the above system is :
•
𝑑
𝑑𝑡
𝜃
𝜑
= −
𝜃 − 𝜃 + 𝐺𝑇𝜑 − 𝐶 +
𝐺 𝜃 + 𝐺𝑇
𝜑 − 𝐶 +
− 𝐻
• here, for a variable w, (w)+ = max (w, 0)

Localization in 2D - Discrete time RNN
• We control convergence rate via dt, which is implemented
as a fixed-point fraction in Q15.17 format. All arithmetic
operations in the data path also use the Q15.17 format
•
𝜃(𝑘 + 1)
𝜑(𝑘 + 1)
=
𝜃 𝑘 + 𝑑𝑡 × 𝑟 𝑘
𝜑 𝑘 + 𝑑𝑡 × 𝐻 − 𝐺 × 𝑟 𝑘
,
where 𝑟 𝑘 = max[ 𝜃 𝑘 + 𝐺𝑇𝜑 𝑘 − 𝐶 , 0].
• The min cost function coefficients, C, in the above primal
problem can be chosen at random since the primary goal
is to solve for q.

Localization in 2D - Digital RNN Architecture
Register
Adder
Register
Adder
×2
×M
θ(k+1)
θ(k)
φ(k+1)
φ(k) Matrix
Product Eval.
G
Adder
GT
φ(k+1)
-C
Register
Register
C
o
m
p
a
r
a
t
o
r
GT
φ(k+1)+θ(k+1)-C
0
Register
r(k+1)
×2 ×2
Multiplier
r(k)
dt
Matrix
Product Eval.
-G
Register
Adder Mult.
dt
Reg.
H H-G×r(k)
r(k)
×2
×2
×M
Primal
solution
Dual solution
Hidden variable
evaluation (RNN block)
Adder
-
Characterization of FPGA-based Localization
Platform ProASIC3EA3PE3000
Combinatorial Cells 24946
Sequential cells (DFFs) 1453
Max Clock Freq MHz 31.45
Power Dissipation for Core at 1.5V 180 mW
Power Dissipation for Core (1.5V) and IO pads
(3.3V)
301.219 mW

Digital RNN Architecture
• Characterization of ASIC-based Localization - PDK45 1V VDD
• HSpice simulations with netlist from Cadence Virtuoso used to
compute average power dissipation, with a 1V supply
• measuring the total current drain from the supply over a 3.2 μs
period
Design Technology NCSU PDK 45 nm
Combinatorial Cells 51890
Sequential cells (DFFs) 962
Max Clock Freq MHz 516
Total Power Dissipation at VDD = 1 V 6.15 mW

Simulated Performance – Mop/sec/W
1 10 100 1000
RNN PDK45 (This work)
RNN FPGA (This work)
LSTM HW 2x Zynq FPGA
LSTM HW Zynq FPGA
Zynq ZC7020 CPU
Exynos5422 4Cortex-A7
Exynos5422 4Cortex-A15
Tegra TK1 GPU
Tegra TK1 CPU
Performance per unit power of different
embedded RNN realizations ( the higher the
better )
Mop / s / W
FPGA - 128 neurons (accounting AOA measurements from 128 anchors),
results in 13 Mop/sec/W with 31.25 MHz processing clock.
PDK 45 – 677.165 Mop/sec/W with 516 MHz processing clock.
A. Chang, B.Martini, E.Culurciello, “Recurrent neural networks hardware
implementation on fpga”, IJAREEIE vol. 5, no 1, pp. 401-409, Jan. 2016.

Simulated RNN state convergence
0 500 1000 1500 2000 2500 3000 3500 4000
0
0.2
0.4
0.6
0.8
1
Time steps, multiples of dt = 0.01. Inset shows steps 400 to 1400.
Primal
states
q1
(blue)
and
q2
(red)
dual
states
f1
(magenta)
and
f2
(black)
400 500 600 700 800 900 1000 1100 1200 1300 1400
0.4
0.5
0.6
0.7
0.8
0.9
1
Simulated convergence: q1 (blue) and q2 (red) are 2D (x,y) coordinates. Solid
lines from MATLAB reference simulation. Dashed lines from Q17.15 fixed-
point Verilog simulation.

Estimates with Noisy Measurements - 1
Error in X & Y estimates against increasing measurement noise. Noise in
measurement angles β1 & β2 is distributed Normally. Error in X & Y estimates is
defined as sum of absolute differences between true and estimated coordinates.
Each point is average over 100 runs.

Estimates with Noisy Measurements - 2
Histogram of estimated X & Y coordinates (normalized to 1).

Localization in 2D – Digital RNN Result Summary
• Proposed 2D AOA Localization architecture uses a
digital fixed-point RNN, with a scalable number of
neurons ( 2 to 128 ) in the hidden layer. The largest
overdetermined system has 128 neurons for AOA
measurements from 128 anchors.
• The RNN solves a primal-dual LP program for the
target’s x,y coordinates.

Localization with Digital RNN
• Reduce power consumption of Hspice netlist – apply power-
gating PMOS / NMOS transistor techniques
• Reduce power consumption of Verilog gate-level netlist by
aggressive clock gating, arithmetic operand gating, imprecise
add/mult bit-widths with acceptable error bounds
• Apply RNN to 3D localization – 3x3 primal/dual LP with 3
neurons for the basic 3x3 system : scale to Nx3 for
overdetermined systems, where N=3,6,9, … etc.
• Compare digital RNN solution with analog OTA based solution
– backup slides

Ultralow power speaker recognition
Publications:
1. Paper to be submitted at IEEE ICCD 2018
22

Text Independent Speaker Recognition
23
• Gaussian mixture model (GMM)-based speaker
probability extraction
• Feature extraction as Mel frequency cepstral
coefficients (MFCCs)

IoT Device - Text-Independent Speaker Recognition
• The Classification block above is a Maximum Likelihood GMM-based classifier,
with all computations in the log domain ; p( . | l i ) is a speaker’s GMM scored
at each MFCC vector x 1 … x T
Ref : D. Reynolds 1995 Ph.D. Thesis

• Example digital system for GMM scoring, up to log domain ( up to Log_Sum of
Exponents, LSE) – see backup for GMM matrix equation; Simulated in floating-
point Matlab : 16 clocks to score 1 12-dimensional z centroid for mixture GMM_i
1
GMM component i – scoring ( evaluation) for incoming 1x12 16-bit two’s complement Q(16,14) MFCC vector
Audio
Stream
MFCC vector
of 12 jointly
Gaussian
Rand Var
X22_Reg1[11:0][15:0]
Mu_[11:0][15:0]
Load_GMM_i_
params
Inv_sigma [11:0][15:0]
Sub_0
…
Sub_11
Sqr_0
…
Sqr_11
mult add
Accum_Reg1[15:0]
Accumulator control :
12 Iterations of mult-add
to Log_Sum
(LSE)
domain
accum[15:0]
Sqr_sub[11:0][15:0]
Sub_vec[11:0][15:0]
Stage 1 Stage 2
Stage 3

GMM scoring – in log domain ( Log_Sum of Exponents, LSE, domain ) ; simulated in flt –point
Matlab
l
To log_LSE unit : M accumulator (accum)
outputs for x1 … xM (accum1 …. accum20 )
Log(k1)
.
.
.
Log(kM)
Pre-computed
element-wise addition
x 1new … x Mnew
sorter
Sorter
(systolic bubble sort)
Sort in M cycles
Find max element from
x 1new … x Mnew
M deep FIFO for
x 1new … x Mnew Saved max in register
x max
Subtract element-wise
x i max - x max i = 1… M
Register_sub
To exp unit
20 clocks for M=20 mixtures

GMM scoring – in log domain ( Log_Sum of Exponents, LSE, domain )
simulated in flt-point Matlab
Total_1z = 16 + 20 + 5 = 41 clocks to score 1 z centroid with GMM_i
Total_40z = 4 * 41 = 164 clocks to score all 40 z centroids all GMMs

GMM scoring – number of operations – Power analysis Estimate
based on published imlementations
• NCSU PDK 45 nm , Vdd = 1.1V , published implementations
• One 16-bit add Carry-Skip = 2 uW, ref 1 , Clk 50 MHz, delay 20 nsec
• On 16x16 mult, Array = 55 uW , ref 2 , Clk 1.234 GHz, delay 0.824 nsec
• 3-way comparator , magnitude = 40 uW , ref 3 , clk 1.2 GHz, delay 0.833 nsec
• SRAM , 4Kb, read access dynamic = 350 uW (leakage 800 uW) , ref 4 , clk 250 MHz
0
200
400
600
800
1000
1200
adds mults compar lookup
Power (mW) for GMM with 20 mixtures,
1 MFCC frame (12 rand var features) scored
1 GMM 1 speaker 38 GMMs 38 speakers 2
Calculated Worst-Case Power:
All ops in each 1.234 GHz cycle :
1 GMM total = 52.48 mW
38 GMMs total = 1994 mW
HIGH !
For each block/operation
using P = a C V 2 f
with a = 1 for all, f=1.234GHz for all

GMM Scoring – Worst case power reduction
techniques - 1
• 1 – Clock Frequency reduction : from (max) 1.234
GHz to 1.234 MHz (div-1000 ): total 1994 mW to
1.994 mW ; in 10 msec frame , above GMM scoring
pipeline has 10 stages, or 1 msec per stage : with
1.234 MHz ( 810 nsec ), we have 1234 clk cycles per
stage : enough clocks for all operations in a stage ;
still using 16-bits for math operations
Now Calculated Worst-case power
for 38 GMM total = 1.994 mW
29

techniques - 2
• 2 – Imprecise arithmetic – fewer quantization bits ; using 16-bits from
above :
• Using 6-bits ( vs 16 ) for all arithmetic and for MFCC quantization reduces
adder power from 20 uW to 20uW/(16/6) = 7.5 uW ; mult to 7.73 uW ;
comp to 5.62 uW ; new total worst-case power :
• Reducing the clock rate from 1.234 GHz to 1.234 MHz reduces this to
0.8525 mW , and to 0.0224 mW for 1 GMM
30
1 speaker (GMM ) = 52.48 mW = 517*(20uW) +480*(55uW)+26*(40uW)+42*(350uW)
38 speakers (GMMs ) = 1994 mW = 19646*(20uW) +18240*(55uW)+988*(40uW)+
1596*(350uW)
38 speakers (GMMs ) = 852.5 mW = 19646*(7.5 uW) +18240*(7.73 uW)+988*(5.62 uW)+
1596*(350uW)
Now Calculated Worst-case power
for 38 GMM total = 0.8525 mW
1 speaker (GMM ) = 22.4 mW = 517*(7.5 uW) +480*(7.73 uW)+26*(5.62 uW)+42*(350uW)

techniques - 3
• 3 – Frame Decimation (downsampling) – the majority
of todays GMM-based systems use a fixed rate frame
skipping (usually rate = 1, or skip every other frame );
power is saved since fewer frames are scored with all
GMMs
31

IoT Device - Text-Independent Speaker Recognition : Frame
decimation
• Low-power focus , FS_mode=0 (Simulator mode no frames skipped):
A ) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum
Simulation result from floating point Matlab simulation : 500 test frames, post
min-energy filtering ; no frame skipping is done, for 100% success but at 100%
computation ; ( every frame scored with all GMMs, maximum power dissipation)
Note that X axis,FS_Rate=0 at all times ( not to scale )
B )

decimation
• Low-power focus , FS_mode=1 ( simulator mode, skip 1, 2, 4, … 128 frames ):
• B ) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum
Simulation result from floating point Matlab simulation : 500 test frames as before;
frame skipping is done, for less than 100% success and less than 100% computation
( every frame Not scored with all GMMs ) ; saving power with fewer computations, but
lower success rate of recognition (already below 90% when skipping every 16th frame)

decimation
• Success rate increases as less frames are skipped ( 128, 64, … 4, 2, 1 )
FS_Mode=1
Challenge : develop algorithm and architecture to generate
the Red performance curve

Text-Independent Speaker Recognition – Clustering test frames : met Challenge
• Low-Power focus, FS_mode=1 vs FS_mode=0 with kMeans clusters : but k = ?
C) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum
- my idea - find clusters in the 500 test frames ; use batch k-Means , starting
- with 10 clusters and incrementing by 10. Use centroids for all clusters to
- score all GMMs with ; this “decimates” 500 test frames to N frames
- ( N centroids ) , N << 500, for each N cluster scenario.
– Simulation result from floating point Matlab simulation : for N= k =40 clusters
– success rate is already 97%, with 8% computation ( % GMMs scored ).
– Classical FS_mode=1 achieves 94% success with 21% computation.
–

Text-Independent Speaker Recognition – number of
computations – Clustering k-Means
• On-line k-Means Clustering – number computations to find 40 clusters : Uses
LMS-like cluster-center update below ; at each timestep t, each frame x1…xt
contributes equally to determine the updated centers z1…z k
• Algorithm ( Lloyd’s) :
•
• Clustering (k-Means, k=40 ) method, using on-line k-Means with k=40 , 1
Iteration : 40 distance computations : 480 adds, 480 mults ; sort 40 values (
40(log40) = 64 3-way comparisons ) ; centroid update : 1 counter add, 12
sub/add, 12 divide, 12 add
• Total for 10 Iterations : 5,050 adds ; 4800 mults; 640 3-way comparisons ;
120 divides ; above Matlab simulation for k=40 clusters converges in 10
iterations

Text-Independent Speaker Recognition – number of computations
– GMM scoring with k centroids
• Compare computations for FS_mode=1 vs computations for
FS_mode=0 with clustering and GMM scoring with k centroids
Adds Mults 3-way
comparisons
Lookups divisions
FS_mode=1
250 frames
from 500;
1GMM
scored
129250 120000 6500 10500 0
On-line k=40
clusters from
500
frames,1GM
M scored
25730 24000 1680 1680 120

Text-Independent Speaker Recognition – number of
computations – GMM scoring with k centroids
0
20000
40000
60000
80000
100000
120000
140000
adds mults 3-way comp lookups divisions
Number of operations
FS_mode=1 FS_mode=0 with k=40 clusters Column1

Text-Independent Speaker Recognition – GMM scoring with k
centroids – Worst-case power analysis
• Using 6-bits ( vs 16 ) for all arithmetic and for MFCC quantization; Clk reduced from 1.234 GHz to 1.234 MHz
• From slide 30, scoring all 38 GMMs with 1 frame takes 0.8525 mW : scale power for rate=1,2,4,8,16 fs_mode=1
• Then scale power for 10,20,30,40,50 centroids (frames) and fs_mode=0
• My worst-case estimate is 34 mW , fs_mode=0 with k=40 centroids
• Competitive with 54 mW design by G.He “A 40-nm 54 mW 3x-real Time VLSI Processor for 60-Kword Continuous
Speech Recognition”
• State-of-the is 6 mW , M.Price 2017 “ A 6mW 5000-Word Real-Time Speech Recognizer Using WFST Models”
• Not apples-to-apples comparison since in Speech Recogniton, decoder’s active-list feedback selects 1 GMM,
GMMs don’t model speakers but senones ; similar in that GMM scoring makes the bulk of all computations

Text-Independent Speaker Recognition – hardware for on-line k-
means
Counter block
n1 … n k
Block to store k
cluster centers
z 1 … z k
Find closest z i to
x t
Update z i
New
test
data
vector
at time t
x t
n i
FSM

Text-Independent Speaker Recognition – hardware for on-line k-
means – detail on Euclidean distance (closest)
stage 5 stage 6
Euclidean dist ( z i - x t ) to
sorter unit
Drive Reference
with z i
i = 1…40
Drive Test bus
with x t
6 clocks to compute 1 Euclidean dist between
40 12-dimensional z centroids and incoming x vector

Text-Independent Speaker Recognition – hardware for
linear time Sorting of K words
sorting_cell_0
state
cell_data
prev_data_is_pushed
data_is_pushed
sorting_cell_1
state
cell_data
prev_data_is_pushed
data_is_pushed
sorting_cell_39
state
cell_data
prev_data_is_pushed
data_is_pushed
. . .
unsorted
_data
32
clk
shift_up
sorted_data
32
42
• For on-line K-Means, with K=40 , 40 systolic sorting cells ; Euclidean-distance block
• drives 1 to 40 words on the unsorted_data bus
40 clocks to sort 40 distance values
Only winning (smallest ) z used
in next stage (LMS update stage)

Text-Independent Speaker Recognition – linear
time Sorter Verilog simulation
43

Text-Independent Speaker Recognition – update
winning cluster center stage (LMS update)
• Z i+1 = Z i + ( 1/n i )*( X – Z i )
• 4 clocks to compute LMS update
• Total clocks for 1 iteration = 6 + 40 + 4 = 50
• Total for 10 iterations = 500 clocks
44

Text-Independent Speaker Recognition – Result summary
• For TIMIT TEST/DR1 38 speaker set I’ve shown that 40 clusters from online
k-Means can achieve 97% recognition success rate
• I have achieved a 12.5 : 1 ( 500 to 40 ) reduction in number of frames used
for GMM scoring while maintaining a 97% success rate ; only 40 centroids
are needed
• 5 : 1 reduction in number of adds and mults
• 3.9 : 1 reduction in number of 3-way comparisons
• 6.25 : 1 reduction in number of lookups
• Estimated 6 : 1 reduction in worst-case power ( 34 mW vs 213 mW)
• Above estimates are for 6-bits quantization for all params and MFCC data;
using 1.234 MHz processing clock for published PDK 45 nm
implementations of arith blocks

Future Works
• Complete the fixed-point Verilog implementation of the on-line 40 cluster
k-Means datapath
• Complete its integration with the GMM scoring datapath
• Simulate end-to-end design and characterize performance : power,
latency, success rate
• Evaluate additional low-power techniques :
• 1 – at GMM layer, select 1 GMM to score, instead of all GMMs (pruning) ;
• 2 - deeper pipelines for on-line clustering unit and for GMM scoring unit :
preferred over adding parallel-units due to leakage current issues at 45 nm
and below ;
• 3 – power modes : sleep , deep-sleep, doze (last GMM used On, others
Off )
• Scale the design to all 168 speakers in the TIMIT TEST/DRx data set.
• Publication : paper to be submitted at IEEE ICCD 2018
46

IoT Device – Speaker Recognition
• If speaker recognition computations can be offloaded from the cloud
processor to the edge IoT node, that cloud processor does not have to
be as fast
• Smartphone apps ( Alexa, Siri, Google Assistant ) generally need 1
Watt of power to process a single speech-recognition query ; 100
Watts for 100 queries
• Dominant computation in max-likelihood GMM speaker recognition is
Gaussian probability estimation (scoring ) – from 6 mW (MIT) to
1.8 W ( CMU ) with GMM accelerators and MFCC frames
• I focus on reducing this power by reducing the total number of GMM
scoring operations via a frame downsampling accelerator , processing
clock frequency reduction, and imprecise arithmetic ( fewer
quantization bits )
• Initial results in paper to be submitted to IEEE MWCS 2018

IoT Device – Localization and Self-Localization
• - Goal : off-load Cloud server computations to IoT device – less network
congestion, faster response times for IoT device localization
• - IoT device has custom low-power circuits for spatial self-localization
• - 2D or 3D spatial coordinates of IoT device : on-board sensors supply data (
acoustic or optical AOA to anchors, anchor’s locations) to the device’s
Processor and Accelerators : it then computes its coordinates ( in its own
coordinate system) and sends them to Cloud server
• Cloud server then does coordinate translation and maps IoT device to global
absolute coordinate map ; or IoT device does coordinate translation on-board
• My research area : on-board Accelerators based on massively parallel neural
networks ( Recurrent Neural Networks , RNNs ) for coordinate computation
• Initial results published in IEEE ICCD 2017
• Additional result : non-RNN image-based localization and coordinate
mapping (registration) : published in IEEE ISM 2016

For AOA measurements from M anchors, this leads to a system of linear equation as
𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1
⋮ ⋮
𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀
𝑥𝑠
𝑦𝑠
=
𝑆𝑖𝑛 𝛼1 × 𝑥1 − 𝐶𝑜𝑠 𝛼1 × 𝑦1
⋮
𝑆𝑖𝑛 𝛼𝑀 × 𝑥𝑀 − 𝐶𝑜𝑠 𝛼𝑀 × 𝑦𝑀
. (2)
In the noiseless case, the above set of linear equations is consistent. However, due to noise
in AOA measurements, the system should be solved in a least square sense. Therefore (2)
can be written as
𝑓1
⋮
𝑓𝑀
=
𝑆𝑖𝑛 𝛼1 −𝐶𝑜𝑠 𝛼1 −𝑆𝑖𝑛 𝛼1 × 𝑥1 + 𝐶𝑜𝑠 𝛼1 × 𝑦1
⋮ ⋮ ⋮
𝑆𝑖𝑛 𝛼𝑀 −𝐶𝑜𝑠 𝛼𝑀 −𝑆𝑖𝑛 𝛼𝑀 × 𝑥𝑀 + 𝐶𝑜𝑠 𝛼𝑀 × 𝑦𝑀
𝑥𝑠
𝑦𝑠
1
. (3)
Here, the estimated location of sensor estimated as
𝑥𝑠 𝑦𝑠 = 𝑎𝑟𝑔𝑚𝑖𝑛 ෍
𝑖=1
𝑀
𝑓𝑖
2
. (4)
If we represent, 𝐻 = σ𝑖=1
𝑀
𝑓𝑖
2
, the total error minimizes when Τ
𝑑𝐻 𝑑𝑡 = 0. However, since
𝐻 ≥ 0, Τ
𝑑𝐻 𝑑𝑡 ≤ 0 is also a sufficient condition to minimize H []. Τ
𝑑𝐻 𝑑𝑡 is expanded as,
𝑑𝐻
𝑑𝑡
= 𝑥𝑠 𝑦𝑠 1
⋮ ⋮ ⋮
𝑇
×
⋮ ⋮
ൗ
𝑑𝑥𝑠
𝑑𝑡
ൗ
𝑑𝑦𝑠
𝑑𝑡
= 0.
Localization in 2D Future Work – low power
Analog OTA circuit 1 - backup
The localization problem can be also formulated
As a system of linear differential equations as show below

•
ൗ
𝑑𝑥𝑠
𝑑𝑡
ൗ
𝑑𝑦𝑠
𝑑𝑡
= −
⋮ ⋮
𝑇
• ×
⋮ ⋮ ⋮
𝑥𝑠
𝑦𝑠
1
. (6)
• Eq. (6) can be rearranged as
•
ൗ
𝑑𝑥𝑠
𝑑𝑡
ൗ
𝑑𝑦𝑠
𝑑𝑡
= −
σ𝑖=1
𝑀
𝑆𝑖𝑛 𝛼𝑖
2
− σ𝑖=1
𝑀
𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖
− σ𝑖=1
𝑀
𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖 σ𝑖=1
𝑀
𝐶𝑜𝑠 𝛼𝑖
2
𝑥𝑠
𝑦𝑠
−
σ𝑖=1
𝑀
−𝑆𝑖𝑛 𝛼𝑖
2
× 𝑥𝑖 + 𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖 × 𝑦𝑖
σ𝑖=1
𝑀
𝑆𝑖𝑛 𝛼𝑖 𝐶𝑜𝑠 𝛼𝑖 × 𝑥𝑖 − 𝐶𝑜𝑠 𝛼𝑖
2
× 𝑦𝑖
. (7)
• Eq. (7) is abbreviated as
•
ൗ
𝑑𝑥𝑠
𝑑𝑡
ൗ
𝑑𝑦𝑠
𝑑𝑡
= −𝐴
𝑥𝑠
𝑦𝑠
− 𝐵 (8)
Localization in 2D Future Work – low power Analog
OTA circuit 2 - backup

• The following OTA circuit is proposed for solving
equation 8) above ; Andrea Gualco’s OTA design and
OTA-based localizer compared with RNN ckt
•
OTA circuit 3 - backup
+
-
+
-
+
-
+
-
VCM
VCM
VCM
VCM
C
C
xs
ys
R1 = 1/A11
R2 = 1/A22
GM2 = A12
GM1 = A12
I1 = -B11
I2 = -B21

Localization in 2D Future Work – low power Analog OTA
circuit 3a - backup
• Linear Coupled Differential equation circuit for 2D localization ; OTA
Verilog-A model completed ; The plot is example simulation output :
OTA output current I(Vcm_out) vs input voltage difference V Hspice
simulation : Vcm (common mode V) = 0.5V
OTA Verilog-A completed – unit tested in HSpice sims

Localization in 3D Future Work – low power Analog Linear
System circuit 4 - backup
• In some 3D spatial localization cases the A matrix in the above OTA circuit
may not be positive definite – hence no convergence can be achieved
• I have a solution in this case using a linear voltage op-amp ( balanced
adder-subtractor ) circuit
x
+
-
Rfx
R4x
R5x
R3x
R1x
R2x
GND
+
-
b1/a11
+
-
Rfy
R4y
R5y
R3y
R1y
R2y
GND
+
-
b2/a22
+
-
Rfz
R4z
R5z
R3z
R1z
R2z
GND
+
-
b3/a33
y
z
y
z
x
z
x
y

Linear System circuit 4a - backup
• The coefficients in these equations are derived from three measured AOA
values ( azimuth angles beta1, beta2, and elevation angle gamma1 ), and two
anchor’s known data (x1,y1,z1) and (x2,y2,z2).
• The active analog network for solving the above 3x3 system is shown below, it
requires 3 op-amps , 3 DC voltage sources, and 18 resistors as shown in the
previous slide.
•

Localization in 3D Future Work – low power Analog Linear
System circuit 4b - backup
• The following 45 nm op-amp and biasing network was used, based on
R.J.Baker ( Reference: Baker, “CMOS Circuit Design , Layout, and
Simulation”, 3rd edition, sect. 24.1 , Fig. 24.2 )

Localization in 3D Future Work – low power Analog Linear System
circuit 5 - backup
• X coordinate = V(out) convergence = 94.532 mV * 50 = 4.73 approx. 5 (true)
• Y coordinate = V(out2) = 309.014 mV * 50 = 15.45 approx. 15 (true )

Localization in 3D Future Work – low power Analog Linear System
circuit 6 - backup
• Z coordinate = V(out3) = 274.6874 mV * 50 = 13.7 approx. 14 (true)

RNN solver – Quadratic Program
• Solving a Quadratic Program with QP block, via Select
QP or LP mux
59
D11 D12
D21 D22
X
Y1(n)
Y2(n)
matrix_vector
mult
+
X1(n)
X2(n)
-
C1
C2
M
a
x
0
R1(n)
R2(n)
- X
X1(n)
X2(n)
dt (scaler)
dX1(n)
dX2(n)
vector-scaler
mult
dX1(n)
dX2(n)
+
X1(n-1)
X2(n-1)
X1(n)
X2(n)
X
I
I + A Select QP or LP
QP

• I’m focusing on Speaker Recognition ( Identification of 1 speaker from a closed set of M
enrolled speakers) not Verification of speaker’s claimed Identity
• GMM based, generative stochastic models, using open-source TIMIT database for model
construction and algorithm and hardware verification ; GMM model build with EM for
each enrolled speaker, using speaker’s training set of MFCC feature vectors (frames) ; an
offline process. A typical 10 msec speaker’s training utterance can have 2000 12-element
MFCC vectors for GMM model building during offline training.
• During online recognition, after Voice Activity Detection and minimum acoustic energy
filtering, about 500 12-element MFCC frames are generated by the unknown (test) speaker.
• A typical maximum-likelihood, GMM-based, speaker recognition system : online recognition
uses the bottom path :

• GMM model of 1 speaker : mixture of multivariate Gaussian densities
•
• The Gaussian mixture probability density function of model ( speaker ) λ
consists of a sum of K weighted component densities, given by the above
equation. K is the number of Gaussian components, Pk is the prior probability
(mixture weight) of the k-th Gaussian component, and
• is the d-variate Gaussian density function with mean vector μk and covariance
matrix Σk. The mixture weights Pk ≥0 are constrained as

GMM scoring – number of operations – Worst case all done in
each clock cycle ( activity factor = 1 for all )
MFCC
Frames = 1
20 GMMs per
Speaker
Speakers Adds Mults 3-way
comparisons
lookups
1 517 480 26 42
38 19646 18240 988 1596
1 speaker (GMM ) = 52.48 mW = 517*(20uW) +480*(55uW)+26*(40uW)+42*(350uW)
38 speakers (GMMs ) = 1994 mW = 19646*(20uW) +18240*(55uW)+988*(40uW)+1596*(350uW)

• For above FS_Mode=0, FS_Rate=0 sim, evolution of probAll,
p( . | l s ) , of all speaker’s posterior probabilities; winning speaker has
smallest negative log(prob) approx. -7990 ; X axis is number of test
frames
•

• For above FS_Mode=1, FS_Rate=1,2,4,8,16 sim, evolution of probAll,
p( . | l s ) , of all speaker’s posterior probabilities; winning speaker has
smallest negative log(prob) approx. -820 ; X axis is number of test frames;
jumps when FS_rate changes; probAll recomputed only for a new FS_rate

Text-Independent Speaker Recognition – Clustering test frames
• Above FS_mode=0 , FS_Rate=0, kMeans 40 clusters simulation :
evolution of probAll, all speaker’s posterior prob over all 40 test
frames (centroids ); winning speaker has smallest negative log(prob)
approx. -590 ; X axis is number of test frames

Text-Independent Speaker Recognition – GMM scoring with k
centroids – power analysis
• ref 1 - S. Shartma et al. 2015, “Design of Low Power High Speed 16 bit Address with
McCMOS in 45 nm Technology”
• ref 2 - S. Mohan et al. 2017, “An improved implementation of hierarchy array multiplier
using Cs1A adder and full swing GDI logic – 45 nm PDK”
• ref 3 - P. Sharma et al. 2016, “Design Analysis of 1-bit Comparator using 45nm Technology”
• ref 4 – J. Stine et al. 2017, “A high performance multi-port SRAM for low voltage shared
memory systems”

Text-Independent Speaker Recognition – number of computations –
Clustering k-Means – table of operations
• Above on-line k-Means, k=40, on 500 test frames, clustering
algorithm requires the following operations per iteration : ( 40 squared
Euclidean distances, sorting (find min of 40 values ), and LMS update
of winning cluster :
Number of
iterations
Adds Mults 3-way
comparisons
divisions
1 505 480 64 12
10 5050 4800 640 120

Text-Independent Speaker Recognition – number of computations – GMM
scoring with k centroids ; table of ops; FS_Mode=0
Adds Mults 3-way
compariso
ns
Divisions Lookups
10
iterations
to converge
to 40
frames
(centroids)
5050 4800 640 120 0
Score GMM
with 40
frames
(centroids)
20680 19200 1040 0 1680
Total
kMeans
and GMM
25730 24000 1680 120 1680

Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization and Speaker Recognition

More Related Content

Similar to Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization and Speaker Recognition (20)

Recently uploaded (20)

Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization and Speaker Recognition