“Facing Up to Bias,” a Presentation from Perceive

© 2021 Perceive
Facing Up To Bias
Steve Teig
Perceive

© 2021 Perceive
The concerning state of face recognition (FR)
2
New York Times
Khan Academy

© 2021 Perceive
Backlash and emerging legislation
3

© 2021 Perceive
Discrimination is pervasive…
4

© 2021 Perceive
Discrimination is pervasive… but not the whole story
5
• Training a neural network (typically) minimizes a loss function
• Near-universal loss function: expected value – i.e., the average – of the error
• E.g., cross-entropy H(p,q) = -Ep[log q] = average over p of –log(q)
• Suppose our FR training set has 10,000 white faces and 100 black faces
• ErrorW = average error on white faces; ErrorB = average error on black faces
• Total error is proportional to 10,000 * ErrorW + 100 * ErrorB
• Yup. Average error penalizes errors on white faces 100x as much as errors on black faces!

© 2021 Perceive
Of course, the trained model does better on white faces!
• Total error ∝ 100 * ErrorW + 1 * ErrorB
• Average error penalizes errors on white faces 100x as much as errors on black faces!
• Model compression makes this problem even worse
• Quantize the network, sparsify the network, etc.
• If the training network must jettison some information…
6

© 2021 Perceive
Why “balancing” the dataset won’t fix this
7

© 2021 Perceive
For experts: why (naïve) GANs won’t fix this either
• GAN: Generative Adversarial Network
• Generates synthetic data points that are hard to distinguish from real data points
• Can’t we use GANs to add more representative, interesting examples to the dataset?
• Yes, but…
• Mainstream GANs optimize only “datum looks as though from the original dataset”
• What if synthetic, clean-shaven faces are easier to generate than bearded ones?
• What if white faces are easier to generate than black ones?
• More bias 
8

© 2021 Perceive
How much influence should one image have?
9
vs.

© 2021 Perceive
Can we enable some images to have more influence?
• In today’s deep learning, each datum appears only once per epoch during training
• Loss L =
1
𝑁
σ𝑑 𝑒𝑟𝑟𝑜𝑟(𝑑) →
1
𝑁
σ𝑑 𝑚𝑎𝑠𝑠 𝑑 ∗ 𝑒𝑟𝑟𝑜𝑟(𝑑) , where σ𝑑 𝑚𝑎𝑠𝑠 𝑑 = 𝑁
• Typically, mass(d) = 1 for all d → average error
• What if we increase the mass of some data points vs. others?
• Mr. Muttonchops gets mass k, where all other data points get mass
𝑁−𝑘
𝑁−1
• Gradient pushes k times as hard on Mr. M
• Sounds reasonable, right?
10

© 2021 Perceive
Nope. Making some gradients bigger is a bad plan
11
Learning rate

© 2021 Perceive
A new idea: repeated selection vs. higher “learning rate”
• If d’s relative mass = k , include d (once) in each of ~k minibatches of each epoch
• Look at d more than once per epoch (in different local contexts)
• Now, d’s learning rate is the same as others’, but…
• d moves ~k times as far per epoch
• Wait a minute! How should we compute the mass of each datum?
• Lossd quantifies d’s distance from happiness: i.e., lossd = 0
• Lots of papers advocate Lossd as relative importance…
• Gradientd quantifies d’s current velocity on the path to happiness
• Lots of papers advocate Gradientd as relative importance…
12

© 2021 Perceive
A new idea: “time to happiness”
• Distance = rate * time → Time = distance / rate
• Td =
𝐿𝑜𝑠𝑠𝑑
𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡𝑑
• Want every data point to achieve happiness at (roughly) the same time
• Otherwise, either stop before every data point is happy, …
• Or wait for eons
• Make each datum’s mass be equal to its time to happiness
• Datum with more “work to do” gets more time to do it
• Time to happiness is a better criterion than Lossd or Gradientd
14

© 2021 Perceive
Remassing is powerful
• Optimizes worst-case accuracy, rather than average accuracy
• No customer really cares about average accuracy, yet everybody optimizes that!
• “Accuracy: Beware of Red Herrings and Black Swans” – Embedded Vision 2020
• But wait! There’s more!
• Remassing can massively accelerate training
• Focus optimization effort on points with the most work to do
• Most data points resemble other data points: get optimized “for free”!
15

© 2021 Perceive
Facing up to bias
• Remassing optimizes worst-case accuracy, not average accuracy
• Treats rare data points and common data points as equally important
• Treats rare (explanatory) features and common features as equally important
• Remassing addresses a major source of observed bias in face recognition
16

© 2021 Perceive
Resources
17
2021 Embedded Vision Summit
“TinyML Is Not Thinking Big Enough”
(talk)
Remassing based on gradient direction
https://arxiv.org/pdf/1803.09050.pdf
Remassing based on loss
https://arxiv.org/pdf/1511.06343.pdf
Perceive
https://www.perceive.io

“Facing Up to Bias,” a Presentation from Perceive

More Related Content

Similar to “Facing Up to Bias,” a Presentation from Perceive (10)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

“Facing Up to Bias,” a Presentation from Perceive