2103 ACM FAccT

Human Interface Laboratory
Towards Cross-Lingual Generalization of
Translation Gender Bias
2021. 3. 9 @FAccT Conference
Won Ik Cho*, Jiwon Kim*, Jaeyoung Yang, Nam Soo Kim

Contents
• Translation gender bias
 What’s the problem and why this matters?
 Significant in which language pairs? - Struggles so far
• Our approach
 Language pairs and template
 Dataset construction
 Measurement of fluency and biasedness
• Discussion
 Results and analysis
 Takeaways
1

Bias
• Bias in machine learning?
 Bias and variance
• Overfitting and underfitting
 Bias in view of fairness machine learning?
• Problem of individuality and context rather than of
statistics and system (Binns, 2017)
 Is the bias in machine learning related with the bias in fairness machine
learning and real social bias?
• e.g., image semantic role labeling
– Zhao et al., Men Also Like Shopping:
Reducing Gender Bias Amplification
using Corpus-level Constraints,
in Proc. EMNLP, 2017.
• This also happens in translation!
2

Bias
• What is shown (social) bias in AI and NLP?
 Sun et al., Mitigating Gender Bias in Natural Language Processing:
Literature Review, in Proc. ACL, 2019.
3

Overview: Gender bias in translation?
• Formulation #1
 Gender-neutral pronouns
• Target problem?
 Translation of gender-neutral pronouns to gender-specific ones
• Gender-neutral pronoun
 Pronouns with no biological
gender displayed
 Frequently appears in languages
like Korean, Japanese, Turkish, ...
 Prates et al., Assessing Gender
Bias in Machine Translation:
A Case Study with Google Trans
late, Neural Computing and
Applications, 2018.
4

• Formulation #2
 Gendered languages
• Target problem?
 Translation of expressions without
gender representation to gendered items
• Gendered languages
 Grammatical genders in articles,
nouns, adjectives
 Differs from the biological gender
 Vanmassenhove et al.,
Getting Gender Right in
Neural Machine Translation,
in Proc. EMNLP, 2018.
5

• Why do they matter?
 The result can be offensive to end users
• When they do matter?
 In cases the users are familiar or not with the target/source language
• Who will potentially feel offensive?
 Especially if the mistranslation regards social stereotypes
• Research questions
 How can the evaluation incorporate various aspects of translation gender
bias?
 How will grammatical properties and resource condition influence the bias
issue?
6

Template-based attacks
• 걔(s/he)는 [##]이야!
 Cho et al., On measuring gender bias in translation of gender-neutral
pronouns," in Proc. GeBNLP, ACL Workshop, 2019.
• Why Korean?
 Displays various sentence styles
 Translation service
popular among the users
7

Semantic cues
• WinoMT
 Stanovsky et al., Evaluating Gender Bias in Machine Translation,
in Proc. ACL, 2019.
8
- Performance can differ among
a similar language family
- High accuracy in stereotypical
case does not guarantee the
accuracy in the counterpart

Our approach
• Combined approach – Cross-lingual evaluation?
 Two different types of translation gender bias
• Different types of gender bias can be observed in a single translation system
9
?

Our approach
• Language pairs
 Different types of `language pairs’ should be investigated
• 1) Languages with gender-neutral pronouns
• 2) Languages without (colloquially used) GNPs but not grammatically gendered
• 3) Languages with grammatical gender
10

Our approach
• Language pairs
 Linguistic criteria: On GNPs and genderedness
• Korean/Tagalog -> English -> German/Portuguese
 Resource criteria
• Korean > Tagalog
• German > Portuguese
11

Our approach
• Template
 How can the pairs be simultaneously evaluated?
12
Korean 그 여자에 대해 하나 말하자면, 걔는 초등학교 선생님이야.
Tagalog Isang bagay tungkol sa babae, siya ay isang guro sa elementarya.
English One thing about the woman, she is an elementary school teacher.
German Eine Sache über die Frau, sie ist eine Grundschullehrerin.
Portuguese Um facto sobre a mulher, ela é professora do ensino primário.

Our approach
• Template
 How can the pairs be simultaneously evaluated?
13

Our approach
• Evaluation
 Template-based evaluation set construction
 Inference with public MT modules
 Human evaluation (gender-related) and automatic metrics (fluency)
14

Our approach
• Measurement
 Biasedness
• Accuracy on biological gender
• Accuracy on grammatical gender
• Disparate impact
– Accuracy on female case
divided by accuracy on male case
 Fluency
• BLEU
– EN, DE, PT
• BERTScore
– Multilingual BERT
15

Results and analysis
• Results
 More bias-related errors in EN > DE/PT than in KO/TL > EN
• She is a game programmer > Sie ist ein professioneller Spieler
• aviador, soldado, monge (airman, soldier, monk)
• Exceptional cases for Bing KO-EN
16

Results and analysis
• Analysis
 Unbiasedness/Disparate impact
• Higher among type 1 languages
– DE, PT < KO, TL (overall)
• In the same type, resource seems
to matter
– DE < PT, KO < TL
 Fluency measurement
• Lexical and semantic approach have different results
– BLEU (lexical): DE > PT > KO, TL
– BERTScore (semantic): DE < PT, KO < TL
 Observations
• The amount of available language resource, though here assumed for public
MT modules, does not guarantee unbiased translation, albeit fluency measure
may be higher in some sense
• There is a difference regarding the evaluation on gender-related inference per
fluency measures
17

Takeaways
• Translation gender bias is problematic since wrong results can be
offensive to end users
• Translation gender bias matters regardless of the user proficiency
of the language, and especially offensive if the mistranslation
engages social stereotypes
• Our approach, including template and measurement, can combine
the translation gender bias evaluation regarding various language
pairs
• Our evaluation results suggest that the inductive bias as a social
stereotype is a major factor causing the errors and augmenting
training corpora may not be a solution
18

Reference (order of appearance)
• Binns, Reuben. "Fairness in Machine Learning: Lessons from Political Philosophy." arXiv preprint
arXiv:1712.03586 (2017).
• Zhao, Jieyu, et al. "Men Also Like Shopping: Reducing Gender Bias Amplification Using Corpus-
level Constraints." arXiv preprint arXiv:1707.09457 (2017).
• Sun, Tony, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza,
Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. "Mitigating Gender Bias in Natural
Language Processing: Literature Review." In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pp. 1630-1640. 2019.
• Prates, Marcelo OR, Pedro H. Avelar, and Luís C. Lamb. "Assessing Gender Bias in Machine
Translation: A Case Study with Google Translate." Neural Computing and Applications (2018): 1-
19.
• Vanmassenhove, Eva, Christian Hardmeier, and Andy Way. "Getting Gender Right in Neural
Machine Translation." In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, pp. 3003-3008. 2018.
• Cho, Won Ik, et al. "On Measuring Gender Bias in Translation of Gender-neutral Pronouns."
GeBNLP 2019 (2019): 173.
• Stanovsky, Gabriel, Noah A. Smith, and Luke Zettlemoyer. "Evaluating Gender Bias in Machine
Translation." arXiv preprint arXiv:1906.00591 (2019).
19

2103 ACM FAccT

More Related Content

Similar to 2103 ACM FAccT (20)

More from WarNik Chow (20)

Recently uploaded (20)

2103 ACM FAccT

Editor's Notes