Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Data ethics and machine learning
Discrimination, algorithmic bias, and
how to discover them.
DINO PEDRESCHI
KDDLAB, DIPARTIMENTO DI INFORMATICA, UNIVERSITÀ DI PISA

5
Spot business trends
Prevent diseases
Fight crime
Improve transportation
Personalised services
Improve wellbeing

Event Detection
Detecting events in a geographic area
classifying the different kinds of users.
City of Rome
Metropolitan area
Covered geographical region: city of Rome
Dataset size per snapshot: ≈ 1.2 GBytes per day
Number of records: ≈ 5.6 million lines per day
8 months between 2015 and 2016

End users
Traveler
Mobility
Manager
City

Personal mobility assistant
12
Carpooling
Network

Estimating wellbeing with mobility data
AI and Big Data 13
A
B
C
H
W

Predicting GDP with Retail Market data
14
generic utility
function
(rationality)
personal utility
function
(diversity)
Product
Price
Quantity
Needed
Sophistication
R2 = 17.25% R2 = 32.38%
R2 = 85.72%

Big Data, Big Risks
Big data is algorithmic, therefore it cannot be biased! And yet…
• All traditional evils of social discrimination, and many new ones, exhibit
themselves in the big data ecosystem
• Because of its tremendous power, massive data analysis must be used
responsibly
• Technology alone won’t do: also need policy, user involvement and
education efforts
16

By 2018, 50% of business ethics
violations will occur through
improper use of big data analytics
[source: Gartner, 2016]
AI and Big Data 17

The danger of black boxes - 1
The COMPAS score (Correctional Offender Management Profiling for
Alternative Sanctions)
A 137-questions questionnaire and a predictive model for “risk of
crime recidivism.” The model is a proprietary secret of Northpointe,
Inc.
The data journalists at propublica.org have shown that
• the prediction accuracy of recidivism is rather low (around 60%)
• the model has a strong ethnic bias
◦ blacks who did not reoffend are classified as high risk twice as much as
whites who did not reoffend
◦ whites who did reoffend were classified as low risk twice as much as
blacks who did reoffend.
AI and Big Data 20

The danger of black boxes -2
The three major US credit bureaus, Experian, TransUnion, and
Equifax, providing credit scoring for millions of individuals, are
often discordant.
In a study of 500,000 records, 29% of consumers received credit
scores that differ by at least fifty points between credit bureaus, a
difference that may mean tens of thousands dollars over the life of
a mortgage [CRS+16].
AI and Big Data 21

In 2010, some homeowners with a regular payment
history of their mortgage reported a sudden drop of forty
points in their credit score, soon after their own enquiry.
AI and Big Data 22

During the 1970s and 1980s, St. George’s Hospital
Medical School in London used a computer program for
initial screening of job applicants.
The program used information from applicants’ forms,
which contained no reference to ethnicity.
The program was found to unfairly discriminate against
female applicants and ethnic minorities (inferred from
surnames and place of birth), less likely to be selected for
interview [LM88].
AI and Big Data 23

In a recent paper at SIGKDD 2016 [RSG16] the authors
show how an accurate but untrustworthy classifier may
result from an accidental bias in the training data.
In a task of discriminating wolves from huskies in a
dataset of images, the resulting deep learning model is
shown to classify a wolf in a picture based solely on …
AI and Big Data 24

In a recent paper at SIGKDD 2016 [RSG16] the authors
show how an accurate but untrustworthy classifier may
result from an accidental bias in the training data.
In a task of discriminating wolves from huskies in a
dataset of images, the resulting deep learning model is
shown to classify a wolf in a picture based solely on …
the presence of snow in the background!
[RSG16] “Why Should I Trust You?” Explaining the Predictions of Any Classifier
SIGKDD 2016 Conference Paper
AI and Big Data 25

Deep learning is creating computer
systems we don't fully understand
www.theverge.com/2016/7/12/12158238/first-click-deep-learning-algorithmic-
black-boxes
AI and Big Data 26

Is AI Permanently Inscrutable?
nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable
27

In a recent study at Princeton Univ, the authors show
how the semantics derived automatically from large
text/web corpora contains human biases
◦ E.g., names associated with whites were found to be
significantly easier to associate with pleasant than
unpleasant terms, compared to names associated with
black people.
Therefore, any machine learning model trained on text
data for, e.g., sentiment or opinion mining has a strong
chance of inheriting the prejudices reflected in the
human-produced training data.
AI and Big Data 28

Human Bias can be Learned - 7
AI and Big Data 30

As we stated in our 2008 SIGKDD paper that started the field of
discrimination-aware data mining [PRT08]:
“learning from historical data recording human decision making
may mean to discover traditional prejudices that are endemic in
reality, and to assign to such practices the status of general rules,
maybe unconsciously, as these rules can be deeply hidden within
the learned classifier.”
AI and Big Data 31

Satya Nadella's rules for AI
www.theverge.com/2016/6/29/12057516/satya-nadella-ai-robot-laws
AI and Big Data 33

U.S. – F.T.C.
Salvatore Ruggieri 34
www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-
exclusion-understanding-issues/160106big-data-rpt.pdf (Sept. 2014)

U.S. – White House
www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1
_2014.pdf (May 2014)

Salvatore Ruggieri
36
www.whitehouse.gov/sites/default/files/microsites/ostp/2016_0504_data_disc
rimination.pdf (May 2016)

www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NST
C/preparing_for_the_future_of_ai.pdf (October 2016)
AI and Big Data 37

E.U. - EDPS
secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Con
sultation/Opinions/2015/15-11-19_Big_Data_EN.pdf

E.U. - EDPS
secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Con
sultation/Opinions/2015/15-09-11_Data_Ethics_EN.pdf

Netherlands
www.knaw.nl/en/news/publications/ethical-and-legal-aspects-of-informatics-
research (September 2016)
AI and Big Data 40

Big Data Ethics
informationaccountability.org/big-data-ethics-initiative/
AI and Big Data 41

Value-Sensitive Design
Design for privacy
Design for security
Design for inclusion
Design for sustainability
Design for democracy
Design for safety
Design for transparency
Design for accountability
Design for human capabilities
AI and Big Data 42

EU Projects: SoBigData.eu
Social Mining & Big Data Ecosystem project (SoBigData, H2020-INFRAIA-2014-2015,
duration: 2015-2019, www.sobigdata.eu
AI and Big Data 43

Master Universitario Di II Livello
BigData Technology
BigData Sensing&Procurement
BigData Mining
BigData StoryTelling
BigData Ethics
Il Master Big Data ha l’obiettivo di formare“data scientists”,dei
professionisti dotati di un mix di competenze multidisciplinari
che permettono non solo di acquisire dati ed estrarne conos-
cenza, ma anche di raccontare“storie” attraverso questi dati, a
supporto delle decisioni, della creatività e dello sviluppo di
servizi innovativi, e di saper gestire le ripercussioni etiche e
legali dei Big Data, che spesso contengono informazioni
personali e suscitano problematiche relative alla privacy, alla
trasparenza,alla consapevolezza.
Aree di innovazione socio-economica:
BigData for Social Good
BigData forBusiness
Big Data AnalyticsESocial Mining
SoBigData
Data Ethics Literacy
Rapporto MIUR su Big Data, 28 Luglio 2016
◦ www.istruzione.it/allegati/2016/bigdata.pdf
Master UNIPI in Big Data Analytics & Social Mining
◦ masterbigdata.it
AI and Big Data 44

Data ethics
technologies
DISCRIMINATION DISCOVERY FROM DATA

Discrimination discovery
Given:
◦ an historical database of decision records, each describing
features of an applicant to a benefit
◦ e.g., a credit request to a bank and the corresponding on credit approval/denial
◦ some designated categories of applicants, such as groups
protected by anti-discrimination laws,
find whether, and in which circumstances, there are
evidences of discrimination of the designated categories
that emerge from the data.
DCUBE: Discrimination Discovery in Databases 47

German Credit dataset

How? Fight with the same weapons
Idea: use data mining to discover discrimination
◦ the decision policies hidden in a database can be represented by
decision rules and discovered by frequent pattern mining
◦ Once found all such decision rules, highlight all potential niches
of discrimination by filtering the rules using a measure that
quantifies the discrimination risk.

Discrimination discovery from data
FOREIGN_WORKER=yes
& PURPOSE=new_car & HOUSING=own
 CREDIT=bad
◦ elift = 5,19 supp = 56 conf = 0,37
elift = 5,19 means that foreign workers have more than 5
times more probability of being refused credit than the
average population (even if they own their house).
50

 Outcome:
 Funded
 Not funded
 Conditionally funded
Case Study: grant evaluation
51

Dataset attributes
52
Features of the PI
Project costs
Research Area
Project Evaluation

A potentially discriminatory rule
Antecedent
◦ Project proposals in “Physical and Analytical
Chemical Sciences”
◦ Young females
◦ Total cost of 1,358,000 Euros or above
Possible interpretation
◦ “Peer-reviewers of panel PE4 trusted young females
requiring high budgets less than males leading
similar projects”
53

Case study: US Harmonized Tariff System
US Harmonized Tariff System (HTS)
https://hts.usitc.gov/
Detailed tariff classification system for
merchandise imported to US
Chapter 61, 62, 64, 65: apparels
◦ Different taxes for same garments
separately produced for male and female
◦ Description is at semi-structured form
64.4¢/kg + 18.8%96¢/doz + 1.4%8.5%Women and
girls
38.6¢/kg + 10%08.9%Men and boys
CoatsFur felt hatsCotton pajamas
Different
taxes for
same
apparels for
men and
women
64.4¢/kg + 18.8%96¢/doz + 1.4%8.5%Women and
girls
38.6¢/kg + 10%08.9%Men and boys
CoatsFur felt hatsCotton pajamas
Different
taxes for
same
apparels for
men and
women
54
Women: 14%
Men: 9%
1.3 billions USD!!!

AI and Big Data 55
Totes-Isotoner Corp. v. U.S.
Rack Room Shoes Inc. and
Forever 21 Inc. vs U.S.
Court of International Trade
U.S. Court of Appeals for the Federal
Circuit (2014)
“[…] the courts may have concluded that
Congress had no discriminatory intent when
ruling the HTS, but there is little
doubt that gender-based tariffs have
discriminatory impact”

Sample rule from the HTS dataset
AI and Big Data 56

Soccer Player Ratings
How humans
evaluate sports
performance?

Human evaluation line
Technical
features
Machine
performance

Human evaluation line
Technical
features
Technical+Contextual
features
Machine
performance

Wrapping up
AI AND BIG DATA 62

Right of explanation
• Applying AI within many domains requires
transparency and responsibility:
• health care
• finance
• surveillance
• autonomous vehicles
• Government
• EU General Data Protection Regulation (April
2016) establishes (?) a right of explanation
for all individuals to obtain “meaningful
explanations of the logic involved” when
automated (algorithmic) individual decision-
making, including profiling, takes place.
• In sharp contrast, (big) data-driven AI/ML
models are often black boxes.
AI and Big Data 63

Accountability
“Why exactly was my loan application rejected?”
“What could I have done differently so that my application
would not have been rejected?”
AI and Big Data 64

Social Mining & Big Data Ecosystem
www.sobigdata.eu

66
Knowledge Discovery
& Data Mining Lab
http://kdd.isti.cnr.it

Special thanks
• Salvatore Ruggieri
• Franco Turini
• Fosca Giannotti
• Anna Monreale
• Luca Pappalardo
SMARTCATs

Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

More Related Content

What's hot (20)

Similar to Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi (20)

More from Data Driven Innovation (20)

Recently uploaded (20)

Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi