SlideShare a Scribd company logo
Data ethics and machine learning
Discrimination, algorithmic bias, and
how to discover them.
DINO PEDRESCHI
KDDLAB, DIPARTIMENTO DI INFORMATICA, UNIVERSITÀ DI PISA
Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi
Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi
Opportunities of
big data
4
5
Spot business trends
Prevent diseases
Fight crime
Improve transportation
Personalised services
Improve wellbeing
Event Detection
Detecting events in a geographic area
classifying the different kinds of users.
City of Rome
Metropolitan area
Covered geographical region: city of Rome
Dataset size per snapshot: ≈ 1.2 GBytes per day
Number of records: ≈ 5.6 million lines per day
8 months between 2015 and 2016
San Pietro
San Giovanni
Circo Massimo
Stadio Olimpico
End users
Traveler
Mobility
Manager
City
Personal mobility assistant
12
Carpooling
Network
Estimating wellbeing with mobility data
AI and Big Data 13
A
B
C
H
W
Predicting GDP with Retail Market data
14
generic utility
function
(rationality)
personal utility
function
(diversity)
Product
Price
Quantity
Needed
Sophistication
R2 = 17.25% R2 = 32.38%
R2 = 85.72%
Risks of big data
15
Big Data, Big Risks
Big data is algorithmic, therefore it cannot be biased! And yet…
• All traditional evils of social discrimination, and many new ones, exhibit
themselves in the big data ecosystem
• Because of its tremendous power, massive data analysis must be used
responsibly
• Technology alone won’t do: also need policy, user involvement and
education efforts
16
By 2018, 50% of business ethics
violations will occur through
improper use of big data analytics
[source: Gartner, 2016]
AI and Big Data 17
AI and Big Data 18
19
The danger of black boxes - 1
The COMPAS score (Correctional Offender Management Profiling for
Alternative Sanctions)
A 137-questions questionnaire and a predictive model for “risk of
crime recidivism.” The model is a proprietary secret of Northpointe,
Inc.
The data journalists at propublica.org have shown that
• the prediction accuracy of recidivism is rather low (around 60%)
• the model has a strong ethnic bias
◦ blacks who did not reoffend are classified as high risk twice as much as
whites who did not reoffend
◦ whites who did reoffend were classified as low risk twice as much as
blacks who did reoffend.
AI and Big Data 20
The danger of black boxes -2
The three major US credit bureaus, Experian, TransUnion, and
Equifax, providing credit scoring for millions of individuals, are
often discordant.
In a study of 500,000 records, 29% of consumers received credit
scores that differ by at least fifty points between credit bureaus, a
difference that may mean tens of thousands dollars over the life of
a mortgage [CRS+16].
AI and Big Data 21
The danger of black boxes - 3
In 2010, some homeowners with a regular payment
history of their mortgage reported a sudden drop of forty
points in their credit score, soon after their own enquiry.
AI and Big Data 22
The danger of black boxes - 4
During the 1970s and 1980s, St. George’s Hospital
Medical School in London used a computer program for
initial screening of job applicants.
The program used information from applicants’ forms,
which contained no reference to ethnicity.
The program was found to unfairly discriminate against
female applicants and ethnic minorities (inferred from
surnames and place of birth), less likely to be selected for
interview [LM88].
AI and Big Data 23
The danger of black boxes - 5
In a recent paper at SIGKDD 2016 [RSG16] the authors
show how an accurate but untrustworthy classifier may
result from an accidental bias in the training data.
In a task of discriminating wolves from huskies in a
dataset of images, the resulting deep learning model is
shown to classify a wolf in a picture based solely on …
AI and Big Data 24
The danger of black boxes - 5
In a recent paper at SIGKDD 2016 [RSG16] the authors
show how an accurate but untrustworthy classifier may
result from an accidental bias in the training data.
In a task of discriminating wolves from huskies in a
dataset of images, the resulting deep learning model is
shown to classify a wolf in a picture based solely on …
the presence of snow in the background!
[RSG16] “Why Should I Trust You?” Explaining the Predictions of Any Classifier
SIGKDD 2016 Conference Paper
AI and Big Data 25
Deep learning is creating computer
systems we don't fully understand
www.theverge.com/2016/7/12/12158238/first-click-deep-learning-algorithmic-
black-boxes
AI and Big Data 26
Is AI Permanently Inscrutable?
nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable
27
The danger of black boxes - 6
In a recent study at Princeton Univ, the authors show
how the semantics derived automatically from large
text/web corpora contains human biases
◦ E.g., names associated with whites were found to be
significantly easier to associate with pleasant than
unpleasant terms, compared to names associated with
black people.
Therefore, any machine learning model trained on text
data for, e.g., sentiment or opinion mining has a strong
chance of inheriting the prejudices reflected in the
human-produced training data.
AI and Big Data 28
Human Bias
AI and Big Data 29
Human Bias can be Learned - 7
AI and Big Data 30
As we stated in our 2008 SIGKDD paper that started the field of
discrimination-aware data mining [PRT08]:
“learning from historical data recording human decision making
may mean to discover traditional prejudices that are endemic in
reality, and to assign to such practices the status of general rules,
maybe unconsciously, as these rules can be deeply hidden within
the learned classifier.”
AI and Big Data 31
Policies
BIG DATA ETHICS
Satya Nadella's rules for AI
www.theverge.com/2016/6/29/12057516/satya-nadella-ai-robot-laws
AI and Big Data 33
U.S. – F.T.C.
Salvatore Ruggieri 34
www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-
exclusion-understanding-issues/160106big-data-rpt.pdf (Sept. 2014)
U.S. – White House
Salvatore Ruggieri 35
www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1
_2014.pdf (May 2014)
U.S. – White House
Salvatore Ruggieri
36
www.whitehouse.gov/sites/default/files/microsites/ostp/2016_0504_data_disc
rimination.pdf (May 2016)
U.S. – White House
www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NST
C/preparing_for_the_future_of_ai.pdf (October 2016)
AI and Big Data 37
E.U. - EDPS
Salvatore Ruggieri 38
secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Con
sultation/Opinions/2015/15-11-19_Big_Data_EN.pdf
E.U. - EDPS
Salvatore Ruggieri 39
secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Con
sultation/Opinions/2015/15-09-11_Data_Ethics_EN.pdf
Netherlands
www.knaw.nl/en/news/publications/ethical-and-legal-aspects-of-informatics-
research (September 2016)
AI and Big Data 40
Big Data Ethics
informationaccountability.org/big-data-ethics-initiative/
AI and Big Data 41
Value-Sensitive Design
Design for privacy
Design for security
Design for inclusion
Design for sustainability
Design for democracy
Design for safety
Design for transparency
Design for accountability
Design for human capabilities
AI and Big Data 42
EU Projects: SoBigData.eu
Social Mining & Big Data Ecosystem project (SoBigData, H2020-INFRAIA-2014-2015,
duration: 2015-2019, www.sobigdata.eu
AI and Big Data 43
Master Universitario Di II Livello
BigData Technology
BigData Sensing&Procurement
BigData Mining
BigData StoryTelling
BigData Ethics
Il Master Big Data ha l’obiettivo di formare“data scientists”,dei
professionisti dotati di un mix di competenze multidisciplinari
che permettono non solo di acquisire dati ed estrarne conos-
cenza, ma anche di raccontare“storie” attraverso questi dati, a
supporto delle decisioni, della creatività e dello sviluppo di
servizi innovativi, e di saper gestire le ripercussioni etiche e
legali dei Big Data, che spesso contengono informazioni
personali e suscitano problematiche relative alla privacy, alla
trasparenza,alla consapevolezza.
Aree di innovazione socio-economica:
BigData for Social Good
BigData forBusiness
Big Data AnalyticsESocial Mining
SoBigData
Data Ethics Literacy
Rapporto MIUR su Big Data, 28 Luglio 2016
◦ www.istruzione.it/allegati/2016/bigdata.pdf
Master UNIPI in Big Data Analytics & Social Mining
◦ masterbigdata.it
AI and Big Data 44
Data ethics
technologies
DISCRIMINATION DISCOVERY FROM DATA
AI and Big Data 46
Discrimination discovery
Given:
◦ an historical database of decision records, each describing
features of an applicant to a benefit
◦ e.g., a credit request to a bank and the corresponding on credit approval/denial
◦ some designated categories of applicants, such as groups
protected by anti-discrimination laws,
find whether, and in which circumstances, there are
evidences of discrimination of the designated categories
that emerge from the data.
DCUBE: Discrimination Discovery in Databases 47
German Credit dataset
DCUBE: Discrimination Discovery in Databases 48
How? Fight with the same weapons
Idea: use data mining to discover discrimination
◦ the decision policies hidden in a database can be represented by
decision rules and discovered by frequent pattern mining
◦ Once found all such decision rules, highlight all potential niches
of discrimination by filtering the rules using a measure that
quantifies the discrimination risk.
DCUBE: Discrimination Discovery in Databases 49
Discrimination discovery from data
FOREIGN_WORKER=yes
& PURPOSE=new_car & HOUSING=own
 CREDIT=bad
◦ elift = 5,19 supp = 56 conf = 0,37
elift = 5,19 means that foreign workers have more than 5
times more probability of being refused credit than the
average population (even if they own their house).
50
 Outcome:
 Funded
 Not funded
 Conditionally funded
Case Study: grant evaluation
51
Dataset attributes
52
Features of the PI
Project costs
Research Area
Project Evaluation
A potentially discriminatory rule
Antecedent
◦ Project proposals in “Physical and Analytical
Chemical Sciences”
◦ Young females
◦ Total cost of 1,358,000 Euros or above
Possible interpretation
◦ “Peer-reviewers of panel PE4 trusted young females
requiring high budgets less than males leading
similar projects”
53
Case study: US Harmonized Tariff System
US Harmonized Tariff System (HTS)
https://hts.usitc.gov/
Detailed tariff classification system for
merchandise imported to US
Chapter 61, 62, 64, 65: apparels
◦ Different taxes for same garments
separately produced for male and female
◦ Description is at semi-structured form
64.4¢/kg + 18.8%96¢/doz + 1.4%8.5%Women and
girls
38.6¢/kg + 10%08.9%Men and boys
CoatsFur felt hatsCotton pajamas
Different
taxes for
same
apparels for
men and
women
64.4¢/kg + 18.8%96¢/doz + 1.4%8.5%Women and
girls
38.6¢/kg + 10%08.9%Men and boys
CoatsFur felt hatsCotton pajamas
Different
taxes for
same
apparels for
men and
women
54
Women: 14%
Men: 9%
1.3 billions USD!!!
AI and Big Data 55
Totes-Isotoner Corp. v. U.S.
Rack Room Shoes Inc. and
Forever 21 Inc. vs U.S.
Court of International Trade
U.S. Court of Appeals for the Federal
Circuit (2014)
“[…] the courts may have concluded that
Congress had no discriminatory intent when
ruling the HTS, but there is little
doubt that gender-based tariffs have
discriminatory impact”
Sample rule from the HTS dataset
AI and Big Data 56
Soccer Player Ratings
Soccer Player Ratings
How humans
evaluate sports
performance?
Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi
Human evaluation line
Technical
features
Machine
performance
Human evaluation line
Technical
features
Technical+Contextual
features
Machine
performance
Wrapping up
AI AND BIG DATA 62
Right of explanation
• Applying AI within many domains requires
transparency and responsibility:
• health care
• finance
• surveillance
• autonomous vehicles
• Government
• EU General Data Protection Regulation (April
2016) establishes (?) a right of explanation
for all individuals to obtain “meaningful
explanations of the logic involved” when
automated (algorithmic) individual decision-
making, including profiling, takes place.
• In sharp contrast, (big) data-driven AI/ML
models are often black boxes.
AI and Big Data 63
Accountability
“Why exactly was my loan application rejected?”
“What could I have done differently so that my application
would not have been rejected?”
AI and Big Data 64
Social Mining & Big Data Ecosystem
www.sobigdata.eu
66
Knowledge Discovery
& Data Mining Lab
http://kdd.isti.cnr.it
Special thanks
• Salvatore Ruggieri
• Franco Turini
• Fosca Giannotti
• Anna Monreale
• Luca Pappalardo
SMARTCATs

More Related Content

PPTX
Ethics of Big Data
PDF
Introduction to the ethics of machine learning
PPTX
Introduction to Ethics of Big Data
PPTX
Introduction to Big Data
PPTX
Fraud Analytics
PDF
Data and Ethics: Why Data Science Needs One
PDF
Nasscom AI top 50 use cases
PPTX
EU GDPR (training)
Ethics of Big Data
Introduction to the ethics of machine learning
Introduction to Ethics of Big Data
Introduction to Big Data
Fraud Analytics
Data and Ethics: Why Data Science Needs One
Nasscom AI top 50 use cases
EU GDPR (training)

What's hot (20)

PPTX
Big data
PDF
The good, the bad, and the ugly on integration ai with cybersecurity
PDF
Introduction to AI Ethics
PDF
Democratizing Data at Airbnb
PDF
DataCamp investor deck April 2015
PDF
Algorithmic Bias - What is it? Why should we care? What can we do about it?
PPTX
The dark web
PPTX
Fraud Detection with Cost-Sensitive Predictive Analytics
PDF
Ethics in Data Science and Machine Learning
PPTX
AI in Marketing: Guest lecture at Bournemouth university
PDF
Ethics in the use of Data & AI
PPTX
A Tutorial to AI Ethics - Fairness, Bias & Perception
PPTX
Ethical Issues in Machine Learning Algorithms. (Part 3)
PDF
PDF
Ai applied in healthcare
PDF
Shift AI 2020: How to identify and treat biases in ML Models | Navdeep Sharma...
PDF
EU Ethics guidelines for trustworthy AI
PPTX
Presentation on Big Data
PPTX
Implementing Ethics in AI
PPTX
Big data in healthcare
Big data
The good, the bad, and the ugly on integration ai with cybersecurity
Introduction to AI Ethics
Democratizing Data at Airbnb
DataCamp investor deck April 2015
Algorithmic Bias - What is it? Why should we care? What can we do about it?
The dark web
Fraud Detection with Cost-Sensitive Predictive Analytics
Ethics in Data Science and Machine Learning
AI in Marketing: Guest lecture at Bournemouth university
Ethics in the use of Data & AI
A Tutorial to AI Ethics - Fairness, Bias & Perception
Ethical Issues in Machine Learning Algorithms. (Part 3)
Ai applied in healthcare
Shift AI 2020: How to identify and treat biases in ML Models | Navdeep Sharma...
EU Ethics guidelines for trustworthy AI
Presentation on Big Data
Implementing Ethics in AI
Big data in healthcare
Ad

Similar to Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi (20)

PPT
JanData-mining-to-knowledge-discovery.ppt
PDF
Big data for development
PPTX
Transparency in ML and AI (humble views from a concerned academic)
PPT
Making sense of big data
PPT
Big Data Analytics (1).ppt
PPTX
Social Νetworks Data Mining
DOCX
June 2015 (142) MIS Quarterly Executive 67The Big Dat.docx
PDF
BigData & Supply Chain: A "Small" Introduction
PDF
Algocracy and the state of AI in public administrations.
PDF
Business with Big data
PPT
Data mining and knowledge Discovery
PDF
Human-Centered Machine Learning: Harnessing Visualization and Interactivity f...
PPTX
Data Mining With Big Data
PDF
PPTX
Smart Data Module 5 d drive_legislation
PDF
10 Key Challenges for AI within the EU Data Protection Framework.pdf
PPTX
Adversarial Attacks and Defense
PDF
Synthetic Data for AI - Conference @ European Commission
PPTX
The REAL Impact of Big Data on Privacy
PPTX
Introduction_to_MAchine_Learning_Advance.pptx
JanData-mining-to-knowledge-discovery.ppt
Big data for development
Transparency in ML and AI (humble views from a concerned academic)
Making sense of big data
Big Data Analytics (1).ppt
Social Νetworks Data Mining
June 2015 (142) MIS Quarterly Executive 67The Big Dat.docx
BigData & Supply Chain: A "Small" Introduction
Algocracy and the state of AI in public administrations.
Business with Big data
Data mining and knowledge Discovery
Human-Centered Machine Learning: Harnessing Visualization and Interactivity f...
Data Mining With Big Data
Smart Data Module 5 d drive_legislation
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Adversarial Attacks and Defense
Synthetic Data for AI - Conference @ European Commission
The REAL Impact of Big Data on Privacy
Introduction_to_MAchine_Learning_Advance.pptx
Ad

More from Data Driven Innovation (20)

PDF
Integrazione della mobilità elettrica nei sistemi urbani (Stefano Carrese, Un...
PDF
La statistica ufficiale e i trasporti marittimi nell'era dei big data (Vincen...
PDF
How can we realize the Mobility as a Service (Maas) (Andrea Paletti, London S...
PDF
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
PDF
CHNet-DHLab: Servizi Cloud a supporto dei beni culturali (Fabio Proietti, INF...
PDF
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
PDF
Una infrastruttura per l’accesso al patrimonio culturale: il Progetto del Por...
PDF
Utilizzo dei Big data per l’analisi dei flussi veicolari e della mobilità (Ma...
PDF
I dati personali nell'analisi comportamentale della mobilità di dipendenti e ...
PDF
Estrarre valore dai dati: tecnologie per ottimizzare la mobilità del futuro (...
PPTX
Le piattaforme dati per la mobilità nelle città italiane (Marco Mena, EY)
PDF
WiseTown, un ecosistema di applicazioni e strumenti per migliorare la qualità...
PDF
CityOpenSource as a civic tech tool (Ilaria Vitellio, CityOpenSource)
PDF
Big Data Confederation: toward the local urban data market place (Renzo Taffa...
PDF
Making citizens the eyes of policy makers: a sweet spot for hybrid AI? (Danie...
PDF
Dall'Agenda Digitale alla Smart City: il percorso di Roma Capitale verso il D...
PDF
Reusing open data: how to make a difference (Vittorio Scarano, Università di ...
PDF
Gestire i beni culturali con i big data (Sandro Stancampiano, Istat)
PDF
Data Governance: cos’è e perché è importante? (Elena Arista, Erwin)
PDF
Data driven economy: bastano i dati per avviare una start up? (Gabriele Anton...
Integrazione della mobilità elettrica nei sistemi urbani (Stefano Carrese, Un...
La statistica ufficiale e i trasporti marittimi nell'era dei big data (Vincen...
How can we realize the Mobility as a Service (Maas) (Andrea Paletti, London S...
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
CHNet-DHLab: Servizi Cloud a supporto dei beni culturali (Fabio Proietti, INF...
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Una infrastruttura per l’accesso al patrimonio culturale: il Progetto del Por...
Utilizzo dei Big data per l’analisi dei flussi veicolari e della mobilità (Ma...
I dati personali nell'analisi comportamentale della mobilità di dipendenti e ...
Estrarre valore dai dati: tecnologie per ottimizzare la mobilità del futuro (...
Le piattaforme dati per la mobilità nelle città italiane (Marco Mena, EY)
WiseTown, un ecosistema di applicazioni e strumenti per migliorare la qualità...
CityOpenSource as a civic tech tool (Ilaria Vitellio, CityOpenSource)
Big Data Confederation: toward the local urban data market place (Renzo Taffa...
Making citizens the eyes of policy makers: a sweet spot for hybrid AI? (Danie...
Dall'Agenda Digitale alla Smart City: il percorso di Roma Capitale verso il D...
Reusing open data: how to make a difference (Vittorio Scarano, Università di ...
Gestire i beni culturali con i big data (Sandro Stancampiano, Istat)
Data Governance: cos’è e perché è importante? (Elena Arista, Erwin)
Data driven economy: bastano i dati per avviare una start up? (Gabriele Anton...

Recently uploaded (20)

PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
annual-report-2024-2025 original latest.
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Managing Community Partner Relationships
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Lecture1 pattern recognition............
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
How to run a consulting project- client discovery
SAP 2 completion done . PRESENTATION.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
annual-report-2024-2025 original latest.
ISS -ESG Data flows What is ESG and HowHow
Managing Community Partner Relationships
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
A Complete Guide to Streamlining Business Processes
Lecture1 pattern recognition............
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Qualitative Qantitative and Mixed Methods.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Mega Projects Data Mega Projects Data
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Introduction-to-Cloud-ComputingFinal.pptx
How to run a consulting project- client discovery

Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

  • 1. Data ethics and machine learning Discrimination, algorithmic bias, and how to discover them. DINO PEDRESCHI KDDLAB, DIPARTIMENTO DI INFORMATICA, UNIVERSITÀ DI PISA
  • 5. 5 Spot business trends Prevent diseases Fight crime Improve transportation Personalised services Improve wellbeing
  • 6. Event Detection Detecting events in a geographic area classifying the different kinds of users. City of Rome Metropolitan area Covered geographical region: city of Rome Dataset size per snapshot: ≈ 1.2 GBytes per day Number of records: ≈ 5.6 million lines per day 8 months between 2015 and 2016
  • 13. Estimating wellbeing with mobility data AI and Big Data 13 A B C H W
  • 14. Predicting GDP with Retail Market data 14 generic utility function (rationality) personal utility function (diversity) Product Price Quantity Needed Sophistication R2 = 17.25% R2 = 32.38% R2 = 85.72%
  • 15. Risks of big data 15
  • 16. Big Data, Big Risks Big data is algorithmic, therefore it cannot be biased! And yet… • All traditional evils of social discrimination, and many new ones, exhibit themselves in the big data ecosystem • Because of its tremendous power, massive data analysis must be used responsibly • Technology alone won’t do: also need policy, user involvement and education efforts 16
  • 17. By 2018, 50% of business ethics violations will occur through improper use of big data analytics [source: Gartner, 2016] AI and Big Data 17
  • 18. AI and Big Data 18
  • 19. 19
  • 20. The danger of black boxes - 1 The COMPAS score (Correctional Offender Management Profiling for Alternative Sanctions) A 137-questions questionnaire and a predictive model for “risk of crime recidivism.” The model is a proprietary secret of Northpointe, Inc. The data journalists at propublica.org have shown that • the prediction accuracy of recidivism is rather low (around 60%) • the model has a strong ethnic bias ◦ blacks who did not reoffend are classified as high risk twice as much as whites who did not reoffend ◦ whites who did reoffend were classified as low risk twice as much as blacks who did reoffend. AI and Big Data 20
  • 21. The danger of black boxes -2 The three major US credit bureaus, Experian, TransUnion, and Equifax, providing credit scoring for millions of individuals, are often discordant. In a study of 500,000 records, 29% of consumers received credit scores that differ by at least fifty points between credit bureaus, a difference that may mean tens of thousands dollars over the life of a mortgage [CRS+16]. AI and Big Data 21
  • 22. The danger of black boxes - 3 In 2010, some homeowners with a regular payment history of their mortgage reported a sudden drop of forty points in their credit score, soon after their own enquiry. AI and Big Data 22
  • 23. The danger of black boxes - 4 During the 1970s and 1980s, St. George’s Hospital Medical School in London used a computer program for initial screening of job applicants. The program used information from applicants’ forms, which contained no reference to ethnicity. The program was found to unfairly discriminate against female applicants and ethnic minorities (inferred from surnames and place of birth), less likely to be selected for interview [LM88]. AI and Big Data 23
  • 24. The danger of black boxes - 5 In a recent paper at SIGKDD 2016 [RSG16] the authors show how an accurate but untrustworthy classifier may result from an accidental bias in the training data. In a task of discriminating wolves from huskies in a dataset of images, the resulting deep learning model is shown to classify a wolf in a picture based solely on … AI and Big Data 24
  • 25. The danger of black boxes - 5 In a recent paper at SIGKDD 2016 [RSG16] the authors show how an accurate but untrustworthy classifier may result from an accidental bias in the training data. In a task of discriminating wolves from huskies in a dataset of images, the resulting deep learning model is shown to classify a wolf in a picture based solely on … the presence of snow in the background! [RSG16] “Why Should I Trust You?” Explaining the Predictions of Any Classifier SIGKDD 2016 Conference Paper AI and Big Data 25
  • 26. Deep learning is creating computer systems we don't fully understand www.theverge.com/2016/7/12/12158238/first-click-deep-learning-algorithmic- black-boxes AI and Big Data 26
  • 27. Is AI Permanently Inscrutable? nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable 27
  • 28. The danger of black boxes - 6 In a recent study at Princeton Univ, the authors show how the semantics derived automatically from large text/web corpora contains human biases ◦ E.g., names associated with whites were found to be significantly easier to associate with pleasant than unpleasant terms, compared to names associated with black people. Therefore, any machine learning model trained on text data for, e.g., sentiment or opinion mining has a strong chance of inheriting the prejudices reflected in the human-produced training data. AI and Big Data 28
  • 29. Human Bias AI and Big Data 29
  • 30. Human Bias can be Learned - 7 AI and Big Data 30
  • 31. As we stated in our 2008 SIGKDD paper that started the field of discrimination-aware data mining [PRT08]: “learning from historical data recording human decision making may mean to discover traditional prejudices that are endemic in reality, and to assign to such practices the status of general rules, maybe unconsciously, as these rules can be deeply hidden within the learned classifier.” AI and Big Data 31
  • 33. Satya Nadella's rules for AI www.theverge.com/2016/6/29/12057516/satya-nadella-ai-robot-laws AI and Big Data 33
  • 34. U.S. – F.T.C. Salvatore Ruggieri 34 www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or- exclusion-understanding-issues/160106big-data-rpt.pdf (Sept. 2014)
  • 35. U.S. – White House Salvatore Ruggieri 35 www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1 _2014.pdf (May 2014)
  • 36. U.S. – White House Salvatore Ruggieri 36 www.whitehouse.gov/sites/default/files/microsites/ostp/2016_0504_data_disc rimination.pdf (May 2016)
  • 37. U.S. – White House www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NST C/preparing_for_the_future_of_ai.pdf (October 2016) AI and Big Data 37
  • 38. E.U. - EDPS Salvatore Ruggieri 38 secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Con sultation/Opinions/2015/15-11-19_Big_Data_EN.pdf
  • 39. E.U. - EDPS Salvatore Ruggieri 39 secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Con sultation/Opinions/2015/15-09-11_Data_Ethics_EN.pdf
  • 42. Value-Sensitive Design Design for privacy Design for security Design for inclusion Design for sustainability Design for democracy Design for safety Design for transparency Design for accountability Design for human capabilities AI and Big Data 42
  • 43. EU Projects: SoBigData.eu Social Mining & Big Data Ecosystem project (SoBigData, H2020-INFRAIA-2014-2015, duration: 2015-2019, www.sobigdata.eu AI and Big Data 43
  • 44. Master Universitario Di II Livello BigData Technology BigData Sensing&Procurement BigData Mining BigData StoryTelling BigData Ethics Il Master Big Data ha l’obiettivo di formare“data scientists”,dei professionisti dotati di un mix di competenze multidisciplinari che permettono non solo di acquisire dati ed estrarne conos- cenza, ma anche di raccontare“storie” attraverso questi dati, a supporto delle decisioni, della creatività e dello sviluppo di servizi innovativi, e di saper gestire le ripercussioni etiche e legali dei Big Data, che spesso contengono informazioni personali e suscitano problematiche relative alla privacy, alla trasparenza,alla consapevolezza. Aree di innovazione socio-economica: BigData for Social Good BigData forBusiness Big Data AnalyticsESocial Mining SoBigData Data Ethics Literacy Rapporto MIUR su Big Data, 28 Luglio 2016 ◦ www.istruzione.it/allegati/2016/bigdata.pdf Master UNIPI in Big Data Analytics & Social Mining ◦ masterbigdata.it AI and Big Data 44
  • 46. AI and Big Data 46
  • 47. Discrimination discovery Given: ◦ an historical database of decision records, each describing features of an applicant to a benefit ◦ e.g., a credit request to a bank and the corresponding on credit approval/denial ◦ some designated categories of applicants, such as groups protected by anti-discrimination laws, find whether, and in which circumstances, there are evidences of discrimination of the designated categories that emerge from the data. DCUBE: Discrimination Discovery in Databases 47
  • 48. German Credit dataset DCUBE: Discrimination Discovery in Databases 48
  • 49. How? Fight with the same weapons Idea: use data mining to discover discrimination ◦ the decision policies hidden in a database can be represented by decision rules and discovered by frequent pattern mining ◦ Once found all such decision rules, highlight all potential niches of discrimination by filtering the rules using a measure that quantifies the discrimination risk. DCUBE: Discrimination Discovery in Databases 49
  • 50. Discrimination discovery from data FOREIGN_WORKER=yes & PURPOSE=new_car & HOUSING=own  CREDIT=bad ◦ elift = 5,19 supp = 56 conf = 0,37 elift = 5,19 means that foreign workers have more than 5 times more probability of being refused credit than the average population (even if they own their house). 50
  • 51.  Outcome:  Funded  Not funded  Conditionally funded Case Study: grant evaluation 51
  • 52. Dataset attributes 52 Features of the PI Project costs Research Area Project Evaluation
  • 53. A potentially discriminatory rule Antecedent ◦ Project proposals in “Physical and Analytical Chemical Sciences” ◦ Young females ◦ Total cost of 1,358,000 Euros or above Possible interpretation ◦ “Peer-reviewers of panel PE4 trusted young females requiring high budgets less than males leading similar projects” 53
  • 54. Case study: US Harmonized Tariff System US Harmonized Tariff System (HTS) https://hts.usitc.gov/ Detailed tariff classification system for merchandise imported to US Chapter 61, 62, 64, 65: apparels ◦ Different taxes for same garments separately produced for male and female ◦ Description is at semi-structured form 64.4¢/kg + 18.8%96¢/doz + 1.4%8.5%Women and girls 38.6¢/kg + 10%08.9%Men and boys CoatsFur felt hatsCotton pajamas Different taxes for same apparels for men and women 64.4¢/kg + 18.8%96¢/doz + 1.4%8.5%Women and girls 38.6¢/kg + 10%08.9%Men and boys CoatsFur felt hatsCotton pajamas Different taxes for same apparels for men and women 54 Women: 14% Men: 9% 1.3 billions USD!!!
  • 55. AI and Big Data 55 Totes-Isotoner Corp. v. U.S. Rack Room Shoes Inc. and Forever 21 Inc. vs U.S. Court of International Trade U.S. Court of Appeals for the Federal Circuit (2014) “[…] the courts may have concluded that Congress had no discriminatory intent when ruling the HTS, but there is little doubt that gender-based tariffs have discriminatory impact”
  • 56. Sample rule from the HTS dataset AI and Big Data 56
  • 58. Soccer Player Ratings How humans evaluate sports performance?
  • 62. Wrapping up AI AND BIG DATA 62
  • 63. Right of explanation • Applying AI within many domains requires transparency and responsibility: • health care • finance • surveillance • autonomous vehicles • Government • EU General Data Protection Regulation (April 2016) establishes (?) a right of explanation for all individuals to obtain “meaningful explanations of the logic involved” when automated (algorithmic) individual decision- making, including profiling, takes place. • In sharp contrast, (big) data-driven AI/ML models are often black boxes. AI and Big Data 63
  • 64. Accountability “Why exactly was my loan application rejected?” “What could I have done differently so that my application would not have been rejected?” AI and Big Data 64
  • 65. Social Mining & Big Data Ecosystem www.sobigdata.eu
  • 66. 66 Knowledge Discovery & Data Mining Lab http://kdd.isti.cnr.it
  • 67. Special thanks • Salvatore Ruggieri • Franco Turini • Fosca Giannotti • Anna Monreale • Luca Pappalardo SMARTCATs