SlideShare a Scribd company logo
Responsible Data Use
Sofus A. Macskássy
Data Science @ LinkedIn
smacskassy@linkedin.com
November 2019
Pillars of
Responsible
Data Use
Bias
Privacy
Explainability
Governance
The Coded Gaze [Joy Buolamwini 2016]
Face detection software: Fails for some darker faces
Bias
• Facial analysis software:
Higher accuracy for light
skinned men
• Error rates for dark skinned
women: 20% - 34%
Gender Shades
[Joy Buolamwini &
Timnit Gebru, 2018]
Bias
Bias
• Ethical challenges posed
by AI systems
• Inherent biases present in
society
• Reflected in training data
• AI/ML models prone to
amplifying such biases
Algorithmic Bias
Bias
Massachusetts Group
Insurance Commission
(1997): Anonymized medical
history of state employees
William Weld vs
Latanya Sweeney
Latanya Sweeney (MIT grad
student): $20 – Cambridge
voter roll
born July 31, 1945
resident of 02138
Privacy
64%Uniquely identifiable with ZIP +
birth date + gender (in the US
population)
Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”, WPES 2006
Privacy
A self driving car knocked down and
killed a pedestrian in Tempe, AZ in 2018.
- Who is to blame (accountability)
- Who to prevent this (safety)
- Should we ban self-driving cars
(liability and policy evaluation)
The need for XAI
Explainability
https://www.nytimes.com/2018/03/19/technology/uber-driverless-fatality.html
A recent research paper shows that a
classifier that could recognize wolves
from husky dogs was basing its decision
solely on the presence of snow in the
background.
The need for XAI
Ribeiro, Singh, and Guestrin. 2016. "Why Should I Trust You?": Explaining
the Predictions of Any Classifier. SIGKDD 2016.
Explainability
The need for XAI
Explainable AI is good for multiple reasons:
- Builds trust (why did you do this)
- Can be judged (how much do I believe
the prediction)
- Can be corrected (new training or
tweaks to correct errors)
- Is actionable (I know what do to next)
- … Explainability
Data Governance
Governance
Reflect company policies
Ensures compliance
Protects data
Protects company
Involves all orgs in a company
https://www.dama.org/sites/default/files/download/DAMA-DMBOK2-Framework-V2-20140317-FINAL.pdf
Laws against Discrimination
Immigration Reform and Control Act
Citizenship
Rehabilitation Act of 1973;
Americans with Disabilities
Act of 1990
Disability status
Civil Rights Act of 1964
Race
Age Discrimination in Employment
Act of 1967
Age
Equal Pay Act of 1963;
Civil Rights Act of 1964
Sex
And more...
Fairness Privacy
Transparency Explainability
Responsible
Data Use @
LinkedIn
Case studies
- Bias
- Privacy
- Governance
LinkedIn operates the largest professional
network on the Internet
Tell your story 645M+ members
30M+
companies are
represented on
LinkedIn
90K+
schools listed
(high school &
college)
35K+
skills listed
20M+
open jobs
on LinkedIn
Jobs
280B
Feed updates
Bias @
LinkedIn
Fairness-aware Talent
Search Ranking
Guiding Principle:
“Diversity by Design”
Insights to
Identify Diverse
Talent Pools
Representative
Talent Search
Results
Diversity
Learning
Curriculum
“Diversity by Design” in LinkedIn’s Talent Solutions
Plan for Diversity
Plan for Diversity
Identify Diverse Talent Pools
Inclusive Job Descriptions / Recruiter Outreach
Representative Ranking for Talent Search
S. C. Geyik, S. Ambler,
K. Kenthapadi, Fairness-
Aware Ranking in Search &
Recommendation Systems with
Application to LinkedIn Talent
Search, KDD’19.
[Microsoft’s AI/ML
conference
(MLADS’18). Distinguished
Contribution Award]
Building Representative
Talent Search at LinkedIn
(LinkedIn engineering blog)
Intuition for Measuring and Achieving
Representativeness
• Ideal: Top ranked results should follow a desired distribution on
gender/age/…
• E.g., same distribution as the underlying talent pool
• Inspired by “Equal Opportunity” definition [Hardt et al, NIPS’16]
• Defined measures (skew, divergence) based on this intuition
Fairness-aware Reranking Algorithm (Simplified)
• Partition the set of potential candidates into different buckets for each
attribute value
• Rank the candidates in each bucket according to the scores assigned by
the machine-learned model
• Merge the ranked lists, balancing the representation requirements and
the selection of highest scored candidates
• Algorithmic variants based on how we choose the next attribute
Architecture
Validating Our Approach
• Gender Representativeness
• Over 95% of all searches are representative compared to the qualified
population of the search
• Business Metrics
• A/B test over LinkedIn Recruiter users for two weeks
• No significant change in business metrics (e.g., # InMails sent or accepted)
• Ramped to 100% of LinkedIn Recruiter users worldwide
Lessons
learned
• Post-processing approach desirable
• Model agnostic
• Scalable across different model choices
for our application
• Acts as a “fail-safe”
• Robust to application-specific business
logic
• Easier to incorporate as part of existing
systems
• Build as a stand-alone service or
component for post-processing
• No significant modifications to the existing
components
• Complementary to efforts to reduce bias from
training data & during model training
Engineering for Fairness in AI Lifecycle
Problem
Formation
Dataset
Construction
Algorithm
Selection
Training
Process
Testing
Process
Deployment
Feedback
Is an algorithm an ethical
solution to our problem?
Does our data include enough
minority samples?
Are there missing/biased
features?
Do we need to apply debiasing
algorithms to preprocess our
data?
Do we need to include fairness
constraints in the function?
Have we evaluated the model
using relevant fairness metrics?
Is the model’s effect
similar across all users?
Are we deploying our
model on a population
that we did not train/test
on?
Does the model encourage
feedback loops that can
produce increasingly unfair
outcomes?
Credit: K. Browne & J. Draper
Engineering for Fairness in AI Lifecycle
S.Vasudevan, K. Kenthapadi, FairScale: A Scalable Framework for Measuring Fairness in AI Applications, 2019
Fairness-aware Experimentation
[Saint-Jacques & Sepehri, KDD’19 Social Impact Workshop]
Imagine LinkedIn has 10 members.
Each of them has 1 session a day.
A new product increases sessions by +1 session per member on average.
Both of these are +1 session / member on average!
One is much more unequal than the other. We want to catch that.
Privacy @
LinkedIn
Framework to compute
robust, privacy-
preserving analytics
Analytics & Reporting Products at LinkedIn
Profile View
Analytics
34
Content
Analytics
Ad Campaign
Analytics
All showing demographics
of members engaging
with the product
• Admit only a small # of predetermined query types
• Querying for the number of member actions, for a specified time period,
together with the top demographic breakdowns
Analytics & Reporting Products at LinkedIn
• Admit only a small # of predetermined query types
• Querying for the number of member actions, for a specified time period,
together with the top demographic breakdowns
Analytics & Reporting Products at LinkedIn
E.g., Title = “Senior
Director”
E.g., Clicks on a
given ad
Privacy Requirements
• Attacker cannot infer whether a member performed an action
• E.g., click on an article or an ad
• Attacker may use auxiliary knowledge
• E.g., knowledge of attributes associated with the target member (say,
obtained from this member’s LinkedIn profile)
• E.g., knowledge of all other members that performed similar action (say, by
creating fake accounts)
Possible Privacy Attacks
38
Targeting:
Senior directors in US, who studied at Cornell
Matches ~16k LinkedIn members
→ over minimum targeting threshold
Demographic breakdown:
Company = X
May match exactly one person
→ can determine whether the person
clicks on the ad or not
Require minimum reporting threshold
Attacker could create fake profiles!
E.g. if threshold is 10, create 9 fake profiles
that all click.
Rounding mechanism
E.g., report incremental of 10
Still amenable to attacks
E.g. using incremental counts over time to
infer individuals’ actions
Need rigorous techniques to preserve member privacy
(not reveal exact aggregate counts)
Problem Statement
•Compute robust, reliable analytics in a privacy-
preserving manner, while addressing the product needs.
Differential Privacy
Curator
Defining Privacy
Defining Privacy
42
CuratorCurator
+ your data
- your data
Differential Privacy
43
Databases D and D′ are neighbors if they differ in one person’s data.
Differential Privacy: The distribution of the curator’s output M(D) on database
D is (nearly) the same as M(D′).
Curator
+ your data
- your data
Dwork, McSherry, Nissim, Smith [TCC 2006]
Curator
Privacy System Architecture
Governance
@ LinkedIn
Keeping our data safe and
secure for members
Problem statement
• We have a lot of data
• Some may have PII data
• How do we keep this secure?
• Removing PII data
• Tracking access
Policy: Keeping the data safe
Solution through Technology
• Meta data store
• Tag all attributes in all tables
• Know which fields are PII
• Know which fields need protection
• Audit access to data
• Obfuscate data wherever possible
Tracking pedigree of data
• Tables can be combined to
create new tables
• Automatically track
pedigree of attributes and
their PII value
• Assess new attributes for
PII as well
• Have authors be
accountable
Name Type PII ?
Name string Yes
Age String Yes
A1 string No
A2 url No
Name Type PII ?
Name string Yes
Adult boolean No
B2 string No
C1 number No
Name Type PII ?
Name string Yes
B1 number No
B2 string No
B3 string No
Reflections
Fairness in ML
• Application specific challenges
• Conversational AI systems: Unique bias/fairness/ethics considerations
• E.g., Hate speech, Complex failure modes
• Beyond protected categories, e.g., accent, dialect
• Entire ecosystem (e.g., including apps such as Alexa skills)
• Two-sided markets: e.g., fairness to buyers and to sellers, or to content consumers
and producers
• Fairness in advertising (externalities)
• Tools for ensuring fairness (measuring & mitigating bias) in AI lifecycle
• Pre-processing (representative datasets; modifying features/labels)
• ML model training with fairness constraints
• Post-processing
• Experimentation & Post-deployment
Explainability in ML
• Actionable explanations
• Balance between explanations & model secrecy
• Robustness of explanations to failure modes (Interaction between ML
components)
• Application-specific challenges
• Conversational AI systems: contextual explanations
• Gradation of explanations
• Tools for explanations across AI lifecycle
• Pre & post-deployment for ML models
• Model developer vs. End user focused
Privacy in ML
• Privacy-preserving model training, robust against adversarial
membership inference attacks
• Privacy for highly sensitive data: model training & analytics using secure
enclaves, homomorphic encryption, federated learning / on-device
learning, or a hybrid
• Privacy-preserving transfer learning (broadly, privacy-preserving
mechanisms for data marketplaces)
Thank you
Sofus A. Macskássy
Data Science @ LinkedIn
smacskassy@linkedin.com

More Related Content

PPTX
Towards Responsible AI - Global AI Student Conference 2022.pptx
PPTX
Responsible AI in Industry (ICML 2021 Tutorial)
PPTX
Generative AI Risks & Concerns
PPTX
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
PPTX
Fairness in AI (DDSW 2019)
PDF
"I don't trust AI": the role of explainability in responsible AI
PDF
The Future is in Responsible Generative AI
PPTX
Implementing Ethics in AI
Towards Responsible AI - Global AI Student Conference 2022.pptx
Responsible AI in Industry (ICML 2021 Tutorial)
Generative AI Risks & Concerns
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Fairness in AI (DDSW 2019)
"I don't trust AI": the role of explainability in responsible AI
The Future is in Responsible Generative AI
Implementing Ethics in AI

What's hot (20)

PDF
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
PPTX
Generative AI and law.pptx
PPTX
The Creative Ai storm
PDF
UTILITY OF AI
PDF
Generative-AI-in-enterprise-20230615.pdf
PPTX
Generative AI Use-cases for Enterprise - First Session
PDF
Responsible AI & Cybersecurity: A tale of two technology risks
PDF
Unlocking the Power of Generative AI An Executive's Guide.pdf
PDF
Exploring Opportunities in the Generative AI Value Chain.pdf
PPTX
Responsible AI
PDF
AI Governance – The Responsible Use of AI
PDF
Unlocking the Power of ChatGPT and AI in Testing - NextSteps, presented by Ap...
PPTX
A View on AI in Insurance - Chris Madsen - H2O AI World London 2018
PDF
Ethics in the use of Data & AI
PPTX
Responsible AI
PDF
An Introduction to Generative AI - May 18, 2023
PDF
Understanding generative AI models A comprehensive overview.pdf
PDF
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
PDF
AI, ChatGPT and Content Marketing - Andrew Jenkins, Volterra Consulting
PPTX
The Future of AI is Generative not Discriminative 5/26/2021
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
Generative AI and law.pptx
The Creative Ai storm
UTILITY OF AI
Generative-AI-in-enterprise-20230615.pdf
Generative AI Use-cases for Enterprise - First Session
Responsible AI & Cybersecurity: A tale of two technology risks
Unlocking the Power of Generative AI An Executive's Guide.pdf
Exploring Opportunities in the Generative AI Value Chain.pdf
Responsible AI
AI Governance – The Responsible Use of AI
Unlocking the Power of ChatGPT and AI in Testing - NextSteps, presented by Ap...
A View on AI in Insurance - Chris Madsen - H2O AI World London 2018
Ethics in the use of Data & AI
Responsible AI
An Introduction to Generative AI - May 18, 2023
Understanding generative AI models A comprehensive overview.pdf
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
AI, ChatGPT and Content Marketing - Andrew Jenkins, Volterra Consulting
The Future of AI is Generative not Discriminative 5/26/2021
Ad

Similar to Responsible Data Use in AI - core tech pillars (20)

PPTX
Fairness and Privacy in AI/ML Systems
PPTX
Fairness and Privacy in AI/ML Systems
PPTX
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
PPTX
Responsible AI in Industry: Practical Challenges and Lessons Learned
PDF
Data Profiling: The First Step to Big Data Quality
PPTX
Fairness, Transparency, and Privacy in AI @ LinkedIn
PDF
DBAs - Is Your Company’s Personal and Sensitive Data Safe?
PPTX
Data Analytics Ethics: Issues and Questions (Arnie Aronoff, Ph.D.)
PDF
From Filter Bubbles to Fair Hiring: Harnessing AI for Diverse News and Mitiga...
PPTX
Data Breaches and Security Rights in SharePoint Webinar
PPT
Creating A Diverse CyberSecurity Program
PDF
CSC1202 Lecture 4 Data Acquisition (1).pdf
PPTX
The Role of Community-Driven Data Curation for Enterprises
PDF
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
PDF
What could possibly go wrong? - An incomplete guide on how to prevent, detect...
PDF
Data science and ethics in fundraising
PPTX
Wild hairtech bih
PPT
Data Governance in a big data era
PPT
EDUCAUSE_SEC10_Apr2010_Fed_Seminar_Final.ppt
PDF
How to unlock new data-driven potential for your organization
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML Systems
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
Data Profiling: The First Step to Big Data Quality
Fairness, Transparency, and Privacy in AI @ LinkedIn
DBAs - Is Your Company’s Personal and Sensitive Data Safe?
Data Analytics Ethics: Issues and Questions (Arnie Aronoff, Ph.D.)
From Filter Bubbles to Fair Hiring: Harnessing AI for Diverse News and Mitiga...
Data Breaches and Security Rights in SharePoint Webinar
Creating A Diverse CyberSecurity Program
CSC1202 Lecture 4 Data Acquisition (1).pdf
The Role of Community-Driven Data Curation for Enterprises
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
What could possibly go wrong? - An incomplete guide on how to prevent, detect...
Data science and ethics in fundraising
Wild hairtech bih
Data Governance in a big data era
EDUCAUSE_SEC10_Apr2010_Fed_Seminar_Final.ppt
How to unlock new data-driven potential for your organization
Ad

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
1_Introduction to advance data techniques.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Introduction to Data Science and Data Analysis
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Lecture1 pattern recognition............
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
1_Introduction to advance data techniques.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
[EN] Industrial Machine Downtime Prediction
climate analysis of Dhaka ,Banglades.pptx
IB Computer Science - Internal Assessment.pptx
Mega Projects Data Mega Projects Data
Supervised vs unsupervised machine learning algorithms
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Reliability_Chapter_ presentation 1221.5784
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Data Science and Data Analysis
Miokarditis (Inflamasi pada Otot Jantung)
Fluorescence-microscope_Botany_detailed content
Data_Analytics_and_PowerBI_Presentation.pptx

Responsible Data Use in AI - core tech pillars

  • 1. Responsible Data Use Sofus A. Macskássy Data Science @ LinkedIn smacskassy@linkedin.com November 2019
  • 3. The Coded Gaze [Joy Buolamwini 2016] Face detection software: Fails for some darker faces Bias
  • 4. • Facial analysis software: Higher accuracy for light skinned men • Error rates for dark skinned women: 20% - 34% Gender Shades [Joy Buolamwini & Timnit Gebru, 2018] Bias
  • 6. • Ethical challenges posed by AI systems • Inherent biases present in society • Reflected in training data • AI/ML models prone to amplifying such biases Algorithmic Bias Bias
  • 7. Massachusetts Group Insurance Commission (1997): Anonymized medical history of state employees William Weld vs Latanya Sweeney Latanya Sweeney (MIT grad student): $20 – Cambridge voter roll born July 31, 1945 resident of 02138 Privacy
  • 8. 64%Uniquely identifiable with ZIP + birth date + gender (in the US population) Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”, WPES 2006 Privacy
  • 9. A self driving car knocked down and killed a pedestrian in Tempe, AZ in 2018. - Who is to blame (accountability) - Who to prevent this (safety) - Should we ban self-driving cars (liability and policy evaluation) The need for XAI Explainability https://www.nytimes.com/2018/03/19/technology/uber-driverless-fatality.html
  • 10. A recent research paper shows that a classifier that could recognize wolves from husky dogs was basing its decision solely on the presence of snow in the background. The need for XAI Ribeiro, Singh, and Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. SIGKDD 2016. Explainability
  • 11. The need for XAI Explainable AI is good for multiple reasons: - Builds trust (why did you do this) - Can be judged (how much do I believe the prediction) - Can be corrected (new training or tweaks to correct errors) - Is actionable (I know what do to next) - … Explainability
  • 12. Data Governance Governance Reflect company policies Ensures compliance Protects data Protects company Involves all orgs in a company https://www.dama.org/sites/default/files/download/DAMA-DMBOK2-Framework-V2-20140317-FINAL.pdf
  • 13. Laws against Discrimination Immigration Reform and Control Act Citizenship Rehabilitation Act of 1973; Americans with Disabilities Act of 1990 Disability status Civil Rights Act of 1964 Race Age Discrimination in Employment Act of 1967 Age Equal Pay Act of 1963; Civil Rights Act of 1964 Sex And more...
  • 15. Responsible Data Use @ LinkedIn Case studies - Bias - Privacy - Governance
  • 16. LinkedIn operates the largest professional network on the Internet Tell your story 645M+ members 30M+ companies are represented on LinkedIn 90K+ schools listed (high school & college) 35K+ skills listed 20M+ open jobs on LinkedIn Jobs 280B Feed updates
  • 19. Insights to Identify Diverse Talent Pools Representative Talent Search Results Diversity Learning Curriculum “Diversity by Design” in LinkedIn’s Talent Solutions
  • 23. Inclusive Job Descriptions / Recruiter Outreach
  • 24. Representative Ranking for Talent Search S. C. Geyik, S. Ambler, K. Kenthapadi, Fairness- Aware Ranking in Search & Recommendation Systems with Application to LinkedIn Talent Search, KDD’19. [Microsoft’s AI/ML conference (MLADS’18). Distinguished Contribution Award] Building Representative Talent Search at LinkedIn (LinkedIn engineering blog)
  • 25. Intuition for Measuring and Achieving Representativeness • Ideal: Top ranked results should follow a desired distribution on gender/age/… • E.g., same distribution as the underlying talent pool • Inspired by “Equal Opportunity” definition [Hardt et al, NIPS’16] • Defined measures (skew, divergence) based on this intuition
  • 26. Fairness-aware Reranking Algorithm (Simplified) • Partition the set of potential candidates into different buckets for each attribute value • Rank the candidates in each bucket according to the scores assigned by the machine-learned model • Merge the ranked lists, balancing the representation requirements and the selection of highest scored candidates • Algorithmic variants based on how we choose the next attribute
  • 28. Validating Our Approach • Gender Representativeness • Over 95% of all searches are representative compared to the qualified population of the search • Business Metrics • A/B test over LinkedIn Recruiter users for two weeks • No significant change in business metrics (e.g., # InMails sent or accepted) • Ramped to 100% of LinkedIn Recruiter users worldwide
  • 29. Lessons learned • Post-processing approach desirable • Model agnostic • Scalable across different model choices for our application • Acts as a “fail-safe” • Robust to application-specific business logic • Easier to incorporate as part of existing systems • Build as a stand-alone service or component for post-processing • No significant modifications to the existing components • Complementary to efforts to reduce bias from training data & during model training
  • 30. Engineering for Fairness in AI Lifecycle Problem Formation Dataset Construction Algorithm Selection Training Process Testing Process Deployment Feedback Is an algorithm an ethical solution to our problem? Does our data include enough minority samples? Are there missing/biased features? Do we need to apply debiasing algorithms to preprocess our data? Do we need to include fairness constraints in the function? Have we evaluated the model using relevant fairness metrics? Is the model’s effect similar across all users? Are we deploying our model on a population that we did not train/test on? Does the model encourage feedback loops that can produce increasingly unfair outcomes? Credit: K. Browne & J. Draper
  • 31. Engineering for Fairness in AI Lifecycle S.Vasudevan, K. Kenthapadi, FairScale: A Scalable Framework for Measuring Fairness in AI Applications, 2019
  • 32. Fairness-aware Experimentation [Saint-Jacques & Sepehri, KDD’19 Social Impact Workshop] Imagine LinkedIn has 10 members. Each of them has 1 session a day. A new product increases sessions by +1 session per member on average. Both of these are +1 session / member on average! One is much more unequal than the other. We want to catch that.
  • 33. Privacy @ LinkedIn Framework to compute robust, privacy- preserving analytics
  • 34. Analytics & Reporting Products at LinkedIn Profile View Analytics 34 Content Analytics Ad Campaign Analytics All showing demographics of members engaging with the product
  • 35. • Admit only a small # of predetermined query types • Querying for the number of member actions, for a specified time period, together with the top demographic breakdowns Analytics & Reporting Products at LinkedIn
  • 36. • Admit only a small # of predetermined query types • Querying for the number of member actions, for a specified time period, together with the top demographic breakdowns Analytics & Reporting Products at LinkedIn E.g., Title = “Senior Director” E.g., Clicks on a given ad
  • 37. Privacy Requirements • Attacker cannot infer whether a member performed an action • E.g., click on an article or an ad • Attacker may use auxiliary knowledge • E.g., knowledge of attributes associated with the target member (say, obtained from this member’s LinkedIn profile) • E.g., knowledge of all other members that performed similar action (say, by creating fake accounts)
  • 38. Possible Privacy Attacks 38 Targeting: Senior directors in US, who studied at Cornell Matches ~16k LinkedIn members → over minimum targeting threshold Demographic breakdown: Company = X May match exactly one person → can determine whether the person clicks on the ad or not Require minimum reporting threshold Attacker could create fake profiles! E.g. if threshold is 10, create 9 fake profiles that all click. Rounding mechanism E.g., report incremental of 10 Still amenable to attacks E.g. using incremental counts over time to infer individuals’ actions Need rigorous techniques to preserve member privacy (not reveal exact aggregate counts)
  • 39. Problem Statement •Compute robust, reliable analytics in a privacy- preserving manner, while addressing the product needs.
  • 43. Differential Privacy 43 Databases D and D′ are neighbors if they differ in one person’s data. Differential Privacy: The distribution of the curator’s output M(D) on database D is (nearly) the same as M(D′). Curator + your data - your data Dwork, McSherry, Nissim, Smith [TCC 2006] Curator
  • 46. Keeping our data safe and secure for members Problem statement • We have a lot of data • Some may have PII data • How do we keep this secure? • Removing PII data • Tracking access
  • 47. Policy: Keeping the data safe Solution through Technology • Meta data store • Tag all attributes in all tables • Know which fields are PII • Know which fields need protection • Audit access to data • Obfuscate data wherever possible
  • 48. Tracking pedigree of data • Tables can be combined to create new tables • Automatically track pedigree of attributes and their PII value • Assess new attributes for PII as well • Have authors be accountable Name Type PII ? Name string Yes Age String Yes A1 string No A2 url No Name Type PII ? Name string Yes Adult boolean No B2 string No C1 number No Name Type PII ? Name string Yes B1 number No B2 string No B3 string No
  • 50. Fairness in ML • Application specific challenges • Conversational AI systems: Unique bias/fairness/ethics considerations • E.g., Hate speech, Complex failure modes • Beyond protected categories, e.g., accent, dialect • Entire ecosystem (e.g., including apps such as Alexa skills) • Two-sided markets: e.g., fairness to buyers and to sellers, or to content consumers and producers • Fairness in advertising (externalities) • Tools for ensuring fairness (measuring & mitigating bias) in AI lifecycle • Pre-processing (representative datasets; modifying features/labels) • ML model training with fairness constraints • Post-processing • Experimentation & Post-deployment
  • 51. Explainability in ML • Actionable explanations • Balance between explanations & model secrecy • Robustness of explanations to failure modes (Interaction between ML components) • Application-specific challenges • Conversational AI systems: contextual explanations • Gradation of explanations • Tools for explanations across AI lifecycle • Pre & post-deployment for ML models • Model developer vs. End user focused
  • 52. Privacy in ML • Privacy-preserving model training, robust against adversarial membership inference attacks • Privacy for highly sensitive data: model training & analytics using secure enclaves, homomorphic encryption, federated learning / on-device learning, or a hybrid • Privacy-preserving transfer learning (broadly, privacy-preserving mechanisms for data marketplaces)
  • 53. Thank you Sofus A. Macskássy Data Science @ LinkedIn smacskassy@linkedin.com