Personal Identifiable
Information vs Attribute Data
An exploration of dealing with Personal
Identifiable Information in enabling analysis.
This slide deck was compiled by Jonathan Swan, ADR-UK Engineering Lead
Data Engineering and Operations | Data Growth and Operations
Office for National Statistics July 2023
Contents
Background and context Slide 3
Examples (of PII) Slide 9
The Legal Bit Slide 12
Disclosure Control Slide 16
Using PII Slide 23
Basic Definitions
Personal
Identifiable
Information (PII)
• Personally identifiable
information (PII) is
information that, when
used alone or with other
relevant data, can
identify an individual.
Attribute data
• Data that can be used to describe or quantify an object
or entity.
• A characteristic or feature that is measured for each
observation (record) and can vary from one observation
to another. It might measured in continuous values (e.g.
time spent on a web site), or in categorical values (e.g.
red, yellow, green). The terms "attribute" and "feature"
are used in the machine learning community, "variable"
is used in the statistics community. They are synonyms.
• Any data that that are used for statistics, analysis, or
research, to describe a data subject, or structural data to
support such analysis.
Background and context
• DPA – Data Protection Act 2018
• GDPR – General Data Protection Regulations
• SRSA – Statistics and Registration Service Act 2007
• DEA – Digital Economy Act 2017
Relevant legislation
Background and context
Why does it matter?
• Section 64 of the DEA allows sharing of “personal information” for research
purposes. Subsection (3) (a) requires that “the person’s identity is not specified
in the information”
• The third GDPR principle (DPA, S37) requires that processing of personal data is
“relevant and not excessive” i.e. proportionate
• The fifth GDPR principle (DPA S39 (1)) requires that personal data “must be kept
for no longer than is necessary for the purpose for which it is processed”
• In short
• PII can not be shared to approved researchers and
• Processors (like ONS) must be proportionate and time sensitive for personal data
Background and context
So why is PII required in research?
Why can’t it be removed ‘at source’?
• Data held in operational systems is often structured around “unique
identifiers”
• Unique identifiers, such as National Insurance Number, are often PII
• Unique identifiers like National Insurance Numbers (known as NINo) can be
needed to join data across sources
• PII, like names, are essential to good quality matching
• PII, can be used to measure error and bias when matching or joining data
• This includes joining on identifiers (like NHS no) where we know there is error
• PII can be used to create attributes (e.g. address used to derive
geographical data)
Background and context
Implications
• It is illegal to share PII with Approved Researchers
• Separating processing of PII and attributes is (often) a proportionate approach
• Separation reduces the burden on, and protects, the people processing the data
• Yes I can see their names – but I don’t know anything about them
• I can see intimate details about people – but I don’t know who they are
• It’s far less likely I find something out about people, maybe even a friend
• I can’t be accused of leaking personal details, if I can’t see them
Background and context
In context
• Five Safes:
• safe people.
• safe projects.
• safe settings.
• safe data.
• safe outputs.
• Appropriate separation of PII and attributes helps ensure safe data.
Background and context
Examples of PII
PII Attributes
Name
Address
Date of Birth
Email
Phone Number
National Insurance No.
NHS No.
Employer Reference No.
Company Name
Sex
Age at
Post Code
Income
SIC
SOC
Qualification
Number of employees
Examples of PII
Preparing data for analysis
Forename Surname NINo Company ID Sex DOB
Michael Mouse AB123456A Disney1 M 01/05/1928
NINo ADRID
AB123456A XYZ123 ADRID Company ID Sex
Age at
1/1/23
XYZ123 PQ7TH89U M 94
Supress
Lookup Apply
Hash
Function
Derive
Variable
Examples of PII
The Legal Bit
Comparison of Legislation
DPA/GDPR SRSA DEA
Definition “Personal data” means any
information relating to an
identified or identifiable
living individual …
“personal information”
means information which
relates to and identifies a
particular person (including a
body corporate)
… information is “personal information”
if—
(a)it relates to a particular person
(including a body corporate), but
(b)it is not information about the internal
administrative arrangements of a public
authority.
Personal
Information /
data
personal data personal information personal information
Body
Corporate
  
Deceased   
The legal bit
Bodies Corporate
• If you are used to GDPR – the concept of protecting the identity of a
corporate body may seem odd to you. But:
• Sole traders are covered under GDPR, and
• Corporate Bodies are explicitly covered under both the SRSA and DEA – so we
have to avoid identifying them.
• Bodies corporate definitely includes companies and charities, but
• Schools, local authorities, government departments, etc are also included
under the SRSA, and may be covered under the DEA in some circumstances.
• Best to include them as requiring protection of identify,
• But under some specific circumstances it may be possible to share
identifiers.
The legal bit
I see dead people
• The GDPR explicitly refers to “living individual[s]”
• The SRSA is interpreted to include dead people in scope
• The DEA is not explicit, but it should be assumed they are covered
• It is safest to assume the identity of dead people is protected
• But death registrations are public
• And the 100 year rule may apply (like for the Census)
• So it may be possible to use identifiable data on dead people in
specific circumstances.
The legal bit
Disclosure Control
Supressing PII - Part of the story
• Data made available to approved
researchers are de-identified
• Published data must be
anonymous
• Anonymisation is a high
standard and an explicit legal
definition.
• De-identification: The act of
removing identifiers from data
• Anonymous: “information which
does not relate to an identified
or identifiable natural person or
to personal data rendered
anonymous in such a manner
that the data subject is not or no
longer identifiable.” (GDPR)
Disclosure control
Isn’t Pseudonymisation enough?
• In short: NO!
• GDPR defines pseudonymisation: “…the processing of personal data in such
a manner that the personal data can no longer be attributed to a specific
data subject without the use of additional information, provided that such
additional information is kept separately and is subject to technical and
organisational measures to ensure that the personal data are not
attributed to an identified or identifiable natural person.”
• And GDPR says ““…Personal data which have undergone
pseudonymisation, which could be attributed to a natural person by the
use of additional information should be considered to be information on an
identifiable natural person…”
• Pseudonymisation is a risk reduction method only, which is good practice
under certain circumstances.
Disclosure control
De-identification – a little more
• De-identification may involve the removal of postcode or other small
area identifiers (like output area) in order to ensure legislation
compliance or appropriate risk management.
• De-identification may also require other measures, like record
swapping or ‘blurring’ or rounding to prevent identification.
• For some variables removal of extreme outliers is required.
• e.g. Income data ‘capping’ may be required - very high salaries can become
identifiers.
Disclosure control
Other measures
• To ensure legislation compliance, and avoid (re)
identification other measures are required
• Safe Projects avoid re-identification by avoiding
toxic data mixes
• Safe Settings help prevent combining data with
other data to enable re-identification
• The higher the risk – the more stringent the
measure
• These measures help to keep “safe data”
• Disclosure control, as above, ensures safe
outputs/
Five Safes:
• safe people.
• safe projects.
• safe settings.
• safe data.
• safe outputs.
Disclosure control
Publishing (Disclosure) – Issues to be aware off
• Publishing requires data / information are anonymous
• Re-identification must not be possible
• Care with dominance
• Especially for corporate bodes
• Caution for small geographies or other groupings
• Where one or two units provide the majority of a measure within a grouping
• Aggregate tables
• Sufficient aggregation required
• Small values an issue
• Specific requirements for some data sets
• Summary statistics
• Care with point values (max, min, etc)
• Computed statistics or models
• High detail may cause disclosure
• Graphical output
• Care with point values
Disclosure control
Publishing – output can be achieved without
identification
• Individual case studies or examples are possible
• As long as the identity of the individual is not discoverable (by the researcher
or other parties)
• Qualitative results can be achieved
• And may avoid the identification issues that would occur by putting numbers
on the results
Disclosure control
Using PII
PII as a resource
• We cannot share PII to Approved Researchers
• But we can use them to help Researchers achieve their aims
• It is entirely legitimate, and intended, that we process PII
• Matching and joining data
• The obvious way we can help
• But not the end of the story …
Using PII
A brief aside
Hashing:
• The use of a cryptographic hash function to apply a one way
transition of a string of characters to a fixed length encoded string.
• ‘One way encryption of data’
• A secure and repeatable way of transforming text into a ‘random’
string that is practically irreversible.
Using PII
Ways to use PII
• Hashing an ID (e.g. NHS no.) so that data from different sources can
be joined without using identifiers
• Hashing to enable analysis by group (e.g. hash school-name, hospital,
company, etc. to enable analysis at unit level without disclosing the
unit – e.g. is the range of e.g. school performance large)
• Creation of derived variables from PII – e.g. Company name includes
word “partner”, calculating weekday, where a date (e.g. DoB) can not
be shared
• Applying algorithms to derive values, e.g. applying an algorithm
derived from test or anonymised data to real data – e.g. textual
analysis algorithms
Using PII
More ways to use PII
• Measure of error or bias in data
• Particularly linked data
• Including error in identifiers like NINo
• Hashing identifiers to enable frequency type analysis (e.g. does
having a rare name correlate to higher salary?)
• Correlation of PII and attribute – e.g. does forename correlate to a
characteristic (e.g. ethnicity)
• Applying an imputed characteristic or proxy – using name or title to
imply sex
Using PII

More Related Content

PDF
Servidorweb casero
PDF
CISSP Prep: Ch 9. Software Development Security
PPTX
Migrating 3000 users and 1100 applications from Lotus Notes to Office 365
PDF
F5 Networks: architecture and risk management
PDF
How to prevent ssh-tunneling using Palo Alto Networks NGFW
PDF
Sql, Sql Injection ve Sqlmap Kullanımı
PDF
Apache NiFi の紹介 #streamctjp
PDF
普通の人でもわかる Paxos
Servidorweb casero
CISSP Prep: Ch 9. Software Development Security
Migrating 3000 users and 1100 applications from Lotus Notes to Office 365
F5 Networks: architecture and risk management
How to prevent ssh-tunneling using Palo Alto Networks NGFW
Sql, Sql Injection ve Sqlmap Kullanımı
Apache NiFi の紹介 #streamctjp
普通の人でもわかる Paxos

What's hot (20)

PDF
#dnstudy 01 ドメイン名の歴史
PPTX
Fortinet
PDF
PDF
Ch 6: Attacking Authentication
PPTX
Staying Ahead of Internet Background Exploitation - Microsoft BlueHat Israel ...
PDF
○ヶ月でできた!?さくらのクラウド開発秘話(【ヒカ☆ラボ】さくらインターネットとMilkcocoa!年末イベント:ここだけのウラ話)
PDF
F5 TLS & SSL Practices
PDF
コンテナ時代にインフラエンジニアは何をするのか
PDF
dachnug49 - panagenda Workshop - 100 new things in Notes, Nomad Web & MarvelC...
PPTX
Intro to Office 365 Security & Compliance Center
PPTX
F5 - BigIP ASM introduction
PPTX
NMAP - The Network Scanner
PDF
実践!DBベンチマークツールの使い方
PDF
ecdl_3.pdf
PPTX
Passwords#14 - mimikatz
PPTX
Introduction to Kamailio (TADSummit 2020 Asia)
PDF
Kamailio - Large Unified Communication Platforms
PPTX
Yeni Nesil Sosyal Mühendislik Saldırıları ve Siber İstihbarat
PDF
Access Security - Privileged Identity Management
PPT
Introduction to Web Application Penetration Testing
#dnstudy 01 ドメイン名の歴史
Fortinet
Ch 6: Attacking Authentication
Staying Ahead of Internet Background Exploitation - Microsoft BlueHat Israel ...
○ヶ月でできた!?さくらのクラウド開発秘話(【ヒカ☆ラボ】さくらインターネットとMilkcocoa!年末イベント:ここだけのウラ話)
F5 TLS & SSL Practices
コンテナ時代にインフラエンジニアは何をするのか
dachnug49 - panagenda Workshop - 100 new things in Notes, Nomad Web & MarvelC...
Intro to Office 365 Security & Compliance Center
F5 - BigIP ASM introduction
NMAP - The Network Scanner
実践!DBベンチマークツールの使い方
ecdl_3.pdf
Passwords#14 - mimikatz
Introduction to Kamailio (TADSummit 2020 Asia)
Kamailio - Large Unified Communication Platforms
Yeni Nesil Sosyal Mühendislik Saldırıları ve Siber İstihbarat
Access Security - Privileged Identity Management
Introduction to Web Application Penetration Testing
Ad

Similar to PII.pptx (20)

PPTX
Personal identifiable information vs attribute data
PDF
DATA PRIVACY IN AN AGE OF INCREASINGLY SPECIFIC AND PUBLICLY AVAILABLE DATA: ...
PPTX
Privacy & Data Protection
DOCX
Personally Identifiable Information - GlossaryAggregated Infor.docx
DOCX
Personally Identifiable Information - GlossaryAggregated Infor.docx
PDF
G0953643
PPTX
Anonymisation and Social Research
PPTX
A Framework of Purpose and Consent for Data Security and Consumer Privacy
PPTX
Handling PII and sensitive content in SAP BusinessObjects
PDF
Altman - Perfectly Anonymous Data is Perfectly Useless Data
PDF
DATA & PRIVACY PROTECTION Anna Monreale Università di Pisa
PDF
Privacy_Engineering_Privacy Assurance_Lecture-Ecole_Polytechnic_Nice_SA-20150127
PPTX
Sharing Confidential Data in ICPSR
PDF
9626 chapter 5 e security
PPTX
Domain 2 - Asset Security
PDF
Data Anonymization Process Challenges and Context Missions
PDF
Data Anonymization Process Challenges and Context Missions
PPTX
Secure Lab at the UK Data Service
PDF
Better use of data
PPTX
Privacy Secrets Your Systems May Be Telling
Personal identifiable information vs attribute data
DATA PRIVACY IN AN AGE OF INCREASINGLY SPECIFIC AND PUBLICLY AVAILABLE DATA: ...
Privacy & Data Protection
Personally Identifiable Information - GlossaryAggregated Infor.docx
Personally Identifiable Information - GlossaryAggregated Infor.docx
G0953643
Anonymisation and Social Research
A Framework of Purpose and Consent for Data Security and Consumer Privacy
Handling PII and sensitive content in SAP BusinessObjects
Altman - Perfectly Anonymous Data is Perfectly Useless Data
DATA & PRIVACY PROTECTION Anna Monreale Università di Pisa
Privacy_Engineering_Privacy Assurance_Lecture-Ecole_Polytechnic_Nice_SA-20150127
Sharing Confidential Data in ICPSR
9626 chapter 5 e security
Domain 2 - Asset Security
Data Anonymization Process Challenges and Context Missions
Data Anonymization Process Challenges and Context Missions
Secure Lab at the UK Data Service
Better use of data
Privacy Secrets Your Systems May Be Telling
Ad

More from EleanorCollard (10)

PPTX
Growing Up in England workshop day 1 slides
PPTX
Growing Up in England workshop day 2 slides
PPTX
Post 16 pathways to employment for lower attaining pupils
PPTX
The Forgotten Third - Using administrative data for policy development (intro)
PPTX
Impact of Wolf Reforms - Evidence and recommendations
PPTX
The Forgotten Third - Using administrative data for policy development (headl...
PPTX
ADR UK workshop: Messy and complex data part 1
PPTX
ADR UK workshop: Messy and complex data part 2
PPTX
How to Write a Strong Fellowship Application Webinar slides.pptx
PPTX
UKSA how to access data under the DEA.pptx
Growing Up in England workshop day 1 slides
Growing Up in England workshop day 2 slides
Post 16 pathways to employment for lower attaining pupils
The Forgotten Third - Using administrative data for policy development (intro)
Impact of Wolf Reforms - Evidence and recommendations
The Forgotten Third - Using administrative data for policy development (headl...
ADR UK workshop: Messy and complex data part 1
ADR UK workshop: Messy and complex data part 2
How to Write a Strong Fellowship Application Webinar slides.pptx
UKSA how to access data under the DEA.pptx

Recently uploaded (20)

PPTX
MBA JAPAN: 2025 the University of Waseda
PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
eGramSWARAJ-PPT Training Module for beginners
PDF
Microsoft Core Cloud Services powerpoint
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
IMPACT OF LANDSLIDE.....................
PPT
Image processing and pattern recognition 2.ppt
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
ai agent creaction with langgraph_presentation_
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PPT
statistic analysis for study - data collection
PPTX
chrmotography.pptx food anaylysis techni
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
SET 1 Compulsory MNH machine learning intro
MBA JAPAN: 2025 the University of Waseda
Tapan_20220802057_Researchinternship_final_stage.pptx
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
eGramSWARAJ-PPT Training Module for beginners
Microsoft Core Cloud Services powerpoint
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
IMPACT OF LANDSLIDE.....................
Image processing and pattern recognition 2.ppt
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
ai agent creaction with langgraph_presentation_
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
statistic analysis for study - data collection
chrmotography.pptx food anaylysis techni
A biomechanical Functional analysis of the masitary muscles in man
CYBER SECURITY the Next Warefare Tactics
1 hour to get there before the game is done so you don’t need a car seat for ...
SET 1 Compulsory MNH machine learning intro

PII.pptx

  • 1. Personal Identifiable Information vs Attribute Data An exploration of dealing with Personal Identifiable Information in enabling analysis. This slide deck was compiled by Jonathan Swan, ADR-UK Engineering Lead Data Engineering and Operations | Data Growth and Operations Office for National Statistics July 2023
  • 2. Contents Background and context Slide 3 Examples (of PII) Slide 9 The Legal Bit Slide 12 Disclosure Control Slide 16 Using PII Slide 23
  • 3. Basic Definitions Personal Identifiable Information (PII) • Personally identifiable information (PII) is information that, when used alone or with other relevant data, can identify an individual. Attribute data • Data that can be used to describe or quantify an object or entity. • A characteristic or feature that is measured for each observation (record) and can vary from one observation to another. It might measured in continuous values (e.g. time spent on a web site), or in categorical values (e.g. red, yellow, green). The terms "attribute" and "feature" are used in the machine learning community, "variable" is used in the statistics community. They are synonyms. • Any data that that are used for statistics, analysis, or research, to describe a data subject, or structural data to support such analysis. Background and context
  • 4. • DPA – Data Protection Act 2018 • GDPR – General Data Protection Regulations • SRSA – Statistics and Registration Service Act 2007 • DEA – Digital Economy Act 2017 Relevant legislation Background and context
  • 5. Why does it matter? • Section 64 of the DEA allows sharing of “personal information” for research purposes. Subsection (3) (a) requires that “the person’s identity is not specified in the information” • The third GDPR principle (DPA, S37) requires that processing of personal data is “relevant and not excessive” i.e. proportionate • The fifth GDPR principle (DPA S39 (1)) requires that personal data “must be kept for no longer than is necessary for the purpose for which it is processed” • In short • PII can not be shared to approved researchers and • Processors (like ONS) must be proportionate and time sensitive for personal data Background and context
  • 6. So why is PII required in research? Why can’t it be removed ‘at source’? • Data held in operational systems is often structured around “unique identifiers” • Unique identifiers, such as National Insurance Number, are often PII • Unique identifiers like National Insurance Numbers (known as NINo) can be needed to join data across sources • PII, like names, are essential to good quality matching • PII, can be used to measure error and bias when matching or joining data • This includes joining on identifiers (like NHS no) where we know there is error • PII can be used to create attributes (e.g. address used to derive geographical data) Background and context
  • 7. Implications • It is illegal to share PII with Approved Researchers • Separating processing of PII and attributes is (often) a proportionate approach • Separation reduces the burden on, and protects, the people processing the data • Yes I can see their names – but I don’t know anything about them • I can see intimate details about people – but I don’t know who they are • It’s far less likely I find something out about people, maybe even a friend • I can’t be accused of leaking personal details, if I can’t see them Background and context
  • 8. In context • Five Safes: • safe people. • safe projects. • safe settings. • safe data. • safe outputs. • Appropriate separation of PII and attributes helps ensure safe data. Background and context
  • 10. PII Attributes Name Address Date of Birth Email Phone Number National Insurance No. NHS No. Employer Reference No. Company Name Sex Age at Post Code Income SIC SOC Qualification Number of employees Examples of PII
  • 11. Preparing data for analysis Forename Surname NINo Company ID Sex DOB Michael Mouse AB123456A Disney1 M 01/05/1928 NINo ADRID AB123456A XYZ123 ADRID Company ID Sex Age at 1/1/23 XYZ123 PQ7TH89U M 94 Supress Lookup Apply Hash Function Derive Variable Examples of PII
  • 13. Comparison of Legislation DPA/GDPR SRSA DEA Definition “Personal data” means any information relating to an identified or identifiable living individual … “personal information” means information which relates to and identifies a particular person (including a body corporate) … information is “personal information” if— (a)it relates to a particular person (including a body corporate), but (b)it is not information about the internal administrative arrangements of a public authority. Personal Information / data personal data personal information personal information Body Corporate    Deceased    The legal bit
  • 14. Bodies Corporate • If you are used to GDPR – the concept of protecting the identity of a corporate body may seem odd to you. But: • Sole traders are covered under GDPR, and • Corporate Bodies are explicitly covered under both the SRSA and DEA – so we have to avoid identifying them. • Bodies corporate definitely includes companies and charities, but • Schools, local authorities, government departments, etc are also included under the SRSA, and may be covered under the DEA in some circumstances. • Best to include them as requiring protection of identify, • But under some specific circumstances it may be possible to share identifiers. The legal bit
  • 15. I see dead people • The GDPR explicitly refers to “living individual[s]” • The SRSA is interpreted to include dead people in scope • The DEA is not explicit, but it should be assumed they are covered • It is safest to assume the identity of dead people is protected • But death registrations are public • And the 100 year rule may apply (like for the Census) • So it may be possible to use identifiable data on dead people in specific circumstances. The legal bit
  • 17. Supressing PII - Part of the story • Data made available to approved researchers are de-identified • Published data must be anonymous • Anonymisation is a high standard and an explicit legal definition. • De-identification: The act of removing identifiers from data • Anonymous: “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” (GDPR) Disclosure control
  • 18. Isn’t Pseudonymisation enough? • In short: NO! • GDPR defines pseudonymisation: “…the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.” • And GDPR says ““…Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person…” • Pseudonymisation is a risk reduction method only, which is good practice under certain circumstances. Disclosure control
  • 19. De-identification – a little more • De-identification may involve the removal of postcode or other small area identifiers (like output area) in order to ensure legislation compliance or appropriate risk management. • De-identification may also require other measures, like record swapping or ‘blurring’ or rounding to prevent identification. • For some variables removal of extreme outliers is required. • e.g. Income data ‘capping’ may be required - very high salaries can become identifiers. Disclosure control
  • 20. Other measures • To ensure legislation compliance, and avoid (re) identification other measures are required • Safe Projects avoid re-identification by avoiding toxic data mixes • Safe Settings help prevent combining data with other data to enable re-identification • The higher the risk – the more stringent the measure • These measures help to keep “safe data” • Disclosure control, as above, ensures safe outputs/ Five Safes: • safe people. • safe projects. • safe settings. • safe data. • safe outputs. Disclosure control
  • 21. Publishing (Disclosure) – Issues to be aware off • Publishing requires data / information are anonymous • Re-identification must not be possible • Care with dominance • Especially for corporate bodes • Caution for small geographies or other groupings • Where one or two units provide the majority of a measure within a grouping • Aggregate tables • Sufficient aggregation required • Small values an issue • Specific requirements for some data sets • Summary statistics • Care with point values (max, min, etc) • Computed statistics or models • High detail may cause disclosure • Graphical output • Care with point values Disclosure control
  • 22. Publishing – output can be achieved without identification • Individual case studies or examples are possible • As long as the identity of the individual is not discoverable (by the researcher or other parties) • Qualitative results can be achieved • And may avoid the identification issues that would occur by putting numbers on the results Disclosure control
  • 24. PII as a resource • We cannot share PII to Approved Researchers • But we can use them to help Researchers achieve their aims • It is entirely legitimate, and intended, that we process PII • Matching and joining data • The obvious way we can help • But not the end of the story … Using PII
  • 25. A brief aside Hashing: • The use of a cryptographic hash function to apply a one way transition of a string of characters to a fixed length encoded string. • ‘One way encryption of data’ • A secure and repeatable way of transforming text into a ‘random’ string that is practically irreversible. Using PII
  • 26. Ways to use PII • Hashing an ID (e.g. NHS no.) so that data from different sources can be joined without using identifiers • Hashing to enable analysis by group (e.g. hash school-name, hospital, company, etc. to enable analysis at unit level without disclosing the unit – e.g. is the range of e.g. school performance large) • Creation of derived variables from PII – e.g. Company name includes word “partner”, calculating weekday, where a date (e.g. DoB) can not be shared • Applying algorithms to derive values, e.g. applying an algorithm derived from test or anonymised data to real data – e.g. textual analysis algorithms Using PII
  • 27. More ways to use PII • Measure of error or bias in data • Particularly linked data • Including error in identifiers like NINo • Hashing identifiers to enable frequency type analysis (e.g. does having a rare name correlate to higher salary?) • Correlation of PII and attribute – e.g. does forename correlate to a characteristic (e.g. ethnicity) • Applying an imputed characteristic or proxy – using name or title to imply sex Using PII

Editor's Notes

  • #4: Attribute Data has a different meaning in “lean six sigma” methodology. Data Attributes is a distinct term used in coding.
  • #8: Illegal to share under the DEA, may be possible under different gateway – under very specific circumstances.
  • #18: The word anonymisation is frequently misused – it is much more than just removing a name.
  • #22: Publishing includes removal from safe environment Full detail here is beyond scope – but the issue is relevant to the context of PII
  • #27: Some of the examples may be a bit tenuous - but are intended to provoke ideas.