PII.pptx

Personal Identifiable
Information vs Attribute Data
An exploration of dealing with Personal
Identifiable Information in enabling analysis.
This slide deck was compiled by Jonathan Swan, ADR-UK Engineering Lead
Data Engineering and Operations | Data Growth and Operations
Office for National Statistics July 2023

Contents
Background and context Slide 3
Examples (of PII) Slide 9
The Legal Bit Slide 12
Disclosure Control Slide 16
Using PII Slide 23

Basic Definitions
Personal
Identifiable
Information (PII)
• Personally identifiable
information (PII) is
information that, when
used alone or with other
relevant data, can
identify an individual.
Attribute data
• Data that can be used to describe or quantify an object
or entity.
• A characteristic or feature that is measured for each
observation (record) and can vary from one observation
to another. It might measured in continuous values (e.g.
time spent on a web site), or in categorical values (e.g.
red, yellow, green). The terms "attribute" and "feature"
are used in the machine learning community, "variable"
is used in the statistics community. They are synonyms.
• Any data that that are used for statistics, analysis, or
research, to describe a data subject, or structural data to
support such analysis.
Background and context

• DPA – Data Protection Act 2018
• GDPR – General Data Protection Regulations
• SRSA – Statistics and Registration Service Act 2007
• DEA – Digital Economy Act 2017
Relevant legislation

Why does it matter?
• Section 64 of the DEA allows sharing of “personal information” for research
purposes. Subsection (3) (a) requires that “the person’s identity is not specified
in the information”
• The third GDPR principle (DPA, S37) requires that processing of personal data is
“relevant and not excessive” i.e. proportionate
• The fifth GDPR principle (DPA S39 (1)) requires that personal data “must be kept
for no longer than is necessary for the purpose for which it is processed”
• In short
• PII can not be shared to approved researchers and
• Processors (like ONS) must be proportionate and time sensitive for personal data

So why is PII required in research?
Why can’t it be removed ‘at source’?
• Data held in operational systems is often structured around “unique
identifiers”
• Unique identifiers, such as National Insurance Number, are often PII
• Unique identifiers like National Insurance Numbers (known as NINo) can be
needed to join data across sources
• PII, like names, are essential to good quality matching
• PII, can be used to measure error and bias when matching or joining data
• This includes joining on identifiers (like NHS no) where we know there is error
• PII can be used to create attributes (e.g. address used to derive
geographical data)

Implications
• It is illegal to share PII with Approved Researchers
• Separating processing of PII and attributes is (often) a proportionate approach
• Separation reduces the burden on, and protects, the people processing the data
• Yes I can see their names – but I don’t know anything about them
• I can see intimate details about people – but I don’t know who they are
• It’s far less likely I find something out about people, maybe even a friend
• I can’t be accused of leaking personal details, if I can’t see them

In context
• Five Safes:
• safe people.
• safe projects.
• safe settings.
• safe data.
• safe outputs.
• Appropriate separation of PII and attributes helps ensure safe data.

PII Attributes
Name
Address
Date of Birth
Email
Phone Number
National Insurance No.
NHS No.
Employer Reference No.
Company Name
Sex
Age at
Post Code
Income
SIC
SOC
Qualification
Number of employees
Examples of PII

Preparing data for analysis
Forename Surname NINo Company ID Sex DOB
Michael Mouse AB123456A Disney1 M 01/05/1928
NINo ADRID
AB123456A XYZ123 ADRID Company ID Sex
Age at
1/1/23
XYZ123 PQ7TH89U M 94
Supress
Lookup Apply
Hash
Function
Derive
Variable
Examples of PII

Comparison of Legislation
DPA/GDPR SRSA DEA
Definition “Personal data” means any
information relating to an
identified or identifiable
living individual …
“personal information”
means information which
relates to and identifies a
particular person (including a
body corporate)
… information is “personal information”
if—
(a)it relates to a particular person
(including a body corporate), but
(b)it is not information about the internal
administrative arrangements of a public
authority.
Personal
Information /
data
personal data personal information personal information
Body
Corporate
  
Deceased   
The legal bit

Bodies Corporate
• If you are used to GDPR – the concept of protecting the identity of a
corporate body may seem odd to you. But:
• Sole traders are covered under GDPR, and
• Corporate Bodies are explicitly covered under both the SRSA and DEA – so we
have to avoid identifying them.
• Bodies corporate definitely includes companies and charities, but
• Schools, local authorities, government departments, etc are also included
under the SRSA, and may be covered under the DEA in some circumstances.
• Best to include them as requiring protection of identify,
• But under some specific circumstances it may be possible to share
identifiers.
The legal bit

I see dead people
• The GDPR explicitly refers to “living individual[s]”
• The SRSA is interpreted to include dead people in scope
• The DEA is not explicit, but it should be assumed they are covered
• It is safest to assume the identity of dead people is protected
• But death registrations are public
• And the 100 year rule may apply (like for the Census)
• So it may be possible to use identifiable data on dead people in
specific circumstances.
The legal bit

Supressing PII - Part of the story
• Data made available to approved
researchers are de-identified
• Published data must be
anonymous
• Anonymisation is a high
standard and an explicit legal
definition.
• De-identification: The act of
removing identifiers from data
• Anonymous: “information which
does not relate to an identified
or identifiable natural person or
to personal data rendered
anonymous in such a manner
that the data subject is not or no
longer identifiable.” (GDPR)
Disclosure control

Isn’t Pseudonymisation enough?
• In short: NO!
• GDPR defines pseudonymisation: “…the processing of personal data in such
a manner that the personal data can no longer be attributed to a specific
data subject without the use of additional information, provided that such
additional information is kept separately and is subject to technical and
organisational measures to ensure that the personal data are not
attributed to an identified or identifiable natural person.”
• And GDPR says ““…Personal data which have undergone
pseudonymisation, which could be attributed to a natural person by the
use of additional information should be considered to be information on an
identifiable natural person…”
• Pseudonymisation is a risk reduction method only, which is good practice
under certain circumstances.
Disclosure control

De-identification – a little more
• De-identification may involve the removal of postcode or other small
area identifiers (like output area) in order to ensure legislation
compliance or appropriate risk management.
• De-identification may also require other measures, like record
swapping or ‘blurring’ or rounding to prevent identification.
• For some variables removal of extreme outliers is required.
• e.g. Income data ‘capping’ may be required - very high salaries can become
identifiers.
Disclosure control

Other measures
• To ensure legislation compliance, and avoid (re)
identification other measures are required
• Safe Projects avoid re-identification by avoiding
toxic data mixes
• Safe Settings help prevent combining data with
other data to enable re-identification
• The higher the risk – the more stringent the
measure
• These measures help to keep “safe data”
• Disclosure control, as above, ensures safe
outputs/
Five Safes:
• safe people.
• safe projects.
• safe settings.
• safe data.
• safe outputs.
Disclosure control

Publishing (Disclosure) – Issues to be aware off
• Publishing requires data / information are anonymous
• Re-identification must not be possible
• Care with dominance
• Especially for corporate bodes
• Caution for small geographies or other groupings
• Where one or two units provide the majority of a measure within a grouping
• Aggregate tables
• Sufficient aggregation required
• Small values an issue
• Specific requirements for some data sets
• Summary statistics
• Care with point values (max, min, etc)
• Computed statistics or models
• High detail may cause disclosure
• Graphical output
• Care with point values
Disclosure control

Publishing – output can be achieved without
identification
• Individual case studies or examples are possible
• As long as the identity of the individual is not discoverable (by the researcher
or other parties)
• Qualitative results can be achieved
• And may avoid the identification issues that would occur by putting numbers
on the results
Disclosure control

PII as a resource
• We cannot share PII to Approved Researchers
• But we can use them to help Researchers achieve their aims
• It is entirely legitimate, and intended, that we process PII
• Matching and joining data
• The obvious way we can help
• But not the end of the story …
Using PII

A brief aside
Hashing:
• The use of a cryptographic hash function to apply a one way
transition of a string of characters to a fixed length encoded string.
• ‘One way encryption of data’
• A secure and repeatable way of transforming text into a ‘random’
string that is practically irreversible.
Using PII

Ways to use PII
• Hashing an ID (e.g. NHS no.) so that data from different sources can
be joined without using identifiers
• Hashing to enable analysis by group (e.g. hash school-name, hospital,
company, etc. to enable analysis at unit level without disclosing the
unit – e.g. is the range of e.g. school performance large)
• Creation of derived variables from PII – e.g. Company name includes
word “partner”, calculating weekday, where a date (e.g. DoB) can not
be shared
• Applying algorithms to derive values, e.g. applying an algorithm
derived from test or anonymised data to real data – e.g. textual
analysis algorithms
Using PII

More ways to use PII
• Measure of error or bias in data
• Particularly linked data
• Including error in identifiers like NINo
• Hashing identifiers to enable frequency type analysis (e.g. does
having a rare name correlate to higher salary?)
• Correlation of PII and attribute – e.g. does forename correlate to a
characteristic (e.g. ethnicity)
• Applying an imputed characteristic or proxy – using name or title to
imply sex
Using PII

PII.pptx

More Related Content

What's hot (20)

Similar to PII.pptx (20)

More from EleanorCollard (10)

Recently uploaded (20)

PII.pptx

Editor's Notes