SlideShare a Scribd company logo
Managing Confidential
Information in Research
                               Micah Altman
                   Senior Research Scientist
    Institute for Quantitative Social Science
                          Harvard University
Managing Confidential Information in Research
Managing Confidential Information in Research
Personally identifiable private
information is surprisingly common
       Includes information from a variety of
        sources, such as…
           Research data, even if you aren‟t the
            original collector
           Student “records” such as e-mail, grades
           Logs from web-servers, other systems
       Lots of things are potentially identifying:
           Under some federal laws: IP
            addresses, dates, zipcodes, …
           Birth date + zipcode + gender uniquely
            identify ~87% of people in the U.S.
            [Sweeney 2002]
           With date and place of birth, can guess
            first five digits of social security number
            (SSN) > 60% of the time. (Can guess the
            whole thing in under 10 tries, for a
            significant minority of people.) [Aquisti&
            Gross 2009]
           Analysis of writing style or eclectic tastes
            has been used to identify individuals          Brownstein, et al., 2006 , NEJM 355(16),

       Tables, graphs and maps can also
        reveal identifiable information

    4                                                                     [Micah Altman, 3/10/2011]
Managing Confidential Information in Research
IQSS (and affiliates) offer you support across all stages of your
  quantitative research:

       Research design, including:
        design of surveys, selection of statistical methods.
       Primary and secondary data collection, including:
        the collection of geospatial and survey data.
       Data management, including:
        storage, cataloging, permanent archiving, and distribution.
       Data analysis, including :
        statistical consulting, GIS consulting, high performance research computing.


                        http://iq.harvard.edu/
    6                                                           [Micah Altman, 3/10/2011]
The IQSS grants administration team helps with every aspect of
  the grant process. Contact us when you are planning your
  proposal.

           Assisting in identifying research funding opportunities
           Consulting on writing proposals
           Assisting IQSS affiliates with:
            preparation, review and submission of all grant applications
             (“pre-award support”)
            management of their sponsored research portfolio
             (“post-award support”)
            Interpret sponsor policies
            Coordinate with FAS Research Administration and the Central Office for
             Sponsored Programs



… And, of course, support seminars like this!


    7                                                            [Micah Altman, 3/10/2011]
Goals for course
       Overview of key areas
       Identify key concepts & issues
       Summarize Harvard
        policies, procedures, resources
       Establish framework for action
       Provide connection to resources, literature




    8                                          [Micah Altman, 3/10/2011]
Outline

   [Preliminaries]

   Law, policy, ethics
   Research
    methods, design, manage
    ment
   Information Security
    (Storage, Transmission, U
    se)
   Disclosure Limitation

     [Additional Resources &
      Summary of
    9 Recommendations]          [Micah Altman, 3/10/2011]
Steps to Manage Confidential Research Data
    Identify potentially sensitive information in planning
        Identify legal requirements, institutional requirements, data use agreements
        Consider obtaining a certificate of confidentiality
        Plan for IRB review
    Reduce sensitivity of collected data in design
    Separate sensitive information in collection
    Encrypt sensitive information in transit
    Desensitize information in processing
        Removing names and other direct identifiers
        Suppressing, aggregating, or perturbing indirect identifiers
    Protect sensitive information in systems
        Use systems that are controlled, securely configured, and audited
        Ensure people are authenticated, authorized, licensed
    Review sensitive information before dissemination
        Review disclosure risk
        Apply non-statistical disclosure limitation
        Apply statistical disclosure limitation
        Review past releases and publically available data
        Check for changes in the law
        Require a use agreement



    10                                                                         [Micah Altman, 3/10/2011]
Managing Confidential Information in Research
Law, policy, ethics

     Law, Policy & Ethics                   Research design …
                                            Information security
                                            Disclosure
                                            limitation
        Ethical Obligations
        Laws
        Fun and games 
        Harvard Policies
        [Summary]




12                             [Micah Altman, 3/10/2011]
Law, policy, ethics

Confidentiality & Research Ethics                                         Research design …
                                                                          Information security
                                                                          Disclosure
    Belmont Principles                                                   limitation

        Respect for Persons
            individuals should be treated as autonomous agents
            persons with diminished autonomy are entitled to protection
            implies “informed consent”
            implies respect for confidentiality and privacy
        Beneficence
            research must have individual and/or societal benefit to justify
             risks
            implies minimizing risk/benefit ratio




    13                                                       [Micah Altman, 3/10/2011]
Scientific & Societal Benefits                                          Law, policy, ethics

of Data Sharing                                                         Research design …
                                                                        Information security
    Increases replicability of research                                Disclosure
                                                                        limitation
        Journal publication policies may apply
    Increases scientific impact of research
        Follow up studies
        Extensions
        Citations
    Public interest in data produced by public funder
        Funder policies may apply
    Public interest in data that supports public policy
        FOIA and state FOI laws may apply
    Open data facilitates…
        Transparent government
        Scientific collaboration
        Scientific verification
        New forms of science
        Participation in science
        Hands-on education
        Continuity of research

Sources: Fienberg et. al 1985; ICSU 2004; Nature 2009
    14                                                     [Micah Altman, 3/10/2011]
Sources of Confidentiality Restrictions for                Law, policy, ethics

University Research Data                                   Research design …
                                                           Information security
                                                           Disclosure
    Overlapping laws                                      limitation

    Different laws apply
     to different cases
    All affiliates subject to
     university policy

(Not included: EU
  directive, foreign
  laws, classified
  data, …)



    15                                        [Micah Altman, 3/10/2011]
Law, policy, ethics

45 CFR 46 [Overview]                                               Research design …
                                                                   Information security

“The Common Rule”                                                  Disclosure
                                                                   limitation

 Governs human subject research
        With federal funds/ at federal institution
    Establishes rules for conduct of research
    Establishes confidentiality and consent requirement
     for for identified private data
    However, some information may be required to be
     disclosed under state and federal laws (e.g. in cases
     of child abuse)
    Delegates procedural decisions to Institutional
     Review Boards (IRB‟s)

    16                                                [Micah Altman, 3/10/2011]
Law, policy, ethics

HIPAA [Overview]                                                 Research design …
                                                                 Information security

Health Insurance Portability and Accountability Act Disclosure
                                                     limitation

 Protects personal health care information for „covered
  entities‟
 Detailed technical protection requirements
 Provides clearest legal standards for dissemination
 Provides a „safe harbor‟
 Has become an accepted practice for dissemination in
  other areas where laws are less clear
 HITECH Act of 2009 extends HIPAA
     Extends coverage to associated entities of covered entities
     Additional technical safeguards
     Adds breach reporting requirement

  HIPAA provides three dissemination options …
 17                                                 [Micah Altman, 3/10/2011]
Dissemination under HIPAA
                                                                                      Law, policy, ethics
                                                                                      Research design …

[option 1]                                                                            Information security
                                                                                      Disclosure
                                                                                      limitation
    “safe harbor” -- remove 18 identifiers
        [Personal identifiers]
            Names
            Social Security #‟s; Personal Account #‟s; Certificate/License #‟s; full face
             photos (and comparable images); biometric id‟s; medical
            Any other unique identifying number, characteristic, or code
        [Asset identifiers]
            fax #‟s; phone #‟s; vehicle #‟s;
            personal URL‟s; IP addresses; e-mail addresses
            Device ID‟s and serial numbers
        [Quasi identifiers]
            dates smaller than a year (and ages > 89 collapsed into one category)
            geographic subdivisions smaller than a state (except for 3 digits of zipcode, if
             unit > 20,000 people)
     And
        Entity does not have actual knowledge [direct and clear awareness] that it
         would be possible to use the remaining information alone or in
         combination with other information to identify the subject


    18                                                                   [Micah Altman, 3/10/2011]
Dissemination under HIPAA                                        Law, policy, ethics
                                                                 Research design …
[Option 2]                                                       Information security

    “limited dataset” – leave some quasi-id‟s                   Disclosure
                                                                 limitation

        Remove personal and asset identifiers
        Permitted dates: dates of birth, death, service, years
        Permitted geographic subdivisions: town, city, state, zip
         code
And
        Require access control and data use agreement.




    19                                              [Micah Altman, 3/10/2011]
Dissemination under HIPAA                                             Law, policy, ethics
                                                                      Research design …
[Option 3]                                                            Information security

    “qualified statistician” – statistical determination Disclosure
                                                           limitation
     Have qualified statistician determine, using generally
     accepted statistical and scientific principles and
     methods, that the risk is very small that the information
     could be used, alone or in combination with other
     reasonably available information, by the anticipated
     recipient to identify the subject of the information.
    Important caveats
        Methods and results of the analysis must be documented
        No bright line for “qualified”, text of rule is:
         “a person with appropriate knowledge of and experience with
         generally accepted statistical and scientific principles and
         methods for rendering information not individually identifiable.”
         [Section 164.514(b)(1)]
        No clear definitions for “generally accepted” or “very small” or
         “reasonably available information”…
         however, there are references in the federal register to
         statistical publications to be used as “starting points”
    20                                                   [Micah Altman, 3/10/2011]
Law, policy, ethics

FERPA                                                                                              Research design …
                                                                                                   Information security

     Family Educational Rights and Privacy Act                                                     Disclosure
                                                                                                   limitation
    Applies schools that receive federal (D.O.E.) funding
    Restricts use of student (not employee) information
    Establishes
        Right to privacy of educational records
        Right to inspect and correct records (with appeal to Federal government)
        Definition of public “directory” information
        Right to block access to public “directory” information, and to other records
    Educational records include:
        Identified information about student
        Maintained by institution
        Not …
            Employee records
            Some medical and law-enforcement records
            Records solely in the possession and for use by the creator (e.g. unpublished instructor notes)
    Personally identifiable information includes:
        Direct identifiers
        Indirect (quasi) identifiers
        Indirectly linkable identifiers
        “Information requested by a person who the educational agency or institution reasonably
         believes knows the identity of the student to whom the education record relates.”


    21                                                                              [Micah Altman, 3/10/2011]
Law, policy, ethics

MA 201 CMR 17                                                    Research design …
                                                                 Information security
                                                Disclosure
Standards for the Protection of Personal Information
                                                limitation

 Strongest U.S. general privacy protection law
 Has been delayed/modified repeatedly
 Requires reporting of breaches
        If data is not encrypted
        Or encryption key is released in conjunction with data
    Requires specific technical protections:
        Firewalls
        Encryption of data transmitted in public
        Anti-virus software
        Software updates


    22                                              [Micah Altman, 3/10/2011]
Inconsistencies in Requirements                                                    Law, policy, ethics

 and Definitions                                                                    Research design …
                                                                                    Information security
   Inconsistent definitions of “personally identifiable”                           Disclosure
   Inconsistent definitions of sensitive information                               limitation
   Requirements for to de-identify jibes with statistical realities

                 FERPA               HIPAA                  Common               MA 201 CMR
                                                            Rule                 17

Coverage         Students in         Medical Information    Living persons in Mass. Residents
                 Educational         in “Covered            research by
                 Institutions        Entities”              funded
                                                            institutions
Identification   -Direct             -Direct                -Direct              -Direct
Criteria         -Indirect           -Indirect              -Indirect
                 -Linked             -Linked                -Linked
                 -Bad intent (!)
Sensitivity      Any non-directory   Any medical            Private              Financial, State,
Criteria         information         information            information –        Federal
                                                            based on harm        Identifiers
Management - Directory opt-out       - Consent              - Consent             - Specific
Requirement - [Implied] good         - Specific technical   - [Implied] risk      technical
s 23        practice                 safeguards             minimization          safeguards
                                                                        [Micah Altman, 3/10/2011]
                                     -Breach                                      - Breach
Law, policy, ethics

Third Party Requirements                                            Research design …
                                                                    Information security

    Licensing requirements                                         Disclosure
                                                                    limitation
    Intellectual property requirements
    Federal/state law and/or policy requirements
        State protection of personal information laws
        Freedom of information laws (FOIA & State FOI)
        State mandatory abuse/neglect notification laws
    And … think ahead to publisher requirements
        Replication requirements
        IP requirements

    Examples
        NSF requires data from funded research be shared
        NIH requires a data sharing plan for large projects
        Wellcome Trust requires a data sharing plan
        Many leading journals require data sharing

    24                                                 [Micah Altman, 3/10/2011]
Law, policy, ethics

(Some) More Laws & Standards                                                                              Research design …
                                                                                                           Information security
    California Laws                                                Detailed technical controls over information
                                                                                                           Disclosure
        Lots of rules                                               systems
                                                                                                           limitation
        Applies any data about California residents            Sarbanes-Oxley (aka, SOX, aka SARBOX)
        Privacy policy                                             Corporate and Auditing Accountability and
                                                                     Responsibility Act of 2002
        Disclosure
                                                                    Applies to U.S. public company
        Reporting policy                                            boards, management and public accounting firms
    EU Directive 95/46/EC                                          Rarely applies to research in universities
        Data protection directive                                  Section 404 requires annual assessment of
        Provides for notice, limits on purpose of                   organizational internal controls – but does not
         use, consent, security, disclosure, access, account         specify details of controls
         ability                                                Classified Data
        Forbids transfer of data to entities in countries          Separate and complex rules and requirements
         compliant with directive
                                                                    The University does not accept classified data
        U.S. is not compliant but …
                                                                    But, may have “Controlled But Unclassified”
           Organizations can certify compliance with FTC
                                                                       Vaguely defined area
           No auditing/enforcement !
                                                                       Mostly government produced area
           Substantial criticism of this arrangement
                                                                       Penalties unclear
    Payment Card Industry (PCI) Security Standards
                                                                    And… export controlled information, under ITAR
        Governs treatment of credit card numbers                    and EAR
        Requires reports, audits, fines                               Export control include
        Detailed technical measures                                      technologies, software, documentation/design
        Not a law, but helps define good practice                        documents may be included
                                                                       Large penalties
        Nevada law mandates PCI standards
                                                                … and over 1100 International Human Subjects
    FISMA                                                       laws…
       Federal Information Security Management Act
        (FISMA), Public Law (P.L.) 107-347.
       Is starting to be applied to NIH sponsored
    25 research                                                                           [Micah Altman, 3/10/2011]
Law, policy, ethics

Predicted Legal Changes for 2011…                                Research design …
                                                                 Information security
                                                                 Disclosure
    Caselaw                                                     limitation

        “personal privacy” does not apply to information about
         corporations (a corporation is not a “person” for this
         purpose)
         FCC vs. ATT 2011
    Scheduled
        EU “cookie privacy” directive 2009/136/EC goes into
         effect
        Proposed updates to EU information privacy directives
    Very Likely
        New information privacy laws in selected states in 2011
    Likely
        Increased federal regulation of internet privacy
    26                                              [Micah Altman, 3/10/2011]
Law, policy, ethics

What’s wrong with this picture?                                                       Research design …
                                                                                      Information security
                                                                                      Disclosure
                                                                                      limitation
Name         SSN     Birthdate   Zipcode   Gender   Favorite    # of crimes
                                                    Ice Cream   committed
A.Jones      12341   01011961    02145     M        Raspberr    0
                                                    y
B. Jones     12342   02021961    02138     M        Pistachio   0
C. Jones     12343   11111972    94043     M        Chocolat    0
                                                    e
D. Jones     12344   12121972    94043     M        Hazelnut    0
E. Jones     12345   03251972    94041     F        Lemon       0
F. Jones     12346   03251972    02127     F        Lemon       1
G. Jones     12347   08081989    02138     F        Peach       1
H. Smith     12348   01011973    63200     F        Lime        2
I. Smith     12349   02021973    63300     M        Mango       4
J. Smith     12350   02021973    63400     M        Coconut     16
K. Smith     12351   03031974    64500     M        Frog        32
L. Smith     12352   04041974    64600     M        Vanilla     64
M. Smith     12353   04041974    64700     F        Pumpkin     128
N.           12354   04041974    64800     F        Allergic    256
  27                                                                     [Micah Altman, 3/10/2011]
       Smi
Law, policy, ethics

     What’s wrong with this picture?                                                                  Research design …
                                                                                                      Information security
Identifier   Sensitive                 Identifier                            Sensitive
                           Private                                                                    Disclosure
              Private     Identifier                                                                  limitation
             Identifier
Name         SSN      Birthdate        Zipcode      Gender   Favorite    # of crimes
                                                             Ice Cream   committed
A.Jones      12341    01011961         02145        M        Raspberr    0                                Mass resident
                                                             y
B. Jones     12342    02021961         02138        M        Pistachio   0
                                                                                                              Californian
C. Jones     12343    11111972         94043        M        Chocolat    0
                                                             e
D. Jones     12344    12121972         94043        M        Hazelnut    0                     Twins, separated at birth?

E. Jones     12345    03251972         94041        F        Lemon       0
                                                                                                             FERPA too?
F. Jones     12346    03251972         02127        F        Lemon       1
G. Jones     12347    08081989         02138        F        Peach       1
H. Smith     12348    01011973         63200        F        Lime        2
I. Smith     12349    02021973         63300        M        Mango       4
J. Smith     12350    02021973         63400        M        Coconut     16
K. Smith     12351    03031974         64500        M        Frog        32
L. Smith     12352    04041974         64600        M        Vanilla     64                      Unexpected Response?
M. Smith 12353        04041974         64700        F        Pumpkin     128
      28                                                                                 [Micah Altman, 3/10/2011]
N. Smith 12354        04041974         64800        F        Allergic    256
Managing Confidential Information in Research
Managing Confidential Information in Research
Harvard:                                                                                                            Law, policy, ethics

Enterprise Security Policy (HEISP)                                                                                  Research design …
                                                                                                                    Information security
    Storing High Risk Confidential Information (HRCI)                                                          Disclosure
         Must not be stored on individual user computer or                                                     limitation
         portable storage device
                                                                       Confidential information on Harvard computing devices
        Must be stored on "target computers" or secure locked
         containers                                                        confidential information must be protected
    Human subject information                                             confidential information on portable devices must be
                                                                            encrypted
        All research on human subjects must be approved by
         the IRB                                                           laptops must have encryption (some schools require
                                                                            whole-disk encryption)
        All proposals must include a data management plan
                                                                           systems must be scanned annually
    Personally identifiable medical information (PIMI)
                                                                       Cannot save confidential information on computer
        "Covered entities" at Harvard are subject to HIPAA             directly accessible from the internet, open Harvard
         requirements                                                   networks
         PIMI is to be treated as HRCI throughout the university
                                                                       Employees who have access must annually agree to
    Obtaining confidential information requires approval               confidentiality agreements
    All confidential information must be encrypted when               Access to lists and database of Harvard University ID
     transported across any network                                     numbers is restricted
    Public directories must adhere to privacy preferences             Each school must provide training
     establishes by the individuals                                    Registrars have developed common definition of
    Identifying Users with Access to Confidential                      FERPA directory information
     Information                                                       Must adhere to student requests to block their directory
        System owners must be able to identify users that have         information, per FERPA
         access to confidential information
                                                                       Accepting Payment Cards - Restricted to procedures
        Strong passwords                                               outlined in HU Credit Card Merchant Handbook
        No account/password sharing
    Inhibit password guessing with logging and lockouts
    Limit application availability time with timeouts
    Limit user access to confidential information based on
     business need
                                                                    [ More on next page…]
    31                                                                                             [Micah Altman, 3/10/2011]
Law, policy, ethics

HEISP – Part 2                                                                       Research design …
                                                                                     Information security
    Physical Environment                           Network take down               Disclosure
                                                                                     limitation
        All digital/non-digital media must be          Network managers run vulnerability
         properly protected                              scans
    Computers must be physically                       May take computers off the network
     secure                                         Service Resumption
    Automatic logging must be                          Must have a service resumption plan if
     consistent with written policies                    loss of confidential data is a
                                                         substantial business risk
    Vendor contracts
        require approval by security officer       Incident Response Policy
        Include OGC contract rider                 Disposition and destruction of
                                                     records
    Computer operator
        computer must be regularly updated         Acquisition/use by unauthorized
                                                     persons must be reported to OGC
        operated securely
        Only necessary application installed       Interacting with legal authorities --
                                                     always refer to OGC unless
        annually certify compliance with            imminent health/safety risk requires
         university policies
                                                     otherwise
    Computer setup - must filter
     malicious traffic                              Web based surveys must have
                                                     protections in place
    “Target” systems and controllers
      Private address space; locally
       firewalled
      Annual vulnerability scanning
    32                                                                  [Micah Altman, 3/10/2011]
Law, policy, ethics

    Harvard:                                                                               Research design …
                                                                                           Information security
    Research Data Security Policy (HRDSP)                                                  Disclosure
                                                                                           limitation
   Sensitivity of research data based on potential harm if disclosed:
       Level 5 = “extremely sensitive”
       Level 4 = “very sensitive” ~= HRCI
       Level 3 = “sensitive” ~= HCI
       Level 2 = “benign” ~= Good computer hygiene
       Level 1 = anonymous and not business confidential
   Required protections based on sensitivity
       Level 5:   Entirely disconnected from network (“bubble security”)
       Level 4:   Protections as per HRCI
       Level 3:   Protections as per HCI
       Level 2:   Good computer hygiene
   Designates procedures for treatment of external data use agreements
    [ next section ]
       Legally binding
       Can be both very detailed and not supported by Harvard security procedures
       Investigator should not sign these – forward to OSP
   Designates responsibilities for IRB, Investigator, OSP, IT, Security Officers.



        security.harvard.edu/research-data-security-policy
        33                                                                    [Micah Altman, 3/10/2011]
Harvard:                                                                               Law, policy, ethics
                                                                                       Research design …
Researcher Responsibilities                                                            Information security
    … for knowing the rules                                                       Disclosure
                                                                                   limitation
    … for identifying potentially confidential information in all forms
     (digital/analogue; on-line/off-line)
    … for notifying recipients of their responsibility to protect confidentiality
    … for obtaining IRB approval for any human subjects research
    … for following an IRB approved plan
    … for obtaining OSP approval of restricted data use agreements with providers, even
     if no money involved




… and for proper
        Storage
        Access
        Transmission
        Disposal

                                 Confidentiality is not an “IT problem”


    34                                                                    [Micah Altman, 3/10/2011]
Harvard:                                                          Law, policy, ethics
                                                                  Research design …
Staff – Personnel Manual                                          Information security
                                                                  Disclosure
    Protect Harvard information and systems                      limitation
    Keep your own information in Peoplesoft up to date
    Comply with copyrights and DMCA
    Comply with Harvard systems policies and procedures
    All information produced at work is Harvard property
    Attach only approved devices to the Harvard network

harvie.harvard.edu/docroot/standalone/Policies_Contracts/St
  aff_Personnel_Manual/Section2/Privacy.shtml




    35                                               [Micah Altman, 3/10/2011]
Law, policy, ethics

    Key Concepts & Issues Review                                                                   Research design …
                                                                                                   Information security
    Privacy                                                                                       Disclosure
        Control over extent and circumstances of sharing                                          limitation
    Confidentiality
        Treatment of private, sensitive information
    Sensitive information
        Information that would cause harm if disclosed and linked to an individual
    Personally/individually identifiable information
        Private information
        Directly or indirectly linked to an identifiable individual
    Human subjects
     A living person …
         who is interacted with to obtain research data
         who‟s private identifiable information is included in research data
    Research
        Systematic investigation
        Designed to develop or contribute to generalizable knowledge
    “Common Rule”
        Law governing funded human subjects research
    HIPAA
        Law governing use of personal health information in covered and associated entities
    MA 201 CMR 17
        Law governing use of certain personal identifiers for Massachusetts residents


    36                                                                                [Micah Altman, 3/10/2011]
Law, policy, ethics

        Checklist: Identify Requirements                            Research design …
                                                                    Information security

Check if research includes …                       Disclosure
                                                   limitation
 Interaction with humans  Common Rule &HEISP/HRDSP
  applies

Check if data used includes identified …
 Student records  FERPA &HEISP/HRDSP applies
 State, federal, financial id‟s  state law &HEISP/HRDSP applies
 Medical/health information  HIPAA (likely) &HEISP/HRDSP
  applies
 Human subjects & private info
   Common Rule &HEISP/HRDSP applies

Check for other requirements/restriction on data dissemination:
 Data provider restrictions and University approvals thereof
 Open data requirements and norms
 University information policy
   37                                               [Micah Altman, 3/10/2011]
Law, policy, ethics

         Resources                                                                                     Research design …

        E.A. Bankert& R.J. Andur, 2006, Institutional Review Board: Management and Function,
                                                                                                       Information security
         Jones and Bartlett Publishers
                                                                                                       Disclosure
        P. Ohm, “Broken Promises of Privacy”, SSRN Working Paper                                      limitation
         [ssrn.com/abstract=1450006]
        D. J. Mazur, 2007. Evaluating the Science and ethics of Research on Humans, Johns Hopkins University
         Press
        IRB: Ethics & Human Research [Journal], Hastings Press
         www.thehastingscenter.org/Publications/IRB/
        Journal of Empirical Research on Human Research Ethics, University of California Press
         ucpressjournals.com/journal.asp?j=jer
        201 CMR 17 text
         www.mass.gov/Eoca/docs/idtheft/201CMR17amended.pdf
        FERPA Website
         www.ed.gov/policy/gen/guid/fpco/ferpa/index.html
        HIPAA Website
         www.hhs.gov/ocr/privacy/
        Common Rule Website
         www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.htm
        State laws
         www.ncsl.org/Default.aspx?TabId=13489
        Harvard Enterprise Information Security Policy/ Research Data Security Policy
         www.security.harvard.edu
        Harvard Institutional Review Board
     www.fas.harvard.edu/~research/hum_sub/
        Harvard FAS Policies and Procedures
         www.fas-it.fas.harvard.edu/services/catalog/browse/39
        IQSS Policies and Procedures
         support.hmdc.harvard.edu/kb-930/hmdc_policies


38                                                                                       [Micah Altman, 3/10/2011]
Research design, methods,                 Law, policy, ethics


                   management
                                                       Research design
                                                       …
                                                       Information security
                                                       Disclosure
        Reducing risk                                 limitation


            Sensitivity of information
            Partitioning
        Decreasing identification
        Managing confidentiality and
         dissemination
        [Summary]




39                                        [Micah Altman, 3/10/2011]
Law, policy, ethics

Trade-offs                                                    Research design
                                                              …
                                                              Information security
    Anonymity vs. research utility                  Disclosure
                                                     limitation
    Sensitivity vs. research utility
    (Anonymity * Sensitivity) vs. research costs/efforts




    40                                           [Micah Altman, 3/10/2011]
Law, policy, ethics

Types of Sensitive Information                                    Research design
                                                                  …
                                                                  Information security
    Information is sensitive, if, once disclosed there
                                                      Disclosure
                                                      limitation
     is a “significant” likelihood of harm
    IRB literature suggests possible categories of
     harm:
        loss of insurability
        loss of employability
        criminal liability
        psychological harm
        social harm to a vulnerable group
        loss of reputational harm
        emotional harm
        dignitary harm
        physical harm: risk of disease, injury, or death
    41                                               [Micah Altman, 3/10/2011]
Law, policy, ethics

    Levels of sensitivity                                                                                                  Research design
                                                                                                                           …

   No widely accepted scale                                                                                               Information security
   Publicly available data not sensitive under “common rule”                                                              Disclosure
   Common rule anchors scale at “minimal risk”:                                                                           limitation
    “if disclosed, the probability and magnitude of harm or discomfort anticipated are not greater in and of
         themselves than those ordinarily encountered in daily life or during the performance of routine
         physical or psychological examinations or tests”
   Harvard Research Data Security Policy
       Level 5- Extremely sensitive information about individually identifiable people.
        Information that if exposed poses significant risk of serious harm.
        Includes information posing serious risk of criminal liability, serious psychological harm or other
        significant injury, loss of insurability or employability, or significant social harm to an individual or
        group.

       Level 4 - Very sensitive information about individually identifiable people
        Information that if exposed poses a non-minimal risk of moderate harm.
        Includes civil liability, moderate psychological harm, or material social harm to individuals or
        groups, medical records not classified as Level 5, sensitive-but-unclassified national security
        information, and financial identifiers (as per HRCI standards).

       Level 3- Sensitive information about individually identifiable people
        Information that if disclosed poses a significant risk of minor harm.
        Includes information that would reasonably be expected to damage reputation or cause
        embarrassment; and FERPA records.

       Level 2 – Benign information about individually identifiable people
        Information that would not be considered harmful, but that as to which a subject has been
        promised confidentiality.

       Level 1 – De-identified information about people, and information not about people

        42                                                                                                    [Micah Altman, 3/10/2011]
Law, policy, ethics

IRB Review Scope                                                        Research design
                                                                        …
    IRB approval needed for all:                                       Information security
        federally-funded research;                               Disclosure
                                                                  limitation
        or any research at (almost all) institutions receiving federal
         funding that involve “human subjects”.
         ( Any organization operating under a general “federal-wide
         assurance”)
        All human subjects research at Harvard
    Human subject: individual about whom an investigator
     (whether professional or student) conducting research
     obtains
          (1) Data through intervention or interaction with a living
         individual, or
          (2) Identifiable private information about living individuals

    See
                      www.hhs.gov/ohrp/

    43                                                     [Micah Altman, 3/10/2011]
Law, policy, ethics

Research not requiring IRB approval                                                                   Research design
                                                                                                      …

    Non-research:                                                                                    Information security
     not generalizable knowledge & systematic inquiry
                                                                                                      Disclosure
    Non-funded:                                                                                      limitation
     institution receives no federal funds for research
    Not human subject:
        No living people described
        Observation only AND no private identifiable information is obtained
    Human Subjects, but “exempt” under 45 CFR 46
        use of existing, publicly-available data
        use of existing non-public data, if data is individuals cannot be directly or
         indirectly identified
        research conducted in educational settings, involving normal educational
         practices
        taste & food quality evaluation
        federal program evaluation approved by agency head
        observational, survey, test & interview of public officials and candidates
         (in their formal capacity, or not identified)
    Caution not all “exempt” is exempt…
       Some research on prisoners, children, not exemptable
       Some universities require review of “exempt” research
    Harvard requires review of all human subject research
    See:
     www.hhs.gov/ohrp/humansubjects/guidance/
     decisioncharts.htm




    44                                                                                   [Micah Altman, 3/10/2011]
Law, policy, ethics

IRB’s and Confidential Information                            Research design
                                                              …
                                                              Information security
                                                              Disclosure
    IRB‟s review consent procedures and documentation        limitation


    IRB‟s may review data management plans
        May require procedures to minimize risk of disclosure
        May require procedures to minimize harm resulting from
         disclosure
    IRB‟s make determination of sensitivity of information
     -- potential harm resulting from disclosure
    IRB‟s make determination regarding whether data is
     de-identified for “public use”
     [see NHRPAC,
     “Recommendations on Public Use Data Files”]

    45                                           [Micah Altman, 3/10/2011]
Law, policy, ethics

Harvard IRB Approval                                                         Research design
                                                                             …
                                                                             Information security
                                                                             Disclosure
    The Harvard Institutional Review Board (IRB) must approve alllimitation
     human subjects research at Harvard prior to data collection or use
    Research involves human subjects if:
        There is any interaction or intervention with living humans; or
        If identifiable private data about living humans is used
    Some examples of human subject research in soc sci:
        Surveys
        Behavioral experiments
        Educational tests and evaluations
        Analysis of identified private data collected from people
         (your e-mail inbox, logs of web-browsing activity, facebook activity, ebay
         bids … )
    The IRB will:
        Assess research protocol
        Identify whether research is exempt from further review and management
        Identify sensitivity level of data



    46                                                          [Micah Altman, 3/10/2011]
Law, policy, ethics

Harvard Responsibilities                                                     Research design
                                                                             …
                                                                             Information security
                                                                             Disclosure
    The Harvard Institutional Review Board (IRB) must approve alllimitation
     human subjects research at Harvard prior to data collection or use
    Research involves human subjects if:
        There is any interaction or intervention with living humans; or
        If identifiable private data about living humans is used
    Some examples of human subject research in soc sci:
        Surveys
        Behavioral experiments
        Educational tests and evaluations
        Analysis of identified private data collected from people
         (your e-mail inbox, logs of web-browsing activity, facebook activity, ebay
         bids … )
    The IRB will:
        Assess research protocol
        Identify whether research is exempt from further review and management
        Identify sensitivity level of data



    47                                                          [Micah Altman, 3/10/2011]
Law, policy, ethics

    HRDSP Responsibilities                                          Research design
                                                                    …
                                                                    Information security
                                                                    Disclosure
   Responsibilities                                                limitation


       Researchers are responsible for disclosing to IRB, and follow
        IRB approved plan
       IRB is responsible for ensuring adequacy of Investigators
        plans; granting (lawful) variances from security requirements
        justified by research needs
       IT is responsible for assisting with the identification of security
        level, and assisting in the implementation of security
        protections
       Security Officer/CIO may review IT facilities and approve (give
        written designation) that they meet protections for a given level



        48                                             [Micah Altman, 3/10/2011]
Valuation of private                                                                                 Law, policy, ethics

information is uncertain                                                                             Research design
                                                                                                     …
                                                                                                     Information security
    Privacy valuations often inconsistent
                                                                                Disclosure
       Framing effects: ordering, endowment effect, possibly others            limitation
      Non-normal/uniform distribution of valuations
      One study: < 10% of subjects would give up $2 of a $12 gift card to buy anonymity
        of purchases
     [Aquesti and Lowenstein 2009]
    Cost benefit of information security may not be optimal for users [Herley
     2009]
        E.g. Loss from all phishing attacks is 100x less than time spent in avoiding them
        Note, however weaknesses in this analysis:
            Only loss of time modeled – no valuation of privacy made
            Institutional costs not included – only personal costs
            Very simplified model – not calibrated through surveys etc.
    Repeated surveys of students show they tend to disclose a lot, e.g.:
            >80% of students sampled in several studies had public facebook pages with birthdays,
             home town and other private information
               This information can easily be used to link to other databases!
              Disclosure of extensive information on sexual orientation, private cell #‟s, drinking habits, etc.
                etc. not uncommon
             [See Kolek& Saunders 2008]
    Emerging markets for privacy?
      Micropayments for disclosures
      http://www.personal.com/
    49 http://www.i-allow.com/
     
                                                                                      [Micah Altman, 3/10/2011]
Law, policy, ethics

Reducing Risk in Data Collection                                            Research design
                                                                            …
                                                                            Information security
                                                                            Disclosure
    Avoid collecting sensitive information, unless it is limitation
     required by research design, method, or hypothesis
        Unnecessary sensitive information  not minimal risk
        Reducing sensitivity  higher participation, greater honesty
    Collect sensitive information in private settings
        Reduces risk of disclosure
        Increases participation
    Reduce sensitivity through indirect measures
        Less sensitive proxies
            E.g. Implicit association test [Greenwald, et al. 1998]
        Unfolding brackets
        Group response collection
        Random response technique [Warner 1965]
        Item count/unmatched count/list experiment technique


    50                                                         [Micah Altman, 3/10/2011]
Managing Sensitive Data                                            Law, policy, ethics

Collection                                                         Research design
                                                                   …
                                                                   Information security
                                                                   Disclosure
    Separate:                                                     limitation

     sensitive measures, (quasi)-identifiers, other
     measures
    If possible avoid storing identifiers with measures:
        Collect identifying information beforehand
        Assign opaque subject identifiers
    For sensitive data:
        Collect on-line directly (with appropriate protections); or
        Encrypt collection devices/media (laptops, usb keys, etc)
    For very/extremely sensitive data:
        Collect with oversight directly; then
        Store on encrypted device and;
        Transfer to secure server as soon as feasible
    51                                                [Micah Altman, 3/10/2011]
Randomized Response                                    Law, policy, ethics

Technique                                              Research design
                                                       …
                                                       Information security
                                                       Disclosure
                                                       limitation
                     Sensitiv
                       e
                     questio
                       n

          Subjec                Record
           t rolls
          a die >               Answer
              2

                                         Variations:
                                         - Ask two different
                                         questions
                                         - Item counts with
                                         sensitive and non-
                      Say
                                         sensitive items – eliminate
                     “YES”               subject-randomization
                                         - Regression analysis
                                         methods



52                                       [Micah Altman, 3/10/2011]
Law, policy, ethics

   Our Table – Less (?) Sensitive                                 Less (?)
                                                                                Research design
                                                                                …
                                                                  Sensitive     Information security
Name       SSN     Birthdate   Zipcode   Gender   Favorite    Treat?    #       Disclosure
                                                  Ice Cream             acts    limitation
                                                                        *
A.Jones    12341   01011961    02145     M        Raspberry   0         0
B. Jones   12342   02021961    02138     M        Pistachio   1         20
C. Jones   12343   11111972    94043     M        Chocolate   0         0
D. Jones   12344   12121972    94043     M        Hazelnut    1         12
E. Jones   12345   03251972    94041     F        Lemon       0         0
F. Jones   12346   03251972    02127     F        Lemon       1         7
G. Jones   12347   08081989    02138     F        Peach       0         1
H. Smith   12348   01011973    63200     F        Lime        1         17
I. Smith   12349   02021973    63300     M        Mango       0         4
J. Smith   12350   02021973    63400     M        Coconut     1         18
K. Smith   12351   03031974    64500     M        Frog        0         32
L. Smith   12352   04041974    64600     M        Vanilla     1         65
M. Smith   12353   04041974    64700     F        Pumpkin     0         128
N. Smith   12354   04041974    64800     F        Allergic    1         256
* Acts = crimes if treatment = 0; crimes + acts of generosity if treatment =1
     53                                                            [Micah Altman, 3/10/2011]
Randomized Response –                                                                      Law, policy, ethics

Pros and Cons                                                                              Research design
                                                                                           …
                                                                                           Information security
    Pros
                                                                                           Disclosure
        Can substantially reduce risks of disclosure                                      limitation
        Can increase response rate
        Can decrease mis-reporting
    Warning!
        None of the randomized models uses a formal measure of disclosure limitation
        Some would clearly violate measures (such as differential privacy) we‟ll see in
         section 4
        Do not use as a replacement for disclosure limitation
    Other Issues
        Loss of statistical efficiency
         (if compliance would otherwise be the same)
        Complicates data analysis, especially model-based analysis
        Leaving randomization up to subject can be unreliable
        May provide less confidentiality protection if:
            Randomization is incomplete
            Records of randomization assignment are kept
            Lists of responses overlap across questions
            Sensitive question response is large enough to dominate overall response
            Non-sensitive question responses are extremely predictable, or publicly observable


    54                                                                        [Micah Altman, 3/10/2011]
Law, policy, ethics

Partitioning Information                                     Research design
                                                             …
                                                             Information security
    Reduces risk in information management                  Disclosure
                                                             limitation
    Partition data information based on sensitivity
        Identifying information
        Descriptive information
        Sensitive information
        Other information
    Segregate
        Storage of information
        Access regimes
        Data collections channels
        Data transmission channels
    Plan to segregate as early as feasible in data collection
     and processing
    Link segregated information with artificial keys …
    55                                          [Micah Altman, 3/10/2011]
Partitioned table
                                                                Not Identified
Name       SSN     Birthdate   Zipcode   Gender   LINK   LINK          Favorite     Treat   #
                                                                       Ice Cream            act
                                                                                            s
A.Jones    12341   0101196     02145     M        1401   1401          Raspberry    0       0
                   1
                                                         283           Pistachio    1       20
B.         12342   0202196     02138     M        283
                                                         8979          Chocolate    0       0
Jones              1
                                                         7023          Hazelnut     1       12
C.         12343   11111972    94043     M        8979
Jones                                                    1498          Lemon        0       0
D.         12344   1212197     94043     M        7023   1036          Lemon        1       7
Jones              2
                                                         3864          Peach        0       1
E.         12345   0325197     94041     F        1498
                                                         2124          Lime         1       17
Jones              2
                                                         4339          Mango        0       4
F.         12346   0325197     02127     F        1036
     Jon           2                                     6629          Coconut      1       18
     es                                                  9091          Frog         0       32
G.         12347   0808198     02138     F        3864
                                                         9918          Vanilla      1       65
     Jon           9
     es                                                  4749          Pumpkin      0       12
                                                                                            8
H. Smith   12348   0101197     63200     F        2124
                   3                                     8197          Allergic     1       25
                                                                                            6
I. Smith   12349   0202197     63300     M        4339
      56           3                                               [Micah Altman, 3/10/2011]
Law, policy, ethics

Choosing Linking Keys                                                                                                     Research design
                                                                                                                          …

    Entirely randomized                                                                                                  Information security
        Most resistant to relink                                                                                         Disclosure
        Mapping from original id to random keys is highly sensitive                                                      limitation
        Must keep and be able to access mapping to add new identified data
        Most computer-generated random numbers are not sufficient by themselves
            Most are PSEUDO random – predictable sequences
            Use a cryptographic secure PRNG: Blum Blum Shub, AES (or other block cypher) in counter mode OR
            Use real random numbers (e.g. from physical sources – see http://maltman.hmdc.harvard.edu/numal/) OR
            Use a PRNG with a real random seed to random the order of the table; then another to generate the ID‟s for this randomly
             ordered table
    Encryption
        More troublesome to compute
        Same id‟s + same key + same “salt” produces same values  facilitates merging
        ID‟s can be recovered if key is exposed, cracked, or algorithm weak
    Cryptographic Hash
     e.g. SHA-256
        Security is well understood
        Tools available to compute
        Same id‟s produce same hashes  easier to merge new identified data
        ID‟s cannot be recovered from hash because hash loses information
        ID‟s can be confirmed if identifying information is known or guessable
    Cryptographic Hash + secret key
        Security is well understood
        Tools available to compute
        Same id‟s produce same hashes  easier to merge new identified data
        ID‟s cannot be recovered from hash because hash loses information
        ID‟s cannot be confirmed unless key is also known
    Do not choose arbitrary mathematical functions of other identifiers!
    57                                                                                                  [Micah Altman, 3/10/2011]
Managing Confidential Information in Research
Anonymous Data Collection:                                        Law, policy, ethics


Pros & Cons
                                                                  Research design
                                                                  …
                                                                  Information security
    Pros                                                         Disclosure
                                                                  limitation
        Presumption that data is not identifiable
        May increase participation
        May increase honesty
    Cons
        Barrier to follow-up, longitudinal studies
        Can conflict with quality control, validation
        Data still may be indirectly identifiable if respondent
         descriptive information is collected
        Linking data to other sources of information may have
         large research benefits



    59                                               [Micah Altman, 3/10/2011]
Law, policy, ethics

Anonymous Data Collection Methods                          Research design
                                                           …
                                                           Information security
    Trusted third party intermediates            Disclosure
                                                  limitation
    Respondent initiates re-contacts
    No identifying information recorded
    Use id‟s randomized to subjects, destroy mapping




    60                                        [Micah Altman, 3/10/2011]
Law, policy, ethics

Remote Data Collection Challenges                                      Research design
                                                                       …
                                                             Information security
    Where network connection is readily available,          Disclosure
     easy to transfer as collected, or enter on remote systemlimitation
        Encrypted network file transfer (e.g. SFTP, part of ssh )
        Encrypted/tunneled network file system (e.g. Expandrive)
    Where network connection is less reliable, high bandwidth
        Whole-disk-encrypted laptop
        Plus, Encrypted cloud backup solutions: CrashPlan, BackBlaze,
         SpiderOak
    Small data, short term
        Encrypted USB keys (e.g. w/IronKey, TrueCrypt, PGP)
    Foreign Travel
        Be aware of U.S. EAR export restrictions, use commercial or
         widely-available open encryption software only. Do not use
         bespoke software.
        Be aware of country import restrictions (as of 2008): Burma,
         Belarus, China, Hungary, Iran, Israel, Morocco, Russia, Saudi
         Arabia, Tunisia, Ukraine
        Encrypt data if possible, but don‟t break foreign laws. Check with
         department of state.
    61                                                    [Micah Altman, 3/10/2011]
Online/Electronic data collection                                                              Law, policy, ethics


challenges
                                                                                               Research design
                                                                                               …
                                                                                               Information security
    IP addresses are identifiers
        IP addresses can be logged automatically by host, even if not intended by researcher  Disclosure
                                                                                               limitation
        IP addresses can trivially be observed as data is collected
        Partial ID numbers can be used for probabilistic geographical identification at sub-zipcode
         levels
    Cookies may be identifiers
        Cookies provide a way to link data more easily
        May or may not explicitly identify subject
    Jurisdiction
        Data collected from subjects from other states / countries could subject you to laws in that
         jurisdiction
        Jurisdiction may depend on residency of subject, availability of data collection instrument in
         jurisdiction, or explicit data collections efforts within jurisdiction
    Vendor
        Vendor could retain IP addresses, identifying cookies, etc., even if not intended by researcher
    Recommendation
        Use only vendors that certify compliance with your confidentiality policy
        Do not retain IP numbers if data is being collected anonymously
        Use SSL/TLS encryption unless data is non-sensitive and anonymous
 Some tools for anonymizing IP addresses and system/network logs
www.caida.org/tools/taxonomy/anonymization.xml
 Harvard policy
        Recommendations as above
     
    62   Plus: do not use or display Level 4+ for web surveys                    [Micah Altman, 3/10/2011]
Managing Confidential Information in Research
Law, policy, ethics

Certificates of Confidentiality                               Research design
                                                              …
                                                              Information security
    Issued by DHHS agencies such as NIH, CDC, Disclosure
                                                      FDA
                                                     limitation
    Protects against many types of forced legal
     disclosure of confidential information
    May not protect against all state disclosure law
    Does not protect against voluntary disclosures by
     researcher/research institutions




    64                                           [Micah Altman, 3/10/2011]
Law, policy, ethics

Confidentiality & Consent                                               Research design
                                                                        …
                                                                        Information security
Best practice is to describe in consent form...            Disclosure
                                                           limitation
 Practices in place to protect confidentiality
 Plans for making the data available, to whom, and under
  what circumstances, rationale.
 Limitations on confidentiality (e.g. limits to a certificate of
  confidentiality under state law, planned voluntary
  disclosure)
 Consent form should be consistent with your:
        Data management plan
        Data sharing plans and requirements
    Not generally best practice to promise
        Unlimited confidentiality
        Destruction of all data
        Restriction of all data to original researchers

    65                                                     [Micah Altman, 3/10/2011]
Law, policy, ethics

Data Management Plan                                                                     Research design
                                                                                         …
                                                                                         Information security
   When is it required?
       Any NIH request over $500K                                                       Disclosure
                                                                                         limitation
       All NSF proposals after 12/31/2010
       NIJ
       Wellcome Trust
       Any proposal where collected data will be a resource beyond the project
   Safeguarding data during collection
       Documentation
       Backup and recovery
       Review
   Treatment of confidential information
       Overview: http://www.icpsr.org/DATAPASS/pdf/confidentiality.pdf
       Separation of identifying and sensitive information
       Obtain certificate of confidentiality, other legal safeguards
       De-identification and public use files
   Dissemination
       Archiving commitment (include letter of support)
       Archiving timeline
       Access procedures
       Documentation
       User vetting, tracking, and support

                              One size does not fit all projects.
        66                                                                  [Micah Altman, 3/10/2011]
Data Management Plan                                                                                                                                   Law, policy, ethics

Outline                                                                                                                                                Research design
                                                                                                                                                       …
                                                                                                                                                       Information security

   Data description                                               Planned documentation and supporting                 Budget                    Disclosure
                                                                    materials                                                                       limitation
        nature of data {generated, observed,                                                                                 Cost of preparing data and documentation
         experimental information; amples; publications;           Quality assurance procedures for metadata
         physical collections; software; models}                    and documentation                                         Cost of permanent archiving
        scale of data                                         Data Organization [if complex]                           Intellectual Property Rights
   Access and Sharing                                             File organization                                         Entities who hold property rights
        Plans for depositing in an existing public                Naming conventions                                        Types of IP rights in data
         database                                                                                                             Protections provided
                                                               Quality Assurance [if not described in main
        Access procedures                                      proposal]                                                     Dispute resolution process
        Embargo periods                                           Procedures for ensuring data quality in              Legal Requirements
        Access charges                                             collections, and expected measurement error
                                                                                                                              Provider requirements and plans to meet them
        Timeframe for access                                      Cleaning and editing procedures
                                                                                                                              Institutional requirements and plans to meet
        Technical access methods                                  Validation methods                                         them
        Restrictions on access                                Storage, backup, replication, and                        Archiving and Preservation
   Audience
                                                                versioning                                                    Requirements for data destruction, if applicable
                                                                   Facilities                                                Procedures for long term preservation
        Potential secondary users
                                                                   Methods                                                   Institution responsible for long-term costs of
        Potential scope or scale of use
                                                                   Procedures                                                 data preservation
        Reasons not to share or reuse
                                                                   Frequency                                                 Succession plans for data should archiving
   Existing Data [ If applicable ]                                                                                            entity go out of existence
                                                                   Replication
        description of existing data relevant to the                                                                    Ethics and privacy
         project                                                   Version management
                                                                                                                              Informed consent
        plans for integration with data collection                Recovery guarantees
                                                                                                                              Protection of privacy
        added value of collection, need to collect/create     Security
         new data                                                                                                             Other ethical issues
                                                                   Procedural controls
   Formats                                                        Technical Controls
                                                                                                                         Adherence
        Generation and dissemination formats and                                                                             When will adherence to data management plan
                                                                   Confidentiality concerns                                   be checked or demonstrated
         procedural justification
                                                                   Access control rules                                      Who is responsible for managing data in the
        Storage format and archival justification
                                                                   Restrictions on use                                        project
   Metadata and documentation
                                                               Responsibility                                                Who is responsible for checking adherence to
        Metadata to be provided                                                                                               data management plan
                                                                   Individual or project team role responsible for
        Metadata standards used                                    data management
        Treatment of field notes, and collection records
          67                                                                                                                   [Micah Altman, 3/10/2011]
IQSS                                                               Law, policy, ethics


Data Management Services
                                                                   Research design
                                                                   …
                                                                   Information security
   The Henry A. Murray Research Archive                           Disclosure
       Harvard’s endowed permanent data archive                   limitation

       Assists in developing data management plans
       Can provide cataloging assistance for public release of
        data
       Dissemination of data through IQSS Dataverse Network
   The IQSS Dataverse Network
       Standard data management plan for public, small data
       Provides easy virtual archiving and dissemination
       Data is catalogued and controlled by you
       You theme and brand your virtual archive
       Universally searchable, citable
       Automatically provides data formatting and statistical
        analysis on-line
                         http://dvn.iq.harvard.edu
        68                                            [Micah Altman, 3/10/2011]
Data Management Plans Examples                                                                                 Law, policy, ethics


(Summaries)
                                                                                                               Research design
                                                                                                               …

    Example 1                                                                                                  Information security
    The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the
                                                                                                                Disclosure
     New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial
                                                                                                                limitation
     features, as well as mental retardation. Even with the removal of all identifiers, we believe that it would be difficult if
     not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical
     data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting
     subjects. Therefore, we are not planning to share the data.
    Example 2
    The proposed research will include data from approximately 500 subjects being screened for three bacterial
     sexually transmitted diseases (STDs) at an inner city STD clinic. The final dataset will include self-reported
     demographic and behavioral data from interviews with the subjects and laboratory data from urine specimens
     provided. Because the STDs being studied are reportable diseases, we will be collecting identifying information.
     Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there remains
     the possibility of deductive disclosure of subjects with unusual characteristics. Thus, we will make the data and
     associated documentation available to users only under a data-sharing agreement that provides for: (1) a
     commitment to using the data only for research purposes and not to identify any individual participant; (2) a
     commitment to securing the data using appropriate computer technology; and (3) a commitment to destroying or
     returning the data after analyses are completed.
    Example 3
    This application requests support to collect public-use data from a survey of more than 22,000 Americans over the
     age of 50 every 2 years. Data products from this study will be made available without cost to researchers and
     analysts. https://ssl.isr.umich.edu/hrs/
    User registration is required in order to access or download files. As part of the registration process, users must
     agree to the conditions of use governing access to the public release data, including restrictions against attempting
     to identify study participants, destruction of the data after analyses are completed, reporting responsibilities,
     restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource.
     Registered users will receive user support, as well as information related to errors in the data, future releases,
     workshops, and publication lists. The information provided to users will not be used for commercial purposes, and
     will not be redistributed to third parties.

    FROM NIH, [grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm#ex]


    69                                                                                        [Micah Altman, 3/10/2011]
External Data Usage
Agreements                                   Controls on Confidential Information

                                             Harvard University has developed extensive technical and administrative
    Between provider and                    procedures to be used with all identified personal information and other
                                             confidential information. The University classifies this form of information internally
     individual                              as "Harvard Confidential Information" -- or HCI.


        Careful – you‟re liable             Any use of HCI at Harvard includes the following safeguards:

                                             - Systems security. Any system used to store HCI is subject to a checklist of
        Harvard will not help if you sign   technical and procedural security measures including: operating system and
                                             applications must be patched to current security levels, a host based firewall is
        University review and OSP           enabled, anti virus software is enabled and the definitions file is current
                                             - Server security. Any server used to distribute HCI to other systems (e.g. through
         signature strongly                  providing remote file system), or otherwise offering login access, must employ

         recommended                         additional security measures including: connection through a private network only;
                                             limitation on length of idle sessions; limitations on incorrect password attempts;
                                             and additional logging and monitoring.
    Between provider and                    - Access restriction: an individual is allowed to access HCI only if there is a
     University                              specific need for access. All access to HCI is over physically controlled and/or
                                             encrypted channels.
        University Liable                   - Disposal processes: including secure file erasure and document destruction.
                                             - Encryption: HCI must be strongly encrypted whenever it is transmitted across a

        Requires University Approved        public network, stored on a laptop, or stored on a portable device such as a flash
                                             drive or on portable media.
         Signer : OSP                        This is only a brief summary. The full University security policy can be found here:
                                                 http://security.harvard.edu/heisp
    Avoid nonstandard protections
                                             And a more detailed checklist used to verify systems compliance is found here:
     whenever possible                         http://security.harvard.edu/files/resources/forms/

        DUA‟s can impose very specific      These safeguards are applied consistently throughout the university, we believe
         and detailed requirements           that these requirements offer stringent protection for the requested data. And
                                             these requirements will be applied in addition to any others required by specific
        Compatible in spirit does not       data use agreement.

         apply compatibility in legal
         practice
     
    70   Use University                                                        [Micah Altman, 3/10/2011]
         policies/procedures as a
Law, policy, ethics

IQSS Data Management Services                                    Research design
                                                                 …
                                                                 Information security
    The Henry A. Murray Research Archive                        Disclosure
        Harvard’s endowed permanent data archive                limitation

        Assists in developing data management plans
        Can provide cataloging assistance for public release of
         data
        Dissemination of data through IQSS Dataverse Network
        Provides letters of commitment to permanent archiving
                         www.murray.harvard.edu

    The IQSS Dataverse Network
        Provides easy virtual archiving and dissemination
        Data is catalogued and controlled by you
        You theme and brand your virtual archive
        Universally searchable, citable
        Automatically provides data formatting and statistical
         analysis on-line
                            dvn.iq.harvard.edu
    71                                              [Micah Altman, 3/10/2011]
Law, policy, ethics

    Key Concepts & Issues Review                   Research design
                                                   …
                                                   Information security
    Levels of sensitivity                         Disclosure
                                                   limitation
    Anonymity criteria
    Sensitivity reduction
    Certificate of Confidentiality
    Data sharing plan
    Data management plan
    Information Partitioning
    Linking Keys




    72                                [Micah Altman, 3/10/2011]
Law, policy, ethics

         Checklist: Research Design …                                          Research design
                                                                               …
                                                                               Information security
                                                                           Disclosure
    Does research involve human subjects?                                 limitation
    What are possible harms that could occur if identified information was
     disclosed?
    Is information collected benign, sensitive, very sensitive, or extremely
     sensitive? (IRB makes final determination)
    Can the sensitivity of the information be reduced?
    Can research be carried out with anonymity?
    Can research data be de-identified during collection?
    How can identifying information, descriptive information and sensitive
     information be segregated?
    Have you:
        Completed NIH human subjects training
        Harvard HETHR training
    Have you written the following to be consistent with final plans for analysis
     and dissemination:
        data management plan?
        consent documents?
        application for certificate of confidentiality?


    73                                                            [Micah Altman, 3/10/2011]
Law, policy, ethics

           Resources                                              Research design
                                                                  …
                                                                  Information security
                                                                  Disclosure
   E.A. Bankert& R.J. Andur, 2006, Institutional Review Board:
                                                            limitation
    Management and Function,
    Jones and Bartlett Publishers
   R. Groves, et al., 2004, Survey Methodology, John Wiley &
    sons.
   J.A. Fox, P.E. Tracy, 1986, Randomized Response, Sage
    Publications.
   R.M. Lee, 1993, Doing Research on Sensitive Topics, Sage
    Publications.
   D. Corstange, 2009, "Sensitive Questions, Truthful Answers?
    Modeling the List Experiment with LISTIT", Political Analysis
    17:45–63
   ICPSR Data Enclave
    [www.icpsr.umich.edu/icpsrweb/ICPSR/access/restricted/encla
    ve]
   Murray Research Archive
   [www.murray.harvard.edu]
   IQSS Dataverse Network
 74                                                 [Micah Altman, 3/10/2011]
    [dvn.iq.harvard.edu/]
Law, policy, ethics

     Information Security                    Research design …
                                             Information security
                                             Disclosure limitation

        Security principles
        FISMA
        Categories of technical
         controls
        A simplified approach
        Harvard Policies
        [Summary]




75                                 [Micah Altman, 3/10/2011]
Law, policy, ethics

Core Information Security Concepts            Research design …
                                              Information security
                                              Disclosure limitation
    Security properties
        Confidentiality
        Integrity
        Availability
        [Authenticity]
        [Nonrepudiation]
    Security practices
        Defense in depth
        Threat modeling
        Risk assessment
        Vulnerability assessment


    76                              [Micah Altman, 3/10/2011]
Law, policy, ethics

Risk Assessment                                                                       Research design …
                                                                                      Information security

    [NIST 800-100, simplification of NIST 800-30]Disclosure limitation

                      Threat
                     Modeling



                                           Analysis             Institute
         System                      - likelihood               Selected           Testing and
         Analysis                    - impact
                                                              Controls              Auditing
                                     - mitigating controls




                    Vulnerability
                    Identification              Information Security Control Selection Process




    77                                                                      [Micah Altman, 3/10/2011]
Law, policy, ethics

Risk Management Details                   Research design …
                                          Information security
                                          Disclosure limitation
    System Characterization
    Threat Identification
    Control analysis
    Likelihood determination
    Impact Analysis
    Risk Determination
    Control recommendation
    Results documentation




    78                          [Micah Altman, 3/10/2011]
Law, policy, ethics

Classes of threats and vulnerabilities                        Research design …
                                                              Information security

    Sources of threat                                        Disclosure limitation

        Natural
        Unintentional Human
        Intentional
    Areas of vulnerability
        Logical
            Data at rest in system
            Data in motion across networks
            Data being processed in applications
        Physical
            Computer systems
            Network
            Backups, disposal, media
        Social
            Social engineering
            Mistakes
            Insider threats


    79                                              [Micah Altman, 3/10/2011]
Law, policy, ethics

Simple Control Model                                                                             Research design …
                                                                                                 Information security

              Request/Resp                           Access Control                              Disclosure limitation
              onse

              Credential
                  s
     Client
                                                                                Resource




                             Authentication


                                              Authorization

                                                              Auditing
                                                                                      Log




     Resource Control Model
                                                                         External Auditor



80                                                                                    [Micah Altman, 3/10/2011]
Operational and Technical                                  Law, policy, ethics

Controls [NIST 800-53]                                     Research design …
                                                           Information security

    Operational                                           Disclosure limitation

        Personnel security
        Physical and environmental protection
        Contingency planning
        Configuration management
        Maintenance
        System and information integrity
        Media protection
        Incident Response
        Awareness and training
    Technical Controls
        Identification and authentication
        Access control
        Audit and accountability
        System and communication protection


    81                                           [Micah Altman, 3/10/2011]
Law, policy, ethics


Key Information Security Standards
                                                                         Research design …
                                                                         Information security
                                                                         Disclosure limitation
    Comprehensive Information Security Standards
        FISMA – framework for non-classified information security
         in federal government.
        ISO/IEC 27002 – framework of similar scope to FISMA, used
         internationally
        PCI – Payment card industry security standards. Used by major
         payment card companies, processors, etc.
    Related Certifications
        FIPS-compliance and certification
            Establishes standards for cryptographic methods and modules
            Be aware that FIPS-certification often limited to algorithm used,
             and not entire system
        SAS 70 Audits – Type 2
            Independent audit of controls and control objectives
            Does not establish sufficiency of control objectives
        CISSP -- Certified Information Systems Security Professional
            Widely recognized certification for information security
             professionals
    82                                                         [Micah Altman, 3/10/2011]
Law, policy, ethics

FISMA Overview                                              Research design …
                                                            Information security
                                                Disclosure limitation
Federal Information Security Management Act of 2002
 All federal agencies required to develop agency-wide
  information security plan
 NIST published extensive list of recommendations
 Federal sponsors seem to be trending to FISMA as
  best practice for managing confidential data
  produced by award
 Identifies risk and impact level; monitoring;
  technical and procedural controls
 Harvard HRCI controls:
  less than FISMA “low”

 83                                               [Micah Altman, 3/10/2011]
Law, policy, ethics

Security Control Baselines                                                           Research design …
                                                                                     Information security
                                                                                     Disclosure limitation
Access Control
Low (impact)                                      Medium-High (impact), adds…
Policies; Account management *; Access            Information flow enforcement; Separation of
Enforcement; Unsuccessful Login Attempts;         Duties; Least Privilege; Session Lock
System Use Notification; Restrict Anonymous
Access*; Restrict Remote Access*; Restrict
Wireless Access*; Restrict Mobile Devices*;
Restrict use of External Information Systems*;
Restrict Publicly Accessible Content
Security Awareness and Training
Policies; Awareness; Training; Training Records

Audit and Accountability
Policies; Auditable Events *; Content of Audit    Audit Reduction; Non-Repudiation
Records *; Storage Capacity; Audit Review,
Analysis and Reporting *; Time Stamps *;
Protection of Audit Information; Audit Record
Retention; Audit Generation
Security Assessment and Authorization
Policies; Assessments* ; System Connections;
Planning; Authorization; Continuous Monitoring
  84                                                                     [Micah Altman, 3/10/2011]
Law, policy, ethics

Security Control Baselines                                                             Research design …
                                                                                       Information security
                                                                                       Disclosure limitation
Configuration Management
Low (impact)                                       Medium-High (impact), adds…
Policies; Baseline*; Impact Analysis; Settings*;   Change Control; Access Restrictions for Change;
Least Functionality; Component Inventory*          Configuration Management Plan
Contingency Planning
Policies; Plan * ; Training *; Plan Testing*;      Alternate storage site; Alternate processing site;
System backup*; Recovery & Reconstitution *        Telecomm
Identification and Authentication
Policies; Organizational Users*; Identifier        Device identification and authentication
Management; Authenticator Management *;
Authenticator Feedback; Cryptographic Module
Authentication; Non-Organizational Users
Incident Response
Policies; Training; Handling *; Monitoring;        Testing
Reporting*; Response Assistance; Response
Plan
Maintenance
Policies; Control*; Non-Local Maintenance          Tools; Maintenance scheduling/timeliness
Restrictions*; Personnel Restrictions*
 85                                                                         [Micah Altman, 3/10/2011]
Law, policy, ethics

Security Control Baselines                                                            Research design …
                                                                                      Information security
                                                                                      Disclosure limitation
Media Protection
Low (impact)                                         Medium-High (impact), adds…
Policies; Access restrictions*; Sanitization         Marking; Storage; Transport

Physical and Environmental Protection
Policies; Access Authorizations; Access Control*;    Network access control; Output device Access
Monitoring*; Visitor Control *; Records*;            control; Power equipment access, shutoff,
Emergency Lighting; Fire protection*;                backup; Alternate work site; Location of
Temperature, Humidity, water damage*; Delivery       information system components; information
and removal                                          leakage
Planning
Policies, Plan, Rules of Behavior; Privacy Impact    Activity planning
Assessment
Personnel Security
Policies; Position categorization; Screening;
Termination; Transfer; Access Agreements; Third-
Parties; Sanctions
Risk Assessment
Policies; Categorization Assessment; Vulnerability
Scanning*
 86                                                                         [Micah Altman, 3/10/2011]
Law, policy, ethics

Security Control Baselines                                                                Research design …
                                                                                          Information security
                                                                                          Disclosure limitation
System and Services Acquisition
Low (impact)                                          Medium-High (impact), adds…
Policies; Resource Allocation; Life CycleSupport;     Security Engineering; Developer configuration
Acquisition*; Documentation; Software usage           management; Developer security testing; supply
restrictions; User installed software restrictions;   chain protection; Trustworthiness
External information System Services restrictions
System and Communications Protection
Policies; Denial of Service Protection; Boundary      Application Partitioning; Restrictions on Shared
protection*; Cryptographic key Management;            Resources; Transmission integrity &
Encryption; Public Access Protection;                 confidentiality; Network Disconnection Procedure;
Collaborative computing devices restriction;          Public Key Infrastructure Certificates; Mobile
Secure Name resolution*                               Code management; VOIP management; Session
                                                      authenticity; Fail in known state; Protection of
                                                      information at rest; Information system partitioning
System and Information Integrity
Policies, Flaw remediation*; Malicious code           Information system monitoring; Software and
protection*; Security Advisory monitoring*;           information integrity; Spam protection; Information
Information output handling                           input restrictions & validation; Error handling
Program Management
Plan; Security Officer Role; Resources; Inventory; Performance Measures; Enterprise architecture;
Risk management strategy; Authorization process; Mission definition
  87                                                                      [Micah Altman, 3/10/2011]
HIPAA Requirements

    Administrative controls
        Access authorization, establishment, modification, and termination.
        Training program
        Vendor compliance
        Disaster recovery
        Internal audits
        Breach procedures
    Physical controls
        Disposal
        Access to equipment
        Access to physical environment
        Workstation environment
    Technical controls
        Intrusion protection
        Network encryption
        Integrity checking
        Authentication of communication
        Configuration management
        Risk analysis




    88                                                                         [Micah Altman, 3/10/2011]
Law, policy, ethics

Delegating Systems Security                                     Research design …
                                                                Information security
                                                                Disclosure limitation

    What are goals for confidentiality, integrity, availability?
    What threats are envisioned?
    What controls are in place?
    Is there a checklist?
    Who is responsible for technical controls?
        Do they have appropriate training, experience and/or
         certification?
    Who is responsible for procedural controls?
        Have they received appropriate training?
    How is security monitored, audited, and tested?
        E.g. SAS Type -2 Audits; FISMA Compliance; ISO Certification
    What security standards are referenced?
        E.g. FISMA, ISO, HEISP/HDRSP/PCI
    89                                                [Micah Altman, 3/10/2011]
Law, policy, ethics

What most security plans do not do                             Research design …
                                                               Information security
                                                   Disclosure limitation
    Protect against all insider threats
    Protect against all unintentional threats (human
     error, voluntary disclosure)
    Protect against the CIA, TEMPEST, evil maids, and
     other well-resourced, sophisticated adversaries
    Protect against prolonged physical threats to
     computer equipment, or data owner




    90                                               [Micah Altman, 3/10/2011]
Law, policy, ethics

Information Security is Systemic                           Research design …
                                                           Information security

Not just control implementation but…          Disclosure limitation



 Policy creation, maintenance, auditing
 Implementation review, auditing, logging, monitoring
 Regular vulnerability & threat assessment




 91                                              [Micah Altman, 3/10/2011]
Law, policy, ethics

Simplified Approach for Sensitive Data                Research design …
                                                      Information security
                                                      Disclosure limitation

     Use whole-disk/media encryption to protect data at
      rest
     Use end-to-end encryption to protect data in motion
     Use core information hygiene to protect systems
     Scan for HRCI regularly
     Be thorough in disposal of information

 Very sensitive/extremely sensitive data requires
  more protection.



     92                                     [Micah Altman, 3/10/2011]
Law, policy, ethics

Plan Outline – Very Sensitive Data                                       Research design …
                                                                         Information security

    Protect very sensitive data on “target systems”                     Disclosure limitation

        Extra physical, logical, administrative access control
            Record keeping
            Limitations
            Lockouts
        Extra monitoring, auditing
        Extra procedural controls – specific, renewed approvals
        Limits on network connectivity
            Private network, not directly connected to public network
    Regular scans
        Vulnerability scans
        Scans for PII
    Extremely sensitive
        Increased access control, procedural limitations
        Not physically/logically connected (even via wireless) to public
         network, directly or indirectly

    93                                                        [Micah Altman, 3/10/2011]
Managing Confidential Information in Research
Managing Confidential Information in Research
Managing Confidential Information in Research
Managing Confidential Information in Research
Managing Confidential Information in Research
Managing Confidential Information in Research
Managing Confidential Information in Research
Managing Confidential Information in Research
Law, policy, ethics

Key Concepts Review                      Research design …
                                         Information security
                                         Disclosure limitation

   Confidentiality
   Integrity
   Availability
   Threat modeling
   Vulnerability assessment
   Risk assessment
   Defense in depth
   Logical Controls
   Physical Controls
   Administrative Controls

102                            [Micah Altman, 3/10/2011]
Law, policy, ethics

          Checklist: Identify Requirements                                            Research design …
                                                                                      Information security
    Documented information security plan?                                            Disclosure limitation
         What are goals for confidentiality, integrity, availability?
         What threats are envisioned?
         What are the broad types of controls in place?
    Key protections
         Use whole-disk/media encryption to protect data at rest
         Use end-to-end encryption to protect data in motion
         Use basic information hygiene to protect systems
         Be thorough in disposal of information
    Additional protections for sensitive data
         Extra logical, administrative, physical controls for very sensitive data?
         Monitoring and vulnerability scanning for very sensitive data?
         Check requirements for remote and foreign data collection
    Refer to security standards
         FIPS encryption
         FISMA / ISO practices
         SAS-70 Auditing
         CISSP certification of key staff
    Delegate implementation to information security professionals

    103                                                                  [Micah Altman, 3/10/2011]
Law, policy, ethics

        Resources                                        Research design …
                                                         Information security
                                                         Disclosure limitation
  S. Garfinkel, et al. 2003, Practical Unix and Internet
   Security, 3rd ed. , O‟Reilly Media
 Shon Harris, 2001, CISSP All-in-One Exam Guide,
   Osborne
 NIST, 2009, DRAFT Guide to Protecting the
   Confidentiality of Personally Identifiable Information, Nist
   Publication 800-122.
 NIST, 2009, Recommended Security Controls for Federal
   Information Systems and Organizations v. 3, NIST 800-
   53.
   (Also see related NIST 800-53A, and other NIST
   Computer Security Division Special Publications)
   [csrc.nist.gov/publications/PubsSPs.html]
 NIST, 2006, Information Security Handbook: A Guide for
   Managers, NIST Publication 800-100.
   Harvard Enterprise Security Checklists [Micah Altman, 3/10/2011]
 104
Law, policy, ethics

              Recommended Software                                               Research design …
                                                                                 Information security
                                                                                 Disclosure limitation
    Whole Disk Encryption
         Open Source: truecrypt.org
         Commercial: pgp.com
    Scanning
         Vulnerability scanner/assessment tool: www.nessus.org/nessus
         Commercial version scans for (limited) PII: www.nessus.org/nessus
         PII Scanning tool (open source), Cornell Spider: www2.cit.cornell.edu/security/tools
         PII Scanning tool (commercial), Identity Finder: www.identityfinder.com
         File integrity/intrusion detection engine, Samhain: la-samhna.de/samhain
         Network intrusion detection, Snort: www.snort.org
    Encrypt transmission over network
         Open SSL: http://openssl.org
         Open SSH: http://openssh.org
         VTUN: http://vtun.sourceforge.net
    Cloud backup services with encryption
         Crashplan: http://crashplan.com
         Spider oak: http://spideroak.com
         Backblaze: http://backblaze.com



    105                                                                [Micah Altman, 3/10/2011]
Law, policy, ethics

          Disclosure Limitation                   Research design …
                                                  Information security
                                                  Disclosure
                                                  limitation
         Threat models
         Disclosure limitation methods
         Statistical disclosure limitation
          methods
         Types of disclosure
         Factors affecting disclosure protection
         SDL Caveats
         SDL Observations




106                                    [Micah Altman, 3/10/2011]
Law, policy, ethics

Threat Models                                          Research design …
                                                       Information security
                                                       Disclosure
    Nosy neighbor (nosy employer)                     limitation

    Muck-raking Journalist (zero-tolerance)
    Business rival contributing to same survey
    Absent-minded professor
    …




    107                                     [Micah Altman, 3/10/2011]
Non statistical                                                        Law, policy, ethics
                                                                       Research design …
Disclosure Limitation Methods                                          Information security

    Licensing                                                         Disclosure
                                                                       limitation
         Used in conjunction with limited deidentification
         Should prohibit reidentification& linking, dissemination to third
          parties; limit retention
         Advantages: can decrease cost of processing, increase utility
          of research data
         Disadvantages: licenses may be violated unintentionally or
          intentionally, difficult to enforce outside of limited domains (e.g.
          HIPAA)
    Automated de-identification
         Primarily used for qualitative text medical records. Replaces
          identifiers with dummy strings.
         Advantages: can decrease cost, increase accuracy of manual
          deidentification of qualitative information
         Disadvantage: little available software, error rates still slightly
          higher than teams of trained human coders


    108                                                     [Micah Altman, 3/10/2011]
Law, policy, ethics

Automated De-identification                                     Research design …
                                                                Information security
                                                                Disclosure
    Trained human sensitivity rates:                           limitation

         Single worker: [.63-.94] (.81)
         Two-person team: [.89-.98] (.94)
         Three-person team: [.98-.99] (.98)
          [Neamatullah 2008]
    State of the art algorithms approach recall of .95
     [Uzuner, et. al 2007]
         Statistical learning of rule template features worked best
         Simpler rules-based approach still did as well as median
          2-person team
         Rules for PII and local dictionary important


    109                                              [Micah Altman, 3/10/2011]
Law, policy, ethics

Text de-identification (HIPAA)                                                     Research design …
                                                                                   Information security
                Cleaned
                                                                                   Disclosure
                                                                                   limitation
Name        SSN     Birthdate   Zipcode   Gender   Favorite    # of crimes
                                                   Ice Cream   committed
[Name 1]    *       *1961       021*      M        Raspberr    0
                                                   y
[Name 2]    *       *1961       021*      M        Pistachio   0
[Name 3]    *       *1972       940*      M        Chocolat    0
                                                   e
[Name 4]    *       *1972       940*      M        Hazelnut    0
[Name 5]    *       *1972       940*      F        Lemon       0
[Name 6]    *       *1972       021*      F        Lemon       1
[Name 7]    *       *1989       021*      F        Peach       1
[Name 8]    *       *1973       632*      F        Lime        2
[Name 9]    *       *1973       633*      M        Mango       4
[Name       *       *1973       634*      M        Coconut     16
    10]
[Name 11]   *       *1974       645*      M        Frog        32
[Name       *       *1974       646*      M        Vanilla     64 Cleaned (by hand check)
    12]
[Name
110         *       *1974       647*      F        Pumpkin     128    [Micah Altman, 3/10/2011]
    13]
Managing Confidential Information in Research
Hybrid Statistical/Non-statistical                                                  Law, policy, ethics
                                                                                    Research design …
Limitation                                                                          Information security
    Data enclaves – physically restrict access to data                             Disclosure
                                                                                    limitation
         Examples: ICPSR, Census Research Data Center
         May include availability of synthetic data as an aid to preparing model
          specifications
         Advantages: extensive human auditing, vetting; information security threats much
          reduced
         Disadvantages: expensive, slow, inconvenient to access
    Controlled remote access
         Varies from remote access to all data and output to human vetting of output
         Advantages: auditable, potential to impose human review, potential to limit analysis
         Disadvantages: complex to implement, slow
    Model servers
         Mediated remote access – analysis limited to designated models
         Advantages: faster, no human in loop
         Disadvantage: statistical methods for ensuring model safety are immature –
          residuals, categorical variables, dummy variables are all risky; very limited set of
          models currently supported; complex to implement
    Statistical Disclosure Limitation
         Modifications to the data to decrease the probability of disclosure
         Advantages/Disadvantages… to follow…


    112                                                                  [Micah Altman, 3/10/2011]
Managing Confidential Information in Research
Pure Statistical Disclosure                                               Law, policy, ethics
                                                                          Research design …
limitation techniques                                                     Information security
    Data reduction                                                       Disclosure
                                                                          limitation
         Removing variables (i.e. deidentifying)
         Suppressing records
         Sub-sampling
         Global recoding (including top/bottom coding)
         Local suppression
         Global complete suppression 
    Perturbation
         Microaggregation
             Sorting based on similarity
             Replace value of records in clusters with mean
         Rule-based data swapping
         Adding noise
         Resampling
    Synthetic microdata
         Bootstrap
         Multiple imputation
         Model based

    114                                                        [Micah Altman, 3/10/2011]
Managing Confidential Information in Research
Law, policy, ethics

          Suppression with R and sdcmicro                                                                                        Research design …
                                                                                                                                 Information security
                                                                                                                                 Disclosure
# setup                                                                                                                          limitation
>library(sdcMicro)

# load data
>classexample.df<-read.csv("examplesdc.csv”, as.is=T,
stringsAsFactors=F,colClasses=c("character","character","character","character","factor","factor","numeric")

# create a weight variable if needed
>classexample.df$weight<-1


# simple frequency table shows that data is uniquely identified
>ftable(Birthdate~Zipcode,data=classexample.df)
Birthdate 01/01/1973 02/02/1973 03/25/1972 04/04/1974 08/08/1989 10/01/1961 11/11/1972 12/12/1972 20/02/1961 30/03/1974

Zipcode

02127             0      0      1      0      0      0      0      0      0      0
02138             0      0      0      0      1      0      0      0      1      0
02145             0      0      0      0      0      1      0      0      0      0
63200             1      0      0      0      0      0      0      0      0      0
63300             0      1      0      0      0      0      0      0      0      0
63400             0      1      0      0      0      0      0      0      0      0
64500             0      0      0      0      0      0      0      0      0      1
64600             0      0      0      1      0      0      0      0      0      0
64700             0      0      0      1      0      0      0      0      0      0
64800             0      0      0      1      0      0      0      0      0      0
94041             0      0      1      0      0      0      0      0      0      0
94043             0      0      0      0      0      0      1      1      0      0


           116                                                                                                        [Micah Altman, 3/10/2011]
Law, policy, ethics

     Suppression with R and sdcmicro                                       Research design …
                                                                           Information security
                                                                           Disclosure
# global recoding                                                          limitation

>recoded.df<-classexample.df
>recoded.df$Birthdate<-substring(classexample.df$Birthdate,7)
>recoded.df$Zipcode<-substring(classexample.df$Zipcode,1,3)


# Check if anonymous?
# NOTE makes sure to use column numbers and w=NULL

>print(freqCalc(recoded.df,keyVars=3:5,w=NULL))

--------------------------
10 observation with fk=1
4 observation with fk=2
 --------------------------




        117                                                     [Micah Altman, 3/10/2011]
Law, policy, ethics

    Suppression with R and sdcmicro                                            Research design …
                                                                               Information security
# try local suppression with preference for suppressing Gender
                                                                             Disclosure
>anonymous.out<-localSupp2Wrapper(recoded.df,3:5,w=NULL,kAnon=2,importance=c(1,1,100))
                                                                             limitation
...
[1] "2-anonymity after 2 iterations."

# look at the data
>as.data.frame(anonymous.out$xAnon)

      Name SSN BirthdateZipcode Gender Ice.cream Crimes weight
1 A. Jones 12341    1961 021 <NA> Raspberry        0     1
2 B. Jones 12342    1961 021 <NA> Pistachio       0     1
3 C. Jones 12343    1972 940 M Chocolate        0     1
4 D. Jones 12344    1972 940 M Hazelnut        0     1
5 E. Jones 12345    1972 940 <NA> Lemon          0     1
6 F. Jones 12346    <NA> 021 <NA> Lemon           1     1
7 G. Jones 12347    <NA> 021 <NA> Peach           1     1
8 H. Smith 12348    1973 <NA><NA>     Lime     2     1
9 I. Smith 12349   <NA> 633 <NA> Mango           4     1
10 J. Smith 12350   <NA> 634 <NA> Coconut          16     1
11 K. Smith 12351    1974 <NA><NA>     Frog    32     1
12 L. Smith 12352   <NA> 646 <NA> Vanilla       64      1
13 M. Smith 12353    <NA> 647 <NA> Pumpkin 128              1
14 N. Smith 12354    <NA> 648 <NA> Allergic 256           1




      118                                                           [Micah Altman, 3/10/2011]
Law, policy, ethics

        Suppression with R and sdcmicro              Research design …
                                                     Information security
                                                     Disclosure
# launch gui if you like
sdcGui                                               limitation


# and play around some more




           119                            [Micah Altman, 3/10/2011]
Law, policy, ethics

How SDL Methods Reduce Utility                              Research design …
                                                            Information security
                                                            Disclosure
                                                            limitation

                                   Issues
Removing variables                 Model misspecification
Suppressing records                Induced non-response bias
Sub-sampling                       Weak protection
Global recoding (generalization)   Censoring
Local suppression                  Non-ignorable missing value bias
Rules-based swapping               Biased, must keep rules for secret
Random swapping                    Weakensbivariate, multivariate
                                   relationships
Adding noise                       Weak protection
Resampling                         Weak protection
Synthetic microdata                Destroysunmodeled relationships,
                                   not currently widely accepted

 120                                             [Micah Altman, 3/10/2011]
Law, policy, ethics

Types of Disclosure                                         Research design …
                                                            Information security

    Identity disclosure (re-identification disclosure) –
                                                      Disclosure
                                                      limitation

     associate an individual with a record and set of
     sensitive variables
    Attribute disclosure (prediction disclosure) – improve
     prediction of value of sensitive variable for an
     individual
    Group disclosure -- predict the value of a sensitive
     variable for a known group of people




    121                                          [Micah Altman, 3/10/2011]
Factors affecting disclosure                                       Law, policy, ethics

protection                                                         Research design …
                                                                   Information security

    Properties of the                                             Disclosure
                                                                   limitation
     sample                               Individualreidentification
         Measured variables
                                           occurs when:
         Realizations of                    Respondent is unique on
          measurements                        values of the key
             Outliers                       Attacker has access to
             Content of qualitative          measurements of key
              responses
                                             Respondent is in attackers
    Distribution of                          set of measurements
     population
                                             Attacker comes across
    Adversarial knowledge                    disclosed data
         Variables                          Attacker recognizes
         Completeness                        respondent
         Errors
                                            [Willenborg&DeWaal 1996]
         Priors
    122                                                 [Micah Altman, 3/10/2011]
Disclosure protection:                                Law, policy, ethics

k-anonymity [Sweeney 2002]                            Research design …
                                                      Information security
                                                      Disclosure
                                                      limitation
    Operates on micro-data
    Designate subset of variables as key’s – variables
     that the attacker could use to identify individual
    For each combination of key variables in the sample
     –there must be krows taking on that combination
    kis typically desired to be in 3-5




    123                                    [Micah Altman, 3/10/2011]
Our table made 2-anonymous                                                        Law, policy, ethics

(one way)                                                                         Research design …
                                                                                  Information security
              Cleaned Quasi-keys                                                  Disclosure
                                                                                  limitation
Name      SSN      Birthdate   Zipcode   Gender   Favorite    # of crimes
                                                  Ice Cream   committed
*Jones    *        * 1961      021*      M        Raspberr    0
                                                  y
* Jones   *        *1961       021*      M        Pistachio   0
                                                                            Both more and less
* Jones   *        *1972       9404*     *        Chocolat    0             than HIPAA default
                                                  e
* Jones   *        *1972       9404*     *        Hazelnut    0
* Jones   *        * 1972      9404*     *        Lemon       0
* Jones   *        *           021*      F        Lemon       1
* Jones   *        *           021*      F        Peach       1
* Smith   *        *1973       63*       *        Lime        2
* Smith   *        *1973       63*       *        Mango       4
*Smith    *        *1973       63*       *        Coconut     16
* Smith   *        *1974       64*       M        Frog        32
* Smith   *        * 1974      64*       M        Vanilla     64
* Smith   *        04041974    64*       F        Pumpkin     128
* Smith   *        04041974    64*       F        Allergic    256
  124                                                                  [Micah Altman, 3/10/2011]
Law, policy, ethics

k-anonymous – but not protected                                                          Research design …
                                                                                         Information security
                                                        Additional
          Sort Order/                                  background                        Disclosure
           Structure                                                                     limitation
Name        SSN         Birthdate   Zipcode   Gender    Favorite     # of crimes
                                                        Ice Cream    committed
*Jones      *           * 1961      021*      M         Raspberr     0
                                                        y
* Jones     *           *1961       021*      M         Pistachio    0
* Jones     *           *1972       9404*     *         Chocolat     0
                                                        e
* Jones     *           *1972       9404*     *         Hazelnut     0
* Jones     *           * 1972      9404*     *         Lemon        0
* Jones     *           *           021*      F         Lemon        1
* Jones     *           *           021*      F         Peach        1               Homogeneity
* Smith     *           *1973       63*       *         Lime         2
* Smith     *           *1973       63*       *         Mango        4
*Smith      *           *1973       63*       *         Coconut      16
* Smith     *           *1974       64*       M         Frog         32
* Smith     *           * 1974      64*       M         Vanilla      64
* Smith     *           04041974    64*       F         Pumpkin      128
* Smith     *           04041974    64*       F         Allergic     256
  125                                                                         [Micah Altman, 3/10/2011]
More than one way to de-identify                                               Law, policy, ethics
                                                                                   Research design …
    (but don’t release both…)                                                      Information security
                                                                                 Disclosure
                                                                                 limitation
Name      SSN   Birthdate   Zipcode   Gender   Name      SSN   Birthdate   Zipcode Gender

*Jones    *     *1961       021*      *        *Jones    *     * 1961      021*        M
* Jones   *     *1961       021*      *        * Jones   *     *1961       021*        M
* Jones   *     *1972       94043     *        * Jones   *     *1972       9404*       *
* Jones   *     *1972       94043     *        * Jones   *     *1972       9404*       *
* Jones   *     0325197     *         *        * Jones   *     * 1972      9404*       *
                2
                                               * Jones   *     *           021*        F
* Jones   *     0325197     *         *
                                               * Jones   *     *           021*        F
                2
*         *     *           *         *        * Smith   *     *1973       63*         *

*         *     *           *         *        * Smith   *     *1973       63*         *

* Smith   *     0202197     6*        *        *Smith    *     *1973       63*         *
                3                              * Smith   *     *1974       64*         M
*Smith    *     0202197     6*        *        * Smith   *     * 1974      64*         M
                3
                                               * Smith   *     0404197     64*         F
* Smith   *     0303197     6*        *                        4
                4
                                               * Smith   *     0404197 64*         F
* Smith   *     0404197     6*        *                        4
      126       4                                                   [Micah Altman, 3/10/2011]
* Smith   *     0404197     6*        *
Vulnerabilities of k-anonymity                                                Law, policy, ethics
                                                                              Research design …
                                                                              Information security

    Sort order [Sweeney 2002]                                                Disclosure
                                                                              limitation
         Information in structure of data, not content!
    Contemporaneous release [Sweeney 2002]
         overlap of information under different anonymizationschemes disclosure
    Information in suppression mechanism, may allow recovery 
     – e.g. rules based swapping
    Temporal changes
         “barn door” -- deletion of tuples can subvert k-anonymity  can‟t “unrelease”
          records
         Additions of tuples, information can yield disclosures if you re-do
          anonymization must anonymize these based on the past data release
          [Sweeney 2002]
    Variable Background Knowledge [Machanavajjhala 2007]
         Incorrect assumption about what variables are in quasi-key
         This may change over time
    Homogeneity [Truta 2006]
         Sensitive values may be homogenous, even if not literally individually
          identified
    127                                                            [Micah Altman, 3/10/2011]
Strengthening k-anonymity                                               Law, policy, ethics
                                                                        Research design …
vs. homogeneity                                                         Information security
                                                                        Disclosure
                                                                        limitation
    Ensure each k-anonymous set also satisfies some measure
     of attribute diversity
         P-sensitive k-anonymity [Truta 2006]
         Fixed l-diversity, Entropy l-diversity, Recursive (c,l) diversity
          [Machanavajjhala 2007]
         T-closeness [Li 2007]
    Diversity measures may be too strong or too
     weak
    And sometimes attribute disclosure is not
     justifiable
         It does not literally (legally?) identify an individual
         Research may be explicitly designed to make
          attribute more predictable
         In some cases, study would probabilistically identify
    128   an attribute, even if participant weren‟t in it!Altman, 3/10/2011]
                                                      [Micah
Law, policy, ethics
Sometimes k-anonymity is too strong                     Research design …
                                                        Information security
                                                        Disclosure
    Embodies several worst case assumptions            limitation


     -- safer, but more information loss:
        Sample unique  population unique
        Attacker discovers your data with certainty
        Attacker has complete database of non-
         sensitive variables and their links to identifiers
        Attacker database and sample are error-free




  129                                        [Micah Altman, 3/10/2011]
Law, policy, ethics

Research Areas                                                           Research design …
                                                                         Information security

    Standard SDL approaches are designed to apply to denseDisclosure

     single tables of quantitative data… use caution & seeklimitation
     consultation with the following

    Dynamic data
         Adding new attributes
         Incremental updates
         Multiple views
    Relational data
         Multiple relations that are not easily normalized
    Non tabular data
         Sparse matrices
         Transactional data
         Trajectory data
         Rich text
         Social networks

    130                                                       [Micah Altman, 3/10/2011]
Law, policy, ethics

Problem 2: Information loss                                             Research design …
                                                                        Information security

    No free lunch: anonymization information loss   Disclosure
                                                      limitation
    Various approaches none satisfactory or commonly used
         Count number of suppressed values
         Compare data matrix before & after anonymization
             Entropy, MSE, MAE, mean variation
         Compare statistics on data matrix before & after
             Variance, Bias, MSE
         Weight by (ad-hoc) importance of variable
    Optimal (information loss) k-anonymity is NP-hard
     [Meyerson& Williams 2004]
    Utility degrade very fast in increased privacy
         See [Brickell and Shmatikov 2008; Ohm 2009,;
          Dinur&Nissim 2004; Dwork et al 2006, 2007]


    131                                                      [Micah Altman, 3/10/2011]
Alternative risk limitation–                                                                  Law, policy, ethics
                                                                                              Research design …
non-microdata approaches                                                                      Information security
                                                                                              Disclosure
                                                                                              limitation
    Models and tables can be safely generated from anonymizedmicrodata, however
     information loss may be less when anonymization is applied at the model/table level
     directly
    Model servers
         Compute models on full microdata
         Limit models being run on data
         Limit specifications of models
         Synthesize residuals; perturb results
    Table-based de-identification
         Compute tables on full micro-data
         Perturb (noise, rounding), suppress cells (and complementary cells, if marginals computed),
          restructure tables (generalization, variable suppression), synthesize value
         Disclosure rule: number of contributors to a cell (similar to k-anonymity); proportion of largest
          group of contributors to a cell total; percentage decrease in upper/lower bounds on contributor
          values
    Limitations
         Feasible (privacy protecting) multi-dimensional table/multiple table protection is NP-hard
         Model/table disclosure requires evaluating entire history of previous disclosures
         Dynamic table servers, model servers should be considered open research topics, not mature.




    132                                                                          [Micah Altman, 3/10/2011]
Alternate solution concept –                                        Law, policy, ethics
                                                                    Research design …
probabilistic record linkage                                        Information security
                                                                    Disclosure
                                                                    limitation
    Apply disclosure rule to population based on threshold
     probability, and estimated population distribution
    E.g. for 3-anonymity – probability < .02that there exists a
     tuple of quasi-identifier values that occurs < 3 time in the
     population
    Advantages
         When sample is small, population risk model will result in far
          less modification & information loss
    Disadvantages
         Harder to explain.
         Does not literally prevent individual reidentification.
         Need to justify reidentification risk threshold
         Need to justify population distribution model
         Assumes that background knowledge of attacker does not
          include whether each identified individual is the sample

    133                                                  [Micah Altman, 3/10/2011]
Alternate Solution Concept                                                 Law, policy, ethics

– Bayesian Optimal Privacy                                                 Research design …
                                                                           Information security

    Possibly…                                                             Disclosure
                                                                           limitation
         Minimal distance between posterior and prior distribution for
          some all priors…



    Limitations… [See A. Mchanavajjala, et. al 2007]
             Insufficient knowledge about distributions of attributes
             Insufficient knowledge about distributions of priors
             Instance-level knowledge not modeled well
             Multiple adversaries not modeled
    Possible limitations
         Complexity of computation not known
         Implementation mechanisms not well-known
         Utility reduction not well-known


    134                                                         [Micah Altman, 3/10/2011]
Alternate Solution Concept–                                           Law, policy, ethics

Differential Privacy                                                  Research design …
                                                                      Information security
                                                                      Disclosure
                                                                      limitation




    Based on cryptography theory (traitor tracing schemes) &
     provides formal bounds on disclosure risk across all
     inferences -- handles attribute disclosure well [Dwork 2006]
    Roughly, differential privacy guarantees that all inferences
     made from the data with a subject included will differ only by
     epsilon if subject is removed.
    Analysis is accompanied by formal analysis of estimator
     efficiency – differential privacy can be achieved in many cases
     with (asymptotic) efficiency
    DP is essentially Frequentist … possible Bayesian
     interpretation
         Prior: n-1 complete records, and distribution over nth record
         DP criterion implies Hellinger distance [Fienberg 2009]
    135                                                    [Micah Altman, 3/10/2011]
Implementing Differential                                                                     Law, policy, ethics

Privacy                                                                                       Research design …
                                                                                              Information security
    Currently, almost all realizations of differential privacy rely on noise appliedDisclosure
                                                                                       to queries
     against numeric tabular databases – unknown how to apply it to new forms of datalimitation
     such as networks. [Dwork 2008]
    Static sanitization is possible … BUT limited
         If possible number of queries in analysis family is superpolynomial in size of data no efficient
          anonymization exists [Dwork et al 2009]
    Differential privacy methods need to be develop for the type of analysis being
     performed.
         Currently differentially private versions of datamining queries exist, but
         … development of differentially private versions of common statistical methods is just
          beginning. [Dwork& Smith 2009]
    Differential privacy may be too strong, in some cases..
         identity disclosure may be the appropriate measure
         disclosing attributes that are the explicit topic of research may be appropriate
         allowing for greater than epsilon gains in information may be appropriate
    There is only one publicly available software tool that supports these methods
     (PLINQ)
         Test use only
         Restricted domain of queries
    Researchers may need access to data not just coefficients
     – e.g. “show me the residuals”!



    136                                                                           [Micah Altman, 3/10/2011]
Managing Confidential Information in Research
MIND THE GAPS                                                                                  Law, policy, ethics

– Future Research                                                                              Research design …
                                                                                               Information security
    Reconcile Bayesian an Frequentist notions of privacy                       Disclosure
                                                                                limitation
    Model privacy from game theoretic/social choice & policy analysis point of view
    Reconcile “random response”/sensitive survey methods and statistical disclosure
     concepts
    Disclosure limitation methods needed for new forms of data
    Differential Privacy methods needed for many more statistical models
    Bridge gap between regulatory and statistical views
         Update regulations/law based on statistical concepts
         Educate IRB‟s on statistical disclosure control
         Integrate permission for data sharing and some disclosure in consent & design of experiments
    Bridge gap between mathematics and implementation
         Very few software packages available for disclosure limitation and analysis
         Interactive disclosure limitations require not just software, but validated, audited software
          infrastrucure
    Data sharing infrastructure needed for managing confidentiality effectively:
         Applying interactive privacy automatically
         Implementing limited data use agreements
         Managing access & logging – virtual enclave
         Providing chokepoint for human auditing of results
         Providing systems auditing, vulnerability & threat assessment
         Ideally:
             Research design information automatically fed into disclosure control parameterization
             Consent documentation automatically integrated with disclosure policies, enforced by system
    138                                                                            [Micah Altman, 3/10/2011]
Law, policy, ethics

What to do – for now…                                              Research design …
                                                                   Information security
                                                                   Disclosure
                                                                   limitation
    (1) Use only information that has already been made
     public, is entirely innocuous, or has been declared legally
     deidentified; or
    (2) Obtain informed consent from research subjects, at
     the time of data collection, that includes acceptance of
     the potential risks of disclosure of personally identifiable
     information; or
    (3) Pay close attention to the technical requirements
     imposed by law:
         Remove all 18 HIPAA factors; or
         Use suppression and recoding to achieve k-anonymity with l-
          diversity on data before releasing it or generating detailed
          figures, maps, or summary tables.
         Supplement data sharing with data-use agreements.
         Apply extra caution & use consultation with “non-traditional”
          data – networks, text corpuses, etc.

    139                                                 [Micah Altman, 3/10/2011]
Law, policy, ethics

Preliminary Recommendations                                      Research design …
                                                                 Information security
                                                                 Disclosure
    Avoid complexities of table and model SDL                   limitation

         Apply SDL to microdata
         Tables and models based on deidentifiedmicrodata are
          de-identified
    Use substantive knowledge to guide disclosure
     limitation
         Globally recode using natural categories
         Use local suppression – check suppressed observations
         Estimate substantively interesting statistics from original
          and modified data as a check




    140                                               [Micah Altman, 3/10/2011]
Law, policy, ethics

    Key Concepts Review                                   Research design …
                                                          Information security
                                                          Disclosure
    Text de-identification                               limitation

    License and access control restrictions
    K-anonymity
    Suppression
    Attribute homogeneity
    Risk/utility tradeoff




    141                                        [Micah Altman, 3/10/2011]
Law, policy, ethics

       Checklist                                     Research design …
                                                     Information security
                                                     Disclosure
                                                     limitation
 Will license be used to limit disclosure?
 Will enclave or remote access limit disclosure?
 Are there natural categories for global recoding?
 Is there a natural measure of information loss, or
  natural weighting for importance of variables?
 What level of reidentification risk is acceptable?
 What is expected background knowledge of
  attacker?




 142                                      [Micah Altman, 3/10/2011]
Law, policy, ethics

                 Available Software                                                                  Research design …
                                                                                                     Information security
                                                                                                     Disclosure
    Deidentification of text                                                                        limitation
         Regular expression, lookup tables, template matching
          [www.physionet.org/physiotools/deid]
  Deidentification of IP addresses and system/network logs
www.caida.org/tools/taxonomy/anonymization.xml
  Interactive Privacy
         PINQ – Experimental interactive differential privacy engine
          [research.microsoft.com/en-us/projects/PINQ/]
    Tabular Data – Tau Argus
         Cell suppression, controlled rounding
          [neon.vb.cbs.nl/casc]
    Microdata
         Mu-Argus
             Microaggregation, local suppression, global recoding, PRAM
              [neon.vb.cbs.nl/casc]
         SDCmicro
             Microaggregation, local suppression, global recoding, PRAM, rank swapping
             Heuristic k-anonymity (using local suppression)
             R module
              [cran.r-project.org/web/packages/sdcMicro]
         NISS Data Swapping Toolkit (DSTK)
             Data swapping in risk/utility framework
             Implemented in Java
          [nisla05.niss.org/software/dstk.html]




    143                                                                                   [Micah Altman, 3/10/2011]
Law, policy, ethics

              Resources                                                                       Research design …
                                                                                              Information security
                                                                                              Disclosure
                                                                                             limitation
     FCSM, 2005. “Report on Statistical Disclosure Limitation Methodology”, FCSM Statistical Working Paper
     Series
     [www.fcsm.gov/working-papers/spwp22.html]
    L. Willenborg,T. de Waal, 2001. Elements of Statistical Disclosure Control, Springer.
    ICPSR Human Subjects Protection Project Citation Database
     [ www.icpsr.umich.edu/HSP/citations]
    A. Hundepool, et al. 2009, Handbook of Statistical Disclosure Control, ESSNET
     [neon.vb.cbs.nl/casc/..%5Ccasc%5Chandbook.htm]
    Privacy in Statistical Database Conference Series
     [unescoprivacychair.urv.cat/psd2010/]
     (See Springer’s Lecture Notes in Computer Science series for previous proceedings volumes)
    ASA Committee on Privacy and Confidentiality Website
     [ www.amstat.org/committees/pc ]
    National Academies Press, Information Security Book Series
     [www.nap.edu/topics.php?topic=320]
    National Institute of Statistical Sciences, Technical Reports
     [www.niss.org/publications/technical-reports]
    Transactions on Data Privacy, IIIA-CSIC [Journal]
     [ www.tdp.cat ]
    Journal of Official Statistics, Statistics Sweden:
     [www.jos.nu]
    Journal of Privacy and Confidentiality, Carnegie-Mellon
     [jpc.cylab.cmu.edu]
    IEEE Security and Privacy
     [www.computer.org/security]
    Census Statistical Disclosure Control checklist
     [www.census.gov/srd/sdc]
    B. C.M. Fung, K. Wang, R. Chen, P.S. Yu, 2010, Privacy Preserving Data Publishing: A Survey of Recent
     Developments, ACM CSUR 42(4)
    144                                                                          [Micah Altman, 3/10/2011]
Additional Resources



         Final review
         Additional training resources
         Harvard Consulting
         Handout for Harvard staff
         Harvard IQSS Research
          Support
         Additional references



145                                       [Micah Altman, 3/10/2011]
Final Review: 7 Steps
    Identify potentially sensitive information in planning
         Identify legal requirements, institutional requirements, data use agreements
         Consider obtaining a certificate of confidentiality
         Plan for IRB review
    Reduce sensitivity of collected data in design
    Separate sensitive information in collection
    Encrypt sensitive information in transit
    Desensitize information in processing
         Removing names and other direct identifiers
         Suppressing, aggregating, or perturbing indirect identifiers
    Protect sensitive information in systems
         Use systems that are controlled, securely configured, and audited
         Ensure people are authenticated, authorized, licensed
    Review sensitive information before dissemination
         Review disclosure risk
         Apply non-statistical disclosure limitation
         Apply statistical disclosure limitation
         Review past releases and publically available data
         Check for changes in the law
         Require a use agreement



    146                                                                         [Micah Altman, 3/10/2011]
Preliminary Recommendation: Choose the
Lesser of Three Evils
    (1) Use only information that has already been made
     public, is entirely innocuous, or has been declared
     legally deidentified; or
    (2) Obtain informed consent from research subjects,
     at the time of data collection, that includes
     acceptance of the potential risks of disclosure of
     personally identifiable information; or
    (3) Pay close attention to the technical requirements
     imposed by law:
         Use suppression and recoding to achieve k-anonymity
          with l-diversity on data before releasing it or generating
          detailed figures, maps, or summary tables.
         Supplement data sharing with data-use agreements.


    147                                               [Micah Altman, 3/10/2011]
Preliminary Recommendations
Planning and methods
    Review research design for sensitive identified information
         Information which would cause harm if disclosed
         HIPAA identifiers
         Other indirectly identifying characteristics
    Design research methods to reduce sensitivity
         Eliminate sensitive/identifying information not needed for research questions
         Consider randomized response, list experiment design
    Design human subjects plan with information management in mind
         Recognize benefits of data sharing
         Ask for consent to share data appropriately
         Apply for a certificate of confidentiality where data is very sensitive
    Separate sensitive information
         Separate sensitive/identifying information at collection, if feasible
         Link separate files using cryptographic hash of identifiers plus secret key; or cryptographic-
          strength random number
    Incorporate extra protections for on-line data collection
         Use vendor agreements that specify anonymity and confidentiality protections
         Do not collect IP addresses if possible, regularly anonymize and purge otherwise
         Restrict display of very sensitive information in user interfaces
         Limit on-line collection of very sensitive information
         Harvard prohibits display/collection of HRCI online


    148                                                                             [Micah Altman, 3/10/2011]
Preliminary Recommendations
Information Security
    Use FISMA as a reference for baseline controls
    Document:
         Protection goals
         Threat models
         Types of controls
    Delegate implementation to IT professionals
    Refer to standards
         Gold standards: FISMA / ISO practices, SAS-70 Auditing, CISSP certification of key staff
    Strongly recommended controls
         Use whole-disk/media encryption to protect data at rest
         Use end-to-end encryption to protect data in motion
         Use core information hygiene to protect systems
             Use a virus checker, and keep it updated
             Use a host-based firewall
             Update your software regularly
             Install all operating system and application security updates
             Don‟t share accounts or passwords
             Don‟t use administrative accounts all the time
             Don‟t run programs from untrusted sources
             Don‟t give out your password to anyone
         Scan for HRCI regularly
         Be thorough in disposal of information
             Use secure file erase tools when disposing of files
             Use secure disk erase tools when disposing/repurposing disks




    149                                                                                          [Micah Altman, 3/10/2011]
Preliminary Recommendations
Very Sensitive/Extremely Sensitive Information
security
    Protect very sensitive data on “target systems”
         Extra physical, logical, administrative access control
             Record keeping
             Limitations
             Lockouts
         Extra monitoring, auditing
         Extra procedural controls – specific, renewed approvals
         Limits on network connectivity
             Private network, not directly connected to public network
    Regular scans
         Vulnerability scans
         Scans for PII
    Extremely sensitive
         Increased access control, procedural limitations
         Not physically/logically connected (even via wireless) to public
          network, directly or indirectly


    150                                                             [Micah Altman, 3/10/2011]
Preliminary Recommendations
non tabular data disclosure
    Use licensing agreements – even if they are “clickthroughs”
     Reason: They provide additional protection without limiting
     legitimate research.
    For qualitative text information
         Use software for the first pass
         Supplement with localized dictionary of place names, common last
          names, etc
         Have a human review results
Reason: Software more effective than single human
  coder.However error rate high enough that human still
  necessary.
 For emerging forms of data (networks, etc.)
         Use remote access, and user authentication, if feasible
          Reason: Greater auditability to compensate for less well
          understood statistical de-identification.
         Pay careful attention to structure of data.
          Reason: Identifying information may be present in structure of
          information (word ordering, prose style, network topology, sparse
          matrix missingness) rather than in the primary attribute information
    151                                                    [Micah Altman, 3/10/2011]
Preliminary Recommendations
Tabular data disclosure
    Use licensing agreements – even if they are “clickthroughs”
     Reason: They provide additional protection without limiting legitimate research.
    Use HIPAA default variable suppression and recoding if according to the PI’s best judgment this does not seriously
     degrade the research value of the data.
     Reason: Clearest legal standard
    For quantitative tabular data
         Use generalization, local suppression, variable suppression.
          Reason: These are effective, commonly used in HIPAA and in statistical disclosure control
         Use k-anonymity
          Reason: k-anonymity appears to be current good practice; provably eliminates literal individual re-identification; works if
          attacker has knowledge of sample participation
         Choose k in [3-5]
          Reason: Best practice at federal agencies for table suppression requires table cells to have 3-5 contributors. Tables
          derived from k-anonymous microdata will also fulfill this.
         Choose quasi-identifiers based on plausible threat models
          Reason: Too broad a definition of quasi-identifiers renders de-identification impossible. Background knowledge is
          pivotal, and threat model is the only source for this.
         Use micro-data anonymization, rather than tabular/model anonymization
          Reason: (1)Table/model methods become computational intractable. (2) Analysis of model-anonymization is immature.
          (3) Anonymizingmicrodata implies derived tables and models are also anonymized. (4) Administratively harder to track
          and evaluate entire history of previous models/tables than history of previously released versions of single micro-data set.
         Use domain knowledge in choosing recodings and testing the resulting anonymization for information loss.
          Reason: MSE , etc. probably not a good proxy for research value of data. Use standard measures, but also consider
          planned uses and simulate possible analyses.
         Inspect data for attribute diversity, use PI‟sjudgement regarding suppression
          Reason: (1) Some attribute disclosures are not avoidable if research is to be conducted at all, some would occur even if
          subject had not participated. (2) Disclosures that would not have resulted if subject had opted out, and are not
          substantially based on representative causal/predictive relationships revealed by the research, should be eliminated. (3)
          All current diversity measures are likely to severely reduce the utility of the anonymized data if applied routinely.




    152                                                                                             [Micah Altman, 3/10/2011]
On-line training
    NIH Protecting Human Subject Research Participants
         Provides minimal testing and certification
         Required for human subjects research at NIH
                      [phrp.nihtraining.com]
    NIH Security and Privacy Awareness
         Includes basics of information security, review of privacy laws
                      [irtsectraining.nih.gov/]

    Harvard Staff Training
         Provides compact training for staff members in handling of confidential
          information
                      [http://www.security.harvard.edu/resources/training]

    Collaborative Institute Training Initiative
         Provides testing, certification, continuing education credits
         Required for human subjects research at Harvard
         Includes basic training on confidentiality, and informed consent
                      [https://www.citiprogram.org/]

    153                                                          [Micah Altman, 3/10/2011]
Managing Confidential Information in Research
Managing Confidential Information in Research
Harvard IQSS Research Support
    IQSS supports your research design:
         Research design, including:
          design of surveys, selection of statistical methods.
    IQSS supports your research process:
         Primary and secondary data collection, including:
          the collection of geospatial and survey data.
         Data management, including:
          storage, cataloging, permanent archiving, and distribution.
         Data analysis, including :
          statistical consulting, GIS consulting, high performance research computing
    IQSS supports your projects
         Dissemination: web site hosting, scholars website
         Research computing infrastructure and hosting
         Conference/seminar/event planning and facilities

Strengthen your proposal through:
         Consultation on research design, statistical issues, GIS, research computing
         Including relevant resources in “facilities” etc.
         Obtaining IQSS letters of support



    156                                                               [Micah Altman, 3/10/2011]
Additional References
    A Aquesti, L John, G Lowestein, 2009, "What is Privacy Worth", 21rst Rowkshop in Information Systems and Economics.
    A. Blum, K. Ligett, A Roth, 2008. “A Learning Theory Approach to Non-Interactive Database Privacy”, STOC’08
    L. Backstrom, C. Dwork, J. Kleinberg. 2007, Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns,
     and Structural Steganography. Proc. 16th Intl. World Wide Web Conference., KDD 008
    J. Brickell, and V. Shmatikov, 2008. The Cost of Privacy: Destruction of Data-Mining Utility in Annoymized Data Publishing
    P. Buneman, A. Chapman an.d J. Cheney, 2006,
     „Provenance Management in Curated Databases‟, in Proceedings of the 2006 ACM SIGMOD International Conference on Ma
     nagement of Data, (Chicago, IL: 2006), 539‐550. http://portal.acm.org/citation.cfm?doid=1142473.1142534;
    Calabrese F., Colonna M., Lovisolo P., Parata D., Ratti C., 2007, "Real-Time Urban Monitoring Using Cellular Phones: a
     Case-Study in Rome", Working paper # 1, SENSEable City Laboratory, MIT, Boston http://senseable.mit.edu/papers/,
     [also see the Real Time Rome Project [http://senseable.mit.edu/realtimerome/]
    Campbell,. D. 2009, reported in D, Goodin 2009, Amazon's EC2 brings new might to password cracking, The Register, Nov 2,
     2009, http://www.theregister.co.uk/2009/11/02/amazon_cloud_password_cracking/
    Dinur and K. Nissim. Revealing information while preserving privacy. Proceedings of the twenty-second ACM SIGMOD-
     SIGACT-SIGART Symposium on Principles of Database Systems, pages 202–210, 2003.
    C. Dwork, M Naor, O Reingold, G Rothblum, S Vadhan, 2009. When and How Can Data be Efficiently Released with Privacy,
     STOC 2009.
    C Dwork, A. Smith, 2009. Differential Privacy for Statistics: What we know and what we want to learn, Journal of Privacy and
     Confdentiality 1(2)135-54
    C Dwork 2008, Differential Privacy, A Survey of Results. TAMC 2008, LCNS 4978, Springer Verlag. 1-19
    C. Dwork. Differential privacy. Proc. ICALP, 2006.
    C. Dwork, F. McSherry, and K. Talwar. The price of privacy and the limits of LP decoding. Proceedings of the thirty-ninth
     annual ACM Symposium on Theory of Computing, pages 85–94, 2007.
    C. Dwork, F. McSherry, K. Nissim, and A. Smith, Calibrating Noise to Sensitivity in Private Data Analysis, Proceedings of the
     3rd IACR Theory of Cryptography Conference, 2006
    A. Desrosieres. 1998. The Politics of Large Numbers, Harvard U. Press.
    S.E. Fienberg, M.E. Martin, and M.L. Straf (eds.), 1985. Sharing Research Data, Washington, D.C.: National Academies
     Press.
    S. Fienberg, 2010. Towards a Bayesian Characterization of Privacy Protection & the Risk-Utility Tradeoff, IPAM--Data 2010
    B. C.M. Fung, K. Wang, R. Chen, P.S. Yu, 2010, Privacy Preserving Data Publishing: A Survey of Recent Developments,
     ACM CSUR 42(4)
    Greenwald, A. G. McGhee, D. E. Schwartz, J. L. K., 1998, "Measuring Individual Differences In Implicit Cognition: The Implicit
     Association Test", Journal of Personality and Social Psychology 74(6):1464-1480
    C. Herley, 2009, So Long and No Thanks for the Externalities: The Rational Rejection of Security Advice by Users; NSPW 09
    A. F. Karr, 2009 Statistical Analysis of Distributed Databases, journal of Privacy and Confidentiality (1)2:


    157                                                                                           [Micah Altman, 3/10/2011]
Additional References
    International Council For Science (ICSU) 2004. ICSU Report of the CSPR Assessment Panel on Scientific
     Data and Information. Report.
    J. Klump, et. al, 2006. “Data publication in the open access initiative”, Data Science Journal Vol. 5 pp. 79-
     83.
    E.A. Kolek, D. Saunders, 2008. Online Disclosure: An Empirical Examination of Undergraduate Facebook
     Profiles, NASPA Journal 45 (1): 1-25
    N. Li, T. Li, and S. Venkatasubramanian. T-closeness: privacy beyond k-anonymity and l-diversity. In Pro-
     ceedings of the IEEE ICDE 2007, 2007.
    A. MachanavaJJhala, D Kifer, J Gehrke, M. Venkitasubramaniam, 2007,"l-Diversity: Privacy Beyond k-
     Anonymity" ACM Transactions on Knowledge Discovery from Data, 1(1): 1-52
    A. Meyerson, R. Williams, 2004. “On the complexity of Optimal K-Anonymity”, ACM Symposium on the
     Principles of Database Systems
    Nature 461, 145 (10 September 2009) | doi:10.1038/461145a
    A. Narayanan and V. Shmatikov, 2008, “Robust De-anonymization of Large Sparse Datasets” , Proc. of
     29th IEEE Symposium on Security and Privacy (Forthcoming)
    I Neamatullah, et. al, 2008, Automated de-identification of free-text medical records, BMC Medical
     Informatics and Decision Making 8:32
    J. Novak, P. Raghavan, A. Tomkins, 2004. Anti-aliasing on the Web, Proceedings of the 13th international
     conference on World Wide Web
    National Science Board (NSB), 2005, Long-Lived Digital Data Collections: Enabling Research and
     Education in the 21rst Century, NSF. (NSB-05-40).
    A Qcquisti, R. Gross 2009, “Predicting Social Security Numbers from Public Data”, PNAS 27(106): 10975–
     10980
    Sweeney, L., (2002) k-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty,
     Fuzziness, and Knowledge-based Systems, Vol. 10, No. 5, pp. 557 – 570.
    Truta T.M., Vinay B. (2006), Privacy Protection: p-Sensitive k-Anonymity Property, International Workshop
     of Privacy Data Management (PDM2006), In Conjunction with 22th International Conference of Data
     Engineering (ICDE), Atlanta, Georgia.
    O. Uzuner, et al, 2007, “Evaluating the State-of-the-Art in Automatic De-identification”, Journal of the
     American Medical Informatics Association 14(5):550
    W. Wagner & R. Steinzor, 2006. Rescuing Science from Politics, Cambridge U. Press.
    Warner, S. 1965. Randomized response: A survey technique for eliminating evasive answer bias. Journal
     of the American Statistical Association 60(309):63–9.
    D.L. Zimmerman, C. Pavlik , 2008. "Quantifying the Effects of Mask Metadata, Disclosure and Multiple
     Releases on the Confidentiality of Geographically Masked Health Data", Geographical Analysis 40: 52-76

    158                                                                              [Micah Altman, 3/10/2011]
Managing Confidential Information in Research
Creative Commons License




         This work. Managing Confidential
         information in research, by Micah Altman
         (http://redistricting.info) is licensed under
         the Creative Commons Attribution-Share
         Alike 3.0 United States License. To view a
         copy of this license, visit
         http://creativecommons.org/licenses/by-
         sa/3.0/us/ or send a letter to Creative
         Commons, 171 Second Street, Suite 300,
         San Francisco, California, 94105, USA.




160                                                  [Micah Altman, 3/10/2011]

More Related Content

PPTX
Ethics in research
PPTX
Research methodology
PPTX
Introduction to Research Ethics
PPTX
Research Ethic and Scientific Integrity
PPT
Research ethics overview for social science researchers
PPTX
5.ethical consideration in research
PPTX
Quantitative and qualitative analysis of data
PDF
Research ethics
Ethics in research
Research methodology
Introduction to Research Ethics
Research Ethic and Scientific Integrity
Research ethics overview for social science researchers
5.ethical consideration in research
Quantitative and qualitative analysis of data
Research ethics

What's hot (20)

PPTX
Lecture 19 research ethics (2)
PPTX
Research Data Management
PPTX
Research Ethics
ODP
Research methodology
PPTX
Chapter 2: Ethical Principles of Research
PDF
Basics of Research Data Management
PPTX
Conflict of interests
PPT
Ethics in Research
PPTX
Research ethics scientific misconduct jpcfm
PPTX
Ethics In Research
PDF
Ch. 4 - The critical literature review
PPTX
Research Data Management
PDF
Research methodology
PPT
RM01 - Research Fundamentals and Terminology
PPT
Research ethics
PPTX
Challenges in research ethics
PPTX
Academic Research Integrity
PDF
Research Methodology and Research Design
PPT
Making Sense of It All: Analyzing Qualitative Data
Lecture 19 research ethics (2)
Research Data Management
Research Ethics
Research methodology
Chapter 2: Ethical Principles of Research
Basics of Research Data Management
Conflict of interests
Ethics in Research
Research ethics scientific misconduct jpcfm
Ethics In Research
Ch. 4 - The critical literature review
Research Data Management
Research methodology
RM01 - Research Fundamentals and Terminology
Research ethics
Challenges in research ethics
Academic Research Integrity
Research Methodology and Research Design
Making Sense of It All: Analyzing Qualitative Data
Ad

Viewers also liked (12)

PPTX
Protecting the Agri-Business: Managing Contracts, Trademarks and Non-Disclos...
PPTX
Review mistakes
DOCX
Spelling mistakes spelling stress – start to love spelling
PDF
Spelling mistake
PPT
Alphabet and spelling
PPTX
The Research Problem
PDF
19 Things to consider when choosing Question types for User Research
PDF
Información trastorno déficit atención hiperactividad tdah.
PPTX
Songs,rhymes,chants,poetry
PPT
Teaching spelling
PPT
Sample Research Writing (WR121) Workshop
PPTX
Quantitative And Qualitative Research
Protecting the Agri-Business: Managing Contracts, Trademarks and Non-Disclos...
Review mistakes
Spelling mistakes spelling stress – start to love spelling
Spelling mistake
Alphabet and spelling
The Research Problem
19 Things to consider when choosing Question types for User Research
Información trastorno déficit atención hiperactividad tdah.
Songs,rhymes,chants,poetry
Teaching spelling
Sample Research Writing (WR121) Workshop
Quantitative And Qualitative Research
Ad

Similar to Managing Confidential Information in Research (20)

PPT
Data Agreements
PPTX
Privacy in Research Data Managemnt - Use Cases
PDF
Interconnected Health 2012 Examining The Privacy Considerations For Secondary...
PDF
Ethics in Information Technology - Privacy
PPT
IT security panel - moeshesh
PPTX
Is590 eport. knowledge map 1
PPTX
Introduction to FOI law (the law of information)
PPTX
Managing Confidential Information – Trends and Approaches
PPTX
Niso library law
PDF
Co3 rsc r5
PPTX
Deconstructing Data Breach Cost
PDF
Data breaches at home and abroad
PDF
Mit csail-tr-2007-034
PPTX
New Law on Access to Public Information:
PDF
Cissp notes
PPTX
اخلاقيات الثاني
PPTX
Introduction to Information security ppt
PPT
Privacy icms (handouts)
PPTX
Information resource-discovery
PPTX
Introduction to Information security ppt
Data Agreements
Privacy in Research Data Managemnt - Use Cases
Interconnected Health 2012 Examining The Privacy Considerations For Secondary...
Ethics in Information Technology - Privacy
IT security panel - moeshesh
Is590 eport. knowledge map 1
Introduction to FOI law (the law of information)
Managing Confidential Information – Trends and Approaches
Niso library law
Co3 rsc r5
Deconstructing Data Breach Cost
Data breaches at home and abroad
Mit csail-tr-2007-034
New Law on Access to Public Information:
Cissp notes
اخلاقيات الثاني
Introduction to Information security ppt
Privacy icms (handouts)
Information resource-discovery
Introduction to Information security ppt

More from Micah Altman (20)

PPTX
Selecting efficient and reliable preservation strategies
PPTX
Well-Being - A Sunset Conversation
PPTX
Matching Uses and Protections for Government Data Releases: Presentation at t...
PPTX
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
PPTX
Well-being A Sunset Conversation
PPTX
Can We Fix Peer Review
PDF
Academy Owned Peer Review
PPTX
Redistricting in the US -- An Overview
PPTX
A Future for Electoral Districting
PPTX
A History of the Internet :Scott Bradner’s Program on Information Science Talk
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PPTX
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
PPTX
Utilizing VR and AR in the Library Space:
PPTX
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
PPTX
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
PDF
Ndsa 2016 opening plenary
PDF
Making Decisions in a World Awash in Data: We’re going to need a different bo...
PPTX
Software Repositories for Research-- An Environmental Scan
PDF
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
PPTX
Gary Price, MIT Program on Information Science
Selecting efficient and reliable preservation strategies
Well-Being - A Sunset Conversation
Matching Uses and Protections for Government Data Releases: Presentation at t...
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Well-being A Sunset Conversation
Can We Fix Peer Review
Academy Owned Peer Review
Redistricting in the US -- An Overview
A Future for Electoral Districting
A History of the Internet :Scott Bradner’s Program on Information Science Talk
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Utilizing VR and AR in the Library Space:
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
Ndsa 2016 opening plenary
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Software Repositories for Research-- An Environmental Scan
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
Gary Price, MIT Program on Information Science

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation theory and applications.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Digital-Transformation-Roadmap-for-Companies.pptx
Review of recent advances in non-invasive hemoglobin estimation
Approach and Philosophy of On baking technology
Encapsulation theory and applications.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The Rise and Fall of 3GPP – Time for a Sabbatical?

Managing Confidential Information in Research

  • 1. Managing Confidential Information in Research Micah Altman Senior Research Scientist Institute for Quantitative Social Science Harvard University
  • 4. Personally identifiable private information is surprisingly common  Includes information from a variety of sources, such as…  Research data, even if you aren‟t the original collector  Student “records” such as e-mail, grades  Logs from web-servers, other systems  Lots of things are potentially identifying:  Under some federal laws: IP addresses, dates, zipcodes, …  Birth date + zipcode + gender uniquely identify ~87% of people in the U.S. [Sweeney 2002]  With date and place of birth, can guess first five digits of social security number (SSN) > 60% of the time. (Can guess the whole thing in under 10 tries, for a significant minority of people.) [Aquisti& Gross 2009]  Analysis of writing style or eclectic tastes has been used to identify individuals Brownstein, et al., 2006 , NEJM 355(16),  Tables, graphs and maps can also reveal identifiable information 4 [Micah Altman, 3/10/2011]
  • 6. IQSS (and affiliates) offer you support across all stages of your quantitative research:  Research design, including: design of surveys, selection of statistical methods.  Primary and secondary data collection, including: the collection of geospatial and survey data.  Data management, including: storage, cataloging, permanent archiving, and distribution.  Data analysis, including : statistical consulting, GIS consulting, high performance research computing. http://iq.harvard.edu/ 6 [Micah Altman, 3/10/2011]
  • 7. The IQSS grants administration team helps with every aspect of the grant process. Contact us when you are planning your proposal.  Assisting in identifying research funding opportunities  Consulting on writing proposals  Assisting IQSS affiliates with:  preparation, review and submission of all grant applications (“pre-award support”)  management of their sponsored research portfolio (“post-award support”)  Interpret sponsor policies  Coordinate with FAS Research Administration and the Central Office for Sponsored Programs … And, of course, support seminars like this! 7 [Micah Altman, 3/10/2011]
  • 8. Goals for course  Overview of key areas  Identify key concepts & issues  Summarize Harvard policies, procedures, resources  Establish framework for action  Provide connection to resources, literature 8 [Micah Altman, 3/10/2011]
  • 9. Outline  [Preliminaries]  Law, policy, ethics  Research methods, design, manage ment  Information Security (Storage, Transmission, U se)  Disclosure Limitation  [Additional Resources & Summary of 9 Recommendations] [Micah Altman, 3/10/2011]
  • 10. Steps to Manage Confidential Research Data  Identify potentially sensitive information in planning  Identify legal requirements, institutional requirements, data use agreements  Consider obtaining a certificate of confidentiality  Plan for IRB review  Reduce sensitivity of collected data in design  Separate sensitive information in collection  Encrypt sensitive information in transit  Desensitize information in processing  Removing names and other direct identifiers  Suppressing, aggregating, or perturbing indirect identifiers  Protect sensitive information in systems  Use systems that are controlled, securely configured, and audited  Ensure people are authenticated, authorized, licensed  Review sensitive information before dissemination  Review disclosure risk  Apply non-statistical disclosure limitation  Apply statistical disclosure limitation  Review past releases and publically available data  Check for changes in the law  Require a use agreement 10 [Micah Altman, 3/10/2011]
  • 12. Law, policy, ethics Law, Policy & Ethics Research design … Information security Disclosure limitation  Ethical Obligations  Laws  Fun and games   Harvard Policies  [Summary] 12 [Micah Altman, 3/10/2011]
  • 13. Law, policy, ethics Confidentiality & Research Ethics Research design … Information security Disclosure  Belmont Principles limitation  Respect for Persons  individuals should be treated as autonomous agents  persons with diminished autonomy are entitled to protection  implies “informed consent”  implies respect for confidentiality and privacy  Beneficence  research must have individual and/or societal benefit to justify risks  implies minimizing risk/benefit ratio 13 [Micah Altman, 3/10/2011]
  • 14. Scientific & Societal Benefits Law, policy, ethics of Data Sharing Research design … Information security  Increases replicability of research Disclosure limitation  Journal publication policies may apply  Increases scientific impact of research  Follow up studies  Extensions  Citations  Public interest in data produced by public funder  Funder policies may apply  Public interest in data that supports public policy  FOIA and state FOI laws may apply  Open data facilitates…  Transparent government  Scientific collaboration  Scientific verification  New forms of science  Participation in science  Hands-on education  Continuity of research Sources: Fienberg et. al 1985; ICSU 2004; Nature 2009 14 [Micah Altman, 3/10/2011]
  • 15. Sources of Confidentiality Restrictions for Law, policy, ethics University Research Data Research design … Information security Disclosure  Overlapping laws limitation  Different laws apply to different cases  All affiliates subject to university policy (Not included: EU directive, foreign laws, classified data, …) 15 [Micah Altman, 3/10/2011]
  • 16. Law, policy, ethics 45 CFR 46 [Overview] Research design … Information security “The Common Rule” Disclosure limitation  Governs human subject research  With federal funds/ at federal institution  Establishes rules for conduct of research  Establishes confidentiality and consent requirement for for identified private data  However, some information may be required to be disclosed under state and federal laws (e.g. in cases of child abuse)  Delegates procedural decisions to Institutional Review Boards (IRB‟s) 16 [Micah Altman, 3/10/2011]
  • 17. Law, policy, ethics HIPAA [Overview] Research design … Information security Health Insurance Portability and Accountability Act Disclosure limitation  Protects personal health care information for „covered entities‟  Detailed technical protection requirements  Provides clearest legal standards for dissemination  Provides a „safe harbor‟  Has become an accepted practice for dissemination in other areas where laws are less clear  HITECH Act of 2009 extends HIPAA  Extends coverage to associated entities of covered entities  Additional technical safeguards  Adds breach reporting requirement HIPAA provides three dissemination options … 17 [Micah Altman, 3/10/2011]
  • 18. Dissemination under HIPAA Law, policy, ethics Research design … [option 1] Information security Disclosure limitation  “safe harbor” -- remove 18 identifiers  [Personal identifiers]  Names  Social Security #‟s; Personal Account #‟s; Certificate/License #‟s; full face photos (and comparable images); biometric id‟s; medical  Any other unique identifying number, characteristic, or code  [Asset identifiers]  fax #‟s; phone #‟s; vehicle #‟s;  personal URL‟s; IP addresses; e-mail addresses  Device ID‟s and serial numbers  [Quasi identifiers]  dates smaller than a year (and ages > 89 collapsed into one category)  geographic subdivisions smaller than a state (except for 3 digits of zipcode, if unit > 20,000 people) And  Entity does not have actual knowledge [direct and clear awareness] that it would be possible to use the remaining information alone or in combination with other information to identify the subject 18 [Micah Altman, 3/10/2011]
  • 19. Dissemination under HIPAA Law, policy, ethics Research design … [Option 2] Information security  “limited dataset” – leave some quasi-id‟s Disclosure limitation  Remove personal and asset identifiers  Permitted dates: dates of birth, death, service, years  Permitted geographic subdivisions: town, city, state, zip code And  Require access control and data use agreement. 19 [Micah Altman, 3/10/2011]
  • 20. Dissemination under HIPAA Law, policy, ethics Research design … [Option 3] Information security  “qualified statistician” – statistical determination Disclosure limitation Have qualified statistician determine, using generally accepted statistical and scientific principles and methods, that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by the anticipated recipient to identify the subject of the information.  Important caveats  Methods and results of the analysis must be documented  No bright line for “qualified”, text of rule is: “a person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable.” [Section 164.514(b)(1)]  No clear definitions for “generally accepted” or “very small” or “reasonably available information”… however, there are references in the federal register to statistical publications to be used as “starting points” 20 [Micah Altman, 3/10/2011]
  • 21. Law, policy, ethics FERPA Research design … Information security Family Educational Rights and Privacy Act Disclosure limitation  Applies schools that receive federal (D.O.E.) funding  Restricts use of student (not employee) information  Establishes  Right to privacy of educational records  Right to inspect and correct records (with appeal to Federal government)  Definition of public “directory” information  Right to block access to public “directory” information, and to other records  Educational records include:  Identified information about student  Maintained by institution  Not …  Employee records  Some medical and law-enforcement records  Records solely in the possession and for use by the creator (e.g. unpublished instructor notes)  Personally identifiable information includes:  Direct identifiers  Indirect (quasi) identifiers  Indirectly linkable identifiers  “Information requested by a person who the educational agency or institution reasonably believes knows the identity of the student to whom the education record relates.” 21 [Micah Altman, 3/10/2011]
  • 22. Law, policy, ethics MA 201 CMR 17 Research design … Information security Disclosure Standards for the Protection of Personal Information limitation  Strongest U.S. general privacy protection law  Has been delayed/modified repeatedly  Requires reporting of breaches  If data is not encrypted  Or encryption key is released in conjunction with data  Requires specific technical protections:  Firewalls  Encryption of data transmitted in public  Anti-virus software  Software updates 22 [Micah Altman, 3/10/2011]
  • 23. Inconsistencies in Requirements Law, policy, ethics and Definitions Research design … Information security Inconsistent definitions of “personally identifiable” Disclosure Inconsistent definitions of sensitive information limitation Requirements for to de-identify jibes with statistical realities FERPA HIPAA Common MA 201 CMR Rule 17 Coverage Students in Medical Information Living persons in Mass. Residents Educational in “Covered research by Institutions Entities” funded institutions Identification -Direct -Direct -Direct -Direct Criteria -Indirect -Indirect -Indirect -Linked -Linked -Linked -Bad intent (!) Sensitivity Any non-directory Any medical Private Financial, State, Criteria information information information – Federal based on harm Identifiers Management - Directory opt-out - Consent - Consent - Specific Requirement - [Implied] good - Specific technical - [Implied] risk technical s 23 practice safeguards minimization safeguards [Micah Altman, 3/10/2011] -Breach - Breach
  • 24. Law, policy, ethics Third Party Requirements Research design … Information security  Licensing requirements Disclosure limitation  Intellectual property requirements  Federal/state law and/or policy requirements  State protection of personal information laws  Freedom of information laws (FOIA & State FOI)  State mandatory abuse/neglect notification laws  And … think ahead to publisher requirements  Replication requirements  IP requirements  Examples  NSF requires data from funded research be shared  NIH requires a data sharing plan for large projects  Wellcome Trust requires a data sharing plan  Many leading journals require data sharing 24 [Micah Altman, 3/10/2011]
  • 25. Law, policy, ethics (Some) More Laws & Standards Research design … Information security  California Laws  Detailed technical controls over information Disclosure  Lots of rules systems limitation  Applies any data about California residents  Sarbanes-Oxley (aka, SOX, aka SARBOX)  Privacy policy  Corporate and Auditing Accountability and Responsibility Act of 2002  Disclosure  Applies to U.S. public company  Reporting policy boards, management and public accounting firms  EU Directive 95/46/EC  Rarely applies to research in universities  Data protection directive  Section 404 requires annual assessment of  Provides for notice, limits on purpose of organizational internal controls – but does not use, consent, security, disclosure, access, account specify details of controls ability  Classified Data  Forbids transfer of data to entities in countries  Separate and complex rules and requirements compliant with directive  The University does not accept classified data  U.S. is not compliant but …  But, may have “Controlled But Unclassified”  Organizations can certify compliance with FTC  Vaguely defined area  No auditing/enforcement !  Mostly government produced area  Substantial criticism of this arrangement  Penalties unclear  Payment Card Industry (PCI) Security Standards  And… export controlled information, under ITAR  Governs treatment of credit card numbers and EAR  Requires reports, audits, fines  Export control include  Detailed technical measures technologies, software, documentation/design  Not a law, but helps define good practice documents may be included  Large penalties  Nevada law mandates PCI standards  … and over 1100 International Human Subjects  FISMA laws…  Federal Information Security Management Act (FISMA), Public Law (P.L.) 107-347.  Is starting to be applied to NIH sponsored 25 research [Micah Altman, 3/10/2011]
  • 26. Law, policy, ethics Predicted Legal Changes for 2011… Research design … Information security Disclosure  Caselaw limitation  “personal privacy” does not apply to information about corporations (a corporation is not a “person” for this purpose) FCC vs. ATT 2011  Scheduled  EU “cookie privacy” directive 2009/136/EC goes into effect  Proposed updates to EU information privacy directives  Very Likely  New information privacy laws in selected states in 2011  Likely  Increased federal regulation of internet privacy 26 [Micah Altman, 3/10/2011]
  • 27. Law, policy, ethics What’s wrong with this picture? Research design … Information security Disclosure limitation Name SSN Birthdate Zipcode Gender Favorite # of crimes Ice Cream committed A.Jones 12341 01011961 02145 M Raspberr 0 y B. Jones 12342 02021961 02138 M Pistachio 0 C. Jones 12343 11111972 94043 M Chocolat 0 e D. Jones 12344 12121972 94043 M Hazelnut 0 E. Jones 12345 03251972 94041 F Lemon 0 F. Jones 12346 03251972 02127 F Lemon 1 G. Jones 12347 08081989 02138 F Peach 1 H. Smith 12348 01011973 63200 F Lime 2 I. Smith 12349 02021973 63300 M Mango 4 J. Smith 12350 02021973 63400 M Coconut 16 K. Smith 12351 03031974 64500 M Frog 32 L. Smith 12352 04041974 64600 M Vanilla 64 M. Smith 12353 04041974 64700 F Pumpkin 128 N. 12354 04041974 64800 F Allergic 256 27 [Micah Altman, 3/10/2011] Smi
  • 28. Law, policy, ethics What’s wrong with this picture? Research design … Information security Identifier Sensitive Identifier Sensitive Private Disclosure Private Identifier limitation Identifier Name SSN Birthdate Zipcode Gender Favorite # of crimes Ice Cream committed A.Jones 12341 01011961 02145 M Raspberr 0 Mass resident y B. Jones 12342 02021961 02138 M Pistachio 0 Californian C. Jones 12343 11111972 94043 M Chocolat 0 e D. Jones 12344 12121972 94043 M Hazelnut 0 Twins, separated at birth? E. Jones 12345 03251972 94041 F Lemon 0 FERPA too? F. Jones 12346 03251972 02127 F Lemon 1 G. Jones 12347 08081989 02138 F Peach 1 H. Smith 12348 01011973 63200 F Lime 2 I. Smith 12349 02021973 63300 M Mango 4 J. Smith 12350 02021973 63400 M Coconut 16 K. Smith 12351 03031974 64500 M Frog 32 L. Smith 12352 04041974 64600 M Vanilla 64 Unexpected Response? M. Smith 12353 04041974 64700 F Pumpkin 128 28 [Micah Altman, 3/10/2011] N. Smith 12354 04041974 64800 F Allergic 256
  • 31. Harvard: Law, policy, ethics Enterprise Security Policy (HEISP) Research design … Information security  Storing High Risk Confidential Information (HRCI) Disclosure  Must not be stored on individual user computer or limitation portable storage device  Confidential information on Harvard computing devices  Must be stored on "target computers" or secure locked containers  confidential information must be protected  Human subject information  confidential information on portable devices must be encrypted  All research on human subjects must be approved by the IRB  laptops must have encryption (some schools require whole-disk encryption)  All proposals must include a data management plan  systems must be scanned annually  Personally identifiable medical information (PIMI)  Cannot save confidential information on computer  "Covered entities" at Harvard are subject to HIPAA directly accessible from the internet, open Harvard requirements networks  PIMI is to be treated as HRCI throughout the university  Employees who have access must annually agree to  Obtaining confidential information requires approval confidentiality agreements  All confidential information must be encrypted when  Access to lists and database of Harvard University ID transported across any network numbers is restricted  Public directories must adhere to privacy preferences  Each school must provide training establishes by the individuals  Registrars have developed common definition of  Identifying Users with Access to Confidential FERPA directory information Information  Must adhere to student requests to block their directory  System owners must be able to identify users that have information, per FERPA access to confidential information  Accepting Payment Cards - Restricted to procedures  Strong passwords outlined in HU Credit Card Merchant Handbook  No account/password sharing  Inhibit password guessing with logging and lockouts  Limit application availability time with timeouts  Limit user access to confidential information based on business need [ More on next page…] 31 [Micah Altman, 3/10/2011]
  • 32. Law, policy, ethics HEISP – Part 2 Research design … Information security  Physical Environment  Network take down Disclosure limitation  All digital/non-digital media must be  Network managers run vulnerability properly protected scans  Computers must be physically  May take computers off the network secure  Service Resumption  Automatic logging must be  Must have a service resumption plan if consistent with written policies loss of confidential data is a substantial business risk  Vendor contracts  require approval by security officer  Incident Response Policy  Include OGC contract rider  Disposition and destruction of records  Computer operator  computer must be regularly updated  Acquisition/use by unauthorized persons must be reported to OGC  operated securely  Only necessary application installed  Interacting with legal authorities -- always refer to OGC unless  annually certify compliance with imminent health/safety risk requires university policies otherwise  Computer setup - must filter malicious traffic  Web based surveys must have protections in place  “Target” systems and controllers  Private address space; locally firewalled  Annual vulnerability scanning 32 [Micah Altman, 3/10/2011]
  • 33. Law, policy, ethics Harvard: Research design … Information security Research Data Security Policy (HRDSP) Disclosure limitation  Sensitivity of research data based on potential harm if disclosed:  Level 5 = “extremely sensitive”  Level 4 = “very sensitive” ~= HRCI  Level 3 = “sensitive” ~= HCI  Level 2 = “benign” ~= Good computer hygiene  Level 1 = anonymous and not business confidential  Required protections based on sensitivity  Level 5: Entirely disconnected from network (“bubble security”)  Level 4: Protections as per HRCI  Level 3: Protections as per HCI  Level 2: Good computer hygiene  Designates procedures for treatment of external data use agreements [ next section ]  Legally binding  Can be both very detailed and not supported by Harvard security procedures  Investigator should not sign these – forward to OSP  Designates responsibilities for IRB, Investigator, OSP, IT, Security Officers. security.harvard.edu/research-data-security-policy 33 [Micah Altman, 3/10/2011]
  • 34. Harvard: Law, policy, ethics Research design … Researcher Responsibilities Information security  … for knowing the rules Disclosure limitation  … for identifying potentially confidential information in all forms (digital/analogue; on-line/off-line)  … for notifying recipients of their responsibility to protect confidentiality  … for obtaining IRB approval for any human subjects research  … for following an IRB approved plan  … for obtaining OSP approval of restricted data use agreements with providers, even if no money involved … and for proper  Storage  Access  Transmission  Disposal Confidentiality is not an “IT problem” 34 [Micah Altman, 3/10/2011]
  • 35. Harvard: Law, policy, ethics Research design … Staff – Personnel Manual Information security Disclosure  Protect Harvard information and systems limitation  Keep your own information in Peoplesoft up to date  Comply with copyrights and DMCA  Comply with Harvard systems policies and procedures  All information produced at work is Harvard property  Attach only approved devices to the Harvard network harvie.harvard.edu/docroot/standalone/Policies_Contracts/St aff_Personnel_Manual/Section2/Privacy.shtml 35 [Micah Altman, 3/10/2011]
  • 36. Law, policy, ethics Key Concepts & Issues Review Research design … Information security  Privacy Disclosure  Control over extent and circumstances of sharing limitation  Confidentiality  Treatment of private, sensitive information  Sensitive information  Information that would cause harm if disclosed and linked to an individual  Personally/individually identifiable information  Private information  Directly or indirectly linked to an identifiable individual  Human subjects A living person …  who is interacted with to obtain research data  who‟s private identifiable information is included in research data  Research  Systematic investigation  Designed to develop or contribute to generalizable knowledge  “Common Rule”  Law governing funded human subjects research  HIPAA  Law governing use of personal health information in covered and associated entities  MA 201 CMR 17  Law governing use of certain personal identifiers for Massachusetts residents 36 [Micah Altman, 3/10/2011]
  • 37. Law, policy, ethics Checklist: Identify Requirements Research design … Information security Check if research includes … Disclosure limitation  Interaction with humans  Common Rule &HEISP/HRDSP applies Check if data used includes identified …  Student records  FERPA &HEISP/HRDSP applies  State, federal, financial id‟s  state law &HEISP/HRDSP applies  Medical/health information  HIPAA (likely) &HEISP/HRDSP applies  Human subjects & private info  Common Rule &HEISP/HRDSP applies Check for other requirements/restriction on data dissemination:  Data provider restrictions and University approvals thereof  Open data requirements and norms  University information policy 37 [Micah Altman, 3/10/2011]
  • 38. Law, policy, ethics Resources Research design …  E.A. Bankert& R.J. Andur, 2006, Institutional Review Board: Management and Function, Information security Jones and Bartlett Publishers Disclosure  P. Ohm, “Broken Promises of Privacy”, SSRN Working Paper limitation [ssrn.com/abstract=1450006]  D. J. Mazur, 2007. Evaluating the Science and ethics of Research on Humans, Johns Hopkins University Press  IRB: Ethics & Human Research [Journal], Hastings Press www.thehastingscenter.org/Publications/IRB/  Journal of Empirical Research on Human Research Ethics, University of California Press ucpressjournals.com/journal.asp?j=jer  201 CMR 17 text www.mass.gov/Eoca/docs/idtheft/201CMR17amended.pdf  FERPA Website www.ed.gov/policy/gen/guid/fpco/ferpa/index.html  HIPAA Website www.hhs.gov/ocr/privacy/  Common Rule Website www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.htm  State laws www.ncsl.org/Default.aspx?TabId=13489  Harvard Enterprise Information Security Policy/ Research Data Security Policy www.security.harvard.edu  Harvard Institutional Review Board www.fas.harvard.edu/~research/hum_sub/  Harvard FAS Policies and Procedures www.fas-it.fas.harvard.edu/services/catalog/browse/39  IQSS Policies and Procedures support.hmdc.harvard.edu/kb-930/hmdc_policies 38 [Micah Altman, 3/10/2011]
  • 39. Research design, methods, Law, policy, ethics management Research design … Information security Disclosure  Reducing risk limitation  Sensitivity of information  Partitioning  Decreasing identification  Managing confidentiality and dissemination  [Summary] 39 [Micah Altman, 3/10/2011]
  • 40. Law, policy, ethics Trade-offs Research design … Information security  Anonymity vs. research utility Disclosure limitation  Sensitivity vs. research utility  (Anonymity * Sensitivity) vs. research costs/efforts 40 [Micah Altman, 3/10/2011]
  • 41. Law, policy, ethics Types of Sensitive Information Research design … Information security  Information is sensitive, if, once disclosed there Disclosure limitation is a “significant” likelihood of harm  IRB literature suggests possible categories of harm:  loss of insurability  loss of employability  criminal liability  psychological harm  social harm to a vulnerable group  loss of reputational harm  emotional harm  dignitary harm  physical harm: risk of disease, injury, or death 41 [Micah Altman, 3/10/2011]
  • 42. Law, policy, ethics Levels of sensitivity Research design …  No widely accepted scale Information security  Publicly available data not sensitive under “common rule” Disclosure  Common rule anchors scale at “minimal risk”: limitation “if disclosed, the probability and magnitude of harm or discomfort anticipated are not greater in and of themselves than those ordinarily encountered in daily life or during the performance of routine physical or psychological examinations or tests”  Harvard Research Data Security Policy  Level 5- Extremely sensitive information about individually identifiable people. Information that if exposed poses significant risk of serious harm. Includes information posing serious risk of criminal liability, serious psychological harm or other significant injury, loss of insurability or employability, or significant social harm to an individual or group.  Level 4 - Very sensitive information about individually identifiable people Information that if exposed poses a non-minimal risk of moderate harm. Includes civil liability, moderate psychological harm, or material social harm to individuals or groups, medical records not classified as Level 5, sensitive-but-unclassified national security information, and financial identifiers (as per HRCI standards).  Level 3- Sensitive information about individually identifiable people Information that if disclosed poses a significant risk of minor harm. Includes information that would reasonably be expected to damage reputation or cause embarrassment; and FERPA records.  Level 2 – Benign information about individually identifiable people Information that would not be considered harmful, but that as to which a subject has been promised confidentiality.  Level 1 – De-identified information about people, and information not about people 42 [Micah Altman, 3/10/2011]
  • 43. Law, policy, ethics IRB Review Scope Research design …  IRB approval needed for all: Information security  federally-funded research; Disclosure limitation  or any research at (almost all) institutions receiving federal funding that involve “human subjects”. ( Any organization operating under a general “federal-wide assurance”)  All human subjects research at Harvard  Human subject: individual about whom an investigator (whether professional or student) conducting research obtains  (1) Data through intervention or interaction with a living individual, or  (2) Identifiable private information about living individuals  See www.hhs.gov/ohrp/ 43 [Micah Altman, 3/10/2011]
  • 44. Law, policy, ethics Research not requiring IRB approval Research design …  Non-research: Information security not generalizable knowledge & systematic inquiry Disclosure  Non-funded: limitation institution receives no federal funds for research  Not human subject:  No living people described  Observation only AND no private identifiable information is obtained  Human Subjects, but “exempt” under 45 CFR 46  use of existing, publicly-available data  use of existing non-public data, if data is individuals cannot be directly or indirectly identified  research conducted in educational settings, involving normal educational practices  taste & food quality evaluation  federal program evaluation approved by agency head  observational, survey, test & interview of public officials and candidates (in their formal capacity, or not identified)  Caution not all “exempt” is exempt…  Some research on prisoners, children, not exemptable  Some universities require review of “exempt” research  Harvard requires review of all human subject research  See: www.hhs.gov/ohrp/humansubjects/guidance/ decisioncharts.htm 44 [Micah Altman, 3/10/2011]
  • 45. Law, policy, ethics IRB’s and Confidential Information Research design … Information security Disclosure  IRB‟s review consent procedures and documentation limitation  IRB‟s may review data management plans  May require procedures to minimize risk of disclosure  May require procedures to minimize harm resulting from disclosure  IRB‟s make determination of sensitivity of information -- potential harm resulting from disclosure  IRB‟s make determination regarding whether data is de-identified for “public use” [see NHRPAC, “Recommendations on Public Use Data Files”] 45 [Micah Altman, 3/10/2011]
  • 46. Law, policy, ethics Harvard IRB Approval Research design … Information security Disclosure  The Harvard Institutional Review Board (IRB) must approve alllimitation human subjects research at Harvard prior to data collection or use  Research involves human subjects if:  There is any interaction or intervention with living humans; or  If identifiable private data about living humans is used  Some examples of human subject research in soc sci:  Surveys  Behavioral experiments  Educational tests and evaluations  Analysis of identified private data collected from people (your e-mail inbox, logs of web-browsing activity, facebook activity, ebay bids … )  The IRB will:  Assess research protocol  Identify whether research is exempt from further review and management  Identify sensitivity level of data 46 [Micah Altman, 3/10/2011]
  • 47. Law, policy, ethics Harvard Responsibilities Research design … Information security Disclosure  The Harvard Institutional Review Board (IRB) must approve alllimitation human subjects research at Harvard prior to data collection or use  Research involves human subjects if:  There is any interaction or intervention with living humans; or  If identifiable private data about living humans is used  Some examples of human subject research in soc sci:  Surveys  Behavioral experiments  Educational tests and evaluations  Analysis of identified private data collected from people (your e-mail inbox, logs of web-browsing activity, facebook activity, ebay bids … )  The IRB will:  Assess research protocol  Identify whether research is exempt from further review and management  Identify sensitivity level of data 47 [Micah Altman, 3/10/2011]
  • 48. Law, policy, ethics HRDSP Responsibilities Research design … Information security Disclosure  Responsibilities limitation  Researchers are responsible for disclosing to IRB, and follow IRB approved plan  IRB is responsible for ensuring adequacy of Investigators plans; granting (lawful) variances from security requirements justified by research needs  IT is responsible for assisting with the identification of security level, and assisting in the implementation of security protections  Security Officer/CIO may review IT facilities and approve (give written designation) that they meet protections for a given level 48 [Micah Altman, 3/10/2011]
  • 49. Valuation of private Law, policy, ethics information is uncertain Research design … Information security  Privacy valuations often inconsistent Disclosure  Framing effects: ordering, endowment effect, possibly others limitation  Non-normal/uniform distribution of valuations  One study: < 10% of subjects would give up $2 of a $12 gift card to buy anonymity of purchases [Aquesti and Lowenstein 2009]  Cost benefit of information security may not be optimal for users [Herley 2009]  E.g. Loss from all phishing attacks is 100x less than time spent in avoiding them  Note, however weaknesses in this analysis:  Only loss of time modeled – no valuation of privacy made  Institutional costs not included – only personal costs  Very simplified model – not calibrated through surveys etc.  Repeated surveys of students show they tend to disclose a lot, e.g.:  >80% of students sampled in several studies had public facebook pages with birthdays, home town and other private information  This information can easily be used to link to other databases!  Disclosure of extensive information on sexual orientation, private cell #‟s, drinking habits, etc. etc. not uncommon [See Kolek& Saunders 2008]  Emerging markets for privacy?  Micropayments for disclosures  http://www.personal.com/ 49 http://www.i-allow.com/  [Micah Altman, 3/10/2011]
  • 50. Law, policy, ethics Reducing Risk in Data Collection Research design … Information security Disclosure  Avoid collecting sensitive information, unless it is limitation required by research design, method, or hypothesis  Unnecessary sensitive information  not minimal risk  Reducing sensitivity  higher participation, greater honesty  Collect sensitive information in private settings  Reduces risk of disclosure  Increases participation  Reduce sensitivity through indirect measures  Less sensitive proxies  E.g. Implicit association test [Greenwald, et al. 1998]  Unfolding brackets  Group response collection  Random response technique [Warner 1965]  Item count/unmatched count/list experiment technique 50 [Micah Altman, 3/10/2011]
  • 51. Managing Sensitive Data Law, policy, ethics Collection Research design … Information security Disclosure  Separate: limitation sensitive measures, (quasi)-identifiers, other measures  If possible avoid storing identifiers with measures:  Collect identifying information beforehand  Assign opaque subject identifiers  For sensitive data:  Collect on-line directly (with appropriate protections); or  Encrypt collection devices/media (laptops, usb keys, etc)  For very/extremely sensitive data:  Collect with oversight directly; then  Store on encrypted device and;  Transfer to secure server as soon as feasible 51 [Micah Altman, 3/10/2011]
  • 52. Randomized Response Law, policy, ethics Technique Research design … Information security Disclosure limitation Sensitiv e questio n Subjec Record t rolls a die > Answer 2 Variations: - Ask two different questions - Item counts with sensitive and non- Say sensitive items – eliminate “YES” subject-randomization - Regression analysis methods 52 [Micah Altman, 3/10/2011]
  • 53. Law, policy, ethics Our Table – Less (?) Sensitive Less (?) Research design … Sensitive Information security Name SSN Birthdate Zipcode Gender Favorite Treat? # Disclosure Ice Cream acts limitation * A.Jones 12341 01011961 02145 M Raspberry 0 0 B. Jones 12342 02021961 02138 M Pistachio 1 20 C. Jones 12343 11111972 94043 M Chocolate 0 0 D. Jones 12344 12121972 94043 M Hazelnut 1 12 E. Jones 12345 03251972 94041 F Lemon 0 0 F. Jones 12346 03251972 02127 F Lemon 1 7 G. Jones 12347 08081989 02138 F Peach 0 1 H. Smith 12348 01011973 63200 F Lime 1 17 I. Smith 12349 02021973 63300 M Mango 0 4 J. Smith 12350 02021973 63400 M Coconut 1 18 K. Smith 12351 03031974 64500 M Frog 0 32 L. Smith 12352 04041974 64600 M Vanilla 1 65 M. Smith 12353 04041974 64700 F Pumpkin 0 128 N. Smith 12354 04041974 64800 F Allergic 1 256 * Acts = crimes if treatment = 0; crimes + acts of generosity if treatment =1 53 [Micah Altman, 3/10/2011]
  • 54. Randomized Response – Law, policy, ethics Pros and Cons Research design … Information security  Pros Disclosure  Can substantially reduce risks of disclosure limitation  Can increase response rate  Can decrease mis-reporting  Warning!  None of the randomized models uses a formal measure of disclosure limitation  Some would clearly violate measures (such as differential privacy) we‟ll see in section 4  Do not use as a replacement for disclosure limitation  Other Issues  Loss of statistical efficiency (if compliance would otherwise be the same)  Complicates data analysis, especially model-based analysis  Leaving randomization up to subject can be unreliable  May provide less confidentiality protection if:  Randomization is incomplete  Records of randomization assignment are kept  Lists of responses overlap across questions  Sensitive question response is large enough to dominate overall response  Non-sensitive question responses are extremely predictable, or publicly observable 54 [Micah Altman, 3/10/2011]
  • 55. Law, policy, ethics Partitioning Information Research design … Information security  Reduces risk in information management Disclosure limitation  Partition data information based on sensitivity  Identifying information  Descriptive information  Sensitive information  Other information  Segregate  Storage of information  Access regimes  Data collections channels  Data transmission channels  Plan to segregate as early as feasible in data collection and processing  Link segregated information with artificial keys … 55 [Micah Altman, 3/10/2011]
  • 56. Partitioned table Not Identified Name SSN Birthdate Zipcode Gender LINK LINK Favorite Treat # Ice Cream act s A.Jones 12341 0101196 02145 M 1401 1401 Raspberry 0 0 1 283 Pistachio 1 20 B. 12342 0202196 02138 M 283 8979 Chocolate 0 0 Jones 1 7023 Hazelnut 1 12 C. 12343 11111972 94043 M 8979 Jones 1498 Lemon 0 0 D. 12344 1212197 94043 M 7023 1036 Lemon 1 7 Jones 2 3864 Peach 0 1 E. 12345 0325197 94041 F 1498 2124 Lime 1 17 Jones 2 4339 Mango 0 4 F. 12346 0325197 02127 F 1036 Jon 2 6629 Coconut 1 18 es 9091 Frog 0 32 G. 12347 0808198 02138 F 3864 9918 Vanilla 1 65 Jon 9 es 4749 Pumpkin 0 12 8 H. Smith 12348 0101197 63200 F 2124 3 8197 Allergic 1 25 6 I. Smith 12349 0202197 63300 M 4339 56 3 [Micah Altman, 3/10/2011]
  • 57. Law, policy, ethics Choosing Linking Keys Research design …  Entirely randomized Information security  Most resistant to relink Disclosure  Mapping from original id to random keys is highly sensitive limitation  Must keep and be able to access mapping to add new identified data  Most computer-generated random numbers are not sufficient by themselves  Most are PSEUDO random – predictable sequences  Use a cryptographic secure PRNG: Blum Blum Shub, AES (or other block cypher) in counter mode OR  Use real random numbers (e.g. from physical sources – see http://maltman.hmdc.harvard.edu/numal/) OR  Use a PRNG with a real random seed to random the order of the table; then another to generate the ID‟s for this randomly ordered table  Encryption  More troublesome to compute  Same id‟s + same key + same “salt” produces same values  facilitates merging  ID‟s can be recovered if key is exposed, cracked, or algorithm weak  Cryptographic Hash e.g. SHA-256  Security is well understood  Tools available to compute  Same id‟s produce same hashes  easier to merge new identified data  ID‟s cannot be recovered from hash because hash loses information  ID‟s can be confirmed if identifying information is known or guessable  Cryptographic Hash + secret key  Security is well understood  Tools available to compute  Same id‟s produce same hashes  easier to merge new identified data  ID‟s cannot be recovered from hash because hash loses information  ID‟s cannot be confirmed unless key is also known  Do not choose arbitrary mathematical functions of other identifiers! 57 [Micah Altman, 3/10/2011]
  • 59. Anonymous Data Collection: Law, policy, ethics Pros & Cons Research design … Information security  Pros Disclosure limitation  Presumption that data is not identifiable  May increase participation  May increase honesty  Cons  Barrier to follow-up, longitudinal studies  Can conflict with quality control, validation  Data still may be indirectly identifiable if respondent descriptive information is collected  Linking data to other sources of information may have large research benefits 59 [Micah Altman, 3/10/2011]
  • 60. Law, policy, ethics Anonymous Data Collection Methods Research design … Information security  Trusted third party intermediates Disclosure limitation  Respondent initiates re-contacts  No identifying information recorded  Use id‟s randomized to subjects, destroy mapping 60 [Micah Altman, 3/10/2011]
  • 61. Law, policy, ethics Remote Data Collection Challenges Research design … Information security  Where network connection is readily available, Disclosure easy to transfer as collected, or enter on remote systemlimitation  Encrypted network file transfer (e.g. SFTP, part of ssh )  Encrypted/tunneled network file system (e.g. Expandrive)  Where network connection is less reliable, high bandwidth  Whole-disk-encrypted laptop  Plus, Encrypted cloud backup solutions: CrashPlan, BackBlaze, SpiderOak  Small data, short term  Encrypted USB keys (e.g. w/IronKey, TrueCrypt, PGP)  Foreign Travel  Be aware of U.S. EAR export restrictions, use commercial or widely-available open encryption software only. Do not use bespoke software.  Be aware of country import restrictions (as of 2008): Burma, Belarus, China, Hungary, Iran, Israel, Morocco, Russia, Saudi Arabia, Tunisia, Ukraine  Encrypt data if possible, but don‟t break foreign laws. Check with department of state. 61 [Micah Altman, 3/10/2011]
  • 62. Online/Electronic data collection Law, policy, ethics challenges Research design … Information security  IP addresses are identifiers  IP addresses can be logged automatically by host, even if not intended by researcher Disclosure limitation  IP addresses can trivially be observed as data is collected  Partial ID numbers can be used for probabilistic geographical identification at sub-zipcode levels  Cookies may be identifiers  Cookies provide a way to link data more easily  May or may not explicitly identify subject  Jurisdiction  Data collected from subjects from other states / countries could subject you to laws in that jurisdiction  Jurisdiction may depend on residency of subject, availability of data collection instrument in jurisdiction, or explicit data collections efforts within jurisdiction  Vendor  Vendor could retain IP addresses, identifying cookies, etc., even if not intended by researcher  Recommendation  Use only vendors that certify compliance with your confidentiality policy  Do not retain IP numbers if data is being collected anonymously  Use SSL/TLS encryption unless data is non-sensitive and anonymous  Some tools for anonymizing IP addresses and system/network logs www.caida.org/tools/taxonomy/anonymization.xml  Harvard policy  Recommendations as above  62 Plus: do not use or display Level 4+ for web surveys [Micah Altman, 3/10/2011]
  • 64. Law, policy, ethics Certificates of Confidentiality Research design … Information security  Issued by DHHS agencies such as NIH, CDC, Disclosure FDA limitation  Protects against many types of forced legal disclosure of confidential information  May not protect against all state disclosure law  Does not protect against voluntary disclosures by researcher/research institutions 64 [Micah Altman, 3/10/2011]
  • 65. Law, policy, ethics Confidentiality & Consent Research design … Information security Best practice is to describe in consent form... Disclosure limitation  Practices in place to protect confidentiality  Plans for making the data available, to whom, and under what circumstances, rationale.  Limitations on confidentiality (e.g. limits to a certificate of confidentiality under state law, planned voluntary disclosure)  Consent form should be consistent with your:  Data management plan  Data sharing plans and requirements  Not generally best practice to promise  Unlimited confidentiality  Destruction of all data  Restriction of all data to original researchers 65 [Micah Altman, 3/10/2011]
  • 66. Law, policy, ethics Data Management Plan Research design … Information security  When is it required?  Any NIH request over $500K Disclosure limitation  All NSF proposals after 12/31/2010  NIJ  Wellcome Trust  Any proposal where collected data will be a resource beyond the project  Safeguarding data during collection  Documentation  Backup and recovery  Review  Treatment of confidential information  Overview: http://www.icpsr.org/DATAPASS/pdf/confidentiality.pdf  Separation of identifying and sensitive information  Obtain certificate of confidentiality, other legal safeguards  De-identification and public use files  Dissemination  Archiving commitment (include letter of support)  Archiving timeline  Access procedures  Documentation  User vetting, tracking, and support One size does not fit all projects. 66 [Micah Altman, 3/10/2011]
  • 67. Data Management Plan Law, policy, ethics Outline Research design … Information security  Data description  Planned documentation and supporting  Budget Disclosure materials limitation  nature of data {generated, observed,  Cost of preparing data and documentation experimental information; amples; publications;  Quality assurance procedures for metadata physical collections; software; models} and documentation  Cost of permanent archiving  scale of data  Data Organization [if complex]  Intellectual Property Rights  Access and Sharing  File organization  Entities who hold property rights  Plans for depositing in an existing public  Naming conventions  Types of IP rights in data database  Protections provided  Quality Assurance [if not described in main  Access procedures proposal]  Dispute resolution process  Embargo periods  Procedures for ensuring data quality in  Legal Requirements  Access charges collections, and expected measurement error  Provider requirements and plans to meet them  Timeframe for access  Cleaning and editing procedures  Institutional requirements and plans to meet  Technical access methods  Validation methods them  Restrictions on access  Storage, backup, replication, and  Archiving and Preservation  Audience versioning  Requirements for data destruction, if applicable  Facilities  Procedures for long term preservation  Potential secondary users  Methods  Institution responsible for long-term costs of  Potential scope or scale of use  Procedures data preservation  Reasons not to share or reuse  Frequency  Succession plans for data should archiving  Existing Data [ If applicable ] entity go out of existence  Replication  description of existing data relevant to the  Ethics and privacy project  Version management  Informed consent  plans for integration with data collection  Recovery guarantees  Protection of privacy  added value of collection, need to collect/create Security new data  Other ethical issues  Procedural controls  Formats  Technical Controls  Adherence  Generation and dissemination formats and  When will adherence to data management plan  Confidentiality concerns be checked or demonstrated procedural justification  Access control rules  Who is responsible for managing data in the  Storage format and archival justification  Restrictions on use project  Metadata and documentation  Responsibility  Who is responsible for checking adherence to  Metadata to be provided data management plan  Individual or project team role responsible for  Metadata standards used data management  Treatment of field notes, and collection records 67 [Micah Altman, 3/10/2011]
  • 68. IQSS Law, policy, ethics Data Management Services Research design … Information security  The Henry A. Murray Research Archive Disclosure  Harvard’s endowed permanent data archive limitation  Assists in developing data management plans  Can provide cataloging assistance for public release of data  Dissemination of data through IQSS Dataverse Network  The IQSS Dataverse Network  Standard data management plan for public, small data  Provides easy virtual archiving and dissemination  Data is catalogued and controlled by you  You theme and brand your virtual archive  Universally searchable, citable  Automatically provides data formatting and statistical analysis on-line http://dvn.iq.harvard.edu 68 [Micah Altman, 3/10/2011]
  • 69. Data Management Plans Examples Law, policy, ethics (Summaries) Research design …  Example 1 Information security  The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the Disclosure New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial limitation features, as well as mental retardation. Even with the removal of all identifiers, we believe that it would be difficult if not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting subjects. Therefore, we are not planning to share the data.  Example 2  The proposed research will include data from approximately 500 subjects being screened for three bacterial sexually transmitted diseases (STDs) at an inner city STD clinic. The final dataset will include self-reported demographic and behavioral data from interviews with the subjects and laboratory data from urine specimens provided. Because the STDs being studied are reportable diseases, we will be collecting identifying information. Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there remains the possibility of deductive disclosure of subjects with unusual characteristics. Thus, we will make the data and associated documentation available to users only under a data-sharing agreement that provides for: (1) a commitment to using the data only for research purposes and not to identify any individual participant; (2) a commitment to securing the data using appropriate computer technology; and (3) a commitment to destroying or returning the data after analyses are completed.  Example 3  This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years. Data products from this study will be made available without cost to researchers and analysts. https://ssl.isr.umich.edu/hrs/  User registration is required in order to access or download files. As part of the registration process, users must agree to the conditions of use governing access to the public release data, including restrictions against attempting to identify study participants, destruction of the data after analyses are completed, reporting responsibilities, restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource. Registered users will receive user support, as well as information related to errors in the data, future releases, workshops, and publication lists. The information provided to users will not be used for commercial purposes, and will not be redistributed to third parties.  FROM NIH, [grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm#ex] 69 [Micah Altman, 3/10/2011]
  • 70. External Data Usage Agreements Controls on Confidential Information Harvard University has developed extensive technical and administrative  Between provider and procedures to be used with all identified personal information and other confidential information. The University classifies this form of information internally individual as "Harvard Confidential Information" -- or HCI.  Careful – you‟re liable Any use of HCI at Harvard includes the following safeguards: - Systems security. Any system used to store HCI is subject to a checklist of  Harvard will not help if you sign technical and procedural security measures including: operating system and applications must be patched to current security levels, a host based firewall is  University review and OSP enabled, anti virus software is enabled and the definitions file is current - Server security. Any server used to distribute HCI to other systems (e.g. through signature strongly providing remote file system), or otherwise offering login access, must employ recommended additional security measures including: connection through a private network only; limitation on length of idle sessions; limitations on incorrect password attempts; and additional logging and monitoring.  Between provider and - Access restriction: an individual is allowed to access HCI only if there is a University specific need for access. All access to HCI is over physically controlled and/or encrypted channels.  University Liable - Disposal processes: including secure file erasure and document destruction. - Encryption: HCI must be strongly encrypted whenever it is transmitted across a  Requires University Approved public network, stored on a laptop, or stored on a portable device such as a flash drive or on portable media. Signer : OSP This is only a brief summary. The full University security policy can be found here: http://security.harvard.edu/heisp  Avoid nonstandard protections And a more detailed checklist used to verify systems compliance is found here: whenever possible http://security.harvard.edu/files/resources/forms/  DUA‟s can impose very specific These safeguards are applied consistently throughout the university, we believe and detailed requirements that these requirements offer stringent protection for the requested data. And these requirements will be applied in addition to any others required by specific  Compatible in spirit does not data use agreement. apply compatibility in legal practice  70 Use University [Micah Altman, 3/10/2011] policies/procedures as a
  • 71. Law, policy, ethics IQSS Data Management Services Research design … Information security  The Henry A. Murray Research Archive Disclosure  Harvard’s endowed permanent data archive limitation  Assists in developing data management plans  Can provide cataloging assistance for public release of data  Dissemination of data through IQSS Dataverse Network  Provides letters of commitment to permanent archiving www.murray.harvard.edu  The IQSS Dataverse Network  Provides easy virtual archiving and dissemination  Data is catalogued and controlled by you  You theme and brand your virtual archive  Universally searchable, citable  Automatically provides data formatting and statistical analysis on-line dvn.iq.harvard.edu 71 [Micah Altman, 3/10/2011]
  • 72. Law, policy, ethics Key Concepts & Issues Review Research design … Information security  Levels of sensitivity Disclosure limitation  Anonymity criteria  Sensitivity reduction  Certificate of Confidentiality  Data sharing plan  Data management plan  Information Partitioning  Linking Keys 72 [Micah Altman, 3/10/2011]
  • 73. Law, policy, ethics Checklist: Research Design … Research design … Information security Disclosure  Does research involve human subjects? limitation  What are possible harms that could occur if identified information was disclosed?  Is information collected benign, sensitive, very sensitive, or extremely sensitive? (IRB makes final determination)  Can the sensitivity of the information be reduced?  Can research be carried out with anonymity?  Can research data be de-identified during collection?  How can identifying information, descriptive information and sensitive information be segregated?  Have you:  Completed NIH human subjects training  Harvard HETHR training  Have you written the following to be consistent with final plans for analysis and dissemination:  data management plan?  consent documents?  application for certificate of confidentiality? 73 [Micah Altman, 3/10/2011]
  • 74. Law, policy, ethics Resources Research design … Information security Disclosure  E.A. Bankert& R.J. Andur, 2006, Institutional Review Board: limitation Management and Function, Jones and Bartlett Publishers  R. Groves, et al., 2004, Survey Methodology, John Wiley & sons.  J.A. Fox, P.E. Tracy, 1986, Randomized Response, Sage Publications.  R.M. Lee, 1993, Doing Research on Sensitive Topics, Sage Publications.  D. Corstange, 2009, "Sensitive Questions, Truthful Answers? Modeling the List Experiment with LISTIT", Political Analysis 17:45–63  ICPSR Data Enclave [www.icpsr.umich.edu/icpsrweb/ICPSR/access/restricted/encla ve]  Murray Research Archive [www.murray.harvard.edu] IQSS Dataverse Network  74 [Micah Altman, 3/10/2011] [dvn.iq.harvard.edu/]
  • 75. Law, policy, ethics Information Security Research design … Information security Disclosure limitation  Security principles  FISMA  Categories of technical controls  A simplified approach  Harvard Policies  [Summary] 75 [Micah Altman, 3/10/2011]
  • 76. Law, policy, ethics Core Information Security Concepts Research design … Information security Disclosure limitation  Security properties  Confidentiality  Integrity  Availability  [Authenticity]  [Nonrepudiation]  Security practices  Defense in depth  Threat modeling  Risk assessment  Vulnerability assessment 76 [Micah Altman, 3/10/2011]
  • 77. Law, policy, ethics Risk Assessment Research design … Information security  [NIST 800-100, simplification of NIST 800-30]Disclosure limitation Threat Modeling Analysis Institute System - likelihood Selected Testing and Analysis - impact Controls Auditing - mitigating controls Vulnerability Identification Information Security Control Selection Process 77 [Micah Altman, 3/10/2011]
  • 78. Law, policy, ethics Risk Management Details Research design … Information security Disclosure limitation  System Characterization  Threat Identification  Control analysis  Likelihood determination  Impact Analysis  Risk Determination  Control recommendation  Results documentation 78 [Micah Altman, 3/10/2011]
  • 79. Law, policy, ethics Classes of threats and vulnerabilities Research design … Information security  Sources of threat Disclosure limitation  Natural  Unintentional Human  Intentional  Areas of vulnerability  Logical  Data at rest in system  Data in motion across networks  Data being processed in applications  Physical  Computer systems  Network  Backups, disposal, media  Social  Social engineering  Mistakes  Insider threats 79 [Micah Altman, 3/10/2011]
  • 80. Law, policy, ethics Simple Control Model Research design … Information security Request/Resp Access Control Disclosure limitation onse Credential s Client Resource Authentication Authorization Auditing Log Resource Control Model External Auditor 80 [Micah Altman, 3/10/2011]
  • 81. Operational and Technical Law, policy, ethics Controls [NIST 800-53] Research design … Information security  Operational Disclosure limitation  Personnel security  Physical and environmental protection  Contingency planning  Configuration management  Maintenance  System and information integrity  Media protection  Incident Response  Awareness and training  Technical Controls  Identification and authentication  Access control  Audit and accountability  System and communication protection 81 [Micah Altman, 3/10/2011]
  • 82. Law, policy, ethics Key Information Security Standards Research design … Information security Disclosure limitation  Comprehensive Information Security Standards  FISMA – framework for non-classified information security in federal government.  ISO/IEC 27002 – framework of similar scope to FISMA, used internationally  PCI – Payment card industry security standards. Used by major payment card companies, processors, etc.  Related Certifications  FIPS-compliance and certification  Establishes standards for cryptographic methods and modules  Be aware that FIPS-certification often limited to algorithm used, and not entire system  SAS 70 Audits – Type 2  Independent audit of controls and control objectives  Does not establish sufficiency of control objectives  CISSP -- Certified Information Systems Security Professional  Widely recognized certification for information security professionals 82 [Micah Altman, 3/10/2011]
  • 83. Law, policy, ethics FISMA Overview Research design … Information security Disclosure limitation Federal Information Security Management Act of 2002  All federal agencies required to develop agency-wide information security plan  NIST published extensive list of recommendations  Federal sponsors seem to be trending to FISMA as best practice for managing confidential data produced by award  Identifies risk and impact level; monitoring; technical and procedural controls  Harvard HRCI controls: less than FISMA “low” 83 [Micah Altman, 3/10/2011]
  • 84. Law, policy, ethics Security Control Baselines Research design … Information security Disclosure limitation Access Control Low (impact) Medium-High (impact), adds… Policies; Account management *; Access Information flow enforcement; Separation of Enforcement; Unsuccessful Login Attempts; Duties; Least Privilege; Session Lock System Use Notification; Restrict Anonymous Access*; Restrict Remote Access*; Restrict Wireless Access*; Restrict Mobile Devices*; Restrict use of External Information Systems*; Restrict Publicly Accessible Content Security Awareness and Training Policies; Awareness; Training; Training Records Audit and Accountability Policies; Auditable Events *; Content of Audit Audit Reduction; Non-Repudiation Records *; Storage Capacity; Audit Review, Analysis and Reporting *; Time Stamps *; Protection of Audit Information; Audit Record Retention; Audit Generation Security Assessment and Authorization Policies; Assessments* ; System Connections; Planning; Authorization; Continuous Monitoring 84 [Micah Altman, 3/10/2011]
  • 85. Law, policy, ethics Security Control Baselines Research design … Information security Disclosure limitation Configuration Management Low (impact) Medium-High (impact), adds… Policies; Baseline*; Impact Analysis; Settings*; Change Control; Access Restrictions for Change; Least Functionality; Component Inventory* Configuration Management Plan Contingency Planning Policies; Plan * ; Training *; Plan Testing*; Alternate storage site; Alternate processing site; System backup*; Recovery & Reconstitution * Telecomm Identification and Authentication Policies; Organizational Users*; Identifier Device identification and authentication Management; Authenticator Management *; Authenticator Feedback; Cryptographic Module Authentication; Non-Organizational Users Incident Response Policies; Training; Handling *; Monitoring; Testing Reporting*; Response Assistance; Response Plan Maintenance Policies; Control*; Non-Local Maintenance Tools; Maintenance scheduling/timeliness Restrictions*; Personnel Restrictions* 85 [Micah Altman, 3/10/2011]
  • 86. Law, policy, ethics Security Control Baselines Research design … Information security Disclosure limitation Media Protection Low (impact) Medium-High (impact), adds… Policies; Access restrictions*; Sanitization Marking; Storage; Transport Physical and Environmental Protection Policies; Access Authorizations; Access Control*; Network access control; Output device Access Monitoring*; Visitor Control *; Records*; control; Power equipment access, shutoff, Emergency Lighting; Fire protection*; backup; Alternate work site; Location of Temperature, Humidity, water damage*; Delivery information system components; information and removal leakage Planning Policies, Plan, Rules of Behavior; Privacy Impact Activity planning Assessment Personnel Security Policies; Position categorization; Screening; Termination; Transfer; Access Agreements; Third- Parties; Sanctions Risk Assessment Policies; Categorization Assessment; Vulnerability Scanning* 86 [Micah Altman, 3/10/2011]
  • 87. Law, policy, ethics Security Control Baselines Research design … Information security Disclosure limitation System and Services Acquisition Low (impact) Medium-High (impact), adds… Policies; Resource Allocation; Life CycleSupport; Security Engineering; Developer configuration Acquisition*; Documentation; Software usage management; Developer security testing; supply restrictions; User installed software restrictions; chain protection; Trustworthiness External information System Services restrictions System and Communications Protection Policies; Denial of Service Protection; Boundary Application Partitioning; Restrictions on Shared protection*; Cryptographic key Management; Resources; Transmission integrity & Encryption; Public Access Protection; confidentiality; Network Disconnection Procedure; Collaborative computing devices restriction; Public Key Infrastructure Certificates; Mobile Secure Name resolution* Code management; VOIP management; Session authenticity; Fail in known state; Protection of information at rest; Information system partitioning System and Information Integrity Policies, Flaw remediation*; Malicious code Information system monitoring; Software and protection*; Security Advisory monitoring*; information integrity; Spam protection; Information Information output handling input restrictions & validation; Error handling Program Management Plan; Security Officer Role; Resources; Inventory; Performance Measures; Enterprise architecture; Risk management strategy; Authorization process; Mission definition 87 [Micah Altman, 3/10/2011]
  • 88. HIPAA Requirements  Administrative controls  Access authorization, establishment, modification, and termination.  Training program  Vendor compliance  Disaster recovery  Internal audits  Breach procedures  Physical controls  Disposal  Access to equipment  Access to physical environment  Workstation environment  Technical controls  Intrusion protection  Network encryption  Integrity checking  Authentication of communication  Configuration management  Risk analysis 88 [Micah Altman, 3/10/2011]
  • 89. Law, policy, ethics Delegating Systems Security Research design … Information security Disclosure limitation  What are goals for confidentiality, integrity, availability?  What threats are envisioned?  What controls are in place?  Is there a checklist?  Who is responsible for technical controls?  Do they have appropriate training, experience and/or certification?  Who is responsible for procedural controls?  Have they received appropriate training?  How is security monitored, audited, and tested?  E.g. SAS Type -2 Audits; FISMA Compliance; ISO Certification  What security standards are referenced?  E.g. FISMA, ISO, HEISP/HDRSP/PCI 89 [Micah Altman, 3/10/2011]
  • 90. Law, policy, ethics What most security plans do not do Research design … Information security Disclosure limitation  Protect against all insider threats  Protect against all unintentional threats (human error, voluntary disclosure)  Protect against the CIA, TEMPEST, evil maids, and other well-resourced, sophisticated adversaries  Protect against prolonged physical threats to computer equipment, or data owner 90 [Micah Altman, 3/10/2011]
  • 91. Law, policy, ethics Information Security is Systemic Research design … Information security Not just control implementation but… Disclosure limitation  Policy creation, maintenance, auditing  Implementation review, auditing, logging, monitoring  Regular vulnerability & threat assessment 91 [Micah Altman, 3/10/2011]
  • 92. Law, policy, ethics Simplified Approach for Sensitive Data Research design … Information security Disclosure limitation  Use whole-disk/media encryption to protect data at rest  Use end-to-end encryption to protect data in motion  Use core information hygiene to protect systems  Scan for HRCI regularly  Be thorough in disposal of information Very sensitive/extremely sensitive data requires more protection. 92 [Micah Altman, 3/10/2011]
  • 93. Law, policy, ethics Plan Outline – Very Sensitive Data Research design … Information security  Protect very sensitive data on “target systems” Disclosure limitation  Extra physical, logical, administrative access control  Record keeping  Limitations  Lockouts  Extra monitoring, auditing  Extra procedural controls – specific, renewed approvals  Limits on network connectivity  Private network, not directly connected to public network  Regular scans  Vulnerability scans  Scans for PII  Extremely sensitive  Increased access control, procedural limitations  Not physically/logically connected (even via wireless) to public network, directly or indirectly 93 [Micah Altman, 3/10/2011]
  • 102. Law, policy, ethics Key Concepts Review Research design … Information security Disclosure limitation  Confidentiality  Integrity  Availability  Threat modeling  Vulnerability assessment  Risk assessment  Defense in depth  Logical Controls  Physical Controls  Administrative Controls 102 [Micah Altman, 3/10/2011]
  • 103. Law, policy, ethics Checklist: Identify Requirements Research design … Information security  Documented information security plan? Disclosure limitation  What are goals for confidentiality, integrity, availability?  What threats are envisioned?  What are the broad types of controls in place?  Key protections  Use whole-disk/media encryption to protect data at rest  Use end-to-end encryption to protect data in motion  Use basic information hygiene to protect systems  Be thorough in disposal of information  Additional protections for sensitive data  Extra logical, administrative, physical controls for very sensitive data?  Monitoring and vulnerability scanning for very sensitive data?  Check requirements for remote and foreign data collection  Refer to security standards  FIPS encryption  FISMA / ISO practices  SAS-70 Auditing  CISSP certification of key staff  Delegate implementation to information security professionals 103 [Micah Altman, 3/10/2011]
  • 104. Law, policy, ethics Resources Research design … Information security Disclosure limitation  S. Garfinkel, et al. 2003, Practical Unix and Internet Security, 3rd ed. , O‟Reilly Media  Shon Harris, 2001, CISSP All-in-One Exam Guide, Osborne  NIST, 2009, DRAFT Guide to Protecting the Confidentiality of Personally Identifiable Information, Nist Publication 800-122.  NIST, 2009, Recommended Security Controls for Federal Information Systems and Organizations v. 3, NIST 800- 53. (Also see related NIST 800-53A, and other NIST Computer Security Division Special Publications) [csrc.nist.gov/publications/PubsSPs.html]  NIST, 2006, Information Security Handbook: A Guide for Managers, NIST Publication 800-100. Harvard Enterprise Security Checklists [Micah Altman, 3/10/2011]  104
  • 105. Law, policy, ethics Recommended Software Research design … Information security Disclosure limitation  Whole Disk Encryption  Open Source: truecrypt.org  Commercial: pgp.com  Scanning  Vulnerability scanner/assessment tool: www.nessus.org/nessus  Commercial version scans for (limited) PII: www.nessus.org/nessus  PII Scanning tool (open source), Cornell Spider: www2.cit.cornell.edu/security/tools  PII Scanning tool (commercial), Identity Finder: www.identityfinder.com  File integrity/intrusion detection engine, Samhain: la-samhna.de/samhain  Network intrusion detection, Snort: www.snort.org  Encrypt transmission over network  Open SSL: http://openssl.org  Open SSH: http://openssh.org  VTUN: http://vtun.sourceforge.net  Cloud backup services with encryption  Crashplan: http://crashplan.com  Spider oak: http://spideroak.com  Backblaze: http://backblaze.com 105 [Micah Altman, 3/10/2011]
  • 106. Law, policy, ethics Disclosure Limitation Research design … Information security Disclosure limitation  Threat models  Disclosure limitation methods  Statistical disclosure limitation methods  Types of disclosure  Factors affecting disclosure protection  SDL Caveats  SDL Observations 106 [Micah Altman, 3/10/2011]
  • 107. Law, policy, ethics Threat Models Research design … Information security Disclosure  Nosy neighbor (nosy employer) limitation  Muck-raking Journalist (zero-tolerance)  Business rival contributing to same survey  Absent-minded professor  … 107 [Micah Altman, 3/10/2011]
  • 108. Non statistical Law, policy, ethics Research design … Disclosure Limitation Methods Information security  Licensing Disclosure limitation  Used in conjunction with limited deidentification  Should prohibit reidentification& linking, dissemination to third parties; limit retention  Advantages: can decrease cost of processing, increase utility of research data  Disadvantages: licenses may be violated unintentionally or intentionally, difficult to enforce outside of limited domains (e.g. HIPAA)  Automated de-identification  Primarily used for qualitative text medical records. Replaces identifiers with dummy strings.  Advantages: can decrease cost, increase accuracy of manual deidentification of qualitative information  Disadvantage: little available software, error rates still slightly higher than teams of trained human coders 108 [Micah Altman, 3/10/2011]
  • 109. Law, policy, ethics Automated De-identification Research design … Information security Disclosure  Trained human sensitivity rates: limitation  Single worker: [.63-.94] (.81)  Two-person team: [.89-.98] (.94)  Three-person team: [.98-.99] (.98) [Neamatullah 2008]  State of the art algorithms approach recall of .95 [Uzuner, et. al 2007]  Statistical learning of rule template features worked best  Simpler rules-based approach still did as well as median 2-person team  Rules for PII and local dictionary important 109 [Micah Altman, 3/10/2011]
  • 110. Law, policy, ethics Text de-identification (HIPAA) Research design … Information security Cleaned Disclosure limitation Name SSN Birthdate Zipcode Gender Favorite # of crimes Ice Cream committed [Name 1] * *1961 021* M Raspberr 0 y [Name 2] * *1961 021* M Pistachio 0 [Name 3] * *1972 940* M Chocolat 0 e [Name 4] * *1972 940* M Hazelnut 0 [Name 5] * *1972 940* F Lemon 0 [Name 6] * *1972 021* F Lemon 1 [Name 7] * *1989 021* F Peach 1 [Name 8] * *1973 632* F Lime 2 [Name 9] * *1973 633* M Mango 4 [Name * *1973 634* M Coconut 16 10] [Name 11] * *1974 645* M Frog 32 [Name * *1974 646* M Vanilla 64 Cleaned (by hand check) 12] [Name 110 * *1974 647* F Pumpkin 128 [Micah Altman, 3/10/2011] 13]
  • 112. Hybrid Statistical/Non-statistical Law, policy, ethics Research design … Limitation Information security  Data enclaves – physically restrict access to data Disclosure limitation  Examples: ICPSR, Census Research Data Center  May include availability of synthetic data as an aid to preparing model specifications  Advantages: extensive human auditing, vetting; information security threats much reduced  Disadvantages: expensive, slow, inconvenient to access  Controlled remote access  Varies from remote access to all data and output to human vetting of output  Advantages: auditable, potential to impose human review, potential to limit analysis  Disadvantages: complex to implement, slow  Model servers  Mediated remote access – analysis limited to designated models  Advantages: faster, no human in loop  Disadvantage: statistical methods for ensuring model safety are immature – residuals, categorical variables, dummy variables are all risky; very limited set of models currently supported; complex to implement  Statistical Disclosure Limitation  Modifications to the data to decrease the probability of disclosure  Advantages/Disadvantages… to follow… 112 [Micah Altman, 3/10/2011]
  • 114. Pure Statistical Disclosure Law, policy, ethics Research design … limitation techniques Information security  Data reduction Disclosure limitation  Removing variables (i.e. deidentifying)  Suppressing records  Sub-sampling  Global recoding (including top/bottom coding)  Local suppression  Global complete suppression   Perturbation  Microaggregation  Sorting based on similarity  Replace value of records in clusters with mean  Rule-based data swapping  Adding noise  Resampling  Synthetic microdata  Bootstrap  Multiple imputation  Model based 114 [Micah Altman, 3/10/2011]
  • 116. Law, policy, ethics Suppression with R and sdcmicro Research design … Information security Disclosure # setup limitation >library(sdcMicro) # load data >classexample.df<-read.csv("examplesdc.csv”, as.is=T, stringsAsFactors=F,colClasses=c("character","character","character","character","factor","factor","numeric") # create a weight variable if needed >classexample.df$weight<-1 # simple frequency table shows that data is uniquely identified >ftable(Birthdate~Zipcode,data=classexample.df) Birthdate 01/01/1973 02/02/1973 03/25/1972 04/04/1974 08/08/1989 10/01/1961 11/11/1972 12/12/1972 20/02/1961 30/03/1974 Zipcode 02127 0 0 1 0 0 0 0 0 0 0 02138 0 0 0 0 1 0 0 0 1 0 02145 0 0 0 0 0 1 0 0 0 0 63200 1 0 0 0 0 0 0 0 0 0 63300 0 1 0 0 0 0 0 0 0 0 63400 0 1 0 0 0 0 0 0 0 0 64500 0 0 0 0 0 0 0 0 0 1 64600 0 0 0 1 0 0 0 0 0 0 64700 0 0 0 1 0 0 0 0 0 0 64800 0 0 0 1 0 0 0 0 0 0 94041 0 0 1 0 0 0 0 0 0 0 94043 0 0 0 0 0 0 1 1 0 0 116 [Micah Altman, 3/10/2011]
  • 117. Law, policy, ethics Suppression with R and sdcmicro Research design … Information security Disclosure # global recoding limitation >recoded.df<-classexample.df >recoded.df$Birthdate<-substring(classexample.df$Birthdate,7) >recoded.df$Zipcode<-substring(classexample.df$Zipcode,1,3) # Check if anonymous? # NOTE makes sure to use column numbers and w=NULL >print(freqCalc(recoded.df,keyVars=3:5,w=NULL)) -------------------------- 10 observation with fk=1 4 observation with fk=2 -------------------------- 117 [Micah Altman, 3/10/2011]
  • 118. Law, policy, ethics Suppression with R and sdcmicro Research design … Information security # try local suppression with preference for suppressing Gender Disclosure >anonymous.out<-localSupp2Wrapper(recoded.df,3:5,w=NULL,kAnon=2,importance=c(1,1,100)) limitation ... [1] "2-anonymity after 2 iterations." # look at the data >as.data.frame(anonymous.out$xAnon) Name SSN BirthdateZipcode Gender Ice.cream Crimes weight 1 A. Jones 12341 1961 021 <NA> Raspberry 0 1 2 B. Jones 12342 1961 021 <NA> Pistachio 0 1 3 C. Jones 12343 1972 940 M Chocolate 0 1 4 D. Jones 12344 1972 940 M Hazelnut 0 1 5 E. Jones 12345 1972 940 <NA> Lemon 0 1 6 F. Jones 12346 <NA> 021 <NA> Lemon 1 1 7 G. Jones 12347 <NA> 021 <NA> Peach 1 1 8 H. Smith 12348 1973 <NA><NA> Lime 2 1 9 I. Smith 12349 <NA> 633 <NA> Mango 4 1 10 J. Smith 12350 <NA> 634 <NA> Coconut 16 1 11 K. Smith 12351 1974 <NA><NA> Frog 32 1 12 L. Smith 12352 <NA> 646 <NA> Vanilla 64 1 13 M. Smith 12353 <NA> 647 <NA> Pumpkin 128 1 14 N. Smith 12354 <NA> 648 <NA> Allergic 256 1 118 [Micah Altman, 3/10/2011]
  • 119. Law, policy, ethics Suppression with R and sdcmicro Research design … Information security Disclosure # launch gui if you like sdcGui limitation # and play around some more 119 [Micah Altman, 3/10/2011]
  • 120. Law, policy, ethics How SDL Methods Reduce Utility Research design … Information security Disclosure limitation Issues Removing variables Model misspecification Suppressing records Induced non-response bias Sub-sampling Weak protection Global recoding (generalization) Censoring Local suppression Non-ignorable missing value bias Rules-based swapping Biased, must keep rules for secret Random swapping Weakensbivariate, multivariate relationships Adding noise Weak protection Resampling Weak protection Synthetic microdata Destroysunmodeled relationships, not currently widely accepted 120 [Micah Altman, 3/10/2011]
  • 121. Law, policy, ethics Types of Disclosure Research design … Information security  Identity disclosure (re-identification disclosure) – Disclosure limitation associate an individual with a record and set of sensitive variables  Attribute disclosure (prediction disclosure) – improve prediction of value of sensitive variable for an individual  Group disclosure -- predict the value of a sensitive variable for a known group of people 121 [Micah Altman, 3/10/2011]
  • 122. Factors affecting disclosure Law, policy, ethics protection Research design … Information security  Properties of the Disclosure limitation sample  Individualreidentification  Measured variables occurs when:  Realizations of  Respondent is unique on measurements values of the key  Outliers  Attacker has access to  Content of qualitative measurements of key responses  Respondent is in attackers  Distribution of set of measurements population  Attacker comes across  Adversarial knowledge disclosed data  Variables  Attacker recognizes  Completeness respondent  Errors [Willenborg&DeWaal 1996]  Priors 122 [Micah Altman, 3/10/2011]
  • 123. Disclosure protection: Law, policy, ethics k-anonymity [Sweeney 2002] Research design … Information security Disclosure limitation  Operates on micro-data  Designate subset of variables as key’s – variables that the attacker could use to identify individual  For each combination of key variables in the sample –there must be krows taking on that combination  kis typically desired to be in 3-5 123 [Micah Altman, 3/10/2011]
  • 124. Our table made 2-anonymous Law, policy, ethics (one way) Research design … Information security Cleaned Quasi-keys Disclosure limitation Name SSN Birthdate Zipcode Gender Favorite # of crimes Ice Cream committed *Jones * * 1961 021* M Raspberr 0 y * Jones * *1961 021* M Pistachio 0 Both more and less * Jones * *1972 9404* * Chocolat 0 than HIPAA default e * Jones * *1972 9404* * Hazelnut 0 * Jones * * 1972 9404* * Lemon 0 * Jones * * 021* F Lemon 1 * Jones * * 021* F Peach 1 * Smith * *1973 63* * Lime 2 * Smith * *1973 63* * Mango 4 *Smith * *1973 63* * Coconut 16 * Smith * *1974 64* M Frog 32 * Smith * * 1974 64* M Vanilla 64 * Smith * 04041974 64* F Pumpkin 128 * Smith * 04041974 64* F Allergic 256 124 [Micah Altman, 3/10/2011]
  • 125. Law, policy, ethics k-anonymous – but not protected Research design … Information security Additional Sort Order/ background Disclosure Structure limitation Name SSN Birthdate Zipcode Gender Favorite # of crimes Ice Cream committed *Jones * * 1961 021* M Raspberr 0 y * Jones * *1961 021* M Pistachio 0 * Jones * *1972 9404* * Chocolat 0 e * Jones * *1972 9404* * Hazelnut 0 * Jones * * 1972 9404* * Lemon 0 * Jones * * 021* F Lemon 1 * Jones * * 021* F Peach 1 Homogeneity * Smith * *1973 63* * Lime 2 * Smith * *1973 63* * Mango 4 *Smith * *1973 63* * Coconut 16 * Smith * *1974 64* M Frog 32 * Smith * * 1974 64* M Vanilla 64 * Smith * 04041974 64* F Pumpkin 128 * Smith * 04041974 64* F Allergic 256 125 [Micah Altman, 3/10/2011]
  • 126. More than one way to de-identify Law, policy, ethics Research design … (but don’t release both…) Information security Disclosure limitation Name SSN Birthdate Zipcode Gender Name SSN Birthdate Zipcode Gender *Jones * *1961 021* * *Jones * * 1961 021* M * Jones * *1961 021* * * Jones * *1961 021* M * Jones * *1972 94043 * * Jones * *1972 9404* * * Jones * *1972 94043 * * Jones * *1972 9404* * * Jones * 0325197 * * * Jones * * 1972 9404* * 2 * Jones * * 021* F * Jones * 0325197 * * * Jones * * 021* F 2 * * * * * * Smith * *1973 63* * * * * * * * Smith * *1973 63* * * Smith * 0202197 6* * *Smith * *1973 63* * 3 * Smith * *1974 64* M *Smith * 0202197 6* * * Smith * * 1974 64* M 3 * Smith * 0404197 64* F * Smith * 0303197 6* * 4 4 * Smith * 0404197 64* F * Smith * 0404197 6* * 4 126 4 [Micah Altman, 3/10/2011] * Smith * 0404197 6* *
  • 127. Vulnerabilities of k-anonymity Law, policy, ethics Research design … Information security  Sort order [Sweeney 2002] Disclosure limitation  Information in structure of data, not content!  Contemporaneous release [Sweeney 2002]  overlap of information under different anonymizationschemes disclosure  Information in suppression mechanism, may allow recovery  – e.g. rules based swapping  Temporal changes  “barn door” -- deletion of tuples can subvert k-anonymity  can‟t “unrelease” records  Additions of tuples, information can yield disclosures if you re-do anonymization must anonymize these based on the past data release [Sweeney 2002]  Variable Background Knowledge [Machanavajjhala 2007]  Incorrect assumption about what variables are in quasi-key  This may change over time  Homogeneity [Truta 2006]  Sensitive values may be homogenous, even if not literally individually identified 127 [Micah Altman, 3/10/2011]
  • 128. Strengthening k-anonymity Law, policy, ethics Research design … vs. homogeneity Information security Disclosure limitation  Ensure each k-anonymous set also satisfies some measure of attribute diversity  P-sensitive k-anonymity [Truta 2006]  Fixed l-diversity, Entropy l-diversity, Recursive (c,l) diversity [Machanavajjhala 2007]  T-closeness [Li 2007]  Diversity measures may be too strong or too weak  And sometimes attribute disclosure is not justifiable  It does not literally (legally?) identify an individual  Research may be explicitly designed to make attribute more predictable  In some cases, study would probabilistically identify 128 an attribute, even if participant weren‟t in it!Altman, 3/10/2011] [Micah
  • 129. Law, policy, ethics Sometimes k-anonymity is too strong Research design … Information security Disclosure  Embodies several worst case assumptions limitation -- safer, but more information loss:  Sample unique  population unique  Attacker discovers your data with certainty  Attacker has complete database of non- sensitive variables and their links to identifiers  Attacker database and sample are error-free 129 [Micah Altman, 3/10/2011]
  • 130. Law, policy, ethics Research Areas Research design … Information security  Standard SDL approaches are designed to apply to denseDisclosure single tables of quantitative data… use caution & seeklimitation consultation with the following  Dynamic data  Adding new attributes  Incremental updates  Multiple views  Relational data  Multiple relations that are not easily normalized  Non tabular data  Sparse matrices  Transactional data  Trajectory data  Rich text  Social networks 130 [Micah Altman, 3/10/2011]
  • 131. Law, policy, ethics Problem 2: Information loss Research design … Information security  No free lunch: anonymization information loss Disclosure limitation  Various approaches none satisfactory or commonly used  Count number of suppressed values  Compare data matrix before & after anonymization  Entropy, MSE, MAE, mean variation  Compare statistics on data matrix before & after  Variance, Bias, MSE  Weight by (ad-hoc) importance of variable  Optimal (information loss) k-anonymity is NP-hard [Meyerson& Williams 2004]  Utility degrade very fast in increased privacy  See [Brickell and Shmatikov 2008; Ohm 2009,; Dinur&Nissim 2004; Dwork et al 2006, 2007] 131 [Micah Altman, 3/10/2011]
  • 132. Alternative risk limitation– Law, policy, ethics Research design … non-microdata approaches Information security Disclosure limitation  Models and tables can be safely generated from anonymizedmicrodata, however information loss may be less when anonymization is applied at the model/table level directly  Model servers  Compute models on full microdata  Limit models being run on data  Limit specifications of models  Synthesize residuals; perturb results  Table-based de-identification  Compute tables on full micro-data  Perturb (noise, rounding), suppress cells (and complementary cells, if marginals computed), restructure tables (generalization, variable suppression), synthesize value  Disclosure rule: number of contributors to a cell (similar to k-anonymity); proportion of largest group of contributors to a cell total; percentage decrease in upper/lower bounds on contributor values  Limitations  Feasible (privacy protecting) multi-dimensional table/multiple table protection is NP-hard  Model/table disclosure requires evaluating entire history of previous disclosures  Dynamic table servers, model servers should be considered open research topics, not mature. 132 [Micah Altman, 3/10/2011]
  • 133. Alternate solution concept – Law, policy, ethics Research design … probabilistic record linkage Information security Disclosure limitation  Apply disclosure rule to population based on threshold probability, and estimated population distribution  E.g. for 3-anonymity – probability < .02that there exists a tuple of quasi-identifier values that occurs < 3 time in the population  Advantages  When sample is small, population risk model will result in far less modification & information loss  Disadvantages  Harder to explain.  Does not literally prevent individual reidentification.  Need to justify reidentification risk threshold  Need to justify population distribution model  Assumes that background knowledge of attacker does not include whether each identified individual is the sample 133 [Micah Altman, 3/10/2011]
  • 134. Alternate Solution Concept Law, policy, ethics – Bayesian Optimal Privacy Research design … Information security  Possibly… Disclosure limitation  Minimal distance between posterior and prior distribution for some all priors…  Limitations… [See A. Mchanavajjala, et. al 2007]  Insufficient knowledge about distributions of attributes  Insufficient knowledge about distributions of priors  Instance-level knowledge not modeled well  Multiple adversaries not modeled  Possible limitations  Complexity of computation not known  Implementation mechanisms not well-known  Utility reduction not well-known 134 [Micah Altman, 3/10/2011]
  • 135. Alternate Solution Concept– Law, policy, ethics Differential Privacy Research design … Information security Disclosure limitation  Based on cryptography theory (traitor tracing schemes) & provides formal bounds on disclosure risk across all inferences -- handles attribute disclosure well [Dwork 2006]  Roughly, differential privacy guarantees that all inferences made from the data with a subject included will differ only by epsilon if subject is removed.  Analysis is accompanied by formal analysis of estimator efficiency – differential privacy can be achieved in many cases with (asymptotic) efficiency  DP is essentially Frequentist … possible Bayesian interpretation  Prior: n-1 complete records, and distribution over nth record  DP criterion implies Hellinger distance [Fienberg 2009] 135 [Micah Altman, 3/10/2011]
  • 136. Implementing Differential Law, policy, ethics Privacy Research design … Information security  Currently, almost all realizations of differential privacy rely on noise appliedDisclosure to queries against numeric tabular databases – unknown how to apply it to new forms of datalimitation such as networks. [Dwork 2008]  Static sanitization is possible … BUT limited  If possible number of queries in analysis family is superpolynomial in size of data no efficient anonymization exists [Dwork et al 2009]  Differential privacy methods need to be develop for the type of analysis being performed.  Currently differentially private versions of datamining queries exist, but  … development of differentially private versions of common statistical methods is just beginning. [Dwork& Smith 2009]  Differential privacy may be too strong, in some cases..  identity disclosure may be the appropriate measure  disclosing attributes that are the explicit topic of research may be appropriate  allowing for greater than epsilon gains in information may be appropriate  There is only one publicly available software tool that supports these methods (PLINQ)  Test use only  Restricted domain of queries  Researchers may need access to data not just coefficients – e.g. “show me the residuals”! 136 [Micah Altman, 3/10/2011]
  • 138. MIND THE GAPS Law, policy, ethics – Future Research Research design … Information security  Reconcile Bayesian an Frequentist notions of privacy Disclosure limitation  Model privacy from game theoretic/social choice & policy analysis point of view  Reconcile “random response”/sensitive survey methods and statistical disclosure concepts  Disclosure limitation methods needed for new forms of data  Differential Privacy methods needed for many more statistical models  Bridge gap between regulatory and statistical views  Update regulations/law based on statistical concepts  Educate IRB‟s on statistical disclosure control  Integrate permission for data sharing and some disclosure in consent & design of experiments  Bridge gap between mathematics and implementation  Very few software packages available for disclosure limitation and analysis  Interactive disclosure limitations require not just software, but validated, audited software infrastrucure  Data sharing infrastructure needed for managing confidentiality effectively:  Applying interactive privacy automatically  Implementing limited data use agreements  Managing access & logging – virtual enclave  Providing chokepoint for human auditing of results  Providing systems auditing, vulnerability & threat assessment  Ideally:  Research design information automatically fed into disclosure control parameterization  Consent documentation automatically integrated with disclosure policies, enforced by system 138 [Micah Altman, 3/10/2011]
  • 139. Law, policy, ethics What to do – for now… Research design … Information security Disclosure limitation  (1) Use only information that has already been made public, is entirely innocuous, or has been declared legally deidentified; or  (2) Obtain informed consent from research subjects, at the time of data collection, that includes acceptance of the potential risks of disclosure of personally identifiable information; or  (3) Pay close attention to the technical requirements imposed by law:  Remove all 18 HIPAA factors; or  Use suppression and recoding to achieve k-anonymity with l- diversity on data before releasing it or generating detailed figures, maps, or summary tables.  Supplement data sharing with data-use agreements.  Apply extra caution & use consultation with “non-traditional” data – networks, text corpuses, etc. 139 [Micah Altman, 3/10/2011]
  • 140. Law, policy, ethics Preliminary Recommendations Research design … Information security Disclosure  Avoid complexities of table and model SDL limitation  Apply SDL to microdata  Tables and models based on deidentifiedmicrodata are de-identified  Use substantive knowledge to guide disclosure limitation  Globally recode using natural categories  Use local suppression – check suppressed observations  Estimate substantively interesting statistics from original and modified data as a check 140 [Micah Altman, 3/10/2011]
  • 141. Law, policy, ethics Key Concepts Review Research design … Information security Disclosure  Text de-identification limitation  License and access control restrictions  K-anonymity  Suppression  Attribute homogeneity  Risk/utility tradeoff 141 [Micah Altman, 3/10/2011]
  • 142. Law, policy, ethics Checklist Research design … Information security Disclosure limitation  Will license be used to limit disclosure?  Will enclave or remote access limit disclosure?  Are there natural categories for global recoding?  Is there a natural measure of information loss, or natural weighting for importance of variables?  What level of reidentification risk is acceptable?  What is expected background knowledge of attacker? 142 [Micah Altman, 3/10/2011]
  • 143. Law, policy, ethics Available Software Research design … Information security Disclosure  Deidentification of text limitation  Regular expression, lookup tables, template matching [www.physionet.org/physiotools/deid]  Deidentification of IP addresses and system/network logs www.caida.org/tools/taxonomy/anonymization.xml  Interactive Privacy  PINQ – Experimental interactive differential privacy engine [research.microsoft.com/en-us/projects/PINQ/]  Tabular Data – Tau Argus  Cell suppression, controlled rounding [neon.vb.cbs.nl/casc]  Microdata  Mu-Argus  Microaggregation, local suppression, global recoding, PRAM [neon.vb.cbs.nl/casc]  SDCmicro  Microaggregation, local suppression, global recoding, PRAM, rank swapping  Heuristic k-anonymity (using local suppression)  R module [cran.r-project.org/web/packages/sdcMicro]  NISS Data Swapping Toolkit (DSTK)  Data swapping in risk/utility framework  Implemented in Java [nisla05.niss.org/software/dstk.html] 143 [Micah Altman, 3/10/2011]
  • 144. Law, policy, ethics Resources Research design … Information security Disclosure  limitation FCSM, 2005. “Report on Statistical Disclosure Limitation Methodology”, FCSM Statistical Working Paper Series [www.fcsm.gov/working-papers/spwp22.html]  L. Willenborg,T. de Waal, 2001. Elements of Statistical Disclosure Control, Springer.  ICPSR Human Subjects Protection Project Citation Database [ www.icpsr.umich.edu/HSP/citations]  A. Hundepool, et al. 2009, Handbook of Statistical Disclosure Control, ESSNET [neon.vb.cbs.nl/casc/..%5Ccasc%5Chandbook.htm]  Privacy in Statistical Database Conference Series [unescoprivacychair.urv.cat/psd2010/] (See Springer’s Lecture Notes in Computer Science series for previous proceedings volumes)  ASA Committee on Privacy and Confidentiality Website [ www.amstat.org/committees/pc ]  National Academies Press, Information Security Book Series [www.nap.edu/topics.php?topic=320]  National Institute of Statistical Sciences, Technical Reports [www.niss.org/publications/technical-reports]  Transactions on Data Privacy, IIIA-CSIC [Journal] [ www.tdp.cat ]  Journal of Official Statistics, Statistics Sweden: [www.jos.nu]  Journal of Privacy and Confidentiality, Carnegie-Mellon [jpc.cylab.cmu.edu]  IEEE Security and Privacy [www.computer.org/security]  Census Statistical Disclosure Control checklist [www.census.gov/srd/sdc]  B. C.M. Fung, K. Wang, R. Chen, P.S. Yu, 2010, Privacy Preserving Data Publishing: A Survey of Recent Developments, ACM CSUR 42(4) 144 [Micah Altman, 3/10/2011]
  • 145. Additional Resources  Final review  Additional training resources  Harvard Consulting  Handout for Harvard staff  Harvard IQSS Research Support  Additional references 145 [Micah Altman, 3/10/2011]
  • 146. Final Review: 7 Steps  Identify potentially sensitive information in planning  Identify legal requirements, institutional requirements, data use agreements  Consider obtaining a certificate of confidentiality  Plan for IRB review  Reduce sensitivity of collected data in design  Separate sensitive information in collection  Encrypt sensitive information in transit  Desensitize information in processing  Removing names and other direct identifiers  Suppressing, aggregating, or perturbing indirect identifiers  Protect sensitive information in systems  Use systems that are controlled, securely configured, and audited  Ensure people are authenticated, authorized, licensed  Review sensitive information before dissemination  Review disclosure risk  Apply non-statistical disclosure limitation  Apply statistical disclosure limitation  Review past releases and publically available data  Check for changes in the law  Require a use agreement 146 [Micah Altman, 3/10/2011]
  • 147. Preliminary Recommendation: Choose the Lesser of Three Evils  (1) Use only information that has already been made public, is entirely innocuous, or has been declared legally deidentified; or  (2) Obtain informed consent from research subjects, at the time of data collection, that includes acceptance of the potential risks of disclosure of personally identifiable information; or  (3) Pay close attention to the technical requirements imposed by law:  Use suppression and recoding to achieve k-anonymity with l-diversity on data before releasing it or generating detailed figures, maps, or summary tables.  Supplement data sharing with data-use agreements. 147 [Micah Altman, 3/10/2011]
  • 148. Preliminary Recommendations Planning and methods  Review research design for sensitive identified information  Information which would cause harm if disclosed  HIPAA identifiers  Other indirectly identifying characteristics  Design research methods to reduce sensitivity  Eliminate sensitive/identifying information not needed for research questions  Consider randomized response, list experiment design  Design human subjects plan with information management in mind  Recognize benefits of data sharing  Ask for consent to share data appropriately  Apply for a certificate of confidentiality where data is very sensitive  Separate sensitive information  Separate sensitive/identifying information at collection, if feasible  Link separate files using cryptographic hash of identifiers plus secret key; or cryptographic- strength random number  Incorporate extra protections for on-line data collection  Use vendor agreements that specify anonymity and confidentiality protections  Do not collect IP addresses if possible, regularly anonymize and purge otherwise  Restrict display of very sensitive information in user interfaces  Limit on-line collection of very sensitive information  Harvard prohibits display/collection of HRCI online 148 [Micah Altman, 3/10/2011]
  • 149. Preliminary Recommendations Information Security  Use FISMA as a reference for baseline controls  Document:  Protection goals  Threat models  Types of controls  Delegate implementation to IT professionals  Refer to standards  Gold standards: FISMA / ISO practices, SAS-70 Auditing, CISSP certification of key staff  Strongly recommended controls  Use whole-disk/media encryption to protect data at rest  Use end-to-end encryption to protect data in motion  Use core information hygiene to protect systems  Use a virus checker, and keep it updated  Use a host-based firewall  Update your software regularly  Install all operating system and application security updates  Don‟t share accounts or passwords  Don‟t use administrative accounts all the time  Don‟t run programs from untrusted sources  Don‟t give out your password to anyone  Scan for HRCI regularly  Be thorough in disposal of information  Use secure file erase tools when disposing of files  Use secure disk erase tools when disposing/repurposing disks 149 [Micah Altman, 3/10/2011]
  • 150. Preliminary Recommendations Very Sensitive/Extremely Sensitive Information security  Protect very sensitive data on “target systems”  Extra physical, logical, administrative access control  Record keeping  Limitations  Lockouts  Extra monitoring, auditing  Extra procedural controls – specific, renewed approvals  Limits on network connectivity  Private network, not directly connected to public network  Regular scans  Vulnerability scans  Scans for PII  Extremely sensitive  Increased access control, procedural limitations  Not physically/logically connected (even via wireless) to public network, directly or indirectly 150 [Micah Altman, 3/10/2011]
  • 151. Preliminary Recommendations non tabular data disclosure  Use licensing agreements – even if they are “clickthroughs” Reason: They provide additional protection without limiting legitimate research.  For qualitative text information  Use software for the first pass  Supplement with localized dictionary of place names, common last names, etc  Have a human review results Reason: Software more effective than single human coder.However error rate high enough that human still necessary.  For emerging forms of data (networks, etc.)  Use remote access, and user authentication, if feasible Reason: Greater auditability to compensate for less well understood statistical de-identification.  Pay careful attention to structure of data. Reason: Identifying information may be present in structure of information (word ordering, prose style, network topology, sparse matrix missingness) rather than in the primary attribute information 151 [Micah Altman, 3/10/2011]
  • 152. Preliminary Recommendations Tabular data disclosure  Use licensing agreements – even if they are “clickthroughs” Reason: They provide additional protection without limiting legitimate research.  Use HIPAA default variable suppression and recoding if according to the PI’s best judgment this does not seriously degrade the research value of the data. Reason: Clearest legal standard  For quantitative tabular data  Use generalization, local suppression, variable suppression. Reason: These are effective, commonly used in HIPAA and in statistical disclosure control  Use k-anonymity Reason: k-anonymity appears to be current good practice; provably eliminates literal individual re-identification; works if attacker has knowledge of sample participation  Choose k in [3-5] Reason: Best practice at federal agencies for table suppression requires table cells to have 3-5 contributors. Tables derived from k-anonymous microdata will also fulfill this.  Choose quasi-identifiers based on plausible threat models Reason: Too broad a definition of quasi-identifiers renders de-identification impossible. Background knowledge is pivotal, and threat model is the only source for this.  Use micro-data anonymization, rather than tabular/model anonymization Reason: (1)Table/model methods become computational intractable. (2) Analysis of model-anonymization is immature. (3) Anonymizingmicrodata implies derived tables and models are also anonymized. (4) Administratively harder to track and evaluate entire history of previous models/tables than history of previously released versions of single micro-data set.  Use domain knowledge in choosing recodings and testing the resulting anonymization for information loss. Reason: MSE , etc. probably not a good proxy for research value of data. Use standard measures, but also consider planned uses and simulate possible analyses.  Inspect data for attribute diversity, use PI‟sjudgement regarding suppression Reason: (1) Some attribute disclosures are not avoidable if research is to be conducted at all, some would occur even if subject had not participated. (2) Disclosures that would not have resulted if subject had opted out, and are not substantially based on representative causal/predictive relationships revealed by the research, should be eliminated. (3) All current diversity measures are likely to severely reduce the utility of the anonymized data if applied routinely. 152 [Micah Altman, 3/10/2011]
  • 153. On-line training  NIH Protecting Human Subject Research Participants  Provides minimal testing and certification  Required for human subjects research at NIH [phrp.nihtraining.com]  NIH Security and Privacy Awareness  Includes basics of information security, review of privacy laws [irtsectraining.nih.gov/]  Harvard Staff Training  Provides compact training for staff members in handling of confidential information [http://www.security.harvard.edu/resources/training]  Collaborative Institute Training Initiative  Provides testing, certification, continuing education credits  Required for human subjects research at Harvard  Includes basic training on confidentiality, and informed consent [https://www.citiprogram.org/] 153 [Micah Altman, 3/10/2011]
  • 156. Harvard IQSS Research Support  IQSS supports your research design:  Research design, including: design of surveys, selection of statistical methods.  IQSS supports your research process:  Primary and secondary data collection, including: the collection of geospatial and survey data.  Data management, including: storage, cataloging, permanent archiving, and distribution.  Data analysis, including : statistical consulting, GIS consulting, high performance research computing  IQSS supports your projects  Dissemination: web site hosting, scholars website  Research computing infrastructure and hosting  Conference/seminar/event planning and facilities Strengthen your proposal through:  Consultation on research design, statistical issues, GIS, research computing  Including relevant resources in “facilities” etc.  Obtaining IQSS letters of support 156 [Micah Altman, 3/10/2011]
  • 157. Additional References  A Aquesti, L John, G Lowestein, 2009, "What is Privacy Worth", 21rst Rowkshop in Information Systems and Economics.  A. Blum, K. Ligett, A Roth, 2008. “A Learning Theory Approach to Non-Interactive Database Privacy”, STOC’08  L. Backstrom, C. Dwork, J. Kleinberg. 2007, Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography. Proc. 16th Intl. World Wide Web Conference., KDD 008  J. Brickell, and V. Shmatikov, 2008. The Cost of Privacy: Destruction of Data-Mining Utility in Annoymized Data Publishing  P. Buneman, A. Chapman an.d J. Cheney, 2006, „Provenance Management in Curated Databases‟, in Proceedings of the 2006 ACM SIGMOD International Conference on Ma nagement of Data, (Chicago, IL: 2006), 539‐550. http://portal.acm.org/citation.cfm?doid=1142473.1142534;  Calabrese F., Colonna M., Lovisolo P., Parata D., Ratti C., 2007, "Real-Time Urban Monitoring Using Cellular Phones: a Case-Study in Rome", Working paper # 1, SENSEable City Laboratory, MIT, Boston http://senseable.mit.edu/papers/, [also see the Real Time Rome Project [http://senseable.mit.edu/realtimerome/]  Campbell,. D. 2009, reported in D, Goodin 2009, Amazon's EC2 brings new might to password cracking, The Register, Nov 2, 2009, http://www.theregister.co.uk/2009/11/02/amazon_cloud_password_cracking/  Dinur and K. Nissim. Revealing information while preserving privacy. Proceedings of the twenty-second ACM SIGMOD- SIGACT-SIGART Symposium on Principles of Database Systems, pages 202–210, 2003.  C. Dwork, M Naor, O Reingold, G Rothblum, S Vadhan, 2009. When and How Can Data be Efficiently Released with Privacy, STOC 2009.  C Dwork, A. Smith, 2009. Differential Privacy for Statistics: What we know and what we want to learn, Journal of Privacy and Confdentiality 1(2)135-54  C Dwork 2008, Differential Privacy, A Survey of Results. TAMC 2008, LCNS 4978, Springer Verlag. 1-19  C. Dwork. Differential privacy. Proc. ICALP, 2006.  C. Dwork, F. McSherry, and K. Talwar. The price of privacy and the limits of LP decoding. Proceedings of the thirty-ninth annual ACM Symposium on Theory of Computing, pages 85–94, 2007.  C. Dwork, F. McSherry, K. Nissim, and A. Smith, Calibrating Noise to Sensitivity in Private Data Analysis, Proceedings of the 3rd IACR Theory of Cryptography Conference, 2006  A. Desrosieres. 1998. The Politics of Large Numbers, Harvard U. Press.  S.E. Fienberg, M.E. Martin, and M.L. Straf (eds.), 1985. Sharing Research Data, Washington, D.C.: National Academies Press.  S. Fienberg, 2010. Towards a Bayesian Characterization of Privacy Protection & the Risk-Utility Tradeoff, IPAM--Data 2010  B. C.M. Fung, K. Wang, R. Chen, P.S. Yu, 2010, Privacy Preserving Data Publishing: A Survey of Recent Developments, ACM CSUR 42(4)  Greenwald, A. G. McGhee, D. E. Schwartz, J. L. K., 1998, "Measuring Individual Differences In Implicit Cognition: The Implicit Association Test", Journal of Personality and Social Psychology 74(6):1464-1480  C. Herley, 2009, So Long and No Thanks for the Externalities: The Rational Rejection of Security Advice by Users; NSPW 09  A. F. Karr, 2009 Statistical Analysis of Distributed Databases, journal of Privacy and Confidentiality (1)2: 157 [Micah Altman, 3/10/2011]
  • 158. Additional References  International Council For Science (ICSU) 2004. ICSU Report of the CSPR Assessment Panel on Scientific Data and Information. Report.  J. Klump, et. al, 2006. “Data publication in the open access initiative”, Data Science Journal Vol. 5 pp. 79- 83.  E.A. Kolek, D. Saunders, 2008. Online Disclosure: An Empirical Examination of Undergraduate Facebook Profiles, NASPA Journal 45 (1): 1-25  N. Li, T. Li, and S. Venkatasubramanian. T-closeness: privacy beyond k-anonymity and l-diversity. In Pro- ceedings of the IEEE ICDE 2007, 2007.  A. MachanavaJJhala, D Kifer, J Gehrke, M. Venkitasubramaniam, 2007,"l-Diversity: Privacy Beyond k- Anonymity" ACM Transactions on Knowledge Discovery from Data, 1(1): 1-52  A. Meyerson, R. Williams, 2004. “On the complexity of Optimal K-Anonymity”, ACM Symposium on the Principles of Database Systems  Nature 461, 145 (10 September 2009) | doi:10.1038/461145a  A. Narayanan and V. Shmatikov, 2008, “Robust De-anonymization of Large Sparse Datasets” , Proc. of 29th IEEE Symposium on Security and Privacy (Forthcoming)  I Neamatullah, et. al, 2008, Automated de-identification of free-text medical records, BMC Medical Informatics and Decision Making 8:32  J. Novak, P. Raghavan, A. Tomkins, 2004. Anti-aliasing on the Web, Proceedings of the 13th international conference on World Wide Web  National Science Board (NSB), 2005, Long-Lived Digital Data Collections: Enabling Research and Education in the 21rst Century, NSF. (NSB-05-40).  A Qcquisti, R. Gross 2009, “Predicting Social Security Numbers from Public Data”, PNAS 27(106): 10975– 10980  Sweeney, L., (2002) k-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems, Vol. 10, No. 5, pp. 557 – 570.  Truta T.M., Vinay B. (2006), Privacy Protection: p-Sensitive k-Anonymity Property, International Workshop of Privacy Data Management (PDM2006), In Conjunction with 22th International Conference of Data Engineering (ICDE), Atlanta, Georgia.  O. Uzuner, et al, 2007, “Evaluating the State-of-the-Art in Automatic De-identification”, Journal of the American Medical Informatics Association 14(5):550  W. Wagner & R. Steinzor, 2006. Rescuing Science from Politics, Cambridge U. Press.  Warner, S. 1965. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60(309):63–9.  D.L. Zimmerman, C. Pavlik , 2008. "Quantifying the Effects of Mask Metadata, Disclosure and Multiple Releases on the Confidentiality of Geographically Masked Health Data", Geographical Analysis 40: 52-76 158 [Micah Altman, 3/10/2011]
  • 160. Creative Commons License This work. Managing Confidential information in research, by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by- sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. 160 [Micah Altman, 3/10/2011]

Editor's Notes

  • #2: This work. Managing Confidential information in research, by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
  • #44: - Institutions may give “limited assurance” of common rule compliance – just for funded research. Most give “general assurance”