SlideShare a Scribd company logo
Capsule computing: safe open
science
Beth Plale
Professor, Indiana University Bloomington
On loan to National Science Foundation
*Opinions expressed herein are those of the author alone and do not represent
the views of the National Science Foundation
Binghamton University December 04, 2018
Data: a
foundation
of science
plale@Indiana.edu
Weather simulation
• Data has value beyond the use for which
it is originally collected
• Open science is a broad-based global
effort to make data emerging from
research available for wider use
• Open science is thus acknowledgement
of inherent value of scientific data
independent of published scientific or
scholarly outcome
December 04, 2018
Open Science encourages from researcher:
• More thoughtful
research
processes;
• Thought to usesof data and codebeyond originalintent
• Attention to
reproducibility /
replicability of
work
The Upturned Microscope by Nik Papageorgiou is licensed under CC BY NC ND
Tension of Open Science
Much data resulting from externally funded
research can be made open, but some data
simply cannot nor will ever be completely and
freely open
Data should be made open - open access, open
use, open license and perhaps made open by
default, but there are important cases where
controls on the data must remain
More options needed for
restricted data reuse on
spectrum between
completely open (Open
Access) and completely
hidden
Possible way forward suggested by
principle: Open as possible, closed as
necessary*
Principle articulated in "Guidelines on FAIR Data Management in
Horizon 2020", EU Horizon 2020 programme
Forms of data availability on spectrum
between pure open access and fully
hidden
Capsule framework
Controlled compute environment, capsule
framework, is viable approach to accessing and
sharing restricted data that
satisfies sharing while protecting data from
unintended use or use prohibited by law
Capsule framework
Implemented through combination of
policy, processes, and software services,
to protect the data and
make the software infrastructure as easy to use
as possible.
plale@Indiana.edu
Capsule technical design
?
?
?
?
Trust Model
Threat model : high level articulation of tradeoffs during
design of the system. Not an implementation guide
Policy decisions influenced by situation of use:
restrictions on the data;
assumptions of use; and
limits of software services.
Our major tradeoff: how much trust must you place with
the user versus a locked down (and relatively unusable)
system
plale@Indiana.edu
The motivating need for capsule computing is the
HathiTrust (HT) shared digital repository
HT Mission and Purpose
To contribute to research, scholarship, and the
common good by collaboratively collecting, organizing,
preserving, communicating, and sharing the record of
human knowledge.
• A trusted digital preservation service enabling the broadest
possible access worldwide.
• An organization with over 100 research libraries partnering to
develop its programs.
• A range of transformative programs enabled by working at a
very large scale.
Current Major Cooperative Initiatives
• Distributed manual copyright reviews.
• Establishing a distributed shared print
monograph archive.
• Expanding and enhancing access to US Federal
Government Documents.
• Expanding services of the HathiTrust Research
Center.
Scale of the HathiTrust Collection
• 16,639,076 total volumes
– 8,075,459 book titles
– 446,580 serial titles
– 5,823,676,600 pages
– 746 terabytes
• 6,256,362 open volumes (~38% of total)
Collection includes (mostly) published materials
in bound form, digitized from research and
academic library collections.
Example use: how influenced is a writer by
time spent at Iowa Writers Workshop?
• Assemble corpus of works by authors affiliated with
renowned Iowa Writers Workshop
• Perform analysis to determine whether a Workshop
style exists and what the characteristics of such a
style might be.
• Collect metrics such as
vocabulary size,
sentence length, or
even frequency of
male and female
pronouns
Controlled Compute Environment
(or remote secure enclave) provides
researchers with remote analytical
access to a data collection that has
restrictions (legal, privacy) on use,
and because of its size, requires
compute to come to the data and
not vice versa
SUNY Binghamton, Dec 2018
Beth Plale, Inna Kouper, Samitha Liayanage, Yu Ma, Robert McDonald, and John Walsh,
Capsule Computing: Safe Open Science, under review, 2019.
Capsule Framework, as a controlled
compute environment,is
implemented through policy,
processes, and software services
working together to protect the data
while making the software
infrastructure as easy to use as
possible.
DataPASS
Policies in place for HathiTrust
Human facing
• Non-consumptive Use Research Policy
• Terms of Use
Infrastructure facing
• HathiTrust Rights Database
• Trust (threat) model
Export review
• Human review of results exported from
Capsule
Human facing
policy
Infrastructure facing
“service agreement”
mutually
reinforcing
Overriding policy is that of Non-
consumptive Use Research Policy
Research in which computational analysis is
performed on one or more volumes but not
research in which researcher reads or displays
substantial portions of an in-copyright or rights-
restricted volume to understand expressive content
presented within that volume.
Examples: text extraction, automated translation, image
analysis, file manipulation, OCR correction, and indexing and
search.
https://www.hathitrust.org/htrc_ncup
Terms of Use
Agreement between HTRC/HT and individual intending to
use HTRC Data Capsule service. Top 4 terms:
1. Read and comply with Non-Consumptive Use Research
Policy.
2. Use their Capsule for non-consumptive research
purposes only as defined in Section 1 of the policy.
3. Prior to first use, User submits form indicating
intended use and expected forms of outputs.
4. By using HTRC Data Capsule service, User
acknowledges that information about their activities
while in Capsule may be reviewed in manner consistent
with HathiTrust privacy policy.
https://www.hathitrust.org/htrc_dc_tou
Rights Database
● Database for storing and tracking rights
information for each digitized volume in HathiTrust
● At core of system is algorithm that considers a)
copyright status and/or explicit access controls
associated with the volume, b) volume's digitizing
agent (e.g., Google or the University of Chicago),
and c) identity of user (if known) in order to
determine access rights.
● How used: demo capsule uses only public domain
content.
https://www.hathitrust.org/rights_database plale@Indiana.edu
Threat model
● Threat model: structured representation of all
information that affects security of an
application.
● Two most relevant clauses:
○ Analysts are themselves considered to act in good
faith, but this does not preclude possibility of them
unwittingly allowing system to be compromised.
■ Reasonable assumption and motivates why analysts are
required to sign use agreement.
Capsule Framework
k*N user VMs running in back end layer; managed by a hypervisor. All
software implementing Capsule framework is open source.
Mode one: Maintenance mode
Access to
Internet
permitted;
Channel to
restricted
data closed
HT DL
Mode two: Secure mode
Access to
Internet
denied;
Channel to
restricted
data open
HT DL
Threat Model implementation in 7 easy
steps
● The threat model for the Capsule framework
implementation in HathiTrust is built on the
assumed existence of a Trusted Computing
Base (TCB), where there resides the totality of
security mechanisms within a secure system
reside
● Threat model implementation (8 statements)
Threat Model implementation
1. An analyst accesses restricted data through
remotely accessed VMs that read data from a
network-accessed data service.
2. The VM that is given to the analyst for use is not
part of the TCB. The remaining support is within the
TCB: the Virtual Machine Manager (VMM), the host
that the VMM runs on, and the system services that
enforce network and data access policies for the
virtual machines. Data storage is included within the
TCB.
Threat Model implementation
3. Users may inadvertently install malware; there may be
other remotely initiated attacks on the VM. These attacks
could potentially compromise the entire operating system
and install a rootkit, both of which are undetectable to the
end user.
Analysts are themselves considered to act in good faith,
but this does not preclude the possibility of them
unwittingly allowing the system to be compromised.
Analysts are required to sign a use agreement before using
the system. Results are reviewed before made available to
the user for download.
Threat Model implementation
4. Users have VNC access to their virtual machines in non-secure
mode to give them a desktop interface to the machine. They also
have SSH access in non-secure mode so that they can upload data
sets and install software more easily. However, VNC access
represents a channel for potential data leak; through use of a use
agreement and profile, HT is comfortable that the analyst is acting
in good faith. An analyst must refrain from sharing their virtual
machine.
5. A potential threat is that of covert channels between virtual
machines that run on the same host machine. A solution requires
using two physically separated systems, one that only runs VMs in
secure mode and another that runs VMs only in maintenance
mode. HT currently performs routine host port scanning.
Threat Model implementation
6. Analysts have complete freedom to access the Internet
from their Capsule to upload/download material while in a
non-secure mode. Once switching to a secure mode, the
analyst has direct access to the restricted materials.
While in secure mode, Internet access is prohibited, as is
copying from the Capsule to the desktop.
7. The analyst’s state is retained in a Capsule across
sessions of work, but when an analyst completes her work
and wants to pull data out of the Capsule, she must store
results off to a special drive. The contents of this drive are
manually reviewed before results are made available to
the analyst.
Policy / Infrastructure tradeoff
Human facing policies
• Non-consumptive Use Research Policy
• Terms of Use
Infrastructure facing
• HathiTrust Rights Database
• Trust (threat) model
Export review
• Human review of results exported from Capsule
Tradeoff: how much trust must you place with the
user versus a locked down (and relatively unusable)
system?
Human facing
policy
Infrastructure facing
“service agreement”
mutually
reinforcing
Takeaways
• HT Capsule implementation heavily driven by
“non-consumptive research” and wide-open
research modes to the HathiTrust collection
– By Authors Guild et al. v HathiTrust et al., 11 Civ
6351 (S.D.N.Y Sep 12 2011), research must be
non-consumptive. That is, “no eyeballs on texts”
– Massive collection where no single set of tools
satisfies needs. Researcher needs freedom to
install own tools
Takeaways
• Recently released toolkit comes default in
every VM. Helps connect to restricted data
and import other data resources (user’s
workset) into Capsule. Proven to reduce
programming burden.
• Running on physical servers at Indiana
University Bloomington.
Resources
• Non-Consumptive Use Research Policy
https://www.hathitrust.org/htrc_ncup
• HathiTrust Rights Database
https://www.hathitrust.org/rights_database
• Trust (threat) model (somewhat outdated)
– Plale, Beth; Prakash, Atul; McDonald, Robert (2015). The
Data Capsule for Non-Consumptive Research: Final Report.
Available from http://hdl.handle.net/2022/19277
• Terms of Use https://www.hathitrust.org/htrc_dc_tou
• HTRC Data Capsule accessible at
https://analytics.hathitrust.org
Please feel free to reach out to me for
more information
Beth Plale
plale@indiana.edu

More Related Content

PDF
HathiTrust Research Center Secure Commons
PDF
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
PDF
Some Proposed Principles for Interoperating Cloud Based Data Platforms
PDF
Some Frameworks for Improving Analytic Operations at Your Company
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
PPTX
Data, Data Everywhere: What's A Publisher to Do?
PDF
A Data Biosphere for Biomedical Research
PDF
A Gen3 Perspective of Disparate Data
HathiTrust Research Center Secure Commons
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Frameworks for Improving Analytic Operations at Your Company
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Data, Data Everywhere: What's A Publisher to Do?
A Data Biosphere for Biomedical Research
A Gen3 Perspective of Disparate Data

What's hot (20)

PDF
What is Data Commons and How Can Your Organization Build One?
PDF
Trust threads: Provenance for Data Reuse in Long Tail Science
PPTX
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
PDF
AD_LABX_BRO_19Nov2014__1_
PPTX
Providing support and services for researchers in good data governance
PPTX
20160523 23 Research Data Things
PDF
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
PPTX
20160719 23 Research Data Things
PPTX
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
PDF
Levine - Data Curation; Ethics and Legal Considerations
PPTX
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
PPTX
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
PDF
2012 Fall Data Management Planning Workshop
PDF
Trust threads : Active Curation and Publishing in SEAD
PPT
Digital Curation 101 - Taster
PPTX
The Future of Open Science
PDF
Privacy Preserving Data Mining
PPT
A Successful Academic Medical Center Must be a Truly Digital Enterprise
PDF
Data Citation Implementation Guidelines By Tim Clark
PPT
Data Sharing & Data Citation
What is Data Commons and How Can Your Organization Build One?
Trust threads: Provenance for Data Reuse in Long Tail Science
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
AD_LABX_BRO_19Nov2014__1_
Providing support and services for researchers in good data governance
20160523 23 Research Data Things
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
20160719 23 Research Data Things
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
Levine - Data Curation; Ethics and Legal Considerations
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
2012 Fall Data Management Planning Workshop
Trust threads : Active Curation and Publishing in SEAD
Digital Curation 101 - Taster
The Future of Open Science
Privacy Preserving Data Mining
A Successful Academic Medical Center Must be a Truly Digital Enterprise
Data Citation Implementation Guidelines By Tim Clark
Data Sharing & Data Citation
Ad

Similar to Capsule Computing: Safe Open Science (20)

PPTX
HathiTrust Research Center Data Capsule Overview 09.10.14
PDF
Secure Data Transmission using IBOOS in VANET
PDF
Trustless Computing Initiative
PDF
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
PDF
Software Defined Networking in the ATMOSPHERE project
PDF
Global bigdata conf_01282013
PDF
Secure Data Access Using ABE Process Model
PDF
Secure hash based distributed framework for utpc based cloud authorization
PDF
Secure hash based distributed framework for utpc based cloud authorization
PDF
Hacking 05 2011
PDF
publishable paper
PDF
Creating a Step Change in Cyber Security | ISCF DSbD Business-led Demonstrato...
 
PPTX
A-Software-Engineering-Framework-for-Enhancing-Cyber-Security-in-Network-Syst...
PPTX
Technologies in Support of Big Data Ethics
PDF
Framework for the Development of Virtual Labs for Industrial Internet of Thin...
PPTX
Implications of GDPR for IoT Big Data Security and Privacy Fabric
PDF
Blockchain R&D to Decentralized Identity Deployment
PPTX
CYBERSECURITY MESH - DIGITAL TRUST FRAMEWORK
PDF
Secure Sharing of Design Information with Blockchains
PDF
HathiTrust Research Center Data Capsule Overview 09.10.14
Secure Data Transmission using IBOOS in VANET
Trustless Computing Initiative
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
Software Defined Networking in the ATMOSPHERE project
Global bigdata conf_01282013
Secure Data Access Using ABE Process Model
Secure hash based distributed framework for utpc based cloud authorization
Secure hash based distributed framework for utpc based cloud authorization
Hacking 05 2011
publishable paper
Creating a Step Change in Cyber Security | ISCF DSbD Business-led Demonstrato...
 
A-Software-Engineering-Framework-for-Enhancing-Cyber-Security-in-Network-Syst...
Technologies in Support of Big Data Ethics
Framework for the Development of Virtual Labs for Industrial Internet of Thin...
Implications of GDPR for IoT Big Data Security and Privacy Fabric
Blockchain R&D to Decentralized Identity Deployment
CYBERSECURITY MESH - DIGITAL TRUST FRAMEWORK
Secure Sharing of Design Information with Blockchains
Ad

More from Beth Plale (7)

PDF
Trustworthy AI and Open Science
PDF
Open science as roadmap to better data science research
PDF
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
PDF
Plale HathiTrust El Colegio de Mexico May2014
PDF
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
PDF
Big data and open access: a collision course for science
PPTX
HathiTrust Reserach Center Nov2013
Trustworthy AI and Open Science
Open science as roadmap to better data science research
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Plale HathiTrust El Colegio de Mexico May2014
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Big data and open access: a collision course for science
HathiTrust Reserach Center Nov2013

Recently uploaded (20)

PPTX
neck nodes and dissection types and lymph nodes levels
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
2Systematics of Living Organisms t-.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
2. Earth - The Living Planet earth and life
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Cell Membrane: Structure, Composition & Functions
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
An interstellar mission to test astrophysical black holes
PPTX
BIOMOLECULES PPT........................
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
neck nodes and dissection types and lymph nodes levels
The scientific heritage No 166 (166) (2025)
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
2Systematics of Living Organisms t-.pptx
. Radiology Case Scenariosssssssssssssss
Classification Systems_TAXONOMY_SCIENCE8.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
ECG_Course_Presentation د.محمد صقران ppt
2. Earth - The Living Planet earth and life
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Cell Membrane: Structure, Composition & Functions
POSITIONING IN OPERATION THEATRE ROOM.ppt
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
HPLC-PPT.docx high performance liquid chromatography
AlphaEarth Foundations and the Satellite Embedding dataset
An interstellar mission to test astrophysical black holes
BIOMOLECULES PPT........................
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
cpcsea ppt.pptxssssssssssssssjjdjdndndddd

Capsule Computing: Safe Open Science

  • 1. Capsule computing: safe open science Beth Plale Professor, Indiana University Bloomington On loan to National Science Foundation *Opinions expressed herein are those of the author alone and do not represent the views of the National Science Foundation Binghamton University December 04, 2018
  • 3. • Data has value beyond the use for which it is originally collected • Open science is a broad-based global effort to make data emerging from research available for wider use • Open science is thus acknowledgement of inherent value of scientific data independent of published scientific or scholarly outcome December 04, 2018
  • 4. Open Science encourages from researcher: • More thoughtful research processes; • Thought to usesof data and codebeyond originalintent • Attention to reproducibility / replicability of work The Upturned Microscope by Nik Papageorgiou is licensed under CC BY NC ND
  • 5. Tension of Open Science Much data resulting from externally funded research can be made open, but some data simply cannot nor will ever be completely and freely open Data should be made open - open access, open use, open license and perhaps made open by default, but there are important cases where controls on the data must remain
  • 6. More options needed for restricted data reuse on spectrum between completely open (Open Access) and completely hidden
  • 7. Possible way forward suggested by principle: Open as possible, closed as necessary* Principle articulated in "Guidelines on FAIR Data Management in Horizon 2020", EU Horizon 2020 programme
  • 8. Forms of data availability on spectrum between pure open access and fully hidden
  • 9. Capsule framework Controlled compute environment, capsule framework, is viable approach to accessing and sharing restricted data that satisfies sharing while protecting data from unintended use or use prohibited by law
  • 10. Capsule framework Implemented through combination of policy, processes, and software services, to protect the data and make the software infrastructure as easy to use as possible. plale@Indiana.edu
  • 12. Trust Model Threat model : high level articulation of tradeoffs during design of the system. Not an implementation guide Policy decisions influenced by situation of use: restrictions on the data; assumptions of use; and limits of software services. Our major tradeoff: how much trust must you place with the user versus a locked down (and relatively unusable) system
  • 13. plale@Indiana.edu The motivating need for capsule computing is the HathiTrust (HT) shared digital repository
  • 14. HT Mission and Purpose To contribute to research, scholarship, and the common good by collaboratively collecting, organizing, preserving, communicating, and sharing the record of human knowledge. • A trusted digital preservation service enabling the broadest possible access worldwide. • An organization with over 100 research libraries partnering to develop its programs. • A range of transformative programs enabled by working at a very large scale.
  • 15. Current Major Cooperative Initiatives • Distributed manual copyright reviews. • Establishing a distributed shared print monograph archive. • Expanding and enhancing access to US Federal Government Documents. • Expanding services of the HathiTrust Research Center.
  • 16. Scale of the HathiTrust Collection • 16,639,076 total volumes – 8,075,459 book titles – 446,580 serial titles – 5,823,676,600 pages – 746 terabytes • 6,256,362 open volumes (~38% of total) Collection includes (mostly) published materials in bound form, digitized from research and academic library collections.
  • 17. Example use: how influenced is a writer by time spent at Iowa Writers Workshop? • Assemble corpus of works by authors affiliated with renowned Iowa Writers Workshop • Perform analysis to determine whether a Workshop style exists and what the characteristics of such a style might be. • Collect metrics such as vocabulary size, sentence length, or even frequency of male and female pronouns
  • 18. Controlled Compute Environment (or remote secure enclave) provides researchers with remote analytical access to a data collection that has restrictions (legal, privacy) on use, and because of its size, requires compute to come to the data and not vice versa SUNY Binghamton, Dec 2018 Beth Plale, Inna Kouper, Samitha Liayanage, Yu Ma, Robert McDonald, and John Walsh, Capsule Computing: Safe Open Science, under review, 2019.
  • 19. Capsule Framework, as a controlled compute environment,is implemented through policy, processes, and software services working together to protect the data while making the software infrastructure as easy to use as possible. DataPASS
  • 20. Policies in place for HathiTrust Human facing • Non-consumptive Use Research Policy • Terms of Use Infrastructure facing • HathiTrust Rights Database • Trust (threat) model Export review • Human review of results exported from Capsule Human facing policy Infrastructure facing “service agreement” mutually reinforcing
  • 21. Overriding policy is that of Non- consumptive Use Research Policy Research in which computational analysis is performed on one or more volumes but not research in which researcher reads or displays substantial portions of an in-copyright or rights- restricted volume to understand expressive content presented within that volume. Examples: text extraction, automated translation, image analysis, file manipulation, OCR correction, and indexing and search. https://www.hathitrust.org/htrc_ncup
  • 22. Terms of Use Agreement between HTRC/HT and individual intending to use HTRC Data Capsule service. Top 4 terms: 1. Read and comply with Non-Consumptive Use Research Policy. 2. Use their Capsule for non-consumptive research purposes only as defined in Section 1 of the policy. 3. Prior to first use, User submits form indicating intended use and expected forms of outputs. 4. By using HTRC Data Capsule service, User acknowledges that information about their activities while in Capsule may be reviewed in manner consistent with HathiTrust privacy policy. https://www.hathitrust.org/htrc_dc_tou
  • 23. Rights Database ● Database for storing and tracking rights information for each digitized volume in HathiTrust ● At core of system is algorithm that considers a) copyright status and/or explicit access controls associated with the volume, b) volume's digitizing agent (e.g., Google or the University of Chicago), and c) identity of user (if known) in order to determine access rights. ● How used: demo capsule uses only public domain content. https://www.hathitrust.org/rights_database plale@Indiana.edu
  • 24. Threat model ● Threat model: structured representation of all information that affects security of an application. ● Two most relevant clauses: ○ Analysts are themselves considered to act in good faith, but this does not preclude possibility of them unwittingly allowing system to be compromised. ■ Reasonable assumption and motivates why analysts are required to sign use agreement.
  • 25. Capsule Framework k*N user VMs running in back end layer; managed by a hypervisor. All software implementing Capsule framework is open source.
  • 26. Mode one: Maintenance mode Access to Internet permitted; Channel to restricted data closed HT DL
  • 27. Mode two: Secure mode Access to Internet denied; Channel to restricted data open HT DL
  • 28. Threat Model implementation in 7 easy steps ● The threat model for the Capsule framework implementation in HathiTrust is built on the assumed existence of a Trusted Computing Base (TCB), where there resides the totality of security mechanisms within a secure system reside ● Threat model implementation (8 statements)
  • 29. Threat Model implementation 1. An analyst accesses restricted data through remotely accessed VMs that read data from a network-accessed data service. 2. The VM that is given to the analyst for use is not part of the TCB. The remaining support is within the TCB: the Virtual Machine Manager (VMM), the host that the VMM runs on, and the system services that enforce network and data access policies for the virtual machines. Data storage is included within the TCB.
  • 30. Threat Model implementation 3. Users may inadvertently install malware; there may be other remotely initiated attacks on the VM. These attacks could potentially compromise the entire operating system and install a rootkit, both of which are undetectable to the end user. Analysts are themselves considered to act in good faith, but this does not preclude the possibility of them unwittingly allowing the system to be compromised. Analysts are required to sign a use agreement before using the system. Results are reviewed before made available to the user for download.
  • 31. Threat Model implementation 4. Users have VNC access to their virtual machines in non-secure mode to give them a desktop interface to the machine. They also have SSH access in non-secure mode so that they can upload data sets and install software more easily. However, VNC access represents a channel for potential data leak; through use of a use agreement and profile, HT is comfortable that the analyst is acting in good faith. An analyst must refrain from sharing their virtual machine. 5. A potential threat is that of covert channels between virtual machines that run on the same host machine. A solution requires using two physically separated systems, one that only runs VMs in secure mode and another that runs VMs only in maintenance mode. HT currently performs routine host port scanning.
  • 32. Threat Model implementation 6. Analysts have complete freedom to access the Internet from their Capsule to upload/download material while in a non-secure mode. Once switching to a secure mode, the analyst has direct access to the restricted materials. While in secure mode, Internet access is prohibited, as is copying from the Capsule to the desktop. 7. The analyst’s state is retained in a Capsule across sessions of work, but when an analyst completes her work and wants to pull data out of the Capsule, she must store results off to a special drive. The contents of this drive are manually reviewed before results are made available to the analyst.
  • 33. Policy / Infrastructure tradeoff Human facing policies • Non-consumptive Use Research Policy • Terms of Use Infrastructure facing • HathiTrust Rights Database • Trust (threat) model Export review • Human review of results exported from Capsule Tradeoff: how much trust must you place with the user versus a locked down (and relatively unusable) system? Human facing policy Infrastructure facing “service agreement” mutually reinforcing
  • 34. Takeaways • HT Capsule implementation heavily driven by “non-consumptive research” and wide-open research modes to the HathiTrust collection – By Authors Guild et al. v HathiTrust et al., 11 Civ 6351 (S.D.N.Y Sep 12 2011), research must be non-consumptive. That is, “no eyeballs on texts” – Massive collection where no single set of tools satisfies needs. Researcher needs freedom to install own tools
  • 35. Takeaways • Recently released toolkit comes default in every VM. Helps connect to restricted data and import other data resources (user’s workset) into Capsule. Proven to reduce programming burden. • Running on physical servers at Indiana University Bloomington.
  • 36. Resources • Non-Consumptive Use Research Policy https://www.hathitrust.org/htrc_ncup • HathiTrust Rights Database https://www.hathitrust.org/rights_database • Trust (threat) model (somewhat outdated) – Plale, Beth; Prakash, Atul; McDonald, Robert (2015). The Data Capsule for Non-Consumptive Research: Final Report. Available from http://hdl.handle.net/2022/19277 • Terms of Use https://www.hathitrust.org/htrc_dc_tou • HTRC Data Capsule accessible at https://analytics.hathitrust.org
  • 37. Please feel free to reach out to me for more information Beth Plale plale@indiana.edu