SlideShare a Scribd company logo
HathiTrust Research Center
  Architecture Overview
      Robert H. McDonald | @mcdonald
Executive Committee-HathiTrust Research Center (HTRC)
         Deputy Director-Data to Insight Center
           Associate Dean-University Libraries
                Indiana University
Follow Along




http://slidesha.re/U4z1gW
HTRC Architecture Group
Indiana University   University of Illinois
• Beth Plale, Lead   • J. Stephen Downie
• Yiming Sun         • Loretta Auvil
• Stacy Kowalczyk    • Boris Capitanu
• Aaron Todd
                     • Kirk Hess
• Jiaan Zeng
                     • Harriett Green
• Guangchen Ruan
• Zong Peng
• Swati Nagde
Presentation Overview
•   Considerations for Current Architecture
•   Architecture - Use Case Methodology
•   Technical Overview
•   UnCamp Sessions for Further Review
Main Case – Data Near
             Computation
                                   HTRC
  HT                              Volume
                 HT              Store and
Volume
              Volume               Index
 Store
               Store               (IUB)
 (UM)                                              XSEDE
              (IUPUI)
                                                 Compute
                         FutureGrid              Allocation
                        Computation
                           Cloud                UIUC
                                             Compute
                                             Allocation
                             IU
                         Compute
                         Allocation
Non-Consumptive Research Paradigm
• No action or set of actions on part of users,
  either acting alone or in cooperation with
  other users over duration of one or multiple
  sessions can result in sufficient information
  gathered from collection of copyrighted works
  to reassemble pages from collection.
• Definition disallows collusion between users,
  or accumulation of material over time.
  Differentiates human researcher from proxy
  which is not a user. Users are human beings.
Amicus Brief and NCR
• Jockers, Sag, Schultz –
• http://tinyurl.com/cy34hhr
Use Cases for Phase 1 Architecture
• Use Case #1 - Previously registered user
  submitted algorithm retrieved and run with
  results set
• Use Case #2 - HTRC applications/portal access
  (SEASR)
• Use Case #3 – Blacklight Lucene/Solr faceted
  access
• Use Case #4 - Direct programmatic access
  through Secure Data API
HTRC Current Infrastructure
• Servers
  – 14 production-level quad-core servers
     • 16 – 32GB of memory
     • 250 – 500GB of local disk each
  – 6-node Cassandra cluster for volume store
  – Ingest service and secure Data API access point
• Storage (IU University Infrastructure)
  – 13TB of 15,000 RPM SAS disk storage
  – Increase up to 17TB by end of 2012
  – 500TB available in late year 2-year 3
Key Components of Architecture
•   Portal Access
•   Blacklight Access
•   Agent
•   Registry
•   Secured Data API Access
HTRC Architecture
 Portal Access
              Blacklight
                                                            Direct
          Agent                                         programmatic
                                                          access (by
Application       Collection                          programs running
submission         building                          on HTRC machines)


                                      Security (OAuth2)
                                                    Data API access interface           Solr Proxy
               Registry (WSO2)                                           Audit
                               Meandre
         Algorithms                                              Cassandra
                               Workflows
                                                                 cluster
                                                                 volume store
         Result Sets           Collections
                                                                           Solr index




 Compute resources
                                                           Storage resources
HTRC Architecture                                         Portal Access
 Portal Access
              Blacklight
                                                                HTRC Portal
                                                            Direct
          Agent                                         programmatic
                                                          access (by
Application       Collection                          programs running                  Blacklight
submission         building                          on HTRC machines)


                                      Security (OAuth2)
                                                    App SEAR                     App Blacklight
                                                    Data API access interface              Solr Proxy
               Registry (WSO2)                                           Audit
                               Meandre
         Algorithms                                              Cassandra
                               Workflows
                                                                 cluster
                                                                 volume store
         Result Sets           Collections
                                                                           Solr index




 Compute resources
                                                           Storage resources
HTRC Architecture                                              Agent
 Portal Access
                                                           HTRC Agent
              Blacklight
                                                           Direct
          Agent                                   Application
                                                       programmatic       Collection
                                                         access (by
Application       Collection
                                                  submission
                                                     programs running      building
submission         building                         on HTRC machines)


                                      Security (OAuth2)
                                                    Data API access interface          Solr Proxy
               Registry (WSO2)                                          Audit
                               Meandre
         Algorithms                                              Cassandra
                               Workflows
                                                                 cluster
                                                                 volume store
         Result Sets           Collections
                                                                          Solr index




 Compute resources
                                                          Storage resources
HTRC Architecture                                         HTRC Registry
 Portal Access
                                                            Registry (WSO2)
              Blacklight
                                                                                Meandre
                                                     Algorithms
                                                         Direct
                                                        programmatic
                                                                                Workflows
          Agent
                                                          access (by
Application       Collection                          programs running




                                                                     1
submission         building                          on HTRC Sets
                                                     Resultmachines)            Collections

                                      Security (OAuth2)
                                                    Data API access interface            Solr Proxy
               Registry (WSO2)                                           Audit
                               Meandre
         Algorithms                                              Cassandra
                               Workflows
                                                                 cluster
                                                                 volume store
         Result Sets           Collections
                                                                           Solr index




 Compute resources
                                                           Storage resources
HTRC Architecture
                                                             Secure Data API
 Portal Access
              Blacklight                                • RESTful Web Service
                                                    Direct     –   Language agnostic
          Agent                                 programmatic –     Clients don’t have to
                                                  access (by
Application       Collection                  programs running     deal with Cassandra
submission         building
                                                             • Simple OAuth2
                                             on HTRC machines)

                                                                  authentication
                                  Security (OAuth2)          • HTTP over SSL
                                                             • Audits
                                                 Data API access interface client access
                                                                                  Solr Proxy
                Registry (WSO2)
                                                             • Protected behind
                                                                      Audit
         Algorithms
                            Meandre
                           Workflows
                                                                  firewall, accessible
                                                              Cassandra
                                                                  only to authorized IPs
                                                              cluster
                                                         volume store
         Result Sets           Collections
                                                                    Solr index

                                                                        HTRC
 Compute resources
                                                   Storage resources
NoSQL Methodology
• Currently HT content is stored in a pair-tree file
  system convention (CDL)
• Moving these files into a NoSQL store like
  Cassandra enabled HTRC to aggregate them into
  larger sets of files for use in retrieval
• Use of Cassandra enabled HTRC to share content
  over a commodity based Cassandra cluster of
  virtual machines
• Originally investigated use of MongoDB,
  CouchDB, Hbase and Cassandra
HTRC Solr index
• The Solr Data API 0.1 test version
  – Preserves all query syntax of original Solr
  – Prevents user from modification
  – Hides the host machine and port number HTRC
    Solr is actually running on
  – Creates audit log of requests
  – Provides filtered term vector for words starting
    with user-specified letter
Data Capsules VM
            Cluster

                                          HTRC Volume
                                         Store and Index

      Remote
                  Provide secure
      Desktop
                  VM
      Or VNC

                       Submit secure
 Scholars              capsule
                                          FutureGrid
                       map/reduce Data
                                         Computation
                       Capsule images
                                            Cloud
                       to FutureGrid.
                       Receive and
                       review results


Non-Consumptive Research-Secure Data Capsule
Sessions for Further Review
• For more on API – Tues Topic I/II
  (Yiming Sun)
• For more on Portal/SEASR – Tues Topic II
  (Loretta Auvil)
• For more on Portal/Blacklight – Tues Topic III
  (Stacy Kowalczyk)
Contact Information
• Robert H. McDonald
  – Email – robert@indiana.edu
  – Chat – rhmcdonald on googletalk | skype
  – Twitter - @mcdonald
  – Blog – http://www.rmcdonald.net
  – Twitter Hashtag: #HTRC12

More Related Content

PPTX
IP based standards for IoT
PDF
Dynamic Resource Allocation Algorithm using Containers
PPTX
3D IT Architecture - Data Center
PPTX
PPTX
The LightConnectTM Fabric V-POD Data Center Architecture
PDF
Presentation data center and cloud architecture
PDF
Data-Ed Webinar: Data Architecture Requirements
PPTX
Cloud Architecture in the Data Center
IP based standards for IoT
Dynamic Resource Allocation Algorithm using Containers
3D IT Architecture - Data Center
The LightConnectTM Fabric V-POD Data Center Architecture
Presentation data center and cloud architecture
Data-Ed Webinar: Data Architecture Requirements
Cloud Architecture in the Data Center

Viewers also liked (10)

PDF
EUDAT data architecture and interoperability aspects – Daan Broeder
PPTX
Data Center: Earth
PDF
Architectural Evolution Starting from Hadoop
PDF
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
PDF
Delivering Apache Hadoop for the Modern Data Architecture
PPT
A Scalable, Commodity Data Center Network Architecture
PDF
Data Center Floor Design - Your Layout Can Save of Kill Your PUE & Cooling Ef...
PPTX
Saving money with smart data center design
PPTX
Data Center Free Cooling in the Middle East
PPT
To_Infinity_and_Beyond_Internet_Scale_Workloads_Data_Center_Design_v6
EUDAT data architecture and interoperability aspects – Daan Broeder
Data Center: Earth
Architectural Evolution Starting from Hadoop
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Delivering Apache Hadoop for the Modern Data Architecture
A Scalable, Commodity Data Center Network Architecture
Data Center Floor Design - Your Layout Can Save of Kill Your PUE & Cooling Ef...
Saving money with smart data center design
Data Center Free Cooling in the Middle East
To_Infinity_and_Beyond_Internet_Scale_Workloads_Data_Center_Design_v6
Ad

Similar to HTRC Architecture Overview (20)

PPTX
HathiTrust Research Center: The Fast Version
PDF
Quantum Networks
PDF
2. FOMS _ FeedHenry_ Mícheál Ó Foghlú
PDF
Leadership Symposium on Digital Media in Healthcare
PDF
Glass Fish Portfolio Web Space What Is James Falkner
PDF
Mobile Cloud Computing
PDF
Building reliable systems from unreliable components
PPTX
Launch webinar-introducing couchbase server 2.0-01202013
PPTX
6.Live Framework 和Mesh Services
PDF
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
PDF
Venus-c: Using open source clouds in eScience
 
PPTX
Enterprise linked data clouds
PPTX
Software architecture
PPT
Oscon keynote 2012
PDF
Mirantis Folsom Meetup Intro
KEY
The Other Way of Doing Big Data
PDF
OreDev 2008: Software + Services
PDF
Eclipse Gyrex OSGi based PaaS-Like Programming Stack - OSGi Cloud Workshop Ma...
PPTX
PDF
Sql azure database under the hood
HathiTrust Research Center: The Fast Version
Quantum Networks
2. FOMS _ FeedHenry_ Mícheál Ó Foghlú
Leadership Symposium on Digital Media in Healthcare
Glass Fish Portfolio Web Space What Is James Falkner
Mobile Cloud Computing
Building reliable systems from unreliable components
Launch webinar-introducing couchbase server 2.0-01202013
6.Live Framework 和Mesh Services
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
Venus-c: Using open source clouds in eScience
 
Enterprise linked data clouds
Software architecture
Oscon keynote 2012
Mirantis Folsom Meetup Intro
The Other Way of Doing Big Data
OreDev 2008: Software + Services
Eclipse Gyrex OSGi based PaaS-Like Programming Stack - OSGi Cloud Workshop Ma...
Sql azure database under the hood
Ad

More from Robert H. McDonald (20)

PDF
ER&L The Role of Choice in the Future of Discovery Evaluations Panel
PPTX
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
PDF
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
PPTX
JCDL 2015 Tutorial Opening Slides
PPTX
TLT Discussion on "Saving My Stuff" - 06.05.15
PPTX
The HathiTrust Research Center: An Overview of Advanced Computational Services
PPTX
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
PPTX
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
PPTX
ER&L 2015 Closing Keynote Slides
PPTX
HathiTrust Research Center Data Capsule Overview 09.10.14
PPTX
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
PPT
Owning the Discovery Experience for Your Patrons
PPTX
Kuali OLE: Enabling Choices for Libraries
PPTX
Charleston Seminar Being Earnest with our Collections - Legacy to Cloud
PDF
The HathiTrust Research Center (HTRC): An Overview and Demo
PPTX
SCONUL Kuali OLE Briefing
PPTX
SEAD Datanet and Sustainability Science
PPTX
New Perspectives for Business Intelligence: Library and Research Technologies...
PPTX
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...
PPTX
GOKb & KB+: An International Partnership to leverage Open Access and Communit...
ER&L The Role of Choice in the Future of Discovery Evaluations Panel
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
JCDL 2015 Tutorial Opening Slides
TLT Discussion on "Saving My Stuff" - 06.05.15
The HathiTrust Research Center: An Overview of Advanced Computational Services
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
ER&L 2015 Closing Keynote Slides
HathiTrust Research Center Data Capsule Overview 09.10.14
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
Owning the Discovery Experience for Your Patrons
Kuali OLE: Enabling Choices for Libraries
Charleston Seminar Being Earnest with our Collections - Legacy to Cloud
The HathiTrust Research Center (HTRC): An Overview and Demo
SCONUL Kuali OLE Briefing
SEAD Datanet and Sustainability Science
New Perspectives for Business Intelligence: Library and Research Technologies...
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...
GOKb & KB+: An International Partnership to leverage Open Access and Communit...

Recently uploaded (20)

PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Pre independence Education in Inndia.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Classroom Observation Tools for Teachers
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Insiders guide to clinical Medicine.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Microbial diseases, their pathogenesis and prophylaxis
O5-L3 Freight Transport Ops (International) V1.pdf
Institutional Correction lecture only . . .
Final Presentation General Medicine 03-08-2024.pptx
human mycosis Human fungal infections are called human mycosis..pptx
Pre independence Education in Inndia.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Complications of Minimal Access Surgery at WLH
Pharma ospi slides which help in ospi learning
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Anesthesia in Laparoscopic Surgery in India
Sports Quiz easy sports quiz sports quiz
Classroom Observation Tools for Teachers
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
O7-L3 Supply Chain Operations - ICLT Program
Module 4: Burden of Disease Tutorial Slides S2 2025
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Insiders guide to clinical Medicine.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx

HTRC Architecture Overview

  • 1. HathiTrust Research Center Architecture Overview Robert H. McDonald | @mcdonald Executive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data to Insight Center Associate Dean-University Libraries Indiana University
  • 3. HTRC Architecture Group Indiana University University of Illinois • Beth Plale, Lead • J. Stephen Downie • Yiming Sun • Loretta Auvil • Stacy Kowalczyk • Boris Capitanu • Aaron Todd • Kirk Hess • Jiaan Zeng • Harriett Green • Guangchen Ruan • Zong Peng • Swati Nagde
  • 4. Presentation Overview • Considerations for Current Architecture • Architecture - Use Case Methodology • Technical Overview • UnCamp Sessions for Further Review
  • 5. Main Case – Data Near Computation HTRC HT Volume HT Store and Volume Volume Index Store Store (IUB) (UM) XSEDE (IUPUI) Compute FutureGrid Allocation Computation Cloud UIUC Compute Allocation IU Compute Allocation
  • 6. Non-Consumptive Research Paradigm • No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. • Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.
  • 7. Amicus Brief and NCR • Jockers, Sag, Schultz – • http://tinyurl.com/cy34hhr
  • 8. Use Cases for Phase 1 Architecture • Use Case #1 - Previously registered user submitted algorithm retrieved and run with results set • Use Case #2 - HTRC applications/portal access (SEASR) • Use Case #3 – Blacklight Lucene/Solr faceted access • Use Case #4 - Direct programmatic access through Secure Data API
  • 9. HTRC Current Infrastructure • Servers – 14 production-level quad-core servers • 16 – 32GB of memory • 250 – 500GB of local disk each – 6-node Cassandra cluster for volume store – Ingest service and secure Data API access point • Storage (IU University Infrastructure) – 13TB of 15,000 RPM SAS disk storage – Increase up to 17TB by end of 2012 – 500TB available in late year 2-year 3
  • 10. Key Components of Architecture • Portal Access • Blacklight Access • Agent • Registry • Secured Data API Access
  • 11. HTRC Architecture Portal Access Blacklight Direct Agent programmatic access (by Application Collection programs running submission building on HTRC machines) Security (OAuth2) Data API access interface Solr Proxy Registry (WSO2) Audit Meandre Algorithms Cassandra Workflows cluster volume store Result Sets Collections Solr index Compute resources Storage resources
  • 12. HTRC Architecture Portal Access Portal Access Blacklight HTRC Portal Direct Agent programmatic access (by Application Collection programs running Blacklight submission building on HTRC machines) Security (OAuth2) App SEAR App Blacklight Data API access interface Solr Proxy Registry (WSO2) Audit Meandre Algorithms Cassandra Workflows cluster volume store Result Sets Collections Solr index Compute resources Storage resources
  • 13. HTRC Architecture Agent Portal Access HTRC Agent Blacklight Direct Agent Application programmatic Collection access (by Application Collection submission programs running building submission building on HTRC machines) Security (OAuth2) Data API access interface Solr Proxy Registry (WSO2) Audit Meandre Algorithms Cassandra Workflows cluster volume store Result Sets Collections Solr index Compute resources Storage resources
  • 14. HTRC Architecture HTRC Registry Portal Access Registry (WSO2) Blacklight Meandre Algorithms Direct programmatic Workflows Agent access (by Application Collection programs running 1 submission building on HTRC Sets Resultmachines) Collections Security (OAuth2) Data API access interface Solr Proxy Registry (WSO2) Audit Meandre Algorithms Cassandra Workflows cluster volume store Result Sets Collections Solr index Compute resources Storage resources
  • 15. HTRC Architecture Secure Data API Portal Access Blacklight • RESTful Web Service Direct – Language agnostic Agent programmatic – Clients don’t have to access (by Application Collection programs running deal with Cassandra submission building • Simple OAuth2 on HTRC machines) authentication Security (OAuth2) • HTTP over SSL • Audits Data API access interface client access Solr Proxy Registry (WSO2) • Protected behind Audit Algorithms Meandre Workflows firewall, accessible Cassandra only to authorized IPs cluster volume store Result Sets Collections Solr index HTRC Compute resources Storage resources
  • 16. NoSQL Methodology • Currently HT content is stored in a pair-tree file system convention (CDL) • Moving these files into a NoSQL store like Cassandra enabled HTRC to aggregate them into larger sets of files for use in retrieval • Use of Cassandra enabled HTRC to share content over a commodity based Cassandra cluster of virtual machines • Originally investigated use of MongoDB, CouchDB, Hbase and Cassandra
  • 17. HTRC Solr index • The Solr Data API 0.1 test version – Preserves all query syntax of original Solr – Prevents user from modification – Hides the host machine and port number HTRC Solr is actually running on – Creates audit log of requests – Provides filtered term vector for words starting with user-specified letter
  • 18. Data Capsules VM Cluster HTRC Volume Store and Index Remote Provide secure Desktop VM Or VNC Submit secure Scholars capsule FutureGrid map/reduce Data Computation Capsule images Cloud to FutureGrid. Receive and review results Non-Consumptive Research-Secure Data Capsule
  • 19. Sessions for Further Review • For more on API – Tues Topic I/II (Yiming Sun) • For more on Portal/SEASR – Tues Topic II (Loretta Auvil) • For more on Portal/Blacklight – Tues Topic III (Stacy Kowalczyk)
  • 20. Contact Information • Robert H. McDonald – Email – robert@indiana.edu – Chat – rhmcdonald on googletalk | skype – Twitter - @mcdonald – Blog – http://www.rmcdonald.net – Twitter Hashtag: #HTRC12

Editor's Notes

  • #12: Registry – agent can deploy any service listed in this digram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -
  • #13: Registry – agent can deploy any service listed in this digram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -
  • #14: Registry – agent can deploy any service listed in this diagram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -
  • #15: Registry – agent can deploy any service listed in this diagram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -
  • #16: Registry – agent can deploy any service listed in this digram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -