SlideShare a Scribd company logo
1/23
Pradeeban Kathiravelu1,2
, Yiru Chen3
, Ashish Sharma4
,
Helena Galhardas1
, Peter Van Roy2
, Luís Veiga1
On-Demand Service-Based
Big Data Integration:
Optimized for Research Collaboration
The 3rd
International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH),
in conjunction with the 43rd International Conference on Very Large Data Bases.
Munich, Germany. September 1, 2017.
1
INESC-ID / Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
2
Université catholique de Louvain, Louvain-la-Neuve, Belgium
3
Peking University, Beijing, China
4
Department of Biomedical Informatics, Emory University, Atlanta, USA
2/23
Introduction
●
Scale and diversity of big data are rising.
–
Geographically distributed data of exabytes.
–
Structured, semi-structured, unstructured, or ill-formed data.
●
Integration of data is crucial for data science.
●
Sharing of integrated data and results.
–
Mandatory for reproducible research.
3/23
Challenges in Medical Research
for Big Data Integration
●
Multiple types of data.
–
Imaging, clinical, and genomic.
●
Numerous data sources.
–
No shared messaging protocol.
●
Do we really need to integrate all the data?
4/23
A Story of Medical Data Researchers...A Story of Medical Data Researchers...
5/23
●
Jim is interested in the
effects of a medicine to
treat brain tumor in patients
of certain age groups.
6/23
Observation - 1
●
Various sources.
–
Service-based data access through APIs.
●
Thanks to specifications such as HL7 FHIR.
●
The researchers possess domain knowledge.
●
Integrate On-Demand.
–
Avoid eager loading of binary data or its textual metadata.
–
Use the researcher query as an input in loading data.
●
Scalable storage in-house.
–
Potential to load, integrate, index, and query unstructured data.
7/23
●
Paula has overlapping
research interests with Jim.
8/23
Observation - 2
●
Load data only once per organization.
–
Bandwidth and storage efficiency.
9/23
●
Sharing the research data with researchers,
beyond organization boundaries.
10/23
Observation - 3
●
Do not duplicate data!
–
We ``own`` our interest; not the data.
●
Point to the data in the data sources.
–
Pointers to data like Dropbox Shared Links work well.
●
Avoids outdated duplicate data.
●
Easy to maintain.
●
APIs – Access the list of research data sets.
11/23
Problems
●
How to..
–
Load data from several service-based big data sources.
●
Avoid duplicate downloads and near duplicate data.
–
Integrate disparate data and persist for future accesses.
–
Share pointers to data internally and externally.
12/23
Óbidos
OOn-demand BBig Data IIntegration,
DDistribution, and OOrchestration SSystem
●
Researcher query →
Narrow down the search space.
●
Define subsets of data that are
of interest.
–
Exploiting the well-defined
hierarchical structure of medical data.
●
Medical Images (DICOM)
●
Clinical data
●
..
13/23
Óbidos Approach
●
Hybrid of virtual and materialized data integration
approaches.
–
Lazy load of metadata: Load the matching subset of metadata.
–
Store integrated data and query results → scalable storage.
●
Track already loaded data.
–
Near duplicate detection.
–
Download only updates (changesets).
●
Efficient SQL queries on NoSQL storage.
●
Share pointers to the datasets rather than the dataset itself.
●
Generic design; implementation for medical research data.Generic design; implementation for medical research data.
14/23
Óbidos Architecture
15/23
Evaluation
●
Evaluation Data:
–
Clinical data and DICOM imaging collections of TCIA.
●
Benchmark Óbidos against eager and lazy ETL.
–
Performance of loading and querying data.
●
Óbidos (inter- and intra- organization) against binary data sharing.
–
Space/bandwidth efficiency of data sharing.
16/23
Workload Characterization
Various Entries in Evaluated Collections
17/23
Data load time
Change in total data volume (Same query and same interest)
●
Observation:
–
Load time increases for eager and lazy ETL with total volume.
–
Load time for Óbidos remains constant.
●
Total volume of data is irrelevant for Óbidos.
18/23
Change in studies of interest
(Same query and constant total data volume)
Data load time
●
Observation:
–
Load time for eager and lazy ETL remains constant.
–
Load time increases for Óbidos with the interest.
●
Converges to the load time of lazy ETL.
19/23
Query completion time
for the integrated data repository
●
Observation:
–
We assume the corresponding data is already loaded.
●
Thus, lazy and eager ETL perform similar.
–
Indexed scalable NoSQL architecture of Óbidos → Better performance.
20/23
Efficiency in Sharing Medical Research Data
●
Observation:
–
A constant-size UID is sufficient, intra-organization.
–
With number of series, Óbidos pointers grow, inter-organization.
–
Traditional binary data sharing:
shared data size = volume of the image series.
21/23
Conclusion
●
Óbidos offers on-demand service-based big data integration.
–
Fast and resource-efficient data analysis.
–
SQL queries over NoSQL data store for the integrated data.
–
Efficient data sharing without duplicating actual data.
●
Future Work
–
Consume data from repositories of domains beyond medical data.
●
EUDAT
–
Óbidos distributed virtual data warehouses.
●
Leverage the proximity of the organizations in data integration and sharing.
22/23
Acknowledgements
●
NCI QIN grant (1U01CA187013, Resources for
Development and Validation Of Radiomic Analyses and
Adaptive Therapy).
●
Google Summer of Code (2014, 2015, and 2016).
●
The Cancer Imaging Archive (TCIA).
●
Tyk and API Umbrella Teams.
23/23
Conclusion
●
Óbidos offers on-demand service-based big data integration.
–
Fast and resource-efficient data analysis.
–
SQL queries over NoSQL data store for the integrated data.
–
Efficient data sharing without duplicating actual data.
●
Future Work
–
Consume data from repositories of domains beyond medical data.
●
EUDAT
–
Óbidos distributed virtual data warehouses.
●
Leverage the proximity of the organizations in data integration and sharing.
Thank you!
Questions?

More Related Content

PDF
Near Duplicate Detection for Medical Imaging Data Warehouse Construction
PPTX
Data Café — A Platform For Creating Biomedical Data Lakes
PPTX
Collaboratively creating a network of ideas, data and software
PDF
BioSharing - Update - Feb2016
PPTX
Publishing the Full Research Data Lifecycle
PPTX
The Rocky Road to Reuse
PPTX
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
PPTX
RDA-WDS Publishing Data Interest Group
Near Duplicate Detection for Medical Imaging Data Warehouse Construction
Data Café — A Platform For Creating Biomedical Data Lakes
Collaboratively creating a network of ideas, data and software
BioSharing - Update - Feb2016
Publishing the Full Research Data Lifecycle
The Rocky Road to Reuse
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
RDA-WDS Publishing Data Interest Group

What's hot (20)

DOCX
Mit401 data warehousing and data mining
PPT
Final presentation
PPTX
Record matching over query results from Web Databases
PDF
FAIR sequencing data repository based on iRODS
PDF
Influence of-structured--semi-structured--unstructured-data-on-various-data-m...
PDF
3 dw architectures
PPTX
Introduction to data pre-processing and cleaning
PPTX
EDI Training Module 11: Publishing Data in the EDI Repository
PDF
Introduction to using REDCap for multi-site longitudinal research in medicine
PDF
Introduction to the Environmental Data Initiative (EDI)
PDF
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
PPTX
Ehr models, standards and semantic interoperability
PDF
Data cloud lab version v.001.2020
PPTX
Archetype-based data transformation with LinkEHR
PPTX
Networked Science, And Integrating with Dataverse
PPTX
Supporting Big Data, Open Data, Data Analytics and Data Science
PPTX
Types of databases
PDF
iRODS User Group Meeting 2016 - MUMC+
PPTX
EPSRC Policy Compliance: What researchers need to know
PDF
Authors' and Publications' Citations knowledge base
Mit401 data warehousing and data mining
Final presentation
Record matching over query results from Web Databases
FAIR sequencing data repository based on iRODS
Influence of-structured--semi-structured--unstructured-data-on-various-data-m...
3 dw architectures
Introduction to data pre-processing and cleaning
EDI Training Module 11: Publishing Data in the EDI Repository
Introduction to using REDCap for multi-site longitudinal research in medicine
Introduction to the Environmental Data Initiative (EDI)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
Ehr models, standards and semantic interoperability
Data cloud lab version v.001.2020
Archetype-based data transformation with LinkEHR
Networked Science, And Integrating with Dataverse
Supporting Big Data, Open Data, Data Analytics and Data Science
Types of databases
iRODS User Group Meeting 2016 - MUMC+
EPSRC Policy Compliance: What researchers need to know
Authors' and Publications' Citations knowledge base
Ad

Similar to On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration (20)

PPTX
Recognising data sharing
PPTX
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
PDF
Alain Frey Research Data for universities and information producers
PPT
Data Science BD2K Update for NIH
PPTX
Clinical Data Models - The Hyve - Bio IT World April 2019
PDF
Unit 3.pdf
PDF
Data Governance in two different data archives: When is a federal data reposi...
PPTX
From Data Sharing to Data Stewardship
PDF
Data discovery and sharing at UCLH
PPTX
EDI Training Module 4: Organizing Data Into Publishable Units
PDF
How to overcome obstacles to data publication: Issues, requirements, and good...
PPTX
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
PPTX
Shifting the goal post – from high impact journals to high impact data
PDF
Exlevel GrowFX for Autodesk 3ds Max Download
PDF
Aiseesoft Video Converter Ultimate 10.9.6
PPTX
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
PDF
Adobe Master Collection CC Crack Advance Version 2025
PDF
Adobe Illustrator 2025 v29.3.1 for MacOS Free Download
PDF
Practice Questions- How to Prepare for Hitachi Vantara HQT-6230
Recognising data sharing
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
Alain Frey Research Data for universities and information producers
Data Science BD2K Update for NIH
Clinical Data Models - The Hyve - Bio IT World April 2019
Unit 3.pdf
Data Governance in two different data archives: When is a federal data reposi...
From Data Sharing to Data Stewardship
Data discovery and sharing at UCLH
EDI Training Module 4: Organizing Data Into Publishable Units
How to overcome obstacles to data publication: Issues, requirements, and good...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Shifting the goal post – from high impact journals to high impact data
Exlevel GrowFX for Autodesk 3ds Max Download
Aiseesoft Video Converter Ultimate 10.9.6
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
Adobe Master Collection CC Crack Advance Version 2025
Adobe Illustrator 2025 v29.3.1 for MacOS Free Download
Practice Questions- How to Prepare for Hitachi Vantara HQT-6230
Ad

More from Pradeeban Kathiravelu, Ph.D. (20)

PDF
Google Summer of Code_2023.pdf
PDF
Google Summer of Code (GSoC) 2022
PDF
Google Summer of Code (GSoC) 2022
PPTX
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
PDF
Google summer of code (GSoC) 2021
PPTX
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
PDF
Google Summer of Code (GSoC) 2020 for mentors
PDF
Google Summer of Code (GSoC) 2020
PDF
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
PDF
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
PDF
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
PDF
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
PDF
UCL Ph.D. Confirmation 2018
PDF
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
PDF
Moving bits with a fleet of shared virtual routers
PDF
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
PDF
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
PDF
Software-Defined Inter-Cloud Composition of Big Services
PDF
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
PDF
Componentizing Big Services in the Internet
Google Summer of Code_2023.pdf
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Google summer of code (GSoC) 2021
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
Google Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
UCL Ph.D. Confirmation 2018
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Moving bits with a fleet of shared virtual routers
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Software-Defined Inter-Cloud Composition of Big Services
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Componentizing Big Services in the Internet

Recently uploaded (20)

PPT
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
PPTX
Acid Base Disorders educational power point.pptx
PPTX
ACID BASE management, base deficit correction
PDF
Medical Evidence in the Criminal Justice Delivery System in.pdf
PPT
HIV lecture final - student.pptfghjjkkejjhhge
PDF
شيت_عطا_0000000000000000000000000000.pdf
PPTX
LUNG ABSCESS - respiratory medicine - ppt
PPT
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
PPTX
Respiratory drugs, drugs acting on the respi system
PPTX
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
PPTX
neonatal infection(7392992y282939y5.pptx
PPT
MENTAL HEALTH - NOTES.ppt for nursing students
PPTX
History and examination of abdomen, & pelvis .pptx
PPT
Obstructive sleep apnea in orthodontics treatment
PPT
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
PPTX
Important Obstetric Emergency that must be recognised
PPT
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
PPTX
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
PPTX
anal canal anatomy with illustrations...
PDF
Human Health And Disease hggyutgghg .pdf
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
Acid Base Disorders educational power point.pptx
ACID BASE management, base deficit correction
Medical Evidence in the Criminal Justice Delivery System in.pdf
HIV lecture final - student.pptfghjjkkejjhhge
شيت_عطا_0000000000000000000000000000.pdf
LUNG ABSCESS - respiratory medicine - ppt
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
Respiratory drugs, drugs acting on the respi system
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
neonatal infection(7392992y282939y5.pptx
MENTAL HEALTH - NOTES.ppt for nursing students
History and examination of abdomen, & pelvis .pptx
Obstructive sleep apnea in orthodontics treatment
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
Important Obstetric Emergency that must be recognised
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
anal canal anatomy with illustrations...
Human Health And Disease hggyutgghg .pdf

On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

  • 1. 1/23 Pradeeban Kathiravelu1,2 , Yiru Chen3 , Ashish Sharma4 , Helena Galhardas1 , Peter Van Roy2 , Luís Veiga1 On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration The 3rd International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH), in conjunction with the 43rd International Conference on Very Large Data Bases. Munich, Germany. September 1, 2017. 1 INESC-ID / Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal 2 Université catholique de Louvain, Louvain-la-Neuve, Belgium 3 Peking University, Beijing, China 4 Department of Biomedical Informatics, Emory University, Atlanta, USA
  • 2. 2/23 Introduction ● Scale and diversity of big data are rising. – Geographically distributed data of exabytes. – Structured, semi-structured, unstructured, or ill-formed data. ● Integration of data is crucial for data science. ● Sharing of integrated data and results. – Mandatory for reproducible research.
  • 3. 3/23 Challenges in Medical Research for Big Data Integration ● Multiple types of data. – Imaging, clinical, and genomic. ● Numerous data sources. – No shared messaging protocol. ● Do we really need to integrate all the data?
  • 4. 4/23 A Story of Medical Data Researchers...A Story of Medical Data Researchers...
  • 5. 5/23 ● Jim is interested in the effects of a medicine to treat brain tumor in patients of certain age groups.
  • 6. 6/23 Observation - 1 ● Various sources. – Service-based data access through APIs. ● Thanks to specifications such as HL7 FHIR. ● The researchers possess domain knowledge. ● Integrate On-Demand. – Avoid eager loading of binary data or its textual metadata. – Use the researcher query as an input in loading data. ● Scalable storage in-house. – Potential to load, integrate, index, and query unstructured data.
  • 8. 8/23 Observation - 2 ● Load data only once per organization. – Bandwidth and storage efficiency.
  • 9. 9/23 ● Sharing the research data with researchers, beyond organization boundaries.
  • 10. 10/23 Observation - 3 ● Do not duplicate data! – We ``own`` our interest; not the data. ● Point to the data in the data sources. – Pointers to data like Dropbox Shared Links work well. ● Avoids outdated duplicate data. ● Easy to maintain. ● APIs – Access the list of research data sets.
  • 11. 11/23 Problems ● How to.. – Load data from several service-based big data sources. ● Avoid duplicate downloads and near duplicate data. – Integrate disparate data and persist for future accesses. – Share pointers to data internally and externally.
  • 12. 12/23 Óbidos OOn-demand BBig Data IIntegration, DDistribution, and OOrchestration SSystem ● Researcher query → Narrow down the search space. ● Define subsets of data that are of interest. – Exploiting the well-defined hierarchical structure of medical data. ● Medical Images (DICOM) ● Clinical data ● ..
  • 13. 13/23 Óbidos Approach ● Hybrid of virtual and materialized data integration approaches. – Lazy load of metadata: Load the matching subset of metadata. – Store integrated data and query results → scalable storage. ● Track already loaded data. – Near duplicate detection. – Download only updates (changesets). ● Efficient SQL queries on NoSQL storage. ● Share pointers to the datasets rather than the dataset itself. ● Generic design; implementation for medical research data.Generic design; implementation for medical research data.
  • 15. 15/23 Evaluation ● Evaluation Data: – Clinical data and DICOM imaging collections of TCIA. ● Benchmark Óbidos against eager and lazy ETL. – Performance of loading and querying data. ● Óbidos (inter- and intra- organization) against binary data sharing. – Space/bandwidth efficiency of data sharing.
  • 17. 17/23 Data load time Change in total data volume (Same query and same interest) ● Observation: – Load time increases for eager and lazy ETL with total volume. – Load time for Óbidos remains constant. ● Total volume of data is irrelevant for Óbidos.
  • 18. 18/23 Change in studies of interest (Same query and constant total data volume) Data load time ● Observation: – Load time for eager and lazy ETL remains constant. – Load time increases for Óbidos with the interest. ● Converges to the load time of lazy ETL.
  • 19. 19/23 Query completion time for the integrated data repository ● Observation: – We assume the corresponding data is already loaded. ● Thus, lazy and eager ETL perform similar. – Indexed scalable NoSQL architecture of Óbidos → Better performance.
  • 20. 20/23 Efficiency in Sharing Medical Research Data ● Observation: – A constant-size UID is sufficient, intra-organization. – With number of series, Óbidos pointers grow, inter-organization. – Traditional binary data sharing: shared data size = volume of the image series.
  • 21. 21/23 Conclusion ● Óbidos offers on-demand service-based big data integration. – Fast and resource-efficient data analysis. – SQL queries over NoSQL data store for the integrated data. – Efficient data sharing without duplicating actual data. ● Future Work – Consume data from repositories of domains beyond medical data. ● EUDAT – Óbidos distributed virtual data warehouses. ● Leverage the proximity of the organizations in data integration and sharing.
  • 22. 22/23 Acknowledgements ● NCI QIN grant (1U01CA187013, Resources for Development and Validation Of Radiomic Analyses and Adaptive Therapy). ● Google Summer of Code (2014, 2015, and 2016). ● The Cancer Imaging Archive (TCIA). ● Tyk and API Umbrella Teams.
  • 23. 23/23 Conclusion ● Óbidos offers on-demand service-based big data integration. – Fast and resource-efficient data analysis. – SQL queries over NoSQL data store for the integrated data. – Efficient data sharing without duplicating actual data. ● Future Work – Consume data from repositories of domains beyond medical data. ● EUDAT – Óbidos distributed virtual data warehouses. ● Leverage the proximity of the organizations in data integration and sharing. Thank you! Questions?