SlideShare a Scribd company logo
Studying Public Medical Images from the Open Access
Literature and Social Networks for Model Training and
Knowledge Extraction
Vincent Andrearczyk
HES-SO, Switzerland
MMM 2020, 08.01.2020
Henning Müller, Vincent Andrearczyk, Oscar Jimenez, Anjani Dhrangadhariya,
Roger Schaer, and Manfredo Atzori
Motivation
• Deep learning has been a driving force for
improving many applications of image analysis
• Complex networks require large amounts of
training data
- Data diversity is important for generalizability
• Most medical data sets have strong class
imbalances (rare diseases)
- Rare diseases require data from multiple centers
making the organization complex
• Many resources that include images have become
available in the past few years
- PubMed Central, TCIA, social networks, etc.
Objectives of this article
• Summarize existing approaches that harvest
public data
– Focusing on PubMed Central and social networks
• Highlight advantages and difficulties in exploiting
the data
– (+) Very diverse data
– (+) Rare cases are oversampled
– (-) Much pre-treatment and filtering is required
• Develop next steps required to fully use the data
PubMed Central
• Repository with the biomedical open access
literature, including images as files, etc.
– 3-4 images per article,
PubMed Central
• Repository with the biomedical open access
literature, including images as files, etc.
– 3-4 images per article,
– increasing # articles
Methodology for finding articles
• Analysis of tasks of ImageCLEF and work done
on these tasks using data from ImageCLEF
– Over the past 12 years
– Steps of filtering out data taken from this
• Use of Google scholar to add references
– Terms “medical image classification”, “publicly
accessible resources”, “medical literature”,
“machine learning” were combined
• Dynamically growing data sets were favored
• Journal papers were referenced over
conference publication
Image retrieval
• Allows to search for images with text
– Or semantic terms such as UMLS or MeSH
• Content-based image retrieval
Demner-Fushman, et al. (2012), Journal of Computing Science and Engineering
Structuring the visual content
• Define types of images to make the literature
images classifiable
– Extremely large variety in most categories
– Many sub-categories are possible
– Categories with clinical relevance
are most important
– Allows removing noise
– Compound figures
are separately treated
[ImageCLEF 2013]
Challenges in the data
• Look-alikes
– Much strange content that needs to be removed
Challenges in the data
• Look-alikes
– Much strange content that needs to be removed
• Compound figures can not easily be classified,
as they may contain aspects of several classes
– Cutting them into subfigures makes content
accessible
Meta data available for PMC
• Text of the figure caption
– Relatively specific but often short
– Hard for compound figures that contain many parts
• Full text of the article
– Non specific for individual figures
– Location of the figure is available
• Article title and author-generated key words
• Global MeSH terms (Manually attached)
– Cover species and organs
• Not all is available for all articles (incomplete)
Tasks to make figures accessible
• Removing very small images & strange aspect
ratios
• Classify figures into figure types
– Using image data and also text
– Remove non-relevant images, e.g. flowcharts
• Detect and cut compound figures into their parts
– Classify these into figure types again
• Filter human and animal tissue
• Filter specific organs of interest
• Find diseases or grading/staging
– Ground truth classes for machine learning
Advantages of literature images
• Rare images are generally used for articles and
case descriptions
– Mostly extreme cases to share the knowledge
on them
– Creates critical mass for rare diseases
• Images are from many laboratories and thus
contain many image variations
– Increase generalizability of learned models
• Exponentially increasing content
Problems with filtered images
• Many images might be missed by automatic
filtering
• Ground truth is not always solid
• Images might not have clinical quality
– Grey level resolution
– No information on level/window setting
– Cropped images, arrows in images, other overlays
• Size of the images is often small for publications
• Scale of images is not known (can be detected)
Otalora et al. (2018) MICCAI 2018
An example of Twitter images
• Images and information posted by pathologists on
Twitter
• Create dataset of histopathology images
• Train machine learning algorithms
– identify stains (H&E, IHC ...)
– discriminate between different tissues
– predict malignant tumors
• Limitations:
– good results (AUROC 0.9) only for simple tasks: H&E
vs rest
Schaumberg et al. (2018), BioRxiv
Next steps
• Quickly increasing content offers many possibilities
– Automatic pipelines need to contain update
mechanisms based on latest imaging equipment
– Community efforts for data curation
• Distribute the class labels with confidence scores
via PMC
• Evaluate impact on machine learning tasks of
adding such diverse sources
Next steps
• We have been working on it!
– Mined out 32,486 light microscopy human rare
cancer images Dhrangadhariya et al. (2020) SPIE2020
– Automatic generalizable filtering pipeline
In preparation: Jimenez et al. (2020) Journal of the American Medical Informatics Association
– Benefits in deep learning clinical tasks … to come
Conclusions
• Images from public resources are complementary to
clinical images for machine learning
– Rare cases, much diversity
– Very large amount of data
• How can we obtain high quality annotations with
limited effort (for example via active learning)
Contact
• More information can be found at
– http://medgift.hevs.ch/
– http://publications.hevs.ch
• Contact:
– vincent.andrearczyk@hevs.ch
– henning.mueller@hevs.ch

More Related Content

PPT
Izobrazevanje za data-mining
PPTX
Multi-Label Modality Classification for Figures in Biomedical Literature
PPT
Lec1-Into
PDF
Advances in Learning Analytics and Educational Data Mining
PDF
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
PDF
Research trends qualitative analysis in cscl
PDF
2011.10.10 Multi-Disciplinary Research Themes and Training
PPTX
Education data mining presentation
Izobrazevanje za data-mining
Multi-Label Modality Classification for Figures in Biomedical Literature
Lec1-Into
Advances in Learning Analytics and Educational Data Mining
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
Research trends qualitative analysis in cscl
2011.10.10 Multi-Disciplinary Research Themes and Training
Education data mining presentation

What's hot (20)

PPTX
Learning Analytics: Seeking new insights from educational data
PDF
A Learning Analytics Approach
PPTX
Educational Data Mining/Learning Analytics issue brief overview
DOC
Victor (Shengli) Sheng
PDF
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
PPTX
EDR 8204 Week 3 Assignment: Analyze Action Research
PDF
Open and Collaborative Software for Digital Pathology
PPT
Case Studies in Teaching and Learning with Social Media in Higher Education
PPTX
Connections b/w active learning and model extraction
PDF
Semiotics in spreadsheets
PDF
Data Management Lab: Data mapping exercise instructions
PPTX
OLT conference Learning analytics
PPT
Clinical Anatomy 9566
PDF
PPTX
Presentation pick a card - newman 17-04-13 - final
PPTX
교육분야 성취기준 링크드 데이터 프로파일 설계
PPTX
MIND MAP BASED USER MODELLING AND RECOMMENDER SYSTEM
PDF
Seminal Works on Education Data Mining and Analytics
PDF
Wasana (2011) a systematic, tool-supported method for conducting lr in is
PDF
Personalized Information Retrieval system using Computational Intelligence Te...
Learning Analytics: Seeking new insights from educational data
A Learning Analytics Approach
Educational Data Mining/Learning Analytics issue brief overview
Victor (Shengli) Sheng
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
EDR 8204 Week 3 Assignment: Analyze Action Research
Open and Collaborative Software for Digital Pathology
Case Studies in Teaching and Learning with Social Media in Higher Education
Connections b/w active learning and model extraction
Semiotics in spreadsheets
Data Management Lab: Data mapping exercise instructions
OLT conference Learning analytics
Clinical Anatomy 9566
Presentation pick a card - newman 17-04-13 - final
교육분야 성취기준 링크드 데이터 프로파일 설계
MIND MAP BASED USER MODELLING AND RECOMMENDER SYSTEM
Seminal Works on Education Data Mining and Analytics
Wasana (2011) a systematic, tool-supported method for conducting lr in is
Personalized Information Retrieval system using Computational Intelligence Te...
Ad

Similar to Studying Public Medical Images from Open Access Literature and Social Networks for Model Training and Knowledge Extraction (20)

PDF
Medical image analysis and big data evaluation infrastructures
PDF
Medical image analysis, retrieval and evaluation infrastructures
PDF
The Global Impact of IEEE Computer Society in Advancing Software Engineering ...
PPTX
Melissa Informatics - Data Quality and AI
PPTX
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
PPTX
Data Science PPT _basics of data science.pptx
PPTX
PhD_Research_Presentation_AI_Healthcare.pptx
PPTX
PhD_Research_Presentation_AI_Healthcare.pptx
PPTX
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
PPT
U mpres
PDF
The state of the art in integrating machine learning into visual analytics
PPT
Data Science Master Specialisation
PPSX
The concept of health informatics
PDF
2015 04-18-wilson cg
PPTX
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
PDF
An introduction to machine learning in biomedical research: Key concepts, pr...
PPTX
University Public Driven Applications - Big Data and Organizational Design
PDF
Visual Information Retrieval: Advances, Challenges and Opportunities
PDF
Challenges in medical imaging and the VISCERAL model
PDF
Machine Learning_2025_First Module_1.pdf
Medical image analysis and big data evaluation infrastructures
Medical image analysis, retrieval and evaluation infrastructures
The Global Impact of IEEE Computer Society in Advancing Software Engineering ...
Melissa Informatics - Data Quality and AI
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
Data Science PPT _basics of data science.pptx
PhD_Research_Presentation_AI_Healthcare.pptx
PhD_Research_Presentation_AI_Healthcare.pptx
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
U mpres
The state of the art in integrating machine learning into visual analytics
Data Science Master Specialisation
The concept of health informatics
2015 04-18-wilson cg
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
An introduction to machine learning in biomedical research: Key concepts, pr...
University Public Driven Applications - Big Data and Organizational Design
Visual Information Retrieval: Advances, Challenges and Opportunities
Challenges in medical imaging and the VISCERAL model
Machine Learning_2025_First Module_1.pdf
Ad

More from Institute of Information Systems (HES-SO) (20)

PPTX
Classification of noisy free-text prostate cancer pathology reports using nat...
PPTX
Machine learning assisted citation screening for Systematic Reviews - Anjani ...
PPTX
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
PDF
L'IoT dans les usines. Quels avantages ?
PPTX
Risques opérationnels et le système de contrôle interne : les limites d’un te...
PDF
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...
PPTX
Le système de contrôle interne : Présentation générale, enjeux et méthodes
PPTX
Crowdsourcing-based Mobile Application for Wheelchair Accessibility
PDF
Quelle(s) valeur(s) pour le leadership stratégique ?
PDF
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...
PDF
NOSE: une approche Smart-City pour les zones périphériques et extra-urbaines
PDF
How to detect soft falls on devices
PDF
FUNDAMENTALS OF TEXTURE PROCESSING FOR BIOMEDICAL IMAGE ANALYSIS
PDF
MOBILE COLLECTION AND DISSEMINATION OF SENIORS’ SKILLS
PDF
Enhanced Students Laboratory The GET project
PDF
Solar production prediction based on non linear meteo source adaptation
PDF
Exploring the New Trends of Chinese Tourists in Switzerland
PDF
Social Media Data analyzis and Semantics for Tourism Understanding
PDF
Valeurs et management agile
Classification of noisy free-text prostate cancer pathology reports using nat...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
L'IoT dans les usines. Quels avantages ?
Risques opérationnels et le système de contrôle interne : les limites d’un te...
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...
Le système de contrôle interne : Présentation générale, enjeux et méthodes
Crowdsourcing-based Mobile Application for Wheelchair Accessibility
Quelle(s) valeur(s) pour le leadership stratégique ?
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...
NOSE: une approche Smart-City pour les zones périphériques et extra-urbaines
How to detect soft falls on devices
FUNDAMENTALS OF TEXTURE PROCESSING FOR BIOMEDICAL IMAGE ANALYSIS
MOBILE COLLECTION AND DISSEMINATION OF SENIORS’ SKILLS
Enhanced Students Laboratory The GET project
Solar production prediction based on non linear meteo source adaptation
Exploring the New Trends of Chinese Tourists in Switzerland
Social Media Data analyzis and Semantics for Tourism Understanding
Valeurs et management agile

Recently uploaded (20)

PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Predictive modeling basics in data cleaning process
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
modul_python (1).pptx for professional and student
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
New ISO 27001_2022 standard and the changes
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
annual-report-2024-2025 original latest.
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Introduction to the R Programming Language
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
A Complete Guide to Streamlining Business Processes
ISS -ESG Data flows What is ESG and HowHow
Predictive modeling basics in data cleaning process
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Business Analytics and business intelligence.pdf
modul_python (1).pptx for professional and student
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Pilar Kemerdekaan dan Identi Bangsa.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
[EN] Industrial Machine Downtime Prediction
IMPACT OF LANDSLIDE.....................
New ISO 27001_2022 standard and the changes
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
annual-report-2024-2025 original latest.
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Introduction to the R Programming Language

Studying Public Medical Images from Open Access Literature and Social Networks for Model Training and Knowledge Extraction

  • 1. Studying Public Medical Images from the Open Access Literature and Social Networks for Model Training and Knowledge Extraction Vincent Andrearczyk HES-SO, Switzerland MMM 2020, 08.01.2020 Henning Müller, Vincent Andrearczyk, Oscar Jimenez, Anjani Dhrangadhariya, Roger Schaer, and Manfredo Atzori
  • 2. Motivation • Deep learning has been a driving force for improving many applications of image analysis • Complex networks require large amounts of training data - Data diversity is important for generalizability • Most medical data sets have strong class imbalances (rare diseases) - Rare diseases require data from multiple centers making the organization complex • Many resources that include images have become available in the past few years - PubMed Central, TCIA, social networks, etc.
  • 3. Objectives of this article • Summarize existing approaches that harvest public data – Focusing on PubMed Central and social networks • Highlight advantages and difficulties in exploiting the data – (+) Very diverse data – (+) Rare cases are oversampled – (-) Much pre-treatment and filtering is required • Develop next steps required to fully use the data
  • 4. PubMed Central • Repository with the biomedical open access literature, including images as files, etc. – 3-4 images per article,
  • 5. PubMed Central • Repository with the biomedical open access literature, including images as files, etc. – 3-4 images per article, – increasing # articles
  • 6. Methodology for finding articles • Analysis of tasks of ImageCLEF and work done on these tasks using data from ImageCLEF – Over the past 12 years – Steps of filtering out data taken from this • Use of Google scholar to add references – Terms “medical image classification”, “publicly accessible resources”, “medical literature”, “machine learning” were combined • Dynamically growing data sets were favored • Journal papers were referenced over conference publication
  • 7. Image retrieval • Allows to search for images with text – Or semantic terms such as UMLS or MeSH • Content-based image retrieval Demner-Fushman, et al. (2012), Journal of Computing Science and Engineering
  • 8. Structuring the visual content • Define types of images to make the literature images classifiable – Extremely large variety in most categories – Many sub-categories are possible – Categories with clinical relevance are most important – Allows removing noise – Compound figures are separately treated [ImageCLEF 2013]
  • 9. Challenges in the data • Look-alikes – Much strange content that needs to be removed
  • 10. Challenges in the data • Look-alikes – Much strange content that needs to be removed • Compound figures can not easily be classified, as they may contain aspects of several classes – Cutting them into subfigures makes content accessible
  • 11. Meta data available for PMC • Text of the figure caption – Relatively specific but often short – Hard for compound figures that contain many parts • Full text of the article – Non specific for individual figures – Location of the figure is available • Article title and author-generated key words • Global MeSH terms (Manually attached) – Cover species and organs • Not all is available for all articles (incomplete)
  • 12. Tasks to make figures accessible • Removing very small images & strange aspect ratios • Classify figures into figure types – Using image data and also text – Remove non-relevant images, e.g. flowcharts • Detect and cut compound figures into their parts – Classify these into figure types again • Filter human and animal tissue • Filter specific organs of interest • Find diseases or grading/staging – Ground truth classes for machine learning
  • 13. Advantages of literature images • Rare images are generally used for articles and case descriptions – Mostly extreme cases to share the knowledge on them – Creates critical mass for rare diseases • Images are from many laboratories and thus contain many image variations – Increase generalizability of learned models • Exponentially increasing content
  • 14. Problems with filtered images • Many images might be missed by automatic filtering • Ground truth is not always solid • Images might not have clinical quality – Grey level resolution – No information on level/window setting – Cropped images, arrows in images, other overlays • Size of the images is often small for publications • Scale of images is not known (can be detected) Otalora et al. (2018) MICCAI 2018
  • 15. An example of Twitter images • Images and information posted by pathologists on Twitter • Create dataset of histopathology images • Train machine learning algorithms – identify stains (H&E, IHC ...) – discriminate between different tissues – predict malignant tumors • Limitations: – good results (AUROC 0.9) only for simple tasks: H&E vs rest Schaumberg et al. (2018), BioRxiv
  • 16. Next steps • Quickly increasing content offers many possibilities – Automatic pipelines need to contain update mechanisms based on latest imaging equipment – Community efforts for data curation • Distribute the class labels with confidence scores via PMC • Evaluate impact on machine learning tasks of adding such diverse sources
  • 17. Next steps • We have been working on it! – Mined out 32,486 light microscopy human rare cancer images Dhrangadhariya et al. (2020) SPIE2020 – Automatic generalizable filtering pipeline In preparation: Jimenez et al. (2020) Journal of the American Medical Informatics Association – Benefits in deep learning clinical tasks … to come
  • 18. Conclusions • Images from public resources are complementary to clinical images for machine learning – Rare cases, much diversity – Very large amount of data • How can we obtain high quality annotations with limited effort (for example via active learning)
  • 19. Contact • More information can be found at – http://medgift.hevs.ch/ – http://publications.hevs.ch • Contact: – vincent.andrearczyk@hevs.ch – henning.mueller@hevs.ch