SlideShare a Scribd company logo
Learning, Training, Classification, Common Sense and
Exascale Computing
Joel Saltz MD, PhD
Chair and Professor Department of Biomedical Informatics
Professor Department of Computer Science
Cherith Endowed Chair
Stony Brook University
Oak Ridge National Laboratory, December 3, 2018
Overview
• Sensor data analysis is becoming an integral component of scientific
computing workflows
• Results from numerical simulations are often also spatio-temporal data
• In virtually all scientific fields, deep learning methods are being adopted
to make sense of sensor and scientific data
• Extremely broad implications – making sense of data via deep learning
generally involves a type of semantically based data compression
• Making sense of data also allows intelligent methods for steering data
acquisition and computation
• These issues were debated and discussed in great detail at the recent
DOE SSIO workshop
• Hot off the press – Nov 29-30th BDEC workshop – same basic conclusions
https://www.exascale.org/bdec/
https://www.exascale.org/bdec/
Manish
Parashar,
Director of
Advanced
Cyberinfrastru
cture, NSF
BDEC 2
Kickoff –
Bloomington
Indiana
Dan Reed,
Provost
University of
Utah
BDEC 2
Kickoff –
Bloomington
Indiana
Scene Understanding
●Scene understanding from coarse-grained to
fine-grained inference
●Classify scene as a whole
●Identify objects
●Specify where objects are located
●Identify each object category’s pixels or
voxels
●Identify pixels and voxels associated with
each instance of each object
Image from Garcia et al arXiv 1704.06857
Scenes from Science Applications
• Size: Gigabyte to Terabyte per image; Petabyte to Exabyte
per collection
• Number of objects: can be billions or more, can be very few –
needle in hay stack
• Categories of object: Many, many ways of classifying natural
objects so category definition depends on what question is
being asked
• Object boundaries: Depends on exactly how an object is
defined. What is the boundary of a hurricane, tumor or city?
• Sources: Sensors and cameras, scientific simulations
Satellite Sensor Data
Pathology Image
Detection of oil leaks and spills (Serge Petiton from U. de Lille and TOTAL)
Wind Farm Simulation – Dimitri Mavriplis
Neural Networks and Deep Learning
Training
●Classify scene as a whole
●Identify objects
●Specify where objects are located
●Identify each object category’s pixels or
voxels
●Identify pixels and voxels associated with
each instance of each object
Image from Garcia et al arXiv 1704.06857
Increasing
Training
Effort
Deep Learning, Sensor/Scientific Data, Architectures and
Training
• Extremely large datasets
• Humans, animals and human built objects (tables, chairs,
houses etc) are different from objects from the natural world
and can be harder to identify, classify and segment
• Coupled deep learning architectures and training strategies are
crucial for success
• In this talk, I will present several examples of work we’ve
done and results obtained
Prehistory – first decade of 2000’s
• Reverse engineer classification – break scene (whole slide
image) into patches
• Create machine learning model and generate training data by
examining characteristics of each patch, understanding
classification system and classifying each training patch
• Close collaboration with applications person (Hiro Shimada)
• Patch level predictions aggregated to generate predictions for
each scene
https://www.sciencedirect.com/science/article/abs/pii/S0031320
308003439
Early Steps to Pathology Computer Aided Classification
2005-2010
Gurcan, Shamada, Kong, Saltz
Hiro Shimada, Metin Gurcan, Jun Kong, Lee Cooper Joel Saltz
BISTI/NIBIB Center for Grid Enabled Image Analysis - P20 EB000591, PI Saltz
Neuroblastoma Classification
FH: favorable histology UH: unfavorable histology
CANCER 2003; 98:2274-81
<5 yr
Schwannian
Development
≥50%
Grossly visible Nodule(s)
absent
present
Microscopic
Neuroblastic
foci
absent
present
Ganglioneuroma
(Schwannian stroma-dominant)
Maturing subtype
Mature subtype
Ganglioneuroblastoma, Intermixed
(Schwannian stroma-rich)
FH
FH
Ganglioneuroblastoma, Nodular
(composite, Schwannian stroma-rich/
stroma-dominant and stroma-poor) UH/FH*
Variant forms*
None to <50%
Neuroblastoma
(Schwannian stroma-poor)
Poorly differentiated
subtype
Undifferentiated
subtype
Differentiating
subtype
Any age UH
≥200/5,000 cells
Mitotic & karyorrhectic cells
100-200/5,000 cells
<100/5,000 cells
Any age
≥1.5 yr
<1.5 yr
UH
UH
FH
≥200/5,000 cells
100-200/5,000 cells
<100/5,000 cells
Any age UH
≥1.5 yr
<1.5 yr
≥5 yr
UH
FH
UH
FH
Multi-Scale Machine Learning Based Shimada
Classification System
• Background Identification
• Image Decomposition (Multi-
resolution levels)
• Image Segmentation
(EMLDA)
• Feature Construction (2nd
order statistics, Tonal
Features)
• Feature Extraction (LDA) +
Classification (Bayesian)
• Multi-resolution Layer
Controller (Confidence
Region)
No
Yes
Image Tile
Initialization
I = L
Background? Label
Create Image I(L)
Segmentation
Feature Construction
Feature Extraction
Classification
Segmentation
Feature Construction
Feature Extraction
Classifier Training
Down-sampling
Training Tiles
Within Confidence
Region ?
I = I -1
I > 1?
Yes
Yes
No
No
TRAINING
TESTING
Learning, Training,  Classification,  Common Sense and Exascale Computing
Training using only scene level categories
• Gigapixel image scenes – only scene level labels are provided
• In Pathology application – the Pathology classification of a whole slide
image is given
• No fine level training information
• Method must infer which patches are crucial (discriminative) to the
scene level classification
• Hidden variable representing likelihood that the label of the patch is the
same as the true label for the entire image
• Neural network architecture carries out computations to determine
which patches are discriminative
• Patch level classifiers are used to determine scene level prediction using
multi-class logistic regression or SVM
Deep Learning - Brain Tumor Classification – CVPR 2016
https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Hou_Patch-
Based_Convolutional_Neural_CVPR_2016_paper.html
Heterogeneity: category shared by two patches in each
column
EM Method iteratively eliminates non discriminative patches
Image level decision fusion model
Brain
Le Hou, Dimitris Samaras, Tahsin Kurc, Yi Gao, Liz Vanner, James
Davis, Joel Saltz
Specify where objects are located -> Semantic
Segmentation
Input: Percent pixels in each patch belonging to a category
Output: High resolution semantic segmentation
Hou, Jojic, Malkin, Robinson, Samaras, Saltz (Microsoft Research and
Stony Brook University)
Input image with low resolution labels – generates a high
resolution image – “Super resolution”
Labels from the Multi-Resolution Land Characteristics
Consortium - https://www.mrlc.gov/tools
Example Accuracy and Jaccard
Performance Data : see upcoming paper
Ground truth vs Super-resolution and High Resolution Models
Application to Pathology Lymphocyte Segmentation
Instance Segmentation using synthetic scenes
Uses Generative Associative Networks (GANS)
Two players: a generator and a discriminator
Generator generates new instances of an object while the
discriminator determines whether the new instance belongs to the
actual dataset
Simple version of this – generate scene that looks realistic.
Because the scene is artificial, we can semantically segment all
object instances
We then use the artificial scenes as training data
The challenge
• Conceptually identical objects can look different in different kinds of
scenes
• Generate training data to find objects in new types of scenes
• In Pathology, nuclei look different in different tissue types – e.g. brain,
kidney, liver, prostate etc, etc
• Clearly not limited to Pathology – consider segmenting trees, vortices,
tornados or microfossils
Microfossils - TOTAL
Requirements and Approach
• GAN scene generation need a good starting point – we need to
generate somewhat realistic artificial tissue samples
• Use excellent and OK artificial scenes as training data by using
GAN provided artificial tissue quality estimates
• Evaluate impact on segmentation loss function and give
preference to “hard” examples
Generation of starting point synthetic tissue
GAN based refinement of synthetic tissue
Examples of generated tissue with quality estimates
Deep Learning Pipeline for NCI consortium research
Deep learning problem - specify where objects are located
• National Cancer Institute Sponsored Cancer Genome Atlas Pan Cancer
Atlas Collaboration
• High resolution (250K patches per whole slide image) mapping of
immune cells
• 13 Cancer Types
• Correlation of spatial immune cell distribution with very detailed
molecular datasets
• Datasets available in Cancer Imaging Archive; publications in Cell journal
series
• Linked learning/molecular can can be interactively explored in CRI
iATLAS - https://www.cri-iatlas.org/
TCGA Pan Cancer Atlas – Immune Landscape of Cancer
• Six identified immune subtypes span
cancer tissue types and molecular
subtypes
• Immune subtypes differ by somatic
aberrations, microenvironment, and
survival
• Multiple control modalities of
molecular networks affect tumor-
immune interactions
• These analyses serve as a resource
for exploring immunogenicity across
cancer types
http://www.cell.com/immunity/fullt
ext/S1074-7613(18)30121-3
• Stony Brook, Institute for Systems Biology, MD Anderson, Emory group
• TCGA Pan Cancer Immune Group – led by ISB researchers
• Deep dive into linked molecular and image based characterization of
cancer related immune response
http://www.cell.com/cell-reports/pdf/S2211-1247(18)30447-9.pdf
● Deep learning based
computational stain for staining
tumor infiltrating lymphocytes
(TILs)
●TIL patterns generated from
4,759 TCGA subjects (5,202 H&E
slides), 13 cancer types
●Computationally stained TILs
correlate with pathologist eye and
molecular estimates
●TIL patterns linked to tumor and
immune molecular features, cancer
type, and outcome
Le Hou – Graduate Student
Computer Science
Vu Nguyen– Graduate Student
Computer Science
Anne Zhao – Pathology Informatics
Biomedical Informatics, Pathology
(now Surg Path Fellow SBM)
Raj Gupta – Pathology Informatics
Biomedical Informatics, Pathology
Deep Learning
and Lymphocytes
Importance of Immune System in Cancer Treatment and Prognosis
• Tumor spatial context and cellular heterogeneity are important
in cancer prognosis
• Spatial TIL densities in different tumor regions have been
shown to have high prognostic value
• Immune related assays used to steer cancer immune therapy
• TIL maps being computed for SEER Pathology studies and will
be routinely computed for data contributed to TCIA archive
• Ongoing study to relate TIL patterns with immune gene
expression groups and patient response
Training, Model Creation
• Algorithm first trained on image patches
• Several cooperating deep learning algorithms generate heat
maps
• Heat maps used to generate new predictions
• Companion molecular statistical data analysis pipelines
Training, threshold adjustment, quality control
We pre-train using sparse autoencoder that ”likes”
immune cells
Visual tools used to QC and iteratively refine
training data: Quantitative Imaging Pathology - QuIP Tool Set
Interactive Deep Learning Training Tool
Learning, Training,  Classification,  Common Sense and Exascale Computing
Validation – Stratified sampling from 5K whole slide images
Arvind Rao, expert in spatial biostatistics (U Michigan)
SKCM TCGA-D3-A2JF-06Z-00-DX1
SKCM TCGA-D3-A2JF-06Z-00-DX1
SKCM TCGA-D3-A2JA-06Z-00-DX1
SKCM TCGA-D3-A2JA-06Z-00-DX1
TIL Pattern Descriptions
Qualitative (Alex Lazar, Raj Gupta)
• ‘‘Brisk, diffuse’’ diffusely infiltrative TILs
scattered throughout at least 30% of the
area of the tumor (1,856 cases);
• ‘‘Brisk, band-like’’ - band-like
boundaries bordering the tumor at its
periphery (1,185);
• ‘‘Nonbrisk, multi-focal’’ loosely
scattered TILs present in less
• than 30% but more than 5% of the area
of the tumor (1,083);
• ‘‘Non-brisk, focal’’ for TILs scattered
throughout less than 5% but greater than
1% of the area of the tumor (874);
• ‘‘None’’ < 1% TILS - in 143 cases
Quantitative – Arvind Rao
• Agglomerative clustering
• Cluster indices representing
cluster number, density, cluster
size, distance between clusters
• Traditional spatial statistics
measures
• R package clusterCrit by
Bernard Desgraupes - Ball-
Hall, Banfield-Raftery, C Index,
and Determinant Ratio indices
Learning, Training,  Classification,  Common Sense and Exascale Computing
Use of Pathology Deep Learning Methods in Multi-institutional
Consortia
• NCI Quantitative Imaging for Pathology (QuIP): Stony Brook, Emory,
MD Anderson, Institute for Systems Biology, Oak Ridge
• NCI SEER Pathology: Stony Brook, Emory, Rutgers, University of
Kentucky (three Cancer registries)
• Cancer Imaging Archive: Arkansas, Stony Brook, Emory (Stony Brook
leads Pathology component)
• Virtual Tissue Repository: Led by NCI SEER; Stony Brook, Emory
• TIES Research Network - Integrated Pathology text and imaging:
Pittsburgh, Stony Brook main sites, 6+ other sites (Stony Brook leads
digital Pathology)
SEER coverage includes 31.9 percent of Whites, 30.0
percent of African Americans, 44.0 percent of Hispanics,
49.3 percent of American Indians and Alaska Natives,
57.5 percent of Asians, and 68.5 percent of
Hawaiian/Pacific Islanders.
Methods and tools for integrating pathomics data into cancer
registries
Saltz, Sharma, Foran and Durban
• Enhance SEER registry data with machine learning based classifications
and quantitative pathomics feature sets.
• The New Jersey State Cancer Registry, Georgia and Kentucky State
Cancer Registries
• Prostate Cancer, Lymphoma and NSCLC
• Repository of high‐quality digitized pathology images for subjects
whose data is being collected by the registries.
• Extract computational features and establish deep linkages with
registry data, thus enabling the creation of information‐rich, population
cohorts containing objective imaging and clinical attributes
Future Exascale Efforts I - Interactive Incorporation of high
throughput training data
• Iterative paradigm of human in the loop training set correction
• For example, current training in TIL application involves roughly 100K
image patches
• Our 5000K WSI dataset, iteratively corrected by humans, now has 500M
image patches that can be used for training
• Super-resolution semantic segmentation can make use of multiple types
of sensor and map data to provide ground truth – sensor and human
markup integration challenge
• Crucial issue is support for interactive exploration of impact on training
data
• Rapid iteration on impact of variations in CNN architecture also crucial
Future Exascale Efforts II – Leveraging Deep Learning to
Optimize Systems Software Performance
• Methods for optimizing
movement of computation
and data through complex
storage hierarchies
• My past work in this has
assumed the need for simple,
effective heuristics
• Was developed before the
acute need for this kind of
thing (but does have 487
citations)
Active Disk/Data Cutter Re-Do with Deep Learning
• Complex storage and I/O
Systems require intelligent
management of work and data
• Extreme scale computing
systems can support complex
optimizations
• Deep learning systems capable
of carrying out policies with
previously infeasible levels of
complexity
Future Exascale Efforts III
• Scaling up whole slide imaging
analytics
• Current studies involve order
10K whole slide images
• TIES consortium aims to obtain
500M slides over next three
years
• SEER will ultimately require
roughly 6M whole slide images
• Human tumor and human tissue
atlas are adding 3rd dimension
and deep molecular annotation
to analysis
THANKS!
ITCR Team
Stony Brook University
Joel Saltz
Tahsin Kurc
Yi Gao
Allen Tannenbaum
Erich Bremer
Jonas Almeida
Alina Jasniewski
Fusheng Wang
Tammy DiPrima
Andrew White
Le Hou
Furqan Baig
Mary Saltz
Raj Gupta
Emory University
Ashish Sharma
Adam Marcus
Oak Ridge National
Laboratory
Scott Klasky
Dave Pugmire
Jeremy Logan
Yale University
Michael Krauthammer
Harvard University
Rick Cummings
Funding – Thanks!
• This work was supported in part by U24CA180924,
U24CA215109, NCIP/Leidos 14X138 and
HHSN261200800001E, UG3CA225021-01 from the
NCI; R01LM011119-01 and R01LM009239 from the
NLM
• This research used resources provided by the
National Science Foundation XSEDE Science
Gateways program under grant TG-ASC130023 and
the Keeneland Computing Facility at the Georgia
Institute of Technology, which is supported by the
NSF under Contract OCI-0910735.

More Related Content

PPTX
Extreme Computing, Clinical Medicine and GPUs or Can GPUs Cure Cancer
PDF
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...
PDF
Twenty Years of Whole Slide Imaging - the Coming Phase Change
PPTX
Digital Pathology, FDA Approval and Precision Medicine
PPTX
Artificial Intelligence in pathology
PPTX
Pathomics Based Biomarkers and Precision Medicine
PPTX
Twenty Years of Whole Slide Imaging - the Coming Phase Change
PDF
Integrative Everything, Deep Learning and Streaming Data
Extreme Computing, Clinical Medicine and GPUs or Can GPUs Cure Cancer
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...
Twenty Years of Whole Slide Imaging - the Coming Phase Change
Digital Pathology, FDA Approval and Precision Medicine
Artificial Intelligence in pathology
Pathomics Based Biomarkers and Precision Medicine
Twenty Years of Whole Slide Imaging - the Coming Phase Change
Integrative Everything, Deep Learning and Streaming Data

What's hot (20)

PPTX
Digital pathology in developing country
PPTX
Translating Cancer Genomes and Transcriptomes for Precision Oncology
PDF
Generation and Use of Quantitative Pathology Phenotype
PDF
E-book Thesis Sara Carvalho
PPTX
ICBO 2014, October 8, 2014
PPTX
Pathomics Based Biomarkers, Tools, and Methods
PDF
Definiens In Digital Pathology Hr
PPTX
Machine Learning and Deep Contemplation of Data
PDF
PPTX
Sigma Xi 2021 Andrew Gao Presentation
PDF
Application of Microarray Technology and softcomputing in cancer Biology
PPTX
Federal Research & Development for the Florida system Sept 2014
PPTX
Cancer Research in Small Animals: A Review of Recent Publications Using High ...
PDF
Whole genome scanning, resolving clinical diagnosis and management amaist com...
PPTX
Next generation sequencing in cancer treatment
PDF
Radiomics: Novel Paradigm of Deep Learning for Clinical Decision Support towa...
PPTX
UNMSymposium2014
PDF
Data Standards in Radiomics Research
PDF
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
PDF
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
Digital pathology in developing country
Translating Cancer Genomes and Transcriptomes for Precision Oncology
Generation and Use of Quantitative Pathology Phenotype
E-book Thesis Sara Carvalho
ICBO 2014, October 8, 2014
Pathomics Based Biomarkers, Tools, and Methods
Definiens In Digital Pathology Hr
Machine Learning and Deep Contemplation of Data
Sigma Xi 2021 Andrew Gao Presentation
Application of Microarray Technology and softcomputing in cancer Biology
Federal Research & Development for the Florida system Sept 2014
Cancer Research in Small Animals: A Review of Recent Publications Using High ...
Whole genome scanning, resolving clinical diagnosis and management amaist com...
Next generation sequencing in cancer treatment
Radiomics: Novel Paradigm of Deep Learning for Clinical Decision Support towa...
UNMSymposium2014
Data Standards in Radiomics Research
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
Ad

Similar to Learning, Training,  Classification,  Common Sense and Exascale Computing (20)

PDF
Collins seattle-2014-final
PPTX
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
PDF
Pathomics, Clinical Studies, and Cancer Surveillance
PDF
Basics of Data Analysis in Bioinformatics
PDF
Spatial Analysis On Histological Images Using Spark
PPTX
Data Science, Big Data and You
PPTX
Nov 2014 ouellette_windsor_icgc_final
PDF
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
PDF
Challenges and opportunities for machine learning in biomedical research
PDF
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
PPTX
Branch: An interactive, web-based tool for building decision tree classifiers
PPTX
Emerging challenges in data-intensive genomics
PDF
Computational Pathology Workshop July 8 2014
PPTX
High Dimensional Fused-Informatics
PPTX
Jillian ms defense-4-14-14-ja
PPTX
Introduction to Systemics with focus on Systems Biology
PPTX
Exploiting NLP for Digital Disease Informatics
PPTX
International Cancer Genomics Consortium (ICGC) Data Coordinating Center
PDF
Amia tb-review-08
PPT
Why Outsource To Neurotar
Collins seattle-2014-final
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
Pathomics, Clinical Studies, and Cancer Surveillance
Basics of Data Analysis in Bioinformatics
Spatial Analysis On Histological Images Using Spark
Data Science, Big Data and You
Nov 2014 ouellette_windsor_icgc_final
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
Challenges and opportunities for machine learning in biomedical research
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Branch: An interactive, web-based tool for building decision tree classifiers
Emerging challenges in data-intensive genomics
Computational Pathology Workshop July 8 2014
High Dimensional Fused-Informatics
Jillian ms defense-4-14-14-ja
Introduction to Systemics with focus on Systems Biology
Exploiting NLP for Digital Disease Informatics
International Cancer Genomics Consortium (ICGC) Data Coordinating Center
Amia tb-review-08
Why Outsource To Neurotar
Ad

More from Joel Saltz (16)

PPTX
AI and whole slide imaging biomarkers
PDF
Integrative Multi-Scale Analysis in Biomedical Data Science: Tools, Methods a...
PDF
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...
PDF
Big Data and Extreme Scale Computing
PPT
Exascale Computing and Experimental Sensor Data
PDF
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
PPTX
Data and Computational Challenges in Integrative Biomedical Informatics
PPTX
Integrative Multi-Scale Analyses
PPTX
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
PPTX
Role of Biomedical Informatics in Translational Cancer Research
PPTX
Extreme Spatio-Temporal Data Analysis
PPTX
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...
PPTX
Presentation at UHC Annual Meeting
PPTX
Indiana 4 2011 Final Final
PDF
Wci Pop Sci Feb 2011
PPTX
Actsi bip overview jan 2011
AI and whole slide imaging biomarkers
Integrative Multi-Scale Analysis in Biomedical Data Science: Tools, Methods a...
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...
Big Data and Extreme Scale Computing
Exascale Computing and Experimental Sensor Data
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
Data and Computational Challenges in Integrative Biomedical Informatics
Integrative Multi-Scale Analyses
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
Role of Biomedical Informatics in Translational Cancer Research
Extreme Spatio-Temporal Data Analysis
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...
Presentation at UHC Annual Meeting
Indiana 4 2011 Final Final
Wci Pop Sci Feb 2011
Actsi bip overview jan 2011

Recently uploaded (20)

PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to machine learning and Linear Models
PDF
annual-report-2024-2025 original latest.
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Lecture1 pattern recognition............
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IB Computer Science - Internal Assessment.pptx
Quality review (1)_presentation of this 21
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to machine learning and Linear Models
annual-report-2024-2025 original latest.
climate analysis of Dhaka ,Banglades.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
STUDY DESIGN details- Lt Col Maksud (21).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
1_Introduction to advance data techniques.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Lecture1 pattern recognition............
Fluorescence-microscope_Botany_detailed content
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

Learning, Training,  Classification,  Common Sense and Exascale Computing

  • 1. Learning, Training, Classification, Common Sense and Exascale Computing Joel Saltz MD, PhD Chair and Professor Department of Biomedical Informatics Professor Department of Computer Science Cherith Endowed Chair Stony Brook University Oak Ridge National Laboratory, December 3, 2018
  • 2. Overview • Sensor data analysis is becoming an integral component of scientific computing workflows • Results from numerical simulations are often also spatio-temporal data • In virtually all scientific fields, deep learning methods are being adopted to make sense of sensor and scientific data • Extremely broad implications – making sense of data via deep learning generally involves a type of semantically based data compression • Making sense of data also allows intelligent methods for steering data acquisition and computation • These issues were debated and discussed in great detail at the recent DOE SSIO workshop • Hot off the press – Nov 29-30th BDEC workshop – same basic conclusions
  • 5. Dan Reed, Provost University of Utah BDEC 2 Kickoff – Bloomington Indiana
  • 6. Scene Understanding ●Scene understanding from coarse-grained to fine-grained inference ●Classify scene as a whole ●Identify objects ●Specify where objects are located ●Identify each object category’s pixels or voxels ●Identify pixels and voxels associated with each instance of each object Image from Garcia et al arXiv 1704.06857
  • 7. Scenes from Science Applications • Size: Gigabyte to Terabyte per image; Petabyte to Exabyte per collection • Number of objects: can be billions or more, can be very few – needle in hay stack • Categories of object: Many, many ways of classifying natural objects so category definition depends on what question is being asked • Object boundaries: Depends on exactly how an object is defined. What is the boundary of a hurricane, tumor or city? • Sources: Sensors and cameras, scientific simulations
  • 10. Detection of oil leaks and spills (Serge Petiton from U. de Lille and TOTAL)
  • 11. Wind Farm Simulation – Dimitri Mavriplis
  • 12. Neural Networks and Deep Learning
  • 13. Training ●Classify scene as a whole ●Identify objects ●Specify where objects are located ●Identify each object category’s pixels or voxels ●Identify pixels and voxels associated with each instance of each object Image from Garcia et al arXiv 1704.06857 Increasing Training Effort
  • 14. Deep Learning, Sensor/Scientific Data, Architectures and Training • Extremely large datasets • Humans, animals and human built objects (tables, chairs, houses etc) are different from objects from the natural world and can be harder to identify, classify and segment • Coupled deep learning architectures and training strategies are crucial for success • In this talk, I will present several examples of work we’ve done and results obtained
  • 15. Prehistory – first decade of 2000’s • Reverse engineer classification – break scene (whole slide image) into patches • Create machine learning model and generate training data by examining characteristics of each patch, understanding classification system and classifying each training patch • Close collaboration with applications person (Hiro Shimada) • Patch level predictions aggregated to generate predictions for each scene https://www.sciencedirect.com/science/article/abs/pii/S0031320 308003439
  • 16. Early Steps to Pathology Computer Aided Classification 2005-2010 Gurcan, Shamada, Kong, Saltz Hiro Shimada, Metin Gurcan, Jun Kong, Lee Cooper Joel Saltz BISTI/NIBIB Center for Grid Enabled Image Analysis - P20 EB000591, PI Saltz
  • 17. Neuroblastoma Classification FH: favorable histology UH: unfavorable histology CANCER 2003; 98:2274-81 <5 yr Schwannian Development ≥50% Grossly visible Nodule(s) absent present Microscopic Neuroblastic foci absent present Ganglioneuroma (Schwannian stroma-dominant) Maturing subtype Mature subtype Ganglioneuroblastoma, Intermixed (Schwannian stroma-rich) FH FH Ganglioneuroblastoma, Nodular (composite, Schwannian stroma-rich/ stroma-dominant and stroma-poor) UH/FH* Variant forms* None to <50% Neuroblastoma (Schwannian stroma-poor) Poorly differentiated subtype Undifferentiated subtype Differentiating subtype Any age UH ≥200/5,000 cells Mitotic & karyorrhectic cells 100-200/5,000 cells <100/5,000 cells Any age ≥1.5 yr <1.5 yr UH UH FH ≥200/5,000 cells 100-200/5,000 cells <100/5,000 cells Any age UH ≥1.5 yr <1.5 yr ≥5 yr UH FH UH FH
  • 18. Multi-Scale Machine Learning Based Shimada Classification System • Background Identification • Image Decomposition (Multi- resolution levels) • Image Segmentation (EMLDA) • Feature Construction (2nd order statistics, Tonal Features) • Feature Extraction (LDA) + Classification (Bayesian) • Multi-resolution Layer Controller (Confidence Region) No Yes Image Tile Initialization I = L Background? Label Create Image I(L) Segmentation Feature Construction Feature Extraction Classification Segmentation Feature Construction Feature Extraction Classifier Training Down-sampling Training Tiles Within Confidence Region ? I = I -1 I > 1? Yes Yes No No TRAINING TESTING
  • 20. Training using only scene level categories • Gigapixel image scenes – only scene level labels are provided • In Pathology application – the Pathology classification of a whole slide image is given • No fine level training information • Method must infer which patches are crucial (discriminative) to the scene level classification • Hidden variable representing likelihood that the label of the patch is the same as the true label for the entire image • Neural network architecture carries out computations to determine which patches are discriminative • Patch level classifiers are used to determine scene level prediction using multi-class logistic regression or SVM
  • 21. Deep Learning - Brain Tumor Classification – CVPR 2016 https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Hou_Patch- Based_Convolutional_Neural_CVPR_2016_paper.html
  • 22. Heterogeneity: category shared by two patches in each column
  • 23. EM Method iteratively eliminates non discriminative patches
  • 24. Image level decision fusion model
  • 25. Brain Le Hou, Dimitris Samaras, Tahsin Kurc, Yi Gao, Liz Vanner, James Davis, Joel Saltz
  • 26. Specify where objects are located -> Semantic Segmentation Input: Percent pixels in each patch belonging to a category Output: High resolution semantic segmentation Hou, Jojic, Malkin, Robinson, Samaras, Saltz (Microsoft Research and Stony Brook University)
  • 27. Input image with low resolution labels – generates a high resolution image – “Super resolution”
  • 28. Labels from the Multi-Resolution Land Characteristics Consortium - https://www.mrlc.gov/tools
  • 29. Example Accuracy and Jaccard Performance Data : see upcoming paper
  • 30. Ground truth vs Super-resolution and High Resolution Models
  • 31. Application to Pathology Lymphocyte Segmentation
  • 32. Instance Segmentation using synthetic scenes Uses Generative Associative Networks (GANS) Two players: a generator and a discriminator Generator generates new instances of an object while the discriminator determines whether the new instance belongs to the actual dataset Simple version of this – generate scene that looks realistic. Because the scene is artificial, we can semantically segment all object instances We then use the artificial scenes as training data
  • 33. The challenge • Conceptually identical objects can look different in different kinds of scenes • Generate training data to find objects in new types of scenes • In Pathology, nuclei look different in different tissue types – e.g. brain, kidney, liver, prostate etc, etc • Clearly not limited to Pathology – consider segmenting trees, vortices, tornados or microfossils Microfossils - TOTAL
  • 34. Requirements and Approach • GAN scene generation need a good starting point – we need to generate somewhat realistic artificial tissue samples • Use excellent and OK artificial scenes as training data by using GAN provided artificial tissue quality estimates • Evaluate impact on segmentation loss function and give preference to “hard” examples
  • 35. Generation of starting point synthetic tissue
  • 36. GAN based refinement of synthetic tissue
  • 37. Examples of generated tissue with quality estimates
  • 38. Deep Learning Pipeline for NCI consortium research Deep learning problem - specify where objects are located • National Cancer Institute Sponsored Cancer Genome Atlas Pan Cancer Atlas Collaboration • High resolution (250K patches per whole slide image) mapping of immune cells • 13 Cancer Types • Correlation of spatial immune cell distribution with very detailed molecular datasets • Datasets available in Cancer Imaging Archive; publications in Cell journal series • Linked learning/molecular can can be interactively explored in CRI iATLAS - https://www.cri-iatlas.org/
  • 39. TCGA Pan Cancer Atlas – Immune Landscape of Cancer • Six identified immune subtypes span cancer tissue types and molecular subtypes • Immune subtypes differ by somatic aberrations, microenvironment, and survival • Multiple control modalities of molecular networks affect tumor- immune interactions • These analyses serve as a resource for exploring immunogenicity across cancer types http://www.cell.com/immunity/fullt ext/S1074-7613(18)30121-3
  • 40. • Stony Brook, Institute for Systems Biology, MD Anderson, Emory group • TCGA Pan Cancer Immune Group – led by ISB researchers • Deep dive into linked molecular and image based characterization of cancer related immune response http://www.cell.com/cell-reports/pdf/S2211-1247(18)30447-9.pdf
  • 41. ● Deep learning based computational stain for staining tumor infiltrating lymphocytes (TILs) ●TIL patterns generated from 4,759 TCGA subjects (5,202 H&E slides), 13 cancer types ●Computationally stained TILs correlate with pathologist eye and molecular estimates ●TIL patterns linked to tumor and immune molecular features, cancer type, and outcome
  • 42. Le Hou – Graduate Student Computer Science Vu Nguyen– Graduate Student Computer Science Anne Zhao – Pathology Informatics Biomedical Informatics, Pathology (now Surg Path Fellow SBM) Raj Gupta – Pathology Informatics Biomedical Informatics, Pathology Deep Learning and Lymphocytes
  • 43. Importance of Immune System in Cancer Treatment and Prognosis • Tumor spatial context and cellular heterogeneity are important in cancer prognosis • Spatial TIL densities in different tumor regions have been shown to have high prognostic value • Immune related assays used to steer cancer immune therapy • TIL maps being computed for SEER Pathology studies and will be routinely computed for data contributed to TCIA archive • Ongoing study to relate TIL patterns with immune gene expression groups and patient response
  • 44. Training, Model Creation • Algorithm first trained on image patches • Several cooperating deep learning algorithms generate heat maps • Heat maps used to generate new predictions • Companion molecular statistical data analysis pipelines
  • 46. We pre-train using sparse autoencoder that ”likes” immune cells
  • 47. Visual tools used to QC and iteratively refine training data: Quantitative Imaging Pathology - QuIP Tool Set
  • 48. Interactive Deep Learning Training Tool
  • 50. Validation – Stratified sampling from 5K whole slide images Arvind Rao, expert in spatial biostatistics (U Michigan)
  • 55. TIL Pattern Descriptions Qualitative (Alex Lazar, Raj Gupta) • ‘‘Brisk, diffuse’’ diffusely infiltrative TILs scattered throughout at least 30% of the area of the tumor (1,856 cases); • ‘‘Brisk, band-like’’ - band-like boundaries bordering the tumor at its periphery (1,185); • ‘‘Nonbrisk, multi-focal’’ loosely scattered TILs present in less • than 30% but more than 5% of the area of the tumor (1,083); • ‘‘Non-brisk, focal’’ for TILs scattered throughout less than 5% but greater than 1% of the area of the tumor (874); • ‘‘None’’ < 1% TILS - in 143 cases Quantitative – Arvind Rao • Agglomerative clustering • Cluster indices representing cluster number, density, cluster size, distance between clusters • Traditional spatial statistics measures • R package clusterCrit by Bernard Desgraupes - Ball- Hall, Banfield-Raftery, C Index, and Determinant Ratio indices
  • 57. Use of Pathology Deep Learning Methods in Multi-institutional Consortia • NCI Quantitative Imaging for Pathology (QuIP): Stony Brook, Emory, MD Anderson, Institute for Systems Biology, Oak Ridge • NCI SEER Pathology: Stony Brook, Emory, Rutgers, University of Kentucky (three Cancer registries) • Cancer Imaging Archive: Arkansas, Stony Brook, Emory (Stony Brook leads Pathology component) • Virtual Tissue Repository: Led by NCI SEER; Stony Brook, Emory • TIES Research Network - Integrated Pathology text and imaging: Pittsburgh, Stony Brook main sites, 6+ other sites (Stony Brook leads digital Pathology)
  • 58. SEER coverage includes 31.9 percent of Whites, 30.0 percent of African Americans, 44.0 percent of Hispanics, 49.3 percent of American Indians and Alaska Natives, 57.5 percent of Asians, and 68.5 percent of Hawaiian/Pacific Islanders.
  • 59. Methods and tools for integrating pathomics data into cancer registries Saltz, Sharma, Foran and Durban • Enhance SEER registry data with machine learning based classifications and quantitative pathomics feature sets. • The New Jersey State Cancer Registry, Georgia and Kentucky State Cancer Registries • Prostate Cancer, Lymphoma and NSCLC • Repository of high‐quality digitized pathology images for subjects whose data is being collected by the registries. • Extract computational features and establish deep linkages with registry data, thus enabling the creation of information‐rich, population cohorts containing objective imaging and clinical attributes
  • 60. Future Exascale Efforts I - Interactive Incorporation of high throughput training data • Iterative paradigm of human in the loop training set correction • For example, current training in TIL application involves roughly 100K image patches • Our 5000K WSI dataset, iteratively corrected by humans, now has 500M image patches that can be used for training • Super-resolution semantic segmentation can make use of multiple types of sensor and map data to provide ground truth – sensor and human markup integration challenge • Crucial issue is support for interactive exploration of impact on training data • Rapid iteration on impact of variations in CNN architecture also crucial
  • 61. Future Exascale Efforts II – Leveraging Deep Learning to Optimize Systems Software Performance • Methods for optimizing movement of computation and data through complex storage hierarchies • My past work in this has assumed the need for simple, effective heuristics • Was developed before the acute need for this kind of thing (but does have 487 citations)
  • 62. Active Disk/Data Cutter Re-Do with Deep Learning • Complex storage and I/O Systems require intelligent management of work and data • Extreme scale computing systems can support complex optimizations • Deep learning systems capable of carrying out policies with previously infeasible levels of complexity
  • 63. Future Exascale Efforts III • Scaling up whole slide imaging analytics • Current studies involve order 10K whole slide images • TIES consortium aims to obtain 500M slides over next three years • SEER will ultimately require roughly 6M whole slide images • Human tumor and human tissue atlas are adding 3rd dimension and deep molecular annotation to analysis
  • 65. ITCR Team Stony Brook University Joel Saltz Tahsin Kurc Yi Gao Allen Tannenbaum Erich Bremer Jonas Almeida Alina Jasniewski Fusheng Wang Tammy DiPrima Andrew White Le Hou Furqan Baig Mary Saltz Raj Gupta Emory University Ashish Sharma Adam Marcus Oak Ridge National Laboratory Scott Klasky Dave Pugmire Jeremy Logan Yale University Michael Krauthammer Harvard University Rick Cummings
  • 66. Funding – Thanks! • This work was supported in part by U24CA180924, U24CA215109, NCIP/Leidos 14X138 and HHSN261200800001E, UG3CA225021-01 from the NCI; R01LM011119-01 and R01LM009239 from the NLM • This research used resources provided by the National Science Foundation XSEDE Science Gateways program under grant TG-ASC130023 and the Keeneland Computing Facility at the Georgia Institute of Technology, which is supported by the NSF under Contract OCI-0910735.

Editor's Notes

  • #18: This is Dr. Shimada’s prognosis classfication