Learning, Training, Classification, Common Sense and Exascale Computing

Learning, Training, Classification, Common Sense and
Exascale Computing
Joel Saltz MD, PhD
Chair and Professor Department of Biomedical Informatics
Professor Department of Computer Science
Cherith Endowed Chair
Stony Brook University
Oak Ridge National Laboratory, December 3, 2018

Overview
• Sensor data analysis is becoming an integral component of scientific
computing workflows
• Results from numerical simulations are often also spatio-temporal data
• In virtually all scientific fields, deep learning methods are being adopted
to make sense of sensor and scientific data
• Extremely broad implications – making sense of data via deep learning
generally involves a type of semantically based data compression
• Making sense of data also allows intelligent methods for steering data
acquisition and computation
• These issues were debated and discussed in great detail at the recent
DOE SSIO workshop
• Hot off the press – Nov 29-30th BDEC workshop – same basic conclusions

https://www.exascale.org/bdec/
https://www.exascale.org/bdec/

Manish
Parashar,
Director of
Advanced
Cyberinfrastru
cture, NSF
BDEC 2
Kickoff –
Bloomington
Indiana

Dan Reed,
Provost
University of
Utah
BDEC 2
Kickoff –
Bloomington
Indiana

Scene Understanding
●Scene understanding from coarse-grained to
fine-grained inference
●Classify scene as a whole
●Identify objects
●Specify where objects are located
●Identify each object category’s pixels or
voxels
●Identify pixels and voxels associated with
each instance of each object
Image from Garcia et al arXiv 1704.06857

Scenes from Science Applications
• Size: Gigabyte to Terabyte per image; Petabyte to Exabyte
per collection
• Number of objects: can be billions or more, can be very few –
needle in hay stack
• Categories of object: Many, many ways of classifying natural
objects so category definition depends on what question is
being asked
• Object boundaries: Depends on exactly how an object is
defined. What is the boundary of a hurricane, tumor or city?
• Sources: Sensors and cameras, scientific simulations

Detection of oil leaks and spills (Serge Petiton from U. de Lille and TOTAL)

Wind Farm Simulation – Dimitri Mavriplis

Neural Networks and Deep Learning

Training
●Classify scene as a whole
●Identify objects
●Specify where objects are located
●Identify each object category’s pixels or
voxels
●Identify pixels and voxels associated with
each instance of each object
Image from Garcia et al arXiv 1704.06857
Increasing
Training
Effort

Deep Learning, Sensor/Scientific Data, Architectures and
Training
• Extremely large datasets
• Humans, animals and human built objects (tables, chairs,
houses etc) are different from objects from the natural world
and can be harder to identify, classify and segment
• Coupled deep learning architectures and training strategies are
crucial for success
• In this talk, I will present several examples of work we’ve
done and results obtained

Prehistory – first decade of 2000’s
• Reverse engineer classification – break scene (whole slide
image) into patches
• Create machine learning model and generate training data by
examining characteristics of each patch, understanding
classification system and classifying each training patch
• Close collaboration with applications person (Hiro Shimada)
• Patch level predictions aggregated to generate predictions for
each scene
https://www.sciencedirect.com/science/article/abs/pii/S0031320
308003439

Early Steps to Pathology Computer Aided Classification
2005-2010
Gurcan, Shamada, Kong, Saltz
Hiro Shimada, Metin Gurcan, Jun Kong, Lee Cooper Joel Saltz
BISTI/NIBIB Center for Grid Enabled Image Analysis - P20 EB000591, PI Saltz

Neuroblastoma Classification
FH: favorable histology UH: unfavorable histology
CANCER 2003; 98:2274-81
<5 yr
Schwannian
Development
≥50%
Grossly visible Nodule(s)
absent
present
Microscopic
Neuroblastic
foci
absent
present
Ganglioneuroma
(Schwannian stroma-dominant)
Maturing subtype
Mature subtype
Ganglioneuroblastoma, Intermixed
(Schwannian stroma-rich)
FH
FH
Ganglioneuroblastoma, Nodular
(composite, Schwannian stroma-rich/
stroma-dominant and stroma-poor) UH/FH*
Variant forms*
None to <50%
Neuroblastoma
(Schwannian stroma-poor)
Poorly differentiated
subtype
Undifferentiated
subtype
Differentiating
subtype
Any age UH
≥200/5,000 cells
Mitotic & karyorrhectic cells
100-200/5,000 cells
<100/5,000 cells
Any age
≥1.5 yr
<1.5 yr
UH
UH
FH
≥200/5,000 cells
100-200/5,000 cells
<100/5,000 cells
Any age UH
≥1.5 yr
<1.5 yr
≥5 yr
UH
FH
UH
FH

Multi-Scale Machine Learning Based Shimada
Classification System
• Background Identification
• Image Decomposition (Multi-
resolution levels)
• Image Segmentation
(EMLDA)
• Feature Construction (2nd
order statistics, Tonal
Features)
• Feature Extraction (LDA) +
Classification (Bayesian)
• Multi-resolution Layer
Controller (Confidence
Region)
No
Yes
Image Tile
Initialization
I = L
Background? Label
Create Image I(L)
Segmentation
Feature Construction
Feature Extraction
Classification
Segmentation
Feature Construction
Feature Extraction
Classifier Training
Down-sampling
Training Tiles
Within Confidence
Region ?
I = I -1
I > 1?
Yes
Yes
No
No
TRAINING
TESTING

Learning, Training, Classification, Common Sense and Exascale Computing

Training using only scene level categories
• Gigapixel image scenes – only scene level labels are provided
• In Pathology application – the Pathology classification of a whole slide
image is given
• No fine level training information
• Method must infer which patches are crucial (discriminative) to the
scene level classification
• Hidden variable representing likelihood that the label of the patch is the
same as the true label for the entire image
• Neural network architecture carries out computations to determine
which patches are discriminative
• Patch level classifiers are used to determine scene level prediction using
multi-class logistic regression or SVM

Deep Learning - Brain Tumor Classification – CVPR 2016
https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Hou_Patch-
Based_Convolutional_Neural_CVPR_2016_paper.html

Heterogeneity: category shared by two patches in each
column

EM Method iteratively eliminates non discriminative patches

Image level decision fusion model

Brain
Le Hou, Dimitris Samaras, Tahsin Kurc, Yi Gao, Liz Vanner, James
Davis, Joel Saltz

Specify where objects are located -> Semantic
Segmentation
Input: Percent pixels in each patch belonging to a category
Output: High resolution semantic segmentation
Hou, Jojic, Malkin, Robinson, Samaras, Saltz (Microsoft Research and
Stony Brook University)

Input image with low resolution labels – generates a high
resolution image – “Super resolution”

Labels from the Multi-Resolution Land Characteristics
Consortium - https://www.mrlc.gov/tools

Example Accuracy and Jaccard
Performance Data : see upcoming paper

Ground truth vs Super-resolution and High Resolution Models

Application to Pathology Lymphocyte Segmentation

Instance Segmentation using synthetic scenes
Uses Generative Associative Networks (GANS)
Two players: a generator and a discriminator
Generator generates new instances of an object while the
discriminator determines whether the new instance belongs to the
actual dataset
Simple version of this – generate scene that looks realistic.
Because the scene is artificial, we can semantically segment all
object instances
We then use the artificial scenes as training data

The challenge
• Conceptually identical objects can look different in different kinds of
scenes
• Generate training data to find objects in new types of scenes
• In Pathology, nuclei look different in different tissue types – e.g. brain,
kidney, liver, prostate etc, etc
• Clearly not limited to Pathology – consider segmenting trees, vortices,
tornados or microfossils
Microfossils - TOTAL

Requirements and Approach
• GAN scene generation need a good starting point – we need to
generate somewhat realistic artificial tissue samples
• Use excellent and OK artificial scenes as training data by using
GAN provided artificial tissue quality estimates
• Evaluate impact on segmentation loss function and give
preference to “hard” examples

Generation of starting point synthetic tissue

GAN based refinement of synthetic tissue

Examples of generated tissue with quality estimates

Deep Learning Pipeline for NCI consortium research
Deep learning problem - specify where objects are located
• National Cancer Institute Sponsored Cancer Genome Atlas Pan Cancer
Atlas Collaboration
• High resolution (250K patches per whole slide image) mapping of
immune cells
• 13 Cancer Types
• Correlation of spatial immune cell distribution with very detailed
molecular datasets
• Datasets available in Cancer Imaging Archive; publications in Cell journal
series
• Linked learning/molecular can can be interactively explored in CRI
iATLAS - https://www.cri-iatlas.org/

TCGA Pan Cancer Atlas – Immune Landscape of Cancer
• Six identified immune subtypes span
cancer tissue types and molecular
subtypes
• Immune subtypes differ by somatic
aberrations, microenvironment, and
survival
• Multiple control modalities of
molecular networks affect tumor-
immune interactions
• These analyses serve as a resource
for exploring immunogenicity across
cancer types
http://www.cell.com/immunity/fullt
ext/S1074-7613(18)30121-3

• Stony Brook, Institute for Systems Biology, MD Anderson, Emory group
• TCGA Pan Cancer Immune Group – led by ISB researchers
• Deep dive into linked molecular and image based characterization of
cancer related immune response
http://www.cell.com/cell-reports/pdf/S2211-1247(18)30447-9.pdf

● Deep learning based
computational stain for staining
tumor infiltrating lymphocytes
(TILs)
●TIL patterns generated from
4,759 TCGA subjects (5,202 H&E
slides), 13 cancer types
●Computationally stained TILs
correlate with pathologist eye and
molecular estimates
●TIL patterns linked to tumor and
immune molecular features, cancer
type, and outcome

Le Hou – Graduate Student
Computer Science
Vu Nguyen– Graduate Student
Computer Science
Anne Zhao – Pathology Informatics
Biomedical Informatics, Pathology
(now Surg Path Fellow SBM)
Raj Gupta – Pathology Informatics
Biomedical Informatics, Pathology
Deep Learning
and Lymphocytes

Importance of Immune System in Cancer Treatment and Prognosis
• Tumor spatial context and cellular heterogeneity are important
in cancer prognosis
• Spatial TIL densities in different tumor regions have been
shown to have high prognostic value
• Immune related assays used to steer cancer immune therapy
• TIL maps being computed for SEER Pathology studies and will
be routinely computed for data contributed to TCIA archive
• Ongoing study to relate TIL patterns with immune gene
expression groups and patient response

Training, Model Creation
• Algorithm first trained on image patches
• Several cooperating deep learning algorithms generate heat
maps
• Heat maps used to generate new predictions
• Companion molecular statistical data analysis pipelines

Training, threshold adjustment, quality control

We pre-train using sparse autoencoder that ”likes”
immune cells

Visual tools used to QC and iteratively refine
training data: Quantitative Imaging Pathology - QuIP Tool Set

Interactive Deep Learning Training Tool

Validation – Stratified sampling from 5K whole slide images
Arvind Rao, expert in spatial biostatistics (U Michigan)

TIL Pattern Descriptions
Qualitative (Alex Lazar, Raj Gupta)
• ‘‘Brisk, diffuse’’ diffusely infiltrative TILs
scattered throughout at least 30% of the
area of the tumor (1,856 cases);
• ‘‘Brisk, band-like’’ - band-like
boundaries bordering the tumor at its
periphery (1,185);
• ‘‘Nonbrisk, multi-focal’’ loosely
scattered TILs present in less
• than 30% but more than 5% of the area
of the tumor (1,083);
• ‘‘Non-brisk, focal’’ for TILs scattered
throughout less than 5% but greater than
1% of the area of the tumor (874);
• ‘‘None’’ < 1% TILS - in 143 cases
Quantitative – Arvind Rao
• Agglomerative clustering
• Cluster indices representing
cluster number, density, cluster
size, distance between clusters
• Traditional spatial statistics
measures
• R package clusterCrit by
Bernard Desgraupes - Ball-
Hall, Banfield-Raftery, C Index,
and Determinant Ratio indices

Use of Pathology Deep Learning Methods in Multi-institutional
Consortia
• NCI Quantitative Imaging for Pathology (QuIP): Stony Brook, Emory,
MD Anderson, Institute for Systems Biology, Oak Ridge
• NCI SEER Pathology: Stony Brook, Emory, Rutgers, University of
Kentucky (three Cancer registries)
• Cancer Imaging Archive: Arkansas, Stony Brook, Emory (Stony Brook
leads Pathology component)
• Virtual Tissue Repository: Led by NCI SEER; Stony Brook, Emory
• TIES Research Network - Integrated Pathology text and imaging:
Pittsburgh, Stony Brook main sites, 6+ other sites (Stony Brook leads
digital Pathology)

SEER coverage includes 31.9 percent of Whites, 30.0
percent of African Americans, 44.0 percent of Hispanics,
49.3 percent of American Indians and Alaska Natives,
57.5 percent of Asians, and 68.5 percent of
Hawaiian/Pacific Islanders.

Methods and tools for integrating pathomics data into cancer
registries
Saltz, Sharma, Foran and Durban
• Enhance SEER registry data with machine learning based classifications
and quantitative pathomics feature sets.
• The New Jersey State Cancer Registry, Georgia and Kentucky State
Cancer Registries
• Prostate Cancer, Lymphoma and NSCLC
• Repository of high‐quality digitized pathology images for subjects
whose data is being collected by the registries.
• Extract computational features and establish deep linkages with
registry data, thus enabling the creation of information‐rich, population
cohorts containing objective imaging and clinical attributes

Future Exascale Efforts I - Interactive Incorporation of high
throughput training data
• Iterative paradigm of human in the loop training set correction
• For example, current training in TIL application involves roughly 100K
image patches
• Our 5000K WSI dataset, iteratively corrected by humans, now has 500M
image patches that can be used for training
• Super-resolution semantic segmentation can make use of multiple types
of sensor and map data to provide ground truth – sensor and human
markup integration challenge
• Crucial issue is support for interactive exploration of impact on training
data
• Rapid iteration on impact of variations in CNN architecture also crucial

Future Exascale Efforts II – Leveraging Deep Learning to
Optimize Systems Software Performance
• Methods for optimizing
movement of computation
and data through complex
storage hierarchies
• My past work in this has
assumed the need for simple,
effective heuristics
• Was developed before the
acute need for this kind of
thing (but does have 487
citations)

Active Disk/Data Cutter Re-Do with Deep Learning
• Complex storage and I/O
Systems require intelligent
management of work and data
• Extreme scale computing
systems can support complex
optimizations
• Deep learning systems capable
of carrying out policies with
previously infeasible levels of
complexity

Future Exascale Efforts III
• Scaling up whole slide imaging
analytics
• Current studies involve order
10K whole slide images
• TIES consortium aims to obtain
500M slides over next three
years
• SEER will ultimately require
roughly 6M whole slide images
• Human tumor and human tissue
atlas are adding 3rd dimension
and deep molecular annotation
to analysis

ITCR Team
Stony Brook University
Joel Saltz
Tahsin Kurc
Yi Gao
Allen Tannenbaum
Erich Bremer
Jonas Almeida
Alina Jasniewski
Fusheng Wang
Tammy DiPrima
Andrew White
Le Hou
Furqan Baig
Mary Saltz
Raj Gupta
Emory University
Ashish Sharma
Adam Marcus
Oak Ridge National
Laboratory
Scott Klasky
Dave Pugmire
Jeremy Logan
Yale University
Michael Krauthammer
Harvard University
Rick Cummings

Funding – Thanks!
• This work was supported in part by U24CA180924,
U24CA215109, NCIP/Leidos 14X138 and
HHSN261200800001E, UG3CA225021-01 from the
NCI; R01LM011119-01 and R01LM009239 from the
NLM
• This research used resources provided by the
National Science Foundation XSEDE Science
Gateways program under grant TG-ASC130023 and
the Keeneland Computing Facility at the Georgia
Institute of Technology, which is supported by the
NSF under Contract OCI-0910735.

Learning, Training, Classification, Common Sense and Exascale Computing

More Related Content

What's hot (20)

Similar to Learning, Training, Classification, Common Sense and Exascale Computing (20)

More from Joel Saltz (16)

Recently uploaded (20)

Learning, Training, Classification, Common Sense and Exascale Computing

Editor's Notes