MR-PLIP
Presented To: Dr. Waqas Sultani & Mr. Abdul-Rehman
Presented BY: M Usama
2
Introduction
1
•Pathology is the branch of medicine that studies the causes, nature, and effects of diseases
2
•Pathologists are the detectives of the medical world—they look into tissues, blood, cells, and body fluids to figure out why a person is ill and how to treat
them.
3
•WSI  Whole Slide Image
4
•Experienced pathologists analyze these WSIs to identify established cancer subtypes and explore potential new forms of the disease
5
•We capture the WSI using the digital scanning technology
Whole Slide Image
3
Background
How a
pathologist
analyze the WSI
Viewing the image at lowest resolution level to focus on the
overall architectural layout
Viewing the image at high resolution to focus on specified
areas inspecting the tissues and cells
Identify or scrutinize tumor epithelial cells (abnormal in
shape, size, nucleus, and arrangement)
To effectively model pathology data, it’s essential to capture
both the architectural overview and high-resolution cellular
details
4
Limitations of Previous Vision
encoder
Treat each resolution
independently
use a single resolution from WSI Which limits their
→
ability to understand both global context and local
details
Fail to connect dots
Fail to understand
relationships at diff
res (5x, 10x, 20x,
40x)
5
Limitations of Previous Vision
encoder
Textual descriptions generated by Quilt-
LLaVA vary at different res-levels with
same prompt
Moving towards higher magnification (5×
to 40×) shifts focus from contextual to
detailed information
Previous VLMs ignored how context
changes across magnification — and
failed to use text to guide this multiscale
learning and limited to single level
learning
6
How MR-PLIP Resolve the Problem
Key Innovations
Extract Patches at multi resolution levels
Get textual description for each resolution  Quilt-LLAVA
Learns how to connect and align visual & textual features across zoom levels using a multimodal encoder
7
Now the problem for the MR-PLIP
With great detail comes great complexity
MR-PLIP’s multiresolution patch-text extraction enriches learning but also amplifies data volume and
organizational challenges
8
 MR-PLIP employs a parent-child hierarchy to
1. Structure and organize multiresolution data,
2. Enabling the model to learn both global context and fine-grained details cohesively
 20 random patches from each WSI at 5x (2 µm/pixel) with each size
512 x 512
 4 patches at 10× 1
µm/pixel
 16 patches at 20× 0.5
µm/pixel
 64 patches at 40× 0.25
µm/pixel
 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85
patches
Organizing Dataset
9
1. Used 20,000 WSIs from the TCGA dataset
2. 20,000 WSIs × 20 patches = 400,000 total 5× patches
3. 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85 patches
4. 400,000 × 85 = 34,000,000 patches
Organizing Dataset
10
p r
i,j patch taken from WSI
 i  index of WSI
 j patch Number
 r Resolution level
MR-PLIP Architecture
11
CVTA Loss
Symbol Meaning
vₒ Total number of patches (or visual features) in the visual bag Bᵥᵢⱼ
k Total number of textual keywords in the textual bag Bₜᵢⱼ
k₀
Number of top relevant keywords (positive ones) selected for each
patch
vₐ Visual feature vector of the a-th patch in the bag
wᵦ Textual keyword feature vector for the b-th keyword
w⁺ᵦ One of the top-k₀ keywords most similar to vₐ
τ (tau)
Temperature hyperparameter (initially 0.07), controls how sharply the
model focuses on the most similar keywords
Cosine Similarity Computed as:
1
• Measure the cosine similarity between
textual features and the visual features
2
• Select the top k words with highest
cosine similarity and treated as +ve
words
3
• Remaining words with low cosine
similarity treated as negative words
4
• Filter out irrelevant words
Why we are getting top k words?
Because not all words in the description are
important! Using full description:
• Adds noise
• Includes irrelevant terms (e.g., “this image
primarily shows…”)
• Selecting top-k words ensures:
• You only keep the most meaningful, patch-relevant
words
• It reduces dimensionality and improves model
alignment
• Helps model focus on keywords like “lymphocyte”,
“tumor”, “sinuses”, etc.
It’s like attention — focus only on what matters
12
Concatenating visual features and top k words
to get the rich information vector Zri,j.
Facilitates the alignment of visual features with
their corresponding optimal textual
representations
Multi Guided Visual feature representations Alignment MRTVA
Fused features passed into the MRTVA loss for
the contextual learning from low res to high res
13
Child Patch
Parent
Patch
Multi-Modal Encoder
(Encodes V & W+)
zᵖ  fused
embeddings
Multi-Modal Encoder
(Encodes V & W+)
zᶜ fused
embeddings
Projection Head
hᵖ = Proj(zᵖ)
Prediction Head
hᶜ = Proj(zᶜ)
MRTVA Loss
L = -cosine(hᵖ, gᶜ)
Projection Head
gᶜ = Pred(hᶜ)
MRTVA Loss Flow
Stop Gradients
Gradients
Flow
Symmetric loss
both directions — parent child and child
→ →
parent
Term Meaning
hi,jph^p_{i,j} Projection of the parent feature vector
gi,jcg^c_{i,j}
Prediction of the child vector (tries to guess
parent)
Sg(...)text{Sg}
(...)
Stop Gradient — freeze gradients here (no
backpropagation)
gi,jpg^p_{i,j}
Prediction of parent vector (now predicting
what the child looks like!)
hi,jch^c_{i,j} Projection of the child vector
14
 Zero-shot tile and WSI level classification performance comparison of MR-PLIP with existing SOTA VLMs in terms of weighted F1score with PE & NPE
 Performance comparison of proposed MR-PLIP with existing SOTA on tile level classification using linear probe evaluation and on WSI level
classification using weekly supervised learning.
Results
15
Thank
You
16
MR-PLIP Architecture
1. Image-Text Contrasting (ITC)
2. Image-Text Matching(ITC)
3. Mased Language Modeling (MLM)
SoftMax normalized
similarities between i2t and
t2i
17
Simple Siamese Representation Learning (SimSiam Network)
transforms z1 into p1 — a prediction of
what the other view’s features (z2)
should be
Self-supervised learning framework that teaches a
model to understand images — without any labels
18
Loss Function
19
Organizing Dataset (how they form 34 million
dataset)
5x
10x 10x 10x 10x
20x 20x 20x 20x
20x 20x 20x 20x
20x 20x 20x 20x 20x 20x 20x 20x
40
1) 20 random patches from each WSI at 5x (2 µm/pixel) with each size 512 x 512
2) Then, for each of these 5× patches They zoom in to get:
 4 patches at 10× 1 µm/pixel
 16 patches at 20× 0.5 µm/pixel
 64 patches at 40× 0.25 µm/pixel
3) 20,000 WSIs × 20 patches = 400,000 total 5× patches
4) 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85 patches
5) 400,000 × 85 = 34,000,000 patches
20
Organizing Dataset (how they form 34 million
dataset)
5x
10x 10x 10x 10x
20x 20x 20x 20x
20x 20x 20x 20x
20x 20x 20x 20x 20x 20x 20x 20x
40
1) 20 random patches from each WSI at 5x (2 µm/pixel) with each size 512 x 512
2) Then, for each of these 5× patches They zoom in to get:
 4 patches at 10× 1 µm/pixel
 16 patches at 20× 0.5 µm/pixel
 64 patches at 40× 0.25 µm/pixel
3) 20,000 WSIs × 20 patches = 400,000 total 5× patches
4) 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85 patches
5) 400,000 × 85 = 34,000,000 patches

More Related Content

PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
PPTX
Learning a Joint Embedding Representation for Image Search using Self-supervi...
PPTX
Advances in Visual Quality Restoration with Generative Adversarial Networks
PDF
Intelligent Multimedia Recommendation
PDF
vision in LMM, a close look at them in context
PPT
Cvpr2007 object category recognition p3 - discriminative models
PPTX
TechnicalBackgroundOverview
PDF
Class Weighted Convolutional Features for Image Retrieval
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Advances in Visual Quality Restoration with Generative Adversarial Networks
Intelligent Multimedia Recommendation
vision in LMM, a close look at them in context
Cvpr2007 object category recognition p3 - discriminative models
TechnicalBackgroundOverview
Class Weighted Convolutional Features for Image Retrieval

Similar to Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation.pptx (20)

PPTX
ViT.pptx
PDF
AISF19 - Unleash Computer Vision at the Edge
PPTX
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
PDF
MLIP - Chapter 5 - Detection, Segmentation, Captioning
PPTX
Exploiting LMM based knowledge for image classification tasks
PDF
Modeling perceptual similarity and shift invariance in deep networks
PPTX
[NS][Lab_Seminar_250106]SAM-Aware Graph Prompt Reasoning Network for Cross-Do...
PPT
Cvpr2007 object category recognition p2 - part based models
PDF
Domain Transfer and Adaptation Survey
PPT
Cvpr2007 object category recognition p1 - bag of words models
PDF
thesis
PDF
Online video object segmentation via convolutional trident network
PPT
Data Redundacy
PDF
Unsupervised Computer Vision: The Current State of the Art
PDF
Cs231n 2017 lecture11 Detection and Segmentation
PDF
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
PPTX
Evolving a Medical Image Similarity Search
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
PDF
Object detection stanford
PDF
PR-355: Masked Autoencoders Are Scalable Vision Learners
ViT.pptx
AISF19 - Unleash Computer Vision at the Edge
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
MLIP - Chapter 5 - Detection, Segmentation, Captioning
Exploiting LMM based knowledge for image classification tasks
Modeling perceptual similarity and shift invariance in deep networks
[NS][Lab_Seminar_250106]SAM-Aware Graph Prompt Reasoning Network for Cross-Do...
Cvpr2007 object category recognition p2 - part based models
Domain Transfer and Adaptation Survey
Cvpr2007 object category recognition p1 - bag of words models
thesis
Online video object segmentation via convolutional trident network
Data Redundacy
Unsupervised Computer Vision: The Current State of the Art
Cs231n 2017 lecture11 Detection and Segmentation
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
Evolving a Medical Image Similarity Search
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
Object detection stanford
PR-355: Masked Autoencoders Are Scalable Vision Learners
Ad

Recently uploaded (20)

PPTX
@K. CLINICAL TRIAL(NEW DRUG DISCOVERY)- KIRTI BHALALA.pptx
PPTX
SHOCK- lectures on types of shock ,and complications w
PPTX
Critical Issues in Periodontal Research- An overview
PPTX
Physiology of Thyroid Hormones.pptx
PPTX
NRP and care of Newborn.pptx- APPT presentation about neonatal resuscitation ...
PPTX
Impression Materials in dental materials.pptx
PPTX
Wheat allergies and Disease in gastroenterology
PPTX
Hearthhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
PPT
Dermatology for member of royalcollege.ppt
PDF
Adverse drug reaction and classification
PPTX
ANESTHETIC CONSIDERATION IN ALCOHOLIC ASSOCIATED LIVER DISEASE.pptx
PDF
B C German Homoeopathy Medicineby Dr Brij Mohan Prasad
PDF
OSCE SERIES ( Questions & Answers ) - Set 3.pdf
PPT
nephrology MRCP - Member of Royal College of Physicians ppt
PPTX
CARDIOVASCULAR AND RENAL DRUGS.pptx for health study
PPTX
NUCLEAR-MEDICINE-Copy.pptxbabaabahahahaahha
PPTX
Neoplasia III.pptxjhghgjhfj fjfhgfgdfdfsrbvhv
PPTX
thio and propofol mechanism and uses.pptx
PDF
OSCE SERIES - Set 7 ( Questions & Answers ).pdf
PDF
Forensic Psychology and Its Impact on the Legal System.pdf
@K. CLINICAL TRIAL(NEW DRUG DISCOVERY)- KIRTI BHALALA.pptx
SHOCK- lectures on types of shock ,and complications w
Critical Issues in Periodontal Research- An overview
Physiology of Thyroid Hormones.pptx
NRP and care of Newborn.pptx- APPT presentation about neonatal resuscitation ...
Impression Materials in dental materials.pptx
Wheat allergies and Disease in gastroenterology
Hearthhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
Dermatology for member of royalcollege.ppt
Adverse drug reaction and classification
ANESTHETIC CONSIDERATION IN ALCOHOLIC ASSOCIATED LIVER DISEASE.pptx
B C German Homoeopathy Medicineby Dr Brij Mohan Prasad
OSCE SERIES ( Questions & Answers ) - Set 3.pdf
nephrology MRCP - Member of Royal College of Physicians ppt
CARDIOVASCULAR AND RENAL DRUGS.pptx for health study
NUCLEAR-MEDICINE-Copy.pptxbabaabahahahaahha
Neoplasia III.pptxjhghgjhfj fjfhgfgdfdfsrbvhv
thio and propofol mechanism and uses.pptx
OSCE SERIES - Set 7 ( Questions & Answers ).pdf
Forensic Psychology and Its Impact on the Legal System.pdf
Ad

Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation.pptx

  • 1. MR-PLIP Presented To: Dr. Waqas Sultani & Mr. Abdul-Rehman Presented BY: M Usama
  • 2. 2 Introduction 1 •Pathology is the branch of medicine that studies the causes, nature, and effects of diseases 2 •Pathologists are the detectives of the medical world—they look into tissues, blood, cells, and body fluids to figure out why a person is ill and how to treat them. 3 •WSI  Whole Slide Image 4 •Experienced pathologists analyze these WSIs to identify established cancer subtypes and explore potential new forms of the disease 5 •We capture the WSI using the digital scanning technology Whole Slide Image
  • 3. 3 Background How a pathologist analyze the WSI Viewing the image at lowest resolution level to focus on the overall architectural layout Viewing the image at high resolution to focus on specified areas inspecting the tissues and cells Identify or scrutinize tumor epithelial cells (abnormal in shape, size, nucleus, and arrangement) To effectively model pathology data, it’s essential to capture both the architectural overview and high-resolution cellular details
  • 4. 4 Limitations of Previous Vision encoder Treat each resolution independently use a single resolution from WSI Which limits their → ability to understand both global context and local details Fail to connect dots Fail to understand relationships at diff res (5x, 10x, 20x, 40x)
  • 5. 5 Limitations of Previous Vision encoder Textual descriptions generated by Quilt- LLaVA vary at different res-levels with same prompt Moving towards higher magnification (5× to 40×) shifts focus from contextual to detailed information Previous VLMs ignored how context changes across magnification — and failed to use text to guide this multiscale learning and limited to single level learning
  • 6. 6 How MR-PLIP Resolve the Problem Key Innovations Extract Patches at multi resolution levels Get textual description for each resolution  Quilt-LLAVA Learns how to connect and align visual & textual features across zoom levels using a multimodal encoder
  • 7. 7 Now the problem for the MR-PLIP With great detail comes great complexity MR-PLIP’s multiresolution patch-text extraction enriches learning but also amplifies data volume and organizational challenges
  • 8. 8  MR-PLIP employs a parent-child hierarchy to 1. Structure and organize multiresolution data, 2. Enabling the model to learn both global context and fine-grained details cohesively  20 random patches from each WSI at 5x (2 µm/pixel) with each size 512 x 512  4 patches at 10× 1 µm/pixel  16 patches at 20× 0.5 µm/pixel  64 patches at 40× 0.25 µm/pixel  1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85 patches Organizing Dataset
  • 9. 9 1. Used 20,000 WSIs from the TCGA dataset 2. 20,000 WSIs × 20 patches = 400,000 total 5× patches 3. 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85 patches 4. 400,000 × 85 = 34,000,000 patches Organizing Dataset
  • 10. 10 p r i,j patch taken from WSI  i  index of WSI  j patch Number  r Resolution level MR-PLIP Architecture
  • 11. 11 CVTA Loss Symbol Meaning vₒ Total number of patches (or visual features) in the visual bag Bᵥᵢⱼ k Total number of textual keywords in the textual bag Bₜᵢⱼ k₀ Number of top relevant keywords (positive ones) selected for each patch vₐ Visual feature vector of the a-th patch in the bag wᵦ Textual keyword feature vector for the b-th keyword w⁺ᵦ One of the top-k₀ keywords most similar to vₐ τ (tau) Temperature hyperparameter (initially 0.07), controls how sharply the model focuses on the most similar keywords Cosine Similarity Computed as: 1 • Measure the cosine similarity between textual features and the visual features 2 • Select the top k words with highest cosine similarity and treated as +ve words 3 • Remaining words with low cosine similarity treated as negative words 4 • Filter out irrelevant words Why we are getting top k words? Because not all words in the description are important! Using full description: • Adds noise • Includes irrelevant terms (e.g., “this image primarily shows…”) • Selecting top-k words ensures: • You only keep the most meaningful, patch-relevant words • It reduces dimensionality and improves model alignment • Helps model focus on keywords like “lymphocyte”, “tumor”, “sinuses”, etc. It’s like attention — focus only on what matters
  • 12. 12 Concatenating visual features and top k words to get the rich information vector Zri,j. Facilitates the alignment of visual features with their corresponding optimal textual representations Multi Guided Visual feature representations Alignment MRTVA Fused features passed into the MRTVA loss for the contextual learning from low res to high res
  • 13. 13 Child Patch Parent Patch Multi-Modal Encoder (Encodes V & W+) zᵖ  fused embeddings Multi-Modal Encoder (Encodes V & W+) zᶜ fused embeddings Projection Head hᵖ = Proj(zᵖ) Prediction Head hᶜ = Proj(zᶜ) MRTVA Loss L = -cosine(hᵖ, gᶜ) Projection Head gᶜ = Pred(hᶜ) MRTVA Loss Flow Stop Gradients Gradients Flow Symmetric loss both directions — parent child and child → → parent Term Meaning hi,jph^p_{i,j} Projection of the parent feature vector gi,jcg^c_{i,j} Prediction of the child vector (tries to guess parent) Sg(...)text{Sg} (...) Stop Gradient — freeze gradients here (no backpropagation) gi,jpg^p_{i,j} Prediction of parent vector (now predicting what the child looks like!) hi,jch^c_{i,j} Projection of the child vector
  • 14. 14  Zero-shot tile and WSI level classification performance comparison of MR-PLIP with existing SOTA VLMs in terms of weighted F1score with PE & NPE  Performance comparison of proposed MR-PLIP with existing SOTA on tile level classification using linear probe evaluation and on WSI level classification using weekly supervised learning. Results
  • 16. 16 MR-PLIP Architecture 1. Image-Text Contrasting (ITC) 2. Image-Text Matching(ITC) 3. Mased Language Modeling (MLM) SoftMax normalized similarities between i2t and t2i
  • 17. 17 Simple Siamese Representation Learning (SimSiam Network) transforms z1 into p1 — a prediction of what the other view’s features (z2) should be Self-supervised learning framework that teaches a model to understand images — without any labels
  • 19. 19 Organizing Dataset (how they form 34 million dataset) 5x 10x 10x 10x 10x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 40 1) 20 random patches from each WSI at 5x (2 µm/pixel) with each size 512 x 512 2) Then, for each of these 5× patches They zoom in to get:  4 patches at 10× 1 µm/pixel  16 patches at 20× 0.5 µm/pixel  64 patches at 40× 0.25 µm/pixel 3) 20,000 WSIs × 20 patches = 400,000 total 5× patches 4) 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85 patches 5) 400,000 × 85 = 34,000,000 patches
  • 20. 20 Organizing Dataset (how they form 34 million dataset) 5x 10x 10x 10x 10x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 20x 40 1) 20 random patches from each WSI at 5x (2 µm/pixel) with each size 512 x 512 2) Then, for each of these 5× patches They zoom in to get:  4 patches at 10× 1 µm/pixel  16 patches at 20× 0.5 µm/pixel  64 patches at 40× 0.25 µm/pixel 3) 20,000 WSIs × 20 patches = 400,000 total 5× patches 4) 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85 patches 5) 400,000 × 85 = 34,000,000 patches