Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation.pptx

MR-PLIP
Presented To: Dr. Waqas Sultani & Mr. Abdul-Rehman
Presented BY: M Usama

2
Introduction
1
•Pathology is the branch of medicine that studies the causes, nature, and effects of diseases
2
•Pathologists are the detectives of the medical world—they look into tissues, blood, cells, and body fluids to figure out why a person is ill and how to treat
them.
3
•WSI  Whole Slide Image
4
•Experienced pathologists analyze these WSIs to identify established cancer subtypes and explore potential new forms of the disease
5
•We capture the WSI using the digital scanning technology
Whole Slide Image

3
Background
How a
pathologist
analyze the WSI
Viewing the image at lowest resolution level to focus on the
overall architectural layout
Viewing the image at high resolution to focus on specified
areas inspecting the tissues and cells
Identify or scrutinize tumor epithelial cells (abnormal in
shape, size, nucleus, and arrangement)
To effectively model pathology data, it’s essential to capture
both the architectural overview and high-resolution cellular
details

4
Limitations of Previous Vision
encoder
Treat each resolution
independently
use a single resolution from WSI Which limits their
→
ability to understand both global context and local
details
Fail to connect dots
Fail to understand
relationships at diff
res (5x, 10x, 20x,
40x)

5
Limitations of Previous Vision
encoder
Textual descriptions generated by Quilt-
LLaVA vary at different res-levels with
same prompt
Moving towards higher magnification (5×
to 40×) shifts focus from contextual to
detailed information
Previous VLMs ignored how context
changes across magnification — and
failed to use text to guide this multiscale
learning and limited to single level
learning

6
How MR-PLIP Resolve the Problem
Key Innovations
Extract Patches at multi resolution levels
Get textual description for each resolution  Quilt-LLAVA
Learns how to connect and align visual & textual features across zoom levels using a multimodal encoder

7
Now the problem for the MR-PLIP
With great detail comes great complexity
MR-PLIP’s multiresolution patch-text extraction enriches learning but also amplifies data volume and
organizational challenges

8
 MR-PLIP employs a parent-child hierarchy to
1. Structure and organize multiresolution data,
2. Enabling the model to learn both global context and fine-grained details cohesively
 20 random patches from each WSI at 5x (2 µm/pixel) with each size
512 x 512
 4 patches at 10× 1
µm/pixel
 16 patches at 20× 0.5
µm/pixel
 64 patches at 40× 0.25
µm/pixel
 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85
patches
Organizing Dataset

9
1. Used 20,000 WSIs from the TCGA dataset
2. 20,000 WSIs × 20 patches = 400,000 total 5× patches
3. 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85 patches
4. 400,000 × 85 = 34,000,000 patches
Organizing Dataset

10
p r
i,j patch taken from WSI
 i  index of WSI
 j patch Number
 r Resolution level
MR-PLIP Architecture

11
CVTA Loss
Symbol Meaning
vₒ Total number of patches (or visual features) in the visual bag Bᵥᵢⱼ
k Total number of textual keywords in the textual bag Bₜᵢⱼ
k₀
Number of top relevant keywords (positive ones) selected for each
patch
vₐ Visual feature vector of the a-th patch in the bag
wᵦ Textual keyword feature vector for the b-th keyword
w⁺ᵦ One of the top-k₀ keywords most similar to vₐ
τ (tau)
Temperature hyperparameter (initially 0.07), controls how sharply the
model focuses on the most similar keywords
Cosine Similarity Computed as:
1
• Measure the cosine similarity between
textual features and the visual features
2
• Select the top k words with highest
cosine similarity and treated as +ve
words
3
• Remaining words with low cosine
similarity treated as negative words
4
• Filter out irrelevant words
Why we are getting top k words?
Because not all words in the description are
important! Using full description:
• Adds noise
• Includes irrelevant terms (e.g., “this image
primarily shows…”)
• Selecting top-k words ensures:
• You only keep the most meaningful, patch-relevant
words
• It reduces dimensionality and improves model
alignment
• Helps model focus on keywords like “lymphocyte”,
“tumor”, “sinuses”, etc.
It’s like attention — focus only on what matters

12
Concatenating visual features and top k words
to get the rich information vector Zri,j.
Facilitates the alignment of visual features with
their corresponding optimal textual
representations
Multi Guided Visual feature representations Alignment MRTVA
Fused features passed into the MRTVA loss for
the contextual learning from low res to high res

13
Child Patch
Parent
Patch
Multi-Modal Encoder
(Encodes V & W+)
zᵖ  fused
embeddings
Multi-Modal Encoder
(Encodes V & W+)
zᶜ fused
embeddings
Projection Head
hᵖ = Proj(zᵖ)
Prediction Head
hᶜ = Proj(zᶜ)
MRTVA Loss
L = -cosine(hᵖ, gᶜ)
Projection Head
gᶜ = Pred(hᶜ)
MRTVA Loss Flow
Stop Gradients
Gradients
Flow
Symmetric loss
both directions — parent child and child
→ →
parent
Term Meaning
hi,jph^p_{i,j} Projection of the parent feature vector
gi,jcg^c_{i,j}
Prediction of the child vector (tries to guess
parent)
Sg(...)text{Sg}
(...)
Stop Gradient — freeze gradients here (no
backpropagation)
gi,jpg^p_{i,j}
Prediction of parent vector (now predicting
what the child looks like!)
hi,jch^c_{i,j} Projection of the child vector

14
 Zero-shot tile and WSI level classification performance comparison of MR-PLIP with existing SOTA VLMs in terms of weighted F1score with PE & NPE
 Performance comparison of proposed MR-PLIP with existing SOTA on tile level classification using linear probe evaluation and on WSI level
classification using weekly supervised learning.
Results

16
MR-PLIP Architecture
1. Image-Text Contrasting (ITC)
2. Image-Text Matching(ITC)
3. Mased Language Modeling (MLM)
SoftMax normalized
similarities between i2t and
t2i

17
Simple Siamese Representation Learning (SimSiam Network)
transforms z1 into p1 — a prediction of
what the other view’s features (z2)
should be
Self-supervised learning framework that teaches a
model to understand images — without any labels

19
Organizing Dataset (how they form 34 million
dataset)
5x
10x 10x 10x 10x
20x 20x 20x 20x
20x 20x 20x 20x
20x 20x 20x 20x 20x 20x 20x 20x
40
1) 20 random patches from each WSI at 5x (2 µm/pixel) with each size 512 x 512
2) Then, for each of these 5× patches They zoom in to get:
 4 patches at 10× 1 µm/pixel
 16 patches at 20× 0.5 µm/pixel
3) 20,000 WSIs × 20 patches = 400,000 total 5× patches
4) 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85 patches
5) 400,000 × 85 = 34,000,000 patches

20
Organizing Dataset (how they form 34 million
dataset)
5x
10x 10x 10x 10x
20x 20x 20x 20x
20x 20x 20x 20x
20x 20x 20x 20x 20x 20x 20x 20x
40
1) 20 random patches from each WSI at 5x (2 µm/pixel) with each size 512 x 512
2) Then, for each of these 5× patches They zoom in to get:
 4 patches at 10× 1 µm/pixel
3) 20,000 WSIs × 20 patches = 400,000 total 5× patches
4) 1 (original 5×) + 4 (10×) + 16 (20×) + 64 (40×) = 85 patches
5) 400,000 × 85 = 34,000,000 patches

Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation.pptx

More Related Content

Similar to Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation.pptx (20)

Recently uploaded (20)

Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation.pptx