SlideShare a Scribd company logo
Set Transformer: A Framework for Attention-based
Permutaion-Invariant Neural Networks
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek*,
Seungjin Choi, Yee Whye Teh
(ICML2019)
Selection reason:
- I do think it’s interesting and useful in a wide range of scenario
- A main ingredient used in Stacked capsule autoencoders
(maybe next time)
Many tasks uses set-structured input
Multiple instance learning
Ex: Image bag classification
● Input: a set of images
● Output: label for that set
beach desert
Many tasks uses set-structured input
3D shape recognization
● Input: 2d images from different angles of an object
● Output: object’s class
https://www.kaggle.com/nepuerto/the-small-norb-dataset-v10
Many tasks uses set-structured input
Anomaly detection
● Input: A set of items that contains a anomaly item
● Output: the anomaly item
For such problems...
The model should satisfies:
1. Permutation invariant: the output of the model are identical under any
permutation of the elements in the input set
2. Ability to process sets of any size
Existing method such as
● Classical Feed-forward networks violate boths
● RNN-liked model is sensitive to input’s order
Recent solution: set pooling methods
Deep Sets (Manzil Zaheer, 2017)
1. Element-wise Feed-forward network (for each item)
2. Embedding by aggregating (mean, sum, max, ..)
3. Final output obtained by further non-linear transform
Properties:
● Is proven to be a universal approximator for any set function
● Interactions between elements has to be necessarily discarded
A sample: Amortized clustering
● Problem:
Input: A set of points
Output: Parametric mapping from point to centers of clusters of points
● Hardness: the parametric mapping must assign each point to its cluster
while modelling the explaining away pattern
● Typically solved via iterative algorithms
● Set-pooling suffer from under-fitting (result below)
Overall idea and contributions
● Using self-attention mechanism to encode the pairwise or higher-order
interactions between elements in the set
● Again using self-attention to aggregate features that is useful when
outputs related to each other (multiple outputs problem)
● Method to reduce the computation time for large set
Ingredient 1: Pooling Architecture for Sets
Deep Sets (Manzil Zaheer, 2017)
Properties:
● Any permutation invariant function can be represented as above, when
pool() is sum() and be any continuous function
● Stacking permutation-equivariant layers maintains permutation invariant.
● Example of permutation-equivariant layer
EncoderDecoder
Ingredient 2: Attention
● n query vectors
key-value pairs
● Output:
Ingredient 2: Attention
● n query vectors
key-value pairs
● Output:
Ingredient 2: Attention (multi-head)
Ingredient 2: Attention (multi-head)
Extension of attention
● Project Q, K, V onto h different dimensional vectors
● Attention function applied for each projections (tuples)
● Final output is the linear transformation of the concatenation of
h attention outputs
● Some typical design choices:
○
○
○
○
Encoder: Set Attention Block - SAB
● Multihead Attention Block (MAB)
Ability to compute pairwise or higher-order interactions among instances
rFF = row-wise feed-forward layer
● Self Attention Block (SAB)
Problem: High computation complexity for large set
Encoder: Induced Set Attention Block (ISAB)
● Introducing m dimensional vectors , called inducing points
where is trainable parameters
● Analogous to low-rank projection
● Expected to encode some global structure that explain the input set
● Ex: amortized clustering problem
Could be points that the encoder can compare elements in the set.
● Time complexity is O(nm)
Pooling by Multihead Attention (PMA)
● Instead of simple pooling, using learnable k seed vectors S
Typically, k correlated output -> k seed vectors (k=1 for most cases)
● Next, model the interactions among the k outputs
Decoder: Pooling by Multihead Attention
● The encoder:
○ A stack of SABs or ISABs. For example:
○ Time complexity
■ for stacks of SAB
■ for stacks of ISABm
● The decoder:
○ Aggregate features from encoder using PMA with k seeds
○ Next, model outputs correlation using a SAB
○ Get final output with a FF network
Set transformer: Overall Architecture
Analysis
Proposition 1. The Set transformer is permutation invariant
● Encoder blocks (SAB, ISAB) are permutation invariant
● PMA in decoder is also permutation equivariant
Proposition 2. The Set Transformer is a universal approximator of
permutation invariant functions.
Related works
● Applications:
○ 3D shape recoginition
○ discovering causality
○ learning the statistics of a set
○ few-shot image classification
● Attention-based approaches for sets
● Modeling interactions between elements in sets
● Idea of inducing point used in
○ sparse Gaussian processes
○ Nystrom method for matrix decomposition
Experiment
● Zaheer et al., 2017
○ rFF + Pooling: rFF encoder and simple pooling
○ rFFp-mean/rFFp-max + Pooling: rFF encoder + simple pooling + rFF decoder
● Yang et al., 2018; Ilse et al., 2018
○ rFF + Dotprod: rFF encoder + dot product attention sum pooling + rFF
decoder
● Proposed
○ SAB(ISAB) + Pooling: Stack of SABs (ISABs) encoder and simple pooling
○ rFF + PMA (ours): rFF layers in encoder and PMA decoder
○ SAB (ISAB) + PMA: Stack of SABs (ISABs) ecoder and PMA decoder
Experiments: Maximum Value Regression
● Setting
○ Input: A set of number
○ Output: max of set
○ Loss-function:
● Note: Encoder with identity function + max-pooling give true answer
● Results:
○ rFF + mean/sum pooling performs poorly
○ Set transformer achieved comparable performance with rFF + max-pool
=> able to learn to focus on the max value
Experiments: Counting Unique Characters
● Setting:
○ Input: a set of characters from Omniglot dataset
○ Output: Number of unique characters
○ Model: Poisson regression
● Results:
Experiments: Amortized Clustering
● Problem:
○ Input: Points generate from a mixture of Gaussians
○ Output: Parameters of this mixture
○ #seeds for PMA equals to number Gaussians (4 in exp)
○ Data:
■ Synthetic 2D points (100-500 points/dataset, different params per dataset)
■ Each dataset contains 100-500 samples from random 4 class of CIFAR-100
Experiments: Amortized Clustering
● Result:
○ ISAB outperform others in all task
Even the oracle by EM method in CIFAR-100
○ ISAB may outperform SAB due to knowledge transfer and regularization
via inducing points, help the networks learn global structures.
Experiments: Amortized Clustering
● Problem:
○ Input:
■ 7 images with same attributes
■ 1 image that not
○ Output: The anomaly image
○ Data: CelebA
● Result:
Experiments: Set Anomaly Detection
● ZOZO Town: Fashion cordinate recommend system
● Problem: Fill in the blank
An interesting application
https://speakerdeck.com/yukisaito/set-transformer-for-coordinating-outfits
An interesting application
An interesting application
An interesting application
Conclusion and thought
● A permutation invariant network for set-structured input
● Attractive result
● Already used in some applications
○ Important ingredient in new capsule network by Hinton group
○ Recommendation problem
● Might be usable in some of our problems?

More Related Content

PDF
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
PDF
Using synthetic data for computer vision model training
PDF
CNN Attention Networks
PPTX
Natural language processing and transformer models
PDF
Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAI
PDF
Knowledge Graph Embeddings for Recommender Systems
PDF
[UMAP2013]Tutorial on Context-Aware User Modeling for Recommendation by Bamsh...
PDF
Recent advances in deep recommender systems
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Using synthetic data for computer vision model training
CNN Attention Networks
Natural language processing and transformer models
Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAI
Knowledge Graph Embeddings for Recommender Systems
[UMAP2013]Tutorial on Context-Aware User Modeling for Recommendation by Bamsh...
Recent advances in deep recommender systems

What's hot (20)

PPT
CCD_CAMERA.ppt
PPTX
Transformers In Vision From Zero to Hero (DLI).pptx
PDF
Tutorial on Deep Generative Models
PDF
Deep learning (Machine learning) tutorial for beginners
PDF
Generative models (Geek hub 2021 lecture)
PDF
Generative Models for General Audiences
PDF
Deep learning for NLP and Transformer
PDF
Transformer Introduction (Seminar Material)
PPTX
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
PPTX
Gpt1 and 2 model review
PPTX
Deep Learning Models for Question Answering
PPTX
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
BERT Finetuning Webinar Presentation
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
CSCSS Science of Security - Developing Scientific Foundations for the Operati...
PPTX
Notes on attention mechanism
PDF
Continual learning: Survey
PDF
Generative Adversarial Networks
PPTX
META-LEARNING.pptx
CCD_CAMERA.ppt
Transformers In Vision From Zero to Hero (DLI).pptx
Tutorial on Deep Generative Models
Deep learning (Machine learning) tutorial for beginners
Generative models (Geek hub 2021 lecture)
Generative Models for General Audiences
Deep learning for NLP and Transformer
Transformer Introduction (Seminar Material)
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
Gpt1 and 2 model review
Deep Learning Models for Question Answering
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT Finetuning Webinar Presentation
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
CSCSS Science of Security - Developing Scientific Foundations for the Operati...
Notes on attention mechanism
Continual learning: Survey
Generative Adversarial Networks
META-LEARNING.pptx
Ad

Similar to Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural Networks (20)

PPTX
Deep Learning Module 2A Training MLP.pptx
PPTX
240318_JW_labseminar[Attention Is All You Need].pptx
PPTX
AUTO ENCODERS (Deep Learning fundamentals)
PPTX
Practical ML
PDF
Introduction to Machine Learning with Spark
PPTX
Rethinking Attention with Performers
PPTX
Jyduydufyuyf8yfiyfiyfifiyfiyviyviyfiugiuy8f7dd64d4yrsxyfhgdhfjhvjhv
PDF
Seq2Seq (encoder decoder) model
PDF
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
PDF
DALL-E.pdf
PPTX
Deep Learning with Apache Spark: an Introduction
PDF
Centernet
DOCX
Machine learning Experiments report
PDF
spaGO: A self-contained ML & NLP library in GO
PPTX
Nuts and Bolts of Transfer Learning.pptx
PPTX
08 neural networks
PPTX
250224_JH_Labseminar[Graph Attention Networks].pptx
PDF
Supervised embedding techniques in search ranking system
PDF
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
PDF
Pregel: A System For Large Scale Graph Processing
Deep Learning Module 2A Training MLP.pptx
240318_JW_labseminar[Attention Is All You Need].pptx
AUTO ENCODERS (Deep Learning fundamentals)
Practical ML
Introduction to Machine Learning with Spark
Rethinking Attention with Performers
Jyduydufyuyf8yfiyfiyfifiyfiyviyviyfiugiuy8f7dd64d4yrsxyfhgdhfjhvjhv
Seq2Seq (encoder decoder) model
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
DALL-E.pdf
Deep Learning with Apache Spark: an Introduction
Centernet
Machine learning Experiments report
spaGO: A self-contained ML & NLP library in GO
Nuts and Bolts of Transfer Learning.pptx
08 neural networks
250224_JH_Labseminar[Graph Attention Networks].pptx
Supervised embedding techniques in search ranking system
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
Pregel: A System For Large Scale Graph Processing
Ad

More from Thien Q. Tran (6)

PDF
LLM Threats: Prompt Injections and Jailbreak Attacks
PDF
Introduction to FAST-LAMP
PDF
Finding statistically significant interactions between continuous features (I...
PDF
Introduction to TCAV (ICML2018)
PDF
Hypothesis testing and statistically sound-pattern mining
PDF
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
LLM Threats: Prompt Injections and Jailbreak Attacks
Introduction to FAST-LAMP
Finding statistically significant interactions between continuous features (I...
Introduction to TCAV (ICML2018)
Hypothesis testing and statistically sound-pattern mining
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Advanced methodologies resolving dimensionality complications for autism neur...
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Spectral efficient network and resource selection model in 5G networks
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing

Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural Networks

  • 1. Set Transformer: A Framework for Attention-based Permutaion-Invariant Neural Networks Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek*, Seungjin Choi, Yee Whye Teh (ICML2019) Selection reason: - I do think it’s interesting and useful in a wide range of scenario - A main ingredient used in Stacked capsule autoencoders (maybe next time)
  • 2. Many tasks uses set-structured input Multiple instance learning Ex: Image bag classification ● Input: a set of images ● Output: label for that set beach desert
  • 3. Many tasks uses set-structured input 3D shape recognization ● Input: 2d images from different angles of an object ● Output: object’s class https://www.kaggle.com/nepuerto/the-small-norb-dataset-v10
  • 4. Many tasks uses set-structured input Anomaly detection ● Input: A set of items that contains a anomaly item ● Output: the anomaly item
  • 5. For such problems... The model should satisfies: 1. Permutation invariant: the output of the model are identical under any permutation of the elements in the input set 2. Ability to process sets of any size Existing method such as ● Classical Feed-forward networks violate boths ● RNN-liked model is sensitive to input’s order
  • 6. Recent solution: set pooling methods Deep Sets (Manzil Zaheer, 2017) 1. Element-wise Feed-forward network (for each item) 2. Embedding by aggregating (mean, sum, max, ..) 3. Final output obtained by further non-linear transform Properties: ● Is proven to be a universal approximator for any set function ● Interactions between elements has to be necessarily discarded
  • 7. A sample: Amortized clustering ● Problem: Input: A set of points Output: Parametric mapping from point to centers of clusters of points ● Hardness: the parametric mapping must assign each point to its cluster while modelling the explaining away pattern ● Typically solved via iterative algorithms ● Set-pooling suffer from under-fitting (result below)
  • 8. Overall idea and contributions ● Using self-attention mechanism to encode the pairwise or higher-order interactions between elements in the set ● Again using self-attention to aggregate features that is useful when outputs related to each other (multiple outputs problem) ● Method to reduce the computation time for large set
  • 9. Ingredient 1: Pooling Architecture for Sets Deep Sets (Manzil Zaheer, 2017) Properties: ● Any permutation invariant function can be represented as above, when pool() is sum() and be any continuous function ● Stacking permutation-equivariant layers maintains permutation invariant. ● Example of permutation-equivariant layer EncoderDecoder
  • 10. Ingredient 2: Attention ● n query vectors key-value pairs ● Output:
  • 11. Ingredient 2: Attention ● n query vectors key-value pairs ● Output:
  • 12. Ingredient 2: Attention (multi-head)
  • 13. Ingredient 2: Attention (multi-head) Extension of attention ● Project Q, K, V onto h different dimensional vectors ● Attention function applied for each projections (tuples) ● Final output is the linear transformation of the concatenation of h attention outputs ● Some typical design choices: ○ ○ ○ ○
  • 14. Encoder: Set Attention Block - SAB ● Multihead Attention Block (MAB) Ability to compute pairwise or higher-order interactions among instances rFF = row-wise feed-forward layer ● Self Attention Block (SAB) Problem: High computation complexity for large set
  • 15. Encoder: Induced Set Attention Block (ISAB) ● Introducing m dimensional vectors , called inducing points where is trainable parameters ● Analogous to low-rank projection ● Expected to encode some global structure that explain the input set ● Ex: amortized clustering problem Could be points that the encoder can compare elements in the set. ● Time complexity is O(nm)
  • 16. Pooling by Multihead Attention (PMA) ● Instead of simple pooling, using learnable k seed vectors S Typically, k correlated output -> k seed vectors (k=1 for most cases) ● Next, model the interactions among the k outputs Decoder: Pooling by Multihead Attention
  • 17. ● The encoder: ○ A stack of SABs or ISABs. For example: ○ Time complexity ■ for stacks of SAB ■ for stacks of ISABm ● The decoder: ○ Aggregate features from encoder using PMA with k seeds ○ Next, model outputs correlation using a SAB ○ Get final output with a FF network Set transformer: Overall Architecture
  • 18. Analysis Proposition 1. The Set transformer is permutation invariant ● Encoder blocks (SAB, ISAB) are permutation invariant ● PMA in decoder is also permutation equivariant Proposition 2. The Set Transformer is a universal approximator of permutation invariant functions.
  • 19. Related works ● Applications: ○ 3D shape recoginition ○ discovering causality ○ learning the statistics of a set ○ few-shot image classification ● Attention-based approaches for sets ● Modeling interactions between elements in sets ● Idea of inducing point used in ○ sparse Gaussian processes ○ Nystrom method for matrix decomposition
  • 20. Experiment ● Zaheer et al., 2017 ○ rFF + Pooling: rFF encoder and simple pooling ○ rFFp-mean/rFFp-max + Pooling: rFF encoder + simple pooling + rFF decoder ● Yang et al., 2018; Ilse et al., 2018 ○ rFF + Dotprod: rFF encoder + dot product attention sum pooling + rFF decoder ● Proposed ○ SAB(ISAB) + Pooling: Stack of SABs (ISABs) encoder and simple pooling ○ rFF + PMA (ours): rFF layers in encoder and PMA decoder ○ SAB (ISAB) + PMA: Stack of SABs (ISABs) ecoder and PMA decoder
  • 21. Experiments: Maximum Value Regression ● Setting ○ Input: A set of number ○ Output: max of set ○ Loss-function: ● Note: Encoder with identity function + max-pooling give true answer ● Results: ○ rFF + mean/sum pooling performs poorly ○ Set transformer achieved comparable performance with rFF + max-pool => able to learn to focus on the max value
  • 22. Experiments: Counting Unique Characters ● Setting: ○ Input: a set of characters from Omniglot dataset ○ Output: Number of unique characters ○ Model: Poisson regression ● Results:
  • 23. Experiments: Amortized Clustering ● Problem: ○ Input: Points generate from a mixture of Gaussians ○ Output: Parameters of this mixture ○ #seeds for PMA equals to number Gaussians (4 in exp) ○ Data: ■ Synthetic 2D points (100-500 points/dataset, different params per dataset) ■ Each dataset contains 100-500 samples from random 4 class of CIFAR-100
  • 24. Experiments: Amortized Clustering ● Result: ○ ISAB outperform others in all task Even the oracle by EM method in CIFAR-100 ○ ISAB may outperform SAB due to knowledge transfer and regularization via inducing points, help the networks learn global structures.
  • 26. ● Problem: ○ Input: ■ 7 images with same attributes ■ 1 image that not ○ Output: The anomaly image ○ Data: CelebA ● Result: Experiments: Set Anomaly Detection
  • 27. ● ZOZO Town: Fashion cordinate recommend system ● Problem: Fill in the blank An interesting application https://speakerdeck.com/yukisaito/set-transformer-for-coordinating-outfits
  • 31. Conclusion and thought ● A permutation invariant network for set-structured input ● Attractive result ● Already used in some applications ○ Important ingredient in new capsule network by Hinton group ○ Recommendation problem ● Might be usable in some of our problems?