Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural Networks

Set Transformer: A Framework for Attention-based
Permutaion-Invariant Neural Networks
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek*,
Seungjin Choi, Yee Whye Teh
(ICML2019)
Selection reason:
- I do think it’s interesting and useful in a wide range of scenario
- A main ingredient used in Stacked capsule autoencoders
(maybe next time)

Many tasks uses set-structured input
Multiple instance learning
Ex: Image bag classification
● Input: a set of images
● Output: label for that set
beach desert

3D shape recognization
● Input: 2d images from different angles of an object
● Output: object’s class
https://www.kaggle.com/nepuerto/the-small-norb-dataset-v10

Anomaly detection
● Input: A set of items that contains a anomaly item
● Output: the anomaly item

For such problems...
The model should satisfies:
1. Permutation invariant: the output of the model are identical under any
permutation of the elements in the input set
2. Ability to process sets of any size
Existing method such as
● Classical Feed-forward networks violate boths
● RNN-liked model is sensitive to input’s order

Recent solution: set pooling methods
Deep Sets (Manzil Zaheer, 2017)
1. Element-wise Feed-forward network (for each item)
2. Embedding by aggregating (mean, sum, max, ..)
3. Final output obtained by further non-linear transform
Properties:
● Is proven to be a universal approximator for any set function
● Interactions between elements has to be necessarily discarded

A sample: Amortized clustering
● Problem:
Input: A set of points
Output: Parametric mapping from point to centers of clusters of points
● Hardness: the parametric mapping must assign each point to its cluster
while modelling the explaining away pattern
● Typically solved via iterative algorithms
● Set-pooling suffer from under-fitting (result below)

Overall idea and contributions
● Using self-attention mechanism to encode the pairwise or higher-order
interactions between elements in the set
● Again using self-attention to aggregate features that is useful when
outputs related to each other (multiple outputs problem)
● Method to reduce the computation time for large set

Ingredient 1: Pooling Architecture for Sets
Deep Sets (Manzil Zaheer, 2017)
Properties:
● Any permutation invariant function can be represented as above, when
pool() is sum() and be any continuous function
● Stacking permutation-equivariant layers maintains permutation invariant.
● Example of permutation-equivariant layer
EncoderDecoder

Ingredient 2: Attention
● n query vectors
key-value pairs
● Output:

Ingredient 2: Attention (multi-head)

Ingredient 2: Attention (multi-head)
Extension of attention
● Project Q, K, V onto h different dimensional vectors
● Attention function applied for each projections (tuples)
● Final output is the linear transformation of the concatenation of
h attention outputs
● Some typical design choices:
○
○
○
○

Encoder: Set Attention Block - SAB
● Multihead Attention Block (MAB)
Ability to compute pairwise or higher-order interactions among instances
rFF = row-wise feed-forward layer
● Self Attention Block (SAB)
Problem: High computation complexity for large set

Encoder: Induced Set Attention Block (ISAB)
● Introducing m dimensional vectors , called inducing points
where is trainable parameters
● Analogous to low-rank projection
● Expected to encode some global structure that explain the input set
● Ex: amortized clustering problem
Could be points that the encoder can compare elements in the set.
● Time complexity is O(nm)

Pooling by Multihead Attention (PMA)
● Instead of simple pooling, using learnable k seed vectors S
Typically, k correlated output -> k seed vectors (k=1 for most cases)
● Next, model the interactions among the k outputs
Decoder: Pooling by Multihead Attention

● The encoder:
○ A stack of SABs or ISABs. For example:
○ Time complexity
■ for stacks of SAB
■ for stacks of ISABm
● The decoder:
○ Aggregate features from encoder using PMA with k seeds
○ Next, model outputs correlation using a SAB
○ Get final output with a FF network
Set transformer: Overall Architecture

Analysis
Proposition 1. The Set transformer is permutation invariant
● Encoder blocks (SAB, ISAB) are permutation invariant
● PMA in decoder is also permutation equivariant
Proposition 2. The Set Transformer is a universal approximator of
permutation invariant functions.

Related works
● Applications:
○ 3D shape recoginition
○ discovering causality
○ learning the statistics of a set
○ few-shot image classification
● Attention-based approaches for sets
● Modeling interactions between elements in sets
● Idea of inducing point used in
○ sparse Gaussian processes
○ Nystrom method for matrix decomposition

Experiment
● Zaheer et al., 2017
○ rFF + Pooling: rFF encoder and simple pooling
○ rFFp-mean/rFFp-max + Pooling: rFF encoder + simple pooling + rFF decoder
● Yang et al., 2018; Ilse et al., 2018
○ rFF + Dotprod: rFF encoder + dot product attention sum pooling + rFF
decoder
● Proposed
○ SAB(ISAB) + Pooling: Stack of SABs (ISABs) encoder and simple pooling
○ rFF + PMA (ours): rFF layers in encoder and PMA decoder
○ SAB (ISAB) + PMA: Stack of SABs (ISABs) ecoder and PMA decoder

Experiments: Maximum Value Regression
● Setting
○ Input: A set of number
○ Output: max of set
○ Loss-function:
● Note: Encoder with identity function + max-pooling give true answer
● Results:
○ rFF + mean/sum pooling performs poorly
○ Set transformer achieved comparable performance with rFF + max-pool
=> able to learn to focus on the max value

Experiments: Counting Unique Characters
● Setting:
○ Input: a set of characters from Omniglot dataset
○ Output: Number of unique characters
○ Model: Poisson regression
● Results:

Experiments: Amortized Clustering
● Problem:
○ Input: Points generate from a mixture of Gaussians
○ Output: Parameters of this mixture
○ #seeds for PMA equals to number Gaussians (4 in exp)
○ Data:
■ Synthetic 2D points (100-500 points/dataset, different params per dataset)
■ Each dataset contains 100-500 samples from random 4 class of CIFAR-100

● Result:
○ ISAB outperform others in all task
Even the oracle by EM method in CIFAR-100
○ ISAB may outperform SAB due to knowledge transfer and regularization
via inducing points, help the networks learn global structures.

● Problem:
○ Input:
■ 7 images with same attributes
■ 1 image that not
○ Output: The anomaly image
○ Data: CelebA
● Result:
Experiments: Set Anomaly Detection

● ZOZO Town: Fashion cordinate recommend system
● Problem: Fill in the blank
An interesting application
https://speakerdeck.com/yukisaito/set-transformer-for-coordinating-outfits

Conclusion and thought
● A permutation invariant network for set-structured input
● Attractive result
● Already used in some applications
○ Important ingredient in new capsule network by Hinton group
○ Recommendation problem
● Might be usable in some of our problems?

Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural Networks

More Related Content

What's hot (20)

Similar to Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural Networks (20)

More from Thien Q. Tran (6)

Recently uploaded (20)

Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural Networks