[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A Modalities-driven Analysis

On Popularity Bias of Multimodal-aware Recommender Systems:
a Modalities-driven Analysis
Daniele Malitesta, Giandomenico Cornacchia, Claudio Pomo, Tommaso Di Noia
Politecnico di Bari, Bari (Italy)
email: firstname.lastname@poliba.it
The 1st International Workshop on Deep Multimodal Learning for Information Retrieval
Ottawa, ON, Canada, 11-02-2023
Co-located with ACM Multimedia 2023

On Popularity Bias of Multimodal-aware Recommender Systems: a Modalities-driven Analysis
The 1st International Workshop on Deep Multimodal Learning for Information Retrieval (Ottawa, November 02, 2023)
● Introduction and motivations
● Background
● Proposed analysis
● Results and discussion
● Conclusion and future work
Outline
2

Introduction and motivations
3

Multimodal-aware recommender systems [Malitesta et al. (2023a)] exploit multimodal (i.e., audio, visual, textual)
content data to augment the representation of items, thus tackling known issues such as dataset sparsity and the
inexplicable nature of users’ actions (i.e., views, clicks) on online platforms.
4
Recommendation systems leveraging multimodal data
࢛
࢏
MODALITIES
࢓૚
࢓૛
࢓૜
. . .
. . .
MULTIMODAL
FEATURE
EXTRACTOR
࣐࢓ሺ‫ڄ‬ሻ
MULTIMODAL
REPRESENTATION
JOINT
ࣆሺ‫ڄ‬ሻ
COORDINATE
ࣆ࢓ ‫ڄ‬
. . .
INFERENCE
࣋ሺ‫ڄ‬ሻ
EARLY
FUSION
ࢽࢋሺ‫ڄ‬ሻ
LATE
FUSION
ࢽ࢒ሺ‫ڄ‬ሻ
(1) (2)
(a)
(b)
MULTIMODAL
FUSION
(3)
(a)
(b)
(4)
࢘
Which? How? When?
INPUT
[Malitesta et al. (2023a)] 2023. Formalizing Multimedia Recommendation through Multimodal Deep Learning. Under review at TORS. Available online at: arXiv:2309.05273.

Most of multimodal-aware recommender
systems are based upon factorization models
for recommendation, such as the matrix
factorization with Bayesian personalized
ranking architecture (MFBPR [Rendle et al.]).
Given its simple implementation and efficacy,
MFBPR has long constituted the backbone of
recommendation algorithms in collaborative
filtering [He et al. (2020), Mao et al.], not only
in multimodal recommendation.
5
Multimodal-aware recommendation and factorization models
[Rendle et al.] 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI.
[He et al. (2020)] 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In SIGIR. ACM, 639–648.
[Mao et al.] 2021. SimpleX: A Simple and Strong Baseline for Collaborative Filtering. In CIKM. ACM, 1243–1252.
𝑢
𝑖
#
𝑦!"
𝑚# 𝑚$
𝑚%

Nevertheless, the literature has shown that MFBPR-like models
may be affected by popularity bias [Abdollahpouri et al.,
Ricardo Baeza-Yates, Boratto et al., Jannach et al.]. Such
recommender systems tend to boost the performance of items from
the short-head at the detriment of the items from the long-tail.
6
Popularity bias in matrix factorization
[Abdollahpouri et al.] 2017. Controlling Popularity Bias in Learning-to-Rank Recommendation. In RecSys. ACM, 42–46.
[Ricardo Baeza-Yates] 2020. Bias in Search and Recommender Systems. In RecSys. ACM, 2.
[Boratto et al.] 2021. Connecting user and item perspectives in popularity debiasing for collaborative recommendation. Inf. Process. Manag. 58, 1 (2021), 102387.
[Jannach et al.] 2015. What recommenders recommend: an analysis of recommendation biases and possible countermeasures. User Model. User Adapt. Interact. 25, 5 (2015), 427–491.
Daniele Malitesta∗
Politecnico di Bari, Italy
daniele.malitesta@poliba.it
Giandomenico Cornacchia∗
giandomenico.cornacchia@poliba.it
Claudio Pomo
claudio.pomo@poliba.it
Tommaso Di Noia
tommaso.dinoia@poliba.it
ABSTRACT
Multimodal-aware recommender systems (MRSs) exploit multi-
modal content (e.g., product images or descriptions) as items’ side
information to improve recommendation accuracy. While most of
such methods rely on factorization models (e.g., MFBPR) as base
architecture, it has been shown that MFBPR may be a�ected by
popularity bias, meaning that it inherently tends to boost the rec-
ommendation of popular (i.e., short-head) items at the detriment of
niche (i.e., long-tail) items from the catalog. Motivated by this as-
sumption, in this work, we provide one of the �rst analyses on how
multimodality in recommendation could further amplify popularity
bias. Concretely, we evaluate the performance of four state-of-the-
art MRSs algorithms (i.e., VBPR, MMGCN, GRCN, LATTICE) on
three datasets from Amazon by assessing, along with recommen-
dation accuracy metrics, performance measures accounting for
the diversity of recommended items and the portion of retrieved
niche items. To better investigate this aspect, we decide to study
the separate in�uence of each modality (i.e., visual and textual) on
popularity bias in di�erent evaluation dimensions. Results, which
demonstrate how the single modality may augment the negative
e�ect of popularity bias, shed light on the importance to provide a
more rigorous analysis of the performance of such models.
0 500 1,000 1,500 2,000 2,500
0
20
40
60
80
100
120
items
popularity
short-head
long-tail
Figure 1: Short-head and long-tail items from the O�ce
dataset in the Amazon catalog.
Systems: a Modalities-driven Analysis. In Proceedings of Make sure to en-
ter the correct conference title from your rights con�rmation emai (Confer-
ence acronym ’XX). ACM, New York, NY, USA, 10 pages. https://doi.org/
.12911v1
[cs.IR]
24
Aug
2023

Some recent works [Liu et al., Kowald and Lacic, Malitesta et al.
(2023b)] address bias in multimodal-aware recommendation, but
with different definitions and settings with respect to the one of
popularity bias we presented earlier.
7
Popularity bias in multimodal-aware recommendation
[Liu et al.] 2022. EliMRec: Eliminating Single-modal Bias in Multimedia Recommendation. In ACM Multimedia. ACM, 687–695.
[Kowald and Lacic] 2022. Popularity Bias in Collaborative Filtering-Based Multimedia Recommender Systems. In BIAS (Communications in Computer and Information Science, Vol. 1610). Springer, 1–11.
[Malitesta et al. (2023b)] 2023. Disentangling the Performance Puzzle of Multimodal-aware Recommender Systems. In EvalRS@KDD (CEUR Workshop Proceedings, Vol. 3450). CEUR-WS.org.
Daniele Malitesta∗
daniele.malitesta@poliba.it
Giandomenico Cornacchia∗
giandomenico.cornacchia@poliba.it
Claudio Pomo
claudio.pomo@poliba.it
Tommaso Di Noia
tommaso.dinoia@poliba.it
ABSTRACT
Multimodal-aware recommender systems (MRSs) exploit multi-
modal content (e.g., product images or descriptions) as items’ side
information to improve recommendation accuracy. While most of
such methods rely on factorization models (e.g., MFBPR) as base
architecture, it has been shown that MFBPR may be a�ected by
popularity bias, meaning that it inherently tends to boost the rec-
ommendation of popular (i.e., short-head) items at the detriment of
niche (i.e., long-tail) items from the catalog. Motivated by this as-
sumption, in this work, we provide one of the �rst analyses on how
multimodality in recommendation could further amplify popularity
bias. Concretely, we evaluate the performance of four state-of-the-
art MRSs algorithms (i.e., VBPR, MMGCN, GRCN, LATTICE) on
three datasets from Amazon by assessing, along with recommen-
dation accuracy metrics, performance measures accounting for
the diversity of recommended items and the portion of retrieved
niche items. To better investigate this aspect, we decide to study
the separate in�uence of each modality (i.e., visual and textual) on
popularity bias in di�erent evaluation dimensions. Results, which
demonstrate how the single modality may augment the negative
e�ect of popularity bias, shed light on the importance to provide a
more rigorous analysis of the performance of such models.
0 500 1,000 1,500 2,000 2,500
0
20
40
60
80
100
120
items
popularity
short-head
long-tail
Figure 1: Short-head and long-tail items from the O�ce
dataset in the Amazon catalog.
Systems: a Modalities-driven Analysis. In Proceedings of Make sure to en-
ter the correct conference title from your rights con�rmation emai (Confer-
ence acronym ’XX). ACM, New York, NY, USA, 10 pages. https://doi.org/
.12911v1
[cs.IR]
24
Aug
2023

ü Propose one of the first analyses on how multimodal-aware recommender systems may amplify popularity bias
ü Select four state-of-the-art multimodal-aware recommender systems (i.e., VBPR, MMGCN, GRCN, and LATTICE)
ü Train them on three categories of the Amazon Catalogue (i.e., Office, Toys, and Clothing)
ü Evaluate the performance on recommendation accuracy and popularity bias (i.e., diversity and percentage of retrieved items from the long-tail)
ü Assess the separate impact of each multimodal side information on single and paired recommendation metrics
8
Our contributions
Research questions
RQ1) How do multimodal-aware recommendation models behave in terms of accuracy, diversity, and popularity bias?
RQ2) What is the influence of each modality (i.e., visual, textual, multimodal) on such performance measures?

10
Preliminaries
U
I
u1 u2
i3
i2
i1
1 1 1
0 1 0
1 1 0
0 0 1
1 1 0
X
User-item
interaction matrix
u3
u4 u5
USERS
ITEMS
eu
ei
fi
fu
Collaborative
Multimodal

• Visual Bayesian personalized ranking (VBPR [He
et al. (2016)])
• Multimodal graph convolutional network for
recommendation (MMGCN [Wei et al. (2019)])
• Graph-refined convolutional network (GRCN
[Wei et al. (2020)])
• Latent structure mining method for multimodal
recommendation (LATTICE [Zhang et al.])
11
Multimodal-aware recommender systems
[He et al. (2016)] 2016. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In AAAI. AAAI Press, 144–150.
[Wei et al. (2019)] 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. In ACM Multimedia. ACM, 1437–1445.
[Wei et al. (2020)] 2020. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback. In ACM Multimedia. ACM, 3541–3549.
[Zhang et al.] 2021. Mining Latent Structures for Multimedia Recommendation. In ACM Multimedia. ACM, 3872–3880.
On Popularity Bias of Multimodal-aware Recommender Systems: a Modalities-driven Analysis MMIR ’23, November 2, 2023, O�awa, ON, Canada
Models Year Venue Prediction
VBPR 2016 AAAI ĜD8 = e>
D e8 + f>
D C(f8) with f8 = k
<2M
f<
8
MMGCN 2019 MM ĜD8 = f>
D f8 with fD =
Õ
<2M
2(eD,6(f<
D ),C(f<
D , eD))
GRCN 2020 MM ĜD8 = f>
D f8 with fD = 6(eD, f<
D , 8< 2 M) ||
✓
k
<2M
C(f<
D )
◆
LATTICE 2021 MM ĜD8 = e>
D f8 with f8 = e8 +
6(e8,f<
8 ,8<2M)
||6(e8,f<
8 ,8<2M)||2
Table 1: Statistics of the tested datasets.
Datasets |U| |I| |R| Sparsity (%)
O�ce 4,905 2,420 53,258 99.5513
Toys 19,412 11,924 167,597 99.9276
Clothing 39,387 23,033 278,677 99.9693
approach. Then, we focused on quantifying the singular modality
in�uence on the multimodal scenario in terms of accuracy, diver-
and MFBPR as a reference for the other multimodal-aware
recommender systems we want to analyze.
RQ2. What is the in�uence of each modality setting (i.e., visual, tex
tual, multimodal) on such performance measures? Section 5.2
takes a step further by analyzing how each modality (i.e., vi
sual, textual, and multimodal) in�uences accuracy, diversity
and popularity bias; the evaluation is conducted both on the
single metric and across pairs of metrics.
5.1 Recommendation accuracy, diversity, and
popularity bias (RQ1)

13
Datasets and multimodal features
[McAuley et al.] 2015. Image-Based Recommendations on Styles and Substitutes. In SIGIR. ACM, 43–52.
[Deldjoo et al.] 2021. A Study on the Relative Importance of Convolutional Neural Networks in Visually-Aware Recommender Systems. In CVPR Workshops. Computer Vision Foundation / IEEE, 3961–3967.
[Zhang et al.] 2021. Mining Latent Structures for Multimedia Recommendation. In ACM Multimedia. ACM, 3872–3880.
MMIR ’23, November 2, 2023, O�awa, ON, Canada Daniele Malitesta, Giandomenico Cornacchia, Claudio Pomo, & Tommaso Di Noia
Table 1: Statistics of the tested datasets.
Datasets |U| |I| |R| Sparsity (%)
O�ce 4,905 2,420 53,258 99.5513
Toys 19,412 11,924 167,597 99.9276
Clothing 39,387 23,033 278,677 99.9693
through weighted element-wise addition, and the �nal adjacency
matrix is exploited to perform graph convolution to update the
representation of the collaborative item embeddings. Then, this
updated version is added to the initial collaborative item embed-
ding. Finally, the dot product between the collaborative user and
(updated) item embeddings predicts the interaction score:
ĜD8 = e>
D f8 with f8 = e8 +
6(e8, f<
8 , 8< 2 M)
, (6)
thorough coverage to the list of user interactions [7]:
Recall@: =
1
|U|
’
D2U
|RelD @:|
|RelD |
, (7)
where RelD indicates the set of relevant items for user D, while
RelD @: is the set of relevant recommended items in the top-: list.
Normalized discount cumulative gain. The normalized discount
cumulative gain (nDCG) considers the relevance and the ranking
position of recommended products, taking into account the varied
degrees of relevance:
nDCG@: =
1
|U|
’
D
DCGD@:
IDCGD@:
, (8)
where DCG@: =
Õ:
8=1
2A4;D,8 1
log2 (8+1) quanti�es the cumulative gain of
relevance scores through the recommended list, with A4;D,8 2 RelD,
and IDCG represents the cumulative gain of relevance scores for a
Amazon Catalogue [McAuley et al.] Multimodal features
• Visual features: 4,096 embeddings [Deldjoo et al.]
• Textual features: 1,024 embeddings [Zhang et al.]

Item coverage:
14
Evaluation metrics
representation of the collaborative item embeddings. Then, this
updated version is added to the initial collaborative item embed-
ding. Finally, the dot product between the collaborative user and
(updated) item embeddings predicts the interaction score:
ĜD8 = e>
D f8 with f8 = e8 +
6(e8, f<
8 , 8< 2 M)
||6(e8, f<
8 , 8< 2 M)||2
, (6)
where6 is a LightGCN [28] architecture performing graph structure
learning as stated above.
4 PROPOSED ANALYSIS
In this section, we present the details to conduct our analysis. Ini-
tially, we report on the used datasets, describing the methodologies
employed for extracting multimodal features. Subsequently, we
introduce and formally de�ne the evaluation metrics employed,
encompassing accuracy, diversity, and popularity bias. Finally, we
provide a thorough summary of the reproducibility information
for our study, detailing the methods used for dataset splitting and
�ltering as well as the strategy for hyperparameter search.
4.1 Datasets
he multimodal recommender systems have been tested on three
popular [17, 33, 66, 69] datasets from the Amazon catalog [46]: Of-
�ce Products (O�ce), (b) Toys & Games (Toys), and (c) Clothing,
Shoes & Jewelry (Clothing). The multimodal datasets provide both
images and descriptions for each available item. Speci�cally, we uti-
lize the pre-extracted 4,096-dimensional visual features [24] which
are made publicly available1. For the textual modality, we follow the
existing literature [66], which aggregates the item’s title, descrip-
tions, categories, and brand, thereby generating textual embeddings
by leveraging sentence transformers [51]. The generated features
are 1,024-dimensional embeddings. Additional dataset information
can be found in Table 1.
nDCG@: =
1
|U|
’
D
DCGD@:
IDCGD@:
, (8)
where DCG@: =
Õ:
8=1
2A4;D,8 1
perfect (ideal) recommender system.
Item coverage. The item coverage (abbreviated “iCov” in the fol-
lowing) gives information on the coverage (item-side) measured in
recommendation lists. A higher item coverage suggests that a larger
fraction of the item space is being scrutinized and recommended to
consumers, implying a more comprehensive coverage of user pref-
erences and potentially a more comprehensive recommendation
experience. In particular, we have:
iCov@: =
|
–
D Î
D@:|
|I
CA08= |
, (9)
where Î
D@: is the list of top-: recommended items for a user D.
Average percentage of long-tail items. The average percentage
of long-tail items (APLT) is a measure used to assess the presence
of popularity bias in recommendation systems [2]. Popularity bias
refers to the tendency of recommendation algorithms to priori-
tize popular or mainstream items over less well-known or niche
items. This bias can lead to limited exposure of users to diverse and
personalized recommendations. The metric measure the percent-
age of items belonging to the medium/long-tail distribution in the
recommendation lists averaged over all users:
APLT@: =
1
|U|
’
D2U
|{8 | 8 2 (Î
D@: ⇠ )}|
:
, (10)
where is the set of items belonging to the short-tail distribution
while ⇠ is the set of items from the medium/long-tail distribution.
Note that we decide to integrate the evaluation of the APLT along
with the iCov (introduced above) because the latter may be func-
ˆD8 = e>
D f8 with f8 = e8 +
6(e8, f<
8 , 8< 2 M)
||6(e8, f<
8 , 8< 2 M)||2
, (6)
is a LightGCN [28] architecture performing graph structure
g as stated above.
ROPOSED ANALYSIS
section, we present the details to conduct our analysis. Ini-
e report on the used datasets, describing the methodologies
ed for extracting multimodal features. Subsequently, we
ce and formally de�ne the evaluation metrics employed,
assing accuracy, diversity, and popularity bias. Finally, we
a thorough summary of the reproducibility information
study, detailing the methods used for dataset splitting and
g as well as the strategy for hyperparameter search.
Datasets
imodal recommender systems have been tested on three
[17, 33, 66, 69] datasets from the Amazon catalog [46]: Of-
ducts (O�ce), (b) Toys & Games (Toys), and (c) Clothing,
& Jewelry (Clothing). The multimodal datasets provide both
and descriptions for each available item. Speci�cally, we uti-
pre-extracted 4,096-dimensional visual features [24] which
e publicly available1. For the textual modality, we follow the
g literature [66], which aggregates the item’s title, descrip-
ategories, and brand, thereby generating textual embeddings
raging sentence transformers [51]. The generated features
4-dimensional embeddings. Additional dataset information
ound in Table 1.
Evaluation metrics
roposed study, we refer to various metrics that may bring
itional insights which have not been investigated yet in
odal recommendation. Indeed, we do not solely rely on
erences and potentially a more comprehensive recommendation
experience. In particular, we have:
iCov@: =
|
–
D Î
D@:|
|I
CA08= |
, (9)
where Î
D@: is the list of top-: recommended items for a user D.
Average percentage of long-tail items. The average percentage
of long-tail items (APLT) is a measure used to assess the presence
of popularity bias in recommendation systems [2]. Popularity bias
refers to the tendency of recommendation algorithms to priori-
tize popular or mainstream items over less well-known or niche
items. This bias can lead to limited exposure of users to diverse and
personalized recommendations. The metric measure the percent-
age of items belonging to the medium/long-tail distribution in the
recommendation lists averaged over all users:
APLT@: =
1
|U|
’
D2U
|{8 | 8 2 (Î
D@: ⇠ )}|
:
, (10)
tional to provide a complete interpretation of the former. Indeed,
following their de�nitions and formulations, the two metrics are
conceptually related.
Metrics value interpretation An ideal recommender system
should increase all the metrics listed above according to the princi-
Average percentage of long-tail items [Abdollahpouri et al.]
Recall:
by leveraging sentence transformers [51]. The generated features
are 1,024-dimensional embeddings. Additional dataset information
can be found in Table 1.
4.2 Evaluation metrics
In the proposed study, we refer to various metrics that may bring
out additional insights which have not been investigated yet in
multimodal recommendation. Indeed, we do not solely rely on
accuracy metrics (i.e., Recall and nDCG) but also on diversity (i.e.,
item coverage) and popularity bias (i.e., APLT) metrics. The metrics
listed hereinafter are calculated on top-: recommendation lists.
Recall. The Recall assesses the system’s capacity to retrieve rele-
vant items from the recommendation list, highlighting the need for
thorough coverage to the list of user interactions [7]:
Recall@: =
1
|U|
’
D2U
|RelD @:|
|RelD |
, (7)
where RelD indicates the set of relevant items for user D, while
RelD @: is the set of relevant recommended items in the top-: list.
Normalized discount cumulative gain. The normalized discount
cumulative gain (nDCG) considers the relevance and the ranking
1https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html.
tional to provide a complete interpretation of the former. Indeed,
following their de�nitions and formulations, the two metrics are
conceptually related.
Metrics value interpretation An ideal recommender system
should increase all the metrics listed above according to the princi-
ple “higher is better” to boost accuracy and diversity while reducing
the popularity bias of the produced recommendations. Neverthe-
less, with the current work, we try to unveil whether and why
multimodal-aware recommender systems are a�ected by pop-
ularity bias. Thus, in the following, we will take into account
those se�ings in which accuracy is high, while diversity and
popularity bias are low (according to the metrics de�nitions).
4.3 Reproducibility
We investigate the models’ behavior in three di�erent settings:
(i) visual modality, in which we employ only visual features, (ii)
textual modality, in which we employ only textual features, and (iii)
multimodal, where both modalities are considered and combined.
In the �rst step, we evaluate the models in the multimodal set-
ting which is the same setting as the original one for each tested
Normalized discount cumulative gain:
Daniele Malitesta, Giandomenico Cornacchia, Claudio Pomo, & Tommaso Di Noia
position of recommended products, taking into account the varied
degrees of relevance:
nDCG@: =
1
|U|
’
D
DCGD@:
IDCGD@:
, (8)
where DCG@: =
Õ:
8=1
2A4;D,8 1
Accuracy Popularity Bias
[Abdollahpouri et al.] 2017. Controlling Popularity Bias in Learning-to-Rank Recommendation. In RecSys. ACM, 42–46.

16
Recommendation accuracy, diversity, and popularity bias (RQ1)
Table 2: Results in terms of recommendation accuracy (Recall, nDCG), diversity (iCov) and popularity bias (APLT). For accuracy
metrics, " means better performance, while # means less diversity and more popularity bias. We remind that, while iCov and
APLT metrics would generally adhere to the principle of “higher is better” (") for an ideal recommender system, in this work we
consider the opposite as we want to emphasize which models are performing worst in terms of diversity and popularity bias.
Datasets Models
top@10 top@20 top@50
Recall" nDCG" iCov# APLT# Recall" nDCG" iCov# APLT# Recall" nDCG" iCov# APLT#
O�ce
Random 0.0034 0.0020 2,414 0.5950 0.0079 0.0034 2,414 0.5948 0.0220 0.0068 2,414 0.5924
MostPop 0.0302 0.0208 20 0.0000 0.0533 0.0282 32 0.0000 0.1143 0.0439 66 0.0000
MFBPR 0.0602 0.0389 2,268 0.2294 0.0955 0.0500 2,357 0.2379 0.1657 0.0677 2,398 0.2513
VBPR 0.0652 0.0419 2,265 0.2321 0.1025 0.0533 2,354 0.2375 0.1774 0.0721 2,404 0.2469
MMGCN 0.0455 0.0300 74 0.0016 0.0798 0.0405 112 0.0078 0.1575 0.0598 247 0.0205
GRCN 0.0393 0.0253 2,390 0.3438 0.0667 0.0339 2,409 0.3469 0.1250 0.0488 2,414 0.3548
LATTICE 0.0664 0.0449 2,121 0.1752 0.1029 0.0566 2,315 0.2039 0.1780 0.0751 2,397 0.2413
Toys
Random 0.0011 0.0006 11,879 0.4894 0.0021 0.0008 11,879 0.4896 0.0051 0.0015 11,879 0.4902
MostPop 0.0130 0.0075 13 0.0000 0.0229 0.0104 24 0.0000 0.0451 0.0156 56 0.0000
MFBPR 0.0641 0.0403 10,016 0.1167 0.0903 0.0481 10,944 0.1268 0.1394 0.0596 11,544 0.1460
VBPR 0.0710 0.0458 10,085 0.1064 0.1006 0.0545 11,026 0.1180 0.1523 0.0667 11,624 0.1400
MMGCN 0.0256 0.0150 4,499 0.0961 0.0426 0.0200 6,238 0.1058 0.0785 0.0285 8,657 0.1263
GRCN 0.0554 0.0354 11,007 0.2368 0.0831 0.0436 11,609 0.2482 0.1355 0.0559 11,847 0.2679
LATTICE 0.0805 0.0512 8,767 0.0546 0.1165 0.0617 10,285 0.0684 0.1771 0.0759 11,397 0.0950
Clothing
Random 0.0004 0.0002 23,016 0.4487 0.0010 0.0003 23,016 0.4478 0.0024 0.0006 23,016 0.4482
MostPop 0.0089 0.0046 13 0.0000 0.0157 0.0063 24 0.0000 0.0322 0.0095 56 0.0000
MFBPR 0.0303 0.0156 18,414 0.0729 0.0459 0.0195 20,582 0.0824 0.0734 0.0249 22,171 0.1017
VBPR 0.0339 0.0181 19,195 0.0809 0.0529 0.0229 21,251 0.0915 0.0847 0.0292 22,555 0.1112
MMGCN 0.0227 0.0119 1,744 0.0044 0.0348 0.0150 2,864 0.0066 0.0609 0.0201 5,373 0.0121
GRCN 0.0319 0.0164 21,490 0.2358 0.0496 0.0209 22,503 0.2459 0.0858 0.0281 22,954 0.2631
LATTICE 0.0502 0.0275 13,463 0.0134 0.0744 0.0336 17,538 0.0207 0.1186 0.0425 21,458 0.0385
popularity bias phenomenon as much as MMGCN does. Indeed,
even if LATTICE’s iCov is the second-worst across all the datasets,
the metric is always close to the best models in terms of diversity.
Finally, VBPR and GRCN con�rm their ability (already observed
on the diversity measure) to tackle also popularity bias in all ex-
discuss the in�uence of each single modality on the performance.
We consider two evaluation dimensions where modalities in�uence
is assessed (i) on accuracy, diversity, and popularity bias separately,
and (ii) on pairs of metrics to investigate their joint variations.
Modalities in�uence on the single metric. Figure 2 displays the
LATTICE stands out for its accuracy
performance…😃
…but amplifies popularity bias 🥲

17
Datasets Models
O�ce
Random 0.0034 0.0020 2,414 0.5950 0.0079 0.0034 2,414 0.5948 0.0220 0.0068 2,414 0.5924
MostPop 0.0302 0.0208 20 0.0000 0.0533 0.0282 32 0.0000 0.1143 0.0439 66 0.0000
MFBPR 0.0602 0.0389 2,268 0.2294 0.0955 0.0500 2,357 0.2379 0.1657 0.0677 2,398 0.2513
VBPR 0.0652 0.0419 2,265 0.2321 0.1025 0.0533 2,354 0.2375 0.1774 0.0721 2,404 0.2469
MMGCN 0.0455 0.0300 74 0.0016 0.0798 0.0405 112 0.0078 0.1575 0.0598 247 0.0205
GRCN 0.0393 0.0253 2,390 0.3438 0.0667 0.0339 2,409 0.3469 0.1250 0.0488 2,414 0.3548
LATTICE 0.0664 0.0449 2,121 0.1752 0.1029 0.0566 2,315 0.2039 0.1780 0.0751 2,397 0.2413
Toys
Random 0.0011 0.0006 11,879 0.4894 0.0021 0.0008 11,879 0.4896 0.0051 0.0015 11,879 0.4902
MostPop 0.0130 0.0075 13 0.0000 0.0229 0.0104 24 0.0000 0.0451 0.0156 56 0.0000
MFBPR 0.0641 0.0403 10,016 0.1167 0.0903 0.0481 10,944 0.1268 0.1394 0.0596 11,544 0.1460
VBPR 0.0710 0.0458 10,085 0.1064 0.1006 0.0545 11,026 0.1180 0.1523 0.0667 11,624 0.1400
MMGCN 0.0256 0.0150 4,499 0.0961 0.0426 0.0200 6,238 0.1058 0.0785 0.0285 8,657 0.1263
GRCN 0.0554 0.0354 11,007 0.2368 0.0831 0.0436 11,609 0.2482 0.1355 0.0559 11,847 0.2679
LATTICE 0.0805 0.0512 8,767 0.0546 0.1165 0.0617 10,285 0.0684 0.1771 0.0759 11,397 0.0950
Clothing
Random 0.0004 0.0002 23,016 0.4487 0.0010 0.0003 23,016 0.4478 0.0024 0.0006 23,016 0.4482
MostPop 0.0089 0.0046 13 0.0000 0.0157 0.0063 24 0.0000 0.0322 0.0095 56 0.0000
MFBPR 0.0303 0.0156 18,414 0.0729 0.0459 0.0195 20,582 0.0824 0.0734 0.0249 22,171 0.1017
VBPR 0.0339 0.0181 19,195 0.0809 0.0529 0.0229 21,251 0.0915 0.0847 0.0292 22,555 0.1112
MMGCN 0.0227 0.0119 1,744 0.0044 0.0348 0.0150 2,864 0.0066 0.0609 0.0201 5,373 0.0121
GRCN 0.0319 0.0164 21,490 0.2358 0.0496 0.0209 22,503 0.2459 0.0858 0.0281 22,954 0.2631
LATTICE 0.0502 0.0275 13,463 0.0134 0.0744 0.0336 17,538 0.0207 0.1186 0.0425 21,458 0.0385
MMGCN struggles with diversity… 🤒
...exhibits strong popularity bias… 😱
…and sacrifices accuracy in certain
scenarios ☠

18
Datasets Models
O�ce
Random 0.0034 0.0020 2,414 0.5950 0.0079 0.0034 2,414 0.5948 0.0220 0.0068 2,414 0.5924
MostPop 0.0302 0.0208 20 0.0000 0.0533 0.0282 32 0.0000 0.1143 0.0439 66 0.0000
MFBPR 0.0602 0.0389 2,268 0.2294 0.0955 0.0500 2,357 0.2379 0.1657 0.0677 2,398 0.2513
VBPR 0.0652 0.0419 2,265 0.2321 0.1025 0.0533 2,354 0.2375 0.1774 0.0721 2,404 0.2469
MMGCN 0.0455 0.0300 74 0.0016 0.0798 0.0405 112 0.0078 0.1575 0.0598 247 0.0205
GRCN 0.0393 0.0253 2,390 0.3438 0.0667 0.0339 2,409 0.3469 0.1250 0.0488 2,414 0.3548
LATTICE 0.0664 0.0449 2,121 0.1752 0.1029 0.0566 2,315 0.2039 0.1780 0.0751 2,397 0.2413
Toys
Random 0.0011 0.0006 11,879 0.4894 0.0021 0.0008 11,879 0.4896 0.0051 0.0015 11,879 0.4902
MostPop 0.0130 0.0075 13 0.0000 0.0229 0.0104 24 0.0000 0.0451 0.0156 56 0.0000
MFBPR 0.0641 0.0403 10,016 0.1167 0.0903 0.0481 10,944 0.1268 0.1394 0.0596 11,544 0.1460
VBPR 0.0710 0.0458 10,085 0.1064 0.1006 0.0545 11,026 0.1180 0.1523 0.0667 11,624 0.1400
MMGCN 0.0256 0.0150 4,499 0.0961 0.0426 0.0200 6,238 0.1058 0.0785 0.0285 8,657 0.1263
GRCN 0.0554 0.0354 11,007 0.2368 0.0831 0.0436 11,609 0.2482 0.1355 0.0559 11,847 0.2679
LATTICE 0.0805 0.0512 8,767 0.0546 0.1165 0.0617 10,285 0.0684 0.1771 0.0759 11,397 0.0950
Clothing
Random 0.0004 0.0002 23,016 0.4487 0.0010 0.0003 23,016 0.4478 0.0024 0.0006 23,016 0.4482
MostPop 0.0089 0.0046 13 0.0000 0.0157 0.0063 24 0.0000 0.0322 0.0095 56 0.0000
MFBPR 0.0303 0.0156 18,414 0.0729 0.0459 0.0195 20,582 0.0824 0.0734 0.0249 22,171 0.1017
VBPR 0.0339 0.0181 19,195 0.0809 0.0529 0.0229 21,251 0.0915 0.0847 0.0292 22,555 0.1112
MMGCN 0.0227 0.0119 1,744 0.0044 0.0348 0.0150 2,864 0.0066 0.0609 0.0201 5,373 0.0121
GRCN 0.0319 0.0164 21,490 0.2358 0.0496 0.0209 22,503 0.2459 0.0858 0.0281 22,954 0.2631
LATTICE 0.0502 0.0275 13,463 0.0134 0.0744 0.0336 17,538 0.0207 0.1186 0.0425 21,458 0.0385
VBPR and GRCN better manage all
the metrics by finding the right
compromise among them 😎

19
Modalities influence on recommendation performance (RQ2)
Popularity Bias of Multimodal-aware Recommender Systems: a Modalities-driven Analysis MMIR ’23, November 2, 2023, O�awa, ON, Canada
VBPR MMGCN GRCN LATTICE
-20%
-10%
0%
+10%
+20%
O�ce
-20%
-10%
0%
+10%
+20%
Toys
-20%
-10%
0%
+10%
+20%
Clothing
(a) Recall
-20%
-10%
0%
+10%
+20%
O�ce
-20%
-10%
0%
+10%
+20%
Toys
-20%
-10%
0%
+10%
+20%
Clothing
(b) iCov
-20%
-10%
0%
+10%
+20%
O�ce
-20%
-10%
0%
+10%
+20%
Toys
-20%
-10%
0%
+10%
+20%
Clothing
Popularity Bias of Multimodal-aware Recommender Systems: a Modalities-driven Analysis MMIR ’23, November 2, 2023, O�awa, ON, Canada
-20%
-10%
0%
+10%
+20%
O�ce
-20%
-10%
0%
+10%
+20%
Toys
-20%
-10%
0%
+10%
+20%
Clothing
(a) Recall
-20%
-10%
0%
+10%
+20%
O�ce
-20%
-10%
0%
+10%
+20%
Toys
-20%
-10%
0%
+10%
+20%
Clothing
(b) iCov
-10%
0%
+10%
+20%
-10%
0%
+10%
+20%
-10%
0%
+10%
+20%
O�ce
Toys
Clothing
(a) Recall
-20%
-10%
0%
+10%
+20%
O�ce
-20%
-10%
0%
+10%
+20%
Toys
-20%
-10%
0%
+10%
+20%
Clothing
(b) iCov
-20%
-10%
0%
+10%
+20%
O�ce
-20%
-10%
0%
+10%
+20%
Toys
-20%
-10%
0%
+10%
+20%
Clothing
(c) APLT
visual textual
Figure 2: Percentage variation on the (a) Recall, (b) iCov, and (c) APLT when training the multimodal recommender systems
with either visual or textual modalities. The 0% line stands for the reference performance provided by the multimodal version
of the model. All results refer to the top@20 recommendation lists.
showing consistent trends. Indeed, the visual modality reduces the
Recall while the textual increases it (with the only exception of
VBPR whose percentage variation is negligible).
Di�erently from the accuracy analysis, we recognize a quasi-
stable trend in the performance variation measured for the diversity
metric (Figure 2b). Considering the O�ce dataset, each modality’s
contribution is generally irrelevant except for MMGCN, for which
the visual modality slightly improves the coverage across the whole
recommendation list, while the textual one worsens the perfor-
mance by a large margin. Assessing the trend on Toys, both the
modalities decrease the coverage performance of the model when
injected separately in the recommendation pipeline; remarkably,
MMGCN is once again the model a�ected by the single modality
presence the most, but this time the coverage performance widely
deteriorates because of both the visual and textual modalities. Fi-
nally, on Clothing, both modalities lower the model’s item coverage,
with speci�c reference to the visual modality.
As the last part of our analysis, we take into account each modal-
ity’s contribution to the popularity bias dimension (Figure 2c). Start-
ing from O�ce, we notice how both modalities are prone to enforce
popularity bias if injected singularly, with the only exception of
LATTICE whose textual modality limits the popularity bias (the
The textual modality improves the accuracy… 💪
…while both modalities negatively affect the
diversity and reinforce the popularity bias 😭
Single metric setting

20
The textual modality has a
significant influence on
accuracy… 😣
but minimal effects on diversity
and popularity bias 😇
Pair-wise metric setting
0.03 0.04 0.05 0.06 0.07 0.08
0.00
0.05
0.10
0.15
0.20
0.25
Recall
APLT
(a)
0.03 0.04 0.05 0.06 0.07 0.08
5 000
10 000
15 000
20 000
Recall
iCov
(b)
5 000 10 000 15 000 20 000
0.00
0.05
0.10
0.15
0.20
0.25
iCov
APLT
(c)
multimodal visual textual
Figure 3: Performance analysis on Clothing when comparing (a) Recall vs. APLT, (b) Recall vs. iCov, and (c) iCov vs. APLT for
di�erent modality settings involving the multimodal, visual, and textual modalities. Metrics are on top@20.
APLT increases); this is interesting as we remind that LATTICE is
the second-worst model in terms of popularity bias, but using only
the textual modality reduces its accuracy performance and the in�u-
ence of popular items in the recommendation list. When it comes to
the Toys dataset, every single modality enforces the popularity bias
of MMGCN and GRCN; for VBPR, the visual and textual modalities
6 CONCLUSION AND FUTURE WORK
Motivated by the assumption that factorization models in recom-
mendation (such as MFBPR) are a�ected by popularity bias, in this
work, we provided one of the �rst systematic analyses on how
multimodal-aware recommender systems (largely built upon MF-
BPR) further amplify the recommendation of popular items. After

21
The visual modality reduces
accuracy… 😨
…and jointly worsens the
popularity bias and diversity 😵
Pair-wise metric setting (cont’d)
0.03 0.04 0.05 0.06 0.07 0.08
0.00
0.05
0.10
0.15
0.20
0.25
Recall
APLT
(a)
0.03 0.04 0.05 0.06 0.07 0.08
5 000
10 000
15 000
20 000
Recall
iCov
(b)
5 000 10 000 15 000 20 000
0.00
0.05
0.10
0.15
0.20
0.25
iCov
APLT
(c)
multimodal visual textual
Figure 3: Performance analysis on Clothing when comparing (a) Recall vs. APLT, (b) Recall vs. iCov, and (c) iCov vs. APLT for
di�erent modality settings involving the multimodal, visual, and textual modalities. Metrics are on top@20.
APLT increases); this is interesting as we remind that LATTICE is
the second-worst model in terms of popularity bias, but using only
the textual modality reduces its accuracy performance and the in�u-
ence of popular items in the recommendation list. When it comes to
the Toys dataset, every single modality enforces the popularity bias
of MMGCN and GRCN; for VBPR, the visual and textual modalities
6 CONCLUSION AND FUTURE WORK
Motivated by the assumption that factorization models in recom-
mendation (such as MFBPR) are a�ected by popularity bias, in this
work, we provided one of the �rst systematic analyses on how
multimodal-aware recommender systems (largely built upon MF-
BPR) further amplify the recommendation of popular items. After

Conclusion
● Analysis on influence of multimodality on popularity bias
● Four SOTA multimodal recommendation approaches on three datasets
● Three evaluation dimensions and three modality settings
● [RQ1] VBPR and GRCN strike a better compromise among all metrics
● [RQ2 single] Separate injection of modalities improves accuracy but negatively impacts diversity and popularity bias
● [RQ2 pairs textual] Highly impacts on accuracy but little effect on diversity and popularity bias
● [RQ2 pairs visual] Reduces accuracy while exacerbating popularity bias and limiting the diversity
Future work
● More complete study on the performance of these models
● Assessing the performance of more recent multimodal approaches [Malitesta et al. (2023a)]
23

Reach us out!
24
The authors:
• Daniele Malitesta (daniele.malitesta@poliba.it)
• Giandomenico Cornacchia (giandomenico.cornacchia@poliba.it)
• Claudio Pomo (claudio.pomo@poliba.it)
• Tommaso Di Noia (tommaso.dinoia@poliba.it)

Don’t forget to check out our theoretical/experimental survey
25

[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A Modalities-driven Analysis

More Related Content

Similar to [MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A Modalities-driven Analysis (20)

Recently uploaded (20)

[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A Modalities-driven Analysis