A Learning to Rank Project on a Daily Song Ranking Problem

London Information Retrieval Meetup
A Learning to Rank Project on a
Daily Song Ranking Problem
Ilaria Petreti, Information Retrieval/ML
Engineer
3rd November 2020

Ilaria Petreti
! Information Retrieval/Machine Learning
Engineer
! Master in Data Science
! Data Mining and Machine Learning
technologies passionate
! Sports and Healthy Lifestyle lover
Who I Am

● Headquarter in London/distributed
● Open Source Enthusiasts
● Apache Lucene/Solr/Es experts
● Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevancy Tuning
www.sease.io
Search Services

Clients

Overview
Problem Statement
Data Preprocessing
Model Training
Results

How to create a Learning to Rank Pipeline using the
Spotify’s Kaggle Dataset?!
Problem Statement
https://www.kaggle.com/edumucelli/spotifys-worldwide-daily-song-ranking

LTR is the application of machine learning, typically supervised, semi-
supervised or reinforcement learning, in the construction of ranking models for
information retrieval systems.
Training data consists of lists of items and each item is composed by:
• Query ID
• Relevance Rating
• Feature Vector (composed by N features (<id>:<value>))
Learning to Rank

Spotify’s Worldwide
Daily Song Ranking:
• 200 most listened songs in 53
countries
• From 1st January 2017 to 9th
January 2018
• More than 3 million rows
• 6629 artists and 18598 songs
• A total count of one hundred five
billion streams counts
Dataset Description

Learning to Rank: Our Approach
Trained Ranking Model
QUERY is the Region
DOCUMENT is the Song
Relevance Rating = estimated from Position on Chart
Feature Vector = all the other N features
Spotify Search Engine

Data Preprocessing
Model Training
Results
Problem Statement

Feature Level
Document level Query level Query Dependent
This feature describes a
property of the DOCUMENT.
The value of the feature depends only on
the document instance.
e.g.
Document Type = Digital Music Service
Product
- Track Name
- Artist
- Streams
Each sample is a <query,document> pair, the feature vector describes numerically this
property of the QUERY.
The value of the feature depends only on
the query instance.
e.g.
Query Type = Digital Music Service Search
- Month
- Day
- Weekday
property of the QUERY in correlation
with the DOCUMENT.
The value of the feature depends on
the query and document instance.
e.g.
Query Type = Digital Music Service Search
Document Type = Digital Music Service
Product
- Matching query Region-Title Language
- Matching query Region-Artist Nationality

Data Cleaning:
Data Preprocessing: Data Cleaning
Validity
Accuracy
Consistency
Completeness
Uniformity
Handle Missing Values:
a total of 657 NaN in Track Name and Artist features filled using a
DICTIONARY:
{0: 'Reggaetón Lento (Bailemos)', 1: 'Chantaje', 2: 'Otra Vez (feat. J Balvin)', 3:
"Vente Pa' Ca", 4: 'Safari', 5: 'La Bicicleta', 6: 'Ay Mi Dios', 7: 'Andas En Mi Cabeza',
8: 'Traicionera', 9: 'Shaky Shaky', 10: 'Vacaciones', 11: 'Dile Que Tu Me Quieres', 12:
'Let Me Love You', 13: 'DUELE EL CORAZON', 14: 'Chillax', 15: 'Borro Cassette', 16:
'One Dance', 17: 'Closer', …}
ID (URL) Track Name
0
Reggaetón
Lento
(Bailemos)
1 Chantaje
2
Otra Vez (feat.
J Balvin)'
0 NaN
3 Vente Pa' Ca
4 Safari
3 NaN

Feature Engineering:
! Prepare the proper input dataset, compatible with the machine learning
algorithm requirements
! Improve the performance of machine learning models
Feature Engineering
Feature Selection
Feature Extraction
Feature Transformation
Feature Importance
Categorical
Encoding

Position: song's position on chart
Feature Engineering: Grouping
Position
1
2
3
4-5
6-10
11-20
21-35
36-55
56-80
81-130
131-200
Ranking
10
9
8
7
6
5
4
3
2
1
0
Position Values have been grouped in two
different ways:
1. Relevance Labels (Ranking) from 0 to 10
2. Relevance Labels (Ranking) from 0 to 20
Target - Relevance Rating
Position
1
2
3
4
5
6
7
8
9
…
200

Feature hashing maps each category
in a categorical feature to an integer
within a pre-determined range
Track Name: song title
Feature Engineering: Categorical Encoding
Track Name
Reggaetón Lento
(Bailemos)
Chantaje
Otra Vez (feat. J
Balvin)
…
Let Her Go
It is a method to create a
numeric representation of a
document/sentences, regardless
of its length
2 different approaches:
Hash Encoding
doc2vec
Document Level Feature

Categorical Encoding: Hash Encoding
Feature Hashing or “The Hashing Trick” is a fast and space-eﬃcient way of vectorising features
! Use of category_encoders library (as ce)
! Main Arguments:
title_encoder = ce.HashingEncoder(cols=[‘Track Name'], n_components=8)
newds = title_encoder.fit_transform(ds2)
• cols: a list of columns to encode
• n_components: how many bits to use to represent the feature
(default is 8 bits)
• hash_method: which hashing method to use (default is “md5”
algorithm)
https://contrib.scikit-learn.org/category_encoders/hashing.html

Categorical Encoding: Doc2Vec
! Adaptation of Word2Vec, adding another feature vector named Paragraph ID
! Use of the gensim library
! Replace sentence as a list of words (token)
! Create new instance of TaggedDocument (token, tag)
! Build the Vocabulary
! Train the Doc2Vec model, the main parameters are:
• Documents: iterable list of TaggedDocument elements;
• dm{1,0}: defines the training algorithm; by default dm = 1 that is
Distributed Memory version of Paragraph Vector (PV-DM);
• min_count: ignores all words with total frequency lower than this;
• vector_size: dimensionality of the feature vectors (100 by default).
TaggedDocument
Trained Document Vectors
https://radimrehurek.com/gensim/models/doc2vec.html

Language Detection from the Song Titles
Feature Engineering
! langdetect
! guess_language-spirit
! TextBlob
! Googletrans
• Low accuracy (built for
large text)
• No limitation
• High accuracy
• Limited access (API)
https://pypi.org/
https://textblob.readthedocs.io/en/dev/api_reference.html

Artist: name of musician/singer or group
Artist
CNCO
Shakira
Zion &
Lennox
…
Passengers
Artists
78.12742
68.62432
61.62190
…
167.15266
Feature Engineering: Categorical Encoding
Leave One Out Encoding 0.39
0.24
2.21
0.76
0.27
4.01
2.28
0.19
2.03
1,96
5.15
0.36
1.06
A
C
B
B
C
A
mean = 1.06
TARGET FEATURE
Document Level Feature
! Use of category_encoders
library
! It excludes the current row’s
target when calculating the
mean target for a level
https://contrib.scikit-learn.org/category_encoders/leaveoneout.html

Date: chart date
Year Month Day Weekday
2017 1 1 6
2017 1 2 0
2017 1 3 1
… … … …
2018 1 9 1
Date
2017/01/01
2017/01/02
2017/01/03
…
2018/01/09
Feature Engineering: Extracting Date
Query Level Feature

Region: country code
Feature Engineering
Query
Region
ec
ﬁ
cr
…
hn
query_ID
0
1
2
…
53
pandas.factorize()
to obtain a numeric representation of an array
when all that matters is identifying distinct values

Feature Engineering
Final Dataset

Problem Statement
Data Preprocessing
Model Training
Results

Model Training: XGBoost
XGBoost is an optimised distributed gradient boosting library
designed to be highly efficient, flexible and portable.
https://github.com/dmlc/xgboost
! It implements machine learning algorithms under the Gradient
Boosting framework.
! It is Open Source
! It supports both pairwise and list-wise models

1. Split the entire dataset in:
2. Separate the Relevance Label, query_ID and training
vectors as different components to create the xgboost
matrices
Training Set, used to build and train the model (80%)
Test Set, used to evaluate the model performance on unseen data (20%)
DMatrix is an internal data structure that used by
XGBoost which is optimized for both memory efficiency
and training speed

training_xgb_matrix = xgboost.DMatrix(training_data_set,
label=training_label_column)
training_xgb_matrix.set_group(training_query_groups)
training_data_set = training_set_data_frame[
training_set_data_frame.columns.difference(
['Ranking', 'ID', 'query_ID'])]
training_query_id_column = training_set_data_frame['query_ID']
training_query_groups = training_query_id_column.value_counts(sort=False)
training_label_column = training_set_data_frame['Ranking']
Training and Test Set Creation
test_xgb_matrix = xgboost.DMatrix(test_data_set, label=test_label_column)
test_xgb_matrix.set_group(test_query_groups)
test_data_set = test_set_data_frame[
test_set_data_frame.columns.difference(
['Ranking', 'ID', 'query_ID'])]
test_query_id_column = test_set_data_frame['query_ID']
test_query_groups = test_query_id_column.value_counts(sort=False)
test_label_column = test_set_data_frame['Ranking']

Train and test the model with LambdaMART method:
! LambdaMART model uses gradient boosted decision tree using a cost
function derived from LambdaRank for solving a Ranking Task.
! The model performs list-wise ranking where Normalised Discounted
Cumulative Gain (NDCG) is maximised.
! List-wise approaches directly look at the entire list of documents and
try to come up with the optimal ordering for it
! The Evaluation Measure is an average across the queries.

Train and test the model with LambdaMART:
params = {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@10', 'verbosity': 2,
'early_stopping_rounds': 10}
watch_list = [(test_xgb_matrix, 'eval'), (training_xgb_matrix, ‘train')]
print('- - - - Training The Model')
xgb_model = xgboost.train(params, training_xgb_matrix, num_boost_round=999,
evals=watch_list)
print('- - - - Saving XGBoost model’)
xgboost_model_json = output_dir + "/xgboost-" + name + ".json"
xgb_model.dump_model(xgboost_model_json, fmap='', with_stats=True,
dump_format='json')
Model Training: LambdaMART

• DCG@K = Discounted Cumulative Gain@K
It measures the usefulness, or gain, of a document based
on its position in the result list.
Normalised Discounted Cumulative Gain
• NDCG@K = DCG@K/ Ideal DCG@K
• It will be in the range [0,1]
Model1 Model2 Model3 Ideal
1 2 2 4
2 3 4 3
3 2 3 2
4 4 2 2
2 1 1 1
0 0 0 0
0 0 0 0
14,01 15,76 17,64 22,60
0,62 0,70 0,78 1,0
Evaluation Metric: List-wise and NDCG
relevance weight
result position
DCG
NDCG

Let’s see the common mistakes to avoid during the
model creation:
! One sample per query group
! One Relevance Label for all the samples in a query group:
Under Sampled Query Ids can potentially sky rock your
NDCG avg
Common Mistakes

Problem Statement
Model Training
Data Preprocessing
Results

Results
train-ndcg@10 eval-ndcg@10
Relevance Labels
(0-10) 0.7179 0.7351
Relevance Labels
(0-20) 0.8018 0.7740
Relevance Labels
(0-10) 0.8235 0.7633
Relevance Labels
(0-20) 0.8215 0.8244
doc2vec
Encoding
Hash
Encoding
NDCG@10, where ‘@10’ denotes that the metric is evaluated only on top 10 documents/songs

! Importance of Data Preprocessing and Feature Engineering
! Language Detection as additional feature
! doc2vec and Relevance Rating [0, 20] as best approaches
! Online testing in LTR evaluation
! Use of the library Tree SHAP for the feature importance
https://github.com/slundberg/shap
Conclusions

Thanks!

A Learning to Rank Project on a Daily Song Ranking Problem

More Related Content

What's hot (20)

Similar to A Learning to Rank Project on a Daily Song Ranking Problem (20)

More from Sease (20)

Recently uploaded (20)

A Learning to Rank Project on a Daily Song Ranking Problem