SlideShare a Scribd company logo
Deep Learning on Tabular
Data, Predicting Profitability
Peiyu Wang
Senior Data Scientist
Klarna Bank AB
Content
01 Introduction
Introduction to the problem and
research questions
02
04
Method
How did I answer the research questions?
05
Results and analysis
What were the outcomes?
Conclusions
Conclusions and future work
03
Background
Tabular data, models used
Introduction.
Klarna: Largest Buy now pay later fintech in Europe
Buy now, pay later
- Allows customers to buy a product without paying right away
- Essentially, the customers are taking a loan from the company
- Therefore Klarna has to assess the value/risk of letting a
customer pay at a later point in time
Introduction
Problem
Problem: What transactions should be granted credit?
Solutions:
- Predict the expected profit on the transaction, and set
thresholds/limits accordingly
Introduction
Problem
Introduction
Problem
Modelling problem:
Given the customer/transaction attributes at
purchase time, what is the expected,
matured profit?
Introduction
Research questions
1. Can deep learning models outperform GBT model?
2. Can the models’ decisions be understood by a human?
3. Can pre-training across markets give a boost in performance?
Research questions:
Background.
Problem space
DL on tabular data
1. A minor boost in performance can have significant impact
2. Multimodal data e.g. image + tabular data
3. Get a better understanding of the limits of DL
Some reasons:
Q: Why even use deep learning on tabular data?
Problem space
DL on tabular data
Q: What makes tabular data different from e.g. images or text?
Some reasons:
1. No common correlation structure between features
2. Often requires pre-processing
3. Data quality
Problem space
DL on tabular data
Encoding based
- Neural nets cannot
work with
categoricals
- Simple one-hot
encoding, transform
to latent space, or
homogenous data
Attention based
- Inspired by
progress in NLP/CV
- Attention helps the
network focus on
salient features
- Attention provides
a way of
interpretability
Hybrid based
- Combines classical
ML methods with DL
- E.g use DL to find
complex feature
interactions, then
pass to linear model
Problem space
Selected models
Two models selected:
- VIME (Value Imputation and Mask Estimation)1
- TabNet2
1J. Yoon, Y. Zhang, J. Jordon, and M. van der Schaar, “Vime: Extending the success of self- and semi-supervised learning to tabular domain,” 2020
2S. O. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular learning,” 2020.
Background
VIME - Value imputation and mask estimation
- Introduces two pretext tasks for tabular
data. Features are corrupted by a binary
mask and recovered through:
1. Mask estimation
2. Feature estimation
- Otherwise very simple architecture:
Only uses multilayer perceptrons
- Encoder can then be used for
downstream task
VIME SSL
Background
TabNet
TabNet encoder
- Interpretable DL model
- Attention masks enable feature
selection
- Feature/attentive transformers
- Additive output from all steps
- Global feature importance from
aggregated masks
Background
TabNet
Method.
Method
Data
Market Rows Columns % Missing Values
1 637521 187 8.4
2 602923 237 13.6
3 1034630 83 6.8
Method
Data
1. Raw data converted into correct datatypes and remaining
missing values are imputed using mean
2. Feature scaling and one-hot encoding
3. Recursive feature elimination => 30 most salient features
4. Data split into train, validation and test
Pre-processing steps:
Method
Experiments
1. Evaluate performance of DL models and XGboost model
for each market, fully supervised training
2. Generate SHAP plots for both models, and attention maps
for TabNet
3. Evaluate pre-training gains from pre-training/fine-tuning on
pairs of markets
Method
Hyperparameter tuning
As with all ML research, hyperparameters have a big
influence on the results
Used Hyperopt for efficient hyperparameter tuning
Method
Metrics
Apart from RMSE and MAE, a custom evaluation metric
was used:
Results.
Results
Supervised performance
Results
Training and inference times per row of data
Model Training time (μs) Inference time (μs)
XGboost 31.3 0.18
TabNet 56.7 12
VIME 47.8 11
Results
Pre-training results - Market 1
Results
Pre-training results - Market 1
Results
Interpretability - Market 1
TabNet VIME
Results
Interpretability - Market 1
Results
Interpretability - Market 1
Conclusions
&Discussion.
Conclusions & Discussion
Research questions
Q: Can deep learning models outperform the currently used model?
A: Yes, at least in terms of RMSE
Q: Can the models’ decisions be understood by a human?
A: Yes, through SHAP and attention maps
Q: Can pre-training across markets give a boost in performance?
A: Yes, pre-training gives a statistically significant (𝛂=0.05) boost in performance
Conclusions & Discussion
Discussion
Should we replace gradient boosted trees with VIME or TabNet?
- DL models show promise: better RMSE, interpretable and not too slow
- For this problem, lower RMSE ≠ better profitability
- Requires GPUs to train efficiently => more cost
- Question of whether the decrease in RMSE can justify the added costs
Questions?

More Related Content

PPTX
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
PDF
Non Anthropomorhic Deep Learning Developing the Eyes for the Internet of Things
PPTX
Learning – Types of Machine Learning – Supervised Learning – Unsupervised UNI...
PPTX
Transfer learning with real world applications in deep learning
PPTX
How to fine-tune and develop your own large language model.pptx
PPT
Machine learning with Big Data power point presentation
PPT
Chapter01.ppt
DOCX
Title_ Deep Learning Explained_ What You Should Be Aware of in Data Science a...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Non Anthropomorhic Deep Learning Developing the Eyes for the Internet of Things
Learning – Types of Machine Learning – Supervised Learning – Unsupervised UNI...
Transfer learning with real world applications in deep learning
How to fine-tune and develop your own large language model.pptx
Machine learning with Big Data power point presentation
Chapter01.ppt
Title_ Deep Learning Explained_ What You Should Be Aware of in Data Science a...

Similar to [Cryptica 22] Deep Learning on Tabular Data, Predicting Profitability - Peiyu Wang (20)

PPTX
Machine Learning vs Decision Optimization comparison
PDF
التنقيب في البيانات - Data Mining
PDF
Quest for machine intelligence: Statistical learning methods
PPTX
ODSC APAC 2022 - Explainable AI
PPTX
CS8451 DAA Unit-I.pptx
PDF
林守德/Practical Issues in Machine Learning
PDF
Automation of IT Ticket Automation using NLP and Deep Learning
PPTX
Building Continuous Learning Systems
PDF
Data science (machine learning , statistics)
PDF
Introduction to Artificial Intelligence_ Lec 6
PDF
Reds interpretability report
PPTX
A Survey of Techniques for Maximizing LLM Performance.pptx
PDF
ML crash course
PPTX
introduction to machine learning .pptx
PDF
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
PDF
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
PPTX
ODSC East 2020 : Continuous_learning_systems
PPT
Introduction to Decision Science
PPTX
Continuous Learning Systems: Building ML systems that learn from their mistakes
Machine Learning vs Decision Optimization comparison
التنقيب في البيانات - Data Mining
Quest for machine intelligence: Statistical learning methods
ODSC APAC 2022 - Explainable AI
CS8451 DAA Unit-I.pptx
林守德/Practical Issues in Machine Learning
Automation of IT Ticket Automation using NLP and Deep Learning
Building Continuous Learning Systems
Data science (machine learning , statistics)
Introduction to Artificial Intelligence_ Lec 6
Reds interpretability report
A Survey of Techniques for Maximizing LLM Performance.pptx
ML crash course
introduction to machine learning .pptx
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
ODSC East 2020 : Continuous_learning_systems
Introduction to Decision Science
Continuous Learning Systems: Building ML systems that learn from their mistakes
Ad

More from DataScienceConferenc1 (20)

PPTX
[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...
PPTX
[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...
PPTX
[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...
PPTX
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
PPTX
[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...
PPTX
[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...
PPTX
[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake
PDF
[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant
PDF
[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...
PPTX
[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...
PPTX
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
PPTX
[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...
PDF
[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...
PPTX
[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance
PPTX
[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...
PDF
[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...
PPTX
[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...
PPTX
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
PPTX
[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...
PPTX
[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...
[DSC Europe 24] Anastasia Shapedko - How Alice, our intelligent personal assi...
[DSC Europe 24] Joy Chatterjee - Balancing Personalization and Experimentatio...
[DSC Europe 24] Pratul Chakravarty - Personalized Insights and Engagements us...
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
[DSC Europe 24] Marcin Szymaniuk - The path to Effective Data Migration - Ove...
[DSC Europe 24] Fran Mikulicic - Building a Data-Driven Culture: What the C-S...
[DSC Europe 24] Sofija Pervulov - Building up the Bosch Semantic Data Lake
[DSC Europe 24] Dani Ei-Ayyas - Overcoming Loneliness with LLM Dating Assistant
[DSC Europe 24] Ewelina Kucal & Maciej Dziezyc - How to Encourage Children to...
[DSC Europe 24] Nikola Milosevic - VerifAI: Biomedical Generative Question-An...
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
[DSC Europe 24] Sray Agarwal - 2025: year of Ai dilemma - ethics, regulations...
[DSC Europe 24] Peter Kertys & Maros Buban - Application of AI technologies i...
[DSC Europe 24] Orsalia Andreou - Fostering Trust in AI-Driven Finance
[DSC Europe 24] Arnault Ioualalen - AI Trustworthiness – A Path Toward Mass A...
[DSC Europe 24] Nathan Coyle - Open Data for Everybody: Social Action, Peace ...
[DSC Europe 24] Miodrag Vladic - Revolutionizing Information Access: All Worl...
[DSC Europe 24] Katherine Munro - Where there’s a will, there’s a way: The ma...
[DSC Europe 24] Ana Stojkovic Knezevic - How to effectively manage AI/ML proj...
[DSC Europe 24] Simun Sunjic & Lovro Matosevic - Empowering Sales with Intell...
Ad

Recently uploaded (20)

PPTX
Introduction to Customs (June 2025) v1.pptx
PPTX
Basic Concepts of Economics.pvhjkl;vbjkl;ptx
PDF
Buy Verified Stripe Accounts for Sale - Secure and.pdf
PDF
NAPF_RESPONSE_TO_THE_PENSIONS_COMMISSION_8 _2_.pdf
PDF
Spending, Allocation Choices, and Aging THROUGH Retirement. Are all of these ...
PDF
6a Transition Through Old Age in a Dynamic Retirement Distribution Model JFP ...
PPTX
OAT_ORI_Fed Independence_August 2025.pptx
PDF
Mathematical Economics 23lec03slides.pdf
PDF
ECONOMICS AND ENTREPRENEURS LESSONSS AND
PPTX
social-studies-subject-for-high-school-globalization.pptx
PDF
HCWM AND HAI FOR BHCM STUDENTS(1).Pdf and ptts
PDF
how_to_earn_50k_monthly_investment_guide.pdf
PPTX
Who’s winning the race to be the world’s first trillionaire.pptx
PPTX
introuction to banking- Types of Payment Methods
PPTX
Antihypertensive_Drugs_Presentation_Poonam_Painkra.pptx
PDF
Bitcoin Layer August 2025: Power Laws of Bitcoin: The Core and Bubbles
PPTX
kyc aml guideline a detailed pt onthat.pptx
PDF
ECONOMICS AND ENTREPRENEURS LESSONSS AND
PDF
Copia de Minimal 3D Technology Consulting Presentation.pdf
PDF
How to join illuminati agent in Uganda Kampala call 0782561496/0756664682
Introduction to Customs (June 2025) v1.pptx
Basic Concepts of Economics.pvhjkl;vbjkl;ptx
Buy Verified Stripe Accounts for Sale - Secure and.pdf
NAPF_RESPONSE_TO_THE_PENSIONS_COMMISSION_8 _2_.pdf
Spending, Allocation Choices, and Aging THROUGH Retirement. Are all of these ...
6a Transition Through Old Age in a Dynamic Retirement Distribution Model JFP ...
OAT_ORI_Fed Independence_August 2025.pptx
Mathematical Economics 23lec03slides.pdf
ECONOMICS AND ENTREPRENEURS LESSONSS AND
social-studies-subject-for-high-school-globalization.pptx
HCWM AND HAI FOR BHCM STUDENTS(1).Pdf and ptts
how_to_earn_50k_monthly_investment_guide.pdf
Who’s winning the race to be the world’s first trillionaire.pptx
introuction to banking- Types of Payment Methods
Antihypertensive_Drugs_Presentation_Poonam_Painkra.pptx
Bitcoin Layer August 2025: Power Laws of Bitcoin: The Core and Bubbles
kyc aml guideline a detailed pt onthat.pptx
ECONOMICS AND ENTREPRENEURS LESSONSS AND
Copia de Minimal 3D Technology Consulting Presentation.pdf
How to join illuminati agent in Uganda Kampala call 0782561496/0756664682

[Cryptica 22] Deep Learning on Tabular Data, Predicting Profitability - Peiyu Wang

  • 1. Deep Learning on Tabular Data, Predicting Profitability Peiyu Wang Senior Data Scientist Klarna Bank AB
  • 2. Content 01 Introduction Introduction to the problem and research questions 02 04 Method How did I answer the research questions? 05 Results and analysis What were the outcomes? Conclusions Conclusions and future work 03 Background Tabular data, models used
  • 4. Klarna: Largest Buy now pay later fintech in Europe Buy now, pay later - Allows customers to buy a product without paying right away - Essentially, the customers are taking a loan from the company - Therefore Klarna has to assess the value/risk of letting a customer pay at a later point in time Introduction Problem
  • 5. Problem: What transactions should be granted credit? Solutions: - Predict the expected profit on the transaction, and set thresholds/limits accordingly Introduction Problem
  • 6. Introduction Problem Modelling problem: Given the customer/transaction attributes at purchase time, what is the expected, matured profit?
  • 7. Introduction Research questions 1. Can deep learning models outperform GBT model? 2. Can the models’ decisions be understood by a human? 3. Can pre-training across markets give a boost in performance? Research questions:
  • 9. Problem space DL on tabular data 1. A minor boost in performance can have significant impact 2. Multimodal data e.g. image + tabular data 3. Get a better understanding of the limits of DL Some reasons: Q: Why even use deep learning on tabular data?
  • 10. Problem space DL on tabular data Q: What makes tabular data different from e.g. images or text? Some reasons: 1. No common correlation structure between features 2. Often requires pre-processing 3. Data quality
  • 11. Problem space DL on tabular data Encoding based - Neural nets cannot work with categoricals - Simple one-hot encoding, transform to latent space, or homogenous data Attention based - Inspired by progress in NLP/CV - Attention helps the network focus on salient features - Attention provides a way of interpretability Hybrid based - Combines classical ML methods with DL - E.g use DL to find complex feature interactions, then pass to linear model
  • 12. Problem space Selected models Two models selected: - VIME (Value Imputation and Mask Estimation)1 - TabNet2 1J. Yoon, Y. Zhang, J. Jordon, and M. van der Schaar, “Vime: Extending the success of self- and semi-supervised learning to tabular domain,” 2020 2S. O. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular learning,” 2020.
  • 13. Background VIME - Value imputation and mask estimation - Introduces two pretext tasks for tabular data. Features are corrupted by a binary mask and recovered through: 1. Mask estimation 2. Feature estimation - Otherwise very simple architecture: Only uses multilayer perceptrons - Encoder can then be used for downstream task VIME SSL
  • 14. Background TabNet TabNet encoder - Interpretable DL model - Attention masks enable feature selection - Feature/attentive transformers - Additive output from all steps - Global feature importance from aggregated masks
  • 17. Method Data Market Rows Columns % Missing Values 1 637521 187 8.4 2 602923 237 13.6 3 1034630 83 6.8
  • 18. Method Data 1. Raw data converted into correct datatypes and remaining missing values are imputed using mean 2. Feature scaling and one-hot encoding 3. Recursive feature elimination => 30 most salient features 4. Data split into train, validation and test Pre-processing steps:
  • 19. Method Experiments 1. Evaluate performance of DL models and XGboost model for each market, fully supervised training 2. Generate SHAP plots for both models, and attention maps for TabNet 3. Evaluate pre-training gains from pre-training/fine-tuning on pairs of markets
  • 20. Method Hyperparameter tuning As with all ML research, hyperparameters have a big influence on the results Used Hyperopt for efficient hyperparameter tuning
  • 21. Method Metrics Apart from RMSE and MAE, a custom evaluation metric was used:
  • 24. Results Training and inference times per row of data Model Training time (μs) Inference time (μs) XGboost 31.3 0.18 TabNet 56.7 12 VIME 47.8 11
  • 31. Conclusions & Discussion Research questions Q: Can deep learning models outperform the currently used model? A: Yes, at least in terms of RMSE Q: Can the models’ decisions be understood by a human? A: Yes, through SHAP and attention maps Q: Can pre-training across markets give a boost in performance? A: Yes, pre-training gives a statistically significant (𝛂=0.05) boost in performance
  • 32. Conclusions & Discussion Discussion Should we replace gradient boosted trees with VIME or TabNet? - DL models show promise: better RMSE, interpretable and not too slow - For this problem, lower RMSE ≠ better profitability - Requires GPUs to train efficiently => more cost - Question of whether the decrease in RMSE can justify the added costs

Editor's Notes

  • #2: Thank you all for coming and listening to my thesis presentation. My name is Samuel I am a final year student in Engineering Physics at KTH, and I have done my masters thesis on the topic of testing deep learning models on tabular data, and in particular applying them to transactional underwriting.
  • #3: First I just want to give an overview of what I will be speaking about. Give an introduction to the problem and research questions Then I’ll Go over the problem space and some background on some relevant topics, and explain the models i used in depth Then I’ll get into the bulk of the thesis and talk about the experiments i did, the results and finally the conclusions we can draw from the results.
  • #5: My thesis has been focused on modelling the transactional profitability of retail purchases that are purchased through buy now pay later. Buy now pay later, as the name suggests, simply means that a customer can buy a product without paying it a purchase time. In essence, the company providing the buy now pay later option is lending money to the customer and acting as a middle man between the merchant and customer to allow more time for the customer to pay. To make this into a profitable business, the buy now pay later companies have to ensure that they to a large extent only are accepting customers that will bring profit, and rejecting the rest.
  • #6: The core problem I have been working on is to figure out what transactions should be granted credit or not. By focusing looking at this from a transactional perspective, instead of a customer perspective we can make more granular decisions based on the potential profitability of each transaction. The key metric that is used here is the profit of a transaction. [NEW SLIDE] Deciding who should be granted credit can be done essentially in two ways. One, by deciding on a threshold on the profit and classifying whether a transaction will yield more or less than this threshold or two, predicting the profit of the transaction, and setting the threshold post-hoc. In the team the second option is the one that is used.
  • #7: Essentially, this becomes a regression problem where we want to predict the profit given what we know about the transaction, customer and merchant at purchase time.
  • #8: The company is currently using gradient boosted tree models for this problems, specifically xgboost and catboost, and now wanted to test if deep learning could be used to get better performance. This led me to the following research questions: Can the DL models outperform the currently used model? Here, by outperform I am measuring performance using three different metrics that i will cover later. A very important aspect of using these complex models in an industry setting is explainability. Can the models’ decisions be understood by a human? Can pre-training across markets give a boost in performance?
  • #10: Before diving into a project like this one I think it is important to ask why we would want to use DL on tabular data in the first place. I have collected a few reasons here: Firstly, over the last few years DL has really proved to give incredible results in computer vision and NLP domain. Motivated by this, similar performance gains might be possible if DL is applied to tabular data Secondly, a nice thing about DL models is that they can work with data from different modalities at the same time, so for example if we have both image and tabular data we can use both to train a DL model. Which in contrast would not be possible using for example xgboost. Finally, I think that applying DL models to tabular data is interesting in itself because it gives a better understanding of what is possible with DL
  • #11: When In images and text, there is very strong correlation structure for example between pixels in an image or words in a sentence. the correlation among features is usually weaker. the existing dependencies between features are usually rather complex and irregular so the assumptions underlying for example CNNs or RNNs cannot be used Often tabular data contains categorical features that require some form of encoding to be used, which becomes an extra complexity when working with tabular data Tabular data, especially from industry applications, often has a lot of missing values, outliers and in general inconsistent data which means more challenges to deal with for the models that we use.
  • #12: Nevertheless, there has been quite a bit of research into DL applied to tabular data and several successful models. Broadly speaking, these models can be categorized into three categories Encoding based methods that aim to produce encodings of the data so that already existing architecture can be used. This can vary from simple one-hot encoding of categoricals and then training a simple multilayer perceptron, to even turning the tabular data into an image and using a convolutional NN on it. One of the models tested in this thesis is from this category. The second category is based on the very successful self-attention mechanism from natural language processing. One nice feature of attention is that it can be used as a sort of feature importance and give increased transparency of the model, something that I will talk more about later.. The final category combines classical ML methods with DL to reap the benefits of both
  • #13: Two models were selected for the thesis, based on their benchmark performance and how complicated they were to implement VIME which is an Encoding based model TabNet which is based on Attention
  • #14: Now i will move on to talk a bit about the models I used, note that i won’t go in to details here since we are limited on time. First out is a model called VIME which introduces a way to do self-supervised learning on tabular data through two pretext tasks. These tasks are designed so that the network will learn some useful representation of the data, without access to labels. Essentially, the self-supervised training works by producing a binary mask that is used to corrupt the feature matrix and then an encoder is trained to recover the corrupted both feature values and the mask used to corrupt the features. The architecture is fairly simple, consisting only of MLPs. After training the encoder, it can be used in a semi-supervised setting where any available labels can be used
  • #15: Intro Attention based DL model tailored for tabular data, consisting of multiple steps of sequential attention The attention allows instance wise feature selection through so called attention masks that allows the model to focus on a subset of features The basic building blocks of the model are the attentive transformer which produces the mask and the feature transformer which creates an encoding of the data The output from each step in the sequence is aggregated to produce a final prediction/attention masks Achieves impressive results on benchmark datasets, outperforming xgboost
  • #16: TabNet support pre-training on unlabeled data: Features are masked by a binary mask that is sampled from a bernoulli distribution These masked features are then encoded using the same encoder scheme as i showed in the previous slide And finally passed through a decoder, consisting of FC layers, that tries to reconstruct the masked features The network is trained using a reconstruction loss function that measures the discrepancy between the reconstructed and original features Once the encoder has been trained, it can be used in a supervised setting with any available labels
  • #17: Now I want to talk a bit about the data i used, what experiments i did and what metrics i used to evaluate performance
  • #18: I used data from three markets Plenty of rows, good since DL models often need a lot of data to converge Many features, containing both internal and external variables. Example on next slide. Quite a lot of missing values in some features, especially features from external bureaus. Columns that had more than 50% missing values were dropped entirely
  • #19: Raw data converted into correct datatypes and remaining missing values are imputed Feature scaling and one-hot encoding Recursive feature selection => 30 most salient features Data split into train, validation and test
  • #20: With the research questions in mind, 3 experiments were conducted
  • #21: Hyperopt which is a bayesian hyperparameter optimization framework
  • #22: This metric is aimed to evaluate the potential profitability on an entire dataset if the model’s predictions were to be used. It can works by: Assume that we accept all transactions that have a predicted profitability above 0. Compute the total profit we would get following this strategy. Compute the maximum profit possible when following the strategy, that is, summing all the positive profits Compute the average distance between the two
  • #23: I have only included results
  • #24: Bold numbers indicate the best performance for the given market Analysis Both DL models generally have better performance in terms of MAE and RMSE compared to XGboost XGboost is the winner on the portfolio error Note that there is some bias in this result as it is only from predictions on one test set per market, so we cannot really say that these results are statistically significant
  • #25: Analysis: All times are averaged for one row of data Both DL models are significantly slower than XGboost especially in terms of inference time However, they are still quick enough to be used in a production environment Note the difficulty in measuring these times: Training times can vary a lot depending on factors such as initialization, hardware. But i did set a random seed, and averaged over all the market datasets to get a more accurate result. Moreover, the first epoch of training is always slower than the rest since the data has to be loaded into memory.
  • #26: Pre-training results: Have only included results from one market, using TabNet due to lack of time. If you want to see the full results, you can read my report. What do i mean with pre-training? Basically what I have done here is selected two of the three markets, used one market for self-supervised training i.e. not using the labels, and then using a limit number of labels from the other market. Analysis In this figure we can see the loss minimization process when training TabNet on 1000 labels from market 1, with and without pre-training on market 2. Note how the loss much quicker reaches a minimum when pre-training first.
  • #27: Here I have plotted the RMSE for varying amounts of labels with and without self-supervised pre-training on the two other markets. What we can see here is that Analysis When few labels are available, pre-training gives a significant boost in performance with diminishing gains the more labels we have access to. The gains differ a bit depending on the market used for pre-training/fine-tuning. Could be due to some markets being more similar than others. BUt more investigation would be needed to confirm this.
  • #28: Here are SHAP plots for VIME and TabNet on market 1. For those of you who are not familiar with how to read these plots Analysis Each dot is one row of data. Red color indicates a high feature value, and blue indicates a low feature value and if the dots are far to one side, they have a strong influence on the prediction either in the positive or negative direction. For example, a high order amount seems to give a greater prediction of the profit. Some conclusions: Top features are similar for both TabNet and VIME, which gives us more confidence in the fact that these are truly important features Benefit with SHAP: we can see in what direction the features push, which would be useful if a customer would want to know why they were not granted credit.
  • #29: The next interpretability approach is to look at the masks or attention maps produced by TabNet. I have put the feature names on the side so you can see more clearly. Attention maps show what features are focused on for each row of data in the different steps of TabNet. Gives a very local view of the feature importance. However it can be a bit hard to read these plots as we are more often interested in the global feature importance Luckily, we can aggregate these masks to get a better understanding of what features are important
  • #30: Analysis By aggregating the attention maps across all the steps, one can get a feature importance ranking which is more easy to interpret We can see that the results match the SHAP plots. However, a downside is that we cannot deduce in what way the features push the predictions as with SHAP.
  • #32: 1. Outperforms XGboost in RMSE however not in portfolio error 2. I would argue that the models are as interpretable as gradient boosted machines. Yes, they are more complex and difficult to understand, but so are GBTs. 3. This result indicates that when few labels are available in one market, other markets can be used to pre-train the model to give head start in the training.
  • #33: Now it looks like the DL models have promise, outperforming XGboost in terms of RMSE, they are interpretable and not far too slow. However, for this specific problem, having a better RMSE does not necessarily mean better profitability. After prediction, we set a threshold on the predictions, classifying them as accept or reject, so being very close to the actual value will not matter in most cases. Moreover, training DL models often requires a GPU to be done efficiently, and GPUs cost quite a bit more to train on. Essentially it becomes a question of whether the slightly better performance can justify the added cost.
  • #34: If I would have more time on this project I would probably look further into two things:: Approach the problem like a classification problem and compare the performance to the current setup As noted before, when pre-training on some markets performance was better. Investigate why this is? Perhaps the markets are more similar?