[Cryptica 22] Deep Learning on Tabular Data, Predicting Profitability - Peiyu Wang

Download as PPTX, PDF

0 likes36 views

The document discusses the application of deep learning on tabular data to predict the profitability of transactions for Klarna, the largest buy now pay later fintech in Europe. It explores the effectiveness of deep learning models, specifically VIME and TabNet, compared to traditional models like XGBoost, in terms of performance and interpretability. The findings indicate that deep learning models can outperform existing models, offer interpretability, and benefit from pre-training across markets, although considerations regarding training costs remain.

Economy & Finance

Deep Learning on Tabular
Data, Predicting Profitability
Peiyu Wang
Senior Data Scientist
Klarna Bank AB

Content
01 Introduction
Introduction to the problem and
research questions
02
04
Method
How did I answer the research questions?
05
Results and analysis
What were the outcomes?
Conclusions
Conclusions and future work
03
Background
Tabular data, models used

Klarna: Largest Buy now pay later fintech in Europe
Buy now, pay later
- Allows customers to buy a product without paying right away
- Essentially, the customers are taking a loan from the company
- Therefore Klarna has to assess the value/risk of letting a
customer pay at a later point in time
Introduction
Problem

Problem: What transactions should be granted credit?
Solutions:
- Predict the expected profit on the transaction, and set
thresholds/limits accordingly
Introduction
Problem

Introduction
Problem
Modelling problem:
Given the customer/transaction attributes at
purchase time, what is the expected,
matured profit?

Introduction
Research questions
1. Can deep learning models outperform GBT model?
2. Can the models’ decisions be understood by a human?
3. Can pre-training across markets give a boost in performance?
Research questions:

Problem space
DL on tabular data
1. A minor boost in performance can have significant impact
2. Multimodal data e.g. image + tabular data
3. Get a better understanding of the limits of DL
Some reasons:
Q: Why even use deep learning on tabular data?

Problem space
DL on tabular data
Q: What makes tabular data different from e.g. images or text?
Some reasons:
1. No common correlation structure between features
2. Often requires pre-processing
3. Data quality

Problem space
DL on tabular data
Encoding based
- Neural nets cannot
work with
categoricals
- Simple one-hot
encoding, transform
to latent space, or
homogenous data
Attention based
- Inspired by
progress in NLP/CV
- Attention helps the
network focus on
salient features
- Attention provides
a way of
interpretability
Hybrid based
- Combines classical
ML methods with DL
- E.g use DL to find
complex feature
interactions, then
pass to linear model

Problem space
Selected models
Two models selected:
- VIME (Value Imputation and Mask Estimation)1
- TabNet2
1J. Yoon, Y. Zhang, J. Jordon, and M. van der Schaar, “Vime: Extending the success of self- and semi-supervised learning to tabular domain,” 2020
2S. O. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular learning,” 2020.

Background
VIME - Value imputation and mask estimation
- Introduces two pretext tasks for tabular
data. Features are corrupted by a binary
mask and recovered through:
1. Mask estimation
2. Feature estimation
- Otherwise very simple architecture:
Only uses multilayer perceptrons
- Encoder can then be used for
downstream task
VIME SSL

Background
TabNet
TabNet encoder
- Interpretable DL model
- Attention masks enable feature
selection
- Feature/attentive transformers
- Additive output from all steps
- Global feature importance from
aggregated masks

Method
Data
Market Rows Columns % Missing Values
1 637521 187 8.4
2 602923 237 13.6
3 1034630 83 6.8

Method
Data
1. Raw data converted into correct datatypes and remaining
missing values are imputed using mean
2. Feature scaling and one-hot encoding
3. Recursive feature elimination => 30 most salient features
4. Data split into train, validation and test
Pre-processing steps:

Method
Experiments
1. Evaluate performance of DL models and XGboost model
for each market, fully supervised training
2. Generate SHAP plots for both models, and attention maps
for TabNet
3. Evaluate pre-training gains from pre-training/fine-tuning on
pairs of markets

Method
Hyperparameter tuning
As with all ML research, hyperparameters have a big
influence on the results
Used Hyperopt for efficient hyperparameter tuning

Method
Metrics
Apart from RMSE and MAE, a custom evaluation metric
was used:

Results
Training and inference times per row of data
Model Training time (μs) Inference time (μs)
XGboost 31.3 0.18
TabNet 56.7 12
VIME 47.8 11

Results
Interpretability - Market 1
TabNet VIME

Conclusions & Discussion
Research questions
Q: Can deep learning models outperform the currently used model?
A: Yes, at least in terms of RMSE
Q: Can the models’ decisions be understood by a human?
A: Yes, through SHAP and attention maps
Q: Can pre-training across markets give a boost in performance?
A: Yes, pre-training gives a statistically significant (𝛂=0.05) boost in performance

Conclusions & Discussion
Discussion
Should we replace gradient boosted trees with VIME or TabNet?
- DL models show promise: better RMSE, interpretable and not too slow
- For this problem, lower RMSE ≠ better profitability
- Requires GPUs to train efficiently => more cost
- Question of whether the decrease in RMSE can justify the added costs

[Cryptica 22] Deep Learning on Tabular Data, Predicting Profitability - Peiyu Wang

1. Deep Learning on Tabular Data, Predicting Profitability Peiyu Wang Senior Data Scientist Klarna Bank AB

2. Content 01 Introduction Introduction to the problem and research questions 02 04 Method How did I answer the research questions? 05 Results and analysis What were the outcomes? Conclusions Conclusions and future work 03 Background Tabular data, models used

3. Introduction.

4. Klarna: Largest Buy now pay later fintech in Europe Buy now, pay later - Allows customers to buy a product without paying right away - Essentially, the customers are taking a loan from the company - Therefore Klarna has to assess the value/risk of letting a customer pay at a later point in time Introduction Problem

5. Problem: What transactions should be granted credit? Solutions: - Predict the expected profit on the transaction, and set thresholds/limits accordingly Introduction Problem

6. Introduction Problem Modelling problem: Given the customer/transaction attributes at purchase time, what is the expected, matured profit?

7. Introduction Research questions 1. Can deep learning models outperform GBT model? 2. Can the models’ decisions be understood by a human? 3. Can pre-training across markets give a boost in performance? Research questions:

8. Background.

9. Problem space DL on tabular data 1. A minor boost in performance can have significant impact 2. Multimodal data e.g. image + tabular data 3. Get a better understanding of the limits of DL Some reasons: Q: Why even use deep learning on tabular data?

10. Problem space DL on tabular data Q: What makes tabular data different from e.g. images or text? Some reasons: 1. No common correlation structure between features 2. Often requires pre-processing 3. Data quality

11. Problem space DL on tabular data Encoding based - Neural nets cannot work with categoricals - Simple one-hot encoding, transform to latent space, or homogenous data Attention based - Inspired by progress in NLP/CV - Attention helps the network focus on salient features - Attention provides a way of interpretability Hybrid based - Combines classical ML methods with DL - E.g use DL to find complex feature interactions, then pass to linear model

12. Problem space Selected models Two models selected: - VIME (Value Imputation and Mask Estimation)1 - TabNet2 1J. Yoon, Y. Zhang, J. Jordon, and M. van der Schaar, “Vime: Extending the success of self- and semi-supervised learning to tabular domain,” 2020 2S. O. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular learning,” 2020.

13. Background VIME - Value imputation and mask estimation - Introduces two pretext tasks for tabular data. Features are corrupted by a binary mask and recovered through: 1. Mask estimation 2. Feature estimation - Otherwise very simple architecture: Only uses multilayer perceptrons - Encoder can then be used for downstream task VIME SSL

14. Background TabNet TabNet encoder - Interpretable DL model - Attention masks enable feature selection - Feature/attentive transformers - Additive output from all steps - Global feature importance from aggregated masks

15. Background TabNet

16. Method.

17. Method Data Market Rows Columns % Missing Values 1 637521 187 8.4 2 602923 237 13.6 3 1034630 83 6.8

18. Method Data 1. Raw data converted into correct datatypes and remaining missing values are imputed using mean 2. Feature scaling and one-hot encoding 3. Recursive feature elimination => 30 most salient features 4. Data split into train, validation and test Pre-processing steps:

19. Method Experiments 1. Evaluate performance of DL models and XGboost model for each market, fully supervised training 2. Generate SHAP plots for both models, and attention maps for TabNet 3. Evaluate pre-training gains from pre-training/fine-tuning on pairs of markets

20. Method Hyperparameter tuning As with all ML research, hyperparameters have a big influence on the results Used Hyperopt for efficient hyperparameter tuning

21. Method Metrics Apart from RMSE and MAE, a custom evaluation metric was used:

22. Results.

23. Results Supervised performance

24. Results Training and inference times per row of data Model Training time (μs) Inference time (μs) XGboost 31.3 0.18 TabNet 56.7 12 VIME 47.8 11

25. Results Pre-training results - Market 1

26. Results Pre-training results - Market 1

27. Results Interpretability - Market 1 TabNet VIME

28. Results Interpretability - Market 1

29. Results Interpretability - Market 1

30. Conclusions &Discussion.

31. Conclusions & Discussion Research questions Q: Can deep learning models outperform the currently used model? A: Yes, at least in terms of RMSE Q: Can the models’ decisions be understood by a human? A: Yes, through SHAP and attention maps Q: Can pre-training across markets give a boost in performance? A: Yes, pre-training gives a statistically significant (𝛂=0.05) boost in performance

32. Conclusions & Discussion Discussion Should we replace gradient boosted trees with VIME or TabNet? - DL models show promise: better RMSE, interpretable and not too slow - For this problem, lower RMSE ≠ better profitability - Requires GPUs to train efficiently => more cost - Question of whether the decrease in RMSE can justify the added costs

33. Questions?

Editor's Notes

#2: Thank you all for coming and listening to my thesis presentation. My name is Samuel I am a final year student in Engineering Physics at KTH, and I have done my masters thesis on the topic of testing deep learning models on tabular data, and in particular applying them to transactional underwriting.
#3: First I just want to give an overview of what I will be speaking about. Give an introduction to the problem and research questions Then I’ll Go over the problem space and some background on some relevant topics, and explain the models i used in depth Then I’ll get into the bulk of the thesis and talk about the experiments i did, the results and finally the conclusions we can draw from the results.
#5: My thesis has been focused on modelling the transactional profitability of retail purchases that are purchased through buy now pay later. Buy now pay later, as the name suggests, simply means that a customer can buy a product without paying it a purchase time. In essence, the company providing the buy now pay later option is lending money to the customer and acting as a middle man between the merchant and customer to allow more time for the customer to pay. To make this into a profitable business, the buy now pay later companies have to ensure that they to a large extent only are accepting customers that will bring profit, and rejecting the rest.
#6: The core problem I have been working on is to figure out what transactions should be granted credit or not. By focusing looking at this from a transactional perspective, instead of a customer perspective we can make more granular decisions based on the potential profitability of each transaction. The key metric that is used here is the profit of a transaction. [NEW SLIDE] Deciding who should be granted credit can be done essentially in two ways. One, by deciding on a threshold on the profit and classifying whether a transaction will yield more or less than this threshold or two, predicting the profit of the transaction, and setting the threshold post-hoc. In the team the second option is the one that is used.
#7: Essentially, this becomes a regression problem where we want to predict the profit given what we know about the transaction, customer and merchant at purchase time.
#8: The company is currently using gradient boosted tree models for this problems, specifically xgboost and catboost, and now wanted to test if deep learning could be used to get better performance. This led me to the following research questions: Can the DL models outperform the currently used model? Here, by outperform I am measuring performance using three different metrics that i will cover later. A very important aspect of using these complex models in an industry setting is explainability. Can the models’ decisions be understood by a human? Can pre-training across markets give a boost in performance?
#10: Before diving into a project like this one I think it is important to ask why we would want to use DL on tabular data in the first place. I have collected a few reasons here: Firstly, over the last few years DL has really proved to give incredible results in computer vision and NLP domain. Motivated by this, similar performance gains might be possible if DL is applied to tabular data Secondly, a nice thing about DL models is that they can work with data from different modalities at the same time, so for example if we have both image and tabular data we can use both to train a DL model. Which in contrast would not be possible using for example xgboost. Finally, I think that applying DL models to tabular data is interesting in itself because it gives a better understanding of what is possible with DL
#11: When In images and text, there is very strong correlation structure for example between pixels in an image or words in a sentence. the correlation among features is usually weaker. the existing dependencies between features are usually rather complex and irregular so the assumptions underlying for example CNNs or RNNs cannot be used Often tabular data contains categorical features that require some form of encoding to be used, which becomes an extra complexity when working with tabular data Tabular data, especially from industry applications, often has a lot of missing values, outliers and in general inconsistent data which means more challenges to deal with for the models that we use.
#12: Nevertheless, there has been quite a bit of research into DL applied to tabular data and several successful models. Broadly speaking, these models can be categorized into three categories Encoding based methods that aim to produce encodings of the data so that already existing architecture can be used. This can vary from simple one-hot encoding of categoricals and then training a simple multilayer perceptron, to even turning the tabular data into an image and using a convolutional NN on it. One of the models tested in this thesis is from this category. The second category is based on the very successful self-attention mechanism from natural language processing. One nice feature of attention is that it can be used as a sort of feature importance and give increased transparency of the model, something that I will talk more about later.. The final category combines classical ML methods with DL to reap the benefits of both
#13: Two models were selected for the thesis, based on their benchmark performance and how complicated they were to implement VIME which is an Encoding based model TabNet which is based on Attention
#14: Now i will move on to talk a bit about the models I used, note that i won’t go in to details here since we are limited on time. First out is a model called VIME which introduces a way to do self-supervised learning on tabular data through two pretext tasks. These tasks are designed so that the network will learn some useful representation of the data, without access to labels. Essentially, the self-supervised training works by producing a binary mask that is used to corrupt the feature matrix and then an encoder is trained to recover the corrupted both feature values and the mask used to corrupt the features. The architecture is fairly simple, consisting only of MLPs. After training the encoder, it can be used in a semi-supervised setting where any available labels can be used
#15: Intro Attention based DL model tailored for tabular data, consisting of multiple steps of sequential attention The attention allows instance wise feature selection through so called attention masks that allows the model to focus on a subset of features The basic building blocks of the model are the attentive transformer which produces the mask and the feature transformer which creates an encoding of the data The output from each step in the sequence is aggregated to produce a final prediction/attention masks Achieves impressive results on benchmark datasets, outperforming xgboost
#16: TabNet support pre-training on unlabeled data: Features are masked by a binary mask that is sampled from a bernoulli distribution These masked features are then encoded using the same encoder scheme as i showed in the previous slide And finally passed through a decoder, consisting of FC layers, that tries to reconstruct the masked features The network is trained using a reconstruction loss function that measures the discrepancy between the reconstructed and original features Once the encoder has been trained, it can be used in a supervised setting with any available labels
#17: Now I want to talk a bit about the data i used, what experiments i did and what metrics i used to evaluate performance
#18: I used data from three markets Plenty of rows, good since DL models often need a lot of data to converge Many features, containing both internal and external variables. Example on next slide. Quite a lot of missing values in some features, especially features from external bureaus. Columns that had more than 50% missing values were dropped entirely
#19: Raw data converted into correct datatypes and remaining missing values are imputed Feature scaling and one-hot encoding Recursive feature selection => 30 most salient features Data split into train, validation and test
#20: With the research questions in mind, 3 experiments were conducted
#21: Hyperopt which is a bayesian hyperparameter optimization framework
#22: This metric is aimed to evaluate the potential profitability on an entire dataset if the model’s predictions were to be used. It can works by: Assume that we accept all transactions that have a predicted profitability above 0. Compute the total profit we would get following this strategy. Compute the maximum profit possible when following the strategy, that is, summing all the positive profits Compute the average distance between the two
#23: I have only included results
#24: Bold numbers indicate the best performance for the given market Analysis Both DL models generally have better performance in terms of MAE and RMSE compared to XGboost XGboost is the winner on the portfolio error Note that there is some bias in this result as it is only from predictions on one test set per market, so we cannot really say that these results are statistically significant
#25: Analysis: All times are averaged for one row of data Both DL models are significantly slower than XGboost especially in terms of inference time However, they are still quick enough to be used in a production environment Note the difficulty in measuring these times: Training times can vary a lot depending on factors such as initialization, hardware. But i did set a random seed, and averaged over all the market datasets to get a more accurate result. Moreover, the first epoch of training is always slower than the rest since the data has to be loaded into memory.
#26: Pre-training results: Have only included results from one market, using TabNet due to lack of time. If you want to see the full results, you can read my report. What do i mean with pre-training? Basically what I have done here is selected two of the three markets, used one market for self-supervised training i.e. not using the labels, and then using a limit number of labels from the other market. Analysis In this figure we can see the loss minimization process when training TabNet on 1000 labels from market 1, with and without pre-training on market 2. Note how the loss much quicker reaches a minimum when pre-training first.
#27: Here I have plotted the RMSE for varying amounts of labels with and without self-supervised pre-training on the two other markets. What we can see here is that Analysis When few labels are available, pre-training gives a significant boost in performance with diminishing gains the more labels we have access to. The gains differ a bit depending on the market used for pre-training/fine-tuning. Could be due to some markets being more similar than others. BUt more investigation would be needed to confirm this.
#28: Here are SHAP plots for VIME and TabNet on market 1. For those of you who are not familiar with how to read these plots Analysis Each dot is one row of data. Red color indicates a high feature value, and blue indicates a low feature value and if the dots are far to one side, they have a strong influence on the prediction either in the positive or negative direction. For example, a high order amount seems to give a greater prediction of the profit. Some conclusions: Top features are similar for both TabNet and VIME, which gives us more confidence in the fact that these are truly important features Benefit with SHAP: we can see in what direction the features push, which would be useful if a customer would want to know why they were not granted credit.
#29: The next interpretability approach is to look at the masks or attention maps produced by TabNet. I have put the feature names on the side so you can see more clearly. Attention maps show what features are focused on for each row of data in the different steps of TabNet. Gives a very local view of the feature importance. However it can be a bit hard to read these plots as we are more often interested in the global feature importance Luckily, we can aggregate these masks to get a better understanding of what features are important
#30: Analysis By aggregating the attention maps across all the steps, one can get a feature importance ranking which is more easy to interpret We can see that the results match the SHAP plots. However, a downside is that we cannot deduce in what way the features push the predictions as with SHAP.
#32: 1. Outperforms XGboost in RMSE however not in portfolio error 2. I would argue that the models are as interpretable as gradient boosted machines. Yes, they are more complex and difficult to understand, but so are GBTs. 3. This result indicates that when few labels are available in one market, other markets can be used to pre-train the model to give head start in the training.
#33: Now it looks like the DL models have promise, outperforming XGboost in terms of RMSE, they are interpretable and not far too slow. However, for this specific problem, having a better RMSE does not necessarily mean better profitability. After prediction, we set a threshold on the predictions, classifying them as accept or reject, so being very close to the actual value will not matter in most cases. Moreover, training DL models often requires a GPU to be done efficiently, and GPUs cost quite a bit more to train on. Essentially it becomes a question of whether the slightly better performance can justify the added cost.
#34: If I would have more time on this project I would probably look further into two things:: Approach the problem like a classification problem and compare the performance to the current setup As noted before, when pre-training on some markets performance was better. Investigate why this is? Perhaps the markets are more similar?

[Cryptica 22] Deep Learning on Tabular Data, Predicting Profitability - Peiyu Wang

More Related Content

Similar to [Cryptica 22] Deep Learning on Tabular Data, Predicting Profitability - Peiyu Wang (20)

More from DataScienceConferenc1 (20)

Recently uploaded (20)

[Cryptica 22] Deep Learning on Tabular Data, Predicting Profitability - Peiyu Wang

Editor's Notes