Credit card fraud detection and concept drift adaptation with delayed supervised information

Introduction Problem formulation Learning strategy Experiments Conclusion
Credit Card Fraud Detection and
Concept-Drift Adaptation with Delayed
Supervised Information
Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen,
Cesare Alippi, and Gianluca Bontempi
15/07/2015
IEEE IJCNN 2015 conference
1/ 22

INTRODUCTION
Fraud Detection is notably a challenging problem because of
concept drift (i.e. customers’ habits evolve)
class unbalance (i.e. genuine transactions far outnumber
frauds)
uncertain class labels (i.e. some frauds are not reported or
reported with large delay and few transactions can be
timely investigated)
2/ 22

INTRODUCTION II
Fraud-detection systems (FDSs) differ from a classiﬁcation
tasks:
only a small set of supervised samples is provided by
human investigators (they check few alerts).
the labels of the majority of transactions are available only
several days later (after customers have report
unauthorized transactions).
3/ 22

PROBLEM FORMULATION
We formalise FD as a classification problem:
At day t, classifier Kt−1 (trained on t − 1) associates to each
feature vector x ∈ Rn, a score PKt−1
(+|x).
The k transactions with largest PKt−1
(+|x) define the alerts
At reported to the investigators.
Investigators provide feedbacks Ft about the alerts in At,
defining a set of k supervised couples (x, y)
Ft = {(x, y), x ∈ At}, (1)
Ft are the only immediate supervised samples.
4/ 22

PROBLEM FORMULATION II
At day t, delayed supervised couples Dt−δ are transactions
that have not been checked by investigators, but their label
is assumed to be correct after that δ days have elapsed.
Time%
Feedbacks%
Supervised%samples%
Delayed%samples%
t −δ t −1 t
FtDt−δ
All%fraudulent%transac9ons%of%a%day%
All%genuine%transac9ons%of%a%day%
Fraudulent%transac9ons%in%the%feedback%
Genuine%transac9ons%in%the%feedback%
Figure : The supervised samples available at day t include: i)
feedbacks of the ﬁrst δ days and ii) delayed couples occurred before
the δth
day.
5/ 22

Ft are a small set of risky transactions according the FDS.
Dt−δ contains all the occurred transactions in a day (≈ 99%
genuine transactions).
Time%
Fraudulent%transac9ons%in%
Genuine%transac9ons%in%
Fraudulent%feedback%in%%
Genuine%feedback%in%%
FtFt−1Dt−7 Ft−6 Ft−5 Ft−4 Ft−3 Ft−2
FtFt−1Dt−7 Ft−6 Ft−5 Ft−4 Ft−3 Ft−2Dt−8
Day'1'
Day'2'
Day'3'
Ft
Ft
St
St
Dt−9
Figure : Everyday we have a new set of feedbacks
(Ft, Ft−1, . . . , Ft−(δ−1)) from the ﬁrst δ days and a new set of delayed
transactions occurred on the δth
day (Dt−δ). In this Figure we assume
δ = 7.
6/ 22

ACCURACY MEASURE FOR A FDS
The goal of a FDS is to return accurate alerts, thus the highest
precision in At. This precision can be measured by the quantity
pk(t) =
#{(x, y) ∈ Ft s.t. y = +}
k
(2)
where pk(t) is the proportion of frauds in the top k transactions
with the highest likelihood of frauds ([1]).
7/ 22

LEARNING STRATEGY
Learning from feedbacks Ft is a different problem than learning
from delayed samples in Dt−δ:
Ft provides recent, up-to-date, information while Dt−δ
might be already obsolete once it comes.
Percentage of frauds in Ft and Dt−δ is different.
Supervised couples in Ft are not independently drawn, but
are instead selected by Kt−1.
A classiﬁer trained on Ft learns how to label transactions
that are most likely to be fraudulent.
Feedbacks and delayed transactions have to be treated
separately.
8/ 22

CONCEPT DRIFT ADAPTATION
Two conventional solutions for CD adaptation are Wt and
Et [6, 5]. To learn separately from feedbacks and delayed
transactions we propose Ft, WD
t and ED
t .
Time%
All%fraudulent%transac9ons%of%a%day%
All%genuine%transac9ons%of%a%day%
Fraudulent%transac9ons%in%the%feedback%
Genuine%transac9ons%in%the%feedback%
Sliding'
window'
Ensemble'
M1M2 Ft
Ft
EtED
t
Wt
WD
t
Figure : Supervised information used by different classiﬁers in the
ensemble and sliding window approach.9/ 22

CLASSIFIER AGGREGATIONS
WD
t and ED
t have to be aggregated with Ft to exploit
information provided by feedbacks. We combine these
classiﬁers by averaging the posterior probabilities.
Sliding window:
PAW
t
(+|x) =
PFt (+|x) + PWD
t
(+|x)
2
Ensemble:
PAE
t
(+|x) =
PFt (+|x) + PED
t
(+|x)
2
AE
t and AW
t give larger inﬂuence to feedbacks on the
probability estimates w.r.t Et and Wt.
10/ 22

TWO RANDOM FOREST
We used two different Random Forests (RF) classiﬁers
depending on the fraud prevalence in the training set.
for classiﬁers on delayed samples we used a Balanced
RF [3] (undersampling before training each tree).
for Ft we adopted a standard RF [2] (no undersampling).
11/ 22

DATASETS
We considered two datasets of credit card transactions:
Table : Datasets
Id Start day End day # Days # Instances # Features % Fraud
2013 2013-09-05 2014-01-18 136 21,830,330 51 0.19%
2014 2014-08-05 2014-10-09 44 7,619,452 51 0.22%
In the 2013 dataset there is an average of 160k transaction per
day and about 304 frauds per day, while in the 2014 dataset
there is a daily average of 173k transactions and 380 frauds.
12/ 22

EXPERIMENTS
Settings:
We assume that after δ = 7 days all the transactions labels
are provided (delayed supervised information)
A budget of k = 100 alerts that can be checked by the
investigators (Ft is trained on a window of 700 feedbacks).
A window of α = 16 days is used to train WD
t (16 models
in ED
t )
Each experiments is repeated 10 times and the performance is
assessed using pk.
13/ 22

In both 2013 and 2014 datasets, aggregations AW
t and AE
t
outperforms the other FDSs in terms of pk.
Table : Average pk in all the batches for the sliding window
Dataset 2013 Dataset 2014
classiﬁer mean sd mean sd
F 0.609 0.250 0.596 0.249
WD 0.540 0.227 0.549 0.253
W 0.563 0.233 0.559 0.256
AW 0.697 0.212 0.657 0.236
Table : Average pk in all the batches for the ensemble
Dataset 2013 Dataset 2014
classiﬁer mean sd mean sd
F 0.603 0.258 0.596 0.271
ED 0.459 0.237 0.443 0.242
E 0.555 0.239 0.516 0.252
AE 0.683 0.220 0.634 0.239
14/ 22

WDWFAW
(a) Sliding window
2013
WD
WFAW
(b) Sliding window
2014
F E ED
AE
(c) Ensemble 2013
E EDFAE
(d) Ensemble 2014
Sum of ranks from
the Friedman test [4],
classiﬁers having the
same letter are not
signiﬁcantly different
(paired t-test based
upon on the ranks).
15/ 22

EXPERIMENTS ON ARTIFICIAL DATASET WITH CD
In the second part we artificially introduce CD in specific days
by juxtaposing transactions acquired in different times of the
year.
Table : Datasets with Artificially Introduced CD
Id Start 2013 End 2013 Start 2014 End 2014
CD1 2013-09-05 2013-09-30 2014-08-05 2014-08-31
CD2 2013-10-01 2013-10-31 2014-09-01 2014-09-30
CD3 2013-11-01 2013-11-30 2014-08-05 2014-08-31
16/ 22

Table : Average pk in the month before and after CD for the sliding
window approach
(a) Before CD
CD1 CD2 CD3
classiﬁer mean sd mean sd mean sd
F 0.411 0.142 0.754 0.270 0.690 0.252
WD 0.291 0.129 0.757 0.265 0.622 0.228
W 0.332 0.215 0.758 0.261 0.640 0.227
AW 0.598 0.192 0.788 0.261 0.768 0.221
(b) After CD
CD1 CD2 CD3
classiﬁer mean sd mean sd mean sd
F 0.635 0.279 0.511 0.224 0.599 0.271
WD 0.536 0.335 0.374 0.218 0.515 0.331
W 0.570 0.309 0.391 0.213 0.546 0.319
AW 0.714 0.250 0.594 0.210 0.675 0.244
17/ 22

AW
W
(e) Sliding window strate-
gies on dataset CD1
AW
W
(f) Sliding window strate-
gies on dataset CD2
W
AW
(g) Sliding window strate-
gies on dataset CD3
AE
E
(h) Ensemble strategies on
dataset CD3
Figure : Average pk per day (the higher the better) for classiﬁers on
datasets with artiﬁcial concept drift smoothed using moving average
of 15 days. The vertical bar denotes the date of the concept drift.
18/ 22

CONCLUDING REMARKS
We notice that:
Ft outperforms classifiers on delayed samples (trained on
obsolete couples).
Ft outperforms classifiers trained on the entire supervised
dataset (dominated by delayed samples).
Aggregation gives larger influence to feedbacks.
19/ 22

CONCLUSION
We formalise a real-world FDS framework that meets
realistic working conditions.
In a real-world scenario, there is a strong alert-feedback
interaction that has to be explicitly considered
Feedbacks and delayed samples should be separately
handled when training a FDS
Aggregating two distinct classiﬁers is an effective strategy
and that it enables a prompter adaptation in concept
drifting environments
20/ 22

FUTURE WORK
Future work will focus on:
Adaptive aggregation of Ft and the classiﬁer trained on
delayed samples.
Study the sample selection bias in Ft introduced by
alert-feedback interaction.
21/ 22

BIBLIOGRAPHY
[1] S. Bhattacharyya, S. Jha, K. Tharakunnel, and J. C. Westland.
Data mining for credit card fraud: A comparative study.
Decision Support Systems, 50(3):602–613, 2011.
[2] L. Breiman.
Random forests.
Machine learning, 45(1):5–32, 2001.
[3] C. Chen, A. Liaw, and L. Breiman.
Using random forest to learn imbalanced data.
University of California, Berkeley, 2004.
[4] M. Friedman.
The use of ranks to avoid the assumption of normality implicit in the analysis of variance.
Journal of the American Statistical Association, 32(200):675–701, 1937.
[5] J. Gao, B. Ding, W. Fan, J. Han, and P. S. Yu.
Classifying data streams with skewed class distributions and concept drifts.
Internet Computing, 12(6):37–49, 2008.
[6] D. K. Tasoulis, N. M. Adams, and D. J. Hand.
Unsupervised clustering in streaming data.
In ICDM Workshops, pages 638–642, 2006.
22/ 22

Credit card fraud detection and concept drift adaptation with delayed supervised information

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to Credit card fraud detection and concept drift adaptation with delayed supervised information (20)

More from Andrea Dal Pozzolo (6)

Recently uploaded (20)

Credit card fraud detection and concept drift adaptation with delayed supervised information