SlideShare a Scribd company logo
Detection of fraud in financial blockchain-based
transactions through big data analytics
Jessica P´aez Bonilla
Director: Jose Maria ´Alvarez Rodr´ıguez
Universidad Carlos III de Madrid
Master in Big Data Analytics
2017-2018
July 11,2018
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 1 / 27
Overview
1 Introduction
2 Project Objectives
3 System Design
4 Implementation
5 Experiment
6 Project Budget and Plan
7 Legal Framework and socio-economic environment
8 Conclusions and Future works
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 2 / 27
Introduction
Using analytical techniques -data gathering, preprocessing, and model
building- it could be possible to detect and prevent financial fraud.
The aim to describe complex fraud in terms of patterns suitable for
system-driven detection and analysis.
Network analysis can provide useful insight into large datasets based
on the interconnectedness of the agents in the network being
analyzed.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 3 / 27
Introduction
Network: shows relationships among the blockchain users and flux of
money. It enables the fraud patterns discovery.
Network graph analysis offers a method for capturing the context
of fraud in a standard, machine readable and transferable format.
Associations learned from visually observing fraudulent transactions,
could be used as knowledge input to create analytical models.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 4 / 27
Project Objectives
1 Research techniques used for fraud detection and explore blockchain
data.
2 Design a system that could take into account the patterns
surrounding the fraudulent transactions.
3 Implement a system using big data analytic tools like R and Python.
4 Experiment and validate the designed system.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 5 / 27
System Overview
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 6 / 27
System Design - Network Metrics
Metric Interpretation
Degree Influence on the network
Closeness How quick is the access to other nodes in the network
Betweeness Node location. Is it in the shortest path to other nodes?
Density Level of linkage among the nodes
Modularity How modular the network is
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 7 / 27
Implementation - Technology used
BigQuery, R
(igraph) and
Python have been
used in the
development of
this system.
Table 1: Used Packages Versions
Package Used Version
matplotlib 1.5.1
pandas 0.19.2
networkx 1.11
community 0.9
numpy 1.11.3
scipy 0.18.1
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 8 / 27
Experiment - Steps
1 Data Exploration.
2 Network metrics and extraction of communities.
3 Features and ML algorithms selection.
4 Performance Measures.
5 Execution.
6 Analysis of Results.
7 Experiment Limitations.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 9 / 27
Experiment - 1. Data Exploration
Bitcoin blockchain data was explored using BigQuery. A data segment
containing fraudulent movements was chosen as sample for analysis in this
project.
Figure 1: Blocks over time Figure 2: Transactions in the sample
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 10 / 27
Experiment - 2. Network Metrics and extraction of
communities
Communities
1 Network modeling
2 Clustering
3 Giant Component
Figure 3: Communities extraction
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 11 / 27
Experiment - 3. Features and ML algorithms selection
Figure 4: Selected features
ML Algorithms
1 Decision Tree
1 White-box modeled. Can be
interpreted.
2 Perform well on imbalanced
datasets.
2 Random Forest
1 Ensemble: combine the
predictions of several base
estimators in order to improve
robustness over a single
estimator.
2 Each tree in the ensemble is
built from a sample drawn
with replacement
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 12 / 27
Experiment - 4. Performance Measures
Classification Precision
It gives the percentage of correct predictions.
Confusion Matrix
It is a 2x2 matrix that tells us the types of errors that the classifier is
making.
AUC - Area Under the (ROC) Curve
It is a single number summary of classifier performance, useful even when
there is class imbalance.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 13 / 27
Experiment - 5. Execution
Once the features (transaction network metrics) are obtained, and ML
algortithms and its performance metrics are defined, 2 main tasks need to
be run before fitting the system.
Observations Labeling
Analysis of a real fraudulent transaction.
Dataset Balancing
Once the dataset is labeled, there were many more observations of one
class. An oversampling technique was applied in order to balance it.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 14 / 27
Experiment- 5.1. Analysis of a fraudulent transaction
Figure 5: Fraudster Neighbours
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 15 / 27
Experiment - 5.2. Dataset Balancing
The dataset used
has around 30k
observations in
the training set
and around 7k in
the test set.
Python package
Imbalanced-learn
was used. It
applies an
oversampling on
the minority class.
Table 2: Proportion of classes
Dataset Class Proportion
Train Suspicious 0.498627
Train Non-suspicious 0.501373
Test Suspicious 0.500343
Test Non-suspicious 0.499657
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 16 / 27
Experiment - 6. Analysis of Results
The obtained metrics of the selected ML algorithms are summarized in the
table below:
Table 3: Classification Metrics Comparison
Model Class. Accuracy Sensitivity AUC
Decision Tree 0.9989 0.9979 0.9994
Random Forest 0.9619 0.9752 0.9974
The selected method was the Random Forest, as was the one giving more
weight to the different network metrics and still achieving a high accuracy.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 17 / 27
Experiment - 6. Analysis of Results
The weight given to each of the features of Random Forest is presented in
this barchart.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 18 / 27
Experiment - 6. Analysis of Results
Table 4: Classification Metrics - Random Forest
METRIC VALUE
Classification accuracy 0.9619
Classification error 0.0380
Sensitivity 0.9752
Specificity 0.9487
False positive rate 0.0512
Precision 0.9502
AUC 0.9974
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 19 / 27
Experiment - 6. Analysis of Results
ROC Curve obtained:
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 20 / 27
Experiment - 7. Limitations
Studying more known cases of fraud within the bitcoin blockchain, it
could be possible to increase the known fraudulent transaction
patterns.
Having more data will also help to prevent the overfitting with
decision trees, as the tree design would not be able to cover all the
training data.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 21 / 27
Project Budget
A summary of the project budget is presented in the table.
Cost Total (AC)
Direct Costs 8,827.5
Indirect Costs 882,75
Total Costs 9,710.25
Profit (10%) 971.025
Cost + Profit 10,681.275
IVA (21%) 2,243.06
TOTAL + IVA 12,924.343
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 22 / 27
Project Planification
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 23 / 27
Legal Framework and socio-economic environment
Legal Framework: The Bitcoin blockchain data is now available for
exploration with BigQuery, using Google Cloud services. Data is
public and no licensing is required.
Socio-economic environment: Blockchain technology is rapidly
evolving and will be widely used in the finance world in the coming
years.
10 % of world GDP will be stored in blockchains by 2020.
IoT era also promotes the Fintech revolution.
It creates the challenge to develop and apply different sets of
techniques in order to detect fraud in these new digital platforms.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 24 / 27
Conclusions
1 Business: Detecting and flagging activity suspicious of fraud before it
actually takes place could save billions annually in both developed and
non-developed economies.
2 Technical: The proposed system can flag a suspicious blockchain
transaction with a high accuracy taking into account network metrics
resulting of modeling the giant components of the transactions.
3 Personal: Learning of a ongrowing sector (”Fintech”) that combines
finance and technology as well as of how the analytic techniques can
be applied to it.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 25 / 27
Future works
1 Create a software platform that could access and integrate both
environments R and Python.
2 This platform could be running continuously and flag by means of
an UI whenever the model classifies a new observation as Suspicious.
3 Knowing more patterns of fraudulent transactions can help to
avoid the overfitting in the models.
4 Try other network metrics (like mean neighbour degree, node
correlation similarity etc..) as features for the classification model.
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 26 / 27
Thank you for your attention
Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 27 / 27

More Related Content

PPTX
PPTX
Intrusion Detection with Neural Networks
PDF
Feature Engineering - Getting most out of data for predictive models - TDC 2017
PDF
Loan approval prediction based on machine learning approach
PDF
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
PPTX
cnn ppt.pptx
PPTX
PPTX
Understanding Your Attack Surface and Detecting & Mitigating External Threats
Intrusion Detection with Neural Networks
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Loan approval prediction based on machine learning approach
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
cnn ppt.pptx
Understanding Your Attack Surface and Detecting & Mitigating External Threats

What's hot (20)

PDF
Machine Learning Based Botnet Detection
PPTX
You only look once (YOLO) : unified real time object detection
PDF
Seminar Report | Network Intrusion Detection using Supervised Machine Learnin...
PDF
IoT Malware Detection through Threshold Random Walks
PPTX
Convolutional Neural Network and Its Applications
PPTX
Metasploit seminar
PDF
PPTX
CREDIT CARD FRAUD DETECTION
PDF
Cracking WPA/WPA2 with Non-Dictionary Attacks
PDF
Incident Response
PPTX
You Only Look Once: Unified, Real-Time Object Detection
PPTX
Image Classification using deep learning
PPTX
Credit card fraud detection
PDF
Deep learning based object detection basics
PPTX
Deep learning for object detection
PPTX
Convolutional Neural Network
PPTX
Basics Gephi Tutorial
PPTX
Hacker tooltalk: Social Engineering Toolkit (SET)
PPTX
Kali Linux
Machine Learning Based Botnet Detection
You only look once (YOLO) : unified real time object detection
Seminar Report | Network Intrusion Detection using Supervised Machine Learnin...
IoT Malware Detection through Threshold Random Walks
Convolutional Neural Network and Its Applications
Metasploit seminar
CREDIT CARD FRAUD DETECTION
Cracking WPA/WPA2 with Non-Dictionary Attacks
Incident Response
You Only Look Once: Unified, Real-Time Object Detection
Image Classification using deep learning
Credit card fraud detection
Deep learning based object detection basics
Deep learning for object detection
Convolutional Neural Network
Basics Gephi Tutorial
Hacker tooltalk: Social Engineering Toolkit (SET)
Kali Linux
Ad

Similar to Detection of fraud in financial blockchain-based transactions through big data analytics (20)

PDF
Blockchain Technology: A Sustainability Perspective
PDF
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian network
PDF
Analysis of IT Monitoring Using Open Source Software Techniques: A Review
PPT
The Internet of Things: What's next?
PDF
Analytics of Performance and Data Quality for Mobile Edge Cloud Applications
PPTX
The Story of the Semantic Grid
PDF
Extending Network Intrusion Detection with Enhanced Particle Swarm Optimizati...
PDF
Extending Network Intrusion Detection with Enhanced Particle Swarm Optimizati...
PPTX
Big Data analytics
PDF
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
PDF
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
PPTX
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
PDF
Concept Drift Identification using Classifier Ensemble Approach
PDF
Ijarcce 6
PDF
Feature level fusion of multi-source data for network intrusion detection
PDF
Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggre...
PDF
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
PDF
Survey Paper On Real Time Smart CCTV Surveillance System
PDF
Data Mining Framework for Network Intrusion Detection using Efficient Techniques
PDF
IDENTITY DISCLOSURE PROTECTION IN DYNAMIC NETWORKS USING K W – STRUCTURAL DIV...
Blockchain Technology: A Sustainability Perspective
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian network
Analysis of IT Monitoring Using Open Source Software Techniques: A Review
The Internet of Things: What's next?
Analytics of Performance and Data Quality for Mobile Edge Cloud Applications
The Story of the Semantic Grid
Extending Network Intrusion Detection with Enhanced Particle Swarm Optimizati...
Extending Network Intrusion Detection with Enhanced Particle Swarm Optimizati...
Big Data analytics
Efficient Attack Detection in IoT Devices using Feature Engineering-Less Mach...
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Concept Drift Identification using Classifier Ensemble Approach
Ijarcce 6
Feature level fusion of multi-source data for network intrusion detection
Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggre...
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
Survey Paper On Real Time Smart CCTV Surveillance System
Data Mining Framework for Network Intrusion Detection using Efficient Techniques
IDENTITY DISCLOSURE PROTECTION IN DYNAMIC NETWORKS USING K W – STRUCTURAL DIV...
Ad

More from CARLOS III UNIVERSITY OF MADRID (20)

PDF
TSUNAMI DESINFORMACIÓN: IA contra el caos Informativo. Proyecto IVERES UC3M ...
PDF
Proyecto IVERES-UC3M
PDF
RTVE: Sustainable Development Goal Radar
PPTX
Engineering 4.0: Digitization through task automation and reuse
PDF
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
PPTX
SESE 2021: Where Systems Engineering meets AI/ML
PDF
Sailing the V: Engineering digitalization through task automation and reuse i...
PPTX
Deep Learning Notes
PDF
H2020-AHTOOLS Use Case 3 Functional Design
PDF
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
PDF
INCOSE IS 2019: AI and Systems Engineering
PDF
Challenges in the integration of Systems Engineering and the AI/ML model life...
PDF
Blockchain en la Industria Musical
PDF
OSLC KM: Elevating the meaning of data and operations within the toolchain
PDF
Blockchain y sector asegurador
PDF
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
PDF
Systems and Software Architecture: an introduction to architectural modelling
PDF
News headline generation with sentiment and patterns: A case study of sports ...
PDF
Blockchain y la industria musical
PDF
Preparing your Big Data start-up pitch
TSUNAMI DESINFORMACIÓN: IA contra el caos Informativo. Proyecto IVERES UC3M ...
Proyecto IVERES-UC3M
RTVE: Sustainable Development Goal Radar
Engineering 4.0: Digitization through task automation and reuse
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
SESE 2021: Where Systems Engineering meets AI/ML
Sailing the V: Engineering digitalization through task automation and reuse i...
Deep Learning Notes
H2020-AHTOOLS Use Case 3 Functional Design
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
INCOSE IS 2019: AI and Systems Engineering
Challenges in the integration of Systems Engineering and the AI/ML model life...
Blockchain en la Industria Musical
OSLC KM: Elevating the meaning of data and operations within the toolchain
Blockchain y sector asegurador
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
Systems and Software Architecture: an introduction to architectural modelling
News headline generation with sentiment and patterns: A case study of sports ...
Blockchain y la industria musical
Preparing your Big Data start-up pitch

Recently uploaded (20)

PPTX
EABDM Slides for Indifference curve.pptx
PDF
Predicting Customer Bankruptcy Using Machine Learning Algorithm research pape...
PPTX
Introduction to Customs (June 2025) v1.pptx
PPTX
4.5.1 Financial Governance_Appropriation & Finance.pptx
PPTX
introuction to banking- Types of Payment Methods
PDF
Circular Flow of Income by Dr. S. Malini
PDF
Dr Tran Quoc Bao the first Vietnamese speaker at GITEX DigiHealth Conference ...
PDF
Lecture1.pdf buss1040 uses economics introduction
PPTX
Basic Concepts of Economics.pvhjkl;vbjkl;ptx
PDF
How to join illuminati agent in Uganda Kampala call 0782561496/0756664682
PDF
Why Ignoring Passive Income for Retirees Could Cost You Big.pdf
PDF
Spending, Allocation Choices, and Aging THROUGH Retirement. Are all of these ...
PPTX
Unilever_Financial_Analysis_Presentation.pptx
PDF
ECONOMICS AND ENTREPRENEURS LESSONSS AND
PDF
Q2 2025 :Lundin Gold Conference Call Presentation_Final.pdf
PPTX
Session 11-13. Working Capital Management and Cash Budget.pptx
PPTX
Who’s winning the race to be the world’s first trillionaire.pptx
PDF
THE EFFECT OF FOREIGN AID ON ECONOMIC GROWTH IN ETHIOPIA
PDF
how_to_earn_50k_monthly_investment_guide.pdf
PDF
ECONOMICS AND ENTREPRENEURS LESSONSS AND
EABDM Slides for Indifference curve.pptx
Predicting Customer Bankruptcy Using Machine Learning Algorithm research pape...
Introduction to Customs (June 2025) v1.pptx
4.5.1 Financial Governance_Appropriation & Finance.pptx
introuction to banking- Types of Payment Methods
Circular Flow of Income by Dr. S. Malini
Dr Tran Quoc Bao the first Vietnamese speaker at GITEX DigiHealth Conference ...
Lecture1.pdf buss1040 uses economics introduction
Basic Concepts of Economics.pvhjkl;vbjkl;ptx
How to join illuminati agent in Uganda Kampala call 0782561496/0756664682
Why Ignoring Passive Income for Retirees Could Cost You Big.pdf
Spending, Allocation Choices, and Aging THROUGH Retirement. Are all of these ...
Unilever_Financial_Analysis_Presentation.pptx
ECONOMICS AND ENTREPRENEURS LESSONSS AND
Q2 2025 :Lundin Gold Conference Call Presentation_Final.pdf
Session 11-13. Working Capital Management and Cash Budget.pptx
Who’s winning the race to be the world’s first trillionaire.pptx
THE EFFECT OF FOREIGN AID ON ECONOMIC GROWTH IN ETHIOPIA
how_to_earn_50k_monthly_investment_guide.pdf
ECONOMICS AND ENTREPRENEURS LESSONSS AND

Detection of fraud in financial blockchain-based transactions through big data analytics

  • 1. Detection of fraud in financial blockchain-based transactions through big data analytics Jessica P´aez Bonilla Director: Jose Maria ´Alvarez Rodr´ıguez Universidad Carlos III de Madrid Master in Big Data Analytics 2017-2018 July 11,2018 Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 1 / 27
  • 2. Overview 1 Introduction 2 Project Objectives 3 System Design 4 Implementation 5 Experiment 6 Project Budget and Plan 7 Legal Framework and socio-economic environment 8 Conclusions and Future works Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 2 / 27
  • 3. Introduction Using analytical techniques -data gathering, preprocessing, and model building- it could be possible to detect and prevent financial fraud. The aim to describe complex fraud in terms of patterns suitable for system-driven detection and analysis. Network analysis can provide useful insight into large datasets based on the interconnectedness of the agents in the network being analyzed. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 3 / 27
  • 4. Introduction Network: shows relationships among the blockchain users and flux of money. It enables the fraud patterns discovery. Network graph analysis offers a method for capturing the context of fraud in a standard, machine readable and transferable format. Associations learned from visually observing fraudulent transactions, could be used as knowledge input to create analytical models. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 4 / 27
  • 5. Project Objectives 1 Research techniques used for fraud detection and explore blockchain data. 2 Design a system that could take into account the patterns surrounding the fraudulent transactions. 3 Implement a system using big data analytic tools like R and Python. 4 Experiment and validate the designed system. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 5 / 27
  • 6. System Overview Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 6 / 27
  • 7. System Design - Network Metrics Metric Interpretation Degree Influence on the network Closeness How quick is the access to other nodes in the network Betweeness Node location. Is it in the shortest path to other nodes? Density Level of linkage among the nodes Modularity How modular the network is Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 7 / 27
  • 8. Implementation - Technology used BigQuery, R (igraph) and Python have been used in the development of this system. Table 1: Used Packages Versions Package Used Version matplotlib 1.5.1 pandas 0.19.2 networkx 1.11 community 0.9 numpy 1.11.3 scipy 0.18.1 Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 8 / 27
  • 9. Experiment - Steps 1 Data Exploration. 2 Network metrics and extraction of communities. 3 Features and ML algorithms selection. 4 Performance Measures. 5 Execution. 6 Analysis of Results. 7 Experiment Limitations. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 9 / 27
  • 10. Experiment - 1. Data Exploration Bitcoin blockchain data was explored using BigQuery. A data segment containing fraudulent movements was chosen as sample for analysis in this project. Figure 1: Blocks over time Figure 2: Transactions in the sample Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 10 / 27
  • 11. Experiment - 2. Network Metrics and extraction of communities Communities 1 Network modeling 2 Clustering 3 Giant Component Figure 3: Communities extraction Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 11 / 27
  • 12. Experiment - 3. Features and ML algorithms selection Figure 4: Selected features ML Algorithms 1 Decision Tree 1 White-box modeled. Can be interpreted. 2 Perform well on imbalanced datasets. 2 Random Forest 1 Ensemble: combine the predictions of several base estimators in order to improve robustness over a single estimator. 2 Each tree in the ensemble is built from a sample drawn with replacement Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 12 / 27
  • 13. Experiment - 4. Performance Measures Classification Precision It gives the percentage of correct predictions. Confusion Matrix It is a 2x2 matrix that tells us the types of errors that the classifier is making. AUC - Area Under the (ROC) Curve It is a single number summary of classifier performance, useful even when there is class imbalance. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 13 / 27
  • 14. Experiment - 5. Execution Once the features (transaction network metrics) are obtained, and ML algortithms and its performance metrics are defined, 2 main tasks need to be run before fitting the system. Observations Labeling Analysis of a real fraudulent transaction. Dataset Balancing Once the dataset is labeled, there were many more observations of one class. An oversampling technique was applied in order to balance it. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 14 / 27
  • 15. Experiment- 5.1. Analysis of a fraudulent transaction Figure 5: Fraudster Neighbours Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 15 / 27
  • 16. Experiment - 5.2. Dataset Balancing The dataset used has around 30k observations in the training set and around 7k in the test set. Python package Imbalanced-learn was used. It applies an oversampling on the minority class. Table 2: Proportion of classes Dataset Class Proportion Train Suspicious 0.498627 Train Non-suspicious 0.501373 Test Suspicious 0.500343 Test Non-suspicious 0.499657 Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 16 / 27
  • 17. Experiment - 6. Analysis of Results The obtained metrics of the selected ML algorithms are summarized in the table below: Table 3: Classification Metrics Comparison Model Class. Accuracy Sensitivity AUC Decision Tree 0.9989 0.9979 0.9994 Random Forest 0.9619 0.9752 0.9974 The selected method was the Random Forest, as was the one giving more weight to the different network metrics and still achieving a high accuracy. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 17 / 27
  • 18. Experiment - 6. Analysis of Results The weight given to each of the features of Random Forest is presented in this barchart. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 18 / 27
  • 19. Experiment - 6. Analysis of Results Table 4: Classification Metrics - Random Forest METRIC VALUE Classification accuracy 0.9619 Classification error 0.0380 Sensitivity 0.9752 Specificity 0.9487 False positive rate 0.0512 Precision 0.9502 AUC 0.9974 Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 19 / 27
  • 20. Experiment - 6. Analysis of Results ROC Curve obtained: Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 20 / 27
  • 21. Experiment - 7. Limitations Studying more known cases of fraud within the bitcoin blockchain, it could be possible to increase the known fraudulent transaction patterns. Having more data will also help to prevent the overfitting with decision trees, as the tree design would not be able to cover all the training data. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 21 / 27
  • 22. Project Budget A summary of the project budget is presented in the table. Cost Total (AC) Direct Costs 8,827.5 Indirect Costs 882,75 Total Costs 9,710.25 Profit (10%) 971.025 Cost + Profit 10,681.275 IVA (21%) 2,243.06 TOTAL + IVA 12,924.343 Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 22 / 27
  • 23. Project Planification Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 23 / 27
  • 24. Legal Framework and socio-economic environment Legal Framework: The Bitcoin blockchain data is now available for exploration with BigQuery, using Google Cloud services. Data is public and no licensing is required. Socio-economic environment: Blockchain technology is rapidly evolving and will be widely used in the finance world in the coming years. 10 % of world GDP will be stored in blockchains by 2020. IoT era also promotes the Fintech revolution. It creates the challenge to develop and apply different sets of techniques in order to detect fraud in these new digital platforms. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 24 / 27
  • 25. Conclusions 1 Business: Detecting and flagging activity suspicious of fraud before it actually takes place could save billions annually in both developed and non-developed economies. 2 Technical: The proposed system can flag a suspicious blockchain transaction with a high accuracy taking into account network metrics resulting of modeling the giant components of the transactions. 3 Personal: Learning of a ongrowing sector (”Fintech”) that combines finance and technology as well as of how the analytic techniques can be applied to it. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 25 / 27
  • 26. Future works 1 Create a software platform that could access and integrate both environments R and Python. 2 This platform could be running continuously and flag by means of an UI whenever the model classifies a new observation as Suspicious. 3 Knowing more patterns of fraudulent transactions can help to avoid the overfitting in the models. 4 Try other network metrics (like mean neighbour degree, node correlation similarity etc..) as features for the classification model. Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 26 / 27
  • 27. Thank you for your attention Jessica P´aez Bonilla (UC3M) Master Thesis July 11,2018 27 / 27