SlideShare a Scribd company logo
Cancer Diagnostic Prediction with Amazon ML – A Tutorial
By
Kato Mivule, Researcher
June 2015
1	
  
Cancer Diagnostic Prediction with Amazon ML – A Tutorial
Agenda
•  The Dataset
•  Amazon ML Account setup
•  S3 Services – data storage
•  The ML Model
•  Results
•  Conclusion
•  References
2	
  
•  Characteristics of the Wisconsin Breast Cancer dataset is given in the figure above.
•  The dataset contains 11 attributes, 10 for the observations, and 1 for the class label.
•  The goal is to use the data collected from the observations to make a prediction if
future diagnosis from data with similar traits will be, 2 (Benign) or 4 (Malignant).
Cancer Diagnostic Prediction with Amazon ML– The Dataset
3	
  
The Wisconsin Breast Cancer Dataset Characteristics: UCI Machine Learning Repository
•  Download the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine
Learning repository.
•  Online at: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)]
Cancer Diagnostic Prediction with Amazon ML– The Dataset
4	
  
Cancer Diagnostic Prediction with Amazon ML– The Dataset
•  Download the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine
Learning repository.
•  Online at: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)]
5	
  
•  Preprocessing:
•  For the first row of the data, each column or attribute is named.
•  Ensure at this stage that missing values are replaced with average or most frequent
values. Amazon ML at this point does not do well with missing values in data.
Cancer Diagnostic Prediction with Amazon ML – The Dataset
6	
  
•  Make sure to save your file as a CSV file using the Windows CSV format if you are
using MS Excel.
Cancer Diagnostic Prediction with Amazon ML – The Dataset
7	
  
•  Log into your Amazon web services (AWS) account.
•  You can use the same credentials if you already buy and sell on Amazon.
•  Online: http://www.aws.amazon.com/
Cancer Diagnostic Prediction with Amazon ML – The Account
8	
  
•  Once logged in, you will notice many services offered by Amazon.
•  Our interest for now is the Storage and Content Delivery - S3 and Analytics -
Machine Learning Services.
•  To start, click on the S3 web service link under the Storage and Content Delivery.
•  S3 web service allows us to upload and store data on the Amazon Cloud.
Cancer Diagnostic Prediction with Amazon ML – The Account
9	
  
•  Clicking on the S3 link should bring us inside the S3 console.
•  To create a new bucket to store our dataset, click on the Create Bucket tab.
The black out on some of the lists was for security reasons.
Cancer Diagnostic Prediction with Amazon ML – S3 Services
10	
  
•  At this point you can give your data bucket a name.
•  Amazon AWS demands you select a “Region” where your data will be stored.
•  For now we shall go with the default region, the “US Standard.
The black out on some of the lists was for security reasons.
Cancer Diagnostic Prediction with Amazon ML – S3 Services
11	
  
•  On the right side of the S3 Panel, is the None, Properties, and Transfers tabs.
•  Click on your Bucket-name link on the left to open the data bucket.
The black out on some of the lists was for security reasons.
Cancer Diagnostic Prediction with Amazon ML – S3 Services
12	
  
•  Once inside the data bucket, Amazon ML shows no datasets – the bucket is empty.
•  Next, click on the Upload tab on the top-left corner to upload data.
The black out on some of the lists was for security reasons.
Cancer Diagnostic Prediction with Amazon ML – S3 Services
13	
  
•  You could either drag and drop datasets directly into the bucket or use the “Add
Files” button to upload the old fashioned way.
•  Keep in mind, Amazon ML at this point in time will only support CSV files.
Cancer Diagnostic Prediction with Amazon ML – S3 Services
14	
  
•  After a successful upload of data, click on the radio button on the left to highlight the
new dataset and then click on the Properties Tab on the right to learn more about the
dataset.
•  Copy the provided URL link for the dataset on the right of the bucket panel. The S3 link
will be needed to tell the Amazon ML where to access the data.
Cancer Diagnostic Prediction with Amazon ML – S3 Services
15	
  
•  For now, we are done with the S3 web service under the Storage and Content Delivery
section.
•  We return to the main Amazon Web Services Panel and choose Machine Learning
service under Analytics section.
Cancer Diagnostic Prediction with Amazon ML – ML Model
16	
  
•  At this point, we are now presented with the Amazon Machine Learning panel.
•  Click on the Create new tab on the left of the panel.
•  Select the Datasource and ML model – this allows us to use the cancer data we
uploaded to the S3 services.
Cancer Diagnostic Prediction with Amazon ML – ML Model
17	
  
•  Selecting the Datasource option brings us to the Create datasource panel.
•  The first step is to select where Amazon ML will get the data.
•  Select the S3 radio button and input the link saved from the S3 services after uploading
the cancer diagnosis dataset.
Cancer Diagnostic Prediction with Amazon ML – ML Model
18	
  
•  Amazon ML requests permission to access your dataset in the S3 services section.
•  Select “Yes” to proceed.
Cancer Diagnostic Prediction with Amazon ML – ML Model
19	
  
•  Amazon ML indicates that it successfully accessed and validated your data in the S3
storage service.
•  “Continue” to proceed.
Cancer Diagnostic Prediction with Amazon ML – ML Model
20	
  
•  Select the “Yes” radio button since the cancer dataset contains column names.
•  In this preprocessing step, Amazon ML allows for the editing of the data schema to
choose various data types.
Cancer Diagnostic Prediction with Amazon ML – ML Model
21	
  
•  The next step is to select the “Target” – the attribute that will work as the class label for
classification of the data.
•  Label1 is chosen for this particular cancer dataset, it contains two classes representing
cancer cases diagnosed as 2 for Benign, and 4 for Malignant.
•  Amazon ML uses the “Target” attribute to automatically select the ML algorithm – in
this case, Regression is chosen. Later, the Binary Classification will be used.
Cancer Diagnostic Prediction with Amazon ML – ML Model
22	
  
•  Amazon ML allows for the selection of a row identifier attribute to help follow which
prediction, in this case, class labels 2 and 4, parallels to which observation.
Cancer Diagnostic Prediction with Amazon ML – ML Model
23	
  
•  Update and corrections can still be made at this point.
•  Click on the Edit button to review the Input data, Schema, and Target.
•  Regression is chosen as the Target for the Label1 attribute but the Target section will be
edited later to choose Binary Classification.
Cancer Diagnostic Prediction with Amazon ML – ML Model
24	
  
•  Cross Validation: By default Amazon ML divides the dataset into two parts; with 70
percent of the data for Training and the remaining 30 percent for Testing.
•  In this case, the breast cancer diagnosis data was divided into 488 records for Training
and 211 records for Testing.
Cancer Diagnostic Prediction with Amazon ML – ML Model
25	
  
•  In the Review Panel, a summary of the ML model is given and adjustments can still be
made at this point to the model settings.
•  Click finish to proceed once you are satisfied with the settings.
Cancer Diagnostic Prediction with Amazon ML – ML Model
26	
  
•  After execution, the ML model report is returned. The Evaluation status on the right side
under the Evaluation Summary, should read Completed, in green.
•  Under the ML model report is the Evaluations link, with a dropdown menu to
Summary, Alerts, and Explore performance links to evaluate performance.
•  Amazon ML returns a performance metric value and the Explore model performance
button gives more visualization of results.
Cancer Diagnostic Prediction with Amazon ML – Results
27	
  
The RMSE: Amazon Machine Learning Developers Guide – Evaluating ML Models
•  Amazon ML returns the root-mean-square error (RMSE) value for Regression models.
Cancer Diagnostic Prediction with Amazon ML – Results
28	
  
•  The root-mean-square error (RMSE) value is returned for the Regression model.
•  The smaller the RMSE, the better the performance of the ML model.
•  Amazon ML reports that for this experiment, the Regression model achieved a root mean
square error ( RMSE) of 0.35, better than the baseline of 0.90.
Cancer Diagnostic Prediction with Amazon ML – Results
29	
  
•  Amazon ML provides a visual distribution of residuals for the ML Regression model in
form of a bar chart with the option to change the bin width.
•  Where Residual = Observed value – Predicted value.
Cancer Diagnostic Prediction with Amazon ML – Results
30	
  
•  Under the ML model report, click on the Evaluations dropdown menu and then click
the Alerts link.
•  A summary of the criteria used to evaluate the ML model is given, showing the cross
validation, number of records for both training and testing, and schema attributes used.
•  488 records were used for Training, while 211 were used for Testing (evaluation data).
Cancer Diagnostic Prediction with Amazon ML – Results
31	
  
•  Amazon ML provides the option to learn about the characteristics of the dataset being
used for the ML model.
•  Go to back to the Amazon ML dashboard, on the listed Entities, click on the Entity
Name with Type, Datasource.
Cancer Diagnostic Prediction with Amazon ML – Results
32	
  
•  A frequency distribution is given for the class Label1 attribute in the Training sample
data, showing 295 cases listed as 2 = Benign while 193 as 4 = Malignant.
•  A total of 488 records were used for Training, while 211 records were reserved for
Testing.
Cancer Diagnostic Prediction with Amazon ML – Results
33	
  
•  Amazon ML gives basic descriptive statistics for each attribute in the dataset.
•  Click on the Preview, on the right of the table, a visualization for each attribute is given.
Cancer Diagnostic Prediction with Amazon ML – Results
34	
  
•  The Preview for Feature6 in the dataset, gives a visualization of the frequency
distribution and summary of the basic descriptive statistics in that attribute.
Cancer Diagnostic Prediction with Amazon ML – Results
35	
  
Cancer Diagnostic Prediction with Amazon ML – Results
36	
  
The F1 score: Amazon Machine Learning Developers Guide – Evaluating ML Models
•  In this next section, the Target parameters are edited to select Binary Classification.
•  Run Binary Classification ML model and make a comparison with results from the
Regression ML model.
•  A summary of the Binary classification ML model performance returns an F1 score at
0.94.
•  The F1 score is normalized between 0 and 1; a higher F1 score in this case, 0.94, would
indicate better performance for the Binary classification model.
Cancer Diagnostic Prediction with Amazon ML – Results
37	
  
•  Amazon ML provides both the F1 score metric value and visualization for the Binary
Classification ML model.
•  Hovering the cursor over each rectangular box, displays a percentage of records correctly
classified and those misclassified.
Cancer Diagnostic Prediction with Amazon ML – Results
38	
  
•  To explore the visualization aspect of the model, click on the Explore model
performance button. A confusion matrix is presented.
•  On the horizontal side are the Predicted values, while on the vertical, are the True
values. The F1 score values are presented for each row, including the totals.
Cancer Diagnostic Prediction with Amazon ML – Results
39	
  
•  211 records were used as Testing data for the Binary classification ML model.
•  163 records belonged to Label 2 (Benign), the other 48, belonged to Label 4 (Malignant).
•  99% of cancer cases diagnosed as Benign in the Training data, were correctly predicted
as belonging to group 2 (Benign) in the Testing data, while only 0.61% of the the same
records were misclassified as belonging to group 4 (Malignant), in the Testing data.
•  The F1 score for the group 2 was at 0.98, almost a perfect score – approaching 1.
Cancer Diagnostic Prediction with Amazon ML – Results
40	
  
•  85% of cancer cases diagnosed as Malignant in the Training data, were correctly
predicted as belonging to group 4 (Malignant) in the Testing data, while 14.58% of the
the same records were mistakenly predicted as group 2 (Benign), in the Testing data.
•  The F1 score for the group 4 was at 0.91. The total F1 score was averaged at 0.94.
Cancer Diagnostic Prediction with Amazon ML – Results
41	
  
Conclusion
•  Amazon ML is intuitive and could assist the data scientist to focus on knowledge
discovery while leaving issues to do with hardware and other computational resources to
the engineers at Amazon cloud services.
•  The potential for Amazon ML applications in Health Data Science is enormous.
•  ML algorithms are still constrained to choices provide my Amazon ML, namely, Binary,
Multi-class, and Regression classification models. Including other ML algorithms in the
future would provide more choice for comparative studies.
•  Data preprocessing is still a pain – one has to strictly follow Amazon ML guidelines.
Currently Amazon ML only accepts CSV file formats. However, automation of this
process would be ideal.
Cancer Diagnostic Prediction with Amazon ML – Conclusion
42	
  
References
•  Amazon ML, Online: [www.aws.amazon.com]
•  Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning repository. Online at: [https://
archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)]
•  K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets",
Optimization Methodsand Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).
•  Evaluating ML Models - Amazon Machine Learning Developer Guide, Available Online: [http://
docs.aws.amazon.com/machine-learning/latest/dg/evaluating_models.html]
•  Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science.
Cancer Diagnostic Prediction with Amazon ML – References
43	
  
Thanks
Questions?
Contact
Kato Mivule @ [kmivule/gmail/com]
Cancer Diagnostic Prediction with Amazon ML – Questions
44	
  

More Related Content

DOCX
Data management Final Project
PPTX
softwares in public health
PPT
Реальные углы обзора видеорегистраторов
DOCX
Oumh1103 bab 4
DOC
Mechanical engineering
PDF
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
PPT
Burton Industries ppt 2012
Data management Final Project
softwares in public health
Реальные углы обзора видеорегистраторов
Oumh1103 bab 4
Mechanical engineering
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
Burton Industries ppt 2012

Viewers also liked (20)

DOC
Ttss consulting(1)
PDF
Lesson 7 world_history_medieval_period_new_
PDF
PROFESSIONAL LEARNING NETWORKS- MASS CUE 2013
DOCX
OUMH1103: TOPIK 3: READING FOR INFORMATION
PDF
Book Design by Jason Gonzales
PPT
Baker Business Bootcamp
PPT
Wonju Medical Industry Techno Valley Introduction
PDF
June 2013 IRMAC slides
PDF
Thrust and lube - Startupfest 2012
PDF
17.mengadministrasi server dalam_jaringan
PPT
About P&T
PPT
4 Seasons Virtual Field Trip
PDF
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
PPTX
Iltabloidmotori
PPT
Wmit introduction 2012 english
PDF
Presentazione Peopleware Marcom
PDF
Top Firefox Addons
PPTX
Comparison between different marketing plans
PPTX
PPTX
HumanCloud - Trace
Ttss consulting(1)
Lesson 7 world_history_medieval_period_new_
PROFESSIONAL LEARNING NETWORKS- MASS CUE 2013
OUMH1103: TOPIK 3: READING FOR INFORMATION
Book Design by Jason Gonzales
Baker Business Bootcamp
Wonju Medical Industry Techno Valley Introduction
June 2013 IRMAC slides
Thrust and lube - Startupfest 2012
17.mengadministrasi server dalam_jaringan
About P&T
4 Seasons Virtual Field Trip
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
Iltabloidmotori
Wmit introduction 2012 english
Presentazione Peopleware Marcom
Top Firefox Addons
Comparison between different marketing plans
HumanCloud - Trace
Ad

Similar to Cancer Diagnostic Prediction with Amazon ML – A Tutorial (20)

PPTX
Presentazione tutorial
PPTX
BAS 250 Lecture 3
PDF
Amazon Machine Learning im Einsatz: smartes Marketing - AWS Machine Learning...
PPTX
Statistical Learning and Model Selection (1).pptx
PDF
bookrecommendations-230615063942-3b1016c9 (1).pdf
PPTX
Book Recommendations.pptx
PPTX
Azure machine learning
PDF
An explanation of machine learning for business
PPTX
Selected Topics in CS-CHapter-twooo.pptx
PPTX
Decoding Loan Approval: Predictive Modeling in Action
PPTX
Pricing like a data scientist
PPTX
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
PDF
Understanding Mahout classification documentation
PPTX
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
PPTX
Diabetes_Prediction_Presentation .pptx
PPT
Xlminer demo
PPTX
Recommender System Using AZURE ML
PDF
Barga Data Science lecture 10
PPTX
Breast Cancer Prediction - Arwa Marfatia.pptx
PPTX
The 8 Step Data Mining Process
Presentazione tutorial
BAS 250 Lecture 3
Amazon Machine Learning im Einsatz: smartes Marketing - AWS Machine Learning...
Statistical Learning and Model Selection (1).pptx
bookrecommendations-230615063942-3b1016c9 (1).pdf
Book Recommendations.pptx
Azure machine learning
An explanation of machine learning for business
Selected Topics in CS-CHapter-twooo.pptx
Decoding Loan Approval: Predictive Modeling in Action
Pricing like a data scientist
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
Understanding Mahout classification documentation
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Diabetes_Prediction_Presentation .pptx
Xlminer demo
Recommender System Using AZURE ML
Barga Data Science lecture 10
Breast Cancer Prediction - Arwa Marfatia.pptx
The 8 Step Data Mining Process
Ad

More from Kato Mivule (20)

PDF
A Study of Usability-aware Network Trace Anonymization
PDF
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...
PDF
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...
PDF
Implementation of Data Privacy and Security in an Online Student Health Recor...
PDF
Applying Data Privacy Techniques on Published Data in Uganda
PPTX
Kato Mivule - Towards Agent-based Data Privacy Engineering
PDF
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
PDF
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
PDF
Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms
PDF
Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...
PDF
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
PDF
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
PDF
Lit Review Talk - Signal Processing and Machine Learning with Differential Pr...
PDF
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
PDF
Kato Mivule: An Overview of CUDA for High Performance Computing
PDF
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
PDF
Kato Mivule: An Overview of Adaptive Boosting – AdaBoost
PDF
Kato Mivule: COGNITIVE 2013 - An Overview of Data Privacy in Multi-Agent Lear...
PDF
Kato Mivule: An Investigation of Data Privacy and Utility Preservation Using ...
PPTX
Towards A Differential Privacy Preserving Utility Machine Learning Classifier
A Study of Usability-aware Network Trace Anonymization
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...
Implementation of Data Privacy and Security in an Online Student Health Recor...
Applying Data Privacy Techniques on Published Data in Uganda
Kato Mivule - Towards Agent-based Data Privacy Engineering
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms
Lit Review Talk by Kato Mivule: Protecting DNA Sequence Anonymity with Genera...
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
Lit Review Talk - Signal Processing and Machine Learning with Differential Pr...
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
Kato Mivule: An Overview of CUDA for High Performance Computing
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Kato Mivule: An Overview of Adaptive Boosting – AdaBoost
Kato Mivule: COGNITIVE 2013 - An Overview of Data Privacy in Multi-Agent Lear...
Kato Mivule: An Investigation of Data Privacy and Utility Preservation Using ...
Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to machine learning and Linear Models
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Foundation of Data Science unit number two notes
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to machine learning and Linear Models
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IB Computer Science - Internal Assessment.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
IBA_Chapter_11_Slides_Final_Accessible.pptx
climate analysis of Dhaka ,Banglades.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Foundation of Data Science unit number two notes
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ISS -ESG Data flows What is ESG and HowHow
Galatica Smart Energy Infrastructure Startup Pitch Deck

Cancer Diagnostic Prediction with Amazon ML – A Tutorial

  • 1. Cancer Diagnostic Prediction with Amazon ML – A Tutorial By Kato Mivule, Researcher June 2015 1  
  • 2. Cancer Diagnostic Prediction with Amazon ML – A Tutorial Agenda •  The Dataset •  Amazon ML Account setup •  S3 Services – data storage •  The ML Model •  Results •  Conclusion •  References 2  
  • 3. •  Characteristics of the Wisconsin Breast Cancer dataset is given in the figure above. •  The dataset contains 11 attributes, 10 for the observations, and 1 for the class label. •  The goal is to use the data collected from the observations to make a prediction if future diagnosis from data with similar traits will be, 2 (Benign) or 4 (Malignant). Cancer Diagnostic Prediction with Amazon ML– The Dataset 3   The Wisconsin Breast Cancer Dataset Characteristics: UCI Machine Learning Repository
  • 4. •  Download the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning repository. •  Online at: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)] Cancer Diagnostic Prediction with Amazon ML– The Dataset 4  
  • 5. Cancer Diagnostic Prediction with Amazon ML– The Dataset •  Download the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning repository. •  Online at: [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)] 5  
  • 6. •  Preprocessing: •  For the first row of the data, each column or attribute is named. •  Ensure at this stage that missing values are replaced with average or most frequent values. Amazon ML at this point does not do well with missing values in data. Cancer Diagnostic Prediction with Amazon ML – The Dataset 6  
  • 7. •  Make sure to save your file as a CSV file using the Windows CSV format if you are using MS Excel. Cancer Diagnostic Prediction with Amazon ML – The Dataset 7  
  • 8. •  Log into your Amazon web services (AWS) account. •  You can use the same credentials if you already buy and sell on Amazon. •  Online: http://www.aws.amazon.com/ Cancer Diagnostic Prediction with Amazon ML – The Account 8  
  • 9. •  Once logged in, you will notice many services offered by Amazon. •  Our interest for now is the Storage and Content Delivery - S3 and Analytics - Machine Learning Services. •  To start, click on the S3 web service link under the Storage and Content Delivery. •  S3 web service allows us to upload and store data on the Amazon Cloud. Cancer Diagnostic Prediction with Amazon ML – The Account 9  
  • 10. •  Clicking on the S3 link should bring us inside the S3 console. •  To create a new bucket to store our dataset, click on the Create Bucket tab. The black out on some of the lists was for security reasons. Cancer Diagnostic Prediction with Amazon ML – S3 Services 10  
  • 11. •  At this point you can give your data bucket a name. •  Amazon AWS demands you select a “Region” where your data will be stored. •  For now we shall go with the default region, the “US Standard. The black out on some of the lists was for security reasons. Cancer Diagnostic Prediction with Amazon ML – S3 Services 11  
  • 12. •  On the right side of the S3 Panel, is the None, Properties, and Transfers tabs. •  Click on your Bucket-name link on the left to open the data bucket. The black out on some of the lists was for security reasons. Cancer Diagnostic Prediction with Amazon ML – S3 Services 12  
  • 13. •  Once inside the data bucket, Amazon ML shows no datasets – the bucket is empty. •  Next, click on the Upload tab on the top-left corner to upload data. The black out on some of the lists was for security reasons. Cancer Diagnostic Prediction with Amazon ML – S3 Services 13  
  • 14. •  You could either drag and drop datasets directly into the bucket or use the “Add Files” button to upload the old fashioned way. •  Keep in mind, Amazon ML at this point in time will only support CSV files. Cancer Diagnostic Prediction with Amazon ML – S3 Services 14  
  • 15. •  After a successful upload of data, click on the radio button on the left to highlight the new dataset and then click on the Properties Tab on the right to learn more about the dataset. •  Copy the provided URL link for the dataset on the right of the bucket panel. The S3 link will be needed to tell the Amazon ML where to access the data. Cancer Diagnostic Prediction with Amazon ML – S3 Services 15  
  • 16. •  For now, we are done with the S3 web service under the Storage and Content Delivery section. •  We return to the main Amazon Web Services Panel and choose Machine Learning service under Analytics section. Cancer Diagnostic Prediction with Amazon ML – ML Model 16  
  • 17. •  At this point, we are now presented with the Amazon Machine Learning panel. •  Click on the Create new tab on the left of the panel. •  Select the Datasource and ML model – this allows us to use the cancer data we uploaded to the S3 services. Cancer Diagnostic Prediction with Amazon ML – ML Model 17  
  • 18. •  Selecting the Datasource option brings us to the Create datasource panel. •  The first step is to select where Amazon ML will get the data. •  Select the S3 radio button and input the link saved from the S3 services after uploading the cancer diagnosis dataset. Cancer Diagnostic Prediction with Amazon ML – ML Model 18  
  • 19. •  Amazon ML requests permission to access your dataset in the S3 services section. •  Select “Yes” to proceed. Cancer Diagnostic Prediction with Amazon ML – ML Model 19  
  • 20. •  Amazon ML indicates that it successfully accessed and validated your data in the S3 storage service. •  “Continue” to proceed. Cancer Diagnostic Prediction with Amazon ML – ML Model 20  
  • 21. •  Select the “Yes” radio button since the cancer dataset contains column names. •  In this preprocessing step, Amazon ML allows for the editing of the data schema to choose various data types. Cancer Diagnostic Prediction with Amazon ML – ML Model 21  
  • 22. •  The next step is to select the “Target” – the attribute that will work as the class label for classification of the data. •  Label1 is chosen for this particular cancer dataset, it contains two classes representing cancer cases diagnosed as 2 for Benign, and 4 for Malignant. •  Amazon ML uses the “Target” attribute to automatically select the ML algorithm – in this case, Regression is chosen. Later, the Binary Classification will be used. Cancer Diagnostic Prediction with Amazon ML – ML Model 22  
  • 23. •  Amazon ML allows for the selection of a row identifier attribute to help follow which prediction, in this case, class labels 2 and 4, parallels to which observation. Cancer Diagnostic Prediction with Amazon ML – ML Model 23  
  • 24. •  Update and corrections can still be made at this point. •  Click on the Edit button to review the Input data, Schema, and Target. •  Regression is chosen as the Target for the Label1 attribute but the Target section will be edited later to choose Binary Classification. Cancer Diagnostic Prediction with Amazon ML – ML Model 24  
  • 25. •  Cross Validation: By default Amazon ML divides the dataset into two parts; with 70 percent of the data for Training and the remaining 30 percent for Testing. •  In this case, the breast cancer diagnosis data was divided into 488 records for Training and 211 records for Testing. Cancer Diagnostic Prediction with Amazon ML – ML Model 25  
  • 26. •  In the Review Panel, a summary of the ML model is given and adjustments can still be made at this point to the model settings. •  Click finish to proceed once you are satisfied with the settings. Cancer Diagnostic Prediction with Amazon ML – ML Model 26  
  • 27. •  After execution, the ML model report is returned. The Evaluation status on the right side under the Evaluation Summary, should read Completed, in green. •  Under the ML model report is the Evaluations link, with a dropdown menu to Summary, Alerts, and Explore performance links to evaluate performance. •  Amazon ML returns a performance metric value and the Explore model performance button gives more visualization of results. Cancer Diagnostic Prediction with Amazon ML – Results 27  
  • 28. The RMSE: Amazon Machine Learning Developers Guide – Evaluating ML Models •  Amazon ML returns the root-mean-square error (RMSE) value for Regression models. Cancer Diagnostic Prediction with Amazon ML – Results 28  
  • 29. •  The root-mean-square error (RMSE) value is returned for the Regression model. •  The smaller the RMSE, the better the performance of the ML model. •  Amazon ML reports that for this experiment, the Regression model achieved a root mean square error ( RMSE) of 0.35, better than the baseline of 0.90. Cancer Diagnostic Prediction with Amazon ML – Results 29  
  • 30. •  Amazon ML provides a visual distribution of residuals for the ML Regression model in form of a bar chart with the option to change the bin width. •  Where Residual = Observed value – Predicted value. Cancer Diagnostic Prediction with Amazon ML – Results 30  
  • 31. •  Under the ML model report, click on the Evaluations dropdown menu and then click the Alerts link. •  A summary of the criteria used to evaluate the ML model is given, showing the cross validation, number of records for both training and testing, and schema attributes used. •  488 records were used for Training, while 211 were used for Testing (evaluation data). Cancer Diagnostic Prediction with Amazon ML – Results 31  
  • 32. •  Amazon ML provides the option to learn about the characteristics of the dataset being used for the ML model. •  Go to back to the Amazon ML dashboard, on the listed Entities, click on the Entity Name with Type, Datasource. Cancer Diagnostic Prediction with Amazon ML – Results 32  
  • 33. •  A frequency distribution is given for the class Label1 attribute in the Training sample data, showing 295 cases listed as 2 = Benign while 193 as 4 = Malignant. •  A total of 488 records were used for Training, while 211 records were reserved for Testing. Cancer Diagnostic Prediction with Amazon ML – Results 33  
  • 34. •  Amazon ML gives basic descriptive statistics for each attribute in the dataset. •  Click on the Preview, on the right of the table, a visualization for each attribute is given. Cancer Diagnostic Prediction with Amazon ML – Results 34  
  • 35. •  The Preview for Feature6 in the dataset, gives a visualization of the frequency distribution and summary of the basic descriptive statistics in that attribute. Cancer Diagnostic Prediction with Amazon ML – Results 35  
  • 36. Cancer Diagnostic Prediction with Amazon ML – Results 36   The F1 score: Amazon Machine Learning Developers Guide – Evaluating ML Models
  • 37. •  In this next section, the Target parameters are edited to select Binary Classification. •  Run Binary Classification ML model and make a comparison with results from the Regression ML model. •  A summary of the Binary classification ML model performance returns an F1 score at 0.94. •  The F1 score is normalized between 0 and 1; a higher F1 score in this case, 0.94, would indicate better performance for the Binary classification model. Cancer Diagnostic Prediction with Amazon ML – Results 37  
  • 38. •  Amazon ML provides both the F1 score metric value and visualization for the Binary Classification ML model. •  Hovering the cursor over each rectangular box, displays a percentage of records correctly classified and those misclassified. Cancer Diagnostic Prediction with Amazon ML – Results 38  
  • 39. •  To explore the visualization aspect of the model, click on the Explore model performance button. A confusion matrix is presented. •  On the horizontal side are the Predicted values, while on the vertical, are the True values. The F1 score values are presented for each row, including the totals. Cancer Diagnostic Prediction with Amazon ML – Results 39  
  • 40. •  211 records were used as Testing data for the Binary classification ML model. •  163 records belonged to Label 2 (Benign), the other 48, belonged to Label 4 (Malignant). •  99% of cancer cases diagnosed as Benign in the Training data, were correctly predicted as belonging to group 2 (Benign) in the Testing data, while only 0.61% of the the same records were misclassified as belonging to group 4 (Malignant), in the Testing data. •  The F1 score for the group 2 was at 0.98, almost a perfect score – approaching 1. Cancer Diagnostic Prediction with Amazon ML – Results 40  
  • 41. •  85% of cancer cases diagnosed as Malignant in the Training data, were correctly predicted as belonging to group 4 (Malignant) in the Testing data, while 14.58% of the the same records were mistakenly predicted as group 2 (Benign), in the Testing data. •  The F1 score for the group 4 was at 0.91. The total F1 score was averaged at 0.94. Cancer Diagnostic Prediction with Amazon ML – Results 41  
  • 42. Conclusion •  Amazon ML is intuitive and could assist the data scientist to focus on knowledge discovery while leaving issues to do with hardware and other computational resources to the engineers at Amazon cloud services. •  The potential for Amazon ML applications in Health Data Science is enormous. •  ML algorithms are still constrained to choices provide my Amazon ML, namely, Binary, Multi-class, and Regression classification models. Including other ML algorithms in the future would provide more choice for comparative studies. •  Data preprocessing is still a pain – one has to strictly follow Amazon ML guidelines. Currently Amazon ML only accepts CSV file formats. However, automation of this process would be ideal. Cancer Diagnostic Prediction with Amazon ML – Conclusion 42  
  • 43. References •  Amazon ML, Online: [www.aws.amazon.com] •  Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning repository. Online at: [https:// archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)] •  K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methodsand Software 1, 1992, 23-34 (Gordon & Breach Science Publishers). •  Evaluating ML Models - Amazon Machine Learning Developer Guide, Available Online: [http:// docs.aws.amazon.com/machine-learning/latest/dg/evaluating_models.html] •  Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Cancer Diagnostic Prediction with Amazon ML – References 43  
  • 44. Thanks Questions? Contact Kato Mivule @ [kmivule/gmail/com] Cancer Diagnostic Prediction with Amazon ML – Questions 44