SlideShare a Scribd company logo
Machine Learning and Bioinformatics Approach
Yields Noninvasive miRNA Biomarkers for Early Lung
Cancer Detection
Andrew Gao
1
Abstract
Non-small cell lung cancer (NSCLC), the most common type of lung cancer, affects
millions of people. In 2020, lung cancer caused 1.8 million deaths, partly because it is
difficult to diagnose lung cancer at early stages. Detecting cancer earlier results in better
survival. One potential method for early diagnosis, liquid biopsy, relies on biomarkers
present in body fluids. MicroRNAs (miRNAs) in blood could serve as biomarkers for
noninvasive diagnostic/prognostic tests. MiRNAs are RNA molecules that regulate the
expression of specific genes. Through differential expression analysis of four public
datasets, the present study identified 13 miRNAs that are consistently underexpressed
in the tissue, blood, and serum of NSCLC patients. Kaplan-Meier survival analysis found
that six miRNAs had statistically significant prognostic power (miR-140, miR-199a,
miR-29c, miR-320e, miR-103a, miR-526b). Functional enrichment analysis of the genes
targeted by the miRNAs demonstrated that they are involved in several hallmarks of
cancer, such as the epithelial to mesenchymal transition. A machine learning model was
constructed using microRNA expression data. Recursive feature elimination was
performed to select miRNAs with the greatest diagnostic value. Five classifiers were
tested on the selected miRNAs, with Random Forest and Logistic Regression performing
the best. A novel three-microRNA panel with 91.5% accuracy for NSCLC detection
was identified (miR-320e, miR-103a, miR-526b). These miRNAs also have significant
prognostic power for lung adenocarcinoma. The machine learning and analysis workflow
was adapted into an open-source online tool for automatic biomarker selection,
available at biomarkergenie.com. Future steps include experimental validation and clinical
trials. 2
Background
• 1.8 million deaths from lung cancer in 2018
– 5 year survival rate is only 21%
• My grandfather passed away at 67 due to
non-small cell lung cancer
• Hard to accurately diagnose lung cancer
– CT Scan
– Tissue Biopsy (follow up)
• Cells release molecules like DNA and RNA
into the bloodstream
• Drawing blood is noninvasive
• Different levels of biomarkers could
indicate cancer
– High levels of miR-1228-3p (Xue 2020)
3
Decreased gene expression
(degraded miRNA can’t be
translated into protein)
Statement of Purpose
4
Purpose: Identify non-invasive microRNA biomarkers for non-small cell lung
cancer (NSCLC) diagnostic and prognostic tests.
Criteria:
1. Must be found in blood
2. Consistently differentially expressed across studies
○ Account for high variance in results between studies
3. Statistically significant change
4. Ideally has biological relevance
Outline:
1. Find miRNAs that are reliably differently expressed in lung cancer
2. Find what pathways these miRNAs are involved in
3. Check if these miRNAs can predict survival
4. Identify best combination of miRNAs for diagnosing lung cancer
Materials:
● Computer
● R Studio
○ R
○ Limma
● Google Colab
● Atom code editor
● Python
○ Matplotlib
○ Seaborn
○ Scikit Learn
○ Streamlit
● Jvenn
● Heroku
● Github
● Gene Expression
Omnibus (GEO)
○ GSE137140
○ GSE93300
○ GSE94536
○ GSE53882
● The Cancer Genome Atlas (TCGA)
● Search Tool for the Retrieval of Interacting Proteins (STRING)
● Kaplan Meier Plotter (kmplot.com)
● Graphpad Prism 9
● Cytoscape
○ MCODE
● Gene Ontology
○ Panther
● miRWalk
● miRDB
● miRTarBase
● GeneCards
Data: miRNA expression profiling data
Control: Non-cancer people
Experimental: NSCLC patients
Procedure:
6
Selected datasets and characteristics
Differential expression data
● Four datasets
were selected.
● In total, 1978
NSCLC and 1932
control samples.
The limma package in R was
used to calculate logFC and
p values (t-test).
cutoff: p<0.05
logFC: ratio of expression in
disease vs. controls
negative = underexpressed
positive = overexpressed
microRNA name ratio of expression other name
7
Venn diagram of overlapping differentially
expressed microRNAs between datasets
Heatmap of logFC values of
each of the 13 microRNAs
across all datasets
13 microRNAs are
differentially expressed in
all four datasets (p<0.05).
All are underexpressed
(negative logFC).
8
349 target genes are involved in many hallmarks of cancer:
● epithelial to mesenchymal transition
● SMAD protein phosphorylation
● heterochromatin
● miRNA silencing (impaired)
● transforming growth factor beta
Protein protein interaction (PPI) network of target genes
Highly interconnected
clusters of interacting genes
(using Cytoscape)
9
Kaplan Meier Survival Analysis: miR-140, miR-199a, miR-29c, miR-320e, miR-103a,
and miR-526b can reliably predict patient survival outlook (p<0.05)
high expression = better survival (this makes sense)
Squamous Carcinoma
Adenocarcinoma
Can these miRNAs distinguish lung cancer and
healthy controls?
10
Input raw data
Stage characteristics of input data (GSE137140)
Most samples are Stage 1 (72%)
columns: microRNAs, target = label
( 1 for cancer, 0 for control)
Initial 3 component PCA using 13
miRNAs shows distinct separation
between lung cancer and healthy
63%
explained
variance
Which classifier works the best?
11
Extra Forest, Random Forest, and
Recursive Feature Elimination
(Logistic Regression) rank 13 miRNAs
by importance
● Generally agreed
Random Forest performs best
20% of data for training
80% for testing
Which miRNAs are most important?
Top 3: miR-320e, miR-103a, miR-526b
Chart of Feature Importances based on Extra Forest Ranker
12
Confusion matrix of results
for top 3 microRNAs
10 fold cross validation on top 3 miRNAs:
Accuracy = 0.915 (stdev = 0.13)
10 fold cross validation on top 4 miRNAs:
Accuracy = 0.916 (stdev = 0.13)
(virtually identical)
Accuracy:
miR-320e: 82.5%
top 3: 91.6%
top 4: 92.6%
Which combination of miRNAs works the best?
biomarkergenie.com
13
“one click” omics data analysis tool
simply input a csv file
works on: metabolomics, molecular descriptors, etc.
Conclusion
Combined machine learning/bioinformatics approach identifies:
• 2 miRNA panel with 90% accuracy
– miR-320e + miR-103a
• 3 miRNA panel with 91.5% accuracy
– miR-320e + miR-103a + miR-526b
– prognostic biomarkers for squamous carcinoma
• 3 miRNA panel with 86% accuracy
– miR-140 + miR-199a + and miR-29c
– prognostic biomarkers for adenocarcinoma
• Potential “2 in 1” tests
Web tool:
• Speeds up exploratory analysis
• No setup needed
• Increases accessibility of machine learning
• Widely applicable to any disease or omics data 14
Conclusion
Advantages
• biomarkers were tested on majority Stage 1 patient data
• consistent across large sample size and across 4 studies
– methodology of this study accounts for variance between studies
• greater sampling flexibility (microRNAs are present in serum AND
plasma)
Limitations:
• small sample size for some data
• groups LUSC and LUAD together
• flaws with training data
– batch effect/confounding
Next steps
• Experimental validation
• Differences between LUAD and LUSC
• Clinical trials
• Add functionality to web tool
– Regression
Topics:
1. microRNAs in cancer
2. differential expression analysis
3. target gene prediction
4. functional enrichment analysis
5. machine learning classification
6. biomarkergenie.com
References
1. Press Release N° 263. (2018). In World Health Organization. International Agency for Report on Cancer.
2. American Cancer Society. Facts & Figures 2019. American Cancer Society. Atlanta, Ga. 2019. Howlader N, Noone AM, Krapcho M, Miller D,
Bishop K, Kosary CL, Yu M, Ruhl J, Tatalovich Z, Mariotto A, Lewis DR, Chen HS, Feuer EJ, Cronin KA (eds). SEER Cancer Statistics
Review, 1975-2014, National Cancer Institute. Bethesda, MD, https://seer.cancer.gov/csr/1975_2014/, based on November 2016 SEER data
submission, posted to the SEER web site, April 2017.
3. Heneghan, H. M., Miller, N., & Kerin, M. J. (2010). MiRNAs as biomarkers and therapeutic targets in cancer. Current Opinion in
Pharmacology, 10(5), 543–550.
4. Farazi, T. A., Spitzer, J. I., Morozov, P., & Tuschl, T. (2010). miRNAs in human cancer. The Journal of Pathology, 223(2), 102–115.
5. Ma, J., Lin, Y., Zhan, M., Mann, D. L., Stass, S. A., & Jiang, F. (2015). Differential miRNA expressions in peripheral blood mononuclear cells
for diagnosis of lung cancer. Laboratory Investigation, 95(10), 1197–1206.
6. Hennessey, P. T., Sanford, T., Choudhary, A., Mydlarz, W. W., Brown, D., Adai, A. T., Ochs, M. F., Ahrendt, S. A., Mambo, E., & Califano, J.
A. (2012). Serum microRNA Biomarkers for Detection of Non-Small Cell Lung Cancer. PLoS ONE, 7(2), e32307.
7. Shen, J., Todd, N. W., Zhang, H., Yu, L., Lingxiao, X., Mei, Y., Guarnera, M., Liao, J., Chou, A., Lu, C. L., Jiang, Z., Fang, H., Katz, R. L., &
Jiang, F. (2010). Plasma microRNAs as potential biomarkers for non-small-cell lung cancer. Laboratory Investigation, 91(4), 579–587.
8. Heneghan, H. M., Miller, N., Lowery, A. J., Sweeney, K. J., Newell, J., & Kerin, M. J. (2010). Circulating microRNAs as Novel Minimally
Invasive Biomarkers for Breast Cancer. Annals of Surgery, 251(3), 499–505.
9. Xue, W.-X., Zhang, M.-Y., Rui Li, Liu, X., Yin, Y.-H., & Qu, Y.-Q. (2020, July 7). Serum miR-1228-3p and miR-181a-5p as Noninvasive
Biomarkers for Non-Small Cell Lung Cancer Diagnosis and Prognosis. BioMed Research International.
10. Ying, Lisha, et al. "Development of a serum miRNA panel for detection of early stage non-small cell lung cancer." Proceedings of the National
Academy of Sciences 117.40 (2020): 25036-25042.
11. Chugh, P., & Dittmer, D. P. (2012). Potential pitfalls in microRNA profiling. Wiley Interdisciplinary Reviews: RNA, 3(5), 601–616.
https://doi.org/10.1002/wrna.1120
12. Tang, Gusheng, et al. "Different normalization strategies might cause inconsistent variation in circulating microRNAs in patients with
hepatocellular carcinoma." Medical science monitor: international medical journal of experimental and clinical research 21 (2015): 617.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4345856/
13. Kenny, Louise C., et al. "Novel biomarkers for pre-eclampsia detected using metabolomics and machine learning." Metabolomics 1.3 (2005):
227-234. https://link.springer.com/article/10.1007/s11306-005-0003-1
14. Huang, Yao, et al. "Serum microRNA panel excavated by machine learning as a potential biomarker for the detection of gastric cancer."
Oncology reports 39.3 (2018): 1338-1346. https://www.spandidos-publications.com/10.3892/or.2017.6163
16
Full list in notebook.
Acknowledgement
Thank you to my parents, my teachers, the Van
Allen lab, and my mentor Wendy Slijk for their
support!
17

More Related Content

PDF
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
PPTX
Bioinformatics
PPT
INBIOMEDvision Workshop at MIE 2011. Victoria López
PDF
Integrative Everything, Deep Learning and Streaming Data
DOC
Introduction to cancer bioinformatics
PPTX
Bioinformatics in medicine
PDF
Application of Microarray Technology and softcomputing in cancer Biology
PDF
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
Bioinformatics
INBIOMEDvision Workshop at MIE 2011. Victoria López
Integrative Everything, Deep Learning and Streaming Data
Introduction to cancer bioinformatics
Bioinformatics in medicine
Application of Microarray Technology and softcomputing in cancer Biology
Digital Pathology: Precision Medicine, Deep Learning and Computer Aided Inter...

What's hot (20)

PPTX
NGS in cancer treatment
PDF
The Application of Next Generation Sequencing (NGS) in cancer treatment
PDF
JALANov2000
PPT
Bioinformatics, its application main
PPTX
UNMSymposium2014
PPTX
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
PPTX
Extreme Computing, Clinical Medicine and GPUs or Can GPUs Cure Cancer
PPTX
Genomics
PPTX
NetBioSIG2014-Talk by David Amar
PPT
Introduction to Cancer Genomics Databases
PPT
Bioinformatics in present and its future
PPTX
Bioinformatics
PDF
Use cases
PPTX
Application of Biomedical Informatics in Clinical Problem Solving
PPTX
Next generation sequencing in cancer treatment
PPTX
Human genome project
PDF
Genomics2 Phenomics Complete
DOCX
CHAVEZ_SESSION23_ACADEMICPAPER.docx
PPTX
2015 bioinformatics personal_genomics_wim_vancriekinge
PPTX
Bioinformatics
NGS in cancer treatment
The Application of Next Generation Sequencing (NGS) in cancer treatment
JALANov2000
Bioinformatics, its application main
UNMSymposium2014
Precision Oncology - using Genomics, Proteomics and Imaging to inform biology...
Extreme Computing, Clinical Medicine and GPUs or Can GPUs Cure Cancer
Genomics
NetBioSIG2014-Talk by David Amar
Introduction to Cancer Genomics Databases
Bioinformatics in present and its future
Bioinformatics
Use cases
Application of Biomedical Informatics in Clinical Problem Solving
Next generation sequencing in cancer treatment
Human genome project
Genomics2 Phenomics Complete
CHAVEZ_SESSION23_ACADEMICPAPER.docx
2015 bioinformatics personal_genomics_wim_vancriekinge
Bioinformatics
Ad

Similar to Sigma Xi 2021 Andrew Gao Presentation (20)

PPTX
blood based bio assays for cancer
PDF
An expression meta-analysis of predicted microRNA targets identifies a diagno...
PPTX
lung cancer biomarkers
PPTX
Molecules in lung cancer part 3
PDF
04 Biologia molecular en Cáncer de Pulmón
PDF
Total RNA Discovery for RNA Biomarker Development Webinar
PDF
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
PDF
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
PDF
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
PDF
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
PDF
Translation of microarray data into clinically relevant cancer diagnostic tes...
PDF
biomarkers by apurva.pdf
PDF
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
PDF
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
PDF
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
PDF
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
PDF
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
PPTX
Gene expression based classification of non-small cell lung carcinomas
PDF
Biomarkers for the Non-scientist by Cheryl Selinsky, Parker Institute for Can...
PDF
IRJET- Intelligent Prediction of Lung Cancer Via MRI Images using Morphologic...
blood based bio assays for cancer
An expression meta-analysis of predicted microRNA targets identifies a diagno...
lung cancer biomarkers
Molecules in lung cancer part 3
04 Biologia molecular en Cáncer de Pulmón
Total RNA Discovery for RNA Biomarker Development Webinar
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
Translation of microarray data into clinically relevant cancer diagnostic tes...
biomarkers by apurva.pdf
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
Combined Analysis of Micro RNA and Proteomic Profiles and Interactions in Pat...
Gene expression based classification of non-small cell lung carcinomas
Biomarkers for the Non-scientist by Cheryl Selinsky, Parker Institute for Can...
IRJET- Intelligent Prediction of Lung Cancer Via MRI Images using Morphologic...
Ad

Recently uploaded (20)

PPTX
surgery guide for USMLE step 2-part 1.pptx
PPT
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
PPTX
ACID BASE management, base deficit correction
PPTX
Imaging of parasitic D. Case Discussions.pptx
PPTX
Note on Abortion.pptx for the student note
PPTX
post stroke aphasia rehabilitation physician
PPTX
15.MENINGITIS AND ENCEPHALITIS-elias.pptx
PPTX
Slider: TOC sampling methods for cleaning validation
PPTX
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
PDF
Human Health And Disease hggyutgghg .pdf
PPTX
Acid Base Disorders educational power point.pptx
PPT
OPIOID ANALGESICS AND THEIR IMPLICATIONS
PPTX
Pathophysiology And Clinical Features Of Peripheral Nervous System .pptx
PPT
MENTAL HEALTH - NOTES.ppt for nursing students
DOCX
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
PPTX
Fundamentals of human energy transfer .pptx
PPT
Breast Cancer management for medicsl student.ppt
PPTX
1 General Principles of Radiotherapy.pptx
PDF
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
PPT
CHAPTER FIVE. '' Association in epidemiological studies and potential errors
surgery guide for USMLE step 2-part 1.pptx
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
ACID BASE management, base deficit correction
Imaging of parasitic D. Case Discussions.pptx
Note on Abortion.pptx for the student note
post stroke aphasia rehabilitation physician
15.MENINGITIS AND ENCEPHALITIS-elias.pptx
Slider: TOC sampling methods for cleaning validation
NEET PG 2025 Pharmacology Recall | Real Exam Questions from 3rd August with D...
Human Health And Disease hggyutgghg .pdf
Acid Base Disorders educational power point.pptx
OPIOID ANALGESICS AND THEIR IMPLICATIONS
Pathophysiology And Clinical Features Of Peripheral Nervous System .pptx
MENTAL HEALTH - NOTES.ppt for nursing students
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
Fundamentals of human energy transfer .pptx
Breast Cancer management for medicsl student.ppt
1 General Principles of Radiotherapy.pptx
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
CHAPTER FIVE. '' Association in epidemiological studies and potential errors

Sigma Xi 2021 Andrew Gao Presentation

  • 1. Machine Learning and Bioinformatics Approach Yields Noninvasive miRNA Biomarkers for Early Lung Cancer Detection Andrew Gao 1
  • 2. Abstract Non-small cell lung cancer (NSCLC), the most common type of lung cancer, affects millions of people. In 2020, lung cancer caused 1.8 million deaths, partly because it is difficult to diagnose lung cancer at early stages. Detecting cancer earlier results in better survival. One potential method for early diagnosis, liquid biopsy, relies on biomarkers present in body fluids. MicroRNAs (miRNAs) in blood could serve as biomarkers for noninvasive diagnostic/prognostic tests. MiRNAs are RNA molecules that regulate the expression of specific genes. Through differential expression analysis of four public datasets, the present study identified 13 miRNAs that are consistently underexpressed in the tissue, blood, and serum of NSCLC patients. Kaplan-Meier survival analysis found that six miRNAs had statistically significant prognostic power (miR-140, miR-199a, miR-29c, miR-320e, miR-103a, miR-526b). Functional enrichment analysis of the genes targeted by the miRNAs demonstrated that they are involved in several hallmarks of cancer, such as the epithelial to mesenchymal transition. A machine learning model was constructed using microRNA expression data. Recursive feature elimination was performed to select miRNAs with the greatest diagnostic value. Five classifiers were tested on the selected miRNAs, with Random Forest and Logistic Regression performing the best. A novel three-microRNA panel with 91.5% accuracy for NSCLC detection was identified (miR-320e, miR-103a, miR-526b). These miRNAs also have significant prognostic power for lung adenocarcinoma. The machine learning and analysis workflow was adapted into an open-source online tool for automatic biomarker selection, available at biomarkergenie.com. Future steps include experimental validation and clinical trials. 2
  • 3. Background • 1.8 million deaths from lung cancer in 2018 – 5 year survival rate is only 21% • My grandfather passed away at 67 due to non-small cell lung cancer • Hard to accurately diagnose lung cancer – CT Scan – Tissue Biopsy (follow up) • Cells release molecules like DNA and RNA into the bloodstream • Drawing blood is noninvasive • Different levels of biomarkers could indicate cancer – High levels of miR-1228-3p (Xue 2020) 3 Decreased gene expression (degraded miRNA can’t be translated into protein)
  • 4. Statement of Purpose 4 Purpose: Identify non-invasive microRNA biomarkers for non-small cell lung cancer (NSCLC) diagnostic and prognostic tests. Criteria: 1. Must be found in blood 2. Consistently differentially expressed across studies ○ Account for high variance in results between studies 3. Statistically significant change 4. Ideally has biological relevance Outline: 1. Find miRNAs that are reliably differently expressed in lung cancer 2. Find what pathways these miRNAs are involved in 3. Check if these miRNAs can predict survival 4. Identify best combination of miRNAs for diagnosing lung cancer
  • 5. Materials: ● Computer ● R Studio ○ R ○ Limma ● Google Colab ● Atom code editor ● Python ○ Matplotlib ○ Seaborn ○ Scikit Learn ○ Streamlit ● Jvenn ● Heroku ● Github ● Gene Expression Omnibus (GEO) ○ GSE137140 ○ GSE93300 ○ GSE94536 ○ GSE53882 ● The Cancer Genome Atlas (TCGA) ● Search Tool for the Retrieval of Interacting Proteins (STRING) ● Kaplan Meier Plotter (kmplot.com) ● Graphpad Prism 9 ● Cytoscape ○ MCODE ● Gene Ontology ○ Panther ● miRWalk ● miRDB ● miRTarBase ● GeneCards Data: miRNA expression profiling data Control: Non-cancer people Experimental: NSCLC patients Procedure:
  • 6. 6 Selected datasets and characteristics Differential expression data ● Four datasets were selected. ● In total, 1978 NSCLC and 1932 control samples. The limma package in R was used to calculate logFC and p values (t-test). cutoff: p<0.05 logFC: ratio of expression in disease vs. controls negative = underexpressed positive = overexpressed microRNA name ratio of expression other name
  • 7. 7 Venn diagram of overlapping differentially expressed microRNAs between datasets Heatmap of logFC values of each of the 13 microRNAs across all datasets 13 microRNAs are differentially expressed in all four datasets (p<0.05). All are underexpressed (negative logFC).
  • 8. 8 349 target genes are involved in many hallmarks of cancer: ● epithelial to mesenchymal transition ● SMAD protein phosphorylation ● heterochromatin ● miRNA silencing (impaired) ● transforming growth factor beta Protein protein interaction (PPI) network of target genes Highly interconnected clusters of interacting genes (using Cytoscape)
  • 9. 9 Kaplan Meier Survival Analysis: miR-140, miR-199a, miR-29c, miR-320e, miR-103a, and miR-526b can reliably predict patient survival outlook (p<0.05) high expression = better survival (this makes sense) Squamous Carcinoma Adenocarcinoma
  • 10. Can these miRNAs distinguish lung cancer and healthy controls? 10 Input raw data Stage characteristics of input data (GSE137140) Most samples are Stage 1 (72%) columns: microRNAs, target = label ( 1 for cancer, 0 for control) Initial 3 component PCA using 13 miRNAs shows distinct separation between lung cancer and healthy 63% explained variance
  • 11. Which classifier works the best? 11 Extra Forest, Random Forest, and Recursive Feature Elimination (Logistic Regression) rank 13 miRNAs by importance ● Generally agreed Random Forest performs best 20% of data for training 80% for testing Which miRNAs are most important? Top 3: miR-320e, miR-103a, miR-526b Chart of Feature Importances based on Extra Forest Ranker
  • 12. 12 Confusion matrix of results for top 3 microRNAs 10 fold cross validation on top 3 miRNAs: Accuracy = 0.915 (stdev = 0.13) 10 fold cross validation on top 4 miRNAs: Accuracy = 0.916 (stdev = 0.13) (virtually identical) Accuracy: miR-320e: 82.5% top 3: 91.6% top 4: 92.6% Which combination of miRNAs works the best?
  • 13. biomarkergenie.com 13 “one click” omics data analysis tool simply input a csv file works on: metabolomics, molecular descriptors, etc.
  • 14. Conclusion Combined machine learning/bioinformatics approach identifies: • 2 miRNA panel with 90% accuracy – miR-320e + miR-103a • 3 miRNA panel with 91.5% accuracy – miR-320e + miR-103a + miR-526b – prognostic biomarkers for squamous carcinoma • 3 miRNA panel with 86% accuracy – miR-140 + miR-199a + and miR-29c – prognostic biomarkers for adenocarcinoma • Potential “2 in 1” tests Web tool: • Speeds up exploratory analysis • No setup needed • Increases accessibility of machine learning • Widely applicable to any disease or omics data 14
  • 15. Conclusion Advantages • biomarkers were tested on majority Stage 1 patient data • consistent across large sample size and across 4 studies – methodology of this study accounts for variance between studies • greater sampling flexibility (microRNAs are present in serum AND plasma) Limitations: • small sample size for some data • groups LUSC and LUAD together • flaws with training data – batch effect/confounding Next steps • Experimental validation • Differences between LUAD and LUSC • Clinical trials • Add functionality to web tool – Regression Topics: 1. microRNAs in cancer 2. differential expression analysis 3. target gene prediction 4. functional enrichment analysis 5. machine learning classification 6. biomarkergenie.com
  • 16. References 1. Press Release N° 263. (2018). In World Health Organization. International Agency for Report on Cancer. 2. American Cancer Society. Facts & Figures 2019. American Cancer Society. Atlanta, Ga. 2019. Howlader N, Noone AM, Krapcho M, Miller D, Bishop K, Kosary CL, Yu M, Ruhl J, Tatalovich Z, Mariotto A, Lewis DR, Chen HS, Feuer EJ, Cronin KA (eds). SEER Cancer Statistics Review, 1975-2014, National Cancer Institute. Bethesda, MD, https://seer.cancer.gov/csr/1975_2014/, based on November 2016 SEER data submission, posted to the SEER web site, April 2017. 3. Heneghan, H. M., Miller, N., & Kerin, M. J. (2010). MiRNAs as biomarkers and therapeutic targets in cancer. Current Opinion in Pharmacology, 10(5), 543–550. 4. Farazi, T. A., Spitzer, J. I., Morozov, P., & Tuschl, T. (2010). miRNAs in human cancer. The Journal of Pathology, 223(2), 102–115. 5. Ma, J., Lin, Y., Zhan, M., Mann, D. L., Stass, S. A., & Jiang, F. (2015). Differential miRNA expressions in peripheral blood mononuclear cells for diagnosis of lung cancer. Laboratory Investigation, 95(10), 1197–1206. 6. Hennessey, P. T., Sanford, T., Choudhary, A., Mydlarz, W. W., Brown, D., Adai, A. T., Ochs, M. F., Ahrendt, S. A., Mambo, E., & Califano, J. A. (2012). Serum microRNA Biomarkers for Detection of Non-Small Cell Lung Cancer. PLoS ONE, 7(2), e32307. 7. Shen, J., Todd, N. W., Zhang, H., Yu, L., Lingxiao, X., Mei, Y., Guarnera, M., Liao, J., Chou, A., Lu, C. L., Jiang, Z., Fang, H., Katz, R. L., & Jiang, F. (2010). Plasma microRNAs as potential biomarkers for non-small-cell lung cancer. Laboratory Investigation, 91(4), 579–587. 8. Heneghan, H. M., Miller, N., Lowery, A. J., Sweeney, K. J., Newell, J., & Kerin, M. J. (2010). Circulating microRNAs as Novel Minimally Invasive Biomarkers for Breast Cancer. Annals of Surgery, 251(3), 499–505. 9. Xue, W.-X., Zhang, M.-Y., Rui Li, Liu, X., Yin, Y.-H., & Qu, Y.-Q. (2020, July 7). Serum miR-1228-3p and miR-181a-5p as Noninvasive Biomarkers for Non-Small Cell Lung Cancer Diagnosis and Prognosis. BioMed Research International. 10. Ying, Lisha, et al. "Development of a serum miRNA panel for detection of early stage non-small cell lung cancer." Proceedings of the National Academy of Sciences 117.40 (2020): 25036-25042. 11. Chugh, P., & Dittmer, D. P. (2012). Potential pitfalls in microRNA profiling. Wiley Interdisciplinary Reviews: RNA, 3(5), 601–616. https://doi.org/10.1002/wrna.1120 12. Tang, Gusheng, et al. "Different normalization strategies might cause inconsistent variation in circulating microRNAs in patients with hepatocellular carcinoma." Medical science monitor: international medical journal of experimental and clinical research 21 (2015): 617. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4345856/ 13. Kenny, Louise C., et al. "Novel biomarkers for pre-eclampsia detected using metabolomics and machine learning." Metabolomics 1.3 (2005): 227-234. https://link.springer.com/article/10.1007/s11306-005-0003-1 14. Huang, Yao, et al. "Serum microRNA panel excavated by machine learning as a potential biomarker for the detection of gastric cancer." Oncology reports 39.3 (2018): 1338-1346. https://www.spandidos-publications.com/10.3892/or.2017.6163 16 Full list in notebook.
  • 17. Acknowledgement Thank you to my parents, my teachers, the Van Allen lab, and my mentor Wendy Slijk for their support! 17