SlideShare a Scribd company logo
INTEGRATING DATA SCIENCE AND
DATA ANALYTICS IN VARIOUS
RESEARCH TRUST OF THE
UNIVERSITY
Menchita F. Dumlao, Ph.D.
DATA SCIENCE IS FOR BIG DATA
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
MACHINE LEARNING
• computer programs that automatically improve with experience."
• interdisciplinary in nature
• employs techniques from the fields of computer science, statistics, and
artificial intelligence, among others.
• algorithms which facilitate automatic improvement from
• machine learning is a central aspect of data science.
• pattern recognition Machine learning has a complex relationship with data
mining.
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
DATA SCIENCE : DAY T0 DAY
WHAT DOES
FACEBOOK DO TO
YOUR DATA?
• learning what consumers prefer
• emotional contagion study
• Cookies on your browser predicts who you
are
• Social plugins ("like", subscribe" or
"recommend" buttons.)
• information that Facebook sells to advertisers
we’ve agreed to a huge
amount of data being turned
over and signing off on the
social network’s seemingly
limitless ability to do with it
whatever it wants
FACEBOOK DATA SCIENCE
crawled or scraped data will be valuable and
constructive for commercial, scientific, and many
other fields of prediction and analysis
FACEBOOK’S DATA PRIVACY
POLICY:
• …in addition to helping people see and find things that you do and share,
we may use the information we receive about you… for internal operations,
including troubleshooting, data analysis, testing, research and service
improvement.
OCTOPARSE
Octoparse is a powerful web scraper that
can scrape both static and dynamic
websites with AJAX, JavaScript, cookies
and etc
https://www.octoparse.com/blog/facebook-data-mining/
VISUAL SCRAPER
• Visual Scraper is another great free
web scraper with simple point-and-
click interface
• collect data from the web
• export the extracted data as CSV,
XML, JSON or SQL files.
• scrape data from up to 50,000 web
pages for only one
user.
http://www.visualscraper.com/
FACEBOOK DATA SCIENCE USING R
• R is a data mining
software
application used
to analyze big
data.
• Data science in FB
using R.pdf
• Rfacebook Package provides an interface
to the Facebook API. For mining Facebook
using R, the Rfacebook package provides
functions that allow R to access Facebook’s
API to get information about posts,
comments, likes, group that mention
specific keywords & much more.
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
But: it is not much different from
what we, especially statisticians,
have been doing for many years
Much more data is digitally available than
was before
Inexpensive computing + Cloud + Easy-to-
use programming frameworks = Much
easier to analyze it
Often: large-scale data + simple
algorithms > small data + complex
algorithms
Changes how you do analysis
dramatically
•Causation --> Correlation Goal of
analysis often to figure out what
caused what. Causation very hard
to figure out
 What causes breast cancer and other diseases
Data Science correlates what causes things to happen:
 When will earthquake come
 Why students fail and pass board exam
 job after graduation and why
Using data understanding and computer science algorithms
Datafication":
•Process of converting abstract
things into concrete data e.g.,
what you like represented as a
stream of your likes;
•your "sitting posture" captured
using 100's of sensors placed in
a car seat
• Google Flu Trends
• Early warning of flu outbreaks by analyzing search
queries
• Up to 1 or 2 weeks ahead of CDC
• Analyzed 50M search queries to see which of them fit
the physician visits for flu
• 45 search terms used to create a single model
DATA SCIENCE PROJECTS:
ALGORITHMS, SIMULATION
AND APPLICATIONS
DATA SCIENCE PROJECTS
•Determining Rice Bug Epidemic Using Decision
Trees
•Prediction Model for Students’ Performance in
Java Programming with Course-content
Recommendation System
•Predicting IT Employability Using Data Mining
Techniques
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• Roland Calderon, Menchita Dumlao et. al (2016)
• data mining techniques in agriculture for predicting future trends such as
bug epidemic.
• Insect Epidemiology Data Mining (IEDM).
• IEDM - Discrete Mathematics and Theoretical Computer Science (DIMACS)
that aims to provide an opportunity to develop and test problem instances
and other methods of testing and comparing performance of algorithms
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• uses decision tree .
• classification and prediction
• represents rules
• CRISP-DM methodology
Data Science Projects:
• Rice Field Insect Light Trap (RFILT) mass traps both the sexes
of insect pests
• insect distribution, abundance, flight patterns, timing of the
application of pesticide
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
Data Science Projects:
• forecasting precision of a predictive model: confusion
matrix
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• Lunar Cycle level is the best predictor of epidemic status
• followed by Vegetative level
• In Vegetative stage level, 100% resulted in outbreak status
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• For the Ripening stage, the next best predictor is temperature.
• Over 82% bugs occurred in the outbreak status if the temperature is lesser or
equal to 32 to 38 temperatures
• 97.3% if the temperature greater than to 32 temperatures.
• For Reproduction and Resting stage, 52.7% bugs occurred in the infested
status and this is also considered a terminal node.
Data Science Projects:
DETERMINING RICE BUG EPIDEMIC USING
DECISION TREES
• Evale, Digna, Menchita F. Dumlao, et.al (2016)
• Comparative analysis among different data mining algorithm for attribute
selection and classification
• a two-phase study which aimed to predict the students’ performance in
Java Programming and be able to generate recommendations
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Knowledge Discovery in Database (KDD)
• Logistic Regression and Correlation-based Feature Selection was used for
finding significant predictors
• Classifiers such as CHAID, Exhaustive CHAID, CRT, QUEST, J48, BayesNet,
NaĂŻveBayes and JRip were implemented
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• J48, has the highest percentage of prediction.
• For the second phase evolutionary prototyping implemented
• Ruby on Rails : a web-based examination module that will determine the
students’ index of learning style and to assess their prior knowledge in Java
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• A course-content recommendation presenting the learners’ strengths and
weaknesses in the subject with suggested method of learning style will be
automatically generated by the system.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• KDD: selection, pre-processing, transformation, mining and interpretation.
• Selection- possible attributes is collected for data set
• pre-processing - filtering and removing of irrelevant data.
• Transformation- determining the most suited data mining technique to
provide the best prediction algorithm.
• Mining -discovering the pattern captured through classification rules,
regression models or decision tree. Evaluation or interpretation is the process
of visualization extracted from models.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Waikato Environment for Knowledge Analysis (WEKA) data mining tool and
IBM Statistical Package for the Social Science (SPSS).
• There were 8 attributes namely gender, age, course, section, schedule and 3
academic performance for programming languages.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Attribute selection was done using Standard Regression Analysis, Forward
and Backward Conditional Regression, Likelihood Ratio, and WALD
• WEKA was also used to conduct pre-processing thru filtering by
AttributeSelection
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Summary of Attribute Selection Result
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• 2 significant attribute out of eight original attributes
• With a critical p value of .05 (significant predictors should have smaller
critical p value),
• Binary Logistic Regression (SPSS)
• section and course as highly insignificant with .747 and .221 p value
respectively.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Pre-processing using attribute selection (SPSS and WEKA)
• course and section was automatically removed (highly insignificant)
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• CfsSubsetEvaluation - to further verify the significance of attribute gender
• BestFirst method -gender was found significant with 0.239 value of merit of
best subset (0 to 1,incorrectly classified instance)
• 76.1% of correctly classified instances
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• GreedyStepWise search method (through Cross Validation)
• , course and section are not found in any of the ten folds while gender
appeared in 7 out of 10 folds (70%).
• significant predictors: age, gender, schedule, grade in Programming 1,
grade in rogramming 2, and grade in Programming 3.
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
Summary of Accuracy of Different Algorithms tested
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• J48 is the best algorithm
• J48 has highest accuracy in making predictions
• Also has the highest Cohen’s Kappa value which means that the prediction
is strongly reliable with 64% to 81% reliability
Prediction Model for Students’ Performance in Java
Programming with Course-content Recommendation
System
Data Science Projects:
• Piad, Keno, Menchita F. Dumlao, et.al.(2016)
• Knowledge discovery of databases (KDD)
• CRISP-DM (CROSS-Industry Standard Process for Data Mining)
• Naive Bayes
• Decision Tree
• Ensemble
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
• pre-processing
• data sets : training and testing data sets
• training datasets: used to generate model
• testing datasets: used to determine the acceptability of the model.
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
•Apriori Algorithm -determine
associated attributes frequently
occurred in the data sets
•decision tree and naive bayes
algorithm – used to design the
predictive model
•predictive model = equation or rule
sets for prediction
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Rule set or equation learning instances of the testing sets
WEKA AND SPSS
graduate tracer
student’s biographic profile
cumulative grade point average (CGPA)
685 instances (tuples) SY 2011-2015
training and testing sets of
data.
Algorithm Accuracy Result Error Estimation
Rate %
Naive Bayes 75.33 24.47
J48 74.95 25.05
SimpleCart 73.01 26.99
Logistic regression 78.4 22.60
Chaid 76.3 23.70
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Accuracy Result in Predicting IT Graduate Employability
• Logistic regression measures the relationship between the categorical
dependent logistic function
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Algorithm Accuracy Result Error Estimation Rate
%
Chaid 70.1 29.9
Quest 40 60
CRT 70.2 29.8
Exhaustive Chaid 70.1 29.9
ID3 67 33
J48 70 30
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Accuracy Result in Predicting IT Specific Profession
•Classification and Regression Trees.
•CRT splits the data into segments that are as
homogeneous :dependent variable.
•all cases have the same value for the
dependent variable is a homogeneous,
"pure" node.
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
• The CRT growing method: maximize within-node homogeneity.
• node that do not represent a homogenous subset of cases:impurity.
• a terminal node in which all cases have the same value for the dependent
variable is a homogenous node that requires no further splitting because it is
“pure.”
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Observed Value
Predicted
Percentage
Corrected
Not
Related
Related
Target
Related 22 48 68.5
Not
Related
72 28 72
Average Percentage 70.5
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Classification Table of Logistic Regression in Testing Data (N=170)
Results of Testing the Accuracy of Logistic Regression in Predicting Employability
IT Classifications IT Specific
Career
Correct
Classificaiton
Error Rate
1 (IT Software) 34 23 (67.64) 11 (32.35)
2 (IT Network/ Sys/
DB Admin)
25 16 (64.00) 9 (36.00)
3 (other IT related
field.)
11 5 (45.45) 16. (54.54)
Data Science Projects:
Predicting IT Employability Using Data Mining Techniques
Classification Table of CRT in Testing Data (N=70)
Results of Testing the Accuracy of CRT in Predicting Specific IT Field/Job to be Employed
• RapidMiner - https://rapidminer.com/
• BigML - https://bigml.com/
• Google Cloud AutoML - https://cloud.google.com/automl/
• Paxata - https://www.paxata.com/
• Trifacta -https://www.trifacta.com/
• MLBase - http://mlbase.org/
• Auto-WEKA -http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
• Driverless AI - https://www.h2o.ai/driverless-ai/
DATA SCIENCE AND MACHINE LEARNING TOOLS
FOR PEOPLE WHO DON’T KNOW PROGRAMMING
• https://studio.azureml.net/ - https://studio.azureml.net/
• MLJar - https://mljar.com/
• Amazon Lex - https://aws.amazon.com/lex/
• IBM Watson Studio - https://www.ibm.com/cloud/watson-studio
• Automatic Statistician - https://www.automaticstatistician.com/index/
• KNIME - https://www.knime.com/
• FeatureLab - http://www.featurelab.co/
• MarketSwitch - http://www.experian.com/decision-analytics/marketswitch-optimization.html
• Logical Glue - http://www.logicalglue.com/
• Pure Predictive - http://www.purepredictive.com/
DATA SCIENCE AND MACHINE LEARNING TOOLS
FOR PEOPLE WHO DON’T KNOW PROGRAMMING
DO YOU THINK DATA
SCIENCE CAN DEVELOP
YOUR RESEARCH SKILLS?
AND HELP YOU DEVELOP
AMAZING RESEARCH?

More Related Content

DOCX
Himansu sahoo resume-ds
PPTX
The UVA School of Data Science
PPTX
The Roots: Linked data and the foundations of successful Agriculture Data
PPTX
Summary of 3DPAS
PDF
Hands-on Introduction to Machine Learning
PPTX
Machines are people too
PDF
Resume
PDF
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
Himansu sahoo resume-ds
The UVA School of Data Science
The Roots: Linked data and the foundations of successful Agriculture Data
Summary of 3DPAS
Hands-on Introduction to Machine Learning
Machines are people too
Resume
Ramil Mauleon: Galaxy: bioinformatics for rice scientists

What's hot (20)

PPTX
Learning Analytics: Realizing the Big Data Promise in the CSU
PDF
OpenML data@Sheffield
PPTX
DataONE Education Module 09: Analysis and Workflows
PPTX
Databases, Web Services and Tools For Systems Immunology
PDF
Novero a resume-2018
PPT
Machine Learning for automated diagnosis of distributed ...AE
 
PPT
Data at the NIH: Some Early Thoughts
PPT
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
PPTX
Lessons from Data Science Program at Indiana University: Curriculum, Students...
PDF
Connecting and synchronizing scientific knowledge
PPTX
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
PDF
Reproducible research: First steps.
PDF
An Introduction to Machine Learning and Genomics
PDF
Amrapali Zaveri Defense
PPTX
Introduction to Big Data and its Potential for Dementia Research
PDF
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
PPTX
RDAP 15: The Role of Assessment in Research Data Services
 
PDF
Crowdsourcing Linked Data Quality Assessment
PPT
Web analytics webinar
PPT
eScience: A Transformed Scientific Method
Learning Analytics: Realizing the Big Data Promise in the CSU
OpenML data@Sheffield
DataONE Education Module 09: Analysis and Workflows
Databases, Web Services and Tools For Systems Immunology
Novero a resume-2018
Machine Learning for automated diagnosis of distributed ...AE
 
Data at the NIH: Some Early Thoughts
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Connecting and synchronizing scientific knowledge
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Reproducible research: First steps.
An Introduction to Machine Learning and Genomics
Amrapali Zaveri Defense
Introduction to Big Data and its Potential for Dementia Research
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
RDAP 15: The Role of Assessment in Research Data Services
 
Crowdsourcing Linked Data Quality Assessment
Web analytics webinar
eScience: A Transformed Scientific Method
Ad

Similar to Lec 1 integrating data science and data analytics in various research thrust (20)

PPTX
Pemanfaatan Big Data Dalam Riset 2023.pptx
PPTX
Data Science Mastery Course in Pitampura
PDF
Data Science Course in Pune
PDF
Data Science Training and Placement
PPTX
Data Science and Analysis.pptx
PPTX
Best Selenium certification course
PDF
Data science training in hyd ppt converted (1)
PDF
Data science training in hyd pdf converted (1)
PDF
Data science training in hydpdf converted (1)
PPTX
Data Science.pptx NEW COURICUUMN IN DATA
PPTX
Which institute is best for data science?
PPTX
Best Selenium certification course
PPTX
Data science training in hyd ppt (1)
PPTX
Data science training institute in hyderabad
PPTX
Data science training in Hyderabad
PPTX
Data science training Hyderabad
PPTX
Data science online training in hyderabad
PPTX
Data science training in hyd ppt (1)
PPTX
data science training and placement
PPTX
online data science training
Pemanfaatan Big Data Dalam Riset 2023.pptx
Data Science Mastery Course in Pitampura
Data Science Course in Pune
Data Science Training and Placement
Data Science and Analysis.pptx
Best Selenium certification course
Data science training in hyd ppt converted (1)
Data science training in hyd pdf converted (1)
Data science training in hydpdf converted (1)
Data Science.pptx NEW COURICUUMN IN DATA
Which institute is best for data science?
Best Selenium certification course
Data science training in hyd ppt (1)
Data science training institute in hyderabad
Data science training in Hyderabad
Data science training Hyderabad
Data science online training in hyderabad
Data science training in hyd ppt (1)
data science training and placement
online data science training
Ad

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
annual-report-2024-2025 original latest.
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Foundation of Data Science unit number two notes
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Quality review (1)_presentation of this 21
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Knowledge Engineering Part 1
climate analysis of Dhaka ,Banglades.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
annual-report-2024-2025 original latest.
Miokarditis (Inflamasi pada Otot Jantung)
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Qualitative Qantitative and Mixed Methods.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Business Acumen Training GuidePresentation.pptx
Foundation of Data Science unit number two notes
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Fluorescence-microscope_Botany_detailed content
Quality review (1)_presentation of this 21
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf

Lec 1 integrating data science and data analytics in various research thrust

  • 1. INTEGRATING DATA SCIENCE AND DATA ANALYTICS IN VARIOUS RESEARCH TRUST OF THE UNIVERSITY Menchita F. Dumlao, Ph.D.
  • 2. DATA SCIENCE IS FOR BIG DATA
  • 10. MACHINE LEARNING • computer programs that automatically improve with experience." • interdisciplinary in nature • employs techniques from the fields of computer science, statistics, and artificial intelligence, among others. • algorithms which facilitate automatic improvement from • machine learning is a central aspect of data science. • pattern recognition Machine learning has a complex relationship with data mining.
  • 14. DATA SCIENCE : DAY T0 DAY
  • 15. WHAT DOES FACEBOOK DO TO YOUR DATA? • learning what consumers prefer • emotional contagion study • Cookies on your browser predicts who you are • Social plugins ("like", subscribe" or "recommend" buttons.) • information that Facebook sells to advertisers
  • 16. we’ve agreed to a huge amount of data being turned over and signing off on the social network’s seemingly limitless ability to do with it whatever it wants
  • 17. FACEBOOK DATA SCIENCE crawled or scraped data will be valuable and constructive for commercial, scientific, and many other fields of prediction and analysis
  • 18. FACEBOOK’S DATA PRIVACY POLICY: • …in addition to helping people see and find things that you do and share, we may use the information we receive about you… for internal operations, including troubleshooting, data analysis, testing, research and service improvement.
  • 19. OCTOPARSE Octoparse is a powerful web scraper that can scrape both static and dynamic websites with AJAX, JavaScript, cookies and etc
  • 21. VISUAL SCRAPER • Visual Scraper is another great free web scraper with simple point-and- click interface • collect data from the web • export the extracted data as CSV, XML, JSON or SQL files. • scrape data from up to 50,000 web pages for only one user.
  • 23. FACEBOOK DATA SCIENCE USING R • R is a data mining software application used to analyze big data. • Data science in FB using R.pdf
  • 24. • Rfacebook Package provides an interface to the Facebook API. For mining Facebook using R, the Rfacebook package provides functions that allow R to access Facebook’s API to get information about posts, comments, likes, group that mention specific keywords & much more.
  • 28. But: it is not much different from what we, especially statisticians, have been doing for many years
  • 29. Much more data is digitally available than was before Inexpensive computing + Cloud + Easy-to- use programming frameworks = Much easier to analyze it Often: large-scale data + simple algorithms > small data + complex algorithms Changes how you do analysis dramatically
  • 30. •Causation --> Correlation Goal of analysis often to figure out what caused what. Causation very hard to figure out  What causes breast cancer and other diseases Data Science correlates what causes things to happen:  When will earthquake come  Why students fail and pass board exam  job after graduation and why Using data understanding and computer science algorithms
  • 31. Datafication": •Process of converting abstract things into concrete data e.g., what you like represented as a stream of your likes; •your "sitting posture" captured using 100's of sensors placed in a car seat
  • 32. • Google Flu Trends • Early warning of flu outbreaks by analyzing search queries • Up to 1 or 2 weeks ahead of CDC • Analyzed 50M search queries to see which of them fit the physician visits for flu • 45 search terms used to create a single model
  • 33. DATA SCIENCE PROJECTS: ALGORITHMS, SIMULATION AND APPLICATIONS
  • 34. DATA SCIENCE PROJECTS •Determining Rice Bug Epidemic Using Decision Trees •Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System •Predicting IT Employability Using Data Mining Techniques
  • 35. DETERMINING RICE BUG EPIDEMIC USING DECISION TREES • Roland Calderon, Menchita Dumlao et. al (2016) • data mining techniques in agriculture for predicting future trends such as bug epidemic. • Insect Epidemiology Data Mining (IEDM). • IEDM - Discrete Mathematics and Theoretical Computer Science (DIMACS) that aims to provide an opportunity to develop and test problem instances and other methods of testing and comparing performance of algorithms Data Science Projects:
  • 36. DETERMINING RICE BUG EPIDEMIC USING DECISION TREES • uses decision tree . • classification and prediction • represents rules • CRISP-DM methodology Data Science Projects:
  • 37. • Rice Field Insect Light Trap (RFILT) mass traps both the sexes of insect pests • insect distribution, abundance, flight patterns, timing of the application of pesticide DETERMINING RICE BUG EPIDEMIC USING DECISION TREES Data Science Projects:
  • 38. • forecasting precision of a predictive model: confusion matrix Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 39. Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 40. Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 41. Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 42. • Lunar Cycle level is the best predictor of epidemic status • followed by Vegetative level • In Vegetative stage level, 100% resulted in outbreak status Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 43. • For the Ripening stage, the next best predictor is temperature. • Over 82% bugs occurred in the outbreak status if the temperature is lesser or equal to 32 to 38 temperatures • 97.3% if the temperature greater than to 32 temperatures. • For Reproduction and Resting stage, 52.7% bugs occurred in the infested status and this is also considered a terminal node. Data Science Projects: DETERMINING RICE BUG EPIDEMIC USING DECISION TREES
  • 44. • Evale, Digna, Menchita F. Dumlao, et.al (2016) • Comparative analysis among different data mining algorithm for attribute selection and classification • a two-phase study which aimed to predict the students’ performance in Java Programming and be able to generate recommendations Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 45. • Knowledge Discovery in Database (KDD) • Logistic Regression and Correlation-based Feature Selection was used for finding significant predictors • Classifiers such as CHAID, Exhaustive CHAID, CRT, QUEST, J48, BayesNet, NaĂŻveBayes and JRip were implemented Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 46. • J48, has the highest percentage of prediction. • For the second phase evolutionary prototyping implemented • Ruby on Rails : a web-based examination module that will determine the students’ index of learning style and to assess their prior knowledge in Java Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 47. • A course-content recommendation presenting the learners’ strengths and weaknesses in the subject with suggested method of learning style will be automatically generated by the system. Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 48. • KDD: selection, pre-processing, transformation, mining and interpretation. • Selection- possible attributes is collected for data set • pre-processing - filtering and removing of irrelevant data. • Transformation- determining the most suited data mining technique to provide the best prediction algorithm. • Mining -discovering the pattern captured through classification rules, regression models or decision tree. Evaluation or interpretation is the process of visualization extracted from models. Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 49. • Waikato Environment for Knowledge Analysis (WEKA) data mining tool and IBM Statistical Package for the Social Science (SPSS). • There were 8 attributes namely gender, age, course, section, schedule and 3 academic performance for programming languages. Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 50. • Attribute selection was done using Standard Regression Analysis, Forward and Backward Conditional Regression, Likelihood Ratio, and WALD • WEKA was also used to conduct pre-processing thru filtering by AttributeSelection Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 51. • Summary of Attribute Selection Result Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 52. • 2 significant attribute out of eight original attributes • With a critical p value of .05 (significant predictors should have smaller critical p value), • Binary Logistic Regression (SPSS) • section and course as highly insignificant with .747 and .221 p value respectively. Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 53. • Pre-processing using attribute selection (SPSS and WEKA) • course and section was automatically removed (highly insignificant) Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 54. • CfsSubsetEvaluation - to further verify the significance of attribute gender • BestFirst method -gender was found significant with 0.239 value of merit of best subset (0 to 1,incorrectly classified instance) • 76.1% of correctly classified instances Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 55. • GreedyStepWise search method (through Cross Validation) • , course and section are not found in any of the ten folds while gender appeared in 7 out of 10 folds (70%). • significant predictors: age, gender, schedule, grade in Programming 1, grade in rogramming 2, and grade in Programming 3. Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 56. Summary of Accuracy of Different Algorithms tested Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 57. • J48 is the best algorithm • J48 has highest accuracy in making predictions • Also has the highest Cohen’s Kappa value which means that the prediction is strongly reliable with 64% to 81% reliability Prediction Model for Students’ Performance in Java Programming with Course-content Recommendation System Data Science Projects:
  • 58. • Piad, Keno, Menchita F. Dumlao, et.al.(2016) • Knowledge discovery of databases (KDD) • CRISP-DM (CROSS-Industry Standard Process for Data Mining) • Naive Bayes • Decision Tree • Ensemble Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 59. • pre-processing • data sets : training and testing data sets • training datasets: used to generate model • testing datasets: used to determine the acceptability of the model. Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 60. •Apriori Algorithm -determine associated attributes frequently occurred in the data sets •decision tree and naive bayes algorithm – used to design the predictive model •predictive model = equation or rule sets for prediction Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 61. Data Science Projects: Predicting IT Employability Using Data Mining Techniques Rule set or equation learning instances of the testing sets WEKA AND SPSS graduate tracer student’s biographic profile cumulative grade point average (CGPA) 685 instances (tuples) SY 2011-2015 training and testing sets of data.
  • 62. Algorithm Accuracy Result Error Estimation Rate % Naive Bayes 75.33 24.47 J48 74.95 25.05 SimpleCart 73.01 26.99 Logistic regression 78.4 22.60 Chaid 76.3 23.70 Data Science Projects: Predicting IT Employability Using Data Mining Techniques Accuracy Result in Predicting IT Graduate Employability
  • 63. • Logistic regression measures the relationship between the categorical dependent logistic function Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 64. Algorithm Accuracy Result Error Estimation Rate % Chaid 70.1 29.9 Quest 40 60 CRT 70.2 29.8 Exhaustive Chaid 70.1 29.9 ID3 67 33 J48 70 30 Data Science Projects: Predicting IT Employability Using Data Mining Techniques Accuracy Result in Predicting IT Specific Profession
  • 65. •Classification and Regression Trees. •CRT splits the data into segments that are as homogeneous :dependent variable. •all cases have the same value for the dependent variable is a homogeneous, "pure" node. Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 66. • The CRT growing method: maximize within-node homogeneity. • node that do not represent a homogenous subset of cases:impurity. • a terminal node in which all cases have the same value for the dependent variable is a homogenous node that requires no further splitting because it is “pure.” Data Science Projects: Predicting IT Employability Using Data Mining Techniques
  • 67. Observed Value Predicted Percentage Corrected Not Related Related Target Related 22 48 68.5 Not Related 72 28 72 Average Percentage 70.5 Data Science Projects: Predicting IT Employability Using Data Mining Techniques Classification Table of Logistic Regression in Testing Data (N=170) Results of Testing the Accuracy of Logistic Regression in Predicting Employability
  • 68. IT Classifications IT Specific Career Correct Classificaiton Error Rate 1 (IT Software) 34 23 (67.64) 11 (32.35) 2 (IT Network/ Sys/ DB Admin) 25 16 (64.00) 9 (36.00) 3 (other IT related field.) 11 5 (45.45) 16. (54.54) Data Science Projects: Predicting IT Employability Using Data Mining Techniques Classification Table of CRT in Testing Data (N=70) Results of Testing the Accuracy of CRT in Predicting Specific IT Field/Job to be Employed
  • 69. • RapidMiner - https://rapidminer.com/ • BigML - https://bigml.com/ • Google Cloud AutoML - https://cloud.google.com/automl/ • Paxata - https://www.paxata.com/ • Trifacta -https://www.trifacta.com/ • MLBase - http://mlbase.org/ • Auto-WEKA -http://www.cs.ubc.ca/labs/beta/Projects/autoweka/ • Driverless AI - https://www.h2o.ai/driverless-ai/ DATA SCIENCE AND MACHINE LEARNING TOOLS FOR PEOPLE WHO DON’T KNOW PROGRAMMING
  • 70. • https://studio.azureml.net/ - https://studio.azureml.net/ • MLJar - https://mljar.com/ • Amazon Lex - https://aws.amazon.com/lex/ • IBM Watson Studio - https://www.ibm.com/cloud/watson-studio • Automatic Statistician - https://www.automaticstatistician.com/index/ • KNIME - https://www.knime.com/ • FeatureLab - http://www.featurelab.co/ • MarketSwitch - http://www.experian.com/decision-analytics/marketswitch-optimization.html • Logical Glue - http://www.logicalglue.com/ • Pure Predictive - http://www.purepredictive.com/ DATA SCIENCE AND MACHINE LEARNING TOOLS FOR PEOPLE WHO DON’T KNOW PROGRAMMING
  • 71. DO YOU THINK DATA SCIENCE CAN DEVELOP YOUR RESEARCH SKILLS? AND HELP YOU DEVELOP AMAZING RESEARCH?