SlideShare a Scribd company logo
Big Data & DS Analytics
for PAARL
Albert Anthony D. Gavino, MBA
Data Scientist / DS Evangelist
About the speaker: Albert Anthony D. Gavino
Project profile
Program Objectives / Program Goals
Participants to be able to relate Big Data and Data Science
applications to Library services.
1. What is Big Data?
Extremely large data sets that may be analyzed to reveal patterns,
trends and associations
The BIG 3 V’s
• Variety: different types of data
(Facebook, Twitter, CCTV feed)
• Velocity: the speed that data comes in
(batch, streaming every second)
• Volume: the largeness of that data.
(1GB, 1TB, 1PB, 1ZB)
Library Data Resources
What resources does the library have (budget, staff, premises, media, opening
hours etc.) and how is the library performing against traditional parameters, like
lending figures, visitors and social media activity? This library data can also be
combined with environmental information like community education levels,
geographical distances, age and so on.
http://www.axiell.co.uk/gettingthemostfromyourlibrarydata/
DATA Analytics Challenges and Pitfalls
The challenges to creating a robust institutional data analytics program include
culture, talent, cost, and data. We have deliberately mentioned culture first
because it is very easy to jump to data challenges. In fact, most of the literature
surrounding data analytics starts with challenges surrounding the data itself.
However, we are convinced that institutional culture is the most important factor
in determining the success of any given data analytics program, including the
politics and process around questions of talent, cost, and data itself.
Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries:
Challenges and Opportunities
63% of researchers and administrators expressed unhappiness with the use
of metrics in higher education (Abbott et al., 2010)
What about New Tasks like streamlining for the Librarian?
If librarians take on new tasks, it is very important to track the amount of time and level of staff
required when undertaking analytics projects. For example, collecting citation data for a
researcher with a common name often requires manual and painstaking record-by-record
searching in order to disambiguate that individual's research from others that share his/her
name. This type of work requires a librarian with a deep and intimate knowledge of the
bibliometric databases that are being used to harvest the bibliometric data.
Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries:
Challenges and Opportunities
What is the Cost?
• Data analytics should be thought of as a strategic investment,
not a cost-saving technique
• the real cost is the time spent on cultural change and on
developing and educating a staff with the analytical skills that we
need in our discipline
• visionary analytics plan invests in people, in hiring and training,
over data tools and platforms.
.
Pitfalls of Data Sharing:
Challenges on Institutional Data Analytics
Pitfalls Possible Solution/s
Ownership: who owns the data? It
could be registrar, library, IT
services.
An assigned office e.g. or Office of
the President/ Compliance Office
can release the official reports.
Quality: deciding when it is accurate
or good data, data reliability.
Data Governance Unit assures the
quality of data
Standards: what kind of data
variables are in use: string, numeric
This can be addressed by Data
Management on data warehousing
Access: who has access to the data User roles can be defined as to who
has access
Getting Started on Institutional Data
• Creating an inventory of institutional data
• Developing a data dictionary
• Designing an unambiguous process for cleaning up those data
• Creating an open data set that answers to the most commonly asked data
questions across campus.
Opportunities for Libraries on Big Data
• Libraries know metadata
• Libraries know strategy
• Libraries know assessment
• Libraries are neutral
• Libraries know the vendors
• Libraries are part of larger bodies like PAARL
• Libraries have influence over campuses
• Libraries know metrics
• Libraries have user-centered culture
• Libraries know the vendors
• Libraries know the politics and policy issues with commercial parties
• Libraries collaborate with both academic and academic support
2. Building a BIG DATA culture
• Openness and acceptance to technology: Upper Management
• Willingness to invest in the Big Data Platform: which entails cost
• Training Staff and making sure of job security: Skills upgrade
• Make data sharing acceptable: Trust in the data quality and people
• Create Data Quality Assurance Team/s
• Foster collaboration among departments
• Continuous improvement of models
DATA Governance and DATA Management are different roles
Data governance is the designation of decision-rights and policy-making surrounding institutional data,
while data management is the implementation of those decisions and policies. Institutions need both,
and both require investment, but the senior leadership of our institutions need to design the former.
Data Governance CouncilData Governance Council
Data ManagementData Management
policiespolicies
metricsmetrics
Data Quality DeptData Quality Dept
Data Warehouse / Data
Lake
Data Warehouse / Data
Lake
Machine Learning
Is a type of artificial intelligence that provides
computers with the ability to learn without being
explicitly programmed.
Market Basket Analysis on Book Recommendations (Association Rule Algorithm)
Weather related information and reading a book (use of hash tags and location and weather data)
Pic from Marco Rasos
Social Listening – is the process of monitoring digital conversations to
understand what customers are saying about a brand or service.
Online Research Journals and Click through Rates
Click through Rates (CTR)
Ratio of users who click on a specific
link to get to a page from a page ad or
button.
OpenCV (Open Source and Computer Vision)
Modern Day Data Scientists
Dr. Reina Reyes, Astrophysicist
Andrew Ng of Baidu, Coursera
Amy Smith, Uber Singapore
Data Science Conference 2016
YOU as the next
Doctor Strange
(Entering the world of
Data Science)
Isaac Reyes, Data Scientist Talas Data Scientists
CRISP – DM Methodology
The project was led by
five companies: SPSS,
Teradata, Daimler AG,
NCR Corporation and
OHRA, an insurance
company
The project was led by
five companies: SPSS,
Teradata, Daimler AG,
NCR Corporation and
OHRA, an insurance
company
CRISP-DM Tasks
From regular data to BIG data, from stat to AI
RegulardataBIGdata
Statistical modeling
Machine Learning
Deep Learning / A.I.
Traditional Modern
Trends in Data Science Domains
Data Science Domain Current Status
Natural Language
Processing (NLP)
Entered the market
Predictive Analytics /
Machine Learning
Entered the market
Visualization /
Dashboards
Entered the market
Image Processing
(openCV)
Exploration
Internet of Things (IoT) Exploration
Artificial Intelligence Exploration
DS/Big Data Applications to the field of Study
Agriculture Climate forecast modeling to help farmers
manage plantations (e.g. corn yields)
Medical field Image processing for chest x rays,
retina images for diabetic patients
Linguistics Natural Language Processing (NLP) for
dialects and Sentiment Analysis applications
Economics/Finance Predicting a stock price based on certain
indicators (e.g. noise, competitor price)
Sample Field of Study Specific Applications
Engineering Internet of Things (IoT) application to Big Data
Building a Data Science Team
Data ScientistData Engineer/
Dev Ops
Statistician Viz Expert
R,
Python,
Spark ML
Hadoop,
Spark Core,
Spark stream
SAS,
SPSS,
R, Matlab
Tableau, Cognos
D3, Javascript
Neural Nets
Random Forest
RDD, dataframes,
SQLContext
Linear Regression
K-means clustering
visualization
GIS maps
DS
role
Prog
Language
Sample
output
Data Science Team Composition
11 22 33
Trends on Programming Languages
scalaR
python
spark Rapid miner EMC
java
TOOLS: OPEN SOURCE vs PROPRIETARY SOFTWARE
OPEN SOURCE PROPRIETARY
SOFTWARE
pros No cost on software, packages are
available faster
Easy to deploy
cons Takes some time to create and
integrate with other software
Expensive software,
you have do buy in
modules
tools Python, R, Apache Spark SAS, IBM-SPSS,
AWS, Google
Small Data vs Big Data (in comparison)
Small data Big data
Sample size can be done
(sampling e.g. survey)
Use all of the data in the
storage
No need for memory computing,
can be run on a regular PC/Mac
Eats up memory and needs
distributed computing
Statistical assumptions hold
true,
normality, heteroskedasticity
independence
Statistical assumptions do not
hold true like p-values since the
data is so large (what seems
not significant to small sets will
become significant, be careful
when using these assumptions)
Simple DS Cheat sheet
Classifiers
Neural Nets
Random forest
Clustering
K-means
Association
Assoc Rules
Predicting
Linear
Regression
Logistic
Regression
(binary)
Cox Regression
(Survival)
Hierarchical
Clustering
SVM (Cancer
Cells)
Medical
Vizualization TOOLS
Color Hues and Functionality
Local Implications: Data Privacy Act 10173
Sensitive personal information refers to personal information:
1. About an individual’s race, ethnic origin, marital status, age, color, and religious, philosophical or
political affiliations;
2. About an individual’s health, education, genetic or sexual life of a person, or to any proceeding for
any offense committed or alleged to have been committed by such individual, the disposal of such
proceedings, or the sentence of any court in such proceedings;
3. Issued by government agencies peculiar to an individual which includes, but is not limited to, social
security numbers, previous or current health records, licenses or its denials, suspension or
revocation, and tax returns; and
4. Specifically established by an executive order or an act of Congress to be kept classified.
Solutions to the Data Privacy Act: Policies
Make sure you have the following in place
•Opt In for customers
•Opt out for customers
•Updated your customer policy accordingly
•Make your policy available publicly e.g. websites
References
• www.coursera.org/learn/machine-learning
• www.kaggle.com
• www.crowdanalytix.com
• www.talas.ph
• www.facebook.com/analytics4pinoys
• www.linkedin.com/albertgavino

More Related Content

PPSX
USING BIGDATA WITH ACADEMIC LIBRARY SERVICES: A VIEW
PDF
Big Data for Library Services (2017)
PDF
M.Florence Dayana
PDF
Lecture1 introduction to big data
PDF
Lect 1 introduction
PPTX
PDF
Data minig with Big data analysis
PPTX
Big data road map
USING BIGDATA WITH ACADEMIC LIBRARY SERVICES: A VIEW
Big Data for Library Services (2017)
M.Florence Dayana
Lecture1 introduction to big data
Lect 1 introduction
Data minig with Big data analysis
Big data road map

What's hot (20)

PPT
Elementary Concepts of data minig
PDF
Data Mining and Big Data Challenges and Research Opportunities
PPTX
Big Data and Classification
PPT
Data mining
PPTX
Big data analytics
PPTX
Bigdata and Hadoop with applications
PPTX
Data science
PPT
Seminar presentation
PPTX
Session 10 handling bigger data
PPTX
Big data ppt
PDF
Challenges of Big Data Research
PPTX
Big Data Analytics
PDF
Addressing Big Data Challenges - The Hadoop Way
PDF
INF2190_W1_2016_public
PDF
Big Data: Issues and Challenges
PPTX
Digital data
PPTX
On Big Data
PPT
Data mining with big data
PPTX
Data Analytics
PDF
Hadoop and Big Data Readiness in Africa: A Case of Tanzania
Elementary Concepts of data minig
Data Mining and Big Data Challenges and Research Opportunities
Big Data and Classification
Data mining
Big data analytics
Bigdata and Hadoop with applications
Data science
Seminar presentation
Session 10 handling bigger data
Big data ppt
Challenges of Big Data Research
Big Data Analytics
Addressing Big Data Challenges - The Hadoop Way
INF2190_W1_2016_public
Big Data: Issues and Challenges
Digital data
On Big Data
Data mining with big data
Data Analytics
Hadoop and Big Data Readiness in Africa: A Case of Tanzania
Ad

Viewers also liked (20)

PDF
Libraries and the Internet of Things
PDF
Philippine Libraries in Transformation (Summer Conference)
PDF
PAARL Summer Conference 2017 Call for papers
PDF
"One MIL a Day Keeps the (IL) Literate Away"
DOCX
Paarl Calendar of Activities for 2016
PDF
Byaheng Wow libraries, philippines 2017
PDF
Enhancing writing skills for librarians and information professionals
PDF
Paarl newsletter 2014 (october - december)
PDF
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
PDF
CUST-1 Share Document Library Extension Points
PDF
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
ODP
A theoretical approach to accreditation of Open Education
PDF
Collaborative Benchmarking Plus 1: The Amazing Bangkok, Thailand Experience
PDF
Paarl calendar of activities 2015
PDF
Library congress guam 2 (1)
PDF
Ramos, Roderick and cv as of February 23, 2017
PDF
WOW LIBRARIES REPEAT! May 19 Summer Library Tours & Travels
PDF
e-book available now: Being chief & confidently able with a heart! By Roderic...
PDF
Libraries and the Internet of Things
Philippine Libraries in Transformation (Summer Conference)
PAARL Summer Conference 2017 Call for papers
"One MIL a Day Keeps the (IL) Literate Away"
Paarl Calendar of Activities for 2016
Byaheng Wow libraries, philippines 2017
Enhancing writing skills for librarians and information professionals
Paarl newsletter 2014 (october - december)
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
CUST-1 Share Document Library Extension Points
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
A theoretical approach to accreditation of Open Education
Collaborative Benchmarking Plus 1: The Amazing Bangkok, Thailand Experience
Paarl calendar of activities 2015
Library congress guam 2 (1)
Ramos, Roderick and cv as of February 23, 2017
WOW LIBRARIES REPEAT! May 19 Summer Library Tours & Travels
e-book available now: Being chief & confidently able with a heart! By Roderic...
Ad

Similar to Big Data & DS Analytics for PAARL (20)

PPTX
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
PPTX
Unit-I- Introduction- Traits of Big Data-Final.pptx
PPTX
BigData
PPTX
Foundations of Big Data: Concepts, Techniques, and Applications
PDF
Big Data Analytics M1.pdf big data analytics
PDF
Data science fin_tech_2016
PPTX
1 UNIT-DSP.pptx
PPTX
Chapter 1 Introduction to Data Science (Computing)
PDF
Research Data Management (RDM): the management of data in the research process
PDF
What is data science ?
PPTX
Rscd 2017 bo f data lifecycle data skills for libs
PPTX
basic of data science and big data......
PPTX
Research Data Management (RDM): the management of dat in the research process
PPTX
Pemanfaatan Big Data Dalam Riset 2023.pptx
PPTX
PPT 1.1.2.pptx ehhllo hi hwi bdfhd dbdhu
PPTX
L3 Big Data and Application.pptx
PDF
Introduction to Data Analytics and data analytics life cycle
PDF
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
PDF
My FAIR share of the work - Diamond Light Source - Dec 2018
PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
Unit-I- Introduction- Traits of Big Data-Final.pptx
BigData
Foundations of Big Data: Concepts, Techniques, and Applications
Big Data Analytics M1.pdf big data analytics
Data science fin_tech_2016
1 UNIT-DSP.pptx
Chapter 1 Introduction to Data Science (Computing)
Research Data Management (RDM): the management of data in the research process
What is data science ?
Rscd 2017 bo f data lifecycle data skills for libs
basic of data science and big data......
Research Data Management (RDM): the management of dat in the research process
Pemanfaatan Big Data Dalam Riset 2023.pptx
PPT 1.1.2.pptx ehhllo hi hwi bdfhd dbdhu
L3 Big Data and Application.pptx
Introduction to Data Analytics and data analytics life cycle
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
My FAIR share of the work - Diamond Light Source - Dec 2018
Data Science - An emerging Stream of Science with its Spreading Reach & Impact

More from Philippine Association of Academic/Research Librarians (20)

PDF
PDF
PAARL Awards and Scholarship program 2016
PDF
Recognizing Best Researches: a Colloquium
PPTX
Demonstrating the library's impact through assessment and evaluation
PPTX
Building a library disaster preparedness plan
PDF
Information literacy and the role of academic libraries
PPTX
PPTX
Dynamic Leadership and Management of Libraries/Learning Commons
PPTX
The DLSU Libraries Engineering Collection
PPTX
Use equals value: Use Analysis of the DLSU Business and Economics Collection
PPTX
The 80/20 Rule: Analysis of Factors That Contribute to Print Book Utilization
PPTX
Collection assessment using modified brief test method
PDF
Step-by-step guide to travel visa application for Taiwan
PPTX
E-Metrics: Assessing Electronic Resources
PPTX
Keeping them posted: Analyzing library web content and user engagement
PAARL Awards and Scholarship program 2016
Recognizing Best Researches: a Colloquium
Demonstrating the library's impact through assessment and evaluation
Building a library disaster preparedness plan
Information literacy and the role of academic libraries
Dynamic Leadership and Management of Libraries/Learning Commons
The DLSU Libraries Engineering Collection
Use equals value: Use Analysis of the DLSU Business and Economics Collection
The 80/20 Rule: Analysis of Factors That Contribute to Print Book Utilization
Collection assessment using modified brief test method
Step-by-step guide to travel visa application for Taiwan
E-Metrics: Assessing Electronic Resources
Keeping them posted: Analyzing library web content and user engagement

Recently uploaded (20)

PPTX
master seminar digital applications in india
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
master seminar digital applications in india
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
TR - Agricultural Crops Production NC III.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Pre independence Education in Inndia.pdf
Basic Mud Logging Guide for educational purpose
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Microbial disease of the cardiovascular and lymphatic systems
Anesthesia in Laparoscopic Surgery in India
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
PPH.pptx obstetrics and gynecology in nursing
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Pharma ospi slides which help in ospi learning
Week 4 Term 3 Study Techniques revisited.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf

Big Data & DS Analytics for PAARL

  • 1. Big Data & DS Analytics for PAARL Albert Anthony D. Gavino, MBA Data Scientist / DS Evangelist
  • 2. About the speaker: Albert Anthony D. Gavino
  • 4. Program Objectives / Program Goals Participants to be able to relate Big Data and Data Science applications to Library services.
  • 5. 1. What is Big Data? Extremely large data sets that may be analyzed to reveal patterns, trends and associations
  • 6. The BIG 3 V’s • Variety: different types of data (Facebook, Twitter, CCTV feed) • Velocity: the speed that data comes in (batch, streaming every second) • Volume: the largeness of that data. (1GB, 1TB, 1PB, 1ZB)
  • 7. Library Data Resources What resources does the library have (budget, staff, premises, media, opening hours etc.) and how is the library performing against traditional parameters, like lending figures, visitors and social media activity? This library data can also be combined with environmental information like community education levels, geographical distances, age and so on. http://www.axiell.co.uk/gettingthemostfromyourlibrarydata/
  • 8. DATA Analytics Challenges and Pitfalls The challenges to creating a robust institutional data analytics program include culture, talent, cost, and data. We have deliberately mentioned culture first because it is very easy to jump to data challenges. In fact, most of the literature surrounding data analytics starts with challenges surrounding the data itself. However, we are convinced that institutional culture is the most important factor in determining the success of any given data analytics program, including the politics and process around questions of talent, cost, and data itself. Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries: Challenges and Opportunities 63% of researchers and administrators expressed unhappiness with the use of metrics in higher education (Abbott et al., 2010)
  • 9. What about New Tasks like streamlining for the Librarian? If librarians take on new tasks, it is very important to track the amount of time and level of staff required when undertaking analytics projects. For example, collecting citation data for a researcher with a common name often requires manual and painstaking record-by-record searching in order to disambiguate that individual's research from others that share his/her name. This type of work requires a librarian with a deep and intimate knowledge of the bibliometric databases that are being used to harvest the bibliometric data. Reference: The Journal of Academic Librarianship, Libraries and Institutional Data Libraries: Challenges and Opportunities
  • 10. What is the Cost? • Data analytics should be thought of as a strategic investment, not a cost-saving technique • the real cost is the time spent on cultural change and on developing and educating a staff with the analytical skills that we need in our discipline • visionary analytics plan invests in people, in hiring and training, over data tools and platforms. .
  • 11. Pitfalls of Data Sharing: Challenges on Institutional Data Analytics Pitfalls Possible Solution/s Ownership: who owns the data? It could be registrar, library, IT services. An assigned office e.g. or Office of the President/ Compliance Office can release the official reports. Quality: deciding when it is accurate or good data, data reliability. Data Governance Unit assures the quality of data Standards: what kind of data variables are in use: string, numeric This can be addressed by Data Management on data warehousing Access: who has access to the data User roles can be defined as to who has access
  • 12. Getting Started on Institutional Data • Creating an inventory of institutional data • Developing a data dictionary • Designing an unambiguous process for cleaning up those data • Creating an open data set that answers to the most commonly asked data questions across campus.
  • 13. Opportunities for Libraries on Big Data • Libraries know metadata • Libraries know strategy • Libraries know assessment • Libraries are neutral • Libraries know the vendors • Libraries are part of larger bodies like PAARL • Libraries have influence over campuses • Libraries know metrics • Libraries have user-centered culture • Libraries know the vendors • Libraries know the politics and policy issues with commercial parties • Libraries collaborate with both academic and academic support
  • 14. 2. Building a BIG DATA culture • Openness and acceptance to technology: Upper Management • Willingness to invest in the Big Data Platform: which entails cost • Training Staff and making sure of job security: Skills upgrade • Make data sharing acceptable: Trust in the data quality and people • Create Data Quality Assurance Team/s • Foster collaboration among departments • Continuous improvement of models
  • 15. DATA Governance and DATA Management are different roles Data governance is the designation of decision-rights and policy-making surrounding institutional data, while data management is the implementation of those decisions and policies. Institutions need both, and both require investment, but the senior leadership of our institutions need to design the former. Data Governance CouncilData Governance Council Data ManagementData Management policiespolicies metricsmetrics Data Quality DeptData Quality Dept Data Warehouse / Data Lake Data Warehouse / Data Lake
  • 16. Machine Learning Is a type of artificial intelligence that provides computers with the ability to learn without being explicitly programmed.
  • 17. Market Basket Analysis on Book Recommendations (Association Rule Algorithm)
  • 18. Weather related information and reading a book (use of hash tags and location and weather data) Pic from Marco Rasos
  • 19. Social Listening – is the process of monitoring digital conversations to understand what customers are saying about a brand or service.
  • 20. Online Research Journals and Click through Rates Click through Rates (CTR) Ratio of users who click on a specific link to get to a page from a page ad or button.
  • 21. OpenCV (Open Source and Computer Vision)
  • 22. Modern Day Data Scientists Dr. Reina Reyes, Astrophysicist Andrew Ng of Baidu, Coursera Amy Smith, Uber Singapore Data Science Conference 2016 YOU as the next Doctor Strange (Entering the world of Data Science) Isaac Reyes, Data Scientist Talas Data Scientists
  • 23. CRISP – DM Methodology The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, an insurance company The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, an insurance company
  • 25. From regular data to BIG data, from stat to AI RegulardataBIGdata Statistical modeling Machine Learning Deep Learning / A.I. Traditional Modern
  • 26. Trends in Data Science Domains Data Science Domain Current Status Natural Language Processing (NLP) Entered the market Predictive Analytics / Machine Learning Entered the market Visualization / Dashboards Entered the market Image Processing (openCV) Exploration Internet of Things (IoT) Exploration Artificial Intelligence Exploration
  • 27. DS/Big Data Applications to the field of Study Agriculture Climate forecast modeling to help farmers manage plantations (e.g. corn yields) Medical field Image processing for chest x rays, retina images for diabetic patients Linguistics Natural Language Processing (NLP) for dialects and Sentiment Analysis applications Economics/Finance Predicting a stock price based on certain indicators (e.g. noise, competitor price) Sample Field of Study Specific Applications Engineering Internet of Things (IoT) application to Big Data
  • 28. Building a Data Science Team Data ScientistData Engineer/ Dev Ops Statistician Viz Expert R, Python, Spark ML Hadoop, Spark Core, Spark stream SAS, SPSS, R, Matlab Tableau, Cognos D3, Javascript Neural Nets Random Forest RDD, dataframes, SQLContext Linear Regression K-means clustering visualization GIS maps DS role Prog Language Sample output Data Science Team Composition 11 22 33
  • 29. Trends on Programming Languages scalaR python spark Rapid miner EMC java
  • 30. TOOLS: OPEN SOURCE vs PROPRIETARY SOFTWARE OPEN SOURCE PROPRIETARY SOFTWARE pros No cost on software, packages are available faster Easy to deploy cons Takes some time to create and integrate with other software Expensive software, you have do buy in modules tools Python, R, Apache Spark SAS, IBM-SPSS, AWS, Google
  • 31. Small Data vs Big Data (in comparison) Small data Big data Sample size can be done (sampling e.g. survey) Use all of the data in the storage No need for memory computing, can be run on a regular PC/Mac Eats up memory and needs distributed computing Statistical assumptions hold true, normality, heteroskedasticity independence Statistical assumptions do not hold true like p-values since the data is so large (what seems not significant to small sets will become significant, be careful when using these assumptions)
  • 32. Simple DS Cheat sheet Classifiers Neural Nets Random forest Clustering K-means Association Assoc Rules Predicting Linear Regression Logistic Regression (binary) Cox Regression (Survival) Hierarchical Clustering SVM (Cancer Cells) Medical
  • 34. Color Hues and Functionality
  • 35. Local Implications: Data Privacy Act 10173 Sensitive personal information refers to personal information: 1. About an individual’s race, ethnic origin, marital status, age, color, and religious, philosophical or political affiliations; 2. About an individual’s health, education, genetic or sexual life of a person, or to any proceeding for any offense committed or alleged to have been committed by such individual, the disposal of such proceedings, or the sentence of any court in such proceedings; 3. Issued by government agencies peculiar to an individual which includes, but is not limited to, social security numbers, previous or current health records, licenses or its denials, suspension or revocation, and tax returns; and 4. Specifically established by an executive order or an act of Congress to be kept classified.
  • 36. Solutions to the Data Privacy Act: Policies Make sure you have the following in place •Opt In for customers •Opt out for customers •Updated your customer policy accordingly •Make your policy available publicly e.g. websites
  • 37. References • www.coursera.org/learn/machine-learning • www.kaggle.com • www.crowdanalytix.com • www.talas.ph • www.facebook.com/analytics4pinoys • www.linkedin.com/albertgavino