SlideShare a Scribd company logo
Big Data and Machine Learning
An introduction to Key Ideas
Mauritian JEDI
Bruce Bassett
bruce@saao.ac.za
AIMS/SAAO/UCT
Jan 2015
History of the JEDI concept
We developed the format at several SA workshops (2005-
2008)
NRF-Royal Society 5 year Bilateral with Portsmouth, Sussex
and Oxford: train new researchers & do excellent
cosmology research
• JEDI 1 – Langebaan 2008
• JEDI 2 – STIAS/Avalon 2008
• We are now past JEDI X…
Aim of the JEDI series: explore to find the most efficient way
of teaching & learning research, building new
collaborations and doing excellent research
“Sciama” Principles
• Creativity has to be nurtured creatively
• Ideas are a non-linear function of interaction – want as much
discussion/interaction as possible
• Learning is most efficient when it is fun, informal and play.
• Academia is a small-world network…
• Hence personal contacts and networking are crucial for progress
• Being part of the “fratelli fisici” (Coleman) is important. People
need to know and trust you…
“Google” Principles
• Take good people and treat them really well.
• Trust that good things will come out…things that you can’t
predict before hand.
• Get out of your comfort zone!
“Creativity requires chaos”. Talk to people you would not
normally talk to. Do things that scare you!
• Attitude and atmosphere is crucial: be friendly, have fun,
relax, enjoy yourself, be proactive, interact, work hard.
How does the JEDI work?
• Research is best learned by doing it with people who
do it better or differently than you.
• Work with a “screw-it let’s do it” attitude
• Work on coming up with and evaluating new ideas
• Work on real research projects in teams.
• You choose the projects you are interested in and how
you spend your time.
• 1-3 years: are there any ongoing projects between people
who met at the JEDI?
• 10-20 years: Successful if two people can look back and say,
“actually I first worked or became good friends with X at JEDI
and we have since written papers together, they took my
students for post-docs, they wrote a letter of reference for
me, examined my student’s thesis, helped referee my grant,
get me promoted etc…”
Success on different timescales
Brain Teaser
• A man tosses a coin 30 times and it comes up
heads 30 times in a row.
• What is the probability that it comes up heads
on the 31st coin toss?
What is the scientific method?
• What is the first thing we do when we try to
understand something with physics/applied
mathematics?
• We build a toy model of it, a representation,
that we can study.
• We then study this simplified model and make
predictions.
Machine Learning
• In machine learning, we do the same. We
must choose a set of features that we think
are the most important to achieve our goals
• We then train the machine learning, and use it
to make predictions.
www.quora.com
Data Science in 3 nutshells
The Deeper Drivers
Data Science is really driven by the intersection of:
• Moore’s Law – cheaper, faster, smaller…
• Development of powerful, fast new algorithms that
take advantage of the computing power (e.g. Bayesian
methods)
• Turing completeness which allows near universal
application of the algorithms…
Moore’s Law applies to lots of things…
250,000 x more storage and
about 10 x Cheaper!
The Lean Startup Model
• What we are trying is very close to running a
startup in a competitive landscape
• In Lean Startup, the Minimum Viable Product
is central… test basic assumptions!
• The same is true in data science – start with
something very basic. You will learn a lot…
then build a better model.
A Very Simply & Brief Intro to
Machine Learning
Typically there are two classes of
problems people want solved…
• Classification – what group does this data fall
into? (e.g. male vs female, big spender vs
spendthrift etc…)
• Regression – predict the value of this variable.
(e.g. how much money will our store make
next year?)
Separate these two classes…
Campbell et al, 2012
There are two basic steps in machine
learning
1. Feature extraction – what information do you pull
from the data to learn from?
(e.g. “you dunt neid atl the leytirs to reqd tjis”)
2. Apply the learning algorithm – feed the features to
the algorithm you have chosen and get the answers.
You can play with either step to get better results (and
there are algorithms that do both in one step, e.g.
deep learning, convnets).
There are typically two types of ML
problems…
• Supervised – “here are some examples with the
model answers. Learn from these and apply to
new examples…” (labeled data). Just like school.
Learn from Training set  Apply to Test data set
• Unsupervised – ‘Here is some data. I don’t know
anything, figure everything out yourself.’
(unlabeled data). This is basically clustering 
Nadeem’s dataset.
Pitfalls and Warnings
https://www.topstocks.com.au/
1. Correlation is not causation…
If you look through enough correlations (and algorithms),
some of them will appear significant, just by chance…
But they have no real value.
2. Representative training data
• If the data you train on is not similar to the
test data, you will usually get very bad results!
Representative Training
The Ugly Ducking lacked representative training data…
3. Overfitting
If your friend says “I know how to get to the
supermarket, follow me” and then goes to the
toilet before getting in the car, you probably
don’t need to follow them into the
bathroom…
Robust Classification…
Overfitting
Data Science: First Steps
Step 1. Determine sample size, an indicator of data depth.
Step 2. Know the number of numeric and character variables, an indicator
of data breadth.
Step 3. Calculate the percentage of missing data for each numeric variable.
Step 4. Histogram, plot or otherwise map each variable
Step 5. Start a search for unexpected values of each variable: Improbable
values; and, undefined values due to dividing by 1/0.
Step 6. Know the nature of numeric variables. I.e., declare the formats of
the numerics as decimal, integer or date.
If your data has some nasty peculiarities you don’t know about, it can
really upset a clever algorithm.
• Machine learning competition site
(kaggle.com)
• They give a training dataset and a test set for
which we need to predict the answers.
• We can submit up to 5 test submissions per
day until the competition closes.
• Final scores is based on an unknown subset of
the test data.
The Titanic Problem
• Start with: https://www.kaggle.com/c/titanic-
gettingStarted
• Do the tutorials!
• Read the forums (https://www.kaggle.com/c/titanic-
gettingStarted/forums)
• Download the ipython notebook:
https://www.kaggle.com/c/titanic-
gettingStarted/forums/t/5105/ipython-notebook-
tutorial-for-titanic-machine-learning-from-disaster
• This is a classification problem (0 = died, 1 = survived)
• Good luck!

More Related Content

PPTX
Machine Learning using Big data
PPTX
machine learning in the age of big data: new approaches and business applicat...
PPTX
Machine learning
PPTX
Intro to Machine Learning
PPTX
Machine Learning in the age of Big Data
PDF
EDF2013: Big Data Tutorial: Marko Grobelnik
PDF
Open Data, Big Data and Machine Learning
PPTX
Introduction to Big Data/Machine Learning
Machine Learning using Big data
machine learning in the age of big data: new approaches and business applicat...
Machine learning
Intro to Machine Learning
Machine Learning in the age of Big Data
EDF2013: Big Data Tutorial: Marko Grobelnik
Open Data, Big Data and Machine Learning
Introduction to Big Data/Machine Learning

What's hot (20)

PPTX
[Webinar] How Big Data and Machine Learning Are Transforming ITSM
PPTX
Machine Learning in Big Data
PDF
Introduction to machine learning
PPTX
Big Data & Machine Learning - TDC2013 Sao Paulo
PPTX
Introduction to Machine Learning
PPTX
Machine Learning Introduction for Digital Business Leaders
PDF
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
PPT
Machine learning with Big Data power point presentation
PDF
Managing machine learning
PPTX
Introduction to data science
PDF
GTU GeekDay Data Science and Applications
PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
PDF
Data Science, Machine Learning and Neural Networks
PDF
Introduction to Data Science
PDF
MIT Sloan: Intro to Machine Learning
PDF
Data science presentation 2nd CI day
PDF
Programming for data science in python
PDF
Machine learning and big data
PDF
How to become a Data Scientist?
PDF
Introduction to Data Science and Large-scale Machine Learning
[Webinar] How Big Data and Machine Learning Are Transforming ITSM
Machine Learning in Big Data
Introduction to machine learning
Big Data & Machine Learning - TDC2013 Sao Paulo
Introduction to Machine Learning
Machine Learning Introduction for Digital Business Leaders
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Machine learning with Big Data power point presentation
Managing machine learning
Introduction to data science
GTU GeekDay Data Science and Applications
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science, Machine Learning and Neural Networks
Introduction to Data Science
MIT Sloan: Intro to Machine Learning
Data science presentation 2nd CI day
Programming for data science in python
Machine learning and big data
How to become a Data Scientist?
Introduction to Data Science and Large-scale Machine Learning
Ad

Similar to Mauritius Big Data and Machine Learning JEDI workshop (20)

PDF
An Elementary Introduction to Artificial Intelligence, Data Science and Machi...
PPTX
Machine Learning.pptx
PPTX
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
PDF
DataScience_introduction.pdf
PPTX
Data Science in Manufacturing and Automation
PDF
Building successful data science teams
PPTX
SPWK '20 - explaining data science to humans.pptx
PDF
Guide for a Data Scientist
PDF
Choosing a Machine Learning technique to solve your need
PDF
Machine-Learning for Data analytics and detection
PPTX
GTU GeekDay 2019 Limitations of Artificial Intelligence
PPTX
Ml - A shallow dive
PDF
Machine learning at b.e.s.t. summer university
PPT
Chapter01.ppt
PDF
Data Science Folk Knowledge
PPTX
MODULE-1.pptx machine learning note for 6th sem vtu
PDF
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
PDF
Introduction to machine learning and applications (1)
PPTX
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
PPTX
L15.pptx
An Elementary Introduction to Artificial Intelligence, Data Science and Machi...
Machine Learning.pptx
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
DataScience_introduction.pdf
Data Science in Manufacturing and Automation
Building successful data science teams
SPWK '20 - explaining data science to humans.pptx
Guide for a Data Scientist
Choosing a Machine Learning technique to solve your need
Machine-Learning for Data analytics and detection
GTU GeekDay 2019 Limitations of Artificial Intelligence
Ml - A shallow dive
Machine learning at b.e.s.t. summer university
Chapter01.ppt
Data Science Folk Knowledge
MODULE-1.pptx machine learning note for 6th sem vtu
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
Introduction to machine learning and applications (1)
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
L15.pptx
Ad

More from CosmoAIMS Bassett (20)

PPTX
Machine learning clustering
PDF
Testing dark energy as a function of scale
PPTX
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
PPTX
Cosmology with the 21cm line
PDF
Tuning your radio to the cosmic dawn
PDF
A short introduction to massive gravity... or ... Can one give a mass to the ...
PPTX
Decomposing Profiles of SDSS Galaxies
PDF
Cluster abundances and clustering Can theory step up to precision cosmology?
PPTX
An Overview of Gravitational Lensing
PDF
Testing cosmology with galaxy clusters, the CMB and galaxy clustering
PPT
Galaxy Formation: An Overview
PDF
Spit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio Data
PPT
MeerKAT: an overview
PDF
Casa cookbook for KAT 7
PPT
From Darkness, Light: Computing Cosmological Reionization
PDF
WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?
PDF
Binary pulsars as tools to study gravity
PDF
Cross Matching EUCLID and SKA using the Likelihood Ratio
PDF
Machine Learning Challenges in Astronomy
PDF
Cosmological Results from Planck
Machine learning clustering
Testing dark energy as a function of scale
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
Cosmology with the 21cm line
Tuning your radio to the cosmic dawn
A short introduction to massive gravity... or ... Can one give a mass to the ...
Decomposing Profiles of SDSS Galaxies
Cluster abundances and clustering Can theory step up to precision cosmology?
An Overview of Gravitational Lensing
Testing cosmology with galaxy clusters, the CMB and galaxy clustering
Galaxy Formation: An Overview
Spit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio Data
MeerKAT: an overview
Casa cookbook for KAT 7
From Darkness, Light: Computing Cosmological Reionization
WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?
Binary pulsars as tools to study gravity
Cross Matching EUCLID and SKA using the Likelihood Ratio
Machine Learning Challenges in Astronomy
Cosmological Results from Planck

Recently uploaded (20)

PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Open Quiz Monsoon Mind Game Final Set.pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
The Final Stretch: How to Release a Game and Not Die in the Process.
PDF
Basic Mud Logging Guide for educational purpose
PDF
Business Ethics Teaching Materials for college
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Open Quiz Monsoon Mind Game Prelims.pptx
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Pharma ospi slides which help in ospi learning
Renaissance Architecture: A Journey from Faith to Humanism
Pharmacology of Heart Failure /Pharmacotherapy of CHF
O7-L3 Supply Chain Operations - ICLT Program
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Open Quiz Monsoon Mind Game Final Set.pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
GDM (1) (1).pptx small presentation for students
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
TR - Agricultural Crops Production NC III.pdf
The Final Stretch: How to Release a Game and Not Die in the Process.
Basic Mud Logging Guide for educational purpose
Business Ethics Teaching Materials for college
Anesthesia in Laparoscopic Surgery in India
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Open Quiz Monsoon Mind Game Prelims.pptx
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Pharma ospi slides which help in ospi learning

Mauritius Big Data and Machine Learning JEDI workshop

  • 1. Big Data and Machine Learning An introduction to Key Ideas Mauritian JEDI Bruce Bassett bruce@saao.ac.za AIMS/SAAO/UCT Jan 2015
  • 2. History of the JEDI concept We developed the format at several SA workshops (2005- 2008) NRF-Royal Society 5 year Bilateral with Portsmouth, Sussex and Oxford: train new researchers & do excellent cosmology research • JEDI 1 – Langebaan 2008 • JEDI 2 – STIAS/Avalon 2008 • We are now past JEDI X… Aim of the JEDI series: explore to find the most efficient way of teaching & learning research, building new collaborations and doing excellent research
  • 3. “Sciama” Principles • Creativity has to be nurtured creatively • Ideas are a non-linear function of interaction – want as much discussion/interaction as possible • Learning is most efficient when it is fun, informal and play. • Academia is a small-world network… • Hence personal contacts and networking are crucial for progress • Being part of the “fratelli fisici” (Coleman) is important. People need to know and trust you…
  • 4. “Google” Principles • Take good people and treat them really well. • Trust that good things will come out…things that you can’t predict before hand. • Get out of your comfort zone! “Creativity requires chaos”. Talk to people you would not normally talk to. Do things that scare you! • Attitude and atmosphere is crucial: be friendly, have fun, relax, enjoy yourself, be proactive, interact, work hard.
  • 5. How does the JEDI work? • Research is best learned by doing it with people who do it better or differently than you. • Work with a “screw-it let’s do it” attitude • Work on coming up with and evaluating new ideas • Work on real research projects in teams. • You choose the projects you are interested in and how you spend your time.
  • 6. • 1-3 years: are there any ongoing projects between people who met at the JEDI? • 10-20 years: Successful if two people can look back and say, “actually I first worked or became good friends with X at JEDI and we have since written papers together, they took my students for post-docs, they wrote a letter of reference for me, examined my student’s thesis, helped referee my grant, get me promoted etc…” Success on different timescales
  • 7. Brain Teaser • A man tosses a coin 30 times and it comes up heads 30 times in a row. • What is the probability that it comes up heads on the 31st coin toss?
  • 8. What is the scientific method?
  • 9. • What is the first thing we do when we try to understand something with physics/applied mathematics? • We build a toy model of it, a representation, that we can study. • We then study this simplified model and make predictions.
  • 10. Machine Learning • In machine learning, we do the same. We must choose a set of features that we think are the most important to achieve our goals • We then train the machine learning, and use it to make predictions.
  • 12. The Deeper Drivers Data Science is really driven by the intersection of: • Moore’s Law – cheaper, faster, smaller… • Development of powerful, fast new algorithms that take advantage of the computing power (e.g. Bayesian methods) • Turing completeness which allows near universal application of the algorithms…
  • 13. Moore’s Law applies to lots of things…
  • 14. 250,000 x more storage and about 10 x Cheaper!
  • 15. The Lean Startup Model • What we are trying is very close to running a startup in a competitive landscape • In Lean Startup, the Minimum Viable Product is central… test basic assumptions! • The same is true in data science – start with something very basic. You will learn a lot… then build a better model.
  • 16. A Very Simply & Brief Intro to Machine Learning
  • 17. Typically there are two classes of problems people want solved… • Classification – what group does this data fall into? (e.g. male vs female, big spender vs spendthrift etc…) • Regression – predict the value of this variable. (e.g. how much money will our store make next year?)
  • 18. Separate these two classes… Campbell et al, 2012
  • 19. There are two basic steps in machine learning 1. Feature extraction – what information do you pull from the data to learn from? (e.g. “you dunt neid atl the leytirs to reqd tjis”) 2. Apply the learning algorithm – feed the features to the algorithm you have chosen and get the answers. You can play with either step to get better results (and there are algorithms that do both in one step, e.g. deep learning, convnets).
  • 20. There are typically two types of ML problems… • Supervised – “here are some examples with the model answers. Learn from these and apply to new examples…” (labeled data). Just like school. Learn from Training set  Apply to Test data set • Unsupervised – ‘Here is some data. I don’t know anything, figure everything out yourself.’ (unlabeled data). This is basically clustering  Nadeem’s dataset.
  • 22. https://www.topstocks.com.au/ 1. Correlation is not causation… If you look through enough correlations (and algorithms), some of them will appear significant, just by chance… But they have no real value.
  • 23. 2. Representative training data • If the data you train on is not similar to the test data, you will usually get very bad results!
  • 24. Representative Training The Ugly Ducking lacked representative training data…
  • 25. 3. Overfitting If your friend says “I know how to get to the supermarket, follow me” and then goes to the toilet before getting in the car, you probably don’t need to follow them into the bathroom…
  • 28. Data Science: First Steps Step 1. Determine sample size, an indicator of data depth. Step 2. Know the number of numeric and character variables, an indicator of data breadth. Step 3. Calculate the percentage of missing data for each numeric variable. Step 4. Histogram, plot or otherwise map each variable Step 5. Start a search for unexpected values of each variable: Improbable values; and, undefined values due to dividing by 1/0. Step 6. Know the nature of numeric variables. I.e., declare the formats of the numerics as decimal, integer or date. If your data has some nasty peculiarities you don’t know about, it can really upset a clever algorithm.
  • 29. • Machine learning competition site (kaggle.com) • They give a training dataset and a test set for which we need to predict the answers. • We can submit up to 5 test submissions per day until the competition closes. • Final scores is based on an unknown subset of the test data.
  • 30. The Titanic Problem • Start with: https://www.kaggle.com/c/titanic- gettingStarted • Do the tutorials! • Read the forums (https://www.kaggle.com/c/titanic- gettingStarted/forums) • Download the ipython notebook: https://www.kaggle.com/c/titanic- gettingStarted/forums/t/5105/ipython-notebook- tutorial-for-titanic-machine-learning-from-disaster • This is a classification problem (0 = died, 1 = survived) • Good luck!