SlideShare a Scribd company logo
Predicting the
“Next Big Thing”
in Science
ADRIAN MLADENIĆ GROBELNIK
ADRIAN.GROBELNIK@GMAIL.COM
BRITISH INTERNATIONAL SCHOOL LJUBLJANA
LJUBLJANA, SLOVENIA
#scichallenge2017
What is this research project about?
 The aim is to make a C++ program for predicting which scientific topics will
become important in the future
 To predict the future of science, I have used Machine Learning algorithms
to learn how science behaved in the past, and to use the resulting model
to predict future trends in science
 To analyse how science evolved in the past, I used the data from the
recently released “Microsoft Academic Graph” which includes 125 million
scientific articles from the year 1800 to the present
Research Hypothesis
 My research hypothesis is that the science topics which will
become important in the future, already exist in today’s scientific
articles
 …they are just not visible yet,
 …but it is possible to identify them with Machine Learning
 The task is to find early indicators suggesting which scientific topics
in today’s literature will likely become important in the future
Context: How does science evolve?
 The main element of science is an invention
 Inventions always happen at the beginning of a scientific process
 After an invention happens, there is a period of scientific
exploration, to prove the invention is useful
 Some inventions prove themselves, and some do not
 If an invention proves itself, new products and research is done
involving ideas from the invention
 …less useful inventions usually get forgotten
Context: How to detect scientific
inventions and concepts?
 Scientists are typically strict and consistent when naming things
 In the same way, inventions and other scientific concepts get names which
are then used in scientific articles
 In this project I have used the names from the titles of scientific articles
to track how particular scientific topics evolve through time
 We can spot when a scientific topic appears for the first time, we can count
how frequently it appears, and we can spot when it stops being used
 …this is my base for predicting the “next big thing” in science
What data do we have available?
 There are many databases of scientific articles in the world, but only some are
open and available for research.
 The biggest open database of scientific articles is “Microsoft Academic Graph”
which was released for research use in 2016
 The database size is 130 Gigabytes
 It includes references to 125 million scientific articles from the year 1800 to the present
from all areas of science
 Each scientific article in the database is described by: (a) title, (b) authors and their (c)
institutions, (d) journal/conference where it was published, and (e) the year of publication
 Data available from: https://www.microsoft.com/en-us/research/project/microsoft-
academic-graph/
The task to be solved
 The core task in this project is to use the data from over 200 years of
science and to extract what are early signs of a scientific topic
becoming successful
 With Machine Learning algorithms I trained a statistical model to
classify scientific topics which became successful and which didn’t
 The trained model I am using on the current data (after 2010) to
predict which topics will be hot and relevant in the near future (in
early 2020s)
Description of the experiment (1/2)
 From 125 million article titles I extracted 2.5 million candidate topics
 …each topic is described by a phrase of the size 1 to 5 words
 …the phrase must appear at least 100 times in the database of article titles
 Each topic is represented by a set of features (attributes) describing the
first 10 years after its appearance
 …features include frequency and trend (slope from linear regression) of an
appearance of the topic within institutions, journals and conferences
 …each topic is described by approx. 55,000 features, represented in a feature
vector
Description of the experiment (2/2)
 Each topic is classified either as:
 Positive, if it became popular in the past (has increased by a factor 2 after the 10 years
from the topic’s first appearance), or as
 Negative, if the topic didn’t attract much attention
 We split the topics into a training (70%) and test set (30%)
 …where the training set is used to train the model and testing set used to test the model
 For machine learning I used the Perceptron algorithm which is relatively easy to
implement (https://en.wikipedia.org/wiki/Perceptron)
 …I used an improved version of the Perceptron (MaxMargin)
Key statistical results
 The statistical model, trained with the MaxMargin Perceptron
algorithm produced the following results on the testing data:
 Precision: 74%
 Recall: 72%
 F1 (a combination of both): 73%
 …this means, the model correctly predicts the success of
approx. 73% of all scientific topics (either successful ones or
unsuccessful ones)
Key descriptive results
 Looking at the resulting statistical model we can see:
 If a scientific topic gets increasingly used by important research
institutions (universities and research institutes)
 …and is getting published by important journals and conferences
 …within 10 years from the invention (when the initial mention is
spotted)
 …then, we can expect the increased use of the topic (by a factor
two or more) by science and industry in the next 5 years
Examples of best topics and features
 Example Best Topics (as predicted by the model):
 Collisions, efficient, proton proton collisions, higgs boson, system, quark,
particles, hadron, mobile augmented reality, variable quantum,
advanced network, molecular dynamics simulations
 Example Best Features (as identified by the Perceptron training):
 CERN, Journal of Proteomics & Bioinformatics, Industrial Research Limited,
Circulation-cardiovascular Imaging, Molecular BioSystems, Metamaterials
, Atw-international Journal for Nuclear Power
Summary
 In this research project I analyzed 125 million articles from “Microsoft
Academic Graph” from over 200 years of science
 I made a program in C++ to process 130 Gigabytes of data and to
build a machine learning model to predict which scientific topics will
become important in the future
 The resulting model predicts 73% of the scientific topics which became
important in the history of science
 C++ code and detailed results are available from: https://goo.gl/8luSwz

More Related Content

PPTX
Data Science, Data & Dashboards Design
PDF
Introduction to Computational Statistics
PDF
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
PPTX
Predicting College STEM Enrollment using HPCC Systems in Educational Research
PPTX
International Journal on Computational Science & Applications (IJCSA)
PPTX
Csse 2014 hmm presentation_ta_ed
PPTX
International Journal on Computational Science & Applications (IJCSA)
PDF
Call for Paper - December Issue - Applied Mathematics and Sciences: An Intern...
Data Science, Data & Dashboards Design
Introduction to Computational Statistics
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
Predicting College STEM Enrollment using HPCC Systems in Educational Research
International Journal on Computational Science & Applications (IJCSA)
Csse 2014 hmm presentation_ta_ed
International Journal on Computational Science & Applications (IJCSA)
Call for Paper - December Issue - Applied Mathematics and Sciences: An Intern...

What's hot (20)

PPTX
Application of-statistics-in-CSE
PDF
Call for Papers - Applied Mathematics and Sciences: An International Journal ...
PDF
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
PDF
Significant Role of Statistics in Computational Sciences
PDF
Applied Mathematics and Sciences: An International Journal (MathSJ)
PPTX
Call for papers - International Journal on Computational Science & Applicatio...
PDF
Call for Paper - Applied Mathematics and Sciences: An International Journal (...
PPTX
Techniques Machine Learning
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
PPTX
Deep learning
PDF
Ranking Related News Predictions
PPTX
Interactive mathematica
DOCX
An Independent Study Comparing SPSS to Intellectus Statistics: Preliminary ...
PPT
Domain Ontology Usage Analysis Framework (OUSAF)
PDF
Data legend dh_benelux_2017.key
PPTX
Predicting students performance in final examination
PPTX
Big Data Quality Panel : Diachron Workshop @EDBT
PPT
OMICS Publishing Group | Journal of Applied & Computational Mathematics
PDF
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
PPTX
Transparency and reproducibility in research
Application of-statistics-in-CSE
Call for Papers - Applied Mathematics and Sciences: An International Journal ...
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
Significant Role of Statistics in Computational Sciences
Applied Mathematics and Sciences: An International Journal (MathSJ)
Call for papers - International Journal on Computational Science & Applicatio...
Call for Paper - Applied Mathematics and Sciences: An International Journal (...
Techniques Machine Learning
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
Deep learning
Ranking Related News Predictions
Interactive mathematica
An Independent Study Comparing SPSS to Intellectus Statistics: Preliminary ...
Domain Ontology Usage Analysis Framework (OUSAF)
Data legend dh_benelux_2017.key
Predicting students performance in final examination
Big Data Quality Panel : Diachron Workshop @EDBT
OMICS Publishing Group | Journal of Applied & Computational Mathematics
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Transparency and reproducibility in research
Ad

Similar to Predicting the “Next Big Thing” in Science - #scichallenge2017 (20)

PDF
Invited Talk: Early Detection of Research Topics
PPTX
OSFair2017 training | Explore, model, analyze and visualize systematic resear...
PPTX
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
PDF
AUGUR: Forecasting the Emergence of New Research Topics
PPTX
Rare (and emergent) disciplines in the light of science studies
PPT
First major step in writing a thesis or dissertation
PPTX
Scientometric approaches to classification
PDF
Navigation through citation network based on content similarity using cosine ...
PPTX
ResearchFlow: Understanding the Knowledge Flow between Academia and Industry
PPTX
Emerging topic detection on twitter based on temporal and social terms evalua...
PPTX
Early Detection and Forecasting of Research Trends
PPTX
DCDataFest - Text mining and machine learning
PPTX
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
PPTX
Trends influencing future scholarshp
PPT
Machine Learning ICS 273A
PPT
Machine Learning ICS 273A
PPTX
Interactive Visualization Systems and Data Integration Methods for Supporting...
PPTX
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
PPTX
How to use science maps to navigate large information spaces? What is the lin...
PPTX
Visualizing Scientific Data - LATAM Faculty Summit 2011
Invited Talk: Early Detection of Research Topics
OSFair2017 training | Explore, model, analyze and visualize systematic resear...
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
AUGUR: Forecasting the Emergence of New Research Topics
Rare (and emergent) disciplines in the light of science studies
First major step in writing a thesis or dissertation
Scientometric approaches to classification
Navigation through citation network based on content similarity using cosine ...
ResearchFlow: Understanding the Knowledge Flow between Academia and Industry
Emerging topic detection on twitter based on temporal and social terms evalua...
Early Detection and Forecasting of Research Trends
DCDataFest - Text mining and machine learning
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
Trends influencing future scholarshp
Machine Learning ICS 273A
Machine Learning ICS 273A
Interactive Visualization Systems and Data Integration Methods for Supporting...
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
How to use science maps to navigate large information spaces? What is the lin...
Visualizing Scientific Data - LATAM Faculty Summit 2011
Ad

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Global journeys: estimating international migration
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Lecture1 pattern recognition............
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
.pdf is not working space design for the following data for the following dat...
Launch Your Data Science Career in Kochi – 2025
Data_Analytics_and_PowerBI_Presentation.pptx
Clinical guidelines as a resource for EBP(1).pdf
Reliability_Chapter_ presentation 1221.5784
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Miokarditis (Inflamasi pada Otot Jantung)
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IBA_Chapter_11_Slides_Final_Accessible.pptx
Global journeys: estimating international migration
Fluorescence-microscope_Botany_detailed content
Lecture1 pattern recognition............
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Business Ppt On Nestle.pptx huunnnhhgfvu
.pdf is not working space design for the following data for the following dat...

Predicting the “Next Big Thing” in Science - #scichallenge2017

  • 1. Predicting the “Next Big Thing” in Science ADRIAN MLADENIĆ GROBELNIK ADRIAN.GROBELNIK@GMAIL.COM BRITISH INTERNATIONAL SCHOOL LJUBLJANA LJUBLJANA, SLOVENIA #scichallenge2017
  • 2. What is this research project about?  The aim is to make a C++ program for predicting which scientific topics will become important in the future  To predict the future of science, I have used Machine Learning algorithms to learn how science behaved in the past, and to use the resulting model to predict future trends in science  To analyse how science evolved in the past, I used the data from the recently released “Microsoft Academic Graph” which includes 125 million scientific articles from the year 1800 to the present
  • 3. Research Hypothesis  My research hypothesis is that the science topics which will become important in the future, already exist in today’s scientific articles  …they are just not visible yet,  …but it is possible to identify them with Machine Learning  The task is to find early indicators suggesting which scientific topics in today’s literature will likely become important in the future
  • 4. Context: How does science evolve?  The main element of science is an invention  Inventions always happen at the beginning of a scientific process  After an invention happens, there is a period of scientific exploration, to prove the invention is useful  Some inventions prove themselves, and some do not  If an invention proves itself, new products and research is done involving ideas from the invention  …less useful inventions usually get forgotten
  • 5. Context: How to detect scientific inventions and concepts?  Scientists are typically strict and consistent when naming things  In the same way, inventions and other scientific concepts get names which are then used in scientific articles  In this project I have used the names from the titles of scientific articles to track how particular scientific topics evolve through time  We can spot when a scientific topic appears for the first time, we can count how frequently it appears, and we can spot when it stops being used  …this is my base for predicting the “next big thing” in science
  • 6. What data do we have available?  There are many databases of scientific articles in the world, but only some are open and available for research.  The biggest open database of scientific articles is “Microsoft Academic Graph” which was released for research use in 2016  The database size is 130 Gigabytes  It includes references to 125 million scientific articles from the year 1800 to the present from all areas of science  Each scientific article in the database is described by: (a) title, (b) authors and their (c) institutions, (d) journal/conference where it was published, and (e) the year of publication  Data available from: https://www.microsoft.com/en-us/research/project/microsoft- academic-graph/
  • 7. The task to be solved  The core task in this project is to use the data from over 200 years of science and to extract what are early signs of a scientific topic becoming successful  With Machine Learning algorithms I trained a statistical model to classify scientific topics which became successful and which didn’t  The trained model I am using on the current data (after 2010) to predict which topics will be hot and relevant in the near future (in early 2020s)
  • 8. Description of the experiment (1/2)  From 125 million article titles I extracted 2.5 million candidate topics  …each topic is described by a phrase of the size 1 to 5 words  …the phrase must appear at least 100 times in the database of article titles  Each topic is represented by a set of features (attributes) describing the first 10 years after its appearance  …features include frequency and trend (slope from linear regression) of an appearance of the topic within institutions, journals and conferences  …each topic is described by approx. 55,000 features, represented in a feature vector
  • 9. Description of the experiment (2/2)  Each topic is classified either as:  Positive, if it became popular in the past (has increased by a factor 2 after the 10 years from the topic’s first appearance), or as  Negative, if the topic didn’t attract much attention  We split the topics into a training (70%) and test set (30%)  …where the training set is used to train the model and testing set used to test the model  For machine learning I used the Perceptron algorithm which is relatively easy to implement (https://en.wikipedia.org/wiki/Perceptron)  …I used an improved version of the Perceptron (MaxMargin)
  • 10. Key statistical results  The statistical model, trained with the MaxMargin Perceptron algorithm produced the following results on the testing data:  Precision: 74%  Recall: 72%  F1 (a combination of both): 73%  …this means, the model correctly predicts the success of approx. 73% of all scientific topics (either successful ones or unsuccessful ones)
  • 11. Key descriptive results  Looking at the resulting statistical model we can see:  If a scientific topic gets increasingly used by important research institutions (universities and research institutes)  …and is getting published by important journals and conferences  …within 10 years from the invention (when the initial mention is spotted)  …then, we can expect the increased use of the topic (by a factor two or more) by science and industry in the next 5 years
  • 12. Examples of best topics and features  Example Best Topics (as predicted by the model):  Collisions, efficient, proton proton collisions, higgs boson, system, quark, particles, hadron, mobile augmented reality, variable quantum, advanced network, molecular dynamics simulations  Example Best Features (as identified by the Perceptron training):  CERN, Journal of Proteomics & Bioinformatics, Industrial Research Limited, Circulation-cardiovascular Imaging, Molecular BioSystems, Metamaterials , Atw-international Journal for Nuclear Power
  • 13. Summary  In this research project I analyzed 125 million articles from “Microsoft Academic Graph” from over 200 years of science  I made a program in C++ to process 130 Gigabytes of data and to build a machine learning model to predict which scientific topics will become important in the future  The resulting model predicts 73% of the scientific topics which became important in the history of science  C++ code and detailed results are available from: https://goo.gl/8luSwz