Predicting the “Next Big Thing” in Science - #scichallenge2017

Predicting the
“Next Big Thing”
in Science
ADRIAN MLADENIĆ GROBELNIK
ADRIAN.GROBELNIK@GMAIL.COM
BRITISH INTERNATIONAL SCHOOL LJUBLJANA
LJUBLJANA, SLOVENIA
#scichallenge2017

What is this research project about?
 The aim is to make a C++ program for predicting which scientific topics will
become important in the future
 To predict the future of science, I have used Machine Learning algorithms
to learn how science behaved in the past, and to use the resulting model
to predict future trends in science
 To analyse how science evolved in the past, I used the data from the
recently released “Microsoft Academic Graph” which includes 125 million
scientific articles from the year 1800 to the present

Research Hypothesis
 My research hypothesis is that the science topics which will
become important in the future, already exist in today’s scientific
articles
 …they are just not visible yet,
 …but it is possible to identify them with Machine Learning
 The task is to find early indicators suggesting which scientific topics
in today’s literature will likely become important in the future

Context: How does science evolve?
 The main element of science is an invention
 Inventions always happen at the beginning of a scientific process
 After an invention happens, there is a period of scientific
exploration, to prove the invention is useful
 Some inventions prove themselves, and some do not
 If an invention proves itself, new products and research is done
involving ideas from the invention
 …less useful inventions usually get forgotten

Context: How to detect scientific
inventions and concepts?
 Scientists are typically strict and consistent when naming things
 In the same way, inventions and other scientific concepts get names which
are then used in scientific articles
 In this project I have used the names from the titles of scientific articles
to track how particular scientific topics evolve through time
 We can spot when a scientific topic appears for the first time, we can count
how frequently it appears, and we can spot when it stops being used
 …this is my base for predicting the “next big thing” in science

What data do we have available?
 There are many databases of scientific articles in the world, but only some are
open and available for research.
 The biggest open database of scientific articles is “Microsoft Academic Graph”
which was released for research use in 2016
 The database size is 130 Gigabytes
 It includes references to 125 million scientific articles from the year 1800 to the present
from all areas of science
 Each scientific article in the database is described by: (a) title, (b) authors and their (c)
institutions, (d) journal/conference where it was published, and (e) the year of publication
 Data available from: https://www.microsoft.com/en-us/research/project/microsoft-
academic-graph/

The task to be solved
 The core task in this project is to use the data from over 200 years of
science and to extract what are early signs of a scientific topic
becoming successful
 With Machine Learning algorithms I trained a statistical model to
classify scientific topics which became successful and which didn’t
 The trained model I am using on the current data (after 2010) to
predict which topics will be hot and relevant in the near future (in
early 2020s)

Description of the experiment (1/2)
 From 125 million article titles I extracted 2.5 million candidate topics
 …each topic is described by a phrase of the size 1 to 5 words
 …the phrase must appear at least 100 times in the database of article titles
 Each topic is represented by a set of features (attributes) describing the
first 10 years after its appearance
 …features include frequency and trend (slope from linear regression) of an
appearance of the topic within institutions, journals and conferences
 …each topic is described by approx. 55,000 features, represented in a feature
vector

Description of the experiment (2/2)
 Each topic is classified either as:
 Positive, if it became popular in the past (has increased by a factor 2 after the 10 years
from the topic’s first appearance), or as
 Negative, if the topic didn’t attract much attention
 We split the topics into a training (70%) and test set (30%)
 …where the training set is used to train the model and testing set used to test the model
 For machine learning I used the Perceptron algorithm which is relatively easy to
implement (https://en.wikipedia.org/wiki/Perceptron)
 …I used an improved version of the Perceptron (MaxMargin)

Key statistical results
 The statistical model, trained with the MaxMargin Perceptron
algorithm produced the following results on the testing data:
 Precision: 74%
 Recall: 72%
 F1 (a combination of both): 73%
 …this means, the model correctly predicts the success of
approx. 73% of all scientific topics (either successful ones or
unsuccessful ones)

Key descriptive results
 Looking at the resulting statistical model we can see:
 If a scientific topic gets increasingly used by important research
institutions (universities and research institutes)
 …and is getting published by important journals and conferences
 …within 10 years from the invention (when the initial mention is
spotted)
 …then, we can expect the increased use of the topic (by a factor
two or more) by science and industry in the next 5 years

Examples of best topics and features
 Example Best Topics (as predicted by the model):
 Collisions, efficient, proton proton collisions, higgs boson, system, quark,
particles, hadron, mobile augmented reality, variable quantum,
advanced network, molecular dynamics simulations
 Example Best Features (as identified by the Perceptron training):
 CERN, Journal of Proteomics & Bioinformatics, Industrial Research Limited,
Circulation-cardiovascular Imaging, Molecular BioSystems, Metamaterials
, Atw-international Journal for Nuclear Power

Summary
 In this research project I analyzed 125 million articles from “Microsoft
Academic Graph” from over 200 years of science
 I made a program in C++ to process 130 Gigabytes of data and to
build a machine learning model to predict which scientific topics will
become important in the future
 The resulting model predicts 73% of the scientific topics which became
important in the history of science
 C++ code and detailed results are available from: https://goo.gl/8luSwz

Predicting the “Next Big Thing” in Science - #scichallenge2017

More Related Content

What's hot (20)

Similar to Predicting the “Next Big Thing” in Science - #scichallenge2017 (20)

Recently uploaded (20)

Predicting the “Next Big Thing” in Science - #scichallenge2017