SlideShare a Scribd company logo
OpenML
O P E N , A U T O M AT E D M A C H I N E L E A R N I N G
J O A Q U I N VA N S C H O R E N , T U / E
@open_mlwww.openml.org
OpenML
You can be part of this presentation :)
Follow the code examples:
• On Google Colab: goo.gl/VwbKb4
• On Github: https://git.io/fA3eL
J O A Q U I N VA N S C H O R E N , T U / E @open_mlwww.openml.org
World-wide telescope
Networked science
(Not so)
Automatic
Machine Learning
It’s hard to find and learn from prior machine learning data
(Auto)ML: manual work, unnecessary friction
Hard to find and reuse prior results
No standards / hubs for sharing and organizing results
Scattered, ill-described, datasets
Manual searching, reformatting, making assumptions
Hard to automate model building end-to-end
Requires automated data organization, clean APIs,…
Myriad algorithms, versions, languages
Write code, set up experiments, store results,…
Reproducibility is hard
Manually tracking every detail is error-prone
Easy to use: Integrated in many ML tools/environments
Easy to contribute: Automated sharing of data, code, results
Organized data: APIs to find & reuse data, models, experiments
Reward structure: Track your impact, build reputation
Self-learning: Learn from many experiments to help people
OpenML
S H A R E A N D R E U S E
M A C H I N E L E A R N I N G D ATA O N L I N E
www.openml.org
OpenML: Components
Flows: Pipelines/code that build ML models
Run locally (or wherever), auto-upload all results
Datasets: Auto-annotated, organized, well-formatted
Find the datasets you need, share your own
Tasks: Auto-generated, machine-readable
Everyone’s results are directly comparable
Runs:All results from running flows on tasks
All details needed for tracking and reproducibility
Evaluations can be queried, compared, reused
It starts with data
Data (tabular) easily uploaded or referenced (URL)
It starts with data
Data can remain in existing repositories
-> registered via URL, transparent to users
interoperability
For now: only tabular data
-> ARFF or CSV import (auto-annotate features)
-> FrictionlessData support in the works
auto-versioned, analysed, organised online
Search (API)
import	openml	as	oml	
openml_list	=	oml.datasets.list_datasets()
Python, R, Java, C#
Search (API)
import	pandas	as	pd	
datalist	=	pd.DataFrame.from_dict(openml_list)	
datalist[datalist.NumberOfInstances>10000	
										].sort_values(['NumberOfInstances'])
Python, R, Java, C#
Search (API)
datalist.query('name	==	"eeg-eye-state"')
Python, R, Java, C#
data id
Get (API)
dataset	=	oml.datasets.get_dataset(1471)	
dataset.description[:500]	
Python, R, Java, C#
Get (API)
X,	y,	attribute_names	=	dataset.get_data(	
			return_attribute_names=True)	
Python, R, Java, C#
eeg	=	pd.DataFrame(X,	columns=attribute_names)	
eeg['class']	=	y
Get (API)
eeg.plot()	
pd.DataFrame(y).plot()	
Python, R, Java, C#
Fit (API) Python, R, Java, C#
from	sklearn	import	neighbors	
clf	=	neighbors.KNeighborsClassifier()	
clf.fit(X,	y)
Complete code to build a model,
automatically, anywhere
import	openml	as	oml	
from	sklearn	import	neighbors,	tree	
dataset	=	oml.datasets.get_dataset(1471)	
X,	y	=	dataset.get_data()	
clf	=	neighbors.KNeighborsClassifier()	
clf.fit(X,	y)
Tasks contain data, goals, procedures.
Auto-build + evaluate models correctly
All evaluations are directly comparable
optimize accuracy
Predict target T
Tasks
benchmarking and collaboration
10-fold Xval
10-fold Xval
Predict target T
Collaborate in real time online
optimize accuracy
Search
task_list	=	oml.tasks.list_tasks(size=5000)
task id
Search
mytasks	=	pd.DataFrame.from_dict(task_list)	
mytasks.query('name=="eeg-eye-state"')
task id
Get
task	=	oml.tasks.get_task(14951)
Auto-run algorithms/workflows on any task
Integrated in many machine learning tools (+ APIs)
Flows
Run experiments locally, share them globally
Integrated in many machine learning tools (+ APIs)
import	openml	as	oml	
from	sklearn	import	tree	
task	=	oml.tasks.get_task(14951)	
clf	=	tree.ExtraTreeClassifier()	
flow	=	oml.flows.sklearn_to_flow(clf)	
run	=	oml.runs.run_flow_on_task(task,	flow)	
myrun	=	run.publish()
Fit and share (complete code)
Uploaded to http://www.openml.org/r/9204488
import	openml	as	oml	
from	sklearn	import	tree	
task	=	oml.tasks.get_task(14951)	
clf	=	tree.ExtraTreeClassifier()	
flow	=	oml.flows.sklearn_to_flow(clf)	
run	=	oml.runs.run_flow_on_task(task,	flow)	
myrun	=	run.publish()
Fit and share pipelines
Uploaded to http://www.openml.org/r/7943199
from	sklearn	import	pipeline,	ensemble,	preprocessing	
from	openml	import	tasks,runs,	datasets	
task	=	tasks.get_task(59)	
pipe	=	pipeline.Pipeline(steps=[	
												('Imputer',	preprocessing.Imputer()),	
												('OneHotEncoder',	preprocessing.OneHotEncoder(),	
												('Classifier',	ensemble.RandomForestClassifier())	
											])	
flow	=	oml.flows.sklearn_to_flow(pipe)	
run	=	oml.runs.run_flow_on_task(task,	flow)	
myrun	=	run.publish()
Fit and share deep learning models
import	keras	
from	keras.models	import	Sequential,		
from	keras.layers	import	Dense,	Dropout,	Flatten,	Conv2D,	MaxPooling
from	keras.layers.core	import	Activation	
model	=	Sequential()	
model.add(Reshape((28,	28,	1),	input_shape=(784,)))	
model.add(Conv2D(20,	(5,	5),	padding=“same",	input_shape=(28,28,1),	
										activation='relu'))	
model.add(MaxPooling2D(pool_size=(2,	2),	strides=(2,	2)))	
model.add(Conv2D(50,	(5,	5),	padding="same",	activation='relu'))	
model.add(MaxPooling2D(pool_size=(2,	2),	strides=(2,	2)))	
model.add(Flatten())	
model.add(Dense(500))	
model.add(Activation(‘relu'))	
model.add(Dense(10))	
model.add(Activation('softmax'))	
model.compile(loss=keras.losses.categorical_crossentropy,	
														optimizer=keras.optimizers.Adadelta(),	
														metrics=['accuracy'])
Fit and share deep learning models
Uploaded to https://www.openml.org/r/9204337
task	=	tasks.get_task(3573)	#MNIST	
flow	=	oml.flows.keras_to_flow(model)	
run	=	oml.runs.run_flow_on_task(task,	flow)	
myrun	=	run.publish()
Compare to state-of-the-art
reproducible, linked to data, flows, authors
and all other experiments
Experiments auto-uploaded, evaluated online
Runs
Share and reuse results
Experiments auto-uploaded, evaluated online
Download, reuse runs
myruns	=	oml.runs.list_runs(task=[14951])	
sns.violinplot(x="score",	y="flow",	data=pd.DataFrame(
Open and Automated Machine Learning
Publishing, impact tracking
OpenML Community
5600+ registered users,
120000+ yearly users
A U T O M L : M E TA - L E A R N I N G
• Find similar datasets
• 20,000+ versioned datasets, with 130+ meta-features
• Instead of starting from scratch, start from configurations
that worked well on similar datasets
A U T O M L : M E TA - L E A R N I N G
• Find similar datasets
• 20,000+ versioned datasets, with 130+ meta-features
• Instead of starting from scratch, start from configurations
that worked well on similar datasets
• Auto-sklearn (AutoML challenge winner, NIPS 2016)
• Lookup similar datasets, start with best pipelines
Matthias Feurer et al. (2016) NIPS
A U T O M L : M E TA - L E A R N I N G
• Find similar datasets
• 20,000+ versioned datasets, with 130+ meta-features
• Instead of starting from scratch, start from configurations
that worked well on similar datasets
• Auto-sklearn (AutoML challenge winner, NIPS 2016)
• Lookup similar datasets, start with best pipelines
A U T O M L : M E TA - L E A R N I N G
• Reuse (millions of) prior model evaluations:
• Benchmark new algorithms against state-of-the-art
• Meta-models: E.g. predict performance or training time
• MIT AutoML system (ICBD 2017)
• Uses and compares against OpenML results
A U T O M L : M E TA - L E A R N I N G
• Reuse (millions of) prior model evaluations:
• Benchmark new algorithms against state-of-the-art
• Meta-models: E.g. predict performance or training time
• Runtime prediction
A U T O M L : M E TA - L E A R N I N G
• Reuse (millions of) prior model evaluations:
• Benchmark new algorithms against state-of-the-art
• Meta-models: E.g. predict performance or training time
• Faster TPOT (in progress)
• Build meta-models (Random Forest works well)
• Focus on fast configurations first
A U T O M L : M E TA - L E A R N I N G
• Reuse results on many hyperparameter settings
• Surrogate models: predict best hyperparameter settings
• Study hyperparameter effects/importance
• Amazon’s multi-task learning AutoML (NIPS 2017)
• Trains surrogate models per task
• On new tasks: learns how to combine them with neural net
A U T O M L : M E TA - L E A R N I N G
• Reuse results on many hyperparameter settings
• Surrogate models: predict best hyperparameter settings
• Study hyperparameter effects/importance
• Hyperparameter space design
• Use OpenML data to learn which hyperparameters to tune
Jan van Rijn et al. (2017) AutoML@ICML
A U T O M L : M E TA - L E A R N I N G
• Never-ending Automatic Machine Learning:
• AutoML methods built on top of OpenML get increasingly
better as more meta-data is added
• Faster drug discovery (QSAR)
• Meta-learning to build better models that recommend drug
candidates for rare diseases
ChEMBL DB: 1.4M compounds,
10k proteins,12.8M activities
Molecule
representations
MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9#
!!
377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !!
341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…!
197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !!
346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0!
! ! ! ! ! ! !.!
! ! ! ! ! ! !: !!
16,000 regression datasets
x52 pipelines (on OpenML)
meta-model
all data on
new protein
optimal models
to predict activity
(Olier et al., Machine Learning 107(1), 2018)
Learning to learn
Bots that learn from all prior experiments
Automate drudge work, help people build models
Join us! (and change the world)
Active open source community
We need more bright people
- ML/DB experts
- Developers
- UX
Support is welcome!
Workshop sponsorship (hackathons 2x/year)
Donations: OpenML foundation
Compute time
Project ideas
E I N D H O V E N U N I V E R S I T Y
Looking for:
• PhD Students
• Scientific programmer
O P E N M L H A C K AT H O N
Paris, September 17-21, 2018
meet.openml.org
Co-located with COSEAL
Thank you!
谢谢
@open_ml
OpenML
Questions?

More Related Content

PDF
OpenML 2019
PDF
OpenML NeurIPS2018
PDF
Exposé Ontology
PDF
AutoML lectures (ACDL 2019)
PDF
Learning how to learn
PPTX
Automated Machine Learning (Auto ML)
PDF
Meta learning tutorial
PDF
The ABC of Implementing Supervised Machine Learning with Python.pptx
OpenML 2019
OpenML NeurIPS2018
Exposé Ontology
AutoML lectures (ACDL 2019)
Learning how to learn
Automated Machine Learning (Auto ML)
Meta learning tutorial
The ABC of Implementing Supervised Machine Learning with Python.pptx

What's hot (20)

PDF
GLM & GBM in H2O
PDF
Le Machine Learning de A à Z
PDF
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
PDF
QCon Rio - Machine Learning for Everyone
PPTX
Automated Machine Learning
PPTX
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
PPTX
Top 10 Data Science Practitioner Pitfalls
PDF
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
PDF
Automated Machine Learning
PPTX
Machine learning 101 dkom 2017
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
PDF
Machine Learning for Everyone
PPTX
Ferruzza g automl deck
PPTX
Demystifying Machine and Deep Learning for Developers
PPTX
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
PPTX
Graph Based Machine Learning on Relational Data
PDF
The Evolution of AutoML
PDF
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
PDF
Introduction to Machine Learning with SciKit-Learn
PPTX
Linear regression on 1 terabytes of data? Some crazy observations and actions
GLM & GBM in H2O
Le Machine Learning de A à Z
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
QCon Rio - Machine Learning for Everyone
Automated Machine Learning
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Top 10 Data Science Practitioner Pitfalls
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Automated Machine Learning
Machine learning 101 dkom 2017
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Machine Learning for Everyone
Ferruzza g automl deck
Demystifying Machine and Deep Learning for Developers
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Graph Based Machine Learning on Relational Data
The Evolution of AutoML
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Introduction to Machine Learning with SciKit-Learn
Linear regression on 1 terabytes of data? Some crazy observations and actions
Ad

Similar to Open and Automated Machine Learning (20)

PDF
OpenML Tutorial ECMLPKDD 2015
PDF
OpenML Reproducibility in Machine Learning ICML2017
PDF
OpenML DALI
PDF
OpenML data@Sheffield
PDF
Scaling up Machine Learning Development
PDF
OpenML Tutorial: Networked Science in Machine Learning
PDF
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
PDF
OpenML 2014
PDF
PythonとAutoML at PyConJP 2019
PDF
Automatic Machine Learning, AutoML
PDF
Introduction to ML.NET
PPTX
How to automate Machine Learning pipeline ?
PPTX
2019 12 19 Mississauga .Net User Group - Machine Learning.Net and Auto ML
PPTX
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
PDF
Open source ml systems that need to be built
PPTX
Leveraging Open Source Automated Data Science Tools
PPTX
Everything you need to know about AutoML
PPTX
2021 02 23 MVP Fusion Getting Started with Machine Learning.Net and AutoML
PPTX
2021 06 19 ms student ambassadors nigeria ml net 01 slide-share
PPTX
Apache Spark MLlib
OpenML Tutorial ECMLPKDD 2015
OpenML Reproducibility in Machine Learning ICML2017
OpenML DALI
OpenML data@Sheffield
Scaling up Machine Learning Development
OpenML Tutorial: Networked Science in Machine Learning
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
OpenML 2014
PythonとAutoML at PyConJP 2019
Automatic Machine Learning, AutoML
Introduction to ML.NET
How to automate Machine Learning pipeline ?
2019 12 19 Mississauga .Net User Group - Machine Learning.Net and Auto ML
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Open source ml systems that need to be built
Leveraging Open Source Automated Data Science Tools
Everything you need to know about AutoML
2021 02 23 MVP Fusion Getting Started with Machine Learning.Net and AutoML
2021 06 19 ms student ambassadors nigeria ml net 01 slide-share
Apache Spark MLlib
Ad

More from Joaquin Vanschoren (7)

PDF
Designed Serendipity
PDF
Data science
PDF
Open Machine Learning
PDF
Hadoop tutorial
PDF
Hadoop sensordata part2
PDF
Hadoop sensordata part1
PDF
Hadoop sensordata part3
Designed Serendipity
Data science
Open Machine Learning
Hadoop tutorial
Hadoop sensordata part2
Hadoop sensordata part1
Hadoop sensordata part3

Recently uploaded (20)

PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
An interstellar mission to test astrophysical black holes
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
. Radiology Case Scenariosssssssssssssss
PDF
The scientific heritage No 166 (166) (2025)
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Sciences of Europe No 170 (2025)
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPT
protein biochemistry.ppt for university classes
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
An interstellar mission to test astrophysical black holes
neck nodes and dissection types and lymph nodes levels
. Radiology Case Scenariosssssssssssssss
The scientific heritage No 166 (166) (2025)
HPLC-PPT.docx high performance liquid chromatography
Sciences of Europe No 170 (2025)
Introduction to Fisheries Biotechnology_Lesson 1.pptx
POSITIONING IN OPERATION THEATRE ROOM.ppt
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
ECG_Course_Presentation د.محمد صقران ppt
INTRODUCTION TO EVS | Concept of sustainability
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
The KM-GBF monitoring framework – status & key messages.pptx
Placing the Near-Earth Object Impact Probability in Context
protein biochemistry.ppt for university classes
microscope-Lecturecjchchchchcuvuvhc.pptx
2. Earth - The Living Planet Module 2ELS
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
TOTAL hIP ARTHROPLASTY Presentation.pptx

Open and Automated Machine Learning

  • 1. OpenML O P E N , A U T O M AT E D M A C H I N E L E A R N I N G J O A Q U I N VA N S C H O R E N , T U / E @open_mlwww.openml.org
  • 2. OpenML You can be part of this presentation :) Follow the code examples: • On Google Colab: goo.gl/VwbKb4 • On Github: https://git.io/fA3eL J O A Q U I N VA N S C H O R E N , T U / E @open_mlwww.openml.org
  • 5. (Not so) Automatic Machine Learning It’s hard to find and learn from prior machine learning data
  • 6. (Auto)ML: manual work, unnecessary friction Hard to find and reuse prior results No standards / hubs for sharing and organizing results Scattered, ill-described, datasets Manual searching, reformatting, making assumptions Hard to automate model building end-to-end Requires automated data organization, clean APIs,… Myriad algorithms, versions, languages Write code, set up experiments, store results,… Reproducibility is hard Manually tracking every detail is error-prone
  • 7. Easy to use: Integrated in many ML tools/environments Easy to contribute: Automated sharing of data, code, results Organized data: APIs to find & reuse data, models, experiments Reward structure: Track your impact, build reputation Self-learning: Learn from many experiments to help people OpenML S H A R E A N D R E U S E M A C H I N E L E A R N I N G D ATA O N L I N E
  • 9. OpenML: Components Flows: Pipelines/code that build ML models Run locally (or wherever), auto-upload all results Datasets: Auto-annotated, organized, well-formatted Find the datasets you need, share your own Tasks: Auto-generated, machine-readable Everyone’s results are directly comparable Runs:All results from running flows on tasks All details needed for tracking and reproducibility Evaluations can be queried, compared, reused
  • 11. Data (tabular) easily uploaded or referenced (URL) It starts with data
  • 12. Data can remain in existing repositories -> registered via URL, transparent to users interoperability For now: only tabular data -> ARFF or CSV import (auto-annotate features) -> FrictionlessData support in the works
  • 18. Get (API) X, y, attribute_names = dataset.get_data( return_attribute_names=True) Python, R, Java, C# eeg = pd.DataFrame(X, columns=attribute_names) eeg['class'] = y
  • 20. Fit (API) Python, R, Java, C# from sklearn import neighbors clf = neighbors.KNeighborsClassifier() clf.fit(X, y)
  • 21. Complete code to build a model, automatically, anywhere import openml as oml from sklearn import neighbors, tree dataset = oml.datasets.get_dataset(1471) X, y = dataset.get_data() clf = neighbors.KNeighborsClassifier() clf.fit(X, y)
  • 22. Tasks contain data, goals, procedures. Auto-build + evaluate models correctly All evaluations are directly comparable optimize accuracy Predict target T Tasks benchmarking and collaboration 10-fold Xval
  • 23. 10-fold Xval Predict target T Collaborate in real time online optimize accuracy
  • 27. Auto-run algorithms/workflows on any task Integrated in many machine learning tools (+ APIs) Flows Run experiments locally, share them globally
  • 28. Integrated in many machine learning tools (+ APIs) import openml as oml from sklearn import tree task = oml.tasks.get_task(14951) clf = tree.ExtraTreeClassifier() flow = oml.flows.sklearn_to_flow(clf) run = oml.runs.run_flow_on_task(task, flow) myrun = run.publish()
  • 29. Fit and share (complete code) Uploaded to http://www.openml.org/r/9204488 import openml as oml from sklearn import tree task = oml.tasks.get_task(14951) clf = tree.ExtraTreeClassifier() flow = oml.flows.sklearn_to_flow(clf) run = oml.runs.run_flow_on_task(task, flow) myrun = run.publish()
  • 30. Fit and share pipelines Uploaded to http://www.openml.org/r/7943199 from sklearn import pipeline, ensemble, preprocessing from openml import tasks,runs, datasets task = tasks.get_task(59) pipe = pipeline.Pipeline(steps=[ ('Imputer', preprocessing.Imputer()), ('OneHotEncoder', preprocessing.OneHotEncoder(), ('Classifier', ensemble.RandomForestClassifier()) ]) flow = oml.flows.sklearn_to_flow(pipe) run = oml.runs.run_flow_on_task(task, flow) myrun = run.publish()
  • 31. Fit and share deep learning models import keras from keras.models import Sequential, from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling from keras.layers.core import Activation model = Sequential() model.add(Reshape((28, 28, 1), input_shape=(784,))) model.add(Conv2D(20, (5, 5), padding=“same", input_shape=(28,28,1), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2))) model.add(Conv2D(50, (5, 5), padding="same", activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2))) model.add(Flatten()) model.add(Dense(500)) model.add(Activation(‘relu')) model.add(Dense(10)) model.add(Activation('softmax')) model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])
  • 32. Fit and share deep learning models Uploaded to https://www.openml.org/r/9204337 task = tasks.get_task(3573) #MNIST flow = oml.flows.keras_to_flow(model) run = oml.runs.run_flow_on_task(task, flow) myrun = run.publish()
  • 34. reproducible, linked to data, flows, authors and all other experiments Experiments auto-uploaded, evaluated online Runs Share and reuse results
  • 39. OpenML Community 5600+ registered users, 120000+ yearly users
  • 40. A U T O M L : M E TA - L E A R N I N G • Find similar datasets • 20,000+ versioned datasets, with 130+ meta-features • Instead of starting from scratch, start from configurations that worked well on similar datasets
  • 41. A U T O M L : M E TA - L E A R N I N G • Find similar datasets • 20,000+ versioned datasets, with 130+ meta-features • Instead of starting from scratch, start from configurations that worked well on similar datasets • Auto-sklearn (AutoML challenge winner, NIPS 2016) • Lookup similar datasets, start with best pipelines Matthias Feurer et al. (2016) NIPS
  • 42. A U T O M L : M E TA - L E A R N I N G • Find similar datasets • 20,000+ versioned datasets, with 130+ meta-features • Instead of starting from scratch, start from configurations that worked well on similar datasets • Auto-sklearn (AutoML challenge winner, NIPS 2016) • Lookup similar datasets, start with best pipelines
  • 43. A U T O M L : M E TA - L E A R N I N G • Reuse (millions of) prior model evaluations: • Benchmark new algorithms against state-of-the-art • Meta-models: E.g. predict performance or training time • MIT AutoML system (ICBD 2017) • Uses and compares against OpenML results
  • 44. A U T O M L : M E TA - L E A R N I N G • Reuse (millions of) prior model evaluations: • Benchmark new algorithms against state-of-the-art • Meta-models: E.g. predict performance or training time • Runtime prediction
  • 45. A U T O M L : M E TA - L E A R N I N G • Reuse (millions of) prior model evaluations: • Benchmark new algorithms against state-of-the-art • Meta-models: E.g. predict performance or training time • Faster TPOT (in progress) • Build meta-models (Random Forest works well) • Focus on fast configurations first
  • 46. A U T O M L : M E TA - L E A R N I N G • Reuse results on many hyperparameter settings • Surrogate models: predict best hyperparameter settings • Study hyperparameter effects/importance • Amazon’s multi-task learning AutoML (NIPS 2017) • Trains surrogate models per task • On new tasks: learns how to combine them with neural net
  • 47. A U T O M L : M E TA - L E A R N I N G • Reuse results on many hyperparameter settings • Surrogate models: predict best hyperparameter settings • Study hyperparameter effects/importance • Hyperparameter space design • Use OpenML data to learn which hyperparameters to tune Jan van Rijn et al. (2017) AutoML@ICML
  • 48. A U T O M L : M E TA - L E A R N I N G • Never-ending Automatic Machine Learning: • AutoML methods built on top of OpenML get increasingly better as more meta-data is added • Faster drug discovery (QSAR) • Meta-learning to build better models that recommend drug candidates for rare diseases ChEMBL DB: 1.4M compounds, 10k proteins,12.8M activities Molecule representations MW #LogP #TPSA #b1 #b2 #b3 #b4 #b5 #b6 #b7 #b8 #b9# !! 377.435 !3.883 !77.85 !1 !1 !0 !0 !0 !0 !0 !0 !0 !! 341.361 !3.411 !74.73 !1 !1 !0 !1 !0 !0 !0 !0 !0 !…! 197.188 !-2.089 !103.78 !1 !1 !0 !1 !0 !0 !0 !1 !0 !! 346.813 !4.705 !50.70 !1 !0 !0 !1 !0 !0 !0 !0 !0! ! ! ! ! ! ! !.! ! ! ! ! ! ! !: !! 16,000 regression datasets x52 pipelines (on OpenML) meta-model all data on new protein optimal models to predict activity (Olier et al., Machine Learning 107(1), 2018)
  • 49. Learning to learn Bots that learn from all prior experiments Automate drudge work, help people build models
  • 50. Join us! (and change the world) Active open source community We need more bright people - ML/DB experts - Developers - UX
  • 51. Support is welcome! Workshop sponsorship (hackathons 2x/year) Donations: OpenML foundation Compute time Project ideas
  • 52. E I N D H O V E N U N I V E R S I T Y Looking for: • PhD Students • Scientific programmer
  • 53. O P E N M L H A C K AT H O N Paris, September 17-21, 2018 meet.openml.org Co-located with COSEAL