SlideShare a Scribd company logo
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Machine Learning for Data Extraction
OECD Workshop on Evaluation of Existing Data on EDCs
October 24, 2018, Paris, France
Christopher Stahl,
stahlcg@ornl.gov
Dasha Herrmannova,
dasha.herrmannova@open.ac.uk
Oak Ridge National Laboratory, US & The Open University, UK
22 Machine Learning for Data Extraction
Acknowledgements
• The following people contributed to the research in this presentation:
– Steven Young, ORNL
– Robert Patton, ORNL
– Jack Wells, ORNL
– Mary Wolfe, NTP, NIEHS, NIH
– Nicole Kleinstreuer, NICEATM, NTP, NIEHS, NIH
33 Machine Learning for Data Extraction
Outline
• Introduction
– Who are we and where are we from?
• What is Machine Learning?
– Introduction to Machine Learning and Deep Learning
• ORNL Deep Learning Experiment
– PDF extraction with ORNL DeepPDF
• Machine Learning for Data Extraction
– Extraction of study descriptors in Toxicology research
Introduction
55 Machine Learning for Data Extraction
Introduction
• Christopher Stahl
– Graduated from Florida Southern College with a Bachelor of Science in 2011
and has been working with the Computational Data Analytics group at Oak
Ridge National Laboratory since then. He currently is a Data Analytics
Software Engineer who works on multiple projects with a focus on data
mining, data analytics, and knowledge discovery.
66 Machine Learning for Data Extraction
Introduction
• Dasha Herrmannova
– Research Associate in Scholarly Data Mining at the Open University
– Visiting Researcher at @ Oak Ridge National Laboratory since 2016
– Goal: Accelerate scientific discovery, help researchers work more effectively by
enabling intelligent access to the content of research papers
– My research: text-mining for research evaluation (co-founder of
semantometrics.org), analysis of research collaboration and trends,
information extraction from scholarly publications
77 Machine Learning for Data Extraction
Introduction
• Oak Ridge National Laboratory
– Oak Ridge National Laboratory is the largest US Department of Energy science
and energy laboratory, conducting basic and applied research to deliver
transformative solutions to compelling problems in energy and security.
88 Machine Learning for Data Extraction
Introduction
• Computational Data Analytics Group
– The Computational Data Analytics Research Group at the Oak Ridge National
Laboratory conducts innovative basic and applied computer science research
on challenges of national interest. Our primary research focus is in the areas of
large-scale data analytics and architectures. We conduct innovative basic and
applied computer science research on challenges of national interest. Our
main research area is analysis of very large sets of data. Our research focus is
in the areas of large-scale data analytics and architectures, which includes
work machine learning, deep learning, text analytics, graph analytics, visual
analytics, and biomedical analytics.
What is Machine Learning?
1010 Machine Learning for Data Extraction
What is Machine Learning?
• Simply put
– “Machine Learning is the science of getting computers to learn and act like
humans do, and improve their learning over time in autonomous fashion, by
feeding them data and information in the form of observations and real-world
interactions.” - https://www.techemergence.com/what-is-machine-learning/
1111 Machine Learning for Data Extraction
Machine Learning
• Two common conceptions:
– Fully autonomous robots that ultimately
destroy all mankind.
– Human assisted machines that make us better,
faster, and stronger.
1212 Machine Learning for Data Extraction
But Why Teach Machines?
• Big Problems, Big Data
1313 Machine Learning for Data Extraction
Deep Learning
• Deep Learning research is exploding
1414 Machine Learning for Data Extraction
What is Deep Learning?
• Deep learning is data driven feature extraction supported by
hierarchy of neuron layers
– Lower layers learn local detail
– Higher layers learn global concepts
http://www.datarobot.com/blog/a-primer-on-deep-learning/
1515 Machine Learning for Data Extraction
Simple Deep Learning
• Software programs trying to mimic the human brain.
• So let's talk more about those cats.
• What is a cat?
“Features” of a Cat
Small and fluffy
2 ears
2 eyes
Nose
4 legs
Tail
1616 Machine Learning for Data Extraction
Simple Deep Learning
Why?
Small and fluffy - ✓
2 ears - ✓
2 eyes - X
Nose - X
4 legs - X
Tail - ✓
Based on our training...not a cat.
How does our brain know better?
1717 Machine Learning for Data Extraction
Simple Deep Learning
• Convolutional Neural Network
– Instead of defining features (eyes, nose, tail, etc.) let's give the machine the
entire image and allow it to scan for features dividing the image into smaller
blocks.
– The computer looks for things such as, is there a horizontal line, this block is
mostly black, does this look fluffy etc.
– The computer can find tens to hundreds of such features. These are called
neurons.
1818 Machine Learning for Data Extraction
Simple Deep Learning
1919 Machine Learning for Data Extraction
Simple Deep Learning
• Now we can make another layer with larger blocks. Then another and
another, etc. Until we finally have learned what defines a cat.
http://www.datarobot.com/blog/a-primer-on-deep-learning/
2020 Machine Learning for Data Extraction
Not Just Robots, Self Driving Cars, and Cat Videos
• Researchers from Sutter Health and Georgia Institute of Technology
showed they could predict heart failure as much as nine months
before doctors using traditional methods.
(https://blogs.nvidia.com/blog/2016/04/11/predict-heart-failure/)
• Wearables such as Horus help the blind to be able to “see” by using
deep learning to analyze their surroundings and provide feedback to
the user.
(https://newatlas.com/horus-wearable-blind-assistant/46173/)
ORNL Deep Learning
Experiment
DeepPDF
https://www.osti.gov/servlets/purl/1460210
2222 Machine Learning for Data Extraction
ORNL Deep Learning Experiment
• What current problem exists, that is easy for human but hard for
machine?
2323 Machine Learning for Data Extraction
Background
• Test of Grobid on 100 Sponsor Publications
2424 Machine Learning for Data Extraction
Background
• Deep learning is heavily being applied to image analysis.
• Publication files (PDF) look like images.
• Can we use Deep learning to improve results???
2525 Machine Learning for Data Extraction
Data Collection
• 50 Randomly selected PDF files from
PMC_sample_1943
• 12 Different annotation types
• 407 total pages
2626 Machine Learning for Data Extraction
Experiment
• Semantic segmentation
• The process of assigning a label to each pixel of an image.
• U-Net, A popular network for semantic segmentation tasks.
• https://github.com/shreyaspadhy/UNet-Zoo
• This network was chosen as it typically provides good performance
with relatively few training examples.
2727 Machine Learning for Data Extraction
Results
• The per pixel classification accuracy on the validation set was 94.32%,
compared to a baseline of classifying each pixel as “not paragraph”
which would provide 79.67 accuracy.
2828 Machine Learning for Data Extraction
Results
2929 Machine Learning for Data Extraction
Results
3030 Machine Learning for Data Extraction
Next Steps
• Take results and extract to structured text. This should allow for
better results of all our work using PDF’s as an input.
Machine Learning for Data
Extraction
https://www.aclweb.org/anthology/W18-5609/
3232 Machine Learning for Data Extraction
Goal
• Goal: automated identification/extraction of study descriptors
from research publications
– Information pertaining to guideline for rodent uterotrophic bioassays
(OECD TG 440)
Minimal criteria
Female weanling rats
Six rats per group
Four treatment groups
3333 Machine Learning for Data Extraction
The Data Being Extracted
• OECD Test Guideline 440: Uterotrophic Bioassay in Rodents
– The guideline consists of 6 minimal criteria (MC)
– All six MC have to be met for a study to be guideline-like (GL)
– Below: manually annotated (GL) abstract showing the six MC
3434 Machine Learning for Data Extraction
OECD TG 440: Minimum criteria (MC)
• MC 1: Animal model
• MC 2: Group size
• MC 3: Route of administration
• MC 4: Number of dose groups
• MC 5: Dosing interval
• MC 6: Necropsy timing
Source: Kleinstreuer et al. (2016). A Curated Database of
Rodent Uterotrophic Bioactivity.
3535 Machine Learning for Data Extraction
Dataset
• 670 research publications with results for 2,615 uterotrophic
bioassays
– A curated database of rodent uterotrophic bioactivity. (2015). Kleinstreuer et.
al. Environmental health perspectives.
– ~120 publications (~18%) contain GL studies (~540 out of 2,615)
3636 Machine Learning for Data Extraction
Information extraction
• Standard approach to document annotation: train a prediction
model from labeled data
• Requires fine-grained sentence/word level annotations
– Obtaining training data can be very costly
– Depending on task, data being extracted can vary significantly
– What if training data isn’t available?
✓
✗
VS.
3737 Machine Learning for Data Extraction
Human learning vs. machine learning
Document
“pets”
It talks about dogs
It talks about cats
…
“pets”
document label
dog cat buy
Which words were in
the document?
3838 Machine Learning for Data Extraction
Human learning vs. machine learning
• A human non-expert can correctly annotate criteria given just a
description of the target information
3939 Machine Learning for Data Extraction
Human learning vs. machine learning
• Machine Learning (ML) algorithm learns to associate certain
linguistic patterns with the criteria
– These patterns may be hard to spot for a human
• For example an ML algorithm can learn to correctly predict if a
document met criteria based short text snippets not mentioning the
criteria (such as abstracts)
4040 Machine Learning for Data Extraction
Our approach
• Goal: extract information for
the purpose of identifying
GL/non-GL documents
• Two approaches:
1. Classify, then extract: Can we
train a good classifier? If so,
extract learned patterns (fully
supervised)
2. Extract, then classify: Extract
text segments relevant to criteria
descriptions (unsupervised), then
classify the extracted segments
“pets”
Why?
document label
dog cat buy
Which words were in
the document?
Sentence 1:
Pre-pubertal (day 18 of life) female mice
were randomized to receive placebo, 200
mg/kg CTX, or 120 mg/kg CTX.
Similarity to MC1: 0.5478
Sentence 2:
The dosages of CTX used were based
on previous studies, which
demonstrated a significant dose-
dependent ovarian toxicity.
Similarity to MC1: 0.3887
✓
4141 Machine Learning for Data Extraction
Classify, then extract
• Train a classifier (e.g. Logistic Regression, Convolutional Neural
Network) to distinguish between publications that met/didn’t meet
criteria
• Extract the linguistic patterns the classifier learns to use to make
the decision
– Occurrence/lack of certain words in text, importance of words
• Pro: theoretically a more accurate method
• Con: requires labeled data – what is the minimal amount of data
needed to make this approach work reliably?
4242 Machine Learning for Data Extraction
Extract, then classify
• Utilize criteria descriptions/sample sentences/taxonomy (=a “query”
into the document)
• Extract parts of the document most relevant to the “query” (e.g.
parts which are the most similar to the query)
• Pro: Does not require labeled data, works on individual documents
• Con: What is a good ”query” and a reliable extraction (similarity)
method?
4343 Machine Learning for Data Extraction
OECD TG 440: Minimum criteria (MC)
• MC 1: Animal model
• MC 2: Group size
• MC 3: Route of administration
• MC 4: Number of dose groups
• MC 5: Dosing interval
• MC 6: Necropsy timing
Source: Kleinstreuer et al. (2016). A Curated Database of
Rodent Uterotrophic Bioactivity.
4444 Machine Learning for Data Extraction
Example abstract
4545 Machine Learning for Data Extraction
Results for MC 1: Animal model
• Top: supervised (classify, then extract), bottom: unsupervised
Correct answer
(underlined)
4646 Machine Learning for Data Extraction
Results for MC 2: Group size
• Top: supervised (classify, then extract), bottom: unsupervised
4747 Machine Learning for Data Extraction
Results for MC 3: Route of administration
• Top: supervised (classify, then extract), bottom: unsupervised
4848 Machine Learning for Data Extraction
Results for MC 4: Number of dose groups
• Top: supervised (classify, then extract), bottom: unsupervised
4949 Machine Learning for Data Extraction
Results for MC 5: Dosing interval
• Top: supervised (classify, then extract), bottom: unsupervised
5050 Machine Learning for Data Extraction
Results for MC 6: Necropsy timing
• Top: supervised (classify, then extract), bottom: unsupervised
5151 Machine Learning for Data Extraction
Unsupervised (similarity-based) approach
• Top sentences extracted from full text (correct answer underlined)
• Similarity scores: 70.61, 65.31, and 63.69
1. After weaning on pnd 21, the dams were euthanized by CO2 asphyxiation and the juvenile
females were individually housed.
2. Six CD(SD) rat dams, each with reconstituted litters of six female pups, were received from Charles
River Laboratories (Raleigh, NC, USA) on offspring postnatal day (pnd) 16.
3. This validation study followed OECD TG 440, with six female weanling rats (postnatal day 21)
per dose group and six treatment groups.
5252 Machine Learning for Data Extraction
Current limitations and future work
• “Semantic implication”
• Tables/figures
• Multiple studies
• Answer not in text
5353 Machine Learning for Data Extraction
Current limitations
• “Semantic implication”
– MC 4: Number of dose group – Minimum of two dose groups, must have
positive and negative control
5454 Machine Learning for Data Extraction
Current limitations
• Tables/figures
5555 Machine Learning for Data Extraction
Current limitations
• Multiple studies
– We currently don’t have a definitive way of recognizing publications with
multiple studies or matching sentences to studies
5656 Machine Learning for Data Extraction
Current limitations
• Document 24096037 (describes 4 studies), top sentence from full text
for each MC
1. Animal model: (I) Female Sprague-Dawley rats and C57BL/6 mice, ovarectomized on PND
20 and all within 10% of the average body weight (b.w.), were obtained from ...
2. Group size: (I) An equal number of time-matched Veh animals (n=5) were treated in the
same manner.
3. Route of administration: However, o,p-DDT induced Ca2 in the rat, with a clear difference
between EE and o,p-DDT treatment.
4. Number of dose groups: (I) For graphing purposes, the relative expression levels were
scaled such that the expression level of the time-matched control group was equal to one.
5. Dosing interval: Female Sprague-Dawley rats and C57BL/6 mice, ovarectomized on PND 20
and all within 10% of the average body weight (b.w.), were obtained from ...
6. Necropsy timing: (HI) All animals were sacrificed 24 h after the last treatment (72 h after
the initial dose).
5757 Machine Learning for Data Extraction
Current limitations
• Answer not in text
– We currently don’t have a definitive way of recognizing if the answer not there
5858 Machine Learning for Data Extraction
Other improvements
• Making extraction more accurate
– Applying robust evaluation metrics toward the extracted text
– Combining both (supervised and unsupervised) approaches
• Labeling extracted text
• Using ensembles to improve classification / joint modelling of MC
labels (e.g. using Deep Learning)
5959 Machine Learning for Data Extraction
Conclusions
• A hard problem given the lack of annotations
• Similarity-based approaches may provide a way forward
6060 Machine Learning for Data Extraction
Funding
• Support for this research was provided by an Interagency Agreement with the National Institute
of Environmental Health Sciences (AES 16002-001) and the U.S. Department of Energy at Oak
Ridge National Laboratory.
• This research was supported in part by an appointment to the Oak Ridge National Laboratory
ASTRO Program, sponsored by the U.S. Department of Energy and administered by the Oak
Ridge Institute for Science and Education.
• This manuscript has been authored by UT-Battelle, LLC and used resources of the Oak Ridge
Leadership Computing Facility at the Oak Ridge National Laboratory under Contract No. DE-
AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and
the publisher, by accepting the article for publication, acknowledges that the United States
Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or
reproduce the published form of this manuscript, or allow others to do so, for United States
Government purposes. The Department of Energy will provide public access to these results of
federally sponsored research in accordance with the DOE Public Access Plan.
Thank you for listening!

More Related Content

PPTX
Hattrick-Simpers MRS Webinar on AI in Materials
PPT
IDs书友会 - 主题1 - Swinburne Next Generation Research
PPTX
Research Data Management for Econometrics
PPTX
Intro to Machine Learning
PPTX
2014 aus-agta
PDF
Introduction to machine learning
PPTX
machine learning in the age of big data: new approaches and business applicat...
PPT
eScience: A Transformed Scientific Method
Hattrick-Simpers MRS Webinar on AI in Materials
IDs书友会 - 主题1 - Swinburne Next Generation Research
Research Data Management for Econometrics
Intro to Machine Learning
2014 aus-agta
Introduction to machine learning
machine learning in the age of big data: new approaches and business applicat...
eScience: A Transformed Scientific Method

What's hot (19)

PPTX
AdClickFraud_Bigdata-Apic-Ist-2019
PDF
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
PPT
PhRMA Some Early Thoughts
PDF
Open Data, Big Data and Machine Learning
PDF
EDF2013: Big Data Tutorial: Marko Grobelnik
PPTX
Data Science, Data Curation, and Human-Data Interaction
PPTX
Machines are people too
PDF
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
PPTX
Introduction to Big Data/Machine Learning
PPTX
Towards automated phenotypic cell profiling with high-content imaging
PPTX
Reproducibility and Scientific Research: why, what, where, when, who, how
PDF
Deep learning in medicine: An introduction and applications to next-generatio...
PPTX
Mauritius Big Data and Machine Learning JEDI workshop
PDF
An Introduction to Machine Learning and Genomics
PPTX
The Roots: Linked data and the foundations of successful Agriculture Data
PPTX
Machine Learning in the age of Big Data
PDF
Fn3110961103
PDF
Personalized health knowledge graph ckg workshop - iswc 2018 (2)
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
AdClickFraud_Bigdata-Apic-Ist-2019
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
PhRMA Some Early Thoughts
Open Data, Big Data and Machine Learning
EDF2013: Big Data Tutorial: Marko Grobelnik
Data Science, Data Curation, and Human-Data Interaction
Machines are people too
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
Introduction to Big Data/Machine Learning
Towards automated phenotypic cell profiling with high-content imaging
Reproducibility and Scientific Research: why, what, where, when, who, how
Deep learning in medicine: An introduction and applications to next-generatio...
Mauritius Big Data and Machine Learning JEDI workshop
An Introduction to Machine Learning and Genomics
The Roots: Linked data and the foundations of successful Agriculture Data
Machine Learning in the age of Big Data
Fn3110961103
Personalized health knowledge graph ckg workshop - iswc 2018 (2)
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Ad

Similar to Machine Learning for Data Extraction (20)

PDF
H2O with Erin LeDell at Portland R User Group
PDF
H2O World - Intro to Data Science with Erin Ledell
PDF
Machinr Learning and artificial_Lect1.pdf
PPT
intro to ML by the way m toh phasee movie Punjabi
PDF
Deep Learning for Recommender Systems
PDF
Deep Learning for Recommender Systems
PPTX
Learning Systems for Science
PDF
Intro to Data Science for Non-Data Scientists
PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
PDF
ELRIG Event Biocity Scotland May19
PDF
OpenML data@Sheffield
PPTX
Automating fetal heart monitor using machine learning
PPTX
GTU GeekDay 2019 Limitations of Artificial Intelligence
PDF
Machine Learning Deep Learning AI and Data Science
PPTX
Machine Learning
PDF
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
PDF
Machine Learning_2025_First Module_1.pdf
PDF
Hacking Predictive Modeling - RoadSec 2018
PDF
ODSC East 2017: Data Science Models For Good
PDF
IICT-Big Data.pdf slideshow information to communication
H2O with Erin LeDell at Portland R User Group
H2O World - Intro to Data Science with Erin Ledell
Machinr Learning and artificial_Lect1.pdf
intro to ML by the way m toh phasee movie Punjabi
Deep Learning for Recommender Systems
Deep Learning for Recommender Systems
Learning Systems for Science
Intro to Data Science for Non-Data Scientists
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
ELRIG Event Biocity Scotland May19
OpenML data@Sheffield
Automating fetal heart monitor using machine learning
GTU GeekDay 2019 Limitations of Artificial Intelligence
Machine Learning Deep Learning AI and Data Science
Machine Learning
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
Machine Learning_2025_First Module_1.pdf
Hacking Predictive Modeling - RoadSec 2018
ODSC East 2017: Data Science Models For Good
IICT-Big Data.pdf slideshow information to communication
Ad

More from Dasha Herrmannova (10)

PDF
Do Authors Deposit on Time? Tracking Open Access Policy Compliance
PDF
Semantometrics: Text Analysis in Research Evaluation
PDF
Do Citations and Readership Predict Excellent Publications?
PDF
An Analysis of the Microsoft Academic Graph
PDF
Visual Search for Supporting Content Exploration in Large Document Collections
PDF
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
PDF
Simple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
PDF
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
PDF
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
PDF
Mining Research Publication Networks for Impact -- KMi Internal Seminar
Do Authors Deposit on Time? Tracking Open Access Policy Compliance
Semantometrics: Text Analysis in Research Evaluation
Do Citations and Readership Predict Excellent Publications?
An Analysis of the Microsoft Academic Graph
Visual Search for Supporting Content Exploration in Large Document Collections
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
Simple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
Mining Research Publication Networks for Impact -- KMi Internal Seminar

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PPTX
Machine Learning_overview_presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Getting Started with Data Integration: FME Form 101
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Electronic commerce courselecture one. Pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
Machine Learning_overview_presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Getting Started with Data Integration: FME Form 101
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
1. Introduction to Computer Programming.pptx
Spectral efficient network and resource selection model in 5G networks
Programs and apps: productivity, graphics, security and other tools
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
“AI and Expert System Decision Support & Business Intelligence Systems”
Electronic commerce courselecture one. Pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...

Machine Learning for Data Extraction

  • 1. ORNL is managed by UT-Battelle, LLC for the US Department of Energy Machine Learning for Data Extraction OECD Workshop on Evaluation of Existing Data on EDCs October 24, 2018, Paris, France Christopher Stahl, stahlcg@ornl.gov Dasha Herrmannova, dasha.herrmannova@open.ac.uk Oak Ridge National Laboratory, US & The Open University, UK
  • 2. 22 Machine Learning for Data Extraction Acknowledgements • The following people contributed to the research in this presentation: – Steven Young, ORNL – Robert Patton, ORNL – Jack Wells, ORNL – Mary Wolfe, NTP, NIEHS, NIH – Nicole Kleinstreuer, NICEATM, NTP, NIEHS, NIH
  • 3. 33 Machine Learning for Data Extraction Outline • Introduction – Who are we and where are we from? • What is Machine Learning? – Introduction to Machine Learning and Deep Learning • ORNL Deep Learning Experiment – PDF extraction with ORNL DeepPDF • Machine Learning for Data Extraction – Extraction of study descriptors in Toxicology research
  • 5. 55 Machine Learning for Data Extraction Introduction • Christopher Stahl – Graduated from Florida Southern College with a Bachelor of Science in 2011 and has been working with the Computational Data Analytics group at Oak Ridge National Laboratory since then. He currently is a Data Analytics Software Engineer who works on multiple projects with a focus on data mining, data analytics, and knowledge discovery.
  • 6. 66 Machine Learning for Data Extraction Introduction • Dasha Herrmannova – Research Associate in Scholarly Data Mining at the Open University – Visiting Researcher at @ Oak Ridge National Laboratory since 2016 – Goal: Accelerate scientific discovery, help researchers work more effectively by enabling intelligent access to the content of research papers – My research: text-mining for research evaluation (co-founder of semantometrics.org), analysis of research collaboration and trends, information extraction from scholarly publications
  • 7. 77 Machine Learning for Data Extraction Introduction • Oak Ridge National Laboratory – Oak Ridge National Laboratory is the largest US Department of Energy science and energy laboratory, conducting basic and applied research to deliver transformative solutions to compelling problems in energy and security.
  • 8. 88 Machine Learning for Data Extraction Introduction • Computational Data Analytics Group – The Computational Data Analytics Research Group at the Oak Ridge National Laboratory conducts innovative basic and applied computer science research on challenges of national interest. Our primary research focus is in the areas of large-scale data analytics and architectures. We conduct innovative basic and applied computer science research on challenges of national interest. Our main research area is analysis of very large sets of data. Our research focus is in the areas of large-scale data analytics and architectures, which includes work machine learning, deep learning, text analytics, graph analytics, visual analytics, and biomedical analytics.
  • 9. What is Machine Learning?
  • 10. 1010 Machine Learning for Data Extraction What is Machine Learning? • Simply put – “Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.” - https://www.techemergence.com/what-is-machine-learning/
  • 11. 1111 Machine Learning for Data Extraction Machine Learning • Two common conceptions: – Fully autonomous robots that ultimately destroy all mankind. – Human assisted machines that make us better, faster, and stronger.
  • 12. 1212 Machine Learning for Data Extraction But Why Teach Machines? • Big Problems, Big Data
  • 13. 1313 Machine Learning for Data Extraction Deep Learning • Deep Learning research is exploding
  • 14. 1414 Machine Learning for Data Extraction What is Deep Learning? • Deep learning is data driven feature extraction supported by hierarchy of neuron layers – Lower layers learn local detail – Higher layers learn global concepts http://www.datarobot.com/blog/a-primer-on-deep-learning/
  • 15. 1515 Machine Learning for Data Extraction Simple Deep Learning • Software programs trying to mimic the human brain. • So let's talk more about those cats. • What is a cat? “Features” of a Cat Small and fluffy 2 ears 2 eyes Nose 4 legs Tail
  • 16. 1616 Machine Learning for Data Extraction Simple Deep Learning Why? Small and fluffy - ✓ 2 ears - ✓ 2 eyes - X Nose - X 4 legs - X Tail - ✓ Based on our training...not a cat. How does our brain know better?
  • 17. 1717 Machine Learning for Data Extraction Simple Deep Learning • Convolutional Neural Network – Instead of defining features (eyes, nose, tail, etc.) let's give the machine the entire image and allow it to scan for features dividing the image into smaller blocks. – The computer looks for things such as, is there a horizontal line, this block is mostly black, does this look fluffy etc. – The computer can find tens to hundreds of such features. These are called neurons.
  • 18. 1818 Machine Learning for Data Extraction Simple Deep Learning
  • 19. 1919 Machine Learning for Data Extraction Simple Deep Learning • Now we can make another layer with larger blocks. Then another and another, etc. Until we finally have learned what defines a cat. http://www.datarobot.com/blog/a-primer-on-deep-learning/
  • 20. 2020 Machine Learning for Data Extraction Not Just Robots, Self Driving Cars, and Cat Videos • Researchers from Sutter Health and Georgia Institute of Technology showed they could predict heart failure as much as nine months before doctors using traditional methods. (https://blogs.nvidia.com/blog/2016/04/11/predict-heart-failure/) • Wearables such as Horus help the blind to be able to “see” by using deep learning to analyze their surroundings and provide feedback to the user. (https://newatlas.com/horus-wearable-blind-assistant/46173/)
  • 22. 2222 Machine Learning for Data Extraction ORNL Deep Learning Experiment • What current problem exists, that is easy for human but hard for machine?
  • 23. 2323 Machine Learning for Data Extraction Background • Test of Grobid on 100 Sponsor Publications
  • 24. 2424 Machine Learning for Data Extraction Background • Deep learning is heavily being applied to image analysis. • Publication files (PDF) look like images. • Can we use Deep learning to improve results???
  • 25. 2525 Machine Learning for Data Extraction Data Collection • 50 Randomly selected PDF files from PMC_sample_1943 • 12 Different annotation types • 407 total pages
  • 26. 2626 Machine Learning for Data Extraction Experiment • Semantic segmentation • The process of assigning a label to each pixel of an image. • U-Net, A popular network for semantic segmentation tasks. • https://github.com/shreyaspadhy/UNet-Zoo • This network was chosen as it typically provides good performance with relatively few training examples.
  • 27. 2727 Machine Learning for Data Extraction Results • The per pixel classification accuracy on the validation set was 94.32%, compared to a baseline of classifying each pixel as “not paragraph” which would provide 79.67 accuracy.
  • 28. 2828 Machine Learning for Data Extraction Results
  • 29. 2929 Machine Learning for Data Extraction Results
  • 30. 3030 Machine Learning for Data Extraction Next Steps • Take results and extract to structured text. This should allow for better results of all our work using PDF’s as an input.
  • 31. Machine Learning for Data Extraction https://www.aclweb.org/anthology/W18-5609/
  • 32. 3232 Machine Learning for Data Extraction Goal • Goal: automated identification/extraction of study descriptors from research publications – Information pertaining to guideline for rodent uterotrophic bioassays (OECD TG 440) Minimal criteria Female weanling rats Six rats per group Four treatment groups
  • 33. 3333 Machine Learning for Data Extraction The Data Being Extracted • OECD Test Guideline 440: Uterotrophic Bioassay in Rodents – The guideline consists of 6 minimal criteria (MC) – All six MC have to be met for a study to be guideline-like (GL) – Below: manually annotated (GL) abstract showing the six MC
  • 34. 3434 Machine Learning for Data Extraction OECD TG 440: Minimum criteria (MC) • MC 1: Animal model • MC 2: Group size • MC 3: Route of administration • MC 4: Number of dose groups • MC 5: Dosing interval • MC 6: Necropsy timing Source: Kleinstreuer et al. (2016). A Curated Database of Rodent Uterotrophic Bioactivity.
  • 35. 3535 Machine Learning for Data Extraction Dataset • 670 research publications with results for 2,615 uterotrophic bioassays – A curated database of rodent uterotrophic bioactivity. (2015). Kleinstreuer et. al. Environmental health perspectives. – ~120 publications (~18%) contain GL studies (~540 out of 2,615)
  • 36. 3636 Machine Learning for Data Extraction Information extraction • Standard approach to document annotation: train a prediction model from labeled data • Requires fine-grained sentence/word level annotations – Obtaining training data can be very costly – Depending on task, data being extracted can vary significantly – What if training data isn’t available? ✓ ✗ VS.
  • 37. 3737 Machine Learning for Data Extraction Human learning vs. machine learning Document “pets” It talks about dogs It talks about cats … “pets” document label dog cat buy Which words were in the document?
  • 38. 3838 Machine Learning for Data Extraction Human learning vs. machine learning • A human non-expert can correctly annotate criteria given just a description of the target information
  • 39. 3939 Machine Learning for Data Extraction Human learning vs. machine learning • Machine Learning (ML) algorithm learns to associate certain linguistic patterns with the criteria – These patterns may be hard to spot for a human • For example an ML algorithm can learn to correctly predict if a document met criteria based short text snippets not mentioning the criteria (such as abstracts)
  • 40. 4040 Machine Learning for Data Extraction Our approach • Goal: extract information for the purpose of identifying GL/non-GL documents • Two approaches: 1. Classify, then extract: Can we train a good classifier? If so, extract learned patterns (fully supervised) 2. Extract, then classify: Extract text segments relevant to criteria descriptions (unsupervised), then classify the extracted segments “pets” Why? document label dog cat buy Which words were in the document? Sentence 1: Pre-pubertal (day 18 of life) female mice were randomized to receive placebo, 200 mg/kg CTX, or 120 mg/kg CTX. Similarity to MC1: 0.5478 Sentence 2: The dosages of CTX used were based on previous studies, which demonstrated a significant dose- dependent ovarian toxicity. Similarity to MC1: 0.3887 ✓
  • 41. 4141 Machine Learning for Data Extraction Classify, then extract • Train a classifier (e.g. Logistic Regression, Convolutional Neural Network) to distinguish between publications that met/didn’t meet criteria • Extract the linguistic patterns the classifier learns to use to make the decision – Occurrence/lack of certain words in text, importance of words • Pro: theoretically a more accurate method • Con: requires labeled data – what is the minimal amount of data needed to make this approach work reliably?
  • 42. 4242 Machine Learning for Data Extraction Extract, then classify • Utilize criteria descriptions/sample sentences/taxonomy (=a “query” into the document) • Extract parts of the document most relevant to the “query” (e.g. parts which are the most similar to the query) • Pro: Does not require labeled data, works on individual documents • Con: What is a good ”query” and a reliable extraction (similarity) method?
  • 43. 4343 Machine Learning for Data Extraction OECD TG 440: Minimum criteria (MC) • MC 1: Animal model • MC 2: Group size • MC 3: Route of administration • MC 4: Number of dose groups • MC 5: Dosing interval • MC 6: Necropsy timing Source: Kleinstreuer et al. (2016). A Curated Database of Rodent Uterotrophic Bioactivity.
  • 44. 4444 Machine Learning for Data Extraction Example abstract
  • 45. 4545 Machine Learning for Data Extraction Results for MC 1: Animal model • Top: supervised (classify, then extract), bottom: unsupervised Correct answer (underlined)
  • 46. 4646 Machine Learning for Data Extraction Results for MC 2: Group size • Top: supervised (classify, then extract), bottom: unsupervised
  • 47. 4747 Machine Learning for Data Extraction Results for MC 3: Route of administration • Top: supervised (classify, then extract), bottom: unsupervised
  • 48. 4848 Machine Learning for Data Extraction Results for MC 4: Number of dose groups • Top: supervised (classify, then extract), bottom: unsupervised
  • 49. 4949 Machine Learning for Data Extraction Results for MC 5: Dosing interval • Top: supervised (classify, then extract), bottom: unsupervised
  • 50. 5050 Machine Learning for Data Extraction Results for MC 6: Necropsy timing • Top: supervised (classify, then extract), bottom: unsupervised
  • 51. 5151 Machine Learning for Data Extraction Unsupervised (similarity-based) approach • Top sentences extracted from full text (correct answer underlined) • Similarity scores: 70.61, 65.31, and 63.69 1. After weaning on pnd 21, the dams were euthanized by CO2 asphyxiation and the juvenile females were individually housed. 2. Six CD(SD) rat dams, each with reconstituted litters of six female pups, were received from Charles River Laboratories (Raleigh, NC, USA) on offspring postnatal day (pnd) 16. 3. This validation study followed OECD TG 440, with six female weanling rats (postnatal day 21) per dose group and six treatment groups.
  • 52. 5252 Machine Learning for Data Extraction Current limitations and future work • “Semantic implication” • Tables/figures • Multiple studies • Answer not in text
  • 53. 5353 Machine Learning for Data Extraction Current limitations • “Semantic implication” – MC 4: Number of dose group – Minimum of two dose groups, must have positive and negative control
  • 54. 5454 Machine Learning for Data Extraction Current limitations • Tables/figures
  • 55. 5555 Machine Learning for Data Extraction Current limitations • Multiple studies – We currently don’t have a definitive way of recognizing publications with multiple studies or matching sentences to studies
  • 56. 5656 Machine Learning for Data Extraction Current limitations • Document 24096037 (describes 4 studies), top sentence from full text for each MC 1. Animal model: (I) Female Sprague-Dawley rats and C57BL/6 mice, ovarectomized on PND 20 and all within 10% of the average body weight (b.w.), were obtained from ... 2. Group size: (I) An equal number of time-matched Veh animals (n=5) were treated in the same manner. 3. Route of administration: However, o,p-DDT induced Ca2 in the rat, with a clear difference between EE and o,p-DDT treatment. 4. Number of dose groups: (I) For graphing purposes, the relative expression levels were scaled such that the expression level of the time-matched control group was equal to one. 5. Dosing interval: Female Sprague-Dawley rats and C57BL/6 mice, ovarectomized on PND 20 and all within 10% of the average body weight (b.w.), were obtained from ... 6. Necropsy timing: (HI) All animals were sacrificed 24 h after the last treatment (72 h after the initial dose).
  • 57. 5757 Machine Learning for Data Extraction Current limitations • Answer not in text – We currently don’t have a definitive way of recognizing if the answer not there
  • 58. 5858 Machine Learning for Data Extraction Other improvements • Making extraction more accurate – Applying robust evaluation metrics toward the extracted text – Combining both (supervised and unsupervised) approaches • Labeling extracted text • Using ensembles to improve classification / joint modelling of MC labels (e.g. using Deep Learning)
  • 59. 5959 Machine Learning for Data Extraction Conclusions • A hard problem given the lack of annotations • Similarity-based approaches may provide a way forward
  • 60. 6060 Machine Learning for Data Extraction Funding • Support for this research was provided by an Interagency Agreement with the National Institute of Environmental Health Sciences (AES 16002-001) and the U.S. Department of Energy at Oak Ridge National Laboratory. • This research was supported in part by an appointment to the Oak Ridge National Laboratory ASTRO Program, sponsored by the U.S. Department of Energy and administered by the Oak Ridge Institute for Science and Education. • This manuscript has been authored by UT-Battelle, LLC and used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory under Contract No. DE- AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan.
  • 61. Thank you for listening!

Editor's Notes

  • #17: $200 million supercomputer
  • #32: Say this is an ongoing project with NIEHS / NTP
  • #38: Difference – humans make the decision based on understanding of text vs. machines use statistics