Machine Learning for Data Extraction

ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Machine Learning for Data Extraction
OECD Workshop on Evaluation of Existing Data on EDCs
October 24, 2018, Paris, France
Christopher Stahl,
stahlcg@ornl.gov
Dasha Herrmannova,
dasha.herrmannova@open.ac.uk
Oak Ridge National Laboratory, US & The Open University, UK

22 Machine Learning for Data Extraction
Acknowledgements
• The following people contributed to the research in this presentation:
– Steven Young, ORNL
– Robert Patton, ORNL
– Jack Wells, ORNL
– Mary Wolfe, NTP, NIEHS, NIH
– Nicole Kleinstreuer, NICEATM, NTP, NIEHS, NIH

Outline
• Introduction
– Who are we and where are we from?
• What is Machine Learning?
– Introduction to Machine Learning and Deep Learning
• ORNL Deep Learning Experiment
– PDF extraction with ORNL DeepPDF
• Machine Learning for Data Extraction
– Extraction of study descriptors in Toxicology research

Introduction
• Christopher Stahl
– Graduated from Florida Southern College with a Bachelor of Science in 2011
and has been working with the Computational Data Analytics group at Oak
Ridge National Laboratory since then. He currently is a Data Analytics
Software Engineer who works on multiple projects with a focus on data
mining, data analytics, and knowledge discovery.

Introduction
• Dasha Herrmannova
– Research Associate in Scholarly Data Mining at the Open University
– Visiting Researcher at @ Oak Ridge National Laboratory since 2016
– Goal: Accelerate scientific discovery, help researchers work more effectively by
enabling intelligent access to the content of research papers
– My research: text-mining for research evaluation (co-founder of
semantometrics.org), analysis of research collaboration and trends,
information extraction from scholarly publications

Introduction
• Oak Ridge National Laboratory
– Oak Ridge National Laboratory is the largest US Department of Energy science
and energy laboratory, conducting basic and applied research to deliver
transformative solutions to compelling problems in energy and security.

Introduction
• Computational Data Analytics Group
– The Computational Data Analytics Research Group at the Oak Ridge National
Laboratory conducts innovative basic and applied computer science research
on challenges of national interest. Our primary research focus is in the areas of
large-scale data analytics and architectures. We conduct innovative basic and
applied computer science research on challenges of national interest. Our
main research area is analysis of very large sets of data. Our research focus is
in the areas of large-scale data analytics and architectures, which includes
work machine learning, deep learning, text analytics, graph analytics, visual
analytics, and biomedical analytics.

What is Machine Learning?
• Simply put
– “Machine Learning is the science of getting computers to learn and act like
humans do, and improve their learning over time in autonomous fashion, by
feeding them data and information in the form of observations and real-world
interactions.” - https://www.techemergence.com/what-is-machine-learning/

Machine Learning
• Two common conceptions:
– Fully autonomous robots that ultimately
destroy all mankind.
– Human assisted machines that make us better,
faster, and stronger.

But Why Teach Machines?
• Big Problems, Big Data

Deep Learning
• Deep Learning research is exploding

What is Deep Learning?
• Deep learning is data driven feature extraction supported by
hierarchy of neuron layers
– Lower layers learn local detail
– Higher layers learn global concepts
http://www.datarobot.com/blog/a-primer-on-deep-learning/

Simple Deep Learning
• Software programs trying to mimic the human brain.
• So let's talk more about those cats.
• What is a cat?
“Features” of a Cat
Small and fluffy
2 ears
2 eyes
Nose
4 legs
Tail

Why?
Small and fluffy - ✓
2 ears - ✓
2 eyes - X
Nose - X
4 legs - X
Tail - ✓
Based on our training...not a cat.
How does our brain know better?

• Convolutional Neural Network
– Instead of defining features (eyes, nose, tail, etc.) let's give the machine the
entire image and allow it to scan for features dividing the image into smaller
blocks.
– The computer looks for things such as, is there a horizontal line, this block is
mostly black, does this look fluffy etc.
– The computer can find tens to hundreds of such features. These are called
neurons.

• Now we can make another layer with larger blocks. Then another and
another, etc. Until we finally have learned what defines a cat.
http://www.datarobot.com/blog/a-primer-on-deep-learning/

Not Just Robots, Self Driving Cars, and Cat Videos
• Researchers from Sutter Health and Georgia Institute of Technology
showed they could predict heart failure as much as nine months
before doctors using traditional methods.
(https://blogs.nvidia.com/blog/2016/04/11/predict-heart-failure/)
• Wearables such as Horus help the blind to be able to “see” by using
deep learning to analyze their surroundings and provide feedback to
the user.
(https://newatlas.com/horus-wearable-blind-assistant/46173/)

ORNL Deep Learning
Experiment
DeepPDF
https://www.osti.gov/servlets/purl/1460210

ORNL Deep Learning Experiment
• What current problem exists, that is easy for human but hard for
machine?

Background
• Test of Grobid on 100 Sponsor Publications

Background
• Deep learning is heavily being applied to image analysis.
• Publication files (PDF) look like images.
• Can we use Deep learning to improve results???

Data Collection
• 50 Randomly selected PDF files from
PMC_sample_1943
• 12 Different annotation types
• 407 total pages

Experiment
• Semantic segmentation
• The process of assigning a label to each pixel of an image.
• U-Net, A popular network for semantic segmentation tasks.
• https://github.com/shreyaspadhy/UNet-Zoo
• This network was chosen as it typically provides good performance
with relatively few training examples.

Results
• The per pixel classification accuracy on the validation set was 94.32%,
compared to a baseline of classifying each pixel as “not paragraph”
which would provide 79.67 accuracy.

Results

Next Steps
• Take results and extract to structured text. This should allow for
better results of all our work using PDF’s as an input.

Machine Learning for Data
Extraction
https://www.aclweb.org/anthology/W18-5609/

Goal
• Goal: automated identification/extraction of study descriptors
from research publications
– Information pertaining to guideline for rodent uterotrophic bioassays
(OECD TG 440)
Minimal criteria
Female weanling rats
Six rats per group
Four treatment groups

The Data Being Extracted
• OECD Test Guideline 440: Uterotrophic Bioassay in Rodents
– The guideline consists of 6 minimal criteria (MC)
– All six MC have to be met for a study to be guideline-like (GL)
– Below: manually annotated (GL) abstract showing the six MC

OECD TG 440: Minimum criteria (MC)
• MC 1: Animal model
• MC 2: Group size
• MC 3: Route of administration
• MC 4: Number of dose groups
• MC 5: Dosing interval
• MC 6: Necropsy timing
Source: Kleinstreuer et al. (2016). A Curated Database of
Rodent Uterotrophic Bioactivity.

Dataset
• 670 research publications with results for 2,615 uterotrophic
bioassays
– A curated database of rodent uterotrophic bioactivity. (2015). Kleinstreuer et.
al. Environmental health perspectives.
– ~120 publications (~18%) contain GL studies (~540 out of 2,615)

Information extraction
• Standard approach to document annotation: train a prediction
model from labeled data
• Requires fine-grained sentence/word level annotations
– Obtaining training data can be very costly
– Depending on task, data being extracted can vary significantly
– What if training data isn’t available?
✓
✗
VS.

Human learning vs. machine learning
Document
“pets”
It talks about dogs
It talks about cats
…
“pets”
document label
dog cat buy
Which words were in
the document?

• A human non-expert can correctly annotate criteria given just a
description of the target information

• Machine Learning (ML) algorithm learns to associate certain
linguistic patterns with the criteria
– These patterns may be hard to spot for a human
• For example an ML algorithm can learn to correctly predict if a
document met criteria based short text snippets not mentioning the
criteria (such as abstracts)

Our approach
• Goal: extract information for
the purpose of identifying
GL/non-GL documents
• Two approaches:
1. Classify, then extract: Can we
train a good classifier? If so,
extract learned patterns (fully
supervised)
2. Extract, then classify: Extract
text segments relevant to criteria
descriptions (unsupervised), then
classify the extracted segments
“pets”
Why?
document label
dog cat buy
Which words were in
the document?
Sentence 1:
Pre-pubertal (day 18 of life) female mice
were randomized to receive placebo, 200
mg/kg CTX, or 120 mg/kg CTX.
Similarity to MC1: 0.5478
Sentence 2:
The dosages of CTX used were based
on previous studies, which
demonstrated a significant dose-
dependent ovarian toxicity.
Similarity to MC1: 0.3887
✓

Classify, then extract
• Train a classifier (e.g. Logistic Regression, Convolutional Neural
Network) to distinguish between publications that met/didn’t meet
criteria
• Extract the linguistic patterns the classifier learns to use to make
the decision
– Occurrence/lack of certain words in text, importance of words
• Pro: theoretically a more accurate method
• Con: requires labeled data – what is the minimal amount of data
needed to make this approach work reliably?

Extract, then classify
• Utilize criteria descriptions/sample sentences/taxonomy (=a “query”
into the document)
• Extract parts of the document most relevant to the “query” (e.g.
parts which are the most similar to the query)
• Pro: Does not require labeled data, works on individual documents
• Con: What is a good ”query” and a reliable extraction (similarity)
method?

OECD TG 440: Minimum criteria (MC)
• MC 1: Animal model
• MC 2: Group size
• MC 3: Route of administration
• MC 4: Number of dose groups
• MC 5: Dosing interval
• MC 6: Necropsy timing
Source: Kleinstreuer et al. (2016). A Curated Database of
Rodent Uterotrophic Bioactivity.

Example abstract

Results for MC 1: Animal model
• Top: supervised (classify, then extract), bottom: unsupervised
Correct answer
(underlined)

Results for MC 2: Group size

Results for MC 3: Route of administration

Results for MC 4: Number of dose groups

Results for MC 5: Dosing interval

Results for MC 6: Necropsy timing

Unsupervised (similarity-based) approach
• Top sentences extracted from full text (correct answer underlined)
• Similarity scores: 70.61, 65.31, and 63.69
1. After weaning on pnd 21, the dams were euthanized by CO2 asphyxiation and the juvenile
females were individually housed.
2. Six CD(SD) rat dams, each with reconstituted litters of six female pups, were received from Charles
River Laboratories (Raleigh, NC, USA) on offspring postnatal day (pnd) 16.
3. This validation study followed OECD TG 440, with six female weanling rats (postnatal day 21)
per dose group and six treatment groups.

Current limitations and future work
• “Semantic implication”
• Tables/figures
• Multiple studies
• Answer not in text

Current limitations
• “Semantic implication”
– MC 4: Number of dose group – Minimum of two dose groups, must have
positive and negative control

Current limitations
• Tables/figures

Current limitations
• Multiple studies
– We currently don’t have a definitive way of recognizing publications with
multiple studies or matching sentences to studies

Current limitations
• Document 24096037 (describes 4 studies), top sentence from full text
for each MC
1. Animal model: (I) Female Sprague-Dawley rats and C57BL/6 mice, ovarectomized on PND
20 and all within 10% of the average body weight (b.w.), were obtained from ...
2. Group size: (I) An equal number of time-matched Veh animals (n=5) were treated in the
same manner.
3. Route of administration: However, o,p-DDT induced Ca2 in the rat, with a clear difference
between EE and o,p-DDT treatment.
4. Number of dose groups: (I) For graphing purposes, the relative expression levels were
scaled such that the expression level of the time-matched control group was equal to one.
5. Dosing interval: Female Sprague-Dawley rats and C57BL/6 mice, ovarectomized on PND 20
and all within 10% of the average body weight (b.w.), were obtained from ...
6. Necropsy timing: (HI) All animals were sacrificed 24 h after the last treatment (72 h after
the initial dose).

Current limitations
• Answer not in text
– We currently don’t have a definitive way of recognizing if the answer not there

Other improvements
• Making extraction more accurate
– Applying robust evaluation metrics toward the extracted text
– Combining both (supervised and unsupervised) approaches
• Labeling extracted text
• Using ensembles to improve classification / joint modelling of MC
labels (e.g. using Deep Learning)

Conclusions
• A hard problem given the lack of annotations
• Similarity-based approaches may provide a way forward

Funding
• Support for this research was provided by an Interagency Agreement with the National Institute
of Environmental Health Sciences (AES 16002-001) and the U.S. Department of Energy at Oak
Ridge National Laboratory.
• This research was supported in part by an appointment to the Oak Ridge National Laboratory
ASTRO Program, sponsored by the U.S. Department of Energy and administered by the Oak
Ridge Institute for Science and Education.
• This manuscript has been authored by UT-Battelle, LLC and used resources of the Oak Ridge
Leadership Computing Facility at the Oak Ridge National Laboratory under Contract No. DE-
AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and
the publisher, by accepting the article for publication, acknowledges that the United States
Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or
reproduce the published form of this manuscript, or allow others to do so, for United States
Government purposes. The Department of Energy will provide public access to these results of
federally sponsored research in accordance with the DOE Public Access Plan.

Machine Learning for Data Extraction

More Related Content

What's hot (19)

Similar to Machine Learning for Data Extraction (20)

More from Dasha Herrmannova (10)

Recently uploaded (20)

Machine Learning for Data Extraction

Editor's Notes