SlideShare a Scribd company logo
Accelerating Discovery Mining Unstructured
Information For Hypothesis Generation Scott
Spangler download
https://ebookbell.com/product/accelerating-discovery-mining-
unstructured-information-for-hypothesis-generation-scott-
spangler-5241508
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Accelerating Discovery Mining Unstructured Information For Hypothesis
Generation Spangler
https://ebookbell.com/product/accelerating-discovery-mining-
unstructured-information-for-hypothesis-generation-spangler-5261124
Knowledge Guided Machine Learning Accelerating Discovery Using
Scientific Knowledge And Data Taylor Francis Group
https://ebookbell.com/product/knowledge-guided-machine-learning-
accelerating-discovery-using-scientific-knowledge-and-data-taylor-
francis-group-43676354
Accelerating The Discovery Of New Dielectric Properties In Polymer
Insulation 1st Edition Boxue Du
https://ebookbell.com/product/accelerating-the-discovery-of-new-
dielectric-properties-in-polymer-insulation-1st-edition-boxue-
du-6837250
Accelerating Economic Growth Lessons From 200000 Years Of
Technological Progress And Human Development Jakub Growiec
https://ebookbell.com/product/accelerating-economic-growth-lessons-
from-200000-years-of-technological-progress-and-human-development-
jakub-growiec-46081114
Accelerating Digital Transformation 10 Years Of Software Center Jan
Bosch
https://ebookbell.com/product/accelerating-digital-
transformation-10-years-of-software-center-jan-bosch-46706442
Accelerating Organisation Culture Change Innovation Through Digital
Tools Jaclyn Lee
https://ebookbell.com/product/accelerating-organisation-culture-
change-innovation-through-digital-tools-jaclyn-lee-47134668
Accelerating Digital Transformation Of Smes Pohsun Seow Clarence Goh
https://ebookbell.com/product/accelerating-digital-transformation-of-
smes-pohsun-seow-clarence-goh-49474098
Accelerating Learning Recovery For All Students Margaret Vaughn
https://ebookbell.com/product/accelerating-learning-recovery-for-all-
students-margaret-vaughn-49998496
Accelerating Performance How Organizations Can Mobilize Execute And
Transform With Agility 1st Edition Colin Price
https://ebookbell.com/product/accelerating-performance-how-
organizations-can-mobilize-execute-and-transform-with-agility-1st-
edition-colin-price-50583728
Accelerating Discovery Mining Unstructured Information For Hypothesis Generation Scott Spangler
ACCELERATING DISCOVERY
MINING UNSTRUCTURED
INFORMATION FOR HYPOTHESIS
GENERATION
Scott Spangler
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
Accelerating Discovery Mining Unstructured Information For Hypothesis Generation Scott Spangler
ACCELERATING DISCOVERY
MINING UNSTRUCTURED
INFORMATION FOR HYPOTHESIS
GENERATION
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A.
AIMS AND SCOPE
This series aims to capture new developments and applications in data mining and knowledge
discovery, while summarizing the computational tools and techniques useful in data analysis.This
series encourages the integration of mathematical, statistical, and computational methods and
techniques through the publication of a broad range of textbooks, reference works, and hand-
books. The inclusion of concrete examples and applications is highly encouraged. The scope of the
series includes, but is not limited to, titles in the areas of data mining and knowledge discovery
methods and applications, modeling, algorithms, theory and foundations, data and knowledge
visualization, data mining systems and tools, and privacy and security issues.
PUBLISHED TITLES
ACCELERATING DISCOVERY: MINING UNSTRUCTURED INFORMATION FOR
HYPOTHESIS GENERATION
Scott Spangler
ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY
Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, and Ashok N. Srivastava
BIOLOGICAL DATA MINING
Jake Y. Chen and Stefano Lonardi
COMPUTATIONAL BUSINESS ANALYTICS
Subrata Das
COMPUTATIONAL INTELLIGENT DATA ANALYSIS FOR SUSTAINABLE
DEVELOPMENT
TingYu, NiteshV. Chawla, and Simeon Simoff
COMPUTATIONAL METHODS OF FEATURE SELECTION
Huan Liu and Hiroshi Motoda
CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY,
AND APPLICATIONS
Sugato Basu, Ian Davidson, and Kiri L. Wagstaff
CONTRAST DATA MINING: CONCEPTS, ALGORITHMS, AND APPLICATIONS
Guozhu Dong and James Bailey
DATA CLASSIFICATION: ALGORITHMS AND APPLICATIONS
Charu C. Aggarawal
DATA CLUSTERING: ALGORITHMS AND APPLICATIONS
Charu C. Aggarawal and Chandan K. Reddy
DATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH
Guojun Gan
DATA MINING FOR DESIGN AND MARKETING
Yukio Ohsawa and Katsutoshi Yada
DATA MINING WITH R: LEARNING WITH CASE STUDIES
Luís Torgo
FOUNDATIONS OF PREDICTIVE ANALYTICS
James Wu and Stephen Coggeshall
GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY,
SECOND EDITION
Harvey J. Miller and Jiawei Han
HANDBOOK OF EDUCATIONAL DATA MINING
Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d. Baker
HEALTHCARE DATA ANALYTICS
Chandan K. Reddy and Charu C. Aggarwal
INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS
Vagelis Hristidis
INTELLIGENT TECHNOLOGIES FOR WEB APPLICATIONS
Priti Srinivas Sajja and Rajendra Akerkar
INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS
AND TECHNIQUES
Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu
KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND
LAW ENFORCEMENT
David Skillicorn
KNOWLEDGE DISCOVERY FROM DATA STREAMS
João Gama
MACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR
ENGINEERING SYSTEMS HEALTH MANAGEMENT
Ashok N. Srivastava and Jiawei Han
MINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS
DavidLo,Siau-ChengKhoo,JiaweiHan,andChaoLiu
MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO
CONCEPTS AND THEORY
Zhongfei Zhang and Ruofei Zhang
MUSIC DATA MINING
Tao Li, Mitsunori Ogihara, and George Tzanetakis
NEXT GENERATION OF DATA MINING
Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar
RAPIDMINER: DATA MINING USE CASES AND BUSINESS ANALYTICS
APPLICATIONS
Markus Hofmann and Ralf Klinkenberg
RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS,
AND APPLICATIONS
Bo Long, Zhongfei Zhang, and Philip S. Yu
SERVICE-ORIENTED DISTRIBUTED KNOWLEDGE DISCOVERY
Domenico Talia and Paolo Trunfio
SPECTRAL FEATURE SELECTION FOR DATA MINING
Zheng Alan Zhao and Huan Liu
STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION
George Fernandez
SUPPORTVECTOR MACHINES: OPTIMIZATION BASED THEORY,
ALGORITHMS, AND EXTENSIONS
Naiyang Deng, Yingjie Tian, and Chunhua Zhang
TEMPORAL DATA MINING
Theophano Mitsa
TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS
Ashok N. Srivastava and Mehran Sahami
THE TOP TEN ALGORITHMS IN DATA MINING
Xindong Wu and Vipin Kumar
UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX
DECOMPOSITIONS
David Skillicorn
ACCELERATING DISCOVERY
MINING UNSTRUCTURED
INFORMATION FOR HYPOTHESIS
GENERATION
Scott Spangler
IBM Research
San Jose, California, USA
The views expressed here are solely those of the author in his private capacity and do not in any way represent
the views of the IBM Corporation.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2016 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Version Date: 20150721
International Standard Book Number-13: 978-1-4822-3914-0 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To Karon, my love
Accelerating Discovery Mining Unstructured Information For Hypothesis Generation Scott Spangler
ix
Contents
Preface, xvii
Acknowledgments, xxi
Chapter 1 ◾ 
Introduction 1
Chapter 2 ◾ 
Why Accelerate Discovery? 9
Scott Spangler and Ying Chen
THE PROBLEM OF SYNTHESIS 11
THE PROBLEM OF FORMULATION 11
WHAT WOULD DARWIN DO? 13
THE POTENTIAL FOR ACCELERATED DISCOVERY:
USING COMPUTERS TO MAP THE KNOWLEDGE SPACE 14
WHY ACCELERATE DISCOVERY: THE BUSINESS
PERSPECTIVE 15
COMPUTATIONAL TOOLS THAT ENABLE ACCELERATED
DISCOVERY 16
Search 16
Business Intelligence and Data Warehousing 17
Massive Parallelization 17
Unstructured Information Mining 17
Natural Language Processing 17
Machine Learning 18
Collaborative Filtering/Matrix Factorization 18
Modeling and Simulation 18
Service-Oriented Architectures 19
x   ◾    Contents
Ontological Representation Schemes 19
DeepQA 19
Reasoning under Uncertainty 20
ACCELERATED DISCOVERY FROM A SYSTEM PERSPECTIVE 20
Content Curator 21
Domain-pedia 21
Annotators 23
Normalizers 23
BigInsights Framework 23
Query Services 23
Analytics Services 23
User Interface 23
Catalogue 24
ACCELERATED DISCOVERY FROM A DATA PERSPECTIVE 24
Initial Domain Content and Knowledge Collection 24
Content Comprehension and Semantic Knowledge Extraction 26
Complex and High-Level Knowledge Composition and
Representation 26
New Hypothesis and Discovery Creation 27
ACCELERATED DISCOVERY IN THE ORGANIZATION 28
CHALLENGE (AND OPPORTUNITY) OF ACCELERATED
DISCOVERY 29
REFERENCES 30
Chapter 3 ◾ 
Form and Function 33
THE PROCESS OF ACCELERATED DISCOVERY 34
CONCLUSION 40
REFERENCE 40
Chapter 4 ◾ 
Exploring Content to Find Entities 41
SEARCHING FOR RELEVANT CONTENT 42
HOW MUCH DATA IS ENOUGH? WHAT IS TOO MUCH? 42
HOW COMPUTERS READ DOCUMENTS 43
EXTRACTING FEATURES 43
Contents   ◾    xi
Editing the Feature Space 46
FEATURE SPACES: DOCUMENTS AS VECTORS 47
CLUSTERING 48
DOMAIN CONCEPT REFINEMENT 50
Category Level 50
Document Level 51
MODELING APPROACHES 51
Classification Approaches 52
Centroid 52
Decision Tree 52
Naïve Bayes 52
Numeric Features 52
Binary Features 53
Rule Based 53
Statistical 53
DICTIONARIES AND NORMALIZATION 54
COHESION AND DISTINCTNESS 54
Cohesion 55
Distinctness 56
SINGLE AND MULTIMEMBERSHIP TAXONOMIES 56
SUBCLASSING AREAS OF INTEREST 57
GENERATING NEW QUERIES TO FIND ADDITIONAL
RELEVANT CONTENT 57
VALIDATION 58
SUMMARY 58
REFERENCES 58
Chapter 5 ◾ 
Organization 61
DOMAIN-SPECIFIC ONTOLOGIES AND DICTIONARIES 61
SIMILARITY TREES 62
USING SIMILARITY TREES TO INTERACT WITH DOMAIN
EXPERTS 65
xii   ◾    Contents
SCATTER-PLOT VISUALIZATIONS 65
USING SCATTER PLOTS TO FIND OVERLAPS BETWEEN
NEARBY ENTITIES OF DIFFERENT TYPES 67
DISCOVERY THROUGH VISUALIZATION OF TYPE SPACE 69
REFERENCES 69
Chapter 6 ◾ 
Relationships 71
WHAT DO RELATIONSHIPS LOOK LIKE? 71
HOW CAN WE DETECT RELATIONSHIPS? 72
REGULAR EXPRESSION PATTERNS FOR EXTRACTING
RELATIONSHIPS 72
NATURAL LANGUAGE PARSING 73
COMPLEX RELATIONSHIPS 74
EXAMPLE: P53 PHOSPHORYLATION EVENTS 74
PUTTING IT ALL TOGETHER 75
EXAMPLE: DRUG/TARGET/DISEASE RELATIONSHIP
NETWORKS 75
CONCLUSION 79
Chapter 7 ◾ 
Inference 81
CO-OCCURRENCE TABLES 81
CO-OCCURRENCE NETWORKS 83
RELATIONSHIP SUMMARIZATION GRAPHS 83
HOMOGENEOUS RELATIONSHIP NETWORKS 83
HETEROGENEOUS RELATIONSHIP NETWORKS 86
NETWORK-BASED REASONING APPROACHES 86
GRAPH DIFFUSION 87
MATRIX FACTORIZATION 87
CONCLUSION 88
REFERENCES 89
Chapter 8 ◾ 
Taxonomies 91
TAXONOMY GENERATION METHODS 91
SNIPPETS 92
TEXT CLUSTERING 92
Contents   ◾    xiii
TIME-BASED TAXONOMIES 94
Partitions Based on the Calendar 94
Partitions Based on Sample Size 95
Partitions on Known Events 95
KEYWORD TAXONOMIES 95
Regular Expression Patterns 96
NUMERICAL VALUE TAXONOMIES 97
Turning Numbers into X-Tiles 98
EMPLOYING TAXONOMIES 98
Understanding Categories 98
Feature Bar Charts 98
Sorting of Examples 99
Category/Category Co-Occurrence 99
Dictionary/Category Co-Occurrence 100
REFERENCES 101
Chapter 9 ◾ 
Orthogonal Comparison 103
AFFINITY 104
COTABLE DIMENSIONS 105
COTABLE LAYOUT AND SORTING 106
FEATURE-BASED COTABLES 107
COTABLE APPLICATIONS 109
EXAMPLE: MICROBES AND THEIR PROPERTIES 109
ORTHOGONAL FILTERING 111
CONCLUSION 114
REFERENCE 115
Chapter 10 ◾ 
Visualizing the Data Plane 117
ENTITY SIMILARITY NETWORKS 117
USING COLOR TO SPOT POTENTIAL NEW
HYPOTHESES 119
VISUALIZATION OF CENTROIDS 123
EXAMPLE: THREE MICROBES 125
xiv   ◾    Contents
CONCLUSION 127
REFERENCE127
Chapter 11 ◾ 
Networks 129
PROTEIN NETWORKS 130
MULTIPLE SCLEROSIS AND IL7R 130
EXAMPLE: NEW DRUGS FOR OBESITY 134
CONCLUSION 136
REFERENCE 136
Chapter 12 ◾ 
Examples and Problems 139
PROBLEM CATALOGUE 139
EXAMPLE CATALOGUE 140
Chapter 13 ◾ 
Problem: Discovery of Novel Properties
of Known Entities 141
ANTIBIOTICS AND ANTI-INFLAMMATORIES 141
SOS PATHWAY FOR ESCHERICHIA COLI 146
CONCLUSIONS 149
REFERENCES 150
Chapter 14 ◾ 
Problem: Finding New Treatments for Orphan
Diseases from Existing Drugs 151
IC50:IC50 152
REFERENCES 158
Chapter 15 ◾ 
Example: Target Selection Based on Protein
Network Analysis 159
TYPE 2 DIABETES PROTEIN ANALYSIS 159
Chapter 16 ◾ 
Example: Gene Expression Analysis for
Alternative Indications 165
Scott Spangler, Ignacio Terrizzano, and Jeffrey Kreulen
NCBI GEO DATA 165
CONCLUSION 173
REFERENCES 174
Contents   ◾    xv
Chapter 17 ◾ 
Example: Side Effects 175
Chapter 18 ◾ 
Example: Protein Viscosity Analysis Using
Medline Abstracts 183
DISCOVERY OF ONTOLOGIES 184
USING ORTHOGONAL FILTERING TO DISCOVER
IMPORTANT RELATIONSHIPS 187
REFERENCE 194
Chapter 19 ◾ 
Example: Finding Microbes to Clean Up Oil
Spills 195
Scott Spangler, Zarath Summers, and Adam Usadi
ENTITIES 196
USING COTABLES TO FIND THE RIGHT COMBINATION
OF FEATURES 199
DISCOVERING NEW SPECIES 202
ORGANISM RANKING STRATEGY 205
CHARACTERIZING ORGANISMS 206
Respiration 209
Environment 215
Substrate 215
CONCLUSION 216
Chapter 20 ◾ 
Example: Drug Repurposing 225
COMPOUND 1: A PDE5 INHIBITOR 226
PPARα/γ AGONIST 228
Chapter 21 ◾ 
Example: Adverse Events 231
FENOFIBRATE 231
PROCESS 232
CONCLUSION 237
REFERENCES 239
Chapter 22 ◾ 
Example: P53 Kinases 241
AN ACCELERATED DISCOVERY APPROACH BASED ON
ENTITY SIMILARITY 243
xvi   ◾    Contents
RETROSPECTIVE STUDY 246
EXPERIMENTAL VALIDATION 248
CONCLUSION 250
REFERENCE 251
Chapter 23 ◾ 
Conclusion and Future Work 253
ARCHITECTURE 254
FUTURE WORK 255
ASSIGNING CONFIDENCE AND PROBABILITIES TO
ENTITIES, RELATIONSHIPS, AND INFERENCES 255
DEALING WITH CONTRADICTORY EVIDENCE 259
UNDERSTANDING INTENTIONALITY 259
ASSIGNING VALUE TO HYPOTHESES 261
TOOLS AND TECHNIQUES FOR AUTOMATING THE
DISCOVERY PROCESS 261
CROWD SOURCING DOMAIN ONTOLOGY CURATION 262
FINAL WORDS 262
REFERENCE 262
xvii
Preface
Afew years ago, having spent more than a decade doing
unstructured data mining of one form or another, in domains
spanning helpdesk problem tickets, social media, and patents, I thought
I fully understood the potential range of problems and likely areas of
applicability of this mature technology. Then, something happened that
completely changed how I thought about what I was doing and what its
potential really was.
The change in my outlook began with the Watson Jeopardy challenge.
Seeing a computer learn from text to play a game I had thought was far
beyond the capabilities of any artificial intelligence opened my eyes to new
possibilities. And I was not alone. Soon many customers were coming for-
ward with their own unique problems—problems I would have said a few
years ago were just too hard to solve with existing techniques. And now, I
said, let’s give it a try.
This wasn’t simply a straightforward application of the algorithms
used to win Jeopardy in a different context. Most of the problems I was
being asked to solve weren’t really even question-answering problems
at all. But they all had a similar quality in that they forced us to digest
all of the information in a given area and find a way to synthesize a new
kind of meaning out of it. This time the problem was not to win a game
show, but (to put it bluntly) to advance human scientific knowledge. More
than once in the early going, before we had any results to show for our
efforts, I wondered if I was out of my mind for even trying this. Early on, I
remember making more than one presentation to senior executives at my
company, describing what I was doing, half expecting they would tell me
to cease immediately, because I was attempting something way too hard
for current technology and far outside the bounds of a reasonable business
opportunity. But fortunately, no one ever said that, and I kept on going.
xviii   ◾    Preface
Somewhere along the way (I can’t say just when), I lost all doubt that I
was really onto something very important. This was more than another
new application of unstructured data mining. And as each new scien-
tific area came forward for analysis, the approach we were using began to
solidify into a kind of methodology. And then, just as this was happening,
I attended the ACM Knowledge Discovery and Data Mining Conference
in 2013 (KDD13); I met with CRC Press in a booth at the conference and
told them about my idea. Shortly thereafter, I was signed up to write a
book.
I knew at the time that what would make this book especially challeng-
ing is that I was still proving the methodology and tweaking it, even as
I was writing out the description of that method. This was not ideal, but
neither could it be avoided if I wanted to broaden the application of the
method beyond a small team of people working in my own group. And if
it could not be broadened, it would never realize its full potential.
Data science is a new discipline. It lacks a curriculum, a set of text-
books, a fundamental theory, and a set of guiding principles. This is both
regrettable and exciting. It must be rectified if the discipline is to become
established. Since I greatly desire that end, I write this book in the hopes
of furthering it.
Many years ago, I remember stumbling across a book by the statistician
John Tukey called Exploratory Data Analysis. It was written 30 years before
I read it, and the author was no longer living; yet it spoke to me as if I were
his research collaborator. It showed me how the ideas I had been grappling
with in the area of unstructured text had been similarly addressed in the
realm of structured data. Reading that book gave me renewed confidence
in the direction I was taking and a larger vision for what the fulfillment of
that vision might one day accomplish.
This book is one more step on that journey. The journey is essentially
my life’s work, and this book is that work’s synthesis thus far. It is far from
perfect, as anything that is real will always be a diminishment from what
is merely imagined. But I hope even in its imperfect state it will commu-
nicate some part of what I experience on a daily basis working through
accelerated discovery problems as a data scientist. I think it is unquestion-
ably the most rewarding and exciting job in the world. And I dare to hope
that 30 years from now, or maybe even sooner, someone will pick up this
book and see their own ideas and challenges reflected in its pages and feel
renewed confidence in the direction they are heading.
Preface   ◾    xix
At the same time, I fear that one or two readers (at least) will buy this
book and immediately be disappointed because it is not at all what they
were expecting it would be. It’s not a textbook. It’s not a business book. It’s
not a popular science book. It doesn’t fit well in any classification. I can
only say this: I wrote it for the person I was a few years back. Read it with
an open mind: you might find you get something useful out of it, regard-
less of the failure to meet your initial expectations. It took me a long time
to get to this level of proficiency in knowing how to address accelerated
discovery problems. I’ve tried my best to capture exactly how I go about
it, both from a systematic perspective and from a practical point of view.
This book provides motivation, strategy, tactics, and a heterogeneous set
of comprehensive examples to illustrate all the points I make. If it works
as I have intended it to, it will fill an important gap left by the other types
of books I have mentioned…the ones you thought this might be. You can
still buy those other books as well, but keep this one. Come back to it later
when you have started to put in practice the theories that you learned in
school to solve real-world applications. You may find then that the book
has more to say to you than you first thought.
Today, I go into each new problem domain with complete confidence
that I know how to get started; I know the major steps I need to accom-
plish, and I have a pretty good idea what the final solution will look like (or
at least I know the range of things it might look like when we first deliver
a useful prototype). It wasn’t always that way. Those first few customer
engagements of this type that I did, I was mostly winging it. It was excit-
ing, no doubt, but I would have really loved to have this book on my desk
(or in my e-reader) to look over after each meeting and help me figure out
what I should do next. If you are fortunate enough to do what I do for a
living, I think you will (eventually) find this book worthwhile.
Accelerating Discovery Mining Unstructured Information For Hypothesis Generation Scott Spangler
xxi
Acknowledgments
There were many people who were instrumental in the creation of
this methodology and in the process of writing the book that explains
it. First, the team at IBM Watson Innovations, who made it all possible:
Ying Chen, Meena Nagarajan, Qi He, Linda Kato, Ana Lelescu, Jacques
LaBrie, Cartic Ramakrishnan, Sheng Hua Boa, Steven Boyer, Eric Louie,
Anshu Jain, Isaac Cheng, Griff Weber, Su Yan, and Roxana Stanoi. Also
instrumental in realizing the vision were the team at Baylor College of
Medicine, led by Olivier Lichtarge, with Larry Donehower, Angela Dawn
Wilkins, Sam Regenbogen, Curtis Pickering, and Ben Bachman.
Jeff Kreulen has been a collaborator for many years now and contin-
ues to be a big supporter and contributor to the ideas described here.
Michael Karasick and Laura Haas have been instrumental in consistently
supporting and encouraging this work from a management perspective at
IBM. John Richter, Meena Nagarajan, and Peter Haas were early reviewers
of my first draft, and I appreciate their input. Ying Chen helped write the
chapter on Why Accelerate Discovery?, for which I am most grateful. Pat
Langley provided some very good advice during the planning phase for
the book, which I profited from.
Finally, and most importantly, my wife, Karon Barber, who insisted
that I finish this project, at the expense of time that I would rather have
spent with her. Nothing I’ve accomplished in life would have happened
without her steadfast faith and love.
Accelerating Discovery Mining Unstructured Information For Hypothesis Generation Scott Spangler
1
C h a p t e r 1
Introduction
This book is about discovery in science and the importance of
heterogeneous data analytics in aiding that discovery. As the vol-
ume of scientific data and literature increases exponentially, scientists
need ever-more powerful tools to process and synthesize that infor-
mation in a practical and meaningful way. But in addition, scientists
need a methodology that takes all the relevant information in a given
problem area—all the available evidence—and processes it in order to
propose the set of potential new hypotheses that are most likely to be
both true and important. This book describes a method for achieving
this goal.
But first, I owe the reader a short introduction and an explanation of
why I am the one writing this book. The short answer is a lucky accident
(lucky for me anyway; for you it remains to be seen). I stumbled into a
career doing the most exciting and rewarding work I can imagine. I do
not know it for a fact, but I suspect that I have done more of this kind
of work and for a longer period of time than anyone else now alive. It is
this experience that I now feel compelled to share, and it is that experi-
ence that should make the book interesting reading for those who also
see the potential of the approach but do not know how to get started
applying it.
It all started out with a love of mathematics, in particular discreet
mathematics, and to be even more specific: combinatorics. Basically this is
the study of how to count things. I was never happier than when I found
this course in college. The discovery of a discipline devoted precisely to
what one instinctively loves is one of life’s greatest joys. I was equally
2   ◾    Accelerating Discovery
disappointed to find there was no such thing as a career in combinatorics,
outside of academia—at least, not at that time.
But I wandered into computer science, and from there into machine
learning and from there into text mining, and suddenly I became aware
that the skill and practice of knowing how to count things had a great deal
of practical application after all. And now 30 years have passed since I fin-
ished that combinatorics course, and with every passing year the number,
variety, importance, and fascination of the problems I work on are still
increasing.
Nothing thrills me more than to have a new data set land in my inbox.
This is especially so if it is some kind of data I have never seen before, better
still if analyzing it requires me to learn about a whole new field of knowl-
edge, and best yet if the result will somehow make a difference in the world.
I can honestly say I have had the privilege of working on such problems,
not once, not twice, but more times than I can reckon. I do not always suc-
ceed, but I do make progress often enough that more and more of these
problems seem to find their way to me. At some level I wish I could do them
all, but that would be selfish (and not very practical). So I am writing this
book instead. If you love working with unstructured, heterogeneous data
the way I do, I believe this book will have a positive impact on your career,
and that you will in turn have a positive impact on society.
This book is an attempt to document and teach Accelerated Discovery
to the next generation of data scientists. Once you have learned these
techniques and practiced them in the lab, your mission will be to find
a scientist or an engineer struggling with a big data challenge and help
them to make a better world. I know these scientists and engineers exist,
and I know they have these challenges, because I have talked to them
and corresponded with them. I have wished there were more of me to go
around so that I could help with every one of them, because they are all
fascinating and all incredibly promising and worthy efforts. But there are
only so many hours in a week, and I can only pick a few of the most prom-
ising to pursue, and every one of these has been a rewarding endeavor.
For the rest and for those problems that will come, I have written this
book.
This book is not a data-mining manual. It does not discuss how to build
a text-classification engine or the ins and outs of writing an unsupervised
clustering implementation. Other books already do this, and I could not
surpass these. This book assumes you already know how to process data
using the basic bag of tools that are now taught in any good data-mining
Introduction   ◾    3
or machine-learning course. Where those courses leave off, this book
begins. The question this book answers is how to use unstructured mining
approaches to solve a really complex problem in a given scientific domain.
How do you create a system that can reason in a sophisticated way about a
complex problem and come up with solutions that are profound, nonobvi-
ous, and original?
From here on, this book is organized in a more or less top-down fashion.
The next chapter discusses the importance of the Accelerated Discovery
problem space and why the time has come to tackle it with the tools we
currently have available. Even if you are already motivated to read the
book, do not skip this chapter, because it contains some important mate-
rial about how flexibly the technology can be applied across a wide swath
of problems.
What follows immediately thereafter is a set of five chapters that
describe the method at a fairly high level. These are the most important
chapters of the book, because they should be in the front of your mind
each time you face a new analytics challenge in science. First there is a
high-level description of our method for tackling these problems, followed
by four detailed chapters giving a general approach to arriving at a solu-
tion. When put together, these five chapters essentially cover our method
for accelerating discovery. Not every problem you encounter will use every
step of this method in its solution, but the basic approach can be applied in
a more or less universal way.
The next section brings the level of detail down to specific technologies
for implementing the method. These are less universal in character but
hopefully will make the method more concrete. This set of four chapters
goes into greater detail about the tools and algorithms I use to help real-
ize the approach in practice. It is not complete, but hopefully it will be
illustrative of the kinds of techniques that can make the abstract process
a reality.
The rest of the book is made up of sample problems or examples of how
this really works in practice. I included ten such examples because it was
a nice round number, and I think the examples I have selected do pro-
vide a good representative sample of this kind of engagement. All of these
examples are from real scientists, are based on real data, and are focused
on important problems in discovery. The examples all come from the life-
sciences area, but that is not meant to be the only area where these tech-
niques would apply; in fact, I have applied them in several other sciences,
including materials and chemistry. But my best physical science examples
4   ◾    Accelerating Discovery
are not publishable due to proprietary concerns, so for this book I have
chosen to focus on the science of biology.
That is how the book is organized, but do not feel you have to read it
this way. You could just as well start with the examples and work your
way back to the earlier chapters when you want to understand the method
in more detail. You will quickly notice that not every problem solution
employs every step of the methodology anyway. The methodology is a
flexible framework on which to assemble the components of your solu-
tion, as you need them, and where they make sense. And it is meant to be
iterative and to evolve as you get deeper into the information complexity
of each new domain.
As you read the book, I hope that certain core principles of how to be
a good data scientist will naturally become apparent. Here is a brief cata-
logue of those principles to keep in mind each time you are faced with a
new problem.
• The whole is greater than the sum of the parts: As scientists we natu-
rally tend toward reductionism when we think about how to solve a
problem. But in data science, it is frequently the case that, by con-
sidering all the relevant data at once, we can learn something that
we cannot see by looking at each piece of data in isolation. Consider
ways to unify everything we know about an individual entity as a
complete picture. What you learn is frequently surprising.
• More X is not always better: There is a wishful tendency among those
less familiar with the problems of data science to imagine that every
problem can be solved with more data, no matter how irrelevant that
data happens to be; or that, if we have run out of data, then adding
more features to the feature space ought to help; or, if that fails, that
adding more categories to our taxonomy should help, and so on. The
operative concept is more is always better. And certainly, one is sup-
posed to assume that at least more stuff can never hurt; the solution
must be in there somewhere. But the problem is, if you add mostly
more noise, the signal gets harder to find, not easier. Careful selec-
tion of the right data, the right features, and the right categories is
always preferable to indiscriminate addition.
• Compare and contrast: Measuring something in isolation does not
tell you very much. Only when you compare the value to some other
related thing does it begin to have meaning. If I tell you a certain
Introduction   ◾    5
baseball player hit 50 home runs last season, this will not mean much
if you know nothing about the game. But if you know what percentile
that puts him in compared to other players, that tells you something,
especially if you also take into account plate appearances, difficulty
of pitchers faced, and the ball parks he played in. The point is that too
often in data science, we are tempted to look too narrowly at only one
aspect of a domain in order to get the precise number we are look-
ing for. We also need to look more broadly in order to understand
the implications of that value: to know whether it means anything of
importance.
• Divide and conquer: When you have a lot of data you are trying to
make sense of, the best strategy for doing this is to divide it into
smaller and smaller chunks that you can more easily comprehend.
But this only works if you divide up the data in a way that you can
make sense of when you put it all back together again. For exam-
ple, one way to divide up data is by letter of the alphabet, but this is
unlikely to make any one category much different than any other,
and thus the problem has not become any easier within each subcat-
egory. But if I focus on concepts rather than syntax, I stand a much
better chance of being enlightened at the end.
• “There’s more than one way to…”: Being a cat lover, I shy away from
completing that statement, but the sentiment is no less true for being
illustrated in such an unpleasant way. Once we find a solution or
approach, our brains seem to naturally just turn off. We have to avoid
this trap and keep looking for other ways to arrive at the result. If we
apply these additional ways and get the same answer, we can be far
more confident than we were that the answer is correct. If we apply
the additional approaches and get a different answer, that opens up
whole new areas for analysis that were closed to us before. Either
way, we win.
• Use your whole brain (and its visual cortex): Find a way to make the
data draw a picture, and you will see something new and important
that was hidden before. Of course, the challenge is to draw the pic-
ture that illustrates the key elements of importance across the most
visible dimensions. Our brains have evolved over time to take in
vast amounts of data through the eyes and convert it effortlessly into
a reasonably accurate view of what is going on around us. Find a
6   ◾    Accelerating Discovery
way to put that powerful specialized processor to work on your data
problem and you will inevitably be astounded at what you can see.
• Everything is a taxonomy/feature vector/network: At the risk of
oversimplifying things a bit, there are really only three basic things
you need to know to make sense of data: What the entities are that
you care about and how they relate to each other (the taxonomy),
how you can describe those entities as features (feature vector), and
how you can represent the way those entities interact (network).
Every problem involves some subset or combination of these ideas. It
really is that simple.
• Time is not a magazine: The data we take to begin our investigation
with is usually static, meaning it sits in a file that we have down-
loaded from somewhere, and we make sure that file does not change
over time (we may even back it up to be absolutely sure). This often
leads us to forget that change is the only constant in the universe,
and over time we will find that our file bears less and less relation
to the new reality of now. Find a way to account for time and to use
time recorded in data to learn how things evolve.
• All data is local: A corollary to the problem of time is the problem
of localization. Most data files we work with are subsets of the larger
data universe, and thus we have to generalize what we learn from
them to make them applicable to the real universe. That generaliza-
tion problem is going to be much harder than you realize. Prejudice
toward what we know and ignorance of what we do not is the bane
of all future predictions. Be humble in the face of your own limited
awareness.
• Prepare for surprise: If you are not constantly amazed by what you
find in data, you are doing something wrong.
Hopefully this brief introduction gives you some sense of the ideas to
keep in mind as you begin to master this discipline. Discovery is always
hard and always involves synthesizing different kinds of data and ana-
lytics. The crucial step is to make all those moving parts work together
constructively to illuminate what lies just beyond the known. The key
ingredient is figuring out what to count and how to count it. In the end,
everything is just combinatorics all the way down!
Introduction   ◾    7
I hope this book helps you to solve complex and important problems
of this type. More than that, I encourage you to develop your own meth-
ods for accelerating discovery and publish them as I have done mine. This
problem is too important for one small group of data scientists in one
organization to have all the fun. Come join us.
Accelerating Discovery Mining Unstructured Information For Hypothesis Generation Scott Spangler
9
C h a p t e r 2
Why Accelerate
Discovery?
Scott Spangler and Ying Chen
There is a crisis emerging in science due to too much data. On the
surface, this sounds like an odd problem for a scientist to have. After
all, science is all about data, and the more the better. Scientists crave data;
they spend time and resources collecting it. How can there be too much
data? After all, why can scientists not simply ignore the data they do not
need and keep the data they find useful?
But therein lies the problem. Which data do they need? What data will
end up proving useful? Answering this question grows more difficult with
increasing data availability. And if data grows exponentially, the problem
may reach the point where individual scientists cannot make optimal
decisions based on their own limited knowledge of what the data contains.
I believe we have reached this situation in nearly all sciences today. We
have certainly reached it in some sciences.
So by accelerating discovery, I do not simply mean doing science the
way we do it today, only faster; I really mean doing science in a profoundly
new way, using data in a new way, and generating hypotheses in a new
way. But before getting into all that, I want to present some historical con-
text in order to show why science the way it has always been practiced is
becoming less and less viable over time.
To illustrate what I mean, consider the discovery of evolution by
Charles Darwin. This is one of the most studied examples of great science
10   ◾    Accelerating Discovery
in history, and in many ways his example provided the template for all
scientific practice for the next 150 years. On the surface, the story is
remarkably elegant and straightforward. Darwin travels to the Galapagos
Islands, where he discovers a number of new and unique species. When
he gets back from his trip, he notices a pattern of species changing over
time. From this comes the idea of species “evolution,” which he publishes
to much acclaim and significant controversy. Of course, what really hap-
pened is far more complex and illuminating from the standpoint of how
science was, and for the most part is still, actually practiced.
First of all, as inhabitants of the twenty-first century, we may forget
how difficult world travel was back in Darwin’s day. Darwin’s voyage was
a survey mission that took him around the world on a sailing ship, the
Beagle. He made many stops and had many adventures. Darwin left on
his trip in 1831 and returned five years later, in 1836. During that time,
he collected many samples from numerous locations and he took copi-
ous notes on everything he did and saw. After he got back to England, he
then spent many years collating, organizing, and systematically catalogu-
ing his specimens and notes. In 1839, he published, to much acclaim, a
book describing the incidents of this voyage (probably not the one you are
thinking of, that one came much later): Journal and Remarks, Voyage of
the Beagle. Darwin then spent the next 20 years doing research and col-
lecting evidence on plants and animals and their tendency to change over
time. But though he was convinced the phenomenon was real, he still did
not have a mechanism by which this change occurred.
Then Darwin happened upon Essay on the Principle of Population (1798)
by Thomas Malthus. It introduced the idea that animals produce more
offspring than typically actually survive. This created a “struggle for exis-
tence” among competing offspring. This led Darwin directly to the idea of
“natural selection.” The Origin of Species was published in 1859, 28 years
after the Beagle left on its voyage. And of course, it was many decades later
before Darwin’s theory would be generally accepted.
There are certain key aspects of this story that I want to highlight as
particularly relevant to the question “Why Accelerate Discovery?” The
first has to do with the 20 years it took Darwin to collect and analyze the
data he felt was necessary to develop and validate his theory. The second
is related to the connection that Darwin made between his own work and
that of Malthus. I think both of these phenomena are characteristic of the
big data issue facing scientists both then and now. And if we think about
them carefully in the context of their time and ours, we can see how it
Why Accelerate Discovery?   ◾    11
becomes imperative that scientists working today use methods and tools
that are far more powerful than those of Darwin and his contemporaries.
THE PROBLEM OF SYNTHESIS
When Darwin returned from his 5-year voyage, he had a formidable col-
lection of notes and specimens to organize and catalogue. This step took
him many years; longer, in fact, than it took him to collect the data in the
first place, but it was crucial to the discovery process. We often think of
scientific discovery as a Eureka moment—a bolt from the blue. But in real-
ity, it is much more frequently the result of painstaking labor carried out
to collect, organize, and catalogue all the relevant information. In essence,
this is a problem of synthesis. Data hardly ever comes in a form that can be
readily processed and interpreted. In nearly every case, the genius lies not
in the finding and collecting of the data but in the organization scheme
that makes the interpretation possible. In a very real sense, all significant
scientific discoveries are about overcoming the data-synthesis problem.
Clearly, data synthesis is hard (because otherwise everyone would do
it), but what makes it so? It is often not easy to see the effort required if
only the result is observed. This is because the most difficult step, the part
that requires the real genius, is almost invisible. It is hidden within the
structure of the catalogue itself. Let us look at the catalogue of specimens
Darwin created.
Darwin’s task in specimen organization cataloguing was not just to
record the species and physical characteristics of each specimen—it was
to find the hidden relationships between them, as illustrated in Figure 2.1.
Organizing data into networks of entities and relationships is a recurring
theme in science. Taxonomies and ontologies are another manifestation of
this. Taxonomies break entities down into classes and subclasses based on
some measure of similarity, so that the further down the tree you go, the
more alike things are within the same class. Ontologies represent a more
general kind of entity network that expresses how entities relate to each
other in the world. Creating such networks from raw data is the problem
of synthesis. The more data there is, and in particular the more hetero-
geneous the forms of that data, the more challenging synthesis becomes.
THE PROBLEM OF FORMULATION
Once Darwin had synthesized his data, it became clear to him that species
did indeed change over time. But merely to observe this phenomenon was
not enough. To complete his theory, he needed a mechanism by which
12   ◾    Accelerating Discovery
this change takes place. He needed to create a model that could explain
how the data points (i.e., the species) connected to each other; otherwise
all he would have is a way to organize the data, without having any addi-
tional insight into what the data meant. Creating this additional insight
that emerges from synthesis is the problem of formulation.
Formulation requires the creation of an equation or algorithm that
explains a process or at least simulates or approximates mathematically
how that process behaves in the physical world.
From a data-science perspective, formulation requires extracting
patterns that may appear across many disparate, heterogeneous data
collections. Going beyond synthesis to explanation may require data visu-
alization and sometimes even analogy. It requires pattern matching and
a14
a10
a9
a8
a7
a6
a5
a4
a3
a2
a1
q14
p14 b14
f 14
f 10
f 9
f 8
f 7
f 6
o14 e14
m14
F14
F10
E10
m10
m9
m8
m7
m6
m5
m4
m3
m2
m1
s2
i2
i3
k5
k6
k7
k8
l8
l7
n14 r14
w14 y14
v14 z14
z10
z9
z8
z7
z6
z5
z4
z3
z2
z1
t2
t3
u5
u6
u7
u8
w9
w10
w8
w7
A
W. West lith Halton garden
B C D E F G H I K L
I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII
XIV
FIGURE 2.1 Illustrations from The Origin of Species.
Why Accelerate Discovery?   ◾    13
being able to draw from a wide array of related data. This is what Darwin
was able to do when he drew on the writings of Thomas Malthus to dis-
cover the driving mechanism behind species change. Darwin reused an
existing formulation, the struggle for existence among competing off-
spring (i.e., “survival of the fittest”), and applied it to competition among
all living things in order to arrive at a formulation of how species evolve.
The process of formulation begins with observation. The scientist
observes how entities change and interact over time. She observes which
properties of entities tend to occur together and which tend to be indepen-
dent. Often, data visualization—charts or graphs, for example—is used to
summarize large tables of numbers in a way that the human visual cortex
can digest and make sense of. The synthesis of data is one of the key steps
in discovery—one that often looks obvious in retrospect but, at the begin-
ning of research, is far from being so in most cases.
WHAT WOULD DARWIN DO?
The process of synthesis and formulation used by Darwin and other
scientists worked well in the past, but this process is increasingly prob-
lematic. To put it bluntly, the amount and variety of data that needs to be
synthesized and the complexity of the models that need to be formulated
has begun to exceed the capacity of individual human intelligence. To see
this, let us compare the entities and relationships described in The Origin
of Species to the ones that today’s biologists need to grapple with.
Today’s biologists need to go well beyond the species and physical anat-
omy of organisms. Today, biology probes life at the molecular level. The
number of different proteins that compose the cells in the human organ-
ism is over a million. Each of these proteins has different functions in the
cell. Individual proteins work in concert with other proteins to create
additional functionality. The complexity and richness of all these interac-
tions and functions is awe inspiring. It is also clearly beyond the capability
of a single human mind to grasp. And all of this was entirely unknown to
Darwin.
To look at this in a different way, the number of scholarly publications
available for Darwin to read in his field might have been on the order
of around 10,000–100,000 at most. Today, that number would be on the
order of fifty million [1].
How do the scientists of today even hope to fathom such complexity and
scale of knowledge? There are two strategies that every scientist employs
to one degree or another: specialization and consensus. Each scientist
14   ◾    Accelerating Discovery
chooses an area of specialization that is narrow enough to encompass a
field wherein they can be familiar with all the important published litera-
ture and findings. Of course, this implies that as time goes on and more
and more publications occur, specialization must grow more and more
intense. This has the obvious drawback of narrowing the scope of each
scientist’s knowledge and the application of their research. In addition,
each scientist will read only the publications in the most prestigious, high-
profile journals. These will represent the best consensus opinion of the
most important research in their area. The drawback is that consensus in
science is frequently wrong. Also, if the majority of scientists are pursuing
the same line of inquiry, the space of possible hypothesis is very ineffi-
ciently and incompletely explored.
THE POTENTIAL FOR ACCELERATED DISCOVERY: USING
COMPUTERS TO MAP THE KNOWLEDGE SPACE
But all is not lost for the scientists of today, for the very tools that help
generate the exponentially increasing amounts of data can also help to
synthesize and formulate that data. Due to Moore’s Law, scientists have
and will most likely continue to have exponentially increasing amounts
of computational power available to them. What is needed is a way to
harness that computational power to carry out better synthesis and for-
mulation—to help the scientist see the space of possibilities and explore
that space much more effectively than they can today. What is needed is
a methodology that can be repeatedly employed to apply computation to
any scientific domain in such a way as to make the knowledge space com-
prehensible to the scientist’s brain.
The purpose of this book is to present one such methodology and to
describe exactly how to carry it out, with specific examples from biology
and elsewhere. We have shown this method to be an effective way to syn-
thesize all published literature in a given subject area and to formulate
new properties of entities based on everything that we know about those
entities from previous results. This leads us to conclude that the meth-
odology is an effective tool for accelerating scientific discovery. Since the
methods we use are in no way specific to these examples, we think there
is a strong possibility that they may be effective in many other scientific
domains as well.
Moreover, regardless of whether our particular methodology is opti-
mal or effective for any particular scientific domain, the fact remains
that all scientific disciplines that are pursued by ever-increasing numbers
Why Accelerate Discovery?   ◾    15
of investigators must ultimately address this fundamental challenge:
Eventually, the rate of data publication will exceed the individual human
capacity to process it in a timely fashion. Only with the aid of computation
can the brain hope to keep pace. The challenges we address here and the
method we employ to meet those challenges will continue to be relevant
and essential for science for the foreseeable future.
So clearly the need exists and will continue to increase for aiding sci-
entific discovery with computational tools. But some would argue that no
such tools exist, or that if they do exist, they are still too rudimentary to
really create value on a consistent basis. Computers can do computation
and information retrieval, but scientific discovery requires creativity and
thinking “outside the box,” which is just what computers cannot do. A few
years ago, the authors would have been largely in agreement with this
viewpoint, but something has changed in the field of computer science
that makes us believe that accelerating scientific discovery is no longer a
distant dream but is actually well within current capability. Later in this
chapter, we will describe these recent developments and preview some of
the implications of these emerging capabilities.
WHY ACCELERATE DISCOVERY: THE BUSINESS PERSPECTIVE
Discovery is central and critical to the whole of humanity and to many
of the world’s most significant challenges. Discovery represents an ability
to uncover things that are not previously known. It underpins all innova-
tions (Figure 2.2).
Looking at what we human beings consume—for example, consumer
goods such as food, clothing, household items, and energy—we would
quickly realize that we need significant innovations across the board.
We need to discover new ways to generate and store energy, new water
Major discoveries and innovation are critical to many world challenges
and the success of many companies across industries
Smarter planet/
consumer goods:
water filtration,
product innovation
Information
technology:
nanotechnologies,
mobile
Life sciences:
drug discovery,
biomedical
research
Energy storage
and generation:
batteries,
solar, CO2
FIGURE 2.2 Example application areas for Accelerated Discovery.
16   ◾    Accelerating Discovery
filtration methods, and new product formations for food and other
goods so that they are more sustainable for our environments and
healthier for human beings. We need these innovations more desper-
ately than ever.
Looking at what we make and build—for example, new computer and
mobile devices, infrastructures, and machines—again, the need for dis-
covery and innovation is at the center of all these. We need new kinds of
nanotechnologies that can scale human capacity to unimaginable limits,
new materials that have all the desired properties that will lower energy
consumption while sustaining its intended functions, and new designs
that can take a wide variety of factors into consideration.
Looking at ourselves, human beings, our own wellbeing depends heav-
ily on the discovery and innovation in healthcare, life sciences, and a wide
range of social services. We need a much better understanding of human
biology. We need to discover new drugs and new therapies that can target
diseases much more effectively and efficiently.
Yet today, the discovery processes in many industries are slow, manual,
expensive, and ad hoc. For example, in drug discovery, it takes on average
10–15 years to develop one drug, and costs hundreds of millions of dol-
lars per drug. The attrition rate of drug development today is over 90%.
Similarly, new energy forms, such as the lithium battery, take tens of years
to discover. New consumer product formations are mostly done on a trial-
and-error basis. There is a need across all industries for a reliable, repeat-
able process to make discovery more cost-effective.
COMPUTATIONAL TOOLS THAT ENABLE
ACCELERATED DISCOVERY
Accelerated Discovery is not just one capability or algorithm but a com-
bination of many complementary approaches and strategies for synthe-
sizing information and formulating hypotheses. The following existing
technologies represent the most important enablers of better hypothesis
generation in science.
Search
The ability to index and retrieve documents based on the words they con-
tain across all relevant content in a given scientific field is a primary enabler
of all the technologies that are involved in Accelerated Discovery. Being
able to selectively and rapidly acquire all the relevant content concerning
a given subject of interest is the first step in synthesizing the meaning of
Why Accelerate Discovery?   ◾    17
that content. The easy availability, scalability, and application of search to
this problem space have made everything else possible.
Business Intelligence and Data Warehousing
The ability to store, retrieve, and summarize large amounts of structured
data (i.e., numeric and categorical data in tables) allows us to deal with
all kinds of information in heterogeneous formats. This gives us the criti-
cal ability to survey scientific discoveries over time and space or to com-
pare similar treatments on different populations. The ability to aggregate
data and then accurately compare different subsets is a technology that
we apply over and over again in our approach as we seek to determine the
credibility and reliability of each fact and conclusion.
Massive Parallelization
In recent years, Hadoop and MapReduce frameworks [2] have made par-
allelization approaches much more applicable to real-world computing
problems. This gives us the ability to attack hard problems involving large
amounts of data in entirely new ways. In short, we can build up a number
of simple strategies to mine and annotate data that can, in aggregate, add
up to a very sophisticated model of what the data actually means. Massive
parallelization also allows us to try out thousands of approaches and com-
binations in real time before selecting the few candidates that are most
likely to succeed based on our models and predictions.
Unstructured Information Mining
Most of the critical information in science is unstructured. In other words,
it comes in the form of words, not numbers. Unstructured information
mining provides the ability to reliably and accurately convert words into
other kinds of structures that computers can more readily deal with. As
we will see in this book, this is a key element of the accelerated discovery
process. This allows us to go beyond retrieving the right document, to
actually discovering hidden relationships between the elements described
by those documents.
Natural Language Processing
The ability to recognize entities, relationships or transitions, and features
and properties and to attribute them appropriately requires natural lan-
guage processing. This technology allows us to parse the individual ele-
ments of the sentence, identify their part of speech, and determine to what
18   ◾    Accelerating Discovery
they refer. It can also allow us to discover the intentionality of the author.
These natural language processing abilities enable the precise determina-
tion of what is known, what is hypothesized, and what is still to be deter-
mined through experimentation. It creates the underlying fact-based
framework from which our hypotheses can be generated.
Machine Learning
To do Accelerated Discovery in complex domains requires more than
just establishing the factual statements in literature. Not all literature is
equally trustworthy, and the trustworthiness may differ depending on
the scope and context. To acquire the level of sophistication and nuance
needed to make these determinations will require more than human pro-
gramming can adequately provide. It will require learning from mistakes
and past examples in order to get better and better over time at deciding
which information is credible and which is suspect. Machine learning is
the technology that provides this type of capability. In fact, it has proven
remarkably adept at tackling even heretofore intractable problems where
there is sufficient training data to be had [3]. Machine learning will enable
our Accelerated Discovery approach to apply sophisticated judgment at
each decision point and to improve that judgment over time.
Collaborative Filtering/Matrix Factorization
Collaborative filtering is a technique made famous by Amazon and Netflix
[4], where it is used to accurately identify the best movie or book for a given
customer based on the purchase history of that customer and other simi-
lar customers (customers who buy similar things). Customer purchases
are a special kind of entity-entity network. Other kinds of entity-entity
networks are similarly amenable to this kind of link prediction. We can
use a generalization of this approach as a way to predict new links in an
existing entity-entity network, which can be considered to be a hypothesis
of a new connection between entities (or a new property of an entity)that is
not currently known but is very likely based on everything we know about
the relevant entities.
Modeling and Simulation
In order to reason accurately about the physical world we have to be able to
simulate its processes in silico and predict what would happen if an experi-
ment were tried or a new property or relationship was actually found to
exist. These types of simulations will help reveal potential downstream
Why Accelerate Discovery?   ◾    19
problems or contradictions that might occur if we were to hypothesize a
physically unrealizable condition or some impossible connection between
entities. Moreover, modeling and simulation can help determine what the
likely impact would be on the physical system as a whole of any new prop-
erty or relationships being discovered, in order to foresee whether such a
discovery would be likely to be uninteresting or quite valuable because it
would imply a favorable outcome or have a wide impact in the field.
Service-Oriented Architectures
Clearly, doing Accelerated Discovery requires a large array of hetero-
geneous software components providing a wide variety of features and
functions across multiple platforms. Service-oriented architectures (SOA)
provide a uniform communication protocol that allows the components
to communicate across the network using complex and evolving data rep-
resentations. It also allows new implementations and algorithms to be eas-
ily swapped in as they become available. SOAs represent an indispensable
tool for enabling the large, sophisticated, and distributed systems needed
for accelerated discovery applications to emerge from components that
can largely be found on the shelf or in open-source libraries.
Ontological Representation Schemes
In addition to being able to extract entities and relationships from unstruc-
tured content, we also need powerful ways to represent those entities and
their features and connections in a persistent fashion, and in a way that
makes it possible to do reasoning over these objects. Existing ontological
representation schemes (e.g., OWL [5]) make it possible to store entities
in a way that retains all the pertinent contextual information about those
entities while still maintaining a degree of homogeneity. This homogene-
ity makes it possible to design algorithms that can discover or infer new
properties based on all known existing patterns. The ability to store such
representations in a scalable database and/or index provides the capability
of growing the stored version of what is known to the level necessary to
comprehend an entire scientific domain.
DeepQA
While question answering is not a central feature of Accelerated Discovery,
the two applications share many common components. Both require the
computational digestion of large amounts of unstructured content, which
then must be used in aggregated to form a conclusion with a likelihood
20   ◾    Accelerating Discovery
estimate. Both also support their answers with evidence extracted from
text sources.
Reasoning under Uncertainty
Machine learning techniques allow us to predict the likelihood that some
conclusion or fact is true. Reasoning under uncertainty is how we use this
probabilistic knowledge to form new hypotheses or to invalidate or ignore
some fact that is highly unlikely. Bayesian inferencing [6] is one example
of an existing framework that we can apply to uncertain causal networks
in order to do credible deduction of probable end states. This capability is
central to telling the difference between a likely outcome and something
that is wildly fanciful.
ACCELERATED DISCOVERY FROM A SYSTEM PERSPECTIVE
The previous list of enabling technologies available to support Accelerated
Discovery (AD) is necessary to the task, but incomplete. What is needed
is a coherent framework in which these technologies can effectively work
together to achieve the desired outcome.
To support and enable such continuous data transformations and dis-
covery, we must design our discovery solution carefully. In particular,
our discovery solution must adapt and scale to a wide range of chang-
ing dynamics, including data content, domain knowledge, and human
interactions. This is crucial because in all industry domains, domain con-
tent and knowledge are constantly changing and expanding. If we do not
design our discovery system to cope with such changes, the outcome will
be a system that lacks longevity and capability. Most, if not all, of today’s
discovery solutions can only deal with limited volume and diversity of
content and knowledge.
To enable adaptation and scaling, we instituted two system design prin-
ciples: agility and adaptivity. Agility means that a discovery system must
be able to rapidly generate outputs in the face of changes in data content,
knowledge, and human inputs. This is far from a reality in today’s discov-
ery systems. For example, a discovery system may be built for one kind
of data input formats. When the data input format changes, significant
manual intervention may be needed and downstream system components
may also need to change accordingly. Such designs make a discovery pro-
cess extremely lengthy and error prone. We will describe our approach to
build agility into the system from the very beginning.
Why Accelerate Discovery?   ◾    21
Adaptivity means that a discovery solution must consider “changes in
all forms” to be the first-class citizen; for example, changes in individual
system components, changes in data content, and changes in knowledge.
We envision that various underlying technology components will go
through their evolution to become better over time, and the same is true
of data content and knowledge bases. A discovery system must have the
notion of adaptivity built into it from day one.
To enable agility, we suggest that all major system components be
designed with a “core-abstraction plus configurable customization”
approach. The core abstraction defines all the major services that the sys-
tem components intend to support. The configurable customizations allow
for changes and adaptations of the core abstractions for specific needs in
data-content formats, knowledge repositories, and interactions with other
components. For example, a content collection component may have com-
mon and core services dealing with the processing of typical unstructured
documents. A configurable customization can define the specific fields
and format extensions that are needed for those unstructured sources
without code change in the core abstraction services.
To enable adaptivity, we defined generalized data input and output for-
mats, called common data models (CDMs), and common interfaces around
all system components. This allows developers to change the component
engine itself without impacting the overall function of the discovery system.
It also allows us to adapt to new changes in data sources and knowledge bases
by simply mapping them to CDMs without changing the rest of the system.
Figure 2.3 summarizes the key system components of a discovery
solution. The boxes are the major system components. We will also pro-
vide a summary of the description to each of the following major system
components:
Content Curator
Content curator is responsible for managing domain content sources. It
includes collecting, cleansing, and making diverse content sources available
fordownstreamprocessing,end-userqueries,andbrowsing.Typicalfunctions
include data ingestion, extraction, transformation, and loading into some
content repository. It also includes functions such as indexing and searching.
Domain-pedia
Domain-pedia is responsible for managing domain knowledge. It must
ingest, update, and process data knowledge from existing knowledge
22   ◾    Accelerating Discovery
All
key
components
have
common
data
model
extensible
data
exchange
format,
standard
web
service
interfaces,
and
customizable
configurations
Key
components
Catalog
Content
curator
Domain-
pedia
Annotators
Normalizers
BigInsights
framework
Query
services
Analytics
services
User
interface
Catalog
Annotators
Normalizers
Normalizers
registry
Annotator
registry
Filter
Annotate
Content
curator
Content
registry
Ontology
/
dictionary
registry
Domain-pedia
BigInsights-based
map
reduce
framework
Normalize
Transform
and
index
Content
index
Specialty
index
Graph
store
Query services
Analytics services
Core UI services
custom user interfaces
FIGURE
2.3
(See
color
insert.)
Functional
model.
Why Accelerate Discovery?   ◾    23
sources as well as some of the downstream processing, such as the seman-
tic entity extraction process. It can also be a resource for runtime process-
ing and analytics and end-user searching and browsing, similar to what
one might do on Wikipedia, such as searching for knowledge around a
given protein.
Annotators
Annotators are the engines that pull semantic domain concepts out of
unstructured text information, such as chemicals, genes, and proteins,
and their relationships.
Normalizers
Normalizers are the engines that organize the vocabularies around vari-
ous domain concepts into a more consistent form. As the name indicates,
they normalize domain concepts into standardized forms, such as unique
chemical structures and protein names.
BigInsights Framework
TheBigInsightsframeworkisanorchestrationenginethatmanagestheinter-
actions between the components described above in a scalable and efficient
fashion by mapping runtime content, knowledge, annotation, and normal-
ization processing in a large-scale Hadoop-like infrastructure framework.
Such a framework is like the blood vessels of the human being; without it, we
only have pieces and parts of the system, rather than a live system.
Query Services
Query services provide consistent data access interfaces for end-user and
other system components to query underlying data repositories without
knowing what format each data repository might be in.
Analytics Services
Analytics services include runtime algorithms to enable the discovery
and construction of more complex knowledge representations. For exam-
ple, they may produce network representations based on the underlying
knowledge repository of a gene-to-gene relationship.
User Interface
The user interactions of a discovery system can be diverse. They can range
from basic searches and reporting to more complex visualizations and
24   ◾    Accelerating Discovery
workflows. Our user interface component is built with a “platform+appli-
cations” principle. That is, we developed a suite of common user-interface
widgets that can be leveraged by a wide range of use cases. We then design
specific applications that are targeted for specific use cases by leveraging
the common widgets as much as possible. This way, our discovery system
can quickly be adapted to support different use cases without applications
having to be rewritten each time.
Catalogue
Finally, the catalogue component manages the system records about all
system components and their configurations and metadata about the con-
tent sources and knowledge sources. It creates and manages a list of sys-
tem and data catalogues. System administrators are the typical users of
such a component.
Clearly, a discovery system is complex and dynamic. Without the design
principles that have been described, a discovery solution can quickly find
itself stuck in face of changing content, knowledge, and user-interaction
paradigms.
ACCELERATED DISCOVERY FROM A DATA PERSPECTIVE
The process of Accelerated Discovery can essentially be thought of as a
data transformation process. Having just described the system for trans-
forming that data, let us look at the process again from a data perspective
and see how the data may be changed as we move through the discovery
process.
The discovery process is inherently a continuous transformation from
raw pieces of data and information to enriched pieces of knowledge and
more comprehensive knowledge representations, all the way to brand new
discoveries and hypotheses. A discovery solution/platform can therefore
be viewed as a system that supports and enables such data transformations
and end-user interactions. Figure 2.4 summarizes four major data trans-
formations driven by each of the four discovery process steps. We will now
discuss each of these steps of transformation in detail.
Initial Domain Content and Knowledge Collection
The bottom inner circle of Figure 2.4 marks the beginning of the data
transformation journey. To enable major discoveries, we must ensure that
the system is aware of a vast and diverse set of prior domain knowledge
and domain content. Without this step, the downstream analysis and
Why Accelerate Discovery?   ◾    25
discovery will be extremely limited in scope, and wrong conclusions could
be drawn. This is often the case with today’s bioinformatics tools, which
operate on small and narrowly scoped data sets.
We differentiate domain knowledge and domain content deliberately
here since they mean different things. Domain knowledge means prior
knowledge that has been captured digitally, such as manually curated
domain ontologies, taxonomies, dictionaries, and manually curated
structured databases. For example, in drug discovery, such domain
knowledge may include ChemBL database [7], OBO ontologies [8], and
other dictionaries.
Domaincontentmeansrawdigitaldatacontentthatmightcaptureexist-
ing domain knowledge but has not been curated systematically to allow a
broad set of scientists to gain such knowledge easily. For example, many of
the unstructured information sources, such as patents, published confer-
ence and journal articles, and internal reports, often contain knowledge
known by the individual scientists who wrote the documents. However, if
this knowledge has not been made widely and easily accessible, much of
it is locked down and unusable. Domain content also includes structured
data and semistructured data such as genomic screening information
The Discovery Platform—A Data Perspective
The discovery platform is a system that continuously transforms from initial raw data and
domain knowledge to brand new discoveries through a series of data transformation steps.
Each step along the way will enable services and bring value to our clients and partners
Emerging patterns
and discovery
Complex knowledge
and visual
representations
Enriched data and
enhanced domain
knowledge
Data and
domain
knowledge
Create new hypothesis and predictions
• Given known, show unknown
• Can be simple or complex representations
Compose complex and holistic knowledge
representations
• Graphs and networks
• Runtime calculated visualizations, such as scatter plots
• ......
Comprehend and extract semantic knowledge
• Entity, relationships, and complex relationship
extraction core and customizations
• E.g., chemical, biological and toxicology
annotators and PTM relationships
• Enhanced domain-knowledge and enriched data
content
Collect and curate domain content and knowledge
• Content collection and ingestion
• Content curation, indexing
• Patents, medline literature, ChemBL, ...
• Domain knowledge/ontology ingestion
• Domain-pedia management
• OBO, SIDER, dictionaries, ontologies
FIGURE 2.4 Data evolution.
26   ◾    Accelerating Discovery
and experiments. Many such sources of raw data also require significant
cleansing and processing before they can be made available and accessible.
A discovery solution must be able to collect, cleanse, and make acces-
sible a vast and diverse amount of domain knowledge and content in its
initial data transformation step. The output of this initial step is a content
and knowledge repository that is ready for downstream analysis, index-
ing, data exploration, and browsing.
Content Comprehension and Semantic Knowledge Extraction
The second transformation has to do with the ability to derive an enriched
set of domain knowledge from the initial content and knowledge col-
lection. Specifically, when it comes to unstructured and semistructured
content sources, such a transformation step is critical in capturing and
extracting knowledge buried in text, tables, and figures in a systematic
fashion; for example, extracting chemical structures and their properties,
genes and proteins described in text, and tables of experimental results.
Other ways to comprehend the content also include classification and
taxonomy generation. Such methods help organize content more mean-
ingfully and allow scientists to examine data with different lenses.
Such a step often requires natural language processing, text mining,
and machine learning capabilities. In addition, it also requires systems to
coalesce and cross-map different styles of vocabulary into something that
is more consistent. For example, one chemical compound may have over
100 different names. Unless such chemical compounds are cross-mapped
and normalized in some fashion, there is very little hope that scientists can
gain comprehensive knowledge about them, let alone discover new insights.
A well-designed system component that carries out such forms of data
comprehension will result in an enriched content and knowledge reposi-
tory. Such a repository not only unlocks and captures domain knowledge
buried in unstructured data, but also cross-maps different domain vocab-
ularies in a more consistent fashion. An analogy of such a knowledge
repository is a “domain-pedia.” That is, it contains all pieces of knowledge
gained via machine curation and gathers them together into a reposi-
tory that is like Wikipedia but more in-depth and comprehensive for the
domain under study.
Complex and High-Level Knowledge Composition and Representation
Building on the repositories created from the previous data transforma-
tion steps, the third data transformation step attempts to generate more
Random documents with unrelated
content Scribd suggests to you:
she just turned around and went home. That is the time she found
out he was married and was deceiving her.
But I don't know why she left, I mean why, exactly she left but I
think this is the person that was her reason for leaving Leningrad.
Mr. Liebeler. Did she tell you that she had left in part at least to
get away from this man?
Mrs. Ford. That is what I understood.
Mr. Liebeler. Now, did Marina tell you why she married Lee
Oswald when she was in Russia, did she talk to you about that?
Mrs. Ford. Yes; she did.
I don't exactly know why she married him. But she said she met
him at a dance, and soon after that, I don't know the reason why he
was in the hospital but he was in a hospital, and she called on him,
and I don't know how long he stayed there, either, and she liked the
man, I think, and she bought him an Easter Egg, that was during
Easter sometime and he was very surprised that such a thing could
be done in Russia.
I think it rather pleased him very well. She said somehow she
felt sorry for the man because none of her friends liked him, and
mistrusted him, and she felt sort of like she was on the defensive,
she wanted to, she felt sorry for him in a way.
Representative Ford. Did she tell you why her friends didn't trust
him?
Mrs. Ford. Yes; she said they were thinking that he was an
American spy or something like that, that is what they were trying to
tell her. Maybe he is a spy, and how can you trust a man like that?
She told me the other day, she says no one trusted him, but she
says, I wasn't afraid of him, that is how she put it.
Representative Ford. Did Marina indicate to you whether she
thought or had any reason to believe that Oswald was a spy?
Mrs. Ford. No; she didn't. She didn't think so. She never said
that, I mean.
Representative Ford. Did Marina ever indicate to you or did you
gather that one of the reasons that Marina married Lee Oswald was
she had the possibility of leaving the Soviet Union in mind?
Mrs. Ford. Yes. She never did go out directly and say that but I
think I got an impression that was her reason. She was telling me
that way before she met Oswald she was dreaming of coming over
here, and that is, I mean gathering by that later I thought that she
wanted to come over, and he was, I suppose he was a reason.
Mr. Liebeler. In this connection, I wanted to ask you whether you
ever had any contact with any newspaper reporters from the Dallas
newspapers about this, did they ask you about this at any time?
Mrs. Ford. About this I don't remember if they asked me. I don't
remember. But a couple of reporters came to my house soon after
the assassination and talked to me.
Mr. Liebeler. I represent to you that there was a story in the
November 27, 1963, issue of the Dallas Times Herald which told
about some Russian-born woman in Dallas to whom Marina had
supposedly confided some of the most intimate secrets of her
stormy marriage, in the words of the newspaper article. Do you have
any knowledge whether that would be you or somebody else?
Mrs. Ford. I think that would be me.
Mr. Liebeler. Did you tell the reporters that Marina had told you
that Marina had felt sorry for Oswald because everybody hated him
even in Russia?
Mrs. Ford. Yes; I might have said that.
Mr. Liebeler. Marina did say that to you?
Mrs. Ford. Yes; she did.
Mr. Liebeler. Did Marina tell you anything at this time about their
trip back to the United States and the difficulties that they
encountered, or how they did it, that sort of thing?
Mrs. Ford. No; I never talked with her about that.
Mr. Liebeler. Was there any conversation while Marina stayed
with you during that week in November 1962, about the possibility
of a divorce, of her divorcing Lee Oswald?
Mrs. Ford. The possibility—I know she didn't want to go back to
him at the time she stayed with me.
Mr. Liebeler. But you don't remember any specific conversation?
Mrs. Ford. No; I don't remember any specific conversation.
Mr. Liebeler. About divorce?
Mrs. Ford. Frankly, there was talk about it, she didn't want to go
back and I just told her, I felt that Marina wasn't really the domestic
type she could stand very long being a help at home, not that I think
she is not capable of taking care of her own house. I see now since
she has got even her own place she keeps it very clean and her
children are always neat. But she wasn't right for domestic help and
I told her to stay with Lee, that is what I told her myself, and wait
until she could be able to take care of herself other than working in
a house.
Mr. Liebeler. What did she say about that?
Mrs. Ford. She didn't say, she was really just listening, I think,
and she didn't say anything.
Mr. Liebeler. Did Marina say anything to you at this time about
wanting to go back to Russia?
Mrs. Ford. No, no; she didn't want to.
In fact, she told me that Lee soon after he came to the United
States, he was telling her that he would want to go back because he
couldn't find a job here and he was, of course, seeing a lot of
difficulties for himself, and Marina said, If you want to go back, you
can go but I am not going, that is what she told me.
Mr. Liebeler. Marina told Lee if he wanted to go back he could but
she wasn't going to go back to Russia.
Mrs. Ford. That is right.
Mr. Liebeler. Is that all the conversation that you had with her
about going back to Russia at that time?
Mrs. Ford. That is right. That is about all.
Mr. Liebeler. Did you know where the Oswalds lived at this time
when Marina came to the Meller's house and then to your house?
Mrs. Ford. No; I have never been at their apartment, and she
couldn't tell me. I know she lived in Oak Cliff, the Dallas section
about southwest, I believe.
Mr. Liebeler. In Dallas?
Mrs. Ford. Yes.
Mr. Liebeler. Did you know whether Marina had lived with a lady
in Fort Worth before they came to Dallas?
Mrs. Ford. Yes; I know they stayed there but I didn't talk to her
during the time and I didn't visit her. I know she stayed at Elena
Hall's house and I think Elena had an accident just before that and
she was—she stayed in bed most of the time. Marina was helping
her out.
Mr. Liebeler. Now, you had only seen Lee Oswald, up to this week
that Marina came to live with you, one time, is that correct?
Mrs. Ford. That is correct.
Mr. Liebeler. That was the Meller's luncheon party?
Mrs. Ford. That is correct.
Mr. Liebeler. Were you surprised on the basis of any judgment
you might have made of Lee Oswald to learn that he had beaten his
wife?
Mrs. Ford. Just from seeing him once I would not have made—
no; that he has beaten his wife; no, I didn't think at that time. I did,
when she came in after I learned that he has beaten her, I was
rather—I remember the bruises on her face and that rather made
signs to me that he did.
Mr. Liebeler. Did it surprise you that he would have done this?
Mrs. Ford. No; it did not surprise me. I just felt that young man
as he was, if he was—decided to go to Russia after living in a
country like the United States, I didn't feel he was very, what shall I
say, how would you say, a person's mind won't work at this time——
The Chairman. Unstable?
Mrs. Ford. Unstable, that is how I felt. I felt a person like that, I
felt frankly could do anything.
Representative Ford. Did she ever tell you that Lee Oswald was
the cause of these bruises on her face?
Mrs. Ford. Well, she did tell me after she came to the house to
stay with me.
Representative Ford. That is what I mean?
Mrs. Ford. That is right.
Representative Ford. The bruises you saw on her face at the
house she told you Lee Oswald was the cause?
Mrs. Ford. Yes, that is right.
Mr. Liebeler. Let's clarify that a little. Did Marina Oswald have any
bruises at the time she came to live in your house in November
1962?
Mrs. Ford. No; that is right. But she stayed at Anna Meller's
house for a week and when she came to Anna Meller's house I
heard there were bruises at that time.
Mr. Liebeler. Did Anna Meller tell you that?
Mrs. Ford. Either Anna Meller or George Bouhe told me that. I
don't remember.
Mr. Liebeler. And you yourself did see bruises on her face the first
time?
Mrs. Ford. I did see the first time; yes.
Mr. Liebeler. Did she tell you that Lee Oswald had given her those
bruises?
Mrs. Ford. That is right.
Mr. Liebeler. She told you that when she stayed with you?
Mrs. Ford. That is right.
Mr. Liebeler. Tell us the circumstances under which Marina left
your home in November of 1962, where she went and what
happened?
Mrs. Ford. Well, she stayed with me a week, and my husband
came home on Saturday, and we discussed with another friend of
mine for Marina to go to her house and stay there as long as she
wanted, and I think Sunday morning this friend of mine, Anna Ray,
came with a station wagon and picked all her things up, her playpen
she had for baby, and diapers and things, and took her to her house
and I believe my husband was with her, too, at that time and that is
how she left.
Mr. Liebeler. Now, do you know how long she stayed with the
Rays?
Mrs. Ford. With the Rays. I think she just stayed there, she had
had dinner there, I believe she stayed one afternoon. I don't know
how soon Lee came there but he came soon over to the house, but
Marina said he cried and begged her to return, he would be nothing,
if she didn't return, he would be finished, that is what he was telling
her, and she said she just couldn't say, no to him.
Mr. Liebeler. So she returned to Oswald at that time?
Mrs. Ford. Yes; she returned to Oswald.
Mr. Liebeler. Did you ever talk with the Mellers about their
experiences with Marina when she lived at their house for that time?
Mrs. Ford. No; I don't remember, she did not discuss it.
Mr. Liebeler. Did you remember or did you know where Lee
Oswald was living prior to the time that Marina came to Dallas?
Mrs. Ford. No; I don't know where he lived at any time.
Mr. Liebeler. Did you know where he was working at that time or
if he was working at all?
Mrs. Ford. During the time they lived in Dallas, I believe, I don't
know exactly, though, either George Bouhe or Anna Meller's husband
found him a job in a printing shop, I think, or I believe it is printing
shop, somewhere in Oak Cliff, and that is why they had an
apartment there. I remember that is the reason because George
Bouhe was rather mad at Marina for taking an apartment in Oak Cliff
because it was too far for him to drive and help her when she
needed help and the baby, I think he was taking her to the dentist
and taking the baby to a doctor to help her in ways that she couldn't
do herself.
Representative Ford. Who was doing this driving?
Mrs. Ford. I believe George Bouhe did this. He has the car.
Mr. Liebeler. Where does Mr. Bouhe live?
Mrs. Ford. He lives, well, I don't know his address now. I know
where he lives but I don't know the street number.
Mr. Liebeler. But it is not in the Oak Cliff Section of Dallas?
Mrs. Ford. No; it is not. I think it would be in the east part of
Dallas.
Mr. Liebeler. During the time that Marina stayed with you, did she
say anything to you about Lee Oswald's political beliefs or his
attitudes concerning politics?
Mrs. Ford. No; she didn't talk to me about that and I didn't ask
her.
Frankly, I didn't talk with Lee about that, I didn't feel the need of
it myself to discuss politics with him.
Mr. Liebeler. Did you discuss that subject with any of your
friends?
Mrs. Ford. Well, yes. They were telling me, those friends that
went to his apartment, were telling me, that they have seen books
like Karl Marx open in front of him, just lying there on the table, that
he didn't even hide it when someone came in, and then someone
else said there was a book laying there of How to Be a Spy, laying
right open there.
Mr. Liebeler. Do you remember who told you that?
Mrs. Ford. I believe it was Lydia Dymitruk.
Mr. Liebeler. D-y-m-i-t-r-u-k.
Do you remember anything about that particular conversation?
Mrs. Ford. Well, she was telling me, she took, when the baby left
my house, she had a cold, and it was getting worse, and I believe
soon after she left Anna Ray, the baby began to have a fever, and
Lydia, I believe, I don't know how she got to go to her apartment,
really, I don't know the reason she went there; she went there and
wanted to take the baby to the doctor and she told me of an
incident that says even Marina was ashamed of Lee because when
she took her to the hospital Lee was lying about that he didn't have
a job at the time, which Lydia knew that he did have. He didn't want
to pay for the services, and people at the hospital was asking him
how does he pay for the apartment and he was telling them that,
My friends were helping me, and Marina just said something in
Russian that Lydia remembers, What a liar, you know, behind his
back.
Mr. Liebeler. Did she say that so he could hear it?
Mrs. Ford. I think so, because she said it aloud.
Mr. Liebeler. And she said it in Russian?
Mrs. Ford. Yes; she did.
Mr. Liebeler. In front of Lydia Dymitruk?
Mrs. Ford. Yes; that is right, and Lydia was rather mad about the
whole thing and she said she is not going to help them any more if
they are acting that way.
The baby wasn't helped at the hospital. I think the hospital
didn't want to take the child because the father couldn't pay, that is
what I got, the father couldn't pay for it.
Mr. Liebeler. Did you ever see Marina say anything adverse to
Oswald in his presence, did she run him down or make fun of him in
public so far as you know?
Mrs. Ford. As far as I know, I don't, except the first time, the
one time, I would think when she said, What a liar, in front of him.
Mr. Liebeler. You don't know of any other instance when she
would have done that?
Mrs. Ford. I don't know of any other instance.
Mr. Liebeler. Do you know whether she ever spoke of his political
views before other people or make fun of him?
Mrs. Ford. No; I don't remember except lately I have been
talking to her about that and she said she thought of him being
young, and she thought she hoped it would pass with years, that he
would mature, this is what she was telling me.
Mr. Liebeler. Did she tell you whether or not she discussed
politics with him herself? Did she argue with him about anything?
Mrs. Ford. She said she was arguing with him about that.
Certainly, in fact, he called her, she was typical American girl, that
she is not interested at all in politics, except in the material things
that he wasn't interested in. She said she wanted a house and a
family and he said, All the American girls think that way and he
thought he married a different sort of a girl, a Russian girl.
Mr. Liebeler. Now, think back on that week that Marina stayed
with you. Is there anything else that happened or is there anything
that Marina told you that you think we should know about and about
which I haven't already asked you?
Mrs. Ford. I cannot think of it at the time.
Mr. Liebeler. When was the next contact that you had with the
Oswalds'?
Mrs. Ford. It was right after Christmas before New Year's. I
believe it was the 28th of December, a Friday. I gave, I had, a party
for all the friends, and I invited a family by the name of De
Mohrenschildt, wife and husband, and Mrs. De Mohrenschildt called
me up and asked me if she couldn't bring Marina and her husband
over because she was saying it is a shame the way all their Russian
friends have forsaken them during that time and they had no place
to go and the De Mohrenschildts were the only ones helping them at
the time, and I told her I didn't object to it. So she brought them
over with them.
Mr. Liebeler. Was Mr. Bouhe at the party?
Mrs. Ford. Yes; I think so.
Mr. Liebeler. Mr. and Mrs. Frank Ray?
Mrs. Ford. Yes.
Mr. Liebeler. Mr. and Mrs. Thomas Ray?
Mrs. Ford. Yes.
Mr. Liebeler. Were Elena Hall and her husband there?
Mrs. Ford. I don't remember them being at that party. I don't
think so.
Mr. Liebeler. Were the Mellers there?
Mrs. Ford. Yes.
Mr. Liebeler. You mentioned that De Mohrenschildt was there.
Mrs. Ford. Yes.
Mr. Liebeler. And the Oswalds.
Mrs. Ford. That is right.
Mr. Liebeler. Was there a gentleman by the name of Allen A.
Jackson at the party?
Mrs. Ford. Yes.
Mr. Liebeler. And his wife?
Mrs. Ford. And his wife.
Mr. Liebeler. Do you remember any conversations or observe any
conversations between Marina Oswald and Mr. Jackson?
Mrs. Ford. Marina Oswald, no; I don't think so. I don't think
Marina spoke English at the time.
Mr. Liebeler. Were Mr. and Mrs. Charles Edward Harris at the
party?
Mrs. Ford. Yes.
Mr. Liebeler. Did you talk to Mr. and Mrs. Harris about the party
afterward?
Mrs. Ford. I might have.
Mr. Liebeler. Did she mention translating a conversation between
Mr. Jackson and Marina Oswald?
Mrs. Ford. No; she did not mention it to me.
Mr. Liebeler. And you didn't see her doing that?
Mrs. Ford. No; I didn't see her doing that at the party.
Mr. Liebeler. Did you have any conversation with Lee Oswald at
the party that night?
Mrs. Ford. No.
Mr. Liebeler. Did you notice anybody else talking to him?
Mrs. Ford. Yes; I did. I noticed a girl talking to him who was of
Japanese descent but I don't remember her name.
Mr. Liebeler. Was there anything striking about that?
Mrs. Ford. No; I think, the only thing it was that I think he talked
to her most of the time and wasn't making any conversation with
anyone else.
Mr. Liebeler. Did Marina comment on that to you?
Mrs. Ford. No; she didn't. But I heard from somebody else that
she did comment.
Mr. Liebeler. Who told you that?
Mrs. Ford. I think George Bouhe again. He always spoke to
everybody.
Mr. Liebeler. What did Mr. Bouhe say about that?
Mrs. Ford. Well, he said something that I asked Marina afterward
and she told me that it wasn't true. He said that Lee talked to that
Japanese girl like a—it is an expression in Russia to take a bath and
then beat themselves with the leaves from a tree, and the leaf would
stick to the body, in the wintertime, and so the expression from that,
like a leaf sticks to the hot body when you take a bath, you know.
Then I suppose Lee struck her as just not saying anything—and I
asked Marina and she said he did not do it.
Mr. Liebeler. Was there any talk at the party about Oswald's
experiences in Russia or his marriage to Marina?
Mrs. Ford. During the party?
Mr. Liebeler. Yes.
Mrs. Ford. No; I didn't talk to him at all.
Mr. Liebeler. And you didn't overhear anybody else talking about
Oswald's experiences in Russia?
Mrs. Ford. No.
Mr. Liebeler. You mentioned that the Oswalds came with De
Mohrensehildt. Did they go home with De Mohrensehildt?
Mrs. Ford. Yes; they did because Mrs. De Mohrensehildt stated
that the lady who stayed with Oswald's child had to leave at 12
o'clock and they left before that.
Mr. Liebeler. After the Oswalds left the party, was there any
discussion about the Oswalds that you can remember?
Mrs. Ford. No; really not. I don't think they made a big
impression on that party on anybody.
Mr. Liebeler. Can you fix the date on which your party was held?
Mrs. Ford. The 28th of December.
Mr. Liebeler. What year?
Mrs. Ford. 1962.
Mr. Liebeler. Was there any other party the next day or the day
after that?
Mrs. Ford. There were other parties. I don't think that Marina
and her husband were present. I don't know if you would call it a
party, there was a sort of a get-together at my house afterward
because some people stayed over in town for a few days, I think
that was the Rays, and the Harrises, and a friend of my husband
from Louisiana, Sullivans, stayed there. But they hadn't met Oswald,
they came much later after 12 o'clock. So we sort of had a get-
together.
Mr. Liebeler. This would have been the next day?
Mrs. Ford. The next day.
Mr. Liebeler. On the 29th of December.
Was Mr. Bouhe——
Mrs. Ford. No; Mr. Bouhe wasn't present at that time.
Mr. Liebeler. Was there any conversation at that time about
Oswald?
Mrs. Ford. I don't remember, 1 don't think so.
Mr. Liebeler. Do you remember that there was a party or open
house at Mr. Bouhe's house on the 29th?
Mrs. Ford. If there was, I wasn't present, I didn't go.
Mr. Liebeler. Was there a party at Meller's house?
Mrs. Ford. There was a party or luncheon.
Mr. Liebeler. Did you go?
Mrs. Ford. No; I didn't go.
Mr. Liebeler. Did you hear any conversation during this period of
3 or 4 days about Oswald, anybody speculating about Oswald or
discussing his experiences——
Mrs. Ford. No; I think——
Mr. Liebeler. In Russia?
Mrs. Ford. I think at that time everyone rather—George Bouhe
said he was not going to help them any more, he was through, since
Marina, he tried to help her very hard, and she did not hold her
word about not going back to him. So he said since she went back,
so now it is her problem.
Well, he is sort of that type of man, he is trying to help hard and
if you are doing what he says otherwise he is not going to help, so
that was it. So it was rather, sort of Marina and her husband were
dropped at that time, nobody actually wanted to help, and I think
what they heard about Lydia Dymitruk was saying that he couldn't
help those people. I mean they were just sort of—he couldn't reach
them. He was lying in hospital and things, we sort of gave it up.
Mr. Liebeler. Did Lydia make that remark at one of those parties
or was that at a previous time?
Mrs. Ford. Oh, that was a previous time. As soon as she left to
go back to her husband, George Bouhe even took the Russian
dictionary back to him. He told her to give it back and he was just
through with him.
Mr. Liebeler. So far as you know Mr. Bouhe had no more contact
with the Oswalds after that?
Mrs. Ford. No; I don't think so.
Mr. Liebeler. Did you have any conversation with any of your
friends in Dallas or Fort Worth on the question of Oswald's ability to
leave Russia and come back to the United States and bring Marina
with him?
Mrs. Ford. We didn't speculate on that until really later, until
now, after the assassination that subject came up, and people
asking why they left so soon. He was telling me it took them a year,
so I don't know.
Mr. Liebeler. Was there any conversation prior to the
assassination, during this period in 1962, any speculation as to
whether Oswald might be an agent of any government?
Mrs. Ford. No. I frankly didn't think he was capable of it. That
was my feeling on it.
Mr. Liebeler. Were there any conversations on that?
Mrs. Ford. No. There were not.
Mr. Liebeler. Do you remember that Mr. Bouhe suggested at one
time that Oswald was a mental case?
Mrs. Ford. Mr. Bouhe, he might have; yes, I think we all thought
that.
Mr. Liebeler. Do you remember that Mr. Bouhe said that?
Mrs. Ford. I don't remember particularly that he would say that.
The Chairman. Did you say, We all thought that?
Mrs. Ford. Yes; we thought that, that he was rather mentally—
you just said the word before.
The Chairman. Unstable?
Mrs. Ford. Unstable.
Mr. Liebeler. Unstable.
Why did you think that, Mrs. Ford?
Mrs. Ford. In my own opinion, I just didn't think that a man as
young as he was could come to the conclusions just by rather
experiences or living a long time in America and I mean studying the
whole economic structures of different governments, and things that
he would come to the conclusion that is the best thing for him. I
think he was just rather too young for that.
I thought he was just rather—something was rather wrong with
the man.
Mr. Liebeler. You based this——
Representative Ford. In the conversation that Oswald had with
this Japanese lady at your party, did you overhear any of that
conversation?
Mrs. Ford. No; I did not. I did not have time, I was the hostess
and I just didn't get to talk to anyone.
Representative Ford. Did you ever ask the Japanese lady what
the gist of the conversation was or what the content of the
conversation was?
Mrs. Ford. No; I never have. In fact, I have not seen her after
that. That was the first time she came to my house. I mean I have
seen her later on in the beauty shop but I have never talked to her
about it.
Representative Ford. Did she speak English?
Mrs. Ford. Yes; she speaks English well.
Representative Ford. All right.
Mr. Liebeler. At this get-together that you mentioned at your
house on, I think it was the 29th, after the party, the informal get-
together, the Rays were there, and the Sullivans, and Mr. and Mrs.
Harris.
Would it refresh your recollection if I suggested to you that some
of the people that were there at that party said the possibility of
Oswald being a Russian agent was discussed in detail at that party in
that group?
Mrs. Ford. No.
Mr. Liebeler. You don't remember any of the discussion?
Mrs. Ford. I don't remember any of the discussion.
Mr. Liebeler. Is there anything other than what you told us that
led you to believe that Oswald was unstable or a mental case?
Mrs. Ford. Nothing, except that I was thinking about him myself
beating his wife. That would have been one reason. I don't think
that any stable man would do that, especially she appeared to me
very sick sort of a woman, not sick, but frail and fragile, I think any
man who strikes a woman who is incapable of striking back, I would
think would be unstable.
Mr. Liebeler. This Russian group that we have been discussing,
Mr. Bouhe and these other people, do they see each other regularly?
Is there sort of a Russian community in Dallas, would you say?
Mrs. Ford. Yes; there are about, I think, maybe four families in
Fort Worth and maybe half a dozen in Dallas or more than that, but
that mainly we see each other, and there is one Eastern Orthodox
Church in Russia where that is where we actually meet each other.
Mr. Liebeler. In Dallas, you mean, or Fort Worth?
Mrs. Ford. Dallas. I am sorry. In Dallas, and we still observe our
Eastern Orthodox religion during the holidays and sometimes like
Christmas falls 13 days after the American Christmas so there is New
Year's sometimes we celebrate those.
Mr. Liebeler. How well do you know the De Mohrenschildts?
Mrs. Ford. Well, I know George De Mohrenschildt the same,
approximately, I will say I was acquainted with him for
approximately 14 years but I don't know him well.
Mr. Liebeler. You mentioned before that De Mohrenschildt was
the only member of the Russian community that kept on seeing the
Oswalds and trying to help them.
Was there any discussion about that among your friends?
Mrs. Ford. Yes; George De Mohrenschildt is rather an odd ball,
among Russians anyway, so it was nothing unusual about him doing
that. He was always doing something unusual. He would even go to
church with shorts on, you know, this is something, he would do
something that nobody else would do.
Mr. Liebeler. Shorts?
Mrs. Ford. Shorts.
Mr. Liebeler. Short trousers?
Mrs. Ford. Yes.
Mr. Liebeler. Do you remember any specific conversations about
the reasons as to why De Mohrenschildt continued to associate with
the Oswalds after the rest of you had given them up.
Mrs. Ford. Well, I remember his wife was telling me like she felt
it was their duty now since everybody else dropped them and they
needed help.
Mr. Liebeler. Do you remember any conversation with the
Oswalds among any of your friends as to whether or not Oswald
went hunting in Russia and had access to weapons?
Mrs. Ford. Yes; I think that George Bouhe was telling me that.
He was telling him that he was going hunting and he told him about
killing ducks or something of that type.
Mr. Liebeler. Can you remember that in any greater detail?
Mrs. Ford. No; I don't remember it. I only remember that
because of the way he was saying, ducks in Russian, George was
saying that he was using sort of a word when you call for it, it is a
small duck rather than for ducks, he was saying that his Russian
wasn't perfect.
I mean in that conversation he was using, an example of he was
saying, when he would go hunting for ducks, instead of utki for
ducks he would say utitschki that would mean small ducks, and he
was saying that his Russian was imperfect.
Mr. Liebeler. That is Oswald's Russian?
Mrs. Ford. That is right.
Mr. Liebeler. What about Marina's ability to speak English at that
time, did she speak English at that time?
Mrs. Ford. I don't think she did. She could speak a few words
but I don't think she did.
Mr. Liebeler. Did any of you attempt to teach her English?
Mrs. Ford. Yes; George Bouhe was attempting to teach her to
write and was giving her lessons.
Mr. Liebeler. Can you tell us something about that?
Mrs. Ford. Well, he was telling me that he had gotten her a
dictionary and he had—or some other book anyway and he was
telling me that every time he saw her, made an attempt to see her I
don't know how he did that but anyway he was giving her a lesson
and she was supposed to have completed it by the next time in
writing.
Mr. Liebeler. Did Mr. Bouhe tell you anything about Marina's
ability to speak English or write English? Did Marina learn as a result
of that?
Mrs. Ford. He said she was doing very well. I don't know if she
learned to speak but he said she was a good student.
Mr. Liebeler. Did Lee Oswald ever object to this effort on Mr.
Bouhe's part?
Mrs. Ford. Well, he was objecting to anyone of the Russians
helping her.
Mr. Liebeler. Helping her learn English?
Mrs. Ford. Not learning English, but I mean helping her about
anything. I don't know whether he was objecting to that. But we
talked with Lee about this, why he wasn't teaching her English and
he wasn't speaking to her and he said that he didn't want to forget
Russian and he really said, If she wants, this is what Marina said
the other day, that he didn't actually object, but he thought if she
could learn Russian just by herself in any way she could, she could,
but he is not going to help her. He just didn't want to help her by
speaking English.
Mr. Liebeler. You mean English.
Mrs. Ford. That is right. He is not going to talk English with her,
he wanted Russian. He wanted for the little girl to learn Russian and
for himself not to forget it.
Mr. Liebeler. Did he indicate, in other words that he wanted
Marina to speak Russian so that he could maintain his own ability in
Russian?
Mrs. Ford. That is correct.
Mr. Liebeler. And he also wanted his children to learn to speak
Russian?
Mrs. Ford. That is right.
Mr. Liebeler. After this party on the 28th of December, what was
your next contact with either Lee or Marina Oswald?
Mrs. Ford. After the 28th? I think after the assassination, I only
heard once about her, I just heard she went to New Orleans. Again a
friend, Lydia Dymitruk, was in the bakery and she said she saw
Marina coming in and she told her she thought Marina was pregnant
and she told her Marina was going to New Orleans. That is the only
time I have heard anything about them after that party. The next
contact I had with her was, I don't know the date but it was soon
after the assassination when I just felt sorry for Marina, I thought
she was, I always felt she was innocent, I thought she was a naive
girl in a lot of ways and that is why she got into a lot of problems
and troubles.
I just felt if she didn't have anybody there except the FBI and
nobody to speak in Russian, she didn't know how everybody felt, I
would think she would feel very badly, so I called Mrs. Paine who, I
found out after the assassination she was a friend of Marina's and I
told her that if she would have a contact with Marina, tell her that
we sympathize with her that she is in the position that she is, and to
call me or to let me—that I would like to talk with her.
Mr. Liebeler. Can you tell us approximately when it was that you
talked to Mrs. Paine?
Mrs. Ford. I think that was either the first part of December, I
think it must have been at least a week after the assassination.
Mr. Liebeler. In early December?
Mrs. Ford. Yes.
Mr. Liebeler. What happened after that?
Mrs. Ford. Right after that, I think the next day Marina called
me, and she said the reason she called me was because Mrs. Paine
told her that I called, and let her know.
Mr. Liebeler. Do you remember specifically that Marina told you
that she, Marina, had talked to Mrs. Paine?
Mrs. Ford. Well, this, I don't know. I think that is what she did.
In fact, I think that Mrs. Paine told me she talked on the telephone
with Marina but I couldn't be positive about it. I know she wrote
letters, they wrote, I know Mrs. Paine wrote to Marina, and I
couldn't exactly say she talked to her on the telephone or how it
was, either Marina called Mrs. Paine or I don't know.
Mr. Liebeler. You say you know Marina wrote Mrs. Paine a letter?
Did Marina tell you that she wrote Mrs. Paine during that period?
Mrs. Ford. I don't remember. I know later Mrs. Paine kept calling
me and asked if I heard from Marina, because she kept writing to
Marina and Marina didn't answer, so she wanted to know if I talked
to her, that she was the reason she was calling me.
Mr. Liebeler. Did Mrs. Paine indicate to you after she called you
trying to find out from Marina, did she indicate to you she had or
had not heard from Marina since the assassination?
Mrs. Ford. I believe she told me she talked on the telephone
with her right soon after, after I did.
Mr. Liebeler. During this first telephone conversation with Marina,
was there anything said about the events of the assassination in any
way?
Mrs. Ford. No; not at all, except that Marina was very surprised
the way people treated her. She was telling me that if it had
happened in Russia, she just would—she just knew she wouldn't be
talking to me or anybody else. She knew they would be sent to
Siberia or shot right away is what she said.
Mr. Liebeler. Is that what Marina said?
Mrs. Ford. That is what she said. I told her that was the big
difference in Russian Government and the American Government.
And then she asked me for advice. Someone contacted her at the
time from a western paper and offered her $10,000 or something for
the story and she asked me if it was the right thing to do because
she felt she didn't want to make money on such a thing, a horrible
thing as that, and I advised her to take the money because I
thought she would need it for the children.
Mr. Liebeler. Now, after that first—was there anything else you
discussed in that telephone conversation?
Mrs. Ford. No; I think we talked mostly about that book deal,
about the offer she had.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Accelerating Discovery Mining Unstructured Information For Hypothesis Generat...
PDF
Accelerating Discovery Mining Unstructured Information For Hypothesis Generat...
PDF
Computational Intelligent Data Analysis For Sustainable Development Ting Yu
PDF
Petascale Analytics Largescale Machine Learning In The Earth Sciences 1st Edi...
PDF
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
PDF
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
PDF
Serviceoriented Distributed Knowledge Discovery 1st Edition Domenico Talia
PDF
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
Accelerating Discovery Mining Unstructured Information For Hypothesis Generat...
Accelerating Discovery Mining Unstructured Information For Hypothesis Generat...
Computational Intelligent Data Analysis For Sustainable Development Ting Yu
Petascale Analytics Largescale Machine Learning In The Earth Sciences 1st Edi...
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
Serviceoriented Distributed Knowledge Discovery 1st Edition Domenico Talia
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal

Similar to Accelerating Discovery Mining Unstructured Information For Hypothesis Generation Scott Spangler (20)

PDF
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
PDF
Healthcare Data Analytics 1st Edition Chandan K. Reddy
PDF
Healthcare Data Analytics 1st Edition Chandan K. Reddy
PDF
Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery S...
PDF
Biological Data Mining Chapman Hall Crc Data Mining and Knowledge Discovery S...
PDF
Service Oriented Distributed Knowledge Discovery 1st Edition Domenico Talia
PDF
Service Oriented Distributed Knowledge Discovery 1st Edition Domenico Talia
PDF
Service Oriented Distributed Knowledge Discovery 1st Edition Domenico Talia
PDF
Privacyaware Knowledge Discovery Novel Applications And New Techniques France...
PDF
Next generation of data mining 1st Edition Hillol Kargupta
PDF
Next generation of data mining 1st Edition Hillol Kargupta
PDF
Frontiers In Data Science 1st Edition Matthias Dehmer Frank Emmertstreib
PDF
Computational Methods Of Feature Selection Huan Liu Hiroshi Motoda
PDF
Methods And Applications Of Autonomous Experimentation Marcus M Noack Daniela...
PDF
Visualizing Data Patterns With Micromaps 1st Edition Daniel B Carr
PDF
Frontiers in Data Science 1st Edition Matthias Dehmer
PDF
Digital Twin Technology And Applications A Daniel Srinivasan Sriramulu
PDF
Hybrid Rough Sets And Applications In Uncertain Decisionmaking Systems Evalua...
PDF
Data Driven Decision Making Using Analytics Computational Intelligence Techni...
PDF
Evolutionary Multiobjective System Designtheory And Applications 1st Edition ...
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
Healthcare Data Analytics 1st Edition Chandan K. Reddy
Healthcare Data Analytics 1st Edition Chandan K. Reddy
Biological Data Mining Chapman Hall Crc Data Mining And Knowledge Discovery S...
Biological Data Mining Chapman Hall Crc Data Mining and Knowledge Discovery S...
Service Oriented Distributed Knowledge Discovery 1st Edition Domenico Talia
Service Oriented Distributed Knowledge Discovery 1st Edition Domenico Talia
Service Oriented Distributed Knowledge Discovery 1st Edition Domenico Talia
Privacyaware Knowledge Discovery Novel Applications And New Techniques France...
Next generation of data mining 1st Edition Hillol Kargupta
Next generation of data mining 1st Edition Hillol Kargupta
Frontiers In Data Science 1st Edition Matthias Dehmer Frank Emmertstreib
Computational Methods Of Feature Selection Huan Liu Hiroshi Motoda
Methods And Applications Of Autonomous Experimentation Marcus M Noack Daniela...
Visualizing Data Patterns With Micromaps 1st Edition Daniel B Carr
Frontiers in Data Science 1st Edition Matthias Dehmer
Digital Twin Technology And Applications A Daniel Srinivasan Sriramulu
Hybrid Rough Sets And Applications In Uncertain Decisionmaking Systems Evalua...
Data Driven Decision Making Using Analytics Computational Intelligence Techni...
Evolutionary Multiobjective System Designtheory And Applications 1st Edition ...
Ad

More from alankcvety (7)

PDF
Environment And Society A Critical Introduction 2nd Edition Paul Robbins
PDF
Textbook Of Palliative Care Communicaiton Elaine Wittenberg Et Al Eds
PDF
Accelerating Discovery Mining Unstructured Information For Hypothesis Generat...
PDF
Textbook Of Palliative Care Communicaiton Elaine Wittenberg Et Al Eds
PDF
Environment And Society A Critical Introduction 2nd Edition Paul Robbins
PDF
An Integrated Approach To Environmental Management 1st Edition D Sarkar
PDF
Environment And Society A Critical Introduction 2nd Edition Paul Robbins
Environment And Society A Critical Introduction 2nd Edition Paul Robbins
Textbook Of Palliative Care Communicaiton Elaine Wittenberg Et Al Eds
Accelerating Discovery Mining Unstructured Information For Hypothesis Generat...
Textbook Of Palliative Care Communicaiton Elaine Wittenberg Et Al Eds
Environment And Society A Critical Introduction 2nd Edition Paul Robbins
An Integrated Approach To Environmental Management 1st Edition D Sarkar
Environment And Society A Critical Introduction 2nd Edition Paul Robbins
Ad

Recently uploaded (20)

PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
master seminar digital applications in india
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Presentation on HIE in infants and its manifestations
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
102 student loan defaulters named and shamed – Is someone you know on the list?
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
master seminar digital applications in india
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Supply Chain Operations Speaking Notes -ICLT Program
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Cell Structure & Organelles in detailed.
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Microbial diseases, their pathogenesis and prophylaxis
human mycosis Human fungal infections are called human mycosis..pptx
VCE English Exam - Section C Student Revision Booklet
Presentation on HIE in infants and its manifestations
Computing-Curriculum for Schools in Ghana
Final Presentation General Medicine 03-08-2024.pptx
Chinmaya Tiranga quiz Grand Finale.pdf
Microbial disease of the cardiovascular and lymphatic systems
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student

Accelerating Discovery Mining Unstructured Information For Hypothesis Generation Scott Spangler

  • 1. Accelerating Discovery Mining Unstructured Information For Hypothesis Generation Scott Spangler download https://ebookbell.com/product/accelerating-discovery-mining- unstructured-information-for-hypothesis-generation-scott- spangler-5241508 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Accelerating Discovery Mining Unstructured Information For Hypothesis Generation Spangler https://ebookbell.com/product/accelerating-discovery-mining- unstructured-information-for-hypothesis-generation-spangler-5261124 Knowledge Guided Machine Learning Accelerating Discovery Using Scientific Knowledge And Data Taylor Francis Group https://ebookbell.com/product/knowledge-guided-machine-learning- accelerating-discovery-using-scientific-knowledge-and-data-taylor- francis-group-43676354 Accelerating The Discovery Of New Dielectric Properties In Polymer Insulation 1st Edition Boxue Du https://ebookbell.com/product/accelerating-the-discovery-of-new- dielectric-properties-in-polymer-insulation-1st-edition-boxue- du-6837250 Accelerating Economic Growth Lessons From 200000 Years Of Technological Progress And Human Development Jakub Growiec https://ebookbell.com/product/accelerating-economic-growth-lessons- from-200000-years-of-technological-progress-and-human-development- jakub-growiec-46081114
  • 3. Accelerating Digital Transformation 10 Years Of Software Center Jan Bosch https://ebookbell.com/product/accelerating-digital- transformation-10-years-of-software-center-jan-bosch-46706442 Accelerating Organisation Culture Change Innovation Through Digital Tools Jaclyn Lee https://ebookbell.com/product/accelerating-organisation-culture- change-innovation-through-digital-tools-jaclyn-lee-47134668 Accelerating Digital Transformation Of Smes Pohsun Seow Clarence Goh https://ebookbell.com/product/accelerating-digital-transformation-of- smes-pohsun-seow-clarence-goh-49474098 Accelerating Learning Recovery For All Students Margaret Vaughn https://ebookbell.com/product/accelerating-learning-recovery-for-all- students-margaret-vaughn-49998496 Accelerating Performance How Organizations Can Mobilize Execute And Transform With Agility 1st Edition Colin Price https://ebookbell.com/product/accelerating-performance-how- organizations-can-mobilize-execute-and-transform-with-agility-1st- edition-colin-price-50583728
  • 5. ACCELERATING DISCOVERY MINING UNSTRUCTURED INFORMATION FOR HYPOTHESIS GENERATION Scott Spangler Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
  • 8. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series SERIES EDITOR Vipin Kumar University of Minnesota Department of Computer Science and Engineering Minneapolis, Minnesota, U.S.A. AIMS AND SCOPE This series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis.This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and hand- books. The inclusion of concrete examples and applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues. PUBLISHED TITLES ACCELERATING DISCOVERY: MINING UNSTRUCTURED INFORMATION FOR HYPOTHESIS GENERATION Scott Spangler ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, and Ashok N. Srivastava BIOLOGICAL DATA MINING Jake Y. Chen and Stefano Lonardi COMPUTATIONAL BUSINESS ANALYTICS Subrata Das COMPUTATIONAL INTELLIGENT DATA ANALYSIS FOR SUSTAINABLE DEVELOPMENT TingYu, NiteshV. Chawla, and Simeon Simoff COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY, AND APPLICATIONS Sugato Basu, Ian Davidson, and Kiri L. Wagstaff CONTRAST DATA MINING: CONCEPTS, ALGORITHMS, AND APPLICATIONS Guozhu Dong and James Bailey DATA CLASSIFICATION: ALGORITHMS AND APPLICATIONS Charu C. Aggarawal
  • 9. DATA CLUSTERING: ALGORITHMS AND APPLICATIONS Charu C. Aggarawal and Chandan K. Reddy DATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH Guojun Gan DATA MINING FOR DESIGN AND MARKETING Yukio Ohsawa and Katsutoshi Yada DATA MINING WITH R: LEARNING WITH CASE STUDIES Luís Torgo FOUNDATIONS OF PREDICTIVE ANALYTICS James Wu and Stephen Coggeshall GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, SECOND EDITION Harvey J. Miller and Jiawei Han HANDBOOK OF EDUCATIONAL DATA MINING Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d. Baker HEALTHCARE DATA ANALYTICS Chandan K. Reddy and Charu C. Aggarwal INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS Vagelis Hristidis INTELLIGENT TECHNOLOGIES FOR WEB APPLICATIONS Priti Srinivas Sajja and Rajendra Akerkar INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS AND TECHNIQUES Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David Skillicorn KNOWLEDGE DISCOVERY FROM DATA STREAMS João Gama MACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR ENGINEERING SYSTEMS HEALTH MANAGEMENT Ashok N. Srivastava and Jiawei Han MINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS DavidLo,Siau-ChengKhoo,JiaweiHan,andChaoLiu MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO CONCEPTS AND THEORY Zhongfei Zhang and Ruofei Zhang MUSIC DATA MINING Tao Li, Mitsunori Ogihara, and George Tzanetakis
  • 10. NEXT GENERATION OF DATA MINING Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar RAPIDMINER: DATA MINING USE CASES AND BUSINESS ANALYTICS APPLICATIONS Markus Hofmann and Ralf Klinkenberg RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS, AND APPLICATIONS Bo Long, Zhongfei Zhang, and Philip S. Yu SERVICE-ORIENTED DISTRIBUTED KNOWLEDGE DISCOVERY Domenico Talia and Paolo Trunfio SPECTRAL FEATURE SELECTION FOR DATA MINING Zheng Alan Zhao and Huan Liu STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION George Fernandez SUPPORTVECTOR MACHINES: OPTIMIZATION BASED THEORY, ALGORITHMS, AND EXTENSIONS Naiyang Deng, Yingjie Tian, and Chunhua Zhang TEMPORAL DATA MINING Theophano Mitsa TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS Ashok N. Srivastava and Mehran Sahami THE TOP TEN ALGORITHMS IN DATA MINING Xindong Wu and Vipin Kumar UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX DECOMPOSITIONS David Skillicorn
  • 11. ACCELERATING DISCOVERY MINING UNSTRUCTURED INFORMATION FOR HYPOTHESIS GENERATION Scott Spangler IBM Research San Jose, California, USA
  • 12. The views expressed here are solely those of the author in his private capacity and do not in any way represent the views of the IBM Corporation. CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2016 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20150721 International Standard Book Number-13: 978-1-4822-3914-0 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com
  • 13. To Karon, my love
  • 15. ix Contents Preface, xvii Acknowledgments, xxi Chapter 1 ◾  Introduction 1 Chapter 2 ◾  Why Accelerate Discovery? 9 Scott Spangler and Ying Chen THE PROBLEM OF SYNTHESIS 11 THE PROBLEM OF FORMULATION 11 WHAT WOULD DARWIN DO? 13 THE POTENTIAL FOR ACCELERATED DISCOVERY: USING COMPUTERS TO MAP THE KNOWLEDGE SPACE 14 WHY ACCELERATE DISCOVERY: THE BUSINESS PERSPECTIVE 15 COMPUTATIONAL TOOLS THAT ENABLE ACCELERATED DISCOVERY 16 Search 16 Business Intelligence and Data Warehousing 17 Massive Parallelization 17 Unstructured Information Mining 17 Natural Language Processing 17 Machine Learning 18 Collaborative Filtering/Matrix Factorization 18 Modeling and Simulation 18 Service-Oriented Architectures 19
  • 16. x   ◾    Contents Ontological Representation Schemes 19 DeepQA 19 Reasoning under Uncertainty 20 ACCELERATED DISCOVERY FROM A SYSTEM PERSPECTIVE 20 Content Curator 21 Domain-pedia 21 Annotators 23 Normalizers 23 BigInsights Framework 23 Query Services 23 Analytics Services 23 User Interface 23 Catalogue 24 ACCELERATED DISCOVERY FROM A DATA PERSPECTIVE 24 Initial Domain Content and Knowledge Collection 24 Content Comprehension and Semantic Knowledge Extraction 26 Complex and High-Level Knowledge Composition and Representation 26 New Hypothesis and Discovery Creation 27 ACCELERATED DISCOVERY IN THE ORGANIZATION 28 CHALLENGE (AND OPPORTUNITY) OF ACCELERATED DISCOVERY 29 REFERENCES 30 Chapter 3 ◾  Form and Function 33 THE PROCESS OF ACCELERATED DISCOVERY 34 CONCLUSION 40 REFERENCE 40 Chapter 4 ◾  Exploring Content to Find Entities 41 SEARCHING FOR RELEVANT CONTENT 42 HOW MUCH DATA IS ENOUGH? WHAT IS TOO MUCH? 42 HOW COMPUTERS READ DOCUMENTS 43 EXTRACTING FEATURES 43
  • 17. Contents   ◾    xi Editing the Feature Space 46 FEATURE SPACES: DOCUMENTS AS VECTORS 47 CLUSTERING 48 DOMAIN CONCEPT REFINEMENT 50 Category Level 50 Document Level 51 MODELING APPROACHES 51 Classification Approaches 52 Centroid 52 Decision Tree 52 Naïve Bayes 52 Numeric Features 52 Binary Features 53 Rule Based 53 Statistical 53 DICTIONARIES AND NORMALIZATION 54 COHESION AND DISTINCTNESS 54 Cohesion 55 Distinctness 56 SINGLE AND MULTIMEMBERSHIP TAXONOMIES 56 SUBCLASSING AREAS OF INTEREST 57 GENERATING NEW QUERIES TO FIND ADDITIONAL RELEVANT CONTENT 57 VALIDATION 58 SUMMARY 58 REFERENCES 58 Chapter 5 ◾  Organization 61 DOMAIN-SPECIFIC ONTOLOGIES AND DICTIONARIES 61 SIMILARITY TREES 62 USING SIMILARITY TREES TO INTERACT WITH DOMAIN EXPERTS 65
  • 18. xii   ◾    Contents SCATTER-PLOT VISUALIZATIONS 65 USING SCATTER PLOTS TO FIND OVERLAPS BETWEEN NEARBY ENTITIES OF DIFFERENT TYPES 67 DISCOVERY THROUGH VISUALIZATION OF TYPE SPACE 69 REFERENCES 69 Chapter 6 ◾  Relationships 71 WHAT DO RELATIONSHIPS LOOK LIKE? 71 HOW CAN WE DETECT RELATIONSHIPS? 72 REGULAR EXPRESSION PATTERNS FOR EXTRACTING RELATIONSHIPS 72 NATURAL LANGUAGE PARSING 73 COMPLEX RELATIONSHIPS 74 EXAMPLE: P53 PHOSPHORYLATION EVENTS 74 PUTTING IT ALL TOGETHER 75 EXAMPLE: DRUG/TARGET/DISEASE RELATIONSHIP NETWORKS 75 CONCLUSION 79 Chapter 7 ◾  Inference 81 CO-OCCURRENCE TABLES 81 CO-OCCURRENCE NETWORKS 83 RELATIONSHIP SUMMARIZATION GRAPHS 83 HOMOGENEOUS RELATIONSHIP NETWORKS 83 HETEROGENEOUS RELATIONSHIP NETWORKS 86 NETWORK-BASED REASONING APPROACHES 86 GRAPH DIFFUSION 87 MATRIX FACTORIZATION 87 CONCLUSION 88 REFERENCES 89 Chapter 8 ◾  Taxonomies 91 TAXONOMY GENERATION METHODS 91 SNIPPETS 92 TEXT CLUSTERING 92
  • 19. Contents   ◾    xiii TIME-BASED TAXONOMIES 94 Partitions Based on the Calendar 94 Partitions Based on Sample Size 95 Partitions on Known Events 95 KEYWORD TAXONOMIES 95 Regular Expression Patterns 96 NUMERICAL VALUE TAXONOMIES 97 Turning Numbers into X-Tiles 98 EMPLOYING TAXONOMIES 98 Understanding Categories 98 Feature Bar Charts 98 Sorting of Examples 99 Category/Category Co-Occurrence 99 Dictionary/Category Co-Occurrence 100 REFERENCES 101 Chapter 9 ◾  Orthogonal Comparison 103 AFFINITY 104 COTABLE DIMENSIONS 105 COTABLE LAYOUT AND SORTING 106 FEATURE-BASED COTABLES 107 COTABLE APPLICATIONS 109 EXAMPLE: MICROBES AND THEIR PROPERTIES 109 ORTHOGONAL FILTERING 111 CONCLUSION 114 REFERENCE 115 Chapter 10 ◾  Visualizing the Data Plane 117 ENTITY SIMILARITY NETWORKS 117 USING COLOR TO SPOT POTENTIAL NEW HYPOTHESES 119 VISUALIZATION OF CENTROIDS 123 EXAMPLE: THREE MICROBES 125
  • 20. xiv   ◾    Contents CONCLUSION 127 REFERENCE127 Chapter 11 ◾  Networks 129 PROTEIN NETWORKS 130 MULTIPLE SCLEROSIS AND IL7R 130 EXAMPLE: NEW DRUGS FOR OBESITY 134 CONCLUSION 136 REFERENCE 136 Chapter 12 ◾  Examples and Problems 139 PROBLEM CATALOGUE 139 EXAMPLE CATALOGUE 140 Chapter 13 ◾  Problem: Discovery of Novel Properties of Known Entities 141 ANTIBIOTICS AND ANTI-INFLAMMATORIES 141 SOS PATHWAY FOR ESCHERICHIA COLI 146 CONCLUSIONS 149 REFERENCES 150 Chapter 14 ◾  Problem: Finding New Treatments for Orphan Diseases from Existing Drugs 151 IC50:IC50 152 REFERENCES 158 Chapter 15 ◾  Example: Target Selection Based on Protein Network Analysis 159 TYPE 2 DIABETES PROTEIN ANALYSIS 159 Chapter 16 ◾  Example: Gene Expression Analysis for Alternative Indications 165 Scott Spangler, Ignacio Terrizzano, and Jeffrey Kreulen NCBI GEO DATA 165 CONCLUSION 173 REFERENCES 174
  • 21. Contents   ◾    xv Chapter 17 ◾  Example: Side Effects 175 Chapter 18 ◾  Example: Protein Viscosity Analysis Using Medline Abstracts 183 DISCOVERY OF ONTOLOGIES 184 USING ORTHOGONAL FILTERING TO DISCOVER IMPORTANT RELATIONSHIPS 187 REFERENCE 194 Chapter 19 ◾  Example: Finding Microbes to Clean Up Oil Spills 195 Scott Spangler, Zarath Summers, and Adam Usadi ENTITIES 196 USING COTABLES TO FIND THE RIGHT COMBINATION OF FEATURES 199 DISCOVERING NEW SPECIES 202 ORGANISM RANKING STRATEGY 205 CHARACTERIZING ORGANISMS 206 Respiration 209 Environment 215 Substrate 215 CONCLUSION 216 Chapter 20 ◾  Example: Drug Repurposing 225 COMPOUND 1: A PDE5 INHIBITOR 226 PPARα/γ AGONIST 228 Chapter 21 ◾  Example: Adverse Events 231 FENOFIBRATE 231 PROCESS 232 CONCLUSION 237 REFERENCES 239 Chapter 22 ◾  Example: P53 Kinases 241 AN ACCELERATED DISCOVERY APPROACH BASED ON ENTITY SIMILARITY 243
  • 22. xvi   ◾    Contents RETROSPECTIVE STUDY 246 EXPERIMENTAL VALIDATION 248 CONCLUSION 250 REFERENCE 251 Chapter 23 ◾  Conclusion and Future Work 253 ARCHITECTURE 254 FUTURE WORK 255 ASSIGNING CONFIDENCE AND PROBABILITIES TO ENTITIES, RELATIONSHIPS, AND INFERENCES 255 DEALING WITH CONTRADICTORY EVIDENCE 259 UNDERSTANDING INTENTIONALITY 259 ASSIGNING VALUE TO HYPOTHESES 261 TOOLS AND TECHNIQUES FOR AUTOMATING THE DISCOVERY PROCESS 261 CROWD SOURCING DOMAIN ONTOLOGY CURATION 262 FINAL WORDS 262 REFERENCE 262
  • 23. xvii Preface Afew years ago, having spent more than a decade doing unstructured data mining of one form or another, in domains spanning helpdesk problem tickets, social media, and patents, I thought I fully understood the potential range of problems and likely areas of applicability of this mature technology. Then, something happened that completely changed how I thought about what I was doing and what its potential really was. The change in my outlook began with the Watson Jeopardy challenge. Seeing a computer learn from text to play a game I had thought was far beyond the capabilities of any artificial intelligence opened my eyes to new possibilities. And I was not alone. Soon many customers were coming for- ward with their own unique problems—problems I would have said a few years ago were just too hard to solve with existing techniques. And now, I said, let’s give it a try. This wasn’t simply a straightforward application of the algorithms used to win Jeopardy in a different context. Most of the problems I was being asked to solve weren’t really even question-answering problems at all. But they all had a similar quality in that they forced us to digest all of the information in a given area and find a way to synthesize a new kind of meaning out of it. This time the problem was not to win a game show, but (to put it bluntly) to advance human scientific knowledge. More than once in the early going, before we had any results to show for our efforts, I wondered if I was out of my mind for even trying this. Early on, I remember making more than one presentation to senior executives at my company, describing what I was doing, half expecting they would tell me to cease immediately, because I was attempting something way too hard for current technology and far outside the bounds of a reasonable business opportunity. But fortunately, no one ever said that, and I kept on going.
  • 24. xviii   ◾    Preface Somewhere along the way (I can’t say just when), I lost all doubt that I was really onto something very important. This was more than another new application of unstructured data mining. And as each new scien- tific area came forward for analysis, the approach we were using began to solidify into a kind of methodology. And then, just as this was happening, I attended the ACM Knowledge Discovery and Data Mining Conference in 2013 (KDD13); I met with CRC Press in a booth at the conference and told them about my idea. Shortly thereafter, I was signed up to write a book. I knew at the time that what would make this book especially challeng- ing is that I was still proving the methodology and tweaking it, even as I was writing out the description of that method. This was not ideal, but neither could it be avoided if I wanted to broaden the application of the method beyond a small team of people working in my own group. And if it could not be broadened, it would never realize its full potential. Data science is a new discipline. It lacks a curriculum, a set of text- books, a fundamental theory, and a set of guiding principles. This is both regrettable and exciting. It must be rectified if the discipline is to become established. Since I greatly desire that end, I write this book in the hopes of furthering it. Many years ago, I remember stumbling across a book by the statistician John Tukey called Exploratory Data Analysis. It was written 30 years before I read it, and the author was no longer living; yet it spoke to me as if I were his research collaborator. It showed me how the ideas I had been grappling with in the area of unstructured text had been similarly addressed in the realm of structured data. Reading that book gave me renewed confidence in the direction I was taking and a larger vision for what the fulfillment of that vision might one day accomplish. This book is one more step on that journey. The journey is essentially my life’s work, and this book is that work’s synthesis thus far. It is far from perfect, as anything that is real will always be a diminishment from what is merely imagined. But I hope even in its imperfect state it will commu- nicate some part of what I experience on a daily basis working through accelerated discovery problems as a data scientist. I think it is unquestion- ably the most rewarding and exciting job in the world. And I dare to hope that 30 years from now, or maybe even sooner, someone will pick up this book and see their own ideas and challenges reflected in its pages and feel renewed confidence in the direction they are heading.
  • 25. Preface   ◾    xix At the same time, I fear that one or two readers (at least) will buy this book and immediately be disappointed because it is not at all what they were expecting it would be. It’s not a textbook. It’s not a business book. It’s not a popular science book. It doesn’t fit well in any classification. I can only say this: I wrote it for the person I was a few years back. Read it with an open mind: you might find you get something useful out of it, regard- less of the failure to meet your initial expectations. It took me a long time to get to this level of proficiency in knowing how to address accelerated discovery problems. I’ve tried my best to capture exactly how I go about it, both from a systematic perspective and from a practical point of view. This book provides motivation, strategy, tactics, and a heterogeneous set of comprehensive examples to illustrate all the points I make. If it works as I have intended it to, it will fill an important gap left by the other types of books I have mentioned…the ones you thought this might be. You can still buy those other books as well, but keep this one. Come back to it later when you have started to put in practice the theories that you learned in school to solve real-world applications. You may find then that the book has more to say to you than you first thought. Today, I go into each new problem domain with complete confidence that I know how to get started; I know the major steps I need to accom- plish, and I have a pretty good idea what the final solution will look like (or at least I know the range of things it might look like when we first deliver a useful prototype). It wasn’t always that way. Those first few customer engagements of this type that I did, I was mostly winging it. It was excit- ing, no doubt, but I would have really loved to have this book on my desk (or in my e-reader) to look over after each meeting and help me figure out what I should do next. If you are fortunate enough to do what I do for a living, I think you will (eventually) find this book worthwhile.
  • 27. xxi Acknowledgments There were many people who were instrumental in the creation of this methodology and in the process of writing the book that explains it. First, the team at IBM Watson Innovations, who made it all possible: Ying Chen, Meena Nagarajan, Qi He, Linda Kato, Ana Lelescu, Jacques LaBrie, Cartic Ramakrishnan, Sheng Hua Boa, Steven Boyer, Eric Louie, Anshu Jain, Isaac Cheng, Griff Weber, Su Yan, and Roxana Stanoi. Also instrumental in realizing the vision were the team at Baylor College of Medicine, led by Olivier Lichtarge, with Larry Donehower, Angela Dawn Wilkins, Sam Regenbogen, Curtis Pickering, and Ben Bachman. Jeff Kreulen has been a collaborator for many years now and contin- ues to be a big supporter and contributor to the ideas described here. Michael Karasick and Laura Haas have been instrumental in consistently supporting and encouraging this work from a management perspective at IBM. John Richter, Meena Nagarajan, and Peter Haas were early reviewers of my first draft, and I appreciate their input. Ying Chen helped write the chapter on Why Accelerate Discovery?, for which I am most grateful. Pat Langley provided some very good advice during the planning phase for the book, which I profited from. Finally, and most importantly, my wife, Karon Barber, who insisted that I finish this project, at the expense of time that I would rather have spent with her. Nothing I’ve accomplished in life would have happened without her steadfast faith and love.
  • 29. 1 C h a p t e r 1 Introduction This book is about discovery in science and the importance of heterogeneous data analytics in aiding that discovery. As the vol- ume of scientific data and literature increases exponentially, scientists need ever-more powerful tools to process and synthesize that infor- mation in a practical and meaningful way. But in addition, scientists need a methodology that takes all the relevant information in a given problem area—all the available evidence—and processes it in order to propose the set of potential new hypotheses that are most likely to be both true and important. This book describes a method for achieving this goal. But first, I owe the reader a short introduction and an explanation of why I am the one writing this book. The short answer is a lucky accident (lucky for me anyway; for you it remains to be seen). I stumbled into a career doing the most exciting and rewarding work I can imagine. I do not know it for a fact, but I suspect that I have done more of this kind of work and for a longer period of time than anyone else now alive. It is this experience that I now feel compelled to share, and it is that experi- ence that should make the book interesting reading for those who also see the potential of the approach but do not know how to get started applying it. It all started out with a love of mathematics, in particular discreet mathematics, and to be even more specific: combinatorics. Basically this is the study of how to count things. I was never happier than when I found this course in college. The discovery of a discipline devoted precisely to what one instinctively loves is one of life’s greatest joys. I was equally
  • 30. 2   ◾    Accelerating Discovery disappointed to find there was no such thing as a career in combinatorics, outside of academia—at least, not at that time. But I wandered into computer science, and from there into machine learning and from there into text mining, and suddenly I became aware that the skill and practice of knowing how to count things had a great deal of practical application after all. And now 30 years have passed since I fin- ished that combinatorics course, and with every passing year the number, variety, importance, and fascination of the problems I work on are still increasing. Nothing thrills me more than to have a new data set land in my inbox. This is especially so if it is some kind of data I have never seen before, better still if analyzing it requires me to learn about a whole new field of knowl- edge, and best yet if the result will somehow make a difference in the world. I can honestly say I have had the privilege of working on such problems, not once, not twice, but more times than I can reckon. I do not always suc- ceed, but I do make progress often enough that more and more of these problems seem to find their way to me. At some level I wish I could do them all, but that would be selfish (and not very practical). So I am writing this book instead. If you love working with unstructured, heterogeneous data the way I do, I believe this book will have a positive impact on your career, and that you will in turn have a positive impact on society. This book is an attempt to document and teach Accelerated Discovery to the next generation of data scientists. Once you have learned these techniques and practiced them in the lab, your mission will be to find a scientist or an engineer struggling with a big data challenge and help them to make a better world. I know these scientists and engineers exist, and I know they have these challenges, because I have talked to them and corresponded with them. I have wished there were more of me to go around so that I could help with every one of them, because they are all fascinating and all incredibly promising and worthy efforts. But there are only so many hours in a week, and I can only pick a few of the most prom- ising to pursue, and every one of these has been a rewarding endeavor. For the rest and for those problems that will come, I have written this book. This book is not a data-mining manual. It does not discuss how to build a text-classification engine or the ins and outs of writing an unsupervised clustering implementation. Other books already do this, and I could not surpass these. This book assumes you already know how to process data using the basic bag of tools that are now taught in any good data-mining
  • 31. Introduction   ◾    3 or machine-learning course. Where those courses leave off, this book begins. The question this book answers is how to use unstructured mining approaches to solve a really complex problem in a given scientific domain. How do you create a system that can reason in a sophisticated way about a complex problem and come up with solutions that are profound, nonobvi- ous, and original? From here on, this book is organized in a more or less top-down fashion. The next chapter discusses the importance of the Accelerated Discovery problem space and why the time has come to tackle it with the tools we currently have available. Even if you are already motivated to read the book, do not skip this chapter, because it contains some important mate- rial about how flexibly the technology can be applied across a wide swath of problems. What follows immediately thereafter is a set of five chapters that describe the method at a fairly high level. These are the most important chapters of the book, because they should be in the front of your mind each time you face a new analytics challenge in science. First there is a high-level description of our method for tackling these problems, followed by four detailed chapters giving a general approach to arriving at a solu- tion. When put together, these five chapters essentially cover our method for accelerating discovery. Not every problem you encounter will use every step of this method in its solution, but the basic approach can be applied in a more or less universal way. The next section brings the level of detail down to specific technologies for implementing the method. These are less universal in character but hopefully will make the method more concrete. This set of four chapters goes into greater detail about the tools and algorithms I use to help real- ize the approach in practice. It is not complete, but hopefully it will be illustrative of the kinds of techniques that can make the abstract process a reality. The rest of the book is made up of sample problems or examples of how this really works in practice. I included ten such examples because it was a nice round number, and I think the examples I have selected do pro- vide a good representative sample of this kind of engagement. All of these examples are from real scientists, are based on real data, and are focused on important problems in discovery. The examples all come from the life- sciences area, but that is not meant to be the only area where these tech- niques would apply; in fact, I have applied them in several other sciences, including materials and chemistry. But my best physical science examples
  • 32. 4   ◾    Accelerating Discovery are not publishable due to proprietary concerns, so for this book I have chosen to focus on the science of biology. That is how the book is organized, but do not feel you have to read it this way. You could just as well start with the examples and work your way back to the earlier chapters when you want to understand the method in more detail. You will quickly notice that not every problem solution employs every step of the methodology anyway. The methodology is a flexible framework on which to assemble the components of your solu- tion, as you need them, and where they make sense. And it is meant to be iterative and to evolve as you get deeper into the information complexity of each new domain. As you read the book, I hope that certain core principles of how to be a good data scientist will naturally become apparent. Here is a brief cata- logue of those principles to keep in mind each time you are faced with a new problem. • The whole is greater than the sum of the parts: As scientists we natu- rally tend toward reductionism when we think about how to solve a problem. But in data science, it is frequently the case that, by con- sidering all the relevant data at once, we can learn something that we cannot see by looking at each piece of data in isolation. Consider ways to unify everything we know about an individual entity as a complete picture. What you learn is frequently surprising. • More X is not always better: There is a wishful tendency among those less familiar with the problems of data science to imagine that every problem can be solved with more data, no matter how irrelevant that data happens to be; or that, if we have run out of data, then adding more features to the feature space ought to help; or, if that fails, that adding more categories to our taxonomy should help, and so on. The operative concept is more is always better. And certainly, one is sup- posed to assume that at least more stuff can never hurt; the solution must be in there somewhere. But the problem is, if you add mostly more noise, the signal gets harder to find, not easier. Careful selec- tion of the right data, the right features, and the right categories is always preferable to indiscriminate addition. • Compare and contrast: Measuring something in isolation does not tell you very much. Only when you compare the value to some other related thing does it begin to have meaning. If I tell you a certain
  • 33. Introduction   ◾    5 baseball player hit 50 home runs last season, this will not mean much if you know nothing about the game. But if you know what percentile that puts him in compared to other players, that tells you something, especially if you also take into account plate appearances, difficulty of pitchers faced, and the ball parks he played in. The point is that too often in data science, we are tempted to look too narrowly at only one aspect of a domain in order to get the precise number we are look- ing for. We also need to look more broadly in order to understand the implications of that value: to know whether it means anything of importance. • Divide and conquer: When you have a lot of data you are trying to make sense of, the best strategy for doing this is to divide it into smaller and smaller chunks that you can more easily comprehend. But this only works if you divide up the data in a way that you can make sense of when you put it all back together again. For exam- ple, one way to divide up data is by letter of the alphabet, but this is unlikely to make any one category much different than any other, and thus the problem has not become any easier within each subcat- egory. But if I focus on concepts rather than syntax, I stand a much better chance of being enlightened at the end. • “There’s more than one way to…”: Being a cat lover, I shy away from completing that statement, but the sentiment is no less true for being illustrated in such an unpleasant way. Once we find a solution or approach, our brains seem to naturally just turn off. We have to avoid this trap and keep looking for other ways to arrive at the result. If we apply these additional ways and get the same answer, we can be far more confident than we were that the answer is correct. If we apply the additional approaches and get a different answer, that opens up whole new areas for analysis that were closed to us before. Either way, we win. • Use your whole brain (and its visual cortex): Find a way to make the data draw a picture, and you will see something new and important that was hidden before. Of course, the challenge is to draw the pic- ture that illustrates the key elements of importance across the most visible dimensions. Our brains have evolved over time to take in vast amounts of data through the eyes and convert it effortlessly into a reasonably accurate view of what is going on around us. Find a
  • 34. 6   ◾    Accelerating Discovery way to put that powerful specialized processor to work on your data problem and you will inevitably be astounded at what you can see. • Everything is a taxonomy/feature vector/network: At the risk of oversimplifying things a bit, there are really only three basic things you need to know to make sense of data: What the entities are that you care about and how they relate to each other (the taxonomy), how you can describe those entities as features (feature vector), and how you can represent the way those entities interact (network). Every problem involves some subset or combination of these ideas. It really is that simple. • Time is not a magazine: The data we take to begin our investigation with is usually static, meaning it sits in a file that we have down- loaded from somewhere, and we make sure that file does not change over time (we may even back it up to be absolutely sure). This often leads us to forget that change is the only constant in the universe, and over time we will find that our file bears less and less relation to the new reality of now. Find a way to account for time and to use time recorded in data to learn how things evolve. • All data is local: A corollary to the problem of time is the problem of localization. Most data files we work with are subsets of the larger data universe, and thus we have to generalize what we learn from them to make them applicable to the real universe. That generaliza- tion problem is going to be much harder than you realize. Prejudice toward what we know and ignorance of what we do not is the bane of all future predictions. Be humble in the face of your own limited awareness. • Prepare for surprise: If you are not constantly amazed by what you find in data, you are doing something wrong. Hopefully this brief introduction gives you some sense of the ideas to keep in mind as you begin to master this discipline. Discovery is always hard and always involves synthesizing different kinds of data and ana- lytics. The crucial step is to make all those moving parts work together constructively to illuminate what lies just beyond the known. The key ingredient is figuring out what to count and how to count it. In the end, everything is just combinatorics all the way down!
  • 35. Introduction   ◾    7 I hope this book helps you to solve complex and important problems of this type. More than that, I encourage you to develop your own meth- ods for accelerating discovery and publish them as I have done mine. This problem is too important for one small group of data scientists in one organization to have all the fun. Come join us.
  • 37. 9 C h a p t e r 2 Why Accelerate Discovery? Scott Spangler and Ying Chen There is a crisis emerging in science due to too much data. On the surface, this sounds like an odd problem for a scientist to have. After all, science is all about data, and the more the better. Scientists crave data; they spend time and resources collecting it. How can there be too much data? After all, why can scientists not simply ignore the data they do not need and keep the data they find useful? But therein lies the problem. Which data do they need? What data will end up proving useful? Answering this question grows more difficult with increasing data availability. And if data grows exponentially, the problem may reach the point where individual scientists cannot make optimal decisions based on their own limited knowledge of what the data contains. I believe we have reached this situation in nearly all sciences today. We have certainly reached it in some sciences. So by accelerating discovery, I do not simply mean doing science the way we do it today, only faster; I really mean doing science in a profoundly new way, using data in a new way, and generating hypotheses in a new way. But before getting into all that, I want to present some historical con- text in order to show why science the way it has always been practiced is becoming less and less viable over time. To illustrate what I mean, consider the discovery of evolution by Charles Darwin. This is one of the most studied examples of great science
  • 38. 10   ◾    Accelerating Discovery in history, and in many ways his example provided the template for all scientific practice for the next 150 years. On the surface, the story is remarkably elegant and straightforward. Darwin travels to the Galapagos Islands, where he discovers a number of new and unique species. When he gets back from his trip, he notices a pattern of species changing over time. From this comes the idea of species “evolution,” which he publishes to much acclaim and significant controversy. Of course, what really hap- pened is far more complex and illuminating from the standpoint of how science was, and for the most part is still, actually practiced. First of all, as inhabitants of the twenty-first century, we may forget how difficult world travel was back in Darwin’s day. Darwin’s voyage was a survey mission that took him around the world on a sailing ship, the Beagle. He made many stops and had many adventures. Darwin left on his trip in 1831 and returned five years later, in 1836. During that time, he collected many samples from numerous locations and he took copi- ous notes on everything he did and saw. After he got back to England, he then spent many years collating, organizing, and systematically catalogu- ing his specimens and notes. In 1839, he published, to much acclaim, a book describing the incidents of this voyage (probably not the one you are thinking of, that one came much later): Journal and Remarks, Voyage of the Beagle. Darwin then spent the next 20 years doing research and col- lecting evidence on plants and animals and their tendency to change over time. But though he was convinced the phenomenon was real, he still did not have a mechanism by which this change occurred. Then Darwin happened upon Essay on the Principle of Population (1798) by Thomas Malthus. It introduced the idea that animals produce more offspring than typically actually survive. This created a “struggle for exis- tence” among competing offspring. This led Darwin directly to the idea of “natural selection.” The Origin of Species was published in 1859, 28 years after the Beagle left on its voyage. And of course, it was many decades later before Darwin’s theory would be generally accepted. There are certain key aspects of this story that I want to highlight as particularly relevant to the question “Why Accelerate Discovery?” The first has to do with the 20 years it took Darwin to collect and analyze the data he felt was necessary to develop and validate his theory. The second is related to the connection that Darwin made between his own work and that of Malthus. I think both of these phenomena are characteristic of the big data issue facing scientists both then and now. And if we think about them carefully in the context of their time and ours, we can see how it
  • 39. Why Accelerate Discovery?   ◾    11 becomes imperative that scientists working today use methods and tools that are far more powerful than those of Darwin and his contemporaries. THE PROBLEM OF SYNTHESIS When Darwin returned from his 5-year voyage, he had a formidable col- lection of notes and specimens to organize and catalogue. This step took him many years; longer, in fact, than it took him to collect the data in the first place, but it was crucial to the discovery process. We often think of scientific discovery as a Eureka moment—a bolt from the blue. But in real- ity, it is much more frequently the result of painstaking labor carried out to collect, organize, and catalogue all the relevant information. In essence, this is a problem of synthesis. Data hardly ever comes in a form that can be readily processed and interpreted. In nearly every case, the genius lies not in the finding and collecting of the data but in the organization scheme that makes the interpretation possible. In a very real sense, all significant scientific discoveries are about overcoming the data-synthesis problem. Clearly, data synthesis is hard (because otherwise everyone would do it), but what makes it so? It is often not easy to see the effort required if only the result is observed. This is because the most difficult step, the part that requires the real genius, is almost invisible. It is hidden within the structure of the catalogue itself. Let us look at the catalogue of specimens Darwin created. Darwin’s task in specimen organization cataloguing was not just to record the species and physical characteristics of each specimen—it was to find the hidden relationships between them, as illustrated in Figure 2.1. Organizing data into networks of entities and relationships is a recurring theme in science. Taxonomies and ontologies are another manifestation of this. Taxonomies break entities down into classes and subclasses based on some measure of similarity, so that the further down the tree you go, the more alike things are within the same class. Ontologies represent a more general kind of entity network that expresses how entities relate to each other in the world. Creating such networks from raw data is the problem of synthesis. The more data there is, and in particular the more hetero- geneous the forms of that data, the more challenging synthesis becomes. THE PROBLEM OF FORMULATION Once Darwin had synthesized his data, it became clear to him that species did indeed change over time. But merely to observe this phenomenon was not enough. To complete his theory, he needed a mechanism by which
  • 40. 12   ◾    Accelerating Discovery this change takes place. He needed to create a model that could explain how the data points (i.e., the species) connected to each other; otherwise all he would have is a way to organize the data, without having any addi- tional insight into what the data meant. Creating this additional insight that emerges from synthesis is the problem of formulation. Formulation requires the creation of an equation or algorithm that explains a process or at least simulates or approximates mathematically how that process behaves in the physical world. From a data-science perspective, formulation requires extracting patterns that may appear across many disparate, heterogeneous data collections. Going beyond synthesis to explanation may require data visu- alization and sometimes even analogy. It requires pattern matching and a14 a10 a9 a8 a7 a6 a5 a4 a3 a2 a1 q14 p14 b14 f 14 f 10 f 9 f 8 f 7 f 6 o14 e14 m14 F14 F10 E10 m10 m9 m8 m7 m6 m5 m4 m3 m2 m1 s2 i2 i3 k5 k6 k7 k8 l8 l7 n14 r14 w14 y14 v14 z14 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1 t2 t3 u5 u6 u7 u8 w9 w10 w8 w7 A W. West lith Halton garden B C D E F G H I K L I II III IV V VI VII VIII IX X XI XII XIII XIV FIGURE 2.1 Illustrations from The Origin of Species.
  • 41. Why Accelerate Discovery?   ◾    13 being able to draw from a wide array of related data. This is what Darwin was able to do when he drew on the writings of Thomas Malthus to dis- cover the driving mechanism behind species change. Darwin reused an existing formulation, the struggle for existence among competing off- spring (i.e., “survival of the fittest”), and applied it to competition among all living things in order to arrive at a formulation of how species evolve. The process of formulation begins with observation. The scientist observes how entities change and interact over time. She observes which properties of entities tend to occur together and which tend to be indepen- dent. Often, data visualization—charts or graphs, for example—is used to summarize large tables of numbers in a way that the human visual cortex can digest and make sense of. The synthesis of data is one of the key steps in discovery—one that often looks obvious in retrospect but, at the begin- ning of research, is far from being so in most cases. WHAT WOULD DARWIN DO? The process of synthesis and formulation used by Darwin and other scientists worked well in the past, but this process is increasingly prob- lematic. To put it bluntly, the amount and variety of data that needs to be synthesized and the complexity of the models that need to be formulated has begun to exceed the capacity of individual human intelligence. To see this, let us compare the entities and relationships described in The Origin of Species to the ones that today’s biologists need to grapple with. Today’s biologists need to go well beyond the species and physical anat- omy of organisms. Today, biology probes life at the molecular level. The number of different proteins that compose the cells in the human organ- ism is over a million. Each of these proteins has different functions in the cell. Individual proteins work in concert with other proteins to create additional functionality. The complexity and richness of all these interac- tions and functions is awe inspiring. It is also clearly beyond the capability of a single human mind to grasp. And all of this was entirely unknown to Darwin. To look at this in a different way, the number of scholarly publications available for Darwin to read in his field might have been on the order of around 10,000–100,000 at most. Today, that number would be on the order of fifty million [1]. How do the scientists of today even hope to fathom such complexity and scale of knowledge? There are two strategies that every scientist employs to one degree or another: specialization and consensus. Each scientist
  • 42. 14   ◾    Accelerating Discovery chooses an area of specialization that is narrow enough to encompass a field wherein they can be familiar with all the important published litera- ture and findings. Of course, this implies that as time goes on and more and more publications occur, specialization must grow more and more intense. This has the obvious drawback of narrowing the scope of each scientist’s knowledge and the application of their research. In addition, each scientist will read only the publications in the most prestigious, high- profile journals. These will represent the best consensus opinion of the most important research in their area. The drawback is that consensus in science is frequently wrong. Also, if the majority of scientists are pursuing the same line of inquiry, the space of possible hypothesis is very ineffi- ciently and incompletely explored. THE POTENTIAL FOR ACCELERATED DISCOVERY: USING COMPUTERS TO MAP THE KNOWLEDGE SPACE But all is not lost for the scientists of today, for the very tools that help generate the exponentially increasing amounts of data can also help to synthesize and formulate that data. Due to Moore’s Law, scientists have and will most likely continue to have exponentially increasing amounts of computational power available to them. What is needed is a way to harness that computational power to carry out better synthesis and for- mulation—to help the scientist see the space of possibilities and explore that space much more effectively than they can today. What is needed is a methodology that can be repeatedly employed to apply computation to any scientific domain in such a way as to make the knowledge space com- prehensible to the scientist’s brain. The purpose of this book is to present one such methodology and to describe exactly how to carry it out, with specific examples from biology and elsewhere. We have shown this method to be an effective way to syn- thesize all published literature in a given subject area and to formulate new properties of entities based on everything that we know about those entities from previous results. This leads us to conclude that the meth- odology is an effective tool for accelerating scientific discovery. Since the methods we use are in no way specific to these examples, we think there is a strong possibility that they may be effective in many other scientific domains as well. Moreover, regardless of whether our particular methodology is opti- mal or effective for any particular scientific domain, the fact remains that all scientific disciplines that are pursued by ever-increasing numbers
  • 43. Why Accelerate Discovery?   ◾    15 of investigators must ultimately address this fundamental challenge: Eventually, the rate of data publication will exceed the individual human capacity to process it in a timely fashion. Only with the aid of computation can the brain hope to keep pace. The challenges we address here and the method we employ to meet those challenges will continue to be relevant and essential for science for the foreseeable future. So clearly the need exists and will continue to increase for aiding sci- entific discovery with computational tools. But some would argue that no such tools exist, or that if they do exist, they are still too rudimentary to really create value on a consistent basis. Computers can do computation and information retrieval, but scientific discovery requires creativity and thinking “outside the box,” which is just what computers cannot do. A few years ago, the authors would have been largely in agreement with this viewpoint, but something has changed in the field of computer science that makes us believe that accelerating scientific discovery is no longer a distant dream but is actually well within current capability. Later in this chapter, we will describe these recent developments and preview some of the implications of these emerging capabilities. WHY ACCELERATE DISCOVERY: THE BUSINESS PERSPECTIVE Discovery is central and critical to the whole of humanity and to many of the world’s most significant challenges. Discovery represents an ability to uncover things that are not previously known. It underpins all innova- tions (Figure 2.2). Looking at what we human beings consume—for example, consumer goods such as food, clothing, household items, and energy—we would quickly realize that we need significant innovations across the board. We need to discover new ways to generate and store energy, new water Major discoveries and innovation are critical to many world challenges and the success of many companies across industries Smarter planet/ consumer goods: water filtration, product innovation Information technology: nanotechnologies, mobile Life sciences: drug discovery, biomedical research Energy storage and generation: batteries, solar, CO2 FIGURE 2.2 Example application areas for Accelerated Discovery.
  • 44. 16   ◾    Accelerating Discovery filtration methods, and new product formations for food and other goods so that they are more sustainable for our environments and healthier for human beings. We need these innovations more desper- ately than ever. Looking at what we make and build—for example, new computer and mobile devices, infrastructures, and machines—again, the need for dis- covery and innovation is at the center of all these. We need new kinds of nanotechnologies that can scale human capacity to unimaginable limits, new materials that have all the desired properties that will lower energy consumption while sustaining its intended functions, and new designs that can take a wide variety of factors into consideration. Looking at ourselves, human beings, our own wellbeing depends heav- ily on the discovery and innovation in healthcare, life sciences, and a wide range of social services. We need a much better understanding of human biology. We need to discover new drugs and new therapies that can target diseases much more effectively and efficiently. Yet today, the discovery processes in many industries are slow, manual, expensive, and ad hoc. For example, in drug discovery, it takes on average 10–15 years to develop one drug, and costs hundreds of millions of dol- lars per drug. The attrition rate of drug development today is over 90%. Similarly, new energy forms, such as the lithium battery, take tens of years to discover. New consumer product formations are mostly done on a trial- and-error basis. There is a need across all industries for a reliable, repeat- able process to make discovery more cost-effective. COMPUTATIONAL TOOLS THAT ENABLE ACCELERATED DISCOVERY Accelerated Discovery is not just one capability or algorithm but a com- bination of many complementary approaches and strategies for synthe- sizing information and formulating hypotheses. The following existing technologies represent the most important enablers of better hypothesis generation in science. Search The ability to index and retrieve documents based on the words they con- tain across all relevant content in a given scientific field is a primary enabler of all the technologies that are involved in Accelerated Discovery. Being able to selectively and rapidly acquire all the relevant content concerning a given subject of interest is the first step in synthesizing the meaning of
  • 45. Why Accelerate Discovery?   ◾    17 that content. The easy availability, scalability, and application of search to this problem space have made everything else possible. Business Intelligence and Data Warehousing The ability to store, retrieve, and summarize large amounts of structured data (i.e., numeric and categorical data in tables) allows us to deal with all kinds of information in heterogeneous formats. This gives us the criti- cal ability to survey scientific discoveries over time and space or to com- pare similar treatments on different populations. The ability to aggregate data and then accurately compare different subsets is a technology that we apply over and over again in our approach as we seek to determine the credibility and reliability of each fact and conclusion. Massive Parallelization In recent years, Hadoop and MapReduce frameworks [2] have made par- allelization approaches much more applicable to real-world computing problems. This gives us the ability to attack hard problems involving large amounts of data in entirely new ways. In short, we can build up a number of simple strategies to mine and annotate data that can, in aggregate, add up to a very sophisticated model of what the data actually means. Massive parallelization also allows us to try out thousands of approaches and com- binations in real time before selecting the few candidates that are most likely to succeed based on our models and predictions. Unstructured Information Mining Most of the critical information in science is unstructured. In other words, it comes in the form of words, not numbers. Unstructured information mining provides the ability to reliably and accurately convert words into other kinds of structures that computers can more readily deal with. As we will see in this book, this is a key element of the accelerated discovery process. This allows us to go beyond retrieving the right document, to actually discovering hidden relationships between the elements described by those documents. Natural Language Processing The ability to recognize entities, relationships or transitions, and features and properties and to attribute them appropriately requires natural lan- guage processing. This technology allows us to parse the individual ele- ments of the sentence, identify their part of speech, and determine to what
  • 46. 18   ◾    Accelerating Discovery they refer. It can also allow us to discover the intentionality of the author. These natural language processing abilities enable the precise determina- tion of what is known, what is hypothesized, and what is still to be deter- mined through experimentation. It creates the underlying fact-based framework from which our hypotheses can be generated. Machine Learning To do Accelerated Discovery in complex domains requires more than just establishing the factual statements in literature. Not all literature is equally trustworthy, and the trustworthiness may differ depending on the scope and context. To acquire the level of sophistication and nuance needed to make these determinations will require more than human pro- gramming can adequately provide. It will require learning from mistakes and past examples in order to get better and better over time at deciding which information is credible and which is suspect. Machine learning is the technology that provides this type of capability. In fact, it has proven remarkably adept at tackling even heretofore intractable problems where there is sufficient training data to be had [3]. Machine learning will enable our Accelerated Discovery approach to apply sophisticated judgment at each decision point and to improve that judgment over time. Collaborative Filtering/Matrix Factorization Collaborative filtering is a technique made famous by Amazon and Netflix [4], where it is used to accurately identify the best movie or book for a given customer based on the purchase history of that customer and other simi- lar customers (customers who buy similar things). Customer purchases are a special kind of entity-entity network. Other kinds of entity-entity networks are similarly amenable to this kind of link prediction. We can use a generalization of this approach as a way to predict new links in an existing entity-entity network, which can be considered to be a hypothesis of a new connection between entities (or a new property of an entity)that is not currently known but is very likely based on everything we know about the relevant entities. Modeling and Simulation In order to reason accurately about the physical world we have to be able to simulate its processes in silico and predict what would happen if an experi- ment were tried or a new property or relationship was actually found to exist. These types of simulations will help reveal potential downstream
  • 47. Why Accelerate Discovery?   ◾    19 problems or contradictions that might occur if we were to hypothesize a physically unrealizable condition or some impossible connection between entities. Moreover, modeling and simulation can help determine what the likely impact would be on the physical system as a whole of any new prop- erty or relationships being discovered, in order to foresee whether such a discovery would be likely to be uninteresting or quite valuable because it would imply a favorable outcome or have a wide impact in the field. Service-Oriented Architectures Clearly, doing Accelerated Discovery requires a large array of hetero- geneous software components providing a wide variety of features and functions across multiple platforms. Service-oriented architectures (SOA) provide a uniform communication protocol that allows the components to communicate across the network using complex and evolving data rep- resentations. It also allows new implementations and algorithms to be eas- ily swapped in as they become available. SOAs represent an indispensable tool for enabling the large, sophisticated, and distributed systems needed for accelerated discovery applications to emerge from components that can largely be found on the shelf or in open-source libraries. Ontological Representation Schemes In addition to being able to extract entities and relationships from unstruc- tured content, we also need powerful ways to represent those entities and their features and connections in a persistent fashion, and in a way that makes it possible to do reasoning over these objects. Existing ontological representation schemes (e.g., OWL [5]) make it possible to store entities in a way that retains all the pertinent contextual information about those entities while still maintaining a degree of homogeneity. This homogene- ity makes it possible to design algorithms that can discover or infer new properties based on all known existing patterns. The ability to store such representations in a scalable database and/or index provides the capability of growing the stored version of what is known to the level necessary to comprehend an entire scientific domain. DeepQA While question answering is not a central feature of Accelerated Discovery, the two applications share many common components. Both require the computational digestion of large amounts of unstructured content, which then must be used in aggregated to form a conclusion with a likelihood
  • 48. 20   ◾    Accelerating Discovery estimate. Both also support their answers with evidence extracted from text sources. Reasoning under Uncertainty Machine learning techniques allow us to predict the likelihood that some conclusion or fact is true. Reasoning under uncertainty is how we use this probabilistic knowledge to form new hypotheses or to invalidate or ignore some fact that is highly unlikely. Bayesian inferencing [6] is one example of an existing framework that we can apply to uncertain causal networks in order to do credible deduction of probable end states. This capability is central to telling the difference between a likely outcome and something that is wildly fanciful. ACCELERATED DISCOVERY FROM A SYSTEM PERSPECTIVE The previous list of enabling technologies available to support Accelerated Discovery (AD) is necessary to the task, but incomplete. What is needed is a coherent framework in which these technologies can effectively work together to achieve the desired outcome. To support and enable such continuous data transformations and dis- covery, we must design our discovery solution carefully. In particular, our discovery solution must adapt and scale to a wide range of chang- ing dynamics, including data content, domain knowledge, and human interactions. This is crucial because in all industry domains, domain con- tent and knowledge are constantly changing and expanding. If we do not design our discovery system to cope with such changes, the outcome will be a system that lacks longevity and capability. Most, if not all, of today’s discovery solutions can only deal with limited volume and diversity of content and knowledge. To enable adaptation and scaling, we instituted two system design prin- ciples: agility and adaptivity. Agility means that a discovery system must be able to rapidly generate outputs in the face of changes in data content, knowledge, and human inputs. This is far from a reality in today’s discov- ery systems. For example, a discovery system may be built for one kind of data input formats. When the data input format changes, significant manual intervention may be needed and downstream system components may also need to change accordingly. Such designs make a discovery pro- cess extremely lengthy and error prone. We will describe our approach to build agility into the system from the very beginning.
  • 49. Why Accelerate Discovery?   ◾    21 Adaptivity means that a discovery solution must consider “changes in all forms” to be the first-class citizen; for example, changes in individual system components, changes in data content, and changes in knowledge. We envision that various underlying technology components will go through their evolution to become better over time, and the same is true of data content and knowledge bases. A discovery system must have the notion of adaptivity built into it from day one. To enable agility, we suggest that all major system components be designed with a “core-abstraction plus configurable customization” approach. The core abstraction defines all the major services that the sys- tem components intend to support. The configurable customizations allow for changes and adaptations of the core abstractions for specific needs in data-content formats, knowledge repositories, and interactions with other components. For example, a content collection component may have com- mon and core services dealing with the processing of typical unstructured documents. A configurable customization can define the specific fields and format extensions that are needed for those unstructured sources without code change in the core abstraction services. To enable adaptivity, we defined generalized data input and output for- mats, called common data models (CDMs), and common interfaces around all system components. This allows developers to change the component engine itself without impacting the overall function of the discovery system. It also allows us to adapt to new changes in data sources and knowledge bases by simply mapping them to CDMs without changing the rest of the system. Figure 2.3 summarizes the key system components of a discovery solution. The boxes are the major system components. We will also pro- vide a summary of the description to each of the following major system components: Content Curator Content curator is responsible for managing domain content sources. It includes collecting, cleansing, and making diverse content sources available fordownstreamprocessing,end-userqueries,andbrowsing.Typicalfunctions include data ingestion, extraction, transformation, and loading into some content repository. It also includes functions such as indexing and searching. Domain-pedia Domain-pedia is responsible for managing domain knowledge. It must ingest, update, and process data knowledge from existing knowledge
  • 51. Why Accelerate Discovery?   ◾    23 sources as well as some of the downstream processing, such as the seman- tic entity extraction process. It can also be a resource for runtime process- ing and analytics and end-user searching and browsing, similar to what one might do on Wikipedia, such as searching for knowledge around a given protein. Annotators Annotators are the engines that pull semantic domain concepts out of unstructured text information, such as chemicals, genes, and proteins, and their relationships. Normalizers Normalizers are the engines that organize the vocabularies around vari- ous domain concepts into a more consistent form. As the name indicates, they normalize domain concepts into standardized forms, such as unique chemical structures and protein names. BigInsights Framework TheBigInsightsframeworkisanorchestrationenginethatmanagestheinter- actions between the components described above in a scalable and efficient fashion by mapping runtime content, knowledge, annotation, and normal- ization processing in a large-scale Hadoop-like infrastructure framework. Such a framework is like the blood vessels of the human being; without it, we only have pieces and parts of the system, rather than a live system. Query Services Query services provide consistent data access interfaces for end-user and other system components to query underlying data repositories without knowing what format each data repository might be in. Analytics Services Analytics services include runtime algorithms to enable the discovery and construction of more complex knowledge representations. For exam- ple, they may produce network representations based on the underlying knowledge repository of a gene-to-gene relationship. User Interface The user interactions of a discovery system can be diverse. They can range from basic searches and reporting to more complex visualizations and
  • 52. 24   ◾    Accelerating Discovery workflows. Our user interface component is built with a “platform+appli- cations” principle. That is, we developed a suite of common user-interface widgets that can be leveraged by a wide range of use cases. We then design specific applications that are targeted for specific use cases by leveraging the common widgets as much as possible. This way, our discovery system can quickly be adapted to support different use cases without applications having to be rewritten each time. Catalogue Finally, the catalogue component manages the system records about all system components and their configurations and metadata about the con- tent sources and knowledge sources. It creates and manages a list of sys- tem and data catalogues. System administrators are the typical users of such a component. Clearly, a discovery system is complex and dynamic. Without the design principles that have been described, a discovery solution can quickly find itself stuck in face of changing content, knowledge, and user-interaction paradigms. ACCELERATED DISCOVERY FROM A DATA PERSPECTIVE The process of Accelerated Discovery can essentially be thought of as a data transformation process. Having just described the system for trans- forming that data, let us look at the process again from a data perspective and see how the data may be changed as we move through the discovery process. The discovery process is inherently a continuous transformation from raw pieces of data and information to enriched pieces of knowledge and more comprehensive knowledge representations, all the way to brand new discoveries and hypotheses. A discovery solution/platform can therefore be viewed as a system that supports and enables such data transformations and end-user interactions. Figure 2.4 summarizes four major data trans- formations driven by each of the four discovery process steps. We will now discuss each of these steps of transformation in detail. Initial Domain Content and Knowledge Collection The bottom inner circle of Figure 2.4 marks the beginning of the data transformation journey. To enable major discoveries, we must ensure that the system is aware of a vast and diverse set of prior domain knowledge and domain content. Without this step, the downstream analysis and
  • 53. Why Accelerate Discovery?   ◾    25 discovery will be extremely limited in scope, and wrong conclusions could be drawn. This is often the case with today’s bioinformatics tools, which operate on small and narrowly scoped data sets. We differentiate domain knowledge and domain content deliberately here since they mean different things. Domain knowledge means prior knowledge that has been captured digitally, such as manually curated domain ontologies, taxonomies, dictionaries, and manually curated structured databases. For example, in drug discovery, such domain knowledge may include ChemBL database [7], OBO ontologies [8], and other dictionaries. Domaincontentmeansrawdigitaldatacontentthatmightcaptureexist- ing domain knowledge but has not been curated systematically to allow a broad set of scientists to gain such knowledge easily. For example, many of the unstructured information sources, such as patents, published confer- ence and journal articles, and internal reports, often contain knowledge known by the individual scientists who wrote the documents. However, if this knowledge has not been made widely and easily accessible, much of it is locked down and unusable. Domain content also includes structured data and semistructured data such as genomic screening information The Discovery Platform—A Data Perspective The discovery platform is a system that continuously transforms from initial raw data and domain knowledge to brand new discoveries through a series of data transformation steps. Each step along the way will enable services and bring value to our clients and partners Emerging patterns and discovery Complex knowledge and visual representations Enriched data and enhanced domain knowledge Data and domain knowledge Create new hypothesis and predictions • Given known, show unknown • Can be simple or complex representations Compose complex and holistic knowledge representations • Graphs and networks • Runtime calculated visualizations, such as scatter plots • ...... Comprehend and extract semantic knowledge • Entity, relationships, and complex relationship extraction core and customizations • E.g., chemical, biological and toxicology annotators and PTM relationships • Enhanced domain-knowledge and enriched data content Collect and curate domain content and knowledge • Content collection and ingestion • Content curation, indexing • Patents, medline literature, ChemBL, ... • Domain knowledge/ontology ingestion • Domain-pedia management • OBO, SIDER, dictionaries, ontologies FIGURE 2.4 Data evolution.
  • 54. 26   ◾    Accelerating Discovery and experiments. Many such sources of raw data also require significant cleansing and processing before they can be made available and accessible. A discovery solution must be able to collect, cleanse, and make acces- sible a vast and diverse amount of domain knowledge and content in its initial data transformation step. The output of this initial step is a content and knowledge repository that is ready for downstream analysis, index- ing, data exploration, and browsing. Content Comprehension and Semantic Knowledge Extraction The second transformation has to do with the ability to derive an enriched set of domain knowledge from the initial content and knowledge col- lection. Specifically, when it comes to unstructured and semistructured content sources, such a transformation step is critical in capturing and extracting knowledge buried in text, tables, and figures in a systematic fashion; for example, extracting chemical structures and their properties, genes and proteins described in text, and tables of experimental results. Other ways to comprehend the content also include classification and taxonomy generation. Such methods help organize content more mean- ingfully and allow scientists to examine data with different lenses. Such a step often requires natural language processing, text mining, and machine learning capabilities. In addition, it also requires systems to coalesce and cross-map different styles of vocabulary into something that is more consistent. For example, one chemical compound may have over 100 different names. Unless such chemical compounds are cross-mapped and normalized in some fashion, there is very little hope that scientists can gain comprehensive knowledge about them, let alone discover new insights. A well-designed system component that carries out such forms of data comprehension will result in an enriched content and knowledge reposi- tory. Such a repository not only unlocks and captures domain knowledge buried in unstructured data, but also cross-maps different domain vocab- ularies in a more consistent fashion. An analogy of such a knowledge repository is a “domain-pedia.” That is, it contains all pieces of knowledge gained via machine curation and gathers them together into a reposi- tory that is like Wikipedia but more in-depth and comprehensive for the domain under study. Complex and High-Level Knowledge Composition and Representation Building on the repositories created from the previous data transforma- tion steps, the third data transformation step attempts to generate more
  • 55. Random documents with unrelated content Scribd suggests to you:
  • 56. she just turned around and went home. That is the time she found out he was married and was deceiving her. But I don't know why she left, I mean why, exactly she left but I think this is the person that was her reason for leaving Leningrad. Mr. Liebeler. Did she tell you that she had left in part at least to get away from this man? Mrs. Ford. That is what I understood. Mr. Liebeler. Now, did Marina tell you why she married Lee Oswald when she was in Russia, did she talk to you about that? Mrs. Ford. Yes; she did. I don't exactly know why she married him. But she said she met him at a dance, and soon after that, I don't know the reason why he was in the hospital but he was in a hospital, and she called on him, and I don't know how long he stayed there, either, and she liked the man, I think, and she bought him an Easter Egg, that was during Easter sometime and he was very surprised that such a thing could be done in Russia. I think it rather pleased him very well. She said somehow she felt sorry for the man because none of her friends liked him, and mistrusted him, and she felt sort of like she was on the defensive, she wanted to, she felt sorry for him in a way. Representative Ford. Did she tell you why her friends didn't trust him? Mrs. Ford. Yes; she said they were thinking that he was an American spy or something like that, that is what they were trying to tell her. Maybe he is a spy, and how can you trust a man like that? She told me the other day, she says no one trusted him, but she says, I wasn't afraid of him, that is how she put it. Representative Ford. Did Marina indicate to you whether she thought or had any reason to believe that Oswald was a spy?
  • 57. Mrs. Ford. No; she didn't. She didn't think so. She never said that, I mean. Representative Ford. Did Marina ever indicate to you or did you gather that one of the reasons that Marina married Lee Oswald was she had the possibility of leaving the Soviet Union in mind? Mrs. Ford. Yes. She never did go out directly and say that but I think I got an impression that was her reason. She was telling me that way before she met Oswald she was dreaming of coming over here, and that is, I mean gathering by that later I thought that she wanted to come over, and he was, I suppose he was a reason. Mr. Liebeler. In this connection, I wanted to ask you whether you ever had any contact with any newspaper reporters from the Dallas newspapers about this, did they ask you about this at any time? Mrs. Ford. About this I don't remember if they asked me. I don't remember. But a couple of reporters came to my house soon after the assassination and talked to me. Mr. Liebeler. I represent to you that there was a story in the November 27, 1963, issue of the Dallas Times Herald which told about some Russian-born woman in Dallas to whom Marina had supposedly confided some of the most intimate secrets of her stormy marriage, in the words of the newspaper article. Do you have any knowledge whether that would be you or somebody else? Mrs. Ford. I think that would be me. Mr. Liebeler. Did you tell the reporters that Marina had told you that Marina had felt sorry for Oswald because everybody hated him even in Russia? Mrs. Ford. Yes; I might have said that. Mr. Liebeler. Marina did say that to you? Mrs. Ford. Yes; she did. Mr. Liebeler. Did Marina tell you anything at this time about their trip back to the United States and the difficulties that they
  • 58. encountered, or how they did it, that sort of thing? Mrs. Ford. No; I never talked with her about that. Mr. Liebeler. Was there any conversation while Marina stayed with you during that week in November 1962, about the possibility of a divorce, of her divorcing Lee Oswald? Mrs. Ford. The possibility—I know she didn't want to go back to him at the time she stayed with me. Mr. Liebeler. But you don't remember any specific conversation? Mrs. Ford. No; I don't remember any specific conversation. Mr. Liebeler. About divorce? Mrs. Ford. Frankly, there was talk about it, she didn't want to go back and I just told her, I felt that Marina wasn't really the domestic type she could stand very long being a help at home, not that I think she is not capable of taking care of her own house. I see now since she has got even her own place she keeps it very clean and her children are always neat. But she wasn't right for domestic help and I told her to stay with Lee, that is what I told her myself, and wait until she could be able to take care of herself other than working in a house. Mr. Liebeler. What did she say about that? Mrs. Ford. She didn't say, she was really just listening, I think, and she didn't say anything. Mr. Liebeler. Did Marina say anything to you at this time about wanting to go back to Russia? Mrs. Ford. No, no; she didn't want to. In fact, she told me that Lee soon after he came to the United States, he was telling her that he would want to go back because he couldn't find a job here and he was, of course, seeing a lot of difficulties for himself, and Marina said, If you want to go back, you can go but I am not going, that is what she told me.
  • 59. Mr. Liebeler. Marina told Lee if he wanted to go back he could but she wasn't going to go back to Russia. Mrs. Ford. That is right. Mr. Liebeler. Is that all the conversation that you had with her about going back to Russia at that time? Mrs. Ford. That is right. That is about all. Mr. Liebeler. Did you know where the Oswalds lived at this time when Marina came to the Meller's house and then to your house? Mrs. Ford. No; I have never been at their apartment, and she couldn't tell me. I know she lived in Oak Cliff, the Dallas section about southwest, I believe. Mr. Liebeler. In Dallas? Mrs. Ford. Yes. Mr. Liebeler. Did you know whether Marina had lived with a lady in Fort Worth before they came to Dallas? Mrs. Ford. Yes; I know they stayed there but I didn't talk to her during the time and I didn't visit her. I know she stayed at Elena Hall's house and I think Elena had an accident just before that and she was—she stayed in bed most of the time. Marina was helping her out. Mr. Liebeler. Now, you had only seen Lee Oswald, up to this week that Marina came to live with you, one time, is that correct? Mrs. Ford. That is correct. Mr. Liebeler. That was the Meller's luncheon party? Mrs. Ford. That is correct. Mr. Liebeler. Were you surprised on the basis of any judgment you might have made of Lee Oswald to learn that he had beaten his wife?
  • 60. Mrs. Ford. Just from seeing him once I would not have made— no; that he has beaten his wife; no, I didn't think at that time. I did, when she came in after I learned that he has beaten her, I was rather—I remember the bruises on her face and that rather made signs to me that he did. Mr. Liebeler. Did it surprise you that he would have done this? Mrs. Ford. No; it did not surprise me. I just felt that young man as he was, if he was—decided to go to Russia after living in a country like the United States, I didn't feel he was very, what shall I say, how would you say, a person's mind won't work at this time—— The Chairman. Unstable? Mrs. Ford. Unstable, that is how I felt. I felt a person like that, I felt frankly could do anything. Representative Ford. Did she ever tell you that Lee Oswald was the cause of these bruises on her face? Mrs. Ford. Well, she did tell me after she came to the house to stay with me. Representative Ford. That is what I mean? Mrs. Ford. That is right. Representative Ford. The bruises you saw on her face at the house she told you Lee Oswald was the cause? Mrs. Ford. Yes, that is right. Mr. Liebeler. Let's clarify that a little. Did Marina Oswald have any bruises at the time she came to live in your house in November 1962? Mrs. Ford. No; that is right. But she stayed at Anna Meller's house for a week and when she came to Anna Meller's house I heard there were bruises at that time. Mr. Liebeler. Did Anna Meller tell you that?
  • 61. Mrs. Ford. Either Anna Meller or George Bouhe told me that. I don't remember. Mr. Liebeler. And you yourself did see bruises on her face the first time? Mrs. Ford. I did see the first time; yes. Mr. Liebeler. Did she tell you that Lee Oswald had given her those bruises? Mrs. Ford. That is right. Mr. Liebeler. She told you that when she stayed with you? Mrs. Ford. That is right. Mr. Liebeler. Tell us the circumstances under which Marina left your home in November of 1962, where she went and what happened? Mrs. Ford. Well, she stayed with me a week, and my husband came home on Saturday, and we discussed with another friend of mine for Marina to go to her house and stay there as long as she wanted, and I think Sunday morning this friend of mine, Anna Ray, came with a station wagon and picked all her things up, her playpen she had for baby, and diapers and things, and took her to her house and I believe my husband was with her, too, at that time and that is how she left. Mr. Liebeler. Now, do you know how long she stayed with the Rays? Mrs. Ford. With the Rays. I think she just stayed there, she had had dinner there, I believe she stayed one afternoon. I don't know how soon Lee came there but he came soon over to the house, but Marina said he cried and begged her to return, he would be nothing, if she didn't return, he would be finished, that is what he was telling her, and she said she just couldn't say, no to him. Mr. Liebeler. So she returned to Oswald at that time?
  • 62. Mrs. Ford. Yes; she returned to Oswald. Mr. Liebeler. Did you ever talk with the Mellers about their experiences with Marina when she lived at their house for that time? Mrs. Ford. No; I don't remember, she did not discuss it. Mr. Liebeler. Did you remember or did you know where Lee Oswald was living prior to the time that Marina came to Dallas? Mrs. Ford. No; I don't know where he lived at any time. Mr. Liebeler. Did you know where he was working at that time or if he was working at all? Mrs. Ford. During the time they lived in Dallas, I believe, I don't know exactly, though, either George Bouhe or Anna Meller's husband found him a job in a printing shop, I think, or I believe it is printing shop, somewhere in Oak Cliff, and that is why they had an apartment there. I remember that is the reason because George Bouhe was rather mad at Marina for taking an apartment in Oak Cliff because it was too far for him to drive and help her when she needed help and the baby, I think he was taking her to the dentist and taking the baby to a doctor to help her in ways that she couldn't do herself. Representative Ford. Who was doing this driving? Mrs. Ford. I believe George Bouhe did this. He has the car. Mr. Liebeler. Where does Mr. Bouhe live? Mrs. Ford. He lives, well, I don't know his address now. I know where he lives but I don't know the street number. Mr. Liebeler. But it is not in the Oak Cliff Section of Dallas? Mrs. Ford. No; it is not. I think it would be in the east part of Dallas. Mr. Liebeler. During the time that Marina stayed with you, did she say anything to you about Lee Oswald's political beliefs or his attitudes concerning politics?
  • 63. Mrs. Ford. No; she didn't talk to me about that and I didn't ask her. Frankly, I didn't talk with Lee about that, I didn't feel the need of it myself to discuss politics with him. Mr. Liebeler. Did you discuss that subject with any of your friends? Mrs. Ford. Well, yes. They were telling me, those friends that went to his apartment, were telling me, that they have seen books like Karl Marx open in front of him, just lying there on the table, that he didn't even hide it when someone came in, and then someone else said there was a book laying there of How to Be a Spy, laying right open there. Mr. Liebeler. Do you remember who told you that? Mrs. Ford. I believe it was Lydia Dymitruk. Mr. Liebeler. D-y-m-i-t-r-u-k. Do you remember anything about that particular conversation? Mrs. Ford. Well, she was telling me, she took, when the baby left my house, she had a cold, and it was getting worse, and I believe soon after she left Anna Ray, the baby began to have a fever, and Lydia, I believe, I don't know how she got to go to her apartment, really, I don't know the reason she went there; she went there and wanted to take the baby to the doctor and she told me of an incident that says even Marina was ashamed of Lee because when she took her to the hospital Lee was lying about that he didn't have a job at the time, which Lydia knew that he did have. He didn't want to pay for the services, and people at the hospital was asking him how does he pay for the apartment and he was telling them that, My friends were helping me, and Marina just said something in Russian that Lydia remembers, What a liar, you know, behind his back. Mr. Liebeler. Did she say that so he could hear it?
  • 64. Mrs. Ford. I think so, because she said it aloud. Mr. Liebeler. And she said it in Russian? Mrs. Ford. Yes; she did. Mr. Liebeler. In front of Lydia Dymitruk? Mrs. Ford. Yes; that is right, and Lydia was rather mad about the whole thing and she said she is not going to help them any more if they are acting that way. The baby wasn't helped at the hospital. I think the hospital didn't want to take the child because the father couldn't pay, that is what I got, the father couldn't pay for it. Mr. Liebeler. Did you ever see Marina say anything adverse to Oswald in his presence, did she run him down or make fun of him in public so far as you know? Mrs. Ford. As far as I know, I don't, except the first time, the one time, I would think when she said, What a liar, in front of him. Mr. Liebeler. You don't know of any other instance when she would have done that? Mrs. Ford. I don't know of any other instance. Mr. Liebeler. Do you know whether she ever spoke of his political views before other people or make fun of him? Mrs. Ford. No; I don't remember except lately I have been talking to her about that and she said she thought of him being young, and she thought she hoped it would pass with years, that he would mature, this is what she was telling me. Mr. Liebeler. Did she tell you whether or not she discussed politics with him herself? Did she argue with him about anything? Mrs. Ford. She said she was arguing with him about that. Certainly, in fact, he called her, she was typical American girl, that she is not interested at all in politics, except in the material things that he wasn't interested in. She said she wanted a house and a
  • 65. family and he said, All the American girls think that way and he thought he married a different sort of a girl, a Russian girl. Mr. Liebeler. Now, think back on that week that Marina stayed with you. Is there anything else that happened or is there anything that Marina told you that you think we should know about and about which I haven't already asked you? Mrs. Ford. I cannot think of it at the time. Mr. Liebeler. When was the next contact that you had with the Oswalds'? Mrs. Ford. It was right after Christmas before New Year's. I believe it was the 28th of December, a Friday. I gave, I had, a party for all the friends, and I invited a family by the name of De Mohrenschildt, wife and husband, and Mrs. De Mohrenschildt called me up and asked me if she couldn't bring Marina and her husband over because she was saying it is a shame the way all their Russian friends have forsaken them during that time and they had no place to go and the De Mohrenschildts were the only ones helping them at the time, and I told her I didn't object to it. So she brought them over with them. Mr. Liebeler. Was Mr. Bouhe at the party? Mrs. Ford. Yes; I think so. Mr. Liebeler. Mr. and Mrs. Frank Ray? Mrs. Ford. Yes. Mr. Liebeler. Mr. and Mrs. Thomas Ray? Mrs. Ford. Yes. Mr. Liebeler. Were Elena Hall and her husband there? Mrs. Ford. I don't remember them being at that party. I don't think so. Mr. Liebeler. Were the Mellers there? Mrs. Ford. Yes.
  • 66. Mr. Liebeler. You mentioned that De Mohrenschildt was there. Mrs. Ford. Yes. Mr. Liebeler. And the Oswalds. Mrs. Ford. That is right. Mr. Liebeler. Was there a gentleman by the name of Allen A. Jackson at the party? Mrs. Ford. Yes. Mr. Liebeler. And his wife? Mrs. Ford. And his wife. Mr. Liebeler. Do you remember any conversations or observe any conversations between Marina Oswald and Mr. Jackson? Mrs. Ford. Marina Oswald, no; I don't think so. I don't think Marina spoke English at the time. Mr. Liebeler. Were Mr. and Mrs. Charles Edward Harris at the party? Mrs. Ford. Yes. Mr. Liebeler. Did you talk to Mr. and Mrs. Harris about the party afterward? Mrs. Ford. I might have. Mr. Liebeler. Did she mention translating a conversation between Mr. Jackson and Marina Oswald? Mrs. Ford. No; she did not mention it to me. Mr. Liebeler. And you didn't see her doing that? Mrs. Ford. No; I didn't see her doing that at the party. Mr. Liebeler. Did you have any conversation with Lee Oswald at the party that night? Mrs. Ford. No.
  • 67. Mr. Liebeler. Did you notice anybody else talking to him? Mrs. Ford. Yes; I did. I noticed a girl talking to him who was of Japanese descent but I don't remember her name. Mr. Liebeler. Was there anything striking about that? Mrs. Ford. No; I think, the only thing it was that I think he talked to her most of the time and wasn't making any conversation with anyone else. Mr. Liebeler. Did Marina comment on that to you? Mrs. Ford. No; she didn't. But I heard from somebody else that she did comment. Mr. Liebeler. Who told you that? Mrs. Ford. I think George Bouhe again. He always spoke to everybody. Mr. Liebeler. What did Mr. Bouhe say about that? Mrs. Ford. Well, he said something that I asked Marina afterward and she told me that it wasn't true. He said that Lee talked to that Japanese girl like a—it is an expression in Russia to take a bath and then beat themselves with the leaves from a tree, and the leaf would stick to the body, in the wintertime, and so the expression from that, like a leaf sticks to the hot body when you take a bath, you know. Then I suppose Lee struck her as just not saying anything—and I asked Marina and she said he did not do it. Mr. Liebeler. Was there any talk at the party about Oswald's experiences in Russia or his marriage to Marina? Mrs. Ford. During the party? Mr. Liebeler. Yes. Mrs. Ford. No; I didn't talk to him at all. Mr. Liebeler. And you didn't overhear anybody else talking about Oswald's experiences in Russia?
  • 68. Mrs. Ford. No. Mr. Liebeler. You mentioned that the Oswalds came with De Mohrensehildt. Did they go home with De Mohrensehildt? Mrs. Ford. Yes; they did because Mrs. De Mohrensehildt stated that the lady who stayed with Oswald's child had to leave at 12 o'clock and they left before that. Mr. Liebeler. After the Oswalds left the party, was there any discussion about the Oswalds that you can remember? Mrs. Ford. No; really not. I don't think they made a big impression on that party on anybody. Mr. Liebeler. Can you fix the date on which your party was held? Mrs. Ford. The 28th of December. Mr. Liebeler. What year? Mrs. Ford. 1962. Mr. Liebeler. Was there any other party the next day or the day after that? Mrs. Ford. There were other parties. I don't think that Marina and her husband were present. I don't know if you would call it a party, there was a sort of a get-together at my house afterward because some people stayed over in town for a few days, I think that was the Rays, and the Harrises, and a friend of my husband from Louisiana, Sullivans, stayed there. But they hadn't met Oswald, they came much later after 12 o'clock. So we sort of had a get- together. Mr. Liebeler. This would have been the next day? Mrs. Ford. The next day. Mr. Liebeler. On the 29th of December. Was Mr. Bouhe—— Mrs. Ford. No; Mr. Bouhe wasn't present at that time.
  • 69. Mr. Liebeler. Was there any conversation at that time about Oswald? Mrs. Ford. I don't remember, 1 don't think so. Mr. Liebeler. Do you remember that there was a party or open house at Mr. Bouhe's house on the 29th? Mrs. Ford. If there was, I wasn't present, I didn't go. Mr. Liebeler. Was there a party at Meller's house? Mrs. Ford. There was a party or luncheon. Mr. Liebeler. Did you go? Mrs. Ford. No; I didn't go. Mr. Liebeler. Did you hear any conversation during this period of 3 or 4 days about Oswald, anybody speculating about Oswald or discussing his experiences—— Mrs. Ford. No; I think—— Mr. Liebeler. In Russia? Mrs. Ford. I think at that time everyone rather—George Bouhe said he was not going to help them any more, he was through, since Marina, he tried to help her very hard, and she did not hold her word about not going back to him. So he said since she went back, so now it is her problem. Well, he is sort of that type of man, he is trying to help hard and if you are doing what he says otherwise he is not going to help, so that was it. So it was rather, sort of Marina and her husband were dropped at that time, nobody actually wanted to help, and I think what they heard about Lydia Dymitruk was saying that he couldn't help those people. I mean they were just sort of—he couldn't reach them. He was lying in hospital and things, we sort of gave it up. Mr. Liebeler. Did Lydia make that remark at one of those parties or was that at a previous time?
  • 70. Mrs. Ford. Oh, that was a previous time. As soon as she left to go back to her husband, George Bouhe even took the Russian dictionary back to him. He told her to give it back and he was just through with him. Mr. Liebeler. So far as you know Mr. Bouhe had no more contact with the Oswalds after that? Mrs. Ford. No; I don't think so. Mr. Liebeler. Did you have any conversation with any of your friends in Dallas or Fort Worth on the question of Oswald's ability to leave Russia and come back to the United States and bring Marina with him? Mrs. Ford. We didn't speculate on that until really later, until now, after the assassination that subject came up, and people asking why they left so soon. He was telling me it took them a year, so I don't know. Mr. Liebeler. Was there any conversation prior to the assassination, during this period in 1962, any speculation as to whether Oswald might be an agent of any government? Mrs. Ford. No. I frankly didn't think he was capable of it. That was my feeling on it. Mr. Liebeler. Were there any conversations on that? Mrs. Ford. No. There were not. Mr. Liebeler. Do you remember that Mr. Bouhe suggested at one time that Oswald was a mental case? Mrs. Ford. Mr. Bouhe, he might have; yes, I think we all thought that. Mr. Liebeler. Do you remember that Mr. Bouhe said that? Mrs. Ford. I don't remember particularly that he would say that. The Chairman. Did you say, We all thought that?
  • 71. Mrs. Ford. Yes; we thought that, that he was rather mentally— you just said the word before. The Chairman. Unstable? Mrs. Ford. Unstable. Mr. Liebeler. Unstable. Why did you think that, Mrs. Ford? Mrs. Ford. In my own opinion, I just didn't think that a man as young as he was could come to the conclusions just by rather experiences or living a long time in America and I mean studying the whole economic structures of different governments, and things that he would come to the conclusion that is the best thing for him. I think he was just rather too young for that. I thought he was just rather—something was rather wrong with the man. Mr. Liebeler. You based this—— Representative Ford. In the conversation that Oswald had with this Japanese lady at your party, did you overhear any of that conversation? Mrs. Ford. No; I did not. I did not have time, I was the hostess and I just didn't get to talk to anyone. Representative Ford. Did you ever ask the Japanese lady what the gist of the conversation was or what the content of the conversation was? Mrs. Ford. No; I never have. In fact, I have not seen her after that. That was the first time she came to my house. I mean I have seen her later on in the beauty shop but I have never talked to her about it. Representative Ford. Did she speak English? Mrs. Ford. Yes; she speaks English well. Representative Ford. All right.
  • 72. Mr. Liebeler. At this get-together that you mentioned at your house on, I think it was the 29th, after the party, the informal get- together, the Rays were there, and the Sullivans, and Mr. and Mrs. Harris. Would it refresh your recollection if I suggested to you that some of the people that were there at that party said the possibility of Oswald being a Russian agent was discussed in detail at that party in that group? Mrs. Ford. No. Mr. Liebeler. You don't remember any of the discussion? Mrs. Ford. I don't remember any of the discussion. Mr. Liebeler. Is there anything other than what you told us that led you to believe that Oswald was unstable or a mental case? Mrs. Ford. Nothing, except that I was thinking about him myself beating his wife. That would have been one reason. I don't think that any stable man would do that, especially she appeared to me very sick sort of a woman, not sick, but frail and fragile, I think any man who strikes a woman who is incapable of striking back, I would think would be unstable. Mr. Liebeler. This Russian group that we have been discussing, Mr. Bouhe and these other people, do they see each other regularly? Is there sort of a Russian community in Dallas, would you say? Mrs. Ford. Yes; there are about, I think, maybe four families in Fort Worth and maybe half a dozen in Dallas or more than that, but that mainly we see each other, and there is one Eastern Orthodox Church in Russia where that is where we actually meet each other. Mr. Liebeler. In Dallas, you mean, or Fort Worth? Mrs. Ford. Dallas. I am sorry. In Dallas, and we still observe our Eastern Orthodox religion during the holidays and sometimes like Christmas falls 13 days after the American Christmas so there is New Year's sometimes we celebrate those.
  • 73. Mr. Liebeler. How well do you know the De Mohrenschildts? Mrs. Ford. Well, I know George De Mohrenschildt the same, approximately, I will say I was acquainted with him for approximately 14 years but I don't know him well. Mr. Liebeler. You mentioned before that De Mohrenschildt was the only member of the Russian community that kept on seeing the Oswalds and trying to help them. Was there any discussion about that among your friends? Mrs. Ford. Yes; George De Mohrenschildt is rather an odd ball, among Russians anyway, so it was nothing unusual about him doing that. He was always doing something unusual. He would even go to church with shorts on, you know, this is something, he would do something that nobody else would do. Mr. Liebeler. Shorts? Mrs. Ford. Shorts. Mr. Liebeler. Short trousers? Mrs. Ford. Yes. Mr. Liebeler. Do you remember any specific conversations about the reasons as to why De Mohrenschildt continued to associate with the Oswalds after the rest of you had given them up. Mrs. Ford. Well, I remember his wife was telling me like she felt it was their duty now since everybody else dropped them and they needed help. Mr. Liebeler. Do you remember any conversation with the Oswalds among any of your friends as to whether or not Oswald went hunting in Russia and had access to weapons? Mrs. Ford. Yes; I think that George Bouhe was telling me that. He was telling him that he was going hunting and he told him about killing ducks or something of that type. Mr. Liebeler. Can you remember that in any greater detail?
  • 74. Mrs. Ford. No; I don't remember it. I only remember that because of the way he was saying, ducks in Russian, George was saying that he was using sort of a word when you call for it, it is a small duck rather than for ducks, he was saying that his Russian wasn't perfect. I mean in that conversation he was using, an example of he was saying, when he would go hunting for ducks, instead of utki for ducks he would say utitschki that would mean small ducks, and he was saying that his Russian was imperfect. Mr. Liebeler. That is Oswald's Russian? Mrs. Ford. That is right. Mr. Liebeler. What about Marina's ability to speak English at that time, did she speak English at that time? Mrs. Ford. I don't think she did. She could speak a few words but I don't think she did. Mr. Liebeler. Did any of you attempt to teach her English? Mrs. Ford. Yes; George Bouhe was attempting to teach her to write and was giving her lessons. Mr. Liebeler. Can you tell us something about that? Mrs. Ford. Well, he was telling me that he had gotten her a dictionary and he had—or some other book anyway and he was telling me that every time he saw her, made an attempt to see her I don't know how he did that but anyway he was giving her a lesson and she was supposed to have completed it by the next time in writing. Mr. Liebeler. Did Mr. Bouhe tell you anything about Marina's ability to speak English or write English? Did Marina learn as a result of that? Mrs. Ford. He said she was doing very well. I don't know if she learned to speak but he said she was a good student.
  • 75. Mr. Liebeler. Did Lee Oswald ever object to this effort on Mr. Bouhe's part? Mrs. Ford. Well, he was objecting to anyone of the Russians helping her. Mr. Liebeler. Helping her learn English? Mrs. Ford. Not learning English, but I mean helping her about anything. I don't know whether he was objecting to that. But we talked with Lee about this, why he wasn't teaching her English and he wasn't speaking to her and he said that he didn't want to forget Russian and he really said, If she wants, this is what Marina said the other day, that he didn't actually object, but he thought if she could learn Russian just by herself in any way she could, she could, but he is not going to help her. He just didn't want to help her by speaking English. Mr. Liebeler. You mean English. Mrs. Ford. That is right. He is not going to talk English with her, he wanted Russian. He wanted for the little girl to learn Russian and for himself not to forget it. Mr. Liebeler. Did he indicate, in other words that he wanted Marina to speak Russian so that he could maintain his own ability in Russian? Mrs. Ford. That is correct. Mr. Liebeler. And he also wanted his children to learn to speak Russian? Mrs. Ford. That is right. Mr. Liebeler. After this party on the 28th of December, what was your next contact with either Lee or Marina Oswald? Mrs. Ford. After the 28th? I think after the assassination, I only heard once about her, I just heard she went to New Orleans. Again a friend, Lydia Dymitruk, was in the bakery and she said she saw Marina coming in and she told her she thought Marina was pregnant
  • 76. and she told her Marina was going to New Orleans. That is the only time I have heard anything about them after that party. The next contact I had with her was, I don't know the date but it was soon after the assassination when I just felt sorry for Marina, I thought she was, I always felt she was innocent, I thought she was a naive girl in a lot of ways and that is why she got into a lot of problems and troubles. I just felt if she didn't have anybody there except the FBI and nobody to speak in Russian, she didn't know how everybody felt, I would think she would feel very badly, so I called Mrs. Paine who, I found out after the assassination she was a friend of Marina's and I told her that if she would have a contact with Marina, tell her that we sympathize with her that she is in the position that she is, and to call me or to let me—that I would like to talk with her. Mr. Liebeler. Can you tell us approximately when it was that you talked to Mrs. Paine? Mrs. Ford. I think that was either the first part of December, I think it must have been at least a week after the assassination. Mr. Liebeler. In early December? Mrs. Ford. Yes. Mr. Liebeler. What happened after that? Mrs. Ford. Right after that, I think the next day Marina called me, and she said the reason she called me was because Mrs. Paine told her that I called, and let her know. Mr. Liebeler. Do you remember specifically that Marina told you that she, Marina, had talked to Mrs. Paine? Mrs. Ford. Well, this, I don't know. I think that is what she did. In fact, I think that Mrs. Paine told me she talked on the telephone with Marina but I couldn't be positive about it. I know she wrote letters, they wrote, I know Mrs. Paine wrote to Marina, and I couldn't exactly say she talked to her on the telephone or how it was, either Marina called Mrs. Paine or I don't know.
  • 77. Mr. Liebeler. You say you know Marina wrote Mrs. Paine a letter? Did Marina tell you that she wrote Mrs. Paine during that period? Mrs. Ford. I don't remember. I know later Mrs. Paine kept calling me and asked if I heard from Marina, because she kept writing to Marina and Marina didn't answer, so she wanted to know if I talked to her, that she was the reason she was calling me. Mr. Liebeler. Did Mrs. Paine indicate to you after she called you trying to find out from Marina, did she indicate to you she had or had not heard from Marina since the assassination? Mrs. Ford. I believe she told me she talked on the telephone with her right soon after, after I did. Mr. Liebeler. During this first telephone conversation with Marina, was there anything said about the events of the assassination in any way? Mrs. Ford. No; not at all, except that Marina was very surprised the way people treated her. She was telling me that if it had happened in Russia, she just would—she just knew she wouldn't be talking to me or anybody else. She knew they would be sent to Siberia or shot right away is what she said. Mr. Liebeler. Is that what Marina said? Mrs. Ford. That is what she said. I told her that was the big difference in Russian Government and the American Government. And then she asked me for advice. Someone contacted her at the time from a western paper and offered her $10,000 or something for the story and she asked me if it was the right thing to do because she felt she didn't want to make money on such a thing, a horrible thing as that, and I advised her to take the money because I thought she would need it for the children. Mr. Liebeler. Now, after that first—was there anything else you discussed in that telephone conversation? Mrs. Ford. No; I think we talked mostly about that book deal, about the offer she had.
  • 78. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com