Text Mining: Beyond Extraction Towards Exploitation

Text Mining:
Beyond Extraction Towards Exploitation
TIE - Text Information Exploitation
Project Proposal for „Future and Emerging Technologies“
in the EU-IST Programme
S. Staab1, R. Studer Karlsruhe University
K. Markert, B. Webber University of Edinburgh
N. Kushmerick University College Dublin
B. Bremdal, R. Engels Cognit a.s

http://www.aifb.uni-karlsruhe.de/~sst/Research/Projects/TextMining/

1 Abstract
Motivation: The revolutionary step from printed text to digital documents has lead to
an explosive growth of knowledge available (semi-)publicly through the internet or
through community and coporate intranets. With this flood of potentially useful infor-
mation, there comes the urgent need to sift through it, find the golden nuggets of infor-
mation and analyze them for making informed decisions.
Problem: So far, work in text mining has mostly concentrated on purposes like extrac-
ting informations from texts, summarizing the relevant informations, or answering ques-
tions on texts. However, this information extraction-based vision, which has been elabo-
rated, e.g., in approaches for message understanding, mostly neglects the amount and
complexity of information that the user must deal with and act upon. In contrast, the fur-
ther connotations of text mining that go well beyond information extraction — the ag-
gregation and analysis of information into golden nuggets of knowledge that may lead
to informed actions — has hardly been investigated so far.
Objectives: In our project we want to go beyond information extraction towards text in-
formation exploitation. This means we want to aggregate extracted information in order
to deduce knowledge that may not have been in the mind of the authors of the text.
For example, a decade ago there were a lot of separate medical research reports descri-
bing symptoms of a special migraine headache, such as a lack of magnesium (among
others). However, the causal link between magnesium deficits and migraine headaches
was implicit and sometimes it was given only indirectly via other symptoms. A manual
text recherche found that the literature reported eight different types of direct and indi-
rect links between magnesium deficits and migraine headaches. This result strongly sup-
ported the hypothesis that a lack of magnesium causes the type of migraine headaches –
a hypothesis which easily proved true in subsequent medical experiments and, thus, be-

1
Contact: Steffen Staab, AIFB, Karlsruhe University, D-76128 Karlsruhe, email: staab@aifb.uni-karl-
sruhe.de, Tel.: +49(0)721/608 7363, Fax.: +49(0)721/ 693717

came very valuable to know. However, though all the information was in the docu-
ments, the knowledge about the potential causal link was very hard to discover.
Text mining in the sense we envision will allow for applications that handle tasks like
finding causal links described in texts (semi-)automatically. Once, the text mining appli-
cation has been set up by a systems engineer, the naive user lets the system extract infor-
mation, aggregate it and discover those nuggets of knowledge that may actually help
him to solve a problem.
Method:
The objective just described will be put to work in a realistic environment. This means
we must consider:
1. Real-world texts: This means we must include semi-structured information such as
given in de facto web standards, like HTML and XML. This also implies that we
must try to extract information from layout structures and from more rigid formats,
such as tables appearing in natural language texts.
2. Integrating Information Extraction Techniques: We need a broad basis for informati-
on extraction in order to solve the information exploitation task. For this purpose,
we want to build on the experiences that have been made with TREC- and MUC-sty-
le approaches as well as with machine learning techniques which have been applied
for wrapping semi-structured data. This requires strong competence in the fields of
Information Retrieval and Extraction, Computational Linguistics and Machine Lear-
ning.
3. Domain Ontology: In order to go beyond „simple“ phrase extraction, we need a do-
main ontology that acts as a semantic mediator between different information extrac-
tion methods, that allows for knowledge discovery at different levels of granularity
and that allows for mappings between different terminologies.
4. Text mining as a semi-automatic process: We consider text mining a semi-automatic
process that is designed and set up with a particular application and particular topics
in mind.
The design involves the construction of a domain ontology and a domain lexicon,
the formulation and/or learning of interesting structures with computational lingui-
stics and/or information retrieval techniques and the exploration of the correspon-
ding results. Once, the domain specific text mining application is set up the naive
user may run it to extract information and – in particular – to find associations and
rules that were not present in the original texts, but that could only be found by con-
sidering, integrating and comparing various text sources.
This approach parallels the development in data mining where the utopia of a fully
automatic knowledge discovery process has matured with great success into an engi-
neering approach towards this problem.
5. Knowledge Discovery: Finally, we actually need to apply machine learning techni-
ques to aggregate and analyse extracted information yielding „golden nuggets of
knowledge“.
Research Issues:
Open research issues in this field are manifold, e.g.:
1. Extracting the semantics of layout with computational linguistics, aligning semi-
structured data with the corresponding ontology information
2. Aligning several information extraction techniques (TREC-style) with

Integrating techniques: ontology, machine learning, information retrieval and extrac-
tion, computational linguistics (learning with ontologies, inducing ontologies from
computational linguistics and information extraction techniques, aligning wrapper
induction with ontologies, applying information extraction measures to the syntactic
and semantic level,
Scenario: As an interesting case study we choose the mining of annual business reports
and analysts‘ reports that comment on companies from a particular area (e.g., telecom-
munication). This scenario is very appropriate, because
1. It allows the observation of competitors and the detection of trends that are extreme-
ly important for decision makers, such as trends in organizational structures or in
markets and products.
2. The understanding of these texts cannot be performed in isolation. Rather the know-
ledge that needs to be found is mostly available in the annual changes that take place
and in the comparisons between companies in the same trade.
3. The setting is well enough observed and understood by professionals in order to ve-
rify the techniques we develop.

2 Chances for Europe

Multiple chances and possibilities arising from an application of semi-automatic text mi-
ning are given on several levels:

1 Informed Decisions: Results from our project may deliver critical information to Eu-
ropean businesses, thus keeping them competitive, reacting quickly to new trends
and possibilities.

2 Individual Learning: The more time the individual may spend on understanding in-
terconnections and the less time she spends with searching for information and tes-
ting hypothesis, the more she profits from the information technology that is at hand,
now.

3 Research: Though our scenario develops a particular business case, many research
issues may profit from semi-automatic text mining, too. Indeed, research hypotheses
may be easier to (pre-)test or even to generate.

All these factors are critical to develop a high potential of Europeans and for Europeans.
Informed decisions, faster learning and improved research all work together in keeping
Europe competitive.

4 Partner Profile

We consider text mining as being a knowledge acquisition process that should be facili-
tated by learning approaches and by the techniques found in information retrieval and
computational linguistics. Hence, the consortium includes people from these different
communities:

Prof. Dr. Studer has a chair for knowledge management at Karlsruhe University. He has
carried out research and organized numerous activities in the fields of knowledge acqui-
sition, knowledge management and data mining for over 20 years.

Dr. Steffen Staab is senior researcher and lecturer at Karlsruhe University. His research
interests include knowledge management, ontology engineering, information extraction,
and data mining. He is now project manager for Karlsruhe in the project GETESS
(http://www.getess.de), which aims at an information extraction system for the tourism
domain and which is funded by the German government.

Prof. Dr. Bonnie Webber...

Dr. Katja Markert....

Dr. Nicholas Kushmerick is College Lecturer in the Department of Computer Science,
University College Dublin, Ireland. Dr. Kushmerick received his Ph.D. in 1997 at the
University of Washington, and his dissertation was nominated for the ACM Distin-
guished Dissertation award. Dr. Kushmerick has worked in the areas of planning, ma-
chine learning, and information-extraction, -integration, and -retrieval. His worked has
been published in several international journals, and he has been on the organizing com-
mittee of numerous conferences and workshops. Dr. Kushmerick’s current work focuses
on the use of machine learning to scale up knowledge engineering on the Internet, in ser-
vice of problems such as information extraction and designing intelligent browsing as-
sistants.

Dr. Bernt Arild Bremdal: Studied Marine Technology in Trondheim, Norway. After fi-
nishing his MSc at the NTNU he wrote his PhD at the same university. He got is PhD on
the application of artificial intelligence, rule-based and object-oriented programming in
project planning in 1988. After he has been affiliated with a variety of companies he co-
founded and directs CognIT a.s. Author of more than 50 articles and published reports
on computer applications in engineering and industry, design and planning, object-orien-
ted technology and artificial intelligence. Most recent publication is Braunschweig and
Bremdal, “AI in the Petroleum Industry.” Volume 2. Edition Technip 1996.

Dr. Robert Engels: Studied Artificial Intelligence, Psychology and (partly) Computer
Science at the university of Amsterdam, NL. He conducted his MSc thesis on applicati-
ons of Inductive Logic Programming in Stockholm, Sweden. In 1999 he got his PhD
from the university of Karlsruhe for research conducted in the area of Knowledge Disco-
very and Data Mining. He (co-) authored a variety of papers, and organised several in-
ternational and national (German) workshops on practical applications of Data Mining.
Currently he is affiliated with CognIT as a senior systems architect.
The work packages would be split along the following lines (bold face indicates leader-
ship for a particular work package):
Knowledge Computational Machine Lear- Information Re-
Acquisition Linguistics ning trieval
Univ. Karlsruhe Ontology ac- Mining Infor-
quisition mation
Univ. Edinburgh Information

Extraction
with Layout
Univ. College Wrappers with Indexing and
Dublin Ontologies; querying struc-
Mining Infor- tured documents
mation
Cognit Ontology in- Understanding
duction XML Texts

5 Partner Adresses
Dr. Steffen Staab, Prof. Dr. Rudi Studer
Institute for Applied Computer Science and Formal Description Methods (AIFB),
Karlsruhe University, D-76128 Karlsruhe, Germany
http://www.aifb.uni-karlsruhe.de/WBS
mailto:staab@aifb.uni-karlsruhe.de,studer@aifb.uni-karlsruhe.de

Dr. Katja Markert, Prof. Dr. Bonnie Webber
Division of Informatics, University of Edinburgh, 80 South Bridge
Edinburgh EH1 1HN, Scotland
http://www.informatics.ed.ac.uk/research/irr/
mailto:markert@cogsci.ed.ac.uk,bonnie@dai.ed.ac.uk

Dr. Nicholas Kushmerick
Department of Computer Science, University College Dublin, Dublin 4, Ireland
http://www.cs.ucd.ie/staff/nick/
mailto:nick@ucd.ie

Dr. Robert Engels, Dr. Bernt Bremdal
Cognit a.s, P.B. 610, N-1754 Halden, Norway
http://www.cognit.no/
mailto:robert.engels@cognit.no,bernt.bremdal@cognit.no

Text Mining: Beyond Extraction Towards Exploitation

More Related Content

What's hot (19)

Viewers also liked (15)

Similar to Text Mining: Beyond Extraction Towards Exploitation (20)

More from butest (20)

Text Mining: Beyond Extraction Towards Exploitation