SlideShare a Scribd company logo
Text Mining:
     Beyond Extraction Towards Exploitation
         TIE - Text Information Exploitation
    Project Proposal for „Future and Emerging Technologies“
                   in the EU-IST Programme
           S. Staab1, R. Studer                              Karlsruhe University
          K. Markert, B. Webber                             University of Edinburgh
             N. Kushmerick                                 University College Dublin
          B. Bremdal, R. Engels                                   Cognit a.s

        http://www.aifb.uni-karlsruhe.de/~sst/Research/Projects/TextMining/


1 Abstract
Motivation: The revolutionary step from printed text to digital documents has lead to
an explosive growth of knowledge available (semi-)publicly through the internet or
through community and coporate intranets. With this flood of potentially useful infor-
mation, there comes the urgent need to sift through it, find the golden nuggets of infor-
mation and analyze them for making informed decisions.
Problem: So far, work in text mining has mostly concentrated on purposes like extrac-
ting informations from texts, summarizing the relevant informations, or answering ques-
tions on texts. However, this information extraction-based vision, which has been elabo-
rated, e.g., in approaches for message understanding, mostly neglects the amount and
complexity of information that the user must deal with and act upon. In contrast, the fur-
ther connotations of text mining that go well beyond information extraction — the ag-
gregation and analysis of information into golden nuggets of knowledge that may lead
to informed actions — has hardly been investigated so far.
Objectives: In our project we want to go beyond information extraction towards text in-
formation exploitation. This means we want to aggregate extracted information in order
to deduce knowledge that may not have been in the mind of the authors of the text.
For example, a decade ago there were a lot of separate medical research reports descri-
bing symptoms of a special migraine headache, such as a lack of magnesium (among
others). However, the causal link between magnesium deficits and migraine headaches
was implicit and sometimes it was given only indirectly via other symptoms. A manual
text recherche found that the literature reported eight different types of direct and indi-
rect links between magnesium deficits and migraine headaches. This result strongly sup-
ported the hypothesis that a lack of magnesium causes the type of migraine headaches –
a hypothesis which easily proved true in subsequent medical experiments and, thus, be-


1
 Contact: Steffen Staab, AIFB, Karlsruhe University, D-76128 Karlsruhe, email: staab@aifb.uni-karl-
sruhe.de, Tel.: +49(0)721/608 7363, Fax.: +49(0)721/ 693717
came very valuable to know. However, though all the information was in the docu-
ments, the knowledge about the potential causal link was very hard to discover.
Text mining in the sense we envision will allow for applications that handle tasks like
finding causal links described in texts (semi-)automatically. Once, the text mining appli-
cation has been set up by a systems engineer, the naive user lets the system extract infor-
mation, aggregate it and discover those nuggets of knowledge that may actually help
him to solve a problem.
Method:
The objective just described will be put to work in a realistic environment. This means
we must consider:
1.   Real-world texts: This means we must include semi-structured information such as
     given in de facto web standards, like HTML and XML. This also implies that we
     must try to extract information from layout structures and from more rigid formats,
     such as tables appearing in natural language texts.
2.   Integrating Information Extraction Techniques: We need a broad basis for informati-
     on extraction in order to solve the information exploitation task. For this purpose,
     we want to build on the experiences that have been made with TREC- and MUC-sty-
     le approaches as well as with machine learning techniques which have been applied
     for wrapping semi-structured data. This requires strong competence in the fields of
     Information Retrieval and Extraction, Computational Linguistics and Machine Lear-
     ning.
3.   Domain Ontology: In order to go beyond „simple“ phrase extraction, we need a do-
     main ontology that acts as a semantic mediator between different information extrac-
     tion methods, that allows for knowledge discovery at different levels of granularity
     and that allows for mappings between different terminologies.
4.   Text mining as a semi-automatic process: We consider text mining a semi-automatic
     process that is designed and set up with a particular application and particular topics
     in mind.
     The design involves the construction of a domain ontology and a domain lexicon,
     the formulation and/or learning of interesting structures with computational lingui-
     stics and/or information retrieval techniques and the exploration of the correspon-
     ding results. Once, the domain specific text mining application is set up the naive
     user may run it to extract information and – in particular – to find associations and
     rules that were not present in the original texts, but that could only be found by con-
     sidering, integrating and comparing various text sources.
     This approach parallels the development in data mining where the utopia of a fully
     automatic knowledge discovery process has matured with great success into an engi-
     neering approach towards this problem.
5.   Knowledge Discovery: Finally, we actually need to apply machine learning techni-
     ques to aggregate and analyse extracted information yielding „golden nuggets of
     knowledge“.
Research Issues:
Open research issues in this field are manifold, e.g.:
1. Extracting the semantics of layout with computational linguistics, aligning semi-
   structured data with the corresponding ontology information
2. Aligning several information extraction techniques (TREC-style) with
Integrating techniques: ontology, machine learning, information retrieval and extrac-
     tion, computational linguistics (learning with ontologies, inducing ontologies from
     computational linguistics and information extraction techniques, aligning wrapper
     induction with ontologies, applying information extraction measures to the syntactic
     and semantic level,
Scenario: As an interesting case study we choose the mining of annual business reports
and analysts‘ reports that comment on companies from a particular area (e.g., telecom-
munication). This scenario is very appropriate, because
1.   It allows the observation of competitors and the detection of trends that are extreme-
     ly important for decision makers, such as trends in organizational structures or in
     markets and products.
2.   The understanding of these texts cannot be performed in isolation. Rather the know-
     ledge that needs to be found is mostly available in the annual changes that take place
     and in the comparisons between companies in the same trade.
3.   The setting is well enough observed and understood by professionals in order to ve-
     rify the techniques we develop.


2 Chances for Europe

Multiple chances and possibilities arising from an application of semi-automatic text mi-
ning are given on several levels:

1    Informed Decisions: Results from our project may deliver critical information to Eu-
     ropean businesses, thus keeping them competitive, reacting quickly to new trends
     and possibilities.

2    Individual Learning: The more time the individual may spend on understanding in-
     terconnections and the less time she spends with searching for information and tes-
     ting hypothesis, the more she profits from the information technology that is at hand,
     now.

3    Research: Though our scenario develops a particular business case, many research
     issues may profit from semi-automatic text mining, too. Indeed, research hypotheses
     may be easier to (pre-)test or even to generate.

All these factors are critical to develop a high potential of Europeans and for Europeans.
Informed decisions, faster learning and improved research all work together in keeping
Europe competitive.


4 Partner Profile

We consider text mining as being a knowledge acquisition process that should be facili-
tated by learning approaches and by the techniques found in information retrieval and
computational linguistics. Hence, the consortium includes people from these different
communities:
Prof. Dr. Studer has a chair for knowledge management at Karlsruhe University. He has
carried out research and organized numerous activities in the fields of knowledge acqui-
sition, knowledge management and data mining for over 20 years.

Dr. Steffen Staab is senior researcher and lecturer at Karlsruhe University. His research
interests include knowledge management, ontology engineering, information extraction,
and data mining. He is now project manager for Karlsruhe in the project GETESS
(http://www.getess.de), which aims at an information extraction system for the tourism
domain and which is funded by the German government.

Prof. Dr. Bonnie Webber...

Dr. Katja Markert....

Dr. Nicholas Kushmerick is College Lecturer in the Department of Computer Science,
University College Dublin, Ireland. Dr. Kushmerick received his Ph.D. in 1997 at the
University of Washington, and his dissertation was nominated for the ACM Distin-
guished Dissertation award. Dr. Kushmerick has worked in the areas of planning, ma-
chine learning, and information-extraction, -integration, and -retrieval. His worked has
been published in several international journals, and he has been on the organizing com-
mittee of numerous conferences and workshops. Dr. Kushmerick’s current work focuses
on the use of machine learning to scale up knowledge engineering on the Internet, in ser-
vice of problems such as information extraction and designing intelligent browsing as-
sistants.

Dr. Bernt Arild Bremdal: Studied Marine Technology in Trondheim, Norway. After fi-
nishing his MSc at the NTNU he wrote his PhD at the same university. He got is PhD on
the application of artificial intelligence, rule-based and object-oriented programming in
project planning in 1988. After he has been affiliated with a variety of companies he co-
founded and directs CognIT a.s. Author of more than 50 articles and published reports
on computer applications in engineering and industry, design and planning, object-orien-
ted technology and artificial intelligence. Most recent publication is Braunschweig and
Bremdal, “AI in the Petroleum Industry.” Volume 2. Edition Technip 1996.

Dr. Robert Engels: Studied Artificial Intelligence, Psychology and (partly) Computer
Science at the university of Amsterdam, NL. He conducted his MSc thesis on applicati-
ons of Inductive Logic Programming in Stockholm, Sweden. In 1999 he got his PhD
from the university of Karlsruhe for research conducted in the area of Knowledge Disco-
very and Data Mining. He (co-) authored a variety of papers, and organised several in-
ternational and national (German) workshops on practical applications of Data Mining.
Currently he is affiliated with CognIT as a senior systems architect.
The work packages would be split along the following lines (bold face indicates leader-
ship for a particular work package):
                    Knowledge       Computational     Machine    Lear- Information Re-
                    Acquisition     Linguistics       ning             trieval
Univ. Karlsruhe     Ontology ac-                      Mining Infor-
                    quisition                         mation
Univ. Edinburgh                     Information
Extraction
                                          with Layout
Univ.     College                                           Wrappers with Indexing    and
Dublin                                                      Ontologies;   querying struc-
                                                            Mining Infor- tured documents
                                                            mation
Cognit               Ontology       in-                                    Understanding
                     duction                                               XML Texts




5 Partner Adresses
Dr. Steffen Staab, Prof. Dr. Rudi Studer
     Institute for Applied Computer Science and Formal Description Methods (AIFB),
     Karlsruhe University, D-76128 Karlsruhe, Germany
     http://www.aifb.uni-karlsruhe.de/WBS
     mailto:staab@aifb.uni-karlsruhe.de,studer@aifb.uni-karlsruhe.de


Dr. Katja Markert, Prof. Dr. Bonnie Webber
     Division of Informatics, University of Edinburgh, 80 South Bridge
     Edinburgh EH1 1HN, Scotland
     http://www.informatics.ed.ac.uk/research/irr/
     mailto:markert@cogsci.ed.ac.uk,bonnie@dai.ed.ac.uk


Dr. Nicholas Kushmerick
     Department of Computer Science, University College Dublin, Dublin 4, Ireland
     http://www.cs.ucd.ie/staff/nick/
     mailto:nick@ucd.ie




Dr. Robert Engels, Dr. Bernt Bremdal
     Cognit a.s, P.B. 610, N-1754 Halden, Norway
     http://www.cognit.no/
     mailto:robert.engels@cognit.no,bernt.bremdal@cognit.no

More Related Content

DOC
Semi-automatic Text MiningNK
PDF
Ontology learning techniques and applications computer science thesis writing...
PDF
Text Mining : Experience
DOC
text_mining.doc
PDF
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
PDF
Research Statement
PDF
Classification of News and Research Articles Using Text Pattern Mining
PDF
An Abridged Version of My Statement of Research Interests
Semi-automatic Text MiningNK
Ontology learning techniques and applications computer science thesis writing...
Text Mining : Experience
text_mining.doc
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
Research Statement
Classification of News and Research Articles Using Text Pattern Mining
An Abridged Version of My Statement of Research Interests

What's hot (19)

PDF
Pankaj Gupta CV / Resume
PDF
The Process of Information extraction through Natural Language Processing
PDF
Semantics-based clustering approach for similar research area detection
PDF
Mining Opinion Features in Customer Reviews
PDF
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect match
DOC
2007bai7604.doc.doc
PDF
A comparative study on different types of effective methods in text mining
PDF
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
PDF
Text databases and information retrieval
PDF
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
PDF
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
PDF
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
PDF
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
PDF
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
PPTX
Content + Signals: The value of the entire data estate for machine learning
PDF
PDF
Using “Distant Reading” to Explore Discussion Threads in Online Courses
PDF
A Mathematical Approach to Ontology Authoring and Documentation
PDF
Knowledge Engineering and Intelligence Gathering
Pankaj Gupta CV / Resume
The Process of Information extraction through Natural Language Processing
Semantics-based clustering approach for similar research area detection
Mining Opinion Features in Customer Reviews
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect match
2007bai7604.doc.doc
A comparative study on different types of effective methods in text mining
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
Text databases and information retrieval
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Content + Signals: The value of the entire data estate for machine learning
Using “Distant Reading” to Explore Discussion Threads in Online Courses
A Mathematical Approach to Ontology Authoring and Documentation
Knowledge Engineering and Intelligence Gathering
Ad

Viewers also liked (15)

PPTX
Evaluation question 6
DOCX
Curriculum For Teaching (Summer)
PDF
JUNIOR SECONDARY CERTIFICATE (grade 10 original)
DOC
Natural Semantics for a Robot notes:
PPTX
Os sentidos da tecnologia da informação na experiência educativa e design ...
PDF
Business and Project Management Certificate
PDF
Revive summer camp program
PDF
Transitando.com.pe en el diario El Comercio
PPTX
Wish4You Project
PDF
Cn et-frh-2017
PDF
10th STD CERTIFICATE
PPTX
Revenge of the Nerds - Why digital performance for app and website success.
PPTX
Deliver Personal Customer Experiences in a Complex Digital World
PPT
Polymer Chemistry
Evaluation question 6
Curriculum For Teaching (Summer)
JUNIOR SECONDARY CERTIFICATE (grade 10 original)
Natural Semantics for a Robot notes:
Os sentidos da tecnologia da informação na experiência educativa e design ...
Business and Project Management Certificate
Revive summer camp program
Transitando.com.pe en el diario El Comercio
Wish4You Project
Cn et-frh-2017
10th STD CERTIFICATE
Revenge of the Nerds - Why digital performance for app and website success.
Deliver Personal Customer Experiences in a Complex Digital World
Polymer Chemistry
Ad

Similar to Text Mining: Beyond Extraction Towards Exploitation (20)

PDF
A STUDY ON PLAGIARISM CHECKING WITH APPROPRIATE ALGORITHM IN DATAMINING
PDF
Web_Mining_Overview_Nfaoui_El_Habib
DOCX
Post 1What is text analytics How does it differ from text mini
DOCX
Post 1What is text analytics How does it differ from text mini.docx
PDF
Knowledge Management Cultures: A Comparison of Engineering and Cultural Scien...
PDF
A Survey on Text Mining-techniques and application
PDF
Great model a model for the automatic generation of semantic relations betwee...
PDF
Dialectal Arabic sentiment analysis based on tree-based pipeline optimizatio...
PDF
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
PDF
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
PDF
Sentimental classification analysis of polarity multi-view textual data using...
PDF
"Analysis of Different Text Classification Algorithms: An Assessment "
PDF
Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...
PDF
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...
PDF
Nlp and semantic_web_for_competitive_int
PDF
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
PDF
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
PDF
Structured and Unstructured Information Extraction Using Text Mining and Natu...
PDF
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
A STUDY ON PLAGIARISM CHECKING WITH APPROPRIATE ALGORITHM IN DATAMINING
Web_Mining_Overview_Nfaoui_El_Habib
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini.docx
Knowledge Management Cultures: A Comparison of Engineering and Cultural Scien...
A Survey on Text Mining-techniques and application
Great model a model for the automatic generation of semantic relations betwee...
Dialectal Arabic sentiment analysis based on tree-based pipeline optimizatio...
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
Sentimental classification analysis of polarity multi-view textual data using...
"Analysis of Different Text Classification Algorithms: An Assessment "
Increasing the Investment’s Opportunities in Kingdom of Saudi Arabia By Study...
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...
Nlp and semantic_web_for_competitive_int
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
A Newly Proposed Technique for Summarizing the Abstractive Newspapers’ Articl...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

Text Mining: Beyond Extraction Towards Exploitation

  • 1. Text Mining: Beyond Extraction Towards Exploitation TIE - Text Information Exploitation Project Proposal for „Future and Emerging Technologies“ in the EU-IST Programme S. Staab1, R. Studer Karlsruhe University K. Markert, B. Webber University of Edinburgh N. Kushmerick University College Dublin B. Bremdal, R. Engels Cognit a.s http://www.aifb.uni-karlsruhe.de/~sst/Research/Projects/TextMining/ 1 Abstract Motivation: The revolutionary step from printed text to digital documents has lead to an explosive growth of knowledge available (semi-)publicly through the internet or through community and coporate intranets. With this flood of potentially useful infor- mation, there comes the urgent need to sift through it, find the golden nuggets of infor- mation and analyze them for making informed decisions. Problem: So far, work in text mining has mostly concentrated on purposes like extrac- ting informations from texts, summarizing the relevant informations, or answering ques- tions on texts. However, this information extraction-based vision, which has been elabo- rated, e.g., in approaches for message understanding, mostly neglects the amount and complexity of information that the user must deal with and act upon. In contrast, the fur- ther connotations of text mining that go well beyond information extraction — the ag- gregation and analysis of information into golden nuggets of knowledge that may lead to informed actions — has hardly been investigated so far. Objectives: In our project we want to go beyond information extraction towards text in- formation exploitation. This means we want to aggregate extracted information in order to deduce knowledge that may not have been in the mind of the authors of the text. For example, a decade ago there were a lot of separate medical research reports descri- bing symptoms of a special migraine headache, such as a lack of magnesium (among others). However, the causal link between magnesium deficits and migraine headaches was implicit and sometimes it was given only indirectly via other symptoms. A manual text recherche found that the literature reported eight different types of direct and indi- rect links between magnesium deficits and migraine headaches. This result strongly sup- ported the hypothesis that a lack of magnesium causes the type of migraine headaches – a hypothesis which easily proved true in subsequent medical experiments and, thus, be- 1 Contact: Steffen Staab, AIFB, Karlsruhe University, D-76128 Karlsruhe, email: staab@aifb.uni-karl- sruhe.de, Tel.: +49(0)721/608 7363, Fax.: +49(0)721/ 693717
  • 2. came very valuable to know. However, though all the information was in the docu- ments, the knowledge about the potential causal link was very hard to discover. Text mining in the sense we envision will allow for applications that handle tasks like finding causal links described in texts (semi-)automatically. Once, the text mining appli- cation has been set up by a systems engineer, the naive user lets the system extract infor- mation, aggregate it and discover those nuggets of knowledge that may actually help him to solve a problem. Method: The objective just described will be put to work in a realistic environment. This means we must consider: 1. Real-world texts: This means we must include semi-structured information such as given in de facto web standards, like HTML and XML. This also implies that we must try to extract information from layout structures and from more rigid formats, such as tables appearing in natural language texts. 2. Integrating Information Extraction Techniques: We need a broad basis for informati- on extraction in order to solve the information exploitation task. For this purpose, we want to build on the experiences that have been made with TREC- and MUC-sty- le approaches as well as with machine learning techniques which have been applied for wrapping semi-structured data. This requires strong competence in the fields of Information Retrieval and Extraction, Computational Linguistics and Machine Lear- ning. 3. Domain Ontology: In order to go beyond „simple“ phrase extraction, we need a do- main ontology that acts as a semantic mediator between different information extrac- tion methods, that allows for knowledge discovery at different levels of granularity and that allows for mappings between different terminologies. 4. Text mining as a semi-automatic process: We consider text mining a semi-automatic process that is designed and set up with a particular application and particular topics in mind. The design involves the construction of a domain ontology and a domain lexicon, the formulation and/or learning of interesting structures with computational lingui- stics and/or information retrieval techniques and the exploration of the correspon- ding results. Once, the domain specific text mining application is set up the naive user may run it to extract information and – in particular – to find associations and rules that were not present in the original texts, but that could only be found by con- sidering, integrating and comparing various text sources. This approach parallels the development in data mining where the utopia of a fully automatic knowledge discovery process has matured with great success into an engi- neering approach towards this problem. 5. Knowledge Discovery: Finally, we actually need to apply machine learning techni- ques to aggregate and analyse extracted information yielding „golden nuggets of knowledge“. Research Issues: Open research issues in this field are manifold, e.g.: 1. Extracting the semantics of layout with computational linguistics, aligning semi- structured data with the corresponding ontology information 2. Aligning several information extraction techniques (TREC-style) with
  • 3. Integrating techniques: ontology, machine learning, information retrieval and extrac- tion, computational linguistics (learning with ontologies, inducing ontologies from computational linguistics and information extraction techniques, aligning wrapper induction with ontologies, applying information extraction measures to the syntactic and semantic level, Scenario: As an interesting case study we choose the mining of annual business reports and analysts‘ reports that comment on companies from a particular area (e.g., telecom- munication). This scenario is very appropriate, because 1. It allows the observation of competitors and the detection of trends that are extreme- ly important for decision makers, such as trends in organizational structures or in markets and products. 2. The understanding of these texts cannot be performed in isolation. Rather the know- ledge that needs to be found is mostly available in the annual changes that take place and in the comparisons between companies in the same trade. 3. The setting is well enough observed and understood by professionals in order to ve- rify the techniques we develop. 2 Chances for Europe Multiple chances and possibilities arising from an application of semi-automatic text mi- ning are given on several levels: 1 Informed Decisions: Results from our project may deliver critical information to Eu- ropean businesses, thus keeping them competitive, reacting quickly to new trends and possibilities. 2 Individual Learning: The more time the individual may spend on understanding in- terconnections and the less time she spends with searching for information and tes- ting hypothesis, the more she profits from the information technology that is at hand, now. 3 Research: Though our scenario develops a particular business case, many research issues may profit from semi-automatic text mining, too. Indeed, research hypotheses may be easier to (pre-)test or even to generate. All these factors are critical to develop a high potential of Europeans and for Europeans. Informed decisions, faster learning and improved research all work together in keeping Europe competitive. 4 Partner Profile We consider text mining as being a knowledge acquisition process that should be facili- tated by learning approaches and by the techniques found in information retrieval and computational linguistics. Hence, the consortium includes people from these different communities:
  • 4. Prof. Dr. Studer has a chair for knowledge management at Karlsruhe University. He has carried out research and organized numerous activities in the fields of knowledge acqui- sition, knowledge management and data mining for over 20 years. Dr. Steffen Staab is senior researcher and lecturer at Karlsruhe University. His research interests include knowledge management, ontology engineering, information extraction, and data mining. He is now project manager for Karlsruhe in the project GETESS (http://www.getess.de), which aims at an information extraction system for the tourism domain and which is funded by the German government. Prof. Dr. Bonnie Webber... Dr. Katja Markert.... Dr. Nicholas Kushmerick is College Lecturer in the Department of Computer Science, University College Dublin, Ireland. Dr. Kushmerick received his Ph.D. in 1997 at the University of Washington, and his dissertation was nominated for the ACM Distin- guished Dissertation award. Dr. Kushmerick has worked in the areas of planning, ma- chine learning, and information-extraction, -integration, and -retrieval. His worked has been published in several international journals, and he has been on the organizing com- mittee of numerous conferences and workshops. Dr. Kushmerick’s current work focuses on the use of machine learning to scale up knowledge engineering on the Internet, in ser- vice of problems such as information extraction and designing intelligent browsing as- sistants. Dr. Bernt Arild Bremdal: Studied Marine Technology in Trondheim, Norway. After fi- nishing his MSc at the NTNU he wrote his PhD at the same university. He got is PhD on the application of artificial intelligence, rule-based and object-oriented programming in project planning in 1988. After he has been affiliated with a variety of companies he co- founded and directs CognIT a.s. Author of more than 50 articles and published reports on computer applications in engineering and industry, design and planning, object-orien- ted technology and artificial intelligence. Most recent publication is Braunschweig and Bremdal, “AI in the Petroleum Industry.” Volume 2. Edition Technip 1996. Dr. Robert Engels: Studied Artificial Intelligence, Psychology and (partly) Computer Science at the university of Amsterdam, NL. He conducted his MSc thesis on applicati- ons of Inductive Logic Programming in Stockholm, Sweden. In 1999 he got his PhD from the university of Karlsruhe for research conducted in the area of Knowledge Disco- very and Data Mining. He (co-) authored a variety of papers, and organised several in- ternational and national (German) workshops on practical applications of Data Mining. Currently he is affiliated with CognIT as a senior systems architect. The work packages would be split along the following lines (bold face indicates leader- ship for a particular work package): Knowledge Computational Machine Lear- Information Re- Acquisition Linguistics ning trieval Univ. Karlsruhe Ontology ac- Mining Infor- quisition mation Univ. Edinburgh Information
  • 5. Extraction with Layout Univ. College Wrappers with Indexing and Dublin Ontologies; querying struc- Mining Infor- tured documents mation Cognit Ontology in- Understanding duction XML Texts 5 Partner Adresses Dr. Steffen Staab, Prof. Dr. Rudi Studer Institute for Applied Computer Science and Formal Description Methods (AIFB), Karlsruhe University, D-76128 Karlsruhe, Germany http://www.aifb.uni-karlsruhe.de/WBS mailto:staab@aifb.uni-karlsruhe.de,studer@aifb.uni-karlsruhe.de Dr. Katja Markert, Prof. Dr. Bonnie Webber Division of Informatics, University of Edinburgh, 80 South Bridge Edinburgh EH1 1HN, Scotland http://www.informatics.ed.ac.uk/research/irr/ mailto:markert@cogsci.ed.ac.uk,bonnie@dai.ed.ac.uk Dr. Nicholas Kushmerick Department of Computer Science, University College Dublin, Dublin 4, Ireland http://www.cs.ucd.ie/staff/nick/ mailto:nick@ucd.ie Dr. Robert Engels, Dr. Bernt Bremdal Cognit a.s, P.B. 610, N-1754 Halden, Norway http://www.cognit.no/ mailto:robert.engels@cognit.no,bernt.bremdal@cognit.no