SlideShare a Scribd company logo
Automated Information Retrieval
   and Text Categorization:
    The RIKS Demonstrator

           Acknowledge final event
             November 25, 2008
     Marie-Francine Moens, Erik Boiy, Javier Arias (HMDB-LIIR)
                      Saskia Debergh (i.Know)
         Philippe De Lombaerde, Birger Fühne (UNU-CRIS)




                     Overview
• UNU CRIS: The RIKS Demonstrator
  UNU-CRIS:
• K.U.Leuven:
    – Content extraction from multilingual Web pages
    – Text categorization: machine learning approach
    – Search engine and indexing infrastructure
    – Interfacing the Acknowledge platform
• i.Know:
    – Information forensics


                       Acknowledge 25-11-2008




                                                                 1
The RIKS Demonstrator
• United Nations University – Comparative Regional
  Integration Studies (UNU-CRIS)
• Issues addressed in research and capacity building:
   – (i) emergence of regional (= supra-national)
     governance level
   – (ii) linkages with other governance levels (national,
     global/UN)
   – (iii) building of regional institutions
   – (iv) growing regional interdependence, etc.
• RIKS = Regional Integration Knowledge System
  (UNU-CRIS and GARNET NoE)
                     Acknowledge 25-11-2008




                     Acknowledge 25-11-2008




                                                             2
The RIKS Demonstrator
Issues addressed in the demonstrator:
  How to automate retrieval and processing
                                  p          g
  (cleaning, search, categorization,
  presentation) of particular types of relevant
  information in an e-learning environment?:
   – ‘News’: short texts, various formats, dynamic
     collection, short life cycle, role of news in e-
     learning application
   – ‘Documentation’: heterogeneous texts: scientific
     articles, theses, essays, ... , rather static collection
   – Treaty texts: long and complex texts, static
     collection, issue of accessibility
                     Acknowledge 25-11-2008




                          RIKS
                   example output




                     Acknowledge 25-11-2008




                                                                3
Demo




                  Acknowledge 25-11-2008




K.U.Leuven: Content extraction
 from multilingual Web pages

• = Extracting main content from Web page and
  removing extraneous data (navigation menu’s,
  advertisements, etc.)
• Requirements of the tool:
   – Accurate
   – Generic
   – Multilingual
   – Fast

                  Acknowledge 25-11-2008




                                                 4
Acknowledge 25-11-2008




                 [Arias et al. submitted]
Acknowledge 25-11-2008




                                            5
[Arias et al. submitted]
                                      [5] =[Gottron 2008]
                     Acknowledge 25-11-2008




 K.U.Leuven:Text categorization
• Heterogeneous documentation and Google News
  classified into 27 categories (e.g., trade, poverty, ...)
                                (e g trade poverty )
• Supervised classifier: Multinomial Naïve Bayes, Support
  Vector Machine, ...
• Features:
   – different features: unigrams, bigrams, feature item
      sets, ...
• Additional feature Selection:
   – Chi Square, Information Gain, Linear Classifier
      Weights, Orthogonal Centroid Feature Selection
• Different test set ups




                                                                 6
K.U.Leuven: Text categorization




          Acknowledge 25-11-2008




           RIKS
 K.U.Leuven: search engine




          Acknowledge 25-11-2008




                                   7
Acknowledge 25-11-2008




Demo




   Acknowledge 25-11-2008




                            8
Weten dat je niet weet wat je zou moeten weten

          1. Information Forensics ‐ Smart Indexing
                        more than just an index
                       distinguishes between concepts and relations
                       distinguishes between concepts and relations
                       starts from unstructured text (bottom‐up instead of top‐down)

                         recognises word groups as meaningful units
 Top‐down:                                                                            Bottom‐up:




     knowledge                                                                                                           knowledge
                                    keywords                                            concepts and relations
                                                         text                  text
                                                          Acknowledge 25-11-2008
           © i.Know NV ‐ All rights reserved.




                           Weten dat je niet weet wat je zou moeten weten

      1. Information Forensics – Smart Indexing
                                De Fortis Bank werd overgenomen                                        door BNP Paribas.


          Traditional indexing (keywords):

     De    Fortis          Bank         werd       overgenomen   door    BNP      Paribas.
                                                                                                               Keyword Index
                                                                                                               Fortis        0.23
                     stopwords                               calculation                                       Bank          0.38
                                                                                                               werd          0.08
                     stemming                                correlation
                                                                                                               overgenomen   0.21
                                                                                                               door          0.12
                                                                                                               BNP           0.34
De         Fortis          Bank             werd      overgenomen       door     BNP         Paribas
                                                                                                               Paribas       0.27




                                                          Acknowledge 25-11-2008
           © i.Know NV ‐ All rights reserved.




                                                                                                                                     9
Weten dat je niet weet wat je zou moeten weten

     1. Information Forensics – Smart Indexing
                            De Fortis Bank werd overgenomen                door BNP Paribas.


     Smart Indexing (concepts and relations):

     De Fortis Bank werd overgenomen door BNP Paribas.

                                                                            Smart Index
                   relation                          concept 
                                                                            Concept    Fortis Bank
                   detection                         detection
                                                                            Relation   werd overgenomen door
                                                                                       werd overgenomen door
                                                                            Concept    BNP Paribas


De    Fortis Bank                   werd overgenomen door    BNP Paribas




                                                Acknowledge 25-11-2008
       © i.Know NV ‐ All rights reserved.




                       Weten dat je niet weet wat je zou moeten weten

     2. Categorisation based on Smart Indexing
 Preconditions:
     Pre defined taxonomy/ontology
     Pre‐defined taxonomy/ontology
     Top‐down processing


 Advantages of Smart Indexing:
 Smart Indexing Results can be used to fill and enrich the taxonomy, thus ensuring 
 the entries are
           relevant
           precise
           complete



                                                Acknowledge 25-11-2008
       © i.Know NV ‐ All rights reserved.




                                                                                                               10
Weten dat je niet weet wat je zou moeten weten

  2. Categorisation

Categorisation

                                                                             EU                      EFTA


Smart Indexing (concepts and relations):


  The       Agreement                     will be applied with   the   European    and with   the   EFTA states.
                                                                       Union



Input:
   The Agreement will be applied with the European Union and with the EFTA states.

                                                    Acknowledge 25-11-2008
     © i.Know NV ‐ All rights reserved.




                                                           RIKS
                             i.Know: news categorization




                                                    Acknowledge 25-11-2008




                                                                                                                   11
RIKS
i.Know: news categorization




       Acknowledge 25-11-2008




       Acknowledge 25-11-2008




                                12
Acknowledge 25-11-2008




Demo




   Acknowledge 25-11-2008




                            13
Thank you




Acknowledge 25-11-2008




                         14

More Related Content

PDF
Ict Sd04 Een Selectie Uit De Beschikbare Cijfers Gert Verdonck
PDF
Ad Ketelaars - Services offered over municipal network
PDF
Break out: Incubation and Venturing - Felix Van Maele
PDF
Break out: Project Communication and Dissemination - Koen De Vos
PDF
Keynote josephine green we bbt 2011
PDF
I Lab4 Usecases
PDF
Tarea 4 plataformas e learning
PDF
Maduf07 Expert Opinion And Potential Estimation Lieven De Marez
Ict Sd04 Een Selectie Uit De Beschikbare Cijfers Gert Verdonck
Ad Ketelaars - Services offered over municipal network
Break out: Incubation and Venturing - Felix Van Maele
Break out: Project Communication and Dissemination - Koen De Vos
Keynote josephine green we bbt 2011
I Lab4 Usecases
Tarea 4 plataformas e learning
Maduf07 Expert Opinion And Potential Estimation Lieven De Marez

Viewers also liked (17)

PDF
Qo E E2 E6 Slotevent Programma
PDF
Ddo1 Bernd Langeheine 081017 Ghent
PDF
Acknowledge 08 Ontwikkeling Front End Benny Daems Ibbt Edm U Hasselt En Al...
PDF
Ecrea3e Zoran Kostov Ppt
DOC
Curriculumvitae 100425072655-phpapp01
PDF
Vânia goncalves isbo ng wi nets - accounting interference
PPTX
documentouta
PDF
I Minds2009 Michela Pollone Csp–Innovazione Nelle Ict Innovative Paths To I...
PDF
Romas02 De Digitale Revolutie Jan Potemans
PDF
Brokerage2006 ontsluiting van multimediale archieven
PDF
Benoit Felten - The Universal Connectivity Revolution
PDF
Ict Sd11 Vlaams E Governement En Ondernemingen Geert Mareels
PDF
Workshopvin5 Sociability
PPT
Guido Impens - One access at iLab.t
PPT
cavenger2000
PPTX
Presentatie voor Avans Breda @ Muscom
PDF
Break out: Collaboration tools - Kris Naessens
Qo E E2 E6 Slotevent Programma
Ddo1 Bernd Langeheine 081017 Ghent
Acknowledge 08 Ontwikkeling Front End Benny Daems Ibbt Edm U Hasselt En Al...
Ecrea3e Zoran Kostov Ppt
Curriculumvitae 100425072655-phpapp01
Vânia goncalves isbo ng wi nets - accounting interference
documentouta
I Minds2009 Michela Pollone Csp–Innovazione Nelle Ict Innovative Paths To I...
Romas02 De Digitale Revolutie Jan Potemans
Brokerage2006 ontsluiting van multimediale archieven
Benoit Felten - The Universal Connectivity Revolution
Ict Sd11 Vlaams E Governement En Ondernemingen Geert Mareels
Workshopvin5 Sociability
Guido Impens - One access at iLab.t
cavenger2000
Presentatie voor Avans Breda @ Muscom
Break out: Collaboration tools - Kris Naessens
Ad

Similar to Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator (15)

PPT
First european research for web information extraction and analysis for sup...
PPT
Urbact II-The local support group toolkit-Ministry of Regional Development an...
PDF
2 1-research roadmap task force michele missikoff
PPTX
10052012 luc vervenne synergetics van syntax portfolio naar semantische uitwi...
PDF
Lecture 01 Data Mining
PDF
Insemtives cluj iccp
PDF
Financial i: Roundtable Corporate SWIFT Access
PPTX
Finding fraud in large, diverse data sets
PDF
03 heemskerk eramind mobility mtg_trieste italy_fh_27_may10
PPTX
Effective Internal Investigations
PDF
Finance 2.0: Future or Feature
PDF
Audaxis : BI Project for an Association of Pharmacists
PDF
Digitale fabriek - I2 Iterias
PDF
Enterprise 2.0 - How to b
PPT
Blaszkowsky
First european research for web information extraction and analysis for sup...
Urbact II-The local support group toolkit-Ministry of Regional Development an...
2 1-research roadmap task force michele missikoff
10052012 luc vervenne synergetics van syntax portfolio naar semantische uitwi...
Lecture 01 Data Mining
Insemtives cluj iccp
Financial i: Roundtable Corporate SWIFT Access
Finding fraud in large, diverse data sets
03 heemskerk eramind mobility mtg_trieste italy_fh_27_may10
Effective Internal Investigations
Finance 2.0: Future or Feature
Audaxis : BI Project for an Association of Pharmacists
Digitale fabriek - I2 Iterias
Enterprise 2.0 - How to b
Blaszkowsky
Ad

More from imec.archive (20)

PDF
iMinds-iLab.o, Open Innovation in ICT
PDF
Accio presentation closing event
PPTX
PRoF+ Patient Room of the Future
PPTX
Results of the Apollon pilot in homecare and independent living
PPTX
Delivery of feedback on Health, Home Security and Home Energy in Aware Homes ...
PDF
NMMU-Emmanuel Haven Living Lab
PDF
The Humanicité workshops
PPTX
A Real-World Experimentation Platform
PDF
ENoLL @ AAL Forum 2012
PDF
ENoLL 6th Wave Results Ceremony (Jesse Marsh)
PDF
The Connected Smart Cities Network and Living Labs - Towards Horizon 2020 - K...
PDF
Apollon-23/05/2012-9u30- Parallell session: Living Labs added value
PPT
Apollon - 22/5/12 - 11:30 - Local SME's - Innovating Across borders
PPT
Apollon - 22/5/12 - 16:00 - Smart Open Cities and the Future Internet
PPT
Apollon - 22/5/12 - 16:00 - Smart Open Cities and the Future Internet
PPT
Apollon - 22/5/12 - 16:00 - Smart Open Cities and the Future Internet
PPTX
Apollon - 22/5/12 - 16:00 - Smart Open Cities and the Future Internet
PPTX
Apollon - 22/5/12 - 11:30 - Local SME's - Innovating Across borders
PPTX
Apollon - 22/5/12 - 09:00 - User-driven Open Innovation Ecosystems
PPT
Apollon - 22/5/12 - 09:00 - User-driven Open Innovation Ecosystems
iMinds-iLab.o, Open Innovation in ICT
Accio presentation closing event
PRoF+ Patient Room of the Future
Results of the Apollon pilot in homecare and independent living
Delivery of feedback on Health, Home Security and Home Energy in Aware Homes ...
NMMU-Emmanuel Haven Living Lab
The Humanicité workshops
A Real-World Experimentation Platform
ENoLL @ AAL Forum 2012
ENoLL 6th Wave Results Ceremony (Jesse Marsh)
The Connected Smart Cities Network and Living Labs - Towards Horizon 2020 - K...
Apollon-23/05/2012-9u30- Parallell session: Living Labs added value
Apollon - 22/5/12 - 11:30 - Local SME's - Innovating Across borders
Apollon - 22/5/12 - 16:00 - Smart Open Cities and the Future Internet
Apollon - 22/5/12 - 16:00 - Smart Open Cities and the Future Internet
Apollon - 22/5/12 - 16:00 - Smart Open Cities and the Future Internet
Apollon - 22/5/12 - 16:00 - Smart Open Cities and the Future Internet
Apollon - 22/5/12 - 11:30 - Local SME's - Innovating Across borders
Apollon - 22/5/12 - 09:00 - User-driven Open Innovation Ecosystems
Apollon - 22/5/12 - 09:00 - User-driven Open Innovation Ecosystems

Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator

  • 1. Automated Information Retrieval and Text Categorization: The RIKS Demonstrator Acknowledge final event November 25, 2008 Marie-Francine Moens, Erik Boiy, Javier Arias (HMDB-LIIR) Saskia Debergh (i.Know) Philippe De Lombaerde, Birger Fühne (UNU-CRIS) Overview • UNU CRIS: The RIKS Demonstrator UNU-CRIS: • K.U.Leuven: – Content extraction from multilingual Web pages – Text categorization: machine learning approach – Search engine and indexing infrastructure – Interfacing the Acknowledge platform • i.Know: – Information forensics Acknowledge 25-11-2008 1
  • 2. The RIKS Demonstrator • United Nations University – Comparative Regional Integration Studies (UNU-CRIS) • Issues addressed in research and capacity building: – (i) emergence of regional (= supra-national) governance level – (ii) linkages with other governance levels (national, global/UN) – (iii) building of regional institutions – (iv) growing regional interdependence, etc. • RIKS = Regional Integration Knowledge System (UNU-CRIS and GARNET NoE) Acknowledge 25-11-2008 Acknowledge 25-11-2008 2
  • 3. The RIKS Demonstrator Issues addressed in the demonstrator: How to automate retrieval and processing p g (cleaning, search, categorization, presentation) of particular types of relevant information in an e-learning environment?: – ‘News’: short texts, various formats, dynamic collection, short life cycle, role of news in e- learning application – ‘Documentation’: heterogeneous texts: scientific articles, theses, essays, ... , rather static collection – Treaty texts: long and complex texts, static collection, issue of accessibility Acknowledge 25-11-2008 RIKS example output Acknowledge 25-11-2008 3
  • 4. Demo Acknowledge 25-11-2008 K.U.Leuven: Content extraction from multilingual Web pages • = Extracting main content from Web page and removing extraneous data (navigation menu’s, advertisements, etc.) • Requirements of the tool: – Accurate – Generic – Multilingual – Fast Acknowledge 25-11-2008 4
  • 5. Acknowledge 25-11-2008 [Arias et al. submitted] Acknowledge 25-11-2008 5
  • 6. [Arias et al. submitted] [5] =[Gottron 2008] Acknowledge 25-11-2008 K.U.Leuven:Text categorization • Heterogeneous documentation and Google News classified into 27 categories (e.g., trade, poverty, ...) (e g trade poverty ) • Supervised classifier: Multinomial Naïve Bayes, Support Vector Machine, ... • Features: – different features: unigrams, bigrams, feature item sets, ... • Additional feature Selection: – Chi Square, Information Gain, Linear Classifier Weights, Orthogonal Centroid Feature Selection • Different test set ups 6
  • 7. K.U.Leuven: Text categorization Acknowledge 25-11-2008 RIKS K.U.Leuven: search engine Acknowledge 25-11-2008 7
  • 8. Acknowledge 25-11-2008 Demo Acknowledge 25-11-2008 8
  • 9. Weten dat je niet weet wat je zou moeten weten 1. Information Forensics ‐ Smart Indexing more than just an index distinguishes between concepts and relations distinguishes between concepts and relations starts from unstructured text (bottom‐up instead of top‐down) recognises word groups as meaningful units Top‐down: Bottom‐up: knowledge knowledge keywords concepts and relations text text Acknowledge 25-11-2008 © i.Know NV ‐ All rights reserved. Weten dat je niet weet wat je zou moeten weten 1. Information Forensics – Smart Indexing De Fortis Bank werd overgenomen door BNP Paribas. Traditional indexing (keywords): De Fortis Bank werd overgenomen door BNP Paribas. Keyword Index Fortis 0.23 stopwords calculation Bank 0.38 werd 0.08 stemming correlation overgenomen 0.21 door 0.12 BNP 0.34 De Fortis Bank werd overgenomen door BNP Paribas Paribas 0.27 Acknowledge 25-11-2008 © i.Know NV ‐ All rights reserved. 9
  • 10. Weten dat je niet weet wat je zou moeten weten 1. Information Forensics – Smart Indexing De Fortis Bank werd overgenomen door BNP Paribas. Smart Indexing (concepts and relations): De Fortis Bank werd overgenomen door BNP Paribas. Smart Index relation  concept  Concept Fortis Bank detection detection Relation werd overgenomen door werd overgenomen door Concept BNP Paribas De Fortis Bank werd overgenomen door BNP Paribas Acknowledge 25-11-2008 © i.Know NV ‐ All rights reserved. Weten dat je niet weet wat je zou moeten weten 2. Categorisation based on Smart Indexing Preconditions: Pre defined taxonomy/ontology Pre‐defined taxonomy/ontology Top‐down processing Advantages of Smart Indexing: Smart Indexing Results can be used to fill and enrich the taxonomy, thus ensuring  the entries are relevant precise complete Acknowledge 25-11-2008 © i.Know NV ‐ All rights reserved. 10
  • 11. Weten dat je niet weet wat je zou moeten weten 2. Categorisation Categorisation EU EFTA Smart Indexing (concepts and relations): The Agreement will be applied with the European  and with the EFTA states. Union Input: The Agreement will be applied with the European Union and with the EFTA states. Acknowledge 25-11-2008 © i.Know NV ‐ All rights reserved. RIKS i.Know: news categorization Acknowledge 25-11-2008 11
  • 12. RIKS i.Know: news categorization Acknowledge 25-11-2008 Acknowledge 25-11-2008 12
  • 13. Acknowledge 25-11-2008 Demo Acknowledge 25-11-2008 13