SlideShare a Scribd company logo
When ECM Meets the
                             Semantic Web


                             20 Oct 2011 - Olivier Grisel & Stefane Fermigier



           Open Source ECM


Thursday, October 20, 2011
Business Motivations


                               2




Thursday, October 20, 2011
Source: Wikipedia
Thursday, October 20, 2011
Source: Wikipedia
Thursday, October 20, 2011
The DIKW hierarchy




                             5




Thursday, October 20, 2011
But every coin has
                               another side


Thursday, October 20, 2011
Infobesity!




Thursday, October 20, 2011
A few figures
                     • 50% more data / content / information
                             produced every year
                     • 1.8 zettabytes of data produced in 2011
                             (= 1 billion terabytes)
                     • Employees are drowning in a sea of email,
                             status messages, etc., and spend on average
                             more than 6 hours / weeks unsuccessfully
                             searching for or recreating lost documents


Thursday, October 20, 2011
A Solution: the Semantic
        Web


                                   9




Thursday, October 20, 2011
A Brief History of the Web
            • Web 1.0 (1990-now): web of sites and pages,
              aka the World Wide Web
            • Web 2.0 (2000-now): web of people and of
              participation, aka the Social Web (Blogs, RSS,
              tags, Facebook, Wikipedia, etc.)
            • Web 3.0 (2010-now): web of data, of meaning
              and connected knowledge, aka the Semantic
              Web
                                                               10




Thursday, October 20, 2011
11




Thursday, October 20, 2011
“To a computer, then, the web is a flat,
                              boring world devoid of meaning”


          Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/           12




Thursday, October 20, 2011
“This is a pity, as in fact documents on the
                   web describe real objects and imaginary
                 concepts, and give particular relationships
                                   between them”
          Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/     13




Thursday, October 20, 2011
“Adding semantics to the web involves two things:
       allowing documents which have information in
      machine-readable forms, and allowing links to be
            created with relationship values.”
          Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
                                                               14




Thursday, October 20, 2011
“The Semantic Web is not a separate Web but an
        extension of the current one, in which information
           is given well-defined meaning, better enabling
        computers and people to work in cooperation.”

          Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
                                                               15




Thursday, October 20, 2011
Means and Tools


                             16




Thursday, October 20, 2011
4 stages


            • Extract meaning from raw data / content
            • Connect information to form knowledge
            • Reason about this knowledge
            • Present this knowledge in actionable form

                                                          17




Thursday, October 20, 2011
Extracting

            • Leverage metadata embedded in or associated with
              documents (when they exist)
            • Or use machine learning, NLP (Natural Language
              Processing) and image processing algorithms to
              extract meaning from text / images
            • Examples include: named entities extraction,
              automatic categorization / tagging, sentiment
              analysis, etc.
                                                                 18




Thursday, October 20, 2011
Interlude:
        Linked Open Data


                             19




Thursday, October 20, 2011
2007
                                    2008




                             2009   2010




                                       20




Thursday, October 20, 2011
2011!                 21




Thursday, October 20, 2011
Linking

            • Many Linked Open Data repositories have been
              made available over the last 10 years
            • RDF and graph database systems are now available
              to manage this huge mass of information (billions of
              triples)
            • Match information extracted from content with
              these public (or internal) data/knowledge bases
                                                                     22




Thursday, October 20, 2011
Reasoning

            • When you are working on reliable metadata (ex:
              RDFa embedded in web pages), you can use rule /
              inference engines to infer actionable knowledge
              from your content (ex: shopping recommendation
              engine)
            • Rules can also be used to clean up / flag errors
              when working with unreliable (e.g. automatically
              extracted) information
                                                                 23




Thursday, October 20, 2011
Presenting

            • Allow the users of your system to interact with the
              knowledge thus extracted or produced, in a way
              that allows them to do their jobs better
            • A smart presentation system solves the information
              overload issue by contextualizing the information,
              i.e. presenting only information relevant to what the
              user is currently doing

                                                                      24




Thursday, October 20, 2011
R&D Projects
        Involving Nuxeo


                             25




Thursday, October 20, 2011
IKS project

            • European R&D project under the FP7, with 13
                    partners (6 SMEs) and a 8.5M EUR budget

            • Goal: create a semantic software “stack” that
                    will be used by CMS vendors to add semantic
                    features to their products

            • Started in Jan. 2009, will last until Dec. 2012
            • First tangible result: Apache Stanbol
                    (more about this later)                       26




Thursday, October 20, 2011
SAMAR project
            • French collaborative R&D project with 10
              partners, and a 4.5M EUR budget
            • Goal: create a platform for managing
              multimedia content in arabic, for news agencies
              such as AFP
            • Will include: automated translation, named
              entities extraction, content classification
            • First results: integration between Nuxeo and
              Temis (more later)                                27




Thursday, October 20, 2011
State of the Art
        Semantic ECM at Nuxeo


                                28




Thursday, October 20, 2011
The Semantic Engine

           • From unstructured content to Knowledge

           • Language guessing

           • Topic classification (Business, Sports, Media, ...)

           • Named Entities extraction and linking

           • Relationships and properties extraction

                                                                  29




Thursday, October 20, 2011
Demo time!



                             30




Thursday, October 20, 2011
31




Thursday, October 20, 2011
32




Thursday, October 20, 2011
33




Thursday, October 20, 2011
RESTful
                                is
                             Beautiful




                                         34




Thursday, October 20, 2011
35




Thursday, October 20, 2011
36




Thursday, October 20, 2011
=
                                  Semantic Engines
                                 (Apache OpenNLP)
                                          +
                             Fast Linked Data local index
                                    (Apache Solr)
                                          +
                                Semantic Rule Engine        37


                                    (Apache Jena)
Thursday, October 20, 2011
Apache Stanbol

                                                                     Engine 1          DBpedia
                                                                     Engine 2


                                 2
           1                                                         Engine 3



                                                                                       Freebase

                    Nuxeo DM
                                                              3
                             addon
                                                                                       Geonames
                                                                                LDAP
                                     Local IT infrastructure (LAN)                                38




Thursday, October 20, 2011
How to build engines?



                                39




Thursday, October 20, 2011
Training statistical models for NER with
        Wikipedia and DBpedia

           •       Extract sentences with link positions in Wikipedia articles

           •       DBPedia to the find type of the target entity (Person,
                   Location, Organization)

           •       Apache Pig scripts to compute the join + format the result as
                   training files for OpenNLP

           •       Apache OpenNLP to build and evaluate the models

           •       Apache Hadoop for distributed processing

           •       Apache Whirr for deployment and management on Amazon
                   EC2 cluster
                                                                                   40




Thursday, October 20, 2011
41




Thursday, October 20, 2011
42




Thursday, October 20, 2011
43




Thursday, October 20, 2011
44




Thursday, October 20, 2011
Training statistical models for topic
        classification from Wikipedia and DBpedia


           •       Filter category tree from DBpedia SKOS entries (~500k)

           •       Pig scripts to compute the joins with articles abstracts for all
                   the articles categorized in Wikipedia

           •       Export as 2.8GB TSV file to be indexed in Apache Solr

           •       Use Solr MoreLikeThisHandler to find the top 3 most related
                   Wikipedia category for any kind of text

           •       Apache Whirr & Hadoop for deployment and management on
                   Amazon EC2 cluster
                                                                                      45




Thursday, October 20, 2011
Wrap Up on Recent Work

            • Full offline mode: Stanbol EntityHub
            • Multi-lingual Indexes
            • New UI for occurrences reviews
            • Temis Luxid Annotation Factory integration

                                                           46




Thursday, October 20, 2011
What’s next?

           • Stanbol and Temis connection in Admin Center

           • Embedded Stanbol mode for easy deployment

           • More OpenNLP models for more languages

           • Finalize topic classification - handle hierarchy

           • Tight integration with Nuxeo DM search features

                                                               47




Thursday, October 20, 2011
Thank you for your attention!




                                        48




Thursday, October 20, 2011

More Related Content

PDF
Warsaw Poland 20-Oct-2011 on Open Government Linked Data
PDF
Linked Data Cookbook for Government Agencies, SemTech East, Washington DC 1-D...
PDF
Brief for W3C Government Linked Data Working Group 29-June 2011
PDF
20111101 b hyland-w3-c-tpac-egov
PDF
20111120 warsaw learning curve by b hyland notes
PDF
Government Linked Data Projects in the Wild
PDF
digital libraries: the phoenix rises from the ashes
PDF
Rapid Semantic Web Application Development
Warsaw Poland 20-Oct-2011 on Open Government Linked Data
Linked Data Cookbook for Government Agencies, SemTech East, Washington DC 1-D...
Brief for W3C Government Linked Data Working Group 29-June 2011
20111101 b hyland-w3-c-tpac-egov
20111120 warsaw learning curve by b hyland notes
Government Linked Data Projects in the Wild
digital libraries: the phoenix rises from the ashes
Rapid Semantic Web Application Development

What's hot (19)

PDF
LIBER and its EU projects
PPT
Computing and Linguistics: A cognitive approach
ODP
20101015 linked openeuropeanafi
PDF
Connecting Smart Things through Web services Orchestrations
PDF
Nuxeo World Session: Semantic Technologies - Update on Recent Research
PDF
The Future of Business Intelligence
PDF
Social web & linked data
PDF
When the Wikipedians talk: network and tree structure of Wikipedia discussion...
ZIP
Linked Open Data in Libraries, Archives & Museums
PDF
Open Source in the Cloud Computing Era
PPTX
Accessibility: introduction
PPT
NISO BISG Forum: Bibliographic Roadmap
PDF
Enhancing user access to european digital heritage
PDF
Twenty Years of Metadata: Lessons from the First Two Decades of the Web
PDF
Drupalcon keynote: Open Source and Open Data in the age of the cloud
PPT
Everything is a Subject: The vision of subject-centric computing
PDF
Localbysocial sunderland
LIBER and its EU projects
Computing and Linguistics: A cognitive approach
20101015 linked openeuropeanafi
Connecting Smart Things through Web services Orchestrations
Nuxeo World Session: Semantic Technologies - Update on Recent Research
The Future of Business Intelligence
Social web & linked data
When the Wikipedians talk: network and tree structure of Wikipedia discussion...
Linked Open Data in Libraries, Archives & Museums
Open Source in the Cloud Computing Era
Accessibility: introduction
NISO BISG Forum: Bibliographic Roadmap
Enhancing user access to european digital heritage
Twenty Years of Metadata: Lessons from the First Two Decades of the Web
Drupalcon keynote: Open Source and Open Data in the age of the cloud
Everything is a Subject: The vision of subject-centric computing
Localbysocial sunderland
Ad

Viewers also liked (13)

PPT
The Nuxeo Way: leveraging open source to build a world-class ECM platform
KEY
Challenges du recrutement pour un editeur de logiciel libre
PDF
Open Cloud Computing @ GTLL
PDF
Nuxeo World Session: Mobile ECM Apps with Nuxeo EP
KEY
Lessons learned Building Nuxeo EP - Component-based, open source ECM platform
PDF
Eclipse Apogee and Nuxeo RCP
KEY
Nuxeo at 10
PPT
Le Marché du Logiciel Libre en France en 2010
PDF
What's new in Nuxeo 5.2? - Solutions Linux 2009
PDF
GT Logiciel Libre - Convention Systematic 2011
PDF
A Quick Tour of JVM Languages
KEY
Nuxeo on the Cloud - Nuxeo World 2011
PDF
Cours ECM à l'EPITA
The Nuxeo Way: leveraging open source to build a world-class ECM platform
Challenges du recrutement pour un editeur de logiciel libre
Open Cloud Computing @ GTLL
Nuxeo World Session: Mobile ECM Apps with Nuxeo EP
Lessons learned Building Nuxeo EP - Component-based, open source ECM platform
Eclipse Apogee and Nuxeo RCP
Nuxeo at 10
Le Marché du Logiciel Libre en France en 2010
What's new in Nuxeo 5.2? - Solutions Linux 2009
GT Logiciel Libre - Convention Systematic 2011
A Quick Tour of JVM Languages
Nuxeo on the Cloud - Nuxeo World 2011
Cours ECM à l'EPITA
Ad

Similar to ECM Meets the Semantic Web - Nuxeo World 2011 (20)

PDF
6 - Making Information Pay 2011 -- SOLOMON, MADI (Pearson)
PPT
OpenAIRE at e-infrastructures DC-NET Brussels, October 2010
PDF
Semantic Technologies for Cultural Heritage
PDF
APIs and URLs for Social TV
PDF
DCI - Data, Context and Interaction @ Jug Genova April 2011
PPS
06 making information pay 2011 -- solomon, madi (pearson)
PDF
Antonio Pintus- TouchTheWeb 2010
PPT
Trove: A Government 2.0 Showcase August 2010, Australian Parliament
PDF
Web heresies
PPTX
Web@rchive Austria (Archiving Online Media)
PDF
Intro to Linked Data: Context
PDF
Going Global - Workshop Version - Fall 2011
PPTX
Sharing knowledge 2011
PPT
Collecting sharing and improving data: changing roles for librarians and user...
PDF
Collaborative Culture
PDF
Introduction to digital libraries - definitions, examples, concepts and trend...
PDF
Authorities as Linked Data Hubs
PDF
Migration from FAST ESP to Solr
PDF
In the land of the blind the squinter rules
6 - Making Information Pay 2011 -- SOLOMON, MADI (Pearson)
OpenAIRE at e-infrastructures DC-NET Brussels, October 2010
Semantic Technologies for Cultural Heritage
APIs and URLs for Social TV
DCI - Data, Context and Interaction @ Jug Genova April 2011
06 making information pay 2011 -- solomon, madi (pearson)
Antonio Pintus- TouchTheWeb 2010
Trove: A Government 2.0 Showcase August 2010, Australian Parliament
Web heresies
Web@rchive Austria (Archiving Online Media)
Intro to Linked Data: Context
Going Global - Workshop Version - Fall 2011
Sharing knowledge 2011
Collecting sharing and improving data: changing roles for librarians and user...
Collaborative Culture
Introduction to digital libraries - definitions, examples, concepts and trend...
Authorities as Linked Data Hubs
Migration from FAST ESP to Solr
In the land of the blind the squinter rules

More from Stefane Fermigier (20)

PDF
Pitch Abilian - Paris Open Source Summit 2015
PDF
15 ans de politiques publiques du logiciel libre en France
PDF
Créer une communauté open source: pourquoi ? comment ?
PDF
L'open source professionnel - un business model open source
PDF
Roadmap du GT Logiciel Libre 2013-2020
PDF
Le MOOC powered by Abilian - Plateforme open source de MOOC
PDF
Pitch Abilian mai 2013
PDF
Open Innovation in Action
PDF
Pourquoi le big data open source ?
PDF
Save the date OWF 2013
PDF
Ecosystemes logiciel libre
PDF
Pleniere du GT Logiciel Libre, janvier 2013
PDF
OWF 2012 Outcome
PDF
Demo Cup 2012
KEY
Four Python Pains
PDF
Nuxeo, an open source platform for content-centric business applications
PPTX
Open World Forum 2011 - Overview
KEY
Plénière du GT Logiciel Libre - Février 2011
PPT
Samar - Premier bilan d'étape - Oct. 2010
PDF
Pleniere du GTLL - Septembre 2010
Pitch Abilian - Paris Open Source Summit 2015
15 ans de politiques publiques du logiciel libre en France
Créer une communauté open source: pourquoi ? comment ?
L'open source professionnel - un business model open source
Roadmap du GT Logiciel Libre 2013-2020
Le MOOC powered by Abilian - Plateforme open source de MOOC
Pitch Abilian mai 2013
Open Innovation in Action
Pourquoi le big data open source ?
Save the date OWF 2013
Ecosystemes logiciel libre
Pleniere du GT Logiciel Libre, janvier 2013
OWF 2012 Outcome
Demo Cup 2012
Four Python Pains
Nuxeo, an open source platform for content-centric business applications
Open World Forum 2011 - Overview
Plénière du GT Logiciel Libre - Février 2011
Samar - Premier bilan d'étape - Oct. 2010
Pleniere du GTLL - Septembre 2010

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation theory and applications.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Modernizing your data center with Dell and AMD
PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
Encapsulation theory and applications.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Monthly Chronicles - July 2025
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Modernizing your data center with Dell and AMD
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
Chapter 3 Spatial Domain Image Processing.pdf

ECM Meets the Semantic Web - Nuxeo World 2011

  • 1. When ECM Meets the Semantic Web 20 Oct 2011 - Olivier Grisel & Stefane Fermigier Open Source ECM Thursday, October 20, 2011
  • 2. Business Motivations 2 Thursday, October 20, 2011
  • 5. The DIKW hierarchy 5 Thursday, October 20, 2011
  • 6. But every coin has another side Thursday, October 20, 2011
  • 8. A few figures • 50% more data / content / information produced every year • 1.8 zettabytes of data produced in 2011 (= 1 billion terabytes) • Employees are drowning in a sea of email, status messages, etc., and spend on average more than 6 hours / weeks unsuccessfully searching for or recreating lost documents Thursday, October 20, 2011
  • 9. A Solution: the Semantic Web 9 Thursday, October 20, 2011
  • 10. A Brief History of the Web • Web 1.0 (1990-now): web of sites and pages, aka the World Wide Web • Web 2.0 (2000-now): web of people and of participation, aka the Social Web (Blogs, RSS, tags, Facebook, Wikipedia, etc.) • Web 3.0 (2010-now): web of data, of meaning and connected knowledge, aka the Semantic Web 10 Thursday, October 20, 2011
  • 12. “To a computer, then, the web is a flat, boring world devoid of meaning” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/ 12 Thursday, October 20, 2011
  • 13. “This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/ 13 Thursday, October 20, 2011
  • 14. “Adding semantics to the web involves two things: allowing documents which have information in machine-readable forms, and allowing links to be created with relationship values.” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/ 14 Thursday, October 20, 2011
  • 15. “The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/ 15 Thursday, October 20, 2011
  • 16. Means and Tools 16 Thursday, October 20, 2011
  • 17. 4 stages • Extract meaning from raw data / content • Connect information to form knowledge • Reason about this knowledge • Present this knowledge in actionable form 17 Thursday, October 20, 2011
  • 18. Extracting • Leverage metadata embedded in or associated with documents (when they exist) • Or use machine learning, NLP (Natural Language Processing) and image processing algorithms to extract meaning from text / images • Examples include: named entities extraction, automatic categorization / tagging, sentiment analysis, etc. 18 Thursday, October 20, 2011
  • 19. Interlude: Linked Open Data 19 Thursday, October 20, 2011
  • 20. 2007 2008 2009 2010 20 Thursday, October 20, 2011
  • 21. 2011! 21 Thursday, October 20, 2011
  • 22. Linking • Many Linked Open Data repositories have been made available over the last 10 years • RDF and graph database systems are now available to manage this huge mass of information (billions of triples) • Match information extracted from content with these public (or internal) data/knowledge bases 22 Thursday, October 20, 2011
  • 23. Reasoning • When you are working on reliable metadata (ex: RDFa embedded in web pages), you can use rule / inference engines to infer actionable knowledge from your content (ex: shopping recommendation engine) • Rules can also be used to clean up / flag errors when working with unreliable (e.g. automatically extracted) information 23 Thursday, October 20, 2011
  • 24. Presenting • Allow the users of your system to interact with the knowledge thus extracted or produced, in a way that allows them to do their jobs better • A smart presentation system solves the information overload issue by contextualizing the information, i.e. presenting only information relevant to what the user is currently doing 24 Thursday, October 20, 2011
  • 25. R&D Projects Involving Nuxeo 25 Thursday, October 20, 2011
  • 26. IKS project • European R&D project under the FP7, with 13 partners (6 SMEs) and a 8.5M EUR budget • Goal: create a semantic software “stack” that will be used by CMS vendors to add semantic features to their products • Started in Jan. 2009, will last until Dec. 2012 • First tangible result: Apache Stanbol (more about this later) 26 Thursday, October 20, 2011
  • 27. SAMAR project • French collaborative R&D project with 10 partners, and a 4.5M EUR budget • Goal: create a platform for managing multimedia content in arabic, for news agencies such as AFP • Will include: automated translation, named entities extraction, content classification • First results: integration between Nuxeo and Temis (more later) 27 Thursday, October 20, 2011
  • 28. State of the Art Semantic ECM at Nuxeo 28 Thursday, October 20, 2011
  • 29. The Semantic Engine • From unstructured content to Knowledge • Language guessing • Topic classification (Business, Sports, Media, ...) • Named Entities extraction and linking • Relationships and properties extraction 29 Thursday, October 20, 2011
  • 30. Demo time! 30 Thursday, October 20, 2011
  • 34. RESTful is Beautiful 34 Thursday, October 20, 2011
  • 37. = Semantic Engines (Apache OpenNLP) + Fast Linked Data local index (Apache Solr) + Semantic Rule Engine 37 (Apache Jena) Thursday, October 20, 2011
  • 38. Apache Stanbol Engine 1 DBpedia Engine 2 2 1 Engine 3 Freebase Nuxeo DM 3 addon Geonames LDAP Local IT infrastructure (LAN) 38 Thursday, October 20, 2011
  • 39. How to build engines? 39 Thursday, October 20, 2011
  • 40. Training statistical models for NER with Wikipedia and DBpedia • Extract sentences with link positions in Wikipedia articles • DBPedia to the find type of the target entity (Person, Location, Organization) • Apache Pig scripts to compute the join + format the result as training files for OpenNLP • Apache OpenNLP to build and evaluate the models • Apache Hadoop for distributed processing • Apache Whirr for deployment and management on Amazon EC2 cluster 40 Thursday, October 20, 2011
  • 45. Training statistical models for topic classification from Wikipedia and DBpedia • Filter category tree from DBpedia SKOS entries (~500k) • Pig scripts to compute the joins with articles abstracts for all the articles categorized in Wikipedia • Export as 2.8GB TSV file to be indexed in Apache Solr • Use Solr MoreLikeThisHandler to find the top 3 most related Wikipedia category for any kind of text • Apache Whirr & Hadoop for deployment and management on Amazon EC2 cluster 45 Thursday, October 20, 2011
  • 46. Wrap Up on Recent Work • Full offline mode: Stanbol EntityHub • Multi-lingual Indexes • New UI for occurrences reviews • Temis Luxid Annotation Factory integration 46 Thursday, October 20, 2011
  • 47. What’s next? • Stanbol and Temis connection in Admin Center • Embedded Stanbol mode for easy deployment • More OpenNLP models for more languages • Finalize topic classification - handle hierarchy • Tight integration with Nuxeo DM search features 47 Thursday, October 20, 2011
  • 48. Thank you for your attention! 48 Thursday, October 20, 2011