SlideShare a Scribd company logo
Text Analytics in Enterprise Search
         Daniel Ling (Findwise)
What will I cover?
   Intro
   About Text Analytics
   Benefits and possibilities
   Examples
   Solution Techniques to Examples
   Conclusions




                            3
My Background
   Daniel Ling
   Findwise
   Enterprise Search and Findability Consultant
   Experience and expertise
      5+ years of Enterprise Search Experience
      20+ enterprise search implementations, ranging industries
      Lucene, FAST ESP, Solr
      Apache Solr my primary search platform
      Focus areas includes Findability and Search Architecture and
       Implementation, Text Analytics, Document Processing.




                                    4
About Text Analytics




          5
Text Analytics in the Enterprise
Challenges:
 80% of data in the Enterprise is unstructured.
 Reduce the time looking for information (currently 9.6 hours per week)
 Reduce the time reading documents / e-mails (currently 14.5 hours per
  week)

Benefits:
 More predictable scale and domain
 Well-understood domain
 Supporting content for analytics can be identified




                                   6
Text Analytics
The definition


   A set of linguistic, statistical and machine learning techniques
   used to model and structure information content of textual
   source.

      - Wikipedia.org




                                7
Types of Applications


•   Entity Extraction
•   Document Categorization
•   Sentiment Analysis
•   Summarization




                              8
Frameworks and Techniques


Framework                          Techniques

Solr                               Statistics, Lingustics

Mallet, Classifier4j, etc, etc..   Statistical natural language processing

Mahout (Hadoop)                    Machine Learning, Statistics

GATE                               General language processing framework

UIMA                               Content analytics, text mining, pipeline

OpenNLP                            Machine learning toolkit for NLP


                                              9
Benefits and possibilities




            10
Benefits and possibilities

 Text analytics can bring some structure to the unstructured content
 Enhance discovery and findability of content
   • Works well together with search
 Increase relevance and precision with extracted keywords and meta-
  data
 Generating content for dynamic pages / topic pages
   • Selection of documents and extracts from documents
 Track and discover sentiments
 Reduce the time for user to analyze content




                                 11
Examples




   12
Entity Extraction

 Types of Entities for Extraction:
   • Dates
   • Places
   • Companies
   • Objects (Product names, etc)
   • People
   • Events




                                  13
Example – Presenting the data




               14
Example – Presenting the data




              15
Example – Facets on the data




               16
Example Solution: Entity Extraction
 Rule-based entity extraction
    Combination of lists and regular expressions
 Works within well-understood domains.
 Requires maintaining lists.
 Lists from: Country lists from World Factbook, Public Companies from
  Google Finance, Customers from CRM.
 Workflow: Document for indexing > Update Request Handler >
  Update Chain (lookup and match entities) > Writes to index



             Update Chain
                     (processor)                                   Lucene Index
        (lists | input fields | entity fields)
                                                 (entity fields)




                                                          17
Example Solution: Entity Extraction
 Register a custom class to lookup resources and extract found entities
  to specific Solr fields, setup in solrconfig.xml:




                                     18
Document Categorization

   To assign a label to the document / content / data.
   Labels for the category or for the sentiment.
   Threshold values for matching a category before labeling.
   Statistics and “knowledge” from previous examples can be used.




                                  19
Example – Facets from Categories




                 20
Example Solution: Document
                Categorization


                                               *

 Training the component, Mallet (Machine Learning for Language
  Toolkit).
   • Alternative components includes Lucene (TFIDF) index
      (MoreLikeThis), OpenNLP, Textcat, Classifier4j.
 Running the new documents against the model/index of trained
  documents.
 Training from interface, adhoc, or index pre-categorized.

* Figure from the book Taming Text.


                                      21
Example Solution: Document
             Categorization
 Mallet and the process of setup and train:




                                   22
Example Solution: Document
              Categorization
 Evaluation of new document:




 Setting the evaluated category tag to the document in pipeline:


            Update Chain
                 (processor)                        Lucene Index
              (input document)
                                 (category field)




                                            23
Document Summarization

 Summarize a document, at index time or on-demand.
 Leverage from the knowledge and term statistics of the document
  and the index.
 Picks the “most important” sentences based on the statistics and
  displays those.




                                 24
Example – Summarize content


Static Summaries




Dynamic Summaries




                    25
Example – Summarize content - 1




                   26
Example – Summarize content - 2




                  27
Example Solution: Document
           Summarization
 Custom RequestHandler that receives document ID and field to
  summarize.
 Custom Search Component making the selection of top sentences.
 Selecting a subset of sentences and sends these back in a field.




               RequestHandler                         Lucene Index
          (SearchComponent for summariziation)




                                                 28
Wrap Up

• Examples: Entity Extraction, Document Categorization,
  Summarization.
• Technology: You can take small steps and get a great
  deal of gain, since you can leverage from features and
  components of Solr and Lucene (as well as other open
  source NLP frameworks).
• Value: Benefits from text analytics includes the increase
  in discovery, findability and productivity from the
  solution.




                                29
Questions ?



daniel.ling@findwise.com
www.findabilityblog.com




            30

More Related Content

PPTX
Segmentation
PPTX
Treparel - KMX Patent Analytics 2014
PPT
Tovek Presentation 2 by Livio Costantini
PPT
Aggregation for searching complex information spaces
PPT
Tovek Presentation by Livio Costantini
PPT
Web search engines
PPTX
Text Data Mining
PPTX
Latest trends in AI and information Retrieval
Segmentation
Treparel - KMX Patent Analytics 2014
Tovek Presentation 2 by Livio Costantini
Aggregation for searching complex information spaces
Tovek Presentation by Livio Costantini
Web search engines
Text Data Mining
Latest trends in AI and information Retrieval

What's hot (19)

PPTX
Tdm information retrieval
PDF
Information Retrieval
PDF
Tutorial 1 (information retrieval basics)
PPTX
Tdm recent trends
PPTX
Techniques of information retrieval
PDF
Text Indexing and Retrieval
PPTX
Multidimensioal database
PPTX
Text mining presentation in Data mining Area
PPTX
ATLAS.ti training presentation: Covering the basics
PPTX
ATLAS.ti Training - Covering the Basics (Mac edition)
PPTX
The Apache Solr Smart Data Ecosystem
PPTX
Text mining
PPTX
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
PDF
Reflected intelligence evolving self-learning data systems
PPT
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
PPTX
The Intent Algorithms of Search & Recommendation Engines
PDF
Information retrieval concept, practice and challenge
PPTX
Candidate selection tutorial
PDF
Crowdsourced query augmentation through the semantic discovery of domain spec...
Tdm information retrieval
Information Retrieval
Tutorial 1 (information retrieval basics)
Tdm recent trends
Techniques of information retrieval
Text Indexing and Retrieval
Multidimensioal database
Text mining presentation in Data mining Area
ATLAS.ti training presentation: Covering the basics
ATLAS.ti Training - Covering the Basics (Mac edition)
The Apache Solr Smart Data Ecosystem
Text mining
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Reflected intelligence evolving self-learning data systems
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
The Intent Algorithms of Search & Recommendation Engines
Information retrieval concept, practice and challenge
Candidate selection tutorial
Crowdsourced query augmentation through the semantic discovery of domain spec...
Ad

Similar to Text Analytics in Enterprise Search - Daniel Ling (20)

PDF
2010 10-building-global-listening-platform-with-solr
ODP
Text-mining and Automation
PPT
PPTX
Designing and Implementing Search Solutions
PDF
SA2: Text Mining from User Generated Content
PPTX
Use of ontologies in natural language processing
KEY
Introduction to the Semantic Web
PPTX
Text and Beyond
PDF
Introducing Hydra – An Open Source Document Processing Framework
PPT
Content Sharing: Whence and Whither?
PDF
Text categorization with Lucene and Solr
PPT
Vellino presentationtocisti
PPT
Text Analytics: Yesterday, Today and Tomorrow
PDF
Getting Started with Unstructured Data
PDF
Adding structure to unstructured content for enhanced findability hakan tylen
PDF
Semtech2006
PDF
Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns
PPT
PDF
Text Mining : Experience
2010 10-building-global-listening-platform-with-solr
Text-mining and Automation
Designing and Implementing Search Solutions
SA2: Text Mining from User Generated Content
Use of ontologies in natural language processing
Introduction to the Semantic Web
Text and Beyond
Introducing Hydra – An Open Source Document Processing Framework
Content Sharing: Whence and Whither?
Text categorization with Lucene and Solr
Vellino presentationtocisti
Text Analytics: Yesterday, Today and Tomorrow
Getting Started with Unstructured Data
Adding structure to unstructured content for enhanced findability hakan tylen
Semtech2006
Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns
Text Mining : Experience
Ad

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
State of the Art Logging. Kibana4Solr is Here!
PDF
Search at Twitter
PDF
Building Client-side Search Applications with Solr
PDF
Integrate Solr with real-time stream processing applications
PDF
Scaling Solr with SolrCloud
PDF
Administering and Monitoring SolrCloud Clusters
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
PDF
Using Solr to Search and Analyze Logs
PDF
Enhancing relevancy through personalization & semantic search
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
Solr's Admin UI - Where does the data come from?
PDF
Schemaless Solr and the Solr Schema REST API
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PDF
Faceted Search with Lucene
PDF
Recent Additions to Lucene Arsenal
PDF
Turning search upside down
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PDF
Shrinking the haystack wes caldwell - final
Text Classification Powered by Apache Mahout and Lucene
State of the Art Logging. Kibana4Solr is Here!
Search at Twitter
Building Client-side Search Applications with Solr
Integrate Solr with real-time stream processing applications
Scaling Solr with SolrCloud
Administering and Monitoring SolrCloud Clusters
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Using Solr to Search and Analyze Logs
Enhancing relevancy through personalization & semantic search
Real-time Inverted Search in the Cloud Using Lucene and Storm
Solr's Admin UI - Where does the data come from?
Schemaless Solr and the Solr Schema REST API
High Performance JSON Search and Relational Faceted Browsing with Lucene
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Faceted Search with Lucene
Recent Additions to Lucene Arsenal
Turning search upside down
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Shrinking the haystack wes caldwell - final

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
A Presentation on Artificial Intelligence
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
A Presentation on Artificial Intelligence
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
cuic standard and advanced reporting.pdf
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
NewMind AI Weekly Chronicles - August'25 Week I
Understanding_Digital_Forensics_Presentation.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Text Analytics in Enterprise Search - Daniel Ling

  • 1. Text Analytics in Enterprise Search Daniel Ling (Findwise)
  • 2. What will I cover?  Intro  About Text Analytics  Benefits and possibilities  Examples  Solution Techniques to Examples  Conclusions 3
  • 3. My Background  Daniel Ling  Findwise  Enterprise Search and Findability Consultant  Experience and expertise  5+ years of Enterprise Search Experience  20+ enterprise search implementations, ranging industries  Lucene, FAST ESP, Solr  Apache Solr my primary search platform  Focus areas includes Findability and Search Architecture and Implementation, Text Analytics, Document Processing. 4
  • 5. Text Analytics in the Enterprise Challenges:  80% of data in the Enterprise is unstructured.  Reduce the time looking for information (currently 9.6 hours per week)  Reduce the time reading documents / e-mails (currently 14.5 hours per week) Benefits:  More predictable scale and domain  Well-understood domain  Supporting content for analytics can be identified 6
  • 6. Text Analytics The definition A set of linguistic, statistical and machine learning techniques used to model and structure information content of textual source. - Wikipedia.org 7
  • 7. Types of Applications • Entity Extraction • Document Categorization • Sentiment Analysis • Summarization 8
  • 8. Frameworks and Techniques Framework Techniques Solr Statistics, Lingustics Mallet, Classifier4j, etc, etc.. Statistical natural language processing Mahout (Hadoop) Machine Learning, Statistics GATE General language processing framework UIMA Content analytics, text mining, pipeline OpenNLP Machine learning toolkit for NLP 9
  • 10. Benefits and possibilities  Text analytics can bring some structure to the unstructured content  Enhance discovery and findability of content • Works well together with search  Increase relevance and precision with extracted keywords and meta- data  Generating content for dynamic pages / topic pages • Selection of documents and extracts from documents  Track and discover sentiments  Reduce the time for user to analyze content 11
  • 11. Examples 12
  • 12. Entity Extraction  Types of Entities for Extraction: • Dates • Places • Companies • Objects (Product names, etc) • People • Events 13
  • 13. Example – Presenting the data 14
  • 14. Example – Presenting the data 15
  • 15. Example – Facets on the data 16
  • 16. Example Solution: Entity Extraction  Rule-based entity extraction  Combination of lists and regular expressions  Works within well-understood domains.  Requires maintaining lists.  Lists from: Country lists from World Factbook, Public Companies from Google Finance, Customers from CRM.  Workflow: Document for indexing > Update Request Handler > Update Chain (lookup and match entities) > Writes to index Update Chain (processor) Lucene Index (lists | input fields | entity fields) (entity fields) 17
  • 17. Example Solution: Entity Extraction  Register a custom class to lookup resources and extract found entities to specific Solr fields, setup in solrconfig.xml: 18
  • 18. Document Categorization  To assign a label to the document / content / data.  Labels for the category or for the sentiment.  Threshold values for matching a category before labeling.  Statistics and “knowledge” from previous examples can be used. 19
  • 19. Example – Facets from Categories 20
  • 20. Example Solution: Document Categorization *  Training the component, Mallet (Machine Learning for Language Toolkit). • Alternative components includes Lucene (TFIDF) index (MoreLikeThis), OpenNLP, Textcat, Classifier4j.  Running the new documents against the model/index of trained documents.  Training from interface, adhoc, or index pre-categorized. * Figure from the book Taming Text. 21
  • 21. Example Solution: Document Categorization  Mallet and the process of setup and train: 22
  • 22. Example Solution: Document Categorization  Evaluation of new document:  Setting the evaluated category tag to the document in pipeline: Update Chain (processor) Lucene Index (input document) (category field) 23
  • 23. Document Summarization  Summarize a document, at index time or on-demand.  Leverage from the knowledge and term statistics of the document and the index.  Picks the “most important” sentences based on the statistics and displays those. 24
  • 24. Example – Summarize content Static Summaries Dynamic Summaries 25
  • 25. Example – Summarize content - 1 26
  • 26. Example – Summarize content - 2 27
  • 27. Example Solution: Document Summarization  Custom RequestHandler that receives document ID and field to summarize.  Custom Search Component making the selection of top sentences.  Selecting a subset of sentences and sends these back in a field. RequestHandler Lucene Index (SearchComponent for summariziation) 28
  • 28. Wrap Up • Examples: Entity Extraction, Document Categorization, Summarization. • Technology: You can take small steps and get a great deal of gain, since you can leverage from features and components of Solr and Lucene (as well as other open source NLP frameworks). • Value: Benefits from text analytics includes the increase in discovery, findability and productivity from the solution. 29