SlideShare a Scribd company logo
Smartlogic
                                     TM




 Lucene Revolution 2012
                                     	
  
                                     	
  
            Jeremy	
  Bentley,	
  CEO	
  
1st degree of order


Filing management
• 80% of enterprise information is
unstructured
• Doubling every 19 months and
accelerating [Gartner]
• Increasing burden of compliance
• Enterprise 2.0 additions
• Big Data connotations
2nd degree of order


Index management
• File plans and metadata schema
• Manually applied classification
• Low level of consistency and quality
3rd degree Order
                               Enterprise	
        Content	
  
                                 Search	
        Management	
  

              Portal	
  
          Infrastructure	
                                              Document	
  
                                                                   	
  Management	
  


                             Automation of
      SharePoint	
             1st & 2nd                                 Records	
  
                                                                       Management	
  
                               Degrees
           Publishing	
                                             Process	
  	
  
            Systems	
                                            Management	
  &	
  
                              Digital	
                            Workflow	
  
                               Asset	
  
                            Management	
  
                                                eDiscovery	
  
5	
  

A 10 year Flatline
          User	
  Search	
  
          Sa5sfac5on	
  



                   50%	
  
                                                                       48%	
  




                        2001	
                                       2011	
  
•  2001,	
  IDC,	
  “Quan5fying	
  Enterprise	
  Search”	
  
	
  Searchers	
  are	
  successful	
  in	
  finding	
  what	
  they	
  seek	
  50%	
  of	
  the	
  9me	
  or	
  less	
  	
  
        	
  
•  2011,	
  MindMetre/SmartLogic	
  
More	
  than	
  half	
  	
  (52%)	
  cannot	
  find	
  the	
  informa9on	
  they	
  need	
  using	
  their	
  Enterprise	
  
search	
  system	
  	
  
The explosion of information
                                                               80Tb	
  




                                                              ?	
  
                                    20	
  5mes	
  
      Terabytes	
  of	
  data	
  

                                    increase	
  in	
  
                                    Informa5on	
  
                                    volume	
  


                                          4Tb	
  




                                    1993-­‐2001	
          2001-­‐2009	
  

                                                         Source:	
  the	
  Na5onal	
  Archives	
  
Volume + other disruptive factors

Velocity	
  
	
  
Variety	
  
	
  
Complexity	
  
	
  
             	
  Cross-­‐organiza5onal	
  	
  and	
  cross	
  pla[orm	
  informa5on	
  needs	
  
	
  
             	
  Changing	
  requirements	
  for	
  informa5on	
  over	
  5me	
  
	
  




                                     Copyright	
  @	
  2011	
  Smartlogic	
  Semaphore	
  Limited	
     7	
  
New 4th degree of order
                               Enterprise	
        Content	
  
                                 Search	
        Management	
  

              Portal	
  
          Infrastructure	
                                              Document	
  
                                                                   	
  Management	
  




      SharePoint	
                        Content                        Records	
  
                                        Intelligence                   Management	
  



           Publishing	
                                             Process	
  	
  
            Systems	
                                            Management	
  &	
  
                              Digital	
                            Workflow	
  
                               Asset	
  
                            Management	
  
                                                eDiscovery	
  
Content Intelligence


                                                Informa5on	
  
                                               Manufacturing	
  
             Mone5sa5on	
  



                                                               Knowledge	
  
                                         Metadata	
             Recovery	
  
         Data	
  Loss	
  Preven5on	
  
          Risk	
  &	
  Compliance	
  


                                             Content	
  	
  
                                             Analy5cs	
  
Knowing what you have
Metadata
Information

   Subject	
                                                                                   Crea5on	
  Date	
  


 Loca5on	
                                                                                     Modified	
  Date	
  


    Project	
                                                                                  Author	
  


 Func5on	
                                                                                     Format	
  
                                                                                               (PDF,DOC,XLS)	
  
 (IT,HR,Finance)	
  
                                    Protec5ve	
  
                                       Marker	
  


                                                       Expiry	
  

                                                                    Publisher	
  
                       Expert	
  




                                                    Reten5on	
  




                                                                                    Site	
  
Process                                                                                        Structural
4th degree of order
Content Intelligence




                                         Content	
  Intelligence	
  Pla[orm	
  



                 	
  	
  	
  FAST	
  


                 SharePoint       	
  
What is Content Intelligence

               Content	
  Intelligence	
  is	
  the	
  process	
  of	
  	
  	
  
                                     	
  
                                         	
  
                                         	
  
                                         	
  
IDENTIFYING	
   CLASSIFYING	
            	
  
                                    EXTRACTING	
     ANALYZING	
       SURFACING	
  
                                         	
  
                                   informa5on	
  
           based	
  on	
  its	
  meaning	
  and	
  context	
  to	
  make	
  	
  
               !mely	
  and	
  informed	
  business	
  decisions.	
  
                                         	
  
Content Intelligence Solutions


 KNOWLEDGE	
  	
  
                              MICROTARGETING	
  
 ACQUSITION	
  
                              &	
  DISTRIBUTION	
  
    &	
  REUSE	
  



         GOVERNANCE,	
  
        COMPLIANCE	
  &	
             WEB-­‐BASED	
  
                RISK	
                SELF	
  SERVICE	
  
Big Data + Content Intelligence




                                  From	
  Gartner,	
  2011	
  	
  
Semaphore – Three Core Capabilities
                                                   Seman5c	
  	
     Ontology	
  	
  
                     Build,	
  Manage	
  and	
     Model	
           Manager	
  
                    Deploy	
  Vocabularies/	
  
                          Libraries	
  




               Expose	
                                                                     Apply	
  
                                              SEMAPHORE	
  
               Users      	
                                                            Content	
  
                                                                                                      ClassificaJon	
  
  SemanJc	
  
                                                                                                         Server	
  
Enhancement	
  
   Server	
                                          Inform	
  
           Explore	
  data	
  to	
  find	
                                         Automate	
  the	
  
                 insights	
                                                    Metadata	
  Enrichment	
  

                                                                                                                     16	
  
Enterprise Classification

Important	
  requirements	
  for	
  Velocity/Volume:	
  
•  Scalability	
  for	
  large	
  volumes	
  of	
  content,	
  users,	
  
   metadata	
  and	
  systems	
  
•  Easy	
  integra5on	
  with	
  processing	
  systems	
  -­‐	
  
   search,	
  content,	
  records	
  and	
  document	
  
   management	
  systems	
  as	
  well	
  as	
  file	
  shares	
  
   and	
  content	
  migra5on	
  tools	
  
•  Support	
  for	
  all	
  the	
  organiza5on‘s	
  languages	
  
   and	
  data	
  formats	
  
From Many Different Sources
Metadata Generation
 Information

    Brand                                                           Creation Date


   Service                                                          Modified Date


 Geography                                                          Author


  Products                                                          Format
                                                                    (PDF,DOC,XLS)
               Expert


                        Protective


                                     Retention
                           Marker




                                                 Publisher
                                        Expiry




                                                             Site
 Process                                                            Structural
Different Vocabulary and Ambiguity
You	
  Say	
           I	
  Say	
  
Perpetrator	
          Burglar	
  
                       Thief	
  
Swine	
  Flu	
         Swine	
  Influenza	
  Virus	
                          	
  Missing	
  results	
  
                       H1N1	
  
Touchscreen	
          Touch	
  screen	
  
                       Mul5-­‐touch	
  

You	
  Say	
           What	
  do	
  you	
  mean?	
  
Apple	
                A	
  fruit?	
  
                       Fiona	
  -­‐	
  A	
  singer	
  /	
  songwriter?	
  
                       An	
  electronics	
  company?	
  
Rights	
               Employment	
  rights?	
  
                       Equal	
  rights?	
                                    	
  Too	
  many	
  results	
  
                       Right	
  of	
  way?	
  
Ford	
                 Ford	
  Motor	
  
                       Forward	
  Industrials	
  (5cker=FORD)	
  
                       A	
  shallow	
  river	
  crossing	
  




       ©	
  2010	
                                                                                        20	
  
Without Accurate Metadata
	
  
	
       Big	
  Data	
  has	
  its	
  perils.	
  With	
  huge	
  data	
  
	
        sets	
  and	
  fine-­‐grained	
  measurement,	
  
                there	
  is	
  increased	
  risk	
  of	
  “false	
  
        discoveries.”	
  The	
  trouble	
  with	
  seeking	
  a	
  
        meaningful	
  needle	
  in	
  massive	
  haystacks	
  
        of	
  data	
  is	
  that	
  “many	
  bits	
  of	
  straw	
  look	
  
                                like	
  needles.”	
  
                                          	
  
       -­‐	
  Trevor	
  Has5e,	
  	
  
       Sta5s5cs	
  Professor	
  at	
  Stanford	
  University	
  	
  
What Classification Must Handle
Capability	
                                                                            Included	
  
Look	
  for	
  all	
  the	
  vocabulary	
  associated	
  with	
  topic/en5ty	
  
Determine	
  aboutness	
  /	
  avoid	
  passing	
  men5ons	
  
Address	
  term	
  ambiguity	
  
Handle	
  stemming	
  errors	
  
Determine	
  if	
  topics	
  in	
  the	
  same	
  context	
  
Split	
  documents	
  into	
  components	
  
Generate	
  scores	
  (so	
  most	
  relevant	
  content	
  bubbles	
  to	
  top)	
  
Show	
  dynamic	
  summaries	
  to	
  users	
  
Enhancing Metadata
•  Accurately	
  classify	
  content	
  into	
  subject	
  areas	
  
   defined	
  in	
  a	
  taxonomy/ontology	
  
•  En5ty	
  extrac5on	
  (Text	
  Mining)	
  
•  Sen5ment	
  Analysis	
  
•  Fact	
  Extrac5on	
  
Physical Architecture
   Ontology	
  Management	
  Services	
  
                                    Ontology	
  Manager	
                                             Ontology	
  Manager	
  Desktop	
                                             Ontology	
  Manager	
  Desktop	
  
                                   Standalone	
  Desktop	
  
                                         Win	
  7,	
  Vista	
                                                          Win	
  7,	
  Vista	
                                                                           Win7,	
  Vista	
  
                                         2Gb	
  RAM	
                                                                  2Gb	
  RAM	
                                                                                   2Gb	
  RAM	
  
                                         2GHz	
  Dual	
  CPU	
                                                         2GHz	
  Dual	
  CPU	
                                                                          2GHz	
  Dual	
  CPU	
  




                                   Op5onal	
  RDBMS	
  data	
  store	
                                 Ontology	
  Manager	
  Server	
  
                                                Oracle	
                                                                          Port	
  8001	
                Port	
  8002	
  
                                               MySQL	
                                                                                                                                        Win	
  7,	
  Vista,	
  2003,	
  2008	
  +R2	
  
                                                                                                                                   Ontology	
                    Ontology	
                   Linux	
  
                                    SQL	
  Server	
  2005	
  +	
  2008	
  +	
                                                     Instance	
  1	
               Instance	
  2	
               2Gb	
  RAM	
  
                                               2008	
  R2	
                                                                                                                                   2GHz	
  CPU	
  



   Seman5c	
  Enhancement	
  Server	
                                                                                                    Content	
  Classifica5on	
  Server	
  
    Search	
  Enhancement	
  Server	
                                                                                                        Classifica5on	
  Server	
                                                                             Classifica5on	
  Test	
  Interface	
  
                                                                                                                                                                                             Port	
  5058	
  
                                                Search	
                              GSA	
  Extensions	
                                                                                  Classifica5on	
                                                           Internet	
  Explorer	
  
                                             Enhancement	
                           FAST	
  Extensions	
                                                                                     Instance	
                                                            Firefox	
  
                                               Instance	
                          Sharepoint	
  Extensions	
  
                                                                                                                                                                                                                                                  Rule	
  and	
  Template	
  Editor	
  
      Windows	
  Server	
  2003	
  ,2008	
  (32bit/64bit)	
  +R2	
                                                                               Windows	
  Server	
  2003	
  ,2008	
  (32bit/64bit)	
  +	
  R2	
                                                   Win	
  7,	
  Vista	
  
      Linux	
                                                                                                                                    Linux	
                                                                                                            2Gb	
  RAM	
  
      IIS/Apache	
  HTTP	
  Server	
                                                                                                             CPU	
  	
  and	
  RAM	
  intensive.	
  Scale	
  to	
  volume	
  of	
  content	
                                    2GHz	
  Dual	
  CPU	
  
      RAM	
  and	
  disk	
  access	
  intensive.	
  Scale	
  to	
  expected	
  peak	
  search	
  throughput	
                                    and	
  number	
  of	
  publishing	
  users	
  




    Google	
  Classifica5on	
  Handler	
                                                                                                                                                                                                             Integra5on	
  Components	
  
                                Dispatcher	
  
                                   Proxy	
  

   Windows	
  Server	
  2003	
  ,2008	
  (32bit/64bit)	
  +R2	
  
   Scale	
  for	
  throughput	
  of	
  GSA	
  Indexing	
  Crawler	
  
                                                                                       Search	
  Applica5on	
  Framework	
                         Search	
  Applica5on	
  Framework	
  
                                                                                                                                                                                                                                           Document	
  Library	
  Components	
  
                                                                                     Semaphore	
  Document	
  Processor	
                        Semaphore	
  Document	
  Processor	
  
           Search	
  Applica5on	
  Framework	
                                                                                                                                                                                                    Search	
  Web	
  Parts	
  

                                                                                         Microsou	
  FAST	
  ESP	
                                                                                                              Microsou	
  Office	
  SharePoint	
  
         Google	
  Search	
  Appliance	
                                                    Server	
  Farm	
                                                             SOLR	
                                               Server	
  2007	
  /	
  	
  2010	
  Server	
  Farm	
  
Leveraging Metadata Schemes
Examples – Customer Service
Examples – Following Trends
Examples – Fact Extraction
How Else Does Semaphore Help
                            Disambiguate queries
         	
  	
  

Perfectly formed filters
 organised by facet


                             Graphical drill down




                            Explore relationships




                           Supporting documents
Happy, Successful Customers

More Related Content

PDF
Km Portals - Lessons Learned
PDF
Expert Webinar Series 5: "De-mystifying Content Types - Four Key Content...
PDF
"Nuxeo 5 a Complete Open Source ECM Solution" by Andreea Stefanescu @ eLibera...
PPTX
Xml finland-2011-sami-poikonen-ea nokia
PDF
Rubik Solutions - Open Integration Portal
PDF
Rubik Open Integration Portal
PDF
Sharepoint Governance
PDF
Business Process Optimization with Enterprise SOA and AIA
Km Portals - Lessons Learned
Expert Webinar Series 5: "De-mystifying Content Types - Four Key Content...
"Nuxeo 5 a Complete Open Source ECM Solution" by Andreea Stefanescu @ eLibera...
Xml finland-2011-sami-poikonen-ea nokia
Rubik Solutions - Open Integration Portal
Rubik Open Integration Portal
Sharepoint Governance
Business Process Optimization with Enterprise SOA and AIA

What's hot (6)

PDF
Talk IT_ Oracle_김태완_110831
PDF
Rubik Open Integration Portal
PDF
Sap Supplier Risk Performance 2011
PDF
Expert Webinar Series: SharePoint Governance - Managing Content Sprawl
PPTX
2011 Sharepoint Summit - Overview of enterprise content management in share_...
PPTX
Agile Business Intelligence
Talk IT_ Oracle_김태완_110831
Rubik Open Integration Portal
Sap Supplier Risk Performance 2011
Expert Webinar Series: SharePoint Governance - Managing Content Sprawl
2011 Sharepoint Summit - Overview of enterprise content management in share_...
Agile Business Intelligence
Ad

Similar to Big Data Meets Metadata – Analyzing Large Data Sets (20)

PDF
SharePoint Saturday DC by ImageTech Systems - David Strock
PDF
CASE-6 Structured Content Authoring and Publishing through Alfresco and Compo...
PDF
10 key decisions_your_ecm_checklist
PPTX
SharePoint & ERM
PDF
IT Governance Portals
PPTX
Moss 2007 Technology Briefing
PDF
Innovative_ecm_with_sharepoint_2010
PDF
Information opportunities in social, mobile, and cloud technologies
PPTX
Nj sharepoint user group
PDF
Why Should Consultants and Systems Integrators Become Certified Information P...
PPTX
Presenting SharePoint as a service back to your organization
PDF
20100430 introduction to business objects data services
PPT
94670552 alfresco-aiim-2006-05-16
PDF
Envision IT - Designing your SharePoint Extranet to work for you
PDF
AIS SharePoint & BI Presentation 24th july 2012
PDF
Asug SAP HANA Presentation - Perceptive Technologies SAP
PDF
SharePoint for information Management in The Legal Profession
PPSX
Share Point Presentation Introduction To Sharepoint
PDF
Business process-outsourcing and ECM 02-04-09
PPT
E biz blueprint
SharePoint Saturday DC by ImageTech Systems - David Strock
CASE-6 Structured Content Authoring and Publishing through Alfresco and Compo...
10 key decisions_your_ecm_checklist
SharePoint & ERM
IT Governance Portals
Moss 2007 Technology Briefing
Innovative_ecm_with_sharepoint_2010
Information opportunities in social, mobile, and cloud technologies
Nj sharepoint user group
Why Should Consultants and Systems Integrators Become Certified Information P...
Presenting SharePoint as a service back to your organization
20100430 introduction to business objects data services
94670552 alfresco-aiim-2006-05-16
Envision IT - Designing your SharePoint Extranet to work for you
AIS SharePoint & BI Presentation 24th july 2012
Asug SAP HANA Presentation - Perceptive Technologies SAP
SharePoint for information Management in The Legal Profession
Share Point Presentation Introduction To Sharepoint
Business process-outsourcing and ECM 02-04-09
E biz blueprint
Ad

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
State of the Art Logging. Kibana4Solr is Here!
PDF
Search at Twitter
PDF
Building Client-side Search Applications with Solr
PDF
Integrate Solr with real-time stream processing applications
PDF
Scaling Solr with SolrCloud
PDF
Administering and Monitoring SolrCloud Clusters
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
PDF
Using Solr to Search and Analyze Logs
PDF
Enhancing relevancy through personalization & semantic search
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
Solr's Admin UI - Where does the data come from?
PDF
Schemaless Solr and the Solr Schema REST API
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PDF
Faceted Search with Lucene
PDF
Recent Additions to Lucene Arsenal
PDF
Turning search upside down
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PDF
Shrinking the haystack wes caldwell - final
Text Classification Powered by Apache Mahout and Lucene
State of the Art Logging. Kibana4Solr is Here!
Search at Twitter
Building Client-side Search Applications with Solr
Integrate Solr with real-time stream processing applications
Scaling Solr with SolrCloud
Administering and Monitoring SolrCloud Clusters
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Using Solr to Search and Analyze Logs
Enhancing relevancy through personalization & semantic search
Real-time Inverted Search in the Cloud Using Lucene and Storm
Solr's Admin UI - Where does the data come from?
Schemaless Solr and the Solr Schema REST API
High Performance JSON Search and Relational Faceted Browsing with Lucene
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Faceted Search with Lucene
Recent Additions to Lucene Arsenal
Turning search upside down
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Shrinking the haystack wes caldwell - final

Recently uploaded (20)

PPTX
Tartificialntelligence_presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
PPTX
1. Introduction to Computer Programming.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
Tartificialntelligence_presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Assigned Numbers - 2025 - Bluetooth® Document
Diabetes mellitus diagnosis method based random forest with bat algorithm
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Programs and apps: productivity, graphics, security and other tools
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
OMC Textile Division Presentation 2021.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology
1. Introduction to Computer Programming.pptx
A comparative analysis of optical character recognition models for extracting...

Big Data Meets Metadata – Analyzing Large Data Sets

  • 1. Smartlogic TM Lucene Revolution 2012     Jeremy  Bentley,  CEO  
  • 2. 1st degree of order Filing management • 80% of enterprise information is unstructured • Doubling every 19 months and accelerating [Gartner] • Increasing burden of compliance • Enterprise 2.0 additions • Big Data connotations
  • 3. 2nd degree of order Index management • File plans and metadata schema • Manually applied classification • Low level of consistency and quality
  • 4. 3rd degree Order Enterprise   Content   Search   Management   Portal   Infrastructure   Document    Management   Automation of SharePoint   1st & 2nd Records   Management   Degrees Publishing   Process     Systems   Management  &   Digital   Workflow   Asset   Management   eDiscovery  
  • 5. 5   A 10 year Flatline User  Search   Sa5sfac5on   50%   48%   2001   2011   •  2001,  IDC,  “Quan5fying  Enterprise  Search”    Searchers  are  successful  in  finding  what  they  seek  50%  of  the  9me  or  less       •  2011,  MindMetre/SmartLogic   More  than  half    (52%)  cannot  find  the  informa9on  they  need  using  their  Enterprise   search  system    
  • 6. The explosion of information 80Tb   ?   20  5mes   Terabytes  of  data   increase  in   Informa5on   volume   4Tb   1993-­‐2001   2001-­‐2009   Source:  the  Na5onal  Archives  
  • 7. Volume + other disruptive factors Velocity     Variety     Complexity      Cross-­‐organiza5onal    and  cross  pla[orm  informa5on  needs      Changing  requirements  for  informa5on  over  5me     Copyright  @  2011  Smartlogic  Semaphore  Limited   7  
  • 8. New 4th degree of order Enterprise   Content   Search   Management   Portal   Infrastructure   Document    Management   SharePoint   Content Records   Intelligence Management   Publishing   Process     Systems   Management  &   Digital   Workflow   Asset   Management   eDiscovery  
  • 9. Content Intelligence Informa5on   Manufacturing   Mone5sa5on   Knowledge   Metadata   Recovery   Data  Loss  Preven5on   Risk  &  Compliance   Content     Analy5cs  
  • 11. Metadata Information Subject   Crea5on  Date   Loca5on   Modified  Date   Project   Author   Func5on   Format   (PDF,DOC,XLS)   (IT,HR,Finance)   Protec5ve   Marker   Expiry   Publisher   Expert   Reten5on   Site   Process Structural
  • 12. 4th degree of order Content Intelligence Content  Intelligence  Pla[orm        FAST   SharePoint  
  • 13. What is Content Intelligence Content  Intelligence  is  the  process  of               IDENTIFYING   CLASSIFYING     EXTRACTING   ANALYZING   SURFACING     informa5on   based  on  its  meaning  and  context  to  make     !mely  and  informed  business  decisions.    
  • 14. Content Intelligence Solutions KNOWLEDGE     MICROTARGETING   ACQUSITION   &  DISTRIBUTION   &  REUSE   GOVERNANCE,   COMPLIANCE  &   WEB-­‐BASED   RISK   SELF  SERVICE  
  • 15. Big Data + Content Intelligence From  Gartner,  2011    
  • 16. Semaphore – Three Core Capabilities Seman5c     Ontology     Build,  Manage  and   Model   Manager   Deploy  Vocabularies/   Libraries   Expose   Apply   SEMAPHORE   Users   Content   ClassificaJon   SemanJc   Server   Enhancement   Server   Inform   Explore  data  to  find   Automate  the   insights   Metadata  Enrichment   16  
  • 17. Enterprise Classification Important  requirements  for  Velocity/Volume:   •  Scalability  for  large  volumes  of  content,  users,   metadata  and  systems   •  Easy  integra5on  with  processing  systems  -­‐   search,  content,  records  and  document   management  systems  as  well  as  file  shares   and  content  migra5on  tools   •  Support  for  all  the  organiza5on‘s  languages   and  data  formats  
  • 19. Metadata Generation Information Brand Creation Date Service Modified Date Geography Author Products Format (PDF,DOC,XLS) Expert Protective Retention Marker Publisher Expiry Site Process Structural
  • 20. Different Vocabulary and Ambiguity You  Say   I  Say   Perpetrator   Burglar   Thief   Swine  Flu   Swine  Influenza  Virus    Missing  results   H1N1   Touchscreen   Touch  screen   Mul5-­‐touch   You  Say   What  do  you  mean?   Apple   A  fruit?   Fiona  -­‐  A  singer  /  songwriter?   An  electronics  company?   Rights   Employment  rights?   Equal  rights?    Too  many  results   Right  of  way?   Ford   Ford  Motor   Forward  Industrials  (5cker=FORD)   A  shallow  river  crossing   ©  2010   20  
  • 21. Without Accurate Metadata     Big  Data  has  its  perils.  With  huge  data     sets  and  fine-­‐grained  measurement,   there  is  increased  risk  of  “false   discoveries.”  The  trouble  with  seeking  a   meaningful  needle  in  massive  haystacks   of  data  is  that  “many  bits  of  straw  look   like  needles.”     -­‐  Trevor  Has5e,     Sta5s5cs  Professor  at  Stanford  University    
  • 22. What Classification Must Handle Capability   Included   Look  for  all  the  vocabulary  associated  with  topic/en5ty   Determine  aboutness  /  avoid  passing  men5ons   Address  term  ambiguity   Handle  stemming  errors   Determine  if  topics  in  the  same  context   Split  documents  into  components   Generate  scores  (so  most  relevant  content  bubbles  to  top)   Show  dynamic  summaries  to  users  
  • 23. Enhancing Metadata •  Accurately  classify  content  into  subject  areas   defined  in  a  taxonomy/ontology   •  En5ty  extrac5on  (Text  Mining)   •  Sen5ment  Analysis   •  Fact  Extrac5on  
  • 24. Physical Architecture Ontology  Management  Services   Ontology  Manager   Ontology  Manager  Desktop   Ontology  Manager  Desktop   Standalone  Desktop   Win  7,  Vista   Win  7,  Vista   Win7,  Vista   2Gb  RAM   2Gb  RAM   2Gb  RAM   2GHz  Dual  CPU   2GHz  Dual  CPU   2GHz  Dual  CPU   Op5onal  RDBMS  data  store   Ontology  Manager  Server   Oracle   Port  8001   Port  8002   MySQL   Win  7,  Vista,  2003,  2008  +R2   Ontology   Ontology   Linux   SQL  Server  2005  +  2008  +   Instance  1   Instance  2   2Gb  RAM   2008  R2   2GHz  CPU   Seman5c  Enhancement  Server   Content  Classifica5on  Server   Search  Enhancement  Server   Classifica5on  Server   Classifica5on  Test  Interface   Port  5058   Search   GSA  Extensions   Classifica5on   Internet  Explorer   Enhancement   FAST  Extensions   Instance   Firefox   Instance   Sharepoint  Extensions   Rule  and  Template  Editor   Windows  Server  2003  ,2008  (32bit/64bit)  +R2   Windows  Server  2003  ,2008  (32bit/64bit)  +  R2   Win  7,  Vista   Linux   Linux   2Gb  RAM   IIS/Apache  HTTP  Server   CPU    and  RAM  intensive.  Scale  to  volume  of  content   2GHz  Dual  CPU   RAM  and  disk  access  intensive.  Scale  to  expected  peak  search  throughput   and  number  of  publishing  users   Google  Classifica5on  Handler   Integra5on  Components   Dispatcher   Proxy   Windows  Server  2003  ,2008  (32bit/64bit)  +R2   Scale  for  throughput  of  GSA  Indexing  Crawler   Search  Applica5on  Framework   Search  Applica5on  Framework   Document  Library  Components   Semaphore  Document  Processor   Semaphore  Document  Processor   Search  Applica5on  Framework   Search  Web  Parts   Microsou  FAST  ESP   Microsou  Office  SharePoint   Google  Search  Appliance   Server  Farm   SOLR   Server  2007  /    2010  Server  Farm  
  • 28. Examples – Fact Extraction
  • 29. How Else Does Semaphore Help Disambiguate queries     Perfectly formed filters organised by facet Graphical drill down Explore relationships Supporting documents