SlideShare a Scribd company logo
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Properly integrate ManifoldCF with Solr
Aurélien MAZOYER
Search Expert, Co-founder, France Labs
3
01
Apache Manifold CF
o Agenda
• Overview of ManifoldCF
• Our scenario : find files on a file share
• In real life
4
01
Apache Manifold CF
o Overview
• Connector Framework
• Incremental crawling
• Handle authorization
• Configuration via REST API and UI
5
01
Apache Manifold CF
o History
• Based on « Connector Framework » developed by Karl Wright
for the MetaCarta Appliance
• Donated to the Apache Software Foundation in 2009
• May 2012 : out of incubation
• Current version : 2.2 (August 2015)
6
01
Connectors gone wild
o Different connectors for :
• Content repositories
• Web, Wiki, DB, Email, RSS, CMIS, Alfresco…
• But also Windows Share, Sharepoint, Dropbox…
• Authorities
• LDAP, AD, CMIS…
• Output
• Solr, Elasticsearch, OSS…
7
03
Big picture
Manifold CF
Solr Elasticsearch Repository N
OpenLDAP
Authority N
…
Daemon Agent
Conn. 1
Manifold CF
authority
service
Ouputs
Authorities
Conn. 2
Conn. N
ManifoldCF
UI
ManifoldCF
API
Conn. 1 Conn. 2 Conn. N
Wiki
DB
Repository N
…
…
Repositories
Conn. 1
Conn. N
8
01
Roles of components
o Daemon agent
• Java process
• Run repository and ouput connectors
• Run data crawling jobs
9
01
Roles of components
o Authority service
• Web application
• Run authority connectors
• Get security tokens for a specific user
10
01
Component
Ouput ConnectionRepo Connection Crawl Job
1…1 1…* 1…* 1…*
o ManifoldCF UI
That’s it.
11
01
API Configuration
o API
12
01
Test it!
o For testing purpose:
• java –jar post.jar
• All-in-one process
• Embedded database (HSQL)
13
01
Taking MCF to production
Multi-process deployment
o 3 web application in a servlet container
• mcf-crawler-ui
• mcf-authorization-service
• mcf-api-service
o Daemon agent
o Database
• PostgresSQL
o Synchronize on filesystem ( local or distributed (zK) )
14
01
Search files with Security : Solr + MCF
o Our scenario
• File share using Active Directory
• Search with Solr
• With security constraints
15
01
Security model : Solr + MCF
o Authorization
• Early Binding
• Index documents with ACLs
• Compute authorization at runtime
o Authentication
• Not handled by Solr/ManifoldCF
• Front-end application should authenticate user
16
01
Search files with security : Solr + MCF
Manifold CF
AD
Daemon
Agent
JCIFS
Connector
Solr
connector
Phase 1 : Indexing
Repositories Authorities
Output Connector
Solr
Extracting
Handler
Manifold CF
authority
service
AD
ConnectorWindows
Share
MCF Plugin
Send docs and
ACLs
Crawl
documents
with ACLs
Get User
access token
Solr
MCF Plugin
17
01
Search files with security : Solr + MCF
Manifold CF
AD
Daemon
Agent
JCIFS
Connector
Solr
connector
Repositories Authorities
Extracting
Handler
Manifold CF
authority
service
AD
Connector
Front End Authenticated Search Filter docs based on
ACLs and users info
Authorized results
Phase 2 : Searching
Output Connector
Windows
Share
18
01
Configure Solr + MCF
o side
o 4 connections and 1 job
• Create Windows Share connection
• Create Solr connection
• Create Active Directory connection
• Create Authority Group connection
• Create a crawling Job
19
01
Component
0…1
1…*
Authority Group
Authority Connection
1…1
1…*
Ouput ConnectionRepo Connection Crawl Job
1…1 1…* 1…* 1…*
20
01
Component
AD Group
Crawl Job Solr Connection
AD Connection
Windows Share
Connection
21
01
Configure Solr + MCF
o Frond end side
o Authentication
• For Tomcat
• JDNI Tomcat Realm
• TomcatSPNEGO
22
01
Configure Solr + MCF
o side
o Modify schema.xml
• Add fields for security tokens
o Modify solrconfig.xml
• Add MCF Solr Plugin (query parser)
o And don’t forget to protect the Solr instance :-P
23
01
Configure Solr + MCF
o Leverage Solr Extracting handler
• Based on ApacheTika
• Mime type detection
• Embed parsing library
• Supported extension:
• MS Office (OLE2 and OOXML)
• OpenDocument
• Pdf
• Audio/video/image files
• Now OCRs thanks to Tika 1.7 (and Tesseract)
o Now, can be done directly in MCF!
24
01
Component
0…1
1…*
Authority Group
Authority Connection
1…1
1…*
Ouput ConnectionRepo Connection Crawl Job
1…1 1…* 1…* 1…*
Transformation
Connection
0…*
1…*
25
01
Crawling principle
o Crawling model
• Incremental model
• Continuous model
ManifoldCF In Action – Chapter 1 (Karl Wright)
Phase 1 Phase 2
26
01
Incremental crawling of file share
o Incremental crawling not so easy with some
repositories:
Windows Shar
e Connector
JCIFS
Windows Share
Uhuuu, file share, what's new
since last time we met?
Errkkk…
27
01
Incremental crawling of file share : Solr + MCF
o Phase 1 : Discovery/Indexing Depth first
Fetch SMB file attributes
If file is a directory and if matches inclusion regex
For each file
If file is a regular file and if matches inclusion regex
List files in SMB directory
Check ingeststatus entry in crawler DB
If no entry or the version attribute is different
Fetch file content
Update ingeststatus entry in DB
Push file to Solr
For each start path
entry
Windows Share
28
01
o What is ingeststatus database entry?
o Simplified version :
o LastVersion?
• Here, computed from lastModified and ACLs on the file
DOCURI LAST_INGEST LAST_VERSION
protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1
protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1
+S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023-
2036360560-1108467148-500+S-1-5-32-544+1+DEAD_AUTHORITY+-file://///52.30.17.1
84/ShareFolder/TestFile.txt+1444462827664:16Y
Incremental crawling of file share
29
01
Incremental crawling of file share : Solr + MCF
o Phase 2 : Deleting unreachable documents
Update Crawler database
Send delete command to Solr
For each crawler DB entry
30
01
How to see what happened
o Search History
o Monitoring
• Job Status
• Notification Connections
31
01
How to see what happened
o Search History
o History
• Simple History
• Maximum Activity
• Maximum Bandwidth
• Result Histogram
o Status
• Document Status
• Queue Status
32
01
Performance issue
o Find bottleneck
• Crawled repository
• Network
• Solr
• MCF database
• MCF configuration
33
01
Handle performance issue
o Specific connector’s configuration
• Throttling
• Max JVM connections
o Can improve speed / limit impact on crawled repository
o Very specific to the repository
34
01
Handle performance issue
o Job settings
o Size limit of ingested documents
o Use regex to remove some extensions from crawl
35
01
Investigate errors
• Increase connector’s log level
• Read MCF simple history
• Thread Dump
36
01
Common errors in file crawling
o Crawler account rights
o Exotic files
o Very biiiiiiig files
o JCIFS errors
o Solr connector timeout
37
01
When use ManifoldCF?
q = crawled_environment:heterogeneous
OR scenario:intranet
OR security:mandatory
38
01
References
o ManifoldCF documentation
https://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html
o ManifoldCF in Action (K. Wright)
https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs
o Securing Solr document with MCF (K. Wright)
http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
o France Labs blog posts :
http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/
http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/
39
01
Datafari
Search
Admin
o Intranet “ready to play” search solution
• Apache License
o Embed:
o Solr
o ManifoldCF
o And other cool stuff:
• Admin and responsive search UI
• User Management
• Banana for user behavior analysis
• Tesseract OCR
• A funny zebra
• Etc…
www.datafari.com
40
aurelien.mazoyer@francelabs.com
@francelabs
www.francelabs.com

More Related Content

PDF
Software-Defined Networking: Evolution or Revolution?
DOCX
Information technology seminar topics
PDF
“Streamlining Development of Edge AI Applications,” a Presentation from NVIDIA
PPTX
Protocols for internet of things
PDF
Apache ManifoldCF
PDF
A Novel methodology for handling Document Level Security in Search Based Appl...
PDF
Presentation Lucene / Solr / Datafari - Nantes JUG
PDF
Besoin de rien Envie de Search - Presentation Lucene Solr ElasticSearch
Software-Defined Networking: Evolution or Revolution?
Information technology seminar topics
“Streamlining Development of Edge AI Applications,” a Presentation from NVIDIA
Protocols for internet of things
Apache ManifoldCF
A Novel methodology for handling Document Level Security in Search Based Appl...
Presentation Lucene / Solr / Datafari - Nantes JUG
Besoin de rien Envie de Search - Presentation Lucene Solr ElasticSearch

Viewers also liked (20)

PPT
Apprendre Solr en deux heures
PPTX
Using Enterprise Search at the city of Antibes
PPTX
Sitecore Dev User Group Meetup in Milwaukee - Perficient - Rick Bauer
PDF
Plannning for the GSA Sunsetting feat. Coveo
PPTX
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
PDF
Concepts de Recherche dans un environnement WSS et MOSS
PPTX
SharePoint Search for Dummies
PPTX
Coveo Search - Product Overview
PDF
Coveo_Intelligent_Workplace_eBook
PDF
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
PPT
Apache ManifoldCF
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Introduction to Big Data processing (FGRE2016)
PDF
Netflix Global Search - Lucene Revolution
PDF
Intro to Apache Solr
PDF
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
PDF
Language support and linguistics in lucene solr & its eco system
PDF
Introduction to solr
PPTX
Real time big data stream processing
PPTX
2015 webinar : Boostez la recherche pour vos applications et sites web avec l...
Apprendre Solr en deux heures
Using Enterprise Search at the city of Antibes
Sitecore Dev User Group Meetup in Milwaukee - Perficient - Rick Bauer
Plannning for the GSA Sunsetting feat. Coveo
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Concepts de Recherche dans un environnement WSS et MOSS
SharePoint Search for Dummies
Coveo Search - Product Overview
Coveo_Intelligent_Workplace_eBook
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Apache ManifoldCF
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Big Data processing (FGRE2016)
Netflix Global Search - Lucene Revolution
Intro to Apache Solr
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Language support and linguistics in lucene solr & its eco system
Introduction to solr
Real time big data stream processing
2015 webinar : Boostez la recherche pour vos applications et sites web avec l...
Ad

Similar to Integrate ManifoldCF with Solr (20)

PDF
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
PDF
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
PDF
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
PDF
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
PDF
Solr search engine with multiple table relation
PDF
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
PDF
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
PDF
Understanding the Solr security framework - Lucene Solr Revolution 2015
PDF
Apache Solr Workshop
PDF
Apache Solr for TYPO3 Components & Review 2016
PDF
Kubernetes2
PDF
Solr Recipes
PDF
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
PDF
Hadoop-scale Search with Solr
PDF
What's new in Solr 5.0
PPTX
ThroughTheLookingGlass_EffectiveObservability.pptx
PDF
Deploying and managing Solr at scale
PDF
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Real-time Inverted Search in the Cloud Using Lucene and Storm
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
Solr search engine with multiple table relation
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Understanding the Solr security framework - Lucene Solr Revolution 2015
Apache Solr Workshop
Apache Solr for TYPO3 Components & Review 2016
Kubernetes2
Solr Recipes
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
Hadoop-scale Search with Solr
What's new in Solr 5.0
ThroughTheLookingGlass_EffectiveObservability.pptx
Deploying and managing Solr at scale
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Ad

More from francelabs (6)

PPTX
Migration d'Exalead vers Solr - IFCE et France Labs - Search Day 2014
PPTX
Apache Solr pour le eCommerce chez Allopneus avec France Labs - Lib'day2014
PDF
Geneva jug Lucene Solr
PPTX
Solr + Hadoop - Fouillez facilement dans votre système Big Data
PPTX
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
PPTX
Marseille JUG Novembre 2013 Lucene Solr France Labs
Migration d'Exalead vers Solr - IFCE et France Labs - Search Day 2014
Apache Solr pour le eCommerce chez Allopneus avec France Labs - Lib'day2014
Geneva jug Lucene Solr
Solr + Hadoop - Fouillez facilement dans votre système Big Data
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Marseille JUG Novembre 2013 Lucene Solr France Labs

Recently uploaded (20)

PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PDF
737-MAX_SRG.pdf student reference guides
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PPTX
Feature types and data preprocessing steps
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PPTX
communication and presentation skills 01
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
Design Guidelines and solutions for Plastics parts
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
737-MAX_SRG.pdf student reference guides
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Nature of X-rays, X- Ray Equipment, Fluoroscopy
Feature types and data preprocessing steps
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
communication and presentation skills 01
III.4.1.2_The_Space_Environment.p pdffdf
Visual Aids for Exploratory Data Analysis.pdf
distributed database system" (DDBS) is often used to refer to both the distri...
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Design Guidelines and solutions for Plastics parts
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
Exploratory_Data_Analysis_Fundamentals.pdf

Integrate ManifoldCF with Solr

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2. Properly integrate ManifoldCF with Solr Aurélien MAZOYER Search Expert, Co-founder, France Labs
  • 3. 3 01 Apache Manifold CF o Agenda • Overview of ManifoldCF • Our scenario : find files on a file share • In real life
  • 4. 4 01 Apache Manifold CF o Overview • Connector Framework • Incremental crawling • Handle authorization • Configuration via REST API and UI
  • 5. 5 01 Apache Manifold CF o History • Based on « Connector Framework » developed by Karl Wright for the MetaCarta Appliance • Donated to the Apache Software Foundation in 2009 • May 2012 : out of incubation • Current version : 2.2 (August 2015)
  • 6. 6 01 Connectors gone wild o Different connectors for : • Content repositories • Web, Wiki, DB, Email, RSS, CMIS, Alfresco… • But also Windows Share, Sharepoint, Dropbox… • Authorities • LDAP, AD, CMIS… • Output • Solr, Elasticsearch, OSS…
  • 7. 7 03 Big picture Manifold CF Solr Elasticsearch Repository N OpenLDAP Authority N … Daemon Agent Conn. 1 Manifold CF authority service Ouputs Authorities Conn. 2 Conn. N ManifoldCF UI ManifoldCF API Conn. 1 Conn. 2 Conn. N Wiki DB Repository N … … Repositories Conn. 1 Conn. N
  • 8. 8 01 Roles of components o Daemon agent • Java process • Run repository and ouput connectors • Run data crawling jobs
  • 9. 9 01 Roles of components o Authority service • Web application • Run authority connectors • Get security tokens for a specific user
  • 10. 10 01 Component Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…* o ManifoldCF UI That’s it.
  • 12. 12 01 Test it! o For testing purpose: • java –jar post.jar • All-in-one process • Embedded database (HSQL)
  • 13. 13 01 Taking MCF to production Multi-process deployment o 3 web application in a servlet container • mcf-crawler-ui • mcf-authorization-service • mcf-api-service o Daemon agent o Database • PostgresSQL o Synchronize on filesystem ( local or distributed (zK) )
  • 14. 14 01 Search files with Security : Solr + MCF o Our scenario • File share using Active Directory • Search with Solr • With security constraints
  • 15. 15 01 Security model : Solr + MCF o Authorization • Early Binding • Index documents with ACLs • Compute authorization at runtime o Authentication • Not handled by Solr/ManifoldCF • Front-end application should authenticate user
  • 16. 16 01 Search files with security : Solr + MCF Manifold CF AD Daemon Agent JCIFS Connector Solr connector Phase 1 : Indexing Repositories Authorities Output Connector Solr Extracting Handler Manifold CF authority service AD ConnectorWindows Share MCF Plugin Send docs and ACLs Crawl documents with ACLs
  • 17. Get User access token Solr MCF Plugin 17 01 Search files with security : Solr + MCF Manifold CF AD Daemon Agent JCIFS Connector Solr connector Repositories Authorities Extracting Handler Manifold CF authority service AD Connector Front End Authenticated Search Filter docs based on ACLs and users info Authorized results Phase 2 : Searching Output Connector Windows Share
  • 18. 18 01 Configure Solr + MCF o side o 4 connections and 1 job • Create Windows Share connection • Create Solr connection • Create Active Directory connection • Create Authority Group connection • Create a crawling Job
  • 19. 19 01 Component 0…1 1…* Authority Group Authority Connection 1…1 1…* Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…*
  • 20. 20 01 Component AD Group Crawl Job Solr Connection AD Connection Windows Share Connection
  • 21. 21 01 Configure Solr + MCF o Frond end side o Authentication • For Tomcat • JDNI Tomcat Realm • TomcatSPNEGO
  • 22. 22 01 Configure Solr + MCF o side o Modify schema.xml • Add fields for security tokens o Modify solrconfig.xml • Add MCF Solr Plugin (query parser) o And don’t forget to protect the Solr instance :-P
  • 23. 23 01 Configure Solr + MCF o Leverage Solr Extracting handler • Based on ApacheTika • Mime type detection • Embed parsing library • Supported extension: • MS Office (OLE2 and OOXML) • OpenDocument • Pdf • Audio/video/image files • Now OCRs thanks to Tika 1.7 (and Tesseract) o Now, can be done directly in MCF!
  • 24. 24 01 Component 0…1 1…* Authority Group Authority Connection 1…1 1…* Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…* Transformation Connection 0…* 1…*
  • 25. 25 01 Crawling principle o Crawling model • Incremental model • Continuous model ManifoldCF In Action – Chapter 1 (Karl Wright) Phase 1 Phase 2
  • 26. 26 01 Incremental crawling of file share o Incremental crawling not so easy with some repositories: Windows Shar e Connector JCIFS Windows Share Uhuuu, file share, what's new since last time we met? Errkkk…
  • 27. 27 01 Incremental crawling of file share : Solr + MCF o Phase 1 : Discovery/Indexing Depth first Fetch SMB file attributes If file is a directory and if matches inclusion regex For each file If file is a regular file and if matches inclusion regex List files in SMB directory Check ingeststatus entry in crawler DB If no entry or the version attribute is different Fetch file content Update ingeststatus entry in DB Push file to Solr For each start path entry Windows Share
  • 28. 28 01 o What is ingeststatus database entry? o Simplified version : o LastVersion? • Here, computed from lastModified and ACLs on the file DOCURI LAST_INGEST LAST_VERSION protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1 protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1 +S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023- 2036360560-1108467148-500+S-1-5-32-544+1+DEAD_AUTHORITY+-file://///52.30.17.1 84/ShareFolder/TestFile.txt+1444462827664:16Y Incremental crawling of file share
  • 29. 29 01 Incremental crawling of file share : Solr + MCF o Phase 2 : Deleting unreachable documents Update Crawler database Send delete command to Solr For each crawler DB entry
  • 30. 30 01 How to see what happened o Search History o Monitoring • Job Status • Notification Connections
  • 31. 31 01 How to see what happened o Search History o History • Simple History • Maximum Activity • Maximum Bandwidth • Result Histogram o Status • Document Status • Queue Status
  • 32. 32 01 Performance issue o Find bottleneck • Crawled repository • Network • Solr • MCF database • MCF configuration
  • 33. 33 01 Handle performance issue o Specific connector’s configuration • Throttling • Max JVM connections o Can improve speed / limit impact on crawled repository o Very specific to the repository
  • 34. 34 01 Handle performance issue o Job settings o Size limit of ingested documents o Use regex to remove some extensions from crawl
  • 35. 35 01 Investigate errors • Increase connector’s log level • Read MCF simple history • Thread Dump
  • 36. 36 01 Common errors in file crawling o Crawler account rights o Exotic files o Very biiiiiiig files o JCIFS errors o Solr connector timeout
  • 37. 37 01 When use ManifoldCF? q = crawled_environment:heterogeneous OR scenario:intranet OR security:mandatory
  • 38. 38 01 References o ManifoldCF documentation https://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html o ManifoldCF in Action (K. Wright) https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs o Securing Solr document with MCF (K. Wright) http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 o France Labs blog posts : http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/ http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/
  • 39. 39 01 Datafari Search Admin o Intranet “ready to play” search solution • Apache License o Embed: o Solr o ManifoldCF o And other cool stuff: • Admin and responsive search UI • User Management • Banana for user behavior analysis • Tesseract OCR • A funny zebra • Etc… www.datafari.com

Editor's Notes

  • #3: Start : 0:00. End : 1:05 Hi, Thank you, Moi : Aurélien MAZOYER Co founder of France Labs, open source company based in France. We offer consulting on search technologies, icosystem Datafari, intranet search solution. Say a few word about datafari at the end of talk. Topic, Survey : how many of you have ever use manifoldcf ?
  • #4: Start : 1:05. End : 1:40 3 parts Overview of MCF. Explain a Case study on the integration of MCF with Solr in order to search file What happens with mcf
  • #5: Start 1:40 to 2:45 CF stands for Connector framework. That means that it is a tool that help you to connect heterodjinious Push the data to your favorite search engine Keep it syncronize Take access right into account to perform authenticated Provides a Complete UI and REST API
  • #6: Start 2:45 to 3:15 Karl wright when he worked for Apache Top Level Project since 2012. Active project : last release this summer.
  • #7: Start 3:15 to 4:25 Plenty of connectors included in ManifoldCF What is called You can write your own (ManifoldCF In action give you the best practice to write your own connector) Domain controller, such as an active directory Search engine
  • #8: Start 4:25 to 5:03 Contains Different components You can see the different connectors for the interaction with the external world Administration interface Talk about it in a few slide. Cannot see here a underlying database, backbone of the solution
  • #9: Start 5:03 to 5:30 Actually do the crawling job
  • #10: Start 5:30 to 6:26 You add the username in parameter provide the security tokens for a specific user. It gives For example the sid of the user in Active Directory, and the sid of all groups that he belongs to
  • #11: Start 6:26 to 7:14 Also web application. Administrate MCF. To begin, you will have to create Crawl Job. Start the job. Once you are done, you are now able to start your crawl
  • #12: Start 7:14 to 8:00 What can be done in the admin Put new config Send command Respect REST standards
  • #13: Start 8:00 to 8:49 Very simple to test it. Extract the binary distribution, open example directory Not unfamiliar TO solr users Not recommanded way to run it in production (mainly because of the HSQL database)
  • #14: Start 8:49 to 10:00 Component we described in different processes. The database is very important. One of the recommanded database Synchronize via local folder on the machine or with zookeeper.
  • #15: Start 10:00 to 11:07 Here is our scenario Let’s imagine An intranet Users who authenticate against AD They put their files on a shared folders. You have access rights on folders based on the user. But specific permission for some users. Of course it is a mess so they need a good search engine to find theirs documents Quite simple, not very unusuable, but it can be a nightmare if you don’t have to right tool We are here in a full proprietary environment. But we will see that MCF and solr can deal with it.
  • #16: Start 11:07 to 12:00 A few words Autorisation, when user runs a solr query Nither solr nor mcf will do this job Up to the front end application
  • #17: Start 12:00 to 12:34 Go back to the big picture Step 1 JCIFS connector fetches documents with theirs access control and push to Solr Extracting handler
  • #18: Start 12:34 to 13:00 Step 2 : Frontend sends an authenticated query Retrieves the security tokens linked to the current user Then, runs a normal search and filter the result set with the help the document acces control list and user security tokens
  • #19: Start 13:00 to 13:44 How can we actually implement that. ON the mcfside Windows share connection. Some few step to do (download last version of JCIFS library, uncomment the windows share line in the connectors config file)
  • #20: Start 13:44 to 14:03 Authority connector should be belong to an authority group
  • #21: Start 14:03 to 14:18 That’s it for manifold
  • #22: Start 14:18 to 15:00 Told you Front end is in charge of the authentication LDAP protocol to authenticate TomcatSPNEGO (Active directory). Spénégo : use single sign on
  • #23: Start 15:00 to 15:57 Add fields that will contains the access control list of the document Declare the MCF plugin Configure the endpoint of the authority service Add a filter query that uses this plugin in your search handler This is for the search handler
  • #24: Start 15:57 to 16:30 For the update handler. It is a default extracting handler that integrate apache tika. As a reminder, since Solr 5, extracting handler can run tesseract to extract content from images. Solr can do this job.
  • #25: Start 16:30 to 17:22 In new version of Manifold. It can also be done In fact, processing pipeline. You can do field mapping but also tika extraction. Perfect if you don’t want to send big files over the network
  • #26: Start 17:22 to 18:11 Now we will try to understand what is going on under the hood during our crawl These two crawling models are available with manifoldCF. To avoid indexing Discover new documents, remove old ones.
  • #27: Start 18:11 to 18:33 Some repository works well with incremental crawling Others don’t Unfortunatly our windows share won’t be able to answer
  • #28: Start 18:33 to 20:00 Therefore JCIFS connector How do windows share connetor handle incremental If it is a file Next slide is version attribute Fetch from the windows share
  • #29: Start 20:00 to 20:50 For each document Last version Depends on the repository
  • #30: Start 20:50 to 21:04 This was for step 1. more We can repeat these 2 steps in order to keep our data syncrhonize. We have covered how to configure this and we ve describe of it works under the hood. Now it is in production mode and you want to be sure of what is going on
  • #31: Start 21:04 to 22:33 Many informations UI or API Send alert if something went wrong or just if the crawl is finished
  • #32: Start 22:33 to 23:00 You also have a tab that shows you an history of all the different activities. Document status, for example if you want to see if a document has already been ingested in the current crawl Maximum bandwitch will give you information of crawling performance
  • #33: Start 23:00 to 23:16 Unfortunatly somtimes facing obvious Crawled repository that is overloaded It can be because of the network. You should packet with wireshark Solr server : for example if the autocommit frequency is too high. Mcf database is an important component, be sure that you followed the best pratices in the documentation
  • #34: Start 23:16 to 24:25 Maybe it is because of the configuration of your connector Two main parameters that can have an impact on performance Throttling : Fixing hard limit on fetching document (usefull if you are doing web crawling don’t don’t to be ban by the webmaster) Max connections that will be done to the system. It can be a good idea if we want to do web crawling to increase this value But windows share won’t work very well with a of connection, so in our scenario we should use a small value
  • #35: Start 24:25 to 25:15 In the job settings, you can filter document that you want to index For an intranet file share, you probably don’t want to index the last Star wars movie that an employee wanted to share with their colleagues
  • #36: Start 25:15 to 26:00 That was some example of performance issues. But unfortunatly, It can be even worst, you can face errors If you are facing errors A thread dump can give you information on
  • #37: Start 26:00 to 28:08 One common problem is when the account you use for crawl doesn’t It must be able to read everything and to read ACLs for each file It can need special right, such as Print operator. As we just saw, we can use exclusion regex or size limit Be also sure to add ignore tika exception in solr JCIFS errors linked or not to network issues timeout. Sometimes be solve while increase jcifs timeout Sometimes you can have to increase solr time out issues Big processing
  • #38: Start 28:08 to 28:45 What can happen in real life To conclude. Massive web crawling : Nutch is the best tool for you Then, go for it.
  • #39: Start 28:45 to 29:18 Here are some references That is now freely available You can have a look at our blog posts, that you how to run through the different steps that I covered in the file search scenario I described
  • #40: Start 29:18 to 29:40 If you are too lazy to integrate Solr and ManifoldCF by yourself
  • #41: Start 29:40 to 30:00 Thank you very much for your attention, Be pleased to answer any question you may have