SlideShare a Scribd company logo
Patrick Beaucamp
Founder of the Vanilla Project
Mail : Patrick.beaucamp@bpm-conseil.com
Custom Open Source Search Engine with Drupal 8
and Solr at French Ministry of Environment
II-SDV, Nice 24th April 2017
1II-SDV, Nice
Presentation Agenda
Open Source Search Engine & Search Platform
Some interesting Platforms
Features expected for Search Platforms (Interface)
2II-SDV, Nice
Open Source Platform at French Ministry
Project Context
Platform Architecture
WebSite Powered by a Search engine
Echo : Tuesday am, presentation from Deep Search 9 and
Tuesday pm prssentation from FranceLabs
Personal Experience of Search
Searching … and finding !
II-SDV : SEARCH, DATA MINING and
VISUALISATION
3II-SDV, Nice
How many times per day do you Google ? (search,
maps, translate …)
Tribute to Open Source at II-SDV
Search is the first Step : collecting information
Searching … and finding !
4II-SDV, Nice
Searching … and finding !
An exemple – my personal experience
5II-SDV, Nice
I tried to find a person during 23 years, roughly from 1993
to 2016
From 1993 to 1998 : no search engine available …
only private investigator ?
From 1999 to 2015 : regular Search – no results
I founded this person on facebook, not on google
From a browser : « f + tab » … « g + tab », « y + tab » …
Some years : no search, other years : multiples search
Searching … and finding !
6II-SDV, Nice
1) We all became private investigators one day or another
Searching … and finding !
7II-SDV, Nice
Searching … and finding !
8II-SDV, Nice
2) Different search engine lead to different results
Searching … and finding !
9II-SDV, Nice
2) Different search engine by country
Searching … and finding !
10II-SDV, Nice
Funny word : SEO … its more « how to be found on
Internet » … and you need to pay for it !
Searching … and finding !
11II-SDV, Nice
3) The person I was looking published on facebook using
his/her real name – its his/her decision to be visible or not
4) Where do we stand with the « Right to Forget »
Searching … and finding !
12II-SDV, Nice
Companies like Facebook have tons of data : they need to
provide search infrastructure (indexing + search interface)
I was lucky to make a try with facebook search interface
Searching … and finding !
13II-SDV, Nice
Discovery of Cholera – 1854 (John Snow)
http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
Searching … and finding !
14II-SDV, Nice
Bicycle Accident in Street : who is taking care of trafic management
Example in Boston :
http://www.boston.com/bostonglobe/editorial_opinion/blogs/the_angle/2010/12/bike_crash_map.html
Open Data
Searching … and finding !
15II-SDV, Nice
LION – 2016 (Garth Davis)
Mistake 1 : Ganesh Tanei – Mistake 2 : Saroo
OpenSource LandScape
16II-SDV, Nice
Crawling
Indexing
Storing
WebSite
Reference
WebSite
Accessibility
Update Management
Search Interface
Result Visualization
Auto Completion
Natural Language
Voice Recognition
Maps
Ads
Unstructured data
Access Management
Search Platform Objectives
Constraints : being able to reach WebSite and content :
Internal WebSites (Intranet) & External WebSites
Internal Document Repositories
17II-SDV, Nice
Being able to index WebSite content (and page updates)
Beeing able to store unstructured data
Crawling
Storing
Indexing
Search Platform Objectives
18II-SDV, Nice
Provide usable Search results (auto classification,
visualization)
Don’t Forget why and what you search :
• You search in existing documents
• You need visualization tools
• Its not a crystal ball : search reflects the past
Provide usable Search interfaces (semantic search, multi
language search …)
Search Interface
Result Visualization
19II-SDV, Nice
Lucene is a java based indexing and search API
Solr/Lucene is the leading server extension of Lucene. 2 companies, LucidWorks
(Fusion) and ElasticSearch, provides packaging and extension of top of Lucene
and Solr.
-Nutch is the crawling component
-Tika is a document Metadata manager – content analysis toolkit
-Zookeeper is a multi thread process manager
OpenSource LandScape
20II-SDV, Nice
-Search Landscape
-Lucene : http://lucene.apache.org
-Solr/Lucene : http://lucene.apache.org/solr/
-Plateform OpenSearch : http://www.open-search-server.com
-Plateform Katta : http://katta.sourceforge.net
-Plateform LucidWorks : http://www.lucidworks.com
-Plateform ElasticSearch : http://www.elasticsearch.com
-Sphinx : http://sphinxsearch.com/
-Cloudera : https://www.cloudera.com/documentation/enterprise/5-5-
x/topics/search_architecture.html
-FranceLabs : http://www.francelabs.com/ (Datafari)
-AklaBox : www.aklabox.com (AklaSearch)
OpenSource LandScape
21II-SDV, Nice
Lucene : Retrieval Software library
Use existing Search Infrastructure like Solr/Lucene (Vanilla certified)
http://www.lucidworks.com/ or http://www.elasticsearch.org/
Search Engine Focus
22II-SDV, Nice
-Cloudera with Solr/Cloud (Solr/Lucene)
-Mapr with ElasticSearch (Lucene code)
-HortonWorks with LucidWorks (Solr/Lucene)
Hadoop Search Platform - Big Data
23II-SDV, Nice
Before indexing your document base, you need to access it !
Apache Nutch is a highly extensible and scalable open source web crawler
software project.
Reference : http://nutch.apache.org/
Nutch
24II-SDV, Nice
Solr
• What is Solr
– Indexation and Search Engine
• Promoted by the Apache Foundation
• Built on Top of Apache Lucene (Java Search library)
– Major engine characteristics
• Scalable, fault tolerance, distribution indexation process, dynamic
workload balancer, centraized configuration
– Technical environment
• Java
• Embeded Jetty server for platform administration
25II-SDV, Nice
Solr
Main characteristics
Admin Interface
Flexible and scalable Configuration
Modular
Multiple index management with a signle instance
26II-SDV, Nice
Solr
Main characteristics
Standard communication interfaces (html, xml, json)
Configuration can be done with or without schema
Real time Indexation
27II-SDV, Nice
Solr
Main characteristics
Customizable Full Text analysis
Rich documents indexation (using Tika)
28II-SDV, Nice
Solr
Main characteristics
Search by facet and filters
Term suggestion and orthograph correction
Geospatial Search
29II-SDV, Nice
Solr
Solr behavior
30II-SDV, Nice
-Synonyms
- It is possible to extend the search to synonyms if they are listed in a
glossary. For example, to find articles containing synonyms to “TV” when
you search with the word TV.
-Metadata
- Dictionary for list of searchable keywords
Search Engine Basic (1/2)
31II-SDV, Nice
-Reserved Words, Protected Words
- Indexing usually uses stemming, which is to reduce words to their root, for
example "Developp" to find items also contain the word when trying to
develop the word development. However, sometimes there are adverse
lemmatizations, indexing under one lemma two words that have no
relation. It is possible to prevent the stemming of words by listing them in
a file protwords.txt.
-StopWords
- The stopwords are meaningless words. A word considered insignificant
will be ignored. Note that some words are insignificant in some contexts,
others have homonyms signifiers. For example, can refer to a summer
season (rather mean) or past participle of the verb to be (relatively
insignificant). Stopwords.txt the file looks like this
Search Engine Basic (2/2)
32II-SDV, Nice
-Multi Language support (this is where commercial search engine have still more
to bring to customer), even there is now Asian type language support (Hindi,
Thai, Chineese, …)
-Elision :
- Elisions are a feature of the French, which consist of a contraction of the
words like or when they are followed by a vowel. Example: + aircraft gives
the aircraft. It is possible to remove these elisions using a lexicon.
-Limits solved other the past 3 years
• Full text search interface (language with search engine)
• SubQuery support : now its ok starting with Solr 4.7 (we are v6)
• Scalability (this is where Solr is taking technical advantage)
Search Engine Current Limits
33II-SDV, Nice
-Advance indexing and querying tools.
-Provides distributed searching capabilities to prevent bottleneck for a particular
server.
-Provides document excerpts (snippets) generation that provides summary of the
search
-Relevance ranking display extracts from the documents based on the query.
Search Interface expectation (1/3)
34II-SDV, Nice
-Duplicate document detection, including fuzzy near duplicates
-Rich Document Parsing and Indexing without using Database Indexing.
-Ranking control carry out a targeted ranking of individual documents.
-Search Grouping by Type / Tag / Categories (General page, documents, images)
Search Interface expectation (2/3)
35II-SDV, Nice
-Multi Criteria support
-Ranking
-Natural language support
-Apps Support (Android, Ipad)
Search Interface expectation (3/3)
Project at Ministry
Initial decision and guidelines from Ministry
36II-SDV, Nice
New WebSite will be done using Drupal CMS 8.2
WebSite should be powered by a « Google alike Search Toolbar »
WebSite – Infrastructure – should connect with multiples other
WebSite
All Infra (Software) must be Open Source components
Project at Ministry
37II-SDV, Nice
http://www.developpement-durable.gouv.fr/
Project at Ministry
38II-SDV, Nice
http://www.developpement-durable.gouv.fr/
Project at Ministry - Architecture
39II-SDV, Nice
Project at Ministry - Architecture
40II-SDV, Nice
Project at Ministry - Technical
41II-SDV, Nice
Projects Steps
Nutch crawler for various WebSite
• Facebook, LinkedIn, Twitter, Youtube …
• Internal WebSite, Previous WebSite
Drupal Forms for Metadata & indexation
• Specific Forms for different kind of documents
• Drupal CMS process to add new content
Drupal 8 Module for Solr : custom search, monitoring, reporting
• Existing drupal solr is limited to single instance of drupal
• Not possible to use Solr Admin interface
Project at Ministry - Technical
42II-SDV, Nice
Additional PHP libraries
Curl : Communication Drupal-Solr (http-get http-post & attached file)
Ssh2 : server administration command
Zookeeper : Communication Drupal-Zookeeper
MemCached : Communication Drupal-Memcached
Solarium : Communication Drupal-Solr (abstraction layer)
GoogleApi : youtube content indexation
Project at Ministry – Admin Interface
43II-SDV, Nice
Drupal8 Addon to setup the global infrastructure (Zookeeper, Solr)
Project at Ministry – Admin Interface
44II-SDV, Nice
Drupal8 Addon to monitor the global infrastructure - Statistics
Project at Ministry - Validation
45II-SDV, Nice
Projects Validation & Deployment
No problems with Zookeeper, Solr, Nutch
Stress tests for the global platform : initial slow down with 10 000
simultaneous connection
Sub-Project : Adressing the Single Point of Failure
Solution : Problems with Drupal & MySql -> MemCached
Project at Ministry - Next
46II-SDV, Nice
Next Steps
Review of WebSite content … new Ministry
New Content to be indexed :
• Other WebSite and Social Content
• New set of document to be added in the repository
47II-SDV, Nice

More Related Content

PDF
II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...
PDF
II-SDV 2017: Deep SEARCH 9
PDF
II-SDV 2017: How Visualisation of Open Patent Data can help with Strategic De...
PDF
II-SDV 2017: Gridlogics Technologies
PDF
II-SDV 2017: What is Innovation and how can we measure it?
PDF
II-SDV 2017: Effective Communication of Complex Monitoring Results: An innova...
PDF
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
PDF
II-SDV 2017: Search Technologies
II-SDV 2017: Approaches of Web Information Analysis in a Day to Day Work Envi...
II-SDV 2017: Deep SEARCH 9
II-SDV 2017: How Visualisation of Open Patent Data can help with Strategic De...
II-SDV 2017: Gridlogics Technologies
II-SDV 2017: What is Innovation and how can we measure it?
II-SDV 2017: Effective Communication of Complex Monitoring Results: An innova...
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
II-SDV 2017: Search Technologies

What's hot (20)

PDF
II-SDV 2017: Spotting the Stars in your Galaxy of Patent Data
PDF
ICIC 2017: New product presentation minesoft
PDF
II-PIC 2017: Gain insight into technical, legal and business information thro...
PDF
II-SDV 2015, 20 - 21 April, in Nice
PDF
ICIC 2017: How to effectively monitor Technological Developments in IP
PDF
ICIC 2017: New product presentationsLighthouse IP
PDF
ICIC 2017: Product presentations FIZ Karlsruhe
PDF
II-SDV 2016 GRIDLOGICS
PDF
AI-SDV 2020: Using Transformer technology to build an AI based personal News ...
PDF
II-PIC 2017: Product Presentation LexisNexis
PDF
Big Data: Big Issues for IP
PDF
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
PDF
II-SDV 2016 Aalt van de Kuilen - The Art of Patent Landscaping
PDF
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
PDF
II-SDV 2015, 20 - 21 April, in Nice
PDF
II-SDV 2016 VantagePoint
PDF
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
PDF
II-SDV 2016 Questel Intellixir
PDF
AI-SDV 2021 - Deep SEARCH 9
PDF
Data Science Application in Business Portfolio & Risk Management
II-SDV 2017: Spotting the Stars in your Galaxy of Patent Data
ICIC 2017: New product presentation minesoft
II-PIC 2017: Gain insight into technical, legal and business information thro...
II-SDV 2015, 20 - 21 April, in Nice
ICIC 2017: How to effectively monitor Technological Developments in IP
ICIC 2017: New product presentationsLighthouse IP
ICIC 2017: Product presentations FIZ Karlsruhe
II-SDV 2016 GRIDLOGICS
AI-SDV 2020: Using Transformer technology to build an AI based personal News ...
II-PIC 2017: Product Presentation LexisNexis
Big Data: Big Issues for IP
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
II-SDV 2016 Aalt van de Kuilen - The Art of Patent Landscaping
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2016 VantagePoint
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2016 Questel Intellixir
AI-SDV 2021 - Deep SEARCH 9
Data Science Application in Business Portfolio & Risk Management
Ad

Similar to II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment (20)

PDF
II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
PDF
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
PDF
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
PDF
Apache Solr, il motore di ricerca enterprise open source
PPTX
search engines
PDF
Searchland: Search quality for Beginners
PDF
Integration visualization
PDF
Is Enterprise Search Ripe for Open Source Disruption?
PDF
Suche mit Apache Lucene & Co.
PPTX
Introduction to Apache Lucene/Solr
PDF
Charting Searchland, ACM SIG Data Mining
PDF
PPTX
Introduction to Information Retrieval
PDF
NoSQL, Apache SOLR and Apache Hadoop
PDF
sunny-slides
PDF
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
PDF
Building multi billion ( dollars, users, documents ) search engines on open ...
PPT
Search Engines
PDF
E-commerce Search Engine with Apache Lucene/Solr
PPTX
Intro to Apache Lucene and Solr
II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
Apache Solr, il motore di ricerca enterprise open source
search engines
Searchland: Search quality for Beginners
Integration visualization
Is Enterprise Search Ripe for Open Source Disruption?
Suche mit Apache Lucene & Co.
Introduction to Apache Lucene/Solr
Charting Searchland, ACM SIG Data Mining
Introduction to Information Retrieval
NoSQL, Apache SOLR and Apache Hadoop
sunny-slides
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
Building multi billion ( dollars, users, documents ) search engines on open ...
Search Engines
E-commerce Search Engine with Apache Lucene/Solr
Intro to Apache Lucene and Solr
Ad

More from Dr. Haxel Consult (20)

PDF
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
PDF
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
PDF
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
PDF
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
PDF
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
PDF
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
PDF
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
PDF
AI-SDV 2022: Machine learning based patent categorization: A success story in...
PDF
AI-SDV 2022: Machine learning based patent categorization: A success story in...
PDF
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
PDF
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
PDF
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
PDF
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
PDF
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
PDF
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
PDF
AI-SDV 2022: Copyright Clearance Center
PDF
AI-SDV 2022: Lighthouse IP
PDF
AI-SDV 2022: New Product Introductions: CENTREDOC
PDF
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
PDF
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...

Recently uploaded (20)

PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PPTX
SAP Ariba Sourcing PPT for learning material
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PDF
Testing WebRTC applications at scale.pdf
PDF
Paper PDF World Game (s) Great Redesign.pdf
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
Introduction to Information and Communication Technology
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
Digital Literacy And Online Safety on internet
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
RPKI Status Update, presented by Makito Lay at IDNOG 10
SAP Ariba Sourcing PPT for learning material
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
Testing WebRTC applications at scale.pdf
Paper PDF World Game (s) Great Redesign.pdf
QR Codes Qr codecodecodecodecocodedecodecode
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
INTERNET------BASICS-------UPDATED PPT PRESENTATION
Tenda Login Guide: Access Your Router in 5 Easy Steps
Design_with_Watersergyerge45hrbgre4top (1).ppt
Introduction to Information and Communication Technology
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
SASE Traffic Flow - ZTNA Connector-1.pdf
Cloud-Scale Log Monitoring _ Datadog.pdf
international classification of diseases ICD-10 review PPT.pptx
Digital Literacy And Online Safety on internet
introduction about ICD -10 & ICD-11 ppt.pptx
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Unit-1 introduction to cyber security discuss about how to secure a system

II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

  • 1. Patrick Beaucamp Founder of the Vanilla Project Mail : Patrick.beaucamp@bpm-conseil.com Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment II-SDV, Nice 24th April 2017 1II-SDV, Nice
  • 2. Presentation Agenda Open Source Search Engine & Search Platform Some interesting Platforms Features expected for Search Platforms (Interface) 2II-SDV, Nice Open Source Platform at French Ministry Project Context Platform Architecture WebSite Powered by a Search engine Echo : Tuesday am, presentation from Deep Search 9 and Tuesday pm prssentation from FranceLabs Personal Experience of Search
  • 3. Searching … and finding ! II-SDV : SEARCH, DATA MINING and VISUALISATION 3II-SDV, Nice How many times per day do you Google ? (search, maps, translate …) Tribute to Open Source at II-SDV Search is the first Step : collecting information
  • 4. Searching … and finding ! 4II-SDV, Nice
  • 5. Searching … and finding ! An exemple – my personal experience 5II-SDV, Nice I tried to find a person during 23 years, roughly from 1993 to 2016 From 1993 to 1998 : no search engine available … only private investigator ? From 1999 to 2015 : regular Search – no results I founded this person on facebook, not on google From a browser : « f + tab » … « g + tab », « y + tab » … Some years : no search, other years : multiples search
  • 6. Searching … and finding ! 6II-SDV, Nice 1) We all became private investigators one day or another
  • 7. Searching … and finding ! 7II-SDV, Nice
  • 8. Searching … and finding ! 8II-SDV, Nice 2) Different search engine lead to different results
  • 9. Searching … and finding ! 9II-SDV, Nice 2) Different search engine by country
  • 10. Searching … and finding ! 10II-SDV, Nice Funny word : SEO … its more « how to be found on Internet » … and you need to pay for it !
  • 11. Searching … and finding ! 11II-SDV, Nice 3) The person I was looking published on facebook using his/her real name – its his/her decision to be visible or not 4) Where do we stand with the « Right to Forget »
  • 12. Searching … and finding ! 12II-SDV, Nice Companies like Facebook have tons of data : they need to provide search infrastructure (indexing + search interface) I was lucky to make a try with facebook search interface
  • 13. Searching … and finding ! 13II-SDV, Nice Discovery of Cholera – 1854 (John Snow) http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
  • 14. Searching … and finding ! 14II-SDV, Nice Bicycle Accident in Street : who is taking care of trafic management Example in Boston : http://www.boston.com/bostonglobe/editorial_opinion/blogs/the_angle/2010/12/bike_crash_map.html Open Data
  • 15. Searching … and finding ! 15II-SDV, Nice LION – 2016 (Garth Davis) Mistake 1 : Ganesh Tanei – Mistake 2 : Saroo
  • 16. OpenSource LandScape 16II-SDV, Nice Crawling Indexing Storing WebSite Reference WebSite Accessibility Update Management Search Interface Result Visualization Auto Completion Natural Language Voice Recognition Maps Ads Unstructured data Access Management
  • 17. Search Platform Objectives Constraints : being able to reach WebSite and content : Internal WebSites (Intranet) & External WebSites Internal Document Repositories 17II-SDV, Nice Being able to index WebSite content (and page updates) Beeing able to store unstructured data Crawling Storing Indexing
  • 18. Search Platform Objectives 18II-SDV, Nice Provide usable Search results (auto classification, visualization) Don’t Forget why and what you search : • You search in existing documents • You need visualization tools • Its not a crystal ball : search reflects the past Provide usable Search interfaces (semantic search, multi language search …) Search Interface Result Visualization
  • 19. 19II-SDV, Nice Lucene is a java based indexing and search API Solr/Lucene is the leading server extension of Lucene. 2 companies, LucidWorks (Fusion) and ElasticSearch, provides packaging and extension of top of Lucene and Solr. -Nutch is the crawling component -Tika is a document Metadata manager – content analysis toolkit -Zookeeper is a multi thread process manager OpenSource LandScape
  • 20. 20II-SDV, Nice -Search Landscape -Lucene : http://lucene.apache.org -Solr/Lucene : http://lucene.apache.org/solr/ -Plateform OpenSearch : http://www.open-search-server.com -Plateform Katta : http://katta.sourceforge.net -Plateform LucidWorks : http://www.lucidworks.com -Plateform ElasticSearch : http://www.elasticsearch.com -Sphinx : http://sphinxsearch.com/ -Cloudera : https://www.cloudera.com/documentation/enterprise/5-5- x/topics/search_architecture.html -FranceLabs : http://www.francelabs.com/ (Datafari) -AklaBox : www.aklabox.com (AklaSearch) OpenSource LandScape
  • 21. 21II-SDV, Nice Lucene : Retrieval Software library Use existing Search Infrastructure like Solr/Lucene (Vanilla certified) http://www.lucidworks.com/ or http://www.elasticsearch.org/ Search Engine Focus
  • 22. 22II-SDV, Nice -Cloudera with Solr/Cloud (Solr/Lucene) -Mapr with ElasticSearch (Lucene code) -HortonWorks with LucidWorks (Solr/Lucene) Hadoop Search Platform - Big Data
  • 23. 23II-SDV, Nice Before indexing your document base, you need to access it ! Apache Nutch is a highly extensible and scalable open source web crawler software project. Reference : http://nutch.apache.org/ Nutch
  • 24. 24II-SDV, Nice Solr • What is Solr – Indexation and Search Engine • Promoted by the Apache Foundation • Built on Top of Apache Lucene (Java Search library) – Major engine characteristics • Scalable, fault tolerance, distribution indexation process, dynamic workload balancer, centraized configuration – Technical environment • Java • Embeded Jetty server for platform administration
  • 25. 25II-SDV, Nice Solr Main characteristics Admin Interface Flexible and scalable Configuration Modular Multiple index management with a signle instance
  • 26. 26II-SDV, Nice Solr Main characteristics Standard communication interfaces (html, xml, json) Configuration can be done with or without schema Real time Indexation
  • 27. 27II-SDV, Nice Solr Main characteristics Customizable Full Text analysis Rich documents indexation (using Tika)
  • 28. 28II-SDV, Nice Solr Main characteristics Search by facet and filters Term suggestion and orthograph correction Geospatial Search
  • 30. 30II-SDV, Nice -Synonyms - It is possible to extend the search to synonyms if they are listed in a glossary. For example, to find articles containing synonyms to “TV” when you search with the word TV. -Metadata - Dictionary for list of searchable keywords Search Engine Basic (1/2)
  • 31. 31II-SDV, Nice -Reserved Words, Protected Words - Indexing usually uses stemming, which is to reduce words to their root, for example "Developp" to find items also contain the word when trying to develop the word development. However, sometimes there are adverse lemmatizations, indexing under one lemma two words that have no relation. It is possible to prevent the stemming of words by listing them in a file protwords.txt. -StopWords - The stopwords are meaningless words. A word considered insignificant will be ignored. Note that some words are insignificant in some contexts, others have homonyms signifiers. For example, can refer to a summer season (rather mean) or past participle of the verb to be (relatively insignificant). Stopwords.txt the file looks like this Search Engine Basic (2/2)
  • 32. 32II-SDV, Nice -Multi Language support (this is where commercial search engine have still more to bring to customer), even there is now Asian type language support (Hindi, Thai, Chineese, …) -Elision : - Elisions are a feature of the French, which consist of a contraction of the words like or when they are followed by a vowel. Example: + aircraft gives the aircraft. It is possible to remove these elisions using a lexicon. -Limits solved other the past 3 years • Full text search interface (language with search engine) • SubQuery support : now its ok starting with Solr 4.7 (we are v6) • Scalability (this is where Solr is taking technical advantage) Search Engine Current Limits
  • 33. 33II-SDV, Nice -Advance indexing and querying tools. -Provides distributed searching capabilities to prevent bottleneck for a particular server. -Provides document excerpts (snippets) generation that provides summary of the search -Relevance ranking display extracts from the documents based on the query. Search Interface expectation (1/3)
  • 34. 34II-SDV, Nice -Duplicate document detection, including fuzzy near duplicates -Rich Document Parsing and Indexing without using Database Indexing. -Ranking control carry out a targeted ranking of individual documents. -Search Grouping by Type / Tag / Categories (General page, documents, images) Search Interface expectation (2/3)
  • 35. 35II-SDV, Nice -Multi Criteria support -Ranking -Natural language support -Apps Support (Android, Ipad) Search Interface expectation (3/3)
  • 36. Project at Ministry Initial decision and guidelines from Ministry 36II-SDV, Nice New WebSite will be done using Drupal CMS 8.2 WebSite should be powered by a « Google alike Search Toolbar » WebSite – Infrastructure – should connect with multiples other WebSite All Infra (Software) must be Open Source components
  • 37. Project at Ministry 37II-SDV, Nice http://www.developpement-durable.gouv.fr/
  • 38. Project at Ministry 38II-SDV, Nice http://www.developpement-durable.gouv.fr/
  • 39. Project at Ministry - Architecture 39II-SDV, Nice
  • 40. Project at Ministry - Architecture 40II-SDV, Nice
  • 41. Project at Ministry - Technical 41II-SDV, Nice Projects Steps Nutch crawler for various WebSite • Facebook, LinkedIn, Twitter, Youtube … • Internal WebSite, Previous WebSite Drupal Forms for Metadata & indexation • Specific Forms for different kind of documents • Drupal CMS process to add new content Drupal 8 Module for Solr : custom search, monitoring, reporting • Existing drupal solr is limited to single instance of drupal • Not possible to use Solr Admin interface
  • 42. Project at Ministry - Technical 42II-SDV, Nice Additional PHP libraries Curl : Communication Drupal-Solr (http-get http-post & attached file) Ssh2 : server administration command Zookeeper : Communication Drupal-Zookeeper MemCached : Communication Drupal-Memcached Solarium : Communication Drupal-Solr (abstraction layer) GoogleApi : youtube content indexation
  • 43. Project at Ministry – Admin Interface 43II-SDV, Nice Drupal8 Addon to setup the global infrastructure (Zookeeper, Solr)
  • 44. Project at Ministry – Admin Interface 44II-SDV, Nice Drupal8 Addon to monitor the global infrastructure - Statistics
  • 45. Project at Ministry - Validation 45II-SDV, Nice Projects Validation & Deployment No problems with Zookeeper, Solr, Nutch Stress tests for the global platform : initial slow down with 10 000 simultaneous connection Sub-Project : Adressing the Single Point of Failure Solution : Problems with Drupal & MySql -> MemCached
  • 46. Project at Ministry - Next 46II-SDV, Nice Next Steps Review of WebSite content … new Ministry New Content to be indexed : • Other WebSite and Social Content • New set of document to be added in the repository