SlideShare a Scribd company logo
A Real-Time Search Engine with Lucene and S4
Yahoo! S4 applied to Information Retrieval




 2/5/2011                                    Michaël Figuière
Speaker

      @mfiguiere
      blog.xebia.fr



      Michaël Figuière           Distributed
                                 Architectures

                         NoSQL
Search Engines
Our case study




      A Search Engine to keep track of activities
                within an enterprise
The Problem
A Search Engine


                  Search
A Search Engine


            MyCustomer   Search
A Search Engine


                      MyCustomer                               Search




   Document     Non Disclosure Agreement                                          12 days ago
                   ... MyCustomer agrees not to disclose any part of ...



   Document     2010 Sales Report                                                 1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        2 days ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline



                    Tika


       PDF
                  Text
                            Analyzer
                Extractor
                                                Search
                                                 Index
                            Analyzer
      Phone
       Call



                                       Lucene
A more complex Search Engine


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting




   Document     2010 Sales Report                                                 1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        2 days ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline



            Tika       Mahout


 PDF
             Text
                       Classifier   Analyzer
           Extractor
                                                       Search
                                                        Index
                       Classifier   Analyzer
 Phone
  Call



                                              Lucene
More complex ...

• Entity Recognition
         Recognizes an entity written in any way



• Language Recognition
         To index each language separately



• Fetching linked URLs
         Enhances document context by also indexing linked URLs



• ...
A Real-Time Search Engine


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting




   Document     2010 Sales Report                                                  1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        3 seconds ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
A Real-Time Search Engine


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting




   Document     2010 Sales Report                                                  1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        3 seconds ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline


                                         Since Lucene 2.9



 PDF
          Text          Some
                                     Analyzer
        Extractor   Pre-Processing
                                                  Near Real-Time
                                                   Search Index
                        Some
                                     Analyzer
Phone               Pre-Processing
 Call
But...




 PDF
           Text          Some
                                         Analyzer
         Extractor   Pre-Processing
                                                    Near Real-Time
                                                     Search Index
                         Some
                                         Analyzer
Phone                Pre-Processing
 Call


                                      What if it takes
                                      one second/document
                                      on a single box ??
Let’s distribute it

                 Server 1                    Server 3

            Pre-            Search      Pre-            Search
         Processing          Index   Processing          Index



                 Server 2                    Server N

            Pre-            Search
         Processing          Index




  Processing logic and index structure distributed together
That’s a problem...

• Processing and index storage may have different scaling needs
        Depending on the search traffic, the processing overhead, ...



• Scaling up and down an index storage is long and complex
        Whereas stateless processing is simple to scale up/down



• Expensive pre-processing may make searches slower
        And indexing in real-time shouldn’t make searches slower !
Let’s move it to Hadoop




 PDF
          Text          Some
                                     Analyzer
        Extractor   Pre-Processing
                                                Near Real-Time
                                                 Search Index
                        Some
                                     Analyzer
Phone               Pre-Processing
 Call




                                     Hadoop MapReduce
But...

• Hadoop can only deal with chunk of data
         Data must be available somewhere on HDFS



• Unbounded stream of data can’t fit into Hadoop MapReduce
         Hadoop is thought and optimized for batch processing



• Manually bounding the stream won’t be efficient
         It’ll resulting in lot of regular and inefficient batches
S4
S4

• A distributed, fault-tolerant, stream processing system



• Elastic

            Based on Zookeeper



• Project started in november 2010, still experimental

            But things are moving fast !
Where does S4 come from ?

• Open Source project created by Yahoo!




• Initially built for relevant ad selection and clever positioning on webpages
         But thought to be generic enough



• Expensive pre-processing may make searches slower
         And indexing in real-time shouldn’t make searches slower !
Processing Element

                                  Your business
                                  logic goes here

                     Processing
                      Element



      Events Input                Events Output
Processing Node




                        Processing Node

           Processing     Processing      Processing
           Element 1      Element 2       Element N
S4 Cluster


                                      Cluster
                                      Management
             Processing Node 1


   Events
             Processing Node 2   Zookeeper
   Stream


             Processing Node N
Programming model


                                  PhoneCallPE

                               Accept events with :
                                 Type=PhoneCall
           Event               KeyTuple: Id=15497              Event
      Type: PhoneCall                                 Type: EnrichedPhoneCall

   KeyTuple: «Id=15497»                                KeyTuple: «Id=15497»

  Value: <serialized object>                          Value: <serialized object>


                                                A new Processing
                                                Element instance is created
                                                for each value of «Id»
An indexing pipeline with S4

               ReRoutingPE
                                                  Handles incoming events
                                                  and load-balance them
                                                  according to partitioning
 TextExtractionPE              TextExtractionPE




               ReRoutingPE




 ClassificationPE               ClassificationPE




                   MergingPE
An indexing pipeline with S4

               ReRoutingPE




 TextExtractionPE              TextExtractionPE


                                                  Handles result events
               ReRoutingPE                        and load-balance between
                                                  Processing Nodes

 ClassificationPE               ClassificationPE




                   MergingPE
An indexing pipeline with S4

               ReRoutingPE




 TextExtractionPE              TextExtractionPE




               ReRoutingPE




 ClassificationPE               ClassificationPE

                                                  Handles final result
                                                  events and push
                   MergingPE
                                                  them to the Indexer
Some drawbacks

• The system is lossy
         Events may be lost when nodes are overloaded or during failure



• A workaround is to increase the incoming queue of nodes
         But still, events may be lost during failure



• Still experimental
         But very promising
More: Real-Time Inverted Search


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting


                                     20 new results...


   Document     2010 Sales Report                                                  1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        3 seconds ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Summary

• S4 is a nice processing system for real-time search
         Events may be lost when nodes are overloaded or during failure



• Not only for indexing-time, also for query-time !
         As S4 ensures low latency, query-time processing is possible



• A promising roadmap....
         Better failure handling, client API in major languages,
         initial processing with Hadoop, ...
Questions / Answers




                       ?
                      blog.xebia.fr
                      @mfiguiere

More Related Content

PDF
Wed 1130 aasman_jans_color
PDF
2012.04.26 big insights streams im forum2
PDF
Real time semantic search engine for social tv streams
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
PDF
Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout
PDF
A Real-Time Version of the Truth
PDF
Embedded Analytics: The Next Mega-Wave of Innovation
PDF
Solve the Mortgage Processing "Paper Problem"
Wed 1130 aasman_jans_color
2012.04.26 big insights streams im forum2
Real time semantic search engine for social tv streams
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Xebia Knowledge Exchange (mars 2011) - Machine Learning with Apache Mahout
A Real-Time Version of the Truth
Embedded Analytics: The Next Mega-Wave of Innovation
Solve the Mortgage Processing "Paper Problem"

Similar to FOSDEM (feb 2011) - A real-time search engine with Lucene and S4 (20)

PPTX
Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
PPTX
Implementing Big Data at the Speed of Business
PPTX
Nosql Now 2012: MongoDB Use Cases
PDF
Time Difference: How Tomorrow's Companies Will Outpace Today's
PDF
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
PPTX
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
PDF
IPexcel - Company Overview
PDF
How Does the Denodo Platform Accelerate Your Time to Insights?
PDF
Microsoft StreamInsight
PPTX
ABBYY USA TAWPI presentation
PDF
Opportunities and Pitfalls of Event-Driven Utopia
PPTX
Digital Transformation Mindset - More Than Just Technology
PDF
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
PPTX
Albel Pres Continuous Intelligence Overview
PDF
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
PDF
Building Reactive Real-time Data Pipeline
PDF
Global automation domination: how do you roll out one workflow solution acros...
PPTX
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
PPTX
EvoApp - Bermuda Real-Time Analytics Platform
PPTX
EvoApp - Bermuda Real-Time Analytics Platform
Fishbowl Solutions Mobile ECM Webinar Presentation: iPad Quick Start
Implementing Big Data at the Speed of Business
Nosql Now 2012: MongoDB Use Cases
Time Difference: How Tomorrow's Companies Will Outpace Today's
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
IPexcel - Company Overview
How Does the Denodo Platform Accelerate Your Time to Insights?
Microsoft StreamInsight
ABBYY USA TAWPI presentation
Opportunities and Pitfalls of Event-Driven Utopia
Digital Transformation Mindset - More Than Just Technology
Gilmore, Palani [InfluxData] | Use Case: Crypto & Fintech | InfluxDays 2022
Albel Pres Continuous Intelligence Overview
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Building Reactive Real-time Data Pipeline
Global automation domination: how do you roll out one workflow solution acros...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics Platform
Ad

More from Michaël Figuière (20)

PDF
EclipseCon - Building an IDE for Apache Cassandra
PDF
Paris Cassandra Meetup - Cassandra for Developers
PDF
YaJug - Cassandra for Java Developers
PDF
Geneva JUG - Cassandra for Java Developers
PDF
ChtiJUG - Cassandra 2.0
PDF
Cassandra summit 2013 - DataStax Java Driver Unleashed!
PDF
NYC* Tech Day - New Cassandra Drivers in Depth
PDF
Paris Cassandra Meetup - Overview of New Cassandra Drivers
PDF
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
PDF
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
PDF
GTUG Nantes (Dec 2011) - BigTable et NoSQL
PDF
Duchess France (Nov 2011) - Atelier Apache Mahout
PDF
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
PDF
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
PDF
Mix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
PDF
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
PDF
Xebia Knowledge Exchange (feb 2011) - Large Scale Web Development
PDF
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
PDF
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
PDF
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
EclipseCon - Building an IDE for Apache Cassandra
Paris Cassandra Meetup - Cassandra for Developers
YaJug - Cassandra for Java Developers
Geneva JUG - Cassandra for Java Developers
ChtiJUG - Cassandra 2.0
Cassandra summit 2013 - DataStax Java Driver Unleashed!
NYC* Tech Day - New Cassandra Drivers in Depth
Paris Cassandra Meetup - Overview of New Cassandra Drivers
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
GTUG Nantes (Dec 2011) - BigTable et NoSQL
Duchess France (Nov 2011) - Atelier Apache Mahout
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
Mix-IT (Apr 2011) - Intelligence Collective avec Apache Mahout
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Xebia Knowledge Exchange (feb 2011) - Large Scale Web Development
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
A Presentation on Artificial Intelligence
PDF
Approach and Philosophy of On baking technology
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
A Presentation on Touch Screen Technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
project resource management chapter-09.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A Presentation on Artificial Intelligence
Approach and Philosophy of On baking technology
Zenith AI: Advanced Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Tartificialntelligence_presentation.pptx
Heart disease approach using modified random forest and particle swarm optimi...
Web App vs Mobile App What Should You Build First.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A Presentation on Touch Screen Technology
Programs and apps: productivity, graphics, security and other tools
project resource management chapter-09.pdf
Group 1 Presentation -Planning and Decision Making .pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
1. Introduction to Computer Programming.pptx
A novel scalable deep ensemble learning framework for big data classification...

FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

  • 1. A Real-Time Search Engine with Lucene and S4 Yahoo! S4 applied to Information Retrieval 2/5/2011 Michaël Figuière
  • 2. Speaker @mfiguiere blog.xebia.fr Michaël Figuière Distributed Architectures NoSQL Search Engines
  • 3. Our case study A Search Engine to keep track of activities within an enterprise
  • 6. A Search Engine MyCustomer Search
  • 7. A Search Engine MyCustomer Search Document Non Disclosure Agreement 12 days ago ... MyCustomer agrees not to disclose any part of ... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 8. Indexing Pipeline Tika PDF Text Analyzer Extractor Search Index Analyzer Phone Call Lucene
  • 9. A more complex Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 10. Indexing Pipeline Tika Mahout PDF Text Classifier Analyzer Extractor Search Index Classifier Analyzer Phone Call Lucene
  • 11. More complex ... • Entity Recognition Recognizes an entity written in any way • Language Recognition To index each language separately • Fetching linked URLs Enhances document context by also indexing linked URLs • ...
  • 12. A Real-Time Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 13. A Real-Time Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 14. Indexing Pipeline Since Lucene 2.9 PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some Analyzer Phone Pre-Processing Call
  • 15. But... PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some Analyzer Phone Pre-Processing Call What if it takes one second/document on a single box ??
  • 16. Let’s distribute it Server 1 Server 3 Pre- Search Pre- Search Processing Index Processing Index Server 2 Server N Pre- Search Processing Index Processing logic and index structure distributed together
  • 17. That’s a problem... • Processing and index storage may have different scaling needs Depending on the search traffic, the processing overhead, ... • Scaling up and down an index storage is long and complex Whereas stateless processing is simple to scale up/down • Expensive pre-processing may make searches slower And indexing in real-time shouldn’t make searches slower !
  • 18. Let’s move it to Hadoop PDF Text Some Analyzer Extractor Pre-Processing Near Real-Time Search Index Some Analyzer Phone Pre-Processing Call Hadoop MapReduce
  • 19. But... • Hadoop can only deal with chunk of data Data must be available somewhere on HDFS • Unbounded stream of data can’t fit into Hadoop MapReduce Hadoop is thought and optimized for batch processing • Manually bounding the stream won’t be efficient It’ll resulting in lot of regular and inefficient batches
  • 20. S4
  • 21. S4 • A distributed, fault-tolerant, stream processing system • Elastic Based on Zookeeper • Project started in november 2010, still experimental But things are moving fast !
  • 22. Where does S4 come from ? • Open Source project created by Yahoo! • Initially built for relevant ad selection and clever positioning on webpages But thought to be generic enough • Expensive pre-processing may make searches slower And indexing in real-time shouldn’t make searches slower !
  • 23. Processing Element Your business logic goes here Processing Element Events Input Events Output
  • 24. Processing Node Processing Node Processing Processing Processing Element 1 Element 2 Element N
  • 25. S4 Cluster Cluster Management Processing Node 1 Events Processing Node 2 Zookeeper Stream Processing Node N
  • 26. Programming model PhoneCallPE Accept events with : Type=PhoneCall Event KeyTuple: Id=15497 Event Type: PhoneCall Type: EnrichedPhoneCall KeyTuple: «Id=15497» KeyTuple: «Id=15497» Value: <serialized object> Value: <serialized object> A new Processing Element instance is created for each value of «Id»
  • 27. An indexing pipeline with S4 ReRoutingPE Handles incoming events and load-balance them according to partitioning TextExtractionPE TextExtractionPE ReRoutingPE ClassificationPE ClassificationPE MergingPE
  • 28. An indexing pipeline with S4 ReRoutingPE TextExtractionPE TextExtractionPE Handles result events ReRoutingPE and load-balance between Processing Nodes ClassificationPE ClassificationPE MergingPE
  • 29. An indexing pipeline with S4 ReRoutingPE TextExtractionPE TextExtractionPE ReRoutingPE ClassificationPE ClassificationPE Handles final result events and push MergingPE them to the Indexer
  • 30. Some drawbacks • The system is lossy Events may be lost when nodes are overloaded or during failure • A workaround is to increase the incoming queue of nodes But still, events may be lost during failure • Still experimental But very promising
  • 31. More: Real-Time Inverted Search MyCustomer Search Sales Juridic Accounting 20 new results... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 3 seconds ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 32. Summary • S4 is a nice processing system for real-time search Events may be lost when nodes are overloaded or during failure • Not only for indexing-time, also for query-time ! As S4 ensures low latency, query-time processing is possible • A promising roadmap.... Better failure handling, client API in major languages, initial processing with Hadoop, ...
  • 33. Questions / Answers ? blog.xebia.fr @mfiguiere