SlideShare a Scribd company logo
Leveraging Solr and Mahout for Next
                                Gen Data Access and Insight

                                Grant Ingersoll
                                Chief Scientist




Confidential © Copyright 2012
Search is Dead, Long Live Search

• Modern Data Challenges are multi-structured

• Search is a system building block                      Content

    - Text is only a part of the story


• If the algorithms fit,
                                           Content
             use them!                   Relationships
                                                                   Users




• Embrace fuzziness!
                                                         Access

• Scoring features are everywhere

Confidential and Proprietary
© 2012 LucidWorks
Topics

    • Intros

    • Search (R)Evolution

    • Apache Solr
    • Apache Mahout

    • Search and Machine Learning

    • Scaling


    Confidential and Proprietary
3   © 2012 LucidWorks
Grant’s Background

• Co-founder:
    - LucidWorks – Chief Scientist
    - Apache Mahout
• Long time Lucene/Solr committer
• Author: Taming Text
    - www.manning.com/ingersoll
• Background in IR and NLP
    - Built CLIR, QA and a variety of other search-based apps




Confidential and Proprietary
© 2012 LucidWorks
Search (R)evolution

• Search use leads to search abuse
    - Denormalization frees your mind
    - Scoring is just a sparse matrix multiply

• Lucene/Solr evolution
    -   Non-free text usages abound
    -   Many DB-like features
    -   NoSQL before NoSQL was cool
    -   Flexible indexing
    -   Finite State Transducers FTW!

• Scale

• “This ain’t your father’s relevance anymore”

Confidential and Proprietary
© 2012 LucidWorks
Apache Solr?

• “Solr is an open source enterprise search server based
  on the Lucene Java search library, with XML/HTTP and
  JSON APIs, hit highlighting, faceted search, caching,
  replication, a web administration interface and many
  more features. It runs in a Java servlet container such
  as Tomcat. “
    - http://lucene.apache.org/solr


• Did I mention free?




Confidential and Proprietary
© 2012 LucidWorks
Apache Mahout

• Goal: create library of scalable machine learning
  algorithms

• Mahout’s 3 “C”s provide tools for helping across many
  aspects of discovery
    - Collaborative Filtering
    - Classification
    - Clustering
• Also:
    - Collocations (Statistically Interesting Phrases)
    - SVD
    - Java math, primitives libraries and more

Confidential and Proprietary
© 2012 LucidWorks
Search + Machine Learning

• Search-driven applications present multiple
  opportunities for leveraging machine learning
    - Clustering – Enhance Discovery, outlier detection
    - Classification – Queries, Documents, Users
    - Content Recommendation – Collab. Filtering and
      personalization
    - NLP – phrases, named entities, co-reference, much more


• Many of these can also power faceted navigation

• Aside: Search can also often be used effectively to
  implement many machine learning algorithms

Confidential and Proprietary
© 2012 LucidWorks
How and When
                                                    Access APIs
                                                                    •View into
                               Search View              Analytic     numeric/hist     Personalization &
                                                                     oric data
                 1                                      Services                      Machine Learning
                      2                                                                   Services
              Shards       3                 N
                                                                                             •Classification
                                                                                             •Recommendation

                                                                         •Documents      Classification
                  Discovery &                            Document
                                                           Store         •Users             Models
                  Enrichment                                             •Logs
                     Clustering,                                                         In memory
                     classification, NLP,                                                Replicated
                     topic identification,                                               Multi-tenant
                     search log analysis,
                     user behavior
                                                 Content Acquisition
                                                    ETL, batch or near
                                                    real-time



                   Data
         • LucidWorks Search
           connectors
         • Push


Confidential and Proprietary
© 2012 LucidWorks
Scaling

• Search
    - Solr Cloud = Large scale, distributed search and faceting
          » http://wiki.apache.org/solr/SolrCloud


• Machine Learning
    - Mahout is built on Hadoop for most things
    - SGD is sequential and really fast


• Sometimes all you can do is make an educated guess
    - Storm, Kafka, etc. can help by allowing you to make estimates in
      near real time



Confidential and Proprietary
© 2012 LucidWorks
Wrap

• Search, Discovery and Analytics, when combined into
  a single, coherent system provides powerful insight into
  both your content and your users

• LucidWorks has combined many of these things into
  LucidWorks Big Data
    - http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based
  applications



Confidential and Proprietary
© 2012 LucidWorks
Resources

• LucidWorks
    - http://www.lucidworks.com
    - http://www.lucidworks.com/products/lucidworks-big-data
    - @LucidImagineer

• Me
    - grant@lucidworks.com
    - @gsingers


• Taming Text
    - http://www.manning.com/ingersoll
    - http://www.tamingtext.com
    - @tamingtext

Confidential and Proprietary
© 2012 LucidWorks

More Related Content

PPTX
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
PPTX
Find Information Faster Using SharePoint 2010 Search
PPT
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
PPT
2010 05-21, object-relational mapping using hibernate v2
KEY
Open source enterprise search and retrieval platform
PDF
FAST Search for SharePoint
PPTX
Open Source Search FTW
PPTX
Intro to Search
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Find Information Faster Using SharePoint 2010 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
2010 05-21, object-relational mapping using hibernate v2
Open source enterprise search and retrieval platform
FAST Search for SharePoint
Open Source Search FTW
Intro to Search

Viewers also liked (9)

PPTX
Apache Lucene 4
PPTX
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
PPTX
Crowd Sourced Reflected Intelligence for Solr and Hadoop
PPTX
Data IO: Next Generation Search with Lucene and Solr 4
PPTX
OpenSearchLab and the Lucene Ecosystem
PPTX
What's new in Lucene and Solr 4.x
PPTX
This Ain't Your Parent's Search Engine
PDF
Solr for Data Science
PPTX
Taming Text
Apache Lucene 4
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Data IO: Next Generation Search with Lucene and Solr 4
OpenSearchLab and the Lucene Ecosystem
What's new in Lucene and Solr 4.x
This Ain't Your Parent's Search Engine
Solr for Data Science
Taming Text
Ad

Similar to Leveraging Solr and Mahout (20)

PPTX
MapR lucidworks joint webinar
PPTX
MapR LucidWorks Joint Webinar 121211
PDF
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
PPTX
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
PPTX
Large Scale Search, Discovery and Analytics in Action
PPTX
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
PPTX
Crowd-Sourced Intelligence Built into Search over Hadoop
PDF
Search + Big Data: It's (still) All About the User- Grant Ingersoll
PPTX
Scalable Machine Learning with Hadoop
PPT
Search, APIs, capability management and Sensis's journey
PDF
Enhance discovery Solr and Mahout
PDF
Reflected intelligence evolving self-learning data systems
PPTX
Revenue Growth through Machine Learning
PDF
Data Engineering with Solr and Spark
PPTX
Summit EU Machine Learning
PPTX
South Big Data Hub: Text Data Analysis Panel
PDF
"Search, APIs,Capability Management and the Sensis Journey"
PPT
Big Data = Big Decisions
PPTX
Building a real time, solr-powered recommendation engine
PDF
ezDL Flyer
MapR lucidworks joint webinar
MapR LucidWorks Joint Webinar 121211
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics in Action
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
Crowd-Sourced Intelligence Built into Search over Hadoop
Search + Big Data: It's (still) All About the User- Grant Ingersoll
Scalable Machine Learning with Hadoop
Search, APIs, capability management and Sensis's journey
Enhance discovery Solr and Mahout
Reflected intelligence evolving self-learning data systems
Revenue Growth through Machine Learning
Data Engineering with Solr and Spark
Summit EU Machine Learning
South Big Data Hub: Text Data Analysis Panel
"Search, APIs,Capability Management and the Sensis Journey"
Big Data = Big Decisions
Building a real time, solr-powered recommendation engine
ezDL Flyer
Ad

More from Grant Ingersoll (8)

PPTX
Bet you didn't know Lucene can...
PDF
Starfish: A Self-tuning System for Big Data Analytics
PPTX
Intro to Mahout -- DC Hadoop
PPTX
Intro to Apache Lucene and Solr
PPTX
Apache Mahout: Driving the Yellow Elephant
PPTX
Intelligent Apps with Apache Lucene, Mahout and Friends
PPTX
TriHUG: Lucene Solr Hadoop
PPTX
Intro to Apache Mahout
Bet you didn't know Lucene can...
Starfish: A Self-tuning System for Big Data Analytics
Intro to Mahout -- DC Hadoop
Intro to Apache Lucene and Solr
Apache Mahout: Driving the Yellow Elephant
Intelligent Apps with Apache Lucene, Mahout and Friends
TriHUG: Lucene Solr Hadoop
Intro to Apache Mahout

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
A Presentation on Artificial Intelligence
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
Understanding_Digital_Forensics_Presentation.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
A Presentation on Artificial Intelligence
Network Security Unit 5.pdf for BCA BBA.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Leveraging Solr and Mahout

  • 1. Leveraging Solr and Mahout for Next Gen Data Access and Insight Grant Ingersoll Chief Scientist Confidential © Copyright 2012
  • 2. Search is Dead, Long Live Search • Modern Data Challenges are multi-structured • Search is a system building block Content - Text is only a part of the story • If the algorithms fit, Content use them! Relationships Users • Embrace fuzziness! Access • Scoring features are everywhere Confidential and Proprietary © 2012 LucidWorks
  • 3. Topics • Intros • Search (R)Evolution • Apache Solr • Apache Mahout • Search and Machine Learning • Scaling Confidential and Proprietary 3 © 2012 LucidWorks
  • 4. Grant’s Background • Co-founder: - LucidWorks – Chief Scientist - Apache Mahout • Long time Lucene/Solr committer • Author: Taming Text - www.manning.com/ingersoll • Background in IR and NLP - Built CLIR, QA and a variety of other search-based apps Confidential and Proprietary © 2012 LucidWorks
  • 5. Search (R)evolution • Search use leads to search abuse - Denormalization frees your mind - Scoring is just a sparse matrix multiply • Lucene/Solr evolution - Non-free text usages abound - Many DB-like features - NoSQL before NoSQL was cool - Flexible indexing - Finite State Transducers FTW! • Scale • “This ain’t your father’s relevance anymore” Confidential and Proprietary © 2012 LucidWorks
  • 6. Apache Solr? • “Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat. “ - http://lucene.apache.org/solr • Did I mention free? Confidential and Proprietary © 2012 LucidWorks
  • 7. Apache Mahout • Goal: create library of scalable machine learning algorithms • Mahout’s 3 “C”s provide tools for helping across many aspects of discovery - Collaborative Filtering - Classification - Clustering • Also: - Collocations (Statistically Interesting Phrases) - SVD - Java math, primitives libraries and more Confidential and Proprietary © 2012 LucidWorks
  • 8. Search + Machine Learning • Search-driven applications present multiple opportunities for leveraging machine learning - Clustering – Enhance Discovery, outlier detection - Classification – Queries, Documents, Users - Content Recommendation – Collab. Filtering and personalization - NLP – phrases, named entities, co-reference, much more • Many of these can also power faceted navigation • Aside: Search can also often be used effectively to implement many machine learning algorithms Confidential and Proprietary © 2012 LucidWorks
  • 9. How and When Access APIs •View into Search View Analytic numeric/hist Personalization & oric data 1 Services Machine Learning 2 Services Shards 3 N •Classification •Recommendation •Documents Classification Discovery & Document Store •Users Models Enrichment •Logs Clustering, In memory classification, NLP, Replicated topic identification, Multi-tenant search log analysis, user behavior Content Acquisition ETL, batch or near real-time Data • LucidWorks Search connectors • Push Confidential and Proprietary © 2012 LucidWorks
  • 10. Scaling • Search - Solr Cloud = Large scale, distributed search and faceting » http://wiki.apache.org/solr/SolrCloud • Machine Learning - Mahout is built on Hadoop for most things - SGD is sequential and really fast • Sometimes all you can do is make an educated guess - Storm, Kafka, etc. can help by allowing you to make estimates in near real time Confidential and Proprietary © 2012 LucidWorks
  • 11. Wrap • Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users • LucidWorks has combined many of these things into LucidWorks Big Data - http://www.lucidworks.com/products/lucidworks-big-data • Design for the big picture when building search-based applications Confidential and Proprietary © 2012 LucidWorks
  • 12. Resources • LucidWorks - http://www.lucidworks.com - http://www.lucidworks.com/products/lucidworks-big-data - @LucidImagineer • Me - grant@lucidworks.com - @gsingers • Taming Text - http://www.manning.com/ingersoll - http://www.tamingtext.com - @tamingtext Confidential and Proprietary © 2012 LucidWorks

Editor's Notes

  • #3: This is a money slide where people should say “Wow man”. They shouldn’t understand the implications of this, but they should be very, very aware that something big just slide into the room.Tech Building Block: Not just textNot just users + queriesEmbrace Fuzziness: Esp. in Big Data, it is the only way you are going to survive.TED: I think that this should make the case for advanced that is still search at its heart. The idea that search can be radically changed should be on the next slide.
  • #6: Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much moreSearch has added more DB features over the yearsTED: We need to introduce the idea of *REVOLUTION* somewhere in here.
  • #12: Big Picture: too often devs are stuck in the weeds