SlideShare a Scribd company logo
Solr@
Things I’m not going to
     talk about:
     A/B Testing
        i18n
Continuos Deployment
About
 Us
Solr @ Etsy - Apache Lucene Eurocon
Solr @ Etsy - Apache Lucene Eurocon
10+ Million Listings
     500 qps
Architecture
 Overview
Architecture Overview
Thrift
Architecture Overview
Thrift
      struct Listing {
         1: i64 listing_id
     }

     struct ListingResults {
         1: i64 count,
         2: list<Listing> listings
     }

     service Search {
         ListingResults search(1:string query)
     }
Architecture Overview
Thrift
Generated Java server code:
 public class Search {

   public interface Iface {

     public ListingResults search(String query) throws TException;

     }


Generated PHP client code:
 class SearchClient implements SearchIf {

    /**...**/
    public function search($query)
    {
      $this->send_search($query);
      return $this->recv_search();
    }
Architecture Overview
Thrift
Why use Thrift?
    • Service Encapsulation
    • Reduced Network Traffic
Architecture Overview
Thrift
Why only return IDs?
    • Index Size
    • Easy to scale PK lookups
The Search Server
Architecture Overview
Search Server


 • Identical Code + Hardware
 • Roles/Behavior controlled by Env variables
 • Single Java Process
 • Solr running as a Jetty Servlet
 • Thrift Servers
 • Smoker
Architecture Overview
Search Server




Master-specific processes:
 • Incremental Indexer
 • External File Field Updaters
Load Balancing
Load Balancing
Thrift TSocketPool
Load Balancing
Thrift TSocketPool
Load Balancing
Thrift TSocketPool
Load Balancing
Server Affinity
Load Balancing
    Server Affinity Algorithm
$serversNew = array();
                                              [“host2”, “host3”, “host1”, “host4”]
$numServers = count($servers);

while($numServers > 0) {
   // Take the first 4 chars of the md5sum of the server count
   // and the query, mod the available servers
   $key = hexdec(substr(md5($numServers . '+' . $query),0,4))%($numServers);
   $keySet = array_keys($servers);
   $serverId = $keySet[$key];

    // Push the chosen server onto the new list and remove it
    // from the initial list
    array_push($serversNew, $servers[$serverId]);
    unset($servers[$serverId]);
    --$numServers;
}
Load Balancing
Server Affinity Algorithm
              $key = hexdec(substr(md5($query),0,4))




  “jewelry”                 [“host2”, “host3”, “host1”, “host4”]

  “scarf”                   [“host2”, “host3”, “host1”, “host4”]
Load Balancing
Server Affinity Algorithm
      $key = hexdec(substr(md5($numServers . '+' . $query),0,4))%(count($servers));




  “jewelry”                            [“host2”, “host3”, “host1”, “host4”]

  “scarf”                              [“host2”, “host1”, “host4”, “host3”]
Load Balancing
Server Affinity Results


      2%        20%
Load Balancing
Server Affinity Caveats

     • Stemming / Analysis
     • Be wary of query distribution
Replication
Replication
The Problem
Replication
The Problem
Replication
Multicast Rsync?
Replication
Multicast Rsync?
[15:25]  <engineer> patrick: i'm gonna test multi-rsyncing some indexes
from host1 to host2 and host3 in prod. I'll be watching the graphs and
what not, but let me know if you see anything funky with the network
[15:26]  <patrick> ok
....

[15:31]  <keyur> is the site down?
Replication
Multicast Rsync?
Hmm...Bit Torrent?
Replication
Bit Torrent POC
Using BitTornado:
Replication
Bit Torrent + Solr
Fork of TTorent: https://github.com/etsy/ttorrent

                            Multi-File Support
                       Performance Enhancements
Replication
Bit Torrent + Solr
Replication
Bit Torrent + Solr
Replication
Bit Torrent + Solr
Replication
Bit Torrent + Solr
Solr InterOp
QParsers
“writing query strings
   is for suckers”
Solr @ Etsy - Apache Lucene Eurocon
Solr InterOp
QParsers

  http://host:8393/solr/person/select/?q=_query_:%22{!dismax
  %20qf=$fqf%20v=$fnq}%22%20OR%20(_query_:%22{!dismax%20qf=$fiqf
  %20v=$fiq}%22%20AND%20(_query_:%22{!dismax%20qf=$lwqf%20v=$lwq}
  %22%20OR%20_query_:%22{!dismax%20qf=$lqf%20v=$lq}%20%22))&fnq=
  %22giovanni%20fernandez-kincade
 %22&fqf=full_name^4&fiq=giovanni&fiqf=first_name^2.0%20first_name_s
 yn&qt=standard&lwq=fernandez-kincade*&lwqf=last_name&lq=fernandez-
 kincade&lqf=last_name^3
Solr InterOp
QParsers

   http://host:8393/solr/person/select/?q={!personrealqp}giovanni
  %20fernandez-kincade
Solr InterOp
QParsers

 class PersonNameRealQParser extends QParser {
   public PersonNameRealQParser(String qstr, SolrParams localParams,
       SolrParams params, SolrQueryRequest req) {
     super(qstr, localParams, params, req);
   }
Solr InterOp
   QParsers
  @Override
  public Query parse() throws ParseException {
    TermQuery exactFullNameQuery = new TermQuery(new Term("full_name", qstr));
    exactFullNameQuery.setBoost(4.0f);

    String[] userQueryTerms = qstr.split("s+");
    Query firstLastQuery = null;

    if (2 == userQueryTerms.length)
      firstLastQuery = parseAsFirstAndLast(userQueryTerms[0], userQueryTerms[1]);
    else
      firstLastQuery = parseAsFirstOrLast(userQueryTerms);

    DisjunctionMaxQuery realNameQuery = new DisjunctionMaxQuery(0);
    realNameQuery.add(exactFullNameQuery);
    realNameQuery.add(firstLastQuery);

    return realNameQuery;
  }
Solr InterOp
QParsers
The QParserPlugin that returns our new QParser:
  public class PersonNameRealQParserPlugin extends QParserPlugin {
   public static final String NAME = "personrealqp";

   @Override
   public void init(NamedList args) {}

   @Override
   public QParser createParser(String qstr, SolrParams localParams,
       SolrParams params, SolrQueryRequest req) {
     return new PersonNameRealQParser(qstr, localParams, params, req);
   }
 }
Solr InterOp
QParsers

Registering the plugin in solrconfig.xml:

   <queryParser name="personrealqp"
      class="com.etsy.person.solr.PersonNameRealQParserPlugin" />
Custom Stemmer
Solr InterOp
Custom Stemmer
Solr InterOp
Custom Stemmer

 banded, banding, birding, bouldering, bounded, buffing, bundler, canning,
carded, circled, coupler, dangler, doubler, firring, foiling, hooper, japanned,
lipped, napped, papered, pebbled, pitted, pocketed, reductive, ricer, rooter,
roper, seeded, shouldered, silvered, skinning, spindling, staining, stitcher,
                      strapped, threaded, yellowing
Solr InterOp
Custom Stemmer
First we extend KStemmer and intercept stem calls:
  public class LStemmer extends KStemmer {

     /**.....**/

      @Override
      String stem(String term) {
          String override = overrideStemTransformations.get(term);
          if(override != null) return override;
          return super.stem(term);
      }
  }
Solr InterOp
 Custom Stemmer
Then create a TokenFilter that uses the new Stemmer:
 final class LStemFilter extends TokenFilter {

   /**.....**/        
   protected LStemFilter(TokenStream input, int cacheSize) {
     super(input);
     stemmer = new LStemmer(cacheSize);
   }
        
   @Override
   public boolean incrementToken() throws IOException {
     /**....**/
   }
Solr InterOp
Custom Stemmer
Create a FilterFactory that exposes it:
       public class LStemFilterFactory extends BaseTokenFilterFactory {
        private int cacheSize = 20000;
        
        @Override
        public void init(Map<String, String> args) {
          super.init(args);
         String cacheSizeStr = args.get("cacheSize");
         if (cacheSizeStr != null) {
          cacheSize = Integer.parseInt(cacheSizeStr);
         }
       }
        
        @Override
       public TokenStream create(TokenStream in) {
        return new LStemFilter(in, cacheSize);
       }
     }
Solr InterOp
Custom Stemmer
And finally plug it into your analysis chain:

 <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
       words="solr/common/conf/stopwords.txt"/>
    <filter class="com.etsy.solr.analysis.LStemFilterFactory" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>
Thanks!

More Related Content

PDF
Solr & Lucene @ Etsy by Gregg Donovan
PDF
Living with garbage
PDF
Solr & Lucene at Etsy
PDF
Solr and Lucene at Etsy - By Gregg Donovan
PDF
PHP 7 – What changed internally? (Forum PHP 2015)
PDF
PHP 7 – What changed internally?
PDF
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PDF
Building fast interpreters in Rust
Solr & Lucene @ Etsy by Gregg Donovan
Living with garbage
Solr & Lucene at Etsy
Solr and Lucene at Etsy - By Gregg Donovan
PHP 7 – What changed internally? (Forum PHP 2015)
PHP 7 – What changed internally?
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
Building fast interpreters in Rust

What's hot (20)

PDF
[131]해커의 관점에서 바라보기
PDF
Nubilus Perl
KEY
Perl Web Client
KEY
dotCloud and go
PDF
New SPL Features in PHP 5.3
PDF
Redis for the Everyday Developer
KEY
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
PDF
Trading with opensource tools, two years later
KEY
Invertible-syntax 入門
PDF
Your code is not a string
ODP
Intro to The PHP SPL
PPTX
SPL: The Undiscovered Library - DataStructures
PDF
Solr Anti-Patterns: Presented by Rafał Kuć, Sematext
PDF
From mysql to MongoDB(MongoDB2011北京交流会)
KEY
CS442 - Rogue: A Scala DSL for MongoDB
PDF
はじめてのMongoDB
KEY
Spl Not A Bridge Too Far phpNW09
PDF
Things I Believe Now That I'm Old
KEY
groovy & grails - lecture 2
PDF
How to write rust instead of c and get away with it
[131]해커의 관점에서 바라보기
Nubilus Perl
Perl Web Client
dotCloud and go
New SPL Features in PHP 5.3
Redis for the Everyday Developer
Scala Days 2011 - Rogue: A Type-Safe DSL for MongoDB
Trading with opensource tools, two years later
Invertible-syntax 入門
Your code is not a string
Intro to The PHP SPL
SPL: The Undiscovered Library - DataStructures
Solr Anti-Patterns: Presented by Rafał Kuć, Sematext
From mysql to MongoDB(MongoDB2011北京交流会)
CS442 - Rogue: A Scala DSL for MongoDB
はじめてのMongoDB
Spl Not A Bridge Too Far phpNW09
Things I Believe Now That I'm Old
groovy & grails - lecture 2
How to write rust instead of c and get away with it
Ad

Viewers also liked (12)

PDF
Emphemeral hadoop clusters in the cloud
PDF
Data mining for_product_search
KEY
Transforming Search in the Digital Marketplace
PDF
Responding to Outages Maturely
PDF
Migrating from PostgreSQL to MySQL Without Downtime
PDF
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
PDF
DevTools at Etsy
PDF
Resilient Response In Complex Systems
PDF
Outages, PostMortems, and Human Error
PDF
Scaling Etsy: What Went Wrong, What Went Right
KEY
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
PDF
Code as Craft: Building a Strong Engineering Culture at Etsy
Emphemeral hadoop clusters in the cloud
Data mining for_product_search
Transforming Search in the Digital Marketplace
Responding to Outages Maturely
Migrating from PostgreSQL to MySQL Without Downtime
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
DevTools at Etsy
Resilient Response In Complex Systems
Outages, PostMortems, and Human Error
Scaling Etsy: What Went Wrong, What Went Right
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
Code as Craft: Building a Strong Engineering Culture at Etsy
Ad

Similar to Solr @ Etsy - Apache Lucene Eurocon (20)

PPTX
PPTX
Developing a Real-time Engine with Akka, Cassandra, and Spray
PPTX
What is new in Java 8
PDF
Wprowadzenie do technologi Big Data i Apache Hadoop
PDF
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
PDF
Presto anatomy
PDF
Refactoring to Macros with Clojure
PDF
Painless Persistence with Realm
PDF
Shooting the Rapids
PPTX
Ft10 de smet
PDF
Kerberizing spark. Spark Summit east
PDF
Store and Process Big Data with Hadoop and Cassandra
PDF
Parboiled explained
PDF
服务框架: Thrift & PasteScript
PDF
Apache Thrift
KEY
fog or: How I Learned to Stop Worrying and Love the Cloud (OpenStack Edition)
PPTX
Where the wild things are - Benchmarking and Micro-Optimisations
PDF
Protocol handler in Gecko
PDF
Lucene for Solr Developers
KEY
fog or: How I Learned to Stop Worrying and Love the Cloud
Developing a Real-time Engine with Akka, Cassandra, and Spray
What is new in Java 8
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Presto anatomy
Refactoring to Macros with Clojure
Painless Persistence with Realm
Shooting the Rapids
Ft10 de smet
Kerberizing spark. Spark Summit east
Store and Process Big Data with Hadoop and Cassandra
Parboiled explained
服务框架: Thrift & PasteScript
Apache Thrift
fog or: How I Learned to Stop Worrying and Love the Cloud (OpenStack Edition)
Where the wild things are - Benchmarking and Micro-Optimisations
Protocol handler in Gecko
Lucene for Solr Developers
fog or: How I Learned to Stop Worrying and Love the Cloud

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Mobile App Security Testing_ A Comprehensive Guide.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Monthly Chronicles - July 2025
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development

Solr @ Etsy - Apache Lucene Eurocon

  • 2. Things I’m not going to talk about: A/B Testing i18n Continuos Deployment
  • 9. Architecture Overview Thrift struct Listing { 1: i64 listing_id } struct ListingResults { 1: i64 count, 2: list<Listing> listings } service Search { ListingResults search(1:string query) }
  • 10. Architecture Overview Thrift Generated Java server code: public class Search { public interface Iface { public ListingResults search(String query) throws TException; } Generated PHP client code: class SearchClient implements SearchIf { /**...**/ public function search($query) { $this->send_search($query); return $this->recv_search(); }
  • 11. Architecture Overview Thrift Why use Thrift? • Service Encapsulation • Reduced Network Traffic
  • 12. Architecture Overview Thrift Why only return IDs? • Index Size • Easy to scale PK lookups
  • 14. Architecture Overview Search Server • Identical Code + Hardware • Roles/Behavior controlled by Env variables • Single Java Process • Solr running as a Jetty Servlet • Thrift Servers • Smoker
  • 15. Architecture Overview Search Server Master-specific processes: • Incremental Indexer • External File Field Updaters
  • 21. Load Balancing Server Affinity Algorithm $serversNew = array(); [“host2”, “host3”, “host1”, “host4”] $numServers = count($servers); while($numServers > 0) { // Take the first 4 chars of the md5sum of the server count // and the query, mod the available servers $key = hexdec(substr(md5($numServers . '+' . $query),0,4))%($numServers); $keySet = array_keys($servers); $serverId = $keySet[$key]; // Push the chosen server onto the new list and remove it // from the initial list array_push($serversNew, $servers[$serverId]); unset($servers[$serverId]); --$numServers; }
  • 22. Load Balancing Server Affinity Algorithm $key = hexdec(substr(md5($query),0,4)) “jewelry” [“host2”, “host3”, “host1”, “host4”] “scarf” [“host2”, “host3”, “host1”, “host4”]
  • 23. Load Balancing Server Affinity Algorithm $key = hexdec(substr(md5($numServers . '+' . $query),0,4))%(count($servers)); “jewelry” [“host2”, “host3”, “host1”, “host4”] “scarf” [“host2”, “host1”, “host4”, “host3”]
  • 25. Load Balancing Server Affinity Caveats • Stemming / Analysis • Be wary of query distribution
  • 30. Replication Multicast Rsync? [15:25]  <engineer> patrick: i'm gonna test multi-rsyncing some indexes from host1 to host2 and host3 in prod. I'll be watching the graphs and what not, but let me know if you see anything funky with the network [15:26]  <patrick> ok .... [15:31]  <keyur> is the site down?
  • 34. Replication Bit Torrent + Solr Fork of TTorent: https://github.com/etsy/ttorrent Multi-File Support Performance Enhancements
  • 41. “writing query strings is for suckers”
  • 43. Solr InterOp QParsers http://host:8393/solr/person/select/?q=_query_:%22{!dismax %20qf=$fqf%20v=$fnq}%22%20OR%20(_query_:%22{!dismax%20qf=$fiqf %20v=$fiq}%22%20AND%20(_query_:%22{!dismax%20qf=$lwqf%20v=$lwq} %22%20OR%20_query_:%22{!dismax%20qf=$lqf%20v=$lq}%20%22))&fnq= %22giovanni%20fernandez-kincade %22&fqf=full_name^4&fiq=giovanni&fiqf=first_name^2.0%20first_name_s yn&qt=standard&lwq=fernandez-kincade*&lwqf=last_name&lq=fernandez- kincade&lqf=last_name^3
  • 44. Solr InterOp QParsers http://host:8393/solr/person/select/?q={!personrealqp}giovanni %20fernandez-kincade
  • 45. Solr InterOp QParsers class PersonNameRealQParser extends QParser {    public PersonNameRealQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {      super(qstr, localParams, params, req);    }
  • 46. Solr InterOp QParsers @Override   public Query parse() throws ParseException { TermQuery exactFullNameQuery = new TermQuery(new Term("full_name", qstr));     exactFullNameQuery.setBoost(4.0f);     String[] userQueryTerms = qstr.split("s+");     Query firstLastQuery = null;     if (2 == userQueryTerms.length)       firstLastQuery = parseAsFirstAndLast(userQueryTerms[0], userQueryTerms[1]);     else       firstLastQuery = parseAsFirstOrLast(userQueryTerms);     DisjunctionMaxQuery realNameQuery = new DisjunctionMaxQuery(0);     realNameQuery.add(exactFullNameQuery);     realNameQuery.add(firstLastQuery);     return realNameQuery;   }
  • 47. Solr InterOp QParsers The QParserPlugin that returns our new QParser: public class PersonNameRealQParserPlugin extends QParserPlugin {    public static final String NAME = "personrealqp";    @Override    public void init(NamedList args) {}    @Override    public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {      return new PersonNameRealQParser(qstr, localParams, params, req);    } }
  • 48. Solr InterOp QParsers Registering the plugin in solrconfig.xml: <queryParser name="personrealqp" class="com.etsy.person.solr.PersonNameRealQParserPlugin" />
  • 51. Solr InterOp Custom Stemmer banded, banding, birding, bouldering, bounded, buffing, bundler, canning, carded, circled, coupler, dangler, doubler, firring, foiling, hooper, japanned, lipped, napped, papered, pebbled, pitted, pocketed, reductive, ricer, rooter, roper, seeded, shouldered, silvered, skinning, spindling, staining, stitcher, strapped, threaded, yellowing
  • 52. Solr InterOp Custom Stemmer First we extend KStemmer and intercept stem calls: public class LStemmer extends KStemmer { /**.....**/      @Override      String stem(String term) {          String override = overrideStemTransformations.get(term);          if(override != null) return override;          return super.stem(term);      } }
  • 53. Solr InterOp Custom Stemmer Then create a TokenFilter that uses the new Stemmer: final class LStemFilter extends TokenFilter { /**.....**/         protected LStemFilter(TokenStream input, int cacheSize) { super(input); stemmer = new LStemmer(cacheSize); }          @Override public boolean incrementToken() throws IOException { /**....**/ }
  • 54. Solr InterOp Custom Stemmer Create a FilterFactory that exposes it: public class LStemFilterFactory extends BaseTokenFilterFactory { private int cacheSize = 20000;      @Override public void init(Map<String, String> args) { super.init(args);      String cacheSizeStr = args.get("cacheSize");      if (cacheSizeStr != null) {       cacheSize = Integer.parseInt(cacheSizeStr);      }    }      @Override    public TokenStream create(TokenStream in) {     return new LStemFilter(in, cacheSize);    } }
  • 55. Solr InterOp Custom Stemmer And finally plug it into your analysis chain: <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="solr/common/conf/stopwords.txt"/> <filter class="com.etsy.solr.analysis.LStemFilterFactory" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer>