Solr @ Etsy - Apache Lucene Eurocon

Things I’m not going to
talk about:
A/B Testing
i18n
Continuos Deployment

10+ Million Listings
500 qps

Architecture Overview
Thrift
struct Listing {
1: i64 listing_id
}

struct ListingResults {
1: i64 count,
2: list<Listing> listings
}

service Search {
ListingResults search(1:string query)
}

Thrift
Generated Java server code:
public class Search {

public interface Iface {

public ListingResults search(String query) throws TException;

}

Generated PHP client code:
class SearchClient implements SearchIf {

/**...**/
public function search($query)
{
$this->send_search($query);
return $this->recv_search();
}

Thrift
Why use Thrift?
• Service Encapsulation
• Reduced Network Traﬃc

Thrift
Why only return IDs?
• Index Size
• Easy to scale PK lookups

Search Server

• Identical Code + Hardware
• Roles/Behavior controlled by Env variables
• Single Java Process
• Solr running as a Jetty Servlet
• Thrift Servers
• Smoker

Search Server

Master-speciﬁc processes:
• Incremental Indexer
• External File Field Updaters

Load Balancing
Thrift TSocketPool

Load Balancing
Server Aﬃnity

Load Balancing
Server Aﬃnity Algorithm
$serversNew = array();
[“host2”, “host3”, “host1”, “host4”]
$numServers = count($servers);

while($numServers > 0) {
// Take the first 4 chars of the md5sum of the server count
// and the query, mod the available servers
$key = hexdec(substr(md5($numServers . '+' . $query),0,4))%($numServers);
$keySet = array_keys($servers);
$serverId = $keySet[$key];

// Push the chosen server onto the new list and remove it
// from the initial list
array_push($serversNew, $servers[$serverId]);
unset($servers[$serverId]);
--$numServers;
}

Load Balancing
$key = hexdec(substr(md5($query),0,4))

“jewelry” [“host2”, “host3”, “host1”, “host4”]

“scarf” [“host2”, “host3”, “host1”, “host4”]

Load Balancing
$key = hexdec(substr(md5($numServers . '+' . $query),0,4))%(count($servers));

“jewelry” [“host2”, “host3”, “host1”, “host4”]

“scarf” [“host2”, “host1”, “host4”, “host3”]

Load Balancing
Server Aﬃnity Results

2% 20%

Load Balancing
Server Aﬃnity Caveats

• Stemming / Analysis
• Be wary of query distribution

Replication
Multicast Rsync?
[15:25] <engineer> patrick: i'm gonna test multi-rsyncing some indexes
from host1 to host2 and host3 in prod. I'll be watching the graphs and
what not, but let me know if you see anything funky with the network
[15:26] <patrick> ok
....

[15:31] <keyur> is the site down?

Replication
Bit Torrent POC
Using BitTornado:

Replication
Bit Torrent + Solr
Fork of TTorent: https://github.com/etsy/ttorrent

Multi-File Support
Performance Enhancements

Replication
Bit Torrent + Solr

“writing query strings
is for suckers”

Solr InterOp
QParsers

http://host:8393/solr/person/select/?q=_query_:%22{!dismax
%20qf=$fqf%20v=$fnq}%22%20OR%20(_query_:%22{!dismax%20qf=$fiqf
%20v=$fiq}%22%20AND%20(_query_:%22{!dismax%20qf=$lwqf%20v=$lwq}
%22%20OR%20_query_:%22{!dismax%20qf=$lqf%20v=$lq}%20%22))&fnq=
%22giovanni%20fernandez-kincade
%22&fqf=full_name^4&fiq=giovanni&fiqf=first_name^2.0%20first_name_s
yn&qt=standard&lwq=fernandez-kincade*&lwqf=last_name&lq=fernandez-
kincade&lqf=last_name^3

Solr InterOp
QParsers

http://host:8393/solr/person/select/?q={!personrealqp}giovanni
%20fernandez-kincade

Solr InterOp
QParsers

class PersonNameRealQParser extends QParser {
   public PersonNameRealQParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {
     super(qstr, localParams, params, req);
   }

Solr InterOp
QParsers
@Override
  public Query parse() throws ParseException {
TermQuery exactFullNameQuery = new TermQuery(new Term("full_name", qstr));
    exactFullNameQuery.setBoost(4.0f);

    String[] userQueryTerms = qstr.split("s+");
    Query firstLastQuery = null;

    if (2 == userQueryTerms.length)
      firstLastQuery = parseAsFirstAndLast(userQueryTerms[0], userQueryTerms[1]);
    else
      firstLastQuery = parseAsFirstOrLast(userQueryTerms);

    DisjunctionMaxQuery realNameQuery = new DisjunctionMaxQuery(0);
    realNameQuery.add(exactFullNameQuery);
    realNameQuery.add(firstLastQuery);

    return realNameQuery;
  }

Solr InterOp
QParsers
The QParserPlugin that returns our new QParser:
public class PersonNameRealQParserPlugin extends QParserPlugin {
   public static final String NAME = "personrealqp";

   @Override
   public void init(NamedList args) {}

   @Override
   public QParser createParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {
     return new PersonNameRealQParser(qstr, localParams, params, req);
   }
}

Solr InterOp
QParsers

Registering the plugin in solrconﬁg.xml:

<queryParser name="personrealqp"
class="com.etsy.person.solr.PersonNameRealQParserPlugin" />

Solr InterOp
Custom Stemmer

banded, banding, birding, bouldering, bounded, buﬃng, bundler, canning,
carded, circled, coupler, dangler, doubler, ﬁrring, foiling, hooper, japanned,
lipped, napped, papered, pebbled, pitted, pocketed, reductive, ricer, rooter,
roper, seeded, shouldered, silvered, skinning, spindling, staining, stitcher,
strapped, threaded, yellowing

Solr InterOp
Custom Stemmer
First we extend KStemmer and intercept stem calls:
public class LStemmer extends KStemmer {

/**.....**/

     @Override
     String stem(String term) {
         String override = overrideStemTransformations.get(term);
         if(override != null) return override;
         return super.stem(term);
     }
}

Solr InterOp
Custom Stemmer
Then create a TokenFilter that uses the new Stemmer:
final class LStemFilter extends TokenFilter {

/**.....**/
protected LStemFilter(TokenStream input, int cacheSize) {
super(input);
stemmer = new LStemmer(cacheSize);
}

@Override
public boolean incrementToken() throws IOException {
/**....**/
}

Solr InterOp
Custom Stemmer
Create a FilterFactory that exposes it:
public class LStemFilterFactory extends BaseTokenFilterFactory {
private int cacheSize = 20000;

@Override
public void init(Map<String, String> args) {
super.init(args);
     String cacheSizeStr = args.get("cacheSize");
     if (cacheSizeStr != null) {
      cacheSize = Integer.parseInt(cacheSizeStr);
     }
   }

@Override
   public TokenStream create(TokenStream in) {
    return new LStemFilter(in, cacheSize);
   }
}

Solr InterOp
Custom Stemmer
And ﬁnally plug it into your analysis chain:

<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="solr/common/conf/stopwords.txt"/>
<filter class="com.etsy.solr.analysis.LStemFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

Solr @ Etsy - Apache Lucene Eurocon

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Solr @ Etsy - Apache Lucene Eurocon (20)

Recently uploaded (20)

Solr @ Etsy - Apache Lucene Eurocon