SlideShare a Scribd company logo
An Introduction to DSE Search
Caleb Rackliffe
Software Engineer
caleb.rackliffe@datastax.com
@calebrackliffe
What problem were we trying to solve?
3
Application
DataStax Driver
4
SELECT * FROM customers WHERE country LIKE '%land%';
5
What about secondary indexes?
Why not just create your own secondary index
implementation that supports wildcard queries?
7
I need full-text search!
DataStax: An Introduction to DataStax Enterprise Search
Why did we build something new?
10
Application
DataStax Driver Solr Client
Polyglot Persistence!
12
Application
DataStax Driver Solr Client
Consistency
Cost
Complexity
DataStax: An Introduction to DataStax Enterprise Search
14
partitioning
multi-DC
replication
geospatial
wildcards
monitoring
C* field type support (UDT, Tuple, collections)
security
live indexing
sorting
faceting
fault-tolerant distributed search
caching
text analysis
grouping
automatic index updates
JVM
CQL
repair
15
Application
DataStax Driver Solr Client
Consistency
Complexity
Cost
How about some examples?
Creating a Solr Core
bash$ dse cassandra -s
cqlsh> CREATE KEYSPACE test
WITH replication = {'class': 'NetworkTopologyStrategy', 'Solr':1};
cqlsh:test> CREATE TABLE test.user(username text PRIMARY KEY,
fullname text,
address_ map<text, text>);
bash$ dsetool create_core test.user generateResources=true
Start a node…
Create a table…
Create the core…
bash$ dsetool get_core_schema test.user
<?xml version="1.0" encoding="UTF-8" standalone=“no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TextField" name="text">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.StrField" name="string"/>
</types>
<fields>
<field indexed="true" name="username" stored="true" type="string"/>
<field indexed="true" name="fullname" stored="true" type="text"/>
<dynamicField indexed="true" name="address_*" stored="true" type="string"/>
</fields>
<uniqueKey>fullname</uniqueKey>
</schema>
The Schema
Insert Rows (…and Index Documents)
cqlsh:test> INSERT INTO user(username, fullname, address)
VALUES('sbtourist', 'Sergio Bossa', {'address_home' : 'UK', 'address_work' : 'UK'});
cqlsh:test> INSERT INTO user(username, fullname, address)
VALUES('bereng', 'Berenguer Blasi', {'address_home' : 'ES', 'address_work' : 'ES'});
cqlsh:test> INSERT INTO user(username, fullname, address)
VALUES('thegrinch', 'Sven Delmas', {'address_home':'US','address_work':'HQ'});
…and that’s it. No ETL. No writing to a second datastore.
Wildcards
cqlsh:test> SELECT username, address
FROM user
WHERE solr_query='{"q":"address_home:U*"}';
username | address
-----------+----------------------------------------------------
sbtourist | {‘address_home': 'UK', ‘address_work': 'UK'}
thegrinch | {‘address_home': 'US', ‘address_work': 'HQ'}
(2 rows)
Sorting and Limits
cqlsh:test> SELECT username, address
FROM user
WHERE solr_query=‘{"q":"*:*", "sort":"address_home desc"}';
username | address
-----------+----------------------------------------------------
thegrinch | {'address_home': 'US', 'address_work': 'HQ'}
sbtourist | {'address_home': 'UK', 'address_work': 'UK'}
bereng | {'address_home': 'ES', 'address_work': 'ES'}
(3 rows)
cqlsh:test> SELECT username, address
FROM user
WHERE solr_query='{"q":"*:*", "sort":"address_home desc"}'
LIMIT 1;
username | address
-----------+----------------------------------------------------
thegrinch | {'address_home': 'US', 'address_work': 'HQ'}
(3 rows)
Faceting
cqlsh:test> SELECT *
FROM user
WHERE solr_query='{"q":"*:*", "facet":{"field" : "address_work"}}';
facet_fields
--------------------------------------------
{"address_work" : {"ES" : 1 , "HQ" : 1 , "UK" : 1}}
(1 rows)
Partition Restrictions
cqlsh:test> CREATE TABLE event(sensor_id bigint,
recording_time timestamp,
description text,
PRIMARY KEY(sensor_id, recording_time));
…
cqlsh:test> SELECT recording_time, description
FROM test.event
WHERE sensor_id = 2314234432
AND solr_query=‘description:unremarkable’;
What do the internals look like?
Indexing
26
Buffered
Searchable
Durable
Memory
Disk
27
Buffered
Searchable
Durable
Memory
Disk
28
RAMBuffer
Segment
Segment
Memory
Disk
Segment Segment
Buffered
Searchable
Durable
Soft Commit
Hard Commit
Querying
Replica Selection
A
A
RF=2
shards: A-E
B
B CC D
D E
E
coordinator1
2
34
5
Healthy Unhealthy
Replica Selection
A
A
RF=2
shards: A-E
B
B CC D
D E
E
coordinator1
2
34
5
Healthy Unhealthy
What happens if a shard query fails?
Failover: Phase 1
4 nodes
RF = 2
shards: A-D
no vnodes
1
2
3
4
Failover: Phase 2
4 nodes
RF = 2
shards: A-D
no vnodes
1
2
3
4
DataStax: An Introduction to DataStax Enterprise Search
Failover: Phase 3
4 nodes
RF = 2
shards: A-D
no vnodes
1
2
3
4
Platform Integrations
Search + Analytics: Explicit Predicate Pushdown
bash$ dse spark
scala> val table = sc.cassandraTable("wiki","solr")
scala> val result = table.select("id","title")
.where(“solr_query=‘body:dog'")
.collect
http://docs.datastax.com

More Related Content

PDF
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PDF
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
PDF
Cassandra EU - Data model on fire
PDF
Cassandra and Spark
PDF
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
PDF
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Cassandra EU - Data model on fire
Cassandra and Spark
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
How We Used Cassandra/Solr to Build Real-Time Analytics Platform

What's hot (20)

PDF
Cassandra Community Webinar: Apache Cassandra Internals
PDF
Cassandra 2.0 and timeseries
PDF
Successful Architectures for Fast Data
PDF
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
PDF
Real data models of silicon valley
PDF
Cassandra 2.0 better, faster, stronger
PDF
Advanced Apache Cassandra Operations with JMX
PDF
Time series with apache cassandra strata
PDF
Cassandra Community Webinar | In Case of Emergency Break Glass
PDF
Cassandra Fundamentals - C* 2.0
PDF
Advanced data modeling with apache cassandra
PDF
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
PDF
Cassandra 3.0 advanced preview
PPTX
Solr Search Engine: Optimize Is (Not) Bad for You
PDF
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
PDF
Owning time series with team apache Strata San Jose 2015
PDF
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
PDF
Cassandra 3.0 Awesomeness
PDF
DataSource V2 and Cassandra – A Whole New World
Cassandra Community Webinar: Apache Cassandra Internals
Cassandra 2.0 and timeseries
Successful Architectures for Fast Data
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
Real data models of silicon valley
Cassandra 2.0 better, faster, stronger
Advanced Apache Cassandra Operations with JMX
Time series with apache cassandra strata
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Fundamentals - C* 2.0
Advanced data modeling with apache cassandra
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
Apache cassandra and spark. you got the the lighter, let's start the fire
Cassandra 3.0 advanced preview
Solr Search Engine: Optimize Is (Not) Bad for You
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
Owning time series with team apache Strata San Jose 2015
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra 3.0 Awesomeness
DataSource V2 and Cassandra – A Whole New World
Ad

Viewers also liked (20)

PDF
Cassandra 2.1 boot camp, Read/Write path
PPTX
Cassandra Summit 2015: Intro to DSE Search
PPTX
Understanding DSE Search by Matt Stump
PPTX
Apache Cassandra Developer Training Slide Deck
PPTX
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DOC
Copa menstrual y esponjas vaginales
PPT
Servidor web lamp
PPT
Magonia getxo blog
PPTX
Scala for rubyists
PDF
Accesus - Catalogo andamio para vias ferroviarias
PPTX
Tams 2012
PDF
Adquirir una propiedad en españa en 7 pasos
PDF
2013 brand id&print
PPTX
Pairform cci formpro
PPTX
los bracekts
PDF
9Guia1
PPTX
Una modesta proposición
PDF
Dossier ii torneo once caballeros c.f.
PPTX
Presentacion corporativa sevenminds agosto2012 (1)
PDF
Project Management Diploma with Instructors
Cassandra 2.1 boot camp, Read/Write path
Cassandra Summit 2015: Intro to DSE Search
Understanding DSE Search by Matt Stump
Apache Cassandra Developer Training Slide Deck
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
Copa menstrual y esponjas vaginales
Servidor web lamp
Magonia getxo blog
Scala for rubyists
Accesus - Catalogo andamio para vias ferroviarias
Tams 2012
Adquirir una propiedad en españa en 7 pasos
2013 brand id&print
Pairform cci formpro
los bracekts
9Guia1
Una modesta proposición
Dossier ii torneo once caballeros c.f.
Presentacion corporativa sevenminds agosto2012 (1)
Project Management Diploma with Instructors
Ad

Similar to DataStax: An Introduction to DataStax Enterprise Search (20)

PDF
DataStax: Enabling Search in your Cassandra Application with DataStax Enterprise
PDF
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
PPTX
DataStax 6 and Beyond
PDF
Searching Billions of Product Logs in Real Time (Use Case)
PPTX
IT talk SPb "Full text search for lazy guys"
PDF
PDF
Basics of Solr and Solr Integration with AEM6
PDF
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
PPTX
Apache Solr for begginers
PDF
Information Retrieval - Data Science Bootcamp
PDF
Retrieving Information From Solr
PDF
Restlet: Building a multi-tenant API PaaS with DataStax Enterprise Search
PDF
Cassandra Summit 2013 Keynote
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
PDF
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
PDF
Search Engine-Building with Lucene and Solr
PPTX
20130310 solr tuorial
PPTX
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
DataStax: Enabling Search in your Cassandra Application with DataStax Enterprise
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
DataStax 6 and Beyond
Searching Billions of Product Logs in Real Time (Use Case)
IT talk SPb "Full text search for lazy guys"
Basics of Solr and Solr Integration with AEM6
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Apache Solr for begginers
Information Retrieval - Data Science Bootcamp
Retrieving Information From Solr
Restlet: Building a multi-tenant API PaaS with DataStax Enterprise Search
Cassandra Summit 2013 Keynote
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
Search Engine-Building with Lucene and Solr
20130310 solr tuorial
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
PPTX
Introduction to DataStax Enterprise Graph Database
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
PPTX
Cassandra on Docker @ Walmart Labs
PDF
Cassandra 3.0 Data Modeling
PPTX
Cassandra Adoption on Cisco UCS & Open stack
PDF
Data Modeling for Apache Cassandra
PDF
Coursera Cassandra Driver
PDF
Production Ready Cassandra
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
PDF
Standing Up Your First Cluster
PDF
Real Time Analytics with Dse
PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Cassandra Core Concepts
PPTX
Bad Habits Die Hard
PDF
Advanced Data Modeling with Apache Cassandra
PDF
Advanced Cassandra
PDF
Apache Cassandra and Drivers
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Cassandra on Docker @ Walmart Labs
Cassandra 3.0 Data Modeling
Cassandra Adoption on Cisco UCS & Open stack
Data Modeling for Apache Cassandra
Coursera Cassandra Driver
Production Ready Cassandra
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 2
Standing Up Your First Cluster
Real Time Analytics with Dse
Introduction to Data Modeling with Apache Cassandra
Cassandra Core Concepts
Bad Habits Die Hard
Advanced Data Modeling with Apache Cassandra
Advanced Cassandra
Apache Cassandra and Drivers

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation theory and applications.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
KodekX | Application Modernization Development
NewMind AI Weekly Chronicles - August'25 Week I
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Spectroscopy.pptx food analysis technology
Understanding_Digital_Forensics_Presentation.pptx
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation theory and applications.pdf
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The AUB Centre for AI in Media Proposal.docx
Unlocking AI with Model Context Protocol (MCP)

DataStax: An Introduction to DataStax Enterprise Search

Editor's Notes

  • #2: “Hello! My name is Caleb Rackliffe, and I’m a member of the search team at DataStax. Today I’d like to walk you through a brief (but action-packed) introduction to DataStax Enterprise Search. I’ll start with a question…”
  • #3: “Before we talk about what DSE Search is, let’s make sure we know why we built it.”
  • #4: “Here we have a small Cassandra cluster and an application sitting on top of it, using the Datastax driver. We can go a long way with CQL and proper denormalization, but what happens when we find ourselves wanting to do something as seemingly simple as…”
  • #5: “…this. You’ll recognize the SQL-style wildcard query, which Cassandra does not support out of the box.”
  • #6: Cassandra’s built in secondary indexes might seem like a solution, but they… …don’t support wildcard queries. …can perform poorly unless limited to a single partition. …can perform poorly for very high or very low cardinality fields. …may fail for a frequently updated/deleted column.
  • #7: “You could, but then you’d be saddled with the cost of building and maintaining that, and you’ll still end up with something that is designed for a fairly specific use-case.”
  • #8: “So when our search problem lacks the structure to make denormalization effective, and is beyond the capabilities of C* secondary indexes, we need to think a bit more broadly.”
  • #9: “Fortunately, there are technologies out there that handle full-text and other more advanced kinds of search well, and most of them, like Solr, are built on the foundation of the Apache Lucene project. ”
  • #10: “Well, let’s see what it would look like to use a separate, Lucene-based search cluster alongside our Cassandra cluster…”
  • #11: “…here we are. Our application is now sitting on top of both a Cassandra cluster and separate search cluster. Notice that we’ve added a new client to our application, specifically for search. So we’ve got Cassandra doing key-value lookups and probably some range queries…we’ve got our search cluster handling the more advanced ad-hoc queries for us.”
  • #12: “This is polyglot persistence at its best…right?”
  • #13: “Well, maybe not…and we can talk about this along 3 axes.” Complexity - The persistence layer of our application is now more complex. We have to configure two clients, write to two data stores, and, if we write to one of them asynchronously, manage a queueing solution. Consistency - Since the two data stores have no explicit knowledge of each other, we have to manage questions of consistency between them in our application. Cost - Aside from the implicit cost of complexity, we’ll also need to deal with the explicit cost of infrastructure and hardware for a separate cluster.
  • #14: “So if you need avoid data loss, scale your writes, and replicate your index over multiple DCs, your architecture might start to look like this lovely Rube Goldberg machine. We wanted to provide all of this in an operationally simple package…”
  • #15: “DSE Search is designed to address those problems. We’ve built a coherent search platform that integrates Cassandra’s distributed persistence, Lucene’s core search and indexing functionality, and the advanced features of Solr in the same JVM…and then we’ve made a number of our own enhancements, which we’ll see in the coming slides.”
  • #16: “So back to our architecture diagram. First, with DSE search, we can eliminate the cost associated with running a separate search cluster. We can eliminate much of the complexity at the application layer, since we don’t have to deal with two clients, and we only have to manage one write path…and with all of our data stored in Cassandra alone and collocated with the relevant shards of our search index, we’ve eliminated many of the potential issues of consistency between the two.”
  • #17: “We’ll go into more details on the indexing and query paths, but before we we do that, let’s run through some basic examples and get a feel for the ergonomics of our solution.”
  • #18: “First, we’ll startup a single node. (The -s switch here tells the node it’s going to handle a search workload.) Second, we create a table from the CQL prompt. Third, we create a Solr core over that table from dsetool…and that’s it. We’re ready to index documents. Note that we don’t have to create the Solr schema explicitly, because DSE Search creates it for us, using the CQL schema to determine its type mappings.”
  • #19: “Under the hood, the schema actually looks something like this, but you shouldn’t need to trouble yourself with it, unless our default type mappings aren’t quite right for you. In that case, you can just tweak the auto-generated schema and re-upload it.”
  • #20: “Next we insert a few rows, which will be indexed automatically for search. There is no ETL involved and no explicit writing to a second data store. We’re ready to make some queries…”
  • #21: “…so let’s start with a simple wildcard query. Here, we want to find everyone who’s home address starts with a U, and of course we find users in the United States and the UK.”
  • #22: “Sorting and Limits! In the first query, we just find all our users and sort them descending by home address. In the second query, we do the same thing except we also use the CQL LIMIT keyword to narrow our results down to just the top result by home address.”
  • #23: “Faceting allows us to take the results of a query, in this case a query for all documents, group them, and count the members in each group. In this example, faceting on our users’ work addresses tells us that we have one working in Spain, one at corporate headquarters, and one in the UK. This is very common in the context of a product search, where a user wants to drill into results by brand.”
  • #24: “What if we want to restrict our search to a specific partition? Here I have another table, one that records series of sensor events. Using a CQL partition key restriction in our WHERE clause, we can ensure that our query visits only the node that contains that partition and then filters on it once we get there. Much like our earlier usage of LIMIT, this is a case where we’re translating CQL instructions to search-specific instructions under the hood.”
  • #25: “Now that we have an idea of what basic usage looks like, let’s take a high-level look at what’s going on in the indexing and query internals…”
  • #26: “The indexing process starts with a Cassandra write. It arrives at the coordinator, is distributed to the proper replicas, and it written the commit log and Memtable, as you would expect. At this point, we create an updated Lucene document and queue it up for indexing, then we return to the coordinator and the client. Then, asynchronously, we update the index. Finally, also in the background, when a C* Memtable is flushed to disk, we also flush the corresponding index updates to disk, ensuring their durability.”
  • #27: “In near-real-time search systems, updated documents, once indexed, progress through 3 stages: a buffered stage, where they are just accumulated in memory; a searchable stage, where they move to disk and become visible to ongoing queries; and a durable stage, where they are permanently added to the index and will survive restart.” “Because moving from the “buffered” layer to the “searchable“ layer is expensive, we are forced to make a tradeoff between the visibility of our data and indexing throughput. i.e. We can make our writes visible to ongoing searches more quickly at the cost of slower indexing throughput, or we can maximize indexing throughput with longer delays before write are visible to searches.”
  • #28: In DSE 4.7, we released a feature called “Live Indexing”. Essentially, we’ve made indexed documents buffered in memory searchable, eliminating the need to build a separate “searchable” representation of the index and the need to make a hard decision between update availability and throughput. This might remind your of the Cassandra write path, where we have “searchable” Memtables buffered in memory that are periodically flushed to “durable” SSTables.
  • #29: “This is what it would look like if we mapped these stages to their equivalents in Solr. Notice that the soft commit process creates searchable segments, which must later be merged by Lucene in the background. Since live indexing bypasses this second level, we can accumulate larger segments before flushing to disk, and this reduces the cost of the segment merges that occur in the background.”
  • #30: “On the query side, we’ve implemented our own distributed search, informed by the topology of the cluster that Cassandra makes available to us. Here we have a 4-node cluster with a replication factor of 2. Our first step is to determine the set of nodes that optimally covers the ring, in this case, the tokens from 0 -> 1000. We then scatter the query to those nodes, find the IDs for matching documents, and read the documents themselves, which are stored only in Cassandra. Notice here that, to minimize fan-out, we only contact node 3, not 4 + 2 to cover ranges 0 -> 250 and 250 -> 500.”
  • #31: “When we need to chose between replicas of a particular token, we do our best to minimize fan-out, to cover the entire dataset optimally. When multiple nodes could be optimal selections, we look more closely at the health and activity of those nodes. In this example we have a 5-node cluster with a replication factor of 2 and index shards A-E. We’ll denote health here by color, with green being health, red being unhealthy, and yellow in the middle. If we need to cover shard B, we can query either node 2 or node 3, but we’ll pick node 2, because it’s healthier.”
  • #32: “However, node health is not the only criterion we use for selection. If node 2 is healthy, but is also in the middle of an expensive operation, let’s say, rebuilding its search index, we’ll want to choose node 3, since node 2 is not potentially both out of date and not able to devote as many resources to handling incoming queries.”
  • #34: “Here we have a healthy 4-node cluster with a replication factor of 2 and 4 index shards. If node 1 coordinates our request, it only needs to contact itself and node 3 to cover all 4 of the shards A-D…”
  • #35: “…but then node 3 fails. It could have been a disk failure or a network issue…”
  • #36: “…but it was probably because you let this guy near it.”
  • #37: “In any case, we still need to cover shards B and C, but node 3 was the only node that contained both of them, so we’ll need to contact nodes 2 and 4.”
  • #38: “To this point, I’ve talked about search in a fairly isolated way, but in the context of a larger platform, there are opportunities to step outside that.”
  • #39: “One example is the integration we released in DSE 4.7 with Spark - a component of DSE Analytics. There are cases where pushing a search query through a Spark job can meaningfully cut down on the size of the RDD Spark presents for analysis. In this example, we’re filtering every Wikipedia article that contains the word ‘dog’ using search, avoiding some unnecessary filtering after we build the RDD.”
  • #40: “Well that wraps it up for me. If you’d like to dig deeper into any of the topics I covered here, or you’d like to try DSE out for yourself, please visit docs.datastax.com. Thank you all so much for coming, and enjoy the rest of your Summit!”