DataStax: An Introduction to DataStax Enterprise Search

Download as PPTX, PDF

1 like1,302 views

The document provides a comprehensive overview of implementing full-text search using the DataStax Enterprise (DSE) search capabilities, detailing how to create and manage schemas while supporting advanced querying features such as wildcards, sorting, and faceting. It outlines practical examples using CQL commands, demonstrating how to integrate Solr with Cassandra for efficient data handling without the need for ETL processes. Additionally, the document discusses the internal workings of indexing and querying within the DSE search architecture.

Technology

An Introduction to DSE Search
Caleb Rackliffe
Software Engineer
caleb.rackliffe@datastax.com
@calebrackliffe

4
SELECT * FROM customers WHERE country LIKE '%land%';

Why not just create your own secondary index
implementation that supports wildcard queries?

DataStax: An Introduction to DataStax Enterprise Search

10
Application
DataStax Driver Solr Client

12
Application
DataStax Driver Solr Client
Consistency
Cost
Complexity

14
partitioning
multi-DC
replication
geospatial
wildcards
monitoring
C* field type support (UDT, Tuple, collections)
security
live indexing
sorting
faceting
fault-tolerant distributed search
caching
text analysis
grouping
automatic index updates
JVM
CQL
repair

15
Application
DataStax Driver Solr Client
Consistency
Complexity
Cost

$Creating a Solr Core bash$ dse cassandra -s cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'Solr':1}; cqlsh:test> CREATE TABLE test.user(username text PRIMARY KEY, fullname text, address_ map<text, text>); bash$ dsetool create_core test.user generateResources=true Start a node… Create a table… Create the core…$

bash$ dsetool get_core_schema test.user
<?xml version="1.0" encoding="UTF-8" standalone=“no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TextField" name="text">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.StrField" name="string"/>
</types>
<fields>
<field indexed="true" name="username" stored="true" type="string"/>
<field indexed="true" name="fullname" stored="true" type="text"/>
<dynamicField indexed="true" name="address_*" stored="true" type="string"/>
</fields>
<uniqueKey>fullname</uniqueKey>
</schema>
The Schema

$Insert Rows (…and Index Documents) cqlsh:test> INSERT INTO user(username, fullname, address) VALUES('sbtourist', 'Sergio Bossa', {'address_home' : 'UK', 'address_work' : 'UK'}); cqlsh:test> INSERT INTO user(username, fullname, address) VALUES('bereng', 'Berenguer Blasi', {'address_home' : 'ES', 'address_work' : 'ES'}); cqlsh:test> INSERT INTO user(username, fullname, address) VALUES('thegrinch', 'Sven Delmas', {'address_home':'US','address_work':'HQ'}); …and that’s it. No ETL. No writing to a second datastore.$

$Wildcards cqlsh:test> SELECT username, address FROM user WHERE solr_query='{"q":"address_home:U*"}'; username | address -----------+---------------------------------------------------- sbtourist | {‘address_home': 'UK', ‘address_work': 'UK'} thegrinch | {‘address_home': 'US', ‘address_work': 'HQ'} (2 rows)$

$Sorting and Limits cqlsh:test> SELECT username, address FROM user WHERE solr_query=‘{"q":"*:*", "sort":"address_home desc"}'; username | address -----------+---------------------------------------------------- thegrinch | {'address_home': 'US', 'address_work': 'HQ'} sbtourist | {'address_home': 'UK', 'address_work': 'UK'} bereng | {'address_home': 'ES', 'address_work': 'ES'} (3 rows) cqlsh:test> SELECT username, address FROM user WHERE solr_query='{"q":"*:*", "sort":"address_home desc"}' LIMIT 1; username | address -----------+---------------------------------------------------- thegrinch | {'address_home': 'US', 'address_work': 'HQ'} (3 rows)$

$Faceting cqlsh:test> SELECT * FROM user WHERE solr_query='{"q":"*:*", "facet":{"field" : "address_work"}}'; facet_fields -------------------------------------------- {"address_work" : {"ES" : 1 , "HQ" : 1 , "UK" : 1}} (1 rows)$

Partition Restrictions
cqlsh:test> CREATE TABLE event(sensor_id bigint,
recording_time timestamp,
description text,
PRIMARY KEY(sensor_id, recording_time));
…
cqlsh:test> SELECT recording_time, description
FROM test.event
WHERE sensor_id = 2314234432
AND solr_query=‘description:unremarkable’;

26
Buffered
Searchable
Durable
Memory
Disk

27
Buffered
Searchable
Durable
Memory
Disk

28
RAMBuffer
Segment
Segment
Memory
Disk
Segment Segment
Buffered
Searchable
Durable
Soft Commit
Hard Commit

Replica Selection
A
A
RF=2
shards: A-E
B
B CC D
D E
E
coordinator1
2
34
5
Healthy Unhealthy

Failover: Phase 1
4 nodes
RF = 2
shards: A-D
no vnodes
1
2
3
4

Failover: Phase 2
4 nodes
RF = 2
shards: A-D
no vnodes
1
2
3
4

Failover: Phase 3
4 nodes
RF = 2
shards: A-D
no vnodes
1
2
3
4

Search + Analytics: Explicit Predicate Pushdown
bash$ dse spark
scala> val table = sc.cassandraTable("wiki","solr")
scala> val result = table.select("id","title")
.where(“solr_query=‘body:dog'")
.collect

http://docs.datastax.com

DataStax: An Introduction to DataStax Enterprise Search

1. An Introduction to DSE Search Caleb Rackliffe Software Engineer caleb.rackliffe@datastax.com @calebrackliffe

2. What problem were we trying to solve?

3. 3 Application DataStax Driver

4. 4 SELECT * FROM customers WHERE country LIKE '%land%';

5. 5 What about secondary indexes?

6. Why not just create your own secondary index implementation that supports wildcard queries?

7. 7 I need full-text search!

9. Why did we build something new?

10. 10 Application DataStax Driver Solr Client

11. Polyglot Persistence!

12. 12 Application DataStax Driver Solr Client Consistency Cost Complexity

14. 14 partitioning multi-DC replication geospatial wildcards monitoring C* field type support (UDT, Tuple, collections) security live indexing sorting faceting fault-tolerant distributed search caching text analysis grouping automatic index updates JVM CQL repair

15. 15 Application DataStax Driver Solr Client Consistency Complexity Cost

16. How about some examples?

17. Creating a Solr Core bash$ dse cassandra -s cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'Solr':1}; cqlsh:test> CREATE TABLE test.user(username text PRIMARY KEY, fullname text, address_ map<text, text>); bash$ dsetool create_core test.user generateResources=true Start a node… Create a table… Create the core…

18. bash$ dsetool get_core_schema test.user <?xml version="1.0" encoding="UTF-8" standalone=“no"?> <schema name="autoSolrSchema" version="1.5"> <types> <fieldType class="org.apache.solr.schema.TextField" name="text"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType class="org.apache.solr.schema.StrField" name="string"/> </types> <fields> <field indexed="true" name="username" stored="true" type="string"/> <field indexed="true" name="fullname" stored="true" type="text"/> <dynamicField indexed="true" name="address_*" stored="true" type="string"/> </fields> <uniqueKey>fullname</uniqueKey> </schema> The Schema

19. Insert Rows (…and Index Documents) cqlsh:test> INSERT INTO user(username, fullname, address) VALUES('sbtourist', 'Sergio Bossa', {'address_home' : 'UK', 'address_work' : 'UK'}); cqlsh:test> INSERT INTO user(username, fullname, address) VALUES('bereng', 'Berenguer Blasi', {'address_home' : 'ES', 'address_work' : 'ES'}); cqlsh:test> INSERT INTO user(username, fullname, address) VALUES('thegrinch', 'Sven Delmas', {'address_home':'US','address_work':'HQ'}); …and that’s it. No ETL. No writing to a second datastore.

20. Wildcards cqlsh:test> SELECT username, address FROM user WHERE solr_query='{"q":"address_home:U*"}'; username | address -----------+---------------------------------------------------- sbtourist | {‘address_home': 'UK', ‘address_work': 'UK'} thegrinch | {‘address_home': 'US', ‘address_work': 'HQ'} (2 rows)

21. Sorting and Limits cqlsh:test> SELECT username, address FROM user WHERE solr_query=‘{"q":"*:*", "sort":"address_home desc"}'; username | address -----------+---------------------------------------------------- thegrinch | {'address_home': 'US', 'address_work': 'HQ'} sbtourist | {'address_home': 'UK', 'address_work': 'UK'} bereng | {'address_home': 'ES', 'address_work': 'ES'} (3 rows) cqlsh:test> SELECT username, address FROM user WHERE solr_query='{"q":"*:*", "sort":"address_home desc"}' LIMIT 1; username | address -----------+---------------------------------------------------- thegrinch | {'address_home': 'US', 'address_work': 'HQ'} (3 rows)

22. Faceting cqlsh:test> SELECT * FROM user WHERE solr_query='{"q":"*:*", "facet":{"field" : "address_work"}}'; facet_fields -------------------------------------------- {"address_work" : {"ES" : 1 , "HQ" : 1 , "UK" : 1}} (1 rows)

23. Partition Restrictions cqlsh:test> CREATE TABLE event(sensor_id bigint, recording_time timestamp, description text, PRIMARY KEY(sensor_id, recording_time)); … cqlsh:test> SELECT recording_time, description FROM test.event WHERE sensor_id = 2314234432 AND solr_query=‘description:unremarkable’;

24. What do the internals look like?

25. Indexing

26. 26 Buffered Searchable Durable Memory Disk

27. 27 Buffered Searchable Durable Memory Disk

28. 28 RAMBuffer Segment Segment Memory Disk Segment Segment Buffered Searchable Durable Soft Commit Hard Commit

29. Querying

30. Replica Selection A A RF=2 shards: A-E B B CC D D E E coordinator1 2 34 5 Healthy Unhealthy

31. Replica Selection A A RF=2 shards: A-E B B CC D D E E coordinator1 2 34 5 Healthy Unhealthy

32. What happens if a shard query fails?

33. Failover: Phase 1 4 nodes RF = 2 shards: A-D no vnodes 1 2 3 4

34. Failover: Phase 2 4 nodes RF = 2 shards: A-D no vnodes 1 2 3 4

36. Failover: Phase 3 4 nodes RF = 2 shards: A-D no vnodes 1 2 3 4

37. Platform Integrations

38. Search + Analytics: Explicit Predicate Pushdown bash$ dse spark scala> val table = sc.cassandraTable("wiki","solr") scala> val result = table.select("id","title") .where(“solr_query=‘body:dog'") .collect

39. http://docs.datastax.com

Editor's Notes

#2: “Hello! My name is Caleb Rackliffe, and I’m a member of the search team at DataStax. Today I’d like to walk you through a brief (but action-packed) introduction to DataStax Enterprise Search. I’ll start with a question…”
#3: “Before we talk about what DSE Search is, let’s make sure we know why we built it.”
#4: “Here we have a small Cassandra cluster and an application sitting on top of it, using the Datastax driver. We can go a long way with CQL and proper denormalization, but what happens when we find ourselves wanting to do something as seemingly simple as…”
#5: “…this. You’ll recognize the SQL-style wildcard query, which Cassandra does not support out of the box.”
#6: Cassandra’s built in secondary indexes might seem like a solution, but they… …don’t support wildcard queries. …can perform poorly unless limited to a single partition. …can perform poorly for very high or very low cardinality fields. …may fail for a frequently updated/deleted column.
#7: “You could, but then you’d be saddled with the cost of building and maintaining that, and you’ll still end up with something that is designed for a fairly specific use-case.”
#8: “So when our search problem lacks the structure to make denormalization effective, and is beyond the capabilities of C* secondary indexes, we need to think a bit more broadly.”
#9: “Fortunately, there are technologies out there that handle full-text and other more advanced kinds of search well, and most of them, like Solr, are built on the foundation of the Apache Lucene project. ”
#10: “Well, let’s see what it would look like to use a separate, Lucene-based search cluster alongside our Cassandra cluster…”
#11: “…here we are. Our application is now sitting on top of both a Cassandra cluster and separate search cluster. Notice that we’ve added a new client to our application, specifically for search. So we’ve got Cassandra doing key-value lookups and probably some range queries…we’ve got our search cluster handling the more advanced ad-hoc queries for us.”
#12: “This is polyglot persistence at its best…right?”
#13: “Well, maybe not…and we can talk about this along 3 axes.” Complexity - The persistence layer of our application is now more complex. We have to configure two clients, write to two data stores, and, if we write to one of them asynchronously, manage a queueing solution. Consistency - Since the two data stores have no explicit knowledge of each other, we have to manage questions of consistency between them in our application. Cost - Aside from the implicit cost of complexity, we’ll also need to deal with the explicit cost of infrastructure and hardware for a separate cluster.
#14: “So if you need avoid data loss, scale your writes, and replicate your index over multiple DCs, your architecture might start to look like this lovely Rube Goldberg machine. We wanted to provide all of this in an operationally simple package…”
#15: “DSE Search is designed to address those problems. We’ve built a coherent search platform that integrates Cassandra’s distributed persistence, Lucene’s core search and indexing functionality, and the advanced features of Solr in the same JVM…and then we’ve made a number of our own enhancements, which we’ll see in the coming slides.”
#16: “So back to our architecture diagram. First, with DSE search, we can eliminate the cost associated with running a separate search cluster. We can eliminate much of the complexity at the application layer, since we don’t have to deal with two clients, and we only have to manage one write path…and with all of our data stored in Cassandra alone and collocated with the relevant shards of our search index, we’ve eliminated many of the potential issues of consistency between the two.”
#17: “We’ll go into more details on the indexing and query paths, but before we we do that, let’s run through some basic examples and get a feel for the ergonomics of our solution.”
#18: “First, we’ll startup a single node. (The -s switch here tells the node it’s going to handle a search workload.) Second, we create a table from the CQL prompt. Third, we create a Solr core over that table from dsetool…and that’s it. We’re ready to index documents. Note that we don’t have to create the Solr schema explicitly, because DSE Search creates it for us, using the CQL schema to determine its type mappings.”
#19: “Under the hood, the schema actually looks something like this, but you shouldn’t need to trouble yourself with it, unless our default type mappings aren’t quite right for you. In that case, you can just tweak the auto-generated schema and re-upload it.”
#20: “Next we insert a few rows, which will be indexed automatically for search. There is no ETL involved and no explicit writing to a second data store. We’re ready to make some queries…”
#21: “…so let’s start with a simple wildcard query. Here, we want to find everyone who’s home address starts with a U, and of course we find users in the United States and the UK.”
#22: “Sorting and Limits! In the first query, we just find all our users and sort them descending by home address. In the second query, we do the same thing except we also use the CQL LIMIT keyword to narrow our results down to just the top result by home address.”
#23: “Faceting allows us to take the results of a query, in this case a query for all documents, group them, and count the members in each group. In this example, faceting on our users’ work addresses tells us that we have one working in Spain, one at corporate headquarters, and one in the UK. This is very common in the context of a product search, where a user wants to drill into results by brand.”
#24: “What if we want to restrict our search to a specific partition? Here I have another table, one that records series of sensor events. Using a CQL partition key restriction in our WHERE clause, we can ensure that our query visits only the node that contains that partition and then filters on it once we get there. Much like our earlier usage of LIMIT, this is a case where we’re translating CQL instructions to search-specific instructions under the hood.”
#25: “Now that we have an idea of what basic usage looks like, let’s take a high-level look at what’s going on in the indexing and query internals…”
#26: “The indexing process starts with a Cassandra write. It arrives at the coordinator, is distributed to the proper replicas, and it written the commit log and Memtable, as you would expect. At this point, we create an updated Lucene document and queue it up for indexing, then we return to the coordinator and the client. Then, asynchronously, we update the index. Finally, also in the background, when a C* Memtable is flushed to disk, we also flush the corresponding index updates to disk, ensuring their durability.”
#27: “In near-real-time search systems, updated documents, once indexed, progress through 3 stages: a buffered stage, where they are just accumulated in memory; a searchable stage, where they move to disk and become visible to ongoing queries; and a durable stage, where they are permanently added to the index and will survive restart.” “Because moving from the “buffered” layer to the “searchable“ layer is expensive, we are forced to make a tradeoff between the visibility of our data and indexing throughput. i.e. We can make our writes visible to ongoing searches more quickly at the cost of slower indexing throughput, or we can maximize indexing throughput with longer delays before write are visible to searches.”
#28: In DSE 4.7, we released a feature called “Live Indexing”. Essentially, we’ve made indexed documents buffered in memory searchable, eliminating the need to build a separate “searchable” representation of the index and the need to make a hard decision between update availability and throughput. This might remind your of the Cassandra write path, where we have “searchable” Memtables buffered in memory that are periodically flushed to “durable” SSTables.
#29: “This is what it would look like if we mapped these stages to their equivalents in Solr. Notice that the soft commit process creates searchable segments, which must later be merged by Lucene in the background. Since live indexing bypasses this second level, we can accumulate larger segments before flushing to disk, and this reduces the cost of the segment merges that occur in the background.”
#30: “On the query side, we’ve implemented our own distributed search, informed by the topology of the cluster that Cassandra makes available to us. Here we have a 4-node cluster with a replication factor of 2. Our first step is to determine the set of nodes that optimally covers the ring, in this case, the tokens from 0 -> 1000. We then scatter the query to those nodes, find the IDs for matching documents, and read the documents themselves, which are stored only in Cassandra. Notice here that, to minimize fan-out, we only contact node 3, not 4 + 2 to cover ranges 0 -> 250 and 250 -> 500.”
#31: “When we need to chose between replicas of a particular token, we do our best to minimize fan-out, to cover the entire dataset optimally. When multiple nodes could be optimal selections, we look more closely at the health and activity of those nodes. In this example we have a 5-node cluster with a replication factor of 2 and index shards A-E. We’ll denote health here by color, with green being health, red being unhealthy, and yellow in the middle. If we need to cover shard B, we can query either node 2 or node 3, but we’ll pick node 2, because it’s healthier.”
#32: “However, node health is not the only criterion we use for selection. If node 2 is healthy, but is also in the middle of an expensive operation, let’s say, rebuilding its search index, we’ll want to choose node 3, since node 2 is not potentially both out of date and not able to devote as many resources to handling incoming queries.”
#34: “Here we have a healthy 4-node cluster with a replication factor of 2 and 4 index shards. If node 1 coordinates our request, it only needs to contact itself and node 3 to cover all 4 of the shards A-D…”
#35: “…but then node 3 fails. It could have been a disk failure or a network issue…”
#36: “…but it was probably because you let this guy near it.”
#37: “In any case, we still need to cover shards B and C, but node 3 was the only node that contained both of them, so we’ll need to contact nodes 2 and 4.”
#38: “To this point, I’ve talked about search in a fairly isolated way, but in the context of a larger platform, there are opportunities to step outside that.”
#39: “One example is the integration we released in DSE 4.7 with Spark - a component of DSE Analytics. There are cases where pushing a search query through a Spark job can meaningfully cut down on the size of the RDD Spark presents for analysis. In this example, we’re filtering every Wikipedia article that contains the word ‘dog’ using search, avoiding some unnecessary filtering after we build the RDD.”
#40: “Well that wraps it up for me. If you’d like to dig deeper into any of the topics I covered here, or you’d like to try DSE out for yourself, please visit docs.datastax.com. Thank you all so much for coming, and enjoy the rest of your Summit!”

DataStax: An Introduction to DataStax Enterprise Search

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to DataStax: An Introduction to DataStax Enterprise Search (20)

More from DataStax Academy (20)

Recently uploaded (20)

DataStax: An Introduction to DataStax Enterprise Search

Editor's Notes