AI from your data lake: Using Solr for analytics

AI from your data lake
Using Solr for analytics

Who we are
Cassandra Targett
Lucene/Solr Committer & PMC
Director of Engineering at
Lucidworks
Solr and HDP Search
Development
Marcelline Saunders
Director of Global Partner
Enablement at Lucidworks

Lucidworks is the primary sponsor
of the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%
Based in San Francisco
Offices in Bangalore, Bangkok,
New York City, London, Raleigh
Over 400 customers across the
Fortune 1000
Fusion, a Solr-powered platform
for search-driven apps
Consulting and support for
organizations using Solr Produces the world’s largest open source user
conference for Lucene/Solr (now also AI!)

Visit activate-conf.com for more information & to register

About Solr
Solr is the most popular search engine available today
Built on Lucene
Open source
Scalable
Distributed
Flexible
Extensible
Search Features:
● Admin UI
● Facets
● Hit highlights
● Multiple languages
● Spell check, auto-
complete

What is HDP Search?
Developed by Lucidworks
Built & Distributed by
Hortonworks
Add-on package for HDP, which
includes:
● Apache Solr
● HDFS, Hive and Pig Connectors
● Ambari MPack for Solr
● Banana
● Documentation

HDP Search
SerDe
Job Jar
Data Files

AI Features in
Solr
● Streaming Expressions
○ Math programming syntax
○ Train regression models
○ Classify results of a search
○ Parallel processing
○ Graph Traversal
○ Parallel SQL
● Learning-to-Rank
● Analytics Component

Streaming Expressions
Powerful stream processing language for Solr
● Suite of functions to query,
transform, and aggregate your
data
● Functions can be nested to
perform multiple tasks in one
request
● Work across your entire
dataset

● Request/response stream
processing
● Batch stream processing
● Fast interactive MapReduce
● Aggregations (pushed down
faceted and shuffling
MapReduce)
● Parallel relational algebra
(distributed joins, intersections,
unions, complements)
● Publish/subscribe messaging
● Distributed graph traversal
● Machine learning and parallel
iterative model training
● Anomaly detection
● Recommendation systems
● Retrieve and rank services
● Text classification and feature
extraction
● Streaming NLP
● Build your own!
What Can You Do?

Stream Sources
output -> tuples
Streaming Sources originate streams (of
tuples).
● search
● jdbc
● echo
● facet
● features
● nodes
● knn
● model
● random
● significantTerms
● shortestPath
● shuffle
● stats
● timeseries
● train
● topic
● tuple

Stream
Decorators
input -> tuples
output -> tuples
● cartesianProduct
● classify
● commit
● complement
● daemon
● eval
● executor
● fetch
● having
● leftOuterJoin
● hashJoin
● innerJoin
● intersect
● merge
● null
● outerHashJoin
● parallel
● priority
● reduce
● rollup
● scoreNodes
● select
● sort
● top
● unique
● update
Stream Decorators wrap other stream functions or
perform operations on a stream (of tuples).

Stream
Evaluators
input -> parameter
(possibly from a field in a
tuple)
output -> parameter
(possibly from a field in a
tuple)
● analyze
● abs
● add
● div
● log
● mult
● sub
● pow
● mod
● ceil
● floor
● sin
● asin
● sinh
● cos
● acos
● atan
● round
● sqrt
● cbrt
● and
● eq
● eor
● gteq
● gt
● if
● lteq
● lt
● not
● or
● raw
● sample
Stream Evaluators are functions that evaluate
parameters and return a result. These can be used
to transform values inside the tuples in a streaming
expression, or can be used independently.
● regress
● predict
● standardize
● distance
● kmeans
● timeseries
● monteCarlo
● cumulativeProbablity
● betaDistribution
● termVectors
● matrix
● rowCount
● mean
● describe
● percentile
● cov
...and many MORE

Parallel Batch
Processing
Train a Logistic Regression
Model
Distributed Joins
Pull Results from External Database
Sources: https://lucene.apache.org/solr/guide/streaming-expressions.html http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
Classify Search
Results
Rapid Export of all
Search Results
Streaming Expression Examples

Parallel SQL
● SQL interface for writing streaming expressions
● Statements are parsed to proper streaming expression syntax
● Supports a basic SQL syntax: SELECT, WHERE, ORDER BY,
LIMIT, etc.
rollup
(search
(techproducts,q=”*:*”,fl=”id,color”,sort=”color asc”),
over=”color”, count(*))
SELECT count(*) from techproducts
WHERE _text_=’(*:*)’ GROUP BY color

Graph Traversal
● Part of Solr’s broader Streaming
Expressions capability
● Implements a powerful, breadth-first
traversal
● Works across shards AND collections
● Supports aggregations
● Cycle aware
● Ability to both traverse AND score
nodes within the graph

Graph Traversal - Syntax
All movies that user "trey" watched
gatherNodes(movielens,walk="trey->user_name_s",gather="movie_id_i")
All movies that viewers of a specific movie watched
gatherNodes(movielens,
gatherNodes(movielens,walk="123->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true"
)

Graph Traversal - Use Cases
• Anomaly detection /
fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Relationship discovery / scoring
Examples
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find all tweets mentioning “Solr” by me or people
I follow
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find 3-star hotels in NYC my friends stayed in
last year

Learning to Rank (LTR)
Rank query results based on trained models
Traditional relevance ranking uses algorithms that calculate user
query terms to terms in the document (TF/IDF, BM25)
LTR allows you to rank results for user queries according to trained
models stored in Solr (trained outside Solr)
Factors for training data:
● Implicit: clicks, time spent on page, historical sales, previously
viewed documents
● Explicit: human judgement

Analytics Component
Calculate complex statistical aggregations over result sets.
Expressions, functions and groupings of data from your documents:
● Expressions: calculations to perform over the result set to
return a single value
● Functions: variables re-used in expressions or groupings
● Groupings: facets, which can include functions or expressions
neg, round, ceil, if, gt, lt, add, sub, div, sum, count,
unique, percentile, date, concat, log, pow, mean, min, max

Tools for Analytics & Visualization

Search Driven Analytics
Motivation
- Go beyond full text search
- Self-service exploration of data
- Provide tools for analysts to mine data without having to
understand query languages
- Create views of data for users

Why SQL with Search?
● Known query language
● Eliminates re-training users on proprietary tools and query
languages
● Third party BI tools use JDBC/ODBC
● Leverage powerful full text search
● Join Solr collections
● Join Solr collections with other data sources

Analytics Visualization tools
Banana (available with HDP Search)
Solr 6.0 + (Solr SQL)
- Apache Zeppelin
Lucidworks Fusion (Spark SQL - Solr SQL)
- Tableau
- Apache Zeppelin
- Jupyter
- Any third party product that supports JDBC/ODBC
Lucidworks Fusion App Insights

Banana Dashboards
Provided with HDP
Search
Easily create
dashboards for a Solr
collection
Based on facet queries
Requires basic
knowledge of Solr

AI from your data lake: Using Solr for analytics

FusionSQL - Using Spark and Solr together

Tableau: Solr Collections look like tables

Fusion - Tableau: Self Service BI/Analytics

• Leverage existing BI tools like Tableau and
Zeppelin
• Add full-text search and advanced Solr AI
features to your SQL query
• Ranking by relevance
• Joins across collections
• Fast and responsive queries at scale
• Ask interesting questions of your data
SQL
Benefits with
Solr/Fusion

35
Fusion App Insights
• Customizable dashboards to visualize
Query Analytics.
• Built in Analytics reports based on
Fusion AI Smart jobs for analyzing query
performance.
• Experiment analysis to give you
feedback on how search variants are
performing.
• Thorough analytics on users, sessions,
and all interactions (signals)

Resources
Solr Reference Guide:
● Streaming Expressions: https://lucene.apache.org/solr/guide/streaming-expressions.html
● Setting up Solr to be used with generic SQL clients: https://lucene.apache.org/solr/guide/7_3/parallel-sql-
interface.html#generic-clients
● Solr and Apache Zeppelin: https://lucene.apache.org/solr/guide/7_3/solr-jdbc-apache-zeppelin.html#solr-jdbc-apache-
zeppelin
Lucidworks Fusion (Solr SQL and Spark SQL) - setting up Tableau
https://lucidworks.com/2017/02/01/sql-in-fusion-3/
Tech at Bloomberg: The search for Solr analytics: https://www.techatbloomberg.com/blog/the-search-for-solr-analytics/

AI from your data lake: Using Solr for analytics

More Related Content

Similar to AI from your data lake: Using Solr for analytics (20)

More from DataWorks Summit (20)

Recently uploaded (20)

AI from your data lake: Using Solr for analytics