SlideShare a Scribd company logo
SQL Analytics for Search Engineers
Timothy Potter
Manager of Smart Data @ Lucidworks / Apache Solr Committer
@thelabdude
#Activate18 #ActivateSearch
An ever-expanding list of needs from search engineers
• Better relevancy, less manual tuning
• Bigger scale, less downtime, fixed resources
• Higher QPS, more complex query pipelines
• More bespoke, search-driven applications,
faster!
• Trying out new ideas
• Making better decisions with self-service
analytics
• Random one-off jobs for this and that
• Use AI everywhere!
The ideal solution …
• Easy to explain to your boss how it works
• Tooling available
• Résumé friendly
• Extensible / customizable / flexible
• Scalable
• People want to feel productive
SQL in Fusion!
Data Ingest = Project Friction
• Bespoke, search-driven applications >
general purpose dashboard tools
• Getting data in continues to be a hassle
/ friction when getting started
• Need something nimble but also fast /
scalable
• For every connector, there’s probably
20 SQL / NoSQL data silos
Fusion’s Parallel Bulk Loader
• Get to the fun stuff faster!
• Complement Fusion’s connectors for those dirty
ETL jobs that cause friction in every project
• High performance parallel reads from structured
data sources, including Cassandra, Elastic, HBase,
JDBC, Hadoop, …
• Basic ETL tasks with SQL and/or custom Scala
• ML Model predictions as UDF
• Direct to Solr for optimal speed or send to index-
pipelines for optimal flexibility
A foundation built on SparkSQL
• Expose structured data as a DataFrame:
RDD + schema
• 100’s of data sources + formats
• spark-solr translates Solr query results
to a DataFrame
• Highly optimized parallel reads, with
predicate pushdown across a Spark
cluster
• Spark optimizes the SQL query plan
• 100’s of built-in functions
Demo: Parallel Bulk Loader
Parallel Bulk Loader
Read parquet
from S3
Write to a Fusion
Index Pipeline
Advanced transforms
with Scala
Transform with SQL
Add job dependencies
On-the-fly
User Feedback to Improve Relevancy
• MRR is sub-optimal for many queries?
• Want to boost some docs based on user
click behavior (per query)
• Older clicks should age out over time
• Some user actions are more important
than others: click < cart add < purchase
• Sometimes you need to join signals with
other tables, e.g. item metadata
• Hide complex business logic behind UDF
/ UDAF (pluggable)
• Designed for change!
Signal Data Flow in Fusion
Demo: Parallel Bulk Loader
SQL Aggregation
Join with other
tables
Custom UDAF
Final output to
Solr
Window Functions
WITH sessions AS (
SELECT *, sum(IF(diff_secs > 30, 1, 0))
OVER (PARTITION BY clientip ORDER BY ts) session_id
FROM (
SELECT *, unix_timestamp(ts) - lag(unix_timestamp(ts))
OVER (PARTITION BY clientip ORDER BY ts) as diff_secs
FROM ${inputCollection}
WHERE clientip IS NOT NULL AND ts IS NOT NULL AND bytes IS NOT NULL
AND verb IS NOT NULL AND response IS NOT NULL
)
) SELECT concat_ws('||', clientip,session_id) as id,
first(clientip) as clientip,
min(ts) as session_start,
max(ts) as session_end,
timediff(max(ts), min(ts), "MILLISECONDS") as session_len_ms_l,
sum(bytes) as total_bytes_l,
count(*) as total_requests_l
FROM sessions
GROUP BY clientip, session_id
Lag window
function
SQL Aggregations Scalability
• Aggregate 42M signals into 11M groups
(query / doc_id)
• ~18 mins on 3 node EC2 cluster (r3.xlarge)
• Mostly I/O from/to Solr
Why Self-service Analytics?
• Powerful connectors, relevance, speed,
and massive scalability = more mission-
critical datasets finding their way into
Fusion
• Don’t be another data silo!
• Let users ask questions of this data
using their tool of choice w/o adding
work for the IT group!
• Aggregations over full-text ranked
results
• But it has to be fast else you’re right
back to data warehousing problems
Self-service Analytics
• Fusion SQL is a JDBC service that
supports SQL
• Fusion SQL plugs into Apache
Spark’s query planner to translate
SQL into optimized Solr queries
(streaming expressions and JSON
facets)
• Integrate with popular BI tools like
Tableau, PowerBI, and Spotfire +
Notebooks like Apache Zeppelin
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Demo: Parallel Bulk Loader
Self-Service Analytics
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Self-service Analytics Performance
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct
user_id/dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: ~900ms
28M rows: ~1.3secs
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
Self-service Analytics Performance
SELECT m.title as title, agg.aggCount as aggCount
FROM movies m
INNER JOIN (
SELECT movie_id, COUNT(*) as aggCount
FROM ratings
WHERE rating >= 4 GROUP BY movie_id
ORDER BY aggCount desc LIMIT 10) as agg
ON agg.movie_id = m.id
ORDER BY aggCount DESC
20M rows
Fusion SQL : ~1.1 secs
MySQL: 17 secs (w/ index on movie_id)
Movielens data: Aggregate 20M ratings
https://lucidworks.com/2018/08/06/using-tableau-sql-and-search-for-fast-data-visualizations/
Experiments
• Run live experiments to try out
new ideas and compare
outcomes between variants
• Built-in metrics: MRR, avg|min|
max response time, CTR …
and you guessed it! SQL
• Bayesian Bandits to
explore/exploit the best
performing variant
Demo: Parallel Bulk Loader
Experiment Metrics
Recap
• How to build powerful SQL aggregations with
joins, custom UDF/ UDAF, and window functions to
power boosting and recommendations
• Ingesting data from data sources using SQL for
ETL, ML
• Self-service analytics from popular BI visualization
tools
• Measure outcomes between variants in an
experiment using SQL
https://github.com/lucidworks/fusion-spark-bootcamp
Top 10 Things you can do with SQL in Fusion
1. Aggregate signals by query / doc / user to compute boost
weights and generate recommendations
2. Ingest & ETL from 100’s of data sources using SparkSQL
3. Use ML models to generate predictions and Lucene text
analysis using UDF functions
4. Join data from multiple Solr collections and data sources
5. Self-service analytics with BI tools like Tableau and PowerBI
6. Hide complex business logic behind UDF / UDAF
7. Use window functions for tasks like sessionization
8. Grouping sets and cubes for advanced analytic reporting
9. Compute KPIs across variants in an experiment
10. Expose complex Solr streaming expressions as simple SQL
views
Thank you!
Timothy Potter
Manager Smart Data, Lucidworks
@thelabdude
#Activate18 #ActivateSearch

More Related Content

PPTX
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
PPTX
The SAS Search Journey: Using AI to Move from Google to Lucidworks - Alex Fl...
PDF
Presto: Fast SQL on Everything
PPTX
Azure enterprise integration platform
PPTX
Tableau API
PDF
Webinar: Fusion for Business Intelligence
PPTX
Democratizing Data Science in the Enterprise
PDF
Online Model Updating with Spark Streaming
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
The SAS Search Journey: Using AI to Move from Google to Lucidworks - Alex Fl...
Presto: Fast SQL on Everything
Azure enterprise integration platform
Tableau API
Webinar: Fusion for Business Intelligence
Democratizing Data Science in the Enterprise
Online Model Updating with Spark Streaming

What's hot (20)

PDF
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
PPTX
Analyzing StackExchange data with Azure Data Lake
PPTX
Tableau & MongoDB: Visual Analytics at the Speed of Thought
PPTX
Modern data warehouse
PPTX
Power BI: Tips and Tricks
PDF
Azure Data Factory V2; The Data Flows
PPTX
A lap around Azure Data Factory
PPTX
Data Modeling on Azure for Analytics
PPTX
Migrate a successful transactional database to azure
PPTX
ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
PPTX
Introduction to Azure DocumentDB
PDF
Continuous Optimization for Distributed BigData Analysis
PPTX
Spark - Migration Story
PDF
Moving to the cloud; PaaS, IaaS or Managed Instance
PPTX
R in Power BI
PDF
CCI2018 - Real-time dashboard whatif analysis
PPTX
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
PDF
Machine Learning Data Lineage with MLflow and Delta Lake
PDF
Azure saturday pn 2018
PPTX
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Analyzing StackExchange data with Azure Data Lake
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Modern data warehouse
Power BI: Tips and Tricks
Azure Data Factory V2; The Data Flows
A lap around Azure Data Factory
Data Modeling on Azure for Analytics
Migrate a successful transactional database to azure
ECS19 - Mike Ammerlaan - Microsoft Graph Data Connect
Introduction to Azure DocumentDB
Continuous Optimization for Distributed BigData Analysis
Spark - Migration Story
Moving to the cloud; PaaS, IaaS or Managed Instance
R in Power BI
CCI2018 - Real-time dashboard whatif analysis
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Machine Learning Data Lineage with MLflow and Delta Lake
Azure saturday pn 2018
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Ad

Similar to SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers (20)

PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Taming the shrew Power BI
PDF
In-memory ColumnStore Index
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PDF
Understanding Query Plans and Spark UIs
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PDF
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
PPTX
Microsoft Azure BI Solutions in the Cloud
PPTX
Solving Office 365 Big Challenges using Cassandra + Spark
PDF
Fighting Fraud with Apache Spark
PDF
Serverless SQL
PDF
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
PDF
QuerySurge Slide Deck for Big Data Testing Webinar
PDF
Apache Spark Presentation good for big data
PPTX
Dax & sql in power bi
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PPTX
Azure Synapse Analytics Overview (r2)
PDF
Breaking data
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
PPTX
Introducing Azure SQL Data Warehouse
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Taming the shrew Power BI
In-memory ColumnStore Index
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Understanding Query Plans and Spark UIs
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Microsoft Azure BI Solutions in the Cloud
Solving Office 365 Big Challenges using Cassandra + Spark
Fighting Fraud with Apache Spark
Serverless SQL
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
QuerySurge Slide Deck for Big Data Testing Webinar
Apache Spark Presentation good for big data
Dax & sql in power bi
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Azure Synapse Analytics Overview (r2)
Breaking data
SQL Analytics Powering Telemetry Analysis at Comcast
Introducing Azure SQL Data Warehouse
Ad

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
KodekX | Application Modernization Development
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Diabetes mellitus diagnosis method based random forest with bat algorithm
KodekX | Application Modernization Development
Review of recent advances in non-invasive hemoglobin estimation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MYSQL Presentation for SQL database connectivity
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
Understanding_Digital_Forensics_Presentation.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Building Integrated photovoltaic BIPV_UPV.pdf

SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers

  • 1. SQL Analytics for Search Engineers Timothy Potter Manager of Smart Data @ Lucidworks / Apache Solr Committer @thelabdude #Activate18 #ActivateSearch
  • 2. An ever-expanding list of needs from search engineers • Better relevancy, less manual tuning • Bigger scale, less downtime, fixed resources • Higher QPS, more complex query pipelines • More bespoke, search-driven applications, faster! • Trying out new ideas • Making better decisions with self-service analytics • Random one-off jobs for this and that • Use AI everywhere!
  • 3. The ideal solution … • Easy to explain to your boss how it works • Tooling available • Résumé friendly • Extensible / customizable / flexible • Scalable • People want to feel productive SQL in Fusion!
  • 4. Data Ingest = Project Friction • Bespoke, search-driven applications > general purpose dashboard tools • Getting data in continues to be a hassle / friction when getting started • Need something nimble but also fast / scalable • For every connector, there’s probably 20 SQL / NoSQL data silos
  • 5. Fusion’s Parallel Bulk Loader • Get to the fun stuff faster! • Complement Fusion’s connectors for those dirty ETL jobs that cause friction in every project • High performance parallel reads from structured data sources, including Cassandra, Elastic, HBase, JDBC, Hadoop, … • Basic ETL tasks with SQL and/or custom Scala • ML Model predictions as UDF • Direct to Solr for optimal speed or send to index- pipelines for optimal flexibility
  • 6. A foundation built on SparkSQL • Expose structured data as a DataFrame: RDD + schema • 100’s of data sources + formats • spark-solr translates Solr query results to a DataFrame • Highly optimized parallel reads, with predicate pushdown across a Spark cluster • Spark optimizes the SQL query plan • 100’s of built-in functions
  • 7. Demo: Parallel Bulk Loader Parallel Bulk Loader
  • 8. Read parquet from S3 Write to a Fusion Index Pipeline Advanced transforms with Scala Transform with SQL Add job dependencies On-the-fly
  • 9. User Feedback to Improve Relevancy • MRR is sub-optimal for many queries? • Want to boost some docs based on user click behavior (per query) • Older clicks should age out over time • Some user actions are more important than others: click < cart add < purchase • Sometimes you need to join signals with other tables, e.g. item metadata • Hide complex business logic behind UDF / UDAF (pluggable) • Designed for change!
  • 10. Signal Data Flow in Fusion
  • 11. Demo: Parallel Bulk Loader SQL Aggregation
  • 12. Join with other tables Custom UDAF Final output to Solr
  • 13. Window Functions WITH sessions AS ( SELECT *, sum(IF(diff_secs > 30, 1, 0)) OVER (PARTITION BY clientip ORDER BY ts) session_id FROM ( SELECT *, unix_timestamp(ts) - lag(unix_timestamp(ts)) OVER (PARTITION BY clientip ORDER BY ts) as diff_secs FROM ${inputCollection} WHERE clientip IS NOT NULL AND ts IS NOT NULL AND bytes IS NOT NULL AND verb IS NOT NULL AND response IS NOT NULL ) ) SELECT concat_ws('||', clientip,session_id) as id, first(clientip) as clientip, min(ts) as session_start, max(ts) as session_end, timediff(max(ts), min(ts), "MILLISECONDS") as session_len_ms_l, sum(bytes) as total_bytes_l, count(*) as total_requests_l FROM sessions GROUP BY clientip, session_id Lag window function
  • 14. SQL Aggregations Scalability • Aggregate 42M signals into 11M groups (query / doc_id) • ~18 mins on 3 node EC2 cluster (r3.xlarge) • Mostly I/O from/to Solr
  • 15. Why Self-service Analytics? • Powerful connectors, relevance, speed, and massive scalability = more mission- critical datasets finding their way into Fusion • Don’t be another data silo! • Let users ask questions of this data using their tool of choice w/o adding work for the IT group! • Aggregations over full-text ranked results • But it has to be fast else you’re right back to data warehousing problems
  • 16. Self-service Analytics • Fusion SQL is a JDBC service that supports SQL • Fusion SQL plugs into Apache Spark’s query planner to translate SQL into optimized Solr queries (streaming expressions and JSON facets) • Integrate with popular BI tools like Tableau, PowerBI, and Spotfire + Notebooks like Apache Zeppelin
  • 18. Demo: Parallel Bulk Loader Self-Service Analytics
  • 20. Self-service Analytics Performance • Blog performed a comparison of their SQL engine against common DBs using a count distinct query typical for dashboards • 14M logs, 1200 distinct dashboards, 1700 distinct user_id/dashboard_id pairs • Replicated the experiment with Fusion on Ec2 (m1.xlarge), single instance of Solr Fusion: ~900ms 28M rows: ~1.3secs https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
  • 21. Self-service Analytics Performance SELECT m.title as title, agg.aggCount as aggCount FROM movies m INNER JOIN ( SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as agg ON agg.movie_id = m.id ORDER BY aggCount DESC 20M rows Fusion SQL : ~1.1 secs MySQL: 17 secs (w/ index on movie_id) Movielens data: Aggregate 20M ratings https://lucidworks.com/2018/08/06/using-tableau-sql-and-search-for-fast-data-visualizations/
  • 22. Experiments • Run live experiments to try out new ideas and compare outcomes between variants • Built-in metrics: MRR, avg|min| max response time, CTR … and you guessed it! SQL • Bayesian Bandits to explore/exploit the best performing variant
  • 23. Demo: Parallel Bulk Loader Experiment Metrics
  • 24. Recap • How to build powerful SQL aggregations with joins, custom UDF/ UDAF, and window functions to power boosting and recommendations • Ingesting data from data sources using SQL for ETL, ML • Self-service analytics from popular BI visualization tools • Measure outcomes between variants in an experiment using SQL https://github.com/lucidworks/fusion-spark-bootcamp
  • 25. Top 10 Things you can do with SQL in Fusion 1. Aggregate signals by query / doc / user to compute boost weights and generate recommendations 2. Ingest & ETL from 100’s of data sources using SparkSQL 3. Use ML models to generate predictions and Lucene text analysis using UDF functions 4. Join data from multiple Solr collections and data sources 5. Self-service analytics with BI tools like Tableau and PowerBI 6. Hide complex business logic behind UDF / UDAF 7. Use window functions for tasks like sessionization 8. Grouping sets and cubes for advanced analytic reporting 9. Compute KPIs across variants in an experiment 10. Expose complex Solr streaming expressions as simple SQL views
  • 26. Thank you! Timothy Potter Manager Smart Data, Lucidworks @thelabdude #Activate18 #ActivateSearch

Editor's Notes

  • #3: How are you going to get all this done? In Fusion, we chose SQL as the foundational technology to solve many of these issues.
  • #4: So I think we’re all pretty clear on the scope of the problem, but what might the ideal solution look like? Audience poll: - How many know SQL and have used it in some fashion in the last year - How many have integrated with some sort of SQL database with search today
  • #5: One of the amazing things about app studio is you can rapidly build bespoke search applications w/o creating another data silo! Getting data indexed is not the end goal of a project, an impediment on most projects, adds friction and distracts us from the important stuff (queries / visualization) Organizations are really good at provisioning data silos To let people ask new questions from your data, they need access across many data sources SQL and NoSQL databases are everywhere! Need something nimble to go grab data from multiple places and Connectors are great for complex business apps like Sharepoint and Box but for every Sharepoint there’s a 100 SQL / NoSQL databases in a modern org
  • #7: SQL lets Spark create an optimized query plan, which sometimes we know how to optimize further for Solr Typically built by experts NoSQL: Cassandra, HBase, Hive, Mongo S3, HDFS, parquet Search: Solr, Elastic RDBMS: JDBC, Redshift, Hive Azure, Google Analytics
  • #8: Ingest data from S3 Invoke an ML model to do NLP stuff Do some basic ETL with SQL
  • #9: Just a placeholder slide for what is shown in the demo
  • #10: Spark function reference: https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/index.html
  • #11: See: https://doc.lucidworks.com/fusion-ai/4.0/user-guide/signals/signals.html
  • #12: Fusion’s built-in click signal SQL aggregation job Time-decay function Custom UDF (price bucketing) Custom SQL job: sessionization of logs with window functions
  • #13: Just a placeholder slide for what is shown in the demo
  • #14: Just a placeholder to show another example of a SQL agg job, this time with a window function to find sessions.
  • #16: The traditional problem with self-service analytics is speed, flexibility, scalability A whole that’s greater than the sum of its parts
  • #17: Pushdown the computation of an aggregated query into Solr for maximum performance Or, pull rows into Spark from Solr to perform most any analytics task
  • #18: At step 1, a Fusion data analyst is authenticated by the JDBC/ODBC client application (e.g. SpotFire or Tableau) using Kerberos. Once authenticated, the user’s SQL query is sent to the Fusion SQL Thriftserver over HTTP (step 2 in the diagram). The SQL Thriftserver uses the service principal keytab to validate the incoming user identity using Kerberos (step 3). The Fusion SQL Thriftserver is a Spark application with a specific number of CPU cores and memory allocated from the pool of Spark resources. You can scale out the number of Spark worker nodes to increase available memory and CPU resources to the SQL service. The Thriftserver sends the query to Spark to be parsed into a Logical query plan (step 4). During the query planning stage, Spark sends the logical plan to Fusion’s pushdown strategy component (step 5). The pushdown strategy analyzes the query plan to determine if there is an optimal Solr query / streaming expression that can “push-down” aggregations into Solr to improve performance and scalability. For instance, the following SQL query can be translated into a Solr facet query by the Fusion pushdown strategy: select count(1) as the_count, movie_id from ratings group by movie_id The basic idea behind Fusion’s pushdown strategy is it is much faster to let Solr facets perform basic aggregations than it is to export raw documents from Solr and have Spark perform the aggregation. If an optimal pushdown query is not possible, then Spark will pull raw documents from Solr and then perform any joins / aggregations needed in Spark. Put simply, the Fusion SQL service tries to translate SQL queries into optimized Solr queries but failing that, the service simply reads all matching docs for a query into Spark and performs the SQL execution logic across the Spark cluster. During pushdown analysis, Fusion calls out to the registered AuthzFilterProvider implementation to get a filter query to perform row-level filtering for the Kerberos authenticated user (step 6). By default there is no row-level security provider but users can install their own implementation using the Fusion SQL service API. Lastly, a distributed Solr query gets executed by Spark to return documents that satisfy the SQL query criteria and row-level security filter (step 7). To leverage the distributed nature of Spark and Solr, Fusion SQL sends a query to all replicas for each shard in a Solr collection. Consequently, you can scale out SQL query performance by adding more Spark and/or Solr resources to your cluster.
  • #19: Show connecting to Fusion SQL from Tableau (or maybe Apache Superset) Build a simple data visualization on-the-fly
  • #20: Just a placeholder slide for what is shown in the demo
  • #24: Avg. time on site / # of interactions per variant Show results in App Insights