SlideShare a Scribd company logo
Thoth - Realtime Solr Monitor and Search Analysis Engine
Thoth
Real-time Solr Monitor
Search Analysis Engine
dbraga@trulia.com
pmhatre@trulia.com
Damiano Braga
Sr. Software Engineer
Praneet Mhatre
Data Mining Engineer
Overview
- What is Thoth ?
- Data Collection and Thoth Core Indexing
- Thoth API & Thoth Dashboard
- Thoth Monitor
- Thoth ML : Prediction and Topic Modeling
- Special Thanks & Q/A
Demo
What is Thoth?
- Innovation project at Trulia
- Understand our search infrastructure without touching logs
- Troubleshoot search performance issues
- Designed as a modular system
- Set of tools that can help gather info, monitor, understand a search infrastructure
- Open source project :
thoth
thoth-ml
thoth-api
thoth-dashboard
thoth-monitor
thoth-demo
Problem: Know Your Search Infrastructure
- Solr logs are a good source. Sometimes partial information
- Decentralized data (at least 1 log per search server)
- Log rotation
- Not searchable
If we could index all the information .. Let’s use Solr !
- We can search on it
- We have some handy features for free: facets, stats etc
- It’s scalable
Thoth Document
1 Solr Request = 1 Thoth (Solr) Document
Server Info
hostname, port number, core name, pool name
Query Info
timestamp, actual query, qtime, hits, exception?
Data Collection (1/2)
- Should be smooth. No traffic slowing down.
- We care about near real-time data
- We care about historical data
- Dataset is growing fast
- Interceptor on each search server
- We use a SolrComponent attached to a Request Handler
- Queue System (E.g: ActiveMQ) to facilitate and temporary store messages
- Each search server has a manifest in the solrconfig.xml
Data Collection (2/2)
<requestHandler name="select" class="com.solr2activemq.SolrToActiveMQHandler”>
<arr name="last-components”>
<str>solr2activemq</str>
</arr>
</requestHandler>
<searchComponent name="solr2activemq” class="com.solr2activemq.SolrToActiveMQComponent" >
<str name="activemq-broker-uri">localhost</str>
<int name="activemq-broker-port">61616</int>
<str name="activemq-broker-destination-type">queue</str>
<str name="activemq-broker-destination-name">test-queue</str>
<str name="solr-hostname">localhost</str>
<int name="solr-port">8983</int>
<str name="solr-poolname">default</str>
<str name="solr-corename">collection</str>
<int name="solr2activemq-buffer-size">1000</int>
<int name="solr2activemq-dequeuing-buffer-polling">500</int>
<int name="solr2activemq-check-activemq-polling">5000</int>
</searchComponent>
Sizing of Data
- Need for granular information for near real-time data
- Less granularity for historical data
Too much data = slow search, space problem
- Shrinking feature:
- Create Shrank Document
- Real-time Core cleanup
- Shrinking time is configurable
Thoth Index
- Solr 4.7
- Soft commit for near real-time search
- Soft commit maxTime set to 1s
- Auto commit set to 15s
- Update chain set to enforce UUID as PkID
- Use of Solrj to index data and query
Thoth API
- Abstraction for Thoth index and Thoth data
- Read only REST-like API
- JSON response
- Written in Node.js to accommodate socket.io
Example:
{"numFound":95,"values":[{"timestamp":"2014-09-
16T18:00:02Z","value":45337},{"timestamp":"2014-09-
16T18:15:02Z","value":77325},{"timestamp":"2014-09-
16T18:30:02Z","value":109523},{"timestamp":"2014-09-
16T18:45:02Z","value":112279},{"timestamp":"2014-09-
16T19:00:02Z","value":115334}
thoth:3001/api/server/foo/core/bar/port/portbar/start/NOW-1DAY/end/NOW/count/nqueries
Thoth Dashboard (1/5)
- Visual insight on Thoth data
- Useful graphs divided by server or pool
- Handy list of slow queries and exceptions
- Real-time view for server
- Selecting data based on time
- Sharable URLs (to OPS team, QA team, Release Eng. )
Thoth Dashboard (2/5)
Thoth Dashboard (3/5)
Thoth Dashboard (4/5)
Thoth Dashboard (5/5)
Thoth Monitor
- Continuously monitoring for metrics
- Stateless
- Alerting through email or Nagios
- Examples: QTime, Number of Zero hits,
Predictor Model Health
- Possibility to implement custom monitors
- Reuse StatsComponent
[http://wiki.apache.org/solr/StatsComponent]
if possible
Thoth ML
What can we do with all this data?
• Rich source of information
• Can we turn it into knowledge?
• How about machine learning?
1. Query time prediction
2. Query pattern recognition
3. Server sizing and resource allocation
1. Query Time Prediction (1/4)
• Goal : appropriately route queries to slow/ fast pool
• Look at query attributes
• Query text
• Start parameter
• Facets, range queries, geo spatial searches etc
• Train a supervised learning model
• Use learned model to predict if a query will be slow v/s fast
• H2O Machine Learning Library
1. Query Time Prediction (2/4)
Challenges
• Imbalanced dataset
• Frequency of model training
• Type of model
• Minimal delay requirement
1. Query Time Prediction (3/4)
Challenges Addressed
• Imbalanced dataset
• Stratified sampling
• Frequency of model training
• Auto identify relearning frequency
• Type of model
• Boolean, categorical features -> Tree based
• High accuracy
• Gradient Boosted Machine
• Minimal delay requirement
• User pool queries: 45-50 ms
• Prediction: 1-3 ms
1. Query Time Prediction (4/4)
• 1000 Gradient Boosted Trees
• Slow queries = (>100ms. Configurable)
• Experimental Results
• Training on ~3.1 million
• Test on ~1.4 million
• AUC: 0.94542
• Accuracy: 0.9202223
Query Time Prediction in Action (1/2)
Performance on real time traffic at Trulia
Query Time Prediction in Action (2/2)
Performance on real time traffic at Trulia
2. Query Pattern Recognition
• Exceptions, zero hit queries
• Analyze and find out why
• Probabilistic Topic Modeling
• Using MALLET open source toolkit
Topic Modeling Flow
Topics With Keywords
Future Direction
- Thoth ML improvements:
• Predicting query time buckets
• Regression v/s classification
• Exceptions and zero hit query analysis
• Sizing and resource allocation
- Solr Cloud integration
- Dashboard integration with Solr cloud
- More standard metrics on Thoth Monitor
- More data collection (load, GC)
Contributors and Special Thanks
Damiano : dbraga@trulia.com
Praneet: pmhatre@trulia.com
Fork us on Github!
github.com/trulia/thoth
JD Cantrell ( API, Dashboard)
Giulio Grillanda (API, Dashboard)
Rajendra Shioramwar (Core)
Ying Wang (Design)
Girish Gudla (Monitor)
Alexander Kanarsky
Alex Burmester

More Related Content

PDF
[2 d1] elasticsearch 성능 최적화
PPTX
Full Text search in Django with Postgres
PDF
[2B1]검색엔진의 패러다임 전환
PDF
[2D1]Elasticsearch 성능 최적화
PDF
PostgreSQL and Sphinx pgcon 2013
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PPT
Building a CRM on top of ElasticSearch
PPTX
Elasticsearch - DevNexus 2015
[2 d1] elasticsearch 성능 최적화
Full Text search in Django with Postgres
[2B1]검색엔진의 패러다임 전환
[2D1]Elasticsearch 성능 최적화
PostgreSQL and Sphinx pgcon 2013
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Building a CRM on top of ElasticSearch
Elasticsearch - DevNexus 2015

Similar to Thoth - Realtime Solr Monitor and Search Analysis Engine (20)

PDF
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
PDF
Apache con 2020 use cases and optimizations of iotdb
PDF
Internals of Presto Service
PDF
Apache Solr crash course
PDF
Solr at zvents 6 years later & still going strong
PDF
Overview of data analytics service: Treasure Data Service
PPTX
Visual Studio 2013 Profiling
PDF
Realtime Data Analytics
PPTX
Analyze database system using a 3 d method
PPTX
Expand data analysis tool at scale with Zeppelin
PDF
Presto At Treasure Data
PDF
Sumo Logic QuickStart Webinar - Jan 2016
PPTX
Real-Time Inverted Search NYC ASLUG Oct 2014
PPTX
ElasticSearch as (only) datastore
PDF
SF Big Analytics meetup : Hoodie From Uber
PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
PDF
Web analytics at scale with Druid at naver.com
PPTX
Real World Performance - Data Warehouses
PDF
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
PDF
[Srijan Wednesday Webinar] Easy Performance Wins for Your Rails App
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Apache con 2020 use cases and optimizations of iotdb
Internals of Presto Service
Apache Solr crash course
Solr at zvents 6 years later & still going strong
Overview of data analytics service: Treasure Data Service
Visual Studio 2013 Profiling
Realtime Data Analytics
Analyze database system using a 3 d method
Expand data analysis tool at scale with Zeppelin
Presto At Treasure Data
Sumo Logic QuickStart Webinar - Jan 2016
Real-Time Inverted Search NYC ASLUG Oct 2014
ElasticSearch as (only) datastore
SF Big Analytics meetup : Hoodie From Uber
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Web analytics at scale with Druid at naver.com
Real World Performance - Data Warehouses
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
[Srijan Wednesday Webinar] Easy Performance Wins for Your Rails App
Ad

Recently uploaded (20)

PDF
Well-logging-methods_new................
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Current and future trends in Computer Vision.pptx
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
DOCX
573137875-Attendance-Management-System-original
PPT
Project quality management in manufacturing
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
Artificial Intelligence
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
Well-logging-methods_new................
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Fundamentals of Mechanical Engineering.pptx
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Safety Seminar civil to be ensured for safe working.
Automation-in-Manufacturing-Chapter-Introduction.pdf
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Current and future trends in Computer Vision.pptx
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
573137875-Attendance-Management-System-original
Project quality management in manufacturing
Categorization of Factors Affecting Classification Algorithms Selection
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Artificial Intelligence
CYBER-CRIMES AND SECURITY A guide to understanding
Ad

Thoth - Realtime Solr Monitor and Search Analysis Engine

  • 2. Thoth Real-time Solr Monitor Search Analysis Engine dbraga@trulia.com pmhatre@trulia.com Damiano Braga Sr. Software Engineer Praneet Mhatre Data Mining Engineer
  • 3. Overview - What is Thoth ? - Data Collection and Thoth Core Indexing - Thoth API & Thoth Dashboard - Thoth Monitor - Thoth ML : Prediction and Topic Modeling - Special Thanks & Q/A Demo
  • 4. What is Thoth? - Innovation project at Trulia - Understand our search infrastructure without touching logs - Troubleshoot search performance issues - Designed as a modular system - Set of tools that can help gather info, monitor, understand a search infrastructure - Open source project : thoth thoth-ml thoth-api thoth-dashboard thoth-monitor thoth-demo
  • 5. Problem: Know Your Search Infrastructure - Solr logs are a good source. Sometimes partial information - Decentralized data (at least 1 log per search server) - Log rotation - Not searchable If we could index all the information .. Let’s use Solr ! - We can search on it - We have some handy features for free: facets, stats etc - It’s scalable
  • 6. Thoth Document 1 Solr Request = 1 Thoth (Solr) Document Server Info hostname, port number, core name, pool name Query Info timestamp, actual query, qtime, hits, exception?
  • 7. Data Collection (1/2) - Should be smooth. No traffic slowing down. - We care about near real-time data - We care about historical data - Dataset is growing fast - Interceptor on each search server - We use a SolrComponent attached to a Request Handler - Queue System (E.g: ActiveMQ) to facilitate and temporary store messages - Each search server has a manifest in the solrconfig.xml
  • 8. Data Collection (2/2) <requestHandler name="select" class="com.solr2activemq.SolrToActiveMQHandler”> <arr name="last-components”> <str>solr2activemq</str> </arr> </requestHandler> <searchComponent name="solr2activemq” class="com.solr2activemq.SolrToActiveMQComponent" > <str name="activemq-broker-uri">localhost</str> <int name="activemq-broker-port">61616</int> <str name="activemq-broker-destination-type">queue</str> <str name="activemq-broker-destination-name">test-queue</str> <str name="solr-hostname">localhost</str> <int name="solr-port">8983</int> <str name="solr-poolname">default</str> <str name="solr-corename">collection</str> <int name="solr2activemq-buffer-size">1000</int> <int name="solr2activemq-dequeuing-buffer-polling">500</int> <int name="solr2activemq-check-activemq-polling">5000</int> </searchComponent>
  • 9. Sizing of Data - Need for granular information for near real-time data - Less granularity for historical data Too much data = slow search, space problem - Shrinking feature: - Create Shrank Document - Real-time Core cleanup - Shrinking time is configurable
  • 10. Thoth Index - Solr 4.7 - Soft commit for near real-time search - Soft commit maxTime set to 1s - Auto commit set to 15s - Update chain set to enforce UUID as PkID - Use of Solrj to index data and query
  • 11. Thoth API - Abstraction for Thoth index and Thoth data - Read only REST-like API - JSON response - Written in Node.js to accommodate socket.io Example: {"numFound":95,"values":[{"timestamp":"2014-09- 16T18:00:02Z","value":45337},{"timestamp":"2014-09- 16T18:15:02Z","value":77325},{"timestamp":"2014-09- 16T18:30:02Z","value":109523},{"timestamp":"2014-09- 16T18:45:02Z","value":112279},{"timestamp":"2014-09- 16T19:00:02Z","value":115334} thoth:3001/api/server/foo/core/bar/port/portbar/start/NOW-1DAY/end/NOW/count/nqueries
  • 12. Thoth Dashboard (1/5) - Visual insight on Thoth data - Useful graphs divided by server or pool - Handy list of slow queries and exceptions - Real-time view for server - Selecting data based on time - Sharable URLs (to OPS team, QA team, Release Eng. )
  • 17. Thoth Monitor - Continuously monitoring for metrics - Stateless - Alerting through email or Nagios - Examples: QTime, Number of Zero hits, Predictor Model Health - Possibility to implement custom monitors - Reuse StatsComponent [http://wiki.apache.org/solr/StatsComponent] if possible
  • 18. Thoth ML What can we do with all this data? • Rich source of information • Can we turn it into knowledge? • How about machine learning? 1. Query time prediction 2. Query pattern recognition 3. Server sizing and resource allocation
  • 19. 1. Query Time Prediction (1/4) • Goal : appropriately route queries to slow/ fast pool • Look at query attributes • Query text • Start parameter • Facets, range queries, geo spatial searches etc • Train a supervised learning model • Use learned model to predict if a query will be slow v/s fast • H2O Machine Learning Library
  • 20. 1. Query Time Prediction (2/4) Challenges • Imbalanced dataset • Frequency of model training • Type of model • Minimal delay requirement
  • 21. 1. Query Time Prediction (3/4) Challenges Addressed • Imbalanced dataset • Stratified sampling • Frequency of model training • Auto identify relearning frequency • Type of model • Boolean, categorical features -> Tree based • High accuracy • Gradient Boosted Machine • Minimal delay requirement • User pool queries: 45-50 ms • Prediction: 1-3 ms
  • 22. 1. Query Time Prediction (4/4) • 1000 Gradient Boosted Trees • Slow queries = (>100ms. Configurable) • Experimental Results • Training on ~3.1 million • Test on ~1.4 million • AUC: 0.94542 • Accuracy: 0.9202223
  • 23. Query Time Prediction in Action (1/2) Performance on real time traffic at Trulia
  • 24. Query Time Prediction in Action (2/2) Performance on real time traffic at Trulia
  • 25. 2. Query Pattern Recognition • Exceptions, zero hit queries • Analyze and find out why • Probabilistic Topic Modeling • Using MALLET open source toolkit
  • 28. Future Direction - Thoth ML improvements: • Predicting query time buckets • Regression v/s classification • Exceptions and zero hit query analysis • Sizing and resource allocation - Solr Cloud integration - Dashboard integration with Solr cloud - More standard metrics on Thoth Monitor - More data collection (load, GC)
  • 29. Contributors and Special Thanks Damiano : dbraga@trulia.com Praneet: pmhatre@trulia.com Fork us on Github! github.com/trulia/thoth JD Cantrell ( API, Dashboard) Giulio Grillanda (API, Dashboard) Rajendra Shioramwar (Core) Ying Wang (Design) Girish Gudla (Monitor) Alexander Kanarsky Alex Burmester