SlideShare a Scribd company logo
Payloads and OCR with Solr
OpenSource Connections October 2019
Apache Lucene/Solr - London User Group
Introductions
Eric Pugh: Search Relevance Engineer at OpenSource Connections
Daniel Worley: Search Relevance Engineer at OpenSource Connections
Searching Text Inside Images
http://pdf-discovery-demo.dev.o19s.com:8080 and search for “HELOC”
OCR
● Tesseract and Tika enable us to get text out of images via OCR
● Text can then be indexed into Solr
● Problem solved?
Highlighting
● What if we want to highlight the text in an image that matched?
● Regular Solr Highlighting not good enough
○ We can get text snippets but we can’t see where they came from in the
image
○ For images with a lot of information this can make it hard for users to
see why a particular image matched their query
The Problem
● Tesseract has provided us bounding boxes for all of the OCR’d text
● We need to access this bounding box information within Solr on a per
match basis
What about Payloads?
● Payloads provide a way of attaching various metadata to each token
● More info
https://www.slideshare.net/lucidworks/payloads-in-solr-erik-hatcher-luci
dworks
Payloads and OCR with Solr
But ....
The Challenge
● Payloads are typically used at query time for matching or to affect the
score of matching documents.
● Not much in the area of surfacing payload data at query time without
manually extracting it again from the stored data
Iteration 1 - Idea
Create a highlighter formatter that surfaces payload attributes
Iteration 1 - Results
● Required hacking at low level Lucene internals to include the payload
attribute in the token stream.
● Suitable for a PoC, not great for any real applications
Iteration 2 - Idea
Create a component that only returns payloads for clauses that matched in
the query
Iteration 2 - Results
A deployable plugin that doesn’t require hacking on Lucene to work
Payload Component - What’s in the box?
● Payload Component
● And some conveniences:
○ Base64Encoder
○ PayloadBufferFilterFactory
● Available at: https://github.com/o19s/payload-component
Payload Component
● Similar to the highlighting component but returns matches only
● Currently no scoring of matches
● For each match, add the payload data to the response if available
PayloadBufferFilterFactory
● A filter to work around payload oddities in Solr
● Filters that produce new tokens often remove all attributes, which
includes payloads.
● This filter will copy the Payload data and restore it later on after other
filters have been run.
PayloadBufferFilterFactory
COPY->
PASTE->
Base64Encoder
● The DelimitedPayloadTokenFilterFactory expects data as:
○ [term][delimiter][payload]
● What about dog|barks woofs?
○ Will “woofs” be included as part of the payload?
Base64Encoder Cont’d
● To get around this problem, the payload can be encoded in Base64
○ dog|YmFya3Mgd29vZnM=
● The Base64Encoder will accept Base64 data at index time but store it
out as the decoded version.
○ YmFya3Mgd29vZnM= -> barks woofs
The Future: Matches Component
● Surface which terms/phrases from the query matched
● Surface payload attribute data that’s already included in the payload
component
● Surface other data from the index such as offsets
Thanks
● PayloadComponent Repo: https://github.com/o19s/payload-component
● Demo Repo: https://github.com/o19s/pdf-discovery-demo
Big thanks to Dan Worley and Andrew Boyd and a brave
client for working with me to make this idea happen!
Interested in Relevance? Join us at www.o19s.com/slack to chat with your
peers.

More Related Content

PPT
Module 3: Introduction to LINQ (PowerPoint Slides)
PDF
Angular meteor presentation
PPTX
Linq
PPT
PPTX
Lerman Vvs14 Ef Tips And Tricks
PPT
Linq
PPTX
JavaScript – Object Basics By Satyen
PDF
.NET Core, ASP.NET Core Course, Session 3
Module 3: Introduction to LINQ (PowerPoint Slides)
Angular meteor presentation
Linq
Lerman Vvs14 Ef Tips And Tricks
Linq
JavaScript – Object Basics By Satyen
.NET Core, ASP.NET Core Course, Session 3

What's hot (20)

PPT
Language Integrated Query - LINQ
PPT
Understanding linq
PDF
Advanced Reflection in Pharo
PPTX
Lerman Vvs13 Entity Framework 4 And Wcf
PPTX
Lerman Adx303 Entity Framework 4 In Aspnet
PDF
C++ Actor Model - You’ve Got Mail ...
PPTX
Android with kotlin course
PPTX
Java.util.concurrent.concurrent hashmap
PDF
Introduction to Web Scraping with Python
ODP
The OCLforUML Profile
PPTX
Java 8 streams
PDF
Whats new in .NET for 2019
PDF
Lec 4 06_aug [compatibility mode]
PDF
.NET Core, ASP.NET Core Course, Session 17
PPTX
Entity Framework 4 In Microsoft Visual Studio 2010 - ericnel
PPTX
MVC and Entity Framework
PDF
Stateful patterns in Azure Functions
PPTX
Multithreading and concurrency in android
PPTX
The .net remote systems
PPTX
Link quries
Language Integrated Query - LINQ
Understanding linq
Advanced Reflection in Pharo
Lerman Vvs13 Entity Framework 4 And Wcf
Lerman Adx303 Entity Framework 4 In Aspnet
C++ Actor Model - You’ve Got Mail ...
Android with kotlin course
Java.util.concurrent.concurrent hashmap
Introduction to Web Scraping with Python
The OCLforUML Profile
Java 8 streams
Whats new in .NET for 2019
Lec 4 06_aug [compatibility mode]
.NET Core, ASP.NET Core Course, Session 17
Entity Framework 4 In Microsoft Visual Studio 2010 - ericnel
MVC and Entity Framework
Stateful patterns in Azure Functions
Multithreading and concurrency in android
The .net remote systems
Link quries
Ad

Similar to Payloads and OCR with Solr (10)

PDF
Payloads in Solr - Erik Hatcher, Lucidworks
PDF
Solr Payloads
PPTX
Building Search & Recommendation Engines
PDF
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
PDF
What's New in Solr 3.x / 4.0
PDF
"Solr Update" at code4lib '13 - Chicago
PDF
Building a Real-time Solr-powered Recommendation Engine
PDF
Needle in an enterprise haystack
PDF
Introduction to Solr
PDF
IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf
Payloads in Solr - Erik Hatcher, Lucidworks
Solr Payloads
Building Search & Recommendation Engines
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
What's New in Solr 3.x / 4.0
"Solr Update" at code4lib '13 - Chicago
Building a Real-time Solr-powered Recommendation Engine
Needle in an enterprise haystack
Introduction to Solr
IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf
Ad

More from OpenSource Connections (20)

PDF
Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024
PDF
Test driven relevancy
PDF
How To Structure Your Search Team for Success
PPT
The right path to making search relevant - Taxonomy Bootcamp London 2019
PPTX
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
PDF
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
PPTX
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
PPTX
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
PDF
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
PPTX
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
PPTX
Haystack 2019 - Search with Vectors - Simon Hughes
PPTX
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
PPTX
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
PDF
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
PDF
Haystack 2019 - Architectural considerations on search relevancy in the conte...
PPTX
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
PPTX
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
PPTX
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
PDF
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024
Test driven relevancy
How To Structure Your Search Team for Success
The right path to making search relevant - Taxonomy Bootcamp London 2019
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via

Recently uploaded (20)

PPTX
artificial intelligence overview of it and more
DOCX
Unit-3 cyber security network security of internet system
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PDF
The Internet -By the Numbers, Sri Lanka Edition
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PPT
tcp ip networks nd ip layering assotred slides
PPTX
Digital Literacy And Online Safety on internet
PDF
Triggering QUIC, presented by Geoff Huston at IETF 123
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
Sims 4 Historia para lo sims 4 para jugar
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
artificial intelligence overview of it and more
Unit-3 cyber security network security of internet system
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
WebRTC in SignalWire - troubleshooting media negotiation
The Internet -By the Numbers, Sri Lanka Edition
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
SASE Traffic Flow - ZTNA Connector-1.pdf
Cloud-Scale Log Monitoring _ Datadog.pdf
introduction about ICD -10 & ICD-11 ppt.pptx
Introuction about WHO-FIC in ICD-10.pptx
tcp ip networks nd ip layering assotred slides
Digital Literacy And Online Safety on internet
Triggering QUIC, presented by Geoff Huston at IETF 123
INTERNET------BASICS-------UPDATED PPT PRESENTATION
Design_with_Watersergyerge45hrbgre4top (1).ppt
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Sims 4 Historia para lo sims 4 para jugar
Unit-1 introduction to cyber security discuss about how to secure a system
Introuction about ICD -10 and ICD-11 PPT.pptx
PptxGenJS_Demo_Chart_20250317130215833.pptx

Payloads and OCR with Solr

  • 1. Payloads and OCR with Solr OpenSource Connections October 2019 Apache Lucene/Solr - London User Group
  • 2. Introductions Eric Pugh: Search Relevance Engineer at OpenSource Connections Daniel Worley: Search Relevance Engineer at OpenSource Connections
  • 3. Searching Text Inside Images http://pdf-discovery-demo.dev.o19s.com:8080 and search for “HELOC”
  • 4. OCR ● Tesseract and Tika enable us to get text out of images via OCR ● Text can then be indexed into Solr ● Problem solved?
  • 5. Highlighting ● What if we want to highlight the text in an image that matched? ● Regular Solr Highlighting not good enough ○ We can get text snippets but we can’t see where they came from in the image ○ For images with a lot of information this can make it hard for users to see why a particular image matched their query
  • 6. The Problem ● Tesseract has provided us bounding boxes for all of the OCR’d text ● We need to access this bounding box information within Solr on a per match basis
  • 7. What about Payloads? ● Payloads provide a way of attaching various metadata to each token ● More info https://www.slideshare.net/lucidworks/payloads-in-solr-erik-hatcher-luci dworks
  • 10. The Challenge ● Payloads are typically used at query time for matching or to affect the score of matching documents. ● Not much in the area of surfacing payload data at query time without manually extracting it again from the stored data
  • 11. Iteration 1 - Idea Create a highlighter formatter that surfaces payload attributes
  • 12. Iteration 1 - Results ● Required hacking at low level Lucene internals to include the payload attribute in the token stream. ● Suitable for a PoC, not great for any real applications
  • 13. Iteration 2 - Idea Create a component that only returns payloads for clauses that matched in the query
  • 14. Iteration 2 - Results A deployable plugin that doesn’t require hacking on Lucene to work
  • 15. Payload Component - What’s in the box? ● Payload Component ● And some conveniences: ○ Base64Encoder ○ PayloadBufferFilterFactory ● Available at: https://github.com/o19s/payload-component
  • 16. Payload Component ● Similar to the highlighting component but returns matches only ● Currently no scoring of matches ● For each match, add the payload data to the response if available
  • 17. PayloadBufferFilterFactory ● A filter to work around payload oddities in Solr ● Filters that produce new tokens often remove all attributes, which includes payloads. ● This filter will copy the Payload data and restore it later on after other filters have been run.
  • 19. Base64Encoder ● The DelimitedPayloadTokenFilterFactory expects data as: ○ [term][delimiter][payload] ● What about dog|barks woofs? ○ Will “woofs” be included as part of the payload?
  • 20. Base64Encoder Cont’d ● To get around this problem, the payload can be encoded in Base64 ○ dog|YmFya3Mgd29vZnM= ● The Base64Encoder will accept Base64 data at index time but store it out as the decoded version. ○ YmFya3Mgd29vZnM= -> barks woofs
  • 21. The Future: Matches Component ● Surface which terms/phrases from the query matched ● Surface payload attribute data that’s already included in the payload component ● Surface other data from the index such as offsets
  • 22. Thanks ● PayloadComponent Repo: https://github.com/o19s/payload-component ● Demo Repo: https://github.com/o19s/pdf-discovery-demo Big thanks to Dan Worley and Andrew Boyd and a brave client for working with me to make this idea happen! Interested in Relevance? Join us at www.o19s.com/slack to chat with your peers.