Payloads and OCR with Solr

Payloads and OCR with Solr
OpenSource Connections October 2019
Apache Lucene/Solr - London User Group

Introductions
Eric Pugh: Search Relevance Engineer at OpenSource Connections
Daniel Worley: Search Relevance Engineer at OpenSource Connections

Searching Text Inside Images
http://pdf-discovery-demo.dev.o19s.com:8080 and search for “HELOC”

OCR
● Tesseract and Tika enable us to get text out of images via OCR
● Text can then be indexed into Solr
● Problem solved?

Highlighting
● What if we want to highlight the text in an image that matched?
● Regular Solr Highlighting not good enough
○ We can get text snippets but we can’t see where they came from in the
image
○ For images with a lot of information this can make it hard for users to
see why a particular image matched their query

The Problem
● Tesseract has provided us bounding boxes for all of the OCR’d text
● We need to access this bounding box information within Solr on a per
match basis

What about Payloads?
● Payloads provide a way of attaching various metadata to each token
● More info
https://www.slideshare.net/lucidworks/payloads-in-solr-erik-hatcher-luci
dworks

The Challenge
● Payloads are typically used at query time for matching or to affect the
score of matching documents.
● Not much in the area of surfacing payload data at query time without
manually extracting it again from the stored data

Iteration 1 - Idea
Create a highlighter formatter that surfaces payload attributes

Iteration 1 - Results
● Required hacking at low level Lucene internals to include the payload
attribute in the token stream.
● Suitable for a PoC, not great for any real applications

Iteration 2 - Idea
Create a component that only returns payloads for clauses that matched in
the query

Iteration 2 - Results
A deployable plugin that doesn’t require hacking on Lucene to work

Payload Component - What’s in the box?
● Payload Component
● And some conveniences:
○ Base64Encoder
○ PayloadBufferFilterFactory
● Available at: https://github.com/o19s/payload-component

Payload Component
● Similar to the highlighting component but returns matches only
● Currently no scoring of matches
● For each match, add the payload data to the response if available

PayloadBufferFilterFactory
● A filter to work around payload oddities in Solr
● Filters that produce new tokens often remove all attributes, which
includes payloads.
● This filter will copy the Payload data and restore it later on after other
filters have been run.

PayloadBuﬀerFilterFactory
COPY->
PASTE->

Base64Encoder
● The DelimitedPayloadTokenFilterFactory expects data as:
○ [term][delimiter][payload]
● What about dog|barks woofs?
○ Will “woofs” be included as part of the payload?

Base64Encoder Cont’d
● To get around this problem, the payload can be encoded in Base64
○ dog|YmFya3Mgd29vZnM=
● The Base64Encoder will accept Base64 data at index time but store it
out as the decoded version.
○ YmFya3Mgd29vZnM= -> barks woofs

The Future: Matches Component
● Surface which terms/phrases from the query matched
● Surface payload attribute data that’s already included in the payload
component
● Surface other data from the index such as offsets

Thanks
● PayloadComponent Repo: https://github.com/o19s/payload-component
● Demo Repo: https://github.com/o19s/pdf-discovery-demo
Big thanks to Dan Worley and Andrew Boyd and a brave
client for working with me to make this idea happen!
Interested in Relevance? Join us at www.o19s.com/slack to chat with your
peers.

Payloads and OCR with Solr

More Related Content

What's hot (20)

Similar to Payloads and OCR with Solr (10)

More from OpenSource Connections (20)

Recently uploaded (20)

Payloads and OCR with Solr