Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks

Challenges of Simple
Documents
Cassandra Targett
Director of Engineering, Lucidworks
@childerelda
#Activate18 #ActivateSearch

About Me
• Lucene/Solr committer and
member of PMC
• Director of Engineering at
Lucidworks
• Manage team of Solr
committers
• Live in the Florida
Panhandle

Agenda
• Looking at the Solr Reference Guide as a content source
• Structure of raw documents
• HTML format
• Indexing the Reference Guide with bin/post
• Indexing the Reference Guide with Site Search

Solr Reference Guide as a
Content Source

Brief History of the Ref Guide
2009
First version of the
Guide created by
Lucidworks
Guide integrated
with Solr source
Lucidworks donates Guide
to Lucene/Solr community
2013 2017

Moving from Confluence
• What we gained:
• Control over information design and presentation
• Versioned Guides
• Tighter integration with developers
• What we lost:
• Comments
• Managed infrastructure
• SEARCH

Challenges with Providing Ref Guide
Search as a Community Artifact
Server
None of this is the core
mission of the committer
community

Baseline Feature Set
• Full text search
• Auto-complete
• Suggestions/Spellcheck
• Highlighting
• Facets? …based on?

Reference Guide
Document Format

Asciidoc Format
Page title
Text
Image reference
with caption
Section title

Ref Guide Content Structure
• Asciidoc is relatively well-structured
• headings clearly separate from general text (==)
• code examples in blocks ([source,xml])
• Doesn’t include header/footer/nav “cruft”
• Challenges:
• Document links are to other .adoc files
• No URL for access via a search result list
• HTML metadata missing
• No established means of indexing .adoc files

Maybe We Should Use HTML Format?
• Lots of systems know how to read HTML
• URLs for access already exist and are correct
• Inter/intra-document links converted to correct HTML
references (anchors or other pages)
• Challenges:
• Includes “cruft” of navs and headers/footers
• HTML can be pretty unstructured

Take One: bin/post
aka, Solr Cell and Tika

Solr’s bin/post
• Simple command line tool for POSTing content
• XML, JSON, CSV, HTML, PDF
• Includes a basic crawler
• Determines update handler to use based on file type
• JSON, CSV -> /update/json, /update/csv
• PDF, HTML, Word -> /update/extract
• Delegates to post.jar (in example/exampledocs)

$ ./bin/post -c post-html -filetypes html example/refguide-html/
/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -classpath /
Applications/Solr/solr-7.5.0/dist/solr-core-7.5.0.jar -Dauto=yes -Dfiletypes=html -
Dc=post-html -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool example/
refguide-html/
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/post-html/update...
Entering auto mode. File endings considered are html
Entering recursive mode, max depth=999, delay=0s
Indexing directory example/refguide-html (245 files, depth=0)
POSTing file requestdispatcher-in-solrconfig.html (text/html) to [base]/extract
POSTing file client-api-lineup.html (text/html) to [base]/extract
…
250 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/post-html/update...
Time spent: 0:00:07.660

/update/extract aka Solr Cell
• ExtractingRequestHandler
• Uses configuration in solrconfig.xml if not defined with
runtime parameters
• Uses Apache Tika for content extraction & parsing
• Streams documents to Solr

Indexed document
{
"id":"/Applications/Solr/solr-7.5.0/example/refguide-html/language-
analysis.html",
"stream_size":[222027],
"x_ua_compatible":["IE=edge"],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"keywords":[" "],
"viewport":["width=device-width, initial-scale=1"],
"dc_title":["Language Analysis | Apache Solr Reference Guide 7.5-DRAFT"],
"content_encoding":["UTF-8"],
"resourcename":["/Applications/Solr/solr-7.5.0/example/refguide-html/language-
analysis.html"],
"title":["Language Analysis | Apache Solr Reference Guide 7.5-DRAFT"],
"content_type":["text/html; charset=UTF-8"],
"_version_":1612232702110466048}

/update/extract in solrconfig.xml
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>

$ ./bin/post -c post-html -filetypes html -params "fmap.content=body" example/refguide-
html/
/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -classpath /
Applications/Solr/solr-7.5.0/dist/solr-core-7.5.0.jar -Dauto=yes -Dfiletypes=html -
Dparams=fmap.content=body -Dc=post-html -Ddata=files -Drecursive=yes
org.apache.solr.util.SimplePostTool example/refguide-html/
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/post-html/update?
fmap.content=body...
Entering auto mode. File endings considered are html
Entering recursive mode, max depth=999, delay=0s
Indexing directory example/refguide-html (245 files, depth=0)
POSTing file requestdispatcher-in-solrconfig.html (text/html) to [base]/extract
POSTing file client-api-lineup.html (text/html) to [base]/extract
…
250 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/post-html/update?
fmap.content=body...
Time spent: 0:00:07.347

Indexed document with body
{
"id":"/Applications/Solr/solr-7.5.0/example/refguide-html/logging.html",
"stream_size":[46314],
"x_ua_compatible":["IE=edge"],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"keywords":[" "],
"viewport":["width=device-width, initial-scale=1"],
"dc_title":["Logging | Apache Solr Reference Guide 7.5-DRAFT"],
"content_encoding":["UTF-8"],
"resourcename":["/Applications/Solr/solr-7.5.0/example/refguide-html/logging.html"],
"title":["Logging | Apache Solr Reference Guide 7.5-DRAFT"],
"content_type":["text/html; charset=UTF-8"],
"body":[" n n stylesheet text/css https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-
awesome.min.css n stylesheet css/lavish-bootstrap.css n stylesheet css/customstyles.css n stylesheet css/
theme-solr.css n stylesheet css/ref-guide.css n https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/
jquery.min.js n https://cdnjs.cloudflare.com/ajax/libs/jquery-cookie/1.4.1/jquery.cookie.min.js n js/
jquery.navgoco.min.js n https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js n https://
cdnjs.cloudflare.com/ajax/libs/anchor-js/2.0.0/anchor.min.js n js/toc.js n
....],
"_version_":1612248222923751424}

capture and captureAttr
• These parameters allow putting HTML tags (XHTML
elements) into separate fields in Solr
• capture=<element> puts a specific element into it’s
own field
• captureAttr=true puts all attributes of elements into
their own fields

Example Using captureAttr
• Captures attributes
of elements
• classnames
• ids
• Best used for
something like
getting href values
out of <a> tags
./bin/post -c post-html -filetypes html -
params "fmap.content=body&captureAttr=true"
example/refguide-html/

Example Using capture
• Map everything in
h2 tags to the
sectiontitles
field
• Great option for
parsing HTML
pages
./bin/post -c post-html -filetypes html -
params “fmap.content=body&capture=h2
&fmap.h2=sectiontitles" example/refguide-html/

More We Could Explore
• Tika gives us a lot of power to parse our documents
• tika.config opens up all of Tika’s options
• parseContext.config gives control over parser options
• Haven’t looked at default field analysis:
• Are we storing the fields the way we want? Storing too
many fields?
• Should we do different analysis to support more search
use cases?

Remaining Challenges
• Indexed the files locally, I don’t have the correct paths for
URLs
• Add a field with this information?
• Crawler option doesn’t like our pages (why?)
• Still need a front-end UI
• Haven’t solved server & maintenance questions

Use /browse for Front-End?
• http://localhost:9983/solr/<collection>/browse
• Most config done via solrconfig.xml & query parameters

What is Site Search?
• Hosted service from Lucidworks, based on Fusion
• Designed to make basic search easy to configure,
manage, and embed into a site

Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks

Embed into an Existing Site
• Add JS snippet to <head>
element
• Add search.html for results
• Add elements to embed:
• <cloud-search-box>
• <cloud-results>
• <cloud-tabs>
• <cloud-facets>

A Few Challenges
• Uniform Information Model works best when content can
conform or adapt to it
• Poorly formed HTML presents problems for machines
• Which elements hold the content?
• Which elements are nav elements?
• Default extraction of a “description” field did not allow for
good highlighting experience
• Fallback to entire <body> brought in all the navigation
“cruft”

<html>
<head> .. </head>
<body>
<div class=“container”>
<div class=“row”>
<div class=“col-md-9”>
<div class=“post-title-main”> .. </div>
<div class=“post-content”>
<div class=“main-content”>
<div class=“sect1”>
<div class=“sectionbody”>
<div class=“paragraph”>
<p> .. </p>
</div>
</div>
</div>
</div>
</div>
<footer> .. </footer>
</div>
</div>
</div>
</body>
</html>
Better HTML
structure might
help…
Before:

Reader View & Search Engines have a
hard time with this structure

<html>
<head> .. </head>
<body>
<div class=“container”>
<div class=“row”>
<nav class=“col-md-3”> .. </nav>
<article class=“col-md-9 post-content”>
<header class=“header”> .. </header>
<nav class=“toc”> .. </nav>
<section class=“content”>
<section class=“sect1”>
<h2> .. </h2>
<p> .. </p>
</section>
</section>
</article>
</div>
<footer> .. </footer>
</div>
</body>
</html>
After
implementing
semantic
elements
(SOLR-12746)

Do better tags help…
• Site Search?
• Yes!
• We can define elements to extract & map those to Site
Search information model
• bin/post (Solr Cell)?
• No, TIKA-985 is for supporting HTML5 elements
• In the meantime, they are ignored

Is Site Search the Solution?
• Hosted and managed for us
• Easy to integrate with our existing site
• Basic search features with very short set up time
• Better than a title keyword lookup
• Challenges:
• Advanced features are obscured
• Are the basic features good enough (maybe just for
now)?

No matter what your data
looks like, you will face
challenges
No tools have perfected this yet.
Your stuff is unique! Learn how it’s structured!

The problem isn’t always
the tool you are trying to
use
Sometimes you need to try to fix your data
(Assuming you can!)

Site Search can be a
solution for Ref Guide
search
It’s not perfect, but it’s better than today!

Thank you!
Cassandra Targett
Director of Engineering, Lucidworks
@childerelda
#Activate18 #ActivateSearch

Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks

More Related Content

What's hot (20)

Similar to Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks (20)

More from Lucidworks (20)

Recently uploaded (20)

Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks