SlideShare a Scribd company logo
Challenges of Simple
Documents
Cassandra Targett
Director of Engineering, Lucidworks
@childerelda
#Activate18 #ActivateSearch
About Me
• Lucene/Solr committer and
member of PMC
• Director of Engineering at
Lucidworks
• Manage team of Solr
committers
• Live in the Florida
Panhandle
Agenda
• Looking at the Solr Reference Guide as a content source
• Structure of raw documents
• HTML format
• Indexing the Reference Guide with bin/post
• Indexing the Reference Guide with Site Search
Solr Reference Guide as a
Content Source
Brief History of the Ref Guide
2009
First version of the
Guide created by
Lucidworks
Guide integrated
with Solr source
Lucidworks donates Guide
to Lucene/Solr community
2013 2017
Moving from Confluence
• What we gained:
• Control over information design and presentation
• Versioned Guides
• Tighter integration with developers
• What we lost:
• Comments
• Managed infrastructure
• SEARCH
Challenges with Providing Ref Guide
Search as a Community Artifact
Server
None of this is the core
mission of the committer
community
Baseline Feature Set
• Full text search
• Auto-complete
• Suggestions/Spellcheck
• Highlighting
• Facets? …based on?
Reference Guide
Document Format
Asciidoc Format
Page title
Text
Image reference
with caption
Section title
Ref Guide Content Structure
• Asciidoc is relatively well-structured
• headings clearly separate from general text (==)
• code examples in blocks ([source,xml])
• Doesn’t include header/footer/nav “cruft”
• Challenges:
• Document links are to other .adoc files
• No URL for access via a search result list
• HTML metadata missing
• No established means of indexing .adoc files
Maybe We Should Use HTML Format?
• Lots of systems know how to read HTML
• URLs for access already exist and are correct
• Inter/intra-document links converted to correct HTML
references (anchors or other pages)
• Challenges:
• Includes “cruft” of navs and headers/footers
• HTML can be pretty unstructured
toc.js
Jekyll
template
Take One: bin/post
aka, Solr Cell and Tika
Solr’s bin/post
• Simple command line tool for POSTing content
• XML, JSON, CSV, HTML, PDF
• Includes a basic crawler
• Determines update handler to use based on file type
• JSON, CSV -> /update/json, /update/csv
• PDF, HTML, Word -> /update/extract
• Delegates to post.jar (in example/exampledocs)
$ ./bin/post -c post-html -filetypes html example/refguide-html/
/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -classpath /
Applications/Solr/solr-7.5.0/dist/solr-core-7.5.0.jar -Dauto=yes -Dfiletypes=html -
Dc=post-html -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool example/
refguide-html/
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/post-html/update...
Entering auto mode. File endings considered are html
Entering recursive mode, max depth=999, delay=0s
Indexing directory example/refguide-html (245 files, depth=0)
POSTing file requestdispatcher-in-solrconfig.html (text/html) to [base]/extract
POSTing file client-api-lineup.html (text/html) to [base]/extract
…
250 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/post-html/update...
Time spent: 0:00:07.660
/update/extract aka Solr Cell
• ExtractingRequestHandler
• Uses configuration in solrconfig.xml if not defined with
runtime parameters
• Uses Apache Tika for content extraction & parsing
• Streams documents to Solr
Indexed document
{
"id":"/Applications/Solr/solr-7.5.0/example/refguide-html/language-
analysis.html",
"stream_size":[222027],
"x_ua_compatible":["IE=edge"],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"keywords":[" "],
"viewport":["width=device-width, initial-scale=1"],
"dc_title":["Language Analysis | Apache Solr Reference Guide 7.5-DRAFT"],
"content_encoding":["UTF-8"],
"resourcename":["/Applications/Solr/solr-7.5.0/example/refguide-html/language-
analysis.html"],
"title":["Language Analysis | Apache Solr Reference Guide 7.5-DRAFT"],
"content_type":["text/html; charset=UTF-8"],
"_version_":1612232702110466048}
/update/extract in solrconfig.xml
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
$ ./bin/post -c post-html -filetypes html -params "fmap.content=body" example/refguide-
html/
/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -classpath /
Applications/Solr/solr-7.5.0/dist/solr-core-7.5.0.jar -Dauto=yes -Dfiletypes=html -
Dparams=fmap.content=body -Dc=post-html -Ddata=files -Drecursive=yes
org.apache.solr.util.SimplePostTool example/refguide-html/
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/post-html/update?
fmap.content=body...
Entering auto mode. File endings considered are html
Entering recursive mode, max depth=999, delay=0s
Indexing directory example/refguide-html (245 files, depth=0)
POSTing file requestdispatcher-in-solrconfig.html (text/html) to [base]/extract
POSTing file client-api-lineup.html (text/html) to [base]/extract
…
250 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/post-html/update?
fmap.content=body...
Time spent: 0:00:07.347
Indexed document with body
{
"id":"/Applications/Solr/solr-7.5.0/example/refguide-html/logging.html",
"stream_size":[46314],
"x_ua_compatible":["IE=edge"],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"keywords":[" "],
"viewport":["width=device-width, initial-scale=1"],
"dc_title":["Logging | Apache Solr Reference Guide 7.5-DRAFT"],
"content_encoding":["UTF-8"],
"resourcename":["/Applications/Solr/solr-7.5.0/example/refguide-html/logging.html"],
"title":["Logging | Apache Solr Reference Guide 7.5-DRAFT"],
"content_type":["text/html; charset=UTF-8"],
"body":[" n n stylesheet text/css https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-
awesome.min.css n stylesheet css/lavish-bootstrap.css n stylesheet css/customstyles.css n stylesheet css/
theme-solr.css n stylesheet css/ref-guide.css n https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/
jquery.min.js n https://cdnjs.cloudflare.com/ajax/libs/jquery-cookie/1.4.1/jquery.cookie.min.js n js/
jquery.navgoco.min.js n https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js n https://
cdnjs.cloudflare.com/ajax/libs/anchor-js/2.0.0/anchor.min.js n js/toc.js n
....],
"_version_":1612248222923751424}
capture and captureAttr
• These parameters allow putting HTML tags (XHTML
elements) into separate fields in Solr
• capture=<element> puts a specific element into it’s
own field
• captureAttr=true puts all attributes of elements into
their own fields
Example Using captureAttr
• Captures attributes
of elements
• classnames
• ids
• Best used for
something like
getting href values
out of <a> tags
./bin/post -c post-html -filetypes html -
params "fmap.content=body&captureAttr=true"
example/refguide-html/
Example Using capture
• Map everything in
h2 tags to the
sectiontitles
field
• Great option for
parsing HTML
pages
./bin/post -c post-html -filetypes html -
params “fmap.content=body&capture=h2
&fmap.h2=sectiontitles" example/refguide-html/
More We Could Explore
• Tika gives us a lot of power to parse our documents
• tika.config opens up all of Tika’s options
• parseContext.config gives control over parser options
• Haven’t looked at default field analysis:
• Are we storing the fields the way we want? Storing too
many fields?
• Should we do different analysis to support more search
use cases?
Remaining Challenges
• Indexed the files locally, I don’t have the correct paths for
URLs
• Add a field with this information?
• Crawler option doesn’t like our pages (why?)
• Still need a front-end UI
• Haven’t solved server & maintenance questions
Use /browse for Front-End?
• http://localhost:9983/solr/<collection>/browse
• Most config done via solrconfig.xml & query parameters
Take Two: Site Search
What is Site Search?
• Hosted service from Lucidworks, based on Fusion
• Designed to make basic search easy to configure,
manage, and embed into a site
Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks
Fields in crawled documents
Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks
Embed into an Existing Site
• Add JS snippet to <head>
element
• Add search.html for results
• Add elements to embed:
• <cloud-search-box>
• <cloud-results>
• <cloud-tabs>
• <cloud-facets>
Queries and results
A Few Challenges
• Uniform Information Model works best when content can
conform or adapt to it
• Poorly formed HTML presents problems for machines
• Which elements hold the content?
• Which elements are nav elements?
• Default extraction of a “description” field did not allow for
good highlighting experience
• Fallback to entire <body> brought in all the navigation
“cruft”
<html>
<head> .. </head>
<body>
<div class=“container”>
<div class=“row”>
<div class=“col-md-9”>
<div class=“post-title-main”> .. </div>
<div class=“post-content”>
<div class=“main-content”>
<div class=“sect1”>
<div class=“sectionbody”>
<div class=“paragraph”>
<p> .. </p>
</div>
</div>
</div>
</div>
</div>
<footer> .. </footer>
</div>
</div>
</div>
</body>
</html>
Better HTML
structure might
help…
Before:
Reader View & Search Engines have a
hard time with this structure
<html>
<head> .. </head>
<body>
<div class=“container”>
<div class=“row”>
<nav class=“col-md-3”> .. </nav>
<article class=“col-md-9 post-content”>
<header class=“header”> .. </header>
<nav class=“toc”> .. </nav>
<section class=“content”>
<section class=“sect1”>
<h2> .. </h2>
<p> .. </p>
</section>
</section>
</article>
</div>
<footer> .. </footer>
</div>
</body>
</html>
After
implementing
semantic
elements
(SOLR-12746)
Reader View Improvement
Do better tags help…
• Site Search?
• Yes!
• We can define elements to extract & map those to Site
Search information model
• bin/post (Solr Cell)?
• No, TIKA-985 is for supporting HTML5 elements
• In the meantime, they are ignored
Search with Semantic HTML
Is Site Search the Solution?
• Hosted and managed for us
• Easy to integrate with our existing site
• Basic search features with very short set up time
• Better than a title keyword lookup
• Challenges:
• Advanced features are obscured
• Are the basic features good enough (maybe just for
now)?
Takeaways
No matter what your data
looks like, you will face
challenges
No tools have perfected this yet.
Your stuff is unique! Learn how it’s structured!
The problem isn’t always
the tool you are trying to
use
Sometimes you need to try to fix your data
(Assuming you can!)
Site Search can be a
solution for Ref Guide
search
It’s not perfect, but it’s better than today!
Questions?
Thank you!
Cassandra Targett
Director of Engineering, Lucidworks
@childerelda
#Activate18 #ActivateSearch

More Related Content

PPTX
SPSDenver - Wrapping Your Head Around the SharePoint Beast
PPTX
The SharePoint & jQuery Guide
PPT
SE2016 - Java EE revisits design patterns 2016
PPTX
SharePoint 2016 Platform Adoption Lessons Learned and Advanced Troubleshooting
PPTX
SEF2013 - A jQuery Primer for SharePoint
PDF
Vibe Custom Development
PPTX
SharePoint Performance
PPTX
SharePoint Saturday Belgium 2014 Creating product centric sites using product...
SPSDenver - Wrapping Your Head Around the SharePoint Beast
The SharePoint & jQuery Guide
SE2016 - Java EE revisits design patterns 2016
SharePoint 2016 Platform Adoption Lessons Learned and Advanced Troubleshooting
SEF2013 - A jQuery Primer for SharePoint
Vibe Custom Development
SharePoint Performance
SharePoint Saturday Belgium 2014 Creating product centric sites using product...

What's hot (20)

PDF
Custom Development with Novell Teaming
PPTX
Developing and Deploying Custom Branding Solutions in SharePoint 2010
PPTX
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
PDF
Drupal is not your Website
PPTX
Parsing strange v4
PPTX
Customizing the Document Library
PPTX
SilverStripe From a Developer's Perspective
PDF
Parsing strange v2
PDF
Advanced Site Studio Class, June 18, 2012
PPTX
PPT
SharePoint Topology
PPTX
72d5drupal
PPT
Intro to drupal
PPTX
Building a SharePoint Platform That Scales
PPTX
Content by query web part
KEY
Introduction to YUI PHP Loader
PDF
Alfresco tech talk live share extensibility metadata and actions for 4.1
PDF
How to migrate from any CMS (thru the front-door)
PPT
SharePoint Advanced Administration with Joel Oleson, Shane Young and Mike Watson
PDF
HTML5, just another presentation :)
Custom Development with Novell Teaming
Developing and Deploying Custom Branding Solutions in SharePoint 2010
SEF2013 - Create a Business Solution, Step by Step, with No Managed Code
Drupal is not your Website
Parsing strange v4
Customizing the Document Library
SilverStripe From a Developer's Perspective
Parsing strange v2
Advanced Site Studio Class, June 18, 2012
SharePoint Topology
72d5drupal
Intro to drupal
Building a SharePoint Platform That Scales
Content by query web part
Introduction to YUI PHP Loader
Alfresco tech talk live share extensibility metadata and actions for 4.1
How to migrate from any CMS (thru the front-door)
SharePoint Advanced Administration with Joel Oleson, Shane Young and Mike Watson
HTML5, just another presentation :)
Ad

Similar to Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks (20)

PDF
Introduction to Solr
PDF
Solr Recipes
PDF
Solr Application Development Tutorial
PDF
Solr Powered Lucene
KEY
ApacheCon Europe 2012 -Big Search 4 Big Data
PDF
Introduction to Solr
PDF
Rapid prototyping with solr - By Erik Hatcher
PDF
Rapid Prototyping with Solr
PDF
Get the most out of Solr search with PHP
PDF
Indexing Text and HTML Files with Solr
PDF
Indexing Text and HTML Files with Solr
PDF
Indexing Text and HTML Files with Solr
KEY
Intro to Apache Solr for Drupal
PPS
Introduction to Solr
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PDF
NoSQL, Apache SOLR and Apache Hadoop
PDF
Solr Masterclass Bangkok, June 2014
KEY
Apache Solr - Enterprise search platform
PPT
Introduction to Solr
Solr Recipes
Solr Application Development Tutorial
Solr Powered Lucene
ApacheCon Europe 2012 -Big Search 4 Big Data
Introduction to Solr
Rapid prototyping with solr - By Erik Hatcher
Rapid Prototyping with Solr
Get the most out of Solr search with PHP
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
Intro to Apache Solr for Drupal
Introduction to Solr
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
NoSQL, Apache SOLR and Apache Hadoop
Solr Masterclass Bangkok, June 2014
Apache Solr - Enterprise search platform
Ad

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation theory and applications.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
A Presentation on Artificial Intelligence
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation theory and applications.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
A Presentation on Artificial Intelligence
Review of recent advances in non-invasive hemoglobin estimation
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
MYSQL Presentation for SQL database connectivity
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I
NewMind AI Monthly Chronicles - July 2025
Building Integrated photovoltaic BIPV_UPV.pdf

Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett, Lucidworks

  • 1. Challenges of Simple Documents Cassandra Targett Director of Engineering, Lucidworks @childerelda #Activate18 #ActivateSearch
  • 2. About Me • Lucene/Solr committer and member of PMC • Director of Engineering at Lucidworks • Manage team of Solr committers • Live in the Florida Panhandle
  • 3. Agenda • Looking at the Solr Reference Guide as a content source • Structure of raw documents • HTML format • Indexing the Reference Guide with bin/post • Indexing the Reference Guide with Site Search
  • 4. Solr Reference Guide as a Content Source
  • 5. Brief History of the Ref Guide 2009 First version of the Guide created by Lucidworks Guide integrated with Solr source Lucidworks donates Guide to Lucene/Solr community 2013 2017
  • 6. Moving from Confluence • What we gained: • Control over information design and presentation • Versioned Guides • Tighter integration with developers • What we lost: • Comments • Managed infrastructure • SEARCH
  • 7. Challenges with Providing Ref Guide Search as a Community Artifact Server None of this is the core mission of the committer community
  • 8. Baseline Feature Set • Full text search • Auto-complete • Suggestions/Spellcheck • Highlighting • Facets? …based on?
  • 10. Asciidoc Format Page title Text Image reference with caption Section title
  • 11. Ref Guide Content Structure • Asciidoc is relatively well-structured • headings clearly separate from general text (==) • code examples in blocks ([source,xml]) • Doesn’t include header/footer/nav “cruft” • Challenges: • Document links are to other .adoc files • No URL for access via a search result list • HTML metadata missing • No established means of indexing .adoc files
  • 12. Maybe We Should Use HTML Format? • Lots of systems know how to read HTML • URLs for access already exist and are correct • Inter/intra-document links converted to correct HTML references (anchors or other pages) • Challenges: • Includes “cruft” of navs and headers/footers • HTML can be pretty unstructured
  • 14. Take One: bin/post aka, Solr Cell and Tika
  • 15. Solr’s bin/post • Simple command line tool for POSTing content • XML, JSON, CSV, HTML, PDF • Includes a basic crawler • Determines update handler to use based on file type • JSON, CSV -> /update/json, /update/csv • PDF, HTML, Word -> /update/extract • Delegates to post.jar (in example/exampledocs)
  • 16. $ ./bin/post -c post-html -filetypes html example/refguide-html/ /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -classpath / Applications/Solr/solr-7.5.0/dist/solr-core-7.5.0.jar -Dauto=yes -Dfiletypes=html - Dc=post-html -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool example/ refguide-html/ SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/post-html/update... Entering auto mode. File endings considered are html Entering recursive mode, max depth=999, delay=0s Indexing directory example/refguide-html (245 files, depth=0) POSTing file requestdispatcher-in-solrconfig.html (text/html) to [base]/extract POSTing file client-api-lineup.html (text/html) to [base]/extract … 250 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/post-html/update... Time spent: 0:00:07.660
  • 17. /update/extract aka Solr Cell • ExtractingRequestHandler • Uses configuration in solrconfig.xml if not defined with runtime parameters • Uses Apache Tika for content extraction & parsing • Streams documents to Solr
  • 18. Indexed document { "id":"/Applications/Solr/solr-7.5.0/example/refguide-html/language- analysis.html", "stream_size":[222027], "x_ua_compatible":["IE=edge"], "x_parsed_by":["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.html.HtmlParser"], "stream_content_type":["text/html"], "keywords":[" "], "viewport":["width=device-width, initial-scale=1"], "dc_title":["Language Analysis | Apache Solr Reference Guide 7.5-DRAFT"], "content_encoding":["UTF-8"], "resourcename":["/Applications/Solr/solr-7.5.0/example/refguide-html/language- analysis.html"], "title":["Language Analysis | Apache Solr Reference Guide 7.5-DRAFT"], "content_type":["text/html; charset=UTF-8"], "_version_":1612232702110466048}
  • 19. /update/extract in solrconfig.xml <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.meta">ignored_</str> <str name="fmap.content">_text_</str> </lst> </requestHandler>
  • 20. $ ./bin/post -c post-html -filetypes html -params "fmap.content=body" example/refguide- html/ /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -classpath / Applications/Solr/solr-7.5.0/dist/solr-core-7.5.0.jar -Dauto=yes -Dfiletypes=html - Dparams=fmap.content=body -Dc=post-html -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool example/refguide-html/ SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/post-html/update? fmap.content=body... Entering auto mode. File endings considered are html Entering recursive mode, max depth=999, delay=0s Indexing directory example/refguide-html (245 files, depth=0) POSTing file requestdispatcher-in-solrconfig.html (text/html) to [base]/extract POSTing file client-api-lineup.html (text/html) to [base]/extract … 250 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/post-html/update? fmap.content=body... Time spent: 0:00:07.347
  • 21. Indexed document with body { "id":"/Applications/Solr/solr-7.5.0/example/refguide-html/logging.html", "stream_size":[46314], "x_ua_compatible":["IE=edge"], "x_parsed_by":["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.html.HtmlParser"], "stream_content_type":["text/html"], "keywords":[" "], "viewport":["width=device-width, initial-scale=1"], "dc_title":["Logging | Apache Solr Reference Guide 7.5-DRAFT"], "content_encoding":["UTF-8"], "resourcename":["/Applications/Solr/solr-7.5.0/example/refguide-html/logging.html"], "title":["Logging | Apache Solr Reference Guide 7.5-DRAFT"], "content_type":["text/html; charset=UTF-8"], "body":[" n n stylesheet text/css https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font- awesome.min.css n stylesheet css/lavish-bootstrap.css n stylesheet css/customstyles.css n stylesheet css/ theme-solr.css n stylesheet css/ref-guide.css n https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/ jquery.min.js n https://cdnjs.cloudflare.com/ajax/libs/jquery-cookie/1.4.1/jquery.cookie.min.js n js/ jquery.navgoco.min.js n https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js n https:// cdnjs.cloudflare.com/ajax/libs/anchor-js/2.0.0/anchor.min.js n js/toc.js n ....], "_version_":1612248222923751424}
  • 22. capture and captureAttr • These parameters allow putting HTML tags (XHTML elements) into separate fields in Solr • capture=<element> puts a specific element into it’s own field • captureAttr=true puts all attributes of elements into their own fields
  • 23. Example Using captureAttr • Captures attributes of elements • classnames • ids • Best used for something like getting href values out of <a> tags ./bin/post -c post-html -filetypes html - params "fmap.content=body&captureAttr=true" example/refguide-html/
  • 24. Example Using capture • Map everything in h2 tags to the sectiontitles field • Great option for parsing HTML pages ./bin/post -c post-html -filetypes html - params “fmap.content=body&capture=h2 &fmap.h2=sectiontitles" example/refguide-html/
  • 25. More We Could Explore • Tika gives us a lot of power to parse our documents • tika.config opens up all of Tika’s options • parseContext.config gives control over parser options • Haven’t looked at default field analysis: • Are we storing the fields the way we want? Storing too many fields? • Should we do different analysis to support more search use cases?
  • 26. Remaining Challenges • Indexed the files locally, I don’t have the correct paths for URLs • Add a field with this information? • Crawler option doesn’t like our pages (why?) • Still need a front-end UI • Haven’t solved server & maintenance questions
  • 27. Use /browse for Front-End? • http://localhost:9983/solr/<collection>/browse • Most config done via solrconfig.xml & query parameters
  • 28. Take Two: Site Search
  • 29. What is Site Search? • Hosted service from Lucidworks, based on Fusion • Designed to make basic search easy to configure, manage, and embed into a site
  • 31. Fields in crawled documents
  • 33. Embed into an Existing Site • Add JS snippet to <head> element • Add search.html for results • Add elements to embed: • <cloud-search-box> • <cloud-results> • <cloud-tabs> • <cloud-facets>
  • 35. A Few Challenges • Uniform Information Model works best when content can conform or adapt to it • Poorly formed HTML presents problems for machines • Which elements hold the content? • Which elements are nav elements? • Default extraction of a “description” field did not allow for good highlighting experience • Fallback to entire <body> brought in all the navigation “cruft”
  • 36. <html> <head> .. </head> <body> <div class=“container”> <div class=“row”> <div class=“col-md-9”> <div class=“post-title-main”> .. </div> <div class=“post-content”> <div class=“main-content”> <div class=“sect1”> <div class=“sectionbody”> <div class=“paragraph”> <p> .. </p> </div> </div> </div> </div> </div> <footer> .. </footer> </div> </div> </div> </body> </html> Better HTML structure might help… Before:
  • 37. Reader View & Search Engines have a hard time with this structure
  • 38. <html> <head> .. </head> <body> <div class=“container”> <div class=“row”> <nav class=“col-md-3”> .. </nav> <article class=“col-md-9 post-content”> <header class=“header”> .. </header> <nav class=“toc”> .. </nav> <section class=“content”> <section class=“sect1”> <h2> .. </h2> <p> .. </p> </section> </section> </article> </div> <footer> .. </footer> </div> </body> </html> After implementing semantic elements (SOLR-12746)
  • 40. Do better tags help… • Site Search? • Yes! • We can define elements to extract & map those to Site Search information model • bin/post (Solr Cell)? • No, TIKA-985 is for supporting HTML5 elements • In the meantime, they are ignored
  • 42. Is Site Search the Solution? • Hosted and managed for us • Easy to integrate with our existing site • Basic search features with very short set up time • Better than a title keyword lookup • Challenges: • Advanced features are obscured • Are the basic features good enough (maybe just for now)?
  • 44. No matter what your data looks like, you will face challenges No tools have perfected this yet. Your stuff is unique! Learn how it’s structured!
  • 45. The problem isn’t always the tool you are trying to use Sometimes you need to try to fix your data (Assuming you can!)
  • 46. Site Search can be a solution for Ref Guide search It’s not perfect, but it’s better than today!
  • 48. Thank you! Cassandra Targett Director of Engineering, Lucidworks @childerelda #Activate18 #ActivateSearch