SlideShare a Scribd company logo
Starting a Search
Application


Marc Krellenstein
CTO
Lucid Imagination
Abstract
If you think you need a search application, there are some useful first steps to take:
validating that full-text search is the right technology; producing sets of ideal results you’d
like to return for a range of queries; considering the value of supplementing a basic search
results list with document clustering; and producing more specific requirements and
investigating technology options.
There are many ways that people come to the conclusion that they need a search
application, and a variety of ways in which they can then proceed. Here are some of the
steps that users often consider (or maybe should consider) before they’re ready to start
building an application.




Starting a Search Application
A Lucid Imagination White Paper • April 2009                                              Page i
Table of Contents

Do You Need Full-text Search? ......................................................................................................................... 1
Identifying Ideal Results..................................................................................................................................... 2
Clustering Results ................................................................................................................................................. 5
Producing Requirements and Choosing a Technology ........................................................................... 7
Resources ................................................................................................................................................................. 9




Starting a Search Application
A Lucid Imagination White Paper • April 2009                                                                                                                     Page ii
Do You Need Full-text Search?
Full-text search is useful when you want to find something in a large amount of
unstructured text—documents, articles, product descriptions, case studies, informal notes,
web sites, discussion boards, wikis, e-mail messages, job descriptions, resumes, patents,
legal decisions, etc.
A “large” amount is usually much more than you can conveniently scan or read from start
to end to find what you’re looking for, and it could potentially be as much as millions or
billions of items. “Unstructured text” refers to the fact that the text consists of ordinary
sentences, sentence fragments, or words of some kind rather than numeric values,
structured alphanumeric values (e.g., account IDs, product codes) or non-textual content


                                       “What makes full-text search useful is that
                                       some of the data you are looking through
                                       is unstructured.”

such as pictures, video, and audio. The unstructured text you’re searching will usually exist
together with structured “fields” or related pieces of information—the price of a product,
the date of an article. But what makes full-text search useful is that some of the data you are
looking through is unstructured.
Full-text search is good at a variety of information requests that can be hard to satisfy with
other technologies. These include:
   •   Finding the most relevant information about a specific topic, or an answer to a
       particular question,
   •   Locating a specific document or content item, and
   •   Exploring information in a general area, or even browsing the collection of
       documents or other content as a whole (this is often supported by clustering; see
       below).
Full-text search is useful even if you have a way of navigating to a particular item of
interest. Early on, Yahoo! pioneered the idea of a directory of websites, allowing users to

Starting a Search Application
A Lucid Imagination White Paper • April 2009                                             Page 1
select what they want by drilling down in a hierarchy from a general category (“geographic
regions”) to successively more specific ones (“downtown Seattle”). However, most users
seem to prefer search even when a browsing option is available. This is partly a matter of
personal style, but also reflects the fact that an accurate search system may allow a user to
find information faster than they could by browsing, (e.g., a search on “downtown Seattle”
may get you what you want faster than browsing by country, then state, then city, then city
area). Of course, users may sometimes want to browse – especially when they are just
exploring different content areas – and at other times want to search. It’s readily possible
to combine these in one system, as is discussed below in the section on clustering search
results.
Full-text search is less effective when you always know exactly what answer to provide.
Some full-text search systems (such as open source Solr) let you “rig” specific documents as
answers to certain matching queries. But the greater value of such systems is in the way
they can quickly find the mostly likely answers to an ad hoc request that may have no
predefined right answer. If all or nearly all the information requests can be answered with
specific documents or content, users might be better served by an FAQ or some other
system that directly links specific requests to specific answers.
A full-text search system may also not be the best choice when unstructured text search is
secondary to more structured searches. In those cases a relational database may be a better
technology choice, especially since most such databases today provide some form of full-
text search for their unstructured text fields. (See the article on choosing a search engine or
a DBMS, in the Resources section at the end of this paper.)


Identifying Ideal Results
If you’ve concluded that a full-text search system is what you need, you’re going to have to
understand the data you’re searching and the users who will be doing the searching. You
will then need to design the search application to maximize “findability” of different kinds
of information for those users on that data. (Grant Ingersoll has a good article on
optimizing findability in your search application; see the Resources section at the end of
this paper.)
However, system designers sometimes jump too quickly into the middle of the
requirements or the design process without being clear about just what they would most
like the search application to do. To figure that out, you might create a basic “paper

Starting a Search Application
A Lucid Imagination White Paper • April 2009                                            Page 2
prototype” of some search scenarios: Take a pencil and some blank paper and, for a
number of possible user queries you expect, sketch the content items you would ideally like
the system to return to the screen. You should put these ideal results in the actual order in
which you want users to see them. Include any specific content for each item you’d like
returned on the search results screen—a title, summary and/or any other particular
attribute of the item. Don’t try to specify the graphical specifics of the user interface—
layout, color, formatting—but just the specific content you would ideally like returned.
This is often not an easy task. It can require scouring your data by whatever means you
have available (existing search systems or databases, browsing individual documents, etc.)
to determine what the possible results would be and carefully considering the user’s
request to figure which are the best results.


                                       “Because this is often difficult, many tend
                                       to skip this step … [but] you can’t really
                                       design for effective relevancy ranking if
                                       you don’t have a goal.”

Because this is often difficult many application developers tend to skip this step, preferring
just to build a system and improve it over time to get better results. People may also skip
this step because they assume that ideal results are unattainable. While that is often true,
it’s not always the case. If you don’t at least start with what you’d ideally like to find, you
may have already made an unnecessary compromise, and you can’t really design for
effective relevancy ranking if you don’t have a goal.
Determining what you most want to get in light of the user’s query will also give you
important clues about what technology you need and how to build the system. Matching
results to a query is usually a combination of finding the results most similar to what the
user is asking while also favoring results which may be inherently “better” by some
measure in a particular context, for example, more authoritative, more recent, longer (or
shorter).
To determine the best results and give you some insight into the system you’ll need to
build, you may want to consider some or all of the following:


Starting a Search Application
A Lucid Imagination White Paper • April 2009                                             Page 3
•   What exactly is the content you want to return (and where is it located and how will
       you be able to gather and process it for search)?
   •   How varied is the data? Are there many different kinds of information? Are you able
       to intelligently order the results into one list or do multiple lists, corresponding to
       different data sources, make more sense? Is the content of varying length or size
       and, if so, how does that influence what results are best?
   •   What are the direct connections between the content and the query, such as
       matching keywords or phrases? What transformations of the queries or content
       (case normalization, synonyms, translation, etc.) are you assuming to help find those
       ideal results? Is the structure of the content matter—titles, summaries, layout, etc.—
       important for matching purposes? Is deeper analysis of the query and/or content—
       recognizing complex entities such as specific people, locations or events, or
       relationships among them—implicit in what you think are the best results?
   •   What, if any, are the indirect connections between the query and the results that
       bear on the quality of the results? Does the source or authority of the result affect its
       value as an ideal result? Do you as system developer want a say in which results are
       best? Is third-party information available and useful in determining the results (e.g.,
       that other people have linked to or cited that result; or, that people who have
       performed a similar query, and then spent time reading specific results)? Are there
       attributes of the user—level of expertise, interests, job description, etc.—that
       determine what the ideal results are?
   •   How important is it that the information be recent? Do you want results sorted by
       date, or is date just one factor in what makes a result good? Or is it not important at
       all? If recency is important, how different would the ideal results be a minute, hour,
       day, or week later?
   •   Are there many acceptable answers or only one or a small number? If there are
       many, is it important to return all of them? That would suggest recall is important—
       that it’s important to get most or all relevant results, even if that means getting some
       irrelevant ones (as is usually the case if you want to see most or all good results).
   •   Do you only want to see very relevant results or are you also interested in
       somewhat relevant ones (and don’t mind some irrelevant ones)? If you want to see
       only relevant results that would suggest that precision is especially important—
       returning relevant results even it means missing some other relevant ones in order

Starting a Search Application
A Lucid Imagination White Paper • April 2009                                             Page 4
to minimize seeing irrelevant ones. (It also is usually the case you’ll be missing some
       good ones if you don’t want to see too many bad ones).
   •   Will certain users only have permission to see certain results?
   •   Are there so many results that typical users will need additional tools to manage or
       view them, such as clustering results into categories? See below for more on
       clustering.


Clustering Results
There are many ways in which a basic search application can be supplemented beyond
simply providing an ordered list of results. One common extension is the clustering of
search results into categories of related results. This is useful when even the best relevancy
ranked list of documents will not suffice to give users what they want. This may be because
the query is too general to identify what the user is really interested in or because the user
is not interested in something specific but is just exploring an area of interest. In other
cases you may be interested in discovering certain common themes in the results, or that a
particular portion of the content has relevance.
In all these cases, supplementing a best effort relevancy ranking by also presenting groups
or clusters of documents organized around a specific topic or common characteristic can
give a user greater visibility and ease of navigation over the full set of results. In these cases
choosing from a set of categories organized by a common subject, date, price range, or
other attribute will often be faster for getting the user to what they want than walking a
long results list from beginning to end.


                                       “Clustering results by a common subject,
                                       date, price range, or other attribute will
                                       often be faster for getting the user to what
                                       they want than walking a long results list
                                       from beginning to end.”



Starting a Search Application
A Lucid Imagination White Paper • April 2009                                               Page 5
The situation is analogous to what happens when a user begins requesting information in
some real-life situations, such as at a store or library. To get what you want often requires a
dialogue with someone knowledgeable, and it’s just the need for simulating such dialogue
that was one early motivation for document clustering. If someone goes into a store and
asks for a television or into a library and asks for information on baseball, producing the
single best television or baseball book is really a wild guess. The store clerk might first ask
“how big a television?” and the librarian might ask “What about baseball? I have general
information, statistics, team information, books about how to play baseball, the World
Series, the World Baseball Classic….” The user might respond to the librarian with “I want
information about teams.” And the librarian’s response might be “the Red Sox?” (because
you’re in Boston), or “the Yankees?” (because people in Boston hate them), etc. The user
might also learn about an aspect of the game—the existence of a World Baseball Classic—of
which they were unaware.
A system that clusters documents responds like a (somewhat inarticulate) librarian.
Instead of a real dialogue in response to a query on baseball, the system responds with the
best documents it can but also blurts out (displays) some possible clusters: history, teams,
statistics, etc. Then the user might select a cluster—teams—and the system responds both
with more specific documents (the query has been effectively narrowed from baseball to
baseball teams) and by clustering the now more specific results. Clusters then might be Red
Sox, Yankees, etc.
The user can keep selecting categories (and/or entering new queries against the now
smaller set of documents) until the request is specific enough that the documents are what
the user is looking for. Of course, the user may not have had a very specific question in
mind, in which case the successive clusters and documents offer a way to explore the
subject that the original results list doesn’t really provide (short of reading the whole
thing). In some cases of exploration, users may want to start not with a search at all but by
browsing the entire collection, starting with the top-level categories and drilling down to
more specific areas of interest (a situation where something like Yahoo’s directory – which
is a top level list of document categories—can be particularly valuable).
Note that the goal is usually still to get to the best or some particular documents of interest.
Where this can be more readily achieved, clusters may be of less value. Even when clusters
are of value they are sometimes ignored by users, since the current dominant search
paradigm pushes the users toward review of the results list. For this reason, clusters
should not take the place of doing the best possible relevancy ranking, but they can


Starting a Search Application
A Lucid Imagination White Paper • April 2009                                             Page 6
supplement that, especially when it’s difficult to determine (even for a human) what makes
up a best ranking or when discovering new aspects of something is important.
Some studies have shown that about 10 percent of queries result in selection of a cluster
when clusters are available. However, regular users tend to use clusters at least
occasionally, and the perceived value of clusters—even when they are not widely used—
may contribute to the perceived value and actual use of the system.
(You may want to see Yonik Seeley’s article for information on faceted search in Solr;
faceting is a particular form of document clustering. See “Resources” at the end of this
paper.)


Producing Requirements and Choosing a Technology
After determining a search application is needed and considering what the system should
ideally return, the next steps are usually to produce more detailed requirements and
choose a technology. These are sometimes intertwined. Certain core requirements may
dictate certain technologies, but a particular technology may then have capabilities that
suggest other application goals.
Any of the above characteristics of ideal results, including the possible value of clustering,
will drive your more specific requirements and may affect what technology is best for your
application. Different technologies provide different advantages for integrating or
searching varied data, for transforming data or queries, for exploiting third-party
information, for handling dates or frequent updates, for favoring recall or precision, for
providing security or for grouping results.


                                       “One useful way both to drive
                                       requirements and evaluate a technology is
                                       to build a proof of concept or prototype of
                                       the intended application.”

All of these are points to include in your review of technologies, in an RFP, and/or in
discussion with technology providers. Things like cost, support, performance, scalability,


Starting a Search Application
A Lucid Imagination White Paper • April 2009                                            Page 7
reliability, likelihood of further improvements and customizability may also be important
factors, as they often are for any technology decision.
Loosely speaking, there are some technologies—such as Lucene/Solr—that are relatively
generic and handle most or all of these capabilities well or very well. Such technologies are
usually strongest in the “basics”—strong and flexible ranking of documents, excellent query
performance and the ability to scale for large systems—and may be functionally best for
most applications or if some more specific need can be handled within them. Technologies
with a more specific focus may be more useful for more specialized needs, though they
might still do well enough on the basics for more general solutions if there is some
particular (nontechnical) reason for choosing them.
At one time, users looking to build search applications often struggled with the question of
“make vs. buy,” that is, should they build their own search system from scratch or start
with technology from an existing vendor. “Make” decisions were mostly motivated by the
perceived lack of good existing systems or by strong needs for control or customizability to
produce exactly the right application.
Today there is usually little reason to consider a pure “make-from-scratch” alternative.
There are a number of good quality technologies, and the existence of a high quality open
source alternative—Lucene/Solr—generally satisfies any strong need for control or for
customizability, since the user has and controls the underlying source code but without the
cost or risk of writing it from scratch.
Sometimes, one useful way both to drive requirements and evaluate a technology is to
build a proof of concept or prototype of the intended application. This will usually be done
on a limited set of data, with limited user interface or packaging, and/or with limited
functionality, perhaps focusing on areas where there is uncertainty about whether a
needed capability—query or content transformation, accuracy of results, speed or
scalability—can be provided. A basic prototype may also serve as a first iteration in an
“agile” development model in which the final product is built less by an upfront, top down
requirements and design process and more by successive iterations that include
increasingly more functionality, content, and/or polish, as seems appropriate at each step.




Starting a Search Application
A Lucid Imagination White Paper • April 2009                                          Page 8
Resources
   •   “Search Engine versus DBMS,” by Marc Krellenstein. Discusses the choice between a
       search engine and a relational database for your application:
       http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles
       /Search-Engine-versus-DBMS

   •   “Optimizing Findability in Lucene and Solr,” by Grant Ingersoll. Discusses designing
       your application to maximize the ability of users to find what they’re looking for:
       http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles
       /Optimizing-Findability-Lucene-and-Solr

   •   “Faceted search with Solr,” by Yonik Seeley. Discusses the faceting (a form of
       clustering) available in Solr:
       http://www.lucidimagination.com/Community/Hear-from-the-
       Experts/Articles/Faceted-Search-Solr
       http://www.lucidimagination.com/Community/Hear-from-the-
       Experts/Articles/Faceted-Search-Solr




Starting a Search Application
A Lucid Imagination White Paper • April 2009                                            Page 9

More Related Content

PDF
Search Solutions 2011: Successful Enterprise Search By Design
PPT
Brave new search world
PPT
Search Analytics for Fun and Profit
PPTX
Search and social patents for 2012 and beyond
PDF
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
PPT
Implementing Semantic Search
PPTX
Henry stewart dam2010_taxonomicsearch_markohurst
PPTX
An Introduction to Entities in Semantic Search
Search Solutions 2011: Successful Enterprise Search By Design
Brave new search world
Search Analytics for Fun and Profit
Search and social patents for 2012 and beyond
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Implementing Semantic Search
Henry stewart dam2010_taxonomicsearch_markohurst
An Introduction to Entities in Semantic Search

What's hot (19)

PPT
Search Analytics: Conversations with Your Customers
PPT
Classification, Tagging & Search
PDF
How to be successful with search in your organisation
PDF
Search vs Text Classification
PDF
Evolution of Search
PPTX
Semantic Search at Yahoo
PDF
What IA, UX and SEO Can Learn from Each Other
PPTX
Semantic Search tutorial at SemTech 2012
PPT
Search Analytics For Content Strategists @CSofNYC
PPTX
Enhance the way people collaborate with documents in SharePoint
PPT
Website Planning
PDF
Design the Search Experience
PPTX
Semtech bizsemanticsearchtutorial
PDF
Sweeny group think-ias2015
PPT
Semantic search
PDF
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
PPTX
Semantic Search on the Rise
KEY
8 Information Architecture Better Practices
PPTX
Slawskiwilliam thegrowthofdirectanswers
Search Analytics: Conversations with Your Customers
Classification, Tagging & Search
How to be successful with search in your organisation
Search vs Text Classification
Evolution of Search
Semantic Search at Yahoo
What IA, UX and SEO Can Learn from Each Other
Semantic Search tutorial at SemTech 2012
Search Analytics For Content Strategists @CSofNYC
Enhance the way people collaborate with documents in SharePoint
Website Planning
Design the Search Experience
Semtech bizsemanticsearchtutorial
Sweeny group think-ias2015
Semantic search
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
Semantic Search on the Rise
8 Information Architecture Better Practices
Slawskiwilliam thegrowthofdirectanswers
Ad

Viewers also liked (6)

PDF
What’s New in Apache Lucene 2.9
PDF
What’s New in Solr 1.4
PDF
How The Guardian Embraced the Internet using Content, Search, and Open Source
PDF
What’s New in Apache Lucene 2.9
PDF
Fun with flexible indexing
PDF
Lucene rev preso bialecki solr crawlers-lr
What’s New in Apache Lucene 2.9
What’s New in Solr 1.4
How The Guardian Embraced the Internet using Content, Search, and Open Source
What’s New in Apache Lucene 2.9
Fun with flexible indexing
Lucene rev preso bialecki solr crawlers-lr
Ad

Similar to Starting a search application (20)

PPT
Search Systems
PPT
675d614e68cce (5).ppt image retrival for UI
PDF
Configuring share point 2010 just do it
PDF
Better Search Engine Testing - Eric Pugh
PDF
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
PDF
Altran Km Focus Search
PDF
How Search 2.0 Has Been Redefined by Enterprise 2.0
PPT
The Landscape of Enterprise Search
PDF
Introduction to Enterprise Search
PPTX
How search engines work Anand Saini
PPT
Search Enginesv2
PPTX
Enterprise Search, Simple, Complex and Powerful
PDF
Intranet focus search strategy a z - from Findability Day 2014
PPT
How search engines work
PDF
Search&Filter Interface Round Up by Amy Cueva, Mad Pow
PDF
From Keyword Search To Exploration Max L Wilson Bill Kules Mc Schraefel
PPT
Business Intelligence Solution Using Search Engine
PDF
Enterprise Search Research Article: Designing for Enterprise Search in a Glob...
PDF
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of ...
Search Systems
675d614e68cce (5).ppt image retrival for UI
Configuring share point 2010 just do it
Better Search Engine Testing - Eric Pugh
Bearish SEO: Defining the User Experience for Google’s Panda Search Landscape
Altran Km Focus Search
How Search 2.0 Has Been Redefined by Enterprise 2.0
The Landscape of Enterprise Search
Introduction to Enterprise Search
How search engines work Anand Saini
Search Enginesv2
Enterprise Search, Simple, Complex and Powerful
Intranet focus search strategy a z - from Findability Day 2014
How search engines work
Search&Filter Interface Round Up by Amy Cueva, Mad Pow
From Keyword Search To Exploration Max L Wilson Bill Kules Mc Schraefel
Business Intelligence Solution Using Search Engine
Enterprise Search Research Article: Designing for Enterprise Search in a Glob...
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of ...

More from Lucidworks (Archived) (20)

PDF
Integrating Hadoop & Solr
PDF
The Data-Driven Paradigm
PDF
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
PDF
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
PPTX
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PPTX
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
PPTX
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
PPTX
What's new in solr june 2014
PPTX
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
PPTX
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
PPTX
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
PDF
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
PDF
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
PDF
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
PPTX
Solr At AOL, Presented by Sean Timm at SolrExchage DC
PPTX
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
PPTX
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
PPTX
Building a data driven search application with LucidWorks SiLK
Integrating Hadoop & Solr
The Data-Driven Paradigm
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
What's new in solr june 2014
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Building a data driven search application with LucidWorks SiLK

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Reach Out and Touch Someone: Haptics and Empathic Computing
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Monthly Chronicles - July 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Building Integrated photovoltaic BIPV_UPV.pdf

Starting a search application

  • 1. Starting a Search Application Marc Krellenstein CTO Lucid Imagination
  • 2. Abstract If you think you need a search application, there are some useful first steps to take: validating that full-text search is the right technology; producing sets of ideal results you’d like to return for a range of queries; considering the value of supplementing a basic search results list with document clustering; and producing more specific requirements and investigating technology options. There are many ways that people come to the conclusion that they need a search application, and a variety of ways in which they can then proceed. Here are some of the steps that users often consider (or maybe should consider) before they’re ready to start building an application. Starting a Search Application A Lucid Imagination White Paper • April 2009 Page i
  • 3. Table of Contents Do You Need Full-text Search? ......................................................................................................................... 1 Identifying Ideal Results..................................................................................................................................... 2 Clustering Results ................................................................................................................................................. 5 Producing Requirements and Choosing a Technology ........................................................................... 7 Resources ................................................................................................................................................................. 9 Starting a Search Application A Lucid Imagination White Paper • April 2009 Page ii
  • 4. Do You Need Full-text Search? Full-text search is useful when you want to find something in a large amount of unstructured text—documents, articles, product descriptions, case studies, informal notes, web sites, discussion boards, wikis, e-mail messages, job descriptions, resumes, patents, legal decisions, etc. A “large” amount is usually much more than you can conveniently scan or read from start to end to find what you’re looking for, and it could potentially be as much as millions or billions of items. “Unstructured text” refers to the fact that the text consists of ordinary sentences, sentence fragments, or words of some kind rather than numeric values, structured alphanumeric values (e.g., account IDs, product codes) or non-textual content “What makes full-text search useful is that some of the data you are looking through is unstructured.” such as pictures, video, and audio. The unstructured text you’re searching will usually exist together with structured “fields” or related pieces of information—the price of a product, the date of an article. But what makes full-text search useful is that some of the data you are looking through is unstructured. Full-text search is good at a variety of information requests that can be hard to satisfy with other technologies. These include: • Finding the most relevant information about a specific topic, or an answer to a particular question, • Locating a specific document or content item, and • Exploring information in a general area, or even browsing the collection of documents or other content as a whole (this is often supported by clustering; see below). Full-text search is useful even if you have a way of navigating to a particular item of interest. Early on, Yahoo! pioneered the idea of a directory of websites, allowing users to Starting a Search Application A Lucid Imagination White Paper • April 2009 Page 1
  • 5. select what they want by drilling down in a hierarchy from a general category (“geographic regions”) to successively more specific ones (“downtown Seattle”). However, most users seem to prefer search even when a browsing option is available. This is partly a matter of personal style, but also reflects the fact that an accurate search system may allow a user to find information faster than they could by browsing, (e.g., a search on “downtown Seattle” may get you what you want faster than browsing by country, then state, then city, then city area). Of course, users may sometimes want to browse – especially when they are just exploring different content areas – and at other times want to search. It’s readily possible to combine these in one system, as is discussed below in the section on clustering search results. Full-text search is less effective when you always know exactly what answer to provide. Some full-text search systems (such as open source Solr) let you “rig” specific documents as answers to certain matching queries. But the greater value of such systems is in the way they can quickly find the mostly likely answers to an ad hoc request that may have no predefined right answer. If all or nearly all the information requests can be answered with specific documents or content, users might be better served by an FAQ or some other system that directly links specific requests to specific answers. A full-text search system may also not be the best choice when unstructured text search is secondary to more structured searches. In those cases a relational database may be a better technology choice, especially since most such databases today provide some form of full- text search for their unstructured text fields. (See the article on choosing a search engine or a DBMS, in the Resources section at the end of this paper.) Identifying Ideal Results If you’ve concluded that a full-text search system is what you need, you’re going to have to understand the data you’re searching and the users who will be doing the searching. You will then need to design the search application to maximize “findability” of different kinds of information for those users on that data. (Grant Ingersoll has a good article on optimizing findability in your search application; see the Resources section at the end of this paper.) However, system designers sometimes jump too quickly into the middle of the requirements or the design process without being clear about just what they would most like the search application to do. To figure that out, you might create a basic “paper Starting a Search Application A Lucid Imagination White Paper • April 2009 Page 2
  • 6. prototype” of some search scenarios: Take a pencil and some blank paper and, for a number of possible user queries you expect, sketch the content items you would ideally like the system to return to the screen. You should put these ideal results in the actual order in which you want users to see them. Include any specific content for each item you’d like returned on the search results screen—a title, summary and/or any other particular attribute of the item. Don’t try to specify the graphical specifics of the user interface— layout, color, formatting—but just the specific content you would ideally like returned. This is often not an easy task. It can require scouring your data by whatever means you have available (existing search systems or databases, browsing individual documents, etc.) to determine what the possible results would be and carefully considering the user’s request to figure which are the best results. “Because this is often difficult, many tend to skip this step … [but] you can’t really design for effective relevancy ranking if you don’t have a goal.” Because this is often difficult many application developers tend to skip this step, preferring just to build a system and improve it over time to get better results. People may also skip this step because they assume that ideal results are unattainable. While that is often true, it’s not always the case. If you don’t at least start with what you’d ideally like to find, you may have already made an unnecessary compromise, and you can’t really design for effective relevancy ranking if you don’t have a goal. Determining what you most want to get in light of the user’s query will also give you important clues about what technology you need and how to build the system. Matching results to a query is usually a combination of finding the results most similar to what the user is asking while also favoring results which may be inherently “better” by some measure in a particular context, for example, more authoritative, more recent, longer (or shorter). To determine the best results and give you some insight into the system you’ll need to build, you may want to consider some or all of the following: Starting a Search Application A Lucid Imagination White Paper • April 2009 Page 3
  • 7. What exactly is the content you want to return (and where is it located and how will you be able to gather and process it for search)? • How varied is the data? Are there many different kinds of information? Are you able to intelligently order the results into one list or do multiple lists, corresponding to different data sources, make more sense? Is the content of varying length or size and, if so, how does that influence what results are best? • What are the direct connections between the content and the query, such as matching keywords or phrases? What transformations of the queries or content (case normalization, synonyms, translation, etc.) are you assuming to help find those ideal results? Is the structure of the content matter—titles, summaries, layout, etc.— important for matching purposes? Is deeper analysis of the query and/or content— recognizing complex entities such as specific people, locations or events, or relationships among them—implicit in what you think are the best results? • What, if any, are the indirect connections between the query and the results that bear on the quality of the results? Does the source or authority of the result affect its value as an ideal result? Do you as system developer want a say in which results are best? Is third-party information available and useful in determining the results (e.g., that other people have linked to or cited that result; or, that people who have performed a similar query, and then spent time reading specific results)? Are there attributes of the user—level of expertise, interests, job description, etc.—that determine what the ideal results are? • How important is it that the information be recent? Do you want results sorted by date, or is date just one factor in what makes a result good? Or is it not important at all? If recency is important, how different would the ideal results be a minute, hour, day, or week later? • Are there many acceptable answers or only one or a small number? If there are many, is it important to return all of them? That would suggest recall is important— that it’s important to get most or all relevant results, even if that means getting some irrelevant ones (as is usually the case if you want to see most or all good results). • Do you only want to see very relevant results or are you also interested in somewhat relevant ones (and don’t mind some irrelevant ones)? If you want to see only relevant results that would suggest that precision is especially important— returning relevant results even it means missing some other relevant ones in order Starting a Search Application A Lucid Imagination White Paper • April 2009 Page 4
  • 8. to minimize seeing irrelevant ones. (It also is usually the case you’ll be missing some good ones if you don’t want to see too many bad ones). • Will certain users only have permission to see certain results? • Are there so many results that typical users will need additional tools to manage or view them, such as clustering results into categories? See below for more on clustering. Clustering Results There are many ways in which a basic search application can be supplemented beyond simply providing an ordered list of results. One common extension is the clustering of search results into categories of related results. This is useful when even the best relevancy ranked list of documents will not suffice to give users what they want. This may be because the query is too general to identify what the user is really interested in or because the user is not interested in something specific but is just exploring an area of interest. In other cases you may be interested in discovering certain common themes in the results, or that a particular portion of the content has relevance. In all these cases, supplementing a best effort relevancy ranking by also presenting groups or clusters of documents organized around a specific topic or common characteristic can give a user greater visibility and ease of navigation over the full set of results. In these cases choosing from a set of categories organized by a common subject, date, price range, or other attribute will often be faster for getting the user to what they want than walking a long results list from beginning to end. “Clustering results by a common subject, date, price range, or other attribute will often be faster for getting the user to what they want than walking a long results list from beginning to end.” Starting a Search Application A Lucid Imagination White Paper • April 2009 Page 5
  • 9. The situation is analogous to what happens when a user begins requesting information in some real-life situations, such as at a store or library. To get what you want often requires a dialogue with someone knowledgeable, and it’s just the need for simulating such dialogue that was one early motivation for document clustering. If someone goes into a store and asks for a television or into a library and asks for information on baseball, producing the single best television or baseball book is really a wild guess. The store clerk might first ask “how big a television?” and the librarian might ask “What about baseball? I have general information, statistics, team information, books about how to play baseball, the World Series, the World Baseball Classic….” The user might respond to the librarian with “I want information about teams.” And the librarian’s response might be “the Red Sox?” (because you’re in Boston), or “the Yankees?” (because people in Boston hate them), etc. The user might also learn about an aspect of the game—the existence of a World Baseball Classic—of which they were unaware. A system that clusters documents responds like a (somewhat inarticulate) librarian. Instead of a real dialogue in response to a query on baseball, the system responds with the best documents it can but also blurts out (displays) some possible clusters: history, teams, statistics, etc. Then the user might select a cluster—teams—and the system responds both with more specific documents (the query has been effectively narrowed from baseball to baseball teams) and by clustering the now more specific results. Clusters then might be Red Sox, Yankees, etc. The user can keep selecting categories (and/or entering new queries against the now smaller set of documents) until the request is specific enough that the documents are what the user is looking for. Of course, the user may not have had a very specific question in mind, in which case the successive clusters and documents offer a way to explore the subject that the original results list doesn’t really provide (short of reading the whole thing). In some cases of exploration, users may want to start not with a search at all but by browsing the entire collection, starting with the top-level categories and drilling down to more specific areas of interest (a situation where something like Yahoo’s directory – which is a top level list of document categories—can be particularly valuable). Note that the goal is usually still to get to the best or some particular documents of interest. Where this can be more readily achieved, clusters may be of less value. Even when clusters are of value they are sometimes ignored by users, since the current dominant search paradigm pushes the users toward review of the results list. For this reason, clusters should not take the place of doing the best possible relevancy ranking, but they can Starting a Search Application A Lucid Imagination White Paper • April 2009 Page 6
  • 10. supplement that, especially when it’s difficult to determine (even for a human) what makes up a best ranking or when discovering new aspects of something is important. Some studies have shown that about 10 percent of queries result in selection of a cluster when clusters are available. However, regular users tend to use clusters at least occasionally, and the perceived value of clusters—even when they are not widely used— may contribute to the perceived value and actual use of the system. (You may want to see Yonik Seeley’s article for information on faceted search in Solr; faceting is a particular form of document clustering. See “Resources” at the end of this paper.) Producing Requirements and Choosing a Technology After determining a search application is needed and considering what the system should ideally return, the next steps are usually to produce more detailed requirements and choose a technology. These are sometimes intertwined. Certain core requirements may dictate certain technologies, but a particular technology may then have capabilities that suggest other application goals. Any of the above characteristics of ideal results, including the possible value of clustering, will drive your more specific requirements and may affect what technology is best for your application. Different technologies provide different advantages for integrating or searching varied data, for transforming data or queries, for exploiting third-party information, for handling dates or frequent updates, for favoring recall or precision, for providing security or for grouping results. “One useful way both to drive requirements and evaluate a technology is to build a proof of concept or prototype of the intended application.” All of these are points to include in your review of technologies, in an RFP, and/or in discussion with technology providers. Things like cost, support, performance, scalability, Starting a Search Application A Lucid Imagination White Paper • April 2009 Page 7
  • 11. reliability, likelihood of further improvements and customizability may also be important factors, as they often are for any technology decision. Loosely speaking, there are some technologies—such as Lucene/Solr—that are relatively generic and handle most or all of these capabilities well or very well. Such technologies are usually strongest in the “basics”—strong and flexible ranking of documents, excellent query performance and the ability to scale for large systems—and may be functionally best for most applications or if some more specific need can be handled within them. Technologies with a more specific focus may be more useful for more specialized needs, though they might still do well enough on the basics for more general solutions if there is some particular (nontechnical) reason for choosing them. At one time, users looking to build search applications often struggled with the question of “make vs. buy,” that is, should they build their own search system from scratch or start with technology from an existing vendor. “Make” decisions were mostly motivated by the perceived lack of good existing systems or by strong needs for control or customizability to produce exactly the right application. Today there is usually little reason to consider a pure “make-from-scratch” alternative. There are a number of good quality technologies, and the existence of a high quality open source alternative—Lucene/Solr—generally satisfies any strong need for control or for customizability, since the user has and controls the underlying source code but without the cost or risk of writing it from scratch. Sometimes, one useful way both to drive requirements and evaluate a technology is to build a proof of concept or prototype of the intended application. This will usually be done on a limited set of data, with limited user interface or packaging, and/or with limited functionality, perhaps focusing on areas where there is uncertainty about whether a needed capability—query or content transformation, accuracy of results, speed or scalability—can be provided. A basic prototype may also serve as a first iteration in an “agile” development model in which the final product is built less by an upfront, top down requirements and design process and more by successive iterations that include increasingly more functionality, content, and/or polish, as seems appropriate at each step. Starting a Search Application A Lucid Imagination White Paper • April 2009 Page 8
  • 12. Resources • “Search Engine versus DBMS,” by Marc Krellenstein. Discusses the choice between a search engine and a relational database for your application: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles /Search-Engine-versus-DBMS • “Optimizing Findability in Lucene and Solr,” by Grant Ingersoll. Discusses designing your application to maximize the ability of users to find what they’re looking for: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles /Optimizing-Findability-Lucene-and-Solr • “Faceted search with Solr,” by Yonik Seeley. Discusses the faceting (a form of clustering) available in Solr: http://www.lucidimagination.com/Community/Hear-from-the- Experts/Articles/Faceted-Search-Solr http://www.lucidimagination.com/Community/Hear-from-the- Experts/Articles/Faceted-Search-Solr Starting a Search Application A Lucid Imagination White Paper • April 2009 Page 9