SlideShare a Scribd company logo
Surviving the Information Glut


Presentation by Bob Boeri
Factory Mutual Engineering & Research
bboeri@world.std.com

October 7, 1994
                               bbtitle 9/23/94
Roots of the Problem
<Storage: increasing
<Access: faster
<Document complexity: more
<Information quantity: increasing
 exponentially



"'A cat may look at a king,' said Alice. 'I've read that in some book,
but I don't remember where.' ". Alice in Wonderland
                                                                         bb1 9/13/94
Document Complexity
<word processor types
<fonts
<rich layouts
<tables
<graphics/ photos
<video, sound, hypertext
<SGML
<... what isn't a document?
"and what is the use of a book," thought Alice, "without pictures or
conversations?" Alice in Wonderland
                                                                       bb2 9/13/94
How to Find What You Are
        Looking For
<how large is your collection of
 documents?
<how complex are they?
<how complex are the searches?
<who will search?
•individuals working by themselves
•members of a corporate organization


                                       bb3 9/17/94
Searching a Very Small
           Collection
 <a few dozen documents
 <simple structure (e.g., memo or e-mail)
 <written consistently (e.g., you, one author)

Find that note about an inexpensive, simple
word processor that never needs upgrading
and will let you add simple graphics to your
writing. Runs under MS-Windows.
                                          bb4 9/17/94
Trivial Search Techniques
<browse through each one
<use a word processor "list files"
<simple search system, simple boolean
 search




                                        bb5 9/17/94
First Search Barrier
<somewhere between a few dozen
 documents and several hundred
<can't remember exactly the words to
 search, begin searching synonyms or
 using wild cards.
She went on, rather surprised at not being able to think of the word. 'I mean to get under the --under the -
under THIS, you know!' putting her hand on the trunk of a tree. --Alice in Wonder land




                                                                                                         bb6 9/14/94
First-Level Search
           Techniques
<Range searches: "word processor"
 <sentence> "inexpensive"

<only 2 hits (probably missed something).
< forgot to ask about graphics support.


                                       bb7 9/15/94
First-Level Search
           Techniques
<Wild cards: word process* <sentence>
 basic


<99 hits; unusable
< Maybe asking about graphics support too
will reduce number of hits

                                        bb8 9/15/94
Searching Gets Complex
<(basic <sentence> word process*)
 <paragraph> (support* <sentence>
 graphic*)


 < complex expression
 < system searches a long time
 < finds nothing useful.

                                    bb9 9/16/94
Sample Hits from 1st-level
     Complex Search
<"Although Visual Basic contains a
 rudimentary word processor... graphic
 support is really limited to OLE and
 DDE."
<"Basic word processing skills can
 sometimes be transferred to..... programs
 which allow you to create graphic effects.

                                        bb10 9/17/94
Need to Break the 1st-Level
       Search Barrier:
<reduce hits to most relevant
<get hits when simpler searches fail
<additional techniques beyond Boolean


< new ways to divide and conquor
< richer, easier search aids
< richer reporting of results
                                        bb11 9/19/94
Combine Structured and Full
      Text Queries
<Apply search to portion of library ("form
 queries")
<Requires knowledge of the library
<Requires "catalog card" for each
 document (e.g., date, subject)
<Smart system might construct catalog
 card
•Requires highly regular documents
•Risk of catalog errors
                                        bb12 9/19/94
Combine Structured and Full
        Text Queries
 <Could design as a form for users to fill out
 <Example:

DATE: after 1/1/94

(inexpensive <sentence> "word processor"
<sentence> "windows")
                                          bb13 9/19/94
Relevancy Ranking
<Puts most likely hits at the top of the list
<Requires understanding of what's most
 important
•# of hits/document
•weighting certain hits (e.g., exact matches) more
 than others
•weighting other criteria (such as date or other
 structured fields)
•let users say what's most important to them


                                                     bb14 9/19/94
Thesauruses
  <General
  <Specific
  •medical
  •legal
  •scientific
  <user-modifiable

"I don't know the meaning of half those long words, and what's more, I don't believe you do either!"
-- Alice in Wonderland




                                                                                                       bb15 9/20/94
Linguistic Helps
<Automatic search for parts of speech
•"sprinkle" also searches for "sprinkled,"
 "sprinkling," etc.
<Fuzzy search
•"sprinkle" also searches for "sparkle"
•helps overcome some OCR errors.
•user-specifiable (how many letters to make "fuzzy")
•gets words you would have missed
•gets words that make no sense at all.
<Natural Language Queries: ("Find me
 cheap reliable easy Windows word
 processors")    "Language is worth a thousand pounds a word."   bb16 9/20/94



                            -- Through the Looking Glass
Complex and Modular
            Queries
<Create, debug, save queries
<Use queries as models for new queries
<If modular ("Lego•s")
•assemble large search queries by plugging together
 smaller ones.
•fine tune searches (adjusting rankings of search
 criteria).
•build libraries of modular searches

                                                 bb19 9/22/94
Fuzzy Searches
<use neural network technology
<like sophisticated wildcard searches
<help overcome OCR errors
<find good matches and irrelevant ones
<can distort relevancy rankings by hit
 count



                                         bb20 9/22/94
SGML Usage
<"Zone" searches
•Confine searches to paragraph headings, chapter
 titles, etc.
<Use SGML DTDs directly:
•Full, Arbitrary (all DTDs)
     A exploits full capabilities of your tag set
     A performance and/or size penalties
•Specific DTDs only
     A "Any color Ford you want as long as it's black."
     A May be tuned for better use



                                                          bb17 9/20/94
SGML Usage
<Filter (convert) SGML tags to application
 specific codes.
•Not authentic SGML use
•May be better performance than authentic SGML
<Best when documents are themselves
 highly structured.
<One-way (from SGML to proprietary);
 loses important SGML benefit.
<Few vendors support SGML well
<Those who do may skimp on other search
 facilities.                                 bb18 9/21/94
Interest Profiling
<Profile determined by any number of
 means
<"I like these documents. Find me more
 like this."
•simple
•unexpected results
•electronic highlighter improves search
<The more search tools the better.
Information Agents
<passive
•computed once, updated periodically
•use when you choose (whenever new CD-Rom title
 appears)
<active
•information gobots
•always on the lookout for anything relevant
•inform you with results or email notification
•on-line or jukeboxes
Looking in classifieds for a low-mileage Saab, prefer beige or red, one-owner,
automatic, 1993 or newer, less than $10,000.

Looking in PC literature for Windows word processor , easy to use, never needs
upgrades, can handle graphics, bug-free, uses 1MB disk, less than $29.95.
Collateral Issues: Authoring
          and Using
<Authoring
•Populating the system
•Subject areas and forms
•Document size
•Legacy Documents




                           bb24 9/23/94
Populating the system
<Security: everyone have identical access?
<Easy way to get documents into system?
<Form per document for form queries?
•date, subject area, sub-type)?
•subject area (e.g., word processors)?
•sub-types within areas (e.g., character-based, GUI)
<Easy way to retract documents? Re-file
 documents? "See also" subject areas?
<QA of forms and documents
•Form field info correct?
•Complex document objects (e.g.,tables).
                                                  bb25 9/24/94
Document Size
<Whole documents or chunks?
<What's appropriate to users?
•Effort to build collection
•Precision of hits
•Size of hit list
•What's natural and expected
"What size do you want to be," the catepillar asked.

Oh, I'm not so particular as to size, Alice hastily replied. "Only one doesn't like changing so
often, you know."

-- Alice in Wonderland




                                                                                                  bb26 9/24/94
Legacy Documents
<Paper
•size, number, quality
•OCR
•Ability to attach page images
•At least name file for faxing
<Electronic                             "These words were followed by a very long
                                        silence, broken only by an occasional
•document type                          exclamation of 'Hjckrrh!" from the
                                        Gryphon."
•quality of author practices    -- Alice in Wonderland
•fonts. . . . . .
•command launch when possible
•what about form queries/document?
                                                                     bb27 9/24/94
Collateral Issues: Using
<Pie fonts
<Non-English characters
<Equations
<Font fidelity, size on-screen
•letter "o" and zero
•letters one "1", el "l", and capital i "I".
 "The White Queen whispered, 'I can read words of one letter!... However, don't be discouraged,
 You'll come to it in time.'"

 -- Through the Looking Glass




                                                                                                  bb28 9/25/94
Collateral Issues: Using
  <Navigation within documents
  <Viewers
  <Launching when Viewers Inadequate
  <CD-Rom Performance
  <Exporting information for reuse.
  <Printing
"... the books are something like our books, only the words go the wrong way."

-- Through the Looking Glass




                                                                                 bb29 9/25/94
Collateral Issues: Using
<Interactive searches
<Batch searches ("go do this later and tell
 me what you found")
<Autonomous information agents
•Continuous monitoring
•Urgent, routine notification
•Empower agents to "Ring a bell" ; "Push a button"
•Active documents: "Go find me more like yourself"



                                               bb30 9/26/94
Adobe Acrobat version 2.0
<Powerful searching
<CD-Rom performance
<Font problem disappears
<SGML promised




                           bb31 9/26/94
And What of Our Original
         Search... Perfect Word
        Processor, Saab for a Song
 Alice laughed. `There's not use trying,' she said: `one CAN'T believe impossible things.'

 `I daresay you haven't had much practice,' said the Queen. .
-- Through the Looking Glass




Even the best searching system can't find
what isn't there. But the best ones will keep
on trying.

                                                                                             bb28 9/25/94

More Related Content

KEY
The web standards gentleman: a matter of (evolving) standards)
PPSX
Marinier Laird Cogsci 2008 Emotionrl Pres
PPT
US Trip Sharing
PDF
State of Social Media 2013
ODP
La ley SOPA
PDF
Improving Findability Inside the Firewall
DOC
J Welch Skills1
PPT
Web indexing finale
The web standards gentleman: a matter of (evolving) standards)
Marinier Laird Cogsci 2008 Emotionrl Pres
US Trip Sharing
State of Social Media 2013
La ley SOPA
Improving Findability Inside the Firewall
J Welch Skills1
Web indexing finale

Similar to Beyond Boolean - Enterprise Search Technologies (20)

PDF
Enterprise Search Share Point2009 Best Practices Final
PDF
Of Haystacks And Needles
PPTX
Retrieval approches
PPTX
dulces
PPT
The hunt for the perfect interface in a googlified world
PPT
Information Retrieval
PDF
History of Search and Web Search Engines - Seminar on Web Search
PDF
PARC Forum 2009: Adventures in SearchLand
PPT
Working Of Search Engine
PPT
Data Integration Lecture
PPT
PPTX
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
PDF
PPTX
05. EDT 513 Week 5 2023 Searching the Internet.pptx
PDF
Introduction to libre « fulltext » technology
PPT
Search Enginesv2
PDF
History page-brin thesis - anatomy of a large scale hypertextual web search...
PDF
Why SGML (Retro Alert 1995)
PDF
Birds Bears and Bs:Optimal SEO for Today's Search Engines
PDF
Optimal SEO (Marianne Sweeny)
Enterprise Search Share Point2009 Best Practices Final
Of Haystacks And Needles
Retrieval approches
dulces
The hunt for the perfect interface in a googlified world
Information Retrieval
History of Search and Web Search Engines - Seminar on Web Search
PARC Forum 2009: Adventures in SearchLand
Working Of Search Engine
Data Integration Lecture
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
05. EDT 513 Week 5 2023 Searching the Internet.pptx
Introduction to libre « fulltext » technology
Search Enginesv2
History page-brin thesis - anatomy of a large scale hypertextual web search...
Why SGML (Retro Alert 1995)
Birds Bears and Bs:Optimal SEO for Today's Search Engines
Optimal SEO (Marianne Sweeny)
Ad

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
cuic standard and advanced reporting.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Modernizing your data center with Dell and AMD
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Review of recent advances in non-invasive hemoglobin estimation
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
cuic standard and advanced reporting.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Monthly Chronicles - July 2025
Digital-Transformation-Roadmap-for-Companies.pptx
Modernizing your data center with Dell and AMD
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Ad

Beyond Boolean - Enterprise Search Technologies

  • 1. Surviving the Information Glut Presentation by Bob Boeri Factory Mutual Engineering & Research bboeri@world.std.com October 7, 1994 bbtitle 9/23/94
  • 2. Roots of the Problem <Storage: increasing <Access: faster <Document complexity: more <Information quantity: increasing exponentially "'A cat may look at a king,' said Alice. 'I've read that in some book, but I don't remember where.' ". Alice in Wonderland bb1 9/13/94
  • 3. Document Complexity <word processor types <fonts <rich layouts <tables <graphics/ photos <video, sound, hypertext <SGML <... what isn't a document? "and what is the use of a book," thought Alice, "without pictures or conversations?" Alice in Wonderland bb2 9/13/94
  • 4. How to Find What You Are Looking For <how large is your collection of documents? <how complex are they? <how complex are the searches? <who will search? •individuals working by themselves •members of a corporate organization bb3 9/17/94
  • 5. Searching a Very Small Collection <a few dozen documents <simple structure (e.g., memo or e-mail) <written consistently (e.g., you, one author) Find that note about an inexpensive, simple word processor that never needs upgrading and will let you add simple graphics to your writing. Runs under MS-Windows. bb4 9/17/94
  • 6. Trivial Search Techniques <browse through each one <use a word processor "list files" <simple search system, simple boolean search bb5 9/17/94
  • 7. First Search Barrier <somewhere between a few dozen documents and several hundred <can't remember exactly the words to search, begin searching synonyms or using wild cards. She went on, rather surprised at not being able to think of the word. 'I mean to get under the --under the - under THIS, you know!' putting her hand on the trunk of a tree. --Alice in Wonder land bb6 9/14/94
  • 8. First-Level Search Techniques <Range searches: "word processor" <sentence> "inexpensive" <only 2 hits (probably missed something). < forgot to ask about graphics support. bb7 9/15/94
  • 9. First-Level Search Techniques <Wild cards: word process* <sentence> basic <99 hits; unusable < Maybe asking about graphics support too will reduce number of hits bb8 9/15/94
  • 10. Searching Gets Complex <(basic <sentence> word process*) <paragraph> (support* <sentence> graphic*) < complex expression < system searches a long time < finds nothing useful. bb9 9/16/94
  • 11. Sample Hits from 1st-level Complex Search <"Although Visual Basic contains a rudimentary word processor... graphic support is really limited to OLE and DDE." <"Basic word processing skills can sometimes be transferred to..... programs which allow you to create graphic effects. bb10 9/17/94
  • 12. Need to Break the 1st-Level Search Barrier: <reduce hits to most relevant <get hits when simpler searches fail <additional techniques beyond Boolean < new ways to divide and conquor < richer, easier search aids < richer reporting of results bb11 9/19/94
  • 13. Combine Structured and Full Text Queries <Apply search to portion of library ("form queries") <Requires knowledge of the library <Requires "catalog card" for each document (e.g., date, subject) <Smart system might construct catalog card •Requires highly regular documents •Risk of catalog errors bb12 9/19/94
  • 14. Combine Structured and Full Text Queries <Could design as a form for users to fill out <Example: DATE: after 1/1/94 (inexpensive <sentence> "word processor" <sentence> "windows") bb13 9/19/94
  • 15. Relevancy Ranking <Puts most likely hits at the top of the list <Requires understanding of what's most important •# of hits/document •weighting certain hits (e.g., exact matches) more than others •weighting other criteria (such as date or other structured fields) •let users say what's most important to them bb14 9/19/94
  • 16. Thesauruses <General <Specific •medical •legal •scientific <user-modifiable "I don't know the meaning of half those long words, and what's more, I don't believe you do either!" -- Alice in Wonderland bb15 9/20/94
  • 17. Linguistic Helps <Automatic search for parts of speech •"sprinkle" also searches for "sprinkled," "sprinkling," etc. <Fuzzy search •"sprinkle" also searches for "sparkle" •helps overcome some OCR errors. •user-specifiable (how many letters to make "fuzzy") •gets words you would have missed •gets words that make no sense at all. <Natural Language Queries: ("Find me cheap reliable easy Windows word processors") "Language is worth a thousand pounds a word." bb16 9/20/94 -- Through the Looking Glass
  • 18. Complex and Modular Queries <Create, debug, save queries <Use queries as models for new queries <If modular ("Lego•s") •assemble large search queries by plugging together smaller ones. •fine tune searches (adjusting rankings of search criteria). •build libraries of modular searches bb19 9/22/94
  • 19. Fuzzy Searches <use neural network technology <like sophisticated wildcard searches <help overcome OCR errors <find good matches and irrelevant ones <can distort relevancy rankings by hit count bb20 9/22/94
  • 20. SGML Usage <"Zone" searches •Confine searches to paragraph headings, chapter titles, etc. <Use SGML DTDs directly: •Full, Arbitrary (all DTDs) A exploits full capabilities of your tag set A performance and/or size penalties •Specific DTDs only A "Any color Ford you want as long as it's black." A May be tuned for better use bb17 9/20/94
  • 21. SGML Usage <Filter (convert) SGML tags to application specific codes. •Not authentic SGML use •May be better performance than authentic SGML <Best when documents are themselves highly structured. <One-way (from SGML to proprietary); loses important SGML benefit. <Few vendors support SGML well <Those who do may skimp on other search facilities. bb18 9/21/94
  • 22. Interest Profiling <Profile determined by any number of means <"I like these documents. Find me more like this." •simple •unexpected results •electronic highlighter improves search <The more search tools the better.
  • 23. Information Agents <passive •computed once, updated periodically •use when you choose (whenever new CD-Rom title appears) <active •information gobots •always on the lookout for anything relevant •inform you with results or email notification •on-line or jukeboxes Looking in classifieds for a low-mileage Saab, prefer beige or red, one-owner, automatic, 1993 or newer, less than $10,000. Looking in PC literature for Windows word processor , easy to use, never needs upgrades, can handle graphics, bug-free, uses 1MB disk, less than $29.95.
  • 24. Collateral Issues: Authoring and Using <Authoring •Populating the system •Subject areas and forms •Document size •Legacy Documents bb24 9/23/94
  • 25. Populating the system <Security: everyone have identical access? <Easy way to get documents into system? <Form per document for form queries? •date, subject area, sub-type)? •subject area (e.g., word processors)? •sub-types within areas (e.g., character-based, GUI) <Easy way to retract documents? Re-file documents? "See also" subject areas? <QA of forms and documents •Form field info correct? •Complex document objects (e.g.,tables). bb25 9/24/94
  • 26. Document Size <Whole documents or chunks? <What's appropriate to users? •Effort to build collection •Precision of hits •Size of hit list •What's natural and expected "What size do you want to be," the catepillar asked. Oh, I'm not so particular as to size, Alice hastily replied. "Only one doesn't like changing so often, you know." -- Alice in Wonderland bb26 9/24/94
  • 27. Legacy Documents <Paper •size, number, quality •OCR •Ability to attach page images •At least name file for faxing <Electronic "These words were followed by a very long silence, broken only by an occasional •document type exclamation of 'Hjckrrh!" from the Gryphon." •quality of author practices -- Alice in Wonderland •fonts. . . . . . •command launch when possible •what about form queries/document? bb27 9/24/94
  • 28. Collateral Issues: Using <Pie fonts <Non-English characters <Equations <Font fidelity, size on-screen •letter "o" and zero •letters one "1", el "l", and capital i "I". "The White Queen whispered, 'I can read words of one letter!... However, don't be discouraged, You'll come to it in time.'" -- Through the Looking Glass bb28 9/25/94
  • 29. Collateral Issues: Using <Navigation within documents <Viewers <Launching when Viewers Inadequate <CD-Rom Performance <Exporting information for reuse. <Printing "... the books are something like our books, only the words go the wrong way." -- Through the Looking Glass bb29 9/25/94
  • 30. Collateral Issues: Using <Interactive searches <Batch searches ("go do this later and tell me what you found") <Autonomous information agents •Continuous monitoring •Urgent, routine notification •Empower agents to "Ring a bell" ; "Push a button" •Active documents: "Go find me more like yourself" bb30 9/26/94
  • 31. Adobe Acrobat version 2.0 <Powerful searching <CD-Rom performance <Font problem disappears <SGML promised bb31 9/26/94
  • 32. And What of Our Original Search... Perfect Word Processor, Saab for a Song Alice laughed. `There's not use trying,' she said: `one CAN'T believe impossible things.' `I daresay you haven't had much practice,' said the Queen. . -- Through the Looking Glass Even the best searching system can't find what isn't there. But the best ones will keep on trying. bb28 9/25/94