SlideShare a Scribd company logo
Search Domain Basics


      Praveen Manvi
       April - 2009
Objectives

 Search Goals
  Business Models
 Structured Vs Un-Structured content
 Search Terminologies
 Technologies behind search
Goal : “Make it like this”

Simple, Mostly accurate & fast
But that’s not always possible 
Business Models
   Sponsored Search
Content Match
It’s all about Bill Boards




              TLR Confidential
Search domain basics
Search domain basics
Vertical search, or domain-specific
search




TLR Confidential
Structured Vs Un-structured Data
Unstructured – 80%, Structured – 20%


                              Relational = structured
                              all other = unstructured.
Why not use SQL/RDBMS?
SQL Search limits – %bla bla% pretty limited by schema &
 SQL (a limited DSL)
Cannot handle Bad user inputs but actually phonetically
 correct inputs
Difficult to implement various search requirement like
 Proximity - Java close to Serialization - if they are close to
 each other it means it’s a software content
Difficult to scale, manage changes & implement
 parallelization (Map-Reduce)
Sample Search requests
Sample Collection: Sun JDK classes
How many times “synchronized” key word has been in JDK
  java classes other than java.lang package?
How many static methods are present in JDK classes that
  have synchronized methods
How many java classes are there in the Collection framework
  that use synchronized keyword and have more than 200 lines
Search Terminologies
Proximity search :A search where users to specify that
 documents returned should have the words near each other.
Concept Search: A search for documents related
 conceptually to a word, rather than specifically containing the
 word itself.
Boolean search: A search allowing the inclusion or
 exclusion of documents containing certain words through the
 use of operators such as AND, NOT and OR.
Contd…
Stemming: The ability for a search to include the "stem" of
 words. For example, stemming allows a user to enter
 "running" and get back results also for the stem word "run."
Lemmatization: is the process of grouping together the
 different inflected forms of a word so they can be analyzed as
 a single item
Contd…
Noise or Stop words :Conjunctions, prepositions and
 articles and other words such as AND, TO and A that appear
 often in documents yet alone may contain little meaning.
Thesaurus : A list of synonyms a search engine can use to
 find matches for particular words if the words themselves
 don't appear in documents.
Index: Normailzed presentation of words
Contd….
Semantic Search: is a process used to improve online
  searching by using data from semantic networks to
  disambiguate queries and web text in order to generate more
  relevant results.
Web Search Vs Enterprise Search
Web Search : Content is public & generic. Uses keywords,
 Links (relevancy) based some kind of historic traffic. Usually
 http crawlers are used for content acquisition
Enterprise Search : Also contains private documents that
 domain specific, Quality of content should be highest quality
 content & not necessarily popular Information/metadata
 needs to be secure with role based access to the content. It
 has to support security (Realms, Roles), SLAs and many
 other requirements.
Search Technologies
RDBMS to store metadata
Cache service - for fast access
Parsers – to interpret input queries
Internationalization – For handling different languages
Search DSL – catering to particular domain
Map/Reduce, Parallelization & Algorithms
Indexing, File storage systems/ Multi-threading
Contd…
Thank You!

More Related Content

PDF
Scala and jvm_languages_praveen_technologist
PDF
Introduction to elasticsearch
PDF
Ahsay Backup Software v7 - Datasheet
PDF
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
PPTX
Elastic search overview
PDF
Elasticsearch From the Bottom Up
PPTX
Elasticsearch
PPTX
Elasticsearch Introduction
Scala and jvm_languages_praveen_technologist
Introduction to elasticsearch
Ahsay Backup Software v7 - Datasheet
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elastic search overview
Elasticsearch From the Bottom Up
Elasticsearch
Elasticsearch Introduction

What's hot (20)

PPTX
ElasticSearch in Production: lessons learned
ODP
Elasticsearch for beginners
PPSX
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
PPTX
Introduction to Elasticsearch with basics of Lucene
PDF
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
PPTX
Introduction to Elasticsearch
PPTX
Real Time search using Spark and Elasticsearch
PDF
Roaring with elastic search sangam2018
PDF
Intro to Elasticsearch
PDF
Workshop: Learning Elasticsearch
PDF
Elasticsearch in Netflix
ODP
Elastic search
PDF
Elasticsearch
PDF
ElasticSearch in action
PPTX
Case study of Rujhaan.com (A social news app )
PDF
Elasticsearch Introduction at BigData meetup
KEY
You know, for search. Querying 24 Billion Documents in 900ms
PDF
Elasticsearch Introduction to Data model, Search & Aggregations
PPTX
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
ElasticSearch in Production: lessons learned
Elasticsearch for beginners
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
Introduction to Elasticsearch with basics of Lucene
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
Introduction to Elasticsearch
Real Time search using Spark and Elasticsearch
Roaring with elastic search sangam2018
Intro to Elasticsearch
Workshop: Learning Elasticsearch
Elasticsearch in Netflix
Elastic search
Elasticsearch
ElasticSearch in action
Case study of Rujhaan.com (A social news app )
Elasticsearch Introduction at BigData meetup
You know, for search. Querying 24 Billion Documents in 900ms
Elasticsearch Introduction to Data model, Search & Aggregations
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Building a Large Scale SEO/SEM Application with Apache Solr
Ad

Viewers also liked (13)

DOCX
PDF
Mengenal Domain Internet
PPTX
Internet_domain_name
PPTX
PPT
Domain name system
PPTX
Presentation on dns
PPTX
Domain name system
PDF
Back to Basics: What Are Domain Names and How Do They Work
PPT
Domain name system
PDF
Presentation on Domain Name System
PDF
DNS - Domain Name System
PPTX
Domain Name System DNS
PPT
Dns ppt
Mengenal Domain Internet
Internet_domain_name
Domain name system
Presentation on dns
Domain name system
Back to Basics: What Are Domain Names and How Do They Work
Domain name system
Presentation on Domain Name System
DNS - Domain Name System
Domain Name System DNS
Dns ppt
Ad

Similar to Search domain basics (20)

PPT
Semantic Search using RDF Metadata (SemTech 2005)
PPTX
The Role of Structured Data in Modern SEO [Zagreb SEO Summit 2025]
PDF
You Don't Know SEO
PPTX
Introduction to enterprise search
PDF
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
PPT
Using metadata repositories with search
PPTX
Semantic Web, e-commerce
PDF
NetIKX Semantic Search Presentation
PDF
E-commerce Search Engine with Apache Lucene/Solr
PDF
Az31349353
PPTX
Web Minnig and text mining presentation
PPT
Business Intelligence Solution Using Search Engine
PDF
Google Search Appliance Version 2.0 Webinar - May 2012
PPTX
Domain Driven Design
PPT
Corrib.org - OpenSource and Research
PPTX
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
PPS
Making IA Real: Planning an Information Architecture Strategy
PDF
X api chinese cop monthly meeting feb.2016
PDF
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
PPT
Searching techniques
Semantic Search using RDF Metadata (SemTech 2005)
The Role of Structured Data in Modern SEO [Zagreb SEO Summit 2025]
You Don't Know SEO
Introduction to enterprise search
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
Using metadata repositories with search
Semantic Web, e-commerce
NetIKX Semantic Search Presentation
E-commerce Search Engine with Apache Lucene/Solr
Az31349353
Web Minnig and text mining presentation
Business Intelligence Solution Using Search Engine
Google Search Appliance Version 2.0 Webinar - May 2012
Domain Driven Design
Corrib.org - OpenSource and Research
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Making IA Real: Planning an Information Architecture Strategy
X api chinese cop monthly meeting feb.2016
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Searching techniques

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
cuic standard and advanced reporting.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Network Security Unit 5.pdf for BCA BBA.
Assigned Numbers - 2025 - Bluetooth® Document
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectroscopy.pptx food analysis technology
cuic standard and advanced reporting.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Search domain basics

  • 1. Search Domain Basics Praveen Manvi April - 2009
  • 2. Objectives  Search Goals Business Models  Structured Vs Un-Structured content  Search Terminologies  Technologies behind search
  • 3. Goal : “Make it like this” Simple, Mostly accurate & fast
  • 4. But that’s not always possible 
  • 5. Business Models Sponsored Search
  • 7. It’s all about Bill Boards TLR Confidential
  • 10. Vertical search, or domain-specific search TLR Confidential
  • 11. Structured Vs Un-structured Data Unstructured – 80%, Structured – 20% Relational = structured all other = unstructured.
  • 12. Why not use SQL/RDBMS? SQL Search limits – %bla bla% pretty limited by schema & SQL (a limited DSL) Cannot handle Bad user inputs but actually phonetically correct inputs Difficult to implement various search requirement like Proximity - Java close to Serialization - if they are close to each other it means it’s a software content Difficult to scale, manage changes & implement parallelization (Map-Reduce)
  • 13. Sample Search requests Sample Collection: Sun JDK classes How many times “synchronized” key word has been in JDK java classes other than java.lang package? How many static methods are present in JDK classes that have synchronized methods How many java classes are there in the Collection framework that use synchronized keyword and have more than 200 lines
  • 14. Search Terminologies Proximity search :A search where users to specify that documents returned should have the words near each other. Concept Search: A search for documents related conceptually to a word, rather than specifically containing the word itself. Boolean search: A search allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR.
  • 15. Contd… Stemming: The ability for a search to include the "stem" of words. For example, stemming allows a user to enter "running" and get back results also for the stem word "run." Lemmatization: is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item
  • 16. Contd… Noise or Stop words :Conjunctions, prepositions and articles and other words such as AND, TO and A that appear often in documents yet alone may contain little meaning. Thesaurus : A list of synonyms a search engine can use to find matches for particular words if the words themselves don't appear in documents. Index: Normailzed presentation of words
  • 17. Contd…. Semantic Search: is a process used to improve online searching by using data from semantic networks to disambiguate queries and web text in order to generate more relevant results.
  • 18. Web Search Vs Enterprise Search Web Search : Content is public & generic. Uses keywords, Links (relevancy) based some kind of historic traffic. Usually http crawlers are used for content acquisition Enterprise Search : Also contains private documents that domain specific, Quality of content should be highest quality content & not necessarily popular Information/metadata needs to be secure with role based access to the content. It has to support security (Realms, Roles), SLAs and many other requirements.
  • 19. Search Technologies RDBMS to store metadata Cache service - for fast access Parsers – to interpret input queries Internationalization – For handling different languages Search DSL – catering to particular domain Map/Reduce, Parallelization & Algorithms Indexing, File storage systems/ Multi-threading

Editor's Notes

  • #2: Hi, First let me introduce myself I m Praveena Manvi – A programmer from more than a decade & now a Search Engineer
  • #3: Retrieve documents with information that is relevant to user’s information need and helps him complete a task from fixed set of documents we Now we have vast amounts of information to which accurate and speedy access is becoming ever more difficult. One effect of this is that relevant information gets ignored since it is never uncovered, which in turn leads to much duplication of work and effort. When high speed computers became available for non-numerical work, many thought that a computer would be able to 'read' an entire document collection to extract the relevant documents. It soon became apparent that using the natural language text of a document not only caused input and storage problems (it still does) but also left unsolved the intellectual problem of characterising the document content. It is conceivable that future hardware developments may make natural language input and storage more feasible. But automatic characterisation in which the software attempts to duplicate the human process of 'reading' is a very sticky problem indeed. More specifically, 'reading' involves attempting to extract information, both syntactic and semantic, from the text and using it to decide whether each document is relevant or not to a particular request. The difficulty is not only 4 Information retrieval knowing how to extract the information but also how to use it to decide relevance. The comparatively slow progress of modern linguistics on the semantic front and the conspicuous failure of machine translation (Bar-Hillel5) show that these problems are largely unsolved.
  • #4: Simple, yet as powerful as changing the world’s business models Exemplifies search end goal of “providing relevant information taking minimum number of inputs in a fastest possible way” Search Engines are meant to minimize the time required find the information & the amount of information to be searchedBillion Google has indexed over 9.6 billion static pages (56 Billion publicly available pages) Over 200+ billion dynamic pages are un-available for search engines Predictve, Geo aware, Personalized, International & Semantic (Context aware) searches. Salient Features: - No complex forms with Instant results No navigation, simple user interface No hierarchy of categories Results in less than a second Non invasive model of displaying ads
  • #5: TLR Confidential But always not possible. We have cases where we always need to provide the extra information, Software will never be able to read the mind the of searcher on what content he wants, but software has been able to make this guess far better & has been improving day by day. Google leading this innovation with amazing speed.
  • #6: TLR Confidential Key word bidding Key word/Term/ Ad Word are being used alternatively Real estate bidding Content Match enables publishers to monetize their sites by leveraging user interest. Targeted Content Match results may be displayed based on the specific content of a publisher's site, geographic location, and other factors.
  • #7: From My Gmail mail Sample form a blog TLR Confidential
  • #8: TLR Confidential It’s all about Bill Boards
  • #9: Main intention is to catch more eye balls & serve ads. Monetization comes only after some time & it reaches the threshold level of traffic (usually in millions). The real estate value for these sites (or the bill boards) will increase depending on the eye ball that they have. TLR Confidential
  • #10: Y!, Google & MSN controls the maximum portion of the search market. BoomBerg Reuters Thomson Finance, Legal & various other domains TLR Confidential
  • #11: Vertical & domain search will always have place in the search domain, "Google makes finding information easier than ever, but nothing beats interacting with an expert.“ There are industries consisting of search engines that focus on specific slices of content or the approach - Answers DeeperWeb.com - Travel Kayak.com - Jobs Indeed.com SimplyHired.com Health HealthPricer.com -Stock Market StockTraderSearch.com -Biomedical VADLO.com -Metrics DeeperWeb.com TLR Confidential
  • #12: According wikipedia un-structured data has about 80% of total content It has long argued that the fabled "80% of corporate content" remaining unstructured across an enterprise may really turn out on deeper examination to be quite structured or at least semi-structured.  Veterans of our industry insist that for data to be called structured, it must live in a database. By elimination, all other content is unstructured.  Structured  vs  Unstructured  is cellular data vs non-cellular data.  DB LOB types are special exception cases.  Structured – That can be queried through SQL TLR Confidential
  • #13: SQL is good for pre-defined schema IR can be represented as a mapping function: IR: Q → D Q - natural language queries which specify user information needs D - a set of documents in the document collection, which meet these needs, (optionally) ordered according to the degree of relevance. TLR Confidential
  • #14: Sample project that can form the test bed for knowing JDK more TLR Confidential
  • #15: In  text processing , a  proximity search  looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters "house * dog" OR "dog * house" <--Search for house/dog up to 2 words apart. – Google uses * the program looks for intersections of the meanings:  power, struggle,  and  India  where they are located together within a sentence. TLR Confidential
  • #16: Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. TLR Confidential
  • #17: TLR Confidential
  • #18: A semantic search engine is a search engine that takes the  sense  of a word as a factor in its ranking algorithm  or  offers the user a choice as to the  sense  of a word or phrase TLR Confidential
  • #19: Google & Yahoo do not provide enterprise search. Professionals enter content for the enterprise search TLR Confidential
  • #20: The basic search term weighting formula known as IDF, proposed by Sparck Jones on heuristic grounds in 1972, has proved extraordinarily robust. It remains at the core of many, if not most, ranking methods used in search engines. In 1972, Karen Sp¨arck Jones published in the Journal of Documentation a paper called “A statistical interpretation of term specificity and its application in retrieval” (Sparck Jones,1972). The measure of term specificity first proposed in that paper later became known as inverse document frequency, or IDF; it is based on counting the number of documents in the collection being searched which contain (or are indexed by) the term in question. Assume there are N documents in the collection, and that term ti occurs in ni of them. (What might constitute a ‘term’ is not of concern to us here, but we may assume that terms are words, or possibly phrases or word stems idf (ti) = log N/Ni TLR Confidential
  • #21: TLR Confidential
  • #22: TLR Confidential