SlideShare a Scribd company logo
Deeper Dive into
Purpose-Built Search: A
Bullet Point Journey
Core Concept
Tailored information retrieval systems designed for specific domains or user
needs, offering superior relevance and efficiency compared to general-purpose
search.
Key Benefits:
 Domain Expertise: Deep understanding of language, data structures, and
search intent within a specific domain.
 Targeted Functionalities: Specialized features and operators catered to the
domain (e.g., legal citation search, product filtering).
 Streamlined Efficiency: Faster and more accurate results, saving time and
effort.
Diverse Applications:
 E-commerce: Advanced product comparisons based on specific criteria.
 Legal Research: Efficient navigation of databases with specialized search
operators.
 Enterprise Search: Role-specific search for internal documents and resources.
 Media & Entertainment: Granular search by genre, cast, release date, etc.
 Scientific Exploration: Domain-specific ranking algorithms for relevant
research papers.
 Healthcare: Search medical databases based on symptoms, diagnoses, and
medications.
 Education: Curated search experiences for students and educators across
disciplines.
Technical Underpinnings:
 Advanced Indexing & Processing: Algorithms optimize data for specific
domain searches.
 Specialized Query Understanding: Intent analysis tailored to the domain
vocabulary and patterns.
 Domain-Specific Ranking: Prioritizes results based on relevance and search
context within the domain.
Emerging Trends:
 AI-Powered Insights: Extracting deeper connections and patterns from search
results.
 Cross-Domain Integration: Seamlessly search across specialized tools for
broader exploration.
 Personalization & Adaptability: Intuitive interfaces learning from user habits
and preferences.
Future Implications:
 Democratization of information access across various domains.
 Increased productivity and efficiency in knowledge-driven tasks.
 Personalized learning experiences and deeper understanding of complex
topics.
Controlled Queries vs.
Uncontrolled Queries in Web
Mining:
Concept
 Controlled queries: Formulated by the researcher with specific goals and
requirements, often tailored to a particular domain or dataset. They leverage
structured query languages (e.g., SQL, XPath) or web APIs to precisely
retrieve relevant data.
 Uncontrolled queries: Submitted by users (e.g., search
keywords, reviews, forum posts) with varying levels of clarity, structure, and
intent. They represent spontaneous information needs in diverse formats and
require parsing, understanding, and interpretation.
Key Differences:
Key Differences:
Relation to Web Mining:
 Controlled queries:
 Used to access well-organized data repositories (e.g., databases, websites with clean APIs)
 Support targeted extraction of specific data points for analysis or modeling
 Examples: Crawling product prices from e-commerce sites, extracting scientific literature
through APIs
 Uncontrolled queries:
 Often require pre-processing, text analysis, and natural language processing (NLP) techniques
 Present challenges due to noise, subjectivity, and ambiguity
 Used for broader exploration, sentiment analysis, topic modeling, and understanding user
behavior
 Examples: Analyzing customer reviews, mining social media trends, exploring unstructured
knowledge bases
Considerations:
 Choice between controlled and uncontrolled queries depends on research
objectives, data availability, and resource constraints.
 Both approaches can be valuable, and often they are combined for
comprehensive web mining.
 Uncontrolled queries offer broader insights but necessitate deeper
understanding and careful processing.
Web Mining Examples:
 Travel website data:
 Controlled queries could be used to extract hotel listings based on specific criteria
(location, price, amenities).
 Uncontrolled queries could analyze visitor reviews to understand sentiment and
identify areas for improvement.
 News analysis:
 Controlled queries could retrieve articles on specific topics from credible sources.
 Uncontrolled queries could explore broader social media discussions to uncover
emerging trends and public opinion.
Future Directions:
 Integration of semantic web technologies and advanced NLP techniques to
better understand unstructured data.
 Development of adaptive mining methods that can dynamically switch
between controlled and uncontrolled queries based on context and needs.
 Enhanced use of explainable AI (XAI) to make query interpretation and
analysis more transparent.
Understanding Word
Embedding and
Word2Vec for Efficient
Language Processing
https://www.youtube.com/watch?
v=viZrOnJclY0
Understanding Word Embedding and Word2Vec for
Efficient Language Processing
 Word embeddings and the Word2Vec model can be used to assign
numerical representations to words based on their context, allowing for
more efficient processing of language and understanding of word
similarities.
Understanding Word Embedding and Word2Vec for
Efficient Language Processing
 Key insights
• Word embeddings allow similar words to have similar numbers, making it easier to analyze and
understand text data.
• Words with similar meanings and usage should be assigned similar numbers in word embedding to
help neural networks learn more efficiently.
• Backpropagation is used to optimize the random values of the weights in a neural network, enabling
the network to make accurate predictions.
• The word embedding model uses input words to predict the next word in a phrase, assigning higher
values to the desired output word.
• Optimizing the weights of word embeddings can potentially improve the performance of natural
language processing models by capturing semantic relationships between words.
• Using word embeddings can optimize the weights in a neural network, allowing it to learn how
similar words are used and improve language processing.
• Word2vec efficiently creates word embeddings by selectively optimizing weights for specific outputs,
allowing for the creation of multiple embeddings for each word in a large vocabulary.
Q&A
 What are word embeddings and Word2Vec?
 —Word embeddings and Word2Vec are methods used to convert words into
numerical representations based on their context, making it easier to process
language and understand word similarities in machine learning.
 How does a neural network determine word associations?
 —A simple neural network can determine the association between words and
numbers based on their context in phrases, allowing for the prediction of the
next word in a phrase.
Q&A
 Why is training a neural network important for word embeddings?
 —Training a neural network is important for correctly predicting the next
word in a phrase and adjusting word embeddings to make similar words more
similar to each other based on their context.
 What strategies does Word2Vec use to increase context in word
embeddings?
 —Word2Vec uses two strategies, continuous bag-of-words and skip-gram, to
increase context in word embeddings by predicting surrounding words based
on the middle word and vice versa.
Q&A
 How does Word2Vec optimize training for word embeddings?
 —Word2Vec speeds up training by using negative sampling to optimize only for
the words we want to predict, efficiently creating word embeddings by
selecting a few words to predict and optimizing only a fraction of the total
weights in the neural network.
Timestamped Summary

00:00 Word embeddings and word2vec convert words into numbers, allowing similar
words to have similar numerical representations for easier use in machine learning
algorithms.
 02:38 Similar words should have similar numbers to help a neural network learn and
apply knowledge, and a simple neural network can determine word-number associations
based on context.
 04:54 We create a neural network with inputs for each unique word, connect them to
activation functions, and optimize the weights through backpropagation to associate
numbers with each word.
 06:20 Using word embeddings and the Word2Vec model, we can predict the next word
in a phrase by training a neural network to assign values to input words, connect them
to activation functions with weights, and run the outputs through the softmax function
for classification.
Timestamped Summary
 08:18 Word embeddings are adjusted through backpropagation to make words
that appear in the same context more similar to each other, and the neural
network accurately predicts the next word based on input.
 10:37 Training a neural network with Word2Vec can help process language and
understand how similar words are used by assigning numbers to words based
on their context.
 12:31 Word2Vec uses multiple activation functions and a large vocabulary to
efficiently create word embeddings by optimizing only a fraction of the total
weights in the neural network.
GOOGLE BERT
 https://jalammar.github.io/illustrated-bert/
How to download pre-trained models and corpora
 https://radimrehurek.com/gensim/auto_examples/howtos/
run_downloader_api.html
Pre trained corpus
 A pre-trained corpus is a massive collection of text data that has already been
used to train a language model. Think of it like a vast library of books that a
language model has already read and learned from. This "reading" process lets
the model understand the nuances of language, like how words are used
together, sentence structure, and different writing styles.
What's in it?
 A pre-trained corpus can contain diverse sources like
books, articles, code, websites, and even social media conversations.
 The size can vary, with some corpora containing billions of words!
Why is it used?
 Training a language model from scratch requires immense computing power
and data.
 Pre-trained corpora save time and resources by providing a foundation of
knowledge.
 The model can then be fine-tuned on specific tasks like summarizing
text, translating languages, or writing different kinds of creative content.
Benefits:
 Faster training of language models.
 Improved performance on various NLP tasks.
 Adaptability to diverse domains by fine-tuning.
Examples:
 Well-known pre-trained corpora include Wikipedia, BookCorpus, and Common
Crawl.
 Specialized corpora exist for legal documents, medical texts, or scientific
papers.

More Related Content

PDF
SEMANTIC INFORMATION EXTRACTION IN UNIVERSITY DOMAIN
PDF
Semantic Information Retrieval Using Ontology in University Domain
PDF
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
PDF
IRJET- Semantic Question Matching
PPTX
SOFTWARE ENGINEERING PROJECT FOR AI AND APPLICATION
PDF
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
PDF
Ontology Based Approach for Semantic Information Retrieval System
PDF
Class Diagram Extraction from Textual Requirements Using NLP Techniques
SEMANTIC INFORMATION EXTRACTION IN UNIVERSITY DOMAIN
Semantic Information Retrieval Using Ontology in University Domain
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
IRJET- Semantic Question Matching
SOFTWARE ENGINEERING PROJECT FOR AI AND APPLICATION
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
Ontology Based Approach for Semantic Information Retrieval System
Class Diagram Extraction from Textual Requirements Using NLP Techniques

Similar to Web Minnig and text mining presentation (20)

PDF
D017232729
PDF
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
PDF
IRJET- Sentimental Prediction of Users Perspective through Live Streaming : T...
PDF
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
PDF
Aspect based sentiment analysis using a novel ensemble deep network
PDF
Az31349353
PDF
E0322035037
PDF
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
PDF
Generative Artificial Intelligence and Large Language Model
PDF
How ZBrain Enhances Knowledge Retrieval With Reranking.pdf
PDF
Ay3313861388
PDF
Improving search result via search keywords and data classification similarity
PDF
PDF
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
PDF
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
PDF
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
DOC
Efficient instant fuzzy search with proximity ranking
PPT
Introduction of Semantic Web using NLP techniques.
PDF
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
PDF
P036401020107
D017232729
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
IRJET- Sentimental Prediction of Users Perspective through Live Streaming : T...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
Aspect based sentiment analysis using a novel ensemble deep network
Az31349353
E0322035037
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
Generative Artificial Intelligence and Large Language Model
How ZBrain Enhances Knowledge Retrieval With Reranking.pdf
Ay3313861388
Improving search result via search keywords and data classification similarity
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
Efficient instant fuzzy search with proximity ranking
Introduction of Semantic Web using NLP techniques.
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
P036401020107
Ad

More from ZahraWaheed9 (15)

PPT
PHP-04-Forms PHP-04-Forms PHP-04-Forms PHP-04-Forms
PPT
Chapter 5 SE Chapter 5 SE.pptChapter 5 SE.ppt
PPTX
Ch 14_Web Mining.pptxCh 14_Web Mining.pptx
PPTX
Open URL in Chrome Browser from Python.pptx
PPTX
Lecture 5 & 6 Advance CSS.pptx for web
PPT
php introduction to the basic student web
PPTX
ch 3 of C# programming in advanced programming
PPTX
Responsive Web Designing for web development
PPTX
Color Theory for web development class for students to understand good websites
PPT
C# wrokig based topics for students in advanced programming
PPT
CSharp POWERPOINT SLIDES C# VISUAL PROGRAMMING
PPT
visual programming GDI presentation powerpoint
PPT
Visual programming Chapter 3: GUI (Graphical User Interface)
PPTX
Review Presentation on develeopment of automated quality
PPTX
Cross-Modal Scene Understanding presntation
PHP-04-Forms PHP-04-Forms PHP-04-Forms PHP-04-Forms
Chapter 5 SE Chapter 5 SE.pptChapter 5 SE.ppt
Ch 14_Web Mining.pptxCh 14_Web Mining.pptx
Open URL in Chrome Browser from Python.pptx
Lecture 5 & 6 Advance CSS.pptx for web
php introduction to the basic student web
ch 3 of C# programming in advanced programming
Responsive Web Designing for web development
Color Theory for web development class for students to understand good websites
C# wrokig based topics for students in advanced programming
CSharp POWERPOINT SLIDES C# VISUAL PROGRAMMING
visual programming GDI presentation powerpoint
Visual programming Chapter 3: GUI (Graphical User Interface)
Review Presentation on develeopment of automated quality
Cross-Modal Scene Understanding presntation
Ad

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Essential Infomation Tech presentation.pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
L1 - Introduction to python Backend.pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Nekopoi APK 2025 free lastest update
PPTX
Transform Your Business with a Software ERP System
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
history of c programming in notes for students .pptx
PDF
AI in Product Development-omnex systems
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
2025 Textile ERP Trends: SAP, Odoo & Oracle
Essential Infomation Tech presentation.pptx
Design an Analysis of Algorithms II-SECS-1021-03
L1 - Introduction to python Backend.pptx
Odoo Companies in India – Driving Business Transformation.pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PTS Company Brochure 2025 (1).pdf.......
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Odoo POS Development Services by CandidRoot Solutions
Nekopoi APK 2025 free lastest update
Transform Your Business with a Software ERP System
How Creative Agencies Leverage Project Management Software.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
history of c programming in notes for students .pptx
AI in Product Development-omnex systems

Web Minnig and text mining presentation

  • 1. Deeper Dive into Purpose-Built Search: A Bullet Point Journey
  • 2. Core Concept Tailored information retrieval systems designed for specific domains or user needs, offering superior relevance and efficiency compared to general-purpose search.
  • 3. Key Benefits:  Domain Expertise: Deep understanding of language, data structures, and search intent within a specific domain.  Targeted Functionalities: Specialized features and operators catered to the domain (e.g., legal citation search, product filtering).  Streamlined Efficiency: Faster and more accurate results, saving time and effort.
  • 4. Diverse Applications:  E-commerce: Advanced product comparisons based on specific criteria.  Legal Research: Efficient navigation of databases with specialized search operators.  Enterprise Search: Role-specific search for internal documents and resources.  Media & Entertainment: Granular search by genre, cast, release date, etc.  Scientific Exploration: Domain-specific ranking algorithms for relevant research papers.  Healthcare: Search medical databases based on symptoms, diagnoses, and medications.  Education: Curated search experiences for students and educators across disciplines.
  • 5. Technical Underpinnings:  Advanced Indexing & Processing: Algorithms optimize data for specific domain searches.  Specialized Query Understanding: Intent analysis tailored to the domain vocabulary and patterns.  Domain-Specific Ranking: Prioritizes results based on relevance and search context within the domain.
  • 6. Emerging Trends:  AI-Powered Insights: Extracting deeper connections and patterns from search results.  Cross-Domain Integration: Seamlessly search across specialized tools for broader exploration.  Personalization & Adaptability: Intuitive interfaces learning from user habits and preferences.
  • 7. Future Implications:  Democratization of information access across various domains.  Increased productivity and efficiency in knowledge-driven tasks.  Personalized learning experiences and deeper understanding of complex topics.
  • 8. Controlled Queries vs. Uncontrolled Queries in Web Mining:
  • 9. Concept  Controlled queries: Formulated by the researcher with specific goals and requirements, often tailored to a particular domain or dataset. They leverage structured query languages (e.g., SQL, XPath) or web APIs to precisely retrieve relevant data.  Uncontrolled queries: Submitted by users (e.g., search keywords, reviews, forum posts) with varying levels of clarity, structure, and intent. They represent spontaneous information needs in diverse formats and require parsing, understanding, and interpretation.
  • 12. Relation to Web Mining:  Controlled queries:  Used to access well-organized data repositories (e.g., databases, websites with clean APIs)  Support targeted extraction of specific data points for analysis or modeling  Examples: Crawling product prices from e-commerce sites, extracting scientific literature through APIs  Uncontrolled queries:  Often require pre-processing, text analysis, and natural language processing (NLP) techniques  Present challenges due to noise, subjectivity, and ambiguity  Used for broader exploration, sentiment analysis, topic modeling, and understanding user behavior  Examples: Analyzing customer reviews, mining social media trends, exploring unstructured knowledge bases
  • 13. Considerations:  Choice between controlled and uncontrolled queries depends on research objectives, data availability, and resource constraints.  Both approaches can be valuable, and often they are combined for comprehensive web mining.  Uncontrolled queries offer broader insights but necessitate deeper understanding and careful processing.
  • 14. Web Mining Examples:  Travel website data:  Controlled queries could be used to extract hotel listings based on specific criteria (location, price, amenities).  Uncontrolled queries could analyze visitor reviews to understand sentiment and identify areas for improvement.  News analysis:  Controlled queries could retrieve articles on specific topics from credible sources.  Uncontrolled queries could explore broader social media discussions to uncover emerging trends and public opinion.
  • 15. Future Directions:  Integration of semantic web technologies and advanced NLP techniques to better understand unstructured data.  Development of adaptive mining methods that can dynamically switch between controlled and uncontrolled queries based on context and needs.  Enhanced use of explainable AI (XAI) to make query interpretation and analysis more transparent.
  • 16. Understanding Word Embedding and Word2Vec for Efficient Language Processing https://www.youtube.com/watch? v=viZrOnJclY0
  • 17. Understanding Word Embedding and Word2Vec for Efficient Language Processing  Word embeddings and the Word2Vec model can be used to assign numerical representations to words based on their context, allowing for more efficient processing of language and understanding of word similarities.
  • 18. Understanding Word Embedding and Word2Vec for Efficient Language Processing  Key insights • Word embeddings allow similar words to have similar numbers, making it easier to analyze and understand text data. • Words with similar meanings and usage should be assigned similar numbers in word embedding to help neural networks learn more efficiently. • Backpropagation is used to optimize the random values of the weights in a neural network, enabling the network to make accurate predictions. • The word embedding model uses input words to predict the next word in a phrase, assigning higher values to the desired output word. • Optimizing the weights of word embeddings can potentially improve the performance of natural language processing models by capturing semantic relationships between words. • Using word embeddings can optimize the weights in a neural network, allowing it to learn how similar words are used and improve language processing. • Word2vec efficiently creates word embeddings by selectively optimizing weights for specific outputs, allowing for the creation of multiple embeddings for each word in a large vocabulary.
  • 19. Q&A  What are word embeddings and Word2Vec?  —Word embeddings and Word2Vec are methods used to convert words into numerical representations based on their context, making it easier to process language and understand word similarities in machine learning.  How does a neural network determine word associations?  —A simple neural network can determine the association between words and numbers based on their context in phrases, allowing for the prediction of the next word in a phrase.
  • 20. Q&A  Why is training a neural network important for word embeddings?  —Training a neural network is important for correctly predicting the next word in a phrase and adjusting word embeddings to make similar words more similar to each other based on their context.  What strategies does Word2Vec use to increase context in word embeddings?  —Word2Vec uses two strategies, continuous bag-of-words and skip-gram, to increase context in word embeddings by predicting surrounding words based on the middle word and vice versa.
  • 21. Q&A  How does Word2Vec optimize training for word embeddings?  —Word2Vec speeds up training by using negative sampling to optimize only for the words we want to predict, efficiently creating word embeddings by selecting a few words to predict and optimizing only a fraction of the total weights in the neural network.
  • 22. Timestamped Summary  00:00 Word embeddings and word2vec convert words into numbers, allowing similar words to have similar numerical representations for easier use in machine learning algorithms.  02:38 Similar words should have similar numbers to help a neural network learn and apply knowledge, and a simple neural network can determine word-number associations based on context.  04:54 We create a neural network with inputs for each unique word, connect them to activation functions, and optimize the weights through backpropagation to associate numbers with each word.  06:20 Using word embeddings and the Word2Vec model, we can predict the next word in a phrase by training a neural network to assign values to input words, connect them to activation functions with weights, and run the outputs through the softmax function for classification.
  • 23. Timestamped Summary  08:18 Word embeddings are adjusted through backpropagation to make words that appear in the same context more similar to each other, and the neural network accurately predicts the next word based on input.  10:37 Training a neural network with Word2Vec can help process language and understand how similar words are used by assigning numbers to words based on their context.  12:31 Word2Vec uses multiple activation functions and a large vocabulary to efficiently create word embeddings by optimizing only a fraction of the total weights in the neural network.
  • 25. How to download pre-trained models and corpora  https://radimrehurek.com/gensim/auto_examples/howtos/ run_downloader_api.html
  • 26. Pre trained corpus  A pre-trained corpus is a massive collection of text data that has already been used to train a language model. Think of it like a vast library of books that a language model has already read and learned from. This "reading" process lets the model understand the nuances of language, like how words are used together, sentence structure, and different writing styles.
  • 27. What's in it?  A pre-trained corpus can contain diverse sources like books, articles, code, websites, and even social media conversations.  The size can vary, with some corpora containing billions of words!
  • 28. Why is it used?  Training a language model from scratch requires immense computing power and data.  Pre-trained corpora save time and resources by providing a foundation of knowledge.  The model can then be fine-tuned on specific tasks like summarizing text, translating languages, or writing different kinds of creative content.
  • 29. Benefits:  Faster training of language models.  Improved performance on various NLP tasks.  Adaptability to diverse domains by fine-tuning.
  • 30. Examples:  Well-known pre-trained corpora include Wikipedia, BookCorpus, and Common Crawl.  Specialized corpora exist for legal documents, medical texts, or scientific papers.