SlideShare a Scribd company logo
Semantic similarity and text summarization
based novelty detection
American University of Culture and Education (AUCE)
Nabatieh Campus
Advanced Database System CSI531
Ali Saad & Mostafa Abbas
Dr. Hassan Harb
Outline
1 What is novelty detection ?
2 Challenges
3 Generic crawler methodology
4 Proposed crawler methodology for novelty detection
5 Simulation set parameters
6 Implementation
7 Result and discussion
8 Conclusion
Timeline Infographic
What is novelty detection ?
• Novelty detection is the process of finding information that has not appeared
before, or new concerning the relevant information already seen, which can be
appended with the current web crawlers.
• Also, the proposed approach can be used with other search engines like Google,
Yahoo and Bing to minimize superfluous documents.
Timeline Infographic
Challenges
• The information accumulated progressively due to the explosive growth of
document over web, which has resulted in duplication of information.
• It consumes precious time and space of the user in reading the new
information. the problem of novelty detection or redundant information still
persists.
• novelty detection mechanism is proposed, which can be appended with the
current web crawlers.
Timeline Infographic
Firstly, a generic crawler is proposed that takes a query on a specific domain, and
the crawler results are stored in an indexed database. The URL enter by the admin
are stored in the dictionary.
Generic crawler methodology
Timeline Infographic
Generic crawler methodology
Interface for domain-specific Generic Crawler
The SQL database
Timeline Infographic
Proposed crawler methodology for novelty detection
Timeline Infographic
Proposed crawler methodology for novelty detection
Algorithm for the proposed crawler novelty detection
Input: URL (source code in S_out), DB-> Data Base
DB first row say in S_current
Begin
Step 1: Fetch source code in S_out
Step 2: Summarization of fetched data S-out
Step 3: for each row in DB
3.1 Summarization of each row of DB (S_current)
Step 4: Find the similarity of both summarized data (S_out , S_current)
Step 5: if (similarity > threshold)
5.1 Break the loop and will not compare with any row
5.2 Because it already finds a similar row
Else Check with another DB row
Step 6: If it does not find the similar doc in DB, than save a new row i.e. fetched data in DB.
End.
Timeline Infographic
Proposed crawler methodology for novelty detection
Detailed steps
Timeline Infographic
Proposed crawler methodology for novelty detection
Similarity calculation of summarized data :
• N-gram formation
N-gram formation is a process of converting a string into substring.
N = (p − m + 1)
Examples:
The size of N = 5.
Let a string ThisIsSKGram, 5-Grams derived from the string.
N=(12-5+1)=8
ThisI hisIs isIsS sIsSK IsSKG sSKGr SKGra Kgram
Timeline Infographic
Proposed crawler methodology for novelty detection
• Process of fingerprint selection
Timeline Infographic
Simulation set parameters
The proposed algorithm used similarity calculation that tells whether the new web page
added into the database depends on a threshold value.
Timeline Infographic
Performance parameters :
Simulation set parameters
To measure the efficacy of the proposed scheme several performance metrics are taken
given under:
• Redundancy removal (RR) :
RR = abs (NPGA – NOPA)
• Memory overhead(MO) :
MO = NPR ∗ PS
• Number of pages identified (NPI) :
NPI = NPPA
Timeline Infographic
The implementation includes the Microsoft Visual Studio 2012 (.NET) as a front end
and SQL server 2012 as a back end database.
SQL database :
 T_ Categoty.
 T_website.
 T_webpages.
The database stores the URL of the query together with its HTML tags, It also includes
the table ontology.
Implementation
Timeline Infographic
Search Engine
Timeline Infographic
List of webpages for the query ‘code’ on generic crawler search interface.
Generic and proposed crawler search
Timeline Infographic
Generic and proposed crawler search
List of webpages for the
query ‘code’ on proposed
crawler search interface.
Timeline Infographic
Domain Query
Generic crawler
(no. of pages retrieved)
Proposed crawler
(no. of novel pages retrieved)
Redundant
pages
Technology Code 344 4 340
Web 130 8 122
HTML 130 8 122
Java 120 6 114
Health Patient 220 40 180
Medicine 300 50 250
Doctor 190 30 160
Health 390 70 320
Transport Bus 480 50 430
Car 430 40 390
Truck 415 30 385
Vehicle 360 42 318
Comparison of generic crawler and proposed crawler novelty
Result and discussion
Timeline Infographic
0
100
200
300
400
500
600
Code
Web
HTML
Java
Patient
Medicine
Doctor
Health
Bus
Car
Truck
Vehicle
Generic Crawler
proposed crawler
redundant pages.
Technology
Result and discussion
Health Transport
No.
of
pages
Domain Specific Queries
Timeline Infographic
Conclusion
• Reduced redundancy provides novel results for the prescribe search rather than
replicating the previous results. This results in effective search effort.
• Memory requirement for the search results also reduce to large extent.
• One of the, main feature of this technique is that number of pages identified after the
given search are very less as compared to generic technique. This results in the
elimination of repeated occurrence and less memory requirement with less execution
time.
Hence, it is concluded that this proposed approach can be used successfully in the field of
information retrieval.

More Related Content

PPTX
Geospatial data
PPTX
Hadoop
PDF
Hadoop_Presentation
PPTX
Learning Systems for Science
PPTX
Why Hadoop is Useful?
PDF
Improving performance of apriori algorithm using hadoop
PPTX
Topic modeling using big data analytics
PPTX
Matching Data Intensive Applications and Hardware/Software Architectures
Geospatial data
Hadoop
Hadoop_Presentation
Learning Systems for Science
Why Hadoop is Useful?
Improving performance of apriori algorithm using hadoop
Topic modeling using big data analytics
Matching Data Intensive Applications and Hardware/Software Architectures

What's hot (20)

PPTX
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
PPTX
Comparing Big Data and Simulation Applications and Implications for Software ...
PDF
Introduction_OF_Hadoop_and_BigData
PPTX
Scaling collaborative data science with Globus and Jupyter
PDF
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
PPTX
51 Use Cases and implications for HPC & Apache Big Data Stack
PPTX
Big data analysis using hadoop cluster
PDF
PPTX
Classification of Big Data Use Cases by different Facets
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Présentation on radoop
PDF
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
PDF
Research Papers Recommender based on Digital Repositories Metadata
PDF
High Performance Data Analytics and a Java Grande Run Time
PPTX
Cloud Services for Big Data Analytics
PPT
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
PPT
Integrating scientific laboratories into the cloud
PDF
Reproducible Research and the Cloud
PPTX
Beyond Kaggle: Solving Data Science Challenges at Scale
PDF
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Comparing Big Data and Simulation Applications and Implications for Software ...
Introduction_OF_Hadoop_and_BigData
Scaling collaborative data science with Globus and Jupyter
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
51 Use Cases and implications for HPC & Apache Big Data Stack
Big data analysis using hadoop cluster
Classification of Big Data Use Cases by different Facets
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Présentation on radoop
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Research Papers Recommender based on Digital Repositories Metadata
High Performance Data Analytics and a Java Grande Run Time
Cloud Services for Big Data Analytics
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Integrating scientific laboratories into the cloud
Reproducible Research and the Cloud
Beyond Kaggle: Solving Data Science Challenges at Scale
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
Ad

Similar to Database novelty detection (20)

PDF
A Novel Data Extraction and Alignment Method for Web Databases
PPT
Data Mining and the Web_Past_Present and Future
PDF
Annotation for query result records based on domain specific ontology
PDF
Using Page Size for Controlling Duplicate Query Results in Semantic Web
PDF
Pdd crawler a focused web
PDF
Pratical Deep Dive into the Semantic Web - #smconnect
PDF
Poster (1)
PPT
Sem tech 2011 v8
PDF
Sree saranya
PDF
Sree saranya
PPTX
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
PDF
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
DOC
A signature based indexing method for efficient content-based retrieval of re...
PDF
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
PDF
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
PDF
Enhancing keyword search over relational databases using ontologies
PPTX
How Lyft Drives Data Discovery
PPTX
Data Café — A Platform For Creating Biomedical Data Lakes
DOC
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
A Novel Data Extraction and Alignment Method for Web Databases
Data Mining and the Web_Past_Present and Future
Annotation for query result records based on domain specific ontology
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Pdd crawler a focused web
Pratical Deep Dive into the Semantic Web - #smconnect
Poster (1)
Sem tech 2011 v8
Sree saranya
Sree saranya
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
A signature based indexing method for efficient content-based retrieval of re...
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
Enhancing keyword search over relational databases using ontologies
How Lyft Drives Data Discovery
Data Café — A Platform For Creating Biomedical Data Lakes
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
Ad

Recently uploaded (20)

PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Introduction to Artificial Intelligence
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Nekopoi APK 2025 free lastest update
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Essential Infomation Tech presentation.pptx
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
How Creative Agencies Leverage Project Management Software.pdf
L1 - Introduction to python Backend.pptx
How to Migrate SBCGlobal Email to Yahoo Easily
Upgrade and Innovation Strategies for SAP ERP Customers
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
2025 Textile ERP Trends: SAP, Odoo & Oracle
Odoo Companies in India – Driving Business Transformation.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Introduction to Artificial Intelligence
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Nekopoi APK 2025 free lastest update
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
wealthsignaloriginal-com-DS-text-... (1).pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Essential Infomation Tech presentation.pptx

Database novelty detection

  • 1. Semantic similarity and text summarization based novelty detection
  • 2. American University of Culture and Education (AUCE) Nabatieh Campus Advanced Database System CSI531 Ali Saad & Mostafa Abbas Dr. Hassan Harb
  • 3. Outline 1 What is novelty detection ? 2 Challenges 3 Generic crawler methodology 4 Proposed crawler methodology for novelty detection 5 Simulation set parameters 6 Implementation 7 Result and discussion 8 Conclusion
  • 4. Timeline Infographic What is novelty detection ? • Novelty detection is the process of finding information that has not appeared before, or new concerning the relevant information already seen, which can be appended with the current web crawlers. • Also, the proposed approach can be used with other search engines like Google, Yahoo and Bing to minimize superfluous documents.
  • 5. Timeline Infographic Challenges • The information accumulated progressively due to the explosive growth of document over web, which has resulted in duplication of information. • It consumes precious time and space of the user in reading the new information. the problem of novelty detection or redundant information still persists. • novelty detection mechanism is proposed, which can be appended with the current web crawlers.
  • 6. Timeline Infographic Firstly, a generic crawler is proposed that takes a query on a specific domain, and the crawler results are stored in an indexed database. The URL enter by the admin are stored in the dictionary. Generic crawler methodology
  • 7. Timeline Infographic Generic crawler methodology Interface for domain-specific Generic Crawler The SQL database
  • 8. Timeline Infographic Proposed crawler methodology for novelty detection
  • 9. Timeline Infographic Proposed crawler methodology for novelty detection Algorithm for the proposed crawler novelty detection Input: URL (source code in S_out), DB-> Data Base DB first row say in S_current Begin Step 1: Fetch source code in S_out Step 2: Summarization of fetched data S-out Step 3: for each row in DB 3.1 Summarization of each row of DB (S_current) Step 4: Find the similarity of both summarized data (S_out , S_current) Step 5: if (similarity > threshold) 5.1 Break the loop and will not compare with any row 5.2 Because it already finds a similar row Else Check with another DB row Step 6: If it does not find the similar doc in DB, than save a new row i.e. fetched data in DB. End.
  • 10. Timeline Infographic Proposed crawler methodology for novelty detection Detailed steps
  • 11. Timeline Infographic Proposed crawler methodology for novelty detection Similarity calculation of summarized data : • N-gram formation N-gram formation is a process of converting a string into substring. N = (p − m + 1) Examples: The size of N = 5. Let a string ThisIsSKGram, 5-Grams derived from the string. N=(12-5+1)=8 ThisI hisIs isIsS sIsSK IsSKG sSKGr SKGra Kgram
  • 12. Timeline Infographic Proposed crawler methodology for novelty detection • Process of fingerprint selection
  • 13. Timeline Infographic Simulation set parameters The proposed algorithm used similarity calculation that tells whether the new web page added into the database depends on a threshold value.
  • 14. Timeline Infographic Performance parameters : Simulation set parameters To measure the efficacy of the proposed scheme several performance metrics are taken given under: • Redundancy removal (RR) : RR = abs (NPGA – NOPA) • Memory overhead(MO) : MO = NPR ∗ PS • Number of pages identified (NPI) : NPI = NPPA
  • 15. Timeline Infographic The implementation includes the Microsoft Visual Studio 2012 (.NET) as a front end and SQL server 2012 as a back end database. SQL database :  T_ Categoty.  T_website.  T_webpages. The database stores the URL of the query together with its HTML tags, It also includes the table ontology. Implementation
  • 17. Timeline Infographic List of webpages for the query ‘code’ on generic crawler search interface. Generic and proposed crawler search
  • 18. Timeline Infographic Generic and proposed crawler search List of webpages for the query ‘code’ on proposed crawler search interface.
  • 19. Timeline Infographic Domain Query Generic crawler (no. of pages retrieved) Proposed crawler (no. of novel pages retrieved) Redundant pages Technology Code 344 4 340 Web 130 8 122 HTML 130 8 122 Java 120 6 114 Health Patient 220 40 180 Medicine 300 50 250 Doctor 190 30 160 Health 390 70 320 Transport Bus 480 50 430 Car 430 40 390 Truck 415 30 385 Vehicle 360 42 318 Comparison of generic crawler and proposed crawler novelty Result and discussion
  • 20. Timeline Infographic 0 100 200 300 400 500 600 Code Web HTML Java Patient Medicine Doctor Health Bus Car Truck Vehicle Generic Crawler proposed crawler redundant pages. Technology Result and discussion Health Transport No. of pages Domain Specific Queries
  • 21. Timeline Infographic Conclusion • Reduced redundancy provides novel results for the prescribe search rather than replicating the previous results. This results in effective search effort. • Memory requirement for the search results also reduce to large extent. • One of the, main feature of this technique is that number of pages identified after the given search are very less as compared to generic technique. This results in the elimination of repeated occurrence and less memory requirement with less execution time. Hence, it is concluded that this proposed approach can be used successfully in the field of information retrieval.