Database novelty detection

Semantic similarity and text summarization
based novelty detection

American University of Culture and Education (AUCE)
Nabatieh Campus
Advanced Database System CSI531
Ali Saad & Mostafa Abbas
Dr. Hassan Harb

Outline
1 What is novelty detection ?
2 Challenges
3 Generic crawler methodology
4 Proposed crawler methodology for novelty detection
5 Simulation set parameters
6 Implementation
7 Result and discussion
8 Conclusion

Timeline Infographic
What is novelty detection ?
• Novelty detection is the process of finding information that has not appeared
before, or new concerning the relevant information already seen, which can be
appended with the current web crawlers.
• Also, the proposed approach can be used with other search engines like Google,
Yahoo and Bing to minimize superfluous documents.

Challenges
• The information accumulated progressively due to the explosive growth of
document over web, which has resulted in duplication of information.
• It consumes precious time and space of the user in reading the new
information. the problem of novelty detection or redundant information still
persists.
• novelty detection mechanism is proposed, which can be appended with the
current web crawlers.

Firstly, a generic crawler is proposed that takes a query on a specific domain, and
the crawler results are stored in an indexed database. The URL enter by the admin
are stored in the dictionary.
Generic crawler methodology

Generic crawler methodology
Interface for domain-specific Generic Crawler
The SQL database

Proposed crawler methodology for novelty detection

Algorithm for the proposed crawler novelty detection
Input: URL (source code in S_out), DB-> Data Base
DB first row say in S_current
Begin
Step 1: Fetch source code in S_out
Step 2: Summarization of fetched data S-out
Step 3: for each row in DB
3.1 Summarization of each row of DB (S_current)
Step 4: Find the similarity of both summarized data (S_out , S_current)
Step 5: if (similarity > threshold)
5.1 Break the loop and will not compare with any row
5.2 Because it already finds a similar row
Else Check with another DB row
Step 6: If it does not find the similar doc in DB, than save a new row i.e. fetched data in DB.
End.

Detailed steps

Similarity calculation of summarized data :
• N-gram formation
N-gram formation is a process of converting a string into substring.
N = (p − m + 1)
Examples:
The size of N = 5.
Let a string ThisIsSKGram, 5-Grams derived from the string.
N=(12-5+1)=8
ThisI hisIs isIsS sIsSK IsSKG sSKGr SKGra Kgram

• Process of fingerprint selection

Simulation set parameters
The proposed algorithm used similarity calculation that tells whether the new web page
added into the database depends on a threshold value.

Performance parameters :
Simulation set parameters
To measure the efficacy of the proposed scheme several performance metrics are taken
given under:
• Redundancy removal (RR) :
RR = abs (NPGA – NOPA)
• Memory overhead(MO) :
MO = NPR ∗ PS
• Number of pages identified (NPI) :
NPI = NPPA

The implementation includes the Microsoft Visual Studio 2012 (.NET) as a front end
and SQL server 2012 as a back end database.
SQL database :
 T_ Categoty.
 T_website.
 T_webpages.
The database stores the URL of the query together with its HTML tags, It also includes
the table ontology.
Implementation

Search Engine

List of webpages for the query ‘code’ on generic crawler search interface.
Generic and proposed crawler search

Generic and proposed crawler search
List of webpages for the
query ‘code’ on proposed
crawler search interface.

Domain Query
Generic crawler
(no. of pages retrieved)
Proposed crawler
(no. of novel pages retrieved)
Redundant
pages
Technology Code 344 4 340
Web 130 8 122
HTML 130 8 122
Java 120 6 114
Health Patient 220 40 180
Medicine 300 50 250
Doctor 190 30 160
Health 390 70 320
Transport Bus 480 50 430
Car 430 40 390
Truck 415 30 385
Vehicle 360 42 318
Comparison of generic crawler and proposed crawler novelty
Result and discussion

0
100
200
300
400
500
600
Code
Web
HTML
Java
Patient
Medicine
Doctor
Health
Bus
Car
Truck
Vehicle
Generic Crawler
proposed crawler
redundant pages.
Technology
Result and discussion
Health Transport
No.
of
pages
Domain Specific Queries

Conclusion
• Reduced redundancy provides novel results for the prescribe search rather than
replicating the previous results. This results in effective search effort.
• Memory requirement for the search results also reduce to large extent.
• One of the, main feature of this technique is that number of pages identified after the
given search are very less as compared to generic technique. This results in the
elimination of repeated occurrence and less memory requirement with less execution
time.
Hence, it is concluded that this proposed approach can be used successfully in the field of
information retrieval.

Database novelty detection

More Related Content

What's hot (20)

Similar to Database novelty detection (20)

Recently uploaded (20)

Database novelty detection