SlideShare a Scribd company logo
DLAlert - AN INFORMATION ALERT
                 SYSTEM FOR DIGITAL LIBRARIES


                                      by

                          GIANNIS ALEXAKIS

                    Submitted in partial fulfillment of the
                      requirements for the diploma of


                   Electronic and Computer Engineering


                        Technical University of Crete


                                  June 2003




Guidance Committee :
       Manolis Koubarakis            Associate Professor (supervisor)
       Stavros Christodoulakis       Professor
       Euripides Petrakis            Associate Professor
Information Alert System for Digital Libraries - 2 -




Abstract
   As information available on the Internet is increasing from day to day, a user has to
spend a lot of time searching, browsing and rejecting useless information in order to stay
up to date, until he finds exactly what he is looking for. A new type of web applications
called alerting services, assume the responsibility to collect all relevant data in a specific
area and deliver them to each user regularly according to his fields of interest.
   Alerting services could prove to be very helpful in the area of digital libraries, as the
quantity of scientific publications doubles every 10-15 years and classic search
applications become more and more ineffective to handle this information overload on
their own. In this text we describe the design and implementation of DLAlert, an alerting
system for digital libraries developed for the Library of the Technical University of Crete
and ready to support many other sources including libraries, publishing houses and
other alerting services. DLAlert is a web application that receives requests through a
web page about the publications that interest every user, stores profiles about every
user’s fields of interest, collects information about new publications from the Technical
University Library gateway, produces notifications for each user and sends an
appropriate e-mail containing all relevant bibliographical data.
Information Alert System for Digital Libraries - 3 -




Acknowledgements
I would like to express my gratefully thanks to the following people for their help, advice
and information in developing this application.


                   My sincere gratitude to my supervisor Manolis Koubarakis
                              for his precious guidance and advice.


        Stamatis Andranakis for his help during the implementation of the Z39.50 client.


          Epemenidis Voutsakis for his help on the installation of the Oracle database.


                Christos Tryfonopoulos for his help during the validation
                              of the filtering module of DLAlert.


  Ian Ibbotson from Knowledge Integration for his technical advice on the JZKit tollkit.


                 All the undergraduate and postgraduate students of the
            Intelligent Systems Laboratory for their cooperation and support.
Information Alert System for Digital Libraries - 4 -




Contents

Chapter 1
Introduction ....................................................................................................................... 6
   1.1    Alerting applications – description..................................................................... 6
   1.2    Alerting applications – examples ....................................................................... 7
   1.3    DLAlert - Alert system for Digital Libraries......................................................... 8
   1.4    Organization of the dissertation......................................................................... 9

Chapter 2
Related Work .................................................................................................................. 10
 2.1     Alerting applications on the web...................................................................... 10
    2.1.1 Elsevier Contents Direct .............................................................................. 11
    2.1.2 Kluwer Alert ................................................................................................. 12
    2.1.3 Springer Alert............................................................................................... 13
    2.1.4 Hermes ........................................................................................................ 14
 2.2     DIAS (Distributed Information Alert System) ................................................... 15
    2.2.1 The WP (Word Pattern) data model ............................................................ 16
    2.2.2 The AWP (Attribute based Word Pattern) data model................................. 18
    2.2.3 The AWPS data model ................................................................................ 19
    2.2.4 Some interesting problems.......................................................................... 21
 2.3     The Z39.50 protocol ........................................................................................ 21
    2.3.1 Initialization facility....................................................................................... 23
    2.3.2 Search facility .............................................................................................. 24
    2.3.3 Type-1 query syntax .................................................................................... 26
    2.3.4 Retrieval facility ........................................................................................... 28
    2.3.5 Record format (UNIMARC).......................................................................... 29
    2.3.6 Z39.50 and interoperability .......................................................................... 30
 2.4     Oracle Text and filtering applications .............................................................. 31
    2.4.1 The CTXRULE index ................................................................................... 32
    2.4.2 Creating the tables ...................................................................................... 35
    2.4.3 Language of the stored queries................................................................... 35
    2.4.4 Indexing the stored queries ......................................................................... 40
    2.4.5 Filtering........................................................................................................ 41
 2.5     Conclusions..................................................................................................... 43

Chapter 3
System Overview ............................................................................................................ 44
  3.1    DLAlert architecture......................................................................................... 44
  3.2    Main Technologies used ................................................................................. 46
  3.3    Graphical User Interface overview .................................................................. 48
  3.4    The language of the text queries..................................................................... 55
  3.5    Conclusions..................................................................................................... 56
Information Alert System for Digital Libraries - 5 -



Chapter 4
Database Schema........................................................................................................... 57
 4.1     Requirements analysis .................................................................................... 57
 4.2     Relational schema........................................................................................... 59
 4.3     Key consistency and atomic transactions ....................................................... 60
 4.4     Indexing of the stored queries ......................................................................... 61
 4.5     Conclusions..................................................................................................... 63

Chapter 5
PL/SQL packages ........................................................................................................... 64
  5.1    Filtering module............................................................................................... 64
    5.1.1 The algorithm............................................................................................... 64
  5.2    Notifying module.............................................................................................. 68
    5.2.1 The UTL_SMTP package ............................................................................ 68
    5.2.2 Collecting the matched publications for a single user ................................. 69
  5.3    Performance.................................................................................................... 72
  5.4    Conclusions..................................................................................................... 74

Chapter 6
The Graphical User Interface .......................................................................................... 75
  6.1    Middle application tier architecture.................................................................. 75
  6.2    The Enterprise Java Bean............................................................................... 77
  6.3    OC4J custom tag library.................................................................................. 80
  6.4    Preventing CTXRULE index errors ................................................................. 81
  6.5    Parsing the text queries................................................................................... 86
  6.6    Conclusions..................................................................................................... 88

Chapter 7
The Observer .................................................................................................................. 89
  7.1    Information providers....................................................................................... 89
  7.2    Observer architecture...................................................................................... 90
  7.3    JZKit API ......................................................................................................... 92
  7.4    UNIMARC parser ............................................................................................ 93
  7.5    SQLJ functionality ........................................................................................... 95
  7.6    Performance.................................................................................................... 97
  7.7    Important technical issues............................................................................... 97
  7.8    Conclusions..................................................................................................... 98

Chapter 8
Scheduling DLAlert ......................................................................................................... 99
  8.1    Simple scenario............................................................................................... 99
  8.2    Supporting three types of desired notification frequencies............................ 100
  8.3    Conclusions................................................................................................... 101

Chapter 9
Concluding remarks ...................................................................................................... 102
 9.1     Future work on DLAlert ................................................................................. 102
 9.2     Conclusion..................................................................................................... 108

Bibliography .................................................................................................................. 109
Information Alert System for Digital Libraries - 6 -




Chapter 1
Introduction


       A survey made two years ago proved that more than one-third of all Internet users
spends more than 2 hours per week searching the web∗. The statistics also stated that
86 per cent of users thought searching should be made more efficient. However, it
seems that coping with all this information available today on the Internet is becoming an
even more harder and time-consuming task. A user has to spend a lot of time searching,
browsing and rejecting useless information in order to stay up to date, until he finds
exactly what he is looking for.
       A new type of applications called alerting services promise to deliver to every one
what they are interested in. The user is being informed regularly in a specific area
instead of constantly chasing after hyper-links and search engines. In the following
sections we describe the use of alerting services on today’s Internet and particularly in
the domain of digital libraries and publishing houses which is the main topic of this
dissertation.



1.1 Alerting applications – description
Alerting service is considered every application that
       •   Collects all relevant information in specific area. This information could refer to
           events on the real world or data retrieved from databases on the Internet.
       •   Receives requests about this information from users. These requests called
           profiles are stored and considered long standing queries that are evaluated
           regularly without the constant interaction of the user.
       •   Filters incoming data according to these stored queries and finds matching
           profiles whose conditions are satisfied.
       •   Produces notifications that are adapted to the user’s needs and sends all
           relevant data in his field of interest which are defined by his profiles.


∗
    Source “Search Engine Watch” URL : http://www.searchenginewatch.com/
Information Alert System for Digital Libraries - 7 -




The main difference with classic information retrieval applications is that the former
evaluate every user’s request only once and send the results immediately. In a selective
dissemination of information or alert system the queries are stored and the user is
notified every time there is a new event or data that might interest him.




1.2 Alerting applications – examples
   Alerting services are becoming more and more helpful and selective dissemination of
information techniques can be applied into many domains. Consider the following
examples:
   •   A news portal broadcasting e-mail messages to registered members adapted to
       every user’s fields of interest containing the daily news.
   •   An e-commerce company integrating information about merchandise from
       various providers and sending advertisements according to every customer’s
       previous shopping.
   •   A stock exchange company alerting stockholders upon certain events in the
       stock market (an increase or decrease of current rates and prices).


Every application like the ones above has its own characteristics:
   •   The way the system collects information from various sources. Sources could be
       data retrieved from databases representing events on the real world. Every
       selective dissemination service monitors different events according to its specific
       area of interest.
   •   The system should provide a standard way that users can request this
       information and define the conditions upon they want to be notified. The so-called
       language of the profiles is the language that the stored - queries are defined.
       Queries can be expressed directly in expressions that the system can recognize,
       by using a simple user-friendly graphic interface with buttons and drop – down
       lists or even implicitly by monitoring the user’s previous actions (like the second
       example above).
Information Alert System for Digital Libraries - 8 -



1.3 DLAlert - Alert system for Digital Libraries
       The number of scientific publications is estimated to double every 10-15 years.
This means that the number of scientific papers, journals and books published before
1990 is close to the number published the last decade. It becomes harder for people to
follow this evolution of technology without spending a lot of time daily on the Internet
searching and browsing. Additionally knowledge of technology is provided often by
independent miscellaneous organizations (universities, publishing houses, research
departments in companies etc.). As new technological branches appear every day, there
is need for tools that help people navigate through all this available information. Search
engines dedicated to literature (scientific or not) and alerting applications in the same
area are becoming very popular tools on today’s Internet.
   This dissertation presents the design and implementation of DLAlert, an alert system
for digital libraries developed for the Library of the Technical University of Crete and
ready to support many other sources including libraries of other organizations and
alerting services.
   Features distinguishing the aforementioned system include
   •    The events that interest a potential user are new publications.
   •    The sources are the bibliographical attributes of the new publications inserted in
        the digital libraries supported by DLAlert (up to now only the Library of the
        Technical University of Crete).
   •    The language of profiles offers queries about words, phrases or concepts
        contained in the publications bibliographical attributes. Additionally, it provides
        Boolean and proximity operators for definition of relations between the previous
        terms.
   •    The user is expected to describe his fields of interest by typing his queries in a
        graphical user interface on the web ( http://intelligence.tuc.gr/alert/login.html ).
   •    Notifications are well-formed e-mail messages send to the user containing all
        relevant bibliographical attributes of the matched publication.
Information Alert System for Digital Libraries - 9 -



1.4 Organization of the dissertation
   This dissertation is organized as follows. In Chapter 2 we present related work in the
areas of information retrieval and selective dissemination of information. In Chapter 3 we
reference modeling and design issues about DLAlert. Chapter 4 presents the internal
structure and organization of the database schema used for storing the profiles and
filtering the incoming data. In Chapters 5-7 we explain briefly the operation of every one
of the system’s modules (filtering, notifying, web Graphical User Interface). In Chapter 8
we discuss possible ways to collect data from sources and present the current
implementation that uses the Z39.50 standard to retrieve information from various digital
libraries. Chapter 9 presents the actions performed by the system. Finally in Chapter 10
we present our conclusions and propose directions for future work on DLAlert.
Information Alert System for Digital Libraries - 10 -




Chapter 2
Related Work
   In this chapter we discuss related technologies in the areas of information retrieval
and selective dissemination of information in the context of digital libraries. We present
previous developments on this topic that inspired and helped us implement DLAlert.


2.1 Alerting applications on the web
   In the following sections we present popular alerting services [2, 3] in the area of
digital libraries. The main issues that we concentrate when designing a notification
service are:
      i. Number of sources. The system should have the ability to collect data from
         various and often unsimilar sources and integrate them in a way that the user
         will not notice this difference.
     ii. Scalability. The application should be able to filter thousands of publications
         and millions of profiles every day.
     iii. User friendliness of the graphical user interface (GUI) and the notifications is
         an important issue otherwise none will utilize the application.
     iv. Query expressiveness. The language in which the profiles are defined should
         offer the user the potential to request notifications on almost any possible
         subset of incoming publications. In most cases the alerting services for digital
         libraries   offer   queries   on   categories     or   keywords       contained       in      the
         bibliographical data.
     v. Type of the notifications. The notifications are often e-mail messages that
         contain hyper-links to web pages related to a specific publication,
         bibliographical attributes, an abstract or complete table of contents of the
         interesting book or journal. They could be plain text, HTML or XML for later
         processing from other alerting services.
     vi. Similar search and alert capabilities. Providing a search engine for the sources
         that are included in the alert is a positive feature.
Information Alert System for Digital Libraries - 11 -



2.1.1 Elsevier Contents Direct
   Elsevier Contents Direct (http://contentsdirect.elsevier.com/ ) is the free e-mail
alerting service from Elsevier Science which delivers notifications to users about new
publications. The sources of this system are a large number of publications daily from
various areas and providers. The web interface is as friendly and simple as it could be
and allows queries on subject categories only. The user can also select to be notified
regularly on single journals instead of selecting the whole category. Complex profiles on
keywords can not be defined. The subjects are defined in a hierarchical manner which
means that every category contains one or more sub-categories. An advantage of this
application is that the user can search for journals and update his profile on the same
page. E-mails messages are well formed HTML messages that contain the title of the
document and a hyper-link to the table of contents and abstract of the book or journal.
Unfortunately more details about the internal design of this service are not available.




                      Figure 2.1-1 Elsevier Contents Direct interface
Information Alert System for Digital Libraries - 12 -



2.1.2 Kluwer Alert
   Kluwer Alert    (http://www.kluweralert.nl/ ) is a service which promises to keep
researchers on top of the latest in scientific publishing. The system collects a large
number of publications daily from various areas. It offers a similar functionality like the
previous application which means that queries are subject categories of books and
journals. Subjects are organized hierarchically and the user is able to select single
journals. Queries on words or bibliographical attributes are not allowed on the alert.
Notifications are well-formed HTML messages that provide all the necessary information
about a publication and hyper-links to the table of contents. E-mails also contain pricing
information and a user can also buy books on-line. The user can also search the
sources included in Kluwer Alert using a simple interface.




                            Figure 2.1-2 Kluwer Alert interface
Information Alert System for Digital Libraries - 13 -



2.1.3 Springer Alert
   Springer Alert (http://www.springer.de/alert/ ) is a service provided by Springer
Science and supports a wide variety of sources. The queries are subject categories
which are organized hierarchically. The user can also select the frequency of
notifications and the preferred language of publications. The e-mails are sent regularly
and contain only the title/author of the matched book or journal and a hyper-link to a
web-page where one can read detailed description of the publication, download the table
of contents and purchase it. The Springer Alert is the only service that provides
promotional brochures and software via surface mail. Catalogue search for books and
electronic media is a feature available in the same interface.




                           Figure 2.1-3 Springer Alert interface
Information Alert System for Digital Libraries - 14 -



2.1.4 Hermes
   Hermes (http://hermes.inf.fu-berlin.de/ ) [4, 5] is an alerting service developed by the
Institute of Computer Science of the University of Berlin. Hermes promises to integrate
heterogeneous interfaces of different providers and support publishing houses or
libraries that do not offer an alerting service themselves. Emphasis is on scholarly
publications as journal articles, technical reports, and books.
   It is the only service that provides specification of interest using an advanced
mechanism. Profiles consist of one or more queries on bibliographical metadata and
query terms as well as selection on specific journals. Queries on attributes can contain
keywords or phrases related with Boolean operators (AND, OR, NOT). Queries to be
stored are checked and those containing syntax errors are not inserted in the database
of Hermes (an error message is produced). Single words standing alone or between
Boolean operators are stemmed (for example the following expressions: library, libraries
are equivalent).




                               Figure 2.1-4 Hermes interface
Information Alert System for Digital Libraries - 15 -



    The user can define notification frequency (day, week, month) and format (plain text,
HTML, XML). The main disadvantage of this service is that the graphical interface is not
as friendly and simple as the previous applications. The messages, which are usually
not well formed, contain the main bibliographical attributes of the publications (title /
author / abstract) and a hyper-link to detailed description. The system accepts relevance
feedback on notifications which means that the user can evaluate the relevance of the
delivered documents so that the ranking results are improved in later filtering. Generally
speaking, Hermes is an application with advanced features but not as simple and
friendly as the services presented earlier.



2.2 DIAS (Distributed Information Alert System)
    DIAS [6, 7, 8] is a distributed alert system for digital libraries, currently under
development in project DIET by the Intelligent Systems Laboratory of the Department of
Electronic and Computer Engineering, Technical University of Crete. DIAS is currently
implemented as a part of a system called P2P-DIET “A query and notification service
based on mobile agents for rapid implementation of peer to peer applications”[52].
    The advanced functionalities that DIAS offers and the basic ideas that this project
proposes, motivated us during the implementation of DLAlert. Of course DLAlert is not a
distributed system already, but the models and languages supported by both services
are similar. In addition during our implementation stored queries from DIAS where used
for validation of the filtering process of the DLAlert.




                              Figure 2.2-1 Architecture of DIAS
Information Alert System for Digital Libraries - 16 -



   Before presenting the DIAS architecture in detail we have to mention the difference
in terminology used. Notifications are considered not only the messages send to the
end-users but also all the messages exchanged between peers and can potentially
contain information about new publications that must be disseminated through the
network.
   The architecture of DIAS is shown in Figure 2.2-1. Resource agents retrieve new
publication’s data from the information providers and produce streams of notifications
containing this information. Users post profiles to some middle-agent(s) and receive
notifications from the network, by using their personal agents. The notifications produced
from the resource agents are propagated through the P2P network and arrive at
interested subscribers (end-agents). Middle-agents forward the long-standing queries to
other middle-agents in a way that matching of a profile with a notification takes place as
close as possible to the origin of the incoming notification.
   The models proposed by DIAS for notifications and profiles are presented bellow.
We will not discuss the complete definition of these models and for more details on DIAS
read [6, 7, 8]. In the following sections we present the schema for notifications every
model proposes and the queries provided by its grammar. The definitions of the models
are reproduced verbatim from the paper [6].

2.2.1 The WP (Word Pattern) data model
   This model assumes that textual information (of notifications) is in the form of free
text and can be queried by word patterns. A word         w   is considered a finite non-empty
sequence of letters from a given alphabet. We also assume the existence of a (finite or
infinite) set of words called the vocabulary. A text value s is a finite sequence of words

from the assumed vocabulary. Thus s (i ) gives the i -th element of s and s its number

of words. The queries are word patterns generated according to the following grammars.


   A proximity-free word pattern is an expression generated by the grammar
           WP → w | ¬ WP | WP ∧ WP | WP ∨ WP | (WP)

   A proximity word pattern is an expression wp1 ≺ i wp2 ≺ i ... ≺ i      wpn where
                                                    1       2        n −1

wp1, wp2 , ..., wpn are positive proximity-free word patterns (does not contain the
Information Alert System for Digital Libraries - 17 -



negation operator ¬ ). Where i1 , i2 ,..., in−1 are intervals (that represent order and

distance between words) from the set                  I   where

      I=   { [l , u ] :l , u    ∈      , 0 ≤ l ≤ u } ∪ { [l , ∞ ) : l ∈    , 0≤l    }

A word pattern is an expression generated by the grammar
        WP → PFWP | PWP | WP ∧ WP | WP ∨ WP | (WP)
      PFWP        is a proximity free word pattern and              PWP      is a proximity word pattern.


Query examples for the WP model


                                       artificial ∧ intelligence
                                    matches documents that contain both words


                      constraint ∧ ( programming ∨ e-commerce )
              matches text values that contain one of the words programming,
                                    e-commerce and the word constraint.


                                           search ≺ [0,6] optimisation

matches text values that contain both of the words and 6 or less words between them.


                 (global ∧ local) ≺ [3,6] search ≺ [1,1] optimisation

              matches text values that satisfy all of the following three conditions :
i.    Contain      one         of    the    words    global, local               and     the    words      search
optimization ii. The distance between one of the words global, local and the
word search is at least 3 and at most 6 words. iii. There is exactly 1 word between
the words search, optimization


                                             University ≺ [7,∞ ) Crete

     matches text values that contain the word University and the word Crete and
                                     between them there are at least 7 words.
Information Alert System for Digital Libraries - 18 -



2.2.2 The AWP (Attribute based Word Pattern) data model
   The AWP data model defines that textual information on notifications is based on
attributes or fields with finite-length strings as values. Strings will be understood as
sequences of words (text values) as formalized by the model WP presented earlier.
Attributes can be used to encode the bibliographical attributes of a publication (e.g.,
author, title, abstract of a paper and so on).
   A notification schema N is a pair ( A, V ) where A is a set of the attributes and V is

a vocabulary. For example a notification schema for a digital library containing three

attributes could be N =   ({ AUTHOR, TITLE, ABSTRACT } , ε ) .

   A notification is a set of attribute-value pairs. For example a valid notification over the
previous schema is


{( AUTHOR," John Brown "),
(TITLE ," Interaction of constraint programming and local search for optimisation problems "),
( ABSTRACT ," In this paper we show that adapting constraint propagation...") }



   Queries in the AWP model reference text values inside attributes. A query is a
formula in any of the following forms.
           1.   A    wp . Where A is an attribute and wp is a positive word pattern. This
                query can be read as “ attribute A contains word pattern wp “

           2.   A = s where A is an attribute and s is a text value (sequence of
                words). This query can be read as “ attribute A equals text value s “
           3. ¬ φ where φ is a query containing only proximity-free word patterns.

           4. φ1 ∨ φ2 where φ1 and φ2 are queries.

           5. φ1 ∧ φ2 where φ1 and φ2 are queries.
Information Alert System for Digital Libraries - 19 -



Query examples for the AWP model


        AUTHOR       (John ≺ [0,6] Smith) ∨ (TITLE               programming )

matches notifications (on publications bibliographical attributes) that contain the words
John, Smith with 6 words between them or less in the attribute AUTHOR or contain
the word programming in the attribute TITLE. This query matches the example
notification (TITLE contains the word programming).


              ¬ AUTHOR = "John Brown" ∧ ABSTRACT                          paper
matches notifications that the AUTHOR attribute is not “John Brown“ and contain the
word paper in the attribute ABSTRACT. . This query does not match the example
notification (AUTHOR attribute is “John Brown “).


2.2.3 The AWPS data model
   AWPS extends AWP with the concept of similarity between two text values. The
AWP model allows us to issue queries on attributes that contain one or more words that
satisfy a set of statements. The new functionality allows us to request notifications that
might not contain a strictly defined set of words but have text values that are similar to a
given string. Queries with similarity could be extremely useful when a user wants to
query a collection of notifications based on a concept and does not know exactly which
keywords should be included in the profile in order to produce the desired results. For
example requests like “I am interested in papers about the use of local search
techniques for the problem of test pattern optimization” can not be easily interpreted into
queries using the model AWP.
   Queries on similarity use the concept of a word weight as defined in the Vector
Space Model [9, 10]. In VSM documents (text values) are represented as vectors. If our
vocabulary consists of n distinct words then a text value s is represented as an
n− dimensional vector of the form (ω1 , ω2 , ..., ωn ) where ωi is the weight of the i − th
word (the weight assigned to a non-existent word is 0). In VSM, the weight of a word is
computed using the heuristic of assigning higher weights to words that are frequent in a
document and infrequent in the collection of documents available. Generally this
mechanism tries to distinguish words that represent the semantic content of a document
Information Alert System for Digital Libraries - 20 -



by assigning them a higher weight. Presenting the definition of this heuristic is out of the
scope of this dissertation.
    sim ( sq , sd ) is a function that uses the weights of the words of two text values
sq , sd to produce a number in the interval [0,1] that represents the concept of similarity
between them

                                                   ∑ iN 1ωq i ⋅ ωd i
                                                      =
                     sim ( sq , sd ) =
                                               ∑ iN 1ωq i 2 ⋅ ∑ iN 1ωd i 2
                                                  =              =
   If the similarity value of two documents is close to 1 then these documents have
similar semantic content.
   The AWPS data model provides a new type of query that utilizes this function and
issues requests on attributes that have similarity values over a certain threshold when

compared with a given string. The syntax for this query is. A ∼ k               s   where     A    is an

attribute, s is a text value and    k   is number in the interval [0,1] that gives a relevance
threshold that candidate text values s should exceed in order to satisfy the predicate. A
low similarity threshold   k   might result in many irrelevant documents satisfying a query,
whereas a high similarity threshold would result in very few achieving satisfaction (or
even no documents at all).


Query examples for the AWPS model


                   TITLE ∼ 0.6 " Object Relational Databases"

 matches documents with TITLE relevant to “Object Relational Databases “


We should mention that we cannot give a notification that will always satisfy this
predicate, because the similarity values of its attribute (TITLE) with the query string,
depends on previously processed notifications. For example the notification bellow is
most likely to satisfy the previous query
            {( AUTHOR," Richard Niemec "),
             (TITLE ," Object Oriented Programming and Relational Databases "),
             ( ABSTRACT ,"...") }
Information Alert System for Digital Libraries - 21 -



Queries on similarity can be combined with Boolean operators and queries on word
patterns.
AUTHOR      (John ≺ [0,6] Smith) ∧ (TITLE ∼ 0.9 " Artificial Intelligence ")

matches notifications that contain the words John, Smith with 6 words between
them or less in the attribute AUTHOR and TITLE relevant to                         “Artificial
Intelligence “.

2.2.4 Some interesting problems
   In the previous sections we presented in detail the language of the profiles used.
DIAS also provides algorithms that efficiently solve the following problems
   •   The satisfiability problem. As profiles and notifications propagate through the
       network a middle agent should be able to detect queries that could be satisfied
       by any notification at all.
   •   The matching problem. Deciding whether an incoming notification matches a
       profile.
   •   The filtering problem. Given a notification n an agent should be able to find all
       stored queries that match n .
   •   The entailment problem. Deciding whether a profile is more or less “general” than
       another. An agent should detect profiles that request the same sets of
       notifications in order to minimize profile forwarding between peers.
   .In DLAlert efficient filtering and matching are the necessary functionalities.
   We have explained the language of the profiles provided by DIAS because queries
on DLAlert are generated according to a similar grammar. For further details, the papers
[6, 7, 8] formally define the algorithms and models of DIAS.


2.3 The Z39.50 protocol
   Z39.50 [1] is the protocol used by DLAlert in order to retrieve new publications from
databases. Z39.50 is a standard which is playing an increasingly important role for
information retrieval, especially in the library world. It was established by NISO (National
Information Standards Organization) [11] and it is maintained by the Library of Congress
[12] of the USA. The first version of this protocol was issued at 1988 and today is
supported by almost all digital libraries around the world. In this section we try to explain
briefly this protocol and provide a few simple examples.
Information Alert System for Digital Libraries - 22 -



    Z39.50 is an application layer network protocol (like HTTP, FTP, SMTP etc.) that
uses the TCP/IP functionality and provides advanced information retrieval services, for
organizations such as universities, libraries, union catalogue centers and museums. It
addresses connection oriented program-to-program communication. The protocol
specification includes the definition of the protocol control information, the rules for
exchanging this information via connection oriented program-to-program communication,
and the conformance requirements to be met by the implementation of this protocol.



                             Z39.50 Gateway

                                                    Requests Data
     DBMS                                            via TCP/IP
                                                Returns Records
                                                                                 Z39.50 Client




                                Figure 2.3-1 Z39.50 protocol
   The purpose of Z39.50 is interoperability for search and retrieval of information
between different client/server systems. The sources behind the gateway are not visible
to the client which requests them, only by using the strictly defined rules of Z39.50. For
this purpose the gateway implements an “abstract database” as a front end to the real
DBMS. The client uses standardized access points which are called “attribute sets” and
standardized queries to request information from to the abstract database. The gateway
returns records in a standardized format (MARC [13], XML [14] etc.). In Z39.50
terminology the client is usually referenced as the “origin” and the server as the “target”.
   This protocol is really hard to penetrate, so we give some simple examples, along
with the definitions of the relevant actions and models of the standard. The deployment
of a new Z39.50 gateway is out of the scope of this dissertation so we focus on the
retrieval of records from existing abstract databases and the configuration of the client
used. There is an example of connection with the Library of the Technical University of
Crete with a sample interactive command-line client provided in JZKit. JZKit [15] is an
open source API (Application Programmers Interface) written in Java and provided by
Knowledge Integration, that helps us construct applications that utilize the Z39.50
Information Alert System for Digital Libraries - 23 -



functionality. Instead of displaying the messages exchanged by the end-systems in raw
data we present the output of the sample Z-client for convenience.
   The functionality of Z39.50 is organized in “facilities”, which represent actions and
consist of one or more services.

2.3.1 Initialization facility
   The initialization facility is the action that establishes the connection (“Z-association”)
between the client and the server. In the Init request, the client proposes values for
initialization parameters (version of Z39.50, option flags, message sizes, other
implementation information). The option flags indicate which other facilities are enabled
during the Z-association. If the target requires authentication the origin should include a
secret id / password in the request. In the Init response, the server responds with values
for the initialization parameters; those values, which may differ from the client-proposed
values, are in effect for the Z-association. If the server responds affirmatively (Result =
‘accept’), the Z-association is established. If the client then does not wish to accept the
values in the server response, it may terminate the Z-association, via the Close service
(and may subsequently attempt to initialize again). If the server responds negatively, the
client may attempt to initialize again.




             Origin                                                           Target
                                          Init request
                             Version, (id/password), option flags,
                          message sizes, implementation information




                                       Init response
                                Result, version, option flags,
                          message sizes, implementation information




                               Figure 2.3-2 Initialization facility
Information Alert System for Digital Libraries - 24 -




                         Figure 2.3-3 JZKit sample client output
For example as we see in the previous picture by connecting to the Technical
Universities Gateway (issue the command “open dias.library.tuc.gr”) we get
the response that contains the implementation id : “1995” , the name: “Geac Advance
Z39.50 Server” and version “2.0”. The target enabled services are Search, Present,
Delete Result Set, Scan, Sort, Extended Services, Named Result Sets. In this
dissertation we present only the Search and Present services because these are the
only ones needed for the retrieval of new records for the purposes of DLAlert. More
information on other Z39.50 services can be found in [1].

2.3.2 Search facility



            Origin                                                         Target
                                      Search request
                           Database names, query type, query,
                         result set name, preferred record syntax




                                   Search response
                               Search status, result count,
                               number of records attached,
                                 next result set position



                               Figure 2.3-4 Search facility
   The search facility is the action that sends a query on abstract records to the
gateway. The origin sends the database name, query-type, query, the result set name
Information Alert System for Digital Libraries - 25 -



and preferred record syntax. The database name is sent because the gateway can be a
front end to multiple databases and the query can refer to all or some of them. The query
type used is Type-1 (the default setting of the client) because it is supported by both the
TUC gateway and the JZkit, provides the functionality we need for the purposes of
DLAlert and is the most common query type used. We focus on this query type and give
a detailed definition in the following section. The result set name is a string generated by
the client so that the results of a search can be referenced. Preferred record syntax is
UNIMARC [16, 17], the only one supported by the TUC gateway. The target returns
search status, result count, number of records attached and next result set position. The
search status indicates where the search completed successfully or not. The result count
is the number of records that satisfy the query sent earlier. The response can contain
attached records (usually in case of one or two results). The parameter next result set
position takes on the value M+1, where M is the position of the result set item which
identifies the database record corresponding to the last response record among those
returned. Usually takes the value “1” in case the response does not contain attached
records. The result records of a query are requested using the Present facility explained
later.




                          Figure 2.3-5 JZKit sample client output
    For example we connect to the digital library named “Advance” of the Technical
University of Crete with the command “base advance” and define the preferred record
format “format UNIMARC”. Then we send the query “@attrset bib-1 @attr
1=1035 smith” with the command “find”. This query requests bibliographical records
that contain the word smith in any attribute. The response contains: the name of the
result set “Search:0” (a string generated by the Jzkit), the status (“true”) that indicates
Information Alert System for Digital Libraries - 26 -



that the search completed successfully and the number of records satisfying the query
“177”. The number of records returned is 0 so the next result set position is 1.


2.3.3 Type-1 query syntax
   The Type-1 [18] query is also called RPN (Reverse Polish Notation) string because
the operators must always be before the two related operands.                  An RPN string is
generated according to the following grammar. Reserved words used might be slightly
different among other Z39.50 API implementations but the grammar is a part of the
protocol.
                   rpn - string → @attrset default - attrset expr
                   default - attrset → bib-1

   The access points to the abstract database are called attributes are categorized in
attribute sets. The most common attribute set used in information retrieval from digital
libraries is the bib-1 [19] attribute set. The bib-1 attribute set includes access points to
attributes of bibliographic records. Other attribute sets could reference extended
services tasks (ext-1), details of the target implementation (exp-1) or different
organization of the access points to bibliographical records (GILS, CCL). A full listing of
all registered attribute sets and generally all Z39.50 object identifiers can be found at
[20]. Almost all Z39.50 implementations support the bib-1 attribute set.


     expr → boolean | attr - plus - term
     attr - plus - term → attrdef           [ single - term |        quoted - string ]
     attrdef → @attr attrtype = attrval
     boolean → operator expr expr
     operator → @and | @or | @not

single - term is considered a single word and quoted - string a set of words (enclosed in
“ “) that should be contained in the corresponding attribute of a record in order to satisfy
the statement. attrtype for the bib-1 attribute set is a value between 1 and 6 that
describes the type of the attributes used. For attrtype 1 we can reference several
Information Alert System for Digital Libraries - 27 -



attributes (the attrval value) that correspond to bibliographical records of the digital
library of TUC.




         Attribute   Description                          UNIMARC fields
         4           Title                                200,5XX,4XX except 410
         5           Series Title                         410
         7           ISBN                                 010
         8           ISSN                                 011
         12          Local number (of the record)         001
         21          Subject Heading                      60X
         31          Date of publication (year)           210d
         32          Date of acquisition (year month)     960
         54          Language code (ENG or GRE)           101
         59          Publication place                    210a
         63          Notes                                3XX
         1003        Author                               7XX
         1018        Publisher                            210
         1035        Anywhere                             almost all fields
             Table 2.3-1 Access points supported by the TUC Z39.50 Gateway


UNIMARC field numbers are the numbers in the result records returned (record format
explained in detail in next section). attr - plus - term formulas with attrtype 1 implicitly
define a “contains word or phrase” relation for the given attribute. Other types (2-6)
define attributes with advanced relations like less than, greater, position in field etc. The
operators @and, @or, @not define Boolean relations (AND, OR, AND-NOT) between

attr - plus - term formulas.


Example queries:


@attrset bib-1 @attr 1=4 databases
returns bibliographical records that contain the word “databases” in the title.


@attrset bib-1 @not            @attr 1=32      2003 @attr 1=1035                algebra
returns bibliographical records acquired in the year 2003 and do not contain the word
“algebra” in the title.
Information Alert System for Digital Libraries - 28 -



@attrset bib-1 @attr 1=32             2003
returns bibliographical records that contain the number 2003 in the date of acquisition
field (records acquired in the year 2003).


@attrset bib-1 @or          @or     @attr 1=4 science @attr 1=4                    algebra
       @attr 1=4 mathematics
returns bibliographical records that contain at least one of the three words in the title.

2.3.4 Retrieval facility
   The Retrieval facility is the action where the origin requests the results of a query
from the target. It consists of the Present and Segment services. In the Present service
the client sends to the gateway the result set name referenced earlier in the Search
facility, a number defining the starting point of the records, and the number of records to
be returned. For example if a query returns 70 records and we want to retrieve the first
twenty the starting point is 1 and the number of records is 20. The target returns the
records in a standardized format (XML, MARC, etc.), a number indicating the number of
records returned and the status which indicates whether the Present service completed
successfully.



           Origin                     Present request
                                                                              Target
                                     Number of records,
                                       starting point,
                                         result set




                                   Present response
                                Number of returned records,
                                     status,records




                                Figure 2.3-6 Present service


Sometimes the result set of a query may contain hundreds or thousands of records. The
Present response could exceed an upper limit of bytes. Thus the server splits a Present
Information Alert System for Digital Libraries - 29 -



response that is larger than this limit into segments. In some Z39.50 implementations the
origin could define the preferred sizes for message but the segmentation service in the
target is responsible to decide the maximum segment size. In this case the number of
records returned from the target is less than the number of records requested. Thus
when constructing a client, that will collect records from various sources, we should
always check the number of records returned from a Present request.




                         Figure 2.3-7 JZKit sample client output
   For example let us retrieve the results of the query which requests bibliographical
records that contain the word smith in any attribute (“@attrset bib-1 @attr
1=1035 smith”). The query in the previous example returned 177 records. We want to
retrieve the last 5 records. We issue the command “show 173+5”. The first record to be
requested is the 173-th record. In the previous page we present the last record returned
from this request. The strange characters in fields 801 – 852 are Greek characters not
displayed properly by the sample client.

2.3.5 Record format (UNIMARC)
   MARC is an acronym for Machine Readable Catalogue and is a standard for
assigning labels to each part of a catalogue record so that it can be handled by
computers. While the MARC format was primarily designed to serve the needs of
libraries, the concept has since been embraced by the wider information community as a
convenient way of storing and exchanging bibliographic data. The original MARC format
was developed at the Library of Congress in 1965-6 and since the early 1970s an
Information Alert System for Digital Libraries - 30 -



extended family of more than 20 MARC formats has grown up. Differences among
various MARC formats meant that editing was required before records can be
exchanged. One solution to the problem of incompatibility was to create an international
MARC format (UNIMARC) [16, 17] which would accept records created in any MARC
format. So in 1977 the International Federation of Library Associations and Institutes
(IFLA) published UNIMARC: Universal MARC format, stating that "The primary purpose
of UNIMARC is to facilitate the international exchange of data in machine-readable form
between national bibliographic agencies".
   The records retrieved from the Technical Universities Library are in the UNIMARC
standard. The record structure is designed to control the representation of data by
storing it in the form of strings of characters known as fields. The fields, which are
identified by three-character numeric tags, are arranged in functional blocks. These
blocks organise the data according to its function in a traditional catalogue record.
 Tag                                            Tag
                  Description                                          Description
 num.                                           num.
 0XX    Identification block                    5XX       Related title block
 1XX    Coded information block                 6XX       Subject analysis block
 2XX    Descriptive information block           7XX       Intellectual responsibility block
 3XX    Notes block                             8XX       International use block
 4XX    Linking entry block                     9XX       Reserved for local use
                          Table 2.3-2 UNIMARC functional blocks
   Within each field, data is coded into one or more subfields, e.g. 700 $a ... $b ..., etc.,
according to the kind of the information. The effect of the subfield coding is to refine
further the definition of the data for computer processing. The subfield identifiers consist
of a special character, represented by a $ in the examples, and a lower case alphabetic
character or a number 0-9. For example the field starting with the tag 210 contains
publication related data and the subfield $d contains publication date. We do not present
the whole definition of UNIMARC format because it defines thousands of tags and
subfields [17]. The main UNIMARC tags used by the TUC library and the corresponding
Z39.50 attributes are shown in Section 2.3.3.

2.3.6 Z39.50 and interoperability
   Most digital libraries round the world nowadays have a Z39.50 Gateway as a front-
end. The organizations, universities, museums or publication houses that support this
protocol are uncountable. Accessing all those digital libraries using a common way,
Information Alert System for Digital Libraries - 31 -



which is independent to the specific implementation of each database, is the main
advantage of Z39.50 functionality. We indicatively report some digital libraries in Greece
that support this standard and we have successfully retrieved records using the client we
constructed.

                           Organization                       Gateway host : port
            University of Thessaly                        library.lib.uth.gr : 210
            University of Patras                          pherusa.lis.upatras.gr : 210
            University of Cyprus                          194.42.4.129 : 210
            University of Aegean                          library.lib.aegean.gr : 210
            Technical Chamber of Greece (TEE)             artemis.tee.gr : 21210
            Panteion University                           library.panteion.gr : 210
            Ionian University                             zante.ionio.gr : 210
            Hellenic American Education Foundation        194.30.242.11 : 210
                        Table 2.3-3 Some Z39.50 Gateways in Greece

2.4 Oracle Text and filtering applications
   Oracle Text [23-27] is a tool used in the Oracle RDBMS [21, 22] that enables us build
text retrieval and filtering applications. Retrieval applications enable users to find
documents that contain one or more search terms defined in a query. Text is a collection
a documents in plain text, HTML, or XML. A filtering application stores queries in the
database and finds those which match a certain document. DLAlert and generally
alerting services are considered filtering applications. The grammars of queries used in
text retrieval and in filtering are similar and search terms could be simple words, phrases
or themes. Themes define concepts inside a document. In the Figure 2.4-1 we present
an overview of the architecture of a filtering application.


                           Incoming                        Matched
                           documents                      documents
                                            Filtering
                                           Application                    Perform Action



                                                     Compares against
                                                      stored queries


                         Oracle RDBMS




                                Figure 2.4-1 Filtering application
Information Alert System for Digital Libraries - 32 -



2.4.1 The CTXRULE index
    The filtering functionality in Oracle database was first introduced in version 9.0.1
(June 2001) with the CTXRULE index type. In filtering applications queries are stored in
a column of a table and a CTXRULE index should be constructed. This index is a
structure that holds information about the stored queries. When Oracle finds matching
queries for a given document, it requests data from the index and not directly from the
table that holds the profiles.
    Consider the simple Boolean queries in Table 2-4.1 that represent keywords
contained in the desired document.
 Query                 Syntax                                           Description
   1                    oracle                 matches documents that contain the word "oracle"
   2               larry or ellison            matches documents that contain either “larry" or "ellison"
   3              text and oracle              matches documents that contain both “DBMS" and "oracle"
   4                market share               matches documents that contain the phrase “market share"
                                               matches documents that contain both “USA" and "Asia"
    5          near( ( USA, Asia),5)
                                               within a group of 5 words
                                      Table 2.4-1 Simple queries


    The indexing process for the CTXRULE index type given a populated table holding
the queries is shown in Figure 2.4-2:

                                                Query
                                                Parser

                                      query              parse
        column data
                                      string              tree

                                                                             Indexing
                 Datastore                      Lexer
                                                                              Engine
                                 query                           rules
                                 strings
                                                                                      column data



                                           CTXRULE index




                           Figure 2.4-2 CTXRULE Indexing Process
Information Alert System for Digital Libraries - 33 -



Datastore is the object which retrieves data from the table with the queries and creates a
stream of query strings. The Lexer breaks the text into tokens according to our
language. These tokens are usually words or query operators. The parser gets the
queries from the Lexer, creates a parse tree and sends this back to the Lexer. The Lexer
normalizes the tokens (turns into upper case, omits very frequent words like ‘a’, ‘is’ etc),
breaks the parse tree into rules and sends these to the engine. The engine builds up an
inverted index of rules, and stores it in the index tables.
   For example we present the index content on some simple Boolean queries on
keywords. The index is a structure that represents the parse tree generated by the query
parser, containing among others the columns TOKEN_TEXT and TOKEN_EXTRA .
QUERY_STRING           : the query string as it is stored in the base table
TOKEN_TEXT             : the first token to be found for the query to be matched
TOKEN_EXTRA            : the other tokens to be found for the query to be matched

Query 1 is a single word query. A document is a full match if it contains the word
“oracle”. In this case, matching TOKEN_TEXT alone is sufficient, so TOKEN_EXTRA is
NULL. Notice the normalization of the token to upper case:


QUERY_STRING                  TOKEN_TEXT           TOKEN_EXTRA
----------------              ----------           -----------
oracle                        ORACLE               (null)



Query 2 is an OR statement. A document is a full match if it contains the word “larry” or
the word “ellison”. This can be reduced to two single-word queries, each of which has
TOKEN_EXTRA NULL:


QUERY_STRING                  TOKEN_TEXT           TOKEN_EXTRA
----------------              ----------           -----------
larry or ellison              LARRY                (null)
                              ELLISON              (null)

Query 3 is an AND statement. A document must have both words “dbms” and “oracle” to
be a full match. The engine will choose one of these as the filing term, and place the
other the TOKEN_EXTRA criteria:
Information Alert System for Digital Libraries - 34 -



QUERY_STRING                 TOKEN_TEXT          TOKEN_EXTRA
----------------             ----------          -----------
DBMS and oracle              DBMS                {ORACLE}

Documents that contain the word “dbms” will pull this rule up as a partial match. The
query engine will then examine the TOKEN_EXTRA criteria, see that it requires the
presence of the word “oracle”, check if the document contains that word, and judge the
rule a full match if so.


Query 4 is a phrase. All expressions that contain multiple tokens which are not
connected with operators are considered phrases. The engine will use the first word of
the phrase as the filing term, and the whole phrase as the TOKEN_EXTRA:


QUERY_STRING                 TOKEN_TEXT          TOKEN_EXTRA
----------------             ----------          -----------
market share                 MARKET              {MARKET} {SHARE}


Query 5 is a proximity statement. The engine will use the first word of the proximity
statement as the TOKEN_TEXT, and the whole expression as the TOKEN_EXTRA. The
parameter FALSE means the order of terms is not specified.


QUERY_STRING                     TOKEN_TEXT         TOKEN_EXTRA
----------------                 ----------         -----------
near((USA,Asia),5)               USA                NEAR(({USA},{ASIA}),5,FALSE)


    The filtering application uses the data stored in the index to find matching profiles.
During filtering the incoming document is considered a stream of tokens. For every token
contained both in this stream and in the TOKEN_TEXT column the corresponding query
is considered partially matched.      For every partially matched query Oracle Text
examines if the criteria stored in the TOKEN_EXTRA column are satisfied for the given
document. This mechanism uses this heuristic to find the partially matched queries
instead of issuing all queries contained in the base table against the given document.
    The first step when constructing a filtering application is to define the tables used to
store the incoming documents and the profiles.
Information Alert System for Digital Libraries - 35 -



2.4.2 Creating the tables
For example let us define the following simple schema that consists of two tables: The
table Documents (Table 2.4-1) with two columns: article_id primary key of the table,
and article_text containing unstructured plain text of the incoming article. The table
Profiles (Table 2.4-2) that holds the stored profiles of users and consists of two
columns query_id primary key of the table and query the query itself. article_id,
query_id are integers and article_text, query are strings.




article_id                                           article_text
             Metals mining is the industrial sector responsible for the largest amount of toxic releases
    12       in the United States, according to a highly...
             Papers in the robotics literature often concern specific technical aspects of robot research
    34
             and development. At the same time, several robot competitions have emphasized …
    97       The Eighth Annual Mobile Robot Competition and Exhibition was held as part of the
             Sixteenth National Conference on Artificial Intelligence in Orlando...
    80       The United States National Security Agency, with help from Network Associates of Santa
             Clara, Calif., has made a security-enhanced version of Linux available for download...
                                 Table 2.4-1 Table Documents



                           query_id query
                                  ……
                                311 toxic releases
                                312 US or Europe
                                313 Artificial AND Intelligence
                                314 near( ( personal, computers ) , 4)
                                315 $library
                                316 about(politics)
                                  … ….
                                 Table 2.4-2 Table Profiles



2.4.3 Language of the stored queries
   Let us introduce the grammar which generates the queries for plain text documents.
All operators are case-insensitive (AND is equivalent to and). All expressions that
contain multiple tokens which are not connected with operators are considered terms
(exact phrases). The available Boolean queries are shown in Table 2.4-2.
Information Alert System for Digital Libraries - 36 -



Function          Syntax                          Description                                Examples

                                                                                           toxic releases
                  term1            matches documents that contain term1
                                                                                             intelligence

              term1 and term2                                                        Artificial and Intelligence
conjuction                      matches documents that contain both terms
               term1 & term2                                                              modile & phone

              term1 or term2     matches documents that contain term1 or                    US or Europe
disjunction
                                                term2                                     Linux | Windows
               term1 | term2

              term1 not term2 matches documents that contain term1 but not                paper not journal
 negation
                                               term2                                    software ~hardware
               term1 ~term2

                                Table 2.4-2 Boolean queries syntax


 Along with the standard Boolean queries, the CTXRULE index type grammar provides
 proximity, stemming and theme functionality.



     Proximity operator NEAR (;)

        The operator NEAR matches documents based on the proximity of two or more
 query terms. The syntax of proximity queries is
                           near ( (term1, term2,..., termn ) , max_span , order )


 term 1-n: the terms in the query separated by commas. The query terms can be single
 words or phrases.
 max_span (optional – default 100): the maximum size of a clump where clump is the
 smallest group of words where all query terms occur. All clumps begin and end with a
 query term. max_span cannot be greater than 100.
 order (optional – default FALSE): indicates whether terms are to be found in the same
 order as in the query.


 Alternatively proximity can be defined according to the syntax
                                            term1 near term2
                                              term1 ; term2
 These queries are equivalent to the expression: near ( (term1, term2 ) , 100,FALSE )
Information Alert System for Digital Libraries - 37 -



                     Example                                                  Description
                                                      matches documents that contain both terms "personal"
near( ( personal, computers ) , 4)
                                                      and "computers" in any order within a group of 4 words
                                                       matches documents that contain all terms "monday",
near( (monday, tuesday, wednesday), 20, TRUE)         "tuesday", and "wednesday" in the order specified and
                                                                    within a group of 20 words
                                                      matches documents that contain both terms "windows",
windows near XP
                                                          "XP" in any order within a group of 100 words
                                                        matches documents that contain both terms "digital
near( (digital signal processing, VLSI), 89, FALSE)
                                                       signal processing", "VLSI" within a group of 89 words

                                 Table 2.4-3 Examples of Proximity queries

         Stem operator ($)
         The stemming functionality enables us request terms that have the same linguistic
     root as the query term. The stem operator ($) expands a query to include all terms
     having the same stem or root word as the specified term.
                     Example                                             Description
                                               matches documents that contain at least one of the terms
                     $scream
                                                         "scream","screamed", "screaming"
                                               matches documents that contain at least one of the terms
                      $library
                                                           "library", libraries", "librarian"
                                               matches documents that contain at least one of the terms
                       $sing
                                                              "sing", sang", "sang"
                                 Table 2.4-4 Examples of Stemming queries
     The definition of the algorithm used for stemming is not available in the manuals
     provided by Oracle [23-25, 27]. The Oracle Text stemmer supports the following
     languages: English, French, Spanish, Italian, German, and Dutch.

         Theme indexing
         Oracle supplies a database of themes in English or French. Themes are tokens
     organized hierarchically and connected to each other with relations that describe their
     semantic content. Oracle Text supports the typical relations used by thesauri and is
     compliant with the ISO-2788 [28] and ANSI Z39.19 (1993) [29] standards. The relation
     definitions are reproduced from the ANSI Z39.19 specification [29].
     These relations are
         •   A Synonym (SYN) relation defines equivalence between two terms and connects
             terms with very close meaning like the ones bellow.
             scary SYN fear,     hiddenly SYN secrecy,    drive-up SYN automobiles, lounge SYN rest.
Information Alert System for Digital Libraries - 38 -



      •   A Broader Term (BT) relation defines a superordinate semantic class that the first
          concept belongs to. For example
          lordships BT royalty and aristocracy,       American Revolution BT military wars,
          backward motion BT withdrawal,         Bible BT sacred texts and objects.
      •   A Narrower Term (NT) relation defines a subordinate semantic class that the first
          concept includes. For example.
          Roman Catholicism NT papism,          computers NT laptops,         behaviour NT sympathy,
          organized crime NT gangsters,         cosmology NT astronomy
          The NT and BT relations are symmetric. If and only if X BT Y, then Y NT X.
      •   A Related Term (RT) relation implies semantic overlap (there is an element of
          meaning common to both terms). For example.
          oceans RT fish,      water birds RT mammals,        beds RT rest, Islam RT Middle East
  Using these binary relations Oracle constructs a tree containing all supported concepts.
  Consider the word “government” as a root node. The tree in Figure 2.4-4 contains only a
  few of the terms related to “government”.

                                            government
                                       NT                  NT
                                                 NT




                 public facilities                                     law
                                                politics
                                   SYN                            RT
                     political                                               military
                                        NT
                                                      NT




                                                                                 NT




                         politicians            political sciences           military wars

                                  Figure 2.4-3 Oracle Text Thesaurus
  Oracle Text provides operators for theme query expansion using these relations (SYN,
  BT, NT and RT). This type of thesaural queries include the main term and all the related
  terms to the possible strings to be found in a matched document.
          Example                                                    Description
       SYN( politics )             matches documents that contain the term "politics" or one of its synonyms
                                   matches documents that contain " pharmaceutical industry " or one of its
RT ( pharmaceutical industry )
                                                              related terms
      BT ( gangsters )           matches documents that contain the term "politics" or one of its broader terms

                                 Table 2.4-5 Theme queries examples
Information Alert System for Digital Libraries - 39 -



   Operator ABOUT
   Another operator that uses the supplied thesaurus to expand theme queries is the
ABOUT(phrase) statement. The phrase parameter can be a single word or a phrase, or
a string of words in free text format. If the phrase parameter does match exactly a stored
concept Oracle normalizes it and finds the stored concepts closer to the original string.
For example before expansion “politic” is normalized to “politics” and “national” to
“nations”. If normalization fails to find a concept describing phrase parameter the query
is satisfied only in exact phrase match. Otherwise Oracle Text expands the queries
using synonyms and narrower terms of the concepts inside the parentheses. The terms
of the expansion could be connected to the original term directly or via another
interjected term (expansion level > 1). The definition of the algorithm used for
normalization and query expansion is not available in the manuals provided by Oracle
[23-25, 27] and should be rather complex. We provide an example of the expansion of
the word “politics” instead.
   Expansion terms are separated by commas. The word “and” in expansion terms is
not considered an operator.
   Relation to term "politics"             Expansions of the query ABOUT(politics)
                                   civil rights, elections and campaigns, political parties,
      narrower terms level 1       political scandals, political sciences, politicians, politicians
                                   and activists, revolution and subversion, world politics
                                   civil liberties, elections, human rights, insurgents,
      narrower terms level 2       insurrectionary, insurrections, partisan politics,
                                   revolutionaries, revolutionists, terrorism
      narrower terms level 3       terrorist activities, terrorist incidents, terrorists
     narrower terms level 1 of
                                   animal rights, consumer advocacy
       "political advocacy"
                                   animal rights activists, animal rights movement, animal-
     narrower terms level 2 of
                                   rights activists, consumer activists, consumer advocates,
       "political advocacy"
                                   consumer rights
 both "politics" and "policymakers"
                                    policymakers
  narrower terms of "government"
                          Table 2.4-6 Expansions of term "politics"
Important Notes:
i. An ABOUT statement cannot contain the proximity operator NEAR. For example the
   query: “ NEAR( ( personal, computers ) , 4) AND ABOUT( software ) “ is not valid and
   cannot be parsed.
ii. The phrase parameter in an ABOUT statement should be in lower case.
iii. Inside thesaural statements (SYN, BT, NT, RT and ABOUT) any reserved words like
     (AND, OR, NOT, NEAR) are not considered operators but simple tokens.
Information Alert System for Digital Libraries - 40 -



   Operator precedence
   Within query expressions with two operands, the operators have the following order
of evaluation from highest precedence to lowest: NEAR, NOT, AND, OR. For example:
                        Query Expression       Order of Evaluation
                        w1 OR w2 AND w3       w1 OR ( w2 AND w3 )
                        w1 AND w2 OR w3       ( w1 AND w2 ) OR w3
                       w1 NOT w2 AND w3      ( w1 NOT w2 ) AND w3
                       w1 OR w2 NEAR w3      w1 OR ( w2 NEAR w3 )
                       w1 NOT w2 NEAR w3     w1 NOT ( w2 NEAR w3 )
                        Table 2.4-7 Operator precedence examples
Grouping characters can be used to control operator precedence. Grouping characters
are parenthesis ( ) and brackets [ ].

2.4.4 Indexing the stored queries
   Let us consider the relational simple schema defined in Section 2.4.2. Before filtering
the documents we must first index the queries in order to be able to collect matching
profiles. This is done with the following command:


   CREATE INDEX profile_index on profiles(query)
               INDEXTYPE IS ctxsys.ctxrule;


Now Oracle constructs an index according to the process described in 2.1.1.
   DML (Data Manipulation Language) operations to the base table refer to inserts,
updates and deletes from the base table. During filtering Oracle reads the queries stored
in the CTXRULE index. In order to ensure that filtering results are correct and consistent
with the queries, the index should be synchronized regularly with the base table after
DML actions. This is done with the following command:


       EXEC ctx_ddl.sync_index(‘profile_index’);

   Syntax errors
   Stored queries that contain operators and do not correspond to CTXRULE grammar
are considered invalid and cause index errors. For example queries like the ones in
Table 2.4-8 are considered syntax errors.
Information Alert System for Digital Libraries - 41 -



             Error Description                                          Query
 missing parenthesis                          ( software design
 term2 missing                                international and
 term1 missing                                or mathematics
 about(...) and near(...) in the same field   about(management) and near((financial, planning),4)
 operator between term1 RDBMS and
 near(...) missing                            RDBMS near((data, warehousing),6)
 about is an operator                         journals about science
                                Table 2.4-8 Syntax error examples
    Syntax errors cause index errors (upon index creation or synchronization) and then
filtering may not produce the expected results. Also indexing of null fields produce
errors. These errors can presented using a simple SELECT statement on the data
dictionary view CTX_USER_INDEX_ERRORS. This view contains four columns which
are name of the index, time of error, row id of error query and error message. This view
is very helpful for program debugging and database administration but not automatic
error detection on queries. Also detecting an error after the transaction is committed, is
not the optimal way to solve this problem. An application that lets users define their
profiles should include a mechanism that validates queries before insert/update
transactions (Sections 6.4-6.5). The main disadvantage of Oracle Text is that it does not
provide such functionality for the CTXRULE index.

2.4.5 Filtering
    In order to filter the documents we use the SQL operator MATCHES. We use this
operator to find all rows in a query table that match a given document. The document
must be a plain text, HTML, or XML document. This operator requires a CTXRULE index
on our set of queries.
The syntax for this operator is


        MATCHES( column, document VARCHAR2 or CLOB) RETURN NUMBER;


column is the indexed column of the base table,
document is a PL/SQL variable of type VARCHAR2 or CLOB.
        VARCHAR2 is string of variable length (<=4000 characters),
        CLOB is character large object (string type of infinite length).
PL/SQL (Programmatic Language/SQL) [30, 31, 32] is a procedural extension to SQL,
used in the Oracle database to declare stored procedures.
Information Alert System for Digital Libraries - 42 -




        MATCHES returns 1 in case of matching and 0 for no match. This number
cannot be assigned to variable because MATCHES does not support functional
invocation. Τhis operator can only used in SELECT statement according to this syntax
using the greater than zero condition “>0”.


        select column1,column2,.. columnN from table
        where MATCHES( column, document VARCHAR2 or CLOB )>0


To find all matching profiles of the schema 2.1.2 for a given document single_doc we
use the following procedure find_matching_profiles()


single_doc is PL/SQL variable of same type as a row of the table Documents.
matched_profile is PL/SQL cursor of same type as a row of the table Profiles.


create procedure
  find_matching_profiles(single_doc documents%rowtype) as

begin

for matched_profile in
     (
      select *
     from profiles
      where matches(profiles.query, single_doc.article_text)>0
      )

  loop
          --- ACTION (S) FOR EVERY MATCHING PROFILE
       dbms_output.put_line('Matched profile' || matched_profile.query_id)
          --- PRINTS MATCHING PROFILES NUMBER TO THE SCREEN
  end loop;
end;

The find_matching_profiles procedure collects all matching profiles (according to
the mechanism explained in 2.4-1) for the document doc_cursor using the SELECT
statement. The for…loop, prints all matching profiles ( their primary keys ) to the
screen.
Information Alert System for Digital Libraries - 43 -




To find matching profiles for all documents in the table, we use the procedure find_all
which uses a for…loop that calls the previous procedure on all documents.




create procedure find_all as

begin
for current_document in
      ( select * from documents )
    loop
            dbms_output.put_line('Article:'|| current_document.article_id)
           find_matching_profiles(current_document);
       end loop;
end;

   Instead of displaying the results on the screen we could have declared other actions
be performed, on every matching profile. The most common case in a complete filtering
application is to insert the primary keys of the results into another database table for
further processing by other database modules. Anyway, the important feature of the
MATCHES statement is the ability to find all matching profiles for a certain document
instead of periodically running the stored queries on the documents.



2.5 Conclusions
   In this chapter we presented the previous developments on this topic that inspired
and helped us implement DLAlert. We evaluated popular Alerting Services on the web in
the area of Digital Libraries. Then we discussed DIAS, a distributed alert system for
digital libraries, and explained in detail its language of profiles and its functionality. We
presented the main facilities of the Z39.50 standard, used for publication record retrieval
from digital libraries. Finally, we introduced filtering application construction with Oracle
Text. In the rest of the dissertation we focus on the design and implementation of
DLAlert starting with an overview of the system.
Information Alert System for Digital Libraries - 44 -




Chapter 3
System Overview
   In this the following Chapters we introduce the main design issues that concerned us
during the implementation of DLAlert. First we present an overview of its modules and all
related Oracle database features used. Then we give a brief manual for the interface of
DLAlert and the language supported.


3.1 DLAlert architecture

   DLAlert, as a complete alerting service, consists of several components that
cooperate with each other, in order to achieve the desired function. The main parts of
this system are (shown in Figure 3.1-1):
       The Oracle Database is where the profiles and user data are stored. New
       publication records are also inserted inside the database and stored temporarily
       until the notification messages (e-mail) are produced and send to each user. The
       RDBMS is the core functionality of DLAlert.
       The Observer is the component responsible for collecting new publication
       records from a Z39.50 gateway (the TUC digital library), and inserting them into
       the Oracle database.
       The Filtering module classifies incoming documents according to the users’
       stored queries and finds matching profiles and related publication records.
       The Notifying module collects all matched profiles and sends a single message
       for each user, containing the bibliographical attributes of the relevant publication.
   The Observer, the Filtering and Notifying modules communicate with each other by
   writing/reading from common records of the database. These modules are scheduled
   to perform actions sequentially (see Chapter 8) so that the results from a previous
   component are processed by the following one.
       The Application Server provides the necessary software infrastructure for
       developing and deploying the middle tier of DLAlert. in addition to providing the
       necessary forms for the transactions on the web (inserts/delete/updates of
       profiles, user credentials), the Alerting Service must check the queries to be
Information Alert System for Digital Libraries - 45 -



           stored for syntax errors and maintain user session state (every user has access
           restricted to the profiles of his account).

                                                      USER
                                                             receives e-mail
                           describes his
                         fields of interest,
                       subscribes to service
                                                                                 sends e-mail
                                                                                  via SMTP

                           WWW                                                            Notifying
                                                                     user e-mail, name
                                                                                           module

                                                                                                  notifications
                                                                                                       data
                        inserts, updates, deletes
                         profiles, user credentials
                                                                        stored queries
                                                                                           Filtering
                                                                                           module
                                                              Oracle
                                                             database
Web Application Server with
                                                                               new publications
     Apache Tomcat               TUC Digital Library
                                                                                    data
      or Oracle AS                Z39.50 Gateway
                                                                   Requests data on
                                                                   new publications
                                                                                          Observer
                                                                via TCP/IP
DBMS "Advance"
                                                        Sends Records

                                           •                                                         components stored
                                                                                                       in the database
                                           •                                                        and executed inside
                                                                                                  the RDBMS environment
          •                                •
                              another Z39.50 Gateway
          •
          •
  another DBMS




                                        Figure 3.1-1 DLAlert architecture


      The other parts of this schema are:
           The Z39.50 Gateways and DBMS of Digital Libraries as described in Section 2.3.
           The User accesses the alerting service through a web page on the Internet. The
           Notifying module sends the e-mails using an SMTP server and the user gets
           them from his mailbox (omitted from the figure).
Information Alert System for Digital Libraries - 46 -



3.2 Main Technologies used

   The Oracle database is the core of DLAlert and most of the applications
programmatic code is stored, compiled and executed inside the database. Bellow we
explain the advantages of using Oracle RDBMS as the essential component of our
system.
     Nowadays most systems that require a robust and reliable centralized mechanism
     to store and retrieve large amount of data utilize a database. One of the main
     requirements of DLAlert is to be able to handle hundreds of thousands or millions
     of user profiles and hundreds of new publication records every day in a way that
     ensures the validity of the information stored. The necessity of using a database
     that can achieve these standards is indisputable. Any reliable RDBMS system
     provides functionalities that ensure data consistency and integrity, transaction
     concurrency and scalability for handling large amount of information. Using a
     software   infrastructure   like    an   RDBMS,        the    developer       considers          the
     aforementioned issues solved, and he is mainly concentrated on the particular
     requirements of his application.
     Oracle Text (as shown in Section 2.4) is a specialized tool provided by the Oracle
     RDBMS which is able to provide document filtering techniques and profile
     matching functionalities, necessary for DLAlert. Without these mechanisms we
     should have constructed the filtering module of our system virtually from scratch.
     Oracle Text is the key feature that persuaded us to choose Oracle RDBMS among
     other equally reliable databases.
     PL/SQL [30-32] is Oracle's procedural extension to industry-standard SQL. Its
     primary strength is in providing a server-side, robust procedural language. PL/SQL
     code is stored and executed inside the Oracle RDBMS environment and is
     responsible for the application’s actions that involve data processing. The
     developer is able to construct procedures, functions and triggers using this
     language. Logically related stored procedures, functions and object types can be
     grouped into packages. The filtering and notifying module are PL/SQL packages.
     Oracle RDBMS provides PL/SQL Packages [33] useful for application developers.
     In DLAlert two of them proved to be very helpful.
Information Alert System for Digital Libraries - 47 -



o   UTL_SMTP is a package that provides functionality for sending e-mail
    messages over SMTP from the database. We utilized UTL_SMTP for sending
    the notifications on new publications to users.
o   DBMS_JOB can be used to schedule the execution of Oracle packaged
    procedures at a specific time and on a recurring basis. The collection of new
    publication records, document filtering and notification of users are operations
    of DLAlert that must be scheduled to run at certain intervals of time (for
    example once a day). Also we have to ensure these actions are performed one
    after the other so that the results a previous component are processed by the
    following one. The main advantages of using DBMS_JOB instead of any
    external scheduling application are security reasons (none can access the
    database   without   permission).    Another      important      functionality     is   that
    DBMS_JOB can be programmed to detect unsuccessful completion of a
    module, and re-schedule it ensuring the proper queue of actions.
Java [34] is the most popular language used by application developers nowadays.
One of the main advantages of Oracle RDBMS [35-39] is the ability to integrate
Java classes inside the database and deploy them on a supplied Enterprise level
Java 2 platform (Oracle Java Server). The Java code which can be loaded as
either source or compiled (bytecode), is executed inside the database using the
internal Oracle Java Virtual Machine. The accesses to the database use the
server-side internal JDBC driver [35, 38], which means that the application can
handle faster large amount of data than any external process. The static methods
of any Java class can be declared stored procedures and can be even called from
PL/SQL code. Although Java is not as fast as PL/SQL in case of intensive data
access, is preferred if the application under development requires a complex
computation mechanism or functionality not available in PL/SQL. DLAlert regularly
collects new publication records from Digital Libraries using the protocol Z39.50
(see Section 2.3). In our case there is not any previous Z39.50 implementation in
PL/SQL and the parsing of new publication records to be inserted requires much
computation. The implementation of the Observer module of DLAlert using Java
stored procedures has been considered the best solution from the performance
perspective as well as the less complex implementation. Alternatively, we could
have used an external client for handling Z39.50 requests using previously
developed functionality written in a different programming language (C, C++, Perl).
Information Alert System for Digital Libraries - 48 -



   The other important component of DLAlert is the Application Server. The Application
Server provides a J2EE (Java 2 Enterprise Edition) platform for developing and
deploying web applications [40-42]. Multi-tiered applications (shown in Figure 3.2-1)
dominate today’s Internet. The client tier contains logic related to presentation and
requests for services. The application server contains business logic that reads and
writes data. In our case the Application Server provides the necessary dynamic web
pages and business logic for data transactions to the Client (profiles, user credentials),
restricts every users access to his account, checks queries for syntax errors and
produces the appropriate error messages in case of invalid profiles.                     The internal
architecture of the middle tier is explained in detail at Section 6.2.

                                     Application Server
       Client                                                                     RDBMS

                  Service Requests                               Data


                                                                                 Resources
   Presentation
                                         Business
                                          Logic


                            Figure 3.2-1 Three-tiered architecture
   DLAlert’s middle layer J2EE components are now deployed using Apache Tomcat.
Tomcat is not a pure Enterprise level infrastructure but the necessary J2EE classes
used can be loaded as classes. The application was also deployed on Oracle9i AS
which provides advanced features for scalability and better performance for the
application. At last Tomcat was chosen because it consumes much less hardware
resources (RAM, CPU) than Oracle9i AS, on the already loaded server available at the
time. We discuss further the architecture and implementation of the middle application
tier of DLAlert in Chapter 6.



3.3 Graphical User Interface overview
   After directing out browser (Internet Explorer, Netscape or Mozilla) to the URL of
DLAlert ( http://intelligence.tuc.gr/alert/login.html ) we the login screen of the web GUI
appears
Information Alert System for Digital Libraries - 49 -




                            Figure 3.3-1 DLAlert: Login screen
   A registered user would type his e-mail/password and login into his account. An
unregistered user will click ‘Help’ for more information about DLAlert or will click ‘New
User’ and enter his credentials.




                          Figure 3.3-2 DLAlert: Welcome screen
Information Alert System for Digital Libraries - 50 -



     In the ‘New User Registration’ screen (3.3-2) the user writes his e-mail address, his
first/last name and the preferred frequency of notifications. A similar screen appears
when a registered user wants to update his credentials.




                          Figure 3.3-3 DLAlert: Registration form
   After the registration to the service the user is expected to enter his profiles. The
profiles define the user’s fields of interest. In Figure 3.3-4 we see the empty account of a
probably new user.




                           Figure 3.3-4 DLAlert: Empty account
Information Alert System for Digital Libraries - 51 -



   In case the user has an account containing stored queries, a table with all of his
previously defined stored profiles appears. For convenience only the profile description
is displayed. The user can edit/delete his long-standing queries or enter new ones. In the
Figure 3.3-5 we see an account containing two profiles.




                       Figure 3.3-5 DLAlert: Profiles of the account
   Suppose we enter a new profile. A form that allows us to type the profile name and
queries on all bibliographical attributes appears (see Figure 3.3-6). Text queries based
on a language (defined in next section) similar to the CTXRULE are allowed on sections:
Title, Series, Author, Publisher, Subject and Notes. The publication year, ISBN and ISSN
queries can contain only one number instead of keywords that will be found in the
incoming publication. Operators and terms in queries are case insensitive. For a profile
to be matched, all non-empty defined stored queries must be satisfied for a given
document. The Profile description doesn’t affect the matching publications of a profile. It
is considered a phrase that reminds to the user the purpose of entering the given profile.
   In Figure 3.3-6 we can see a sample profile with name “books about Programming”.
The profile contains two queries. The text query “programming or Java or C” requests
documents that contain at least one of the words “programming”, ”Java” or “C” in the
Subject bibliographical attribute. The query “2002” on the publication year field restricts
the returned publications of the profile to those published in year 2002. So the desired
document must contain at least one of the words entered in the text query and be
published in the year 2002.
Information Alert System for Digital Libraries - 52 -




                            Figure 3.3-6 DLAlert: Sample profile


   In case of syntax error(s) on queries the system does not update or insert the profile
but shows an error message and displays ‘Syntax error’ on the right side of the wrong
query. Only if the user corrects his profile the system stores it.
   In the Figure 3.3-7 we can see a wrong profile. The user has probably attempted to
update a profile of his account. The users has entered “my area of interest” as the profile
name, “1995” as the desired publication year, “environmental” or “socialist” requested
keywords in the Title and the phrase “Planning research” for the Series bibliographical
attribute. The query “Alexakis or” in the author field is a wrong query as “or” is the
disjunction operator and its second term is missing. An error message appears on the
top of the page and the string “Syntax Error” indicates the wrong text query or queries.
The user must enter a valid query instead of the wrong one or omit it from the profile in
order to be able to store the profile into his account. The profile is not stored unless all
queries defined by the user are valid.
Information Alert System for Digital Libraries - 53 -




Figure 3.3-7 DLAlert: Wrong profile




Figure 3.3-8 DLAlert: Another profile
Information Alert System for Digital Libraries - 54 -



   Suppose we have already entered our fields of interest and we log out or close our
browser. Every user regularly receives e-mail messages these messages contain all
bibliographical attributes of matching incoming publications. The Figure 3.3-9 presents a
sample notification for the profile of Figure 3.3-8 (publications that contain at least one of
the words “environmental”, “socialist” in the Title). As we can see terms in queries are
case insensitive.




                       Figure 3.3-9 DLAlert: Sample e-mail message
Information Alert System for Digital Libraries - 55 -



   3.4 The language of the text queries
       The language of the text queries (referencing the Title, Series, Author, Publisher,
   Subject, Notes bibliographical attributes) is a subset of the CTXRULE grammar (see
   Section 2.4.3). The available query types provided by the interface of DLAlert are shown
   in Table 3-4.1. Terms are considered words or phrases (series of tokens containing no
   operators). The queries are not case sensitive.


      Syntax                             Description                                           Examples

       term1                                                               TITLE: mathematical programming
                      documents that contain term1.
  term1 and term2                                                          TITLE: agents and (artificial intelligence)
                      documents that contain both terms.
  term1 or term2                                                           AUTHOR: ( Bradley Brown ) or Beck
                      documents that contain term1 or term2.
  term1 not term2                                                          NOTES: science not electronics
                      documents that contain term1 but not term2.

                      documents that contain all terms

near((term1,term2,... within a set of words with size max
                                                                           TITLE: near( ( personal, computers ) , 4)
   ,termn),max)
                      Order of terms is not specified.

                      Limitation: max cannot be greater than 99.

                      documents that contain concepts that

                      are related to your query word or phrase.
    about(term)                                                            SUBJECT: about( engineering )
                      Limitation : about(...) and near(...) cannot

                      occur both in the same field.

                      documents that contain words with
       $term                                                               SUBJECT: $library
                      the same linguistic root as term.
                                     Table 3.4-1 DLAlert language


   Given the available query types shown in table 3.4-1 and the grouping characters “( )”
   we can define complex queries like those in the table 3.4-2. As we can see DLAlert
   provides queries with advanced expressiveness capabilities.
Information Alert System for Digital Libraries - 56 -




                Category                                            Example

complex Boolean statements              TITLE: ( mine or disasters ) not industry

proximity and Boolean operators         TITLE: near( (mine, disasters) , 5) and industry

concept queries and Boolean operators SUBJECT: about( engineering ) or software not databases

stemmed words and Boolean operators NOTES: digital and $library

stemmed words and proximity             NOTES: software not near( (advanced, $electronics) , 3)
                                  Table 3.4-2 Complex queries


   Queries like those shown in table 3.4-3 are wrong and produce syntax error in case
the user tries to enter one of them.

              Error type                                       Example
  missing parenthesis               TITLE: ( software design
  term2 missing                     SUBJECT: international and
  term1 missing                     SUBJECT: not mathematics
  about(...) and near(...)          TITLE: about(management) and
                                    near((financial,planning),4)
  in the same field
  max >99                           SUBJECT: near((financial,planning),142)
  operator between term1 RDBMS
                                    NOTES: RDBMS near((data,warehousing),6)
  and near(...) missing
  about is an oparator              TITLE: journals about science
  missing stemmed word              TITLE: digital and $
                                   Table 3.4-3 Syntax errors



3.5 Conclusions
   In this chapter we discussed the main functionalities and the architecture of DLAlert.
We explained the technologies used in the database as well as the middle layer of our
application. Also we introduced the interface of DLAlert and the language supported. In
the following sections we present in detail the implementation of every component of the
system.
Information Alert System for Digital Libraries - 57 -




Chapter 4
Database Schema
   In this chapter we present in detail the design of the database schema of DLAlert. In
addition to the basic requirements analysis and the Entity-Relation diagram we also deal
with issues like primary-foreign key consistency and the necessary CTXRULE indices
creation and maintenance (see also Section 2.4).



4.1 Requirements analysis


                Publication                                           Profile
                                     N             M
        (bibiographical attributes       matches          (set of queries on document's
          of acquired document)                             bibliographical attributes)

                                                                             N

                                                                       defines


                                                                         1

                                                                         user


                    Figure 4.1-1 A high level Entity-Relationship diagram


  The requirements of DLAlert for data resources involve only three entities (as shown
in Figure 4.1-1).
     The entity user contains all necessary user credentials and information. These are
     the user’s e-mail, first/last name and the password for his login into DLAlert. The
     request of password at user login ensures that all users have restricted access to
     their accounts. Another helpful feature is the desired frequency of notifications.
     DLAlert collects all matching documents of a certain user and sends a single
     message to him at the end of the desired interval (‘DAY’, ’WEEK’, ’MONTH’)
     containing all relevant bibliographical attributes. It is considered annoying to send a
     new e-mail message every time a new matching publication arrives.
Information Alert System for Digital Libraries - 58 -



The entity publication contains all necessary bibliographical attributes of
documents to be requested. The attributes include all characteristics of the
document and are included in the query form of TUC Library’s catalog search (see
picture bellow). These bibliographical attributes are: Title, Series, Author,
Publisher, Subject, Notes, Publication year, ISBN (International Standard Book
Number) and ISSN (International Standard Serial Number). The Language attribute
is omitted because English is the only language supported by DLAlert at the time.
The Observer parses the UNIMARC records retrieved from the Z39.50 Gateway in
order to identify the previous fields (see Chapter 7) and populate the corresponding
table. These basic attributes can also be identified using records from different
sources.




    Figure 4.1-2 Technical University Digital Library catalog search engine
The entity profile is a set of conditions to be satisfied for the requested document.
These conditions are queries in the CTXRULE language (see Section 2.4) and
reference certain attributes on the publications. So we have queries for each of the
Information Alert System for Digital Libraries - 59 -



     bibliographical attributes Title, Series, Author, Publisher, Subject, Notes,
     Publication year, ISBN and ISSN. For a profile to be matched, all not null queries
     must be satisfied for a given document. A user can only access and define the
     profiles of his account. User also can specify a short string describing each profile.
     This description does affect the set of matching documents but allows a single user
     to define many profiles in a convenient way.


4.2 Relational schema

      Foreign key     notification      Foreign key
       public_id                     (email, profile_desc)
                      public_id
                      email
                      profile_desc


                                                                         user

    publication                        profile               Foreign key email
                                                                email    first_name
    public_id                          email
                                                                         last_name
    title                              profile_desc
                                                                         password
    series                             title_query
                                                                         frequency
    author                             series_query
    publisher                          author_query
    subject                            publisher_query
    notes
                                       subject_query
    pub_year
                                       notes_query
    bookn
                                       pub_year_query
    serialn
                                       bookn_query
                                       serialn_query


                             Figure 4.2-1 Relational schema
   Considering the E-R diagram and the requirements analysis of Section 4.1, we
construct the relational schema in Figure 4.2-1. The email address is unique for every
user (primary key) and is also stored as foreign key in the table of profiles in order to
restrict user access into his account. The primary key in profiles table consists of two
columns (email, profile_desc) so that the profile description and the user’s
account email, uniquely identifies a profile. The primary key of publications table
(public_id) is the local number of UNIMARC record (field 001) in TUC’s Library
(Sections 2.3.4 - 2.3.6).
Information Alert System for Digital Libraries - 60 -



   In order to represent the relation “publication matches profile” (M to N) we declare a
table (notifications) holding the necessary primary keys of the related tables. The
primary key of table profiles (email, profile_desc) is considered foreign key for the
notifications table. The filtering module populates this table with the primary keys of the
satisfied profiles and matching documents. The notification module uses these records
to send messages with the matching document’s bibliographical attributes.
   All attributes are declared strings of variable length (data varchar2 number) except
those named public_id, pub_year and pub_year_query which are integers (data
type number). The value of attribute frequency (users) can be either ‘DAY’, ’YEAR’ or
MONTH’ (CHECK constraint).
   The tables of profiles and users are accessed through the web Graphical User
Interface. The tables holding publication and notification records are populated and
accessed only by stored procedures of the Oracle RDBMS.


4.3 Key consistency and atomic transactions
   Suppose a user wants to update his email and his account contains profiles, and/or
notifications on publications not send as messages yet. In other words the primary key
value of a record, in users table must be changed, and this value is also stored as
foreign key into another table (profiles, notifications). In all cases when a
transaction references more than one inserts/ updates/ deletes and either all sub-
operations must be committed successfully, or none of them, we consider this
transaction atomic. The atomicity of actions like these is satisfied by using PL/SQL
stored procedures. In our case we declared a package (named “transactions”) that
handles updates of the email or profile_desc attributes (both part of foreign key).
Also handles deletes on the profiles, users tables atomically.                       For example
consider the scenario that user with email del_email unsubscribes from DLAlert. We
execute the transaction using this procedure.
   procedure delete_user( del_email varchar2 ) is
             pragma autonomous_transaction;
   begin
             delete notifications where email=del_email;
             delete profiles where email=del_email;
             delete users where email=del_email;
             commit;
Information Alert System for Digital Libraries - 61 -



   end delete_user;


   When user unsubscribes, his data must be deleted from three tables user, profiles,
notifications. The keyword ”pragma autonomous_transaction;” ensures that
either all three deletes in the PL/SQL block commit or none of them.
The specifications of other procedures of this package are


procedure update_user( old_email varchar2 ,
                               new_email varchar2 );
Called if email value in a record, is changed from old_email to new_email.
Declares the update of the e-mail value on the tables (profiles, notifications,
users) as an atomic transaction.


procedure update_profile( user_email varchar2,
                                   old_desc varchar2,
                                   new_desc varchar2);
Called if profile description value in a record, is changed from old_desc                           to
new_desc.      Declares the update of the value on the tables (profiles,
notifications) as an atomic transaction.


procedure delete_profile( user_email varchar2,
                                   del_desc varchar2);
Called if profile with primary key value ( user_email, del_desc ), is deleted.
Declares the deletion of a profile on the tables (profiles, notifications) as an
atomic transaction.


4.4 Indexing of the stored queries
   The next step is to declare columns that contain stored queries and create the
necessary CTXRULE indices. As shown in Section 2.4.4, we must first index the queries
in order to be able to collect matching profiles. The bibliographical attributes (Title,
Series, Author, Publisher, Subject, Notes) contain plain text and the corresponding
stored queries are generated using the CTXRULE grammar and reference these
sections. The queries on ISBN, ISSN are ten and eight digit strings respectively ( 0-9 or
X ). Many records of TUC’s Digital Library have multiple ISBN and/or ISSN numbers and
Information Alert System for Digital Libraries - 62 -



the number included in the query must be equal to one them. A solution to this problem
was to index also queries on ISBN, ISSN numbers. The MATCHES operator is able find
matching publications with multiple book/serial identifiers. The publication year attribute
has always a single value and text queries on this attribute, using the CTXRULE
grammar, are usually meaningless. Instead of using the MATCHES predicate to
evaluate satisfied queries on publication year, we use the standard SQL statement (‘=’)
of equality in a WHERE clause. So we have to create eight indices on the columns
containing stored queries (all except pub_year_query) that use the CTXRULE
functionality.
    As we showed in Section 2.4.4 indexing null queries is produces index errors. Given
the structure of DLAlert and the queries we want to provide, we do not expect the user to
describe every attribute of the requested document but at least one. So we have to
choose a non reserved symbol to be inserted instead in fields that the user does not
specify a condition. We choose this symbol to be the less than symbol (“<”) because is
not associated with any functionality or operator. Also this character is skipped by the
indexing engine so the number of columns containing this symbol does not affect the
size of the index. We construct a simple trigger that is executed before every insert or
update on the table profiles and inserts the symbol “<” instead of null. This trigger has
if…then rules for every stored query like this.


    if ( :new.title_query is NULL ) then
                  :new.title_query:='<';
          end if;


Queries that contain only the symbol “<” are displayed as empty strings on the Graphical
User interface (see Sections 6.4 - 6.5) so that this mechanism is transparent to the user.
    In order to ensure that filtering results are correct and consistent with the queries, the
indices should be synchronized before filtering with the base table after DML actions.
For this purpose we declared the package “indexing” with a procedure (“sync”) that
synchronizes all CTXRULE indices sequentially before any execution of the filtering
module. In DLAlert we assume that index synchronization and filtering are executed at
least one time at the end of the day.
Information Alert System for Digital Libraries - 63 -



4.5 Conclusions
   In the last chapter we analyzed the requirements that the database used by DLAlert
must meet. We explained in detail the database schema and preceded on to particular
design issues (atomic transactions and index creation). In the next section we present
the main PL/SQL packages of the system: the filtering and notifying module.
Information Alert System for Digital Libraries - 64 -




Chapter 5
PL/SQL packages
   In this chapter we present the filtering and notification modules of DLAlert
implemented as PL/SQL packages. Packages are constructs that allow us to logically
group procedures, functions, object types, and items into a single database object.
PL/SQL package have similar functionality as classes in object oriented languages (Java
or C++), except every package is instantiated only once during a database session.


5.1 Filtering module
   Suppose we have the table publications in the database schema of Figure 4.2-1
populated with bibliographic attributes of documents and the profiles table with queries
of the CTXRULE grammar. In order to be able to produce the necessary messages, we
must first find the matched profiles for every document. A profile is matched if all set
conditions on the corresponding attributes of the document are satisfied.

5.1.1 The algorithm
   As we have shown in Section 2.4.4 to find all matched profiles for a single document
we use the following algorithm. current_publication is a parameter for the
procedure find_matched_profiles.


procedure find_matched_profiles
     (current_publication publications%rowtype) is

begin

for matched_profile in
     (
     SELECT <needed columns>
     FROM   <table holding the profiles>
     …
     MATCHES condition(s)
       )

       loop
            ACTION EXECUTED FOR EVERY matched_profile
       end loop;

end;
Information Alert System for Digital Libraries - 65 -



First we have to construct the appropriate SELECT statement that collects all matching
profiles. If the variable holding the given publication is named current_publication
and we collect all matched profiles according to the title attribute and corresponding
query we have
       select email,profile_desc from profiles
       where matches(title_query, current_publication.title)>0


Suppose now we want find profiles that satisfy not only the query on the attribute title
but the whole set of conditions in a conjunction. The most obvious solution to our
problem could be thought to define this selection.


       select email,profile_desc from profiles
      where (matches(title_query, current_publication.title) >0 )
      and (matches(title_query, current_publication.author) >0 )
      …
      …
      and (matches(serialn_query, current_publication.serialn) >0)




Which is syntactically correct but produces the runtime error bellow due to the limitation
of the MATCHES operator.


ORA-20000: Oracle Text error:
DRG-50610: internal error:
MATCHES does not support Functional Invocation


“MATCHES does not support Functional Invocation” means that the result from multiple
MATCHES statements in conjunction, cannot be evaluated using the Boolean SQL
operator ‘and’ because the value produced from MATCHES cannot be assigned for
further processing from the Oracle SQL processor module. This error occurs in all cases
when using a MATCHES clause that references a cursor, in conjunction/disjunction with
any other statement (MATCHES or standard SQL).
Information Alert System for Digital Libraries - 66 -



There are only two possible solutions to our problem. The first one is treating results of
SELECT clauses as sets and using the intersection of the intermediate results of the
partially satisfied profiles. This can be done as follows:


(select email,profile_desc from profiles
 matches(title_query, current_publication.title) >0 )

intersect
(select email,profile_desc from profiles
 matches(title_query, current_publication.author) >0 )
     …
        …
intersect
(select email,profile_desc from profiles
 matches(serialn_query, current_publication.serialn) >0 )

Another solution is using the intermediate results and issuing over them a SELECT
clause that ensures their primary keys’ equality. This can be done as follows:


select Title_Results.email,Title_Results.profile_desc from

    (       select email,profile_desc from profiles
            where matches(title_query,current_publication.title)>0
    )       Title_Results ,

    (       select email,profile_desc from profiles
            where matches(series_query,current_publication.series)>0
    )       Series_Results ,
    …
    …
    …
    (       select email,profile_desc from profiles
            where matches(serialn_query,current_publication.serialn)>0
    )       Serialn_Results

where
        and    Title_Results.email=Series_Results.email
        and    Title_Results.profile_desc=Series_Results.profile_desc
        …
        …
        …
        and    Title_Results.email=Serialn_Results.email
        and    Title_Results.profile_desc=Serialn_Results.profile_desc
Information Alert System for Digital Libraries - 67 -



   Both of the SQL statements result the same performance measures and are
processed by the Oracle SQL query processor in a similar way.
   In Section 4.4 we explained that null fields produce index errors so we choose to use
the symbol ‘<’ instead of empty query cells. The symbol ‘<’ is skipped by the Lexer
during the indexing process so it does not appear in the indices. So the intermediate
SELECT clause of the profiles, that match a single attribute in the previous SQL
statements, taking in account the cells described as null queries is ( for example the
title_query with the title attribute) are implemented as follows.

  (select email,profile_desc from profiles
   where matches(title_query, current_publication.title)>0)
    union
  (select email,profile_desc from profiles
   where title_query = '<')

The action executed for every matched_profile is an insert of the primary keys of
the matched profile and relevant publication into the notifications table.
       insert into notifications(public_id,email,profile_desc)
               values( current_publication.public_id,
                           matched_profile.email,
                           matched_profile.profile_desc
                           );


We also do not forget to issue the command commit; so the transaction is committed.
A commit;      statement ends a transaction and makes permanent any changes
performed. This statement is preferably issued outside the for…loop and executed
once as a commit’s response time is fairly flat, regardless of the transaction size.
     To find matching profiles for all publications another for…loop is needed to call
the previous procedure on all documents.
for all_documents in

       (
        select * from publications;
        )

     loop
         find_matched_profiles (all_documents)
     end loop;
Information Alert System for Digital Libraries - 68 -



5.2 Notifying module
   Once the notifications table is populated with matching profiles and publications, all
we have to do is to summarize the matched documents for every user and send him a
single e-mail. The preferred frequency of notifications does not affect the filtering
process but all new publications are filtered against all profiles but the frequency controls
the time the e-mail message will be sent. For sending e-mail messages from the
database we use supplied package UTL_SMTP. We first present the basic features of
this package and then we describe the whole functionality or our module.

5.2.1 The UTL_SMTP package
   SMTP [43] stands for Simple Mail Transfer Protocol. This is the protocol that was
developed to allow people around the world to exchange electronic mail. UTL_SMTP [30,
33, 35, 40] is an email utility that provides us with the ability to email from the database.
In other words, we can dynamically generate email from the database and we can
dynamically send it to different people based on different criteria. The message
constructed can be sent as a standard ASCII text email or an enhanced HTML email.
   A SMTP connection is initiated by a call to open_connection, which returns a SMTP
connection. After a connection is established, the following procedure calls are required
to send a mail (we do not specify the complete syntax):
     helo() or ehlo() - identify the domain of the sender
     mail() - start a mail, specify the sender
     rcpt() - specify the recipient
     open_data() - start the mail body
     write_data() - write the mail header/body (multiple calls)
     close_data() - close the mail body and send the mail
     quit() - close the SMTP connection
Using these commands we define a rather complex PL/SQL procedure inside the
notifying module’s package which has the following syntax.
procedure send_mail
       (in_mail_server,               in_sender_email,
       in_recipient_email,            in_recipient_name,
       in_html_flg,                   in_subject,
       in_importance,                 in_body);
Information Alert System for Digital Libraries - 69 -



in_mail_server is the mail server ,
       default value 'intellix.intelligence.tuc.gr'.
in_sender_email is the senders mail address,
       default value 'alert@intelligence.tuc.gr',
in_recipient_email is recipients e-mail address,
in_recipient_name is recipients full name,
in_html_flg indicates whether the message has HTML structure, default value 'Y',
in_subject is the subject of the message, default value is 'New Publications'
in_importance is the importance of the message, default value is 'Normal',
in_body is the body of the message.
All variables are PL/SQL strings (varchar2) [30] except in_body which is clob (character
large object).
   Alternatively we could have declared a Java stored procedure to send the messages
[35-37] using JavaMail, an API supplied from Sun Microsystems which implements the
most commonly used mail protocols. Using JavaMail we can also retrieve e-mail
messages from mailboxes, store and process them inside the database.                       However
UTL_SMTP provides sufficient functionality for DLAlert at the time.

5.2.2 Collecting the matched publications for a single user
   Once we have constructed the send_mail procedure, the next step is to define the
procedure that collects all matched publications for a given user and generates a
message. In order to generate HTML e-mail messages we use the supplied PL/SQL
Web Toolkit [31, 40]. This toolkit includes PL/SQL stored procedures and functions
useful for generating dynamic HTML pages directly from the database. It also supports
sending those HTML pages directly to the user’s browser, using an external Web Server
that is configured suitably. In our case, the HTML structured data are sent to the user via
e-mail only, and we use the PL/SQL Web Toolkit for generating them.
   The functions used, return HTML tags as their character output.
For example:
Title: = htf.title( 'Library Alert Service Notification');
Assigns to the string Title the value :
       <title> Library Alert Service Notification' </title>
Information Alert System for Digital Libraries - 70 -




Text := htf.fontOpen( cface => 'Arial Narrow', csize => '5')
        || 'T.U.C Library Alert'
        || htf.fontClose


Assigns to the string Text the value :
<font face="Arial Narrow" size="5">                 T.U.C Library Alert                  </font>
We do not present in detail the functions used, because generating HTML from PL/SQL
code is a rather complex issue. The algorithm that collects all matched publications for a
given user, and generates a message is.


procedure notify_user( user_email varchar2 ) is

begin

--GENERATING THE HEADER OF THE MESSAGE

for matched_publication in(

                      select publications.*
                      from publications,
                           (select public_id
                            from notifications
                            where email=user_email
                            group by public_id) matched
                      where matched.public_id=publications.public_id

                                )
        loop
              --APPEND ALL NON-EMPTY BIBLOGRAPHICAL ATTRIBUTES
              --TO THE MESSAGE FOR EVERY matched_publication
        end loop;

        --FINISHING AND SENDING THE MESSAGE
        --USING THE send_mail PROCEDURE.
end;


user_email is a input string containing e-mail of the user to notified.
We concentrate on the SQL SELECT clause used.


The clause
        (select public_id
         from notifications
         where email=user_email
         group by public_id) matched
Information Alert System for Digital Libraries - 71 -



retrieves all matched publications of the user with email=user_email and removes
,duplicate matched entries from the result, in case the user has more than one profiles
matching the same publication. Also names the results as table matched.


The clause
         select publications.*
         from publications,matched
         where matched.public_id=publications.public_id


Retrieves the matched publications for a single user, from the table holding the
bibliographical attributes, using the primary keys of the matched ones collected in table
matched.


In order to produce and send messages to all users that have matched profiles we use
the algorithm.
procedure notify_all is

begin

for current_user in
     (
     select email from notifications group by email
     )
     loop

                 notify_user(current_user.email);
                 delete from notifications where
                       email = current_user.email;

        end loop;

        commit;
end;

The SELECT clause retrieves all unique e-mail addresses of the users that have
matched profiles in the notifications table, and calls the notify_user procedure for
each one of them.


We do not forget to delete the sent notifications for each successfully sent message with
the simple DML statement.
        delete from notifications where
                      email = current_user.email;
Information Alert System for Digital Libraries - 72 -




If we want to notify users according to their desired frequency of notifications we must
construct the appropriate procedures that notify all users that have defined the same
interval between e-mail messages. For example if we want to notify all users that want to
be sent e-mail messages every day, in case there is a matching publication, instead of
the previous SELECT statement that controls the for…loop we issue.
       select users.email
       from notifications, users
            where users.frequency=’DAY’
            and users.email=notifications.email
            group by users.email

Therefore we send messages to each category of users separately. For example the
group of users that selected day as preferred frequency, are notified each day, those
who preferred week once a week etc. As we said in the previous section the profiles are
not filtered separately but the frequency of notifications affects only the time the
messages will be sent to the user. In Chapter 9 we explain in more detail the scheduling
of each module.


5.3 Performance
   We tried to measure the performance of DLAlert. Our goal was to ensure that the
time needed for filtering would be acceptable if we consider that this process would be
executed once a day. Our measures represent the worst case scenario on the server of
the Intelligent Systems Laboratory (two Pentium III processors, 1 GB RAM). We took
documents from the work of Theodoros Koutris and Christos Tryfonopoulos on a DIAS
implementation [53,54]. We also generated profiles on random keywords encountered in
those documents, using the profile generator of DIAS of the same implementation (only
Boolean and proximity statements) and parsed them into equivalent CTXRULE
expressions.
   The time taken for indexing 340082 complex stored queries (that is 100000 profiles
with 1 to 4 stored queries on attributes each) in the server of the Intelligent Systems
Laboratory was approximately 10 minutes. We do not measure the time taken to insert
the profiles. This was considered a quite satisfactory performance measure since
indexing 340082 stored queries means that the are 340082 new queries (inserts or
updates) since the last index synchronization (last day for example) which is most
unlikely to happen even for the popular alerting applications we described in Section 2.1.
Information Alert System for Digital Libraries - 73 -



Assuming that index synchronization is executed once a day, we are able to store an
even larger amount of profiles. The time taken for roughly the one tenth of the queries (
3300 ), was approximately 2 minutes which means that time required is not a linear
function of the profiles inserted.
    As stated by on the “Oracle Text Technical Overview” [27], CTXRULE query
performance depends mainly on the size and number documents. As these factors
increase, there are more unique words, each of which results in a query on the index.
Performance is also affected by number of unique rules indexed and the complexity of
stored queries. However, the number of unique rules has much less impact on query
performance than size of the document.
    The SQL Query that collects matching profiles is rather complex in both cases. The
time taken for filtering 84 documents against 340082 stored queries (that is 100000
profiles with 1 to 4 stored queries on attributes each) in the server of the Intelligence
Laboratory was approximately 13 minutes in both of the previous implementations. The
size of an attribute of a document could vary from a small amount of to a few thousands
of words. The total size of all documents (which is the most important factor) was 3.5
Mbytes (approximately 380.000 words) which is considered a very large amount of
words for filtering. In the actual implementation of DLAlert when records are retrieved
from the Digital Library, the average record size is much smaller, since bibliographic
attributes are short strings usually. Even assuming that filtering, which is executed once
a day requires roughly 13 minutes on average, this time is fairly acceptable for the
purposes of our application. Oracle states that the expected response time for filtering
64337 words against 16000 indexed queries is approximately 20 sec. In Chapter 9 we
present techniques, proposed for future work on DLAlert that will reduce this time further.
    We have to underline that filtering using MATCHES is always a CPU time consuming
task and in the server we used for development, there many other processes running all
the time. Also we did not utilize any additional functionality provided by Oracle that
speeds up the overall database performance. However our goal was to roughly estimate
the time needed for filtering and indexing, given the usual workload on the server of the
Intelligence Laboratory and decide whether DLAlert could be deployed on this computer.
Our measures do not evaluate the overall performance of the Oracle Text.
Information Alert System for Digital Libraries - 74 -



5.4 Conclusions
   We explained in detail the essential PL/SQL components of DLAlert, the filtering and
the notifying module. The filtering module finds all matched profiles for every publication
and stores the primary keys of those rows in a table. The notifying module processes
those data, produces and sends dynamic HTML e-mail messages to each user
containing the bibliographical attributes of all matching publications. In the next chapter
we present, the Graphical User Interface of DLAlert.
Information Alert System for Digital Libraries - 75 -




   Chapter 6
   The Graphical User Interface
       In this chapter we present the GUI of DLAlert. We start with an overview of this
   component and continue with particular technical issues and implementation details. The
   URL of DLAlert is http://www.intelligence.tuc.gr/alert/login.html .


   6.1 Middle application tier architecture
       In this section we explain in detailed the 3-tiered architecture of our web application,
   first introduced in 3.2 and we present the specific characteristics of our implementation.




                                Web Tier             Business Tier
      Client Host HTTP                           RMI                            JDBC       Resources
         -HTML              (presentation logic)     (business logic)                     Data & Stored
       -Javascript                                    - EJB stateful                       procedures
                               - JSP pages
                                                      session bean

                                    Middle Application Tier




                          Figure 6.1-1 DLAlert: Three-tiered architecture


The figure above presents the components of DLAlert organized as a J2EE platform
application. This schema consists of the following elements:
   The Client receives dynamic HTML pages from the Application Server via HTTP (Hyper -
   Text Transfer Protocol). JavaScript is code executed on the client’s browser and in our
   case, responsible for validation of the fields of HTML forms. If the client discovers that
   the necessary user credentials during user registration are not filled produces an error
   message and does not send those parameters to the middle-tier. The client checks if the
   profile description and at least one of the query fields, on the edit/new profile form are
   non-empty before sending them to the application server. Also the e-mail, publication
   year (four digits required), ISBN and ISSN fields are validated on the client. JavaScript is
   a language of limited functionalities (unlike Java), not able to perform validation of
   CTXRULE text queries (generated according to a complex recursive grammar).
Information Alert System for Digital Libraries - 76 -



The Web tier generates presentation logic, accepts user input from HTML and generates
appropriate responses for the user. We implement this tier as pages created with Java
Server Pages (JSP) [45, 46] technology on the application server. JSP’s simplify the
development of dynamic Web pages. JSP technology enables us to mix regular, static
HTML with dynamically generated content. The parts that are generated dynamically are
marked with special HTML-like tags and contain Java code.
     Apart from the standard JSP tags, in our application we used the Oracle9iAS
Containers for J2EE (OC4J) Custom Tag Library for SQL, provided by Oracle with the
Oracle9i Application Server [40, 47]. A tag library defines a collection of custom actions
for JavaServer Pages. OC4J is a framework for rapid JSP development. OC4J tags
related to database access, support functionality for opening/closing a database
connection, executing a query or any other SQL statement (DML or DDL) within JSP
code. Except the standard dynamic pages related to presentation, we have constructed
JSP’s, not visible to the user, that process transactions on user credentials and already
validated profiles. Although OC4J functionality is provided with the Oracle AS, the
applications developed with this framework, can be deployed into any other application
server that supports JavaServer Pages technology.
The Business Tier implements business logic related to the user’s session and is
developed using Enterprise JavaBeans (EJB) Technology. An EJB is a server-side web
component, written in Java that encapsulates the business logic of an application. In
DLAlert this functionality includes user authentication, profile validation and data retrieval
(user credentials and profiles) from the database. Also atomic transactions that
update/delete foreign keys and require stored procedures calls (see package
‘transactions’ Section 4.3) are handled by the EJB. A stateful session bean is an EJB
that acts as a server-side extension of the client that uses it. The stateful session bean is
created by a client and will work for only that client until the client connection is dropped
or the bean is explicitly removed. Therefore we use the EJB to restrict the user’s access
into his account and maintain login information.
     A very important functionality of the EJB is validation of profiles according to the
CTXRULE language. If the profile contains at least one wrong query the EJB produces
an error message which is displayed in the corresponding dynamic JSP page. Also the
phrase ‘Syntax Error’ appears at right side of the wrong query. This mechanism was
implemented using Java Compiler-Compiler (JavaCC) [50] a parser generator for Java.
Information Alert System for Digital Libraries - 77 -



JavaCC generates source code for parsers using LL(k) grammars. We explain in detail
the implementation of syntax checking in Section 6.4.
     We could have also included all transaction handling functionality inside the EJB
instead of using the OC4J custom tag library, but applications that use this framework
can easier be developed and maintained. However the JSP’s interact with the EJB in a
way that ensures security and isolation of the user’s session.
The Web and Business Tier communicate with each other using Remote Method
Invocation (RMI). RMI is a Java based Application Programming Interface (API) for
distributed object computing and Web connectivity. RMI allows an application to obtain a
reference to an object that exists elsewhere on the network but then invoke methods on
that object as though it existed locally. So, the web and business tiers can be
implemented in different J2EE platforms, although in our case they are deployed in the
same application server.
The Middle Application Tier communicates with the database using the Java Database
Connectivity interface (JDBC) [35, 38, 40]. JDBC API is a specification for database
connectivity using Java. Software vendors (like Oracle) produce their own JDBC drivers
that implement the API specification in a greater or lesser degree, but all of them support
a common set of interfaces. Thus the way the programmer interacts with the database,
is to some extent independent to the JDBC driver used.
The Database (also called EIS: Enterprise Information System) tier includes the RDBMS
infrastructure (both data and stored procedures). We have already explained in detail the
role of the RDBMS in Chapter 3.


6.2 The Enterprise Java Bean
In the next sections we concentrate on the business logic of DLAlert. First we present
the main class used as an Enterprise Java Bean.
    The class of the EJB is called ‘LoginBean’ and maintains login and account
information of the user session. The main private fields of this class are:
JDBC related fields.
o   java.sql.Connection conn the JDBC connection to the database.
o   java.sql.Statement stmt            the SQL statement to executed
Database schema related fields.
o java.lang.String dbUser the Username for database schema (constant)
Information Alert System for Digital Libraries - 78 -



o java.lang.String dbPass the Password for the database schema (constant)
o java.lang.String dbURL                    the URL of the database (constant)
Account related fields.
o   java.lang.String username                        the username of the account (e-mail)
o   java.lang.String password                        the password for the account
o   java.lang.String[] ProfileArray
An array of strings with the profile descriptions of all profiles inside the account


Objects representing entities inside the database.
o   ReadProfile ResultProfile
Object representing the profile to be inserted/updated or the profile read from the database. This
object holds all the queries of the profile. The functionality of this class is rather complex so it is
presented separately in the next section.
o   User ResultUser
Object holding all user credentials (e-mail, password, first name, last name, frequency) as
strings. This object is instantiated during the retrieval of the user credentials from the database.
This class includes the five strings that represent user credentials and simple public methods that
assign or return their values.


The main methods of the Enterprise Java Bean are:
Methods called during login/logout.
o   void Initialize()
Clears all account information – user logs out.
o   Boolean authenticate()
Returns true if there is a registered user with the e-mail/password in the database. The e-
mail/password are required at login.
Account information related methods.
o   User getCurrentUser()
Returns an object holding all of the user credentials after reading the corresponding data from the
database (table users).
o   java.lang.String[] getProfileArray()
Returns an array of strings with the profile descriptions of all profiles inside the account after
reading the corresponding data from the database (table profiles).
Information Alert System for Digital Libraries - 79 -



Profile validation related methods (explained in Section 6.6).
o   ReadProfile getReadProfile(java.lang.String ProfileDesc)
Returns an object holding all queries of the profile with name ProfileDesc. Calls the
constructor of ReadProfile class.
o   StoreProfile getStoreProfile()
Returns a profile to be stored, already parsed.
o   StoreProfile getStoreProfile( … )
Constructs, parses and returns the parsed profile to be stored as an object. Calls the constructor of
StoreProfile class.


Methods that prevent primary key constraint violation error.
o   Boolean ProfileExists(java.lang.String NewProfileName)
Returns true if profile with description NewProfileName already exists inside the user’s
account. Prevents primary key constraint violation on the table profiles.
o   Boolean UserExists(java.lang.String NewEmail)
Returns true if user with e-mail NewEmail already exists. Checks before new user
registration/credentials update.


Methods that represent atomic transactions (see Section 4.3) – call PL/SQL stored
procedures of package ‘transactions’.
o   void DeleteProfile(java.lang.String DelProfileName)
Deletes a profile inside the account of the user with name DelProfileName.
o   void DeleteUser()
Deletes all user information from the database - unsubscription
o   void UpdateUser(java.lang.String NewEmail)
Updates current user’s e-mail to NewEmail .
o   void UpdateProfileDesc(java.lang.String OldProfileName,
                                    java.lang.String NewProfileName)
Updates profile profile name of profile OldProfileName inside the account with profile
NewProfileName
Information Alert System for Digital Libraries - 80 -



6.3 OC4J custom tag library
    Transactions can be declared inside JavaServer Pages. These transactions use
OC4J custom tag library for SQL functionality.
    The tags used from this library are:


We use the dbOpen tag to open a database connection for subsequent SQL operations:
<database:dbOpen
    [ connId = "connection_id" ]
    [ scope = "page" | "request" | "scope" | "application" ]
    user = "username"
    password = "password"
    URL = "databaseURL"
    [ commitOnClose = "true" | "false" ] >
… OPTIONALLY JSP CODE …
</database:dbOpen>


Parameters :
o   connId -- Optionally used to specify an ID name for the connection. You can then
    reference this ID in subsequent tags such as dbExecute. Alternatively, we can nest
    dbExecute tags inside the dbOpen tag.
o   scope (used only with a connId) – We use this to specify the desired scope of the
    connection instance. The default is page scope.
o   user – the username of the database schema.
o   password – the password for the database schema.
o   URL – the URL of the RDBMS.
o   commitOnClose -- "true" for an automatic SQL commit when the connection is
    closed or goes out of scope. The default setting is “false” for automatic rollback on
    connection close.




    We use the dbExecute tag to execute a single DDL or DML statement inside the
tag dbOpen or outside of it using the same connId and scope parameters. The syntax
for this tag is.
Information Alert System for Digital Libraries - 81 -



<database:dbExecute
        [ connId = "connection_id" ]
        [ scope = "page" | "request" | "scope" | "application" ]
    … DML or DDL statement (one only)…
</database:dbExecute >


    We use the dbClose outside the dbOpen tag to explicitly terminate a database
connection. We use the same parameters defined in the dbOpen to reference the same
connection.


<database:dbClose connId = "connection_id"
        [ scope = "page" | "request" | "scope" | "application" ] />


The OC4J includes many other useful tags that we did not use in DLAlert and are not
referenced in this dissertation. For a complete reference of this library read [48, 49].


6.4 Preventing CTXRULE index errors
    Before explaining in detail the components that store/read profiles from the database
we must introduce the mechanism that validates profiles. The text queries on
bibliographical attributes (Title, Series, Publisher, Subject, Notes) are generated
according to the CTXRULE grammar. The invalid profiles should not be inserted into the
database but rejected by the web interface. The user is not allowed to define wrong
queries, else an error message appears. There are several restrictions on the CTXRULE
grammar. For each of the cases bellow, an index error appears.


    Queries that contain obvious syntax errors like unclosed parentheses, missing term
    or missing operator.
              Error Description                                     Query
missing parenthesis                       ( information systems
Term2 missing                             security and
Term1 missing                             or mathematics
operator between term1 RDBMS and
near(...) missing                         RDBMS near((data, warehousing),6)
                                 Table 6.4-1 Wrong queries
Information Alert System for Digital Libraries - 82 -



    Queries that violate limitations on certain operators.
             Error Description                                                Query
proximity parameter>100                           near((artificial, intelligence),120)
theme inside about(.. ) in upper case             about ( POLITICIANS )
about(...) and near(...) in the same field        about(management) and near((financial, planning),4)
                                     Table 6.4-2 Wrong queries
         Queries like the second one of the above do not produce index errors but are
never expanded properly. According to Oracle Text Reference [24, page 3-7] the
normalization of themes inside about(…) queries is case-sensitive. The themes stored
inside     the     database       are        in    lower-case.       Therefore        to     ensure       that
normalization always succeeds to find the appropriate theme we must turn words or
phrases inside about(…) statements to lower-case.


    Queries that contain reserved words.
    The CTXRULE language has many reserved words or symbols, some of them are
not even associated with functionality for this type of index (but represent functionality on
the CONTEXT index type only [23-24]). The reserved words to treated as query terms
should be enclosed in ‘{ }’. The unused symbols are escaped using a ‘’.
    Also we have decided to include only one type of theme query: the about(…)
statement because it returns the greatest amount of relevant concepts during expansion.
Hence we have the following categories of reserved words:


    Thesaural operators not used                           Operator used only for XML document
Operator                Name                               classification (not plain text) WITHIN,
BT          Broader Term                                   HASPATH, INPATH.
BTG         Broader Term Generic
BTI         Broader Term Instance
BTP         Broader Term Partitive                         Operators not supported by CTXRULE
NT          Narrower Term
NTG         Narrower Term Generic                           Operator     Symbol                Name
                                                           FUZZY       ?            fuzzy
NTI         Narrower Term Instance
                                                           ACCUM       ,            Accumulate
NTP         Narrower Term Partitive
                                                           (none)      %            wildcard characters
PT          Preferred Term
                                                           (none)      _            wildcard character
RT          Related Term
                                                           (none)      !            soundex
TR          Translation Term
                                                           SQE         (none)       Stored Query Expression
TRSYN       Translation Term Synonym
                                                           (none)      >            threshold
TT          Top Term
                                                           (none)      *            weight
SYN         Synonym
                                                           MINUS       -            MINUS
Information Alert System for Digital Libraries - 83 -



Rest of reserved operators
Operator     Symbol           Meaning            Operator     Symbol            Meaning
AND        &          Boolean and               (none)      $            stem
OR         |          Boolean or                ABOUT       (none)       related concepts
NOT        ~          Boolean and-not           (none)      ()           grouping characters
NEAR       ;          proximity                 (none)      []           grouping characters


We have decided to use only word operators when possible (AND, OR, NOT, NEAR).
Also we do not use ‘[ ]’ as grouping characters. The stemming character ($) is the only
symbol operator used.


   Therefore analyzing the requirements of our application we conclude that we must
implement a mechanism that escapes or rejects the following reserved words and
symbols.
   o   Escaped reserved words :         ACCUM, BT, BTG, BTI, BTP, FUZZY, HASPATH,
       INPATH, MINUS, NT, NTG, NTI, NTP, PT, RT, SQE, SYN, TR, TRSYN, TT,
       WITHIN .
   o   Escaped symbols: &, ? , - , ; , ~ , > , * , %.
   o   Any other special character or symbol is omitted.
The CTXRULE index contains only keywords and escaped symbols are never indexed
by default. Therefore including an escaped special symbol in a query does not affect the
filtering results. Special symbols are usually treated as token delimiters by the index
engine by default.


We could not expect the user to be an expert on the CTXRULE language so we must
construct a parser that automatically escapes reserved tokens. This functionality should
not be visible to the user so that characters { }  added by the parser are not visible from
the web GUI.


   Empty Queries.
As we said empty cells in profiles are substituted by the symbol <. This character is
skipped by the indexing engine so it is not included in the index. This symbol also
should not be visible from the web GUI.
Information Alert System for Digital Libraries - 84 -



As a conclusion, we have constructed a parser that is executed before storing profiles
inside the database:
   o     Checks queries for syntax errors.
   o     Allows only two digits on the proximity parameter ( < 100 ).
   o     Allows only one of the statements about(…) or near(…) in the same text query.
   o     Turns themes inside about(…) clauses to lower-case.
   o     Escapes reserved words and symbols.
       To implement such functionality we used JavaCC, a compiler generator for Java.
JavaCC processes a text file that defines the grammar and the semantic actions of the
compiler, and generates the appropriate source Java code. To construct this parser we
have used the following LL(1) grammar. We must underline that this grammar does not
define the actual CTXRULE grammar used by Oracle nor represents the parser used for
CTXRULE indexing. A complete definition of the CTXRULE language is not provided by
Oracle. Therefore the rules bellow represent the language used by DLAlert and define a
heuristic that detects syntax errors. The necessary semantic actions are not included in
the grammar. Ambiguities that occur in the grammar can be handled by the lookahead
(one token) mechanism of the parser. Bold characters are tokens.


   (1)     text _ query → or _ exp EOF | EOF
A text query as can be an empty or non-empty expression. EOF is the end of the string
(End Of File).


   (2)     or _ exp       → and _ exp ( OR and _ exp )*
   (3)     and _ exp → not _ exp ( AND not _ exp )*
   (4)     not _ exp     → operand ( NOT operand )*
The rules (2), (3), (4) allow Boolean queries according to the operator precedence
explained in Section 2.4.3. The symbol * means “zero or more occurrences”.


   (5)     operand → group | about _ exp | near _ exp | ( any _ word ) +
Defines that an operand can be a group of expressions, an about or near expression or
a phrase (series of words). The symbol + means “at least one occurrence”
Information Alert System for Digital Libraries - 85 -



    (6)     group      → left or _ exp right
A group of expressions starts and ends with parentheses.


            near _ exp       → NEAR left left (any _ word ) +
    (7)
                        [comma ( any _ word ) + ]+ right comma two _ digits right
A near expression has the following syntax: near ( (term1, temr2,..., termn ) , max_span )


    (8)     about _ exp → ABOUT left concept right
An about expression has the following syntax: about(concept)


    (9)     concept → ( any _ word | any _ operator ) +
The concept can be a series of any word or operator (treated as plain keywords inside
about(..) ).


    (10)    any _ operator → AND | OR | NOT | NEAR | ABOUT

            any _ word → word | stemmed _ word | reserved _ word
    (11)
                                 | number | two _ digits
A word can be a plain word, a stemmed word (starts with $), a reserved word or any
number.


Our language has minor differences with the actual CTXRULE grammar. DLAlert
grammar does not allow:
    o     Symbol operators ( &, | , ~ , ; ) (escaped by the parser).
    o     The brackets as grouping characters ( [ , ] ) (escaped by the parser).
    o     The syntax “term1 near term2 “ for proximity.
    o     Order of terms on proximity queries.
    o     Stemming on expressions or phrases. Equivalent expressions can be defined
          using stemming on each word separately. For example the query “$( software
          and engineering )” can be equivalently specified as “$software and $engineering”
          so the first syntax is not allowed.
Information Alert System for Digital Libraries - 86 -



6.5 Parsing the text queries
   As a conclusion to the previous section, we need a mechanism that parses the
profiles and produces error messages. If a user tries to insert or update a profile that
contains an invalid query, the application should be able to point out the syntax error. If
the form is always updated with data from the RDBMS, in case of syntax error we will
not be to provide such functionality because the wrong query will be lost. Therefore we
must implement an object that holds the values of the profiles. The session EJB will
decide whether the dynamic JSP displayed on screen, contains queries read from
database, a profile that was not successfully inserted / updated or even empty text fields
in case of new profile. We also assume the profiles already in the database are valid and
should not be re-parsed (unless the user tries update).


                                           StoreProfile




                                           ReadProfile




                                             Profile



                               Figure 6.5-1 Class Hierarchy


   For this purpose we have constructed a class hierarchy as shown in Figure 6.6-1.
The arrows represent an “is a” relation.
   The class Profile is an abstract class and cannot be instantiated.
   The class ReadProfile includes private fields of all the of text query strings (Title,
   Series, Author, Publisher Subject, Notes, Publication Year, ISBN, ISSN) the profile
   name and methods that set or return values from the previous variables. Also
   includes methods that return each query exactly as it is displayed on screen (omitting
   escape characters { }  and one character strings with the symbol ‘<’, as empty). A
   ReadProfile object is instantiated with fields containing queries read from the
   database.
Information Alert System for Digital Libraries - 87 -



   The class StoreProfile includes all fields and methods of ReadProfile. A
   StoreProfile object is instantiated with fields containing parsed text queries,
   which are defined by the user. The constructor method of StoreProfile calls the
   JavaCC generated parser which performs the necessary semantic actions on all text
   query fields, before assigning the strings to the private variables. Among the
   methods inherited from ReadProfile the class includes the following ones.




   o    public Boolean isValid(int QueryIndex)
        Returns true if the referenced text query is valid. QueryIndex defines which
        text query of the profile is referenced.
   o    public String getErrorMessage(int QueryIndex)
        Returns the string “Syntax Error” displayed on screen, if the referenced text
        query is invalid.
   o    public String getHeaderMessage()
        Returns the string “Encountered XX wrong queries …” displayed on screen, if the
        referenced profile contains wrong query.
   o    public int NumberOfErrors()
        Returns the number of wrong queries of the profile.


   This class hierarchy allows us use call two different constructors virtually for the
same Profile object (ReadProfile() and StoreProfile() ) according to the
source of text queries assigned. The constructor StoreProfile() calls the parser and
the constructor ReadProfile() just assigns the text queries to the fields. Thus we
avoid re-parsing of already parsed text queries.

       JavaServer Page
                                      text
  (profile insert/update form)
                                    queries                     EJB
                                                                         ReadProfile( )
                                                                                                    RDBMS
                                                      Profile


                                                                                OC4J
                            text      Parser StoreProfile( )
                                                                                tags
                          queries                                       text
                                                                                       insert / update
                                                                      queries
                                                                                           profile


                           Figure 6.5-2 Profile insert / update mechanism
Information Alert System for Digital Libraries - 88 -



   The functionality that performs inserts/updates on profiles is shown in the schema
above and can handle the following four actions in any valid sequence.
   If a new profile is to be defined the JavaServer Page is fields with empty strings.
   If a profile is to be updated, the form is filled with the actual RDBMS data. The EJB
   calls ReadProfile( ) constructor and the JavaServer Page reads the text queries
   from the object (omitting escape characters { }  and one character strings with the
   symbol ‘<’, displayed as empty field).
   If the user enters a valid profile to be inserted / updated the EJB calls the
   StoreProfile( ) constructor, parses the profile, the JSP with the OC4J tags
   reads the values from the object Profile and performs the transaction
   If the user enters a wrong profile the EJB calls the StoreProfile( ) constructor,
   parses the profile and the form containing the wrong queries and the errors
   messages appears. In this case the transaction is not performed.



6.6 Conclusions
   The Graphical User Interface was implemented with the intention to be a simple and
friendly application. We have achieved this goal to some extent, for this first version of
DLAlert. In the last chapter of the dissertation we propose future work on the service that
will improve the functionality of DLAlert. However we must first mention the way DLAlert
collects publications from Digital Libraries in the next chapter.
Information Alert System for Digital Libraries - 89 -




Chapter 7
The Observer
   In this chapter we present the mechanism that collects records from Digital Libraries
using Java Stored Procedure technology and the Z39.50 protocol. Before explaining the
implementation of this component (Observer) we reference different ways for retrieving
data from information sources.


7.1 Information providers

   Information providers are any suppliers of information in the particular area of
interest of the alerting service (in our case scholarly material) [4,5]. The information
collected is publications metadata (records containing bibliographical attributes).
   We can distinguish information providers to be either active or passive. Active
providers virtually provide their own alerting service; they regularly notify interested
systems or users on new data (publications). Passive providers have to be queried for
new material in a scheduled manner. For example any Z39.50 Gateway is a passive
information provider.
   Also information providers are either cooperative or non-cooperative. Cooperative
providers offer materials in a standard format, non-cooperative ones provide data in a
proprietary custom format. For example human readable and unstructured records in a
web-page or e-mail message are not in a standard format that can be easily processed
by a service.
   The target of highest priority, in constructing an Observer for an Alerting Service, is
the ability of the system to deal efficiently with as many as possible heterogeneous
information providers, in a common approach. During the implementation of DLAlert we
had the following possible solutions for collecting data from the Digital Library of TUC.
   Trying to construct a mechanism inside the Digital Library of TUC that will notify us
   on new publications, would be the worst solution since it will not be easy to add more
   other sources in the future. It will also require specialized knowledge on the RDBMS
   used by the Library and every other system to be supported
   Querying the database of the Digital Library directly using a standard JDBC or ODBC
   API would require development of functionality adapted to the database schema. In
Information Alert System for Digital Libraries - 90 -



   addition the Library of TUC does not provide a licensed ODBC interface already.
   Using such mechanism would require we construct the Observer from scratch every
   time a new source is to be added.
   Using the Web Page of the TUC Digital Library to retrieve new records would not be
   an efficient solution, since the search interface used does not provide queries on the
   date of acquisition of documents. Trying to request records inserted during the last
   month for example, would be impossible.
   The use of the Z39.50 for this purpose is considered the best solution since
   o   It is supported by the majority of Digital Libraries round the world.
   o   Provides standardized access points to the resources.
   o   Offers records in standardized format that can be easily processed.
   o   Requires minor changes in order to add other sources.
   o   Does not require intervention inside the database of every Digital Library
       supported.
   o   Usually querying Gateways using this protocol is free and does not require
       specialized permissions.
   o   The implementation of Z39.50 used by the TUC Library Gateway, provides
       queries on the document’s date of acquisition. Also adding this attribute for
       queries in an existing Gateway does not involve software development but
       requires minor changes on the configuration of the Z39.50 server.


7.2 Observer architecture

    As we explained in Section 3.2 Oracle RDBMS [35-39] is able to integrate Java
classes inside the database and deploy them on a supplied Enterprise level Java 2
platform (Oracle Java Server). The Java code, which can be loaded as either source or
compiled (bytecode), is executed inside the database using the internal Oracle Java
Virtual Machine. The static methods of any Java class can be declared stored
procedures and can be even called from PL/SQL code.
   Using this functionality we can insert all the necessary Java classes into the
database (both the JZKit API and our classes). The main static method that handles the
collection of records and calls all the other methods is published as Java stored
procedure. Thus we can reference the Observer from PL/SQL code and schedule it to
request new publications regularly using the PL/SQL package DBMS_JOB. First let us
present the three sub-components of the Observer.
Information Alert System for Digital Libraries - 91 -




                                            Observer
                                     (Java Stored Procedure)
                                    Array                 Records
                                   of char.               (objects)
                           JZkit               Unimarc                  SQLJ
                            API                 parser                  class

                                                                              inserts
 Requests data on
                                                                JDBC
 new publications
                                                              server-side
                                                            internal driver
               Sends
               Records
                    •
                    •
                    •


                                                                        Oracle
                                                                       database
               TUC                        another
          Digital Library              Z39.50 Gateway
         Z39.50 Gateway

            DBMS
                                              another
          "Advance"
                                               DBMS



                     Figure 7.2-1 Architecture of the Observer



JZKit [15] is an open source Java toolkit for building distributed information retrieval
systems, with particular focus on the Z39.50 Information Retrieval standard. JZkit
offers us functionality that helps us develop clients for the Z39.50 protocol. We have
already presented the Z39.50 main facilities and services in Section 2.3, in this
chapter we focus on the particular characteristics of the Observer. The code
developed writes requested records in an array of characters, processed by the next
component.
The UNIMARC [17] parser is a typical parser generated by JavaCC [50]. This parser
processes UNIMARC records as input and maps the UNIMARC fields to the desired
bibliographical attributes (Title, Series, Author, Publisher, Subject, Notes, Pub. Year,
ISBN, ISSN). The UNIMARC format was presented in detail at Section 2.3.5. This
component virtually processes structured text and produces a set of objects
Information Alert System for Digital Libraries - 92 -



   representing the records requested. JavaCC is the Java compiler generator used in
   the development of the Web Graphical User Interface (Chapter 6).
    The last component is a Java class that uses the SQLJ [35, 51] functionality. SQLJ
   is an industry standard that enables database developers to:
   o   Embed SQL code directly inside of Java source code.
   o   Write Java-based code without resorting to low-level JDBC calls.
   o   Construct applications that are portable to all database platforms that support
       JDBC drivers.
   The last sub-component of the Observer is a class that reads the fields of the objects
   produced by the parser and performs the necessary inserts. We focus on this
   component and the SQLJ functionality in Section 7.5.
We must mention the sub-components are methods that are executed sequentially;
every module performs actions after its previous one. For example the SQLJ class calls
the method that parses the records, which uses the records retrieved by the JZKit.



7.3 JZKit API

   In this section we will explain the functionality of the JZKit API. We will not give a
detailed reference on the classes and methods used. We focus on the services of
Z39.50 used and the algorithm developed according to the Z39.50 terminology
introduced in Section 2.3.
   We need a mechanism that collects publications acquired during a certain interval of
time. The Z39.50 implementation used by the Digital Library offers an access point
(attribute 32) to the bibliographic records according to the month of acquisition (shorter
intervals can not be requested yet).
   For example if we want to request the records inserted in the Digital Library during
February of 2003 we issue the following type-1 (introduced in 2.3.3) query using the
JZKit API. The month stored in this field is always in Greek.


       @attrset bib-1 @attr 1=32             2003-ΦΕΒΡΟΥΑΡΙΟΣ


Choosing the right string that corresponds to the current month/year we can retrieve the
preferred records.
Information Alert System for Digital Libraries - 93 -



Using queries like the above at the end of each month we can retrieve the necessary
records. The complete algorithm developed utilizes the Initialize, Search and Present
services. We must mention that the target cannot return all requested records in one
response (size of response limited by the segmentation service) so we have to count the
records returned until all of them are retrieved. Phrases enclosed in < > represent
variables.


Initialize ( URL : dias.library.tuc.gr , port : 210 )
         returns initialization parameters
Search ( database-name : “Advance” ,
         query : @attrset bib-1 @attr 1=32 <current year>-<month in greek> )
         returns <total number of records>
<starting point> = 0


repeat
                 Present ( number of records , starting point )
                        returns <number of returned records> , <records in UNIMARC>
                 <starting point>+=<number of returned records>
                 write <records in UNIMARC> in an array of char.
until    <starting point> equals < total number of records>


terminate connection


The previous algorithm Initializes a Z-association with the target, issues a query and
sends requests until all records are returned. The records of the Present response are
written in an Array in order to be processed further by the next module.


7.4 UNIMARC parser
    The UNIMARC record format is not supported at the time by Oracle Text. Thus we
have to parse the incoming records into plain text in order to insert them in the database
schema explained in Chapter 4. The parser processes the UNIMARC records and maps
UNIMARC fields into the fields used by DLAlert (Title, Series, Author, Publisher, Subject,
Notes, Pub. Year, ISBN, ISSN).          The parser maps the UNIMARC fields to those
bibliographical attributes according to the Table 7.4-1.
Information Alert System for Digital Libraries - 94 -



                           Destination         UNIMARC fields
                      Local Number       001
                      Title              200,5XX,4XX except 410
                      Series Title       410
                      Author             7XX
                      Publisher          210
                      Subject            60X
                      Notes              3XX
                      Publication Year   210 $d
                      ISBN               010 $a
                      ISSN               011 $a
                           Table 7.4-1 UNIMARC field mapping
Local number is the unique identifier of the record inside the database of the Digital
Library of TUC. We use this number as a primary key for our schema (public_id). Other
fields included in the UNIMARC record (like information about the book’s lending) are
omitted.
For example consider the following record. The extraction of bibliographical attributes is
shown in the Figure 7.4-1. The date of acquisition is not inserted in the Oracle database.




                          Figure 7.4-1 Sample UNIMARC record
The fields of the processed record are shown above (represent private variables of the
object LibraryRecord).
Information Alert System for Digital Libraries - 95 -




public_id                title                 series            author                        publisher
            The international business book        Guy Vincent Mattock John Lincolnwood, Ill., USA
10024364                                    <null>
            Vincent Guy, John Mattock              NTC Business Books       NTC Business Books


         subject                                    notes
                       "All the tools, tactics, and tips you need for doing
International business
                       business across cultures"--Cover. Includes
enterprises Management
                       bibliographical references (p.[171]-173) and index.



pub. year      ISBN               ISSN
1995      0844235172             <null>



7.5 SQLJ functionality
    The last sub-component of our module is an SQLJ class. This is a Java class
containing methods that calls the aformentioned sub-components, and embedded SQL
code that performs the transactions.
In order to compile an SQLJ class we use the SQLJ translator that:
  i. Validates that the SQLJ statements are syntactically correct
  ii. Validates that the SQL code inside the SQLJ statements is correct.
 iii. Validates that the database objects being manipulated by the SQL code in the
      SQLJ statements are valid.
 iv. Translates the SQL code into syntactically correct Java statements.



               SQLJ source                 Java source                  Java class (bytecode)

                                    SQLJ                          Java
                                  Translator                     Compiler




                              Figure 7.5-1 SQLJ compilation process
The file generated still contains Java source code (compilation bytecode is necessary).
The SQL statements are declared with the                #sql     token and must be enclosed in

brackets “{ }”. Java variables declared outside the statement and reference inside of it
start with the symbol “:” .
Information Alert System for Digital Libraries - 96 -



For example to insert a new publication in the database schema of DLAlert .with
Public_id=1000 and Author=’Giannis Alexakis’ we have the following SQLJ code.


String NewPublic_id=1000;
String NewAuthor=’Giannis Alexakis’;
#sql {
            INSERT INTO alert.publications (PUBLIC_ID,AUTHOR)
            VALUES ( :NewPublic_id,            :NewAuthor)
            };




The actual SQLJ statement used for the transaction is.


#sql {
       INSERT INTO alert.publications
                 (PUBLIC_ID,     TITLE,    SERIES,        AUTHOR,        PUBLISHER,
                      SUBJECT,   NOTES,    PUB_YEAR, BOOKN,              SERIALN)
       VALUES
                  (
                  :Publication_Id, :Title, :Series, :Author,
                  :Publisher,     :Subject, :Notes, :Year,
                  :BookNumber, :SerialNumber )
       };


The variables are assigned with the values to be inserted. Iterating thought all the
records we insert all of the new publications requested and parsed previously.
   The translated or compiled Java code produced by the SQLJ translator can be
loaded and executed inside the Oracle database. Java applications executed inside the
RDBMS environment use the server-side internal JDBC for Oracle. As soon as we this
type of driver it is not necessary to explicitly declare a statement that establishes a JDBC
connection with the database. The code executed inside an RDBMS is implicitly
considered that references the same database.
   In order to be able to call the method, that requests records from the JZKit API,
parses them, and stores the bibliographical attributes of new publications, we have to
Information Alert System for Digital Libraries - 97 -



publish it as a Java Stored Procedure. The Java Stored Procedure declared (
ReadFromLibrary ( ) ) calls the Observer and returns the integer 1 on abnormal
termination (for example due to network failure when a Z-association with the Gateway
can not be established). In case of exception no records are inserted in the Oracle
database.


7.6 Performance
   The time needed to retrieve records from the Digital Library of TUC is mainly
dependent to the network congestion between the Oracle database and the Gateway. It
takes usually less than five minutes to retrieve about one thousand records from the
Gateway since a single response contains 33 records at maximum. The time needed for
parsing and the insertions is insignificant, as it is less than ten seconds for a thousand of
records.


7.7 Important technical issues
   We have the following important technical problems with the Digital Library of TUC
that do not allow us deploy DLAlert in complete function yet:
   Most records that are inserted in the Digital Library of TUC have the date of
   acquisition field empty. The total number of records inside in the Digital Library is
   close to 60000 and the number of those with the date of acquisition filled is less than
   6500. This means that almost the 90% of the records inserted in the database
   cannot be retrieved using this mechanism. This problem can be easily solved with
   the cooperation of the Library of TUC as long as we ensure that only future
   inserts/updates contain this essential bibliographical attribute. There is no need to
   change the data already in the database of the Library because we focus on new
   publications from the time DLAlert will be scheduled to operate in regular basis.
   The way the date of acquisition is stored does not allow us easily support frequency
   of notifications for user less than a month. The dates stored contain only year and
   month of acquisition. Therefore if we want to retrieve for example the records
   inserted during the last day we have to request all of the records inserted in the
   current month and extract the new records since the last request. This problem can
   also be solved optimally as long as we ensure that future inserts/updates in the
   Digital Library contain a more detailed date of acquisition. In the next chapter we
Information Alert System for Digital Libraries - 98 -



   assume this issue solved the (in any way) and propose scheduling for the actions of
   DLAlert.
   The Greek character set supported by the Digital Library is a non-standard custom
   character set defined by the company that installed and configured the Digital
   Library. Most records include Greek characters and Greek support (both in keyword
   queries and e-mail messages) is a very important issue that must be solved soon.
   Proposed future work that will solve this problem is discussed in Chapter 9.


7.8 Conclusions
   Due to the technical problems explained in 7.7 we have not scheduled DLAlert in
complete function yet. As the total number of records with the date of acquisition filled is
very small we have developed and validated DLAlert using sets of documents acquired
during certain years ( 500 -800 records ). Given the technical issues presented earlier,
scheduling DLAlert to operate under the current conditions, will result very rarely
notification of users. Despite the previously mentioned limitations we in the next chapter
we present how DLAlert’s processes should be scheduled.
Information Alert System for Digital Libraries - 99 -




Chapter 8
Scheduling DLAlert
   For the system to operate properly and on a regular basis we must schedule all the
related modules and actions. For this purpose we can use the PL/SQL package
DBMS_JOB [30, 33, 35, 40] which provides functionality for:
   Scheduling stored procedures to run unattended at some time in the future or upon
   certain intervals of time.
   Handling jobs that are broken for any reason (network or power failure, database
   error etc). These jobs are attempted to run 16 times if are not successfully executed.
We will not focus on the package, since scheduling the database is a rather complex
administrator’s task. We explain the sequence of actions to be executed, focusing on
two simple scenarios. We discuss the two cases, of supporting or not different user
categories according to their preferred frequency of notifications.


8.1 Simple scenario

                                                                                     Deletion of
          Collection                                           Construction
                        Synchonization                                             new publication
           of new                           Filtering        and transmission
                          of indices                                              records from the
         publications                                          of messages
                                                                                      database


                          Figure 8.1-1 Simple scenario sequence


For the case that we do not support different categories of users according to their
preferred frequency of notifications we have the actions to be executed regularly (every
day or week for example).
   First we have to collect new publication records from the Digital Library inserted
   during a certain interval of time. As the first step we call the module Observer.
   Synchronization of the CTXRULE indices is always necessary before filtering in
   order to have a consistent index with the base table of queries. For this purpose we
   have developed the PL/SQL package “indexing” (Section 4.4). This process can also
   be executed in the background during the first step, since the Observer is not an
   intensive CPU process.
Information Alert System for Digital Libraries - 100 -



   After synchronizing the indices we find matching profiles for every new publication
   (PL/SQL package “filtering”).
   Once the matching profiles are collected we can summarize all matched publications
   for every user and transmit e-mail messages via SMTP.
   As e-mail messages containing all relevant bibliographical attributes are constructed
   and delivered there is no need to maintain the already filtered documents. Unless we
   want to provide other functionalities among alerting services (for example information
   retrieval on the documents stored in the Oracle database) we can delete the
   publication records.


8.2 Supporting three types of desired notification frequencies

In order to support different notification frequencies we have three sets of actions as
show in the above diagram. We categorize the actions according to their interval
between two subsequent operations. “Every day” actions are executed every day
regardless if this day is an end of week or month too. For example at the end of each
month all three sets are executed sequentially.

               Collection                              Filtering for
                                                                          Construction and transmission
                of new             Synchonization   publication records
                                                                           of messages only for users
              publications           of indices      acquired during
                                                                          with desired freqeuncy = 'DAY'
           during the last day                            last day



                                 Figure 8.2-1 Actions executed every day
1) “Every day” actions.
     The first step is to collect new publication records from the Digital Library inserted
     during the last day. As the first step we always call the module Observer.
     Synchronization of indices is necessary in order for filtering to produce results
     consistent with the base table of queries.
     The next step next is to filter the publication records inserted during the last day.
     The filtering module finds all matched profiles for every publication and stores the
     primary keys of those rows in a table. The data produced are maintained inside the
     database until all relevant e-mail messages are delivered.
     Then we must summarize all matched publications for every single user and
     transmit e-mail messages via SMTP. We perform this action only for users that
     have defined ‘DAY’ as the desired frequency of notifications.
Information Alert System for Digital Libraries - 101 -




                              Construction and transmission
                                of messages only for users
                             with desired frequency = 'WEEK'




                      Figure 8.2-2 Actions executed once in a week
2) “Every week” actions.
      Since we have already inserted and filtered publications for every single day of the
     week, the only action that remains is to construct and send e-mail messages to
     users with ‘WEEK’ as the desired notification frequency.

                                                              Deletion of
                    Construction and transmission
                                                        unnecessary publication
                      of messages only for users
                                                           records from the
                  with desired frequency = 'MONTH'
                                                               database


                     Figure 8.2-3 Actions executed once in a month
3) “Every month” actions.
     We have already inserted, filtered publications for every day up to the end of the
     month. Since we have ready notified users of the first two categories (‘DAY’ and
     ‘WEEK’) the action that remains is to construct and send e-mail messages to users
     with ‘MONTH’ as the desired notification frequency.
     As e-mail messages containing all relevant bibliographical attributes for
     publications over the last month, are constructed and delivered, there is no need to
     maintain the already filtered documents. Optionally we can delete the unnecessary
     publication records from the Oracle database.


8.3 Conclusions
   We have completed the presentation of the implementation and the development of
DLAlert. We think that with minor configuration changes mainly on the Digital Library of
TUC (Section 7.7) this system could easily be deployed to complete function and
operate on regular basis. In last sections we presented two operating scenarios of
DLAlert and, the corresponding actions to be scheduled in order to achieve the desired
target. In the next chapter we propose future work on DLAlert.
Information Alert System for Digital Libraries - 102 -




Chapter 9
Concluding remarks
   We think that an alerting service such as the one already developed, would be
proved very helpful to the academic community of the Technical University of Crete.
DLAlert should be enhanced with more functionalities like Greek support, integration of
various sources and an even easier to use web interface. In addition DLAlert, a search
engine that will support several information providers is being developed by the
Intelligence Systems Laboratory. In the following section we purpose future work on
DLAlert.


9.1 Future work on DLAlert
   The following functionalities are proposed future enhancements on DLAlert. We think
that if some of these are supported, DLAlert could be a popular alerting service as long
as there is not any similar system developed in Greece at the time.


   Greek support
   Greek support is a very important issue since most records in the Digital Library of
TUC contain bibliographical attributes in this language. The character set used by the
Digital Library is non-standard custom character set defined by the company that
installed and configured this system. Therefore either we must provide support on this
custom character set, or translate the Greek font into a standard character set supported
by Oracle. The Oracle RDBMS provides Locale Builder, a useful tool for this purpose
that would help us manipulate character set types, character mappings or classifications.
   The standard Boolean and proximity queries on keywords can be supported by
Oracle Text in the Greek Language. The stemmer provided with Oracle Text, the
mechanism that expands queries using tokens with the same linguistic root as the
requested term, does not support Greek. Trying to develop a stemmer from scratch is
hard and complex work, since it requires knowledge on linguistics and literature.
Stemmers for the Greek language have already been developed by students of the
Department of Electronic and Computer Engineering [55]. Oracle also provides a
database of English and French themes (presented in 2.4.3) organized hierarchically
and connected to each other with relations that describe their semantic content
Information Alert System for Digital Libraries - 103 -



(Synonym, and Broader, Narrower or Related Term). In order to support this functionality
in Greek we should extract the main concepts found in the documents of the Digital
Library and organize them hierarchically.


     XML records classification
     Instead of using records containing the bibliographical attributes in plain text that
represent incoming publications, we can represent publications as XML documents with
sections defined as tags. For example consider the following example where we have a
publication with Title: “The international business book” and Author: “Vincent Guy, John
Mattock”. The corresponding XML document would be
     <publication record>
        <title> The international business book </title>
             <author>      Vincent Guy, John Mattock </author>
     </publication record>
Oracle Text provides query operators for XML section searching like the operator
WITHIN. We use the WITHIN operator to narrow a query down into document sections.
For example to request documents with Title containing the word “business” we issue
the CTXRULE query.
        business WITHIN title
This approach has several advantages and disadvantages:
 o    Allows even more complex queries referencing sections that are not included in the
      current implementation of DLAlert (like “Anywhere” clauses). Using this approach
      we can easily request documents containing a keyword in any attribute or a custom
      set of attributes. We can even declare nested sections on records.
 o    The filtering module will speed up in this case since we will need only one
      CTXRULE index for the text queries regardless the number of attributes supported.
      The time consumed for indexing will not be improved since it is mainly dependent
      on the total number of queries inserted / updated.
 o    You cannot combine the WITHIN operator with the ABOUT operator, therefore we
      cannot request themes inside sections.
 o    Requires a more complex parser for the profiles since queries referencing more
      than one attributes must be concatenated into a single CTXRULE query before
      inserted into the database. The operator WITHIN should not be visible to the user
      and the text query stored in the database should be re-parsed and broken into
Information Alert System for Digital Libraries - 104 -



      simple    CTXRULE statements before displayed on the GUI. For example to
      request documents with Title containing the word “business” and author containing
      the word “John” we issue the CTXRULE query.

     Author John         Profile parser
                                          (business WITHIN title) AND (John WITHIN author)
     Title business

        Query visible to                                 CTXRULE query
     the user from the GUI                            stored in the database


 o    The publication records should be parsed to XML before inserted into the
      database. Also the matching documents’ bibliographical attributes should be
      extracted from XML into plain text strings before the construction of e-mail
      messages. For this purpose we can develop the necessary Java stored procedures
      that will manipulate the XML documents.


     “Anywhere” queries – variable number of rows in profiles
     After we carry out the changes mentioned earlier we can support a more convenient
Profile insert/update form like the one bellow. We can then define queries with variable
number of rows and request keywords in any section of the document.




                              Figure 9.1-1 Profile insert form
Information Alert System for Digital Libraries - 105 -




   Automatic word stemming expansion on queries
   Instead of expecting the user to enter the symbol $ in order to request tokens with
the same linguistic root as the requested term we can enhance the parser in order to
automatically put the stemming symbol $ before all queries. As we said previously this
functionality cannot support the Greek language at the time, in case we use the supplied
Oracle stemmer. For example suppose we request documents that contain the words
“business” and “management”. DLAlert can automatically include all the tokens with the
same linguistic root as equivalent terms to the requested keywords.

                                   Profile parser
 business and management                                  $( business and management )

      Query visible to                                         CTXRULE query
   the user from the GUI                                    stored in the database



   Z39.50 sources support
   Integrating various Digital Libraries is a target that can be easily achieved since.
   o   Almost all Z39.50 Gateways in Greece (Section 2.3.6) support the same record
       format (UNIMARC) and are implemented according to the same Z39.50
       specification (Geac Advance Z39.50 version 2). We can already retrieve records
       from these databases for our alerting service as long as they support date of
       acquisition as an access point (attribute).
   o   For Gateways that support a different implementation of Z39.50 with different
       record format (for example USMARC, XML), only minor changes to the parser
       sub-component of the Observer module, are needed.
If we decide that multiple sources are to be supported, an algorithm that detects
duplicate entries on publications among different Digital Libraries is necessary. DLAlert
should not send multiple notifications for a single document found in more than one
databases.


   Other information providers , alerting services integration
   We could support in the future any other type of information provider (cooperative or
not, active or passive) as long as we develop the necessary functionality. Any
functionality that can be developed in Java can be inserted and executed inside the
Information Alert System for Digital Libraries - 106 -



Oracle RDBMS as Java Stored Procedure(s). Also other existing alerting services can
be supported in case they provide a record format that can be processed and parsed so
that the bibliographical attributes can be identified. An algorithm for rejection of duplicate
publication entries is needed in this case too.


   More impressive web GUI
   During the implementation of DLAlert we have been focused on the functionalities of
the GUI and the way the user interacts with the system. A more attractive Web GUI can
be easily developed using the same functionality already developed.


   Similar search and alert capabilities
   A search engine integrated in the same interface would be very useful so that the
user can find which already acquired publications are matched by his profile. Among
with DLAlert, a search engine that will support several information providers is being
developed by the Intelligence Systems Laboratory. A future version of DLAlert must
provide search functionality inside the Profile insert/update form like the picture bellow.




                          Figure 9.1-2 Profile insert / search form


   Journal support
   The mechanism already developed as alerting service on new publication, is not
practical in the area of scientific journals. Every journal series acquired by the Digital
Information Alert System for Digital Libraries - 107 -



Library of Technical University of Crete is represented by a single record inside the
database (sample record on the following picture). Therefore DLAlert cannot notify users
on each number of the journal yet, but sends an e-mail message on a new subscription
from the Library. Supporting specific journal requests on Profiles, is an essential feature
supported by most popular Alerting Services on scholarly material (Section 2.1). The
journals supported could be organized hierarchically according to their scientific area.
Users should be notified regularly not only on a new journal subscription but also on
each separate issue.




                          Figure 9.1-3 Sample record of a journal


   Hyper-links in e-mail massages
   Providing all the bibliographical attributes of new publications on e-mail messages is
an accurate way of notifying the user at the time. If we want to provide more information
on new publications (like Table of Contents), it would be preferred to include hyper-links
to web-pages containing all relevant data instead.


   Notifications in various formats (plain text, HTML, XML)
   Providing notifications in various formats would be a useful feature. Some users may
prefer shorter plain text e-mail messages. Also XML messages would be useful in case
we send the notifications to another alerting service or application.
Information Alert System for Digital Libraries - 108 -




   Using DIAS algorithms inside the database as Java stored procedures
   Functionality developed in the DIAS project could be integrated into the Oracle
database, in case an implementation in Java that handles database records is available
in the future. As we explained in Section 2.2.4 DIAS provides efficient algorithms for
document filtering and profile matching. In this case the use of Oracle Text and the
CTXRULE index would not be necessary. Java Stored Procedure technology enables us
integrate almost everything that can be implemented in Java, as stored procedure inside
the Oracle database.


   Ranking of matching documents according to relevance
   Ranking of matching documents according to relevance is not supported for the
CTXRULE index type in the current version of Oracle Text. Trying to support this feature
would require enhancing of the filtering functionality available now.


   Relevance feedback
   Relevance feedback on notifications means that the user can evaluate the relevance
of the delivered documents so that the ranking results are improved in later filtering. This
feature is also not supported at the time for the CTXRULE index type, and will require
much development work to implement it. Java Stored Procedure technology will be most
useful in case we try to develop functionality with high computational complexity like
enhancing the filtering mechanism already available.



9.2 Conclusion
   The main achievement of this dissertation is the development of a centralized
alerting service for the Digital Library of the Technical University of Crete with the ability
to integrate many information providers. As long as technical issues presented in 7.7 are
solved, DLAlert can be scheduled to operate in regular basis. We hope that this
dissertation will be a good starting point for further work on this application.
Information Alert System for Digital Libraries - 109 -




Bibliography

[1] Information Retrieval (Z39.50): Application Service Definition and Protocol, March
    29 - May 13, 2002 Specification National Information Standards Organization.
    Available at: www.niso.org/standards/resources/Z39-50-200x.pdf
[2] Current Awareness and Alerting Services Alphabetical Listing, Sheffield Hallam
    University. http://www.shu.ac.uk/services/lc/se/alertingservicesalpha.html
[3] Alerting Systems and Services, Freie Universität Berlin.
    http://page.inf.fuberlin.de/~hinze/projects/AS.html
[4] D. Faensen, L. Faulstich, H. Schweppe, A. Hinze, and A. Steidinger. Hermes -- a
    notification service for digital libraries. In ACM/IEEE Joint Conference on Digital
    Libaries, Roanoke, Virginia, USA, June 24-28, 2001.
    Available at: http://www.inf.fu-berlin.de/inst/ag-db/publications/2001/jcdl01.pdf
[5] D. Faensen, A. Hinze, and H. Schweppe. Alerting in a digital library environment --
    do channels meet the requirements? Freie Universitat Berlin, 1998.
    Available at: ftp://ftp.inf.fu-berlin.de/pub/reports/tr-b-98-08.ps.gz
[6] M. Koubarakis, T. Koutris, C. Tryfonopoulos, P. Raftopoulou, Information Alert in
    Distributed Digital Libraries: The Models, Languages and Architecture of DIAS. 6th
    European Conference on Research and Advanced Technology for Digital Libraries
    (ECDL 02), 16-18 September 2002, Pontifical Gregorian University, Rome, Italy.
    Available at: http://www.intelligence.tuc.gr/publications/information-ecdl02.pdf
[7] M. Koubarakis, C. Tryfonopoulos, P. Raftopoulou, T. Koutris, Data Models and
    Languages for Agent-Based Textual Information Dissemination. 6th International
    Workshop on Cooperative Information Systems (CIA 02), 18-20 September 2002,
    Universidad Rey Juan Carlos, Madrid, Spain.
    Available at: http://www.intelligence.tuc.gr/publications/data-cia02-long.zip
[8] M. Koubarakis and C. Tryfonopoulos. Peer-to-peer agent systems for textual
    information dissemination: algorithms and complexity. In UK Workshop on
    Multiagent Systems (UKMAS-2002), Liverpool, UK, 18 & 19 December, 2002.
    Available at: http://www.intelligence.tuc.gr/publications/peer2peer-ukmas02.pdf
[9] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison
    Wesley, 1999.
Information Alert System for Digital Libraries - 110 -



[10] C.D. Manning and H. Schutze. Foundations of Statistical Natural Language
    Processing. The MIT Press, Cambridge, Massachusetts, 1999.
[11] National Information Standards Organization : http://www.niso.org
[12] Z39.50 Standard Maintenance Agency. http://www.loc.gov/z3950/agency/
[13] MARC standards, Library of Congress Network Development and MARC standards
    Office. http://www.loc.gov/marc/
[14] Extensible Markup Language (XML) 1.0 (Second Edition) W3C Recommendation 6
    October 2000. Available at: http://www.w3.org/TR/2000/REC-xml-20001006.pdf
[15] Knowledge Integration JZkit: http://developer.k-int.com/products/jzkit/
[16] Universal Bibliographic Control and International MARC Core Programme:
    http://www.ifla.org/VI/3/p1996-1/UNIMARC.htm
[17] UNIMARC Manual : Bibliographic Format 1994:
    http://www.ifla.org/VI/3/p1996-1/sec-uni.htm
[18] Z39.50 Text Part 9: Type-1 and Type-101 Queries:
    http://www.loc.gov/z3950/agency/markup/09.html
[19] Bib-1 Attribute Set: http://lcweb.loc.gov/z3950/agency/defns/bib1.html
[20] Registry of Z39.50 Object Identifiers: http://lcweb.loc.gov/z3950/agency/defns/oids.html
[21] Oracle Technology Network: http://otn.oracle.com/
[22] Oracle Corporation: http://www.oracle.com/
[23] Oracle Text Application Developer’s Guide Release 9.2 Oracle Corporation.
    http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/text.920/a96517.pdf
[24] Oracle Text Reference Release 9.2. Oracle Corp.
    http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/text.920/a96518.pdf
[25] Oracle Text Documentation http://otn.oracle.com/products/text/content.html
[26] Oracle Text Discussion Forum http://otn.oracle.com/forums/text.html
[27] Oracle Text Technical Overview (The CTXRULE Indextype):
    http://technet.oracle.com/products/text/x/Tech_Overviews/text_901.html
[28] ISO 2788:1986 Documentation -- Guidelines for the establishment and development
    of monolingual thesauri.
[29] ANSI/NISO Z39.19 - 1993 Guidelines for the Construction, Format, and
    Management of Monolingual Thesauri. Available at:
    http://www.niso.org/standards/resources/Z39-19.html
Information Alert System for Digital Libraries - 111 -



[30] Sean Dillon, Christopher Beck, Thomas Kyte. Beginning Oracle Programming, 2002
    Wrox Press.
[31] Application Developer's Guide – Fundamentals. 2002, Oracle Corporation. Available
    at: http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/appdev.920/a96590.pdf
[32] PL/SQL User's Guide and Reference. 2002, Oracle Corporation. Available at:
    http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/appdev.920/a96624.pdf
[33] Supplied PL/SQL Packages and Types Reference. Oracle Corporation. Available at:
    http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/appdev.920/a96612.pdf
[34] The Source for Java Technology. Java home page: http://java.sun.com/
    Sun Microsystems, Inc.
[35] Bjarki Holm, John Carnell, Tomas Stubbs, Poornachandra Sarang, Kevin Mukhar,
    Sant Singh, Jaeda Goodman, Ben Marcotte, Mauricio Naranjo, Anand Raj, Mark
    Piermanini. Oracle 9i Java Programming: Solutions for Developers Using PL/SQL
    and Java. 2002 Wrox Press.
[36] Java Developer's Guide. 2002, Oracle Corporation. Available at:
    http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/java.920/a96656.pdf
[37] Java Stored Procedures Developer's Guide. 2002, Oracle Corporation. Available at:
    http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/java.920/a96659.pdf
[38] JDBC Developer's Guide and Reference. 2002, Oracle Corporation. Available at:
    http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/java.920/a96654.pdf
[39] Supplied Java Packages Reference. 2002, Oracle Corporation. Available at:
    http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/appdev.920/a96609.pdf
[40] Bradley D. Brown, Oracle9i Web Development. November 2001 McGraw-
    Hill/Osborne Media
[41] Developing Java 2 Platform, Enterprise Edition (J2EE) Compatible Applications
    Roles-based Training for Rapid Implementation. January 2001, Sun Educational
    Services Java Technology Team. Available at: http://java.sun.com/j2ee/white/j2ee.pdf
[42] Developing Java 2 Platform, Enterprise Edition (J2EE). 1999, Sun Microsystems,
    Inc. Available at: http://java.sun.com/j2ee/white/j2ee_guide.pdf
[43] Jonathan B. Postel, RFC 821 – Simple Mail Transfer Protocol. August 1982
    Information Sciences Institute, University of Southern California. Available at:
    http://www.ietf.org/rfc/rfc821.txt
Information Alert System for Digital Libraries - 112 -



[44] JavaMail 1.3 Release, Sun Microsystems, Inc.
    Available at: http://java.sun.com/products/javamail/
[45] Core JavaScript Guide 1.5. 2000, Netscape Communications Corp. :
    http://devedge.netscape.com/library/manuals/2000/javascript/1.5/guide/
[46] Marty Hall, Core Servlets and JavaServer Pages, Sun Microsystems
    Press/Prentice Hall. Available at http://pdf.coreservlets.com/
[47] JavaServer Pages Documentation, Sun Microsystems. Available at:
    http://java.sun.com/products/jsp/docs.html
[48] OracleJSP Support for JavaServer Pages Developer's Guide and Reference,
    Release 1.1.3.1 Oracle Corporation. Available at:
    http://otn.oracle.com/docs/tech/java/oc4j/pdf/jsp1131.pdf
[49] Peter Koletzke, Paul Dorsey, Avrom Faderman. Oracle9i JDeveloper Handbook.
    December 2002 McGraw-Hill/Osborne Media.
[50] Java C C Home page: http://www.experimentalstuff.com/Technologies/JavaCC/
[51] SQLJ Developer's Guide and Reference. 2002 Oracle Corporation. Available at:
    http://otn.oracle.com/docs/products/oracle9i/doc_library/901_doc/java.901/a90212.pdf


Related dissertations
[52] Stratos Ydraios. “A query and notification service based on mobile agents for rapid
    implementation of peer to peer applications”, 2003, Department of Electronic and
    Computer Engineering, Technical University of Crete.
[53] Chistos Tryfonopoulos. "Agent-Based Textual Information Dissemination: Data
    Models, Query Languages, Algorithms and Computational Complexity", 2002,
    Department of Electronic and Computer Engineering, Technical University of Crete.
[54] Theodoros Koutris. "Textual information dissemination in distributed agent systems:
    Architectures and efficient filtering algorithms", 2003, Department of Electronic and
    Computer Engineering, Technical University of Crete.
[55] Sotiris Diplaris, Dimitris Pratsolis. “Development of statistic linguistic models for the
    Greek language with stemming and part of speech functionality”, 2001, Department
    of Electronic and Computer Engineering, Technical University of Crete.

More Related Content

PDF
Transforming a Paper-Based Library System to Digital in Example of Herat Univ...
PDF
Hospital Records Management System
PDF
clinic database and software management system
PDF
Ict in africa education fullreport
PDF
Samsung Galaxy S2 (GT-I9100) User Guide
PDF
Uml vs-idef-griffithsuniversity
PDF
The Impact of Information and Communications Technologies on the Teaching of...
PDF
Voice Recognition Service (VRS)
Transforming a Paper-Based Library System to Digital in Example of Herat Univ...
Hospital Records Management System
clinic database and software management system
Ict in africa education fullreport
Samsung Galaxy S2 (GT-I9100) User Guide
Uml vs-idef-griffithsuniversity
The Impact of Information and Communications Technologies on the Teaching of...
Voice Recognition Service (VRS)

What's hot (20)

PDF
Detecting Malice
PDF
Voicenger - System Requirements Specification
DOC
White Paper: Look Before You Leap Into Google Apps
PDF
Impact of ICT and related problems on online banking in Nigerian Banks
PDF
Red & White Student Organization - Member Handbook
PDF
Data Protection Iin The EU
PDF
Utilize PC Fundamentals www.utilizewindows.com
PDF
M Daemon E Mail Server Manual
PDF
Face detection and recognition
DOCX
Android
PDF
WSIS+10 Country Reporting - Rwanda (Republic of)
PDF
PDF
Planning a Microsoft Virtual Server infrastructure with HP ...
PPTX
Report Vietnam INTERNET RESOURCES 2015
PDF
Emotional face-twitter
PDF
Recovery oracle
DOCX
Application of microsoft word
PDF
Online vehicle parking reservation system
PDF
Handbook all eng
PDF
Zenoss administration
Detecting Malice
Voicenger - System Requirements Specification
White Paper: Look Before You Leap Into Google Apps
Impact of ICT and related problems on online banking in Nigerian Banks
Red & White Student Organization - Member Handbook
Data Protection Iin The EU
Utilize PC Fundamentals www.utilizewindows.com
M Daemon E Mail Server Manual
Face detection and recognition
Android
WSIS+10 Country Reporting - Rwanda (Republic of)
Planning a Microsoft Virtual Server infrastructure with HP ...
Report Vietnam INTERNET RESOURCES 2015
Emotional face-twitter
Recovery oracle
Application of microsoft word
Online vehicle parking reservation system
Handbook all eng
Zenoss administration
Ad

Similar to Thesis:"DLAlert and Information Alert System for Digital Libraries" (20)

PPTX
alerting services.pptx
PPTX
alerting services.pptx
PPTX
alerting services.pptx
PDF
Digital library
PDF
Digital library-overview
PDF
Digital Libraries Principles And Practice In A Global Environment Lucy A Tedd...
PDF
Digital Library
PPT
Introducing the Open Discovery Initiative
PDF
Efficient Security Alert Management System
PPTX
Integrated Library Management System to Resource Discovery : Recent Trends
PPT
Digital Libraries
PPT
Digital Libraries
PDF
Digital library
PDF
Researcher Reliance on Digital Libraries: A Descriptive Analysis
PPT
Dlindia
PDF
Dynamic Concept Drift Detection for Spam Email Filtering
PDF
Digital Libraries: Tools for Facilitating Access, Management and Preservation...
PDF
08 chapter 03
alerting services.pptx
alerting services.pptx
alerting services.pptx
Digital library
Digital library-overview
Digital Libraries Principles And Practice In A Global Environment Lucy A Tedd...
Digital Library
Introducing the Open Discovery Initiative
Efficient Security Alert Management System
Integrated Library Management System to Resource Discovery : Recent Trends
Digital Libraries
Digital Libraries
Digital library
Researcher Reliance on Digital Libraries: A Descriptive Analysis
Dlindia
Dynamic Concept Drift Detection for Spam Email Filtering
Digital Libraries: Tools for Facilitating Access, Management and Preservation...
08 chapter 03
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Mushroom cultivation and it's methods.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
A Presentation on Touch Screen Technology
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Tartificialntelligence_presentation.pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
A Presentation on Artificial Intelligence
Building Integrated photovoltaic BIPV_UPV.pdf
A novel scalable deep ensemble learning framework for big data classification...
Web App vs Mobile App What Should You Build First.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Zenith AI: Advanced Artificial Intelligence
Heart disease approach using modified random forest and particle swarm optimi...
A comparative study of natural language inference in Swahili using monolingua...
Mushroom cultivation and it's methods.pdf
Enhancing emotion recognition model for a student engagement use case through...
WOOl fibre morphology and structure.pdf for textiles
A Presentation on Touch Screen Technology
A comparative analysis of optical character recognition models for extracting...
Tartificialntelligence_presentation.pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
cloud_computing_Infrastucture_as_cloud_p
Programs and apps: productivity, graphics, security and other tools
SOPHOS-XG Firewall Administrator PPT.pptx
Hindi spoken digit analysis for native and non-native speakers
TLE Review Electricity (Electricity).pptx
A Presentation on Artificial Intelligence

Thesis:"DLAlert and Information Alert System for Digital Libraries"

  • 1. DLAlert - AN INFORMATION ALERT SYSTEM FOR DIGITAL LIBRARIES by GIANNIS ALEXAKIS Submitted in partial fulfillment of the requirements for the diploma of Electronic and Computer Engineering Technical University of Crete June 2003 Guidance Committee : Manolis Koubarakis Associate Professor (supervisor) Stavros Christodoulakis Professor Euripides Petrakis Associate Professor
  • 2. Information Alert System for Digital Libraries - 2 - Abstract As information available on the Internet is increasing from day to day, a user has to spend a lot of time searching, browsing and rejecting useless information in order to stay up to date, until he finds exactly what he is looking for. A new type of web applications called alerting services, assume the responsibility to collect all relevant data in a specific area and deliver them to each user regularly according to his fields of interest. Alerting services could prove to be very helpful in the area of digital libraries, as the quantity of scientific publications doubles every 10-15 years and classic search applications become more and more ineffective to handle this information overload on their own. In this text we describe the design and implementation of DLAlert, an alerting system for digital libraries developed for the Library of the Technical University of Crete and ready to support many other sources including libraries, publishing houses and other alerting services. DLAlert is a web application that receives requests through a web page about the publications that interest every user, stores profiles about every user’s fields of interest, collects information about new publications from the Technical University Library gateway, produces notifications for each user and sends an appropriate e-mail containing all relevant bibliographical data.
  • 3. Information Alert System for Digital Libraries - 3 - Acknowledgements I would like to express my gratefully thanks to the following people for their help, advice and information in developing this application. My sincere gratitude to my supervisor Manolis Koubarakis for his precious guidance and advice. Stamatis Andranakis for his help during the implementation of the Z39.50 client. Epemenidis Voutsakis for his help on the installation of the Oracle database. Christos Tryfonopoulos for his help during the validation of the filtering module of DLAlert. Ian Ibbotson from Knowledge Integration for his technical advice on the JZKit tollkit. All the undergraduate and postgraduate students of the Intelligent Systems Laboratory for their cooperation and support.
  • 4. Information Alert System for Digital Libraries - 4 - Contents Chapter 1 Introduction ....................................................................................................................... 6 1.1 Alerting applications – description..................................................................... 6 1.2 Alerting applications – examples ....................................................................... 7 1.3 DLAlert - Alert system for Digital Libraries......................................................... 8 1.4 Organization of the dissertation......................................................................... 9 Chapter 2 Related Work .................................................................................................................. 10 2.1 Alerting applications on the web...................................................................... 10 2.1.1 Elsevier Contents Direct .............................................................................. 11 2.1.2 Kluwer Alert ................................................................................................. 12 2.1.3 Springer Alert............................................................................................... 13 2.1.4 Hermes ........................................................................................................ 14 2.2 DIAS (Distributed Information Alert System) ................................................... 15 2.2.1 The WP (Word Pattern) data model ............................................................ 16 2.2.2 The AWP (Attribute based Word Pattern) data model................................. 18 2.2.3 The AWPS data model ................................................................................ 19 2.2.4 Some interesting problems.......................................................................... 21 2.3 The Z39.50 protocol ........................................................................................ 21 2.3.1 Initialization facility....................................................................................... 23 2.3.2 Search facility .............................................................................................. 24 2.3.3 Type-1 query syntax .................................................................................... 26 2.3.4 Retrieval facility ........................................................................................... 28 2.3.5 Record format (UNIMARC).......................................................................... 29 2.3.6 Z39.50 and interoperability .......................................................................... 30 2.4 Oracle Text and filtering applications .............................................................. 31 2.4.1 The CTXRULE index ................................................................................... 32 2.4.2 Creating the tables ...................................................................................... 35 2.4.3 Language of the stored queries................................................................... 35 2.4.4 Indexing the stored queries ......................................................................... 40 2.4.5 Filtering........................................................................................................ 41 2.5 Conclusions..................................................................................................... 43 Chapter 3 System Overview ............................................................................................................ 44 3.1 DLAlert architecture......................................................................................... 44 3.2 Main Technologies used ................................................................................. 46 3.3 Graphical User Interface overview .................................................................. 48 3.4 The language of the text queries..................................................................... 55 3.5 Conclusions..................................................................................................... 56
  • 5. Information Alert System for Digital Libraries - 5 - Chapter 4 Database Schema........................................................................................................... 57 4.1 Requirements analysis .................................................................................... 57 4.2 Relational schema........................................................................................... 59 4.3 Key consistency and atomic transactions ....................................................... 60 4.4 Indexing of the stored queries ......................................................................... 61 4.5 Conclusions..................................................................................................... 63 Chapter 5 PL/SQL packages ........................................................................................................... 64 5.1 Filtering module............................................................................................... 64 5.1.1 The algorithm............................................................................................... 64 5.2 Notifying module.............................................................................................. 68 5.2.1 The UTL_SMTP package ............................................................................ 68 5.2.2 Collecting the matched publications for a single user ................................. 69 5.3 Performance.................................................................................................... 72 5.4 Conclusions..................................................................................................... 74 Chapter 6 The Graphical User Interface .......................................................................................... 75 6.1 Middle application tier architecture.................................................................. 75 6.2 The Enterprise Java Bean............................................................................... 77 6.3 OC4J custom tag library.................................................................................. 80 6.4 Preventing CTXRULE index errors ................................................................. 81 6.5 Parsing the text queries................................................................................... 86 6.6 Conclusions..................................................................................................... 88 Chapter 7 The Observer .................................................................................................................. 89 7.1 Information providers....................................................................................... 89 7.2 Observer architecture...................................................................................... 90 7.3 JZKit API ......................................................................................................... 92 7.4 UNIMARC parser ............................................................................................ 93 7.5 SQLJ functionality ........................................................................................... 95 7.6 Performance.................................................................................................... 97 7.7 Important technical issues............................................................................... 97 7.8 Conclusions..................................................................................................... 98 Chapter 8 Scheduling DLAlert ......................................................................................................... 99 8.1 Simple scenario............................................................................................... 99 8.2 Supporting three types of desired notification frequencies............................ 100 8.3 Conclusions................................................................................................... 101 Chapter 9 Concluding remarks ...................................................................................................... 102 9.1 Future work on DLAlert ................................................................................. 102 9.2 Conclusion..................................................................................................... 108 Bibliography .................................................................................................................. 109
  • 6. Information Alert System for Digital Libraries - 6 - Chapter 1 Introduction A survey made two years ago proved that more than one-third of all Internet users spends more than 2 hours per week searching the web∗. The statistics also stated that 86 per cent of users thought searching should be made more efficient. However, it seems that coping with all this information available today on the Internet is becoming an even more harder and time-consuming task. A user has to spend a lot of time searching, browsing and rejecting useless information in order to stay up to date, until he finds exactly what he is looking for. A new type of applications called alerting services promise to deliver to every one what they are interested in. The user is being informed regularly in a specific area instead of constantly chasing after hyper-links and search engines. In the following sections we describe the use of alerting services on today’s Internet and particularly in the domain of digital libraries and publishing houses which is the main topic of this dissertation. 1.1 Alerting applications – description Alerting service is considered every application that • Collects all relevant information in specific area. This information could refer to events on the real world or data retrieved from databases on the Internet. • Receives requests about this information from users. These requests called profiles are stored and considered long standing queries that are evaluated regularly without the constant interaction of the user. • Filters incoming data according to these stored queries and finds matching profiles whose conditions are satisfied. • Produces notifications that are adapted to the user’s needs and sends all relevant data in his field of interest which are defined by his profiles. ∗ Source “Search Engine Watch” URL : http://www.searchenginewatch.com/
  • 7. Information Alert System for Digital Libraries - 7 - The main difference with classic information retrieval applications is that the former evaluate every user’s request only once and send the results immediately. In a selective dissemination of information or alert system the queries are stored and the user is notified every time there is a new event or data that might interest him. 1.2 Alerting applications – examples Alerting services are becoming more and more helpful and selective dissemination of information techniques can be applied into many domains. Consider the following examples: • A news portal broadcasting e-mail messages to registered members adapted to every user’s fields of interest containing the daily news. • An e-commerce company integrating information about merchandise from various providers and sending advertisements according to every customer’s previous shopping. • A stock exchange company alerting stockholders upon certain events in the stock market (an increase or decrease of current rates and prices). Every application like the ones above has its own characteristics: • The way the system collects information from various sources. Sources could be data retrieved from databases representing events on the real world. Every selective dissemination service monitors different events according to its specific area of interest. • The system should provide a standard way that users can request this information and define the conditions upon they want to be notified. The so-called language of the profiles is the language that the stored - queries are defined. Queries can be expressed directly in expressions that the system can recognize, by using a simple user-friendly graphic interface with buttons and drop – down lists or even implicitly by monitoring the user’s previous actions (like the second example above).
  • 8. Information Alert System for Digital Libraries - 8 - 1.3 DLAlert - Alert system for Digital Libraries The number of scientific publications is estimated to double every 10-15 years. This means that the number of scientific papers, journals and books published before 1990 is close to the number published the last decade. It becomes harder for people to follow this evolution of technology without spending a lot of time daily on the Internet searching and browsing. Additionally knowledge of technology is provided often by independent miscellaneous organizations (universities, publishing houses, research departments in companies etc.). As new technological branches appear every day, there is need for tools that help people navigate through all this available information. Search engines dedicated to literature (scientific or not) and alerting applications in the same area are becoming very popular tools on today’s Internet. This dissertation presents the design and implementation of DLAlert, an alert system for digital libraries developed for the Library of the Technical University of Crete and ready to support many other sources including libraries of other organizations and alerting services. Features distinguishing the aforementioned system include • The events that interest a potential user are new publications. • The sources are the bibliographical attributes of the new publications inserted in the digital libraries supported by DLAlert (up to now only the Library of the Technical University of Crete). • The language of profiles offers queries about words, phrases or concepts contained in the publications bibliographical attributes. Additionally, it provides Boolean and proximity operators for definition of relations between the previous terms. • The user is expected to describe his fields of interest by typing his queries in a graphical user interface on the web ( http://intelligence.tuc.gr/alert/login.html ). • Notifications are well-formed e-mail messages send to the user containing all relevant bibliographical attributes of the matched publication.
  • 9. Information Alert System for Digital Libraries - 9 - 1.4 Organization of the dissertation This dissertation is organized as follows. In Chapter 2 we present related work in the areas of information retrieval and selective dissemination of information. In Chapter 3 we reference modeling and design issues about DLAlert. Chapter 4 presents the internal structure and organization of the database schema used for storing the profiles and filtering the incoming data. In Chapters 5-7 we explain briefly the operation of every one of the system’s modules (filtering, notifying, web Graphical User Interface). In Chapter 8 we discuss possible ways to collect data from sources and present the current implementation that uses the Z39.50 standard to retrieve information from various digital libraries. Chapter 9 presents the actions performed by the system. Finally in Chapter 10 we present our conclusions and propose directions for future work on DLAlert.
  • 10. Information Alert System for Digital Libraries - 10 - Chapter 2 Related Work In this chapter we discuss related technologies in the areas of information retrieval and selective dissemination of information in the context of digital libraries. We present previous developments on this topic that inspired and helped us implement DLAlert. 2.1 Alerting applications on the web In the following sections we present popular alerting services [2, 3] in the area of digital libraries. The main issues that we concentrate when designing a notification service are: i. Number of sources. The system should have the ability to collect data from various and often unsimilar sources and integrate them in a way that the user will not notice this difference. ii. Scalability. The application should be able to filter thousands of publications and millions of profiles every day. iii. User friendliness of the graphical user interface (GUI) and the notifications is an important issue otherwise none will utilize the application. iv. Query expressiveness. The language in which the profiles are defined should offer the user the potential to request notifications on almost any possible subset of incoming publications. In most cases the alerting services for digital libraries offer queries on categories or keywords contained in the bibliographical data. v. Type of the notifications. The notifications are often e-mail messages that contain hyper-links to web pages related to a specific publication, bibliographical attributes, an abstract or complete table of contents of the interesting book or journal. They could be plain text, HTML or XML for later processing from other alerting services. vi. Similar search and alert capabilities. Providing a search engine for the sources that are included in the alert is a positive feature.
  • 11. Information Alert System for Digital Libraries - 11 - 2.1.1 Elsevier Contents Direct Elsevier Contents Direct (http://contentsdirect.elsevier.com/ ) is the free e-mail alerting service from Elsevier Science which delivers notifications to users about new publications. The sources of this system are a large number of publications daily from various areas and providers. The web interface is as friendly and simple as it could be and allows queries on subject categories only. The user can also select to be notified regularly on single journals instead of selecting the whole category. Complex profiles on keywords can not be defined. The subjects are defined in a hierarchical manner which means that every category contains one or more sub-categories. An advantage of this application is that the user can search for journals and update his profile on the same page. E-mails messages are well formed HTML messages that contain the title of the document and a hyper-link to the table of contents and abstract of the book or journal. Unfortunately more details about the internal design of this service are not available. Figure 2.1-1 Elsevier Contents Direct interface
  • 12. Information Alert System for Digital Libraries - 12 - 2.1.2 Kluwer Alert Kluwer Alert (http://www.kluweralert.nl/ ) is a service which promises to keep researchers on top of the latest in scientific publishing. The system collects a large number of publications daily from various areas. It offers a similar functionality like the previous application which means that queries are subject categories of books and journals. Subjects are organized hierarchically and the user is able to select single journals. Queries on words or bibliographical attributes are not allowed on the alert. Notifications are well-formed HTML messages that provide all the necessary information about a publication and hyper-links to the table of contents. E-mails also contain pricing information and a user can also buy books on-line. The user can also search the sources included in Kluwer Alert using a simple interface. Figure 2.1-2 Kluwer Alert interface
  • 13. Information Alert System for Digital Libraries - 13 - 2.1.3 Springer Alert Springer Alert (http://www.springer.de/alert/ ) is a service provided by Springer Science and supports a wide variety of sources. The queries are subject categories which are organized hierarchically. The user can also select the frequency of notifications and the preferred language of publications. The e-mails are sent regularly and contain only the title/author of the matched book or journal and a hyper-link to a web-page where one can read detailed description of the publication, download the table of contents and purchase it. The Springer Alert is the only service that provides promotional brochures and software via surface mail. Catalogue search for books and electronic media is a feature available in the same interface. Figure 2.1-3 Springer Alert interface
  • 14. Information Alert System for Digital Libraries - 14 - 2.1.4 Hermes Hermes (http://hermes.inf.fu-berlin.de/ ) [4, 5] is an alerting service developed by the Institute of Computer Science of the University of Berlin. Hermes promises to integrate heterogeneous interfaces of different providers and support publishing houses or libraries that do not offer an alerting service themselves. Emphasis is on scholarly publications as journal articles, technical reports, and books. It is the only service that provides specification of interest using an advanced mechanism. Profiles consist of one or more queries on bibliographical metadata and query terms as well as selection on specific journals. Queries on attributes can contain keywords or phrases related with Boolean operators (AND, OR, NOT). Queries to be stored are checked and those containing syntax errors are not inserted in the database of Hermes (an error message is produced). Single words standing alone or between Boolean operators are stemmed (for example the following expressions: library, libraries are equivalent). Figure 2.1-4 Hermes interface
  • 15. Information Alert System for Digital Libraries - 15 - The user can define notification frequency (day, week, month) and format (plain text, HTML, XML). The main disadvantage of this service is that the graphical interface is not as friendly and simple as the previous applications. The messages, which are usually not well formed, contain the main bibliographical attributes of the publications (title / author / abstract) and a hyper-link to detailed description. The system accepts relevance feedback on notifications which means that the user can evaluate the relevance of the delivered documents so that the ranking results are improved in later filtering. Generally speaking, Hermes is an application with advanced features but not as simple and friendly as the services presented earlier. 2.2 DIAS (Distributed Information Alert System) DIAS [6, 7, 8] is a distributed alert system for digital libraries, currently under development in project DIET by the Intelligent Systems Laboratory of the Department of Electronic and Computer Engineering, Technical University of Crete. DIAS is currently implemented as a part of a system called P2P-DIET “A query and notification service based on mobile agents for rapid implementation of peer to peer applications”[52]. The advanced functionalities that DIAS offers and the basic ideas that this project proposes, motivated us during the implementation of DLAlert. Of course DLAlert is not a distributed system already, but the models and languages supported by both services are similar. In addition during our implementation stored queries from DIAS where used for validation of the filtering process of the DLAlert. Figure 2.2-1 Architecture of DIAS
  • 16. Information Alert System for Digital Libraries - 16 - Before presenting the DIAS architecture in detail we have to mention the difference in terminology used. Notifications are considered not only the messages send to the end-users but also all the messages exchanged between peers and can potentially contain information about new publications that must be disseminated through the network. The architecture of DIAS is shown in Figure 2.2-1. Resource agents retrieve new publication’s data from the information providers and produce streams of notifications containing this information. Users post profiles to some middle-agent(s) and receive notifications from the network, by using their personal agents. The notifications produced from the resource agents are propagated through the P2P network and arrive at interested subscribers (end-agents). Middle-agents forward the long-standing queries to other middle-agents in a way that matching of a profile with a notification takes place as close as possible to the origin of the incoming notification. The models proposed by DIAS for notifications and profiles are presented bellow. We will not discuss the complete definition of these models and for more details on DIAS read [6, 7, 8]. In the following sections we present the schema for notifications every model proposes and the queries provided by its grammar. The definitions of the models are reproduced verbatim from the paper [6]. 2.2.1 The WP (Word Pattern) data model This model assumes that textual information (of notifications) is in the form of free text and can be queried by word patterns. A word w is considered a finite non-empty sequence of letters from a given alphabet. We also assume the existence of a (finite or infinite) set of words called the vocabulary. A text value s is a finite sequence of words from the assumed vocabulary. Thus s (i ) gives the i -th element of s and s its number of words. The queries are word patterns generated according to the following grammars. A proximity-free word pattern is an expression generated by the grammar WP → w | ¬ WP | WP ∧ WP | WP ∨ WP | (WP) A proximity word pattern is an expression wp1 ≺ i wp2 ≺ i ... ≺ i wpn where 1 2 n −1 wp1, wp2 , ..., wpn are positive proximity-free word patterns (does not contain the
  • 17. Information Alert System for Digital Libraries - 17 - negation operator ¬ ). Where i1 , i2 ,..., in−1 are intervals (that represent order and distance between words) from the set I where I= { [l , u ] :l , u ∈ , 0 ≤ l ≤ u } ∪ { [l , ∞ ) : l ∈ , 0≤l } A word pattern is an expression generated by the grammar WP → PFWP | PWP | WP ∧ WP | WP ∨ WP | (WP) PFWP is a proximity free word pattern and PWP is a proximity word pattern. Query examples for the WP model artificial ∧ intelligence matches documents that contain both words constraint ∧ ( programming ∨ e-commerce ) matches text values that contain one of the words programming, e-commerce and the word constraint. search ≺ [0,6] optimisation matches text values that contain both of the words and 6 or less words between them. (global ∧ local) ≺ [3,6] search ≺ [1,1] optimisation matches text values that satisfy all of the following three conditions : i. Contain one of the words global, local and the words search optimization ii. The distance between one of the words global, local and the word search is at least 3 and at most 6 words. iii. There is exactly 1 word between the words search, optimization University ≺ [7,∞ ) Crete matches text values that contain the word University and the word Crete and between them there are at least 7 words.
  • 18. Information Alert System for Digital Libraries - 18 - 2.2.2 The AWP (Attribute based Word Pattern) data model The AWP data model defines that textual information on notifications is based on attributes or fields with finite-length strings as values. Strings will be understood as sequences of words (text values) as formalized by the model WP presented earlier. Attributes can be used to encode the bibliographical attributes of a publication (e.g., author, title, abstract of a paper and so on). A notification schema N is a pair ( A, V ) where A is a set of the attributes and V is a vocabulary. For example a notification schema for a digital library containing three attributes could be N = ({ AUTHOR, TITLE, ABSTRACT } , ε ) . A notification is a set of attribute-value pairs. For example a valid notification over the previous schema is {( AUTHOR," John Brown "), (TITLE ," Interaction of constraint programming and local search for optimisation problems "), ( ABSTRACT ," In this paper we show that adapting constraint propagation...") } Queries in the AWP model reference text values inside attributes. A query is a formula in any of the following forms. 1. A wp . Where A is an attribute and wp is a positive word pattern. This query can be read as “ attribute A contains word pattern wp “ 2. A = s where A is an attribute and s is a text value (sequence of words). This query can be read as “ attribute A equals text value s “ 3. ¬ φ where φ is a query containing only proximity-free word patterns. 4. φ1 ∨ φ2 where φ1 and φ2 are queries. 5. φ1 ∧ φ2 where φ1 and φ2 are queries.
  • 19. Information Alert System for Digital Libraries - 19 - Query examples for the AWP model AUTHOR (John ≺ [0,6] Smith) ∨ (TITLE programming ) matches notifications (on publications bibliographical attributes) that contain the words John, Smith with 6 words between them or less in the attribute AUTHOR or contain the word programming in the attribute TITLE. This query matches the example notification (TITLE contains the word programming). ¬ AUTHOR = "John Brown" ∧ ABSTRACT paper matches notifications that the AUTHOR attribute is not “John Brown“ and contain the word paper in the attribute ABSTRACT. . This query does not match the example notification (AUTHOR attribute is “John Brown “). 2.2.3 The AWPS data model AWPS extends AWP with the concept of similarity between two text values. The AWP model allows us to issue queries on attributes that contain one or more words that satisfy a set of statements. The new functionality allows us to request notifications that might not contain a strictly defined set of words but have text values that are similar to a given string. Queries with similarity could be extremely useful when a user wants to query a collection of notifications based on a concept and does not know exactly which keywords should be included in the profile in order to produce the desired results. For example requests like “I am interested in papers about the use of local search techniques for the problem of test pattern optimization” can not be easily interpreted into queries using the model AWP. Queries on similarity use the concept of a word weight as defined in the Vector Space Model [9, 10]. In VSM documents (text values) are represented as vectors. If our vocabulary consists of n distinct words then a text value s is represented as an n− dimensional vector of the form (ω1 , ω2 , ..., ωn ) where ωi is the weight of the i − th word (the weight assigned to a non-existent word is 0). In VSM, the weight of a word is computed using the heuristic of assigning higher weights to words that are frequent in a document and infrequent in the collection of documents available. Generally this mechanism tries to distinguish words that represent the semantic content of a document
  • 20. Information Alert System for Digital Libraries - 20 - by assigning them a higher weight. Presenting the definition of this heuristic is out of the scope of this dissertation. sim ( sq , sd ) is a function that uses the weights of the words of two text values sq , sd to produce a number in the interval [0,1] that represents the concept of similarity between them ∑ iN 1ωq i ⋅ ωd i = sim ( sq , sd ) = ∑ iN 1ωq i 2 ⋅ ∑ iN 1ωd i 2 = = If the similarity value of two documents is close to 1 then these documents have similar semantic content. The AWPS data model provides a new type of query that utilizes this function and issues requests on attributes that have similarity values over a certain threshold when compared with a given string. The syntax for this query is. A ∼ k s where A is an attribute, s is a text value and k is number in the interval [0,1] that gives a relevance threshold that candidate text values s should exceed in order to satisfy the predicate. A low similarity threshold k might result in many irrelevant documents satisfying a query, whereas a high similarity threshold would result in very few achieving satisfaction (or even no documents at all). Query examples for the AWPS model TITLE ∼ 0.6 " Object Relational Databases" matches documents with TITLE relevant to “Object Relational Databases “ We should mention that we cannot give a notification that will always satisfy this predicate, because the similarity values of its attribute (TITLE) with the query string, depends on previously processed notifications. For example the notification bellow is most likely to satisfy the previous query {( AUTHOR," Richard Niemec "), (TITLE ," Object Oriented Programming and Relational Databases "), ( ABSTRACT ,"...") }
  • 21. Information Alert System for Digital Libraries - 21 - Queries on similarity can be combined with Boolean operators and queries on word patterns. AUTHOR (John ≺ [0,6] Smith) ∧ (TITLE ∼ 0.9 " Artificial Intelligence ") matches notifications that contain the words John, Smith with 6 words between them or less in the attribute AUTHOR and TITLE relevant to “Artificial Intelligence “. 2.2.4 Some interesting problems In the previous sections we presented in detail the language of the profiles used. DIAS also provides algorithms that efficiently solve the following problems • The satisfiability problem. As profiles and notifications propagate through the network a middle agent should be able to detect queries that could be satisfied by any notification at all. • The matching problem. Deciding whether an incoming notification matches a profile. • The filtering problem. Given a notification n an agent should be able to find all stored queries that match n . • The entailment problem. Deciding whether a profile is more or less “general” than another. An agent should detect profiles that request the same sets of notifications in order to minimize profile forwarding between peers. .In DLAlert efficient filtering and matching are the necessary functionalities. We have explained the language of the profiles provided by DIAS because queries on DLAlert are generated according to a similar grammar. For further details, the papers [6, 7, 8] formally define the algorithms and models of DIAS. 2.3 The Z39.50 protocol Z39.50 [1] is the protocol used by DLAlert in order to retrieve new publications from databases. Z39.50 is a standard which is playing an increasingly important role for information retrieval, especially in the library world. It was established by NISO (National Information Standards Organization) [11] and it is maintained by the Library of Congress [12] of the USA. The first version of this protocol was issued at 1988 and today is supported by almost all digital libraries around the world. In this section we try to explain briefly this protocol and provide a few simple examples.
  • 22. Information Alert System for Digital Libraries - 22 - Z39.50 is an application layer network protocol (like HTTP, FTP, SMTP etc.) that uses the TCP/IP functionality and provides advanced information retrieval services, for organizations such as universities, libraries, union catalogue centers and museums. It addresses connection oriented program-to-program communication. The protocol specification includes the definition of the protocol control information, the rules for exchanging this information via connection oriented program-to-program communication, and the conformance requirements to be met by the implementation of this protocol. Z39.50 Gateway Requests Data DBMS via TCP/IP Returns Records Z39.50 Client Figure 2.3-1 Z39.50 protocol The purpose of Z39.50 is interoperability for search and retrieval of information between different client/server systems. The sources behind the gateway are not visible to the client which requests them, only by using the strictly defined rules of Z39.50. For this purpose the gateway implements an “abstract database” as a front end to the real DBMS. The client uses standardized access points which are called “attribute sets” and standardized queries to request information from to the abstract database. The gateway returns records in a standardized format (MARC [13], XML [14] etc.). In Z39.50 terminology the client is usually referenced as the “origin” and the server as the “target”. This protocol is really hard to penetrate, so we give some simple examples, along with the definitions of the relevant actions and models of the standard. The deployment of a new Z39.50 gateway is out of the scope of this dissertation so we focus on the retrieval of records from existing abstract databases and the configuration of the client used. There is an example of connection with the Library of the Technical University of Crete with a sample interactive command-line client provided in JZKit. JZKit [15] is an open source API (Application Programmers Interface) written in Java and provided by Knowledge Integration, that helps us construct applications that utilize the Z39.50
  • 23. Information Alert System for Digital Libraries - 23 - functionality. Instead of displaying the messages exchanged by the end-systems in raw data we present the output of the sample Z-client for convenience. The functionality of Z39.50 is organized in “facilities”, which represent actions and consist of one or more services. 2.3.1 Initialization facility The initialization facility is the action that establishes the connection (“Z-association”) between the client and the server. In the Init request, the client proposes values for initialization parameters (version of Z39.50, option flags, message sizes, other implementation information). The option flags indicate which other facilities are enabled during the Z-association. If the target requires authentication the origin should include a secret id / password in the request. In the Init response, the server responds with values for the initialization parameters; those values, which may differ from the client-proposed values, are in effect for the Z-association. If the server responds affirmatively (Result = ‘accept’), the Z-association is established. If the client then does not wish to accept the values in the server response, it may terminate the Z-association, via the Close service (and may subsequently attempt to initialize again). If the server responds negatively, the client may attempt to initialize again. Origin Target Init request Version, (id/password), option flags, message sizes, implementation information Init response Result, version, option flags, message sizes, implementation information Figure 2.3-2 Initialization facility
  • 24. Information Alert System for Digital Libraries - 24 - Figure 2.3-3 JZKit sample client output For example as we see in the previous picture by connecting to the Technical Universities Gateway (issue the command “open dias.library.tuc.gr”) we get the response that contains the implementation id : “1995” , the name: “Geac Advance Z39.50 Server” and version “2.0”. The target enabled services are Search, Present, Delete Result Set, Scan, Sort, Extended Services, Named Result Sets. In this dissertation we present only the Search and Present services because these are the only ones needed for the retrieval of new records for the purposes of DLAlert. More information on other Z39.50 services can be found in [1]. 2.3.2 Search facility Origin Target Search request Database names, query type, query, result set name, preferred record syntax Search response Search status, result count, number of records attached, next result set position Figure 2.3-4 Search facility The search facility is the action that sends a query on abstract records to the gateway. The origin sends the database name, query-type, query, the result set name
  • 25. Information Alert System for Digital Libraries - 25 - and preferred record syntax. The database name is sent because the gateway can be a front end to multiple databases and the query can refer to all or some of them. The query type used is Type-1 (the default setting of the client) because it is supported by both the TUC gateway and the JZkit, provides the functionality we need for the purposes of DLAlert and is the most common query type used. We focus on this query type and give a detailed definition in the following section. The result set name is a string generated by the client so that the results of a search can be referenced. Preferred record syntax is UNIMARC [16, 17], the only one supported by the TUC gateway. The target returns search status, result count, number of records attached and next result set position. The search status indicates where the search completed successfully or not. The result count is the number of records that satisfy the query sent earlier. The response can contain attached records (usually in case of one or two results). The parameter next result set position takes on the value M+1, where M is the position of the result set item which identifies the database record corresponding to the last response record among those returned. Usually takes the value “1” in case the response does not contain attached records. The result records of a query are requested using the Present facility explained later. Figure 2.3-5 JZKit sample client output For example we connect to the digital library named “Advance” of the Technical University of Crete with the command “base advance” and define the preferred record format “format UNIMARC”. Then we send the query “@attrset bib-1 @attr 1=1035 smith” with the command “find”. This query requests bibliographical records that contain the word smith in any attribute. The response contains: the name of the result set “Search:0” (a string generated by the Jzkit), the status (“true”) that indicates
  • 26. Information Alert System for Digital Libraries - 26 - that the search completed successfully and the number of records satisfying the query “177”. The number of records returned is 0 so the next result set position is 1. 2.3.3 Type-1 query syntax The Type-1 [18] query is also called RPN (Reverse Polish Notation) string because the operators must always be before the two related operands. An RPN string is generated according to the following grammar. Reserved words used might be slightly different among other Z39.50 API implementations but the grammar is a part of the protocol. rpn - string → @attrset default - attrset expr default - attrset → bib-1 The access points to the abstract database are called attributes are categorized in attribute sets. The most common attribute set used in information retrieval from digital libraries is the bib-1 [19] attribute set. The bib-1 attribute set includes access points to attributes of bibliographic records. Other attribute sets could reference extended services tasks (ext-1), details of the target implementation (exp-1) or different organization of the access points to bibliographical records (GILS, CCL). A full listing of all registered attribute sets and generally all Z39.50 object identifiers can be found at [20]. Almost all Z39.50 implementations support the bib-1 attribute set. expr → boolean | attr - plus - term attr - plus - term → attrdef [ single - term | quoted - string ] attrdef → @attr attrtype = attrval boolean → operator expr expr operator → @and | @or | @not single - term is considered a single word and quoted - string a set of words (enclosed in “ “) that should be contained in the corresponding attribute of a record in order to satisfy the statement. attrtype for the bib-1 attribute set is a value between 1 and 6 that describes the type of the attributes used. For attrtype 1 we can reference several
  • 27. Information Alert System for Digital Libraries - 27 - attributes (the attrval value) that correspond to bibliographical records of the digital library of TUC. Attribute Description UNIMARC fields 4 Title 200,5XX,4XX except 410 5 Series Title 410 7 ISBN 010 8 ISSN 011 12 Local number (of the record) 001 21 Subject Heading 60X 31 Date of publication (year) 210d 32 Date of acquisition (year month) 960 54 Language code (ENG or GRE) 101 59 Publication place 210a 63 Notes 3XX 1003 Author 7XX 1018 Publisher 210 1035 Anywhere almost all fields Table 2.3-1 Access points supported by the TUC Z39.50 Gateway UNIMARC field numbers are the numbers in the result records returned (record format explained in detail in next section). attr - plus - term formulas with attrtype 1 implicitly define a “contains word or phrase” relation for the given attribute. Other types (2-6) define attributes with advanced relations like less than, greater, position in field etc. The operators @and, @or, @not define Boolean relations (AND, OR, AND-NOT) between attr - plus - term formulas. Example queries: @attrset bib-1 @attr 1=4 databases returns bibliographical records that contain the word “databases” in the title. @attrset bib-1 @not @attr 1=32 2003 @attr 1=1035 algebra returns bibliographical records acquired in the year 2003 and do not contain the word “algebra” in the title.
  • 28. Information Alert System for Digital Libraries - 28 - @attrset bib-1 @attr 1=32 2003 returns bibliographical records that contain the number 2003 in the date of acquisition field (records acquired in the year 2003). @attrset bib-1 @or @or @attr 1=4 science @attr 1=4 algebra @attr 1=4 mathematics returns bibliographical records that contain at least one of the three words in the title. 2.3.4 Retrieval facility The Retrieval facility is the action where the origin requests the results of a query from the target. It consists of the Present and Segment services. In the Present service the client sends to the gateway the result set name referenced earlier in the Search facility, a number defining the starting point of the records, and the number of records to be returned. For example if a query returns 70 records and we want to retrieve the first twenty the starting point is 1 and the number of records is 20. The target returns the records in a standardized format (XML, MARC, etc.), a number indicating the number of records returned and the status which indicates whether the Present service completed successfully. Origin Present request Target Number of records, starting point, result set Present response Number of returned records, status,records Figure 2.3-6 Present service Sometimes the result set of a query may contain hundreds or thousands of records. The Present response could exceed an upper limit of bytes. Thus the server splits a Present
  • 29. Information Alert System for Digital Libraries - 29 - response that is larger than this limit into segments. In some Z39.50 implementations the origin could define the preferred sizes for message but the segmentation service in the target is responsible to decide the maximum segment size. In this case the number of records returned from the target is less than the number of records requested. Thus when constructing a client, that will collect records from various sources, we should always check the number of records returned from a Present request. Figure 2.3-7 JZKit sample client output For example let us retrieve the results of the query which requests bibliographical records that contain the word smith in any attribute (“@attrset bib-1 @attr 1=1035 smith”). The query in the previous example returned 177 records. We want to retrieve the last 5 records. We issue the command “show 173+5”. The first record to be requested is the 173-th record. In the previous page we present the last record returned from this request. The strange characters in fields 801 – 852 are Greek characters not displayed properly by the sample client. 2.3.5 Record format (UNIMARC) MARC is an acronym for Machine Readable Catalogue and is a standard for assigning labels to each part of a catalogue record so that it can be handled by computers. While the MARC format was primarily designed to serve the needs of libraries, the concept has since been embraced by the wider information community as a convenient way of storing and exchanging bibliographic data. The original MARC format was developed at the Library of Congress in 1965-6 and since the early 1970s an
  • 30. Information Alert System for Digital Libraries - 30 - extended family of more than 20 MARC formats has grown up. Differences among various MARC formats meant that editing was required before records can be exchanged. One solution to the problem of incompatibility was to create an international MARC format (UNIMARC) [16, 17] which would accept records created in any MARC format. So in 1977 the International Federation of Library Associations and Institutes (IFLA) published UNIMARC: Universal MARC format, stating that "The primary purpose of UNIMARC is to facilitate the international exchange of data in machine-readable form between national bibliographic agencies". The records retrieved from the Technical Universities Library are in the UNIMARC standard. The record structure is designed to control the representation of data by storing it in the form of strings of characters known as fields. The fields, which are identified by three-character numeric tags, are arranged in functional blocks. These blocks organise the data according to its function in a traditional catalogue record. Tag Tag Description Description num. num. 0XX Identification block 5XX Related title block 1XX Coded information block 6XX Subject analysis block 2XX Descriptive information block 7XX Intellectual responsibility block 3XX Notes block 8XX International use block 4XX Linking entry block 9XX Reserved for local use Table 2.3-2 UNIMARC functional blocks Within each field, data is coded into one or more subfields, e.g. 700 $a ... $b ..., etc., according to the kind of the information. The effect of the subfield coding is to refine further the definition of the data for computer processing. The subfield identifiers consist of a special character, represented by a $ in the examples, and a lower case alphabetic character or a number 0-9. For example the field starting with the tag 210 contains publication related data and the subfield $d contains publication date. We do not present the whole definition of UNIMARC format because it defines thousands of tags and subfields [17]. The main UNIMARC tags used by the TUC library and the corresponding Z39.50 attributes are shown in Section 2.3.3. 2.3.6 Z39.50 and interoperability Most digital libraries round the world nowadays have a Z39.50 Gateway as a front- end. The organizations, universities, museums or publication houses that support this protocol are uncountable. Accessing all those digital libraries using a common way,
  • 31. Information Alert System for Digital Libraries - 31 - which is independent to the specific implementation of each database, is the main advantage of Z39.50 functionality. We indicatively report some digital libraries in Greece that support this standard and we have successfully retrieved records using the client we constructed. Organization Gateway host : port University of Thessaly library.lib.uth.gr : 210 University of Patras pherusa.lis.upatras.gr : 210 University of Cyprus 194.42.4.129 : 210 University of Aegean library.lib.aegean.gr : 210 Technical Chamber of Greece (TEE) artemis.tee.gr : 21210 Panteion University library.panteion.gr : 210 Ionian University zante.ionio.gr : 210 Hellenic American Education Foundation 194.30.242.11 : 210 Table 2.3-3 Some Z39.50 Gateways in Greece 2.4 Oracle Text and filtering applications Oracle Text [23-27] is a tool used in the Oracle RDBMS [21, 22] that enables us build text retrieval and filtering applications. Retrieval applications enable users to find documents that contain one or more search terms defined in a query. Text is a collection a documents in plain text, HTML, or XML. A filtering application stores queries in the database and finds those which match a certain document. DLAlert and generally alerting services are considered filtering applications. The grammars of queries used in text retrieval and in filtering are similar and search terms could be simple words, phrases or themes. Themes define concepts inside a document. In the Figure 2.4-1 we present an overview of the architecture of a filtering application. Incoming Matched documents documents Filtering Application Perform Action Compares against stored queries Oracle RDBMS Figure 2.4-1 Filtering application
  • 32. Information Alert System for Digital Libraries - 32 - 2.4.1 The CTXRULE index The filtering functionality in Oracle database was first introduced in version 9.0.1 (June 2001) with the CTXRULE index type. In filtering applications queries are stored in a column of a table and a CTXRULE index should be constructed. This index is a structure that holds information about the stored queries. When Oracle finds matching queries for a given document, it requests data from the index and not directly from the table that holds the profiles. Consider the simple Boolean queries in Table 2-4.1 that represent keywords contained in the desired document. Query Syntax Description 1 oracle matches documents that contain the word "oracle" 2 larry or ellison matches documents that contain either “larry" or "ellison" 3 text and oracle matches documents that contain both “DBMS" and "oracle" 4 market share matches documents that contain the phrase “market share" matches documents that contain both “USA" and "Asia" 5 near( ( USA, Asia),5) within a group of 5 words Table 2.4-1 Simple queries The indexing process for the CTXRULE index type given a populated table holding the queries is shown in Figure 2.4-2: Query Parser query parse column data string tree Indexing Datastore Lexer Engine query rules strings column data CTXRULE index Figure 2.4-2 CTXRULE Indexing Process
  • 33. Information Alert System for Digital Libraries - 33 - Datastore is the object which retrieves data from the table with the queries and creates a stream of query strings. The Lexer breaks the text into tokens according to our language. These tokens are usually words or query operators. The parser gets the queries from the Lexer, creates a parse tree and sends this back to the Lexer. The Lexer normalizes the tokens (turns into upper case, omits very frequent words like ‘a’, ‘is’ etc), breaks the parse tree into rules and sends these to the engine. The engine builds up an inverted index of rules, and stores it in the index tables. For example we present the index content on some simple Boolean queries on keywords. The index is a structure that represents the parse tree generated by the query parser, containing among others the columns TOKEN_TEXT and TOKEN_EXTRA . QUERY_STRING : the query string as it is stored in the base table TOKEN_TEXT : the first token to be found for the query to be matched TOKEN_EXTRA : the other tokens to be found for the query to be matched Query 1 is a single word query. A document is a full match if it contains the word “oracle”. In this case, matching TOKEN_TEXT alone is sufficient, so TOKEN_EXTRA is NULL. Notice the normalization of the token to upper case: QUERY_STRING TOKEN_TEXT TOKEN_EXTRA ---------------- ---------- ----------- oracle ORACLE (null) Query 2 is an OR statement. A document is a full match if it contains the word “larry” or the word “ellison”. This can be reduced to two single-word queries, each of which has TOKEN_EXTRA NULL: QUERY_STRING TOKEN_TEXT TOKEN_EXTRA ---------------- ---------- ----------- larry or ellison LARRY (null) ELLISON (null) Query 3 is an AND statement. A document must have both words “dbms” and “oracle” to be a full match. The engine will choose one of these as the filing term, and place the other the TOKEN_EXTRA criteria:
  • 34. Information Alert System for Digital Libraries - 34 - QUERY_STRING TOKEN_TEXT TOKEN_EXTRA ---------------- ---------- ----------- DBMS and oracle DBMS {ORACLE} Documents that contain the word “dbms” will pull this rule up as a partial match. The query engine will then examine the TOKEN_EXTRA criteria, see that it requires the presence of the word “oracle”, check if the document contains that word, and judge the rule a full match if so. Query 4 is a phrase. All expressions that contain multiple tokens which are not connected with operators are considered phrases. The engine will use the first word of the phrase as the filing term, and the whole phrase as the TOKEN_EXTRA: QUERY_STRING TOKEN_TEXT TOKEN_EXTRA ---------------- ---------- ----------- market share MARKET {MARKET} {SHARE} Query 5 is a proximity statement. The engine will use the first word of the proximity statement as the TOKEN_TEXT, and the whole expression as the TOKEN_EXTRA. The parameter FALSE means the order of terms is not specified. QUERY_STRING TOKEN_TEXT TOKEN_EXTRA ---------------- ---------- ----------- near((USA,Asia),5) USA NEAR(({USA},{ASIA}),5,FALSE) The filtering application uses the data stored in the index to find matching profiles. During filtering the incoming document is considered a stream of tokens. For every token contained both in this stream and in the TOKEN_TEXT column the corresponding query is considered partially matched. For every partially matched query Oracle Text examines if the criteria stored in the TOKEN_EXTRA column are satisfied for the given document. This mechanism uses this heuristic to find the partially matched queries instead of issuing all queries contained in the base table against the given document. The first step when constructing a filtering application is to define the tables used to store the incoming documents and the profiles.
  • 35. Information Alert System for Digital Libraries - 35 - 2.4.2 Creating the tables For example let us define the following simple schema that consists of two tables: The table Documents (Table 2.4-1) with two columns: article_id primary key of the table, and article_text containing unstructured plain text of the incoming article. The table Profiles (Table 2.4-2) that holds the stored profiles of users and consists of two columns query_id primary key of the table and query the query itself. article_id, query_id are integers and article_text, query are strings. article_id article_text Metals mining is the industrial sector responsible for the largest amount of toxic releases 12 in the United States, according to a highly... Papers in the robotics literature often concern specific technical aspects of robot research 34 and development. At the same time, several robot competitions have emphasized … 97 The Eighth Annual Mobile Robot Competition and Exhibition was held as part of the Sixteenth National Conference on Artificial Intelligence in Orlando... 80 The United States National Security Agency, with help from Network Associates of Santa Clara, Calif., has made a security-enhanced version of Linux available for download... Table 2.4-1 Table Documents query_id query …… 311 toxic releases 312 US or Europe 313 Artificial AND Intelligence 314 near( ( personal, computers ) , 4) 315 $library 316 about(politics) … …. Table 2.4-2 Table Profiles 2.4.3 Language of the stored queries Let us introduce the grammar which generates the queries for plain text documents. All operators are case-insensitive (AND is equivalent to and). All expressions that contain multiple tokens which are not connected with operators are considered terms (exact phrases). The available Boolean queries are shown in Table 2.4-2.
  • 36. Information Alert System for Digital Libraries - 36 - Function Syntax Description Examples toxic releases term1 matches documents that contain term1 intelligence term1 and term2 Artificial and Intelligence conjuction matches documents that contain both terms term1 & term2 modile & phone term1 or term2 matches documents that contain term1 or US or Europe disjunction term2 Linux | Windows term1 | term2 term1 not term2 matches documents that contain term1 but not paper not journal negation term2 software ~hardware term1 ~term2 Table 2.4-2 Boolean queries syntax Along with the standard Boolean queries, the CTXRULE index type grammar provides proximity, stemming and theme functionality. Proximity operator NEAR (;) The operator NEAR matches documents based on the proximity of two or more query terms. The syntax of proximity queries is near ( (term1, term2,..., termn ) , max_span , order ) term 1-n: the terms in the query separated by commas. The query terms can be single words or phrases. max_span (optional – default 100): the maximum size of a clump where clump is the smallest group of words where all query terms occur. All clumps begin and end with a query term. max_span cannot be greater than 100. order (optional – default FALSE): indicates whether terms are to be found in the same order as in the query. Alternatively proximity can be defined according to the syntax term1 near term2 term1 ; term2 These queries are equivalent to the expression: near ( (term1, term2 ) , 100,FALSE )
  • 37. Information Alert System for Digital Libraries - 37 - Example Description matches documents that contain both terms "personal" near( ( personal, computers ) , 4) and "computers" in any order within a group of 4 words matches documents that contain all terms "monday", near( (monday, tuesday, wednesday), 20, TRUE) "tuesday", and "wednesday" in the order specified and within a group of 20 words matches documents that contain both terms "windows", windows near XP "XP" in any order within a group of 100 words matches documents that contain both terms "digital near( (digital signal processing, VLSI), 89, FALSE) signal processing", "VLSI" within a group of 89 words Table 2.4-3 Examples of Proximity queries Stem operator ($) The stemming functionality enables us request terms that have the same linguistic root as the query term. The stem operator ($) expands a query to include all terms having the same stem or root word as the specified term. Example Description matches documents that contain at least one of the terms $scream "scream","screamed", "screaming" matches documents that contain at least one of the terms $library "library", libraries", "librarian" matches documents that contain at least one of the terms $sing "sing", sang", "sang" Table 2.4-4 Examples of Stemming queries The definition of the algorithm used for stemming is not available in the manuals provided by Oracle [23-25, 27]. The Oracle Text stemmer supports the following languages: English, French, Spanish, Italian, German, and Dutch. Theme indexing Oracle supplies a database of themes in English or French. Themes are tokens organized hierarchically and connected to each other with relations that describe their semantic content. Oracle Text supports the typical relations used by thesauri and is compliant with the ISO-2788 [28] and ANSI Z39.19 (1993) [29] standards. The relation definitions are reproduced from the ANSI Z39.19 specification [29]. These relations are • A Synonym (SYN) relation defines equivalence between two terms and connects terms with very close meaning like the ones bellow. scary SYN fear, hiddenly SYN secrecy, drive-up SYN automobiles, lounge SYN rest.
  • 38. Information Alert System for Digital Libraries - 38 - • A Broader Term (BT) relation defines a superordinate semantic class that the first concept belongs to. For example lordships BT royalty and aristocracy, American Revolution BT military wars, backward motion BT withdrawal, Bible BT sacred texts and objects. • A Narrower Term (NT) relation defines a subordinate semantic class that the first concept includes. For example. Roman Catholicism NT papism, computers NT laptops, behaviour NT sympathy, organized crime NT gangsters, cosmology NT astronomy The NT and BT relations are symmetric. If and only if X BT Y, then Y NT X. • A Related Term (RT) relation implies semantic overlap (there is an element of meaning common to both terms). For example. oceans RT fish, water birds RT mammals, beds RT rest, Islam RT Middle East Using these binary relations Oracle constructs a tree containing all supported concepts. Consider the word “government” as a root node. The tree in Figure 2.4-4 contains only a few of the terms related to “government”. government NT NT NT public facilities law politics SYN RT political military NT NT NT politicians political sciences military wars Figure 2.4-3 Oracle Text Thesaurus Oracle Text provides operators for theme query expansion using these relations (SYN, BT, NT and RT). This type of thesaural queries include the main term and all the related terms to the possible strings to be found in a matched document. Example Description SYN( politics ) matches documents that contain the term "politics" or one of its synonyms matches documents that contain " pharmaceutical industry " or one of its RT ( pharmaceutical industry ) related terms BT ( gangsters ) matches documents that contain the term "politics" or one of its broader terms Table 2.4-5 Theme queries examples
  • 39. Information Alert System for Digital Libraries - 39 - Operator ABOUT Another operator that uses the supplied thesaurus to expand theme queries is the ABOUT(phrase) statement. The phrase parameter can be a single word or a phrase, or a string of words in free text format. If the phrase parameter does match exactly a stored concept Oracle normalizes it and finds the stored concepts closer to the original string. For example before expansion “politic” is normalized to “politics” and “national” to “nations”. If normalization fails to find a concept describing phrase parameter the query is satisfied only in exact phrase match. Otherwise Oracle Text expands the queries using synonyms and narrower terms of the concepts inside the parentheses. The terms of the expansion could be connected to the original term directly or via another interjected term (expansion level > 1). The definition of the algorithm used for normalization and query expansion is not available in the manuals provided by Oracle [23-25, 27] and should be rather complex. We provide an example of the expansion of the word “politics” instead. Expansion terms are separated by commas. The word “and” in expansion terms is not considered an operator. Relation to term "politics" Expansions of the query ABOUT(politics) civil rights, elections and campaigns, political parties, narrower terms level 1 political scandals, political sciences, politicians, politicians and activists, revolution and subversion, world politics civil liberties, elections, human rights, insurgents, narrower terms level 2 insurrectionary, insurrections, partisan politics, revolutionaries, revolutionists, terrorism narrower terms level 3 terrorist activities, terrorist incidents, terrorists narrower terms level 1 of animal rights, consumer advocacy "political advocacy" animal rights activists, animal rights movement, animal- narrower terms level 2 of rights activists, consumer activists, consumer advocates, "political advocacy" consumer rights both "politics" and "policymakers" policymakers narrower terms of "government" Table 2.4-6 Expansions of term "politics" Important Notes: i. An ABOUT statement cannot contain the proximity operator NEAR. For example the query: “ NEAR( ( personal, computers ) , 4) AND ABOUT( software ) “ is not valid and cannot be parsed. ii. The phrase parameter in an ABOUT statement should be in lower case. iii. Inside thesaural statements (SYN, BT, NT, RT and ABOUT) any reserved words like (AND, OR, NOT, NEAR) are not considered operators but simple tokens.
  • 40. Information Alert System for Digital Libraries - 40 - Operator precedence Within query expressions with two operands, the operators have the following order of evaluation from highest precedence to lowest: NEAR, NOT, AND, OR. For example: Query Expression Order of Evaluation w1 OR w2 AND w3 w1 OR ( w2 AND w3 ) w1 AND w2 OR w3 ( w1 AND w2 ) OR w3 w1 NOT w2 AND w3 ( w1 NOT w2 ) AND w3 w1 OR w2 NEAR w3 w1 OR ( w2 NEAR w3 ) w1 NOT w2 NEAR w3 w1 NOT ( w2 NEAR w3 ) Table 2.4-7 Operator precedence examples Grouping characters can be used to control operator precedence. Grouping characters are parenthesis ( ) and brackets [ ]. 2.4.4 Indexing the stored queries Let us consider the relational simple schema defined in Section 2.4.2. Before filtering the documents we must first index the queries in order to be able to collect matching profiles. This is done with the following command: CREATE INDEX profile_index on profiles(query) INDEXTYPE IS ctxsys.ctxrule; Now Oracle constructs an index according to the process described in 2.1.1. DML (Data Manipulation Language) operations to the base table refer to inserts, updates and deletes from the base table. During filtering Oracle reads the queries stored in the CTXRULE index. In order to ensure that filtering results are correct and consistent with the queries, the index should be synchronized regularly with the base table after DML actions. This is done with the following command: EXEC ctx_ddl.sync_index(‘profile_index’); Syntax errors Stored queries that contain operators and do not correspond to CTXRULE grammar are considered invalid and cause index errors. For example queries like the ones in Table 2.4-8 are considered syntax errors.
  • 41. Information Alert System for Digital Libraries - 41 - Error Description Query missing parenthesis ( software design term2 missing international and term1 missing or mathematics about(...) and near(...) in the same field about(management) and near((financial, planning),4) operator between term1 RDBMS and near(...) missing RDBMS near((data, warehousing),6) about is an operator journals about science Table 2.4-8 Syntax error examples Syntax errors cause index errors (upon index creation or synchronization) and then filtering may not produce the expected results. Also indexing of null fields produce errors. These errors can presented using a simple SELECT statement on the data dictionary view CTX_USER_INDEX_ERRORS. This view contains four columns which are name of the index, time of error, row id of error query and error message. This view is very helpful for program debugging and database administration but not automatic error detection on queries. Also detecting an error after the transaction is committed, is not the optimal way to solve this problem. An application that lets users define their profiles should include a mechanism that validates queries before insert/update transactions (Sections 6.4-6.5). The main disadvantage of Oracle Text is that it does not provide such functionality for the CTXRULE index. 2.4.5 Filtering In order to filter the documents we use the SQL operator MATCHES. We use this operator to find all rows in a query table that match a given document. The document must be a plain text, HTML, or XML document. This operator requires a CTXRULE index on our set of queries. The syntax for this operator is MATCHES( column, document VARCHAR2 or CLOB) RETURN NUMBER; column is the indexed column of the base table, document is a PL/SQL variable of type VARCHAR2 or CLOB. VARCHAR2 is string of variable length (<=4000 characters), CLOB is character large object (string type of infinite length). PL/SQL (Programmatic Language/SQL) [30, 31, 32] is a procedural extension to SQL, used in the Oracle database to declare stored procedures.
  • 42. Information Alert System for Digital Libraries - 42 - MATCHES returns 1 in case of matching and 0 for no match. This number cannot be assigned to variable because MATCHES does not support functional invocation. Τhis operator can only used in SELECT statement according to this syntax using the greater than zero condition “>0”. select column1,column2,.. columnN from table where MATCHES( column, document VARCHAR2 or CLOB )>0 To find all matching profiles of the schema 2.1.2 for a given document single_doc we use the following procedure find_matching_profiles() single_doc is PL/SQL variable of same type as a row of the table Documents. matched_profile is PL/SQL cursor of same type as a row of the table Profiles. create procedure find_matching_profiles(single_doc documents%rowtype) as begin for matched_profile in ( select * from profiles where matches(profiles.query, single_doc.article_text)>0 ) loop --- ACTION (S) FOR EVERY MATCHING PROFILE dbms_output.put_line('Matched profile' || matched_profile.query_id) --- PRINTS MATCHING PROFILES NUMBER TO THE SCREEN end loop; end; The find_matching_profiles procedure collects all matching profiles (according to the mechanism explained in 2.4-1) for the document doc_cursor using the SELECT statement. The for…loop, prints all matching profiles ( their primary keys ) to the screen.
  • 43. Information Alert System for Digital Libraries - 43 - To find matching profiles for all documents in the table, we use the procedure find_all which uses a for…loop that calls the previous procedure on all documents. create procedure find_all as begin for current_document in ( select * from documents ) loop dbms_output.put_line('Article:'|| current_document.article_id) find_matching_profiles(current_document); end loop; end; Instead of displaying the results on the screen we could have declared other actions be performed, on every matching profile. The most common case in a complete filtering application is to insert the primary keys of the results into another database table for further processing by other database modules. Anyway, the important feature of the MATCHES statement is the ability to find all matching profiles for a certain document instead of periodically running the stored queries on the documents. 2.5 Conclusions In this chapter we presented the previous developments on this topic that inspired and helped us implement DLAlert. We evaluated popular Alerting Services on the web in the area of Digital Libraries. Then we discussed DIAS, a distributed alert system for digital libraries, and explained in detail its language of profiles and its functionality. We presented the main facilities of the Z39.50 standard, used for publication record retrieval from digital libraries. Finally, we introduced filtering application construction with Oracle Text. In the rest of the dissertation we focus on the design and implementation of DLAlert starting with an overview of the system.
  • 44. Information Alert System for Digital Libraries - 44 - Chapter 3 System Overview In this the following Chapters we introduce the main design issues that concerned us during the implementation of DLAlert. First we present an overview of its modules and all related Oracle database features used. Then we give a brief manual for the interface of DLAlert and the language supported. 3.1 DLAlert architecture DLAlert, as a complete alerting service, consists of several components that cooperate with each other, in order to achieve the desired function. The main parts of this system are (shown in Figure 3.1-1): The Oracle Database is where the profiles and user data are stored. New publication records are also inserted inside the database and stored temporarily until the notification messages (e-mail) are produced and send to each user. The RDBMS is the core functionality of DLAlert. The Observer is the component responsible for collecting new publication records from a Z39.50 gateway (the TUC digital library), and inserting them into the Oracle database. The Filtering module classifies incoming documents according to the users’ stored queries and finds matching profiles and related publication records. The Notifying module collects all matched profiles and sends a single message for each user, containing the bibliographical attributes of the relevant publication. The Observer, the Filtering and Notifying modules communicate with each other by writing/reading from common records of the database. These modules are scheduled to perform actions sequentially (see Chapter 8) so that the results from a previous component are processed by the following one. The Application Server provides the necessary software infrastructure for developing and deploying the middle tier of DLAlert. in addition to providing the necessary forms for the transactions on the web (inserts/delete/updates of profiles, user credentials), the Alerting Service must check the queries to be
  • 45. Information Alert System for Digital Libraries - 45 - stored for syntax errors and maintain user session state (every user has access restricted to the profiles of his account). USER receives e-mail describes his fields of interest, subscribes to service sends e-mail via SMTP WWW Notifying user e-mail, name module notifications data inserts, updates, deletes profiles, user credentials stored queries Filtering module Oracle database Web Application Server with new publications Apache Tomcat TUC Digital Library data or Oracle AS Z39.50 Gateway Requests data on new publications Observer via TCP/IP DBMS "Advance" Sends Records • components stored in the database • and executed inside the RDBMS environment • • another Z39.50 Gateway • • another DBMS Figure 3.1-1 DLAlert architecture The other parts of this schema are: The Z39.50 Gateways and DBMS of Digital Libraries as described in Section 2.3. The User accesses the alerting service through a web page on the Internet. The Notifying module sends the e-mails using an SMTP server and the user gets them from his mailbox (omitted from the figure).
  • 46. Information Alert System for Digital Libraries - 46 - 3.2 Main Technologies used The Oracle database is the core of DLAlert and most of the applications programmatic code is stored, compiled and executed inside the database. Bellow we explain the advantages of using Oracle RDBMS as the essential component of our system. Nowadays most systems that require a robust and reliable centralized mechanism to store and retrieve large amount of data utilize a database. One of the main requirements of DLAlert is to be able to handle hundreds of thousands or millions of user profiles and hundreds of new publication records every day in a way that ensures the validity of the information stored. The necessity of using a database that can achieve these standards is indisputable. Any reliable RDBMS system provides functionalities that ensure data consistency and integrity, transaction concurrency and scalability for handling large amount of information. Using a software infrastructure like an RDBMS, the developer considers the aforementioned issues solved, and he is mainly concentrated on the particular requirements of his application. Oracle Text (as shown in Section 2.4) is a specialized tool provided by the Oracle RDBMS which is able to provide document filtering techniques and profile matching functionalities, necessary for DLAlert. Without these mechanisms we should have constructed the filtering module of our system virtually from scratch. Oracle Text is the key feature that persuaded us to choose Oracle RDBMS among other equally reliable databases. PL/SQL [30-32] is Oracle's procedural extension to industry-standard SQL. Its primary strength is in providing a server-side, robust procedural language. PL/SQL code is stored and executed inside the Oracle RDBMS environment and is responsible for the application’s actions that involve data processing. The developer is able to construct procedures, functions and triggers using this language. Logically related stored procedures, functions and object types can be grouped into packages. The filtering and notifying module are PL/SQL packages. Oracle RDBMS provides PL/SQL Packages [33] useful for application developers. In DLAlert two of them proved to be very helpful.
  • 47. Information Alert System for Digital Libraries - 47 - o UTL_SMTP is a package that provides functionality for sending e-mail messages over SMTP from the database. We utilized UTL_SMTP for sending the notifications on new publications to users. o DBMS_JOB can be used to schedule the execution of Oracle packaged procedures at a specific time and on a recurring basis. The collection of new publication records, document filtering and notification of users are operations of DLAlert that must be scheduled to run at certain intervals of time (for example once a day). Also we have to ensure these actions are performed one after the other so that the results a previous component are processed by the following one. The main advantages of using DBMS_JOB instead of any external scheduling application are security reasons (none can access the database without permission). Another important functionality is that DBMS_JOB can be programmed to detect unsuccessful completion of a module, and re-schedule it ensuring the proper queue of actions. Java [34] is the most popular language used by application developers nowadays. One of the main advantages of Oracle RDBMS [35-39] is the ability to integrate Java classes inside the database and deploy them on a supplied Enterprise level Java 2 platform (Oracle Java Server). The Java code which can be loaded as either source or compiled (bytecode), is executed inside the database using the internal Oracle Java Virtual Machine. The accesses to the database use the server-side internal JDBC driver [35, 38], which means that the application can handle faster large amount of data than any external process. The static methods of any Java class can be declared stored procedures and can be even called from PL/SQL code. Although Java is not as fast as PL/SQL in case of intensive data access, is preferred if the application under development requires a complex computation mechanism or functionality not available in PL/SQL. DLAlert regularly collects new publication records from Digital Libraries using the protocol Z39.50 (see Section 2.3). In our case there is not any previous Z39.50 implementation in PL/SQL and the parsing of new publication records to be inserted requires much computation. The implementation of the Observer module of DLAlert using Java stored procedures has been considered the best solution from the performance perspective as well as the less complex implementation. Alternatively, we could have used an external client for handling Z39.50 requests using previously developed functionality written in a different programming language (C, C++, Perl).
  • 48. Information Alert System for Digital Libraries - 48 - The other important component of DLAlert is the Application Server. The Application Server provides a J2EE (Java 2 Enterprise Edition) platform for developing and deploying web applications [40-42]. Multi-tiered applications (shown in Figure 3.2-1) dominate today’s Internet. The client tier contains logic related to presentation and requests for services. The application server contains business logic that reads and writes data. In our case the Application Server provides the necessary dynamic web pages and business logic for data transactions to the Client (profiles, user credentials), restricts every users access to his account, checks queries for syntax errors and produces the appropriate error messages in case of invalid profiles. The internal architecture of the middle tier is explained in detail at Section 6.2. Application Server Client RDBMS Service Requests Data Resources Presentation Business Logic Figure 3.2-1 Three-tiered architecture DLAlert’s middle layer J2EE components are now deployed using Apache Tomcat. Tomcat is not a pure Enterprise level infrastructure but the necessary J2EE classes used can be loaded as classes. The application was also deployed on Oracle9i AS which provides advanced features for scalability and better performance for the application. At last Tomcat was chosen because it consumes much less hardware resources (RAM, CPU) than Oracle9i AS, on the already loaded server available at the time. We discuss further the architecture and implementation of the middle application tier of DLAlert in Chapter 6. 3.3 Graphical User Interface overview After directing out browser (Internet Explorer, Netscape or Mozilla) to the URL of DLAlert ( http://intelligence.tuc.gr/alert/login.html ) we the login screen of the web GUI appears
  • 49. Information Alert System for Digital Libraries - 49 - Figure 3.3-1 DLAlert: Login screen A registered user would type his e-mail/password and login into his account. An unregistered user will click ‘Help’ for more information about DLAlert or will click ‘New User’ and enter his credentials. Figure 3.3-2 DLAlert: Welcome screen
  • 50. Information Alert System for Digital Libraries - 50 - In the ‘New User Registration’ screen (3.3-2) the user writes his e-mail address, his first/last name and the preferred frequency of notifications. A similar screen appears when a registered user wants to update his credentials. Figure 3.3-3 DLAlert: Registration form After the registration to the service the user is expected to enter his profiles. The profiles define the user’s fields of interest. In Figure 3.3-4 we see the empty account of a probably new user. Figure 3.3-4 DLAlert: Empty account
  • 51. Information Alert System for Digital Libraries - 51 - In case the user has an account containing stored queries, a table with all of his previously defined stored profiles appears. For convenience only the profile description is displayed. The user can edit/delete his long-standing queries or enter new ones. In the Figure 3.3-5 we see an account containing two profiles. Figure 3.3-5 DLAlert: Profiles of the account Suppose we enter a new profile. A form that allows us to type the profile name and queries on all bibliographical attributes appears (see Figure 3.3-6). Text queries based on a language (defined in next section) similar to the CTXRULE are allowed on sections: Title, Series, Author, Publisher, Subject and Notes. The publication year, ISBN and ISSN queries can contain only one number instead of keywords that will be found in the incoming publication. Operators and terms in queries are case insensitive. For a profile to be matched, all non-empty defined stored queries must be satisfied for a given document. The Profile description doesn’t affect the matching publications of a profile. It is considered a phrase that reminds to the user the purpose of entering the given profile. In Figure 3.3-6 we can see a sample profile with name “books about Programming”. The profile contains two queries. The text query “programming or Java or C” requests documents that contain at least one of the words “programming”, ”Java” or “C” in the Subject bibliographical attribute. The query “2002” on the publication year field restricts the returned publications of the profile to those published in year 2002. So the desired document must contain at least one of the words entered in the text query and be published in the year 2002.
  • 52. Information Alert System for Digital Libraries - 52 - Figure 3.3-6 DLAlert: Sample profile In case of syntax error(s) on queries the system does not update or insert the profile but shows an error message and displays ‘Syntax error’ on the right side of the wrong query. Only if the user corrects his profile the system stores it. In the Figure 3.3-7 we can see a wrong profile. The user has probably attempted to update a profile of his account. The users has entered “my area of interest” as the profile name, “1995” as the desired publication year, “environmental” or “socialist” requested keywords in the Title and the phrase “Planning research” for the Series bibliographical attribute. The query “Alexakis or” in the author field is a wrong query as “or” is the disjunction operator and its second term is missing. An error message appears on the top of the page and the string “Syntax Error” indicates the wrong text query or queries. The user must enter a valid query instead of the wrong one or omit it from the profile in order to be able to store the profile into his account. The profile is not stored unless all queries defined by the user are valid.
  • 53. Information Alert System for Digital Libraries - 53 - Figure 3.3-7 DLAlert: Wrong profile Figure 3.3-8 DLAlert: Another profile
  • 54. Information Alert System for Digital Libraries - 54 - Suppose we have already entered our fields of interest and we log out or close our browser. Every user regularly receives e-mail messages these messages contain all bibliographical attributes of matching incoming publications. The Figure 3.3-9 presents a sample notification for the profile of Figure 3.3-8 (publications that contain at least one of the words “environmental”, “socialist” in the Title). As we can see terms in queries are case insensitive. Figure 3.3-9 DLAlert: Sample e-mail message
  • 55. Information Alert System for Digital Libraries - 55 - 3.4 The language of the text queries The language of the text queries (referencing the Title, Series, Author, Publisher, Subject, Notes bibliographical attributes) is a subset of the CTXRULE grammar (see Section 2.4.3). The available query types provided by the interface of DLAlert are shown in Table 3-4.1. Terms are considered words or phrases (series of tokens containing no operators). The queries are not case sensitive. Syntax Description Examples term1 TITLE: mathematical programming documents that contain term1. term1 and term2 TITLE: agents and (artificial intelligence) documents that contain both terms. term1 or term2 AUTHOR: ( Bradley Brown ) or Beck documents that contain term1 or term2. term1 not term2 NOTES: science not electronics documents that contain term1 but not term2. documents that contain all terms near((term1,term2,... within a set of words with size max TITLE: near( ( personal, computers ) , 4) ,termn),max) Order of terms is not specified. Limitation: max cannot be greater than 99. documents that contain concepts that are related to your query word or phrase. about(term) SUBJECT: about( engineering ) Limitation : about(...) and near(...) cannot occur both in the same field. documents that contain words with $term SUBJECT: $library the same linguistic root as term. Table 3.4-1 DLAlert language Given the available query types shown in table 3.4-1 and the grouping characters “( )” we can define complex queries like those in the table 3.4-2. As we can see DLAlert provides queries with advanced expressiveness capabilities.
  • 56. Information Alert System for Digital Libraries - 56 - Category Example complex Boolean statements TITLE: ( mine or disasters ) not industry proximity and Boolean operators TITLE: near( (mine, disasters) , 5) and industry concept queries and Boolean operators SUBJECT: about( engineering ) or software not databases stemmed words and Boolean operators NOTES: digital and $library stemmed words and proximity NOTES: software not near( (advanced, $electronics) , 3) Table 3.4-2 Complex queries Queries like those shown in table 3.4-3 are wrong and produce syntax error in case the user tries to enter one of them. Error type Example missing parenthesis TITLE: ( software design term2 missing SUBJECT: international and term1 missing SUBJECT: not mathematics about(...) and near(...) TITLE: about(management) and near((financial,planning),4) in the same field max >99 SUBJECT: near((financial,planning),142) operator between term1 RDBMS NOTES: RDBMS near((data,warehousing),6) and near(...) missing about is an oparator TITLE: journals about science missing stemmed word TITLE: digital and $ Table 3.4-3 Syntax errors 3.5 Conclusions In this chapter we discussed the main functionalities and the architecture of DLAlert. We explained the technologies used in the database as well as the middle layer of our application. Also we introduced the interface of DLAlert and the language supported. In the following sections we present in detail the implementation of every component of the system.
  • 57. Information Alert System for Digital Libraries - 57 - Chapter 4 Database Schema In this chapter we present in detail the design of the database schema of DLAlert. In addition to the basic requirements analysis and the Entity-Relation diagram we also deal with issues like primary-foreign key consistency and the necessary CTXRULE indices creation and maintenance (see also Section 2.4). 4.1 Requirements analysis Publication Profile N M (bibiographical attributes matches (set of queries on document's of acquired document) bibliographical attributes) N defines 1 user Figure 4.1-1 A high level Entity-Relationship diagram The requirements of DLAlert for data resources involve only three entities (as shown in Figure 4.1-1). The entity user contains all necessary user credentials and information. These are the user’s e-mail, first/last name and the password for his login into DLAlert. The request of password at user login ensures that all users have restricted access to their accounts. Another helpful feature is the desired frequency of notifications. DLAlert collects all matching documents of a certain user and sends a single message to him at the end of the desired interval (‘DAY’, ’WEEK’, ’MONTH’) containing all relevant bibliographical attributes. It is considered annoying to send a new e-mail message every time a new matching publication arrives.
  • 58. Information Alert System for Digital Libraries - 58 - The entity publication contains all necessary bibliographical attributes of documents to be requested. The attributes include all characteristics of the document and are included in the query form of TUC Library’s catalog search (see picture bellow). These bibliographical attributes are: Title, Series, Author, Publisher, Subject, Notes, Publication year, ISBN (International Standard Book Number) and ISSN (International Standard Serial Number). The Language attribute is omitted because English is the only language supported by DLAlert at the time. The Observer parses the UNIMARC records retrieved from the Z39.50 Gateway in order to identify the previous fields (see Chapter 7) and populate the corresponding table. These basic attributes can also be identified using records from different sources. Figure 4.1-2 Technical University Digital Library catalog search engine The entity profile is a set of conditions to be satisfied for the requested document. These conditions are queries in the CTXRULE language (see Section 2.4) and reference certain attributes on the publications. So we have queries for each of the
  • 59. Information Alert System for Digital Libraries - 59 - bibliographical attributes Title, Series, Author, Publisher, Subject, Notes, Publication year, ISBN and ISSN. For a profile to be matched, all not null queries must be satisfied for a given document. A user can only access and define the profiles of his account. User also can specify a short string describing each profile. This description does affect the set of matching documents but allows a single user to define many profiles in a convenient way. 4.2 Relational schema Foreign key notification Foreign key public_id (email, profile_desc) public_id email profile_desc user publication profile Foreign key email email first_name public_id email last_name title profile_desc password series title_query frequency author series_query publisher author_query subject publisher_query notes subject_query pub_year notes_query bookn pub_year_query serialn bookn_query serialn_query Figure 4.2-1 Relational schema Considering the E-R diagram and the requirements analysis of Section 4.1, we construct the relational schema in Figure 4.2-1. The email address is unique for every user (primary key) and is also stored as foreign key in the table of profiles in order to restrict user access into his account. The primary key in profiles table consists of two columns (email, profile_desc) so that the profile description and the user’s account email, uniquely identifies a profile. The primary key of publications table (public_id) is the local number of UNIMARC record (field 001) in TUC’s Library (Sections 2.3.4 - 2.3.6).
  • 60. Information Alert System for Digital Libraries - 60 - In order to represent the relation “publication matches profile” (M to N) we declare a table (notifications) holding the necessary primary keys of the related tables. The primary key of table profiles (email, profile_desc) is considered foreign key for the notifications table. The filtering module populates this table with the primary keys of the satisfied profiles and matching documents. The notification module uses these records to send messages with the matching document’s bibliographical attributes. All attributes are declared strings of variable length (data varchar2 number) except those named public_id, pub_year and pub_year_query which are integers (data type number). The value of attribute frequency (users) can be either ‘DAY’, ’YEAR’ or MONTH’ (CHECK constraint). The tables of profiles and users are accessed through the web Graphical User Interface. The tables holding publication and notification records are populated and accessed only by stored procedures of the Oracle RDBMS. 4.3 Key consistency and atomic transactions Suppose a user wants to update his email and his account contains profiles, and/or notifications on publications not send as messages yet. In other words the primary key value of a record, in users table must be changed, and this value is also stored as foreign key into another table (profiles, notifications). In all cases when a transaction references more than one inserts/ updates/ deletes and either all sub- operations must be committed successfully, or none of them, we consider this transaction atomic. The atomicity of actions like these is satisfied by using PL/SQL stored procedures. In our case we declared a package (named “transactions”) that handles updates of the email or profile_desc attributes (both part of foreign key). Also handles deletes on the profiles, users tables atomically. For example consider the scenario that user with email del_email unsubscribes from DLAlert. We execute the transaction using this procedure. procedure delete_user( del_email varchar2 ) is pragma autonomous_transaction; begin delete notifications where email=del_email; delete profiles where email=del_email; delete users where email=del_email; commit;
  • 61. Information Alert System for Digital Libraries - 61 - end delete_user; When user unsubscribes, his data must be deleted from three tables user, profiles, notifications. The keyword ”pragma autonomous_transaction;” ensures that either all three deletes in the PL/SQL block commit or none of them. The specifications of other procedures of this package are procedure update_user( old_email varchar2 , new_email varchar2 ); Called if email value in a record, is changed from old_email to new_email. Declares the update of the e-mail value on the tables (profiles, notifications, users) as an atomic transaction. procedure update_profile( user_email varchar2, old_desc varchar2, new_desc varchar2); Called if profile description value in a record, is changed from old_desc to new_desc. Declares the update of the value on the tables (profiles, notifications) as an atomic transaction. procedure delete_profile( user_email varchar2, del_desc varchar2); Called if profile with primary key value ( user_email, del_desc ), is deleted. Declares the deletion of a profile on the tables (profiles, notifications) as an atomic transaction. 4.4 Indexing of the stored queries The next step is to declare columns that contain stored queries and create the necessary CTXRULE indices. As shown in Section 2.4.4, we must first index the queries in order to be able to collect matching profiles. The bibliographical attributes (Title, Series, Author, Publisher, Subject, Notes) contain plain text and the corresponding stored queries are generated using the CTXRULE grammar and reference these sections. The queries on ISBN, ISSN are ten and eight digit strings respectively ( 0-9 or X ). Many records of TUC’s Digital Library have multiple ISBN and/or ISSN numbers and
  • 62. Information Alert System for Digital Libraries - 62 - the number included in the query must be equal to one them. A solution to this problem was to index also queries on ISBN, ISSN numbers. The MATCHES operator is able find matching publications with multiple book/serial identifiers. The publication year attribute has always a single value and text queries on this attribute, using the CTXRULE grammar, are usually meaningless. Instead of using the MATCHES predicate to evaluate satisfied queries on publication year, we use the standard SQL statement (‘=’) of equality in a WHERE clause. So we have to create eight indices on the columns containing stored queries (all except pub_year_query) that use the CTXRULE functionality. As we showed in Section 2.4.4 indexing null queries is produces index errors. Given the structure of DLAlert and the queries we want to provide, we do not expect the user to describe every attribute of the requested document but at least one. So we have to choose a non reserved symbol to be inserted instead in fields that the user does not specify a condition. We choose this symbol to be the less than symbol (“<”) because is not associated with any functionality or operator. Also this character is skipped by the indexing engine so the number of columns containing this symbol does not affect the size of the index. We construct a simple trigger that is executed before every insert or update on the table profiles and inserts the symbol “<” instead of null. This trigger has if…then rules for every stored query like this. if ( :new.title_query is NULL ) then :new.title_query:='<'; end if; Queries that contain only the symbol “<” are displayed as empty strings on the Graphical User interface (see Sections 6.4 - 6.5) so that this mechanism is transparent to the user. In order to ensure that filtering results are correct and consistent with the queries, the indices should be synchronized before filtering with the base table after DML actions. For this purpose we declared the package “indexing” with a procedure (“sync”) that synchronizes all CTXRULE indices sequentially before any execution of the filtering module. In DLAlert we assume that index synchronization and filtering are executed at least one time at the end of the day.
  • 63. Information Alert System for Digital Libraries - 63 - 4.5 Conclusions In the last chapter we analyzed the requirements that the database used by DLAlert must meet. We explained in detail the database schema and preceded on to particular design issues (atomic transactions and index creation). In the next section we present the main PL/SQL packages of the system: the filtering and notifying module.
  • 64. Information Alert System for Digital Libraries - 64 - Chapter 5 PL/SQL packages In this chapter we present the filtering and notification modules of DLAlert implemented as PL/SQL packages. Packages are constructs that allow us to logically group procedures, functions, object types, and items into a single database object. PL/SQL package have similar functionality as classes in object oriented languages (Java or C++), except every package is instantiated only once during a database session. 5.1 Filtering module Suppose we have the table publications in the database schema of Figure 4.2-1 populated with bibliographic attributes of documents and the profiles table with queries of the CTXRULE grammar. In order to be able to produce the necessary messages, we must first find the matched profiles for every document. A profile is matched if all set conditions on the corresponding attributes of the document are satisfied. 5.1.1 The algorithm As we have shown in Section 2.4.4 to find all matched profiles for a single document we use the following algorithm. current_publication is a parameter for the procedure find_matched_profiles. procedure find_matched_profiles (current_publication publications%rowtype) is begin for matched_profile in ( SELECT <needed columns> FROM <table holding the profiles> … MATCHES condition(s) ) loop ACTION EXECUTED FOR EVERY matched_profile end loop; end;
  • 65. Information Alert System for Digital Libraries - 65 - First we have to construct the appropriate SELECT statement that collects all matching profiles. If the variable holding the given publication is named current_publication and we collect all matched profiles according to the title attribute and corresponding query we have select email,profile_desc from profiles where matches(title_query, current_publication.title)>0 Suppose now we want find profiles that satisfy not only the query on the attribute title but the whole set of conditions in a conjunction. The most obvious solution to our problem could be thought to define this selection. select email,profile_desc from profiles where (matches(title_query, current_publication.title) >0 ) and (matches(title_query, current_publication.author) >0 ) … … and (matches(serialn_query, current_publication.serialn) >0) Which is syntactically correct but produces the runtime error bellow due to the limitation of the MATCHES operator. ORA-20000: Oracle Text error: DRG-50610: internal error: MATCHES does not support Functional Invocation “MATCHES does not support Functional Invocation” means that the result from multiple MATCHES statements in conjunction, cannot be evaluated using the Boolean SQL operator ‘and’ because the value produced from MATCHES cannot be assigned for further processing from the Oracle SQL processor module. This error occurs in all cases when using a MATCHES clause that references a cursor, in conjunction/disjunction with any other statement (MATCHES or standard SQL).
  • 66. Information Alert System for Digital Libraries - 66 - There are only two possible solutions to our problem. The first one is treating results of SELECT clauses as sets and using the intersection of the intermediate results of the partially satisfied profiles. This can be done as follows: (select email,profile_desc from profiles matches(title_query, current_publication.title) >0 ) intersect (select email,profile_desc from profiles matches(title_query, current_publication.author) >0 ) … … intersect (select email,profile_desc from profiles matches(serialn_query, current_publication.serialn) >0 ) Another solution is using the intermediate results and issuing over them a SELECT clause that ensures their primary keys’ equality. This can be done as follows: select Title_Results.email,Title_Results.profile_desc from ( select email,profile_desc from profiles where matches(title_query,current_publication.title)>0 ) Title_Results , ( select email,profile_desc from profiles where matches(series_query,current_publication.series)>0 ) Series_Results , … … … ( select email,profile_desc from profiles where matches(serialn_query,current_publication.serialn)>0 ) Serialn_Results where and Title_Results.email=Series_Results.email and Title_Results.profile_desc=Series_Results.profile_desc … … … and Title_Results.email=Serialn_Results.email and Title_Results.profile_desc=Serialn_Results.profile_desc
  • 67. Information Alert System for Digital Libraries - 67 - Both of the SQL statements result the same performance measures and are processed by the Oracle SQL query processor in a similar way. In Section 4.4 we explained that null fields produce index errors so we choose to use the symbol ‘<’ instead of empty query cells. The symbol ‘<’ is skipped by the Lexer during the indexing process so it does not appear in the indices. So the intermediate SELECT clause of the profiles, that match a single attribute in the previous SQL statements, taking in account the cells described as null queries is ( for example the title_query with the title attribute) are implemented as follows. (select email,profile_desc from profiles where matches(title_query, current_publication.title)>0) union (select email,profile_desc from profiles where title_query = '<') The action executed for every matched_profile is an insert of the primary keys of the matched profile and relevant publication into the notifications table. insert into notifications(public_id,email,profile_desc) values( current_publication.public_id, matched_profile.email, matched_profile.profile_desc ); We also do not forget to issue the command commit; so the transaction is committed. A commit; statement ends a transaction and makes permanent any changes performed. This statement is preferably issued outside the for…loop and executed once as a commit’s response time is fairly flat, regardless of the transaction size. To find matching profiles for all publications another for…loop is needed to call the previous procedure on all documents. for all_documents in ( select * from publications; ) loop find_matched_profiles (all_documents) end loop;
  • 68. Information Alert System for Digital Libraries - 68 - 5.2 Notifying module Once the notifications table is populated with matching profiles and publications, all we have to do is to summarize the matched documents for every user and send him a single e-mail. The preferred frequency of notifications does not affect the filtering process but all new publications are filtered against all profiles but the frequency controls the time the e-mail message will be sent. For sending e-mail messages from the database we use supplied package UTL_SMTP. We first present the basic features of this package and then we describe the whole functionality or our module. 5.2.1 The UTL_SMTP package SMTP [43] stands for Simple Mail Transfer Protocol. This is the protocol that was developed to allow people around the world to exchange electronic mail. UTL_SMTP [30, 33, 35, 40] is an email utility that provides us with the ability to email from the database. In other words, we can dynamically generate email from the database and we can dynamically send it to different people based on different criteria. The message constructed can be sent as a standard ASCII text email or an enhanced HTML email. A SMTP connection is initiated by a call to open_connection, which returns a SMTP connection. After a connection is established, the following procedure calls are required to send a mail (we do not specify the complete syntax): helo() or ehlo() - identify the domain of the sender mail() - start a mail, specify the sender rcpt() - specify the recipient open_data() - start the mail body write_data() - write the mail header/body (multiple calls) close_data() - close the mail body and send the mail quit() - close the SMTP connection Using these commands we define a rather complex PL/SQL procedure inside the notifying module’s package which has the following syntax. procedure send_mail (in_mail_server, in_sender_email, in_recipient_email, in_recipient_name, in_html_flg, in_subject, in_importance, in_body);
  • 69. Information Alert System for Digital Libraries - 69 - in_mail_server is the mail server , default value 'intellix.intelligence.tuc.gr'. in_sender_email is the senders mail address, default value 'alert@intelligence.tuc.gr', in_recipient_email is recipients e-mail address, in_recipient_name is recipients full name, in_html_flg indicates whether the message has HTML structure, default value 'Y', in_subject is the subject of the message, default value is 'New Publications' in_importance is the importance of the message, default value is 'Normal', in_body is the body of the message. All variables are PL/SQL strings (varchar2) [30] except in_body which is clob (character large object). Alternatively we could have declared a Java stored procedure to send the messages [35-37] using JavaMail, an API supplied from Sun Microsystems which implements the most commonly used mail protocols. Using JavaMail we can also retrieve e-mail messages from mailboxes, store and process them inside the database. However UTL_SMTP provides sufficient functionality for DLAlert at the time. 5.2.2 Collecting the matched publications for a single user Once we have constructed the send_mail procedure, the next step is to define the procedure that collects all matched publications for a given user and generates a message. In order to generate HTML e-mail messages we use the supplied PL/SQL Web Toolkit [31, 40]. This toolkit includes PL/SQL stored procedures and functions useful for generating dynamic HTML pages directly from the database. It also supports sending those HTML pages directly to the user’s browser, using an external Web Server that is configured suitably. In our case, the HTML structured data are sent to the user via e-mail only, and we use the PL/SQL Web Toolkit for generating them. The functions used, return HTML tags as their character output. For example: Title: = htf.title( 'Library Alert Service Notification'); Assigns to the string Title the value : <title> Library Alert Service Notification' </title>
  • 70. Information Alert System for Digital Libraries - 70 - Text := htf.fontOpen( cface => 'Arial Narrow', csize => '5') || 'T.U.C Library Alert' || htf.fontClose Assigns to the string Text the value : <font face="Arial Narrow" size="5"> T.U.C Library Alert </font> We do not present in detail the functions used, because generating HTML from PL/SQL code is a rather complex issue. The algorithm that collects all matched publications for a given user, and generates a message is. procedure notify_user( user_email varchar2 ) is begin --GENERATING THE HEADER OF THE MESSAGE for matched_publication in( select publications.* from publications, (select public_id from notifications where email=user_email group by public_id) matched where matched.public_id=publications.public_id ) loop --APPEND ALL NON-EMPTY BIBLOGRAPHICAL ATTRIBUTES --TO THE MESSAGE FOR EVERY matched_publication end loop; --FINISHING AND SENDING THE MESSAGE --USING THE send_mail PROCEDURE. end; user_email is a input string containing e-mail of the user to notified. We concentrate on the SQL SELECT clause used. The clause (select public_id from notifications where email=user_email group by public_id) matched
  • 71. Information Alert System for Digital Libraries - 71 - retrieves all matched publications of the user with email=user_email and removes ,duplicate matched entries from the result, in case the user has more than one profiles matching the same publication. Also names the results as table matched. The clause select publications.* from publications,matched where matched.public_id=publications.public_id Retrieves the matched publications for a single user, from the table holding the bibliographical attributes, using the primary keys of the matched ones collected in table matched. In order to produce and send messages to all users that have matched profiles we use the algorithm. procedure notify_all is begin for current_user in ( select email from notifications group by email ) loop notify_user(current_user.email); delete from notifications where email = current_user.email; end loop; commit; end; The SELECT clause retrieves all unique e-mail addresses of the users that have matched profiles in the notifications table, and calls the notify_user procedure for each one of them. We do not forget to delete the sent notifications for each successfully sent message with the simple DML statement. delete from notifications where email = current_user.email;
  • 72. Information Alert System for Digital Libraries - 72 - If we want to notify users according to their desired frequency of notifications we must construct the appropriate procedures that notify all users that have defined the same interval between e-mail messages. For example if we want to notify all users that want to be sent e-mail messages every day, in case there is a matching publication, instead of the previous SELECT statement that controls the for…loop we issue. select users.email from notifications, users where users.frequency=’DAY’ and users.email=notifications.email group by users.email Therefore we send messages to each category of users separately. For example the group of users that selected day as preferred frequency, are notified each day, those who preferred week once a week etc. As we said in the previous section the profiles are not filtered separately but the frequency of notifications affects only the time the messages will be sent to the user. In Chapter 9 we explain in more detail the scheduling of each module. 5.3 Performance We tried to measure the performance of DLAlert. Our goal was to ensure that the time needed for filtering would be acceptable if we consider that this process would be executed once a day. Our measures represent the worst case scenario on the server of the Intelligent Systems Laboratory (two Pentium III processors, 1 GB RAM). We took documents from the work of Theodoros Koutris and Christos Tryfonopoulos on a DIAS implementation [53,54]. We also generated profiles on random keywords encountered in those documents, using the profile generator of DIAS of the same implementation (only Boolean and proximity statements) and parsed them into equivalent CTXRULE expressions. The time taken for indexing 340082 complex stored queries (that is 100000 profiles with 1 to 4 stored queries on attributes each) in the server of the Intelligent Systems Laboratory was approximately 10 minutes. We do not measure the time taken to insert the profiles. This was considered a quite satisfactory performance measure since indexing 340082 stored queries means that the are 340082 new queries (inserts or updates) since the last index synchronization (last day for example) which is most unlikely to happen even for the popular alerting applications we described in Section 2.1.
  • 73. Information Alert System for Digital Libraries - 73 - Assuming that index synchronization is executed once a day, we are able to store an even larger amount of profiles. The time taken for roughly the one tenth of the queries ( 3300 ), was approximately 2 minutes which means that time required is not a linear function of the profiles inserted. As stated by on the “Oracle Text Technical Overview” [27], CTXRULE query performance depends mainly on the size and number documents. As these factors increase, there are more unique words, each of which results in a query on the index. Performance is also affected by number of unique rules indexed and the complexity of stored queries. However, the number of unique rules has much less impact on query performance than size of the document. The SQL Query that collects matching profiles is rather complex in both cases. The time taken for filtering 84 documents against 340082 stored queries (that is 100000 profiles with 1 to 4 stored queries on attributes each) in the server of the Intelligence Laboratory was approximately 13 minutes in both of the previous implementations. The size of an attribute of a document could vary from a small amount of to a few thousands of words. The total size of all documents (which is the most important factor) was 3.5 Mbytes (approximately 380.000 words) which is considered a very large amount of words for filtering. In the actual implementation of DLAlert when records are retrieved from the Digital Library, the average record size is much smaller, since bibliographic attributes are short strings usually. Even assuming that filtering, which is executed once a day requires roughly 13 minutes on average, this time is fairly acceptable for the purposes of our application. Oracle states that the expected response time for filtering 64337 words against 16000 indexed queries is approximately 20 sec. In Chapter 9 we present techniques, proposed for future work on DLAlert that will reduce this time further. We have to underline that filtering using MATCHES is always a CPU time consuming task and in the server we used for development, there many other processes running all the time. Also we did not utilize any additional functionality provided by Oracle that speeds up the overall database performance. However our goal was to roughly estimate the time needed for filtering and indexing, given the usual workload on the server of the Intelligence Laboratory and decide whether DLAlert could be deployed on this computer. Our measures do not evaluate the overall performance of the Oracle Text.
  • 74. Information Alert System for Digital Libraries - 74 - 5.4 Conclusions We explained in detail the essential PL/SQL components of DLAlert, the filtering and the notifying module. The filtering module finds all matched profiles for every publication and stores the primary keys of those rows in a table. The notifying module processes those data, produces and sends dynamic HTML e-mail messages to each user containing the bibliographical attributes of all matching publications. In the next chapter we present, the Graphical User Interface of DLAlert.
  • 75. Information Alert System for Digital Libraries - 75 - Chapter 6 The Graphical User Interface In this chapter we present the GUI of DLAlert. We start with an overview of this component and continue with particular technical issues and implementation details. The URL of DLAlert is http://www.intelligence.tuc.gr/alert/login.html . 6.1 Middle application tier architecture In this section we explain in detailed the 3-tiered architecture of our web application, first introduced in 3.2 and we present the specific characteristics of our implementation. Web Tier Business Tier Client Host HTTP RMI JDBC Resources -HTML (presentation logic) (business logic) Data & Stored -Javascript - EJB stateful procedures - JSP pages session bean Middle Application Tier Figure 6.1-1 DLAlert: Three-tiered architecture The figure above presents the components of DLAlert organized as a J2EE platform application. This schema consists of the following elements: The Client receives dynamic HTML pages from the Application Server via HTTP (Hyper - Text Transfer Protocol). JavaScript is code executed on the client’s browser and in our case, responsible for validation of the fields of HTML forms. If the client discovers that the necessary user credentials during user registration are not filled produces an error message and does not send those parameters to the middle-tier. The client checks if the profile description and at least one of the query fields, on the edit/new profile form are non-empty before sending them to the application server. Also the e-mail, publication year (four digits required), ISBN and ISSN fields are validated on the client. JavaScript is a language of limited functionalities (unlike Java), not able to perform validation of CTXRULE text queries (generated according to a complex recursive grammar).
  • 76. Information Alert System for Digital Libraries - 76 - The Web tier generates presentation logic, accepts user input from HTML and generates appropriate responses for the user. We implement this tier as pages created with Java Server Pages (JSP) [45, 46] technology on the application server. JSP’s simplify the development of dynamic Web pages. JSP technology enables us to mix regular, static HTML with dynamically generated content. The parts that are generated dynamically are marked with special HTML-like tags and contain Java code. Apart from the standard JSP tags, in our application we used the Oracle9iAS Containers for J2EE (OC4J) Custom Tag Library for SQL, provided by Oracle with the Oracle9i Application Server [40, 47]. A tag library defines a collection of custom actions for JavaServer Pages. OC4J is a framework for rapid JSP development. OC4J tags related to database access, support functionality for opening/closing a database connection, executing a query or any other SQL statement (DML or DDL) within JSP code. Except the standard dynamic pages related to presentation, we have constructed JSP’s, not visible to the user, that process transactions on user credentials and already validated profiles. Although OC4J functionality is provided with the Oracle AS, the applications developed with this framework, can be deployed into any other application server that supports JavaServer Pages technology. The Business Tier implements business logic related to the user’s session and is developed using Enterprise JavaBeans (EJB) Technology. An EJB is a server-side web component, written in Java that encapsulates the business logic of an application. In DLAlert this functionality includes user authentication, profile validation and data retrieval (user credentials and profiles) from the database. Also atomic transactions that update/delete foreign keys and require stored procedures calls (see package ‘transactions’ Section 4.3) are handled by the EJB. A stateful session bean is an EJB that acts as a server-side extension of the client that uses it. The stateful session bean is created by a client and will work for only that client until the client connection is dropped or the bean is explicitly removed. Therefore we use the EJB to restrict the user’s access into his account and maintain login information. A very important functionality of the EJB is validation of profiles according to the CTXRULE language. If the profile contains at least one wrong query the EJB produces an error message which is displayed in the corresponding dynamic JSP page. Also the phrase ‘Syntax Error’ appears at right side of the wrong query. This mechanism was implemented using Java Compiler-Compiler (JavaCC) [50] a parser generator for Java.
  • 77. Information Alert System for Digital Libraries - 77 - JavaCC generates source code for parsers using LL(k) grammars. We explain in detail the implementation of syntax checking in Section 6.4. We could have also included all transaction handling functionality inside the EJB instead of using the OC4J custom tag library, but applications that use this framework can easier be developed and maintained. However the JSP’s interact with the EJB in a way that ensures security and isolation of the user’s session. The Web and Business Tier communicate with each other using Remote Method Invocation (RMI). RMI is a Java based Application Programming Interface (API) for distributed object computing and Web connectivity. RMI allows an application to obtain a reference to an object that exists elsewhere on the network but then invoke methods on that object as though it existed locally. So, the web and business tiers can be implemented in different J2EE platforms, although in our case they are deployed in the same application server. The Middle Application Tier communicates with the database using the Java Database Connectivity interface (JDBC) [35, 38, 40]. JDBC API is a specification for database connectivity using Java. Software vendors (like Oracle) produce their own JDBC drivers that implement the API specification in a greater or lesser degree, but all of them support a common set of interfaces. Thus the way the programmer interacts with the database, is to some extent independent to the JDBC driver used. The Database (also called EIS: Enterprise Information System) tier includes the RDBMS infrastructure (both data and stored procedures). We have already explained in detail the role of the RDBMS in Chapter 3. 6.2 The Enterprise Java Bean In the next sections we concentrate on the business logic of DLAlert. First we present the main class used as an Enterprise Java Bean. The class of the EJB is called ‘LoginBean’ and maintains login and account information of the user session. The main private fields of this class are: JDBC related fields. o java.sql.Connection conn the JDBC connection to the database. o java.sql.Statement stmt the SQL statement to executed Database schema related fields. o java.lang.String dbUser the Username for database schema (constant)
  • 78. Information Alert System for Digital Libraries - 78 - o java.lang.String dbPass the Password for the database schema (constant) o java.lang.String dbURL the URL of the database (constant) Account related fields. o java.lang.String username the username of the account (e-mail) o java.lang.String password the password for the account o java.lang.String[] ProfileArray An array of strings with the profile descriptions of all profiles inside the account Objects representing entities inside the database. o ReadProfile ResultProfile Object representing the profile to be inserted/updated or the profile read from the database. This object holds all the queries of the profile. The functionality of this class is rather complex so it is presented separately in the next section. o User ResultUser Object holding all user credentials (e-mail, password, first name, last name, frequency) as strings. This object is instantiated during the retrieval of the user credentials from the database. This class includes the five strings that represent user credentials and simple public methods that assign or return their values. The main methods of the Enterprise Java Bean are: Methods called during login/logout. o void Initialize() Clears all account information – user logs out. o Boolean authenticate() Returns true if there is a registered user with the e-mail/password in the database. The e- mail/password are required at login. Account information related methods. o User getCurrentUser() Returns an object holding all of the user credentials after reading the corresponding data from the database (table users). o java.lang.String[] getProfileArray() Returns an array of strings with the profile descriptions of all profiles inside the account after reading the corresponding data from the database (table profiles).
  • 79. Information Alert System for Digital Libraries - 79 - Profile validation related methods (explained in Section 6.6). o ReadProfile getReadProfile(java.lang.String ProfileDesc) Returns an object holding all queries of the profile with name ProfileDesc. Calls the constructor of ReadProfile class. o StoreProfile getStoreProfile() Returns a profile to be stored, already parsed. o StoreProfile getStoreProfile( … ) Constructs, parses and returns the parsed profile to be stored as an object. Calls the constructor of StoreProfile class. Methods that prevent primary key constraint violation error. o Boolean ProfileExists(java.lang.String NewProfileName) Returns true if profile with description NewProfileName already exists inside the user’s account. Prevents primary key constraint violation on the table profiles. o Boolean UserExists(java.lang.String NewEmail) Returns true if user with e-mail NewEmail already exists. Checks before new user registration/credentials update. Methods that represent atomic transactions (see Section 4.3) – call PL/SQL stored procedures of package ‘transactions’. o void DeleteProfile(java.lang.String DelProfileName) Deletes a profile inside the account of the user with name DelProfileName. o void DeleteUser() Deletes all user information from the database - unsubscription o void UpdateUser(java.lang.String NewEmail) Updates current user’s e-mail to NewEmail . o void UpdateProfileDesc(java.lang.String OldProfileName, java.lang.String NewProfileName) Updates profile profile name of profile OldProfileName inside the account with profile NewProfileName
  • 80. Information Alert System for Digital Libraries - 80 - 6.3 OC4J custom tag library Transactions can be declared inside JavaServer Pages. These transactions use OC4J custom tag library for SQL functionality. The tags used from this library are: We use the dbOpen tag to open a database connection for subsequent SQL operations: <database:dbOpen [ connId = "connection_id" ] [ scope = "page" | "request" | "scope" | "application" ] user = "username" password = "password" URL = "databaseURL" [ commitOnClose = "true" | "false" ] > … OPTIONALLY JSP CODE … </database:dbOpen> Parameters : o connId -- Optionally used to specify an ID name for the connection. You can then reference this ID in subsequent tags such as dbExecute. Alternatively, we can nest dbExecute tags inside the dbOpen tag. o scope (used only with a connId) – We use this to specify the desired scope of the connection instance. The default is page scope. o user – the username of the database schema. o password – the password for the database schema. o URL – the URL of the RDBMS. o commitOnClose -- "true" for an automatic SQL commit when the connection is closed or goes out of scope. The default setting is “false” for automatic rollback on connection close. We use the dbExecute tag to execute a single DDL or DML statement inside the tag dbOpen or outside of it using the same connId and scope parameters. The syntax for this tag is.
  • 81. Information Alert System for Digital Libraries - 81 - <database:dbExecute [ connId = "connection_id" ] [ scope = "page" | "request" | "scope" | "application" ] … DML or DDL statement (one only)… </database:dbExecute > We use the dbClose outside the dbOpen tag to explicitly terminate a database connection. We use the same parameters defined in the dbOpen to reference the same connection. <database:dbClose connId = "connection_id" [ scope = "page" | "request" | "scope" | "application" ] /> The OC4J includes many other useful tags that we did not use in DLAlert and are not referenced in this dissertation. For a complete reference of this library read [48, 49]. 6.4 Preventing CTXRULE index errors Before explaining in detail the components that store/read profiles from the database we must introduce the mechanism that validates profiles. The text queries on bibliographical attributes (Title, Series, Publisher, Subject, Notes) are generated according to the CTXRULE grammar. The invalid profiles should not be inserted into the database but rejected by the web interface. The user is not allowed to define wrong queries, else an error message appears. There are several restrictions on the CTXRULE grammar. For each of the cases bellow, an index error appears. Queries that contain obvious syntax errors like unclosed parentheses, missing term or missing operator. Error Description Query missing parenthesis ( information systems Term2 missing security and Term1 missing or mathematics operator between term1 RDBMS and near(...) missing RDBMS near((data, warehousing),6) Table 6.4-1 Wrong queries
  • 82. Information Alert System for Digital Libraries - 82 - Queries that violate limitations on certain operators. Error Description Query proximity parameter>100 near((artificial, intelligence),120) theme inside about(.. ) in upper case about ( POLITICIANS ) about(...) and near(...) in the same field about(management) and near((financial, planning),4) Table 6.4-2 Wrong queries Queries like the second one of the above do not produce index errors but are never expanded properly. According to Oracle Text Reference [24, page 3-7] the normalization of themes inside about(…) queries is case-sensitive. The themes stored inside the database are in lower-case. Therefore to ensure that normalization always succeeds to find the appropriate theme we must turn words or phrases inside about(…) statements to lower-case. Queries that contain reserved words. The CTXRULE language has many reserved words or symbols, some of them are not even associated with functionality for this type of index (but represent functionality on the CONTEXT index type only [23-24]). The reserved words to treated as query terms should be enclosed in ‘{ }’. The unused symbols are escaped using a ‘’. Also we have decided to include only one type of theme query: the about(…) statement because it returns the greatest amount of relevant concepts during expansion. Hence we have the following categories of reserved words: Thesaural operators not used Operator used only for XML document Operator Name classification (not plain text) WITHIN, BT Broader Term HASPATH, INPATH. BTG Broader Term Generic BTI Broader Term Instance BTP Broader Term Partitive Operators not supported by CTXRULE NT Narrower Term NTG Narrower Term Generic Operator Symbol Name FUZZY ? fuzzy NTI Narrower Term Instance ACCUM , Accumulate NTP Narrower Term Partitive (none) % wildcard characters PT Preferred Term (none) _ wildcard character RT Related Term (none) ! soundex TR Translation Term SQE (none) Stored Query Expression TRSYN Translation Term Synonym (none) > threshold TT Top Term (none) * weight SYN Synonym MINUS - MINUS
  • 83. Information Alert System for Digital Libraries - 83 - Rest of reserved operators Operator Symbol Meaning Operator Symbol Meaning AND & Boolean and (none) $ stem OR | Boolean or ABOUT (none) related concepts NOT ~ Boolean and-not (none) () grouping characters NEAR ; proximity (none) [] grouping characters We have decided to use only word operators when possible (AND, OR, NOT, NEAR). Also we do not use ‘[ ]’ as grouping characters. The stemming character ($) is the only symbol operator used. Therefore analyzing the requirements of our application we conclude that we must implement a mechanism that escapes or rejects the following reserved words and symbols. o Escaped reserved words : ACCUM, BT, BTG, BTI, BTP, FUZZY, HASPATH, INPATH, MINUS, NT, NTG, NTI, NTP, PT, RT, SQE, SYN, TR, TRSYN, TT, WITHIN . o Escaped symbols: &, ? , - , ; , ~ , > , * , %. o Any other special character or symbol is omitted. The CTXRULE index contains only keywords and escaped symbols are never indexed by default. Therefore including an escaped special symbol in a query does not affect the filtering results. Special symbols are usually treated as token delimiters by the index engine by default. We could not expect the user to be an expert on the CTXRULE language so we must construct a parser that automatically escapes reserved tokens. This functionality should not be visible to the user so that characters { } added by the parser are not visible from the web GUI. Empty Queries. As we said empty cells in profiles are substituted by the symbol <. This character is skipped by the indexing engine so it is not included in the index. This symbol also should not be visible from the web GUI.
  • 84. Information Alert System for Digital Libraries - 84 - As a conclusion, we have constructed a parser that is executed before storing profiles inside the database: o Checks queries for syntax errors. o Allows only two digits on the proximity parameter ( < 100 ). o Allows only one of the statements about(…) or near(…) in the same text query. o Turns themes inside about(…) clauses to lower-case. o Escapes reserved words and symbols. To implement such functionality we used JavaCC, a compiler generator for Java. JavaCC processes a text file that defines the grammar and the semantic actions of the compiler, and generates the appropriate source Java code. To construct this parser we have used the following LL(1) grammar. We must underline that this grammar does not define the actual CTXRULE grammar used by Oracle nor represents the parser used for CTXRULE indexing. A complete definition of the CTXRULE language is not provided by Oracle. Therefore the rules bellow represent the language used by DLAlert and define a heuristic that detects syntax errors. The necessary semantic actions are not included in the grammar. Ambiguities that occur in the grammar can be handled by the lookahead (one token) mechanism of the parser. Bold characters are tokens. (1) text _ query → or _ exp EOF | EOF A text query as can be an empty or non-empty expression. EOF is the end of the string (End Of File). (2) or _ exp → and _ exp ( OR and _ exp )* (3) and _ exp → not _ exp ( AND not _ exp )* (4) not _ exp → operand ( NOT operand )* The rules (2), (3), (4) allow Boolean queries according to the operator precedence explained in Section 2.4.3. The symbol * means “zero or more occurrences”. (5) operand → group | about _ exp | near _ exp | ( any _ word ) + Defines that an operand can be a group of expressions, an about or near expression or a phrase (series of words). The symbol + means “at least one occurrence”
  • 85. Information Alert System for Digital Libraries - 85 - (6) group → left or _ exp right A group of expressions starts and ends with parentheses. near _ exp → NEAR left left (any _ word ) + (7) [comma ( any _ word ) + ]+ right comma two _ digits right A near expression has the following syntax: near ( (term1, temr2,..., termn ) , max_span ) (8) about _ exp → ABOUT left concept right An about expression has the following syntax: about(concept) (9) concept → ( any _ word | any _ operator ) + The concept can be a series of any word or operator (treated as plain keywords inside about(..) ). (10) any _ operator → AND | OR | NOT | NEAR | ABOUT any _ word → word | stemmed _ word | reserved _ word (11) | number | two _ digits A word can be a plain word, a stemmed word (starts with $), a reserved word or any number. Our language has minor differences with the actual CTXRULE grammar. DLAlert grammar does not allow: o Symbol operators ( &, | , ~ , ; ) (escaped by the parser). o The brackets as grouping characters ( [ , ] ) (escaped by the parser). o The syntax “term1 near term2 “ for proximity. o Order of terms on proximity queries. o Stemming on expressions or phrases. Equivalent expressions can be defined using stemming on each word separately. For example the query “$( software and engineering )” can be equivalently specified as “$software and $engineering” so the first syntax is not allowed.
  • 86. Information Alert System for Digital Libraries - 86 - 6.5 Parsing the text queries As a conclusion to the previous section, we need a mechanism that parses the profiles and produces error messages. If a user tries to insert or update a profile that contains an invalid query, the application should be able to point out the syntax error. If the form is always updated with data from the RDBMS, in case of syntax error we will not be to provide such functionality because the wrong query will be lost. Therefore we must implement an object that holds the values of the profiles. The session EJB will decide whether the dynamic JSP displayed on screen, contains queries read from database, a profile that was not successfully inserted / updated or even empty text fields in case of new profile. We also assume the profiles already in the database are valid and should not be re-parsed (unless the user tries update). StoreProfile ReadProfile Profile Figure 6.5-1 Class Hierarchy For this purpose we have constructed a class hierarchy as shown in Figure 6.6-1. The arrows represent an “is a” relation. The class Profile is an abstract class and cannot be instantiated. The class ReadProfile includes private fields of all the of text query strings (Title, Series, Author, Publisher Subject, Notes, Publication Year, ISBN, ISSN) the profile name and methods that set or return values from the previous variables. Also includes methods that return each query exactly as it is displayed on screen (omitting escape characters { } and one character strings with the symbol ‘<’, as empty). A ReadProfile object is instantiated with fields containing queries read from the database.
  • 87. Information Alert System for Digital Libraries - 87 - The class StoreProfile includes all fields and methods of ReadProfile. A StoreProfile object is instantiated with fields containing parsed text queries, which are defined by the user. The constructor method of StoreProfile calls the JavaCC generated parser which performs the necessary semantic actions on all text query fields, before assigning the strings to the private variables. Among the methods inherited from ReadProfile the class includes the following ones. o public Boolean isValid(int QueryIndex) Returns true if the referenced text query is valid. QueryIndex defines which text query of the profile is referenced. o public String getErrorMessage(int QueryIndex) Returns the string “Syntax Error” displayed on screen, if the referenced text query is invalid. o public String getHeaderMessage() Returns the string “Encountered XX wrong queries …” displayed on screen, if the referenced profile contains wrong query. o public int NumberOfErrors() Returns the number of wrong queries of the profile. This class hierarchy allows us use call two different constructors virtually for the same Profile object (ReadProfile() and StoreProfile() ) according to the source of text queries assigned. The constructor StoreProfile() calls the parser and the constructor ReadProfile() just assigns the text queries to the fields. Thus we avoid re-parsing of already parsed text queries. JavaServer Page text (profile insert/update form) queries EJB ReadProfile( ) RDBMS Profile OC4J text Parser StoreProfile( ) tags queries text insert / update queries profile Figure 6.5-2 Profile insert / update mechanism
  • 88. Information Alert System for Digital Libraries - 88 - The functionality that performs inserts/updates on profiles is shown in the schema above and can handle the following four actions in any valid sequence. If a new profile is to be defined the JavaServer Page is fields with empty strings. If a profile is to be updated, the form is filled with the actual RDBMS data. The EJB calls ReadProfile( ) constructor and the JavaServer Page reads the text queries from the object (omitting escape characters { } and one character strings with the symbol ‘<’, displayed as empty field). If the user enters a valid profile to be inserted / updated the EJB calls the StoreProfile( ) constructor, parses the profile, the JSP with the OC4J tags reads the values from the object Profile and performs the transaction If the user enters a wrong profile the EJB calls the StoreProfile( ) constructor, parses the profile and the form containing the wrong queries and the errors messages appears. In this case the transaction is not performed. 6.6 Conclusions The Graphical User Interface was implemented with the intention to be a simple and friendly application. We have achieved this goal to some extent, for this first version of DLAlert. In the last chapter of the dissertation we propose future work on the service that will improve the functionality of DLAlert. However we must first mention the way DLAlert collects publications from Digital Libraries in the next chapter.
  • 89. Information Alert System for Digital Libraries - 89 - Chapter 7 The Observer In this chapter we present the mechanism that collects records from Digital Libraries using Java Stored Procedure technology and the Z39.50 protocol. Before explaining the implementation of this component (Observer) we reference different ways for retrieving data from information sources. 7.1 Information providers Information providers are any suppliers of information in the particular area of interest of the alerting service (in our case scholarly material) [4,5]. The information collected is publications metadata (records containing bibliographical attributes). We can distinguish information providers to be either active or passive. Active providers virtually provide their own alerting service; they regularly notify interested systems or users on new data (publications). Passive providers have to be queried for new material in a scheduled manner. For example any Z39.50 Gateway is a passive information provider. Also information providers are either cooperative or non-cooperative. Cooperative providers offer materials in a standard format, non-cooperative ones provide data in a proprietary custom format. For example human readable and unstructured records in a web-page or e-mail message are not in a standard format that can be easily processed by a service. The target of highest priority, in constructing an Observer for an Alerting Service, is the ability of the system to deal efficiently with as many as possible heterogeneous information providers, in a common approach. During the implementation of DLAlert we had the following possible solutions for collecting data from the Digital Library of TUC. Trying to construct a mechanism inside the Digital Library of TUC that will notify us on new publications, would be the worst solution since it will not be easy to add more other sources in the future. It will also require specialized knowledge on the RDBMS used by the Library and every other system to be supported Querying the database of the Digital Library directly using a standard JDBC or ODBC API would require development of functionality adapted to the database schema. In
  • 90. Information Alert System for Digital Libraries - 90 - addition the Library of TUC does not provide a licensed ODBC interface already. Using such mechanism would require we construct the Observer from scratch every time a new source is to be added. Using the Web Page of the TUC Digital Library to retrieve new records would not be an efficient solution, since the search interface used does not provide queries on the date of acquisition of documents. Trying to request records inserted during the last month for example, would be impossible. The use of the Z39.50 for this purpose is considered the best solution since o It is supported by the majority of Digital Libraries round the world. o Provides standardized access points to the resources. o Offers records in standardized format that can be easily processed. o Requires minor changes in order to add other sources. o Does not require intervention inside the database of every Digital Library supported. o Usually querying Gateways using this protocol is free and does not require specialized permissions. o The implementation of Z39.50 used by the TUC Library Gateway, provides queries on the document’s date of acquisition. Also adding this attribute for queries in an existing Gateway does not involve software development but requires minor changes on the configuration of the Z39.50 server. 7.2 Observer architecture As we explained in Section 3.2 Oracle RDBMS [35-39] is able to integrate Java classes inside the database and deploy them on a supplied Enterprise level Java 2 platform (Oracle Java Server). The Java code, which can be loaded as either source or compiled (bytecode), is executed inside the database using the internal Oracle Java Virtual Machine. The static methods of any Java class can be declared stored procedures and can be even called from PL/SQL code. Using this functionality we can insert all the necessary Java classes into the database (both the JZKit API and our classes). The main static method that handles the collection of records and calls all the other methods is published as Java stored procedure. Thus we can reference the Observer from PL/SQL code and schedule it to request new publications regularly using the PL/SQL package DBMS_JOB. First let us present the three sub-components of the Observer.
  • 91. Information Alert System for Digital Libraries - 91 - Observer (Java Stored Procedure) Array Records of char. (objects) JZkit Unimarc SQLJ API parser class inserts Requests data on JDBC new publications server-side internal driver Sends Records • • • Oracle database TUC another Digital Library Z39.50 Gateway Z39.50 Gateway DBMS another "Advance" DBMS Figure 7.2-1 Architecture of the Observer JZKit [15] is an open source Java toolkit for building distributed information retrieval systems, with particular focus on the Z39.50 Information Retrieval standard. JZkit offers us functionality that helps us develop clients for the Z39.50 protocol. We have already presented the Z39.50 main facilities and services in Section 2.3, in this chapter we focus on the particular characteristics of the Observer. The code developed writes requested records in an array of characters, processed by the next component. The UNIMARC [17] parser is a typical parser generated by JavaCC [50]. This parser processes UNIMARC records as input and maps the UNIMARC fields to the desired bibliographical attributes (Title, Series, Author, Publisher, Subject, Notes, Pub. Year, ISBN, ISSN). The UNIMARC format was presented in detail at Section 2.3.5. This component virtually processes structured text and produces a set of objects
  • 92. Information Alert System for Digital Libraries - 92 - representing the records requested. JavaCC is the Java compiler generator used in the development of the Web Graphical User Interface (Chapter 6). The last component is a Java class that uses the SQLJ [35, 51] functionality. SQLJ is an industry standard that enables database developers to: o Embed SQL code directly inside of Java source code. o Write Java-based code without resorting to low-level JDBC calls. o Construct applications that are portable to all database platforms that support JDBC drivers. The last sub-component of the Observer is a class that reads the fields of the objects produced by the parser and performs the necessary inserts. We focus on this component and the SQLJ functionality in Section 7.5. We must mention the sub-components are methods that are executed sequentially; every module performs actions after its previous one. For example the SQLJ class calls the method that parses the records, which uses the records retrieved by the JZKit. 7.3 JZKit API In this section we will explain the functionality of the JZKit API. We will not give a detailed reference on the classes and methods used. We focus on the services of Z39.50 used and the algorithm developed according to the Z39.50 terminology introduced in Section 2.3. We need a mechanism that collects publications acquired during a certain interval of time. The Z39.50 implementation used by the Digital Library offers an access point (attribute 32) to the bibliographic records according to the month of acquisition (shorter intervals can not be requested yet). For example if we want to request the records inserted in the Digital Library during February of 2003 we issue the following type-1 (introduced in 2.3.3) query using the JZKit API. The month stored in this field is always in Greek. @attrset bib-1 @attr 1=32 2003-ΦΕΒΡΟΥΑΡΙΟΣ Choosing the right string that corresponds to the current month/year we can retrieve the preferred records.
  • 93. Information Alert System for Digital Libraries - 93 - Using queries like the above at the end of each month we can retrieve the necessary records. The complete algorithm developed utilizes the Initialize, Search and Present services. We must mention that the target cannot return all requested records in one response (size of response limited by the segmentation service) so we have to count the records returned until all of them are retrieved. Phrases enclosed in < > represent variables. Initialize ( URL : dias.library.tuc.gr , port : 210 ) returns initialization parameters Search ( database-name : “Advance” , query : @attrset bib-1 @attr 1=32 <current year>-<month in greek> ) returns <total number of records> <starting point> = 0 repeat Present ( number of records , starting point ) returns <number of returned records> , <records in UNIMARC> <starting point>+=<number of returned records> write <records in UNIMARC> in an array of char. until <starting point> equals < total number of records> terminate connection The previous algorithm Initializes a Z-association with the target, issues a query and sends requests until all records are returned. The records of the Present response are written in an Array in order to be processed further by the next module. 7.4 UNIMARC parser The UNIMARC record format is not supported at the time by Oracle Text. Thus we have to parse the incoming records into plain text in order to insert them in the database schema explained in Chapter 4. The parser processes the UNIMARC records and maps UNIMARC fields into the fields used by DLAlert (Title, Series, Author, Publisher, Subject, Notes, Pub. Year, ISBN, ISSN). The parser maps the UNIMARC fields to those bibliographical attributes according to the Table 7.4-1.
  • 94. Information Alert System for Digital Libraries - 94 - Destination UNIMARC fields Local Number 001 Title 200,5XX,4XX except 410 Series Title 410 Author 7XX Publisher 210 Subject 60X Notes 3XX Publication Year 210 $d ISBN 010 $a ISSN 011 $a Table 7.4-1 UNIMARC field mapping Local number is the unique identifier of the record inside the database of the Digital Library of TUC. We use this number as a primary key for our schema (public_id). Other fields included in the UNIMARC record (like information about the book’s lending) are omitted. For example consider the following record. The extraction of bibliographical attributes is shown in the Figure 7.4-1. The date of acquisition is not inserted in the Oracle database. Figure 7.4-1 Sample UNIMARC record The fields of the processed record are shown above (represent private variables of the object LibraryRecord).
  • 95. Information Alert System for Digital Libraries - 95 - public_id title series author publisher The international business book Guy Vincent Mattock John Lincolnwood, Ill., USA 10024364 <null> Vincent Guy, John Mattock NTC Business Books NTC Business Books subject notes "All the tools, tactics, and tips you need for doing International business business across cultures"--Cover. Includes enterprises Management bibliographical references (p.[171]-173) and index. pub. year ISBN ISSN 1995 0844235172 <null> 7.5 SQLJ functionality The last sub-component of our module is an SQLJ class. This is a Java class containing methods that calls the aformentioned sub-components, and embedded SQL code that performs the transactions. In order to compile an SQLJ class we use the SQLJ translator that: i. Validates that the SQLJ statements are syntactically correct ii. Validates that the SQL code inside the SQLJ statements is correct. iii. Validates that the database objects being manipulated by the SQL code in the SQLJ statements are valid. iv. Translates the SQL code into syntactically correct Java statements. SQLJ source Java source Java class (bytecode) SQLJ Java Translator Compiler Figure 7.5-1 SQLJ compilation process The file generated still contains Java source code (compilation bytecode is necessary). The SQL statements are declared with the #sql token and must be enclosed in brackets “{ }”. Java variables declared outside the statement and reference inside of it start with the symbol “:” .
  • 96. Information Alert System for Digital Libraries - 96 - For example to insert a new publication in the database schema of DLAlert .with Public_id=1000 and Author=’Giannis Alexakis’ we have the following SQLJ code. String NewPublic_id=1000; String NewAuthor=’Giannis Alexakis’; #sql { INSERT INTO alert.publications (PUBLIC_ID,AUTHOR) VALUES ( :NewPublic_id, :NewAuthor) }; The actual SQLJ statement used for the transaction is. #sql { INSERT INTO alert.publications (PUBLIC_ID, TITLE, SERIES, AUTHOR, PUBLISHER, SUBJECT, NOTES, PUB_YEAR, BOOKN, SERIALN) VALUES ( :Publication_Id, :Title, :Series, :Author, :Publisher, :Subject, :Notes, :Year, :BookNumber, :SerialNumber ) }; The variables are assigned with the values to be inserted. Iterating thought all the records we insert all of the new publications requested and parsed previously. The translated or compiled Java code produced by the SQLJ translator can be loaded and executed inside the Oracle database. Java applications executed inside the RDBMS environment use the server-side internal JDBC for Oracle. As soon as we this type of driver it is not necessary to explicitly declare a statement that establishes a JDBC connection with the database. The code executed inside an RDBMS is implicitly considered that references the same database. In order to be able to call the method, that requests records from the JZKit API, parses them, and stores the bibliographical attributes of new publications, we have to
  • 97. Information Alert System for Digital Libraries - 97 - publish it as a Java Stored Procedure. The Java Stored Procedure declared ( ReadFromLibrary ( ) ) calls the Observer and returns the integer 1 on abnormal termination (for example due to network failure when a Z-association with the Gateway can not be established). In case of exception no records are inserted in the Oracle database. 7.6 Performance The time needed to retrieve records from the Digital Library of TUC is mainly dependent to the network congestion between the Oracle database and the Gateway. It takes usually less than five minutes to retrieve about one thousand records from the Gateway since a single response contains 33 records at maximum. The time needed for parsing and the insertions is insignificant, as it is less than ten seconds for a thousand of records. 7.7 Important technical issues We have the following important technical problems with the Digital Library of TUC that do not allow us deploy DLAlert in complete function yet: Most records that are inserted in the Digital Library of TUC have the date of acquisition field empty. The total number of records inside in the Digital Library is close to 60000 and the number of those with the date of acquisition filled is less than 6500. This means that almost the 90% of the records inserted in the database cannot be retrieved using this mechanism. This problem can be easily solved with the cooperation of the Library of TUC as long as we ensure that only future inserts/updates contain this essential bibliographical attribute. There is no need to change the data already in the database of the Library because we focus on new publications from the time DLAlert will be scheduled to operate in regular basis. The way the date of acquisition is stored does not allow us easily support frequency of notifications for user less than a month. The dates stored contain only year and month of acquisition. Therefore if we want to retrieve for example the records inserted during the last day we have to request all of the records inserted in the current month and extract the new records since the last request. This problem can also be solved optimally as long as we ensure that future inserts/updates in the Digital Library contain a more detailed date of acquisition. In the next chapter we
  • 98. Information Alert System for Digital Libraries - 98 - assume this issue solved the (in any way) and propose scheduling for the actions of DLAlert. The Greek character set supported by the Digital Library is a non-standard custom character set defined by the company that installed and configured the Digital Library. Most records include Greek characters and Greek support (both in keyword queries and e-mail messages) is a very important issue that must be solved soon. Proposed future work that will solve this problem is discussed in Chapter 9. 7.8 Conclusions Due to the technical problems explained in 7.7 we have not scheduled DLAlert in complete function yet. As the total number of records with the date of acquisition filled is very small we have developed and validated DLAlert using sets of documents acquired during certain years ( 500 -800 records ). Given the technical issues presented earlier, scheduling DLAlert to operate under the current conditions, will result very rarely notification of users. Despite the previously mentioned limitations we in the next chapter we present how DLAlert’s processes should be scheduled.
  • 99. Information Alert System for Digital Libraries - 99 - Chapter 8 Scheduling DLAlert For the system to operate properly and on a regular basis we must schedule all the related modules and actions. For this purpose we can use the PL/SQL package DBMS_JOB [30, 33, 35, 40] which provides functionality for: Scheduling stored procedures to run unattended at some time in the future or upon certain intervals of time. Handling jobs that are broken for any reason (network or power failure, database error etc). These jobs are attempted to run 16 times if are not successfully executed. We will not focus on the package, since scheduling the database is a rather complex administrator’s task. We explain the sequence of actions to be executed, focusing on two simple scenarios. We discuss the two cases, of supporting or not different user categories according to their preferred frequency of notifications. 8.1 Simple scenario Deletion of Collection Construction Synchonization new publication of new Filtering and transmission of indices records from the publications of messages database Figure 8.1-1 Simple scenario sequence For the case that we do not support different categories of users according to their preferred frequency of notifications we have the actions to be executed regularly (every day or week for example). First we have to collect new publication records from the Digital Library inserted during a certain interval of time. As the first step we call the module Observer. Synchronization of the CTXRULE indices is always necessary before filtering in order to have a consistent index with the base table of queries. For this purpose we have developed the PL/SQL package “indexing” (Section 4.4). This process can also be executed in the background during the first step, since the Observer is not an intensive CPU process.
  • 100. Information Alert System for Digital Libraries - 100 - After synchronizing the indices we find matching profiles for every new publication (PL/SQL package “filtering”). Once the matching profiles are collected we can summarize all matched publications for every user and transmit e-mail messages via SMTP. As e-mail messages containing all relevant bibliographical attributes are constructed and delivered there is no need to maintain the already filtered documents. Unless we want to provide other functionalities among alerting services (for example information retrieval on the documents stored in the Oracle database) we can delete the publication records. 8.2 Supporting three types of desired notification frequencies In order to support different notification frequencies we have three sets of actions as show in the above diagram. We categorize the actions according to their interval between two subsequent operations. “Every day” actions are executed every day regardless if this day is an end of week or month too. For example at the end of each month all three sets are executed sequentially. Collection Filtering for Construction and transmission of new Synchonization publication records of messages only for users publications of indices acquired during with desired freqeuncy = 'DAY' during the last day last day Figure 8.2-1 Actions executed every day 1) “Every day” actions. The first step is to collect new publication records from the Digital Library inserted during the last day. As the first step we always call the module Observer. Synchronization of indices is necessary in order for filtering to produce results consistent with the base table of queries. The next step next is to filter the publication records inserted during the last day. The filtering module finds all matched profiles for every publication and stores the primary keys of those rows in a table. The data produced are maintained inside the database until all relevant e-mail messages are delivered. Then we must summarize all matched publications for every single user and transmit e-mail messages via SMTP. We perform this action only for users that have defined ‘DAY’ as the desired frequency of notifications.
  • 101. Information Alert System for Digital Libraries - 101 - Construction and transmission of messages only for users with desired frequency = 'WEEK' Figure 8.2-2 Actions executed once in a week 2) “Every week” actions. Since we have already inserted and filtered publications for every single day of the week, the only action that remains is to construct and send e-mail messages to users with ‘WEEK’ as the desired notification frequency. Deletion of Construction and transmission unnecessary publication of messages only for users records from the with desired frequency = 'MONTH' database Figure 8.2-3 Actions executed once in a month 3) “Every month” actions. We have already inserted, filtered publications for every day up to the end of the month. Since we have ready notified users of the first two categories (‘DAY’ and ‘WEEK’) the action that remains is to construct and send e-mail messages to users with ‘MONTH’ as the desired notification frequency. As e-mail messages containing all relevant bibliographical attributes for publications over the last month, are constructed and delivered, there is no need to maintain the already filtered documents. Optionally we can delete the unnecessary publication records from the Oracle database. 8.3 Conclusions We have completed the presentation of the implementation and the development of DLAlert. We think that with minor configuration changes mainly on the Digital Library of TUC (Section 7.7) this system could easily be deployed to complete function and operate on regular basis. In last sections we presented two operating scenarios of DLAlert and, the corresponding actions to be scheduled in order to achieve the desired target. In the next chapter we propose future work on DLAlert.
  • 102. Information Alert System for Digital Libraries - 102 - Chapter 9 Concluding remarks We think that an alerting service such as the one already developed, would be proved very helpful to the academic community of the Technical University of Crete. DLAlert should be enhanced with more functionalities like Greek support, integration of various sources and an even easier to use web interface. In addition DLAlert, a search engine that will support several information providers is being developed by the Intelligence Systems Laboratory. In the following section we purpose future work on DLAlert. 9.1 Future work on DLAlert The following functionalities are proposed future enhancements on DLAlert. We think that if some of these are supported, DLAlert could be a popular alerting service as long as there is not any similar system developed in Greece at the time. Greek support Greek support is a very important issue since most records in the Digital Library of TUC contain bibliographical attributes in this language. The character set used by the Digital Library is non-standard custom character set defined by the company that installed and configured this system. Therefore either we must provide support on this custom character set, or translate the Greek font into a standard character set supported by Oracle. The Oracle RDBMS provides Locale Builder, a useful tool for this purpose that would help us manipulate character set types, character mappings or classifications. The standard Boolean and proximity queries on keywords can be supported by Oracle Text in the Greek Language. The stemmer provided with Oracle Text, the mechanism that expands queries using tokens with the same linguistic root as the requested term, does not support Greek. Trying to develop a stemmer from scratch is hard and complex work, since it requires knowledge on linguistics and literature. Stemmers for the Greek language have already been developed by students of the Department of Electronic and Computer Engineering [55]. Oracle also provides a database of English and French themes (presented in 2.4.3) organized hierarchically and connected to each other with relations that describe their semantic content
  • 103. Information Alert System for Digital Libraries - 103 - (Synonym, and Broader, Narrower or Related Term). In order to support this functionality in Greek we should extract the main concepts found in the documents of the Digital Library and organize them hierarchically. XML records classification Instead of using records containing the bibliographical attributes in plain text that represent incoming publications, we can represent publications as XML documents with sections defined as tags. For example consider the following example where we have a publication with Title: “The international business book” and Author: “Vincent Guy, John Mattock”. The corresponding XML document would be <publication record> <title> The international business book </title> <author> Vincent Guy, John Mattock </author> </publication record> Oracle Text provides query operators for XML section searching like the operator WITHIN. We use the WITHIN operator to narrow a query down into document sections. For example to request documents with Title containing the word “business” we issue the CTXRULE query. business WITHIN title This approach has several advantages and disadvantages: o Allows even more complex queries referencing sections that are not included in the current implementation of DLAlert (like “Anywhere” clauses). Using this approach we can easily request documents containing a keyword in any attribute or a custom set of attributes. We can even declare nested sections on records. o The filtering module will speed up in this case since we will need only one CTXRULE index for the text queries regardless the number of attributes supported. The time consumed for indexing will not be improved since it is mainly dependent on the total number of queries inserted / updated. o You cannot combine the WITHIN operator with the ABOUT operator, therefore we cannot request themes inside sections. o Requires a more complex parser for the profiles since queries referencing more than one attributes must be concatenated into a single CTXRULE query before inserted into the database. The operator WITHIN should not be visible to the user and the text query stored in the database should be re-parsed and broken into
  • 104. Information Alert System for Digital Libraries - 104 - simple CTXRULE statements before displayed on the GUI. For example to request documents with Title containing the word “business” and author containing the word “John” we issue the CTXRULE query. Author John Profile parser (business WITHIN title) AND (John WITHIN author) Title business Query visible to CTXRULE query the user from the GUI stored in the database o The publication records should be parsed to XML before inserted into the database. Also the matching documents’ bibliographical attributes should be extracted from XML into plain text strings before the construction of e-mail messages. For this purpose we can develop the necessary Java stored procedures that will manipulate the XML documents. “Anywhere” queries – variable number of rows in profiles After we carry out the changes mentioned earlier we can support a more convenient Profile insert/update form like the one bellow. We can then define queries with variable number of rows and request keywords in any section of the document. Figure 9.1-1 Profile insert form
  • 105. Information Alert System for Digital Libraries - 105 - Automatic word stemming expansion on queries Instead of expecting the user to enter the symbol $ in order to request tokens with the same linguistic root as the requested term we can enhance the parser in order to automatically put the stemming symbol $ before all queries. As we said previously this functionality cannot support the Greek language at the time, in case we use the supplied Oracle stemmer. For example suppose we request documents that contain the words “business” and “management”. DLAlert can automatically include all the tokens with the same linguistic root as equivalent terms to the requested keywords. Profile parser business and management $( business and management ) Query visible to CTXRULE query the user from the GUI stored in the database Z39.50 sources support Integrating various Digital Libraries is a target that can be easily achieved since. o Almost all Z39.50 Gateways in Greece (Section 2.3.6) support the same record format (UNIMARC) and are implemented according to the same Z39.50 specification (Geac Advance Z39.50 version 2). We can already retrieve records from these databases for our alerting service as long as they support date of acquisition as an access point (attribute). o For Gateways that support a different implementation of Z39.50 with different record format (for example USMARC, XML), only minor changes to the parser sub-component of the Observer module, are needed. If we decide that multiple sources are to be supported, an algorithm that detects duplicate entries on publications among different Digital Libraries is necessary. DLAlert should not send multiple notifications for a single document found in more than one databases. Other information providers , alerting services integration We could support in the future any other type of information provider (cooperative or not, active or passive) as long as we develop the necessary functionality. Any functionality that can be developed in Java can be inserted and executed inside the
  • 106. Information Alert System for Digital Libraries - 106 - Oracle RDBMS as Java Stored Procedure(s). Also other existing alerting services can be supported in case they provide a record format that can be processed and parsed so that the bibliographical attributes can be identified. An algorithm for rejection of duplicate publication entries is needed in this case too. More impressive web GUI During the implementation of DLAlert we have been focused on the functionalities of the GUI and the way the user interacts with the system. A more attractive Web GUI can be easily developed using the same functionality already developed. Similar search and alert capabilities A search engine integrated in the same interface would be very useful so that the user can find which already acquired publications are matched by his profile. Among with DLAlert, a search engine that will support several information providers is being developed by the Intelligence Systems Laboratory. A future version of DLAlert must provide search functionality inside the Profile insert/update form like the picture bellow. Figure 9.1-2 Profile insert / search form Journal support The mechanism already developed as alerting service on new publication, is not practical in the area of scientific journals. Every journal series acquired by the Digital
  • 107. Information Alert System for Digital Libraries - 107 - Library of Technical University of Crete is represented by a single record inside the database (sample record on the following picture). Therefore DLAlert cannot notify users on each number of the journal yet, but sends an e-mail message on a new subscription from the Library. Supporting specific journal requests on Profiles, is an essential feature supported by most popular Alerting Services on scholarly material (Section 2.1). The journals supported could be organized hierarchically according to their scientific area. Users should be notified regularly not only on a new journal subscription but also on each separate issue. Figure 9.1-3 Sample record of a journal Hyper-links in e-mail massages Providing all the bibliographical attributes of new publications on e-mail messages is an accurate way of notifying the user at the time. If we want to provide more information on new publications (like Table of Contents), it would be preferred to include hyper-links to web-pages containing all relevant data instead. Notifications in various formats (plain text, HTML, XML) Providing notifications in various formats would be a useful feature. Some users may prefer shorter plain text e-mail messages. Also XML messages would be useful in case we send the notifications to another alerting service or application.
  • 108. Information Alert System for Digital Libraries - 108 - Using DIAS algorithms inside the database as Java stored procedures Functionality developed in the DIAS project could be integrated into the Oracle database, in case an implementation in Java that handles database records is available in the future. As we explained in Section 2.2.4 DIAS provides efficient algorithms for document filtering and profile matching. In this case the use of Oracle Text and the CTXRULE index would not be necessary. Java Stored Procedure technology enables us integrate almost everything that can be implemented in Java, as stored procedure inside the Oracle database. Ranking of matching documents according to relevance Ranking of matching documents according to relevance is not supported for the CTXRULE index type in the current version of Oracle Text. Trying to support this feature would require enhancing of the filtering functionality available now. Relevance feedback Relevance feedback on notifications means that the user can evaluate the relevance of the delivered documents so that the ranking results are improved in later filtering. This feature is also not supported at the time for the CTXRULE index type, and will require much development work to implement it. Java Stored Procedure technology will be most useful in case we try to develop functionality with high computational complexity like enhancing the filtering mechanism already available. 9.2 Conclusion The main achievement of this dissertation is the development of a centralized alerting service for the Digital Library of the Technical University of Crete with the ability to integrate many information providers. As long as technical issues presented in 7.7 are solved, DLAlert can be scheduled to operate in regular basis. We hope that this dissertation will be a good starting point for further work on this application.
  • 109. Information Alert System for Digital Libraries - 109 - Bibliography [1] Information Retrieval (Z39.50): Application Service Definition and Protocol, March 29 - May 13, 2002 Specification National Information Standards Organization. Available at: www.niso.org/standards/resources/Z39-50-200x.pdf [2] Current Awareness and Alerting Services Alphabetical Listing, Sheffield Hallam University. http://www.shu.ac.uk/services/lc/se/alertingservicesalpha.html [3] Alerting Systems and Services, Freie Universität Berlin. http://page.inf.fuberlin.de/~hinze/projects/AS.html [4] D. Faensen, L. Faulstich, H. Schweppe, A. Hinze, and A. Steidinger. Hermes -- a notification service for digital libraries. In ACM/IEEE Joint Conference on Digital Libaries, Roanoke, Virginia, USA, June 24-28, 2001. Available at: http://www.inf.fu-berlin.de/inst/ag-db/publications/2001/jcdl01.pdf [5] D. Faensen, A. Hinze, and H. Schweppe. Alerting in a digital library environment -- do channels meet the requirements? Freie Universitat Berlin, 1998. Available at: ftp://ftp.inf.fu-berlin.de/pub/reports/tr-b-98-08.ps.gz [6] M. Koubarakis, T. Koutris, C. Tryfonopoulos, P. Raftopoulou, Information Alert in Distributed Digital Libraries: The Models, Languages and Architecture of DIAS. 6th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 02), 16-18 September 2002, Pontifical Gregorian University, Rome, Italy. Available at: http://www.intelligence.tuc.gr/publications/information-ecdl02.pdf [7] M. Koubarakis, C. Tryfonopoulos, P. Raftopoulou, T. Koutris, Data Models and Languages for Agent-Based Textual Information Dissemination. 6th International Workshop on Cooperative Information Systems (CIA 02), 18-20 September 2002, Universidad Rey Juan Carlos, Madrid, Spain. Available at: http://www.intelligence.tuc.gr/publications/data-cia02-long.zip [8] M. Koubarakis and C. Tryfonopoulos. Peer-to-peer agent systems for textual information dissemination: algorithms and complexity. In UK Workshop on Multiagent Systems (UKMAS-2002), Liverpool, UK, 18 & 19 December, 2002. Available at: http://www.intelligence.tuc.gr/publications/peer2peer-ukmas02.pdf [9] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.
  • 110. Information Alert System for Digital Libraries - 110 - [10] C.D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, 1999. [11] National Information Standards Organization : http://www.niso.org [12] Z39.50 Standard Maintenance Agency. http://www.loc.gov/z3950/agency/ [13] MARC standards, Library of Congress Network Development and MARC standards Office. http://www.loc.gov/marc/ [14] Extensible Markup Language (XML) 1.0 (Second Edition) W3C Recommendation 6 October 2000. Available at: http://www.w3.org/TR/2000/REC-xml-20001006.pdf [15] Knowledge Integration JZkit: http://developer.k-int.com/products/jzkit/ [16] Universal Bibliographic Control and International MARC Core Programme: http://www.ifla.org/VI/3/p1996-1/UNIMARC.htm [17] UNIMARC Manual : Bibliographic Format 1994: http://www.ifla.org/VI/3/p1996-1/sec-uni.htm [18] Z39.50 Text Part 9: Type-1 and Type-101 Queries: http://www.loc.gov/z3950/agency/markup/09.html [19] Bib-1 Attribute Set: http://lcweb.loc.gov/z3950/agency/defns/bib1.html [20] Registry of Z39.50 Object Identifiers: http://lcweb.loc.gov/z3950/agency/defns/oids.html [21] Oracle Technology Network: http://otn.oracle.com/ [22] Oracle Corporation: http://www.oracle.com/ [23] Oracle Text Application Developer’s Guide Release 9.2 Oracle Corporation. http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/text.920/a96517.pdf [24] Oracle Text Reference Release 9.2. Oracle Corp. http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/text.920/a96518.pdf [25] Oracle Text Documentation http://otn.oracle.com/products/text/content.html [26] Oracle Text Discussion Forum http://otn.oracle.com/forums/text.html [27] Oracle Text Technical Overview (The CTXRULE Indextype): http://technet.oracle.com/products/text/x/Tech_Overviews/text_901.html [28] ISO 2788:1986 Documentation -- Guidelines for the establishment and development of monolingual thesauri. [29] ANSI/NISO Z39.19 - 1993 Guidelines for the Construction, Format, and Management of Monolingual Thesauri. Available at: http://www.niso.org/standards/resources/Z39-19.html
  • 111. Information Alert System for Digital Libraries - 111 - [30] Sean Dillon, Christopher Beck, Thomas Kyte. Beginning Oracle Programming, 2002 Wrox Press. [31] Application Developer's Guide – Fundamentals. 2002, Oracle Corporation. Available at: http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/appdev.920/a96590.pdf [32] PL/SQL User's Guide and Reference. 2002, Oracle Corporation. Available at: http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/appdev.920/a96624.pdf [33] Supplied PL/SQL Packages and Types Reference. Oracle Corporation. Available at: http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/appdev.920/a96612.pdf [34] The Source for Java Technology. Java home page: http://java.sun.com/ Sun Microsystems, Inc. [35] Bjarki Holm, John Carnell, Tomas Stubbs, Poornachandra Sarang, Kevin Mukhar, Sant Singh, Jaeda Goodman, Ben Marcotte, Mauricio Naranjo, Anand Raj, Mark Piermanini. Oracle 9i Java Programming: Solutions for Developers Using PL/SQL and Java. 2002 Wrox Press. [36] Java Developer's Guide. 2002, Oracle Corporation. Available at: http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/java.920/a96656.pdf [37] Java Stored Procedures Developer's Guide. 2002, Oracle Corporation. Available at: http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/java.920/a96659.pdf [38] JDBC Developer's Guide and Reference. 2002, Oracle Corporation. Available at: http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/java.920/a96654.pdf [39] Supplied Java Packages Reference. 2002, Oracle Corporation. Available at: http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/appdev.920/a96609.pdf [40] Bradley D. Brown, Oracle9i Web Development. November 2001 McGraw- Hill/Osborne Media [41] Developing Java 2 Platform, Enterprise Edition (J2EE) Compatible Applications Roles-based Training for Rapid Implementation. January 2001, Sun Educational Services Java Technology Team. Available at: http://java.sun.com/j2ee/white/j2ee.pdf [42] Developing Java 2 Platform, Enterprise Edition (J2EE). 1999, Sun Microsystems, Inc. Available at: http://java.sun.com/j2ee/white/j2ee_guide.pdf [43] Jonathan B. Postel, RFC 821 – Simple Mail Transfer Protocol. August 1982 Information Sciences Institute, University of Southern California. Available at: http://www.ietf.org/rfc/rfc821.txt
  • 112. Information Alert System for Digital Libraries - 112 - [44] JavaMail 1.3 Release, Sun Microsystems, Inc. Available at: http://java.sun.com/products/javamail/ [45] Core JavaScript Guide 1.5. 2000, Netscape Communications Corp. : http://devedge.netscape.com/library/manuals/2000/javascript/1.5/guide/ [46] Marty Hall, Core Servlets and JavaServer Pages, Sun Microsystems Press/Prentice Hall. Available at http://pdf.coreservlets.com/ [47] JavaServer Pages Documentation, Sun Microsystems. Available at: http://java.sun.com/products/jsp/docs.html [48] OracleJSP Support for JavaServer Pages Developer's Guide and Reference, Release 1.1.3.1 Oracle Corporation. Available at: http://otn.oracle.com/docs/tech/java/oc4j/pdf/jsp1131.pdf [49] Peter Koletzke, Paul Dorsey, Avrom Faderman. Oracle9i JDeveloper Handbook. December 2002 McGraw-Hill/Osborne Media. [50] Java C C Home page: http://www.experimentalstuff.com/Technologies/JavaCC/ [51] SQLJ Developer's Guide and Reference. 2002 Oracle Corporation. Available at: http://otn.oracle.com/docs/products/oracle9i/doc_library/901_doc/java.901/a90212.pdf Related dissertations [52] Stratos Ydraios. “A query and notification service based on mobile agents for rapid implementation of peer to peer applications”, 2003, Department of Electronic and Computer Engineering, Technical University of Crete. [53] Chistos Tryfonopoulos. "Agent-Based Textual Information Dissemination: Data Models, Query Languages, Algorithms and Computational Complexity", 2002, Department of Electronic and Computer Engineering, Technical University of Crete. [54] Theodoros Koutris. "Textual information dissemination in distributed agent systems: Architectures and efficient filtering algorithms", 2003, Department of Electronic and Computer Engineering, Technical University of Crete. [55] Sotiris Diplaris, Dimitris Pratsolis. “Development of statistic linguistic models for the Greek language with stemming and part of speech functionality”, 2001, Department of Electronic and Computer Engineering, Technical University of Crete.