facilitating document annotation using content and querying value

Facilitating Document Annotation Using Content and Querying Value
Facilitating Document Annotation Using Content and Querying
Value
A large number of organizations today generate and share textual descriptions of their products,
services, and actions. Such collections of textual data contain significant amount of structured
information, which remains buried in the unstructured text. While information extraction
algorithms facilitate the extraction of structured relations, they are often expensive and
inaccurate, especially when operating on top of text that does not contain any instances of the
targeted structured information. We present a novel alternative approach that facilitates the
generation of the structured metadata by identifying documents that are likely to contain
information of interest and this information is going to be subsequently useful for querying the
database. Our approach relies on the idea that humans are more likely to add the necessary
metadata during creation time, if prompted by the interface; or that it is much easier for humans
(and/or algorithms) to identify the metadata when such information actually exists in the
document, instead of naively prompting users to fill in forms with information that is not
available in the document. As a major contribution of this paper, we present algorithms that
identify structured attributes that are likely to appear within the document, by jointly utilizing the
content of the text and the query workload. Our experimental evaluation shows that our approach
generates superior results compared to approaches that rely only on the textual content or only on
the query workload, to identify attributes of interest.
Many annotation systems allow only “untyped” keyword annotation: for instance, a user may
annotate a weather report using a tag such as “Storm Category 3”. Annotation strategies that use
attribute-value pairs are generally more expressive, as they can contain more information than
untyped approaches. In such settings, the above information can be entered as
(StormCategory,3). A recent line of work towards using more expressive queries that leverage
such annotations, is the “pay- as-you- go” querying strategy in Dataspaces [2]: In Dataspaces,
users provide data integration hints at query time. The assumption in such systems is that the
Contact: 9703109334, 9533694296
ABSTRACT:
EXISTING SYSTEM:
Email id: academicliveprojects@gmail.com, www.logicsystems.org.in

data sources already contain structured information and the problem is to match the query
attributes with the source attributes. Many systems, though, do not even have the basic
“attribute- value” annotation that would make a “pay-as-you go” querying feasible. Annotations
that use “attribute- value” pairs require users to be more principled in their annotation efforts.
Users should know the underlying schema and field types to use; they should also know when to
use each of these fields. With schemas that often have tens or even hundreds of available fields
to fill, this task become complicated and cumbersome. This results in data entry users ignoring
such annotation capabilities.
DISADVANTAGES OF EXISTING SYSTEM:
 The cost is high for creation of annotation information.
 The existing system produces some errors in the suggestions.
In this paper, we propose CADS (Collaborative Adaptive Data Sharing platform), which is an
“annotate-as-you create” infrastructure that facilitates fielded data annotation. A key contribution
of our system is the direct use of the query workload to direct the annotation process, in addition
to examining the content of the document. In other words, we are trying to prioritize the
annotation of documents towards generating attribute values for attributes that are often used by
querying users. The goal of CADS is to encourage and lower the cost of creating nicely
annotated documents that can be immediately useful for commonly issued semi- structured
queries such as the ones. Our key goal is to encourage the annotation of the documents at
creation time, while the creator is still in the “document generation” phase, even though the
techniques can also be used for post generation document annotation. In our scenario, the author
generates a new document and uploads it to the repository. After the upload, CADS analyzes the
text and creates an adaptive insertion form. The form contains the best attribute names given the
document text and the information need (query workload), and the most probable attribute values
given the document text. The author (creator) can inspect the form, modify the generated
metadata as- necessary, and submit the annotated document for storage.
ADVANTAGES OF PROPOSED SYSTEM:
Contact: 9703109334, 9533694296
PROPOSED SYSTEM:

 We present an adaptive technique for automatically generating data input forms, for
annotating unstructured textual documents, such that the utilization of the inserted data is
maximized, given the user information needs.
 We create principled probabilistic methods and algorithms to seamlessly integrate
information from the query workload into the data annotation process, in order to generate
metadata that are not just relevant to the annotated document, but also useful to the users
querying the database.
 We present extensive experiments with real data and real users, showing that our system
generates accurate suggestions that are significantly better than the suggestions from
alternative approaches.
Contact: 9703109334, 9533694296
MODULES:
1. Collaborative Annotation Module
2. Data spaces and pay-as-you-go integration Module
3. Content management product Module
4. Information extraction Module
5. Schema Evolution Module
6. Query Forms Module
MODULES DESCRIPTION:
Collaborative Annotation Module:
In this module, significant amount of work in predicting the tags for documents or other
resources (WebPages, images, videos). Depending on the object and the user involvement, these
approaches have different assumptions on what is expected as an input; Nevertheless the goals
are similar as they expect to find missing tags that are related with the object. We argue that our
approach is different as we use the workload to augment the document visibility after the tagging
process. Compared with the other approaches p recision is a secondary goal as we expect that the
annotator can improve the annotations on the process. On the other hand, the discovered tags
assist on the tasks of retrieval instead of simply bookmarking.
Dataspaces and pay-as-you-go integration Module:

The integration model of CADS is similar to that of Dataspaces, where a loosely integration
model is proposed for heterogeneous sources. The basic difference is that Dataspaces integrate
existing annotations for data sources, in order to answer queries. Our work suggests the
appropriate annotation during insertion time, and also takes into consideration the query
workload to identify the most promising attributes to add. Another related data model is that of
Google Base, where users can specify their own attribute/value pairs, in addition to the ones
proposed by the system. However, the proposed attributes in Google Base are hard-coded for
each item category (e.g., real estate property). In CADS, the goal is to learn what attributes to
suggest. Pay-as-you go integration techniques like PayGo are useful to suggest candidate
matching at query time.
Content management product Module:
In this module, CADS improves these platforms by learning the user information demand and
adjusting the insertion forms accordingly.
Information extraction Module:
Information extraction is related to this effort, mainly in the context of value suggestion for the
computed attributes. We can broadly separate the area into two main efforts: Closed IE and Open
IE. Closed IE requires the user to define the schema, and then the system populates the tables
with relations extracted from the text. Our work on attribute suggestion naturally complements
closed IE, as we identify what attributes are likely to appear within a document. Once we have
that information, we can then employ the IE system to extract the values for the attributes. Open
IE is closer to the needs of CADS. In particular, Open IE generates RDF- like triplets, e.g.,
(Gustav, is category, 3) with no input from the user. Open IE leads to a very large number of
triplets, which means that even after the successful extraction of the attribute values, we still
have to deal with the problem of schema explosion that prevents the successful execution of
structured queries that require knowledge of the attribute names and values that appear within a
document.
Contact: 9703109334, 9533694296
Schema Evolution Module:

In this module, the adaptive annotation in CADS can be viewed as semi-automatic schema
evolution. Previous work on schema evolution [27] did not address the problem of what attribute
to add to the schema, but how to support querying and other database operations when the
schema changes.
In this schema information to auto-complete attribute or value names in query forms. In keyword
queries are used to select the most appropriate query forms. Our work can be considered a dual
approach: instead of generating query forms using the database contents, we create the schema
and contents of the database by considering the content of the query workload (and the contents
of the documents, of course).
SYSTEM CONFIGURATION:-
HARDWARE CONFIGURATION:-
 Processor - Pentium –IV
 Speed - 1.1 Ghz
 RAM - 256 MB(min)
 Hard Disk - 20 GB
 Key Board - Standard Windows Keyboard
 Mouse - Two or Three Button Mouse
 Monitor - SVGA
SOFTWARE CONFIGURATION:-
 Operating System : Windows XP
 Programming Language : JAVA/J2EE
 Java Version : JDK 1.6 & above.
 IDE : Netbeans 7.2.1
 Database : MYSQL
Contact: 9703109334, 9533694296
Query Forms Module:

Eduardo J. Ruiz, Vagelis Hristidis, and Panagiotis G. Ipeirotis,“Facilitating Document
Annotation Using Content and Que rying Value”, IEEE TRANSACTIONS, VOL. 26, NO. 2,
FEBRUARY 2014.
Contact: 9703109334, 9533694296
REFERENCE:

facilitating document annotation using content and querying value

More Related Content

What's hot (17)

Similar to facilitating document annotation using content and querying value (20)

More from swathi78 (20)

Recently uploaded (20)

facilitating document annotation using content and querying value