Open domain Question Answering System - Research project in NLP

RESEARCH PROJECT FROM IIITH
LTRC RESEARCH CENTRE - Hyderabad
OPEN DOMAIN QUESTION
ANSWERING SYSTEM
Technologies used- python , Stanford parser , Open Ephyra,Google
Search Wrapper,SVM light, Ling pipe.
Domains – Natural language processing , Data indexing

ABSTRACT
Using a computer to answer questions has been a human dream since the beginning of
the digital era. A first step towards the achievement of such an ambitious goal is to deal
with natural language to enable the computer to understand what its user asks. The
discipline that studies the connection between natural language and the representation
of its meaning via computational models is computational linguistics. According to such
discipline, Question Answering can be defined as the task that, given a question
formulated in natural language , aims at finding one or more concise answers. And the
Improvements in Technology and the Explosive demand for better information access
has reignited the interest in Q & A systems , The wealth of the information on the web
makes it an Interactive resource for seeking quick Answers to factual Questions such as
“Who is the first American to land in space ?”, or “what is the second Tallest Mountain
in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo)
make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop
techniques that go beyond Retrieval of Relevant documents in order to return the exact
answers using Natural language factoid question

INTRODUCTION
1.1Functionality:
Questions:
One definition of a question could be ‘a request for information’. But how do we recognise
such a request? In written language we often rely on question marks to denote questions.
However, this clue is misleading as rhetorical questions do not require an answer but
are often terminated by a question mark while statements asking for information may
not be phrased as questions. For example the question “What cities have underground
railways?” could also be written as a statement “Name cities which have underground
railways”. Both ask for the same information but one is a question and one an instruction.
People can easily handle these different expressions as we tend to focus on the meaning
(semantics) of an expression and not the exact phrasing (syntax). We can, therefore, use
the full complexities of language to phrase questions knowing that when they are asked
other people will understand them and may be able to provide an answer..
Answers:
If a question is a request for information and answers are given in response to questions,
then answers must be responses to requests for information. But what constitutes an
answer? Almost any statement can be an answer to a question and, in the same way that
there can be many different ways of expressing the same question, there can be many
ways of describing the same answer. For example, any question whose answer is numeric

may have the answer expressed in an infinite number of ways.
Process of Question Answering:
If we think about question answering as a human activity then what do we expect to
happen when a person is asked a question to which they do not know the answer? In this
situation it is likely that the person of whom the question was asked would consult some
store of knowledge (a book, a library, the Internet...) in order to find some text that they
could read and understand, allowing them to determine the answer to the question. They
could then return to the person who had originally asked the question and tell them the answer.
They could also report on where they had found the answer which would allow
the questioner to place some level of confidence in the answer.
What they would not do would be simply to give a book or maybe a handful of
documents, which they thought might contain the answer, to the person asking the question.
Unfortunately this is what most computer users have to be content with at the moment.
Many people would like to use the Internet as a source of knowledge in which they could
find answers to their questions. Although many search engines suggest that you can ask
natural language questions, the results that they return are usually sections of documents
which may or may not contain the answer but which do have many words in common with
the question. This is because question answering requires a much deeper understanding
and processing of text than most web search engines are currently capable of performing.

1.2 Existing system
Information retrieval systems take set of keywords as input to search engine and output
list of ranked documents to the user which contain keywords of the given input. Even though it
has lot of documents it do best in ranking the documents but it will not give the exact answer
for the user. The user have to study the documents retrieved and find answers if the answer is
available in the documents.
The Question Answering System can help the user in finding the exact answer to the
user rather than studying the entire documents and understanding them. The Question
Answering System reduces the time of the user in searching the internet or any other sources.
The limitations of Information Retrieval are
• It retrieve documents not answers to the user .
• They do not handle natural language questions.
• The results are different when using natural language questions.
• User has to spend time on studying documents is the main disadvantage of information
retrieval.
1.3. PROPOSED SYSTEM
The solution initially proposed was to take a query from the user in natural language
English and give the exact answer to the user with in less time. The proposed system retrieve
the documents from the web, process the documents retrieved by using computational
linguistics techniques and finds the exact answer to the user.

The Question Answering System reduces time of the user spending on understanding
the documents. Some times the user may take days to know the required answer but the
Question Answering System will give the exact result to the user with in seconds.
Features of the Project
The main feature of this project work is to retrieve answers to the user from huge
collection of documents in web.
Question Answering System Advantages:
• Searches for information in natural language.
• Finds precise answers to custom questions.
• Greater relevance of found results.
• Answers to questions can be found in World Wide Web or in any documents.
• Document and knowledge management increase with QAS solution.
Question Answering System Disadvantages
• Sometimes it gives wrong or approximate output not the accurate.
• Knowledge extraction is very difficult.
• It computational cost is very high.
• Has to be enhanced further in future.

ANALYSIS AND SRS DOCUMENT
2.1.ANALYSIS DOCUMENT
2.1.1.ANALYSIS MODE:
Flow Requirements Analysis results in the specification of software’s operational
characteristics. It indicates software’s interface with other system elements and establishes
constraints that the software must meet. It allows the software engineer to elaborate on basic
requirements established during earlier requirement engineering tasks and build models that
depict user scenarios, functional activities, problem classes and their relationships, system and
class behavior.
The analysis model must achieve three primary objectives:
1. To describe what the customer wants,
2. To establish a basis for the creation of a software design,
3. To define a set of requirements that can be validated once the software is build.
One view of analysing modelling,, called structured analysis, considers data and the
processes that transform the data as separate entities. Data objects are modelled in way that
defines their attributes and relationships. Processes that manipulate objects are modelled in a
manner that shows how they transform data as objects flow through the system.
A second approach to analysis modeling, called object-oriented analysis, focuses on the
definition of classes and the manner in which they collaborate with one another to effect the
customer requirements. UML and unified process are predominately object-oriented.
To solve actual problems in an industry, a software engineer or a team of engineers. must

incorporate a development strategy that encompasses the process, methods and tools and layers.
This strategy is often referred to as a process model. Process model is chosen based on the nature
of the project and application, the methods and tools to be used.
The various process models are:
1. Waterfall Model
2. Prototype Model
3. RAD Model
4. Incremental Model
5. Spiral Model
Waterfall Model
It is also called as classic life cycle or the linear sequential model. It suggests systematic,
sequential approach to software development that begins at the system level and progresses
through analysis, design, coding, testing and deployment. The initial requirements should be
specified at the beginning of the project. This is the main disadvantage of this model as it is
difficult for the customer to state all requirements explicitly. And Real projects rarely follow the
sequential flow that the model proposes and working version of the program will not be
available until late in the project time span.
Prototype Model
Prototyping is a process that enables the developer to create a model of the software that
must be built. It begins with requirements gathering. Developer and customer meet and define
the overall objectives for the software, identify whatever requirements are known, and outline
areas where further definition is mandatory. A “quick design” then occurs. The quick design

focuses on a representation of those aspects of the software that will be visible to the user. The
quick design leads to the construction of a prototype.
RAD Model
Rapid Application Development (RAD) is an incremental software development process
model that emphasizes an extremely short development cycle. The RAD model is a high-speed
adaptation of the linear sequential model in which rapid development is achieved by using
component based construction. If requirements are well understood and project scope is
constrained, the RAD process enables a development team to create a “fully functional system”
within very short time periods. RAD approach encompasses the following phases-Business
modeling, Data modeling, Process modeling, Application generation, Testing and turn over.
Incremental Model
The incremental model combines elements of the linear sequential model with the
iterative philosophy of prototyping. When an incremental model is used, the first increment is
often a core product. That is, basic requirements are addressed, but many supplementary features
(some known, others unknown) remain undelivered.
The core product is used by the customer. As a result of use and/or valuation, a plan is
developed for the next increment. The plan addresses the modification of the core product to be
better meet the needs of the customer and the delivery of additional features and functionality.
This process is repeated following the delivery of each increment, until the complete product is
produced.
Iterative Incremental Model

During the first requirements analysis phase, customers and developers specify as many
requirements as possible and prepare a SRS document. This model delivers an operational
quality product at each release, but one that satisfy only a subset of the customer’s requirements.
The complete product is divided into releases and the developer delivers the product release by
the release.
2.1.2.THE REQUIREMENT STUDY
The origin of most software systems in the need of a client, who either wants to automate
an existing manual system or desires a new software system. The software system itself is
created by the developer finally the completed system will be used by the end user. Thus, there
are three major parties interested in a new system: the client, the users and the developer. The
requirements for the system that will satisfy the needs of the clients and the concerns of the users
have to be communicated to the developer. The problem is that the client usually does not
understand software or the software development process, and the developer often does not
understand the clients problem and application area. This causes a communication gap between
the parties involved in the development project. A basic purpose of software requirements
specification is to bridge this communication gap. SRS is the medium through which the client
and the user needs are accurately specified; indeed SRS forms the basis of software
development. A good SRS should satisfy all the parties- something very hard to achieve- and
involves trade-offs and persuasion
2.1.3 Requirement process

The main reason of modelling generally focuses on the problem structure, not its
external behaviour. Consequently, things like user interfaces are rarely modelled, Whereas they
frequently form a major components of the SRS . Similarly performance constraints, design
constraints, standards compliance, recovery, etc., are specified clearly in the SRS because the
designer must know about there to properly design the system.
To properly satisfy the basic goals, an SRS should have certain properties and should contain
different type of requirements. A good SRS is [IEEE87, IEEE94]: complete if everything the
software is supposed to do and the responses of the software to all classes of input data are
specified in the SRS. Correctness and completeness go hand-in-hand an SRS is unambiguous if
and only if every requirement stated has one and only one interpretation, requirements often
written in natural language. An SRS is verifiable if and only if every stated requirement is
verifiable. A requirement is verifiable if there exists some cost-effective process that can check
whether the final software meet those requirements. An SRS is consistent if there is no
requirement that conflicts with another.
Writing an SRS is an iterative process. Even when the requirements of a system are
specified, they are later modified as the needs of the client change. Hence and SRS should be
easy to modify. An SRS is traceable if the origin of each of its requirements is clear and if it
facilitates the referencing of each requirement if future development [IEEE87]. One of the most
common problems in requirement specification is when some of the requirements of the client
are not specified. This necessitates addition and modifications to the requirements later in the
development cycle, which are often expensive to incorporate.

2.1.4 Project schedule study phase
In the study phase we do the preliminary investigation and determine the system
requirements. We study the system and collect the data to draw the dataflow diagrams. We
follow the methods like questions and observation to find the facts that are involved in the
process. This is an important because if the specification study is not done properly then
following design phase etc., will go wrongly.
Design phase
In this design phase we design the system making use of study phase and the dataflow
diagrams. We make use the general access methods for designing. We consider the top down
approach. On the design phase we determine the entities and their attributes and the relationships
between the entities. We do logical and physical design of the system.
Development phase
In the development phase we mostly do the coding part following the design of the system.
We follow modular programming for development and after development and after developing
each and every module we do the unit testing followed by the integration testing.
Implementation phase
The last phase of the project is the implementation phase. Quality assurance is the primary
motive in this phase. The quality assurance is the review of software products related
documentation for completeness, correctness, reliability and maintainability. The philosophy
behind the testing is it finds errors. The testing stratifies are of two types, the code testing and
specification testing. In the code testing we examine the logic of the program.

PHASES OF DEVELOPMENT
Feasibility study
Feasibility study is a compressed capsule version of the entire System Analysis and
Design Process. The study begins by clarifying the problem definition. Feasibility Study is not
to solve the problem but to determine it is worth doing. Once an acceptable problem definition
has been generated, the Analyst develops a logical model of as reference. Next the alternatives
are carefully analyzed for feasibility. At least three different types of feasibility are considered.
Economic Feasibility
A system that can be developed technically and that will be used if installed must still

be a good investment for the organization. Financial benefits must equal or exceed the cost. The
cost of feasibility study should be approximately 5 to 10 per cent of the estimated cost. Next
factor, the cost of development of this sort of project is the cost per man-hour. In this case cost
is nil and considering the time factor the project is to be completed in 1 month. Hence it is
economically feasible.
Technical Feasibility
The technical issues usually raised during feasibility are does the necessary technology
exist to do what is suggested? Can the system be expanded if developed? The present object is
being done after all the software requirements are met and also there is provision for further
enhancement.
Operational Feasibility
This test of feasibility asks if the system will work when it is developed and installed.
Here are the questions that help tests the operational feasibility of a project. Is there sufficient
support for the project form management and user? Will the proposed system works under all
conditions? Have the users been involved in the planning and development of the project. The
project has been done with the involvement of management and users and it is tested to work in
all conditions. So it can be considered as operationally feasible.
2.2 SRS DOCUMENT
2.2.1. PURPOSE

The purpose behind the System Requirements and Specification document is to describe
the resources and management of those resources used in the design of the Question Answering
System. This System Requirements and Specification will further provide details regarding the
functional and performance related requirements of the algorithm.
2.2.2. General Description
Under this topic we get the description of the users who uses the system along with their
characteristics and it specifies the product. It also describes briefly the functional and data
requirements of the project.
2.2.1 Users and their Characteristics
The users of this product are every body who require answers to their questions.
• User will give the natural language query. It is open domain.
• User has to give the query without any mistakes.
• Later the user will get the related answer as the output.
2.2.2 Product Perspective
Our product is developed to get the related answers for the natural language query from
the web. It is helpful in information retrieval to the user.
2.2.3 Overview of Functional Requirements
Purpose: To retrieve answers to the user from the web.
Inputs: Natural language Query from the real world entity.

Outputs: Answer to the user question.
2.2.4 Overview of Data Requirement
The user should contain an internet connection to get the output.
User View of Product Use
User can use this product in many areas like:
Education
Medical
Social
Information retrieving etc.

REQUIREMENTS
Hardware and Software Requirements
The hardware and software requirements of the system which are required for the
implementation of the project in a system.
3.1SOFTWARE REQUIREMENTS
Technologies : Python,Java,C++
Tools : Senna, Nltk 3.0, Svmlight-Tk,Scikit, Stanford parser
Domain : Machine learning , Natural language Processing
JDK : Version 1.6 or above
Python : 2.6 or above
Operating system : Linux versions.
3.2HARDWARE REQUIREMENTS
Processor : Intel core Processor
RAM : 3 GB (minimum)
Hard Disk : 80 GB (minimum)

DESIGN
Software design is an iterative process through which requirements are translated into a
“blueprint” for constructing software. Initially, the blueprint depicts a holistic view of software.
That is, the design is represented as a high level of abstraction. As design iteration occur,
subsequent refinement leads to design representations at much lower levels of abstractions.
These can still be traced to requirements, but connection is more subtle.
Throughout the design process, the quality of the evolving design is assessed with a
series of formal technical reviews or design walkthroughs. Three characteristics that serve as a
guide for evaluation of good design:
• The design must implement all of the explicit requirements contained in the analysis
model.
• Design must be readable, understandable guide for those who generate code and for
those who test and subsequently support the software.
• Design should provide a complete picture of the software, addressing the data,
functional and behavioral domains from an implementation perspective.
Modules of Design:
There are three main modules of design in Question Answering System. They are:
1)Question Analysis
2)Document Retrieval

3)Answer Extraction
Architechture: The figure shows the basic architecture of Question Answering System.
.Question Analysis:
The question analysis deals with user given question. The question analysis is very
important to get the accuracy in the entire system. In question analysis we deal with the question
classification at present. The question classification says what the user is asking in question i.e
whether the user is asking for a number, definition, name, abbreviation etc. We implemented
the question classification by taking the Xin Li and Dan Roth paper into consideration. This
paper classifies the question into 6 coarse classes and 50 fine grade classes. They are:

Question Classification
The question classification is done by using the statistical method. The machine
learning techniques are used to classify the questions. At first the system is trained and testing
is performed on those models trained. The SVMLIGHT-TK tool is used to train and test the
data. The input features for the train and test file are:
1) Parsed Tree
2) Bag of Words

3) Bag of Parts of Speech
4) Predicate Argument Structure
5)TF-IDF values
The PT,BOW,BOP and PAS are done by using the SENNA tool. The tf-idf values are
done by using scikit tool. After calculating all these values they are passed to the svmlight-tk
tool to train them. The svmlight-tk tool creates the models for trained data. The test data is
passed to classify it outputs the predictions.
Keyword Extraction:
After classification of question the keywords are extracted to pass to the web. The
keywords are extracted by using Rapid Automatic Keyword Extraction algorithm. The output
keywords are passed to the search engine for document retrieval.
Document Retrieval:
The keywords are passed to the google search engine by using a program. The top
documents are downloaded from the web into our local machine. The documents downloaded
are in html format. The text from the html files is extracted by using the Beautiful Soup tool.
Later the text extracted are divided into sentences by using the nltk tool.
Answer Extraction:
Sentence Ranking: The sentence ranking is responsible for ranking the sentences and giving
a relative probability estimate to each one. It also registers the frequency of each individual phrase
chunk marked by the NE recognizer for a given question class. The final answer is the phase chunk

with maximum frequency belonging to the sentence with highest rank. The probability estimate and
the retrieved answer’s frequency are used to compute confidence of the answer.
We use sense net ranking algorithm to rank the sentences.
4.2 Detailed System Design
Detail design specifies who uses the Question Answering system outside the system i.e.,
within the environment, how the process within the system communicates along with how the
objects with in the process collaborate using both static as well as dynamic UML diagrams in
this ever-changing world of Object Oriented application development, it has been getting harder
and harder to develop and manage high quality applications in reasonable amount of time.
As a result of this challenge and the need for a universal object modeling language every
one could use, the Unified Modeling Language (UML) is the Information industries version of
blue print. It is a method for describing the systems architecture in detail. Using this blue print,
it becomes much easier to build or maintains system, and to ensure that the system will hold up
to the requirement changes. This part also specifies the other levels of data flow diagram.
4.2.1 Use Case View
A use case diagram presents a collection of use cases and actors and is typically used to
specify or characterize the functionality and behavior of a whole application system interacting
with one or more external actors. The users and any system that may interact with the system
are the actors. Since actors represent system users, they help delimit the system and give a

clearer picture of what it is supposed to do. Use cases are developed on the basis of the actors
needs. This insures that the system will turn out to be what the user expected.
Use case diagrams contain icons representing actors, association relationships,
generalize relationships, packages, and use cases. A use case diagram shows the set of external
actors and the system use cases that the actors participate in. A use case specification enables to
display and modify the properties and relationships of a use case. The information in the
specification is presented textually; some of this information can also be displayed inside the
icon representing a use case. Use-case diagram is used to depict the requirements and the users
of the Question Answering System.
The Use-case diagram as shown in below Figure 3.4 identifies the use cases and actors
in Question Answering System.

4.2.2 Sequence Diagram
Class and object diagrams are static model views. Interaction diagrams are dynamic.
They describe how objects collaborate. A sequence diagram is an interaction diagram that details
how operations are carried out what messages are sent and when. Sequence diagrams are
organized according to time. The time progresses as you go down the page. The objects involved
in the operation are listed from left to right according to when they take part in the message
sequence.
The sequence diagram of Question Answering System as shown below in Figure. The
sequence of steps in the figure is as follows:

4.2.3 Colloboration Diagram
The Colloboration diagram of the Question Answering System is shown below:

DEVELOPMENT
5.1 ABOUT SOFTWARE DEVELOPMENT
Software development is the set of activities that results in software products. Software
development may include research, new development, modification, reuse, re-engineering,
maintenance, or any other activities that result in software products. Especially the first phase
in the software development process may involve many departments, including marketing,
engineering, research and development and general management.
Software development process include following steps-
• Requirement Analysis: The most important task in creating the software product is
extracting the requirements or requirement analysis. Frequently demonstrating live code
may help reduce the risk that the requirements are incorrect. Once the general
requirements are gleaned from the client, an analysis of scope of the development should
be determined and clearly stated.
• Specification: It is the task of precisely describing the software to be written. In
practice, most successful specifications are written to understand and fine-tune
applications that are already developed. These are most important for external interfaces
that must remain stable.
• Architecture: The architecture of the system refers to an abstract representation of
the system. It is concerned with making sure the software system will meet the
requirements of the product.

Design, implementation and testing: Implementation is the part of process where software
engineers actually program the code for project. Software testing is integral and important part
of the software development process. This part of the process ensures that bugs are recognized
as early as possible.
• Deployment and maintenance: Deployment starts after the code is appropriately tested, is
approved for release and sold. Maintenance and enhancing software to cope with newly
discovered problems or new requirements can take far more time than the initial development
of software.

IMPLEMENTATION
6.1. CODING STANDARDS:
The coding standards are:
• Naming standards for the applications, forms and variables etc.
• Screen design standards.
• Validation and checks that need to be implemented.
The coding guidelines are:
• Coding should be well documented.
• Coding style should be simple.
• Length of functions should be short.
6.1.1. Code efficiency:
The code is designed with the following characteristics in mind.
1. Uniqueness
The code structure must ensure that only one value of the code with a single meaning is
correctly applied to a given entity or attribute.
2. Expandability
The code structure is designed for in a way that it must allow for growth of its set of entities
and attributes, thus providing sufficient space for the entry of new items within each
classification.

3. Conciseness
The code requires the fewest possible number of positions to include and define each item.
4. Uniform size and format
Uniform size and format is highly desirable in mechanized data processing system. The
addition of suffixes and prefixes to the root code should not be allowed especially as it is
incompatible with the uniqueness requirement.
5. Simplicity
The codes are designed in a simple manner to understand and simple to apply.
6. Versatility
The code allows modifying easily to reflect necessary changes in conditions, characteristics
and relationship of the encoded entities. Such changes must result in a corresponding change in
the code or coding structure.
7. Sortability
Reports are most valuable for user efficiency when sorted and presented in a
predetermined format or order. Although data must be stored and collaged, the representative
code for the data does not need to be in a sortable form if it can be correlated with another code
that is sortable.
8. Stability
Codes that do not require to be frequently updated also promote use efficiency .Individual
code assignments for a given entity should be made with a minimal likelihood of change either
in the specific code or in the entire coding structure.
9. Meaningfulness
Code is meaningful. Code value should reflect the characteristics of the coded entities, such

as mnemonic features unless such a procedures results in inconsistency and inflexibility.
10. Operability
The code is adequate for present and anticipated data processing both for machine
and human use. Care is taken to minimize the clerical effort and computer time required
for continuing the operation.
6.2. SAMPLE CODE:
Google.py:
#!/usr/bin/env python
__all__ = ['search']
import os
import sys
import time
import urllib2
if sys.version_info[0] > 2:
from http.cookiejar import LWPCookieJar
from urllib.request import Request, urlopen
from urllib.parse import quote_plus, urlparse, parse_qs
else:
from cookielib import LWPCookieJar
from urllib import quote_plus

from urllib2 import Request, urlopen
from urlparse import urlparse, parse_qs
BeautifulSoup = None
url_home = "http://www.google.%(tld)s/"
url_search =
"http://www.google.%(tld)s/search?hl=%(lang)s&q=%(query)s&btnG=Google+Search&tbs=
%(tbs)s&safe=%(safe)s&tbm=%(tpe)s"
url_next_page =
"http://www.google.%(tld)s/search?hl=%(lang)s&q=%(query)s&start=%(start)d&tbs=%(tbs)s
&safe=%(safe)s&tbm=%(tpe)s"
url_search_num =
"http://www.google.%(tld)s/search?hl=%(lang)s&q=%(query)s&num=%(num)d&btnG=Goog
le+Search&tbs=%(tbs)s&safe=%(safe)s&tbm=%(tpe)s"
url_next_page_num =
"http://www.google.%(tld)s/search?hl=%(lang)s&q=%(query)s&num=%(num)d&start=%(star
t)d&tbs=%(tbs)s&safe=%(safe)s&tbm=%(tpe)s"
home_folder = os.getenv('HOME')
if not home_folder:
home_folder = os.getenv('USERHOME')
if not home_folder:

home_folder = '.'
cookie_jar = LWPCookieJar(os.path.join(home_folder, '.google-cookie'))
try:
cookie_jar.load()
except Exception:
pass
def get_page(url):
request = Request(url)
request.add_header('User-Agent',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')
cookie_jar.add_cookie_header(request)
response = urlopen(request)
cookie_jar.extract_cookies(response, request)
html = response.read()
response.close()
cookie_jar.save()
return html
def filter_result(link):
try:
o = urlparse(link, 'http')

if o.netloc and 'google' not in o.netloc:
return link
if link.startswith('/url?'):
link = parse_qs(o.query)['q'][0]
o = urlparse(link, 'http')
if o.netloc and 'google' not in o.netloc:
return link
except Exception:
pass
return None
def search_images(query, tld='com', lang='en', tbs='0', safe='off', num=10, start=0,
stop=None, pause=2.0, only_standard=False, extra_params={}):
return search(query, tld, lang, tbs, safe, num, start, stop, pause, only_standard,
extra_params, tpe='isch')
def search_news(query, tld='com', lang='en', tbs='0', safe='off', num=10, start=0,
extra_params, tpe='nws')
def search_videos(query, tld='com', lang='en', tbs='0', safe='off', num=10, start=0,

extra_params, tpe='vid')
def search_shop(query, tld='com', lang='en', tbs='0', safe='off', num=10, start=0,
extra_params, tpe='shop')
def search_books(query, tld='com', lang='en', tbs='0', safe='off', num=10, start=0,
extra_params, tpe='bks')
def search_apps(query, tld='com', lang='en', tbs='0', safe='off', num=10, start=0,
extra_params, tpe='app')
def search(query, tld='com', lang='en', tbs='0', safe='off', num=10, start=0,
stop=None, pause=2.0, only_standard=False, extra_params={}, tpe=''):
global BeautifulSoup
if BeautifulSoup is None:
try:
from bs4 import BeautifulSoup

except ImportError:
from BeautifulSoup import BeautifulSoup
hashes = set()
query = quote_plus(query)
for builtin_param in ('hl', 'q', 'btnG', 'tbs', 'safe', 'tbm'):
if builtin_param in extra_params.keys():
raise ValueError(
'GET parameter "%s" is overlapping with
the built-in GET parameter',
builtin_param
)
get_page(url_home % vars())
if start:
if num == 10:
url = url_next_page % vars()
else:
url = url_next_page_num % vars()
else:
if num == 10:

url = url_search % vars()
else:
url = url_search_num % vars()
while not stop or start < stop:
for k, v in extra_params.iteritems():
url += url + ('&%s=%s' % (k, v))
time.sleep(pause)
html = get_page(url)
soup = BeautifulSoup(html)
anchors = soup.find(id='search').findAll('a')
for a in anchors:
if only_standard and (
not a.parent or a.parent.name.lower() != "h3"):
continue
try:
link = a['href']
except KeyError:
continue
link = filter_result(link)
if not link:
continue

h = hash(link)
if h in hashes:
continue
hashes.add(h)
yield link
if not soup.find(id='nav'):
break
start += num
if num == 10:
url = url_next_page % vars()
else:
url = url_next_page_num % vars()
if __name__ == "__main__":
from optparse import OptionParser, IndentedHelpFormatter
class BannerHelpFormatter(IndentedHelpFormatter):
"Just a small tweak to optparse to be able to print a banner."

def __init__(self, banner, *argv, **argd):
self.banner = banner
IndentedHelpFormatter.__init__(self, *argv, **argd)
def format_usage(self, usage):
msg = IndentedHelpFormatter.format_usage(self, usage)
return '%sn%s' % (self.banner, msg)
# Parse the command line arguments.
formatter = BannerHelpFormatter(
"Python script to use the Google search enginen"
"By Mario Vilas (mvilas at gmail dot com)n"
"https://github.com/MarioVilas/googlen"
)
parser = OptionParser(formatter=formatter)
parser.set_usage("%prog [options] query")
parser.add_option("--tld", metavar="TLD", type="string", default="com",
help="top level domain to use [default: com]")
parser.add_option("--lang", metavar="LANGUAGE", type="string", default="en",
help="produce results in the given language [default: en]")
parser.add_option("--tbs", metavar="TBS", type="string", default="0",
help="produce results from period [default: 0]")

parser.add_option("--safe", metavar="SAFE", type="string", default="off",
help="kids safe search [default: off]")
parser.add_option("--num", metavar="NUMBER", type="int", default=10,
help="number of results per page [default: 10]")
parser.add_option("--start", metavar="NUMBER", type="int", default=0,
help="first result to retrieve [default: 0]")
parser.add_option("--stop", metavar="NUMBER", type="int", default=0,
help="last result to retrieve [default: unlimited]")
parser.add_option("--pause", metavar="SECONDS", type="float", default=2.0,
help="pause between HTTP requests [default: 2.0]")
parser.add_option("--all", dest="only_standard",
action="store_false", default=True,
help="grab all possible links from result pages")
(options, args) = parser.parse_args()
query = ' '.join(args)
if not query:
parser.print_help()
sys.exit(2)
params = [(k, v) for (k, v) in options.__dict__.items() if not k.startswith('_')]
params = dict(params)
# Run the query.
i=0

for url in search(query, **params):
print(url)
response = urllib2.urlopen(url)
html = response.read()
f=open("result_"+str(i)+".html","w")
f.write(html)
i=i+1
html2text.py:
from bs4 import BeautifulSoup
for i in range(0,20):
markup = open("result_"+str(i)+".html")
soup = BeautifulSoup(markup.read())
markup.close()
for script in soup(["script", "style"]):
script.extract()
# get text
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = 'n'.join(chunk for chunk in chunks if chunk)

print(text)
f = open("example"+str(i)+".txt", "w")
f.write(text.encode('utf-8'))
f.close()
sentencesplitting.py:
import nltk.data
import re, math
from collections import Counter
sentences=[]
for i in range(0,33):
f=open('example'+str(i)+'.txt','r')
text=f.read()
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences.append(sent_detector.tokenize(text.strip()))
f1=open('sentencesfile','w')
for string in sentences:

for s in string:
f1.write(s+"nnn")
WORD = re.compile(r'w+')
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)

return Counter(words)
for string in sentences:
for s in string:
text1 = 'prime minister india'
text2 =s
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)
if(cosine>0):
print s+"nnn"
rake.py:
import re
import operator
debug = False
test = True

def is_number(s):
try:
float(s) if '.' in s else int(s)
return True
except ValueError:
return False
def load_stop_words(stop_word_file):
stop_words = []
for line in open(stop_word_file):
if line.strip()[0:1] != "#":
for word in line.split(): # in case more than one per line
stop_words.append(word)
return stop_words
def separate_words(text, min_word_return_size):
splitter = re.compile('[^a-zA-Z0-9_+-/]')
words = []
for single_word in splitter.split(text):

current_word = single_word.strip().lower()
#leave numbers in phrase, but don't count as words, since they tend to invalidate scores of
their phrases
if len(current_word) > min_word_return_size and current_word != '' and not
is_number(current_word):
words.append(current_word)
return words
def split_sentences(text):
sentence_delimiters = re.compile(u'[.!?,;:t-"()'u2019u2013]')
sentences = sentence_delimiters.split(text)
return sentences
def build_stop_word_regex(stop_word_file_path):
stop_word_list = load_stop_words(stop_word_file_path)
stop_word_regex_list = []
for word in stop_word_list:
word_regex = 'b' + word + 'b'
stop_word_regex_list.append(word_regex)
stop_word_pattern = re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)

return stop_word_pattern
def generate_candidate_keywords(sentence_list, stopword_pattern):
phrase_list = []
for s in sentence_list:
tmp = re.sub(stopword_pattern, '|', s.strip())
phrases = tmp.split("|")
for phrase in phrases:
phrase = phrase.strip().lower()
if phrase != "":
phrase_list.append(phrase)
return phrase_list
def calculate_word_scores(phraseList):
word_frequency = {}
word_degree = {}
for phrase in phraseList:
word_list = separate_words(phrase, 0)
word_list_length = len(word_list)
word_list_degree = word_list_length - 1
#if word_list_degree > 3: word_list_degree = 3 #exp.

for word in word_list:
word_frequency.setdefault(word, 0)
word_frequency[word] += 1
word_degree.setdefault(word, 0)
word_degree[word] += word_list_degree #orig.
#word_degree[word] += 1/(word_list_length*1.0) #exp.
for item in word_frequency:
word_degree[item] = word_degree[item] + word_frequency[item]
# Calculate Word scores = deg(w)/frew(w)
word_score = {}
for item in word_frequency:
word_score.setdefault(item, 0)
word_score[item] = word_degree[item] / (word_frequency[item] * 1.0) #orig.
#word_score[item] = word_frequency[item]/(word_degree[item] * 1.0) #exp.
return word_score
def generate_candidate_keyword_scores(phrase_list, word_score):
keyword_candidates = {}
for phrase in phrase_list:
keyword_candidates.setdefault(phrase, 0)
word_list = separate_words(phrase, 0)

candidate_score = 0
for word in word_list:
candidate_score += word_score[word]
keyword_candidates[phrase] = candidate_score
return keyword_candidates
class Rake(object):
def __init__(self, stop_words_path):
self.stop_words_path = stop_words_path
self.__stop_words_pattern = build_stop_word_regex(stoppath)
def run(self, text):
sentence_list = split_sentences(text)
phrase_list = generate_candidate_keywords(sentence_list, self.__stop_words_pattern)
word_scores = calculate_word_scores(phrase_list)
keyword_candidates = generate_candidate_keyword_scores(phrase_list, word_scores)
sorted_keywords = sorted(keyword_candidates.iteritems(), key=operator.itemgetter(1),
reverse=True)

return sorted_keywords
if test:
text = "input question here"
# Split text into sentences
sentenceList = split_sentences(text)
#stoppath = "FoxStoplist.txt" #Fox stoplist contains "numbers", so it will not find "natural
numbers" like in Table 1.1
stoppath = "SmartStoplist.txt" #SMART stoplist misses some of the lower-scoring
keywords in Figure 1.5, which means that the top 1/3 cuts off one of the 4.0 score words in
Table 1.1
stopwordpattern = build_stop_word_regex(stoppath)
# generate candidate keywords
phraseList = generate_candidate_keywords(sentenceList, stopwordpattern)
# calculate individual word scores
wordscores = calculate_word_scores(phraseList)
# generate candidate keyword scores
keywordcandidates = generate_candidate_keyword_scores(phraseList, wordscores)

if debug: print keywordcandidates
sortedKeywords = sorted(keywordcandidates.iteritems(), key=operator.itemgetter(1),
reverse=True)
if debug: print sortedKeywords
totalKeywords = len(sortedKeywords)
if debug: print totalKeywords
print sortedKeywords[0:(totalKeywords / 3)]
rake = Rake("SmartStoplist.txt")
keywords = rake.run(text)
print keywords
# end
6.3. SCREENSHOTS:
Figure 1 shows snapshot of the Question Classification training module.

Figure 1 training module
Figure 2 shows the snapshot of the testing module in Question Classification

Figure 3 shows the snapshot of the keyword extraction module output
Figure 4 shows the output of Document retrieval module in QA System

TESTING
7.1 INTRODUCTION
Testing is the process of detecting errors. Testing performs a very critical role for quality
assurance and for ensuring the reliability of software. The results of testing are used later on
during maintenance also.
The aim of testing is often to demonstrate that a program works by showing that it has
no errors. The basic purpose of testing phase is to detect the errors that may be present in the
program. Hence one should not start testing with the intent of showing that a program works,
but the intent should be to show that a program doesn’t work. Testing is the process of executing
a program with the intent of finding errors.
7.2 Testing objectives
The main objective of testing is to uncover a host of errors, systematically and with
minimum effort and time. Stating formally, we can say,
• Testing is a process of executing a program with the intent of finding an error.
• A good test case is one that has a high probability of finding error, if it exists.
• The tests are inadequate to detect possibly present errors.
7.3 Process
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the

functionality of components, sub assemblies, assemblies and/or a finished product. It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement
7.4 Levels of testing
In order to uncover the errors present in different phases we have the concept of levels
of testing. The basic levels of testing are as shown below.
Acceptance
Client Needs Testing
System Testing
Requirements
Integration Testing
Design
Unit Testing
Code
Fig 7.1 Levels of testing

7.5. TESTING STRATEGIES and TEST CASES
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic
is functioning properly, and that program input should produce valid outputs. All decision
branches and internal code flow should be validated. It is the testing of individual software units
of the application. It is done after the completion of an individual unit before integration. This
is a structural testing, that relies on knowledge of its construction and is invasive. Unit tests
perform basic tests at component level and test a specific business process, application, and/or
system configuration. Unit tests ensure that each unique path of a business process performs
accurately to the documented specifications and contains clearly defined inputs and expected
results.
Unit testing focuses verification effort on the smallest unit of software design the
module. The software built, is a collection of individual modules. Each module performs a
unique function. In this kind of testing, exact flow of control for each module was verified. With
detailed design considerations used as a guide, important control paths are tested to uncover
errors within the boundary of the module.
The Unit Test can be conducted in parallel for multiple modules. The following shows
the steps in testing the individual modules:
Step-1:Check the input file format for svmlight-tk tool is accepted by the tool or not.
Step-2:check whether the svm classification is working fine or not.
Step-3:Test whether the query is passed to the search engine without any connection problems.

Step-4:Test whether all the documents are retrieved properly or not.
Step-5:Test whether the text is properly separated from html documents.
Step-6:check the sentence ranking algorithm is working fine or not.
Integration testing
Integration tests are designed to test integrated software components to determine if
they actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfaction, as shown by successfully unit testing, the combination of components
is correct and consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.
In this, many class-tested modules are combined into subsystems, which are then tested.
The goal here is to see if all the modules can be integrated properly. We have integrated all the
classes and have checked for compatibility. Errors obtained are identified and debugged.
Functional testing
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs.

Systems/Procedures interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements, key
functions, or special test cases. In addition, systematic coverage pertaining to identify Business
process flows; data fields, predefined processes, and successive processes must be considered
for testing. Before functional testing is complete, additional tests are identified and the effective
value of current tests is determined.
System testing
System testing ensures that the entire integrated software system meets requirements. It
tests a configuration to ensure known and predictable results. An example of system testing is
the configuration oriented system integration test. System testing is based on process
descriptions and flows, emphasizing pre-driven process links and integration points.
Here the entire software system is tested. The reference document for this process is the
requirements document, and the goal is to see if the software meets its requirements.
Debugging is not always easy. Some bugs can take a long time to end. Debugging
concurrent code can be particularly difficult and time consuming. It helps to understand the
language and its libraries. A common source of errors in Java is to use libraries without properly
understanding what methods do and how they work. For example, one might forget to call a
method to initialize an object.
White Box testing:
White Box Testing is a testing in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is used to test areas

that cannot be reached from a black box level.
In this test cases are generated on the logic of each module by drawing flow graphs of
that module and logical decisions are tested on all the cases.
Black box testing:
Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested. It is a testing in which the software under test
is treated, as a black box i.e., you cannot “see” into it. The test provides inputs and responds to
outputs without considering how the software works.
In this strategy some test cases are generated as input conditions that fully execute all
functional requirements for the program. This testing is used to find errors in the following
categories:
• Incorrect or missing functions
• Interface errors
• Errors in network
• Performance errors
• Initialization and termination errors
In this testing, only the output is checked for correctness. The logical flow of
the data is not checked.
Acceptance testing
User Acceptance Testing is a critical phase of any project and requires significant participation
by the end user. It also ensures that the system meets the functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.

8 CONCLUSION AND
FUTURE
ENHANCEMENT

CONCLUSION AND FUTURE
ENHANCEMENT
8.1. CONCLUSION
The need for Open Domain Question Answering System is to give the answers to the
user whatever user asks. The open domain question answering system covers all the questions
without any specific field. In this project we implemented the open domain question answering
system by taking world wide web as our source of knowledge.
8.2. FUTURE ENHANCEMENT
The further enhancements of the project are as follows:
• To reach 100% of accuracy.
• To increase the speed of performance.
• To use artificial intelligence techniques to identify the answers.

BIBLIOGRAPHY
[1] http://start.csail.mit.edu/index.php
[2] http://trec.nist.gov/
[3] http://cogcomp.cs.illinois.edu/Data/QA/QC/
[4] Learning Question Classifiers by Xin Li and Dan Roth.
[5]Building a Question Classifier for TREC style Question Answering System by Richard
May & Ari Steinberg.
[6]A Question Answering Supported by Information Extraction Rohini Srihari and Wie Lie.
[7] Automatic Question Answering: Beyond the Factoid by Eric Brill and Radu Soricut.
[8] Automatic Feature Engineering for Answer Selection and Extraction by Aliaksei Severyn
and Alessandro Moschitti.
[9] Question Answering from the Web Using Knowledge Annotation and Knowledge Mining
Techniques by Jimmy Lin and Boris Katz.
[10]A Survey on Question Answering System by Ch.Das.

Open domain Question Answering System - Research project in NLP

More Related Content

What's hot (20)

Similar to Open domain Question Answering System - Research project in NLP (20)

Recently uploaded (20)

Open domain Question Answering System - Research project in NLP