SlideShare a Scribd company logo
Must be similar to screenshots
I must be able to run the projects on Eclipse so that I can upload
the codes to my Github account
The projects must say that they were created by
Juliet Mercado
Zachary Willis
Ihor Panchenko
Craig Anderson
Building a Search Engine, Part I: Governance, Workflow, and
UI
(This is the first project in this series)
You are going to design, build, and test a scaled-down version
of “Google Search”. Rather than searching the Internet's files,
you will only search local files added to your search engine's
index. Your search engine will allow an administrator to add,
update, and remove files from the index. Users will be able to
enter search terms, and select between Boolean AND, OR, or
PHRASE search. The matching file names (if any) are then
displayed in a list.
You also need to design the system architecture (the high-level
design), so you can plan each part.
Search Engine Project Proposal:
Build a search engine with simple GUI, that can do AND, OR,
and PHRASE Boolean searches on a small set of text files. The
user should be able to say the type of search to do, and enter
some search terms. The results should be a list of file
pathnames that match the search. This should be a stand-alone
application
User Interfaces
In addition to the main user interface (for doing searching), you
will need a separate administrator or maintenance interface to
manage your application. It should be easy to add and remove
files (from the set of indexed files), and to regenerate the index
anytime. When starting, your application should check if any of
the files have been changed or deleted since the application last
saved the index. If so, the administrator should be able to have
the index updated with the modified file(s).
Note that with HTML, Word, or other types of documents, you
would need to extract a plain text version before indexing. That
isn't hard, but the search engine is complex enough already. For
these projects, limit your search engine to only plain text files
(including .txt, .html, and other text files).
The index must be stored on disk, so next time your application
starts it can reload its data. The index, list of files, and other
data, can be stored in one or more file(s) or in a database. The
saved data should be read whenever your application starts. The
saved data should be updated (or recreated) when you add,
update, or remove documents from your set (of indexed
documents), or perhaps just when your application exits. If you
use files, the file formats are up to you; have a format that is
fast and simple to load and store.
To keep things as simple as possible, in this project you can
assume that only a small set of documents will be indexed, and
thus the whole index can be kept in memory at once. (That's
probably not the case for Google's data!) All you need to do is
be able to read the index data from disk at startup into memory,
and write it back either when updating the index, or when your
application shuts down. Note, the names (pathnames) of the
added files as well as their last modification time must be
stored in addition to the index.
If using XML file, you can define an XML schema for it and
have some tool such as Notepad++ validate your file format for
you. XML may have other benefits, but it isn't as simple as
using plain text files. JSON might be the easiest format for
storing and reading the index data. In any case, don't forget to
include the list of file pathnames and other data you decide is
needed, along with the index itself.
Requirements:
In this project, we will follow the model-view-controller design
pattern for the project organization. This allows one to develop
each part mostly independently from the other parts.
Develop Stub User Interfaces:
In this part of the project, you must implement a non-functional
(that means looks good but doesn't do a thing) graphic user
interface for the application. (The “view”.) The main (default)
user interface must support searching and displaying results. It
should have various other features, such as an “About...” menu
or button, a way to quit the application (if a stand-alone
application; if your group creates a web application, there is no
need to quit), and a way to get to the administrator/maintenance
view.
The maintenance/administrator view must allow the user to
perform various administration operations: view the list of
indexed file names, adding files to the index, remove files from
the index, and update the index (when files have been modified
since they were indexed).
The user interface should be complete, but none of the
functionality needs to be implemented at this time. You should
implement stub methods for the functionality not yet
implemented, and invoke them from your event handlers. The
stub methods can either return “canned” (fake but realistic)
data, or throw an OperationNotSupported exception. The only
button that needs to do anything is the one used to switch to the
maintenance view.
Since the user interfaces don't do anything, there is nothing to
test yet. However, you must create a test class with at least one
test method (it can just return success if you wish). I suggest
you agree to use JUnit 4 style tests for now.
Building a Search Engine, Part II: Persistent Data
Please read the background information and full project
description from Search Engine Project, Part I. In this project,
you will implement the persistent data (the “model”) part of the
project: the saving of data and the loading of data at the next
start. The persistent data contains the list of files used in the
index, and the index itself.
First discuss which persistence solution you will use: text files,
XML or JSON files, or a database (and chose between
embedded (my suggestion) or server, and if using a database,
chose between the JDBC and JPA database APIs (I suggest
JPA). You can make this decision before knowing the details of
the data structures used.
Before working on actual code, you need to decide on the data
structures to be used for the file list and the inverted index. Try
to read the Java collections material before deciding.
It should be easy to add and remove files (from the set of
indexed files). When starting, your application should check if
any of the files used have been changed or deleted since the
application last saved the index. If so, the “admin” user should
be able to have the inverted index file(s) updated, from the
maintenance interface.
(Note that with HTML or Word documents, you would need to
extract a plain text version before indexing.) In this project, all
the “indexible” files are plain text. You are free to assume the
system-default text file encoding, or assume UTF-8 encoding,
for all files.
The inverted index can be stored in one or more file(s), and that
should be read whenever your application starts. The file(s)
should be updated (or recreated) when you add, update, or
remove documents from your set (of indexed documents). The
file format is up to you, but should have a format that is fast
and simple to search. However, to keep things simpler, in this
project you can assume that only a small set of documents will
be indexed, and thus the whole index can be kept in memory.
All you need to do is be able to read the index data from a file
at startup into memory, and write it back when updating the
index. Don't forget the names (pathnames) of the files as well as
their last modification time must be stored as well. It is your
choice to use a single file or multiple files, in plain text, JSON,
XML, or any format your group chooses, to hold the persistent
data. If you want, you can use any DBMS. (In that case, I
suggest using the JavaDB included with the JDK, as an
embedded database.) In any case, your file format(s) or database
schema must be documented completely, so that someone else,
without access to your source code could use your file(s) or
database correctly.
If using XML format, you can define an XML schema for your
file and have some tool such as Notepad++ validate your file
format for you. XML may have other benefits, but it isn't as
simple as plain text files or even JSON files. In any case, don't
forget to include the list of file (path) names, along with the
index itself, in your persistent data store.
Part II Requirements:
In this part, you must implement the file operations of your
search engine application (the model). That includes reading
and updating your persistent data (that is, the inverted index as
well as any other information you need to store between runs of
your application, such as the list of files (their pathnames) that
have been indexed). The main file operations are reading each
file to be indexed a “word” at a time; you also need to checking
if the previously indexed files still exist or have been modified
since last indexed.
The maintenance part of the user interface should allow users to
select files for indexing, and to keep track of which files have
been added to the index. For each file, you need to keep the full
pathname of the file as well as the file's last modification time.
Your code should correctly handle the user entering in non-
existent files and unreadable files. How you handle such errors
is up to you
You can download a Search Engine model solution, to play with
it and inspect its user interface. My solution keeps all persistent
data in a single text file in the user's home directory, but you
can certainly use a different persistence solution.
Possible Data Structures you can use. In part III, you will
implement the index operations, including Boolean searching,
adding to the index, and removing files from the index. (The
index is a complex collection of collections.) Because the
format of the index and file list will affect the code used to read
and write them to and from storage, you must decide on the in-
memory data structures to be used early. In the model solution,
I used a List of FileItem objects for the list of indexed files;
each FileItem contained a file's pathname and date it was read
for the index. The index data itself is stored in a Map, with the
using the indexed words as keys, and a Set of IndexData objects
as the values. Each IndexData object holds the id of the file
containing the word and the position of the word in that
document. (The classes FileItem and IndexData were trivial to
write.)
This is NOT the only, or the best, way to represent the index or
file list! (For example, a List of int[2] arrays might be simpler
than a Set of IndexData objects.) Your should decide on the
types of collections used. Only then can you implement the
methods to read and write the data.
Building a Search Engine, Part III:
Collections
Please read the background information and full project
description from Search Engine Project, Part I.
In this final part of the project, you will complete the
application by implementing the index functions. These include
adding a file to the index, and removing a file from the index,
and reading and writing the index from/to a file. (Updating the
index when a file has been changed, can then be done by
removing and then re-adding a file.) Other operations include
searching the index for a given word, and returning a Set of
pairs (document ID and position) for that word.
Finally, you will have to implement the Boolean search
functions of the main user interface. (This is complex enough,
that it should have been another project!) I suggest you start
with an “OR” search, then worry about implementing the
“AND” and “PHRASE” search functions.
When building the index, keep in mind you will need to define
what you mean by “word”. One possibility is to strip out any
non-digits or letters, and convert the result to all lowercase,
both when you build the inverted index and when you read the
search terms entered by the user. Ideally, you can use the I18N
methods discussed in class to normalize the words.
Implementing Boolean Search:
The exact method depends in part on how you implement the
inverted index. In the suggested implementation (a Map with
words as the keys, and a List or Set of (document ID, position)
pairs as the values), you could implement the Boolean searches
using algorithms similar to the following (you can come up with
your own if you wish):
OR Search
This is the easiest one to implement. The general idea is to start
with an empty Set of matching files. Then add to that Set, the
files containing each search term; Just search the Map for that
word, and add each document found (if any). The result is the
OR search results, the files that contain any word in the search
list. (If user inputs no search words, say “ ,.”, then no files are
considered as matching.)
AND Search
This is done the opposite way from an OR search, and is only a
little harder to implement. The idea is to start with a set of all
files in the index. Then for each search term, for each file in the
Set, make sure that file is contained in the index for that search
term. Remove any files from the set that don't contain that
word. The resulting final set is the documents matching all
search terms. (If user inputs no search words, say “ ,.”, then all
files are considered as matching. If that isn't the behavior you
want, you need to treat that as a special case.)
PHRASE Search
This is the hardest search to implement. Unlike the OR and the
AND searches, with PHRASE searching, the position of the
search terms in the files matters. The algorithm I came up with
is:
Create an initially empty Set of Pair objects.
Add to the set the Pair objects for the files that contain the first
word of the phrase. This is the easy part: Just lookup that word
in the Map, and add all Pair objects found to a set.
The Set now contains Pair objects for just the files that might
contain the phrase. Next, loop over the remaining words of the
phrase, removing any Pairs from the set that are no longer
possible phrase continuations. (Actually, I just build a new Set.)
For each remaining word in the phrase:
Create a new, empty set of Pairs.
For each Pair in the previous set, see if the word appears in the
same file, but in the next position. If so, add the Pair object for
the word to the new set.
An example may help clarify this. Suppose the search phrase is
“big top now”. The set initially contains all the Pair objects for
the word “big”. Let's say for example, that set looks like:
(file1,position7), (file1,position22), (file3,position4)
For each Pair object in that set, you need to see if “top” is in
that same file, but the next position. If so, you add the Pair
object for that to the new Set. The (inner) loop for this example
checks each of the following:
Is a (file1,position8) Pair object in the Map for the word "top"?
Is a (file1,position23) Pair object in the Map for the word
"top"?
Is a (file3,position5) Pair object in the Map for the word "top"?
If the answer is “yes”, then add that Pair object to the new set.
When this loop ends, the new set will contain the Pair objects
for the phrase “big top” (pointing to the position of the word
“top”).
For example, suppose “top” is only found in (file1,position8)
and (file3,position5). You replace the first set with this new set:
(file1,position8), (file3,position5)
Repeat for the next word in the phrase, using the set built in the
previous loop.
Continue until the set is empty (so phrase not found), or until
the last word of the phrase has been processed. The Pair objects
remaining in the final set are the ones that contain the phrase;
the position will be that of the last word of the phrase. (We only
need to display the file name; in this project, the position of the
phrase doesn't matter.)
Part III Requirements:
This project has been split into three parts. Each part counts as
a separate project. In the first two parts, you designed and
implemented a graphic user interface for the application, and
added all required file operations.
In this part, you must implement the remaining operations of
your search engine application: the index operations, and the
searching.
You can download a Search Engine model solution, to play with
it and inspect its user interface, but please keep in mind you
should not copy that user interface; instead, invent a better,
nicer-looking one.
Hints:
Keep your code as simple as possible
The inverted index is naturally a Map, from words (the keys) to
a Set of objects (the values). Each of the objects represent a
document and a location within that document, where the word
was found. I called these objects Pairs, since they are a pair of
numbers, but you can use any name for your classes. Note, you
will need to be able to go from a document number to a file
name, when you display the search results.

More Related Content

PDF
Java Abs Java Productivity Creator & Analyzer
PPTX
this is PPT for mobail application development
PPT
Android application structure
PPT
CustomizingStyleSheetsForHTMLOutputs
PPTX
SharePoint Intelligence Extending Share Point Designer 2010 Workflows With Cu...
DOCX
Abstract.DOCX
PPT
Understanding and Configuring the FO Plug-in for Generating PDF Files: Part I...
DOCX
Company Visitor Management System Report.docx
Java Abs Java Productivity Creator & Analyzer
this is PPT for mobail application development
Android application structure
CustomizingStyleSheetsForHTMLOutputs
SharePoint Intelligence Extending Share Point Designer 2010 Workflows With Cu...
Abstract.DOCX
Understanding and Configuring the FO Plug-in for Generating PDF Files: Part I...
Company Visitor Management System Report.docx

Similar to Must be similar to screenshotsI must be able to run the projects.docx (20)

PPT
File Handling In C++(OOPs))
PPTX
Data Science Process.pptx
PPTX
Utilizing the natural langauage toolkit for keyword research
PPTX
PPTX
OPEN TEXT ADMINISTRATION
PPT
Fusion P8 for FileNet Overview
PPTX
Intro to OctoberCMS
PPT
IntelliJ IDEA Architecture and Performance
PDF
IRJET- Resume Information Extraction Framework
PPS
Test Strategy Utilising Mc Useful Tools
PDF
InnerSoft STATS - Introduction
PDF
ssssss
PPTX
ADF- Lect-3 Django Project Structure, Django Apps Structure.pptx
DOCX
Migration from 8.1 to 11.3
PDF
A Review of Data Access Optimization Techniques in a Distributed Database Man...
PDF
A Review of Data Access Optimization Techniques in a Distributed Database Man...
PPTX
ESSIR LivingKnowledge DiversityEngine tutorial
DOCX
pega training with project level Trainingwhatsup@8142976573
DOCX
pega training whatsup@8142976573
PDF
Doc manual 3.x
File Handling In C++(OOPs))
Data Science Process.pptx
Utilizing the natural langauage toolkit for keyword research
OPEN TEXT ADMINISTRATION
Fusion P8 for FileNet Overview
Intro to OctoberCMS
IntelliJ IDEA Architecture and Performance
IRJET- Resume Information Extraction Framework
Test Strategy Utilising Mc Useful Tools
InnerSoft STATS - Introduction
ssssss
ADF- Lect-3 Django Project Structure, Django Apps Structure.pptx
Migration from 8.1 to 11.3
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...
ESSIR LivingKnowledge DiversityEngine tutorial
pega training with project level Trainingwhatsup@8142976573
pega training whatsup@8142976573
Doc manual 3.x

More from herthaweston (20)

DOCX
TOPIC 6 NEGOTIATIONNegotiation is a highly important personal a.docx
DOCX
Topic 4 DQ 1In an SEI classroom, the language objective dr.docx
DOCX
Topic 6 Unit 3 Part 2 Definition EssayObjectivesDraft .docx
DOCX
Topic 2 Partial Correlation 1.From your textbook readings.docx
DOCX
Topic 5 Conformity, Deviance, Crime, and StratificationObject.docx
DOCX
Topic 2 The Disaster Management CycleReadRichards, E. A., Nova.docx
DOCX
Topic 1Identify an experience where you had to assess the needs, i.docx
DOCX
Topic 1 Vicarious Liability StatutesWhat is vicarious liability .docx
DOCX
Topic 2 DQ 1Qualitative research is a research method which .docx
DOCX
Topic 2 DQ 1Child abuse is any Act of Violence either physic.docx
DOCX
TOPIC 1 Active Listening Skill Set; Assessing your Listening Skills.docx
DOCX
Topic 1 How Does Transcultural Nursing Theory FitTranscultural n.docx
DOCX
Topic 1 FTC’s Green Guide Research the policy behind the FTC’.docx
DOCX
Topic 1Personal Philosophy From your readings about teaching and .docx
DOCX
Topic 1 Based upon an A through F scale, what grade would you give .docx
DOCX
Topic #1Define transnational crime. How has it changed in the la.docx
DOCX
Tony’s Chips has recently been sold to a new independent company. .docx
DOCX
Topic #1 Discuss several technologies that have changed the face o.docx
DOCX
Tony’s Chips has recently been sold to a new independent company. Th.docx
DOCX
Today, many infants and toddlers are learning two languages. Wha.docx
TOPIC 6 NEGOTIATIONNegotiation is a highly important personal a.docx
Topic 4 DQ 1In an SEI classroom, the language objective dr.docx
Topic 6 Unit 3 Part 2 Definition EssayObjectivesDraft .docx
Topic 2 Partial Correlation 1.From your textbook readings.docx
Topic 5 Conformity, Deviance, Crime, and StratificationObject.docx
Topic 2 The Disaster Management CycleReadRichards, E. A., Nova.docx
Topic 1Identify an experience where you had to assess the needs, i.docx
Topic 1 Vicarious Liability StatutesWhat is vicarious liability .docx
Topic 2 DQ 1Qualitative research is a research method which .docx
Topic 2 DQ 1Child abuse is any Act of Violence either physic.docx
TOPIC 1 Active Listening Skill Set; Assessing your Listening Skills.docx
Topic 1 How Does Transcultural Nursing Theory FitTranscultural n.docx
Topic 1 FTC’s Green Guide Research the policy behind the FTC’.docx
Topic 1Personal Philosophy From your readings about teaching and .docx
Topic 1 Based upon an A through F scale, what grade would you give .docx
Topic #1Define transnational crime. How has it changed in the la.docx
Tony’s Chips has recently been sold to a new independent company. .docx
Topic #1 Discuss several technologies that have changed the face o.docx
Tony’s Chips has recently been sold to a new independent company. Th.docx
Today, many infants and toddlers are learning two languages. Wha.docx

Recently uploaded (20)

PDF
Empowerment Technology for Senior High School Guide
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
IGGE1 Understanding the Self1234567891011
PDF
Complications of Minimal Access-Surgery.pdf
PDF
My India Quiz Book_20210205121199924.pdf
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
20th Century Theater, Methods, History.pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
Empowerment Technology for Senior High School Guide
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Introduction to pro and eukaryotes and differences.pptx
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
202450812 BayCHI UCSC-SV 20250812 v17.pptx
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
Cambridge-Practice-Tests-for-IELTS-12.docx
History, Philosophy and sociology of education (1).pptx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
B.Sc. DS Unit 2 Software Engineering.pptx
IGGE1 Understanding the Self1234567891011
Complications of Minimal Access-Surgery.pdf
My India Quiz Book_20210205121199924.pdf
Unit 4 Computer Architecture Multicore Processor.pptx
Practical Manual AGRO-233 Principles and Practices of Natural Farming
20th Century Theater, Methods, History.pptx
Chinmaya Tiranga quiz Grand Finale.pdf
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf

Must be similar to screenshotsI must be able to run the projects.docx

  • 1. Must be similar to screenshots I must be able to run the projects on Eclipse so that I can upload the codes to my Github account The projects must say that they were created by Juliet Mercado Zachary Willis Ihor Panchenko Craig Anderson Building a Search Engine, Part I: Governance, Workflow, and UI (This is the first project in this series) You are going to design, build, and test a scaled-down version of “Google Search”. Rather than searching the Internet's files, you will only search local files added to your search engine's index. Your search engine will allow an administrator to add, update, and remove files from the index. Users will be able to enter search terms, and select between Boolean AND, OR, or PHRASE search. The matching file names (if any) are then displayed in a list. You also need to design the system architecture (the high-level design), so you can plan each part. Search Engine Project Proposal:
  • 2. Build a search engine with simple GUI, that can do AND, OR, and PHRASE Boolean searches on a small set of text files. The user should be able to say the type of search to do, and enter some search terms. The results should be a list of file pathnames that match the search. This should be a stand-alone application User Interfaces In addition to the main user interface (for doing searching), you will need a separate administrator or maintenance interface to manage your application. It should be easy to add and remove files (from the set of indexed files), and to regenerate the index anytime. When starting, your application should check if any of the files have been changed or deleted since the application last saved the index. If so, the administrator should be able to have the index updated with the modified file(s). Note that with HTML, Word, or other types of documents, you would need to extract a plain text version before indexing. That isn't hard, but the search engine is complex enough already. For these projects, limit your search engine to only plain text files (including .txt, .html, and other text files). The index must be stored on disk, so next time your application starts it can reload its data. The index, list of files, and other data, can be stored in one or more file(s) or in a database. The saved data should be read whenever your application starts. The saved data should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents), or perhaps just when your application exits. If you use files, the file formats are up to you; have a format that is fast and simple to load and store. To keep things as simple as possible, in this project you can
  • 3. assume that only a small set of documents will be indexed, and thus the whole index can be kept in memory at once. (That's probably not the case for Google's data!) All you need to do is be able to read the index data from disk at startup into memory, and write it back either when updating the index, or when your application shuts down. Note, the names (pathnames) of the added files as well as their last modification time must be stored in addition to the index. If using XML file, you can define an XML schema for it and have some tool such as Notepad++ validate your file format for you. XML may have other benefits, but it isn't as simple as using plain text files. JSON might be the easiest format for storing and reading the index data. In any case, don't forget to include the list of file pathnames and other data you decide is needed, along with the index itself. Requirements: In this project, we will follow the model-view-controller design pattern for the project organization. This allows one to develop each part mostly independently from the other parts. Develop Stub User Interfaces: In this part of the project, you must implement a non-functional (that means looks good but doesn't do a thing) graphic user interface for the application. (The “view”.) The main (default) user interface must support searching and displaying results. It should have various other features, such as an “About...” menu or button, a way to quit the application (if a stand-alone application; if your group creates a web application, there is no need to quit), and a way to get to the administrator/maintenance view. The maintenance/administrator view must allow the user to
  • 4. perform various administration operations: view the list of indexed file names, adding files to the index, remove files from the index, and update the index (when files have been modified since they were indexed). The user interface should be complete, but none of the functionality needs to be implemented at this time. You should implement stub methods for the functionality not yet implemented, and invoke them from your event handlers. The stub methods can either return “canned” (fake but realistic) data, or throw an OperationNotSupported exception. The only button that needs to do anything is the one used to switch to the maintenance view. Since the user interfaces don't do anything, there is nothing to test yet. However, you must create a test class with at least one test method (it can just return success if you wish). I suggest you agree to use JUnit 4 style tests for now. Building a Search Engine, Part II: Persistent Data Please read the background information and full project description from Search Engine Project, Part I. In this project, you will implement the persistent data (the “model”) part of the project: the saving of data and the loading of data at the next start. The persistent data contains the list of files used in the index, and the index itself. First discuss which persistence solution you will use: text files, XML or JSON files, or a database (and chose between embedded (my suggestion) or server, and if using a database, chose between the JDBC and JPA database APIs (I suggest JPA). You can make this decision before knowing the details of the data structures used. Before working on actual code, you need to decide on the data
  • 5. structures to be used for the file list and the inverted index. Try to read the Java collections material before deciding. It should be easy to add and remove files (from the set of indexed files). When starting, your application should check if any of the files used have been changed or deleted since the application last saved the index. If so, the “admin” user should be able to have the inverted index file(s) updated, from the maintenance interface. (Note that with HTML or Word documents, you would need to extract a plain text version before indexing.) In this project, all the “indexible” files are plain text. You are free to assume the system-default text file encoding, or assume UTF-8 encoding, for all files. The inverted index can be stored in one or more file(s), and that should be read whenever your application starts. The file(s) should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents). The file format is up to you, but should have a format that is fast and simple to search. However, to keep things simpler, in this project you can assume that only a small set of documents will be indexed, and thus the whole index can be kept in memory. All you need to do is be able to read the index data from a file at startup into memory, and write it back when updating the index. Don't forget the names (pathnames) of the files as well as their last modification time must be stored as well. It is your choice to use a single file or multiple files, in plain text, JSON, XML, or any format your group chooses, to hold the persistent data. If you want, you can use any DBMS. (In that case, I suggest using the JavaDB included with the JDK, as an embedded database.) In any case, your file format(s) or database schema must be documented completely, so that someone else, without access to your source code could use your file(s) or database correctly.
  • 6. If using XML format, you can define an XML schema for your file and have some tool such as Notepad++ validate your file format for you. XML may have other benefits, but it isn't as simple as plain text files or even JSON files. In any case, don't forget to include the list of file (path) names, along with the index itself, in your persistent data store. Part II Requirements: In this part, you must implement the file operations of your search engine application (the model). That includes reading and updating your persistent data (that is, the inverted index as well as any other information you need to store between runs of your application, such as the list of files (their pathnames) that have been indexed). The main file operations are reading each file to be indexed a “word” at a time; you also need to checking if the previously indexed files still exist or have been modified since last indexed. The maintenance part of the user interface should allow users to select files for indexing, and to keep track of which files have been added to the index. For each file, you need to keep the full pathname of the file as well as the file's last modification time. Your code should correctly handle the user entering in non- existent files and unreadable files. How you handle such errors is up to you You can download a Search Engine model solution, to play with it and inspect its user interface. My solution keeps all persistent data in a single text file in the user's home directory, but you can certainly use a different persistence solution. Possible Data Structures you can use. In part III, you will implement the index operations, including Boolean searching, adding to the index, and removing files from the index. (The
  • 7. index is a complex collection of collections.) Because the format of the index and file list will affect the code used to read and write them to and from storage, you must decide on the in- memory data structures to be used early. In the model solution, I used a List of FileItem objects for the list of indexed files; each FileItem contained a file's pathname and date it was read for the index. The index data itself is stored in a Map, with the using the indexed words as keys, and a Set of IndexData objects as the values. Each IndexData object holds the id of the file containing the word and the position of the word in that document. (The classes FileItem and IndexData were trivial to write.) This is NOT the only, or the best, way to represent the index or file list! (For example, a List of int[2] arrays might be simpler than a Set of IndexData objects.) Your should decide on the types of collections used. Only then can you implement the methods to read and write the data. Building a Search Engine, Part III: Collections Please read the background information and full project description from Search Engine Project, Part I. In this final part of the project, you will complete the application by implementing the index functions. These include adding a file to the index, and removing a file from the index, and reading and writing the index from/to a file. (Updating the index when a file has been changed, can then be done by removing and then re-adding a file.) Other operations include searching the index for a given word, and returning a Set of pairs (document ID and position) for that word. Finally, you will have to implement the Boolean search
  • 8. functions of the main user interface. (This is complex enough, that it should have been another project!) I suggest you start with an “OR” search, then worry about implementing the “AND” and “PHRASE” search functions. When building the index, keep in mind you will need to define what you mean by “word”. One possibility is to strip out any non-digits or letters, and convert the result to all lowercase, both when you build the inverted index and when you read the search terms entered by the user. Ideally, you can use the I18N methods discussed in class to normalize the words. Implementing Boolean Search: The exact method depends in part on how you implement the inverted index. In the suggested implementation (a Map with words as the keys, and a List or Set of (document ID, position) pairs as the values), you could implement the Boolean searches using algorithms similar to the following (you can come up with your own if you wish): OR Search This is the easiest one to implement. The general idea is to start with an empty Set of matching files. Then add to that Set, the files containing each search term; Just search the Map for that word, and add each document found (if any). The result is the OR search results, the files that contain any word in the search list. (If user inputs no search words, say “ ,.”, then no files are considered as matching.) AND Search This is done the opposite way from an OR search, and is only a little harder to implement. The idea is to start with a set of all files in the index. Then for each search term, for each file in the
  • 9. Set, make sure that file is contained in the index for that search term. Remove any files from the set that don't contain that word. The resulting final set is the documents matching all search terms. (If user inputs no search words, say “ ,.”, then all files are considered as matching. If that isn't the behavior you want, you need to treat that as a special case.) PHRASE Search This is the hardest search to implement. Unlike the OR and the AND searches, with PHRASE searching, the position of the search terms in the files matters. The algorithm I came up with is: Create an initially empty Set of Pair objects. Add to the set the Pair objects for the files that contain the first word of the phrase. This is the easy part: Just lookup that word in the Map, and add all Pair objects found to a set. The Set now contains Pair objects for just the files that might contain the phrase. Next, loop over the remaining words of the phrase, removing any Pairs from the set that are no longer possible phrase continuations. (Actually, I just build a new Set.) For each remaining word in the phrase: Create a new, empty set of Pairs. For each Pair in the previous set, see if the word appears in the same file, but in the next position. If so, add the Pair object for the word to the new set. An example may help clarify this. Suppose the search phrase is “big top now”. The set initially contains all the Pair objects for the word “big”. Let's say for example, that set looks like:
  • 10. (file1,position7), (file1,position22), (file3,position4) For each Pair object in that set, you need to see if “top” is in that same file, but the next position. If so, you add the Pair object for that to the new Set. The (inner) loop for this example checks each of the following: Is a (file1,position8) Pair object in the Map for the word "top"? Is a (file1,position23) Pair object in the Map for the word "top"? Is a (file3,position5) Pair object in the Map for the word "top"? If the answer is “yes”, then add that Pair object to the new set. When this loop ends, the new set will contain the Pair objects for the phrase “big top” (pointing to the position of the word “top”). For example, suppose “top” is only found in (file1,position8) and (file3,position5). You replace the first set with this new set: (file1,position8), (file3,position5) Repeat for the next word in the phrase, using the set built in the previous loop. Continue until the set is empty (so phrase not found), or until the last word of the phrase has been processed. The Pair objects remaining in the final set are the ones that contain the phrase; the position will be that of the last word of the phrase. (We only need to display the file name; in this project, the position of the phrase doesn't matter.) Part III Requirements:
  • 11. This project has been split into three parts. Each part counts as a separate project. In the first two parts, you designed and implemented a graphic user interface for the application, and added all required file operations. In this part, you must implement the remaining operations of your search engine application: the index operations, and the searching. You can download a Search Engine model solution, to play with it and inspect its user interface, but please keep in mind you should not copy that user interface; instead, invent a better, nicer-looking one. Hints: Keep your code as simple as possible The inverted index is naturally a Map, from words (the keys) to a Set of objects (the values). Each of the objects represent a document and a location within that document, where the word was found. I called these objects Pairs, since they are a pair of numbers, but you can use any name for your classes. Note, you will need to be able to go from a document number to a file name, when you display the search results.