
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | July 23, 2008 |
Latest Amendment Date: | July 23, 2008 |
Award Number: | 0841275 |
Award Instrument: | Standard Grant |
Program Manager: |
Maria Zemankova
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 1, 2008 |
End Date: | July 31, 2010 (Estimated) |
Total Intended Award Amount: | $200,000.00 |
Total Awarded Amount to Date: | $200,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
5000 FORBES AVE PITTSBURGH PA US 15213-3890 (412)268-8746 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
5000 FORBES AVE PITTSBURGH PA US 15213-3890 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | CLUSTER EXPLORATORY (CLuE) |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This project is adapting prior work on federated search to create a more selective approach to searching web indexes that we call topic-partitioned indexing. Each subset (shard) of a topic-partitioned index covers specific content areas, so that only shards covering the query?s topic area(s) need to be searched. Our research is developing methods to efficiently assign documents to shards. Supervised and unsupervised techniques are used to match queries to shards. The result is a selective search that delivers similar accuracy as more exhaustive searches, but requires an order of magnitude less effort, thus yielding significant computational and financial savings. The project is using the Google/IBM cluster to crawl the web and perform the data cleansing and pre-processing necessary to develop a web dataset of 500 million to 1 billion documents to support the research. Additional effort is being devoted to producing a corpus that is useful for a broad range of research purposes. A project goal is to share the dataset with other researchers on the Google/IBM cluster, and eventually with a broader research community.
The project will have three types of broad impact. The data centers of large web search companies are expensive and major consumers of electrical power, thus reducing their costs has significant financial and environmental benefits. Lower computational costs make it practical for academic researchers to conduct research on datasets that web search companies consider credible, thus increasing the impact of academic research. Finally, research datasets such ours typically have long life spans and are used for diverse research projects by scientists around the world.
Please report errors in award information by writing to: awardsearch@nsf.gov.