Award Abstract # 0841275
SGER: Multi-Tier Indexing for Web Search Engines

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: CARNEGIE MELLON UNIVERSITY
Initial Amendment Date: July 23, 2008
Latest Amendment Date: July 23, 2008
Award Number: 0841275
Award Instrument: Standard Grant
Program Manager: Maria Zemankova
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2008
End Date: July 31, 2010 (Estimated)
Total Intended Award Amount: $200,000.00
Total Awarded Amount to Date: $200,000.00
Funds Obligated to Date: FY 2008 = $200,000.00
History of Investigator:
  • Jamie Callan (Principal Investigator)
    callan@cs.cmu.edu
Recipient Sponsored Research Office: Carnegie-Mellon University
5000 FORBES AVE
PITTSBURGH
PA  US  15213-3890
(412)268-8746
Sponsor Congressional District: 12
Primary Place of Performance: Carnegie-Mellon University
5000 FORBES AVE
PITTSBURGH
PA  US  15213-3890
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): U3NKNFLNQ613
Parent UEI: U3NKNFLNQ613
NSF Program(s): CLUSTER EXPLORATORY (CLuE)
Primary Program Source: 01000809DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7782, 9215, 9237, HPCC
Program Element Code(s): 778200
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This project is adapting prior work on federated search to create a more selective approach to searching web indexes that we call topic-partitioned indexing. Each subset (shard) of a topic-partitioned index covers specific content areas, so that only shards covering the query?s topic area(s) need to be searched. Our research is developing methods to efficiently assign documents to shards. Supervised and unsupervised techniques are used to match queries to shards. The result is a selective search that delivers similar accuracy as more exhaustive searches, but requires an order of magnitude less effort, thus yielding significant computational and financial savings. The project is using the Google/IBM cluster to crawl the web and perform the data cleansing and pre-processing necessary to develop a web dataset of 500 million to 1 billion documents to support the research. Additional effort is being devoted to producing a corpus that is useful for a broad range of research purposes. A project goal is to share the dataset with other researchers on the Google/IBM cluster, and eventually with a broader research community.

The project will have three types of broad impact. The data centers of large web search companies are expensive and major consumers of electrical power, thus reducing their costs has significant financial and environmental benefits. Lower computational costs make it practical for academic researchers to conduct research on datasets that web search companies consider credible, thus increasing the impact of academic research. Finally, research datasets such ours typically have long life spans and are used for diverse research projects by scientists around the world.

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page