Using Bloom Filters to Refine Web Search Results
Search engines have primarily focused on presenting
the most relevant pages to the user quickly. A less well explored
aspect of improving the search experience is to remove or group all
near-duplicate documents in the results presented to the user. In
this paper, we apply a Bloom filter based similarity detection technique to address
this issue by refining the search results presented to the
user. First, we present and analyze our technique for finding similar
documents using content-defined chunking and Bloom filters, and demonstrate its effectiveness in
compactly representing and quickly matching pages for similarity
testing. Later, we demonstrate how a number of results of popular and
random search queries retrieved from different search engines, Google,
Yahoo, MSN, are similar and can be eliminated or
re-organized.
Papers and Presentations
-
Navendu Jain, Mike Dahlin, and Renu Tewari. Using Bloom Filters to Refine Web Search Results
. Eighth International Workshop on
the Web and Databases (WebDB '05), Baltimore, Maryland, June 2005.
[PDF]
 
[PS]
-
Navendu Jain, Mike Dahlin, and Renu Tewari. Using Bloom Filters to Refine Web Search Results
. Technical Report TR-05-29,
Department of Computer Sciences, University of Texas at Austin.
[PDF]
People
Source Code and Datasets
- To download the source code, please see the similarity detection source code in
TAPER.