Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks
Source:
NIPS, Vancouver (2007)
Abstract:
We propose a model that leverages the millions of clicks received by web search
engines to predict document relevance.
This allows the comparison of ranking functions when clicks are available
but complete relevance judgments are not.
After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the
standard evaluation metric discounted cumulative gain (DCG), so comparisons can
be made across time and datasets. This contrasts with previous methods which can
provide only pair-wise relevance judgments between results shown for the same query. When no relevance judgments are available, we can identify the better of
two ranked lists up to 82% of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94% of the time. While
our experiments are on sponsored search results, which is the financial backbone
of web search, our method is general enough to be applicable to algorithmic web
search results as well. Furthermore, we give an algorithm to
guide the selection of additional documents to judge to improve
confidence.